qbruntime.model.Model Class Reference

qbruntime.model.Model Class Reference#

SDK qb Runtime Library: qbruntime.model.Model Class Reference

SDK qb Runtime Library v1.2

MCS001-EN

qbruntime
model
Model

Represents an AI model loaded from an MXQ file. More...

Public Member Functions
	__init__ (self, str path, Optional[ModelConfig] model_config=None)
	Creates a Model object from the specified MXQ model file and configuration.
None	launch (self, Accelerator acc)
	Launches the model on the specified Accelerator, which represents the actual NPU.
None	dispose (self)
	Disposes of the model loaded onto the NPU.
bool	is_target (self, CoreId core_id)
	Checks if the NPU core specified by CoreId is the target of the model.
CoreMode	get_core_mode (self)
	Retrieves the core mode of the model.
List[CoreId]	get_target_cores (self)
	Returns the NPU cores the model is configured to use.
List[CoreId]	target_cores (self)
Optional[List[np.ndarray]]	infer (self, Union[np.ndarray, List[np.ndarray]] inputs, Optional[List[np.ndarray]] outputs=None, int cache_size=0, Optional[List[BatchParam]] params=None)
	Performs inference.
Optional[List[np.ndarray]]	infer_hwc (self, Union[np.ndarray, List[np.ndarray]] inputs, Optional[List[np.ndarray]] outputs=None, int cache_size=0, Optional[List[BatchParam]] params=None)
Optional[List[np.ndarray]]	infer_chw (self, Union[np.ndarray, List[np.ndarray]] inputs, Optional[List[np.ndarray]] outputs=None, int cache_size=0, Optional[List[BatchParam]] params=None)
List[np.ndarray]	infer_to_float (self, Union[np.ndarray, List[np.ndarray],] inputs, int cache_size=0)
	int8_t-to-float inference Performs inference with input and output elements of type int8_t
List[np.ndarray]	infer_hwc_to_float (self, Union[np.ndarray, List[np.ndarray],] inputs, int cache_size=0)
List[np.ndarray]	infer_chw_to_float (self, Union[np.ndarray, List[np.ndarray],] inputs, int cache_size=0)
None	infer_buffer (self, List[Buffer] inputs, List[Buffer] outputs, List[List[int]] shape=[], int cache_size=0)
	Buffer-to-Buffer inference.
None	infer_speedrun (self)
	Development-only API for measuring pure NPU inference speed.
Future	infer_async (self, Union[np.ndarray, List[np.ndarray]] inputs)
	Asynchronous Inference.
Future	infer_async_to_float (self, Union[np.ndarray, List[np.ndarray]] inputs)
	This method supports int8_t-to-float asynchronous inference.
None	reposition_inputs (self, List[np.ndarray] inputs, List[Buffer] input_bufs, List[List[int]] seqlens=[])
	Reposition input.
None	reposition_outputs (self, List[Buffer] output_bufs, List[np.ndarray] outputs, List[List[int]] seqlens=[])
	Reposition output.
int	get_num_model_variants (self)
	Returns the total number of model variants available in this model.
ModelVariantHandle	get_model_variant_handle (self, variant_idx)
	Retrieves a handle to the specified model variant.
List[_Shape]	get_model_input_shape (self)
	Returns the input shape of the model.
List[_Shape]	get_model_output_shape (self)
	Returns the output shape of the model.
List[Scale]	get_input_scale (self)
	Returns the input quantization scale(s) of the model.
List[Scale]	get_output_scale (self)
	Returns the output quantization scale(s) of the model.
List[BufferInfo]	get_input_buffer_info (self)
	Returns the input buffer information for the model.
List[BufferInfo]	get_output_buffer_info (self)
	Returns the output buffer information of the model.
DataType	get_model_input_data_type (self)
	Returns a data type for model inputs.
DataType	get_model_output_data_type (self)
	Returns a data type for model outputs.
List[Buffer]	acquire_input_buffer (self, List[List[int]] seqlens=[])
	Buffer Management API.
List[Buffer]	acquire_output_buffer (self, List[List[int]] seqlens=[])
	Buffer Management API.
None	release_buffer (self, List[Buffer] buffer)
	Buffer Management API.
int	get_identifier (self)
	Returns the model's unique identifier.
str	get_model_path (self)
	Returns the path to the MXQ model file associated with the Model.
List[CacheInfo]	get_cache_infos (self)
	Returns informations of KV-cache of the model.
int	get_latency_consumed (self)
int	get_latency_finished (self)
List[bytes]	dump_cache_memory (self, int cache_id=0)
	Dumps the KV cache memory into buffers.
None	load_cache_memory (self, List[bytes] bufs, int cache_id=0)
	Loads the KV cache memory from buffers.
None	dump_cache_memory_to (self, str cache_dir, int cache_id=0)
	Dumps KV cache memory to files in the specified directory.
None	load_cache_memory_from (self, str cache_dir, int cache_id=0)
	Loads the KV cache memory from files in the specified directory.
int	filter_cache_tail (self, int cache_size, int tail_size, List[bool] mask)
	Filter the tail of the KV cache memory.
int	move_cache_tail (self, int num_head, int num_tail, int cache_size)
	Moves the tail of the KV cache memory to the end of the head.

Protected Member Functions
Optional[List[np.ndarray]]	_infer (self, Union[np.ndarray, List[np.ndarray]] inputs, Optional[List[np.ndarray]] outputs, int cache_size, Optional[bool] is_target_hwc=None, Optional[List[BatchParam]] params=None)
List[np.ndarray]	_infer_to_float (self, Union[np.ndarray, List[np.ndarray],] inputs, int cache_size, Optional[bool] is_target_hwc=None)
	int8_t-to-float inference Performs inference with input and output elements of type int8_t

Protected Attributes
	_model = _cQbRuntime.Model(path)
List[_Shape]	_input_shape = self.get_model_input_shape()
List[_Shape]	_output_shape = self.get_model_output_shape()
	_acc = acc

Detailed Description

Represents an AI model loaded from an MXQ file.

This class loads an AI model from an MXQ file and provides functions to launch it on the NPU and perform inference.

Definition at line 113 of file model.py.

Constructor & Destructor Documentation

◆ init()

qbruntime.model.Model.__init__	(		self,
		str	path,
		Optional[ModelConfig]	model_config = None )

Creates a Model object from the specified MXQ model file and configuration.

Parses the MXQ file and constructs a Model object using the provided configuration, initializing the model with the given settings.

Note: The created Model object must be launched before performing inference. See Model.launch for more details.

Parameters

[in]	path	The path to the MXQ model file.
[in]	model_config	The configuration settings to initialize the Model.

Definition at line 121 of file model.py.

Member Function Documentation

◆ launch()

None qbruntime.model.Model.launch	(		self,
		Accelerator	acc )

Launches the model on the specified Accelerator, which represents the actual NPU.

Parameters

[in] acc The accelerator on which to launch the model.

Definition at line 144 of file model.py.

◆ dispose()

None qbruntime.model.Model.dispose ( self )

Disposes of the model loaded onto the NPU.

Releases any resources associated with the model on the NPU.

Definition at line 154 of file model.py.

◆ is_target()

bool qbruntime.model.Model.is_target	(		self,
		CoreId	core_id )

Checks if the NPU core specified by CoreId is the target of the model.

In other words, whether the model is configured to use the given NPU core.

Parameters

[in] core_id The CoreId to check.

Returns: True if the model is configured to use the specified CoreId, false otherwise.

Definition at line 163 of file model.py.

◆ get_core_mode()

CoreMode qbruntime.model.Model.get_core_mode ( self )

Retrieves the core mode of the model.

Returns: The CoreMode of the model.

Definition at line 174 of file model.py.

◆ get_target_cores()

List[CoreId] qbruntime.model.Model.get_target_cores ( self )

Returns the NPU cores the model is configured to use.

Returns: A list of CoreIds representing the target NPU cores.

Definition at line 182 of file model.py.

◆ target_cores()

List[CoreId] qbruntime.model.Model.target_cores ( self )

Deprecated

Definition at line 191 of file model.py.

◆ infer()

Optional[List[np.ndarray]] qbruntime.model.Model.infer	(		self,
		Union[np.ndarray, List[np.ndarray]]	inputs,
		Optional[List[np.ndarray]]	outputs = None,
		int	cache_size = 0,
		Optional[List[BatchParam]]	params = None )

Performs inference.

Fowllowing types of inference supported.

infer(in:List[numpy]) -> List[numpy] (float / int)
infer(in:numpy) -> List[numpy] (float / int)
infer(in:List[numpy], out:List[numpy]) (float / int)
infer(in:List[numpy], out:List[]) (float / int)
infer(in:numpy, out:List[numpy]) (float / int)
infer(in:numpy, out:List[]) (float / int)

Parameters

[in]	inputs	Input data as a single numpy.ndarray or a list of numpy.ndarray's.
[out]	outputs	Optional pre-allocated list of numpy.ndarray's to store inference results.
[in]	cache_size	The number of tokens accumulated in the KV cache so far.
[in]	params	A List of BatchParam, specifying each batch's information for BatchLLM inference. If params is specified, cache_size is ignored.

Returns: Inference results as a list of numpy.ndarray.

Definition at line 195 of file model.py.

◆ infer_hwc()

Optional[List[np.ndarray]] qbruntime.model.Model.infer_hwc	(		self,
		Union[np.ndarray, List[np.ndarray]]	inputs,
		Optional[List[np.ndarray]]	outputs = None,
		int	cache_size = 0,
		Optional[List[BatchParam]]	params = None )

Definition at line 225 of file model.py.

◆ infer_chw()

Optional[List[np.ndarray]] qbruntime.model.Model.infer_chw	(		self,
		Union[np.ndarray, List[np.ndarray]]	inputs,
		Optional[List[np.ndarray]]	outputs = None,
		int	cache_size = 0,
		Optional[List[BatchParam]]	params = None )

Definition at line 234 of file model.py.

◆ _infer()

Optional[List[np.ndarray]] qbruntime.model.Model._infer	(		self,
		Union[np.ndarray, List[np.ndarray]]	inputs,
		Optional[List[np.ndarray]]	outputs,
		int	cache_size,
		Optional[bool]	is_target_hwc = None,
		Optional[List[BatchParam]]	params = None )

protected

Definition at line 243 of file model.py.

◆ infer_to_float()

List[np.ndarray] qbruntime.model.Model.infer_to_float	(		self,
		Union[ np.ndarray, List[np.ndarray], ]	inputs,
		int	cache_size = 0 )

int8_t-to-float inference Performs inference with input and output elements of type int8_t

Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.

Note: These APIs are intended for advanced use rather than typical usage.

Definition at line 299 of file model.py.

◆ infer_hwc_to_float()

List[np.ndarray] qbruntime.model.Model.infer_hwc_to_float	(		self,
		Union[ np.ndarray, List[np.ndarray], ]	inputs,
		int	cache_size = 0 )

Definition at line 318 of file model.py.

◆ infer_chw_to_float()

List[np.ndarray] qbruntime.model.Model.infer_chw_to_float	(		self,
		Union[ np.ndarray, List[np.ndarray], ]	inputs,
		int	cache_size = 0 )

Definition at line 328 of file model.py.

◆ _infer_to_float()

List[np.ndarray] qbruntime.model.Model._infer_to_float	(		self,
		Union[ np.ndarray, List[np.ndarray], ]	inputs,
		int	cache_size,
		Optional[bool]	is_target_hwc = None )

protected

int8_t-to-float inference Performs inference with input and output elements of type int8_t

Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.

Note: These APIs are intended for advanced use rather than typical usage.

Definition at line 338 of file model.py.

◆ infer_buffer()

None qbruntime.model.Model.infer_buffer	(		self,
		List[Buffer]	inputs,
		List[Buffer]	outputs,
		List[List[int]]	shape = [],
		int	cache_size = 0 )

Buffer-to-Buffer inference.

Performs inference using input and output elements in the NPU’s internal data type. The inference operates on buffers allocated via the following APIs:

Additionally, Model.reposition_inputs(), Model.reposition_outputs(), ModelVariantHandle.reposition_inputs(), ModelVariantHandle.reposition_outputs() must be used properly.

Note: These APIs are intended for advanced use rather than typical usage.

Definition at line 375 of file model.py.

◆ infer_speedrun()

None qbruntime.model.Model.infer_speedrun ( self )

Development-only API for measuring pure NPU inference speed.

Runs NPU inference without uploading inputs and without retrieving outputs.

Definition at line 403 of file model.py.

◆ infer_async()

Future qbruntime.model.Model.infer_async	(		self,
		Union[np.ndarray, List[np.ndarray]]	inputs )

Asynchronous Inference.

Performs inference asynchronously.

To use asynchronous inference, the model must be created using a ModelConfig object with the async pipeline configured to be enabled. This is done by calling ModelConfig.set_async_pipeline_enabled(True) before passing the configuration to Model().

Example:

import qbruntime
 
mc = qbruntime.ModelConfig()
mc.set_async_pipeline_enabled(True)
 
model = qbruntime.Model(MXQ_PATH, mc)
acc = qbruntime.Accelerator()
 
model.launch(acc)
 
future = model.infer_async(inputs)
 
ret = future.get()

Note

Currently, only CNN-based models are supported, as asynchronous execution is particularly effective for this type of workload.

Limitations:

RNN/LSTM and LLM models are not supported yet.
Models requiring CPU offloading are not supported yet.
Currently, only single-batch inference is supported (i.e., N = 1).
Currently, Buffer inference is not supported. The following types are supported in the synchronous API for advanced use cases, but are not yet available for asynchronous inference:
- Buffer to Buffer
- Buffer to float

Definition at line 411 of file model.py.

◆ infer_async_to_float()

Future qbruntime.model.Model.infer_async_to_float	(		self,
		Union[np.ndarray, List[np.ndarray]]	inputs )

This method supports int8_t-to-float asynchronous inference.

Parameters

[in] inputs Input data as a single numpy.ndarray or a list of numpy.ndarray's.

Returns: A future that can be used to retrieve the inference result.

Definition at line 465 of file model.py.

◆ reposition_inputs()

None qbruntime.model.Model.reposition_inputs	(		self,
		List[np.ndarray]	inputs,
		List[Buffer]	input_bufs,
		List[List[int]]	seqlens = [] )

Reposition input.

Definition at line 488 of file model.py.

◆ reposition_outputs()

None qbruntime.model.Model.reposition_outputs	(		self,
		List[Buffer]	output_bufs,
		List[np.ndarray]	outputs,
		List[List[int]]	seqlens = [] )

Reposition output.

Definition at line 500 of file model.py.

◆ get_num_model_variants()

int qbruntime.model.Model.get_num_model_variants ( self )

Returns the total number of model variants available in this model.

The variant_idx parameter passed to Model.get_model_variant_handle() must be in the range [0, return value of this function).

Returns: The total number of model variants.

Definition at line 518 of file model.py.

◆ get_model_variant_handle()

ModelVariantHandle qbruntime.model.Model.get_model_variant_handle	(		self,
			variant_idx )

Retrieves a handle to the specified model variant.

Use the returned ModelVariantHandle to query details such as input and output shapes for the selected variant.

Parameters

[in] variant_idx Index of the model variant to retrieve. Must be in the range [0, getNumModelVariants()).

Returns: A ModelVariantHandle object if successful; otherwise, raise qbruntime.QbRuntimeError "Model_InvalidVariantIdx".

Definition at line 529 of file model.py.

◆ get_model_input_shape()

List[_Shape] qbruntime.model.Model.get_model_input_shape ( self )

Returns the input shape of the model.

Returns: A list of input shape of the model.

Definition at line 546 of file model.py.

◆ get_model_output_shape()

List[_Shape] qbruntime.model.Model.get_model_output_shape ( self )

Returns the output shape of the model.

Returns: A list of output shape of the model.

Definition at line 554 of file model.py.

◆ get_input_scale()

List[Scale] qbruntime.model.Model.get_input_scale ( self )

Returns the input quantization scale(s) of the model.

Returns: A list of input scales.

Definition at line 562 of file model.py.

◆ get_output_scale()

List[Scale] qbruntime.model.Model.get_output_scale ( self )

Returns the output quantization scale(s) of the model.

Returns: A list of output scales.

Definition at line 570 of file model.py.

◆ get_input_buffer_info()

List[BufferInfo] qbruntime.model.Model.get_input_buffer_info ( self )

Returns the input buffer information for the model.

Returns: A list of input buffer information.

Definition at line 578 of file model.py.

◆ get_output_buffer_info()

List[BufferInfo] qbruntime.model.Model.get_output_buffer_info ( self )

Returns the output buffer information of the model.

Returns: A list of output buffer information.

Definition at line 586 of file model.py.

◆ get_model_input_data_type()

DataType qbruntime.model.Model.get_model_input_data_type ( self )

Returns a data type for model inputs.

Returns: An input data type.

Definition at line 594 of file model.py.

◆ get_model_output_data_type()

DataType qbruntime.model.Model.get_model_output_data_type ( self )

Returns a data type for model outputs.

Returns: An output data type.

Definition at line 602 of file model.py.

◆ acquire_input_buffer()

List[Buffer] qbruntime.model.Model.acquire_input_buffer	(		self,
		List[List[int]]	seqlens = [] )

Buffer Management API.

Acquires list of Buffer for input. These API is required when calling Model.infer_buffer().

Note: These APIs are intended for advanced use rather than typical usage.

Definition at line 610 of file model.py.

◆ acquire_output_buffer()

List[Buffer] qbruntime.model.Model.acquire_output_buffer	(		self,
		List[List[int]]	seqlens = [] )

Buffer Management API.

Acquires list of Buffer for output. These API is required when calling Model.infer_buffer().

Note: These APIs are intended for advanced use rather than typical usage.

Definition at line 621 of file model.py.

◆ release_buffer()

None qbruntime.model.Model.release_buffer	(		self,
		List[Buffer]	buffer )

Buffer Management API.

Deallocate acquired Input/Output buffer

Note: These APIs are intended for advanced use rather than typical usage.

Definition at line 632 of file model.py.

◆ get_identifier()

int qbruntime.model.Model.get_identifier ( self )

Returns the model's unique identifier.

This identifier distinguishes multiple models within a single user program. It is assigned incrementally, starting from 0 (e.g., 0, 1, 2, 3, ...).

Returns: The model identifier.

Definition at line 642 of file model.py.

◆ get_model_path()

str qbruntime.model.Model.get_model_path ( self )

Returns the path to the MXQ model file associated with the Model.

Returns: The MXQ file path.

Definition at line 653 of file model.py.

◆ get_cache_infos()

List[CacheInfo] qbruntime.model.Model.get_cache_infos ( self )

Returns informations of KV-cache of the model.

Returns: A list of CacheInfo objects.

Definition at line 661 of file model.py.

◆ get_latency_consumed()

int qbruntime.model.Model.get_latency_consumed ( self )

Deprecated

Definition at line 669 of file model.py.

◆ get_latency_finished()

int qbruntime.model.Model.get_latency_finished ( self )

Deprecated

Definition at line 673 of file model.py.

◆ dump_cache_memory()

List[bytes] qbruntime.model.Model.dump_cache_memory	(		self,
		int	cache_id = 0 )

Dumps the KV cache memory into buffers.

Writes the current KV cache data into provided buffers.

Parameters

[in] cache_id Index of target cache.

Returns: A list of bytes containing the KV cache data.

Definition at line 677 of file model.py.

◆ load_cache_memory()

None qbruntime.model.Model.load_cache_memory	(		self,
		List[bytes]	bufs,
		int	cache_id = 0 )

Loads the KV cache memory from buffers.

Restores the KV cache from the provided buffers.

Parameters

[in] bufs A list of bytes containing the KV cache

Definition at line 690 of file model.py.

◆ dump_cache_memory_to()

None qbruntime.model.Model.dump_cache_memory_to	(		self,
		str	cache_dir,
		int	cache_id = 0 )

Dumps KV cache memory to files in the specified directory.

Writes the KV cache data to binary files within the given directory. Each file is named using the format: cache_<layer_hash>.bin.

Parameters

[in]	cache_dir	Path to the directory where KV cache files will be saved.
[in]	cache_id	Index of target cache.

Definition at line 702 of file model.py.

◆ load_cache_memory_from()

None qbruntime.model.Model.load_cache_memory_from	(		self,
		str	cache_dir,
		int	cache_id = 0 )

Loads the KV cache memory from files in the specified directory.

Reads KV cache data from files within the given directory and restores them. Each file is named using the format: cache_<layer_hash>.bin.

Parameters

[in] cache_dir Path to the directory where KV cache files are saved.

Definition at line 714 of file model.py.

◆ filter_cache_tail()

int qbruntime.model.Model.filter_cache_tail	(		self,
		int	cache_size,
		int	tail_size,
		List[bool]	mask )

Filter the tail of the KV cache memory.

Retains the desired caches in the tail of the KV cache memory, excludes the others, and shifts the remaining caches forward.

Parameters

[in]	cache_size	The number of tokens accumulated in the KV cache so far.
[in]	tail_size	The tail size of the KV cache to filter (<=32).
[in]	mask	A mask indicating tokens to retain or exclude at the tail of the KV cache.

Returns: New cache size after tail filtering.

Definition at line 725 of file model.py.

◆ move_cache_tail()

int qbruntime.model.Model.move_cache_tail	(		self,
		int	num_head,
		int	num_tail,
		int	cache_size )

Moves the tail of the KV cache memory to the end of the head.

Slice the tail of the KV cache memory up to the specified size and moves it to the designated cache position.

Parameters

[in]	num_head	The size of the KV cache head where the tail is appended.
[in]	num_tail	The size of the KV cache tail to be moved.
[in]	cache_size	The total number of tokens accumulated in the KV cache so far.

Returns: The updated cache size after moving the tail.

Definition at line 743 of file model.py.

Member Data Documentation

◆ _model

qbruntime.model.Model._model = _cQbRuntime.Model(path)

protected

Definition at line 135 of file model.py.

◆ _input_shape

List[_Shape] qbruntime.model.Model._input_shape = self.get_model_input_shape()

protected

Definition at line 141 of file model.py.

◆ _output_shape

List[_Shape] qbruntime.model.Model._output_shape = self.get_model_output_shape()

protected

Definition at line 142 of file model.py.

◆ _acc

qbruntime.model.Model._acc = acc

protected

Definition at line 152 of file model.py.

The documentation for this class was generated from the following file:

model.py

qbruntime.model.Model Class Reference

qbruntime.model.Model Class Reference#

Public Member Functions

Protected Member Functions

Protected Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ launch()

◆ dispose()

◆ is_target()

◆ get_core_mode()

◆ get_target_cores()

◆ target_cores()

◆ infer()

◆ infer_hwc()

◆ infer_chw()

◆ _infer()

◆ infer_to_float()

◆ infer_hwc_to_float()

◆ infer_chw_to_float()

◆ _infer_to_float()

◆ infer_buffer()

◆ infer_speedrun()

◆ infer_async()

◆ infer_async_to_float()

◆ reposition_inputs()

◆ reposition_outputs()

◆ get_num_model_variants()

◆ get_model_variant_handle()

◆ get_model_input_shape()

◆ get_model_output_shape()

◆ get_input_scale()

◆ get_output_scale()

◆ get_input_buffer_info()

◆ get_output_buffer_info()

◆ get_model_input_data_type()

◆ get_model_output_data_type()

◆ acquire_input_buffer()

◆ acquire_output_buffer()

◆ release_buffer()

◆ get_identifier()

◆ get_model_path()

◆ get_cache_infos()

◆ get_latency_consumed()

◆ get_latency_finished()

◆ dump_cache_memory()

◆ load_cache_memory()

◆ dump_cache_memory_to()

◆ load_cache_memory_from()

◆ filter_cache_tail()

◆ move_cache_tail()

Member Data Documentation

◆ _model

◆ _input_shape

◆ _output_shape

◆ _acc

◆ init()