maccel.model.Model Class Reference#
|
Runtime Library v0.30
Mobilint SDK qb
|
Represents an AI model loaded from an MXQ file. More...
Public Member Functions | |
| __init__ (self, str path, Optional[ModelConfig] model_config=None) | |
| Creates a Model object from the specified MXQ model file and configuration. | |
| None | launch (self, Accelerator acc) |
| Launches the model on the specified Accelerator, which represents the actual NPU. | |
| None | dispose (self) |
| Disposes of the model loaded onto the NPU. | |
| bool | is_target (self, CoreId core_id) |
| Checks if the NPU core specified by CoreId is the target of the model. | |
| CoreMode | get_core_mode (self) |
| Retrieves the core mode of the model. | |
| List[CoreId] | get_target_cores (self) |
| Returns the NPU cores the model is configured to use. | |
| List[CoreId] | target_cores (self) |
| Optional[List[np.ndarray]] | infer (self, Union[np.ndarray, List[np.ndarray]] inputs, Optional[List[np.ndarray]] outputs=None, int cache_size=0) |
| Performs inference. | |
| List[np.ndarray] | infer_to_float (self, Union[np.ndarray, List[np.ndarray],] inputs, int cache_size=0) |
| int8_t-to-float inference Performs inference with input and output elements of type int8_t | |
| None | infer_buffer (self, List[Buffer] inputs, List[Buffer] outputs, List[List[int]] shape=[], int cache_size=0) |
| Buffer-to-Buffer inference. | |
| None | infer_speedrun (self) |
| Development-only API for measuring pure NPU inference speed. | |
| Future | infer_async (self, Union[np.ndarray, List[np.ndarray]] inputs) |
| Asynchronous Inference. | |
| Future | infer_async_to_float (self, Union[np.ndarray, List[np.ndarray]] inputs) |
| This method supports int8_t-to-float asynchronous inference. | |
| None | reposition_inputs (self, List[np.ndarray] inputs, List[Buffer] input_bufs, List[List[int]] seqlens=[]) |
| Reposition input. | |
| None | reposition_outputs (self, List[Buffer] output_bufs, List[np.ndarray] outputs, List[List[int]] seqlens=[]) |
| Reposition output. | |
| int | get_num_model_variants (self) |
| Returns the total number of model variants available in this model. | |
| ModelVariantHandle | get_model_variant_handle (self, variant_idx) |
| Retrieves a handle to the specified model variant. | |
| List[_Shape] | get_model_input_shape (self) |
| Returns the input shape of the model. | |
| List[_Shape] | get_model_output_shape (self) |
| Returns the output shape of the model. | |
| List[Scale] | get_input_scale (self) |
| Returns the input quantization scale(s) of the model. | |
| List[Scale] | get_output_scale (self) |
| Returns the output quantization scale(s) of the model. | |
| List[BufferInfo] | get_input_buffer_info (self) |
| Returns the input buffer information for the model. | |
| List[BufferInfo] | get_output_buffer_info (self) |
| Returns the output buffer information of the model. | |
| List[Buffer] | acquire_input_buffer (self, List[List[int]] seqlens=[]) |
| Buffer Management API. | |
| List[Buffer] | acquire_output_buffer (self, List[List[int]] seqlens=[]) |
| Buffer Management API. | |
| None | release_buffer (self, List[Buffer] buffer) |
| Buffer Management API. | |
| int | get_identifier (self) |
| Returns the model's unique identifier. | |
| str | get_model_path (self) |
| Returns the path to the MXQ model file associated with the Model. | |
| List[CacheInfo] | get_cache_infos (self) |
| Returns informations of KV-cache of the model. | |
| SchedulePolicy | get_schedule_policy (self) |
| LatencySetPolicy | get_latency_set_policy (self) |
| MaintenancePolicy | get_maintenance_policy (self) |
| int | get_latency_consumed (self) |
| int | get_latency_finished (self) |
| None | reset_cache_memory (self) |
| Resets the KV cache memory. | |
| List[bytes] | dump_cache_memory (self) |
| Dumps the KV cache memory into buffers. | |
| None | load_cache_memory (self, List[bytes] bufs) |
| Loads the KV cache memory from buffers. | |
| None | dump_cache_memory_to (self, str cache_dir) |
| Dumps KV cache memory to files in the specified directory. | |
| None | load_cache_memory_from (self, str cache_dir) |
| Loads the KV cache memory from files in the specified directory. | |
| int | filter_cache_tail (self, int cache_size, int tail_size, List[bool] mask) |
| Filter the tail of the KV cache memory. | |
| int | move_cache_tail (self, int num_head, int num_tail, int cache_size) |
| Moves the tail of the KV cache memory to the end of the head. | |
Static Public Attributes | |
| Optional[List[np.ndarray]] | infer_chw = infer |
| List[np.ndarray] | infer_chw_to_float = infer_to_float |
Protected Attributes | |
| _model = _cMaccel.Model(path) | |
| List[_Shape] | _input_shape = self.get_model_input_shape() |
| List[_Shape] | _output_shape = self.get_model_output_shape() |
| _acc = acc | |
Detailed Description
Represents an AI model loaded from an MXQ file.
This class loads an AI model from an MXQ file and provides functions to launch it on the NPU and perform inference.
Constructor & Destructor Documentation
◆ __init__()
| maccel.model.Model.__init__ | ( | self, | |
| str | path, | ||
| Optional[ModelConfig] | model_config = None ) |
Creates a Model object from the specified MXQ model file and configuration.
Parses the MXQ file and constructs a Model object using the provided configuration, initializing the model with the given settings.
- Note
- The created Model object must be launched before performing inference. See Model.launch for more details.
- Parameters
-
[in] path The path to the MXQ model file. [in] model_config The configuration settings to initialize the Model.
Member Function Documentation
◆ launch()
| None maccel.model.Model.launch | ( | self, | |
| Accelerator | acc ) |
Launches the model on the specified Accelerator, which represents the actual NPU.
- Parameters
-
[in] acc The accelerator on which to launch the model.
◆ dispose()
| None maccel.model.Model.dispose | ( | self | ) |
◆ is_target()
| bool maccel.model.Model.is_target | ( | self, | |
| CoreId | core_id ) |
◆ get_core_mode()
| CoreMode maccel.model.Model.get_core_mode | ( | self | ) |
◆ get_target_cores()
| List[CoreId] maccel.model.Model.get_target_cores | ( | self | ) |
◆ target_cores()
| List[CoreId] maccel.model.Model.target_cores | ( | self | ) |
◆ infer()
| Optional[List[np.ndarray]] maccel.model.Model.infer | ( | self, | |
| Union[np.ndarray, List[np.ndarray]] | inputs, | ||
| Optional[List[np.ndarray]] | outputs = None, | ||
| int | cache_size = 0 ) |
Performs inference.
Fowllowing types of inference supported.
- infer(in:List[numpy]) -> List[numpy] (float / int)
- infer(in:numpy) -> List[numpy] (float / int)
- infer(in:List[numpy], out:List[numpy]) (float / int)
- infer(in:List[numpy], out:List[]) (float / int)
- infer(in:numpy, out:List[numpy]) (float / int)
- infer(in:numpy, out:List[]) (float / int)
- Parameters
-
[in] inputs Input data as a single numpy.ndarray or a list of numpy.ndarray's. [out] outputs Optional pre-allocated list of numpy.ndarray's to store inference results.
- Returns
- Inference results as a list of numpy.ndarray.
◆ infer_to_float()
| List[np.ndarray] maccel.model.Model.infer_to_float | ( | self, | |
| Union[ np.ndarray, List[np.ndarray], ] | inputs, | ||
| int | cache_size = 0 ) |
int8_t-to-float inference Performs inference with input and output elements of type int8_t
Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.
- Note
- These APIs are intended for advanced use rather than typical usage.
◆ infer_buffer()
| None maccel.model.Model.infer_buffer | ( | self, | |
| List[Buffer] | inputs, | ||
| List[Buffer] | outputs, | ||
| List[List[int]] | shape = [], | ||
| int | cache_size = 0 ) |
Buffer-to-Buffer inference.
Performs inference using input and output elements in the NPU’s internal data type. The inference operates on buffers allocated via the following APIs:
- Model.acquire_input_buffer()
- Model.acquire_output_buffer()
- ModelVariantHandle.acquire_input_buffer()
- ModelVariantHandle.acquire_output_buffer()
Additionally, Model.reposition_inputs(), Model.reposition_outputs(), ModelVariantHandle.reposition_inputs(), ModelVariantHandle.reposition_outputs() must be used properly.
- Note
- These APIs are intended for advanced use rather than typical usage.
◆ infer_speedrun()
| None maccel.model.Model.infer_speedrun | ( | self | ) |
◆ infer_async()
| Future maccel.model.Model.infer_async | ( | self, | |
| Union[np.ndarray, List[np.ndarray]] | inputs ) |
Asynchronous Inference.
Performs inference asynchronously.
To use asynchronous inference, the model must be created using a ModelConfig object with the async pipeline configured to be enabled. This is done by calling ModelConfig.set_async_pipeline_enabled(True) before passing the configuration to Model().
Example:
- Note
- Currently, only CNN-based models are supported, as asynchronous execution is particularly effective for this type of workload.
-
Limitations:
- RNN/LSTM and LLM models are not supported yet.
- Models requiring CPU offloading are not supported yet.
- Currently, only single-batch inference is supported (i.e., N = 1).
- Currently, Buffer inference is not supported. The following types are supported in the synchronous API for advanced use cases, but are not yet available for asynchronous inference:
◆ infer_async_to_float()
| Future maccel.model.Model.infer_async_to_float | ( | self, | |
| Union[np.ndarray, List[np.ndarray]] | inputs ) |
◆ reposition_inputs()
| None maccel.model.Model.reposition_inputs | ( | self, | |
| List[np.ndarray] | inputs, | ||
| List[Buffer] | input_bufs, | ||
| List[List[int]] | seqlens = [] ) |
◆ reposition_outputs()
| None maccel.model.Model.reposition_outputs | ( | self, | |
| List[Buffer] | output_bufs, | ||
| List[np.ndarray] | outputs, | ||
| List[List[int]] | seqlens = [] ) |
◆ get_num_model_variants()
| int maccel.model.Model.get_num_model_variants | ( | self | ) |
Returns the total number of model variants available in this model.
The variant_idx parameter passed to Model.get_model_variant_handle() must be in the range [0, return value of this function).
- Returns
- The total number of model variants.
◆ get_model_variant_handle()
| ModelVariantHandle maccel.model.Model.get_model_variant_handle | ( | self, | |
| variant_idx ) |
Retrieves a handle to the specified model variant.
Use the returned ModelVariantHandle to query details such as input and output shapes for the selected variant.
- Parameters
-
[in] variant_idx Index of the model variant to retrieve. Must be in the range [0, getNumModelVariants()).
- Returns
- A ModelVariantHandle object if successful; otherwise, raise maccel.MAccelError "Model_InvalidVariantIdx".
◆ get_model_input_shape()
| List[_Shape] maccel.model.Model.get_model_input_shape | ( | self | ) |
◆ get_model_output_shape()
| List[_Shape] maccel.model.Model.get_model_output_shape | ( | self | ) |
◆ get_input_scale()
| List[Scale] maccel.model.Model.get_input_scale | ( | self | ) |
◆ get_output_scale()
| List[Scale] maccel.model.Model.get_output_scale | ( | self | ) |
◆ get_input_buffer_info()
| List[BufferInfo] maccel.model.Model.get_input_buffer_info | ( | self | ) |
◆ get_output_buffer_info()
| List[BufferInfo] maccel.model.Model.get_output_buffer_info | ( | self | ) |
◆ acquire_input_buffer()
| List[Buffer] maccel.model.Model.acquire_input_buffer | ( | self, | |
| List[List[int]] | seqlens = [] ) |
Buffer Management API.
Acquires list of Buffer for input. These API is required when calling Model.infer_buffer().
- Note
- These APIs are intended for advanced use rather than typical usage.
◆ acquire_output_buffer()
| List[Buffer] maccel.model.Model.acquire_output_buffer | ( | self, | |
| List[List[int]] | seqlens = [] ) |
Buffer Management API.
Acquires list of Buffer for output. These API is required when calling Model.infer_buffer().
- Note
- These APIs are intended for advanced use rather than typical usage.
◆ release_buffer()
| None maccel.model.Model.release_buffer | ( | self, | |
| List[Buffer] | buffer ) |
◆ get_identifier()
| int maccel.model.Model.get_identifier | ( | self | ) |
◆ get_model_path()
| str maccel.model.Model.get_model_path | ( | self | ) |
◆ get_cache_infos()
| List[CacheInfo] maccel.model.Model.get_cache_infos | ( | self | ) |
◆ get_schedule_policy()
| SchedulePolicy maccel.model.Model.get_schedule_policy | ( | self | ) |
◆ get_latency_set_policy()
| LatencySetPolicy maccel.model.Model.get_latency_set_policy | ( | self | ) |
◆ get_maintenance_policy()
| MaintenancePolicy maccel.model.Model.get_maintenance_policy | ( | self | ) |
◆ get_latency_consumed()
| int maccel.model.Model.get_latency_consumed | ( | self | ) |
◆ get_latency_finished()
| int maccel.model.Model.get_latency_finished | ( | self | ) |
◆ reset_cache_memory()
| None maccel.model.Model.reset_cache_memory | ( | self | ) |
◆ dump_cache_memory()
| List[bytes] maccel.model.Model.dump_cache_memory | ( | self | ) |
◆ load_cache_memory()
| None maccel.model.Model.load_cache_memory | ( | self, | |
| List[bytes] | bufs ) |
◆ dump_cache_memory_to()
| None maccel.model.Model.dump_cache_memory_to | ( | self, | |
| str | cache_dir ) |
Dumps KV cache memory to files in the specified directory.
Writes the KV cache data to binary files within the given directory. Each file is named using the format: cache_<layer_hash>.bin.
- Parameters
-
[in] cache_dir Path to the directory where KV cache files will be saved.
◆ load_cache_memory_from()
| None maccel.model.Model.load_cache_memory_from | ( | self, | |
| str | cache_dir ) |
Loads the KV cache memory from files in the specified directory.
Reads KV cache data from files within the given directory and restores them. Each file is named using the format: cache_<layer_hash>.bin.
- Parameters
-
[in] cache_dir Path to the directory where KV cache files are saved.
◆ filter_cache_tail()
| int maccel.model.Model.filter_cache_tail | ( | self, | |
| int | cache_size, | ||
| int | tail_size, | ||
| List[bool] | mask ) |
Filter the tail of the KV cache memory.
Retains the desired caches in the tail of the KV cache memory, excludes the others, and shifts the remaining caches forward.
- Parameters
-
[in] cache_size The number of tokens accumulated in the KV cache so far. [in] tail_size The tail size of the KV cache to filter (<=32). [in] mask A mask indicating tokens to retain or exclude at the tail of the KV cache.
- Returns
- New cache size after tail filtering.
◆ move_cache_tail()
| int maccel.model.Model.move_cache_tail | ( | self, | |
| int | num_head, | ||
| int | num_tail, | ||
| int | cache_size ) |
Moves the tail of the KV cache memory to the end of the head.
Slice the tail of the KV cache memory up to the specified size and moves it to the designated cache position.
- Parameters
-
[in] num_head The size of the KV cache head where the tail is appended. [in] num_tail The size of the KV cache tail to be moved. [in] cache_size The total number of tokens accumulated in the KV cache so far.
- Returns
- The updated cache size after moving the tail.
Member Data Documentation
◆ infer_chw
|
static |
◆ infer_chw_to_float
|
static |
◆ _model
◆ _input_shape
|
protected |
◆ _output_shape
|
protected |
◆ _acc
The documentation for this class was generated from the following file:
Generated by