Model Class Reference#
|
Runtime Library v0.30
Mobilint SDK qb
|
Represents an AI model loaded from an MXQ file. More...
#include <model.h>
Public Member Functions | |
| Model (const Model &other)=delete | |
| Model (Model &&other) noexcept | |
| Model & | operator= (const Model &rhs)=delete |
| Model & | operator= (Model &&rhs) noexcept |
| StatusCode | launch (Accelerator &acc) |
| Launches the model on the specified Accelerator, which represents the actual NPU. | |
| StatusCode | dispose () |
| Disposes of the model loaded onto the NPU. | |
| CoreMode | getCoreMode () const |
| Retrieves the core mode of the model. | |
| bool | isTarget (CoreId core_id) const |
| Checks if the NPU core specified by CoreId is the target of the model. In other words, whether the model is configured to use the given NPU core. | |
| std::vector< CoreId > | getTargetCores () const |
| Returns the NPU cores the model is configured to use. | |
| StatusCode | inferSpeedrun (int variant_idx=0) |
| Development-only API for measuring pure NPU inference speed. | |
| int | getNumModelVariants () const |
| Returns the total number of model variants available in this model. | |
| std::unique_ptr< ModelVariantHandle > | getModelVariantHandle (int variant_idx, StatusCode &sc) const |
| Retrieves a handle to the specified model variant. | |
| const std::vector< std::vector< int64_t > > & | getModelInputShape () const |
| Returns the input shape of the model. | |
| const std::vector< std::vector< int64_t > > & | getModelOutputShape () const |
| Returns the output shape of the model. | |
| const std::vector< BufferInfo > & | getInputBufferInfo () const |
| Returns the input buffer information for the model. | |
| const std::vector< BufferInfo > & | getOutputBufferInfo () const |
| Returns the output buffer information of the model. | |
| std::vector< Scale > | getInputScale () const |
| Returns the input quantization scale(s) of the model. | |
| std::vector< Scale > | getOutputScale () const |
| Returns the output quantization scale(s) of the model. | |
| uint32_t | getIdentifier () const |
| Returns the model's unique identifier. | |
| std::string | getModelPath () const |
| Returns the path to the MXQ model file associated with the Model. | |
| std::vector< CacheInfo > | getCacheInfos () const |
| Returns informations of KV-cache of the model. | |
NHWC float-to-float inference | |
Performs inference with input and output elements of type float in NHWC (batch N, height H, width W, channels C) or HWC format. Two input-output type pairs are supported:
| |
| StatusCode | infer (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output) |
| Performs inference. | |
| std::vector< NDArray< float > > | infer (const std::vector< NDArray< float > > &input, StatusCode &sc) |
| This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter. | |
| StatusCode | infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| std::vector< std::vector< float > > | infer (const std::vector< float * > &input, StatusCode &sc) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| StatusCode | infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| std::vector< std::vector< float > > | infer (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| StatusCode | infer (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output, uint32_t cache_size) |
| This overload supports inference with KV cache. | |
| std::vector< NDArray< float > > | infer (const std::vector< NDArray< float > > &input, uint32_t cache_size, StatusCode &sc) |
| This overload supports inference with KV cache. | |
| StatusCode | infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size) |
| This overload supports inference with KV cache. | |
| std::vector< std::vector< float > > | infer (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc) |
| This overload supports inference with KV cache. | |
NCHW float-to-float inference | |
Performs inference with input and output elements of type float in NCHW (batch N, channels C, height H, width W) or CHW format. Two input-output type pairs are supported:
| |
| StatusCode | inferCHW (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output) |
| Performs inference. | |
| std::vector< NDArray< float > > | inferCHW (const std::vector< NDArray< float > > &input, StatusCode &sc) |
| This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter. | |
| StatusCode | inferCHW (const std::vector< float * > &input, std::vector< std::vector< float > > &output) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| std::vector< std::vector< float > > | inferCHW (const std::vector< float * > &input, StatusCode &sc) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| StatusCode | inferCHW (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| std::vector< std::vector< float > > | inferCHW (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc) |
| This overload is provided for convenience but may result in additional data copies within the maccel runtime. | |
| StatusCode | inferCHW (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output, uint32_t cache_size) |
| This overload supports inference with KV cache. | |
| std::vector< NDArray< float > > | inferCHW (const std::vector< NDArray< float > > &input, uint32_t cache_size, StatusCode &sc) |
| This overload supports inference with KV cache. | |
| StatusCode | inferCHW (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size) |
| This overload supports inference with KV cache. | |
| std::vector< std::vector< float > > | inferCHW (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc) |
| This overload supports inference with KV cache. | |
NHWC int8_t-to-int8_t inference | |
Performs inference with input and output elements of type int8_t in NHWC (batch N, height H, width W, channels C) or HWC format. Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input and int8_t to float for output.
| |
| StatusCode | infer (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output) |
| std::vector< NDArray< int8_t > > | infer (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| StatusCode | infer (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output) |
| std::vector< std::vector< int8_t > > | infer (const std::vector< int8_t * > &input, StatusCode &sc) |
| StatusCode | infer (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape) |
| std::vector< std::vector< int8_t > > | infer (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc) |
| StatusCode | infer (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output, uint32_t cache_size) |
| std::vector< NDArray< int8_t > > | infer (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc) |
| StatusCode | infer (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size) |
| std::vector< std::vector< int8_t > > | infer (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc) |
NCHW int8_t-to-int8_t inference | |
Performs inference with input and output elements of type int8_t in NCHW (batch N, channels C, height H, width W) or CHW format. Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input and int8_t to float for output.
| |
| StatusCode | inferCHW (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output) |
| std::vector< NDArray< int8_t > > | inferCHW (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| StatusCode | inferCHW (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output) |
| std::vector< std::vector< int8_t > > | inferCHW (const std::vector< int8_t * > &input, StatusCode &sc) |
| StatusCode | inferCHW (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape) |
| std::vector< std::vector< int8_t > > | inferCHW (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc) |
| StatusCode | inferCHW (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output, uint32_t cache_size) |
| std::vector< NDArray< int8_t > > | inferCHW (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc) |
| StatusCode | inferCHW (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size) |
| std::vector< std::vector< int8_t > > | inferCHW (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc) |
NHWC int8_t-to-float inference | |
Performs inference with input and output elements of type int8_t in NHWC (batch N, height H, width W, channels C) or HWC format. Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.
| |
| std::vector< NDArray< float > > | inferToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| std::vector< std::vector< float > > | inferToFloat (const std::vector< int8_t * > &input, StatusCode &sc) |
| std::vector< std::vector< float > > | inferToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc) |
| std::vector< NDArray< float > > | inferToFloat (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc) |
| std::vector< std::vector< float > > | inferToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc) |
NCHW int8_t-to-float inference | |
Performs inference with input and output elements of type int8_t in NCHW (batch N, channels C, height H, width W) or CHW format. Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.
| |
| std::vector< NDArray< float > > | inferCHWToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| std::vector< std::vector< float > > | inferCHWToFloat (const std::vector< int8_t * > &input, StatusCode &sc) |
| std::vector< std::vector< float > > | inferCHWToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc) |
| std::vector< NDArray< float > > | inferCHWToFloat (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc) |
| std::vector< std::vector< float > > | inferCHWToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc) |
NHWC Buffer-to-Buffer inference | |
Performs inference using input and output elements in the NPU’s internal data type. The inference operates on buffers allocated via the following APIs:
Additionally, Model::repositionInputs, Model::repositionOutputs, ModelVariantHandle::repositionInputs and ModelVariantHandle::repositionOutputs must be used properly.
| |
| StatusCode | inferBuffer (const std::vector< Buffer > &input, std::vector< Buffer > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0) |
| StatusCode | inferBuffer (const std::vector< std::vector< Buffer > > &input, std::vector< std::vector< Buffer > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0) |
NHWC Buffer-to-float inference | |
Performs inference using input and output elements in the NPU’s internal data type. The inference operates on buffers allocated via the following APIs:
Additionally, Model::repositionInputs and ModelVariantHandle::repositionInputs must be used properly.
| |
| StatusCode | inferBufferToFloat (const std::vector< Buffer > &input, std::vector< NDArray< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0) |
| StatusCode | inferBufferToFloat (const std::vector< std::vector< Buffer > > &input, std::vector< NDArray< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0) |
| StatusCode | inferBufferToFloat (const std::vector< Buffer > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0) |
| StatusCode | inferBufferToFloat (const std::vector< std::vector< Buffer > > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0) |
Asynchronous Inference | |
Performs inference asynchronously. To use asynchronous inference, the model must be created using a ModelConfig object with the async pipeline configured to be enabled. This is done by calling ModelConfig::setAsyncPipelineEnabled(true) before passing the configuration to Model::create. Example: using namespace mobilint;
ModelConfig mc;
// Enables support for `inferAsync` and `inferAsyncCHW`
mc.setAsyncPipelineEnabled(true);
StatusCode sc;
std::unique_ptr<Model> model = Model::create("resnet50.mxq", mc, sc);
if (!sc) {
// Handle error appropriately
}
// Now `inferAsync` can be called.
Future future = model->inferAsync(input, sc);
Represents a future for retrieving the result of asynchronous inference. Definition future.h:43 Configures a core mode and core allocation of a model for NPU inference. Definition type.h:257 void setAsyncPipelineEnabled(bool enable) Enables or disables the asynchronous pipeline required for asynchronous inference. static std::unique_ptr< Model > create(const std::string &mxq_path, StatusCode &sc) Creates a Model object from the specified MXQ model file.
| |
| Future< float > | inferAsync (const std::vector< NDArray< float > > &input, StatusCode &sc) |
| Initiates asynchronous inference with input in NHWC (batch N, height H, width W, channels C) or HWC format. | |
| Future< float > | inferAsyncCHW (const std::vector< NDArray< float > > &input, StatusCode &sc) |
| Initiates asynchronous inference with input in NCHW (batch N, channels C, height H, width W) or CHW format. | |
| Future< int8_t > | inferAsync (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| This overload supports int8_t-to-int8_t asynchronous inference. | |
| Future< int8_t > | inferAsyncCHW (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| This overload supports int8_t-to-int8_t asynchronous inference. | |
| Future< float > | inferAsyncToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| This overload supports int8_t-to-float asynchronous inference. | |
| Future< float > | inferAsyncCHWToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc) |
| This overload supports int8_t-to-float asynchronous inference. | |
Buffer Management APIs | |
These APIs are required when calling Model::inferBuffer or Model::inferBufferToFloat. Buffers are acquired using:
Any acquired buffer must be released using:
Repositioning is handled by:
| |
| std::vector< Buffer > | acquireInputBuffer (const std::vector< std::vector< int > > &seqlens={}) const |
| std::vector< Buffer > | acquireOutputBuffer (const std::vector< std::vector< int > > &seqlens={}) const |
| std::vector< std::vector< Buffer > > | acquireInputBuffers (const int batch_size, const std::vector< std::vector< int > > &seqlens={}) const |
| std::vector< std::vector< Buffer > > | acquireOutputBuffers (const int batch_size, const std::vector< std::vector< int > > &seqlens={}) const |
| StatusCode | releaseBuffer (std::vector< Buffer > &buffer) const |
| StatusCode | releaseBuffers (std::vector< std::vector< Buffer > > &buffers) const |
| StatusCode | repositionInputs (const std::vector< float * > &input, std::vector< Buffer > &input_buf, const std::vector< std::vector< int > > &seqlens={}) const |
| StatusCode | repositionOutputs (const std::vector< Buffer > &output_buf, std::vector< float * > &output, const std::vector< std::vector< int > > &seqlens={}) const |
| StatusCode | repositionOutputs (const std::vector< Buffer > &output_buf, std::vector< std::vector< float > > &output, const std::vector< std::vector< int > > &seqlens={}) const |
| StatusCode | repositionInputs (const std::vector< float * > &input, std::vector< std::vector< Buffer > > &input_buf, const std::vector< std::vector< int > > &seqlens={}) const |
| StatusCode | repositionOutputs (const std::vector< std::vector< Buffer > > &output_buf, std::vector< float * > &output, const std::vector< std::vector< int > > &seqlens={}) const |
| StatusCode | repositionOutputs (const std::vector< std::vector< Buffer > > &output_buf, std::vector< std::vector< float > > &output, const std::vector< std::vector< int > > &seqlens={}) const |
KV Cache Management | |
| |
| void | resetCacheMemory () |
| Resets the KV cache memory. | |
| StatusCode | dumpCacheMemory (std::vector< std::vector< int8_t > > &bufs) |
| Dumps the KV cache memory into buffers. | |
| std::vector< std::vector< int8_t > > | dumpCacheMemory (StatusCode &sc) |
| Dumps the KV cache memory into buffers. | |
| StatusCode | dumpCacheMemory (const std::string &cache_dir) |
| Dumps KV cache memory to files in the specified directory. | |
| StatusCode | loadCacheMemory (const std::vector< std::vector< int8_t > > &bufs) |
| Loads the KV cache memory from buffers. | |
| StatusCode | loadCacheMemory (const std::string &cache_dir) |
| Loads the KV cache memory from files in the specified directory. | |
| int | filterCacheTail (int cache_size, int tail_size, const std::vector< bool > &mask, StatusCode &sc) |
| Filter the tail of the KV cache memory. | |
| int | moveCacheTail (int num_head, int num_tail, int cache_size, StatusCode &sc) |
| Moves the tail of the KV cache memory to the end of the head. | |
Deprecated APIs | |
| |
| StatusCode | infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output, int batch_size) |
| std::vector< std::vector< float > > | infer (const std::vector< float * > &input, int batch_size, StatusCode &sc) |
| StatusCode | inferHeightBatch (const std::vector< float * > &input, std::vector< std::vector< float > > &output, int height_batch_size) |
| SchedulePolicy | getSchedulePolicy () const |
| LatencySetPolicy | getLatencySetPolicy () const |
| MaintenancePolicy | getMaintenancePolicy () const |
| uint64_t | getLatencyConsumed (const int npu_op_idx) const |
| uint64_t | getLatencyFinished (const int npu_op_idx) const |
| std::shared_ptr< Statistics > | getStatistics () const |
Static Public Member Functions | |
| static std::unique_ptr< Model > | create (const std::string &mxq_path, StatusCode &sc) |
| Creates a Model object from the specified MXQ model file. | |
| static std::unique_ptr< Model > | create (const std::string &mxq_path, const ModelConfig &config, StatusCode &sc) |
| Creates a Model object from the specified MXQ model file and configuration. | |
Friends | |
| class | Accelerator |
Detailed Description
Represents an AI model loaded from an MXQ file.
This class loads an AI model from an MXQ file and provides functions to launch it on the NPU and perform inference.
Member Function Documentation
◆ create() [1/2]
|
static |
Creates a Model object from the specified MXQ model file.
Parses the MXQ file and constructs a Model object. The model is initialized in single-core mode with all NPU local cores included.
- Note
- The created Model object must be launched before performing inference. See Model::launch for more details.
- Parameters
-
[in] mxq_path The path to the MXQ model file. [out] sc A reference to a status code that will be updated to indicate whether the model was successfully created or if an error occurred.
- Returns
- A unique pointer to the created Model object.
◆ create() [2/2]
|
static |
Creates a Model object from the specified MXQ model file and configuration.
Parses the MXQ file and constructs a Model object using the provided configuration, initializing the model with the given settings.
- Note
- The created Model object must be launched before performing inference. See Model::launch for more details.
- Parameters
-
[in] mxq_path The path to the MXQ model file. [in] config The configuration settings to initialize the Model. [out] sc A reference to a status code that will be updated to indicate whether the model was successfully created or if an error occurred.
- Returns
- A unique pointer to the created Model object.
◆ launch()
| StatusCode mobilint::Model::launch | ( | Accelerator & | acc | ) |
Launches the model on the specified Accelerator, which represents the actual NPU.
- Parameters
-
[in] acc The accelerator on which to launch the model.
- Returns
- A status code indicating whether the model was successfully launched or if an error occurred.
◆ dispose()
| StatusCode mobilint::Model::dispose | ( | ) |
Disposes of the model loaded onto the NPU.
Releases any resources associated with the model on the NPU.
- Returns
- A status code indicating whether the disposal was successful or if an error occurred.
◆ getCoreMode()
| CoreMode mobilint::Model::getCoreMode | ( | ) | const |
Retrieves the core mode of the model.
- Returns
- The CoreMode of the model.
◆ isTarget()
| bool mobilint::Model::isTarget | ( | CoreId | core_id | ) | const |
◆ getTargetCores()
| std::vector< CoreId > mobilint::Model::getTargetCores | ( | ) | const |
Returns the NPU cores the model is configured to use.
- Returns
- A vector of CoreIds representing the target NPU cores.
◆ infer() [1/12]
| StatusCode mobilint::Model::infer | ( | const std::vector< NDArray< float > > & | input, |
| std::vector< NDArray< float > > & | output ) |
Performs inference.
- Parameters
-
[in] input A vector of NDArray<float>. Each NDArray must be in NHWC or HWC format. [out] output A reference to a vector of NDArray<float> that will store the inference results.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ infer() [2/12]
| std::vector< NDArray< float > > mobilint::Model::infer | ( | const std::vector< NDArray< float > > & | input, |
| StatusCode & | sc ) |
This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter.
- Parameters
-
[in] input A vector of NDArray<float>. Each NDArray must be in NHWC or HWC format. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of NDArray<float> containing the inference results.
◆ infer() [3/12]
| StatusCode mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in HWC format. [out] output A reference to a vector of float vectors that will store the inference results.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ infer() [4/12]
| std::vector< std::vector< float > > mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| StatusCode & | sc ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in HWC format. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of float vectors containing the inference results.
◆ infer() [5/12]
| StatusCode mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output, | ||
| const std::vector< std::vector< int64_t > > & | shape ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
Unlike other overloads, this version allows explicitly specifying the shape of each input data, which can be in NHWC or HWC format.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NHWC or HWC format. [out] output A reference to a vector of float vectors that will store the inference results. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ infer() [6/12]
| std::vector< std::vector< float > > mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| const std::vector< std::vector< int64_t > > & | shape, | ||
| StatusCode & | sc ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NHWC or HWC format. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of float vectors containing the inference results.
◆ infer() [7/12]
| StatusCode mobilint::Model::infer | ( | const std::vector< NDArray< float > > & | input, |
| std::vector< NDArray< float > > & | output, | ||
| uint32_t | cache_size ) |
This overload supports inference with KV cache.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of NDArrays, where each NDArray represents input data in NHWC or HWC format. [out] output A reference to a vector of NDArrays that will store the inference results. [in] cache_size The number of tokens accumulated in the KV cache so far.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ infer() [8/12]
| std::vector< NDArray< float > > mobilint::Model::infer | ( | const std::vector< NDArray< float > > & | input, |
| uint32_t | cache_size, | ||
| StatusCode & | sc ) |
This overload supports inference with KV cache.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of NDArrays, where each NDArray represents input data in NHWC or HWC format. [in] cache_size The number of tokens accumulated in the KV cache so far. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of NDArrays containing the inference results.
◆ infer() [9/12]
| StatusCode mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output, | ||
| const std::vector< std::vector< int64_t > > & | shape, | ||
| uint32_t | cache_size ) |
This overload supports inference with KV cache.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NHWC or HWC format. [out] output A reference to a vector of float vectors that will store the inference results. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data. [in] cache_size The number of tokens accumulated in the KV cache so far.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ infer() [10/12]
| std::vector< std::vector< float > > mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| const std::vector< std::vector< int64_t > > & | shape, | ||
| uint32_t | cache_size, | ||
| StatusCode & | sc ) |
This overload supports inference with KV cache.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NHWC or HWC format. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data. [in] cache_size The number of tokens accumulated in the KV cache so far. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of float vectors containing the inference results.
◆ inferCHW() [1/10]
| StatusCode mobilint::Model::inferCHW | ( | const std::vector< NDArray< float > > & | input, |
| std::vector< NDArray< float > > & | output ) |
Performs inference.
- Parameters
-
[in] input A vector of NDArray<float>. Each NDArray must be in NCHW or CHW format. [out] output A reference to a vector of NDArray<float> that will store the inference results.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ inferCHW() [2/10]
| std::vector< NDArray< float > > mobilint::Model::inferCHW | ( | const std::vector< NDArray< float > > & | input, |
| StatusCode & | sc ) |
This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter.
- Parameters
-
[in] input A vector of NDArray<float>. Each NDArray must be in NCHW or CHW format. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of NDArray<float> containing the inference results.
◆ inferCHW() [3/10]
| StatusCode mobilint::Model::inferCHW | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in CHW format. [out] output A reference to a vector of float vectors that will store the inference results.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ inferCHW() [4/10]
| std::vector< std::vector< float > > mobilint::Model::inferCHW | ( | const std::vector< float * > & | input, |
| StatusCode & | sc ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in CHW format. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of float vectors containing the inference results.
◆ inferCHW() [5/10]
| StatusCode mobilint::Model::inferCHW | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output, | ||
| const std::vector< std::vector< int64_t > > & | shape ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
Unlike other overloads, this version allows explicitly specifying the shape of each input data, which can be in NCHW or CHW format.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NCHW or CHW format. [out] output A reference to a vector of float vectors that will store the inference results. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ inferCHW() [6/10]
| std::vector< std::vector< float > > mobilint::Model::inferCHW | ( | const std::vector< float * > & | input, |
| const std::vector< std::vector< int64_t > > & | shape, | ||
| StatusCode & | sc ) |
This overload is provided for convenience but may result in additional data copies within the maccel runtime.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NCHW or CHW format. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of float vectors containing the inference results.
◆ inferCHW() [7/10]
| StatusCode mobilint::Model::inferCHW | ( | const std::vector< NDArray< float > > & | input, |
| std::vector< NDArray< float > > & | output, | ||
| uint32_t | cache_size ) |
This overload supports inference with KV cache.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of NDArrays, where each NDArray represents input data in NCHW or CHW format. [out] output A reference to a vector of NDArrays that will store the inference results. [in] cache_size The number of tokens accumulated in the KV cache so far.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ inferCHW() [8/10]
| std::vector< NDArray< float > > mobilint::Model::inferCHW | ( | const std::vector< NDArray< float > > & | input, |
| uint32_t | cache_size, | ||
| StatusCode & | sc ) |
This overload supports inference with KV cache.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of NDArrays, where each NDArray represents input data in NCHW or CHW format. [in] cache_size The number of tokens accumulated in the KV cache so far. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of NDArrays containing the inference results.
◆ inferCHW() [9/10]
| StatusCode mobilint::Model::inferCHW | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output, | ||
| const std::vector< std::vector< int64_t > > & | shape, | ||
| uint32_t | cache_size ) |
This overload supports inference with KV cache.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NCHW or CHW format. [out] output A reference to a vector of float vectors that will store the inference results. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data. [in] cache_size The number of tokens accumulated in the KV cache so far.
- Returns
- A status code indicating whether the inference operation completed successfully or encountered an error.
◆ inferCHW() [10/10]
| std::vector< std::vector< float > > mobilint::Model::inferCHW | ( | const std::vector< float * > & | input, |
| const std::vector< std::vector< int64_t > > & | shape, | ||
| uint32_t | cache_size, | ||
| StatusCode & | sc ) |
This overload supports inference with KV cache.
Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.
- Note
- This function is relevant for LLM models that use KV cache.
- Parameters
-
[in] input A vector of float pointers, where each pointer represents input data in NCHW or CHW format. [in] shape A vector of vectors, where each inner vector specifies the shape of the corresponding input data. [in] cache_size The number of tokens accumulated in the KV cache so far. [out] sc A reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
- Returns
- A vector of float vectors containing the inference results.
◆ inferSpeedrun()
| StatusCode mobilint::Model::inferSpeedrun | ( | int | variant_idx = 0 | ) |
Development-only API for measuring pure NPU inference speed.
Runs NPU inference without uploading inputs and without retrieving outputs.
- Parameters
-
[in] variant_idx Index of model variant to run
- Returns
- A status code indicating the result.
◆ inferAsync() [1/2]
| Future< float > mobilint::Model::inferAsync | ( | const std::vector< NDArray< float > > & | input, |
| StatusCode & | sc ) |
Initiates asynchronous inference with input in NHWC (batch N, height H, width W, channels C) or HWC format.
- Parameters
-
[in] input A vector of NDArray<float>. Each NDArray must be in NHWC or HWC format. [out] sc A reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
- Returns
- A future that can be used to retrieve the inference result.
◆ inferAsyncCHW() [1/2]
| Future< float > mobilint::Model::inferAsyncCHW | ( | const std::vector< NDArray< float > > & | input, |
| StatusCode & | sc ) |
Initiates asynchronous inference with input in NCHW (batch N, channels C, height H, width W) or CHW format.
- Parameters
-
[in] input A vector of NDArray<float>. Each NDArray must be in NCHW or CHW format. [out] sc A reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
- Returns
- A future that can be used to retrieve the inference result.
◆ inferAsync() [2/2]
| Future< int8_t > mobilint::Model::inferAsync | ( | const std::vector< NDArray< int8_t > > & | input, |
| StatusCode & | sc ) |
This overload supports int8_t-to-int8_t asynchronous inference.
- Parameters
-
[in] input A vector of NDArray<int8_t>. Each NDArray must be in NHWC or HWC format. [out] sc A reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
- Returns
- A future that can be used to retrieve the inference result.
◆ inferAsyncCHW() [2/2]
| Future< int8_t > mobilint::Model::inferAsyncCHW | ( | const std::vector< NDArray< int8_t > > & | input, |
| StatusCode & | sc ) |
This overload supports int8_t-to-int8_t asynchronous inference.
- Parameters
-
[in] input A vector of NDArray<int8_t>. Each NDArray must be in NCHW or CHW format. [out] sc A reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
- Returns
- A future that can be used to retrieve the inference result.
◆ inferAsyncToFloat()
| Future< float > mobilint::Model::inferAsyncToFloat | ( | const std::vector< NDArray< int8_t > > & | input, |
| StatusCode & | sc ) |
This overload supports int8_t-to-float asynchronous inference.
- Parameters
-
[in] input A vector of NDArray<int8_t>. Each NDArray must be in NHWC or HWC format. [out] sc A reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
- Returns
- A future that can be used to retrieve the inference result.
◆ inferAsyncCHWToFloat()
| Future< float > mobilint::Model::inferAsyncCHWToFloat | ( | const std::vector< NDArray< int8_t > > & | input, |
| StatusCode & | sc ) |
This overload supports int8_t-to-float asynchronous inference.
- Parameters
-
[in] input A vector of NDArray<int8_t>. Each NDArray must be in NCHW or CHW format. [out] sc A reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
- Returns
- A future that can be used to retrieve the inference result.
◆ getNumModelVariants()
| int mobilint::Model::getNumModelVariants | ( | ) | const |
Returns the total number of model variants available in this model.
The variant_idx parameter passed to Model::getModelVariantHandle must be in the range [0, return value of this function).
- Returns
- The total number of model variants.
◆ getModelVariantHandle()
| std::unique_ptr< ModelVariantHandle > mobilint::Model::getModelVariantHandle | ( | int | variant_idx, |
| StatusCode & | sc ) const |
Retrieves a handle to the specified model variant.
Use the returned ModelVariantHandle to query details such as input and output shapes for the selected variant.
- Parameters
-
[in] variant_idx Index of the model variant to retrieve. Must be in the range [0, getNumModelVariants()). [out] sc A reference to a StatusCode variable that will be updated to indicate success or failure.
- Returns
- A unique pointer to the corresponding ModelVariantHandle if successful; otherwise, nullptr.
◆ getModelInputShape()
| const std::vector< std::vector< int64_t > > & mobilint::Model::getModelInputShape | ( | ) | const |
Returns the input shape of the model.
- Returns
- A reference to the input shape of the model.
◆ getModelOutputShape()
| const std::vector< std::vector< int64_t > > & mobilint::Model::getModelOutputShape | ( | ) | const |
Returns the output shape of the model.
- Returns
- A reference to the output shape of the model.
◆ getInputBufferInfo()
| const std::vector< BufferInfo > & mobilint::Model::getInputBufferInfo | ( | ) | const |
Returns the input buffer information for the model.
- Returns
- A reference to a vector of input buffer information.
◆ getOutputBufferInfo()
| const std::vector< BufferInfo > & mobilint::Model::getOutputBufferInfo | ( | ) | const |
Returns the output buffer information of the model.
- Returns
- A reference to a vector of output buffer information.
◆ getInputScale()
| std::vector< Scale > mobilint::Model::getInputScale | ( | ) | const |
Returns the input quantization scale(s) of the model.
- Returns
- A vector of input scales.
◆ getOutputScale()
| std::vector< Scale > mobilint::Model::getOutputScale | ( | ) | const |
Returns the output quantization scale(s) of the model.
- Returns
- A vector of output scales.
◆ getIdentifier()
| uint32_t mobilint::Model::getIdentifier | ( | ) | const |
Returns the model's unique identifier.
This identifier distinguishes multiple models within a single user program. It is assigned incrementally, starting from 0 (e.g., 0, 1, 2, 3, ...).
- Returns
- The model identifier.
◆ getModelPath()
| std::string mobilint::Model::getModelPath | ( | ) | const |
Returns the path to the MXQ model file associated with the Model.
- Returns
- The MXQ file path.
◆ getCacheInfos()
| std::vector< CacheInfo > mobilint::Model::getCacheInfos | ( | ) | const |
Returns informations of KV-cache of the model.
- Returns
- A vector of CacheInfo objects.
◆ resetCacheMemory()
| void mobilint::Model::resetCacheMemory | ( | ) |
Resets the KV cache memory.
Clears the stored KV cache, restoring it to its initial state.
◆ dumpCacheMemory() [1/3]
| StatusCode mobilint::Model::dumpCacheMemory | ( | std::vector< std::vector< int8_t > > & | bufs | ) |
Dumps the KV cache memory into buffers.
Writes the current KV cache data into provided buffers.
- Parameters
-
[out] bufs A reference to vectors of byte vectors that will store the KV cache data.
- Returns
- A status code indicating whether the dump operation was successful or if an error occurred.
◆ dumpCacheMemory() [2/3]
| std::vector< std::vector< int8_t > > mobilint::Model::dumpCacheMemory | ( | StatusCode & | sc | ) |
Dumps the KV cache memory into buffers.
Writes the KV cache data into buffers and returns them.
- Parameters
-
[out] sc A reference to a status code that will be updated to indicate whether the dump operation was successful or if an error occurred.
- Returns
- A vector of byte vectors containing the KV cache data.
◆ dumpCacheMemory() [3/3]
| StatusCode mobilint::Model::dumpCacheMemory | ( | const std::string & | cache_dir | ) |
Dumps KV cache memory to files in the specified directory.
Writes the KV cache data to binary files within the given directory. Each file is named using the format: cache_<layer_hash>.bin.
- Parameters
-
[in] cache_dir Path to the directory where KV cache files will be saved.
- Returns
- A status code indicating whether the dump operation was successful or if an error occurred.
◆ loadCacheMemory() [1/2]
| StatusCode mobilint::Model::loadCacheMemory | ( | const std::vector< std::vector< int8_t > > & | bufs | ) |
Loads the KV cache memory from buffers.
Restores the KV cache from the provided buffers.
- Parameters
-
[in] bufs A reference to a vector of byte vectors containing the KV cache data.
- Returns
- A status code indicating whether the load operation was successful or if an error occurred.
◆ loadCacheMemory() [2/2]
| StatusCode mobilint::Model::loadCacheMemory | ( | const std::string & | cache_dir | ) |
Loads the KV cache memory from files in the specified directory.
Reads KV cache data from files within the given directory and restores them. Each file is named using the format: cache_<layer_hash>.bin.
- Parameters
-
[in] cache_dir Path to the directory where KV cache files are saved.
- Returns
- A status code indicating whether the load operation was successful or if an error occurred.
◆ filterCacheTail()
| int mobilint::Model::filterCacheTail | ( | int | cache_size, |
| int | tail_size, | ||
| const std::vector< bool > & | mask, | ||
| StatusCode & | sc ) |
Filter the tail of the KV cache memory.
Retains the desired caches in the tail of the KV cache memory, excludes the others, and shifts the remaining caches forward.
- Parameters
-
[in] cache_size The number of tokens accumulated in the KV cache so far. [in] tail_size The tail size of the KV cache to filter (<=32). [in] mask A mask indicating tokens to retain or exclude at the tail of the KV cache. [out] sc A status code indicating the outcome of the tail filtering.
- Returns
- New cache size after tail filtering.
◆ moveCacheTail()
| int mobilint::Model::moveCacheTail | ( | int | num_head, |
| int | num_tail, | ||
| int | cache_size, | ||
| StatusCode & | sc ) |
Moves the tail of the KV cache memory to the end of the head.
Slice the tail of the KV cache memory up to the specified size and moves it to the designated cache position.
- Parameters
-
[in] num_head The size of the KV cache head where the tail is appended. [in] num_tail The size of the KV cache tail to be moved. [in] cache_size The total number of tokens accumulated in the KV cache so far. [out] sc A status code indicating the result of the tail move.
- Returns
- The updated cache size after moving the tail.
◆ infer() [11/12]
| StatusCode mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output, | ||
| int | batch_size ) |
- Deprecated
- Use infer(input, output, shape) instead.
◆ infer() [12/12]
| std::vector< std::vector< float > > mobilint::Model::infer | ( | const std::vector< float * > & | input, |
| int | batch_size, | ||
| StatusCode & | sc ) |
- Deprecated
- Use infer(input, shape, sc) instead.
◆ inferHeightBatch()
| StatusCode mobilint::Model::inferHeightBatch | ( | const std::vector< float * > & | input, |
| std::vector< std::vector< float > > & | output, | ||
| int | height_batch_size ) |
- Deprecated
- Deprecated
◆ getSchedulePolicy()
| SchedulePolicy mobilint::Model::getSchedulePolicy | ( | ) | const |
◆ getLatencySetPolicy()
| LatencySetPolicy mobilint::Model::getLatencySetPolicy | ( | ) | const |
◆ getMaintenancePolicy()
| MaintenancePolicy mobilint::Model::getMaintenancePolicy | ( | ) | const |
◆ getLatencyConsumed()
| uint64_t mobilint::Model::getLatencyConsumed | ( | const int | npu_op_idx | ) | const |
◆ getLatencyFinished()
| uint64_t mobilint::Model::getLatencyFinished | ( | const int | npu_op_idx | ) | const |
◆ getStatistics()
| std::shared_ptr< Statistics > mobilint::Model::getStatistics | ( | ) | const |
Friends And Related Symbol Documentation
◆ Accelerator
|
friend |
The documentation for this class was generated from the following file:
Generated by