Model Class Reference

Model Class Reference#

Runtime Library: mobilint::Model Class Reference
Runtime Library v0.30
Mobilint SDK qb

Represents an AI model loaded from an MXQ file. More...

#include <model.h>

Public Member Functions

 Model (const Model &other)=delete
 Model (Model &&other) noexcept
Modeloperator= (const Model &rhs)=delete
Modeloperator= (Model &&rhs) noexcept
StatusCode launch (Accelerator &acc)
 Launches the model on the specified Accelerator, which represents the actual NPU.
StatusCode dispose ()
 Disposes of the model loaded onto the NPU.
CoreMode getCoreMode () const
 Retrieves the core mode of the model.
bool isTarget (CoreId core_id) const
 Checks if the NPU core specified by CoreId is the target of the model. In other words, whether the model is configured to use the given NPU core.
std::vector< CoreIdgetTargetCores () const
 Returns the NPU cores the model is configured to use.
StatusCode inferSpeedrun (int variant_idx=0)
 Development-only API for measuring pure NPU inference speed.
int getNumModelVariants () const
 Returns the total number of model variants available in this model.
std::unique_ptr< ModelVariantHandlegetModelVariantHandle (int variant_idx, StatusCode &sc) const
 Retrieves a handle to the specified model variant.
const std::vector< std::vector< int64_t > > & getModelInputShape () const
 Returns the input shape of the model.
const std::vector< std::vector< int64_t > > & getModelOutputShape () const
 Returns the output shape of the model.
const std::vector< BufferInfo > & getInputBufferInfo () const
 Returns the input buffer information for the model.
const std::vector< BufferInfo > & getOutputBufferInfo () const
 Returns the output buffer information of the model.
std::vector< ScalegetInputScale () const
 Returns the input quantization scale(s) of the model.
std::vector< ScalegetOutputScale () const
 Returns the output quantization scale(s) of the model.
uint32_t getIdentifier () const
 Returns the model's unique identifier.
std::string getModelPath () const
 Returns the path to the MXQ model file associated with the Model.
std::vector< CacheInfogetCacheInfos () const
 Returns informations of KV-cache of the model.
NHWC float-to-float inference

Performs inference with input and output elements of type float in NHWC (batch N, height H, width W, channels C) or HWC format.

Two input-output type pairs are supported:

  1. std::vector<NDArray<float>> for both input and output
    • Recommended approach, as NDArray allows the maccel runtime to avoid unnecessary data copies internally.
  2. std::vector<float*> for input and std::vector<std::vector<float>> for output
    • Provided for user convenience, but results in unavoidable extra copies within the maccel runtime.
StatusCode infer (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output)
 Performs inference.
std::vector< NDArray< float > > infer (const std::vector< NDArray< float > > &input, StatusCode &sc)
 This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter.
StatusCode infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
std::vector< std::vector< float > > infer (const std::vector< float * > &input, StatusCode &sc)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
StatusCode infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
std::vector< std::vector< float > > infer (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
StatusCode infer (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output, uint32_t cache_size)
 This overload supports inference with KV cache.
std::vector< NDArray< float > > infer (const std::vector< NDArray< float > > &input, uint32_t cache_size, StatusCode &sc)
 This overload supports inference with KV cache.
StatusCode infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size)
 This overload supports inference with KV cache.
std::vector< std::vector< float > > infer (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
 This overload supports inference with KV cache.
NCHW float-to-float inference

Performs inference with input and output elements of type float in NCHW (batch N, channels C, height H, width W) or CHW format.

Two input-output type pairs are supported:

  1. std::vector<NDArray<float>> for both input and output
    • Recommended approach, as NDArray allows the maccel runtime to avoid unnecessary data copies internally.
  2. std::vector<float*> for input and std::vector<std::vector<float>> for output
    • Provided for user convenience, but results in unavoidable extra copies within the maccel runtime.
Note
CHW is not the recommended format, as the NPU natively operates on HWC-ordered data. When input is provided in CHW format, it will be transposed internally, introducing additional overhead.
If your data is in HWC format, use Model::infer instead of Model::inferCHW, as it avoids unnecessary format conversion.
StatusCode inferCHW (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output)
 Performs inference.
std::vector< NDArray< float > > inferCHW (const std::vector< NDArray< float > > &input, StatusCode &sc)
 This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter.
StatusCode inferCHW (const std::vector< float * > &input, std::vector< std::vector< float > > &output)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
std::vector< std::vector< float > > inferCHW (const std::vector< float * > &input, StatusCode &sc)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
StatusCode inferCHW (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
std::vector< std::vector< float > > inferCHW (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
 This overload is provided for convenience but may result in additional data copies within the maccel runtime.
StatusCode inferCHW (const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output, uint32_t cache_size)
 This overload supports inference with KV cache.
std::vector< NDArray< float > > inferCHW (const std::vector< NDArray< float > > &input, uint32_t cache_size, StatusCode &sc)
 This overload supports inference with KV cache.
StatusCode inferCHW (const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size)
 This overload supports inference with KV cache.
std::vector< std::vector< float > > inferCHW (const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
 This overload supports inference with KV cache.
NHWC int8_t-to-int8_t inference

Performs inference with input and output elements of type int8_t in NHWC (batch N, height H, width W, channels C) or HWC format.

Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input and int8_t to float for output.

Note
These APIs are intended for advanced use rather than typical usage.
StatusCode infer (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output)
std::vector< NDArray< int8_t > > infer (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
StatusCode infer (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output)
std::vector< std::vector< int8_t > > infer (const std::vector< int8_t * > &input, StatusCode &sc)
StatusCode infer (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape)
std::vector< std::vector< int8_t > > infer (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
StatusCode infer (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output, uint32_t cache_size)
std::vector< NDArray< int8_t > > infer (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc)
StatusCode infer (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size)
std::vector< std::vector< int8_t > > infer (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
NCHW int8_t-to-int8_t inference

Performs inference with input and output elements of type int8_t in NCHW (batch N, channels C, height H, width W) or CHW format.

Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input and int8_t to float for output.

Note
These APIs are intended for advanced use rather than typical usage.
StatusCode inferCHW (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output)
std::vector< NDArray< int8_t > > inferCHW (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
StatusCode inferCHW (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output)
std::vector< std::vector< int8_t > > inferCHW (const std::vector< int8_t * > &input, StatusCode &sc)
StatusCode inferCHW (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape)
std::vector< std::vector< int8_t > > inferCHW (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
StatusCode inferCHW (const std::vector< NDArray< int8_t > > &input, std::vector< NDArray< int8_t > > &output, uint32_t cache_size)
std::vector< NDArray< int8_t > > inferCHW (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc)
StatusCode inferCHW (const std::vector< int8_t * > &input, std::vector< std::vector< int8_t > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size)
std::vector< std::vector< int8_t > > inferCHW (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
NHWC int8_t-to-float inference

Performs inference with input and output elements of type int8_t in NHWC (batch N, height H, width W, channels C) or HWC format.

Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.

Note
These APIs are intended for advanced use rather than typical usage.
std::vector< NDArray< float > > inferToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
std::vector< std::vector< float > > inferToFloat (const std::vector< int8_t * > &input, StatusCode &sc)
std::vector< std::vector< float > > inferToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
std::vector< NDArray< float > > inferToFloat (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc)
std::vector< std::vector< float > > inferToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
NCHW int8_t-to-float inference

Performs inference with input and output elements of type int8_t in NCHW (batch N, channels C, height H, width W) or CHW format.

Using these inference APIs requires manual scaling (quantization) of float values to int8_t for input.

Note
These APIs are intended for advanced use rather than typical usage.
std::vector< NDArray< float > > inferCHWToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
std::vector< std::vector< float > > inferCHWToFloat (const std::vector< int8_t * > &input, StatusCode &sc)
std::vector< std::vector< float > > inferCHWToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
std::vector< NDArray< float > > inferCHWToFloat (const std::vector< NDArray< int8_t > > &input, uint32_t cache_size, StatusCode &sc)
std::vector< std::vector< float > > inferCHWToFloat (const std::vector< int8_t * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
NHWC Buffer-to-Buffer inference

Performs inference using input and output elements in the NPU’s internal data type. The inference operates on buffers allocated via the following APIs:

  • Model::acquireInputBuffer
  • Model::acquireOutputBuffer
  • Model::acquireInputBuffers
  • Model::acquireOutputBuffers
  • ModelVariantHandle::acquireInputBuffer
  • ModelVariantHandle::acquireOutputBuffer
  • ModelVariantHandle::acquireInputBuffers
  • ModelVariantHandle::acquireOutputBuffers

Additionally, Model::repositionInputs, Model::repositionOutputs, ModelVariantHandle::repositionInputs and ModelVariantHandle::repositionOutputs must be used properly.

Note
These APIs are intended for advanced use rather than typical usage.
StatusCode inferBuffer (const std::vector< Buffer > &input, std::vector< Buffer > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0)
StatusCode inferBuffer (const std::vector< std::vector< Buffer > > &input, std::vector< std::vector< Buffer > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0)
NHWC Buffer-to-float inference

Performs inference using input and output elements in the NPU’s internal data type. The inference operates on buffers allocated via the following APIs:

  • Model::acquireInputBuffer
  • Model::acquireInputBuffers
  • ModelVariantHandle::acquireInputBuffer
  • ModelVariantHandle::acquireInputBuffers

Additionally, Model::repositionInputs and ModelVariantHandle::repositionInputs must be used properly.

Note
These APIs are intended for advanced use rather than typical usage.
StatusCode inferBufferToFloat (const std::vector< Buffer > &input, std::vector< NDArray< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0)
StatusCode inferBufferToFloat (const std::vector< std::vector< Buffer > > &input, std::vector< NDArray< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0)
StatusCode inferBufferToFloat (const std::vector< Buffer > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0)
StatusCode inferBufferToFloat (const std::vector< std::vector< Buffer > > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape={}, uint32_t cache_size=0)
Asynchronous Inference

Performs inference asynchronously.

To use asynchronous inference, the model must be created using a ModelConfig object with the async pipeline configured to be enabled. This is done by calling ModelConfig::setAsyncPipelineEnabled(true) before passing the configuration to Model::create.

Example:

using namespace mobilint;
// Enables support for `inferAsync` and `inferAsyncCHW`
std::unique_ptr<Model> model = Model::create("resnet50.mxq", mc, sc);
if (!sc) {
// Handle error appropriately
}
// Now `inferAsync` can be called.
Future future = model->inferAsync(input, sc);
Represents a future for retrieving the result of asynchronous inference.
Definition future.h:43
Configures a core mode and core allocation of a model for NPU inference.
Definition type.h:257
void setAsyncPipelineEnabled(bool enable)
Enables or disables the asynchronous pipeline required for asynchronous inference.
static std::unique_ptr< Model > create(const std::string &mxq_path, StatusCode &sc)
Creates a Model object from the specified MXQ model file.
StatusCode
Enumerates status codes for the maccel runtime.
Definition status_code.h:26
Note
Functions in the inferAsync family (inferAsync, inferAsyncCHW, inferAsyncToFloat, inferAsyncCHWToFloat) typically return immediately. However, they may block if the input queue in the maccel runtime is full.
For all functions in the inferAsync family (inferAsync, inferAsyncCHW, inferAsyncToFloat, inferAsyncCHWToFloat), the data provided through the input parameter must remain unmodified until the asynchronous inference has completed. Modifying this data during execution may result in invalid results.
Currently, only CNN-based models are supported, as asynchronous execution is particularly effective for this type of workload.
Limitations:
  • RNN/LSTM and LLM models are not supported yet.
  • Models requiring CPU offloading are not supported yet.
  • Currently, only single-batch inference is supported (i.e., N = 1).
  • Currently, Buffer inference is not supported. The following types are supported in the synchronous API for advanced use cases, but are not yet available for asynchronous inference:
Future< float > inferAsync (const std::vector< NDArray< float > > &input, StatusCode &sc)
 Initiates asynchronous inference with input in NHWC (batch N, height H, width W, channels C) or HWC format.
Future< float > inferAsyncCHW (const std::vector< NDArray< float > > &input, StatusCode &sc)
 Initiates asynchronous inference with input in NCHW (batch N, channels C, height H, width W) or CHW format.
Future< int8_t > inferAsync (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
 This overload supports int8_t-to-int8_t asynchronous inference.
Future< int8_t > inferAsyncCHW (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
 This overload supports int8_t-to-int8_t asynchronous inference.
Future< float > inferAsyncToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
 This overload supports int8_t-to-float asynchronous inference.
Future< float > inferAsyncCHWToFloat (const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
 This overload supports int8_t-to-float asynchronous inference.
Buffer Management APIs

These APIs are required when calling Model::inferBuffer or Model::inferBufferToFloat.

Buffers are acquired using:

  • acquireInputBuffer
  • acquireOutputBuffer

Any acquired buffer must be released using:

  • releaseBuffer
  • releaseBuffers

Repositioning is handled by:

  • repositionInputs
  • repositionOutputs
Note
These APIs are intended for advanced use rather than typical usage.
std::vector< BufferacquireInputBuffer (const std::vector< std::vector< int > > &seqlens={}) const
std::vector< BufferacquireOutputBuffer (const std::vector< std::vector< int > > &seqlens={}) const
std::vector< std::vector< Buffer > > acquireInputBuffers (const int batch_size, const std::vector< std::vector< int > > &seqlens={}) const
std::vector< std::vector< Buffer > > acquireOutputBuffers (const int batch_size, const std::vector< std::vector< int > > &seqlens={}) const
StatusCode releaseBuffer (std::vector< Buffer > &buffer) const
StatusCode releaseBuffers (std::vector< std::vector< Buffer > > &buffers) const
StatusCode repositionInputs (const std::vector< float * > &input, std::vector< Buffer > &input_buf, const std::vector< std::vector< int > > &seqlens={}) const
StatusCode repositionOutputs (const std::vector< Buffer > &output_buf, std::vector< float * > &output, const std::vector< std::vector< int > > &seqlens={}) const
StatusCode repositionOutputs (const std::vector< Buffer > &output_buf, std::vector< std::vector< float > > &output, const std::vector< std::vector< int > > &seqlens={}) const
StatusCode repositionInputs (const std::vector< float * > &input, std::vector< std::vector< Buffer > > &input_buf, const std::vector< std::vector< int > > &seqlens={}) const
StatusCode repositionOutputs (const std::vector< std::vector< Buffer > > &output_buf, std::vector< float * > &output, const std::vector< std::vector< int > > &seqlens={}) const
StatusCode repositionOutputs (const std::vector< std::vector< Buffer > > &output_buf, std::vector< std::vector< float > > &output, const std::vector< std::vector< int > > &seqlens={}) const
KV Cache Management
Note
These APIs are used for LLM models that utilize KV cache.
void resetCacheMemory ()
 Resets the KV cache memory.
StatusCode dumpCacheMemory (std::vector< std::vector< int8_t > > &bufs)
 Dumps the KV cache memory into buffers.
std::vector< std::vector< int8_t > > dumpCacheMemory (StatusCode &sc)
 Dumps the KV cache memory into buffers.
StatusCode dumpCacheMemory (const std::string &cache_dir)
 Dumps KV cache memory to files in the specified directory.
StatusCode loadCacheMemory (const std::vector< std::vector< int8_t > > &bufs)
 Loads the KV cache memory from buffers.
StatusCode loadCacheMemory (const std::string &cache_dir)
 Loads the KV cache memory from files in the specified directory.
int filterCacheTail (int cache_size, int tail_size, const std::vector< bool > &mask, StatusCode &sc)
 Filter the tail of the KV cache memory.
int moveCacheTail (int num_head, int num_tail, int cache_size, StatusCode &sc)
 Moves the tail of the KV cache memory to the end of the head.
Deprecated APIs
Note
These APIs are deprecated and should not be used.
StatusCode infer (const std::vector< float * > &input, std::vector< std::vector< float > > &output, int batch_size)
std::vector< std::vector< float > > infer (const std::vector< float * > &input, int batch_size, StatusCode &sc)
StatusCode inferHeightBatch (const std::vector< float * > &input, std::vector< std::vector< float > > &output, int height_batch_size)
SchedulePolicy getSchedulePolicy () const
LatencySetPolicy getLatencySetPolicy () const
MaintenancePolicy getMaintenancePolicy () const
uint64_t getLatencyConsumed (const int npu_op_idx) const
uint64_t getLatencyFinished (const int npu_op_idx) const
std::shared_ptr< Statistics > getStatistics () const

Static Public Member Functions

static std::unique_ptr< Modelcreate (const std::string &mxq_path, StatusCode &sc)
 Creates a Model object from the specified MXQ model file.
static std::unique_ptr< Modelcreate (const std::string &mxq_path, const ModelConfig &config, StatusCode &sc)
 Creates a Model object from the specified MXQ model file and configuration.

Friends

class Accelerator

Detailed Description

Represents an AI model loaded from an MXQ file.

This class loads an AI model from an MXQ file and provides functions to launch it on the NPU and perform inference.

Definition at line 40 of file model.h.

Member Function Documentation

◆ create() [1/2]

std::unique_ptr< Model > mobilint::Model::create ( const std::string & mxq_path,
StatusCode & sc )
static

Creates a Model object from the specified MXQ model file.

Parses the MXQ file and constructs a Model object. The model is initialized in single-core mode with all NPU local cores included.

Note
The created Model object must be launched before performing inference. See Model::launch for more details.
Parameters
[in]mxq_pathThe path to the MXQ model file.
[out]scA reference to a status code that will be updated to indicate whether the model was successfully created or if an error occurred.
Returns
A unique pointer to the created Model object.

◆ create() [2/2]

std::unique_ptr< Model > mobilint::Model::create ( const std::string & mxq_path,
const ModelConfig & config,
StatusCode & sc )
static

Creates a Model object from the specified MXQ model file and configuration.

Parses the MXQ file and constructs a Model object using the provided configuration, initializing the model with the given settings.

Note
The created Model object must be launched before performing inference. See Model::launch for more details.
Parameters
[in]mxq_pathThe path to the MXQ model file.
[in]configThe configuration settings to initialize the Model.
[out]scA reference to a status code that will be updated to indicate whether the model was successfully created or if an error occurred.
Returns
A unique pointer to the created Model object.

◆ launch()

StatusCode mobilint::Model::launch ( Accelerator & acc)

Launches the model on the specified Accelerator, which represents the actual NPU.

Parameters
[in]accThe accelerator on which to launch the model.
Returns
A status code indicating whether the model was successfully launched or if an error occurred.

◆ dispose()

StatusCode mobilint::Model::dispose ( )

Disposes of the model loaded onto the NPU.

Releases any resources associated with the model on the NPU.

Returns
A status code indicating whether the disposal was successful or if an error occurred.

◆ getCoreMode()

CoreMode mobilint::Model::getCoreMode ( ) const

Retrieves the core mode of the model.

Returns
The CoreMode of the model.

◆ isTarget()

bool mobilint::Model::isTarget ( CoreId core_id) const

Checks if the NPU core specified by CoreId is the target of the model. In other words, whether the model is configured to use the given NPU core.

Parameters
[in]core_idThe CoreId to check.
Returns
True if the model is configured to use the specified CoreId, false otherwise.

◆ getTargetCores()

std::vector< CoreId > mobilint::Model::getTargetCores ( ) const

Returns the NPU cores the model is configured to use.

Returns
A vector of CoreIds representing the target NPU cores.

◆ infer() [1/12]

StatusCode mobilint::Model::infer ( const std::vector< NDArray< float > > & input,
std::vector< NDArray< float > > & output )

Performs inference.

Parameters
[in]inputA vector of NDArray<float>. Each NDArray must be in NHWC or HWC format.
[out]outputA reference to a vector of NDArray<float> that will store the inference results.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ infer() [2/12]

std::vector< NDArray< float > > mobilint::Model::infer ( const std::vector< NDArray< float > > & input,
StatusCode & sc )

This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter.

Parameters
[in]inputA vector of NDArray<float>. Each NDArray must be in NHWC or HWC format.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of NDArray<float> containing the inference results.

◆ infer() [3/12]

StatusCode mobilint::Model::infer ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in HWC format.
[out]outputA reference to a vector of float vectors that will store the inference results.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ infer() [4/12]

std::vector< std::vector< float > > mobilint::Model::infer ( const std::vector< float * > & input,
StatusCode & sc )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in HWC format.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of float vectors containing the inference results.

◆ infer() [5/12]

StatusCode mobilint::Model::infer ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output,
const std::vector< std::vector< int64_t > > & shape )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Unlike other overloads, this version allows explicitly specifying the shape of each input data, which can be in NHWC or HWC format.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NHWC or HWC format.
[out]outputA reference to a vector of float vectors that will store the inference results.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ infer() [6/12]

std::vector< std::vector< float > > mobilint::Model::infer ( const std::vector< float * > & input,
const std::vector< std::vector< int64_t > > & shape,
StatusCode & sc )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NHWC or HWC format.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of float vectors containing the inference results.

◆ infer() [7/12]

StatusCode mobilint::Model::infer ( const std::vector< NDArray< float > > & input,
std::vector< NDArray< float > > & output,
uint32_t cache_size )

This overload supports inference with KV cache.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of NDArrays, where each NDArray represents input data in NHWC or HWC format.
[out]outputA reference to a vector of NDArrays that will store the inference results.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ infer() [8/12]

std::vector< NDArray< float > > mobilint::Model::infer ( const std::vector< NDArray< float > > & input,
uint32_t cache_size,
StatusCode & sc )

This overload supports inference with KV cache.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of NDArrays, where each NDArray represents input data in NHWC or HWC format.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of NDArrays containing the inference results.

◆ infer() [9/12]

StatusCode mobilint::Model::infer ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output,
const std::vector< std::vector< int64_t > > & shape,
uint32_t cache_size )

This overload supports inference with KV cache.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NHWC or HWC format.
[out]outputA reference to a vector of float vectors that will store the inference results.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ infer() [10/12]

std::vector< std::vector< float > > mobilint::Model::infer ( const std::vector< float * > & input,
const std::vector< std::vector< int64_t > > & shape,
uint32_t cache_size,
StatusCode & sc )

This overload supports inference with KV cache.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NHWC or HWC format.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of float vectors containing the inference results.

◆ inferCHW() [1/10]

StatusCode mobilint::Model::inferCHW ( const std::vector< NDArray< float > > & input,
std::vector< NDArray< float > > & output )

Performs inference.

Parameters
[in]inputA vector of NDArray<float>. Each NDArray must be in NCHW or CHW format.
[out]outputA reference to a vector of NDArray<float> that will store the inference results.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ inferCHW() [2/10]

std::vector< NDArray< float > > mobilint::Model::inferCHW ( const std::vector< NDArray< float > > & input,
StatusCode & sc )

This overload differs from the above function in that it directly returns the inference results instead of modifying an output parameter.

Parameters
[in]inputA vector of NDArray<float>. Each NDArray must be in NCHW or CHW format.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of NDArray<float> containing the inference results.

◆ inferCHW() [3/10]

StatusCode mobilint::Model::inferCHW ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in CHW format.
[out]outputA reference to a vector of float vectors that will store the inference results.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ inferCHW() [4/10]

std::vector< std::vector< float > > mobilint::Model::inferCHW ( const std::vector< float * > & input,
StatusCode & sc )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in CHW format.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of float vectors containing the inference results.

◆ inferCHW() [5/10]

StatusCode mobilint::Model::inferCHW ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output,
const std::vector< std::vector< int64_t > > & shape )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Unlike other overloads, this version allows explicitly specifying the shape of each input data, which can be in NCHW or CHW format.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NCHW or CHW format.
[out]outputA reference to a vector of float vectors that will store the inference results.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ inferCHW() [6/10]

std::vector< std::vector< float > > mobilint::Model::inferCHW ( const std::vector< float * > & input,
const std::vector< std::vector< int64_t > > & shape,
StatusCode & sc )

This overload is provided for convenience but may result in additional data copies within the maccel runtime.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NCHW or CHW format.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of float vectors containing the inference results.

◆ inferCHW() [7/10]

StatusCode mobilint::Model::inferCHW ( const std::vector< NDArray< float > > & input,
std::vector< NDArray< float > > & output,
uint32_t cache_size )

This overload supports inference with KV cache.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of NDArrays, where each NDArray represents input data in NCHW or CHW format.
[out]outputA reference to a vector of NDArrays that will store the inference results.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ inferCHW() [8/10]

std::vector< NDArray< float > > mobilint::Model::inferCHW ( const std::vector< NDArray< float > > & input,
uint32_t cache_size,
StatusCode & sc )

This overload supports inference with KV cache.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of NDArrays, where each NDArray represents input data in NCHW or CHW format.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of NDArrays containing the inference results.

◆ inferCHW() [9/10]

StatusCode mobilint::Model::inferCHW ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output,
const std::vector< std::vector< int64_t > > & shape,
uint32_t cache_size )

This overload supports inference with KV cache.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NCHW or CHW format.
[out]outputA reference to a vector of float vectors that will store the inference results.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
Returns
A status code indicating whether the inference operation completed successfully or encountered an error.

◆ inferCHW() [10/10]

std::vector< std::vector< float > > mobilint::Model::inferCHW ( const std::vector< float * > & input,
const std::vector< std::vector< int64_t > > & shape,
uint32_t cache_size,
StatusCode & sc )

This overload supports inference with KV cache.

Unlike the above overload, this function returns the inference results directly instead of modifying an output parameter.

Note
This function is relevant for LLM models that use KV cache.
Parameters
[in]inputA vector of float pointers, where each pointer represents input data in NCHW or CHW format.
[in]shapeA vector of vectors, where each inner vector specifies the shape of the corresponding input data.
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
[out]scA reference to a status code that will be updated to indicate whether the inference operation was successful or encountered an error.
Returns
A vector of float vectors containing the inference results.

◆ inferSpeedrun()

StatusCode mobilint::Model::inferSpeedrun ( int variant_idx = 0)

Development-only API for measuring pure NPU inference speed.

Runs NPU inference without uploading inputs and without retrieving outputs.

Parameters
[in]variant_idxIndex of model variant to run
Returns
A status code indicating the result.

◆ inferAsync() [1/2]

Future< float > mobilint::Model::inferAsync ( const std::vector< NDArray< float > > & input,
StatusCode & sc )

Initiates asynchronous inference with input in NHWC (batch N, height H, width W, channels C) or HWC format.

Parameters
[in]inputA vector of NDArray<float>. Each NDArray must be in NHWC or HWC format.
[out]scA reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
Returns
A future that can be used to retrieve the inference result.

◆ inferAsyncCHW() [1/2]

Future< float > mobilint::Model::inferAsyncCHW ( const std::vector< NDArray< float > > & input,
StatusCode & sc )

Initiates asynchronous inference with input in NCHW (batch N, channels C, height H, width W) or CHW format.

Parameters
[in]inputA vector of NDArray<float>. Each NDArray must be in NCHW or CHW format.
[out]scA reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
Returns
A future that can be used to retrieve the inference result.

◆ inferAsync() [2/2]

Future< int8_t > mobilint::Model::inferAsync ( const std::vector< NDArray< int8_t > > & input,
StatusCode & sc )

This overload supports int8_t-to-int8_t asynchronous inference.

Parameters
[in]inputA vector of NDArray<int8_t>. Each NDArray must be in NHWC or HWC format.
[out]scA reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
Returns
A future that can be used to retrieve the inference result.

◆ inferAsyncCHW() [2/2]

Future< int8_t > mobilint::Model::inferAsyncCHW ( const std::vector< NDArray< int8_t > > & input,
StatusCode & sc )

This overload supports int8_t-to-int8_t asynchronous inference.

Parameters
[in]inputA vector of NDArray<int8_t>. Each NDArray must be in NCHW or CHW format.
[out]scA reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
Returns
A future that can be used to retrieve the inference result.

◆ inferAsyncToFloat()

Future< float > mobilint::Model::inferAsyncToFloat ( const std::vector< NDArray< int8_t > > & input,
StatusCode & sc )

This overload supports int8_t-to-float asynchronous inference.

Parameters
[in]inputA vector of NDArray<int8_t>. Each NDArray must be in NHWC or HWC format.
[out]scA reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
Returns
A future that can be used to retrieve the inference result.

◆ inferAsyncCHWToFloat()

Future< float > mobilint::Model::inferAsyncCHWToFloat ( const std::vector< NDArray< int8_t > > & input,
StatusCode & sc )

This overload supports int8_t-to-float asynchronous inference.

Parameters
[in]inputA vector of NDArray<int8_t>. Each NDArray must be in NCHW or CHW format.
[out]scA reference to a status code that will be updated to indicate whether the asynchronous inference request was successfully initiated or encountered an error.
Returns
A future that can be used to retrieve the inference result.

◆ getNumModelVariants()

int mobilint::Model::getNumModelVariants ( ) const

Returns the total number of model variants available in this model.

The variant_idx parameter passed to Model::getModelVariantHandle must be in the range [0, return value of this function).

Returns
The total number of model variants.

◆ getModelVariantHandle()

std::unique_ptr< ModelVariantHandle > mobilint::Model::getModelVariantHandle ( int variant_idx,
StatusCode & sc ) const

Retrieves a handle to the specified model variant.

Use the returned ModelVariantHandle to query details such as input and output shapes for the selected variant.

Parameters
[in]variant_idxIndex of the model variant to retrieve. Must be in the range [0, getNumModelVariants()).
[out]scA reference to a StatusCode variable that will be updated to indicate success or failure.
Returns
A unique pointer to the corresponding ModelVariantHandle if successful; otherwise, nullptr.

◆ getModelInputShape()

const std::vector< std::vector< int64_t > > & mobilint::Model::getModelInputShape ( ) const

Returns the input shape of the model.

Returns
A reference to the input shape of the model.

◆ getModelOutputShape()

const std::vector< std::vector< int64_t > > & mobilint::Model::getModelOutputShape ( ) const

Returns the output shape of the model.

Returns
A reference to the output shape of the model.

◆ getInputBufferInfo()

const std::vector< BufferInfo > & mobilint::Model::getInputBufferInfo ( ) const

Returns the input buffer information for the model.

Returns
A reference to a vector of input buffer information.

◆ getOutputBufferInfo()

const std::vector< BufferInfo > & mobilint::Model::getOutputBufferInfo ( ) const

Returns the output buffer information of the model.

Returns
A reference to a vector of output buffer information.

◆ getInputScale()

std::vector< Scale > mobilint::Model::getInputScale ( ) const

Returns the input quantization scale(s) of the model.

Returns
A vector of input scales.

◆ getOutputScale()

std::vector< Scale > mobilint::Model::getOutputScale ( ) const

Returns the output quantization scale(s) of the model.

Returns
A vector of output scales.

◆ getIdentifier()

uint32_t mobilint::Model::getIdentifier ( ) const

Returns the model's unique identifier.

This identifier distinguishes multiple models within a single user program. It is assigned incrementally, starting from 0 (e.g., 0, 1, 2, 3, ...).

Returns
The model identifier.

◆ getModelPath()

std::string mobilint::Model::getModelPath ( ) const

Returns the path to the MXQ model file associated with the Model.

Returns
The MXQ file path.

◆ getCacheInfos()

std::vector< CacheInfo > mobilint::Model::getCacheInfos ( ) const

Returns informations of KV-cache of the model.

Returns
A vector of CacheInfo objects.

◆ resetCacheMemory()

void mobilint::Model::resetCacheMemory ( )

Resets the KV cache memory.

Clears the stored KV cache, restoring it to its initial state.

◆ dumpCacheMemory() [1/3]

StatusCode mobilint::Model::dumpCacheMemory ( std::vector< std::vector< int8_t > > & bufs)

Dumps the KV cache memory into buffers.

Writes the current KV cache data into provided buffers.

Parameters
[out]bufsA reference to vectors of byte vectors that will store the KV cache data.
Returns
A status code indicating whether the dump operation was successful or if an error occurred.

◆ dumpCacheMemory() [2/3]

std::vector< std::vector< int8_t > > mobilint::Model::dumpCacheMemory ( StatusCode & sc)

Dumps the KV cache memory into buffers.

Writes the KV cache data into buffers and returns them.

Parameters
[out]scA reference to a status code that will be updated to indicate whether the dump operation was successful or if an error occurred.
Returns
A vector of byte vectors containing the KV cache data.

◆ dumpCacheMemory() [3/3]

StatusCode mobilint::Model::dumpCacheMemory ( const std::string & cache_dir)

Dumps KV cache memory to files in the specified directory.

Writes the KV cache data to binary files within the given directory. Each file is named using the format: cache_<layer_hash>.bin.

Parameters
[in]cache_dirPath to the directory where KV cache files will be saved.
Returns
A status code indicating whether the dump operation was successful or if an error occurred.

◆ loadCacheMemory() [1/2]

StatusCode mobilint::Model::loadCacheMemory ( const std::vector< std::vector< int8_t > > & bufs)

Loads the KV cache memory from buffers.

Restores the KV cache from the provided buffers.

Parameters
[in]bufsA reference to a vector of byte vectors containing the KV cache data.
Returns
A status code indicating whether the load operation was successful or if an error occurred.

◆ loadCacheMemory() [2/2]

StatusCode mobilint::Model::loadCacheMemory ( const std::string & cache_dir)

Loads the KV cache memory from files in the specified directory.

Reads KV cache data from files within the given directory and restores them. Each file is named using the format: cache_<layer_hash>.bin.

Parameters
[in]cache_dirPath to the directory where KV cache files are saved.
Returns
A status code indicating whether the load operation was successful or if an error occurred.

◆ filterCacheTail()

int mobilint::Model::filterCacheTail ( int cache_size,
int tail_size,
const std::vector< bool > & mask,
StatusCode & sc )

Filter the tail of the KV cache memory.

Retains the desired caches in the tail of the KV cache memory, excludes the others, and shifts the remaining caches forward.

Parameters
[in]cache_sizeThe number of tokens accumulated in the KV cache so far.
[in]tail_sizeThe tail size of the KV cache to filter (<=32).
[in]maskA mask indicating tokens to retain or exclude at the tail of the KV cache.
[out]scA status code indicating the outcome of the tail filtering.
Returns
New cache size after tail filtering.

◆ moveCacheTail()

int mobilint::Model::moveCacheTail ( int num_head,
int num_tail,
int cache_size,
StatusCode & sc )

Moves the tail of the KV cache memory to the end of the head.

Slice the tail of the KV cache memory up to the specified size and moves it to the designated cache position.

Parameters
[in]num_headThe size of the KV cache head where the tail is appended.
[in]num_tailThe size of the KV cache tail to be moved.
[in]cache_sizeThe total number of tokens accumulated in the KV cache so far.
[out]scA status code indicating the result of the tail move.
Returns
The updated cache size after moving the tail.

◆ infer() [11/12]

StatusCode mobilint::Model::infer ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output,
int batch_size )
Deprecated
Use infer(input, output, shape) instead.

◆ infer() [12/12]

std::vector< std::vector< float > > mobilint::Model::infer ( const std::vector< float * > & input,
int batch_size,
StatusCode & sc )
Deprecated
Use infer(input, shape, sc) instead.

◆ inferHeightBatch()

StatusCode mobilint::Model::inferHeightBatch ( const std::vector< float * > & input,
std::vector< std::vector< float > > & output,
int height_batch_size )
Deprecated
Deprecated

◆ getSchedulePolicy()

SchedulePolicy mobilint::Model::getSchedulePolicy ( ) const

◆ getLatencySetPolicy()

LatencySetPolicy mobilint::Model::getLatencySetPolicy ( ) const

◆ getMaintenancePolicy()

MaintenancePolicy mobilint::Model::getMaintenancePolicy ( ) const

◆ getLatencyConsumed()

uint64_t mobilint::Model::getLatencyConsumed ( const int npu_op_idx) const

◆ getLatencyFinished()

uint64_t mobilint::Model::getLatencyFinished ( const int npu_op_idx) const

◆ getStatistics()

std::shared_ptr< Statistics > mobilint::Model::getStatistics ( ) const

◆ Accelerator

friend class Accelerator
friend

Definition at line 1191 of file model.h.


The documentation for this class was generated from the following file: