model.h Source File#
|
Runtime Library v0.30
Mobilint SDK qb
|
model.h
Go to the documentation of this file.
Represents an accelerator, i.e., an NPU, used for executing models.
Definition acc.h:66
Represents a future for retrieving the result of asynchronous inference.
Definition future.h:43
Configures a core mode and core allocation of a model for NPU inference.
Definition type.h:257
std::string getModelPath() const
Returns the path to the MXQ model file associated with the Model.
Future< float > inferAsyncToFloat(const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
This overload supports int8_t-to-float asynchronous inference.
uint64_t getLatencyConsumed(const int npu_op_idx) const
std::vector< std::vector< float > > infer(const std::vector< float * > &input, int batch_size, StatusCode &sc)
StatusCode inferHeightBatch(const std::vector< float * > &input, std::vector< std::vector< float > > &output, int height_batch_size)
bool isTarget(CoreId core_id) const
Checks if the NPU core specified by CoreId is the target of the model. In other words,...
static std::unique_ptr< Model > create(const std::string &mxq_path, StatusCode &sc)
Creates a Model object from the specified MXQ model file.
Future< float > inferAsyncCHW(const std::vector< NDArray< float > > &input, StatusCode &sc)
Initiates asynchronous inference with input in NCHW (batch N, channels C, height H,...
std::shared_ptr< Statistics > getStatistics() const
std::vector< NDArray< float > > infer(const std::vector< NDArray< float > > &input, StatusCode &sc)
This overload differs from the above function in that it directly returns the inference results inste...
Future< int8_t > inferAsync(const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
This overload supports int8_t-to-int8_t asynchronous inference.
StatusCode loadCacheMemory(const std::vector< std::vector< int8_t > > &bufs)
Loads the KV cache memory from buffers.
Future< float > inferAsyncCHWToFloat(const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
This overload supports int8_t-to-float asynchronous inference.
SchedulePolicy getSchedulePolicy() const
std::vector< Scale > getOutputScale() const
Returns the output quantization scale(s) of the model.
StatusCode loadCacheMemory(const std::string &cache_dir)
Loads the KV cache memory from files in the specified directory.
StatusCode dumpCacheMemory(const std::string &cache_dir)
Dumps KV cache memory to files in the specified directory.
const std::vector< std::vector< int64_t > > & getModelOutputShape() const
Returns the output shape of the model.
std::unique_ptr< ModelVariantHandle > getModelVariantHandle(int variant_idx, StatusCode &sc) const
Retrieves a handle to the specified model variant.
StatusCode launch(Accelerator &acc)
Launches the model on the specified Accelerator, which represents the actual NPU.
LatencySetPolicy getLatencySetPolicy() const
std::vector< CoreId > getTargetCores() const
Returns the NPU cores the model is configured to use.
std::vector< std::vector< float > > inferCHW(const std::vector< float * > &input, StatusCode &sc)
This overload is provided for convenience but may result in additional data copies within the maccel ...
std::vector< std::vector< float > > inferCHW(const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
This overload is provided for convenience but may result in additional data copies within the maccel ...
const std::vector< BufferInfo > & getInputBufferInfo() const
Returns the input buffer information for the model.
int moveCacheTail(int num_head, int num_tail, int cache_size, StatusCode &sc)
Moves the tail of the KV cache memory to the end of the head.
std::vector< std::vector< float > > infer(const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
This overload supports inference with KV cache.
StatusCode infer(const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output)
Performs inference.
int getNumModelVariants() const
Returns the total number of model variants available in this model.
Future< float > inferAsync(const std::vector< NDArray< float > > &input, StatusCode &sc)
Initiates asynchronous inference with input in NHWC (batch N, height H, width W, channels C) or HWC f...
std::vector< NDArray< float > > inferCHW(const std::vector< NDArray< float > > &input, StatusCode &sc)
This overload differs from the above function in that it directly returns the inference results inste...
std::vector< NDArray< float > > infer(const std::vector< NDArray< float > > &input, uint32_t cache_size, StatusCode &sc)
This overload supports inference with KV cache.
StatusCode inferCHW(const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output, uint32_t cache_size)
This overload supports inference with KV cache.
static std::unique_ptr< Model > create(const std::string &mxq_path, const ModelConfig &config, StatusCode &sc)
Creates a Model object from the specified MXQ model file and configuration.
int filterCacheTail(int cache_size, int tail_size, const std::vector< bool > &mask, StatusCode &sc)
Filter the tail of the KV cache memory.
const std::vector< std::vector< int64_t > > & getModelInputShape() const
Returns the input shape of the model.
std::vector< NDArray< float > > inferCHW(const std::vector< NDArray< float > > &input, uint32_t cache_size, StatusCode &sc)
This overload supports inference with KV cache.
MaintenancePolicy getMaintenancePolicy() const
Future< int8_t > inferAsyncCHW(const std::vector< NDArray< int8_t > > &input, StatusCode &sc)
This overload supports int8_t-to-int8_t asynchronous inference.
StatusCode infer(const std::vector< float * > &input, std::vector< std::vector< float > > &output)
This overload is provided for convenience but may result in additional data copies within the maccel ...
const std::vector< BufferInfo > & getOutputBufferInfo() const
Returns the output buffer information of the model.
StatusCode infer(const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size)
This overload supports inference with KV cache.
std::vector< CacheInfo > getCacheInfos() const
Returns informations of KV-cache of the model.
StatusCode infer(const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output, uint32_t cache_size)
This overload supports inference with KV cache.
std::vector< std::vector< float > > infer(const std::vector< float * > &input, StatusCode &sc)
This overload is provided for convenience but may result in additional data copies within the maccel ...
std::vector< std::vector< float > > inferCHW(const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size, StatusCode &sc)
This overload supports inference with KV cache.
std::vector< std::vector< float > > infer(const std::vector< float * > &input, const std::vector< std::vector< int64_t > > &shape, StatusCode &sc)
This overload is provided for convenience but may result in additional data copies within the maccel ...
StatusCode infer(const std::vector< float * > &input, std::vector< std::vector< float > > &output, int batch_size)
std::vector< std::vector< int8_t > > dumpCacheMemory(StatusCode &sc)
Dumps the KV cache memory into buffers.
StatusCode inferCHW(const std::vector< float * > &input, std::vector< std::vector< float > > &output)
This overload is provided for convenience but may result in additional data copies within the maccel ...
std::vector< Scale > getInputScale() const
Returns the input quantization scale(s) of the model.
StatusCode inferSpeedrun(int variant_idx=0)
Development-only API for measuring pure NPU inference speed.
StatusCode inferCHW(const std::vector< NDArray< float > > &input, std::vector< NDArray< float > > &output)
Performs inference.
StatusCode infer(const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape)
This overload is provided for convenience but may result in additional data copies within the maccel ...
StatusCode inferCHW(const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape)
This overload is provided for convenience but may result in additional data copies within the maccel ...
uint64_t getLatencyFinished(const int npu_op_idx) const
StatusCode inferCHW(const std::vector< float * > &input, std::vector< std::vector< float > > &output, const std::vector< std::vector< int64_t > > &shape, uint32_t cache_size)
This overload supports inference with KV cache.
StatusCode dumpCacheMemory(std::vector< std::vector< int8_t > > &bufs)
Dumps the KV cache memory into buffers.
Generated by