Release Notes#

v1.2.0#

Release date: April 2, 2026

This update adds Batch LLM support.

Highlights#

BatchParam#

A new struct BatchParam has been introduced to support Batch LLM inference.

BatchParam holds the information required to execute each batch during Batch LLM inference.

Struct Fields

sequence_length : The sequence length for each batch.
cache_size : The cache size each batch will use.
cache_id : The cache identifier for each batch.
- All inputs within the same context must use the same cache ID throughout inference.
- The cache_id value must be within the maximum batch count supported by the model.

Usage

To use Batch LLM, concatenate multiple inputs into a single input, then pass a BatchParam for each individual input.

When the input shape is (1, seq_len, hidden_dim), batch inputs must be concatenated along the seq_len dimension.

- `seq_len` : Number of tokens
- `hidden_dim` : Embedding dimension of each token

For example, to combine two inputs with sequence lengths of 10 and 80:

import qbruntime
import numpy as np

## Check the maximum batch count supported by the model.
print(model.get_cache_infos()[0].num_batches)

## Concatenate inputs into a single input for Batch LLM.
## Inputs must be concatenated along the 2nd dimension (axis=1).
batch_input = np.concatenate([input0, input1], axis=1)

## qbruntime.BatchParam(sequence_length, cache_size, cache_id)
batch_params = [
    qbruntime.BatchParam(10, 0, 0),
    qbruntime.BatchParam(80, 0, 1),
]

res = model.infer([batch_input], params=batch_params)

## The output is structured as:
## [[batch_0_output], [batch_1_output], ...]

batch_params2 = [
    qbruntime.BatchParam(1, 10, 0),
    qbruntime.BatchParam(1, 80, 1),
]

res = model.infer(res, params=batch_params2)

Known Issues#

Running LLM models on ARM (aarch64) systems may fails with “Bus Error”. This issue has been present since v1.1.0. A driver patch to resolve this issue is planned. We apologize for the inconvenience.

v1.1.0#

Release date: March 23, 2026

qb Runtime v1.1.0 brings automatic core mode selection, data type query APIs, and performance optimizations.

Highlights#

CoreMode::Auto#

The runtime can now automatically select the available core mode for your model. Setting CoreMode::Auto in your ModelConfig enables the runtime to detect and apply the appropriate core mode from the MXQ. Previously, non-default core modes such as Multi, Global4, and Global8 required manual ModelConfig construction; with Auto mode, the available core mode is selected automatically. Since the default constructor also uses Auto mode, no additional configuration is needed in most cases.

Note

If the MXQ was compiled with a flag like scheme="all" that produces multiple core modes, you must still select the core mode manually as before.

New APIs#

getModelInputDataType() / getModelOutputDataType() — Query the data types of a model’s inputs and outputs at runtime, enabling more flexible pipeline construction.
getAvailableDeviceNumbers() — Retrieve the list of available NPU device numbers.

REGULUS Dynamic Allocation#

The dynamic allocation approach introduced in v1.0.0 has been applied to REGULUS as well, ensuring a consistent usage pattern.

Performance Improvements#

Improved data transfer performance to NPU devices on Windows.
Optimized internal type conversion.

Bug Fixes#

Resolved a compile error caused by std::filesystem on GCC versions below 9.
Fixed an intermittent deadlock in certain models.

Breaking Changes#

The supported REGULUS driver revision number changes from REV0 to REV1.

Known Issues#

Running LLM models on ARM (aarch64) systems may fails with “Bus Error”. A driver patch to resolve this issue is planned. We apologize for the inconvenience.

v1.0.0 — Major Release#

Release date: January 31, 2026

Update_illust

This update includes significant improvements across the internal architecture and the SDK qb as a whole. We focused on scalability, consistency, and a structural refactor for future expansion.

Highlights#

SDK qb Naming Unification#

Previously, different components used different names, which could be confusing for users new to the SDK qb. To address this, we unified the names of key SDK qb components as follows:

Runtime library maccel → qb Runtime
Compiler qubee → qb Compiler

This naming unification makes the roles and relationships between SDK qb components more intuitive and enables a more consistent user experience in documentation and future feature expansions.

Model Count Limit Removed#

Previously, the number of models that could run concurrently was limited by the number of NPU cores. This update removes that restriction by improving the underlying design.

Models compiled with the latest qb Compiler can be loaded and executed concurrently within available DRAM, regardless of the core mode specified at compile time.

Benefits include:

More flexibility in services that run multiple models simultaneously
Ability to run models built for different core modes at the same time
Removal of core constraints that affected large models such as LLMs

This change is based on internal runtime optimizations. For users, any model compiled as MXQv7 can take advantage of it without code changes.

Multithreading Performance Improvements#

With this update, the C++ library provides .setActivationSlots(int num) and the Python API provides .set_activation_slots(num) to more freely optimize pipelining between NPU inference and data transfer.

These functions allow you to control the number of input slots for a model. Using more slots increases NPU memory usage, but enables more effective pipelining and improves performance in multithreaded workloads.

Note

For models that use cache (e.g., LLMs), the activation slot count is currently limited to 1.

uint8 Inference Support#

This update officially supports uint8 integer inference.

uint8 quantized models can be compiled with qb Compiler
qb Runtime supports inference execution for these models

This enables reduced CPU overhead during preprocessing for models that use uint8 inputs.

Migration Guide#

Due to the naming unification, packages, headers, and module names have changed. Legacy packages are no longer maintained.

Installation#

I. Update APT Package Index#

Before installing any packages, update the APT package index:

sudo apt update

II. Install Runtime Library#

Runtime library package name has been changed from mobilint-npu-runtime to mobilint-qb-runtime.

# C++ library
sudo apt install mobilint-qb-runtime

# Python package
pip install mobilint-qb-runtime

III. Install Driver#

Driver package names have also changed according to the new naming policy from aries-driver to mobilint-aries-driver.

sudo apt install mobilint-aries-driver

C++ Library Changes#

Compilation/linking flags updated

# Previous build
g++ -o example example.cpp -lmaccel

# Updated build
g++ -o example example.cpp -lqbruntime

Header path updated

// Previous header
# include "maccel/maccel.h"

// Updated header
# include "qbruntime/qbruntime.h"

Python Package Changes#

Module name updated

# Previous module
import maccel

# Updated module
import qbruntime

Release Notes

Contents

Release Notes#

v1.2.0#

Highlights#

BatchParam#

Known Issues#

v1.1.0#

Highlights#

CoreMode::Auto#

New APIs#

REGULUS Dynamic Allocation#

Performance Improvements#

Bug Fixes#

Breaking Changes#

Known Issues#

v1.0.0 — Major Release#

Highlights#

SDK qb Naming Unification#

Model Count Limit Removed#

Multithreading Performance Improvements#

uint8 Inference Support#

Migration Guide#

Installation#

I. Update APT Package Index#

II. Install Runtime Library#

III. Install Driver#

C++ Library Changes#

Python Package Changes#