# Release Notes

## v1.2.0

**Release date:** April 2, 2026

This update adds Batch LLM support.

### Highlights

#### BatchParam

A new struct {doxylink}`BatchParam <mobilint::BatchParam>` has been introduced to support Batch LLM inference.

`BatchParam` holds the information required to execute each batch during Batch LLM inference.

**Struct Fields**

- `sequence_length` : The sequence length for each batch.
- `cache_size` : The cache size each batch will use.
- `cache_id` : The cache identifier for each batch.
    - All inputs within the same context must use the same cache ID throughout inference.
    - The `cache_id` value must be within the maximum batch count supported by the model.

**Usage**

To use Batch LLM, concatenate multiple inputs into a single input, then pass a `BatchParam` for each individual input.

When the input shape is (1, seq_len, hidden_dim), batch inputs must be concatenated along the `seq_len` dimension.

    - `seq_len` : Number of tokens
    - `hidden_dim` : Embedding dimension of each token

For example, to combine two inputs with sequence lengths of 10 and 80:

```python
import qbruntime
import numpy as np

## Check the maximum batch count supported by the model.
print(model.get_cache_infos()[0].num_batches)

## Concatenate inputs into a single input for Batch LLM.
## Inputs must be concatenated along the 2nd dimension (axis=1).
batch_input = np.concatenate([input0, input1], axis=1)

## qbruntime.BatchParam(sequence_length, cache_size, cache_id)
batch_params = [
    qbruntime.BatchParam(10, 0, 0),
    qbruntime.BatchParam(80, 0, 1),
]

res = model.infer([batch_input], params=batch_params)

## The output is structured as:
## [[batch_0_output], [batch_1_output], ...]

batch_params2 = [
    qbruntime.BatchParam(1, 10, 0),
    qbruntime.BatchParam(1, 80, 1),
]

res = model.infer(res, params=batch_params2)
```

### Known Issues

- Running LLM models on ARM (aarch64) systems may fails with "Bus Error". This issue has been present since v1.1.0. A driver patch to resolve this issue is planned. We apologize for the inconvenience.

## v1.1.0

**Release date:** March 23, 2026

qb Runtime v1.1.0 brings automatic core mode selection, data type query APIs, and performance optimizations.

### Highlights

#### CoreMode::Auto

The runtime can now automatically select the available core mode for your model. Setting `CoreMode::Auto` in your `ModelConfig` enables the runtime to detect and apply the appropriate core mode from the MXQ. Previously, non-default core modes such as `Multi`, `Global4`, and `Global8` required manual `ModelConfig` construction; with Auto mode, the available core mode is selected automatically. Since the default constructor also uses Auto mode, no additional configuration is needed in most cases.

```{note}
If the MXQ was compiled with a flag like `scheme="all"` that produces multiple core modes, you must still select the core mode manually as before.
```

```{seealso}
For more details, see {doxylink}`setAutoCoreMode() <mobilint::ModelConfig::setAutoCoreMode()>`.
```

#### New APIs

- {doxylink}`getModelInputDataType() <mobilint::Model::getModelInputDataType() const>` / {doxylink}`getModelOutputDataType() <mobilint::Model::getModelOutputDataType() const>` — Query the data types of a model's inputs and outputs at runtime, enabling more flexible pipeline construction.
- {doxylink}`getAvailableDeviceNumbers() <mobilint::getAvailableDeviceNumbers()>` — Retrieve the list of available NPU device numbers.

#### REGULUS Dynamic Allocation

The dynamic allocation approach introduced in v1.0.0 has been applied to REGULUS as well, ensuring a consistent usage pattern.

#### Performance Improvements

- Improved data transfer performance to NPU devices on Windows.
- Optimized internal type conversion.

### Bug Fixes

- Resolved a compile error caused by `std::filesystem` on GCC versions below 9.
- Fixed an intermittent deadlock in certain models.

### Breaking Changes

- The supported REGULUS driver revision number changes from REV0 to REV1.

### Known Issues

- Running LLM models on ARM (aarch64) systems may fails with "Bus Error". A driver patch to resolve this issue is planned. We apologize for the inconvenience.

```{seealso}
For the complete changelog, see the [Changelog](CHANGELOG.md) page.
```

## v1.0.0 — Major Release

**Release date:** January 31, 2026

![Update_illust](/res/image/qb_release.jpg)

This update includes significant improvements across the internal architecture and the SDK qb as a whole. We focused on scalability, consistency, and a structural refactor for future expansion.

### Highlights

#### SDK qb Naming Unification

Previously, different components used different names, which could be confusing for users new to the SDK qb. To address this, we unified the names of key SDK qb components as follows:

- Runtime library maccel → qb Runtime
- Compiler qubee → qb Compiler

This naming unification makes the roles and relationships between SDK qb components more intuitive and enables a more consistent user experience in documentation and future feature expansions.

#### Model Count Limit Removed

Previously, the number of models that could run concurrently was limited by the number of NPU cores. This update removes that restriction by improving the underlying design.

- Models compiled with the latest qb Compiler can be loaded and executed concurrently within available DRAM, regardless of the core mode specified at compile time.

Benefits include:

- More flexibility in services that run multiple models simultaneously
- Ability to run models built for different core modes at the same time
- Removal of core constraints that affected large models such as LLMs

This change is based on internal runtime optimizations. For users, any model compiled as MXQv7 can take advantage of it without code changes.

#### Multithreading Performance Improvements

With this update, the C++ library provides `.setActivationSlots(int num)` and the Python API provides `.set_activation_slots(num)` to more freely optimize pipelining between NPU inference and data transfer.

These functions allow you to control the number of input slots for a model. Using more slots increases NPU memory usage, but enables more effective pipelining and improves performance in multithreaded workloads.

```{note}
For models that use cache (e.g., LLMs), the activation slot count is currently limited to 1.
```

#### uint8 Inference Support

This update officially supports uint8 integer inference.

- uint8 quantized models can be compiled with qb Compiler
- qb Runtime supports inference execution for these models

This enables reduced CPU overhead during preprocessing for models that use uint8 inputs.

### Migration Guide

Due to the naming unification, packages, headers, and module names have changed. Legacy packages are no longer maintained.

#### Installation

##### I. Update APT Package Index

Before installing any packages, update the APT package index:

``` bash
sudo apt update
```

##### II. Install Runtime Library

Runtime library package name has been changed from `mobilint-npu-runtime` to `mobilint-qb-runtime`.

``` bash
# C++ library
sudo apt install mobilint-qb-runtime

# Python package
pip install mobilint-qb-runtime
```

##### III. Install Driver

Driver package names have also changed according to the new naming policy from `aries-driver` to `mobilint-aries-driver`.

``` bash
sudo apt install mobilint-aries-driver
```

#### C++ Library Changes

- Compilation/linking flags updated

    ```bash
    # Previous build
    g++ -o example example.cpp -lmaccel

    # Updated build
    g++ -o example example.cpp -lqbruntime
    ```

- Header path updated

    ```cpp
    // Previous header
    # include "maccel/maccel.h"

    // Updated header
    # include "qbruntime/qbruntime.h"
    ```

#### Python Package Changes

- Module name updated

    ```python
    # Previous module
    import maccel

    # Updated module
    import qbruntime
    ```
