Rust client library for the Kinara ARA-2 neural network accelerator. Provides session management, model loading, and inference on NXP i.MX platforms equipped with ARA-2 PCIe hardware.
| Platform | SoC | Status |
|---|---|---|
| NXP FRDM i.MX 8M Plus | i.MX 8M Plus | Tested |
| NXP FRDM i.MX 95 | i.MX 95 | Tested |
Requires EdgeFirst Yocto Images with ARA-2 SDK support.
| Crate | Description |
|---|---|
ara2 |
Core client library — session, endpoint, model, and DVM metadata APIs |
ara2-sys |
FFI bindings to libaraclient.so via libloading |
The ara2 crate depends on edgefirst-hal
for:
- Tensor memory management — DMA-backed tensors for zero-copy NPU transfers
- Image preprocessing — Hardware-accelerated format conversion and scaling
- Post-processing — YOLO decoding, overlay rendering, segmentation masks
Python bindings are available as a separate package via PyPI:
pip install edgefirst-ara2See crates/ara2-py/README.md for the Python API reference.
use ara2::{Session, DEFAULT_SOCKET};
use edgefirst_hal::tensor::{TensorMemory, TensorTrait as _};
// Connect to the ARA-2 proxy service
let session = Session::create_via_unix_socket(DEFAULT_SOCKET)?;
// Enumerate NPU endpoints and check status
let endpoints = session.list_endpoints()?;
let endpoint = &endpoints[0];
println!("Endpoint state: {:?}", endpoint.check_status()?);
// Load a compiled model (.dvm) and allocate DMA tensors
let mut model = endpoint.load_model_from_file("model.dvm".as_ref())?;
model.allocate_tensors(Some(TensorMemory::Dma))?;
// Run inference
let timing = model.run()?;
println!("NPU inference: {:?}", timing.run_time);
# Ok::<(), ara2::Error>(())The submit() / wait() API enables overlapping CPU work with NPU
execution — the building block for pipeline parallelism:
use ara2::{Session, DEFAULT_SOCKET, DEFAULT_TIMEOUT_MS};
let session = Session::create_via_unix_socket(DEFAULT_SOCKET)?;
let endpoints = session.list_endpoints()?;
let mut model = endpoints[0].load_model_from_file("model.dvm".as_ref())?;
model.allocate_tensors(None)?;
// Submit — returns immediately while the NPU works
let request = model.submit()?;
// CPU is free to do other work (preprocess next frame, etc.)
// Block until the NPU finishes
let timing = request.wait(DEFAULT_TIMEOUT_MS)?;
println!("NPU inference: {:?}", timing.run_time);
// Monitor pipeline depth
assert_eq!(session.inflight_count()?, 0);
# Ok::<(), ara2::Error>(())The Python API mirrors this exactly:
import edgefirst_ara2 as ara2
session = ara2.Session.create_via_unix_socket(ara2.DEFAULT_SOCKET)
endpoint = session.list_endpoints()[0]
model = endpoint.load_model("model.dvm")
model.allocate_tensors()
# Submit — returns immediately
request = model.submit()
# CPU work here... the GIL is NOT held during wait()
timing = request.wait()
print(f"NPU inference: {timing.run_time_us} µs")See the async_infer example for a complete
benchmark comparing synchronous vs. asynchronous inference, and
async_pipeline for pipelined inference
with a circular buffer of DMA-BUF tensor sets (2x+ throughput improvement).
The following must be present on the target system:
libaraclient.so.1— Kinara client library (from the ARA-2 SDK)ara2-proxy/dvproxy— System service providing NPU access, must be running (systemd unit name is platform-dependent:ara2.serviceon EdgeFirst Yocto images,dvproxy.serviceon other platforms)- ARA-2 hardware — PCIe accelerator card visible via
lspci
cargo build --releasecargo zigbuild --release --target aarch64-unknown-linux-gnuBenchmarked on NXP FRDM i.MX 95 + ARA-2 with YOLOv8m-seg (640×640), showing the Python API adds minimal overhead over native Rust thanks to DMA-BUF zero-copy tensor sharing — the GPU and NPU operate on the same physical buffers with no CPU copies in the data path.
| Stage | Rust | Python | Overhead |
|---|---|---|---|
| GPU preprocess (letterbox + RGBA→CHW) | 2.85 ms | 2.88 ms | +0.03 ms |
| NPU inference (wall clock) | 34.53 ms | 34.63 ms | +0.10 ms |
| NPU execution | 26.04 ms | 26.04 ms | — |
| DMA input upload | 2.02 ms | 2.05 ms | — |
| DMA output download | 3.68 ms | 3.68 ms | — |
| Decode (NMS + dequant) | 4.05 ms | 4.31 ms | +0.26 ms |
| Materialize (CPU coeff × proto → bitmaps) | 5.67 ms | 5.98 ms | +0.31 ms |
| Draw (GL mask overlay) | 5.54 ms | 5.71 ms | +0.17 ms |
| Total pipeline | 52.64 ms | 53.52 ms | +0.88 ms |
| Throughput | 19.0 FPS | 18.7 FPS |
Steady-state mean over 30 iterations after warmup. Python overhead is under 1 ms across the entire pipeline. GPU preprocessing and NPU inference are identical since both use the same DMA-BUF tensors.
| Example | Description |
|---|---|
yolov8.rs |
Rust — YOLOv8 detection + segmentation with letterbox preprocessing and 3-step mask pipeline |
yolov8.py |
Python — Same 3-step pipeline via edgefirst-hal and edgefirst-ara2 Python packages |
async_infer.rs |
Rust — Async inference benchmark: sync vs. submit/wait vs. overlap |
async_infer.py |
Python — Same async benchmark via edgefirst-ara2 |
async_pipeline.rs |
Rust — Pipelined inference with circular DMA-BUF buffer ring (2x+ speedup) |
async_pipeline.py |
Python — Same pipeline demo via edgefirst-ara2 |
endpoints.py |
Python — Connect, list endpoints, check status |
test_dvm_metadata.rs |
Rust — Read and display DVM model metadata |
Cross-compile from your development machine and deploy to the target:
# Build
cargo zigbuild --release --example yolov8 --target aarch64-unknown-linux-gnu
# Deploy and run
scp target/aarch64-unknown-linux-gnu/release/examples/yolov8 <target>:/root/yolov8-ara2
ssh <target> "/root/yolov8-ara2 model.dvm image.jpg --benchmark 30 --save"Create a virtual environment on the target and install the packages from PyPI:
# On target
python3 -m venv ~/venv
~/venv/bin/pip install edgefirst-ara2 edgefirst-halCopy the script and run:
# From dev machine
scp examples/yolov8.py <target>:/root/
# On target
~/venv/bin/python3 /root/yolov8.py model.dvm image.jpg --benchmark 30 --saveTests require an NXP i.MX + ARA-2 system with the proxy running:
# All tests (on-target with hardware)
cargo test -p ara2
# Metadata tests only (no hardware needed)
cargo test -p ara2 dvm_metadata
# Model tests (needs a .dvm file)
ARA2_TEST_MODEL=/path/to/model.dvm cargo test -p ara2 model- ARCHITECTURE.md — System architecture and ownership model
- TESTING.md — Test guide, on-target setup, and debugging
- CONTRIBUTING.md — Contribution guidelines
- SECURITY.md — Security policy
- CHANGELOG.md — Release history
Licensed under the Apache License 2.0. See LICENSE for details.
Copyright 2025 Au-Zone Technologies. All Rights Reserved.