Skip to content

EdgeFirstAI/ara2-rs

ARA-2 Client Library

CI License crates.io

Rust client library for the Kinara ARA-2 neural network accelerator. Provides session management, model loading, and inference on NXP i.MX platforms equipped with ARA-2 PCIe hardware.

Supported Platforms

Platform SoC Status
NXP FRDM i.MX 8M Plus i.MX 8M Plus Tested
NXP FRDM i.MX 95 i.MX 95 Tested

Requires EdgeFirst Yocto Images with ARA-2 SDK support.

Workspace

Crate Description
ara2 Core client library — session, endpoint, model, and DVM metadata APIs
ara2-sys FFI bindings to libaraclient.so via libloading

Integration with edgefirst-hal

The ara2 crate depends on edgefirst-hal for:

  • Tensor memory management — DMA-backed tensors for zero-copy NPU transfers
  • Image preprocessing — Hardware-accelerated format conversion and scaling
  • Post-processing — YOLO decoding, overlay rendering, segmentation masks

Python Bindings

Python bindings are available as a separate package via PyPI:

pip install edgefirst-ara2

See crates/ara2-py/README.md for the Python API reference.

Quick Start

use ara2::{Session, DEFAULT_SOCKET};
use edgefirst_hal::tensor::{TensorMemory, TensorTrait as _};

// Connect to the ARA-2 proxy service
let session = Session::create_via_unix_socket(DEFAULT_SOCKET)?;

// Enumerate NPU endpoints and check status
let endpoints = session.list_endpoints()?;
let endpoint = &endpoints[0];
println!("Endpoint state: {:?}", endpoint.check_status()?);

// Load a compiled model (.dvm) and allocate DMA tensors
let mut model = endpoint.load_model_from_file("model.dvm".as_ref())?;
model.allocate_tensors(Some(TensorMemory::Dma))?;

// Run inference
let timing = model.run()?;
println!("NPU inference: {:?}", timing.run_time);
# Ok::<(), ara2::Error>(())

Async Inference

The submit() / wait() API enables overlapping CPU work with NPU execution — the building block for pipeline parallelism:

use ara2::{Session, DEFAULT_SOCKET, DEFAULT_TIMEOUT_MS};

let session = Session::create_via_unix_socket(DEFAULT_SOCKET)?;
let endpoints = session.list_endpoints()?;
let mut model = endpoints[0].load_model_from_file("model.dvm".as_ref())?;
model.allocate_tensors(None)?;

// Submit — returns immediately while the NPU works
let request = model.submit()?;

// CPU is free to do other work (preprocess next frame, etc.)

// Block until the NPU finishes
let timing = request.wait(DEFAULT_TIMEOUT_MS)?;
println!("NPU inference: {:?}", timing.run_time);

// Monitor pipeline depth
assert_eq!(session.inflight_count()?, 0);
# Ok::<(), ara2::Error>(())

The Python API mirrors this exactly:

import edgefirst_ara2 as ara2

session = ara2.Session.create_via_unix_socket(ara2.DEFAULT_SOCKET)
endpoint = session.list_endpoints()[0]
model = endpoint.load_model("model.dvm")
model.allocate_tensors()

# Submit — returns immediately
request = model.submit()

# CPU work here... the GIL is NOT held during wait()
timing = request.wait()
print(f"NPU inference: {timing.run_time_us} µs")

See the async_infer example for a complete benchmark comparing synchronous vs. asynchronous inference, and async_pipeline for pipelined inference with a circular buffer of DMA-BUF tensor sets (2x+ throughput improvement).

Runtime Requirements

The following must be present on the target system:

  • libaraclient.so.1 — Kinara client library (from the ARA-2 SDK)
  • ara2-proxy / dvproxy — System service providing NPU access, must be running (systemd unit name is platform-dependent: ara2.service on EdgeFirst Yocto images, dvproxy.service on other platforms)
  • ARA-2 hardware — PCIe accelerator card visible via lspci

Building

Native

cargo build --release

Cross-compile for aarch64 (NXP i.MX)

cargo zigbuild --release --target aarch64-unknown-linux-gnu

Performance

Benchmarked on NXP FRDM i.MX 95 + ARA-2 with YOLOv8m-seg (640×640), showing the Python API adds minimal overhead over native Rust thanks to DMA-BUF zero-copy tensor sharing — the GPU and NPU operate on the same physical buffers with no CPU copies in the data path.

Stage Rust Python Overhead
GPU preprocess (letterbox + RGBA→CHW) 2.85 ms 2.88 ms +0.03 ms
NPU inference (wall clock) 34.53 ms 34.63 ms +0.10 ms
  NPU execution 26.04 ms 26.04 ms
  DMA input upload 2.02 ms 2.05 ms
  DMA output download 3.68 ms 3.68 ms
Decode (NMS + dequant) 4.05 ms 4.31 ms +0.26 ms
Materialize (CPU coeff × proto → bitmaps) 5.67 ms 5.98 ms +0.31 ms
Draw (GL mask overlay) 5.54 ms 5.71 ms +0.17 ms
Total pipeline 52.64 ms 53.52 ms +0.88 ms
Throughput 19.0 FPS 18.7 FPS

Steady-state mean over 30 iterations after warmup. Python overhead is under 1 ms across the entire pipeline. GPU preprocessing and NPU inference are identical since both use the same DMA-BUF tensors.

Examples

Example Description
yolov8.rs Rust — YOLOv8 detection + segmentation with letterbox preprocessing and 3-step mask pipeline
yolov8.py Python — Same 3-step pipeline via edgefirst-hal and edgefirst-ara2 Python packages
async_infer.rs Rust — Async inference benchmark: sync vs. submit/wait vs. overlap
async_infer.py Python — Same async benchmark via edgefirst-ara2
async_pipeline.rs Rust — Pipelined inference with circular DMA-BUF buffer ring (2x+ speedup)
async_pipeline.py Python — Same pipeline demo via edgefirst-ara2
endpoints.py Python — Connect, list endpoints, check status
test_dvm_metadata.rs Rust — Read and display DVM model metadata

Running the Rust example

Cross-compile from your development machine and deploy to the target:

# Build
cargo zigbuild --release --example yolov8 --target aarch64-unknown-linux-gnu

# Deploy and run
scp target/aarch64-unknown-linux-gnu/release/examples/yolov8 <target>:/root/yolov8-ara2
ssh <target> "/root/yolov8-ara2 model.dvm image.jpg --benchmark 30 --save"

Running the Python example

Create a virtual environment on the target and install the packages from PyPI:

# On target
python3 -m venv ~/venv
~/venv/bin/pip install edgefirst-ara2 edgefirst-hal

Copy the script and run:

# From dev machine
scp examples/yolov8.py <target>:/root/

# On target
~/venv/bin/python3 /root/yolov8.py model.dvm image.jpg --benchmark 30 --save

Testing

Tests require an NXP i.MX + ARA-2 system with the proxy running:

# All tests (on-target with hardware)
cargo test -p ara2

# Metadata tests only (no hardware needed)
cargo test -p ara2 dvm_metadata

# Model tests (needs a .dvm file)
ARA2_TEST_MODEL=/path/to/model.dvm cargo test -p ara2 model

Documentation

License

Licensed under the Apache License 2.0. See LICENSE for details.

Copyright 2025 Au-Zone Technologies. All Rights Reserved.

About

Rust Library for the NXP Ara240

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors