On-device NPU inference engine — AMD Ryzen AI (XDNA2), Linux
From-scratch engine running whole model graphs natively on the laptop NPU via the open MLIR-AIE / IRON stack — where the prior open-source high-water mark was a single matmul at ~0.6% utilization and AMD’s own Linux stack offloads zero ops. 16 models across 11 architectures (BERT · Whisper/Parakeet/GigaAM · ViT/DINOv2/ResNet-18/CLIP · ESM-2), parity ~4e-3. Core: fuses a 12-layer decode into one NPU dispatch (72→1 per token), weights + KV resident. −29% energy / ~half the package power vs CPU at equal accuracy (WER 0.117). Making it compile at scale meant fixing AMD’s compiler & operators upstream.