
Sign up to save your podcasts
Or


What if transformer-level face recognition could run on a microcontroller without giving up speed or accuracy? We set out to make that real on the STM32N6 by pairing its neural processing unit with a hybrid model that blends convolutional efficiency and attention-like global context. Along the way, we rewired core assumptions about attention, reworked unsupported operators, and delivered a full on-device pipeline that actually feels instant.
We start with the hardware edge: ARM Cortex M55, 4 MB of continuous RAM, and an NPU pushing up to 600 GOPS at remarkable power efficiency. That lets us chain models—RetinaFace-style detection with landmarks, alignment for a stable canonical view, MobileNetV2 anti-spoofing to block print and replay attacks, and a final recognizer that outputs a 512‑dimensional embedding. The recognizer is built on EdgeFace, itself based on EdgeNext, chosen for its sweet spot between parameter count and accuracy. It behaves like a transformer where it matters—capturing long-range relationships—yet fits into the tight compute envelope of a microcontroller.
The turning point is attention without the dot product. Because the ST toolchain doesn’t support batch matmul, we replaced it with a convolutional self-attention mechanism. Depthwise and pointwise convolutions encode relationships across pixels and channels, a sigmoid stands in for softmax, and element-wise products reconstruct attention’s weighting behavior. This maps cleanly to the NPU, avoids quadratic costs, and preserves the ability to stabilize identities across pose, lighting, and occlusion.
Benchmarks show roughly 40 ms per frame end to end—about 25 FPS—plus substantial speedups over STM32H7 and higher accuracy than MobileFaceNet across validation sets. That opens doors for privacy-first access control, frictionless enrollment on-device, and personalized experiences where latency matters and data should never leave the edge. If you’re exploring embedded AI, this walkthrough shows how to align model design with silicon capabilities and deliver results that feel both fast and trustworthy.
Enjoy the deep dive? Subscribe, share this episode with a fellow edge AI builder, and leave a quick review to help others find the show.
Send us Fan Mail
Support the show
Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org
By EDGE AI FOUNDATIONWhat if transformer-level face recognition could run on a microcontroller without giving up speed or accuracy? We set out to make that real on the STM32N6 by pairing its neural processing unit with a hybrid model that blends convolutional efficiency and attention-like global context. Along the way, we rewired core assumptions about attention, reworked unsupported operators, and delivered a full on-device pipeline that actually feels instant.
We start with the hardware edge: ARM Cortex M55, 4 MB of continuous RAM, and an NPU pushing up to 600 GOPS at remarkable power efficiency. That lets us chain models—RetinaFace-style detection with landmarks, alignment for a stable canonical view, MobileNetV2 anti-spoofing to block print and replay attacks, and a final recognizer that outputs a 512‑dimensional embedding. The recognizer is built on EdgeFace, itself based on EdgeNext, chosen for its sweet spot between parameter count and accuracy. It behaves like a transformer where it matters—capturing long-range relationships—yet fits into the tight compute envelope of a microcontroller.
The turning point is attention without the dot product. Because the ST toolchain doesn’t support batch matmul, we replaced it with a convolutional self-attention mechanism. Depthwise and pointwise convolutions encode relationships across pixels and channels, a sigmoid stands in for softmax, and element-wise products reconstruct attention’s weighting behavior. This maps cleanly to the NPU, avoids quadratic costs, and preserves the ability to stabilize identities across pose, lighting, and occlusion.
Benchmarks show roughly 40 ms per frame end to end—about 25 FPS—plus substantial speedups over STM32H7 and higher accuracy than MobileFaceNet across validation sets. That opens doors for privacy-first access control, frictionless enrollment on-device, and personalized experiences where latency matters and data should never leave the edge. If you’re exploring embedded AI, this walkthrough shows how to align model design with silicon capabilities and deliver results that feel both fast and trustworthy.
Enjoy the deep dive? Subscribe, share this episode with a fellow edge AI builder, and leave a quick review to help others find the show.
Send us Fan Mail
Support the show
Learn more about the EDGE AI FOUNDATION - edgeaifoundation.org