Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This work presents a practical benchmarking framework for optimizing artificial intelligence (AI) models on ARM Cortex processors (M0+, M4, M7), focusing on energy efficiency, accuracy, and resource utilization in embedded systems. Through the design of an automated test bench, we provide a systematic approach to evaluate across key performance indicators (KPIs) and identify optimal combinations of processor and AI model. The research highlights a nearlinear correlation between floating-point operations (FLOPs) and inference time, offering a reliable metric for estimating computational demands. Using Pareto analysis, we demonstrate how to balance trade-offs between energy consumption and model accuracy, ensuring that AI applications meet performance requirements without compromising sustainability. Key findings indicate that the M7 processor is ideal for short inference cycles, while the M4 processor offers better energy efficiency for longer inference tasks. The M0+ processor, while less efficient for complex AI models, remains suitable for simpler tasks. This work provides insights for developers, guiding them to design energy-efficient AI systems that deliver high performance in realworld applications.

💡 Research Summary

The paper presents a practical benchmarking framework aimed at optimizing artificial‑intelligence (AI) models for three ARM Cortex‑M microcontroller families: Cortex‑M0+, Cortex‑M4, and Cortex‑M7. The authors argue that most existing edge‑AI benchmarks rely on single‑board computers (e.g., Raspberry Pi) where operating systems and middleware obscure the true hardware characteristics. To address this gap, they built a fully automated, bare‑metal test bench that measures inference latency, power consumption, and model accuracy directly on the silicon.

The methodology consists of four stages. First, a set of baseline neural‑network architectures (LeNet‑5, ResNet, an auto‑encoder, MobileNet‑V1) is subjected to structured pruning and static 8‑bit quantisation using an automated multi‑objective optimizer. This generates a diverse population of ONNX models with varying FLOP counts and parameter counts. Second, each ONNX model is converted into a self‑contained C library, linked with a C++ harness, and compiled for the target core (M0+, M4, or M7). Third, the compiled binaries are flashed onto a custom carrier board equipped with a Segger J‑Link debugger and a Power Profiler Kit (PPK). Real‑time current and voltage are recorded while a GPIO pin toggles to mark the start and end of each inference, enabling precise separation of active and idle energy. Fourth, the collected metrics—accuracy, inference time, and energy per inference cycle—are fed into a Pareto‑front analysis to identify the most efficient hardware‑model combinations for each use case.

Four representative edge‑AI tasks are evaluated: optical digit recognition (LeNet‑5 on MNIST), image classification (custom ResNet on CIFAR‑10), anomaly detection (auto‑encoder on industrial sound data), and visual wake‑words (MobileNet‑V1 on MS‑COCO). Quantisation reduces model ROM footprints by roughly 75 %, making deployment on the constrained devices feasible; the unquantised MobileNet‑V1 would exceed the available memory.

Key experimental findings include:

Linear FLOP‑Latency Relationship – Across all three cores, inference time scales linearly with the number of floating‑point operations (R² ≥ 0.93). This validates FLOPs as a reliable early‑stage predictor of latency, enabling designers to estimate performance before hardware deployment.
Processor‑Specific Energy Profiles – The Cortex‑M4 exhibits an ultra‑low idle current of 0.30 mA, making it the most energy‑efficient for long inference cycles (seconds). The Cortex‑M7, with a higher clock speed and more advanced pipeline, achieves the lowest active energy for short cycles (hundreds of milliseconds) by completing inference quickly. The Cortex‑M0+ has the highest idle current (4.20 mA) and consequently the highest total energy per cycle in all scenarios.
Pareto Trade‑offs Between Accuracy and Energy – For a target accuracy of ≥95 % on MNIST, the pruned‑quantised LeNet‑5 on the M4 consumes less than 0.8 mJ per inference, placing it on the Pareto front. In contrast, achieving ≥80 % accuracy on the Visual Wake‑Words task requires the larger MobileNet‑V1; only the M7 can meet the latency constraint (≤10 ms) while keeping energy reasonable, but the M4 becomes preferable when the inference interval is long because its idle power dominates.
Memory as a Hard Constraint – ROM and RAM usage strongly correlate with both latency and power. Models that exceed the available ROM on a given core are simply infeasible, regardless of their FLOP count.

The authors discuss limitations: the experiments were performed at a fixed 3.3 V supply, so battery‑driven voltage droop effects are not captured; only 8‑bit quantisation was explored, leaving lower‑bit schemes and dedicated accelerators (DSP, NPU) for future work; and the test setup is limited to a single carrier board, lacking environmental variations (temperature, supply noise).

Future research directions include extending the framework to dynamic voltage‑frequency scaling, incorporating sub‑8‑bit quantisation, evaluating heterogeneous accelerators, and integrating the methodology into real‑time operating systems to assess system‑level trade‑offs.

In conclusion, the study delivers a reproducible, bare‑metal benchmarking pipeline and demonstrates that FLOPs, memory footprint, and processor‑specific idle power together define the Pareto‑optimal design space for edge AI on ARM Cortex‑M microcontrollers. This provides developers with concrete guidance for co‑designing models and hardware to achieve sustainable, high‑performance embedded AI solutions.

Pareto Optimal Benchmarking of AI Models on ARM Cortex Processors for Sustainable Embedded Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment