A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications
Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.
💡 Research Summary
This paper addresses the growing need for flexible, low‑power convolutional neural network (CNN) accelerators on embedded System‑on‑Chip (SoC) FPGAs. While many prior FPGA‑based CNN accelerators focus solely on maximizing raw throughput measured in giga‑operations per second (GOPS), real‑world embedded applications must also satisfy stringent latency, power, area, and cost constraints. To meet these multi‑dimensional requirements, the authors propose a hardware‑software (HW/SW) co‑design methodology that leverages high‑level synthesis (HLS) to create a parameterizable convolution accelerator template.
The accelerator is divided into two main blocks: a CONV‑PART that implements the convolution and optional ReLU, and an MPOOL‑PART that optionally performs max‑pooling. Both blocks are described in C/C++ and synthesized with HLS pragmas that enable data packing, dataflow, pipelining, and array partitioning. These pragmas expose two levels of parallelism: spatial parallelism across input (ICP) and output (OCP) channels, and temporal parallelism across pipeline stages. By adjusting the design parameters—such as the number of output channels (Co), input channels (Ci), the degree of channel parallelism (OCP/ICP), and on‑chip memory tiling—the same template can be tuned to fit a wide range of FPGA devices, from low‑end Zynq‑7000 chips to high‑performance UltraScale+ devices.
A key contribution is the adoption of 8‑bit dynamic fixed‑point (DFP) quantization, which reduces on‑chip memory requirements and eliminates the need for floating‑point units while preserving CNN accuracy within 0.5 % of the full‑precision baseline. This quantization enables the accelerator to operate within a power envelope of less than 10 W, making it suitable for battery‑powered platforms.
The authors evaluate the template on both lightweight (e.g., MobileNet‑V1) and larger (e.g., ResNet‑50) CNN models. Compared with non‑parameterized designs, the parameterizable accelerator achieves 1.8×–2.3× higher GOPS/W efficiency, reduces latency by 30–45 %, and offers configurable resource utilization (DSP, LUT, BRAM) that can be scaled to meet specific area budgets. Importantly, the same bitstream can serve multiple CNNs; only the software‑level parameters need to be updated, avoiding costly FPGA re‑configuration and shortening deployment cycles.
The paper situates its work within two major categories of FPGA CNN accelerators: single computation engines and streaming architectures. Single engines provide flexibility but often under‑utilize resources, while streaming designs achieve peak performance at the cost of large resource consumption and frequent re‑configuration. The proposed template combines the strengths of both approaches: it retains the flexibility of a single engine while allowing designers to “stream‑like” scale parallelism through parameter tuning.
Limitations are acknowledged: the current implementation accelerates only convolution, ReLU, and max‑pool layers; fully‑connected layers and more exotic activation functions remain on the host CPU. Moreover, parameter selection is performed manually; future work could integrate automated design‑space exploration (e.g., Bayesian optimization) to find optimal configurations for a given power‑latency budget. Extending the template to support additional layer types and full on‑chip pipelines would further increase its applicability.
In summary, the paper demonstrates that HLS‑based, parameterizable accelerator templates provide a practical path to designing CNN accelerators that meet the diverse constraints of embedded deep‑learning applications. By exposing key architectural knobs to the designer and automating much of the RTL generation, the methodology reduces development effort, improves power‑efficiency, and enables rapid adaptation to new models—all essential qualities for the next generation of intelligent edge devices.
Comments & Academic Discussion
Loading comments...
Leave a Comment