A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes

A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.


💡 Research Summary

The paper presents a highly energy‑efficient anti‑UAV system that fuses frame‑based and event‑based visual processing to reliably detect and track small, fast‑moving drones. The authors first address two major shortcomings of prior work: (i) performance degradation when the target is both small and high‑speed, and (ii) excessive power consumption caused by continuously active neural networks. Their solution consists of a hierarchical hardware‑algorithm co‑design that dramatically reduces neural computation while preserving detection accuracy.

The event stream from a Dynamic Vision Sensor (DVS) is sampled over short temporal windows and compressed using Run‑Length Encoding (RLE). A frame builder reconstructs binary event frames, which are then denoised and fed to a Connected Component Labeling (CCL) engine to generate Region Proposals (RPs). Each RP is evaluated for size and velocity. If the RP is larger than nine pixels and moves faster than 20 pixels/s, the system switches to “event mode”; otherwise it remains in “frame mode”. In frame mode, grayscale patches surrounding the RP are extracted and classified by a CNN. In event mode, the RP is updated only after a configurable number of matching events (threshold TH) have been observed; TH adapts to the object’s area and speed, maximizing Intersection‑over‑Union (IoU) with ground truth. This dual‑mode approach enables tracking speeds up to 200 pixels/s while keeping the always‑on Event Signal Processor (ESP) power low.

High‑speed UAVs cause two specific problems: (a) misaligned RPs due to latency, and (b) motion‑blurred grayscale patches that degrade CNN classification. To mitigate these, the authors introduce a Fast Object Tracking Unit (FOTU). FOTU contains 32 RP monitors that continuously observe RP updates and dynamically recompute TH, keeping the RP tightly aligned with the true bounding box. Moreover, for objects moving faster than 25 pixels/s, the system abandons blurred image patches and instead records a sparse trajectory (stored only when displacement exceeds four pixels). These trajectories are fed to a second classifier that operates on motion patterns rather than visual texture, yielding higher accuracy for fast targets.

The Neural Processing Unit (NPU) implements a custom 64‑bit instruction set that orchestrates end‑to‑end CNN inference. It features a 16 × 16 processing‑engine array with an output‑stationary dataflow and per‑PE zero‑skipping capability, which eliminates MAC operations on zero operands and reduces dynamic power. Instructions allow all sub‑blocks (weight loading, MAC, pooling, non‑linear activation) to run concurrently, cutting latency compared with conventional sequential accelerators. Because the NPU accounts for over half of the chip’s dynamic power, the system gates its activation so that it runs only once per tracked object, achieving a 98.3 % reduction in NPU activity on the drone dataset and 97 % on a vehicle dataset.

Fabricated in 40 nm CMOS, the chip occupies 2 mm² and operates from 0.6 V to 1 V at 65 – 214 MHz. Measured energy consumption is 96 pJ per frame‑pixel in frame mode and 61 pJ per event in event mode (both at 0.8 V). Power draw is 0.52 mW (frame) and 0.62 mW (event). The system achieves 98.2 % average recognition accuracy on public event‑camera UAV datasets across ranges of 50 – 400 m and speeds of 5 – 80 pixels/s, outperforming prior CCL‑based and Spiking Neural Network (SNN) baselines in both accuracy and latency.

A detailed breakdown shows ESP consumes 38.8 % of total power, ISP 7.7 %, and NPU 53.5 %. The Region Proposal Unit (RPU) and FOTU together account for roughly 31 % of the ESP budget, while Run‑Length Encoding, noise filtering, and frame building each contribute modestly. Compared with earlier works, the proposed architecture delivers the best energy‑efficiency per pixel and per event while maintaining state‑of‑the‑art detection performance.

In summary, the paper introduces a novel hybrid tracking algorithm, adaptive thresholding, trajectory‑based classification, and a zero‑skipping CNN accelerator to create an ultra‑low‑power, high‑accuracy anti‑UAV SoC. The design demonstrates that tight co‑optimization of algorithmic strategies and silicon architecture can overcome the traditional trade‑off between speed, accuracy, and power in event‑driven vision systems, opening the door for always‑on, battery‑operated UAV detection platforms.


Comments & Academic Discussion

Loading comments...

Leave a Comment