A 58.6mW Real-Time Programmable Object Detector with Multi-Scale Multi-Object Support Using Deformable Parts Model on 1920x1080 Video at 30fps

Reading time: 5 minute
...

📝 Original Info

  • Title: A 58.6mW Real-Time Programmable Object Detector with Multi-Scale Multi-Object Support Using Deformable Parts Model on 1920x1080 Video at 30fps
  • ArXiv ID: 1607.08635
  • Date: 2016-08-01
  • Authors: Amr Suleiman, Zhengdong Zhang, Vivienne Sze

📝 Abstract

This paper presents a programmable, energy-efficient and real-time object detection accelerator using deformable parts models (DPM), with 2x higher accuracy than traditional rigid body models. With 8 deformable parts detection, three methods are used to address the high computational complexity: classification pruning for 33x fewer parts classification, vector quantization for 15x memory size reduction, and feature basis projection for 2x reduction of the cost of each classification. The chip is implemented in 65nm CMOS technology, and can process HD (1920x1080) images at 30fps without any off-chip storage while consuming only 58.6mW (0.94nJ/pixel, 1168 GOPS/W). The chip has two classification engines to simultaneously detect two different classes of objects. With a tested high throughput of 60fps, the classification engines can be time multiplexed to detect even more than two object classes. It is energy scalable by changing the pruning factor or disabling the parts classification.

💡 Deep Analysis

Figure 1

📄 Full Content

Object detection is critical to many embedded applications that require low power and real-time processing. For example, low latency and HD images are important for autonomous control to react quickly to fast approaching objects, while low energy consumption is essential due to battery and heat limitations. Object detection involves not only classification/recognition, but also localization, which is achieved by sliding a window of a pretrained model over an image. For multi-scale detection, the window slides over an image pyramid (multiple downscaled copies of the image). Multi-scale detection is very challenging as the image pyramid results in a data expansion, which can be more than a 100x in HD images. The high computational complexity of object detection processing necessitates fast hardware implementations [1] to enable real-time processing.

This paper presents a complete object detection accelerator using DPM [2] with a root and 8 parts model as shown in Fig. 1. DPM results in double the detection accuracy compared to rigid template (root only) detection. The 8 parts account for deformation such that a single model can detect objects at different poses (Fig. 6) and increase detection confidence. However, this accuracy comes with a classification overhead of 35x more multiplications (i.e. DPM classification consumes 80% of a single detector power), making multi-object detection a challenge. A software-based DPM object detector is described in [3], which enables detection for 500x500 images at 30fps but requires a powerful fully loaded Xeon 6-core processor and 32GB of memory. In this work, the classification overhead is significantly reduced by two main techniques:  Classification pruning with vector quantization (VQ) for selective part processing.  Feature basis projection for sparse multiplications.

Architecture Overview Fig. 2 shows the block diagram of our detector architecture, including histogram of oriented gradients (HOG) feature pyramid generation unit and support vector machine (SVM) classification engines. A feature pyramid size of 12 scales (4 octaves, 3 scales/octave) is selected as a trade-off between detection accuracy and computation complexity. The pyramid contains 87K feature vectors, which is 2.7x more features than a typical HD image. To meet the throughput, three parallel histogram and normalize blocks generate the pyramid. Two classification engines share the generated feature to detect two different classes of objects simultaneously. The root and the parts SVM weights can be programmed with a maximum template size of 128x128 pixels. This large size gives the detector the flexibility to detect many objects classes with different aspect ratios. Each SVM engine contains a root classifier for root detection, a pruning block to select candidate roots, and 8 part processing engines for parts detection. Local feature storage in each part engine allows parallelism, reduces the feature storage read bandwidth and enables 7x speedup. Finally, the Deform block uses a coarse-tofine technique for 2.2x speedup in finding the maximum score in a 5x5 search region for each part after adding the deformation cost.

Classification Pruning and Vector Quantization With more than 2.6 million features generated per second, onthe-fly processing is used for root classification similar to [3] for minimal on-chip storage, where partial dot products are accumulated in SRAMs. Using the same approach with parts classification would require large accumulation SRAM sizes (more than 800KB for one classification engine). However, it was observed that if the root score is too low, then the likelihood of detecting an object based on parts is also low. Since parts classification requires significant additional computation, we choose to prune the parts classification when the root score is below a programmable threshold. By pruning 97% of the parts classification (i.e. a 33x reduction of the root candidates that are processed), we achieve a 10x reduction in classification power with negligible 0.03% reduction in accuracy.

To avoid re-computation, HOG features are stored in line buffers to be reused by the part processing engines after pruning (Fig. 3). VQ is used to reduce the feature line buffers write bandwidth (from 44.4MB/s to 2.5MB/s), making its size suitable for on-chip SRAM (from 572KB to 32KB) and eliminating any off-chip storage. Three parallel VQ engines are used to meet the throughput. A programmable 256 clusters centers are stored in a shared SRAM to minimize the read bandwidth. The 143-bit HOG feature vector (13-D, 11-bit each) is quantized to 8 bits per vector, giving a 15x reduction in the overall feature storage size. Dequantization is just a memory read from the feature SRAM.

To further reduce the cost of each classifier, the features are projected into a new space where the classification SVM weights are sparse. Zeros multiplications are skipped and only the nonzero weights are stored on-chip.

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut