An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices

Weight pruning has been widely acknowledged as a straightforward and effective method to eliminate redundancy in Deep Neural Networks (DNN), thereby achieving acceleration on various platforms. However, most of the pruning techniques are essentially …

Authors: Xiaolong Ma, Wei Niu, Tianyun Zhang

An Image Enhancing Pattern-based Sparsity for Real-time Inference on   Mobile Devices
An Image Enhancing P attern-based Sparsit y for Real-time Inference on Mobile Devices Xiaolong Ma 1 † , W ei Niu 2 † , Tian yun Zhang 3 , Sijia Liu 4 , Sheng Lin 1 , Hong jia Li 1 , W ujie W en 5 , Xiang Chen 6 , Jian T ang 7 , Kaisheng Ma 8 , Bin Ren 2 , and Y anzhi W ang 1 1 Northeastern Univ ersity , Boston MA 02115, USA { ma.xiaol, yanz.wang } @northeastern.edu 2 College of William and Mary , 3 Syracuse Univ ersity , 4 IBM Researc h, 5 Lehigh Univ ersity , 6 George Mason Univ ersity , 7 DiDi AI Labs, 8 Tsingh ua Universit y † Equal Con tribution Abstract. W eight pruning has b een widely ac knowledged as a straight- forw ard and effectiv e method to eliminate redundancy in Deep Neural Net works (DNN), thereby achieving acceleration on v arious platforms. Ho wev er, most of the pruning techniques are essen tially trade-offs b e- t ween mo del accuracy and regularity whic h lead to impaired inference accuracy and limited on-device acceleration p erformance. T o solve the problem, we introduce a new sparsity dimension, namely pattern-based sparsit y that comprises pattern and connectivit y sparsity , and becoming b oth highly accurate and hardw are friendly . With carefully designed pat- terns, the prop osed pruning unprecedentedly and consistently achiev es accuracy enhancement and better feature extraction ability on different DNN structures and datasets, and our pattern-a ware pruning framew ork also achiev es pattern library extraction, pattern selection, pattern and connectivit y pruning and weigh t training simultaneously . Our approach on the new pattern-based sparsit y naturally fits in to compiler optimiza- tion for highly efficient DNN execution on mobile platforms. T o the best of our knowledge, it is the first time that mobile devices ac hieve real-time inference for the large-scale DNN mo dels thanks to the unique spatial prop ert y of pattern-based sparsity and the help of the co de generation capabilit y of compilers. 1 In tro duction W eigh t pruning has b een prov en to b e effective in eliminating redundancy in the original mo del [7,31,15], therefore accelerating DNN execution on target computing platforms. Non-structured pruning [10] ac hieves high accuracy , but is limited by its hardw are unfriendliness [31,15]. Mean while, structured pruning [31] is hardw are friendly but suffers from accuracy loss. It is imp erativ e to seek an approach that can offer, or ev en go b ey ond, the b est of b oth types of sparsity . W e visualize part of the normalized heat map of a pre-trained mo del of VGG-16 on ImageNet in Figure 1, we find that (i) the ef- fectiv e area (i.e. weigh ts with higher absolute v alues) forms some sp ecific shap es 2 X. Ma et al. Fig. 1: Heat map of randomly selected conv olution kernels in the third conv olu- tional lay er of a VGG-16 on ImageNet dataset. The weigh t v alues in each k ernel are normalized and dark er shade represents higher absolute v alue. and repeatedly app ears in the model, and (ii) some of the en tire con volution k er- nels hav e very small weigh t v alues and make themselves void k ernels. Motiv ated b y the tw o observ ations, we introduce a new sparsity dimension – p attern-b ase d sp arsity , which exploits both in tra-con volution and in ter-con volution kernel spar- sities, exhibiting b oth high accuracy and regularity , and revealing a previously unknown p oin t in design space. In pattern-based sparsit y , we call our in tra-conv olution kernel sparsity p attern sp arsity and inter-con volution kernel sparsity c onne ctivity sp arsity . T o get pat- tern sparsit y , w e prune a fixed n um b er of w eights in eac h con volution k ernel, and the remaining weigh ts form sp ecific “kernel patterns”. Along this line, w e find that some carefully designed kernel patterns hav e sp ecial vision prop erties that p oten tially enhance image qualit y , thereby enhancing feature extraction ability of DNNs. F or connectivity sparsity , we cut the relatively unimp ortan t connec- tions b et ween certain input and output channels, which is equiv alent to remo v al of corresponding k ernels. A t the algorithm lev el, w e design a no v el pattern-a ware net work pruning framew ork that efficiently achiev es pattern pruning and connec- tivit y pruning without degrading accuracy . W e b egin by reforming the pruning problem into an ADMM optimization problem [4], and then solve the problem iterativ ely using a Primal-Proximal solution whic h decoupling the sto c hastic gradien t descent pro cess with regularization, enabling a progressiv e and gradual pro cess of penalizing unimportant weigh t groups, meaning a more accurate selec- tion of remaining weigh t patterns. Therefore, the framework can ac hiev e pattern library extraction, pattern assignment, unimp ortan t connectivity remo v al, as w ell as w eight training sim ultaneously . Our prop osed pattern-based sparsity is mobile hardware friendly with the help of c o de gener ation capabilit y of com- pilers. More sp ecifically , w e design the filter/kernel r e-or dering tec hnique that enables compiler optimizations that maintain instruction-level and thread-level parallelism, and ac hieves the maximum p ossible hardware acceleration. Our con tributions of this pap er are summarized as follows: – W e design a set of patterns, namely pattern library , and pro ve the image enhancemen t prop ert y that is related to pattern pruning. (Section 4) – W e form a no vel pattern-aw are netw ork pruning framew ork that can ex- tract pattern library , p erform pattern and connectivity pruning and weigh t training at the same time. (Section 5) – W e design the corresp onding (algorithm-compiler-hardware) inference frame- w ork whic h fully lev erages the new sparsity dimension and ac hieves real-time DNN execution on mobile devices. (Section 6) P attern-based Sparsity for Real-time Mobile Inference 3 Filter Kernel pattern Pruned weights Connectivity pruning Convolution kernel Fig. 2: Illustration of pattern-based sparsit y . Section 7 demonstrates pattern library extraction result, pattern pruning for accuracy and image enhancement results, the ov erall pattern-based compression results and its acceleration results on mobile devices. 2 Bac kground DNN mo del pruning techniques are studied in early work of non-structur e d pruning [10], in whic h an iterativ e, heuristic metho d is used with limited, non- uniform mo del compression rates. The irregular weigh t distribution causes ir- regular memory access and thereby execution ov erheads, which leads to limited acceleration p erformance. Structur e d pruning is pioneered by [31][15], in which regular and smaller w eight matrices are generated to eliminate o verhead of weigh t indices and ac hiev e higher acceleration in CPU/GPU executions. Ho wev er, it suf- fers from notable accuracy drop when the pruning rate increases. Kernel level pruning is studied in [5] that the sparse complimen tary kernels can sav e half of the weigh ts and computations, but it is differen t from our approach b ecause pattern-based sparsit y is theoretically and practically improving the soft ware and hardw are p erformance of DNN while [5] only fo cuses on parameter and computation reduction without discussing on platform acceleration. Mobile DNN inference framew orks are studied, including TFLite [1], TVM [6], Alibaba MNN [2], DeepCache [32] and DeepSense [33]. These w orks do not account for mo del compression techniques, and the p erformance is far from real-time requirement (usually 30 frames/sec). There are other researches that exploit model sparsity to accelerate DNN inference [18] [24], but they either do not target mobile platforms (require new hardware) or trade off compression rate and accuracy , th us having different challenges than our work. 3 Ov erview The pattern-based sparsity should exploit the best of b oth non-structured and structured pruning while hiding the disadv an tages. Given that, we prop ose tw o pattern-based pruning dimensions, p attern pruning and c onne ctivity pruning . P attern pruning is illustrated in Figure 2, where the white blo c ks denote a fixed num b er of pruned weigh ts in each kernel. The remaining (four) green blo c ks in each k ernel ha v e arbitrary w eigh t v alues, while their locations form a sp ecific pattern. Different kernels can ha ve different patterns, but the total n umber of pattern styles (i.e., the size of the pattern library) shall b e limited. W e fo cus on 3 × 3 k ernel pattern in this work b ecause it is widely used in v arious 4 X. Ma et al. of DNN architectures. F or other kernel shap e (e.g., 1 × 1 or 5 × 5), we group 1 × 1 k ernels into 3 × 3 then apply patterns, or use 5 × 5 patterns directly (will not b e discussed in this w ork due to space limit). Connectivit y pruning is illustrated in Figure 2, with gray k ernels as pruned ones. Connectivit y pruning is a goo d supplemen t to pattern pruning, as b oth can b e integrated in the same algorithm-level solution and compiler-assisted mobile inference framew ork. Compiler-assisted DNN inference framew ork uniquely enables opti- mized co de generation to guarantee end-to-end inference execution efficiency supp orting pattern-based sparsity . As the computation paradigm of DNN is in a manner of lay erwise execution, w e conv ert a DNN mo del into computational graph, whic h is embo died by static C++ (for CPU execution) or Op enCL and CUD A (for GPU execution) codes. The ab o ve t wo pruning sc hemes can be natu- rally combined, which ac hieves high pruning (acceleration) rate while maintain- ing hardw are friendliness. 4 P attern Library – Theory and Design 4.1 A Unique Perspective on W eigh t Pruning Con ven tionally , w eight pruning is considered as a redundan t information remov al tec hnique. This will inevitably omit other asp ects, such as the computer vision prop erties of pruning. In this work, w e consider weigh t pruning as incorp orating an additional conv olution mask P on an original kernel. P has the same size as original kernels and binary-v alued elements (0 and 1). F rom our p erspective, pattern pruning is an element-wise multiplication of different P ’s and original k ernels. The set of different P ’s is the p attern libr ary . The m ulti-lay er DNN are formed b y cascading functional lay ers. Applying P on ev ery conv olution k ernel across lay ers is in trinsically an interpolation op- eration of P ’s. Different patterns can form functional steerable filters [9] (e.g., Gaussian blur filter, sharp en filter, edge detection filter, etc.) by interpolation, and this pro cess only needs a limited num b er of patterns (i.e., a small pattern library). A small pattern library has tw o adv an tages, (i) at algorithm lev el, an appropriate num ber of patterns ensures the flexible search space for achieving a solution with go od p erformance on DNN and (ii) at compiler level, fewer pat- terns means fewer computation paradigms after kernel reordering and grouping, whic h reduces thread level divergence. 4.2 P attern Library Design Our designed patterns could b e transformed to a series of steerable filters [9], whic h in our case, the Gaussian filter and Laplacian of Gauss ian filter b y inter- p olating patterns through DNN lay ers. T ransform patterns to Gaussian filter: Consider a tw o-dimensional Gaus- sian filter G : G ( x, y, σ ) = 1 2 π σ 2 e − x 2 + y 2 2 σ 2 (1) P attern-based Sparsity for Real-time Mobile Inference 5 x and y are input co ordinates, and σ 2 is v ariance. Binomial co efficien ts give a compact approximation of the Gaussian co effi- cien ts using only integers. T o apply the Gaussian filters with 3 × 3 filter size, we utilize the following approximation. According to (1) and set σ 2 = 1 2 , in the 1-D situation, the appro ximation of Gaussian filter [1 2 1] is giv en b y the conv olution of t wo b o x filters [1 1]. Then we get the 2-D approximation of Gaussian filter by con volving [ 1 2 1 ] and [ 1 2 1 ] T , and the result is h 1 2 1 2 4 2 1 2 1 i . In terp olation in multi-la yer DNN is pro ved to b e conv ergen t [29]. W e can mak e further approximation by interpolating patterns in to con volutional lay ers (i.e. uniformly map patterns to each kernel). In contin uous probability space, in terp olating patterns into conv olution function is a sp ecific Probability Densit y F unction (PDF), so the effect of interpolating patterns is accum ulating proba- bilit y exp ectations of interpolation into n conv olutional lay ers. 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 p 2 p p 2p 4 p 2 p p 2 p p 1 2 1 2 4 2 1 2 1 p n n = = n interpolations (2) The four pattern masks P shown in colored positions in (2) form the Gaussian filter through in terp olation. The co efficien t p has no effect after normalization. T ransform patterns to Laplacian of Gaussian filter: The Laplacian op erator is a second deriv ativ e op erator. According to the asso ciativ e prop ert y , smo othing an image with Gaussian filter and then applying Laplacian op erator is equiv alen t to conv olve the image with the Laplacian of Gaussian (LoG) filter: ∇ 2 G ( x, y, σ ) =  x 2 + y 2 σ 4 − 2 σ 2  G ( x, y, σ ) (3) LoG has elegant mathematical prop erties, and is v alid for a v ariety of applica- tions including image enhancemen t, edge detection, and stereo matching. T a ylor series expansion is utilized to determine the appro ximate v alues of the LoG filter with 3 × 3 filter size. First, we consider the 1-D situation. The T aylor series expansions of 1-D Gaussian filter G ( x ) are given by: G ( x + δ ) = G ( x ) + δ G 0 ( x ) + 1 2 δ 2 G 00 ( x ) + 1 3! δ 3 G 000 ( x ) + O  δ 4  (4) G ( x − δ ) = G ( x ) − δ G 0 ( x ) + 1 2 δ 2 G 00 ( x ) − 1 3! δ 3 G 000 ( x ) + O  δ 4  (5) By summing (4) and (5), we hav e [ G ( x − δ ) − 2 G ( x ) + G ( x + δ )] /δ 2 = ∇ 2 G ( x ) + O  δ 2  (6) Applying cen tral difference approximation of LoG ∇ 2 G ( x ), we derive the 1-D appro ximation of LoG filter as [ 1 − 2 1 ]. Then we procure the 2-D appro ximation of LoG filter by conv olving [ 1 − 2 1 ] and [ 1 − 2 1 ] T , and get h − 1 2 − 1 2 − 4 2 − 1 2 − 1 i as the 1st appr oximation . According to (6), we hav e ∇ 2 G ( x, y ) =  [ 1 − 2 1 ] + h 1 − 2 1 i ∗ G ( x, y ) (7) 6 X. Ma et al. Based on (7), w e derive the 2nd appr oximation as h 0 1 0 1 − 4 1 0 1 0 i . According to the cen tral limit theorem, the conv olution of tw o Gaussian functions is still a Gaussian function. Hence, w e conv olv e the abov e tw o approx- imations of LoG and then apply normalization, and ge t the Enhanc e d L aplacian of Gaussian (ELoG) filter as h 0 1 0 1 8 1 0 1 0 i . Similarly , we make the further approximation by interpolating patterns into con volutional lay ers. 0 1 0 1 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 1 0 0 p 0 p 1 p 0 p 0 0 1 0 1 1/ p 1 0 1 0 p n n = = n interpolations (8) The four pattern masks P shown in colored positions in (8) form the ELoG filter through interpolation. In order to get the b est appro ximation to ELoG filter, w e set p = 0 . 75 and n = 8, then the desired filter is equal to in terp olating these four patterns for eight times. The co efficien t p has no effect after normalization. 5 P attern-Aw are Net w ork Pruning F ramew ork for P attern Library Extraction In Section 4, w e ha ve determined the (eight) patterns as our pattern library through theoretical deriv ation. How ever, there are still a couple of op en ques- tions. Are these theoretically derived patterns also the most desirable at algo- rithm lev el? Ho w to sele ct the appropriate pattern for eac h kernel and train corresp onding (remaining) w eights? T o answ er these questions, we propose a no vel p attern-awar e network pruning framew ork, simultaneously achieving pat- tern library extraction (with predefined num b er of patterns in library), pattern assignmen t, and weigh t training. In pattern library extraction, w e start from a large library comprising all p ossible candidate patterns. By extending ADMM [4] and incorporating Primal- Pro ximal solution technique, we make conv olution kernels dynamically “select” the best suited patterns within the library and train the unpruned w eights. Then w e delete the least selected patterns in the library , thereby up dating the library . The previous step is iterated on the up dated library , with a single step as shown b elo w. 5.1 P attern Library Extraction – A Single Step F or an N -lay er DNN of interest, let W denote the collection of w eights for all 3 × 3 kernels, i.e., W = { W i } N i =1 . The pattern of eac h k ernel W i is restricted to a finite pattern library Ω = { M 1 , . . . , M j , . . . , M K } , where M j denotes a binary mask, and K denotes the total num b er of possible patterns. W e choose to reserve 4 non-zero entries in a kernel to matc h the SIMD (single-instruction P attern-based Sparsity for Real-time Mobile Inference 7 m ultiple-data) architecture of embedded CPU/GPU pro cessors, thereby maxi- mizing throughput. As a result, the initial K =  9 4  = 126, and K will decrease in eac h step. The purp ose of eac h step is to select a pattern from the curren t library for eac h kernel, and train the non-zero weigh ts. Let f ( W ; D ) denote the training loss ( D denotes training data), we p ose the following optimization problem minimize W , z f ( { W i ◦ ( P K j =1 z j M j ) } N i =1 ; D ) sub ject to z j ∈ { 0 , 1 } , ∀ j, P K j =1 z j = 1 , (9) where z j denotes the Bo olean selection v ariable to indicate which pattern in Ω is chosen for W i . The constraint P K j =1 z j = 1 indicates that only one pattern is selected, and th us W i ◦ ( P K j =1 z j M j ) denotes the pattern-pruned k ernel using one of pruning patterns. Here ◦ denotes elemen t-wise pro duct. In (9), we ha ve two t yp es of optimization v ariables: (i) 3 × 3 kernel w eights W , (ii) pattern Bo olean selection v ariables z ∈ [0 , 1] K . The pattern selection sc heme is co-optimized with non-zero w eight training. T o solve the abov e problem analytically , we introduce auxiliary v ariables u together with constrain ts z = u . Based on that, w e reformulate problem (9) as minimize W , u f ( { W i ◦ ( P K j =1 z j M j ) } N i =1 ; D ) + I ( u ) sub ject to z = u (10) where I ( u ) is the indicator function I ( u ) =  0 if u j ∈ [0 , 1] , ∀ j, P K j =1 u j = 1 ∞ otherwise . (11) Here we relax the binary selection v ariable z i ∈ { 0 , 1 } to the (contin uous) prob- abilistic selection v ariable u i ∈ [0 , 1]. The augmen ted Lagrangian function of problem (10) is given by L ( W , z , u , µ ) = f  { W i ◦ ( X K j =1 z j M j ) } N i =1 ; D  (12) + I ( u ) + µ T ( z − u ) + ρ 2 k z − u k 2 2 where µ is Lagrangian multipliers, and k · k 2 denotes the F rob enius norm. ρ > 0 is a giv en augmented p enalt y v alue, and for ease of notation we view matrices as ve ctors in optimization. ADMM is then given by the follo wing alternating optimization pro cess. A t iteration t , ADMM yields W ( t ) , z ( t ) = arg min W , z L ( W , z , u ( t − 1) , µ ( t − 1) ) (Primal) u ( t ) = arg min u L ( W ( t ) , z ( t ) , u , µ ( t − 1) ) (Pro ximal) µ ( t ) = µ ( t − 1) + ρ ( z ( t ) − u ( t ) ) , (13) 8 X. Ma et al. where the initial v alues u (0) and µ (0) are giv en. Problem (Primal) can b e simplified to minimize W , z f ( { W i ◦ ( P K j =1 z j M j ) } N i =1 ; D ) + ρ 2 k z − a k 2 2 (14) where a : = ( u ( t − 1) − (1 /ρ ) µ ( t − 1) ). In problem (14), the ob jective function is differen tiable, and can thus b e solved by standard DNN solvers in SGD. Problem (Proximal) can b e equiv alently decomp osed ov er u . This leads to problem minimize u ρ 2 k u − d k 2 2 sub ject to u j ∈ [0 , 1] , ∀ j, P K j =1 u j = 1 , (15) where d : = z ( t ) + (1 /ρ ) µ ( t − 1) . Based on [25], the analytical solution to problem (15) is u ( t ) = [ d − ν 1 ] + , (16) where [ x ] + = x if x ≥ 0 and 0 otherwise, ν is the ro ot of the equation 1 T [ d − ν 1 ] + = 1 . (17) Once W and z are solved, z is a contin uous v ariable rather than a binary v ariable. W e need an intermediate step to pro ject con tinuous z admm to integer z binary , yielding minimize z binary k z binary − z admm k 2 2 sub ject to 1 T z = 1 , z i ∈ { 0 , 1 } , ∀ i. (18) The solution is given by [ z binary ] i = 1 if i = argmax j [ z admm ] j , and 0 otherwise. A t this p oin t, w e hav e sim ultaneously selected pattern for each k ernel and trained the non-zero w eights. 5.2 P attern Library Extraction – Ov erall The ov erall pattern library extraction starts from K = 126 and decreases K in eac h step, with algorithm brief shown in Algorithm 1. In actual implementation w e set the new K to b e 12 in the first step as most of the patterns o ccur in v ery few times. W e set the target K to b e either 12, 8, or 4. When the type of patterns is within this range, the ov erhead in co de generation at compiler level can b e kept small and parallelism can b e maximized. Algorithm 1: Pattern library extraction process. 1 Initialization: Ω = { M 1 , M 2 . . . , M K } with K = 126 ; Result: Subsets Ω 0 with K = 12 , 8 or 4; 2 while training neural network do 3 Update W b y solving (Primal) ; 4 for K ← 126 until K = 12 , 8 or 4 do 5 Solving (Proximal) using current Ω ; 6 Update µ in (13); 7 Calculate pattern distribution of current Ω ; 8 Removing patterns with fewest occurrences in Ω ; 9 end 10 end P attern-based Sparsity for Real-time Mobile Inference 9 T otal Runtime: Despite an iterative pro cess, the total num b er of ep ochs (and training time) can b e limited. This is b ecause except for the last step, we only nee d to extract a num b er of patterns instead of finishing the final training of non-zero w eights. As a result, we can finish each step with 10% to 20% of the total ep ochs as training of the original DNN. In the last step, w e need around 9 - 12 ADMM iterations, each requiring less than 20% of the total ep ochs of original DNN training. So the total n umber of training epo c hs using PyT orch [26] is around 300 - 400 for the whole pro cess, which is even lo wer compared with man y prior art [10,22]. 6 Connectivit y Sparsit y and the New Sparsit y Induced Inference F ramew ork F rom Section 5, w e hav e designed the algorithm level solution to simultaneously ac hieve pattern library extraction, pattern selection and weigh t training. In this section, we discuss the connectivit y sparsit y and how to use the same solution framew ork to achiev e the combination of pattern sparsity and connectivit y spar- sit y . W e also design a compiler-assisted DNN inference framew ork for mobile platforms, which can fully lev erages the regularit y in this new sparsity t yp e, and p oten tially surpasses the hardware p erformances with many prior works. 6.1 Connectivit y Sparsit y Connectivit y sparsit y is achiev ed by connectivity pruning which can b e inte- grated in the same algorithm-level solution in Section 5.1 and compiler-assisted mobile inference framework. Using the same notations as in Section 5.1, we de- fine the collection of w eights in i -th lay er as W i ∈ R H i × W i × F i × C i , where H and W denote the dimension of the conv olution kernel. F and C denote the num b er of filters and channels, resp ectiv ely . W e further define critical connectivity score for eac h conv olution kernel as γ i,f ,c ( W i ) = || [ W i ] : , : ,f ,c || 2 (19) where f and c are filter and channel indices, resp ectiv ely . The problem formula- tion and solution framework for ac hieving connectivit y sparsity is similar with the ones in Section 5.1. The difference is that the constraint in the framew ork is related to γ i,f ,c . Please note that our algorithm level solution can solve the problems of pattern pruning and connectivity pruning simultaneously or indi- vidually . 6.2 Compiler-assisted Inference F ramew ork for Real-time Execution After w e obtain pattern and connectivit y sparsit y com bined in a DNN model, w e use a compiler-assisted inference framew ork to maximize the execution efficiency b y utilizing multiple optimization techniques that are induced by pattern-based 10 X. Ma et al. Pattern-based DNN model C omputa tion g r aph la y er shape w eigh ts pa tt er n distr ibution c onnec tivit y inf or ma tion Filter kernel reorder DNN model analysis M emor y ac c ess c ode Ex ecution c ode f or CPU/GPU W eigh ts 4 pa tt er n st yle x x x 1 2 3 4 x x 1 3 2 4 6 5 1 3 2 4 6 5 1 3 2 4 6 5 1 3 2 4 6 5 Load Kernels L oose w eigh ts Data loading optimization c ompac t w eigh ts Deploy on mobile device f o r kernel 0 - kernel i : # compute pattern x f o r kernel i+1 - kernel j : # compute pattern y Pattern style Empty kernel Fig. 3: Ov erview of the compiler level DNN inference framework. sparsit y . The compiler optimizations showing in Figure 3 target on DNN com- putation graph and memory access for on-device executions. La yerwise optimization for DNN computation graph is designed to ac hieve the b est of instruction-level and thread-level parallelism by utilizing the unique filter/kernel re-ordering technique as Figure 3 sho ws. In the weigh t matrix illustration, the internal squares with different colors denote different pattern styles, and empty white squares denote connectivity sparsity . By fil- ter/k ernel re-ordering, we (i) organize the filters with similar k ernels together to impro ve inter-thr e ad parallelism, and (ii) group kernels with iden tical patterns in each filter together to impro ve intr a-thr e ad parallelism. By DNN computation graph optimization, the generated execution co de eliminates all of the execution branc hes, implying higher instruction-level parallelism; mean while, similar filter groups escalate execution similarity and result in a go o d load balance, achieving b etter thread-level parallelism. Memory access optimizations for hardw are execution address the p oor memory p erformance due to the irregular memory access. In DNN execu- tion, the input/output data access is asso ciated with the non-zero elemen ts of the weigh ts. Since in pattern-based sparse mo del, the non-zero pattern of each k ernel is already kno wn, we can generate data access co de with this information for eac h kernel pattern and call them dynamically during DNN execution. With the data access co de, it is p ossible to directly access v alid input data that is asso ciated with the non-zero elements in a pattern-based kernel. Moreov er, after DNN computation graph optimization, the mo del weigh ts distribution is highly compact and structured as Figure 3 sho ws, which reduces the calling frequency of data access co de and as a result, reduces the memory ov erhead. 7 Exp erimen tal Results In our exp erimen t, our generated pattern-based sparse mo dels are based on four widely used net work structures, V GG-16 [28], ResNet-18/50 [11] and MobileNet- V2 [16], and are trained on an eight NVIDIA R TX-2080Ti GPUs serv er using PyT orc h [26]. W e show the consistency of pattern library extraction results with the theoretically designed pattern library in Section 4.2, and provide the accu- racy improv ement and image enhancemen t demonstrations. W e also show the o verall compression results of pattern-based pruning in differen t DNN mo dels. In order to sho w acceleration of pattern-based sparsity on mobile devices, we P attern-based Sparsity for Real-time Mobile Inference 11 Phase 2 (total 8 patterns after we delete least appeared 4 patterns in Phase 1 in one more step ) Phase 1 (total 12 patterns after we delete least appeared 20 patterns in 32 patterns ) T otal number of kernels in the model: over 1,630,000 for VGG-16. Phase 3 (total 4 patterns after we delete least appeared 4 patterns in Phase 2 in one more step ) 38% 31% 19% 12% (b). Pattern distribution after 2 steps (a). Remaining pattern styles during pattern extration process All other 20 patterns Fig. 4: The pattern library extraction result. When K = 32 after tw o steps, the pattern distribution is shown in (b) with different colors represen ting differen t pattern styles in (a) . The 20 less significant patterns only accoun t for 13% of the total 32 patterns, and the rest 12 patterns form the Phase 1 pattern library . If we con tinue the extraction step, we can get Phase 2 and Phase 3 pattern libraries as (a) shows. compare it with three state-of-the-art DNN inference acceleration framew orks, TFLite [1], TVM [6], and MNN [2]. Our exp erimen ts are conducted on a Sam- sung Galaxy S10 cell phone with the latest Qualcomm Snap dragon 855 mobile platform that consists of a Qualcomm Kry o 485 Octa-core CPU and a Qualcomm Adreno 640 GPU. 7.1 P attern Library Extraction Result W e use VGG-16 on ImageNet dataset to extract pattern libraries. V GG-16 has more than 1,630,000 con volution k ernels. How ev er, patterns can b e concen trated to 12 st yles in only a couple of steps. Figure 4 shows the pattern styles distribu- tion results when K decreases to 32 after tw o steps. W e can see that most of the patterns are distributed in the top 12 styles, namely Phase 1 pattern library . If w e contin ue to decrease K to 8, the remaining 8 patterns form Phase 2 pattern library . W e can notice that Phase 2 is exactly the same with our derived pat- tern library in Section 4.2. F urther extraction step will give us Phase 3 pattern library , which is the top-4 pattern st yles. Using other DNNs and datasets gives us the same extraction results, thereby we can conclude that the theoretically deriv ed patterns are also the most desirable ones at algorithm level. 7.2 Visualization Demonstration and Accuracy Analysis for P attern Pruning After we obtain the extracted pattern libraries in three phases (i.e., con taining 12, 8 or 4 patterns resp ectiv ely), we need to v alidate the image enhancemen t effects and ev aluate the accuracy of the pattern pruned DNN. Visualization comparisons of applying Phase 2 pattern library to an orig- inal DNN mo del ( p attern pruning ) are demonstrated in Figure 5. T o ensure the fairness in comparisons, we adopt three visualization metho ds to eliminate the impact of causal factors. They are (a) Guide d-b ackpr op agation (BP) [30], (b) Inte gr ate d gr adients [23], and (c) Inverte d r epr esentation [3]. Through different 12 X. Ma et al. Hourglass Dragonfly Chihuahua Original Image Baseline Guided BP Pattern-pruned Guided BP Baseline Integrated Gradient Pattern-pruned Integrated Gradient Baseline Inverted Rep. Pattern-pruned Inverted Rep. (a) (b) (c) Fig. 5: Visualization comparisons of three images from ImageNet dataset on orig- inal and pattern pruned V GG-16 mo del using (a) guided-bac kpropagation (BP); (b) in tegrated gradients and (c) inv erted representation metho ds. visualization techniques, we can see what a DNN has learned and how well it can preserv e the photographically accurate information from an image. W e provide strong evidence in Figure 5 that pattern pruned VGG-16 mo del can effectiv ely capture more image details and less noise compared with the orig- inal VGG-16 mo del. W e conclude that the accuracy impro vemen t is attributed to the enhanced image pro cessing ability of our designed pattern library . Accuracy ev aluation is shown in Figure 6 (a). Starting from the base- line accuracy results that are in man y cases higher than prior works, we ha ve the first conclusion that the ac cur acy impr ovements ar e mor e signific ant when applying the designe d 8 p atterns (i.e., p attern libr ary at Phase 2) on e ach c onvo- lution kernel . The accuracy improv emen ts are consistently observ ed on v arious net work structures (e.g., VGG-16, ResNet-18/50, MobileNet-V2) on CIF AR-10 and ImageNet datasets. 95.57 95.90 95.67 93.5 93.0 94.0 94.5 95.0 95.5 96.0 96.5 97.0 2.0 2.25x 2.25x 4.0 Baseline Phase 1 Phase 3 Phase 2 VGG-16 ResNet-18 ResNet-50 MobileNet-V2 T op-1 Accuracy (%) Prune Rate (times) 93.50 94.39 94.21 94.08 94.22 94.98 95.39 94.01 95.20 95.27 95.23 89.0 89.2 89.4 92.2 92.4 92.6 92.8 93.0 93.2 2.0 4.0 T op-5 (T op-1) Accuracy (%) 92.86 (76.13) 93.24 (76.49) 93.28 (76.52) 93.16 (76.40) 89.08 (69.91) 89.49 (70.15) 89.48 (70.27) 89.42 (70.06) Baseline Phase 1 Phase 3 Phase 2 ResNet-18 ResNet-50 VGG-16 Cifar-10 ImageNet 91.71 (74.50) 92.49 (74.91) 92.45 (74.84) 92.39 (74.75) 95.67 95.51 (a) T op-5 (b) Fig. 6: (a) Accuracy impro v ement results from pattern pruning on different DNN mo dels and datasets (CIF AR-10 & ImageNet). (b) Overall 6 × compression for ResNet-18 on ImageNet training curv es for connectivity sparsity . P attern-based Sparsity for Real-time Mobile Inference 13 T able 1: Pattern-based pruning results (%) on con volution la yer for CIF AR-10 and ImageNet using V GG-16, ResNet-18 and ResNet-50. CIF AR-10 ImageNet Pruning F ramework Ori. Acc. Prune Acc. Comp. Rate Sparsit y T yp e T op-1 Acc. T op-5 Acc. Comp. Rate Sparsit y Type Ori. Prune Ori. Prune ResNet-18 † AMC [14] 90.5 90.2 2.0 × Structured - - - - - - DCP [37] - - - - 69.6 64.1 88.9 85.7 3.3 × Structured TinyADMM [20] 94.1 93.2 15.1 × Structured N/A N/A 89.1 88.4 3.3 × Structured StrADMM [35] - - - - 69.6 68.8 N/A N/A 3.0 × Structured SFP [12] 92.2 90.8 1.7 × Structured 70.3 67.1 89.6 87.8 1.7 × Structured T AS [8] 92.8 92.8 1.8 × Structured 70.6 69.1 89.8 89.2 1.5 × Structured FPGM [13] 92.2 91.9 2.5 × Structured 70.2 68.3 89.6 88.5 3.3 × Structured Ours 94.0 94.7 8.0 × Phase 2 69.9 69.6 89.1 89.2 4.0 × Phase 2 Ours 94.0 94.6 12.0 × Phase 3 69.9 68.2 89.1 88.3 6.0 × Phase 2 Ours 94.0 94.2 16.0 × Phase 2 69.9 67.1 89.1 87.7 8.0 × Phase 2 ResNet-50 ∗ One Shot [19] 93.8 93.6 2.5 × Irregular - - - - - - ADMM-NN [27] - - - - N/A N/A N/A 92.3 7.0 × Irregular T AS [8] 94.5 93.7 2.0 × Structured 77.5 76.2 93.5 93.1 1.7 × Structured SFP [12] 93.6 93.4 1.7 × Structured 76.2 74.6 92.9 92.1 1.7 × Structured GAL [17] 93.3 90.4 2.9 × Structured 76.4 69.3 92.8 89.1 2.5 × Structured FPGM [13] 93.6 93.5 2.5 × Structured 76.2 75.6 92.8 92.6 3.3 × Structured GBN [34] - - - - 75.8 75.2 92.7 92.4 2.2 × Structured Ours 94.2 95.2 8.0 × Phase 3 76.1 75.9 92.9 92.7 3.9 × Phase 2 Ours 94.2 94.9 12.0 × Phase 3 76.1 75.8 92.9 92.8 4.9 × Phase 3 Ours 94.2 94.5 16.0 × Phase 3 76.1 75.6 92.9 92.6 5.8 × Phase 2 VGG-16 NeST [7] - - - - 71.6 69.3 90.4 89.4 6.5 × Irregular ADMM-NN [27] - - - - 69.0 68.7 89.1 88.9 10.2 × Irregular 2PFPCE [21] 92.9 92.8 4.0 × Structured - - - - - - DecorReg [36] 93.5 93.3 8.5 × Structured 73.1 73.2 N/A N/A 3.9 × Structured GAL [17] 93.9 90.8 5.6 × Structured - - - - - - Ours 93.5 93.4 8.0 × Phase 2 74.5 74.4 91.7 91.5 8.0 × Phase 2 Ours 93.5 93.3 11.6 × Phase 2 74.5 74.1 91.7 91.3 10.0 × Phase 2 Ours 93.5 93.2 19.7 × Phase 1 74.5 73.6 91.7 91.0 12.0 × Phase 2 † SFP , T AS, FPGM use ResNet-20 netw ork structure on CIF AR-10 dataset. * T AS, SFP , GAL, FPGM use ResNet-56 netw ork structure on CIF AR-10 dataset. 7.3 Connectivit y Pruning and Overall Mo del Compression Results Com bining connectivity sparsit y with pattern sparsit y has differen t DNN perfor- mances with differen t pattern lib raries. Figure 6 (b) illustrates testing accuracies of training connectivity sparsity combined with existing pattern sparsity . F rom diagram, we can clearly notice that by using designed pattern library (Phase 2), we can ac hieve b etter training p erformance, thereby higher DNN accuracy . Similar paradigm can be observed with differen t compression rates and on differ- en t netw orks/datasets. Please note that pattern sparsity already reserves 2.25 × compression rate, and we add differen t connectivity compression rates up on it to ac hieve the different ov erall compression rates. T able 1 records the b est final DNN accuracies and compression rates regarding their pattern styles, and are compared with sev eral pruning metho ds with their sparsity types. 7.4 P erformance Ev aluation on Mobile Platform In this part, we demonstrate our ev aluation results on mobile device to show the real-time inference of our prop osed pattern-based sparse mo del with the help of 14 X. Ma et al. the compiler-assisted inference framew ork. T o guarantee fairness, all framew orks are using the same pattern-based sparse mo del, and we also enable the fully op- timized configurations of TFLite, TVM and MNN (e.g., Winograd optimization is turned on). Execution time. Figure 7 sho ws mobile CPU/GPU execution time of pattern- based mo del on different platforms. Since Phase 2 pattern library has b est p er- formance on pruning, our testing mo del are using Phase 2 patterns and 8 × ov er- all compression rate for ResNet-18, 5.8 × for ResNet-50 and 12 × for VGG-16. The inference is using images from ImageNet dataset. W e can see our approach ac hieves significant acceleration on mobile device compared with other frame- w orks. Real-time execution usually requires 30 frames/sec (i.e., 33 ms /frame). F rom our results, all of our DNN mo dels on ImageNet meet or far exceed this requiremen t, and some of them can ev en accomplish real-time inference on mo- bile CPU. Inference Time (ms) VGG-16 0 75 150 225 300 MNN TVM TFLite Our’ s N/A 15 ResNet-18 0 12.5 25 37.5 50 11 ResNet-50 0 35 70 105 140 26 0 125 250 375 500 38 0 27.5 55 82.5 110 18 0 100 200 300 400 44 GPU CPU Fig. 7: Inference time ( ms ) comparisons for differen t mobile inference frameworks using image from ImageNet dataset. 8 Conclusion This pap er prop oses pattern-based sparsity , along with the highly efficien t algo- rithm lev el pruning framew ork and the no v el compiler lev el inference framew ork. P attern-based sparsity inherits the flexibility from non-structured sparsit y and regularit y from structured sparsity , achieving b oth highly accurate/compressed mo del and hardw are friendliness. P articularly , with carefully designed pattern li- brary , pattern pruning ac hieves image enhancemen t and accuracy improv ement. The pattern-based sparsit y elicits compiler optimization, achieving real-time in- ference on mobile devices on v arious representativ e large-scale DNNs. 9 Ac kno wledgment This w ork is supported b y the National Science F oundation CCF-1919117, CCF- 1937500 and CNS-1909172. W e thank all anon ymous revie w ers for their feedbac k. P attern-based Sparsity for Real-time Mobile Inference 15 References 1. https://www.tensorflow.org/mobile/tflite/ 2. https://github.com/alibaba/MNN 3. Ara vindh, M., Andrea, V.: Understanding deep image representations by in verting them. In: Computer Vision and Pattern Recognition, 2015. CVPR 2015. IEEE Conference on (2015) 4. Bo yd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction metho d of multipliers. F oun- dations and T rends R  in Machine Learning 3 (1), 1–122 (2011) 5. Chen, C.F., Oh, J., F an, Q., Pistoia, M.: Sc-conv: Sparse-complementary conv olu- tion for efficien t model utilization on cnns. In: 2018 IEEE In ternational Symposium on Multimedia (ISM). pp. 97–100. IEEE (2018) 6. Chen, T., Moreau, T., Jiang, Z., Zheng, L., Y an, E., Shen, H., Cow an, M., W ang, L., Hu, Y., Ceze, L., et al.: TVM: An automated end-to-end optimizing compiler for deep learning. In: OSDI (2018) 7. Dai, X., Yin, H., Jha, N.K.: Nest: A neural netw ork syn thesis to ol based on a gro w-and-prune paradigm. IEEE T ransactions on Computers 68 (10), 1487–1497 (2019) 8. Dong, X., Y ang, Y.: Netw ork pruning via transformable architecture searc h. In: Adv ances in Neural Information Pro cessing Systems. pp. 759–770 (2019) 9. F reeman, W., Adelson, E.: The design and use of steerable filters. In: IEEE T rans- actions on Pattern Analysis and Machine Intelligence. v ol. 13, pp. 891–906. IEEE (1991) 10. Han, S., Mao, H., Dally , W.J.: Deep compression: Compressing deep neural net- w orks with pruning, trained quan tization and huffman coding. In: In ternational Conference on Learning Representations (ICLR) (2016) 11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Pro ceedings of the IEEE Conference on Computer Vision an d P attern Recognition. pp. 770–778 (2016) 12. He, Y., Kang, G., Dong, X., F u, Y., Y ang, Y.: Soft filter pruning for accelerating deep conv olutional neural net works. In: International Joint Conference on Artificial In telligence (IJCAI). pp. 2234–2240 (2018) 13. He, Y., Liu, P ., W ang, Z., Hu, Z., Y ang, Y.: Filter pruning via geometric median for deep conv olutional neural netw orks acceleration. In: Pro ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4340–4349 (2019) 14. He, Y., Lin, J., Liu, Z., W ang, H., Li, L.J., Han, S.: Amc: Automl for mo del com- pression and acceleration on mobile devices. In: European Conference on Computer Vision. pp. 815–832 (2018) 15. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural net works. In: Computer Vision (ICCV), 2017 IEEE International Conference on. pp. 1398–1406. IEEE (2017) 16. Ho ward, A.G., Zhu, M., Chen, B., Kalenichenk o, D., W ang, W., W eyand, T., An- dreetto, M., Adam, H.: Mobilenets: Efficient conv olutional neural netw orks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) 17. Lin, S., Ji, R., Y an, C., Zhang, B., Cao, L., Y e, Q., Huang, F., Do ermann, D.: T ow ards optimal structured cnn pruning via generative adversarial learning. In: Pro ceedings of the IEEE Conference on Computer Vision an d P attern Recognition. pp. 2790–2799 (2019) 16 X. Ma et al. 18. Liu, B., W ang, M., F oro osh, H., T app en, M., Pensky , M.: Sparse conv olutional neural net works. In: CVPR. pp. 806–814 (2015) 19. Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the v alue of net work pruning. In: In ternational Conference on Learning Representations (2019) 20. Ma, X., Y uan, G., Lin, S., Ding, C., Y u, F., Liu, T., W en, W., Chen, X., W ang, Y.: Tiny but accurate: A pruned, quantized and optimized memristor crossbar framew ork for ultra efficient dnn implemen tation. ASP-DA C (2020) 21. Min, C., W ang, A., Chen, Y., Xu, W., Chen, X.: 2pfp ce: Two-phase filter pruning based on conditional entrop y . arXiv preprint arXiv:1809.02220 (2018) 22. Molc hanov, P ., Tyree, S., Karras, T., Aila, T., Kautz, J.: Pruning conv olutional neural net works for resource efficient inference. arXiv preprint (2016) 23. Mukund, S., Ankur, T., Qiqi, Y.: Axiomatic attribution for deep net works. In: 2017 In ternational Conference on Machine Learning (ICML). A CM/IEEE (2017) 24. P arashar, A., Rh u, M., Mukk ara, A., Puglielli, A., V enk atesan, R., Khailany , B., Emer, J., Keckler, S.W., Dally , W.J.: Scnn: An accelerator for compressed-sparse con volutional neural netw orks. In: ISCA (2017) 25. P arikh, N., Bo yd, S.: Pro ximal algorithms. F oundations and T rends R  in Optimiza- tion 1 (3), 127–239 (2014) 26. P aszke, A., Gross, S., Massa, F., Lerer, A., Bradbury , J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imp erativ e style, high- p erformance deep learning library . In: NeurIPS (2019) 27. Ren, A., Zhang, T., Y e, S., Xu, W., Qian, X., Lin, X., W ang, Y.: Admm-nn: an algorithm-hardware co-design framework of dnns using alternating direction metho ds of multipliers. In: ASPLOS (2019) 28. Simon yan, K., Zisserman, A.: V ery deep conv olutional netw orks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 29. Siyuan, M., Raef, B., Mikhail, B.: The p o wer of interpolation: Understanding the effectiv eness of sgd in mo dern ov er-parametrized learning. In: 2018 In ternational Conference on Mac hine Learning (ICML). ACM/IEEE (2018) 30. Springen b erg, J.T., Alexey Dosovitskiy , T.B.a.R.: Striving for simplicity: The all con volutional net. In: ICLR-2015 workshop trac k (2015) 31. W en, W., W u, C., W ang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural netw orks. In: Adv ances in neural information pro cessing systems. pp. 2074–2082 (2016) 32. Xu, M., Zhu, M., Liu, Y., Lin, F.X., Liu, X.: Deep cac he: Principled cache for mobile deep vision. In: Pro ceedings of the 24th Annual International Conference on Mobile Computing and Netw orking. pp. 129–144. ACM (2018) 33. Y ao, S., Hu, S., Zhao, Y., Zhang, A., Ab delzaher, T.: Deepsense: A unified deep learning framework for time-series mobile sensing data pro cessing. In: Pro ceedings of the 26th International Conference on W orld Wide W eb (2017) 34. Y ou, Z., Y an, K., Y e, J., Ma, M., W ang, P .: Gate decorator: Global filter pruning metho d for accelerating deep con v olutional neural netw orks. In: Adv ances in Neural Information Pro cessing Systems. pp. 2130–2141 (2019) 35. Zhang, T., Zhang, K., Y e, S., T ang, J., W en, W., Lin, X., F ardad, M., W ang, Y.: Adam-admm: A unified, systematic framework of structured weigh t pruning for dnns. arXiv preprin t arXiv:1807.11091 2 , 3 (2018) 36. Zh u, X., Zhou, W., Li, H.: Improving deep neural netw ork sparsity through decor- relation regularization. In: IJCAI (2018) P attern-based Sparsity for Real-time Mobile Inference 17 37. Zh uang, Z., T an, M., Zh uang, B., Liu, J., Guo, Y., W u, Q., Huang, J., Zhu, J.: Discrimination-a ware channel pruning for deep neural netw orks. In: Adv ances in Neural Information Pro cessing Systems. pp. 875–886 (2018)

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment