Enhanced Convolutional Neural Tangent Kernels

Enhanced Con v olutional Neural T angen t Kernels ∗ Zhiyuan Li † Ruosong W ang ‡ Dingli Y u § Simon S. Du ¶ W ei Hu k Ruslan Salakh utdinov ∗∗ Sanjeev Arora †† Abstract Recen t researc h sho ws that for training with ` 2 loss, conv olutional neural netw orks ( CNN s) whose width (n umber of c hannels in conv olutional la yers) goes to inﬁnity corresp ond to regres- sion with resp ect to the CNN Gaussian Process kernel ( CNN-GP ) if only the last la yer is trained, and corresp ond to regression with resp ect to the Con volutional Neural T angen t Kernel ( CNTK ) if all lay ers are trained. An exact algorithm to compute CNTK [ Arora et al. , 2019 ] yielded the ﬁnding that classiﬁcation accuracy of CNTK on CIF AR-10 is within 6-7% of that of the corre- sp onding CNN architecture (b est ﬁgure b eing around 78%) which is interesting p erformance for a ﬁxed kernel. Here we show how to signiﬁcantly enhance the p erformance of these kernels using tw o ideas. (1) Mo difying the kernel using a new op eration called L o c al A ver age Po oling ( LAP ) whic h pre- serv es eﬃcient computabilit y of the k ernel and inherits the spirit of standard data augmen tation using pixel shifts. Earlier papers were unable to incorporate naiv e data augmen tation because of the quadratic training cost of k ernel regression. This idea is inspired by Glob al Aver age Po oling ( GAP ), which we show for CNN-GP and CNTK is equiv alent to full translation data augmen- tation. (2) Representing the input image using a pre-pro cessing technique prop osed by Coates et al. [ 2011 ], which uses a single conv olutional lay er comp osed of random image patches. On CIF AR-10, the resulting kernel, CNN-GP with LAP and horizontal ﬂip data augmen tation, ac hieves 89% accuracy , matc hing the p erformance of AlexNet [ Krizhevsky et al. , 2012 ]. Note that this is the b est such result we kno w of for a classiﬁer that is not a trained neural net work. Similar impro vemen ts are obtained for F ashion-MNIST. 1 In tro duction Recen t researc h sho ws that for training with ` 2 loss, conv olutional neural netw orks ( CNN s) whose width (num ber of channels in conv olutional lay ers) go es to inﬁnity , corresp ond to regression with resp ect to the CNN Gaussian Pro cess kernel ( CNN-GP ) if only the last lay er is trained, and corre- sp ond to regression with resp ect to the Conv olutional Neural T angent Kernel ( CNTK ) if all lay ers are trained [ Jacot et al. , 2018 ]. An exact algorithm w as giv en [ Arora et al. , 2019 ] to compute ∗ The ﬁrst three authors contribute equally . † Princeton Universit y . Email: zhiyuanli@cs.princeton.edu . ‡ Carnegie Mellon Universit y . Email: ruosongw@andrew.cmu.edu . § Princeton Universit y . Email: dingliy@cs.princeton.edu . ¶ Institute for Adv anced Study . Email: ssdu@ias.edu . k Princeton Universit y . Email: huwei@cs.princeton.edu . ∗∗ Carnegie Mellon Universit y . Email: rsalakhu@cs.cmu.edu . †† Princeton Universit y and Institute for Adv anced Study . Email: arora@cs.princeton.edu . 1 CNTK for CNN architectures, as w ell as those that include a Glob al Aver age Po oling ( GAP ) la yer (deﬁned b elow). This is a ﬁxed k ernel that inherits some b eneﬁts of CNN s, including exploitation of lo cality via conv olution, as well as m ultiple lay ers of pro cessing. F or CIF AR-10, incorp orating GAP in to the k ernel improv es classiﬁcation accuracy b y up to 10% compared to pure con volutional CNTK . While this p erformance is encouraging for a ﬁxed k ernel, the b est accuracy is still under 78%, whic h is disapp ointing even compared to AlexNet. One hop e for improving the accuracy further is to somehow capture mo dern innov ations suc h as batc h normalization, data augmentation, residual la yers, etc. in CNTK . The curren t pap er shows ho w to incorp orate simple data augmentation. Sp eciﬁcally , the idea of creating new training images from existing images using pixel translation and ﬂips, while assuming that these op erations should not c hange the lab el. Since deep learning uses sto c hastic gradient descen t (SGD), it is trivial to do suc h data augmen tation on the ﬂy . How ever, it’s unclear ho w to eﬃcien tly incorporate data augmentation in k ernel regression, since training time is quadratic in the n umber of training images. Th us somehow data augmentation has to b e incorp orated into the computation of the kernel itself. The main observ ation here is that the ab ov e-mentioned algorithm for computing CNTK in- v olves a dynamic programming whose recursion depth is equal to the depth of the corresp onding ﬁnite CNN . It is p ossible to imp ose symmetry constrain ts at an y desired lay er during this com- putation. In this viewp oint, it can b e shown that prediction using CNTK / CNN-GP with GAP is equiv alent to prediction using CNTK / CNN-GP without GAP but with ful l tr anslation data augmen- tation with wrap-around at the b oundary . The translation inv ariance prop erty implicitly assumed in data augmentation is exactly equiv alen t to an imp osed symmetry constraint in the computation of the CNTK whic h in turn is derived from the p o oling la yer in the CNN . See Section 4 for more details. Th us GAP corresp onds to full translation data augmentation scheme, but in practice suc h data augmen tation creates unrealistic images (cf. Figure 1 ) and training on them can harm p erformance. Ho wev er, the idea of incorp orating symmetry in the dynamic programming leads to a v ariant we call L o c al Aver age Po oling ( LAP ). This implicitly is like data augmen tation where image lab els are assumed to b e inv ariant to small translation, sa y by a few pixels. This op eration also suggests a new p o oling la yer for CNN s which w e call BBlur and also ﬁnd it b eneﬁcial for CNN s in exp eriments. Exp erimen tally , we ﬁnd LAP signiﬁcan tly enhances the p erformance as discussed b elow. • In extensive exp eriments on CIF AR-10 and F ashion-MNIST, w e ﬁnd that LAP consistently im- pro ves p erformance of CNN-GP and CNTK . In particular, w e ﬁnd CNN-GP with LAP achiev es 81% on CIF AR-10 dataset, outp erforming the b est previous kernel predictor by 3%. • When using the technique prop osed by Coates et al. [ 2011 ], which uses randomly sampled patc hes from training data as ﬁlters to do pre-pro cessing, 1 CNN-GP with LAP and horizon tal ﬂip data augmen tation achiev es 89% accuracy on CIF AR-10, matc hing the p erformance of AlexNet [ Krizhevsky et al. , 2012 ] and is the strongest classiﬁer that is not a trained neural net work. 2 • W e also derive a lay er for CNN that corresp onds to LAP and observe that it improv es the p erformance on certain arc hitectures. 1 See Section 6.2 for the precise pro cedure. 2 h ttps://b enchmarks.ai/cifar-10 2 2 Related W ork Data augmen tation has long b een kno wn to improv e the p erformance of neural netw orks and k ernel metho ds [ Sietsma and Do w , 1991 , Sch¨ olk opf et al. , 1996 ]. Theoretical study of data augmentation dates back to Chap elle et al. [ 2001 ]. Recently , Dao et al. [ 2018 ] prop osed a theoretical framew ork for understanding data augmentation and sho wed data augmen tation with a k ernel classiﬁer can ha ve feature a veraging and v ariance regularization eﬀects. More recen tly , Chen et al. [ 2019 ] quanti- tativ ely shows in certain settings, data augmentation pro v ably impro ves the classiﬁer p erformance. F or more comprehensiv e discussion on data augmentation and its prop erties, w e refer readers to Dao et al. [ 2018 ], Chen et al. [ 2019 ] and references therein. CNN-GP and CNTK corresp ond to inﬁnitely wide CNN with diﬀerent training strategies (only training the top lay er or training all la yers jointly). The corresp ondence b etw een inﬁnite neural net works and kernel machines w as ﬁrst noted b y Neal [ 1996 ]. More recen tly , this was extended to deep and conv olutional neural netw orks [ Lee et al. , 2018 , Matthews et al. , 2018 , Nov ak et al. , 2019 , Garriga-Alonso et al. , 2019 ]. These kernels corresp ond to neural net works where only the last lay er is trained. A recen t line of w ork studied ov erparameterized neural net works where all la yers are trained [ Allen-Zhu et al. , 2018 , Du et al. , 2019b , 2018 , Li and Liang , 2018 , Zou et al. , 2018 ]. Their pro ofs imply the gradient k ernel is close to a ﬁxed kernel which only depends the training data and the neural netw ork architecture. These kernels th us corresp ond to neural net works where are all la yers are trained. Jacot et al. [ 2018 ] named this kernel neural tangent kernel ( NTK ). Arora et al. [ 2019 ] formally prov ed p olynomially wide neural net predictor trained by gradient descen t is equiv alent to NTK predictor. Recently , NTK s induced b y v arious neural netw ork arc hitectures are deriv ed and shown to achiev e strong empirical p erformance [ Arora et al. , 2019 , Y ang , 2019 , Du et al. , 2019a ]. Global Av erage Pooling ( GAP ) is ﬁrst prop osed in Lin et al. [ 2013 ] and is common in mo dern CNN design [ Springenberg et al. , 2014 , He et al. , 2016 , Huang et al. , 2017 ]. Ho wev er, current theoretical understanding on GAP is still rather limited. It has b een conjectured in Lin et al. [ 2013 ] that GAP reduces the num b er of parameters in the last fully-connected lay er and th us a voids o verﬁtting, and GAP is more robust to spatial translations of the input since it sums out the spatial information. In this work, we study GAP from the CNN-GP and CNTK p ersp ective, and draw an in teresting connection b etw een GAP and data augmen tation. The approac h prop osed in Coates et al. [ 2011 ] is one of the b est-p erforming approaches on CIF AR-10 preceding mo dern CNNs. In this work we combine CNTK with LAP and the approach in Coates et al. [ 2011 ] to achiev e the b est p erformance for classiﬁers that are not trained neural net works. 3 Preliminaries 3.1 Notation W e use b old-faced letters for vectors, matrices and tensors. F or a v ector a , w e use [ a ] i to denote its i -th entry . F or a matrix A , w e use [ A ] i,j to denote its ( i, j )-th entry . F or an order 4 tensor T , we use [ T ] i,j,i 0 ,j 0 to denote its ( i, j, i 0 , j 0 )-th en try . F or an order 4 tensor, w et use tr ( T ) to denote P i,j T i,j,i,j . F or an order d tensor T ∈ R C 1 × C 2 × ... × C d and an in teger α ∈ [ C d ], w e use T ( α ) ∈ R C 1 × C 2 × ... × C d − 1 to denote the order d − 1 tensor formed by ﬁxing the co ordinate of the last dimension to b e α . 3 3.2 CNN , CNN-GP and CNTK In this section w e giv e formal deﬁnitions of CNN , CNN-GP and CNTK that w e study in this pap er. Throughout the pap er, w e let P b e the width and Q b e the heigh t of the image. W e use q ∈ Z + to denote the ﬁlter size. In practice, q = 1, 3, 5 or 7. P adding Schemes. In the deﬁnition of CNN , CNTK and CNN-GP , we may use diﬀeren t padding sc hemes. Let x ∈ R P × Q b e an image. F or a given index pair ( i, j ) with i ≤ 0, i ≥ P + 1, j ≤ 0 or j ≥ Q + 1, diﬀeren t padding schemes deﬁne diﬀeren t v alue for [ x ] i,j . F or cir cular p adding , w e deﬁne [ x ] i,j to b e [ x ] i mod P ,j mo d Q . F or zer o p adding , we simply deﬁne [ x ] i,j to b e 0. Note the diﬀerence b etw een circular padding and zero padding o ccurs only on the b oundary of images. W e will pro ve our theoretical results for the circular padding scheme to av oid b oundary eﬀects. CNN . Now we describ e CNN with and without GAP . F or any input image x , after L in termediate la yers, we obtain x ( L ) ∈ R P × Q × C ( L ) where C ( L ) is the num b er of channels of the last la yer. See Section A for the deﬁnition of x ( L ) . F or the output, there are tw o choices: with and without GAP . • Without GAP : the ﬁnal output is deﬁned as f ( θ , x ) = C ( L ) X α =1 D W ( L +1) ( α ) , x ( L ) ( α ) E where x ( L ) ( α ) ∈ R P × Q , and W ( L +1) ( α ) ∈ R P × Q is the w eight of the last fully-connected lay er. • With GAP : the ﬁnal output is deﬁned as f ( θ , x ) = 1 P Q C ( L ) X α =1 W ( L +1) ( α ) · X ( i,j ) ∈ [ P ] × [ Q ] h x ( L ) ( α ) i i,j where W ( L +1) ( α ) ∈ R is the w eight of the last fully-connected lay er. CNN-GP and CNTK . No w we describ e CNN-GP and CNTK . Let x , x 0 b e tw o input images. W e denote the L -th lay er’s CNN-GP k ernel as Σ ( L ) ( x , x 0 ) ∈ R [ P ] × [ Q ] × [ P ] × [ Q ] and the L -th lay er’s CNTK kernel as Θ ( L ) ( x , x 0 ) ∈ R [ P ] × [ Q ] × [ P ] × [ Q ] . See Section A for the precise deﬁnitions of Σ ( L ) ( x , x 0 ) and Θ ( L ) ( x , x 0 ). F or the output k ernel v alue, again, there are t wo choices, without GAP (equiv alent to using a fully-connected lay er) or with GAP . • Without GAP : the output of CNN-GP is Σ FC  x , x 0  = tr  Σ ( L ) ( x , x 0 )  and the output of CNTK is Θ FC  x , x 0  = tr  Θ ( L ) ( x , x 0 )  . 4 • With GAP : the output of CNN-GP is Σ GAP  x , x 0  = 1 P 2 Q 2 X i,j,i 0 ,j 0 ∈ [ P ] × [ Q ] × [ P ] × [ Q ] h Σ ( L )  x , x 0  i i,j,i 0 ,j 0 and the output of CNTK is Θ GAP  x , x 0  = 1 P 2 Q 2 X i,j,i 0 ,j 0 ∈ [ P ] × [ Q ] × [ P ] × [ Q ] h Θ ( L )  x , x 0  i i,j,i 0 ,j 0 . Kernel Prediction. Lastly , w e recall the formula for kernel regression. F or simplicit y , through- out the pap er, w e will assume all k ernels are in vertible. Giv en a k ernel K ( x , x 0 ) and a dataset ( X , y ) with data { ( x i , y i ) } N i =1 , deﬁne K X ∈ R N × N to b e [ K X ] i,j = K ( x i , x j ). The prediction for an unseen data x 0 is P N i =1 α i K ( x 0 , x i ) where α = K − 1 X y . 3.3 Data Augmen tation Sc hemes In this pap er w e consider t wo types of data augmen tation sc hemes: translation and horizon tal ﬂip. T ranslation. Given ( i, j ) ∈ [ P ] × [ Q ], w e deﬁne the translation op erator T i,j : R P × Q × C → R P × Q × C as follo w. F or an image x ∈ R P × Q × C , [ T i,j ( x )] i 0 ,j 0 ,c = [ x ] i 0 + i,j 0 + j,c for ( i 0 , j 0 , c ) ∈ [ P ] × [ Q ] × [ C ]. Here the precise deﬁnition of [ x ] i 0 + i,j 0 + j,c dep ends on the padding sc heme. Giv en a dataset D = { ( x i , y i ) } N i =1 , the ful l tr anslation data augmentation scheme creates a new dataset D T = { ( T i,j ( x i ) , y i ) } ( i,j,n ) ∈ [ P ] × [ Q ] × [ N ] and training is p erformed on D T . Horizon tal Flip. F or an image x ∈ R P × Q × C , the ﬂip op erator F : R P × Q × C → R P × Q × C is deﬁned to b e [ F ( x )] i,j,c = [ x ] P +1 − i,j,c for ( i, j, c ) ∈ [ P ] × [ Q ] × [ C ]. Giv en a dataset D = { ( x i , y i ) } N i =1 , the horizontal ﬂip data augmentation scheme creates a new dataset of the form D F = { ( F ( x i ) , y i ) } N i =1 and training is p erformed on D F ∪ D . 4 Equiv alence Bet w een Augmen ted Kernel and Data Augmenta- tion In this section, we demonstrate the equiv alence betw een data augmentation and augmen ted k ernels. T o formally discuss the equiv alence, we use group theory to describ e translation and horizontal ﬂip op erators. W e pro vide the deﬁnition of group in Section B for completeness. It is easy to v erify that {F , I } , {T i,j } ( i,j ) ∈ [ P ] × [ Q ] , {T i,j ◦ F } ( i,j ) ∈ [ P ] × [ Q ] ∪ {T i,j } ( i,j ) ∈ [ P ] × [ Q ] are groups, where I is the identit y map. F rom now on, giv en a dataset ( X , y ) with data { ( x i , y i ) } N i =1 and a group G , the augmen ted dataset ( X G , y G ) is deﬁned to b e { g ( x i ) , y i } g ∈G ,i ∈ [ N ] . The prediction for an unseen data x 0 on the augmen ted dataset is P i ∈ [ N ] ,g ∈G e α i,g K ( x 0 , g ( x i )) where e α =  K X G  − 1 y G . 5 T o pro ceed, we deﬁne the concept of augmente d kernel . Let G b e a ﬁnite group. Deﬁne the augmen ted kernel K G as K G ( x , x 0 ) = E g ∈G K ( g ( x ) , x 0 ) where x , x 0 are t w o inputs images and g is dra wn from G uniformly at random. A k ey observ ation is that for CNTK and CNN-GP , when circular padding and GAP is adopted, the corresp onding kernel is the augmen ted kernel of the group G = {T i,j } ( i,j ) ∈ [ P ] × [ Q ] . F ormally , w e hav e Σ GAP  x , x 0  = 1 P Q Σ G FC  x , x 0  and Θ GAP  x , x 0  = 1 P Q Θ G FC  x , x 0  , whic h can b e seen by c hecking the form ula of these kernels and using deﬁnition of circular padding. Similarly , the follo wing equiv ariance property holds for Σ GAP , Σ FC , Θ GAP and Θ FC , under all groups men tioned ab ov e, including {F , I } and {T i,j } ( i,j ) ∈ [ P ] × [ Q ] . Deﬁnition 4.1. A kernel K is equiv ariant under a gr oup G if and only if for any g ∈ G , K ( g ( x ) , g ( x 0 )) = K ( x , x 0 ) . The following theorem formally states the equiv alence b etw een using an augmented kernel on the dataset and using the k ernel on the augmented dataset. Theorem 4.1. Given a gr oup G and a kernel K such that K is e quivariant under G , then the pr e diction of augmente d kernel K G with dataset ( X , y ) is e qual to that of kernel K and augmente d dataset ( X G , y G ) . Namely, for any x 0 ∈ R P × Q × C , P N i =1 α i K G ( x 0 , x i ) = P i ∈ [ N ] ,g ∈G e α i,g K ( x 0 , g ( x i )) wher e α =  K G X  − 1 y , e α =  K X G  − 1 y G . The pro of is deferred to App endix B . Theorem 4.1 implies the follo wing tw o corollaries. Corollary 4.1. F or G = {T i,j } ( i,j ) ∈ [ P ] × [ Q ] , for any given dataset D , the pr e diction of Σ GAP (or Θ GAP ) with dataset D is e qual to the pr e diction of Σ FC (or Θ FC ) with augmente d dataset D T . Corollary 4.2. F or G = {F , I } , for any given dataset D , the pr e diction of Σ G GAP (or Θ G GAP ) with dataset D is e qual to the pr e diction of Σ GAP (or Θ GAP ) with augmente d dataset D F ∪ D . No w w e discuss implications of Theorem 4.1 and its corollaries. Naiv ely applying data augmen- tation, with full translation on CNTK or CNN-GP for example, one needs to create a muc h larger k ernel matrix since there are P Q translation op erators, whic h is often computationally infeasible. Instead, one can directly use the augmen ted k ernel ( Σ GAP or Θ GAP for the case of full translation on CNTK or CNN-GP ) for prediction, for which one only needs to create a k ernel matrix that is as large as the original one. F or horizontal ﬂip, although the augmentation k ernel can not b e conv eniently computed as full translation, Corollary 4.2 still provides a more eﬃcient metho d for computing k ernel v alues and solving k ernel regression, since the augmented dataset is twice as large as the original dataset, while the k ernel matrix of the augmented kernel is as large as the original one. 6 (a) GAP (b) LAP with c = 4 Figure 1: Randomly sampled images with full translation data augmentation and lo cal translation data augmen tation from CIF AR-10. F ull translation data augmentation can create unrealistic im- ages that harm the p erformance whereas lo cal translation data augmen tation creates more realistic images. 5 Lo cal Av erage P o oling In this section, w e in tro duce a new op eration called L o c al Aver age Po oling ( LAP ). As discussed in the in tro duction, full translation data augmen tation ma y create unrealistic images. A natural idea is to do lo cal translation data augmen tation, i.e., restricting the distance of translation. More sp eciﬁcally , w e only allow translation op erations T ∆ i , ∆ j (cf. Section 3.3 ) for (∆ i , ∆ j ) ∈ [ − c, c ] × [ − c, c ] where c is a parameter to control the amoun t of allow ed translation. With a prop er c hoice of the parameter c , translation data augmen tation will not create unrealistic images (cf. Figure 1 ). How ever, naiv e lo cal translation data augmentation is computationally infeasible for kernel metho ds, even for mo derate choice of c . T o remedy this issue, in this section we introduce LAP , whic h is inspired b y the connection b et ween full translation data augmentation and GAP on CNN-GP and CNTK . Here, for simplicity , we assume P = Q and deriv e the form ula only for CNTK . Our form ula can b e generalized to CNN-GP in a straigh tforward manner. Recall that for tw o giv en images x and x 0 , without GAP , the form ula for output of CNTK is tr ( Θ ( x , x 0 )). With GAP , the formula for output of CNTK is 1 P 4 X i,j,i 0 ,j 0 ∈ [ P ] 4  Θ  x , x 0  i,j,i 0 ,j 0 . With circular padding, the form ula can b e rewritten as 1 P 2 E ∆ i , ∆ 0 i , ∆ j , ∆ 0 j ∼ [ P ] 4 X i,j ∈ [ P ] × [ P ]  Θ  x , x 0  i +∆ i ,j +∆ j ,i +∆ 0 i ,j +∆ 0 j , whic h is again equal to 1 P 2 E ∆ i , ∆ 0 i , ∆ j , ∆ 0 j ∼ [ P ] 4 tr  Θ  T ∆ i , ∆ j ( x ) , T ∆ 0 i , ∆ 0 j ( x 0 )  . W e ignore the 1 /P 2 scaling factor since it pla ys no role in kernel regression. 7 No w we consider restricted translation op erations T ∆ i , ∆ j with (∆ i , ∆ j ) ∈ [ − c, c ] × [ − c, c ] and deriv e the formula for LAP . Assuming circular padding, we hav e E ∆ i , ∆ 0 i , ∆ j , ∆ 0 j ∼ [ − c,c ] 4 tr  Θ  T ∆ i , ∆ j ( x ) , T ∆ 0 i , ∆ 0 j ( x 0 )  = 1 (2 c + 1) 4 X ∆ i , ∆ 0 i , ∆ j , ∆ 0 j ∈ [ − c,c ] 4 X i,j ∈ [ P ] 2  Θ ( x , x 0 )  i +∆ i ,j +∆ j ,i +∆ 0 i ,j +∆ 0 j . (1) No w we ha ve derived the form ula for LAP , which is the RHS of Equation 1 . Notice that the form ula in the RHS of Equation 1 is a w ell-deﬁned quantit y for all padding sc hemes. In particular, assuming zero padding, when c = P , LAP is equiv alent to GAP . When c = 0, LAP is equiv alent to no p o oling lay er. Another adv antage of LAP is that it do es not incur any extra computational cost, since the form ula in Equation 1 can b e rewritten as X i,j,i 0 ,j 0 ∈ [ P ] 4 [ w ] i,j,i 0 ,j 0 ·  Θ ( x , x 0 )  i,j,i 0 ,j 0 where eac h entry in the weigh t tensor w can b e calculated in constant time. Note that the GAP op eration in CNN-GP and CNTK corresp onds to the GAP la yer in CNN s. Here we observe that the follo wing b ox blur layer corresp onds to LAP in CNN s. Bo x blur lay er ( BBlur ) is a function R P × Q → R P × Q suc h that [ BBlur ( x )] i,j = 1 (2 c + 1) 2 X ∆ i , ∆ j ∈ [ − c,c ] 2 x i +∆ i ,j +∆ j . This is in fact the standard a verage p o oling lay er with p o oling size 2 c + 1 and stride 1. W e prov e the equiv alence b etw een LAP and b o x blur lay er in App endix C . In Section 6.3 , we v erify the eﬀectiv eness of BBlur on CNN s via exp eriments. 6 Exp erimen ts In this section we present our empirical ﬁndings on CIF AR-10 [ Krizhevsky , 2009 ] and F ashion- MNIST [ Xiao et al. , 2017 ]. Exp erimen tal Setup. F or b oth CIF AR-10 and F ashion-MNIST we use the full training set and rep ort the test accuracy on the full test set. Throughout this section we only consider 3 × 3 con volutional ﬁlters with stride 1 and no dilation. In the con volutional la yers in CNTK and CNN- GP , we use zero padding with pad size 1 to ensure the input of eac h la yer has the same size. W e use zero padding for LAP throughout the exp eriment. W e p erform standard prepro cessing (mean subtraction and standard deviation division) for all images. In all exp eriments, we p erform kernel ridge regression to utilize the calculated kernel v alues 3 . W e normalize the kernel matrices so that all diagonal en tries are ones. Equiv alen tly , w e ensure all features hav e unit norm in RKHS. Since the resulting kernel matrices are usually ill-conditioned, w e set the regularization term λ = 5 × 10 − 5 , to make inv erting k ernel matrices n umerically stable. 3 W e also tried kernel SVM but found it signiﬁcantly degrading the p erformance, and thus do not include the results. 8 c d 5 8 11 14 0 66 . 55 (69 . 87) 66 . 27 (69 . 87) 65 . 85 (69 . 37) 65 . 47 (68 . 90) 4 77 . 06 (79 . 08) 77 . 14 (78 . 96) 77 . 06 (78 . 98) 76 . 52 (78 . 74) 8 79 . 24 (80 . 95) 79 . 25 (81 . 03) 78 . 98 (80 . 94) 78 . 65 (80 . 35) 12 80 . 11 (81 . 34) 79 . 79 (81 . 28) 79 . 29 (81 . 14) 79 . 13 (80 . 91) 16 79 . 80 (81 . 21) 79 . 71 ( 81 . 40 ) 79 . 74 (81 . 09) 79 . 42 (81 . 00) 20 79 . 24 (80 . 67) 79 . 27 (80 . 88) 79 . 30 (80 . 76) 78 . 92 (80 . 39) 24 78 . 07 (79 . 88) 78 . 16 (79 . 79) 78 . 14 (80 . 06) 77 . 87 (80 . 07) 28 76 . 91 (78 . 69) 77 . 33 (79 . 20) 77 . 65 (79 . 56) 77 . 65 (79 . 74) 32 76 . 79 (78 . 53) 77 . 39 (79 . 13) 77 . 63 (79 . 51) 77 . 63 (79 . 74) T able 1: T est accuracy of CNTK on CIF AR-10. W e use one-hot enco dings of the lab els as regression targets. W e use scipy.linalg.solve to solv e the corresp onding k ernel ridge regression problem. The k ernel v alue of CNTK and CNN-GP are calculated using the CuPy pack age. W e write nativ e CUD A co des to sp eed up the calculation of the kernel v alues. All exp eriments are p erformed on Amazon W eb Services (A WS), using (p ossibly m ultiple) NVIDIA T esla V100 GPUs. F or eﬃciency considerations, all k ernel v alues are computed with 32-bit precision. One unique adv antage of the dynamic programming algorithm for calculating CNTK and CNN- GP is that we do not need rep eat exp eriments for, say , diﬀeren t v alues of c in LAP and diﬀerent depths. With our highly-optimized native CUD A co des, w e sp end roughly 1,000 GPU hours on calculating all k ernel v alues for eac h dataset. 6.1 Ablation Study on CIF AR-10 and F ashion-MNIST W e p erform exp erimen ts to study the eﬀect of diﬀerent v alues of the c parameter in LAP and horizon tal ﬂip data argumen tation on CNTK and CNN-GP . F or exp eriments in this section we set the bias term in CNTK and CNN-GP to b e γ = 0 (cf. Section A ). W e use the same architecture for CNTK and CNN-GP as in Arora et al. [ 2019 ]. I.e., w e stack multiple con volutional lay ers b efore the ﬁnal p o oling lay er. W e use d to denote the n umber of con volutions lay ers, and in our exp erimen ts w e set d to b e 5, 8, 11 or 14, to study the eﬀect of depth on CNTK and CNN-GP . F or CIF AR-10, w e set the c parameter in LAP to b e 0 , 4 , . . . , 32, while for F ashion-MNIST we set the c parameter in LAP to b e 0 , 4 , . . . , 28. Notice that when c = 32 for CIF AR-10 or c = 28 for F ashion-MNIST, LAP is equiv alent to GAP , and when c = 0, LAP is equiv alen t to no p o oling la yer. Results on CIF AR-10 are rep orted in T ables 1 and 2 , and results on F ashion-MNIST are rep orted in T ables 3 and 4 . In each table, for each com bination of c and d , the ﬁrst num b er is the test accuracy without horizon tal ﬂip data augmentation (in p ercen tage), and the second num b er (in parentheses) is the test accuracy with horizon tal ﬂip data augmentation. 9 c d 5 8 11 14 0 63 . 53 (67 . 90) 65 . 54 (69 . 43) 66 . 42 (70 . 30) 66 . 81 (70 . 48) 4 76 . 35 (78 . 79) 77 . 03 (79 . 30) 77 . 39 (79 . 52) 77 . 35 (79 . 65) 8 79 . 48 (81 . 32) 79 . 82 (81 . 49) 79 . 76 (81 . 71) 79 . 69 (81 . 53) 12 80 . 40 (82 . 13) 80 . 64 (82 . 09) 80 . 58 (82 . 06) 80 . 32 (81 . 95) 16 80 . 36 (81 . 73) 80 . 78 ( 82 . 20 ) 80 . 59 (82 . 06) 80 . 41 (81 . 83) 20 79 . 87 (81 . 50) 80 . 15 (81 . 33) 79 . 87 (81 . 46) 79 . 98 (81 . 35) 24 78 . 60 (79 . 98) 78 . 91 (80 . 48) 79 . 22 (80 . 53) 78 . 94 (80 . 46) 28 77 . 18 (78 . 84) 78 . 03 (79 . 86) 78 . 45 (79 . 87) 78 . 48 (80 . 07) 32 77 . 00 (78 . 49) 77 . 85 (79 . 65) 78 . 49 (80 . 04) 78 . 45 (80 . 01) T able 2: T est accuracy of CNN-GP on CIF AR-10. c d 5 8 11 14 0 92 . 25 (92 . 56) 92 . 22 (92 . 51) 92 . 11 (92 . 29) 91 . 76 (92 . 17) 4 93 . 76 ( 94 . 07 ) 93 . 69 (93 . 86) 93 . 55 (93 . 74) 93 . 37 (93 . 58) 8 93 . 72 (93 . 96) 93 . 67 (93 . 78) 93 . 50 (93 . 58) 93 . 32 (93 . 51) 12 93 . 59 (93 . 80) 93 . 58 (93 . 70) 93 . 35 (93 . 44) 93 . 21 (93 . 40) 16 93 . 50 (93 . 62) 93 . 42 (93 . 63) 93 . 27 (93 . 40) 93 . 10 (93 . 25) 20 93 . 10 (93 . 34) 93 . 17 (93 . 49) 93 . 20 (93 . 34) 92 . 99 (93 . 18) 24 92 . 77 (93 . 04) 93 . 07 (93 . 44) 93 . 11 (93 . 31) 93 . 02 (93 . 21) 28 92 . 80 (92 . 98) 93 . 08 (93 . 42) 93 . 12 (93 . 28) 92 . 97 (93 . 19) T able 3: T est accuracy of CNTK on F ashion-MNIST. c d 5 8 11 14 0 91 . 47 (91 . 81) 91 . 96 (92 . 37) 92 . 09 (92 . 60) 92 . 22 (92 . 72) 4 93 . 44 (93 . 60) 93 . 59 ( 93 . 79 ) 93 . 63 (93 . 76) 93 . 59 (93 . 64) 8 93 . 26 (93 . 16) 93 . 41 (93 . 51) 93 . 31 (93 . 52) 93 . 39 (93 . 46) 12 92 . 83 (92 . 94) 93 . 07 (93 . 20) 93 . 11 (93 . 15) 92 . 94 (93 . 09) 16 92 . 46 (92 . 51) 92 . 58 (92 . 83) 92 . 64 (92 . 92) 92 . 68 (93 . 07) 20 91 . 83 (91 . 72) 92 . 35 (92 . 42) 92 . 49 (92 . 79) 92 . 51 (92 . 69) 24 91 . 15 (91 . 40) 92 . 10 (92 . 18) 92 . 29 (92 . 60) 92 . 41 (92 . 77) 28 91 . 30 (91 . 37) 92 . 03 (92 . 27) 92 . 41 (92 . 79) 92 . 41 (92 . 74) T able 4: T est accuracy of CNN-GP on F ashion-MNIST. W e made the following observ ations regarding our exp erimental results. • LAP with a prop er c hoice of the parameter c signiﬁcantly improv es the p erformance of CNTK and CNN-GP . On CIF AR-10, the b est-p erforming v alue of c is c = 12 or 16, while on F ashion-MNIST the b est-p erforming v alue of c is c = 4. W e susp ect this diﬀerence is due 10 to the nature of the tw o datasets: CIF AR-10 contains real-life images and th us allow more translation, while F ashion-MNIST contains images with centered clothes and thus allow less translation. F or b oth datasets, the b est-performing v alue of c is consisten t across all settings (depth, CNTK or CNN-GP ) that w e hav e considered. • Horizon tal ﬂip data augmentation is less eﬀectiv e on F ashion-MNIST than on CIF AR-10. There are t wo p ossible explanations for this phenomenon. First, most images in F ashion- MNIST are nearly horizon tally symmetric (e.g., T-shirts and bags). Second, CNTK and CNN-GP hav e already achiev ed a relativ ely high accuracy on F ashion-MNIST, and thus it is reasonable for horizon tal ﬂip data augmentation to b e less eﬀective on this dataset. • Finally , for CNTK , when c = 0 (no p o oling la yer) and c = 32 ( GAP ) our rep orted test accuracies are close to those in Arora et al. [ 2019 ] on CIF AR-10. F or CNN-GP , when c = 0 (no p o oling la yer) our rep orted test accuracies are close to those in No v ak et al. [ 2019 ] on CIF AR-10 and F ashion-MNIST. This suggests that w e hav e repro duced previous rep orted results. 6.2 Impro ving P erformance on CIF AR-10 via Additional Pre-pro cessing Finally , w e explore another interesting question: what is the limit of non-deep-neural-netw ork metho ds on CIF AR-10? T o further improv e the p erformance, we com bine CNTK and CNN-GP with LAP , together with the previous b est-p erforming non-deep-neural-net work metho d Coates et al. [ 2011 ]. Here we use the v ariant implemen ted in Rech t et al. [ 2019 ] 4 . More sp eciﬁcally , w e ﬁrst sample 2048 random image patches with size 5 × 5 from all training images. Then for the sampled images patches, we subtract the mean of the patches, then normalize them to hav e unit norm, and ﬁnally p erform ZCA transformation to the resulting patc hes. W e use the resulting patches as 2048 ﬁlters of a conv olutional la yer with kernel size 5, stride 1 and no dilation or padding. F or an input image x , we use conv ( x ) to denote the output of the conv olutional lay er. As in the implementation in Rech t et al. [ 2019 ], we use ReLU( conv ( x ) − γ feature ) and ReLU( − conv ( x ) − γ feature ) as the input feature for CNTK and CNN-GP . Here w e ﬁx γ feature = 1 as in Rec ht et al. [ 2019 ], and set the bias term γ in CNTK and CNN-GP to b e γ = 3, whic h is the ﬁlter size used in CNTK and CNN-GP . T o mak e the equiv arian t under horizontal ﬂip (cf. Deﬁn tion 4.1 ), for each image patc h, w e horizontally ﬂip it and add the ﬂipp ed patch in to the conv olutional lay er as a new ﬁlter. Th us, for an input CIF AR-10 image of size 32 × 32, the dimension of the output feature is 8192 × 28 × 28. T o isolate the eﬀect of randomness in the choices of the image patc hes, we ﬁx the random seed to be 0 throughout the exp erimen t. In this experiment, we set the v alue of the c parameter in LAP to b e 4 , 8 , 12 , . . . , 20 to av oid small and large v alues of c . The results are rep orted in T ables 5 and 6 . In eac h table, for each combination of c and d , the ﬁrst num b er is the test accuracy without horizontal ﬂip data augmen tation (in p ercen tage), and the second n umber (in paren theses) is the test accuracy with horizon tal ﬂip data augmentation (again in p ercentage). F rom our exp erimental results, it is evident that combining CNTK or CNN-GP with additional pre-pro cessing can signiﬁcantly impro ve up on the p erformance of using solely CNTK or CNN-GP , and that of using solely the approac h in Coates et al. [ 2011 ]. Previously , it has been rep orted in Rech t et al. [ 2019 ] that using solely the approach in Coates et al. [ 2011 ] (together with appro- priate p o oling la yer) can only achiev e a test accuracy of 84.2% using 256, 000 image patches, or 4 https://github.com/modestyachts/nondeep 11 c d 5 8 11 14 4 84 . 63 (86 . 64) 84 . 07 (86 . 23) 83 . 29 (85 . 53) 82 . 57 (84 . 81) 8 86 . 36 (88 . 32) 85 . 80 (87 . 81) 85 . 01 (87 . 08) 84 . 57 (86 . 53) 12 86 . 74 (88 . 35) 86 . 20 (87 . 90) 85 . 60 (87 . 36) 84 . 95 (86 . 99) 16 86 . 77 ( 88 . 36 ) 86 . 17 (87 . 85) 85 . 60 (87 . 44) 84 . 92 (86 . 98) 20 86 . 17 (87 . 77) 85 . 71 (87 . 50) 85 . 14 (87 . 07) 84 . 59 (86 . 84) T able 5: T est accuracy of additional pre-pro cessing + CNTK on CIF AR-10. c d 5 8 11 14 4 85 . 49 (87 . 32) 85 . 37 (87 . 22) 85 . 16 (87 . 11) 84 . 79 (86 . 81) 8 87 . 07 (88 . 64) 86 . 82 (88 . 68) 86 . 53 (88 . 40) 86 . 39 (88 . 15) 12 87 . 23 (88 . 91) 87 . 12 ( 88 . 92 ) 86 . 87 (88 . 66) 86 . 62 (88 . 29) 16 87 . 28 (88 . 90) 87 . 11 (88 . 66) 86 . 92 (88 . 61) 86 . 74 (88 . 24) 20 86 . 81 (88 . 26) 86 . 77 (88 . 24) 86 . 61 (88 . 14) 86 . 26 (87 . 84) T able 6: T est accuracy of additional pre-pro cessing + CNN-GP on CIF AR-10. 83.3% using 32, 000 image patches. Even with the help of horizontal ﬂip data augmentation, the approac h in Coates et al. [ 2011 ] can only ac hieve a test accuracy of 85.6% using 256, 000 image patc hes, or 85.0% using 32, 000 image patches. Here we use signiﬁcan tly less image patches (only 2048 ) but achiev e a m uch b etter p erformance, with the help of CNTK and CNN-GP . In particular, w e achiev e a p erformance of 88.92% on CIF AR-10, matc hing the p erformance of AlexNet on the same dataset. In the setting rep orted in Coates et al. [ 2011 ], increasing the num b er of sampled image patches will further improv e the p erformance. Here w e also conjecture that in our setting, further increasing the n umber of sampled image patches can impro ve the performance and get close to mo dern CNN s. How ev er, due the limitation on computational resources, we lea ve exploring the eﬀect of n umber of sampled image patches as a future research direction. 6.3 Exp erimen ts on CNN with BBlur In Figure 2 , we v erify the eﬀectiveness of BBlur on a 10-la y er CNN (with Batc h Normalization) on CIF AR-10. The setting of this exp eriment is rep orted in App endix D . Our netw ork structure has no p o oling lay er except for the BBlur la yer b efore the ﬁnal fully-connected lay er. The fully- connected la yer is ﬁxed during the training. Our exp eriment illustrates that even with a ﬁxed ﬁnal F C lay er, using GAP could improv e the p erformance of CNN , and challenges the conjecture that GAP reduces the num b er of parameters in the last fully-connected lay er and thus a voids o v erﬁtting. Our exp eriments also show that BBlur with appropriate choice of c achiev es better p erformance than GAP . 12 0 5 10 15 91 92 93 c Accuracy av erage test accuracy best test accuracy (a) With Horizontal Flip Data Augmentation 0 5 10 15 90 91 92 c Accuracy av erage test accuracy best test accuracy (b) Without Horizontal Flip Data Augmentation Figure 2: T est accuracy of 10-la yer CNN with v arious v alues for the c parameter in BBlur . 7 Conclusion In this pap er, inspired by the connection b etw een full translation data augmentation and GAP , we deriv e a new op eration, LAP , on CNTK and CNN-GP , which consisten tly improv es the p erformance on image classiﬁcation tasks. Com bining CNN-GP with LAP and the pre-pro cessing technique prop osed by Coates et al. [ 2011 ], the resulting kernel achiev es 89% accuracy on CIF AR-10, matc hing the p erformance of AlexNet and is the strongest classiﬁer that is not a trained neural net work. Here we list a few future researc h directions. Is that p ossible to combine more mo dern techniques on CNN , such as batch normalization and residual lay ers, with CNTK or CNN-GP , to further improv e the p erformance? Moreo ver, it is an interesting direction to study other comp onents in mo dern CNN s through the lens of CNTK and CNN-GP . Ac kno wledgemen ts S. Arora, W. Hu, Z. Li and D. Y u are supp orted b y NSF, ONR, Simons F oundation, Schmidt F oundation, Amazon Research, DARP A and SRC. S. S. Du is supp orted b y National Science F oundation (Gran t No. DMS-1638352) and the Infosys Mem b ership. R. Salakh utdinov and R. W ang are supp orted in part by NSF I IS-1763562, Oﬃce of Na v al Research grant N000141812861, and Nvidia NV AIL aw ard. P art of this work w as done while R. W ang was visiting Princeton Univ ersity . The authors would lik e to thank Amazon W eb Services for providing compute time for the exp erimen ts in this pap er. References Zeyuan Allen-Zhu, Y uanzhi Li, and Zhao Song. A conv ergence theory for deep learning via ov er- parameterization. arXiv pr eprint arXiv:1811.03962 , 2018. Sanjeev Arora, Simon S Du, W ei Hu, Zhiyuan Li, Ruslan Salakh utdinov, and Ruosong W ang. On exact computation with an inﬁnitely wide neural net. arXiv pr eprint arXiv:1904.11955 , 2019. Olivier Chap elle, Jason W eston, L´ eon Bottou, and Vladimir V apnik. Vicinal risk minimization. In A dvanc es in neur al information pr o c essing systems , pages 416–422, 2001. 13 Sh uxiao Chen, Edgar Dobriban, and Jane H Lee. Inv ariance reduces v ariance: Understanding data augmen tation in deep learning and b eyond. arXiv pr eprint arXiv:1907.10905 , 2019. Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-la yer netw orks in unsup ervised feature learning. In Pr o c e e dings of the fourte enth international c onfer enc e on artiﬁcial intel ligenc e and statistics , pages 215–223, 2011. T ri Dao, Alb ert Gu, Alexander J Ratner, Virginia Smith, Christopher De Sa, and Christopher R´ e. A k ernel theory of mo dern data augmentation. arXiv pr eprint arXiv:1803.06084 , 2018. Simon S Du, Jason D Lee, Hao c huan Li, Liwei W ang, and Xiyu Zhai. Gradient descen t ﬁnds global minima of deep neural net works. arXiv pr eprint arXiv:1811.03804 , 2018. Simon S. Du, Kangcheng Hou, Barnab´ as P´ oczos, Ruslan Salakhutdino v, Ruosong W ang, and Keyulu Xu. Graph neural tangent kernel: F using graph neural netw orks with graph k ernels. A rXiv , abs/1905.13192, 2019a. Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradien t descent prov ably optimizes o ver-parameterized neural netw orks. In International Confer enc e on L e arning R epr esentations , 2019b. Adri Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitc hison. Deep conv olutional net- w orks as shallow gaussian pro cesses. In International Confer enc e on L e arning R epr esentations , 2019. URL https://openreview.net/forum?id=Bklfsi0cKm . Kaiming He, Xiangyu Zhang, Shao qing Ren, and Jian Sun. Deep residual learning for image recognition. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition , pages 770–778, 2016. Gao Huang, Zhuang Liu, Laurens V an Der Maaten, and Kilian Q W einberger. Densely connected con volutional netw orks. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition , pages 4700–4708, 2017. Arth ur Jacot, F ranck Gabriel, and Cl´ ement Hongler. Neural tangent k ernel: Conv ergence and generalization in neural net works. arXiv pr eprint arXiv:1806.07572 , 2018. Alex Krizhevsky . Learning m ultiple lay ers of features from tiny images. 2009. Alex Krizhevsky , Ilya Sutsk ever, and Geoﬀrey E Hin ton. Imagenet classiﬁcation with deep con v olu- tional neural net works. In A dvanc es in neur al information pr o c essing systems , pages 1097–1105, 2012. Jaeho on Lee, Jascha Sohl-dickstein, Jeﬀrey Pennington, Roman No v ak, Sam Schoenholz, and Y asaman Bahri. Deep neural netw orks as gaussian pro cesses. In International Confer enc e on L e arning R epr esentations , 2018. URL https://openreview.net/forum?id=B1EA- M- 0Z . Y uanzhi Li and Yingyu Liang. Learning ov erparameterized neural net works via sto c hastic gradient descen t on structured data. arXiv pr eprint arXiv:1808.01204 , 2018. Min Lin, Qiang Chen, and Shuic heng Y an. Netw ork in netw ork. arXiv pr eprint arXiv:1312.4400 , 2013. 14 Alexander G de G Matthews, Mark Ro wland, Jiri Hron, Richard E T urner, and Zoubin Ghahramani. Gaussian pro cess b ehaviour in wide deep neural net works. arXiv pr eprint arXiv:1804.11271 , 2018. Radford M Neal. Priors for inﬁnite netw orks. In Bayesian L e arning for Neur al Networks , pages 29–53. Springer, 1996. Roman Nov ak, Lechao Xiao, Y asaman Bahri, Jaeho on Lee, Greg Y ang, Daniel A. Ab olaﬁa, Jef- frey Pennington, and Jascha Sohl-dickstein. Bay esian deep conv olutional netw orks with many c hannels are gaussian pro cesses. In International Confer enc e on L e arning R epr esentations , 2019. URL https://openreview.net/forum?id=B1g30j0qF7 . Benjamin Rech t, Reb ecca Ro elofs, Ludwig Schmidt, and V aishaal Shank ar. Do imagenet classiﬁers generalize to imagenet? arXiv pr eprint arXiv:1902.10811 , 2019. Bernhard Sc h¨ olkopf, Chris Burges, and Vladimir V apnik. Incorp orating in v ariances in supp ort v ector learning machines. In International Confer enc e on Artiﬁcial Neur al Networks , pages 47– 52. Springer, 1996. Jo celyn Sietsma and Rob ert JF Dow. Creating artiﬁcial neural net works that generalize. Neur al networks , 4(1):67–79, 1991. Jost T obias Springenberg, Alexey Dosovitskiy , Thomas Bro x, and Martin Riedmiller. Striving for simplicit y: The all con volutional net. arXiv pr eprint arXiv:1412.6806 , 2014. Han Xiao, Kashif Rasul, and Roland V ollgraf. F ashion-mnist: a no vel image dataset for b ench- marking mac hine learning algorithms, 2017. Greg Y ang. Scaling limits of wide neural netw orks with weigh t sharing: Gaussian process b ehavior, gradien t indep endence, and neural tangent kernel deriv ation. arXiv pr eprint arXiv:1902.04760 , 2019. Difan Zou, Y uan Cao, Dongruo Zhou, and Quanquan Gu. Sto chastic gradient descent optimizes o ver-parameterized deep ReLU netw orks. arXiv pr eprint arXiv:1811.08888 , 2018. 15 A F ormal Deﬁnitions of CNN , CNN-GP and CNTK In this section we use the following additional notations. Let I be the iden tity matrix, and [ n ] = { 1 , 2 , . . . , n } . Let e i b e an indicator vector with i -th en try b eing 1 and other entries b eing 0, and let 1 denote the all-one vector. W e use  to denote the p oint wise pro duct and ⊗ to denote the tensor pro duct. W e use diag( · ) to transform a vector to a diagonal matrix. W e use σ ( · ) to denote the activ ation function, such as the rectiﬁed linear unit (ReLU) function: σ ( z ) = max { z , 0 } , and ˙ σ ( · ) to denote the deriv ative of σ ( · ). Moreo ver, c σ is a ﬁxed constan t. Denote by N ( µ , Σ ) the Gaussian distribution with mean µ and co v ariance Σ . W e ﬁrst deﬁne the con volution op eration. F or a con volutional ﬁlter w ∈ R q × q and an image x ∈ R P × Q , the con volution op erator is deﬁned as [ w ∗ x ] i,j = q − 1 2 X a = − q − 1 2 q − 1 2 X b = − q − 1 2 [ w ] a + q +1 2 ,b + q +1 2 [ x ] a + i,b + j for i ∈ [ P ] , j ∈ [ Q ] . (2) Here the precise deﬁnition of [ w ] a + q +1 2 ,b + q +1 2 and [ x ] a + i,b + j dep ends on the padding scheme (cf. Section 3.2 ). Notice that in Equation 2 , the v alue of [ w ∗ x ] i,j dep ends on [ x ] i − q − 1 2 : i + q − 1 2 ,j − q − 1 2 : j + q − 1 2 . Th us, for ( i, j, i 0 , j 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ], we deﬁne D ij,i 0 j 0 =  ( i + a, j + b, i 0 + a 0 , j 0 + b 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ] | − ( q − 1) / 2 ≤ a, b, a 0 , b 0 ≤ ( q − 1) / 2  . No w we formally deﬁne CNN , CNN-GP and CNTK . CNN . • Let x (0) = x ∈ R P × Q × C (0) b e the input image where C (0) is the initial n umber of channels. • F or h = 1 , . . . , L , β = 1 , . . . , C ( h ) , the in termediate outputs are deﬁned as ˜ x ( h ) ( β ) = C ( h − 1) X α =1 W ( h ) ( α ) , ( β ) ∗ x ( h − 1) ( α ) + γ · b ( β ) , x ( h ) ( β ) = r c σ C ( h ) × q × q σ  ˜ x ( h ) ( β )  where each W ( h ) ( α ) , ( β ) ∈ R q × q is a ﬁlter with Gaussian initialization, b ( β ) is a bias term with Gaussian initialization, and γ is the scaling factor for the bias term. CNN-GP and CNTK . • F or α = 1 , . . . , C (0) , ( i, j, i 0 , j 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ], deﬁne K (0) ( α )  x , x 0  = x ( α ) ⊗ x 0 ( α ) and h Σ (0) ( x , x 0 ) i ij,i 0 j 0 = C (0) X α =1 tr  h K (0) ( α ) ( x , x 0 ) i D ij,i 0 j 0  + γ 2 . • F or h ∈ [ L − 1], 16 – F or ( i, j, i 0 , j 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ], deﬁne Λ ( h ) ij,i 0 j 0 ( x , x 0 ) =  Σ ( h − 1) ( x , x )  ij,ij  Σ ( h − 1) ( x , x 0 )  ij,i 0 j 0  Σ ( h − 1) ( x 0 , x )  i 0 j 0 ,ij  Σ ( h − 1) ( x 0 , x 0 )  i 0 j 0 ,i 0 j 0 ! ∈ R 2 × 2 . – F or ( i, j, i 0 , j 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ], deﬁne h K ( h ) ( x , x 0 ) i ij,i 0 j 0 = c σ q 2 · E ( u,v ) ∼N  0 , Λ ( h ) ij,i 0 j 0 ( x , x 0 )  [ σ ( u ) σ ( v )] , (3) h ˙ K ( h ) ( x , x 0 ) i ij,i 0 j 0 = c σ q 2 · E ( u,v ) ∼N  0 , Λ ( h ) ij,i 0 j 0 ( x , x 0 )  [ ˙ σ ( u ) ˙ σ ( v )] . (4) – F or ( i, j, i 0 , j 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ], deﬁne h Σ ( h ) ( x , x 0 ) i ij,i 0 j 0 =tr  h K ( h ) ( x , x 0 ) i D ij,i 0 j 0  + γ 2 . Note that the deﬁnition of Σ ( x , x 0 ) and ˙ Σ ( x , x 0 ) share similar patterns as their NTK coun ter- parts [ Jacot et al. , 2018 ]. The only diﬀerence is that we hav e one more step, taking the trace o ver patc hes. This step represents the conv olution op eration in the corresp onding CNN . No w w e can deﬁne the k ernel v alue recursiv ely . 1. First, w e deﬁne Θ (0) ( x , x 0 ) = Σ (0) ( x , x 0 ). 2. F or h ∈ [ L − 1] and ( i, j, i 0 , j 0 ) ∈ [ P ] × [ Q ] × [ P ] × [ Q ], we deﬁne h Θ ( h ) ( x , x 0 ) i ij,i 0 j 0 = tr  h ˙ K ( h ) ( x , x 0 )  Θ ( h − 1) ( x , x 0 ) + K ( h ) ( x , x 0 ) i D ij,i 0 j 0  + γ 2 . 3. Finally , deﬁne Θ ( L ) ( x , x 0 ) = ˙ K ( L ) ( x , x 0 )  Θ ( L − 1) ( x , x 0 ) + K ( L ) ( x , x 0 ) . B Additional Deﬁnitions and Pro of of Theorem 4.1 Deﬁnition B.1 (Group of Op erators) . ( G , ◦ ) is a group of op er ators, if and only if 1. Each element g ∈ G is an op er ator: R P × Q × C → R P × Q × C ; 2. ∀ g 1 , g 2 ∈ G , g 1 ◦ g 2 ∈ G , wher e ( g 1 ◦ g 2 )( x ) is deﬁne d as g 1 ( g 2 ( x )) . 3. ∃ e ∈ G , such that ∀ g ∈ G , e ◦ g = g ◦ e = g . 4. ∀ g 1 ∈ G , ∃ g 2 ∈ G , such that g 1 ◦ g 2 = g 2 ◦ g 1 = e . We say g 2 is the inverse of g 1 , namely, g 2 = g − 1 1 . 17 Pr o of of The or em 4.1 . Since we assume K G X and K X G are inv ertible, b oth α and e α are uniquely deﬁned. No w w e claim e α g = { e α i,g } i ∈ [ N ] ∈ R N is equal to α |G | for all g ∈ G . By the equiv ariance of K under G , for all j ∈ [ N ] and g 0 ∈ G , X i ∈ [ N ] ,g ∈G α i |G | K ( g 0 ( x j ) , g ( x i )) = X i ∈ [ N ] ,g ∈G α i |G | K (( g − 1 ◦ g 0 )( x j ) , x i ) = X i ∈ [ N ] α i E g ∈G K ( g ( x j ) , x i ) = X i ∈ [ N ] α i K G ( x j , x i ) = y j . Note that e α is deﬁned as the unique solution of K X G e α = y G . Similarly , we hav e X i ∈ [ N ] ,g ∈G α i |G | K ( x 0 , g ( x i )) = X i ∈ [ N ] α i E g ∈G K ( g − 1 ( x 0 ) , x i ) = X i ∈ [ N ] α i K G ( x 0 , x i ) . C Equiv alence Bet w een LAP and Bo x Blur La y er. F or a CNN with a b o x blur lay er b efore the ﬁnal fully-connected lay er, the ﬁnal output is deﬁned as f ( θ , x ) = P C ( L ) α =1 D W ( L +1) ( α ) , BBlur  x ( L ) ( α ) E , where x ( L ) ( α ) ∈ R P × Q , and W ( L +1) ( α ) ∈ R P × Q is the weigh t of the last fully-connected la yer. No w we establish the equiv alence b etw een BBlur and LAP on CNTK . The equiv alence on CNN-GP can b e derived similarly . Let Θ BBlur ( x , x 0 ) ∈ R [ P ] × [ Q ] × [ P ] × [ Q ] b e the CNTK kernel of BBlur  x ( L ) ( α )  . Since BBlur is just a linear op eration, we hav e  Θ BBlur  x , x 0  i,j,i 0 ,j 0 = 1 (2 c + 1) 4 X ∆ i , ∆ j , ∆ 0 i , ∆ 0 j ∈ [ − c,c ] 4 h Θ ( L )  x , x 0  i i +∆ i ,j +∆ j ,i 0 +∆ 0 i ,j 0 +∆ 0 j . By the form ula of the output kernel v alue for CNTK without GAP , w e obtain tr  Θ BBlur  x , x 0  = 1 (2 c + 1) 4 X ∆ i , ∆ 0 i , ∆ j , ∆ 0 j ∈ [ − c,c ] 4 X i,j ∈ [ P ] × [ Q ]  Θ ( x , x 0 )  i +∆ i ,j +∆ j ,i +∆ 0 i ,j +∆ 0 j . D Setting of the Exp eriment in Section 6.3 The total num b er of training ep o chs is 80, and the learning rate is 0.1 initially , deca yed by 10 at ep o c h 40 and 60 resp ectiv ely . The momentum is 0.9 and the weigh t deca y factor is 0.0005. In Figure 2 , the blue line rep orts the av erage test accuracy of the last 10 ep o c hs, while the red line rep orts the best test accuracy of the total 80 ep o chs. Each exp erimen t is rep eated for 3 times. W e use circular padding for b oth con volutional lay ers and the BBlur lay er. The last data p oint with largest x -co ordinate rep orted in Figure 2 corresp onds to GAP . 18

Enhanced Convolutional Neural Tangent Kernels

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment