Scale Normalization

One of the difficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and have typically been addressed by careful scale-preserving initialization. We investigat…

Authors: Henry Z. Lo, Kevin Amaral, Wei Ding

Scale Normalization
W orkshop track - ICLR 2016 S C A L E N O R M A L I Z A T I O N Henry Z. Lo, Ke vin Amaral, & W ei Ding Department of Computer Science Univ ersity of Massachusetts Boston Boston, MA 02155, USA { henryzlo,ding } @cs.umb.edu, kevin.m.amaral@gmail.com A B S T R AC T One of the dif ficulties of training deep neural networks is caused by improper scaling between layers. Scaling issues introduce exploding / gradient problems, and ha ve typically been addressed by careful scale-preserving initialization. W e in vestigate the value of preserving scale, or isometry , beyond the initial weights. W e propose tw o methods of maintaing isometry , one exact and one stochastic. Preliminary experiments show that for both determinant and scale-normalization effecti vely speeds up learning. Results suggest that isometry is important in the beginning of learning, and maintaining it leads to faster learning. 1 I N T RO D U C T I O N The goal of many initialization methods is to preserve the gradient signal as it goes backwards through each layer of a neural network Glorot & Bengio (2010); He et al. (2015); Saxe et al. (2014). In RNNs, not preserving this signal may lead to the well-known v anishing and exploding gradient problems Hochreiter (1998). In general, learning in neural nets is much faster when the composite scales of all layers remains near a constant of the problem Saxe et al. (2014). Results from this line of work suggest that initially preserving scale is conducive to learning. How- ev er , any update rule which does not enforce scale-preserv ation will violate this condition (isometry) after the first iteration. Does preserving scale continue to speed up learning after the first iteration? There is evidence for and against. On one hand, if scale were preserved throughout all epochs, the network would fail to learn non-isometric projections. Howe ver , there is circumstantial e vidence for the benefit of preserving scale during training: • The objective most common in autoencoders produces an approximately isometric matrix W , and thus implicitly preserves scale (singular v alues) Bourlard & Kamp. • Co-training with both unsupervised and supervised objectives leads to faster-learning and more generalizable networks Rasmus et al. (2015). At least some of this result is due to the scale-preserving ef fect of the unsupervised objecti ve, which effectiv ely regularizes the singular values of each weight matrix. The contribution of this work is to inv estigate the utility of scale-preserving constraints. W e separate this from the other effects of the unsupervised objectiv e by normalizing scale without optimizing reconstruction. Preliminary results with two different methods indicate that at least for the first few iterations of training, scale normalization leads to faster learning. 2 P R E S E RV I N G S C A L E The forw ard scale of a layer with weight matrix W is its effect on the length of its input x : k W T x k 2 k x k 2 . This is a function of the singular values of W (if x were a singular vector , it is scaled by the corresponding singular value). The backward scale of a layer is k W d k 2 k d k 2 , where d is the incoming gradient. Just as W T is used to calculate the forward pass, W is used to calculate the outcoming gradient for the backward pass. 1 W orkshop track - ICLR 2016 The forward and backward scales are both functions of the singular v alues σ , which are in variant under transposition. The magnitude of the gradient at a giv en layer i is the product of the original gradient’ s magnitude and the scales of all layers l > i . If all these scales (singular values) are greater than one, we hav e exploding gradients; if we ha ve less than 1, we ha ve v anishing gradients. Saxe et al. (2014) suggests using orthogonal matrices (or rectangular matrices with unit singular values) to a void the complications of scale. This ef fectiv ely makes all matrices non-scaling (though length can still be reduced when components of x lies in the nullspace of W T ). This orthonormal initialization scheme contrasts with the popular initializations in Glorot & Bengio (2010) and He et al. (2015), which are based on preserving the variance of the forward and backward passes. The benefit of the singular value interpretation of scale o ver their interpretation are thus: • Unites the notions of forward and backw ards scales (both functions of singular values). • Y ields a constructiv e method of creating non-scaling weight matrices (orthonormal init). • Scaling as a function of singular values relies on much less assumptions than scaling as variance preserv ation. 3 N O R M A L I Z I N G S C A L E Here we propose multiple ways to normalize scale during learning. The challenge is to preserve scale, while retaining what is learned after each update. It should be noted that simply making W unitary after each update (setting all singular values to 1) would decorrelate W 0 x and destroy information contained in W . It may not make sense to hav e ev ery neuron’ s output to be in the same range. So while orthonormalizing W works for initialization, it does not make sense to do it in the course of training. 3 . 1 D E T E R M I N A N T N O R M A L I Z A T I O N One method of preserving the scale of each layer is to set its pseudo-determinant to one. Just as the determinant of a matrix is a product of its eigen values, the pseudo-determinant is a product of a (not necessarily square) matrix’ s singular values ( Q i σ i ). It is an aggregate measure of the scales of W . If our update rule on W giv es us a new weight matrix W ∗ , we can determinant-normalize W : W ← W ∗ det ( W ∗ ) (1) where det is the product of the singular values of W ∗ . This sets the determinant of W to be one. Alternativ ely , determinant-normalziation sets the geometric mean of the scales to one, in a sense ”centering” the singular values. W e test determinant-normalization on a 100 × 100 × 100 ReLu MLP on MNIST (see Figures 1 and 2). Models are trained using vanilla SGD, with batch size of 100. Determinant normalization speeds up learning over Glorot and orthonormal initialization. As ex- pected, the benefits are most pronounced during the beginning of training (1 epoch is 500 batches). 3 . 2 S C A L E N O R M A L I Z A T I O N Determinant normalization has the attractive property of being a function of W , not x . Howe ver , calculating the pseudo-determinant runs into multiple problems: • Calculating the pseudo-determinant requires an SVD to get the singular v alues, and is thus prohibitiv ely expensi ve. • When singular v alues are less than 1 and matrices are large, it is very easy to get numerical underflow ( det ( W ) = 0 ). 2 W orkshop track - ICLR 2016 Figure 1: Experiments over 1 epoch of training an MLP on MNIST . Figure 2: Experiments over 10 epochs of training an MLP on MNIST . T o address this, we propose an alternative notion of scale. Empirically , the scaling of each v ector x by W can be measured by the ratio: s = k W T x k 2 k x k 2 (2) Thus, we can scale normalize by dividing W by the average scale observ ed over a mini-batch: W ← W ∗ E ( s ) (3) This sets the expected scaling to be one, such that on average, W will preserv e scale. As there are many singular v alues, s is different depending on x , and thus may be noisy depending on the batch, especially for small batches. W e noted that scale-normalization does not help after the first epoch. When running scale- normalization for only the first epoch (100 iterations), it achiev es similar results to determinant normalization. 4 F U T U R E W O R K Preliminary results indicate that maintaining isometry is useful for learning, at least in the beginning. Future work will relate scale-normalization to batch-normalization, and more advanced optimization algorithms. Experiments on larger datasets such as CIF AR10 and con volutional architectures are in progress. W e believ e that inv estigating how isometry interplays with learning speed will bring insight into how to speed up learning in the future. R E F E R E N C E S H. Bourlard and Y . Kamp. Auto-association by multilayer perceptrons and singular value decompo- sition. Biological Cybernetics , 59(4):291–294. 3 W orkshop track - ICLR 2016 Figure 3: Experiments ov er 10 epochs of training an MLP on MNIST . Scale-Norm-1 indicates that scale normalization was only used for the first epoch. Xavier Glorot and Y oshua Bengio. Understanding the dif ficulty of training deep feedforward neural networks. In AIST A TS , 2010. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-lev el performance on imagenet classification. ICCV , 2015. Sepp Hochreiter . The v anishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. , 6(2), April 1998. Antti Rasmus, Mathias Ber glund, Mikko Honkala, Harri V alpola, and T apani Raiko. Semi- supervised learning with ladder networks. In NIPS . 2015. Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks. ICLR , 2014. 4

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment