Surprising properties of dropout in deep networks

Surprisin g prop erties of drop out in deep netw orks Da vid P . Helm bo ld Departmen t of Computer Science Univ ersit y of California, San ta Cruz San ta Cruz, CA 95064, USA dph@soe. ucsc.edu Philip M. Long Go ogle plong@go ogle.com April 21, 2017 Abstract W e analyze d rop out in deep n etw orks with rectiﬁed linear units and the quadratic loss. Our results exp ose surprising diﬀerences betw een the beh avior of dropout and more traditional regularizers like w eigh t deca y . F or ex ample, on some simple data sets drop out t raining produces negativ e w eigh ts even though the output is the su m of the inputs. This p ro vides a c ounterpoint to the suggesti on that dropout disco urages co-adaptation of w eigh ts. W e also sho w that the drop out penalty can gro w exp onentially in the depth of the net work w hile the w eight-deca y penalty r emains essen tially linear, and that d rop out is insensitive to v arious re- scalings of the input features, outputs, and netw ork w eights. This las t insensitivit y implies that there are n o isolated local minima of the drop out training criterio n. Our work uncov ers new p roperties of drop out, extends our u nderstanding of why drop out succeeds, and lays the found ation for further progress. 1 In tro duction The 2012 ImageNet Large Sca le Visual Recognition c ha llenge w a s w on b y the Univ ersity of T oronto team b y a surpr is ingly large mar gin. In a n invited talk at NIPS, Hin ton [15] credited the drop out training technique for muc h o f their succes s . Drop out training is a v aria nt o f stochastic gradient des c e nt (SGD) where, a s each example is pro cess ed, the netw o rk is tempo rarily perturb ed b y randomly “dropping out” no des of the netw ork. The gradient calculation and w eigh t up dates ar e perfor med on the reduced net w ork, and the dropped out no des are then restored before the next SGD iteration. Since the ImageNet comp etition, dr op out has been successfully applied to a v ariety of domains [8, 10, 9, 17, 7], and is widely used [21, 13, 2 3]; for exa mple, it is incorp or ated in to po pular packages suc h at T o rch [25], Caﬀe [6] and T enso rFlow [2 4]. It is in triguing that crippling the net work during training o ften leads to s uch dramatically improved results, a nd drop out has also sparked substan tial r e s earch on related methods (e.g. [12, 2 9]). In this w ork, we exa mine the eﬀect o f drop out on the inductiv e bias of the learning algorithm. A matc h betw een dropout’s inductiv e b ias and some imp ortant a pplications could explain the success of drop out, and its popular ity als o motiv a tes the study of its unusual pro p erties. W eight decay training optimizes the empirical error plus an L 2 regulariza tion term, λ 2 || w || 2 2 , so w e call λ 2 || w || 2 2 the L 2 p enalty of w since it is the diﬀerence b et ween training criterion ev a luated at w and the empirical loss of w . By a nalogy , w e deﬁne the dr op out p enalty of w t o b e the diﬀerence betw een the dropout training criterion and the empirical loss of w (see Section 2). Drop out p enalties measure how m uch dropo ut discriminates against weigh t vectors, so they are key to understanding dropo ut’s inductive bia s . Even in o ne-lay er netw orks, conclusions drawn from (t ypically quadratic) approximations of the dropo ut pena lt y can be misleading [1 4]. Therefore w e focus on exact formal analysis o f drop o ut in multi-la y er 1 net works. Theoretical analysis of deep netw orks is notoriously diﬃcult, so w e might expect that a thorough understanding of drop o ut in deep netw orks m ust b e ac hieved in stages. In this pap er w e further the pro cess b y exp osing some of the surpris ing w ays that the inductiv e bias of drop out diﬀers from L 2 and other standard regularizer s. These include the followin g: • W e show t hat dropout training ca n lea d t o negativ e weigh ts even when t he output is a p ositive multiple of the the inputs. Arguably , such use of negativ e w eights cons titutes co- adaptation – this adds a counterpoint to pr evious a nalyses s howing that drop out discour ages co- adaptation [2 2, 14]. • Unlik e w eight decay a nd other p -norm r egularizers , drop out training is insensitive to the rescaling of input features, and largely insensitive to res caling of the outputs; this ma y play a ro le in drop out’s practical success. Dropout is a lso unaﬀected if the weights in o ne lay er a re scaled up by a constant c , and the weigh ts of another lay er are scaled do wn b y c ; this implies that dropo ut training do es not hav e isolated lo c a l minima. • The dro p o ut p enalty grows exponentially in the depth of the net w ork in case s where t he L 2 regularizer grows linear ly . This may enable dropout to pe na lize the complexit y of th e netw o rk in a w ay that more meaningfully reﬂects the ric hness of the netw ork’s b ehaviors. (The ex po nenti al g rowth with d of the drop out p enalty is reminiscen t of so me r egularizer s for deep net works studied by Neys ha bur, et al [20].) • Dropout in deep net works has a v ariet y of o ther be haviors diﬀerent from sta ndard regularizers . In par- ticular: the dr o p out p enalty for a set o f w eig ht s can b e n egative; the drop o ut penalty of a se t of weight s depends on b oth the training instances a nd the labels; and although the dropo ut probability in tuitiv e ly measures the strength of dropo ut regulariza tio n, the dr op out penalties a re often non-monotonic in the drop out pr obability . In contrast, W a ger, et al [28] sho w that when dro p o ut is applied to generalized linear mo dels, the drop out penalty is alwa ys non-negative and do es not dep end on the lab els. Our analysis is for multila yer neural n etw or ks with the square loss at the output node. The hidden lay er s use the popular rectiﬁed linear units [19] outputting σ ( a ) = max(0 , a ) where a is the no de’s activ ation (the weigh ted sum o f its inputs). W e study the minimizers of a criterion that may be viewed as the ob jective function when using dropo ut. This abstracts a wa y sampling and optimization issues to focus o n the inductiv e bias, as in [5, 30, 4, 18, 14]. See Section 2 for a complete explanation. Related w ork A num b er of p o ssible explanations hav e been suggested for dropo ut’s success. Hin ton, et al [16] suggest that dr op out con trols netw ork complexity by restricting the a bility to co-adapt w eigh ts and illustrate how it app ears to learn simpler functions at the second lay e r . Others [3, 1] view dropo ut as an e ns emb le metho d combinin g the diﬀerent net work topolog ies r esulting from the random deletion o f nodes. W age r , et al [27] observe that in 1-lay er netw orks dropout essentially forces lear ning o n a more c hallenging distribution akin to ‘a ltitude training’ of athletes. Most formal analysis o f the inductiv e bias of drop out has concen trated on the single-layer setting, where a single neuron combines the (p otentially dropped-o ut) inputs. W age r, et al [28] considered the case that the distribution of lab el y giv en feature vector x is a member of the exponential family , and the log-los s is used to ev alua te mo dels. They p o inted out that, in this situation, the cr iter ion optimized b y drop out can be decomp osed into the original loss and a term that do es not depend on the lab e ls . They then g ave approximations to this dropo ut r egularizer a nd discus sed i ts re lationship with other reg ula rizers. As we hav e seen, man y asp ects o f the behavior of drop out and its re la tionship to o ther regular izers are qualitatively diﬀerent when there ar e hidden units. W age r , et al [27] considered drop out for lear ning to pics modeled by a P oisson generative pro c ess. They exploited the conditional independence assumptions of the generativ e pr o cess to sho w that the excess risk of drop out training due to training set v ariation has a ter m that decays more rapidly than the straigh tforward empirical risk minimization, but also has a seco nd a dditiv e term related to do cument length. They a lso discussed situations where the mo del lea rned b y drop out has sma ll bias. 2 Baldi and Sadowski [2] analyzed drop out in linear netw orks, and show ed how drop out can b e approx- imated by norma lized geometric means of subnetw orks in the nonlinear case. Gal and Ghahramani [11] describ ed an interpretation of drop out as a n a pproximation to a deep Gaussian pr o cess. The im pact of dro p o ut (a nd its relative drop co nnect) on g eneralization (roughly , how muc h dro po ut restricts the sea rch space o f the learner) was studied in [29]. In the on-line learning with experts setting, V an Er ven, et al [26] show ed that applying dr op out in on-line trials lea ds to a lgorithms t hat automatica lly adapt to the input sequence without requiring doubling or other parameter-tuning tec hniques. The rest of the paper is or ganized a s follows. Section 2 in tro duces our notation and formally deﬁnes the dr op out mo del. W e pr ov e that dro po ut enjoys several scaling inv a riances that weigh t-deca y doesn’t in Section 3, and that dropout requires neg ative weigh ts even in very simple situations in Section 4. Section 5 uncov er s v ar ious proper ties of the dr op out p enalty funct ion. Section 6 descr ib es so me sim ulation exp eriments. W e provide some concluding remar ks in Section 7. 2 Preliminarie s Throughout, we will analyze f ully connected lay ered net works with K inputs, one output, d la y ers ( counting the output, but not the inputs), a nd n no des in ea ch hidden layer. W e assume that n is a p ositive multiple of K and that K is an even p erfect square and a pow er of t w o to av oid unilluminating ﬂo o r /ceiling clutter in the analys is . W e will call this the standar d ar chite ctu r e . W e use W to denote a particular setting of the weigh ts a nd biases in the net work and W ( x ) to denote the netw ork’s output o n input x using W . The hidden no des are ReLUs, and the o utput no de is linear. W ca n b e decomp osed as ( W 1 , b 1 , ..., W d − 1 , b d − 1 , w , b ), where each W j is the matrix of weigh ts on connections from the j − 1s t in to the j th hidden lay er , each b j is the v ector of bias inputs in to the j th hidden layer, w are the weigh ts int o the output no de, and b is the bias in to the output node. W e will refer to a joint pro bability distribution o v er exa mples ( x , y ) as an ex ample distribution . W e fo cus on squar e loss, so the loss of W on e x ample ( x , y ) is ( W ( x ) − y ) 2 . The risk is the exp ected loss with resp ect to an example distribution P , we denote t he risk of W as R P ( W ) def = E ( x ,y ) ∼ P  ( W ( x ) − y ) 2  . The subscript will often b e omitted when P is clear from the context. The goal of L 2 training is t o ﬁnd weigh ts and biase s minimizing the L 2 criterion with regular ization strength λ : J 2 ( W ) def = R ( W ) + λ 2 ||W || 2 . Here and throughout, w e use ||W || 2 to deno te the sum of the squar es of the weights of W . (As usual, the biases are no t penalized.) W e use W L 2 to denote a minimizer of this criterion. The L 2 pena lt y , λ 2 ||W || 2 , is non-neg ative. This is use ful, for example, to b ound the risk of a minimizer W L 2 of J 2 , since R ( W ) ≤ J 2 ( W ). Drop out training indep endently remov es no des in the net w ork. In o ur ana lysis eac h non-output no de is dropp ed o ut with the s ame probability q , so p = 1 − q is the probability that a no de is k ept. (The output no de is alwa ys kept ; dropping it out has the eﬀect o f cancelling the training iteration.) When a no de is dropp ed out, the no de’s output is set to 0. T o comp ensate for this reduction, the v alues of the kept no des are mul tiplied b y 1 /p . With this comp ensation, the drop out can b e viewed a s injecting zero-mean additiv e noise at each non-output no de [28]. 1 The dr op out pr o c ess is the collection of random choices, for ea ch no de in the netw o r k, of whether the no de is kept or dropp ed out. A realiza tio n of the drop out pr o cess is a dr op out p attern , whic h is a b o o lean vector indicating the kept no des. F or a netw ork W , an input x , and drop o ut pattern R , let D ( W , x , R ) be the output of W when no des are dropp ed o ut or not following R (including the 1 /p re s caling of kept no des’ outputs). The goal of dropo ut training on an example distribution P is to ﬁnd weigh ts and biase s minimizing the dr op out criterion for a given drop o ut pro bability: J D ( W ) def = E R E ( x ,y ) ∼ P  ( D ( W , x , R ) − y ) 2  . 1 Some auth ors use a similar adjustmen t where the weigh ts are scaled do wn at p rediction time instead of inﬂa ting the k ept nodes’ outputs at training time. 3 output 1 x 1 -1 x 2 1 1 1 1 1 1 Figure 1: A net w ork where the dropo ut p enalty is negative. This cr iterion is equiv a lent to the ex pec ted risk of the drop out-mo diﬁed netw ork, and w e use W D to denote a minimizer of it. Since the se le c tio n of dr op out pattern and example from P are indep endent, the order of the t wo expectations can b e swapped, yielding J D ( W ) = E ( x ,y ) ∼ P E R  ( D ( W , x , R ) − y ) 2  . (1) Equation (1) is a k ey proper ty of the drop out criterion. It in dicates when so mething is true ab out the dropo ut criterion for a family o f distributions concentrated on single examples, then (usually) the same thing will b e true for a ny mixture of these single-e x ample distributions. Consider now the example net w ork in Figure 1. The weigh t para meters W 1 and w are all-1’s, and a ll o f the biases are 0. W (1 , − 1) = 0 as eac h hidden node computes 0. Each drop out pattern indicates the subset of the four lower nodes to b e kept, and when q = p = 1 / 2 each subset is equally likely to be k ept. If R is the drop out pattern where input x 2 is dropp ed and the other no des a re k ept, then the net w ork co mputes D ( W , (1 , − 1) , R ) = 8 (recall that when p = 1 / 2 the v alues of non-dropp ed out no des ar e doubled). O nly three dropo ut patterns pro duce a non-zero output, so if P is co ncent rated on the example x = (1 , − 1) , y = 8 the dropo ut criterion is: J D ( W ) = 1 16 (8 − 8) 2 + 2 16 (4 − 8) 2 + 13 16 (0 − 8) 2 = 54 . (2) As men tioned in the in tro duction, the dr op out p enalty of a weigh t v ecto r f or a giv en example distribution and dr op out probabilit y is the amount that the drop out criterion exceeds the r isk, J D ( W ) − R ( W ). W ager, et al [28] show that for 1 -lay er generalized linear mo dels, the dro po ut p enalty is non-nega tive. Since W (1 , − 1) = 0, w e hav e R ( W ) = 64, and the drop out p enalty is negativ e in our example. This is beca use the v ariance in the o utput due to dropout causes the net w ork to b etter ﬁt the data (on a v erage) than the netw o rk’s non-drop out ev aluation. In Section 5.2, we give a nece s sary condition for this v ariance to b e beneﬁcia l. As with the dro p out criterion, the drop out p enalty deco mpo ses in to an exp e c tation of p enalties ov e r single examples: J D ( W ) − R ( W ) = E ( x ,y ) ∼ P  E R  ( D ( W , x , R ) − y ) 2  − ( W ( x ) − y ) 2  . Deﬁnition 1 Deﬁne P ( x ,y ) as the distribution with half of its weigh t on example ( x , y ) and half of its weight on example ( 0 , 0) . Unless indicated o ther wise, w e assume p = q = 1 / 2 for s implicit y , althoug h this is not crucial for our results. 3 Scaling inputs, w eigh ts and outputs 3.1 Drop out is scale-free Here w e prov e that drop out regularizes deep net w orks in a manner tha t is independent o f the sca le of the input features. In o ther w o rds, training under dr op out reg ularization do es not penalize t he use of large 4 weigh ts when needed to comp ensate for small input v alues. Deﬁnition 2 F or any example distribution P , deﬁne the drop out aversion with P t o b e the maximum, over minimizers W D of the dr op out risk J D ( W D ) , of R P ( W D ) − inf W R P ( W ) . The drop out av er sion of P measures the extent t o which P is incompatible with the inductive bias of dr op out, measured by the risk ga p betw een the true risk minimizer a nd the optimizers of the drop out criterion. Deﬁnition 3 F or example distribution P and squar e matrix A , denote by A ◦ P the distribution obtai ne d by sampling ( x , y ) fr om P , and outputt ing ( A x , y ) . When A is diago nal a nd has full r ank, then A ◦ P is a rescaling of the inputs, lik e changing o ne input from min utes to seconds a nd another from feet to meters. Theorem 4 F or any example d istribution P , and any diagonal ful l-r ank K × K matrix A , the dr op out aversion of P e quals t he dr op out aversion of A ◦ P . Pro of: Cho ose a net w ork W = ( W 1 , b 1 , ..., W d − 1 , b d − 1 , w , b ). Let W ′ = ( W 1 A − 1 , b 1 ..., W d − 1 , b d − 1 , w , b ). F or any x , W ( x ) = W ′ ( A x ), as A − 1 undo e s the eﬀect of A befo r e it gets to the rest of the netw ork, whic h is unchanged. F urthermor e, for any drop out pa ttern R , w e hav e D ( W , x , R ) = D ( W ′ , A x , R ). Once ag ain A − 1 undo e s the eﬀects of A on k ept no des (since A is dia gonal), a nd the r est of the net w ork W ′ is mo diﬁed b y R an a manner paralleling W . Thus, there is bijection b etw een net w orks W and net works W ′ with J D ( W ′ ) = J D ( W ) and R ( W ′ ) = R ( W ), yielding the theorem. Theorem 4 indicates that some common normalizations o f the input featur e s (e.g . to have unit v ariance) do not aﬀect the quality of the drop out criterio n minimizers, but normaliza tion mig ht change the sp eed of conv ergence and which minimizer is reach ed. Cent ering the features has slightly diﬀerent proper ties. Although it is easy to use the bias es to deﬁne a W ′ that “undoes” the centering in the non-drop out co mpu- tation, diﬀerent W ′ app ear to be r equired for diﬀerent drop o ut patterns, brea king the bijection explo ited in Theorem 4. As we will see in Section 3.4, weigh t decay does not enjo y such sca le - free status. 3.2 Drop out’s in v ariance to pa rameter scal ing Next, w e describ e an equiv a lence re la tion among parameter iza tions for dropo ut net works of depth d ≥ 2. Basically , scaling the parameters at a level creates a co rresp onding scaling of the output. (A similar observ a tion was made in a somewhat diﬀeren t con text by Neyshabur, et al [20].) Theorem 5 F or any input x , dr op out p attern R , any network W = ( W 1 , b 1 , ..., W d − 1 , b d − 1 , w , b ) , and any p ositive c 1 , ..., c d , if W ′ =   c 1 W 1 , c 1 b 1 , c 2 W 2 , c 1 c 2 b 2 , ..., c d − 1 W d − 1 ,   d − 1 Y j =1 c j   b d − 1 , c d w ,   d Y j =1 c j   b   , (3) then D ( W ′ , x , R ) =  Q d j =1 c j  D ( W , x , R ) . In p articular, if Q d j =1 c j = 1 , then for any example distribution P , networks W and W ′ have the same dr op out criterion, dr op out p enalty, and exp e cte d loss. Note that the re-sca ling of the biases at layer j depends not o nly on the rescaling of the connection weigh ts at layer j , but also the re- scalings at low er lay ers. Pro of: Choose an input x and a dr op out pattern R . Deﬁne W ′ as in (3). F or each hidden la yer j , let ( h j 1 , ..., h j n ) b e the j th hidden layer when applying W to x with R , and let ( ˜ h j 1 , ..., ˜ h j n ) b e the j th hidden lay er when applying W ′ instead. By induction, for all i , ˜ h j i =  Q ℓ ≤ j c ℓ  h j i ; the key step is that the 5 pre-rectiﬁed v alue used to compute ˜ h j i has the same sign as for h j i , since resca ling by c j preserves the sign. Thu s the same units are zer o ed out in W and W ′ and D ( W , x , R ) = ( Q j c j ) D ( W ′ , x , R ). When Q j c j = 1, this implies D ( W , x , R ) = D ( W ′ , x , R ). Since this is true for all x and R , we ha v e J D ( W ) = J D ( W ′ ). Since, similarly , W ( x ) = W ′ ( x ) for all x , so R ( W ) = R ( W ′ ), the dr op out p e nalities for W and W ′ are also the same. Theorem 5 implies that the dropo ut criterion never ha s isolated minimizers, since one can contin uously up-scale the weigh ts on one lay er with a comp e ns ating down-scaling at another lay er to ge t a contiguous family of net works computing the same function and having the same drop out criterion. It ma y be p ossible to exploit the par ameterization equiv alence of Theorem 5 in tra ining by using canonical for ms for the equiv alent net works or switching to a n equiv alent net works whose gr adients hav e better prop erties. W e leave this question for future w ork. 3.3 Output scalin g with dropout Scaling the output v alues of an example distribution P do e s aﬀect the a version, but in a very simple and natural w a y . Theorem 6 F or any ex ample distribution P , if P ′ is obtai ne d fr om P by sc aling the outputs of P by a p ositive c onstant c , t he dr op out aversi on of P ′ is c 2 times the dr op out aversion of P . Pro of: If a netw ork W minimizes the drop out criterio n for P , then the net work W ′ obtained by scaling up the weigh ts and bias for the o utput unit b y c minimizes the dropo ut criterion for P ′ , and for any x , y , and drop out pattern R , ( D ( W , x , R ) − y ) 2 = ( D ( W ′ , x , R ) − cy ) 2 /c 2 . 3.4 Scaling properties of w eigh t decay W eight decay do es not have the same scaling prope r ties as drop out. Deﬁne the weight-de c ay aversion analogo usly to the dr op out aversion. W e analyze the L 2 criterion for depth-2 net works in Appendix B, r esulting in the following theorem. Our pro of shows that, des pite the non-conv exit y o f the L 2 criterion, it is still p o ssible to iden tify a closed form for one of its optimizers. Theorem 7 Ch o ose an arbitr ary numb er n of hidd en no des, and λ > 0 . The weight-de c ay aversion of P ( x , 1) is min( 1 4 , λ 2 x · x ) . Theorem 7 shows that, unlike dr o p out, the w eig ht decay av ersion do es dep end on the scaling of the in put features. F urthermor e, when 2 λ > √ x · x , the weigh t-decay cr iter ion for P ( x , 1) has only a single isola ted optimum weigh t setting 2 – all weigh ts set to zero and bias 1 /2 a t the output no de. This means that weigh t-decay in 2-lay er net works can completely reg ula rize awa y signiﬁcant signal in the sample even when λ is ﬁnite , contrasting starkly with weigh t-decay’s behavior in 1-lay er net w orks. The “vertical” ﬂexibili t y t o rescale weigh ts b etw e e n lay er s enjo y ed by dro p o ut (Theor em 5) do es not hold for L 2 : one can alwa y s driv e the L 2 pena lt y to inﬁnit y by s caling one layer up by a lar g e eno ugh p ositive c , even while sc a ling ano ther down by c . On the other hand, the pro of of Theorem 7 s hows that the L 2 criterion has a n alternative “ho rizontal” ﬂexibility in v olving the rescaling o f weigh ts a cross no des o n the hidden la yer (under the theor em’s assumptions). Lemma 32 sho ws that at the optimizers each hidden node’s contributions to the output ar e a constant (depending on the input) times their co nt ribution to the the L 2 pena lt y . Shif ting the mag nitudes of these con tributions b etw een hidden nodes leads to alternative weigh ts 2 Since the output node puts weigh t 0 on eac h hidden no de and the biases are unregularized, this optimum actually represents a class of netw orks diﬀering only i n the irrelev ant biases at the hidden no des. One can easily construct other cases when we ight - deca y has isolated m inima in t his sense, for exa mple when n = 2 and the re is equa l probability on x and − x , b oth with label 1. 6 that compute the same v alue and ha ve the same weigh t decay penalty . This is a more g eneral obs erv ation than the p er mu tation symmetry b etw een hidden no des because any p ortion of a hidden no de’s co ntribution can be shifted to ano ther hidden node. 4 Negativ e w eigh ts for monotone fu n ctions If the weigh ts o f a unit ar e non-negative, then the unit computes a mono tone function, in the sense that increasing any input while keeping the others ﬁxed increases the output. The bias do es not aﬀect a no de’s monotonicity . A net work of monotone units is also mono tone. W e ﬁrst present our theor e tica l r esults for many features (Section 4.1) and few features (Section 4.2), and then discuss the implication of these results in Section 4.3. 4.1 The bas ic case – man y features In this section, w e ana lyze the simple distribution P ( 1 , 1) that assig ns probability 1 / 2 to the example (0 , ..., 0) , 0, a nd probability 1 / 2 to the example (1 , ..., 1) , 1. This is arguably the simplest mono to ne func- tion. Nevertheless, w e prove that dro po ut uses negative w eights to ﬁt this data. The key int uition is tha t optimizing the dro p o ut cr iterion requires co ntrolling the v a riance. Nega tive weigh ts at the hidden nodes can b e used to control the v ariance due to drop o ut at the input lay er. When there are eno ug h hidden no des this becomes so beneﬁcial that every m inimizer of the drop out criterion uses such negative weigh ts. Theorem 8 F or the standar d ar chite ctur e, if K > 18 and n is lar ge enough r elative to K and d , every optimizer of t he dr op out criterion fo r P ( 1 , 1) uses at le ast one ne gative weight. T o prov e Theorem 8, we ﬁrst calculate J D ( W neg ) for a net w ork W neg that uses nega tive weigh ts, and then prov e a lower bound gr eater than this v alue that holds for all net w orks using only non-negative weigh ts. All of the biases in W neg are 0. A key buildin g blo ck in the deﬁnition of W neg is a blo c k of hidden units that w e call the ﬁrst- one gadget . Each suc h block has K hidden no des, a nd takes its input from the K input no des. The i th hidden no de in the blo ck takes the v alue 1 if the K th input no de is 1, and all inputs x i ′ for i ′ < i a re 0. This can b e accomplished with a weight vector w with w i ′ = − 1 for i ′ < i , with w i = 1, and with w i ′ = 0 f or i ′ > i . The ﬁrst hidden lay er of W neg comprises n/K copies of the ﬁrst-one gadget. Informally , this construction remov es most of the v aria nce in the num b er o f 1 ′ s in the input, as recorded in the following lemma. Lemma 9 On any input x ∈ { 0 , 1 } n exc ept (0 , 0 , ..., 0) , the sum of the values on the ﬁrst hidden layer of W neg is exactly n/K . The w eights into the remaining hidden la y ers of W neg are all 1, a nd all the weigh ts into the output lay e r take a v alue c def = K 2 2 n d − 1 ( 1+ K n )( 1+ 1 n ) d − 2 , c hosen to minimize the dropout criterion for the net work. The following lemma analyzes W neg . Lemma 10 J D ( W neg ) = 1 2  1 − (1 − 2 − K ) ( 1+ K n )( 1+ 1 n ) d − 2  . When n is large relative to K and d , the  1 + K n   1 + 1 n  d − 2 denominator in Lemma 10 approaches 1, so J D ( W neg ) approa ches 2 − K / 2 in this ca s e. Lemma 16 b elow gives a larger lo wer bound for any net w ork with all p ositive weigh ts. In the concrete case when d = 2 and n = K 3 , then Lemma 10 implies J D ( W neg ) < 1 /K 2 . Pro of (of Lemma 10) : Consider a computation o f W neg (1 , 1 , ..., 1) under drop o ut and let ˆ y b e the (random) o utput. Let k 0 be the num ber of input no des kept, a nd, fo r ea ch j ≥ 2, let k j be the n um ber of 7 no des in the j th hidden layer k ept. Call the node in each ﬁrst-one gadget that computes 1 a key no de, and if no no de in the ga dget computes 1 be c a use the input is all dr opp ed, arbitra rily make the gadget’s ﬁr st hidden no de the key node. This ensures there is exactly one key no de p er gadg et, and every non-k ey node computes 0. Let k 1 be the n um ber o f k ept k ey nodes o n the ﬁrst hidden lay er. If k 0 = 0, the output ˆ y of the net work is 0. Otherwise, ˆ y = c 2 d Q d − 1 j =1 k j . Note that k 0 is zero with probability 2 − K . Whenev er k 0 ≥ 1, k 1 is distributed as B ( n/K, 1 / 2). Ea ch other k j is distr ibuted as B ( n, 1 / 2), a nd k 1 , k 2 , ..., k d − 1 are indep endent of one another. E ( ˆ y ) = Pr ( k 0 ≥ 1) c 2 d E [ k 1 | k 0 ≥ 1] d − 1 Y j =2 E [ k j ] = (1 − 2 − K ) c 2 d  n 2 K   n 2  d − 2 = 2 c K (1 − 2 − K ) n d − 1 . Using the v alue of the sec o nd mo men t of the binomial, w e get E ( ˆ y 2 ) = E      1 k 0 ≥ 1 c 2 d d − 1 Y j =1 k j   2    = 4 c 2 (1 − 2 − K )  n K   n K + 1  n d − 2 ( n + 1) d − 2 = 4 c 2 (1 − 2 − K ) K 2 n 2( d − 1)  1 + K n   1 + 1 n  d − 2 . Thu s, J D ( W neg ) = 1 2  1 − 2 E ( ˆ y ) + E ( ˆ y 2 )  = 1 2 1 − 4 c K (1 − 2 − K ) n d − 1 + 4 c 2 (1 − 2 − K ) K 2 n 2( d − 1)  1 + K n   1 + 1 n  d − 2 ! = 1 2 1 − (1 − 2 − K )  1 + K n   1 + 1 n  d − 2 ! , since c = K 2 n d − 1 ( 1+ K n )( 1+ 1 n ) d − 2 , completing the pro of. Next we prov e a lower bound o n J D for netw or ks with nonneg ative w eig ht s. Let W be an arbitrary such net work. Our lower bound will use a prop er t y of the function computed b y W that we now deﬁne. Deﬁnition 11 A function φ : R K → R is supermo dula r if for al l x , δ 1 , δ 2 ∈ R K wher e δ 1 , δ 2 ≥ 0 φ ( x ) + φ ( x + δ 1 + δ 2 ) ≥ φ ( x + δ 1 ) + φ ( x + δ 2 ) , or e quivalently: φ ( x + δ 1 + δ 2 ) − φ ( x + δ 2 ) ≥ φ ( x + δ 1 ) − φ ( x ) The latter indicates that adding δ 1 to the bigger input x + δ 2 has at lea st as la r ge a n eﬀect as adding it to the smaller input x . Since W has all non-negative weights, it computes a supermo dular function of its inputs. (This fact ma y be of indep e nden t in terest.) Lemma 12 If a network has non- n e gative weights and its activation functions σ ( · ) ar e c onvex, c ont inuous, non-de cr e asing, and diﬀer entiable exc ept on a ﬁ nite s et , then the net work c omputes a sup ermo dular function of its input x . 8 Pro of: W e will prov e b y induction ov er the lay ers that, for any unit h in the net w ork, if h ( x ) is the output of unit h when x is the input to W , then h ( · ) is a super mo dular function of its input. The ba s e case holds since each input no de h outputs the corres p o nding comp onent o f the input, and ( x + δ 1 ) − x = ( x + δ 1 + δ 2 ) − ( x + δ 2 ). Now, for the inductive step, let w b e the w eight vector for no de h , let b b e its bias, and σ ( · ) be its activ ation function. Let I ( x ), I ( x + δ 1 ), I ( x + δ 2 ), and I ( x + δ 1 + δ 2 ) b e the inputs to no de h when the inputs to the netw o r k ar e x , x + δ 1 , x + δ 2 and x + δ 1 + δ 2 resp ectively . By induction, these inputs to no de h (comp onent wise) satisfy I ( x + δ 1 ) − I ( x ) ≤ I ( x + δ 1 + δ 2 ) − I ( x + δ 2 ) . Therefore, since w , δ 1 , and δ 2 are non-negative, the interv al [ w · I ( x + δ 2 ) + b , w · I ( x + δ 1 + δ 2 ) + b ] is at least as long and star ts at lea st as high as the in terv al [ w · I ( x ) + b, w · I ( x + δ 1 ) + b ]. Since σ is co ntin uous and diﬀerentiable except on a ﬁnite set, we hav e h ( I ( x + δ 1 )) − h ( I ( x )) = Z w · I ( x + δ 1 )+ b w · I ( x )+ b σ ′ ( z ) dz ≤ Z w · I ( x + δ 1 + δ 2 )+ b w · I ( x + δ 2 )+ b σ ′ ( z ) dz (since σ ′ is non-decreasing) = h ( I ( x + δ 1 + δ 2 )) − h ( I ( x + δ 2 )) . Deﬁnition 13 L et r 0 ∈ { 0 , 1 } K b e the dr op out p attern c onc erning the input layer, and let R ′ b e the dr op out p attern c onc erning t he r est of the n et work, so that t he dr op out p attern R = ( r 0 , R ′ ) . F or e ach ℓ ∈ { 0 , ..., K } , let ψ W ( ℓ ) b e the aver age output of W u nder dr op out when ℓ of the inputs ar e kept: i.e. , ψ W ( ℓ ) = E   D ( W , 1 K , ( r 0 , R ′ ))       X j r 0 j = ℓ   . Lemma 14 F or any ℓ ∈ { 1 , ..., K − 1 , } , ψ W ( ℓ + 1) − ψ W ( ℓ ) ≥ ψ W ( ℓ ) − ψ W ( ℓ − 1) . Pro of: Genera te u , i and j randomly b y , ﬁrst c hoo sing u u niformly at random fro m among bit vectors with ℓ ones, then choosing i uniformly from the 0-compo nent s of u , a nd j uniformly from the 1-comp onents of u . By L e mma 12, W ( u + e i ) − W ( u ) ≥ W ( u − e j + e i ) − W ( u − e j ) (4) alwa ys holds. F urthermore, u + e i is unifor mly distributed among bit vectors with ℓ + 1 ones, u − e j is uniformly distributed among bit vectors with ℓ − 1 ones, and u + e i − e j is uniformly distributed among bit vectors with ℓ ones. This is true for W , but it is also true for any net w ork obtained by dr opping out some of the hidden no des of W . Thus ψ W ( ℓ + 1) − ψ W ( ℓ ) = E ( D ( W , 1 K , ( r 0 , R ′ ) | X j r 0 j = ℓ + 1)) − E ( D ( W , 1 K , ( r 0 , R ′ ) | X j r 0 j = ℓ )) = E ( D ( W , r 0 , (1 K , R ′ ) | X j r 0 j = ℓ + 1)) − E ( D ( W , r 0 , (1 K , R ′ ) | X j r 0 j = ℓ )) ≥ E ( D ( W , r 0 , (1 K , R ′ ) | X j r 0 j = ℓ )) − E ( D ( W , r 0 , (1 K , R ′ ) | X j r 0 j = ℓ − 1)) (b y (4)) = ψ W ( ℓ ) − ψ W ( ℓ − 1) , 9 completing the pro o f. W e will use the following low er bound on the tail of the binomial. (Many similar lower bo unds are known.) Lemma 15 If X is distribute d ac c or ding to Bin( n, 1 / 2) , then Pr ( X < n/ 2 − √ n/ 4) = Pr ( X > n/ 2 + √ n/ 4) ≥ 1 / 4 . Pro of: Using the fact that, for an y i , Pr ( X = i ) ≤ 1 / √ n , w e get Pr ( | X − n/ 2 | < √ n/ 4) ≤ 1 / 2, so Pr ( X < n/ 2 − √ n/ 4) ≥ 1 / 4. Now we are rea dy for the low er bound on J D ( W ) . Lemma 16 If K > 18 and the weights in W ar e non-ne gative, then J D ( W ) ≥ 1 36 K . Pro of: Assume to the contrary that J D ( W ) < 1 36 K . Firs t, note tha t ψ W (0) ≤ q 1 18 K , or else the contribution to J D ( W ) due to the (0 , 0) , 0 example is at least 1 36 K . Applying Lemma 15, w e hav e ψ W ( K/ 2 − √ K / 4 ) > 1 − r 2 9 K and ψ W ( K/ 2 + √ K / 4 ) < 1 + r 2 9 K (5) as otherwise the contribution of o ne o f the tails to J D ( W ) will be a t least 1 36 K for the (1 , . . . , 1) , 1 example. W e will contradict this small v ar ia tion of ψ W ( ℓ ) aro und K/ 2 . The b ounds o n ψ W (0) and ψ ( K/ 2 − √ K / 4) and Le mma 14 imply that ψ W ( ℓ ) grows rapidly when ℓ is ar ound K / 2 , in par ticular: ψ ( K/ 2 − √ K / 4 + 1) − ψ ( K / 2 − √ K / 4) ≥ 1 − q 2 9 K − q 1 18 K K/ 2 − √ K / 4 > 1 p 9 / 32 K , since K > 18 . Now using Lemma 1 4 repeatedly shows that ψ ( K/ 2 + √ K / 4) − ψ ( K / 2 − √ K / 4) > √ K 2 × 1 p 9 / 32 K = 2 r 2 9 K , which contradicts (5), completing the pro of. Putting tog ether Lemmas 1 0 and 1 6 immediately proves Theorem 8, s ince for K > 18 and larg e eno ugh n , the cr iterion for W neg m ust be less than the criterion for an y net work with all no n- negative w eight s. 4.2 The case when K = 2 Theorem 8 us es the assumption t hat K > 18 and n is lar ge enough; is the low er bound on K r e a lly necessary? Here w e show that it is not, b y treating the case tha t K = 2. Theorem 17 F or the standar d ar chite ctur e, if K = 2 , for any ﬁxe d d and lar ge enough n , every optimizer of the dr op out criterion for P (1 , 1) , 1 uses ne gative weights. Pro of: D eﬁne W neg as in the pro of of Lemma 10, except that the output layer has a bias of 1 / 5. W e claim that lim n →∞ J D ( W neg ) = 1 / 10 . (6) T o see this, consider the join t input/la b el distribution under dro p out: Pr ((0 , 0) , 0 ) = 1 / 2 Pr ((0 , 0) , 1 ) = 1 / 8 Pr ((2 , 0) , 1 ) = 1 / 8 Pr ((0 , 2) , 1 ) = 1 / 8 Pr ((2 , 2) , 1 ) = 1 / 8 . 10 Due to the bias of 1 / 5 on the output, W neg (0 , 0) = 1 / 5. Th us, contribut ion to J D from examples with x = (0 , 0 ) in this join t distribution is 1 / 2 × (1 / 5) 2 + 1 / 8 × (4 / 5) 2 = 1 / 10. Now, c hoo se x 6 = (0 , 0). If, after dro p out, the input is x , ea ch no de in the hidden lay e r clos e s t to the input co mputes 1. Arguing exactly as in the proo f of Lemma 10, in suc h cases, E (( ˆ y − 1) 2 ) = 1 − 1  1 + 2 n   1 + 1 n  d − 2 . This prov es (6). Now, let W b e an a r bitrary net work with non-negative weigh ts. F or our distr ibution, J D ( W ) = E R  ( D ( W , (0 , 0) , R ) − 0) 2  + E R  ( D ( W , (1 , 1) , R ) − 1 ) 2  2 . Let V 00 = E ( W (0 , 0)) , V 22 = E ( W (2 , 2)) , V 20 = E ( W (2 , 0)) , V 02 = E ( W (0 , 2)) wher e the exp ectations are taken wit h re s pe c t to the drop out pa tterns at the hidden no des (with no drop out at the inputs). Since each drop out pattern ov er the hidden no des deﬁnes a particular net work, and Lemma 12 holds for all of them, the relationships also hold fo r the expecta tio ns, so V 22 ≥ V 20 + V 02 − V 0 . Using this V notation, handling the drop out at the input explicitly , and the bias-v a r iance decomp osition keeping just the bias terms we get: J D ( W ) ≥ ( V 00 − 0) 2 +  ( V 00 − 1) 2 + ( V 22 − 1) 2 + ( V 20 − 1) 2 + ( V 02 − 1) 2  / 4 2 (7) 8 J D ( W ) ≥ 4( V 00 − 0) 2 + ( V 00 − 1) 2 + ( V 22 − 1) 2 + ( V 20 − 1) 2 + ( V 02 − 1) 2 . (8) W e will contin ue lower bounding the RHS. W e can re-write V 22 as V 20 + V 02 − V 00 + ǫ where ǫ ≥ 0. This is conv ex and symmetric in V 02 and V 20 so they b oth ta ke the s ame v alue at the minimizer of the RHS, so we pro ceed using V 20 for this common minimizing v alue. 8 J D ( W ) ≥ (2 V 20 − V 00 + ǫ − 1) 2 + 2( V 20 − 1) 2 + ( V 00 − 1) 2 + 4 V 2 00 . Diﬀeren tiating with resp ect to V 20 , we see that the RHS is minimized when V 20 = (2 + V 00 − ǫ ) / 3, giving 8 J D ( W ) ≥ ( V 00 − ǫ − 1 ) 2 3 + ( V 00 − 1) 2 + 4 V 2 00 . If V 00 ≥ 1 , then J D ≥ 1 / 2 (just fr o m the (0,0),0 example) and when V 00 < 1 the RHS of the above is minimized fo r non-negative ǫ when ǫ = 0. Using this substitut ion, the minimizing v alue of V 00 is 1 /4 g iving 8 J D ( W ) ≥ 1 J D ( W ) ≥ 1 / 8 . Combinin g this with (6) co mpletes the pro of. 11 4.3 More genera l distributions a nd imp lications In the previous sub-sec tio n w e analy zed the distribution P ov er the 2-feature ex amples (1 , 1) , 1 and (0 , 0) , 0 . How ever, these tw o examples can b e em bedded in a larger feature space by using any ﬁxed vector of additional feature v a lues, creating, for instance, a distribution ov er (0 , 1 , 0 , 0 , 0 , 1 / 2 , 1 ) , 0 and (0 , 1 , 1 , 1 , 0 , 1 / 2 , 1 ) , 1 (wit h the additional features underlined) The results o f Section 4.2 s till apply to distribution over these extended examples after deﬁning W neg net work to ha ve zero weight on the additional features, and noticing that any weigh t on the additional features in the po sitive-w eigh t net work W ca n be sim ulated using the biases at the hidden nodes. It is par ticularly int eresting when the additional features all take the v alue 0, we call these zer o-emb e ddi ngs . Every net work W with no n-negative weights ha s J D ( W ) ≥ 1 / 8 on ea ch o f these zer o-emb e ddings of P . On the other hand, a s ingle W neg net work with n/K copies of t he K -input ﬁrs t-one gadget has J D ( W neg ) ≈ 1 / 10 simu ltaneously for all of these zero -embeddings of P (when n >> K ). An y source distribution ov er { 0 , 1 } K × { 0 , 1 } t hat puts probabilit y 1/2 on the the 0 , 0 example and distributes the other 1 / 2 pro ba bilit y over examples where exactly tw o inputs a r e o ne is a mixture of zero- embeddings of P , and th us J D ( W ) ≥ 1 / 8 while J D ( W neg ) ≈ 1 / 10 f or this mixture and optimizing the drop out criterion r e quires nega tive w eigh ts. In our a na lysis the nega tive w eigh ts used by dro p o ut a r e coun terin tuitiv e for ﬁtting monotone b ehavior, but are needed to control the v aria nce due to drop out. This suggests that dro po ut may b e less eﬀective when la yers with spa rse a ctiv ation patterns ar e fed into wider lay er s, as dropo ut training can hijack part of the express iveness of the wide la yer to co nt rol the artiﬁcial v a riance due to drop out r ather than ﬁtting the underlying patterns in the data. 5 Prop ertie s of the d rop out p enalt y 5.1 Gro wth of the dropout p enalt y a s a function of d W eight-decay penalizes larg e weigh ts, while Theorem 5 shows that comp ensa ting r escaling of the weigh ts do es no t aﬀect the drop out p enalty or cr iterion. O n the other hand, dro p out can b e more se ns itiv e to the calculation of large o utputs than weigh t decay , and large o utputs can be pro duced in deep net w orks using only small weigh ts. W e make this obs e r v ation concrete b y exhibiting a family of net w orks where the depth and desired output are link ed while the size o f individual we ights remains constant. F or this family , the drop out p enalty gr ows exponentially in the depth d (as opp osed to linear ly for weigh t-decay), suggesting that dropo ut training is less willing to ﬁt the data in this kind o f situation. Theorem 18 If x = (1 , 1 , ..., 1) and 0 ≤ y ≤ K n d − 1 , for P x ,y ther e ar e weights W for the standar d ar chite ctu r e with R ( W ) = 0 such that (a) every weigh t has magnitude at most one, but (b) J D ( W ) ≥ y 2 K +1 , wher e as (c) J 2 ( W ) ≤ λy 2 /d 2 ( K n + n 2 ( d − 2) + n ) . Pro of: Let W b e the netw o rk whose w eight s are all c = y 1 /d K 1 /d n ( d − 1) /d and biases are all 0, so that the L 2 pena lt y is the num b er o f weigh ts times λc 2 / 2. It is a simple induction to show that, for these w eigh ts and input (1 , 1 , ..., 1), the v alue computed at each hidden no de on level j is c j K n j − 1 , so the the net work outputs c d K n d − 1 , a nd has zero squa re loss (since W ( x ) = c d K n d − 1 = y ). Consider no w dro p o ut on this net work. This is equiv alent to c hanging a ll of the weigh ts fro m c to 2 c and, independently with probability 1 / 2, replacing the v alue of ea ch no de with 0. F or a ﬁxed dropo ut pattern, each no de on a g iven lay er has the same weigh ts, and r eceives the same (kept) inputs. Th us, the v alue computed at ev ery no de o n the same la y er is the same. F or each j , let H j be the v alue computed b y the units in the j th hidden lay er. If k 0 is the num b er of input nodes k ept under drop o ut, and, for each j ∈ { 1 , ..., d − 1 } , k j is the num b er of hidden no des kept in layer j , a straightforw ard induction shows that, for all ℓ , w e hav e H ℓ = (2 c ) ℓ Q ℓ − 1 j =0 k j , so that the output ˆ y of the netw ork is (2 c ) d Q d − 1 j =0 k j . 12 Using a bias-v aria nce decompos ition, E (( ˆ y − y ) 2 ) = ( E [ ˆ y ] − y ) 2 + V ar ( ˆ y ) . Since each k j is bino mia lly distributed, and k 0 , ..., k d − 1 are independent, w e ha v e E ( ˆ y ) = (2 c ) d ( K/ 2 )( n/ 2) d − 1 = c d K n d − 1 = y, so E (( ˆ y − y ) 2 ) = V ar ( ˆ y ) . Since E ( ˆ y 2 ) = (2 c ) 2 d ( K ( K + 1) / 4 )( n ( n + 1) / 4) d − 1 = y 2 (1 + 1 /K )(1 + 1 /n ) d − 1 , we hav e V ar ( ˆ y ) = E ( ˆ y 2 ) − E ( ˆ y ) 2 = y 2 ((1 + 1 /K )(1 + 1 /n ) d − 1 − 1) ≥ y 2 /K, completing the pr o of. If y = exp(Θ( d ) ), the drop out p enalty grows exp onentially in d , whereas the L 2 pena lt y grows p olyno mi- ally . 5.2 A neces sary condition for negativ e drop out p enalt y Section 2 contains an example wher e the dr op out p enalty is negative. The following theorem pr ovides a necessary condition. Theorem 19 The dr op out p enalty c an b e n e gative. F or al l example distributions, a ne c essary c onditio n for this in r e ctiﬁe d line ar n etworks is that either a weight, input, or bias is ne gative. Pro of: [2] sho w that for net w orks of linear units (as opp osed to the non-linea r rectiﬁed linea r units we fo c us on) the net work’s output without drop out equals the exp ected output over drop out patterns, so in our notation: W ( x ) eq uals E R ( D ( W , x , R )) . Assume for the moment that the net work consists o f linea r units and the example distribution is concentrated on the single ex ample ( x , y ). Using the bias- v aria nce decomp osition for squa r e loss and this prop er ty of linear netw or ks, J D ( W ) = E R  ( D ( W , x , R ) − y ) 2  = ( E R ( D ( W , x , R ) − y ) 2 + V ar R ( D ( W , x , R )) ≥ ( W ( x ) − y ) 2 and th e dropout penalty is again non-nega tive. Since the same calculations go through when a veraging ov e r m ultiple examples, we see that the drop out p enalty is alw ays no n- negative for net w orks of linear no des. When all the weigh ts, biases and inputs in a net w ork of rectiﬁed linear units ar e positive, then the rectiﬁed linear units b ehav e a s linear units, so the drop out pena lty will a gain b e non-nega tive. 5.3 Multi-la y er drop out p enalt y d o es depend on lab els In contrast with its behavior on a v ariety of linear mo dels in cluding logis tic regress io n [2 8], the drop out pena lt y can dep end on the v alue of the resp onse v ariable in deep net works with ReLUs and the quadra tic loss. Thus in a fundamental and impor ta nt respect, dro p out diﬀers from traditional regula rizers like w eig ht- decay or a n L 1 pena lt y . Theorem 20 Ther e ar e joint distributions P and Q , and weights W such that, for al l dr op out pr ob abilities q ∈ (0 , 1) , (a) the mar ginals of P and Q on the input variables ar e e qual, but (b) the dr op out p enalties of W with r esp e ct to P and Q ar e diﬀer en t . W e will prove Theorem 20 by describing a general, somewhat technical, condition that implies that P and Q ar e witnesses to Theorem 2 0. F or eac h input x a nd dropout pattern R , let H ( W , x , R ) be the v alues presented to t he output no de with drop out. As befor e, let w ∈ R n be those weigh ts of W on co nnectio ns directly int o the output no de and let b b e the bias at the output no de. Let r ∈ { 0 , 1 } n be the indicator v aria bles for whether the v arious no des connecting to the output node are k ept. Pro of (of Theo rem 20): Suppose that P is concentrated on a sing le ( x , y ) pair. W e will then ge t Q by mo difying y . Let h be the v a lues coming in to the output no de in the non-dropp ed out net w ork. Therefore the output of the non-drop out netw o rk is w · h + b while the output of the net work with drop out is w · H ( W , x , R ) + b . W e no w examine the dro po ut pe na lty , which is the ex pec ted drop out los s minus the non-drop out loss. W e 13 will use δ as a shortha nd for w · ( H ( W , x , R ) − h ). drop out penalty = E  ( w · H ( W , x , R ) + b − y ) 2  − ( w · h + b − y ) 2 = E  ( w · H ( W , x , R ) + b − w · h + w · h − y ) 2  − ( w · h + b − y ) 2 = E  δ 2  + 2( w · h + b − y ) E ( δ ) which dep ends on the lab el y unless E ( δ ) = 0. Typically E ( δ ) 6 = 0. T o pro ve the theorem, consider the case where x = (1 , − 2) and there are t wo hidden no des with weigh ts 1 on their connections to t he output no de. The v alue at t he hidden no des without drop out is 0, but with dropout the hidden no des never negativ e and computes positive v alues when only the negative input is dropp ed, so the exp ecta tion o f δ is positive. 6 Exp erimen ts W e did sim ulation exp er iments using T or ch [25]. The co de is acce s sible a t https: //www .dropbo x.com/sh/6s2lcfrq17zshmp/AAAQ06uDa4gOAuAnw2MAghEMa?dl=0 6.1 Negativ e W eights W e trained netw or ks with th e standard arch itecture with K = 5 inputs, depth 3, and width 50 on the training data studied in Section 4: one input with x = (0 , 0 , ..., 0) a nd y = 0, a nd one input with x = (1 , 1 , ..., 1) and y = 1. W e used the nn.S tocha sticGr adient function from T orch with a maximum of 10000 iterations, and a learning ra te o f 0 . 1 1+0 . 1 × t on iter ation t . With the ab ov e parameter s, we rep eated the followin g exp eriment 1 000 times. • Initialize the parameters o f a netw ork W D . • Clone this net w ork, to pro duce a cop y W with the same initialization. • T rain W D with drop out probability 1 / 2 a t all nodes, and train W without dr op out (without using weigh t deca y for either one). • Compare the n um ber of negative w eigh ts neg ( W D ) in W D with neg( W ). Counts of the outco mes are shown in the following table. neg( W D ) > neg( W ) neg( W D ) < neg( W ) neg( W D ) = neg( W ) 584 338 78 Recall that Theor e m 8 shows that a glo bal optimizer of the drop out criterio n deﬁnitely uses negative weigh ts when tra ined o n data sources similar to P ( 1 , 1) , despite the fact that that they a r e consisten t with a monotone function. This exp eriment shows that ev en when optimization is do ne in a standard w a y , using drop out tends to create models with more neg ative w eight s for P ( 1 , 1) , a nd we attribute this to the v ariance reduction eﬀects ass o ciated with Theorem 8 . The exp eriment also indicates that training with dropo ut s ometimes pro duces few er negative w eight s. Due to the random initialization, a hidden ReLU unit with negative weight s can ev aluate to 0 on both inputs, so its weigh ts will never b e up dated without drop out. O n the other hand, the extra v a riance from drop out could cause up dates to these negative w eigh ts. Ano ther w a y the non-drop out tra ining could pro duce mo re negative weigh ts is if a hidden no de whose output is too large has man y small weight s that turn neg ative with a s ta ndard gradient descen t step. How ev er, with drop out only a b out ha lf the small weights will be upda ted. Theore m 8 fo cusses on the eﬀect o f dr o p out a t the global minimum, and abstracts aw ay these kinds of initialization and optimization eﬀects. 14 0.5 0.6 0.7 0.8 0.9 1 1.1 0.4 0.6 0.8 1 1.2 1.4 1.6 Training error Input Scale Dropout Weight Decay Figure 2: T raining error as a function of the scale of the inputs for Dropo ut and W eigh t Deca y in the exp e riment of Section 6 .2. 6.2 Scale (in)sens itivit y Our experiments regar ding the se nsitivit y with resp ect to the scale of the input used the standard ar chitectu re with K = 5, d = 2, n = 5. W e used stochastic gra dient using the optim package for T orch, with learning rate 0 . 01 1+0 . 00001 t and mo men tum of 0.5 , a nd a maximum of 100000 iter ations. W e p erformed 10 sets of training runs. In each run: • T en tra ining e xamples w ere ge nerated uniformly at random fro m [ − 1 , 1] K . • T arget o utputs w ere as signed using y = Q i sign( x i ). • Fiv e training sets S 1 , ..., S 5 with ten ex amples each w ere obtained by rescaling the inputs by { 0 . 5 , 0 . 75 , 1 , 1 . 25 , 1 . 5 } and leaving the outputs unc hanged. • The weigh ts of a net work W init were initialized using the default initialization from T orch. • F or eac h S i : – W init was cloned three times to pro duce W D , W 2 and W none with iden tica l starting para meters. – W D was trained with dro po ut pro bability 1 / 2 a nd no w eight decay . – W 2 was trained with weigh t decay with λ = 1 / 2 and no drop out. – W none was trained without any r e g ularization. The av er age training losses of W D and W 2 , over the 1 0 runs, a re shown in Figure 2. (The av erage training loss of W none was less than 0 . 05 a t all scales .) The theoretical insensitivity of drop o ut to the scale of the inputs des crib ed in Theorem 4 is also seen here, along with the con trast with weigh t decay analyzed in Theorem 7. The sca le of the inputs also aﬀects the dynamics of sto chastic gradient descen t. With v ery s mall inputs, conv e r gence is very slow, and with v ery large inputs, SGD is unstable. The eﬀects of the s cale of the inputs on inductive bias analyz e d in this pap er are visible a t the scales where optimization can b e done eﬀectively . 7 Conclusions The reasons b ehind drop out’s surprisingly g o o d performance in training dee p netw orks across a v a riety of applications are somewhat mysterious and t here is r elatively little existing formal analysis. A v ariety of 15 explanations hav e b een o ﬀered (e.g. [1, 2, 3, 11, 27]), including the p ossibility that dro po ut reduces the amount of coadaptation in a net w ork’s weigh ts [16]. The drop o ut criterio n is an exp ected loss over dro p o ut patterns, and the v ariance in the output v a lues ov e r drop out patterns co ntribut es to this expected loss. There fo r e dro p out may co-a dapt weight s in order to reduce this (artiﬁcial) v ar ia nce. W e pro ve that this happ ens even in v ery simple situations where nothing in the training data justiﬁes negative weigh ts (Theorem 8). This indicates that the relationship b etw een drop out and co -adaption is not a simple one. The e ﬀects of dropout in deep neural netw orks are rather complicated, and a pproximations can be misleading since the dropo ut penalty is very non-conv ex ev en in 1-lay er netw orks [14]. In Section 3 we show that dropo ut does enjo y several scale-inv ariance prop erties that are not shared b y w eight -decay . A p erhaps surprising consequence of these inv a riances is that there are never isolated loca l minima when learning a deep netw ork with dropo ut. F urther exploration of these scale inv ariance pro pe r ties is w arranted to see if they a r e a contributor to drop out’s empirical s ucc e s s or can b e exploited to facilitate tra ining. While contrasting dropo ut to w eight-deca y in simple situations, we found tha t a deg enerate a ll-zero netw ork res ults (Theorem 7) when the L 2 regulariza tion parameter is ab ov e a threshold. This is in dramatic cont rast to our previous intuition fro m the 1 -lay er case. In [28], dropout was viewed as a reg ularization metho d, adding a data dep endent p enalty to the emp irical loss of (presumably) undesira ble solutions. Section 5 shows that, unlike the gener alized linear models case analyzed there, the drop out pena lty in deeper netw orks can be negative and dep ends on the lab els in the training data, and thus b ehav es unlike most regular izers. On the other hand, the drop out penalty can grow exp o nentially in t he depth o f the netw o rk, a nd th us may better reﬂect the complexity of the underlying mo del space than L 2 regulariza tion. This pap er uncov er s a n um ber of dro p out’s interesting fundamental prop erties us ing formal analysis of simple cases . Ho w ever, the eﬀects of using drop out training in deep net works are subtle and complex, and we hop e that this pape r lays a founda tio n to promote further formal ana lysis of drop out’s prop er ties and behavior. A ckno wledgmen ts W e a re very grateful to P eter Bartlett, Seshadhri Coma ndur, and anonymous review ers for v aluable commu- nications. References [1] P . Bachman, O. Alsharif, and D. Precup. Learning with pseudo-ensem bles. NIPS , 201 4. [2] P . Baldi and P . Sadowski. The drop out learning algorithm. A rtiﬁcial intel li genc e , 210:78–1 22, 2014. [3] Pierre Baldi and Peter J Sadowski. U nderstanding drop out. In A dvanc es i n Neur al Information Pr o c essing Systems , pages 2814–28 22, 2013. [4] P . L. Bartlett, M. I. Jordan, and J. D. McAuliﬀe. Co nv ex it y , class iﬁcation, and risk b ounds. Journal of the A meric an Statistic al A sso ciation , 101 (473):138 –156, 2006. [5] L. Breiman. Some inﬁnity theory for predictor ensem bles. An nals of Statistics , 32(1):1–11, 2004. [6] Caﬀe, 2016. http://caﬀe.berkeleyvision.e du. [7] Danqi Chen and Christopher D Manning. A fast and accurate dep endency parser using n eural n etw orks. I n EMNLP , pages 740–75 0, 2014. [8] G. E. Dahl. Deep learning ho w I did it: Merck 1st place inte rview, 2012. http://blog.kag gle.com. [9] G. E. Dahl, T. N. Sai nath, and G. E. Hinton. Improving deep neural netw orks for L VCSR using rec tiﬁed linear units and drop out. ICASSP , 2013. [10] L. Deng, J. Li, J. H uang, K. Y ao, D. Y u , F. Seide, M. L. Seltzer, G. Zw eig, X. He, J. Williams, Y. Gong, and A. Acero. Recent a dv ances in deep learning for sp eec h researc h at microsoft. ICASSP , 2013 . 16 [11] Y arin Gal and Zoubin Ghahramani. D rop out as a Ba yes ian appro x imation: Representing model u n certain ty i n deep learning. , 2015 . [12] Ian J. Goo dfello w, D a vid W arde-F arley , Mehdi Mirza, Aaron C. Courvill e, and Y osh ua Bengio. Maxout net w orks. In ICML , pages 1319–132 7, 2013. [13] Kaiming He, Xiang yu Zhang, Shao qing Ren, and Ji an Sun. Delving deep i nto rectiﬁers: Surpassing h uman-level p erformance on imagenet classiﬁcation. In ICCV , pages 1026–1 034, 2015. [14] D. P . Helmbold and P . M. Long. On the inductive bias of dropout. JMLR , 16:340 3–3454 , 2015. [15] G. E. Hin ton. Drop out: a simple and eﬀective w a y to impro v e neural netw orks, 2012 . videolectures.net. [16] G. E. Hinton, N. Sriv asta va , A. K rizhevsky , I. Sutskev er, and R. R. Salakh utdino v. Improving neural netw orks by prev en ting co -adaptation of feature detectors, 20 12. Arxiv, arXiv:1207. 0580v1. [17] Nal Kalch brenner, Edw ard Gref enstette, and Phil Bl unsom. A convo lutional n eural netw ork for mo delling sen tences. In A CL , pages 655–665, 2014. [18] P . M. Long and R . A. Servedio. Random classiﬁcation noise defeats all conv ex p otentia l b o osters. Machine L e arning , 78(3):287 –304, 2010 . [19] Vinod Nair and Geoﬀrey E Hinto n. R ectiﬁed linear units impro ve restricted boltzmann mac hines. In ICML , pages 807–8 14, 201 0. [20] Behnam Neyshabu r, Ry ota T omioka, and Nathan Srebro. Norm-based capacity control in neural n etw orks. In COL T , pages 1376– 1401, 2015 . [21] Jürgen Sc hmidhuber. Deep learning in neural n etw orks: An ov erview. Neur al Networks , 61: 85–117 , 2015. [22] N. Sriv asta va , G. Hinto n, A . Krizhev sky , I. Sutsk ev er, and R. Salakh utdino v. Drop out: A simple w a y t o preve nt neural netw orks from ove rﬁtting. JMLR , 15:1929–195 8, 2014. [23] Chris tian Szegedy , W ei Liu, Y angqing Jia , Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent V anhouck e, a nd And rew Rabino vic h. Going deep er with convo lutions. CVPR , 2015. [24] T ensorﬂow , 2016. https:// www.tensorﬂo w.org. [25] T orch, 20 16. http://to rc h.ch. [26] T. V an Erven, W. K otło wski, and M. K. W arm uth. F ollo w the lea der with drop out perturbations. CO L T , pages 949–97 4, 2014 . [27] S. W ager, W. Fithian, S. W ang, and P . S. Liang. Altitude training: Strong b ound s for single-la yer drop out. NIPS , 2014. [28] S. W ager, S. W ang, and P . Liang. Dropout training as adaptiv e reg ularization. NIPS , 2013. [29] L. W an, M. Zeiler, S. Zhang, Y. Le Cun, and R . F ergus. Regul arization of neural netw orks using dropconnect. In ICML , pages 1058–106 6, 2013. [30] T. Zhang. St atistical b ehavior and consistency of classiﬁcation metho ds based on conv ex risk minimization. A nnals of Sta tistics , 32(1):56–85, 2004. 17 A T able of Notation Notation Meaning 1 set indicator function for “set” ( x , y ) an example with feature v ector x and label y σ ( · ) the rectiﬁed linea r unit computing max (0 , · ) W an arbitrary weigh t setting for the netw ork w, v sp e c iﬁc weigh ts, often subscripted W ( x ) the output v a lue produced by w eight setting W on input x P an arrelbitrary s o urce distribution ov er ( x , y ) pairs P x ,y the source distr ibution concentrated on the single example ( x , y ) R P ( W ) the risk (exp ected square loss) of W under source P q , p probabilities that a no de is dro ppe d out ( q ) or k ept ( p ) by the drop out proces s R a dropout pattern, indicates the kept no des r , s drop out patterns o n s ubsets o f the no des D ( W , x , R ) Output of dropo ut with netw or k weigh ts W , input x , and dropo ut pattern R J D ( W ) the dropo ut criterion J 2 ( W ) the L 2 criterion λ the L 2 regulariza tion strength parameter W D an optimizer o f the drop out criterion W L 2 an optimizer o f the L 2 criterion n, d the net w ork width and depth K the n um ber of input no des B Pro of of Theorem 7 Here w e prove Theorem 7, sho wing that the weigh t-decay a version depends on the v alues of the inputs and the n um ber of input nodes K . F urthermore, unlik e the sing le-lay er case, the L 2 regulariza tion strength has a threshold where the minimizer of the L 2 criterion deg enerates to the all-zero net w ork. W e will focus on the standard a rchitecture with depth d = 2 . Reca ll that we are ana lyzing the distribution P ( x , 1) that as signs probabilit y 1 / 2 to ( x , 1) and probabilit y 1 / 2 to ( 0 , 0 ). Also, recall that, for P ( x , 1) , the w eigh t-decay a version is the maximum risk incurred by a minimizer of J 2 . This motiv a tes the following deﬁnition. Deﬁnition 21 A n aversion witness is a minimizer of J 2 that is also a maximizer of the risk R , fr om among minimizers of J 2 . The pro of of Theorem 7 inv olves a s e r ies of lemmas. W e ﬁrst show that there is a av ersion witness W L 2 with a sp ecial form, and then rela te a ny hidden no de’s eﬀect on the output to the regularization pe nalty on the weigh ts in and o ut of that no de. This will allow us to tre at optimizing the L 2 criterion a s a one-dimensiona l problem, whose solution yields the theorem. F or some minimizer W L 2 of J 2 , let v ∗ j denote the v ector of w eights in to hidden no de j and w ∗ j denote the weigh t from j to the output no de. Let a ∗ j be the bias for hidden no de j and let b ∗ be the bias for the output no de. Let h j be the function computed by hidden no de j . Note that, if there is an a version witness with a certain prop erty , then we may a ssume without lo ss o f generality that W L 2 has that pro p er ty . Lemma 22 W e may assume without loss of gener ality that for e ach hidden no de j , ther e is an input ˜ x ∈ { 0 , x } such that h j ( ˜ x ) = 0 . Pro of: Suppose neither of h j ( 0 ) or h j ( x ) was 0. If w ∗ j = 0, then r e pla cing bo th with 0 do es not aﬀect the output of W L 2 , and do es not increa s e the pena lt y . If w ∗ j 6 = 0, then subtracting min { h j ( 0 ) , h j ( x ) } 18 from a ∗ j and adding min { h j ( 0 ) ,h j ( x ) } w ∗ j to b does not aﬀect W L 2 or the p ena lt y , but, a fter this tra ns formation, min { h j ( 0 ) , h j ( x ) } = 0. Lemma 23 W e may assume witho ut loss o f gener ality that for e ach hi dden no de j , we have | v ∗ j · x | = max { h j ( 0 ) , h j ( x ) } . Pro of: If h j ( x ) ≥ h j ( 0 ) = 0, then a ∗ j = 0, and h j ( x ) = v ∗ j · x . If h j ( 0 ) > h j ( x ) = 0 . Then, a ∗ j > 0, and v ∗ j · x ≤ − a ∗ j . If needed, the magnitude of v ∗ j can be decreas ed to make v ∗ j · x = − a ∗ j . This decr ease do es not aﬀect W L 2 ( x ) or W L 2 ( 0 ), and can only reduce the L 2 pena lt y . Lemma 24 W e may assume without loss of gener ality that for e ach hidden no de j , we have h j ( 0 ) = 0 . Pro of: Suppose h j ( 0 ) > 0. Let z be this old v alue of h j ( 0 ). Then h j ( x ) = 0 and z = h j ( 0 ) = − v ∗ j · x . If we negate v ∗ and set a j = 0, then Lemma 2 3 implies that w e swap the v alues of h j ( x ) and h j ( 0 ). Then, b y adding z w ∗ j to b ∗ and neg ating w ∗ j , we correct for this swap at the output no de and do not aﬀect the function computed b y W L 2 or the p enalty . Note that Lemma 24 implies that a ∗ 1 = ... = a ∗ K = 0. Lemma 25 F or al l j , v ∗ j · x ≥ 0 . Pro of: Since a ∗ j = 0, if v ∗ j · x < 0, we could make W L 2 compute the s ame function with a smaller penalty b y r eplacing v ∗ j with 0 . Lemma 2 5 implies that the o ptimal W L 2 computes the linea r function, W L 2 ( ˜ x ) = ( w ∗ ) T V ∗ ˜ x + b ∗ . (9) Later w e will call ( w ∗ ) T V ∗ ˜ x the activ ation at the output no de. Lemma 26 b ∗ = 1 − ( w ∗ ) T V ∗ x 2 . Pro of: Mini mize J 2 (wrt distribution P ( x , 1) ) as a function of b using Ca lculus. Now, w e have W L 2 ( ˜ x ) = 1 2 + ( w ∗ ) T V ∗ ( ˜ x − x / 2) (10) which immediately implies W L 2 ( 0 ) = 1 − W L 2 ( x ) . (11) Lemma 27 Eac h v ∗ j is a r esc aling of x . Pro of: Projecting v ∗ j onto the span of x does not aﬀect h j , a nd cannot increase the p enalty . Lemma 28 W L 2 ( x ) ≤ 1 . Pro of: By (11), if W L 2 ( x ) > 1 then W L 2 ( 0 ) < 0 and the loss and the p e na lty would b oth be reduced b y scaling do wn w ∗ . Lemma 29 W L 2 maximizes W ( x ) o ver those weight ve ctors W that h ave the same p enalty a s W L 2 and c ompute a function of the fo rm W ( ˜ x ) = 1 2 + ( w ) T V ( ˜ x − x / 2) (i.e. Equation 10). 19 Pro of: Let W maximize W ( x ) o ver the netw orks considered. If W L 2 ( x ) < W ( x ) ≤ 1, then W w o uld ha ve the same p enalty as W L 2 but smaller err or, co ntradicting the o ptimalit y o f W L 2 . If W ( x ) > 1, then the net work f W obtained by scaling down the w eigh ts in the output layer so that f W ( x ) = 1 has a smaller p enalty than W L 2 and s ma ller error, ag ain contradicting W L 2 ’s o ptimalit y . Informally , L e mmas 28 a nd 29 engender a view of the learner straining ag ainst t he yolk of the L 2 pena lt y to pro duce a lar ge enough output on x . This motiv ates us to ask how lar g e W ( x ) can be, for a given v alue of ||W || 2 2 (recall that Lemma 25 allows us to assume that the biases a t the hidden no des are a ll 0 ). Deﬁnition 30 F or e ach hidden no de j , let α j b e the c onstant such v ∗ j = α j x , so that h j ( x ) = α j x · x . Recall t hat the a ctivation at t he output n o de on inp ut x is the w eighted sum of the hidden-node outputs, ( w ∗ ) T V ∗ x . Deﬁnition 31 Th e c ontribut ion to the activation at the output due to hi dden no de j is w ∗ j h j ( x ) = w ∗ j α j x · x and the c ontribution to the L 2 p enalty fr om t hese weights is λ 2  ( w ∗ j ) 2 + α 2 j x · x  . W e now bound the con tribution to the activ ation in terms of the contribution to the L 2 pena lt y . Note that as the L 2 “budget” increases , so do es th e th e maximum p o ssible contribution to the output no de’s activ ation. Lemma 32 If B is hidden no de j ’s weight-de c ay c ontribution, ( w ∗ j ) 2 + α 2 j x · x , then hidden no de k ’s c ontri- bution to the output no de’s activation is maximize d when w ∗ j = q B 2 and α j = q B 2 x · x , wher e it ach ieves the value B √ x · x / 2 Pro of: Since α 2 j x · x + ( w ∗ j ) 2 = B , we hav e w ∗ j = q B − α 2 j x · x , so the contribution to the activ ation can b e re- written as α j x · x q B − α 2 j x · x . T aking the deriv ativ e with resp ect to α j , and s olving, w e get α j = ± q B 2 x · x and we w an t th e positive solution (otherwise the node outputs 0). When α j = q B 2 x · x we hav e w ∗ j = q B 2 and th us the no de’s maxim um contribution to the activ ation is r B 2 r B 2 x · x ( x · x ) = B √ x · x 2 . Lemma 33 The minimum sum-squar e d weights for a network W (without biases at t he hidden no des) that has an activation A at the out put no de on input x is 2 A √ x · x . Pro of: W hen max imized, the contribution of each hidden node to the activ a tion at the output is √ x · x / 2 times the hidden no de’s contribution to the s um o f squar ed-weigh ts. Since ea ch weigh t in W is used in exactly one hidden no de’s contribution to the output no de’s activ ation, this co mpletes the pro of. Note that this bound is independent of n , the num ber of hidden units, but does dep end on the input x . Pro of (of Theorem 7): Let A ≥ 0 b e t he activ a tion at the output no de for W L 2 on input x . F ro m Lemma 26 w e get that b ∗ = 1 − A 2 . Combi ning Lemmas 28 , 29 a nd 33, we can re-wr ite the J 2 criterion for W L 2 and distr ibution P ( x , 1) in terms o f A as follows. J 2 ( W L 2 ) = 1 2 ( b ∗ − 0) 2 + 1 2 ( W L 2 ( x ) − 1 ) 2 + λ 2 ||W || 2 2 = 1 2  1 − A 2  2 +  1 + A 2 − 1  2 + λ 2 A √ x · x ! . (12) 20 Diﬀeren tiating with resp ect to A , we see that the criterion is minimized when A = 1 − 2 λ √ x · x when 2 λ √ x · x ≤ 1 A = 0 when 2 λ √ x · x > 1 since we assumed A ≥ 0; when A = 0 then W L 2 has all zero weigh ts with a bias of 1 / 2 a t the output. The r isk part of (12) simpliﬁes to (1 − A ) 2 4 = λ 2 x · x , so the overall the risk of an aversion witness is min  1 4 , λ 2 x · x  . 21

Surprising properties of dropout in deep networks

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment