Testing the number of parameters with multidimensional MLP
This work concerns testing the number of parameters in one hidden layer multilayer perceptron (MLP). For this purpose we assume that we have identifiable models, up to a finite group of transformations on the weights, this is for example the case whe…
Authors: Joseph Rynkiewicz (CES, Samos)
T esting the n um b er of parameters of m ultidimensional MLP Joseph Rynkiewicz 1 SAMOS - MA TISSE Universit ´ e de Paris I 72 rue Regnault, 75013 P aris, F rance (e-mail: joseph.ryn kiewicz@univ-p aris1.fr ) Abstract. This w ork concerns testing the n umber of parameters in one hidden la yer m ultila yer p erceptron (MLP). F or this purp ose w e assume that w e ha ve iden- tifiable mod els, up to a finite group of transformations on the w eights, this is for example the case when the num b er of hidden units is know. In this framew ork, we sho w that we get a simple asymptotic distribution, if we use the logari thm of th e determinant of the emp iri cal error cov ariance matrix as cost function. Keywords: Multilay er Perce ptron, Statistical test, Asymptotic distribu t io n. 1 In tro duction Consider a sequence ( Y t , Z t ) t ∈ N of i.i.d. 1 random vectors (i.e. ident ically distributed and indep endent s). So, eac h couple ( Y t , Z t ) has the same la w that a gener ic v ariable ( Y , Z ) ∈ R d × R d ′ . 1.1 The m ode l Assume that the mo del ca n be written Y t = F W 0 ( Z t ) + ε t where • F W 0 is a function represented b y a o ne hidden lay er MLP with parameters or weigh ts W 0 and sigmo idal functions in the hidden unit. • The no ise, ( ε t ) t ∈ N , is sequence o f i.i.d. centered v ar iables with unknown inv e r tible cov ariance matr ix Γ ( W 0 ). W rite ε the generic v a r iable with the same law that each ε t . Notes that a finite n umber o f tra nsformations of the weigh ts leav e the MLP functions inv ariant, these permutations form a finite group (see [Sussman, 19 92 ]). T o ov erco me this pr oblem, we will co nsider equiv alence classes of MLP : tw o 1 It is n ot hard to exten d all what we show in this pap er for stationary mixing v ariables and so for time series 2 Joseph R ynkiewicz MLP are in the same class if the first o ne is the imag e by such transformation of the second o ne , the consider ed set of par ameter is then the quotient space of para meters b y the finite gro up of tra nsformations. In this spa c e, we assume that the mo del is identifiable, this can b e done if w e consider o nly MLP with the true num b er o f hidden units (see [Sussman, 1992]). No te that, if the num be r of hidden units is ov er-e stimated, then such test can hav e very bad behavior (see [F ukumizu, 20 03 ]). W e a g ree that the a ssumption of ident ifiability is very restrictive, but we wan t empha - size the fact that, even in this framew ork, clas s ical test of the num b er of parameters in the case of multid imensiona l output MLP is not satisfacto ry and we pr opose to improve it. 1.2 testing the num b er of parameters Let q b e a n in teger lesser than s , we wan t to test “ H 0 : W ∈ Θ q ⊂ R q ” a gainst “ H 1 : W ∈ Θ s ⊂ R s ”, where the sets Θ q and Θ s are compact. H 0 express the fact that W b elongs to a subset of Θ s with a pa rametric dimension less e r than s or, equiv alently , that s − q weights o f the MLP in Θ s are null. If we consider the cla ssic cost fun ction : V n ( W ) = P n t =1 k Y t − F W ( Z t ) k 2 where k x k denotes the E uc lide a n norm o f x , we get the following statistic of test : S n = n × min W ∈ Θ q V n ( W ) − min W ∈ Θ s V n ( W ) It is shown in [Y ao, 200 0 ], that S n conv erges in law to a p onderated sum of χ 2 1 S n D → s − q X i =1 λ i χ 2 i, 1 where the χ 2 i, 1 are s − q i.i.d. χ 2 1 v ar ia bles and λ i are strictly po sitiv es v alues, different of 1 if the true cov a riance ma trix of the noise is not the identit y . So, in the gener al case, wher e the true cov ar iance ma tr ix of the nois e is not the identit y , the asymptotic distribution is no t known, beca use the λ i are not known and it is difficult to co mpute the as y mptotic level of the test. T o overcome this difficulty we prop ose to use instead the cost function U n ( W ) := ln det 1 n n X t =1 ( Y t − F W ( Z t ))( Y t − F W ( Z t )) T ! . (1) we will show that, under suitable assumptions, the sta tis tic o f test : T n = n × min W ∈ Θ q U n ( W ) − min W ∈ Θ s U n ( W ) (2) will conv erg e to a classica l χ 2 s − q so the asymptotic level of the tes t will be very easy to compute. The seque l of this pap er is devoted to the proof of this prop erty . multidimensi onal MLP 3 2 Asymptotic proper ties of T n In order to investigate the asymptotic pro perties of the test we hav e to pr o ve the consistency and the asymptotic normality o f ˆ W n = a rg min W ∈ Θ s U n ( W ). Assume, in the sequel, that ε has a mo men t of order a t least 2 and note Γ n ( W ) = 1 n n X t =1 ( Y t − F W ( Z t ))( Y t − F W ( Z t )) T remark that these ma trix Γ n ( W ) and it inv erse are symmetric. in the sa me wa y , we note Γ ( W ) = lim n →∞ Γ n ( W ), which is well defined bec a use of the moment co nditio n o n ε 2.1 Consistency of ˆ W n First we have to identify contrast function as sociated to U n ( W ) Lemma 1 U n ( W ) − U n ( W 0 ) a.s. → K ( W , W 0 ) with K ( W , W 0 ) ≥ 0 and K ( W , W 0 ) = 0 if and only if W = W 0 . Pr o of : By the strong law of large num ber we have U n ( W ) − U n ( W 0 ) a.s. → ln det( Γ ( W )) − ln det( Γ ( W 0 )) = ln det( Γ ( W )) det( Γ ( W 0 )) = ln det Γ − 1 ( W 0 ) Γ ( W ) − Γ ( W 0 ) + I d where I d denotes the iden tity matrix of R d . So, the lemme is true if Γ ( W ) − Γ ( W 0 ) is a pos itiv e matr ix, n ull only if W = W 0 . But this prop ert y is true since Γ ( W ) = E ( Y − F W ( Z ))( Y − F W ( Z )) T = E ( Y − F W 0 ( Z ) + F W 0 ( Z ) − F W ( Z ))( Y − F W 0 ( Z ) + F W 0 ( Z ) − F W ( Z )) T = E ( Y − F W 0 ( Z ))( Y − F W 0 ( Z )) T + E ( F W 0 ( Z ) − F W ( Z ))( F W 0 ( Z ) − F W ( Z )) T = Γ ( W 0 ) + E ( F W 0 ( Z ) − F W ( Z ))( F W 0 ( Z ) − F W ( Z )) T W e deduce then the theor e m o f consistency : Theorem 1 If E k ε k 2 < ∞ , ˆ W n P → W 0 4 Joseph R ynkiewicz Pr o of Remark that it exist a cons tan t B s uc h that sup W ∈ Θ s k Y − F W ( Z ) | 2 < k Y k 2 + B bec ause Θ s is compa ct, so F W ( Z ) is bo unded. F o r a ma tr ix A ∈ R d × d , let k A k b e a norm, for example k A k 2 = tr AA T . W e hav e lim inf W ∈ Θ s k Γ n ( W ) k = k Γ ( W 0 ) k > 0 lim sup W ∈ Θ s k Γ n ( W ) k := C < ∞ and since the function : Γ 7→ ln det Γ , for C ≥ k Γ k ≥ k Γ ( W 0 ) k is uniformly co n tin uous, by the same argument that exa mple 19.8 of [V an der V aart, 19 98 ] the se t of functions U n ( W ) , W ∈ Θ s is Gliv enko- Cantelli. Finally , the theorem 5.7 of [V an der V aart, 19 98 ], show that ˆ W n conv erge in proba bility to W 0 . 2.2 Asymptotic norm a li t y F or this pur pose we hav e to compute the first and the sec o nd deriv ative with resp ect to the parameters of U n ( W ). First, we intro duce a notation : if F W ( X ) is a d -dimensional parametr ic function dep ending of a pa r ameter W , write ∂ F W ( X ) ∂ W k (resp. ∂ 2 F W ( X ) ∂ W k ∂ W l ) for the d -dimensiona l vector of partial deriv a- tive (resp. second order partia l deriv atives) of each comp onen t of F W ( X ). First derivatives : if Γ n ( W ) is a matrix dep ending of the parameter vector W , we g et from [Magnus and Neudeck er, 1988] ∂ ∂ W k ln det ( Γ n ( W )) = tr Γ − 1 n ( W ) ∂ ∂ W k Γ n ( W ) Hence, if we no te A n ( W k ) = 1 n n X t =1 − ∂ F W ( z t ) ∂ W k ( y t − F W ( z t )) T using the fact tr Γ − 1 n ( W ) A n ( W k ) = tr A T n ( W k ) Γ − 1 n ( W ) = tr Γ − 1 n ( W ) A T n ( W k ) we get ∂ ∂ W k ln det ( Γ n ( W )) = 2 tr Γ − 1 n ( W ) A n ( W k ) (3) multidimensi onal MLP 5 Se c ond derivatives : W e wr ite now B n ( W k , W l ) := 1 n n X t =1 ∂ F W ( z t ) ∂ W k ∂ F W ( z t ) ∂ W l T ! and C n ( W k , W l ) := 1 n n X t =1 − ( y t − F W ( z t )) ∂ 2 F W ( z t ) ∂ W k ∂ W l T ! W e g et ∂ 2 U n ( W ) ∂ W k ∂ W l = ∂ ∂ W l 2 tr Γ − 1 n ( W ) A n ( W k ) = 2 tr ∂ Γ − 1 n ( W ) ∂ W l A n ( W k ) + 2 tr Γ − 1 n ( W ) B n ( W k , W l ) + 2 tr Γ n ( W ) − 1 C n ( W k , W l ) Now, [Ma gn us a nd Neudecker, 1988], g iv e an ana ly tic form of the deriv ative of an inv erse matrix, so we get ∂ 2 U n ( W ) ∂ W k ∂ W l = 2 tr Γ − 1 n ( W ) A n ( W k ) + A T n ( W k ) Γ − 1 n ( W ) A n ( W k ) + 2 tr Γ − 1 n ( W ) B n ( W k , W l ) + 2 tr Γ − 1 n ( W ) C n ( W k , W l ) so ∂ 2 U n ( W ) ∂ W k ∂ W l = 4 tr Γ − 1 n ( W ) A n ( W k ) Γ − 1 n ( W ) A n ( W k ) +2 tr Γ − 1 n ( W ) B n ( W k , W l ) + 2 tr Γ − 1 n ( W ) C n ( W k , W l ) (4) Asymptotic distribution of ˆ W n : The previous equations a llo w us to give the asymptotic prop erties of the estimator minimizing the cost function U n ( W ), namely from equatio n (3) and (4) we ca n compute the a symptotic pro perties of the first and the second der iv a tiv es of U n ( W ). If the v a riable Z has a moment of or der at lea st 3 then w e g e t the following lemma : Theorem 2 Assume t ha t E k ε k 2 < ∞ and E k Z k 3 < ∞ , let ∆U n ( W 0 ) b e the gr adient ve ctor of U n ( W ) at W 0 and H U n ( W 0 ) b e t he Hessian matrix of U n ( W ) at W 0 . Write fi nal ly B ( W k , W l ) := ∂ F W ( Z ) ∂ W k ∂ F W ( Z ) ∂ W l T We get then 1. H U n ( W 0 ) a.s. → 2 I 0 2. √ n∆U n ( W 0 ) Law → N (0 , 4 I 0 ) 3. √ n ˆ W n − W 0 Law → N (0 , I − 1 0 ) wher e, the c omp onent ( k , l ) of t he matrix I 0 is : tr Γ − 1 0 E B ( W 0 k , W 0 l ) 6 Joseph R ynkiewicz pr o of : W e ca n show e a sily that, for all x ∈ R d , we have : k ∂ F W ( Z ) ∂ W k k ≤ C te (1 + k Z k ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l k ≤ C te (1 + k Z k 2 ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l − ∂ 2 F 0 W ( Z ) ∂ W k ∂ W l k ≤ C te k W − W 0 k (1 + k Z k 3 ) W rite A ( W k ) = − ∂ F W ( Z ) ∂ W k ( Y − F W ( Z )) T and U ( W ) := log det( Y − F W ( Z )). Note that the co mp onent ( k , l ) of the ma tr ix 4 I 0 is: E ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l = E 2 tr Γ − 1 0 A T ( W 0 k ) × 2 tr Γ − 1 0 A ( W 0 l ) and, since the trace of the pro duct is inv a rian t by circula r p erm utation, E ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l = 4 E − ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ( Y − F W 0 ( Z ))( Y − F W 0 ( Z )) T Γ − 1 0 − ∂ F W 0 ( Z )) ∂ W l = 4 E ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ∂ F W 0 ( Z ) ∂ W l = 4 tr Γ − 1 0 E ∂ F W 0 ( Z ) ∂ W k ∂ F W 0 ( Z ) T ∂ W l = 4 tr Γ − 1 0 E B ( W 0 k , W 0 l ) Now, the deriv ative ∂ F W ( Z ) ∂ W k is squar e integrable, so ∆U n ( W 0 ) fulfills Linde- ber g’s c o ndition (see [Hall and Heyde, 1 980]) and √ n∆U n ( W 0 ) Law → N (0 , 4 I 0 ) F or the comp onent ( k , l ) of the exp ectation o f the Hess ian ma trix, r emark first that lim n →∞ tr Γ − 1 n ( W 0 ) A n ( W 0 k ) Γ − 1 n ( W 0 ) A n ( W 0 k ) = 0 and lim n →∞ trΓ − 1 n C n ( W 0 k , W 0 l ) = 0 so lim n →∞ H n ( W 0 ) = lim n →∞ 4 tr Γ − 1 n ( W 0 ) A n ( W 0 k ) Γ − 1 n ( W 0 ) A n ( W 0 k ) + 2 trΓ − 1 n ( W 0 ) B n ( W 0 k , W 0 l ) + 2 tr Γ − 1 n C n ( W 0 k , W 0 l ) = = 2 tr Γ − 1 0 E B ( W 0 k , W 0 l ) Now, since k ∂ 2 F W ( Z ) ∂ W k ∂ W l k ≤ C te (1 + k Z k 2 ) and k ∂ 2 F W ( Z ) ∂ W k ∂ W l − ∂ 2 F 0 W ( Z ) ∂ W k ∂ W l k ≤ C te k W − W 0 k (1 + k Z k 3 ), by standar d a rgumen ts found, for exa mple, in [Y ao, 200 0 ] we g et √ n ˆ W n − W 0 Law → N (0 , I − 1 0 ) multidimensi onal MLP 7 2.3 Asymptotic di stribution of T n In this sec tion, we wr ite ˆ W n = a rg min W ∈ Θ s U n ( W ) and ˆ W 0 n = a rg min W ∈ Θ q U n ( W ), where Θ q is vie w as a subset of R s . The asymp- totic distributio n of T n is then a co nsequence of the previous sectio n, namely , if we have to replace nU n ( W ) by its T aylor expansion around ˆ W n and ˆ W 0 n , following [V an der V aart, 1998] chapter 16 we have : T n = √ n ˆ W n − ˆ W 0 n T I 0 √ n ˆ W n − ˆ W 0 n + o P (1) D → χ 2 s − q 3 Conclusion It has b een show that, in the case of multid imensiona l o utput , the cos t func- tion U n ( W ) leads to a test for the num b er o f parameter s in MLP simpler than with the tra ditional mean square cost function. In fa ct the estimator ˆ W n is also more efficient than the least squa re estimator (see [Rynkiewicz, 20 0 3 ]). W e can also remark that U n ( W ) matches with t wice the “concentrated Gaus- sian log-likeliho od” but we ha ve to emphasize, that its nice as ymptotic prop- erties nee d o nly moment condition on ε and Z , s o it works even if the dis- tribution of the noise is not Gaussian. An other solution could be to us e an approximation of the cov ar ia nce e r ror matrix to co mpute gener alized least square estimator : 1 n n X t =1 ( Y t − F W ( Z t )) T Γ − 1 ( Y t − F W ( Z t )) , assuming tha t Γ is a g oo d approximation o f the true cov arianc e ma trix of the noise Γ ( W 0 ). How ever it take time to co mpute a go o d the matrix Γ and if we tr y to compute the b est matrix Γ with the data , it leads to the cos t function U n ( W ) (see for example [Gallant, 19 87 ]). Finally , as w e see in this pap er, the computation of the deriv a tiv es o f U n ( W ) is easy , so we ca n use the effective differen tial optimiza tion techniques to es tima te ˆ W n and numerical exa mples can b e found in [Rynkiewicz, 2003]. References F u kumizu, 2003.K. F ukumizu . Lik eliho od ratio of unidentifiable mo dels and mul- tila yer neural netw orks. Annals of Statistics , 31:3:533– 851, 2003. Gallan t, 1987.R.A. Gallant. Non l ine ar statistic al mo dels . J. Wiley and Sons, New- Y ork, 1987. Hall and Heyde, 1980.P . H al l and C. Heyd e. Martingale li mit the ory and its appli- c ations . Academic Press, New-Y ork , 1980. Magn us and N eu dec ker, 1988.Jan R. Magnus and H einz Neudeck er. Matrix differ- ential c alculus with appli c ations i n statist ics and e c onometrics . J. Wiley and Sons, New-Y ork , 1988. 8 Joseph R ynkiewicz Rynkiewicz, 2003. J. Rynkiewicz. Estimation of m ultidimensional regression mo del with multila yer p erceptrons. In J. Mira and J.R. Alva rez, editors, Computa- tional metho ds in neur al mo deling , volume 2686 of L e ctur es notes in c omputer scienc e , pages 310–317, 2003. Sussman, 1992.H .J . Sussman. U niqueness of the w eights for minimal feedforw ard nets with a given input-outp ut map. Neur al Networks , pages 589–593, 1992. V an der V aart, 1998.A.W. V an der V aart. Asymptotic statistics . Cambridge Un i- versi ty Press, Cambridge, UK, 1998. Y ao, 2000.J. Y ao. O n least square estimation for stable n onlinea r ar processes. The Ann als of Institut of Mathematic al Statistics , 52:316–331, 2000.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment