Testing the number of parameters with multidimensional MLP

T esting the n um b er of parameters of m ultidimensional MLP Joseph Rynkiewicz 1 SAMOS - MA TISSE Universit ´ e de Paris I 72 rue Regnault, 75013 P aris, F rance (e-mail: joseph.ryn kiewicz@univ-p aris1.fr ) Abstract. This w ork concerns testing the n umber of parameters in one hidden la yer m ultila yer p erceptron (MLP). F or this purp ose w e assume that w e ha ve iden- tiﬁable mod els, up to a ﬁnite group of transformations on the w eights, this is for example the case when the num b er of hidden units is know. In this framew ork, we sho w that we get a simple asymptotic distribution, if we use the logari thm of th e determinant of the emp iri cal error cov ariance matrix as cost function. Keywords: Multilay er Perce ptron, Statistical test, Asymptotic distribu t io n. 1 In tro duction Consider a sequence ( Y t , Z t ) t ∈ N of i.i.d. 1 random vectors (i.e. ident ically distributed and indep endent s). So, eac h couple ( Y t , Z t ) has the same la w that a gener ic v ariable ( Y , Z ) ∈ R d × R d ′ . 1.1 The m ode l Assume that the mo del ca n be written Y t = F W 0 ( Z t ) + ε t where • F W 0 is a function represented b y a o ne hidden lay er MLP with parameters or weigh ts W 0 and sigmo idal functions in the hidden unit. • The no ise, ( ε t ) t ∈ N , is sequence o f i.i.d. centered v ar iables with unknown inv e r tible cov ariance matr ix Γ ( W 0 ). W rite ε the generic v a r iable with the same law that each ε t . Notes that a ﬁnite n umber o f tra nsformations of the weigh ts leav e the MLP functions inv ariant, these permutations form a ﬁnite group (see [Sussman, 19 92 ]). T o ov erco me this pr oblem, we will co nsider equiv alence classes of MLP : tw o 1 It is n ot hard to exten d all what we show in this pap er for stationary mixing v ariables and so for time series 2 Joseph R ynkiewicz MLP are in the same class if the ﬁrst o ne is the imag e by such transformation of the second o ne , the consider ed set of par ameter is then the quotient space of para meters b y the ﬁnite gro up of tra nsformations. In this spa c e, we assume that the mo del is identiﬁable, this can b e done if w e consider o nly MLP with the true num b er o f hidden units (see [Sussman, 1992]). No te that, if the num be r of hidden units is ov er-e stimated, then such test can hav e very bad behavior (see [F ukumizu, 20 03 ]). W e a g ree that the a ssumption of ident iﬁability is very restrictive, but we wan t empha - size the fact that, even in this framew ork, clas s ical test of the num b er of parameters in the case of multid imensiona l output MLP is not satisfacto ry and we pr opose to improve it. 1.2 testing the num b er of parameters Let q b e a n in teger lesser than s , we wan t to test “ H 0 : W ∈ Θ q ⊂ R q ” a gainst “ H 1 : W ∈ Θ s ⊂ R s ”, where the sets Θ q and Θ s are compact. H 0 express the fact that W b elongs to a subset of Θ s with a pa rametric dimension less e r than s or, equiv alently , that s − q weights o f the MLP in Θ s are null. If we consider the cla ssic cost fun ction : V n ( W ) = P n t =1 k Y t − F W ( Z t ) k 2 where k x k denotes the E uc lide a n norm o f x , we get the following statistic of test : S n = n ×  min W ∈ Θ q V n ( W ) − min W ∈ Θ s V n ( W )  It is shown in [Y ao, 200 0 ], that S n conv erges in law to a p onderated sum of χ 2 1 S n D → s − q X i =1 λ i χ 2 i, 1 where the χ 2 i, 1 are s − q i.i.d. χ 2 1 v ar ia bles and λ i are strictly po sitiv es v alues, diﬀerent of 1 if the true cov a riance ma trix of the noise is not the identit y . So, in the gener al case, wher e the true cov ar iance ma tr ix of the nois e is not the identit y , the asymptotic distribution is no t known, beca use the λ i are not known and it is diﬃcult to co mpute the as y mptotic level of the test. T o overcome this diﬃculty we prop ose to use instead the cost function U n ( W ) := ln det 1 n n X t =1 ( Y t − F W ( Z t ))( Y t − F W ( Z t )) T ! . (1) we will show that, under suitable assumptions, the sta tis tic o f test : T n = n ×  min W ∈ Θ q U n ( W ) − min W ∈ Θ s U n ( W )  (2) will conv erg e to a classica l χ 2 s − q so the asymptotic level of the tes t will be very easy to compute. The seque l of this pap er is devoted to the proof of this prop erty . multidimensi onal MLP 3 2 Asymptotic proper ties of T n In order to investigate the asymptotic pro perties of the test we hav e to pr o ve the consistency and the asymptotic normality o f ˆ W n = a rg min W ∈ Θ s U n ( W ). Assume, in the sequel, that ε has a mo men t of order a t least 2 and note Γ n ( W ) = 1 n n X t =1 ( Y t − F W ( Z t ))( Y t − F W ( Z t )) T remark that these ma trix Γ n ( W ) and it inv erse are symmetric. in the sa me wa y , we note Γ ( W ) = lim n →∞ Γ n ( W ), which is well deﬁned bec a use of the moment co nditio n o n ε 2.1 Consistency of ˆ W n First we have to identify contrast function as sociated to U n ( W ) Lemma 1 U n ( W ) − U n ( W 0 ) a.s. → K ( W , W 0 ) with K ( W , W 0 ) ≥ 0 and K ( W , W 0 ) = 0 if and only if W = W 0 . Pr o of : By the strong law of large num ber we have U n ( W ) − U n ( W 0 ) a.s. → ln det( Γ ( W )) − ln det( Γ ( W 0 )) = ln det( Γ ( W )) det( Γ ( W 0 )) = ln det  Γ − 1 ( W 0 )  Γ ( W ) − Γ ( W 0 )  + I d  where I d denotes the iden tity matrix of R d . So, the lemme is true if Γ ( W ) − Γ ( W 0 ) is a pos itiv e matr ix, n ull only if W = W 0 . But this prop ert y is true since Γ ( W ) = E  ( Y − F W ( Z ))( Y − F W ( Z )) T  = E  ( Y − F W 0 ( Z ) + F W 0 ( Z ) − F W ( Z ))( Y − F W 0 ( Z ) + F W 0 ( Z ) − F W ( Z )) T  = E  ( Y − F W 0 ( Z ))( Y − F W 0 ( Z )) T  + E  ( F W 0 ( Z ) − F W ( Z ))( F W 0 ( Z ) − F W ( Z )) T  = Γ ( W 0 ) + E  ( F W 0 ( Z ) − F W ( Z ))( F W 0 ( Z ) − F W ( Z )) T   W e deduce then the theor e m o f consistency : Theorem 1 If E  k ε k 2  < ∞ , ˆ W n P → W 0 4 Joseph R ynkiewicz Pr o of Remark that it exist a cons tan t B s uc h that sup W ∈ Θ s k Y − F W ( Z ) | 2 < k Y k 2 + B bec ause Θ s is compa ct, so F W ( Z ) is bo unded. F o r a ma tr ix A ∈ R d × d , let k A k b e a norm, for example k A k 2 = tr  AA T  . W e hav e lim inf W ∈ Θ s k Γ n ( W ) k = k Γ ( W 0 ) k > 0 lim sup W ∈ Θ s k Γ n ( W ) k := C < ∞ and since the function : Γ 7→ ln det Γ , for C ≥ k Γ k ≥ k Γ ( W 0 ) k is uniformly co n tin uous, by the same argument that exa mple 19.8 of [V an der V aart, 19 98 ] the se t of functions U n ( W ) , W ∈ Θ s is Gliv enko- Cantelli. Finally , the theorem 5.7 of [V an der V aart, 19 98 ], show that ˆ W n conv erge in proba bility to W 0  . 2.2 Asymptotic norm a li t y F or this pur pose we hav e to compute the ﬁrst and the sec o nd deriv ative with resp ect to the parameters of U n ( W ). First, we intro duce a notation : if F W ( X ) is a d -dimensional parametr ic function dep ending of a pa r ameter W , write ∂ F W ( X ) ∂ W k (resp. ∂ 2 F W ( X ) ∂ W k ∂ W l ) for the d -dimensiona l vector of partial deriv a- tive (resp. second order partia l deriv atives) of each comp onen t of F W ( X ). First derivatives : if Γ n ( W ) is a matrix dep ending of the parameter vector W , we g et from [Magnus and Neudeck er, 1988] ∂ ∂ W k ln det ( Γ n ( W )) = tr  Γ − 1 n ( W ) ∂ ∂ W k Γ n ( W )  Hence, if we no te A n ( W k ) = 1 n n X t =1  − ∂ F W ( z t ) ∂ W k ( y t − F W ( z t )) T  using the fact tr  Γ − 1 n ( W ) A n ( W k )  = tr  A T n ( W k ) Γ − 1 n ( W )  = tr  Γ − 1 n ( W ) A T n ( W k )  we get ∂ ∂ W k ln det ( Γ n ( W )) = 2 tr  Γ − 1 n ( W ) A n ( W k )  (3) multidimensi onal MLP 5 Se c ond derivatives : W e wr ite now B n ( W k , W l ) := 1 n n X t =1 ∂ F W ( z t ) ∂ W k ∂ F W ( z t ) ∂ W l T ! and C n ( W k , W l ) := 1 n n X t =1 − ( y t − F W ( z t )) ∂ 2 F W ( z t ) ∂ W k ∂ W l T ! W e g et ∂ 2 U n ( W ) ∂ W k ∂ W l = ∂ ∂ W l 2 tr  Γ − 1 n ( W ) A n ( W k )  = 2 tr  ∂ Γ − 1 n ( W ) ∂ W l A n ( W k )  + 2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ n ( W ) − 1 C n ( W k , W l )  Now, [Ma gn us a nd Neudecker, 1988], g iv e an ana ly tic form of the deriv ative of an inv erse matrix, so we get ∂ 2 U n ( W ) ∂ W k ∂ W l = 2 tr  Γ − 1 n ( W )  A n ( W k ) + A T n ( W k )  Γ − 1 n ( W ) A n ( W k )  + 2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ − 1 n ( W ) C n ( W k , W l )  so ∂ 2 U n ( W ) ∂ W k ∂ W l = 4 tr  Γ − 1 n ( W ) A n ( W k ) Γ − 1 n ( W ) A n ( W k )  +2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ − 1 n ( W ) C n ( W k , W l )  (4) Asymptotic distribution of ˆ W n : The previous equations a llo w us to give the asymptotic prop erties of the estimator minimizing the cost function U n ( W ), namely from equatio n (3) and (4) we ca n compute the a symptotic pro perties of the ﬁrst and the second der iv a tiv es of U n ( W ). If the v a riable Z has a moment of or der at lea st 3 then w e g e t the following lemma : Theorem 2 Assume t ha t E  k ε k 2  < ∞ and E  k Z k 3  < ∞ , let ∆U n ( W 0 ) b e the gr adient ve ctor of U n ( W ) at W 0 and H U n ( W 0 ) b e t he Hessian matrix of U n ( W ) at W 0 . Write ﬁ nal ly B ( W k , W l ) := ∂ F W ( Z ) ∂ W k ∂ F W ( Z ) ∂ W l T We get then 1. H U n ( W 0 ) a.s. → 2 I 0 2. √ n∆U n ( W 0 ) Law → N (0 , 4 I 0 ) 3. √ n  ˆ W n − W 0  Law → N (0 , I − 1 0 ) wher e, the c omp onent ( k , l ) of t he matrix I 0 is : tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  6 Joseph R ynkiewicz pr o of : W e ca n show e a sily that, for all x ∈ R d , we have : k ∂ F W ( Z ) ∂ W k k ≤ C te (1 + k Z k ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l k ≤ C te (1 + k Z k 2 ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l − ∂ 2 F 0 W ( Z ) ∂ W k ∂ W l k ≤ C te k W − W 0 k (1 + k Z k 3 ) W rite A ( W k ) =  − ∂ F W ( Z ) ∂ W k ( Y − F W ( Z )) T  and U ( W ) := log det( Y − F W ( Z )). Note that the co mp onent ( k , l ) of the ma tr ix 4 I 0 is: E  ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l  = E  2 tr  Γ − 1 0 A T ( W 0 k )  × 2 tr  Γ − 1 0 A ( W 0 l )  and, since the trace of the pro duct is inv a rian t by circula r p erm utation, E  ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l  = 4 E  − ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ( Y − F W 0 ( Z ))( Y − F W 0 ( Z )) T Γ − 1 0  − ∂ F W 0 ( Z )) ∂ W l  = 4 E  ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ∂ F W 0 ( Z ) ∂ W l  = 4 tr  Γ − 1 0 E  ∂ F W 0 ( Z ) ∂ W k ∂ F W 0 ( Z ) T ∂ W l  = 4 tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  Now, the deriv ative ∂ F W ( Z ) ∂ W k is squar e integrable, so ∆U n ( W 0 ) fulﬁlls Linde- ber g’s c o ndition (see [Hall and Heyde, 1 980]) and √ n∆U n ( W 0 ) Law → N (0 , 4 I 0 ) F or the comp onent ( k , l ) of the exp ectation o f the Hess ian ma trix, r emark ﬁrst that lim n →∞ tr  Γ − 1 n ( W 0 ) A n ( W 0 k ) Γ − 1 n ( W 0 ) A n ( W 0 k )  = 0 and lim n →∞ trΓ − 1 n C n ( W 0 k , W 0 l ) = 0 so lim n →∞ H n ( W 0 ) = lim n →∞ 4 tr  Γ − 1 n ( W 0 ) A n ( W 0 k ) Γ − 1 n ( W 0 ) A n ( W 0 k )  + 2 trΓ − 1 n ( W 0 ) B n ( W 0 k , W 0 l ) + 2 tr Γ − 1 n C n ( W 0 k , W 0 l ) = = 2 tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  Now, since k ∂ 2 F W ( Z ) ∂ W k ∂ W l k ≤ C te (1 + k Z k 2 ) and k ∂ 2 F W ( Z ) ∂ W k ∂ W l − ∂ 2 F 0 W ( Z ) ∂ W k ∂ W l k ≤ C te k W − W 0 k (1 + k Z k 3 ), by standar d a rgumen ts found, for exa mple, in [Y ao, 200 0 ] we g et √ n  ˆ W n − W 0  Law → N (0 , I − 1 0 )  multidimensi onal MLP 7 2.3 Asymptotic di stribution of T n In this sec tion, we wr ite ˆ W n = a rg min W ∈ Θ s U n ( W ) and ˆ W 0 n = a rg min W ∈ Θ q U n ( W ), where Θ q is vie w as a subset of R s . The asymp- totic distributio n of T n is then a co nsequence of the previous sectio n, namely , if we have to replace nU n ( W ) by its T aylor expansion around ˆ W n and ˆ W 0 n , following [V an der V aart, 1998] chapter 16 we have : T n = √ n  ˆ W n − ˆ W 0 n  T I 0 √ n  ˆ W n − ˆ W 0 n  + o P (1) D → χ 2 s − q 3 Conclusion It has b een show that, in the case of multid imensiona l o utput , the cos t func- tion U n ( W ) leads to a test for the num b er o f parameter s in MLP simpler than with the tra ditional mean square cost function. In fa ct the estimator ˆ W n is also more eﬃcient than the least squa re estimator (see [Rynkiewicz, 20 0 3 ]). W e can also remark that U n ( W ) matches with t wice the “concentrated Gaus- sian log-likeliho od” but we ha ve to emphasize, that its nice as ymptotic prop- erties nee d o nly moment condition on ε and Z , s o it works even if the dis- tribution of the noise is not Gaussian. An other solution could be to us e an approximation of the cov ar ia nce e r ror matrix to co mpute gener alized least square estimator : 1 n n X t =1 ( Y t − F W ( Z t )) T Γ − 1 ( Y t − F W ( Z t )) , assuming tha t Γ is a g oo d approximation o f the true cov arianc e ma trix of the noise Γ ( W 0 ). How ever it take time to co mpute a go o d the matrix Γ and if we tr y to compute the b est matrix Γ with the data , it leads to the cos t function U n ( W ) (see for example [Gallant, 19 87 ]). Finally , as w e see in this pap er, the computation of the deriv a tiv es o f U n ( W ) is easy , so we ca n use the eﬀective diﬀeren tial optimiza tion techniques to es tima te ˆ W n and numerical exa mples can b e found in [Rynkiewicz, 2003]. References F u kumizu, 2003.K. F ukumizu . Lik eliho od ratio of unidentiﬁable mo dels and mul- tila yer neural netw orks. Annals of Statistics , 31:3:533– 851, 2003. Gallan t, 1987.R.A. Gallant. Non l ine ar statistic al mo dels . J. Wiley and Sons, New- Y ork, 1987. Hall and Heyde, 1980.P . H al l and C. Heyd e. Martingale li mit the ory and its appli- c ations . Academic Press, New-Y ork , 1980. Magn us and N eu dec ker, 1988.Jan R. Magnus and H einz Neudeck er. Matrix diﬀer- ential c alculus with appli c ations i n statist ics and e c onometrics . J. Wiley and Sons, New-Y ork , 1988. 8 Joseph R ynkiewicz Rynkiewicz, 2003. J. Rynkiewicz. Estimation of m ultidimensional regression mo del with multila yer p erceptrons. In J. Mira and J.R. Alva rez, editors, Computa- tional metho ds in neur al mo deling , volume 2686 of L e ctur es notes in c omputer scienc e , pages 310–317, 2003. Sussman, 1992.H .J . Sussman. U niqueness of the w eights for minimal feedforw ard nets with a given input-outp ut map. Neur al Networks , pages 589–593, 1992. V an der V aart, 1998.A.W. V an der V aart. Asymptotic statistics . Cambridge Un i- versi ty Press, Cambridge, UK, 1998. Y ao, 2000.J. Y ao. O n least square estimation for stable n onlinea r ar processes. The Ann als of Institut of Mathematic al Statistics , 52:316–331, 2000.

Testing the number of parameters with multidimensional MLP

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment