Efficient Estimation of Multidimensional Regression Model with Multilayer Perceptron

Eﬃcien t Estimation of Multidimensional Regression Mo del with Multila y er P erceptron Joseph Rynkiewicz 1 Universit ´ e P aris I - SAMOS/MA TISSE 72 rue Regnault, P aris - F ra nce Abstract . This w ork concerns estimation of m u l tidimensional nonlinear regression mo dels u s in g multil a yer p erceptron (MLP). The main problem with such mod el is that we hav e to know the co vari ance matrix of t he noise to get optimal estimator. how ever we sho w that, if we choose as cost function the log arithm of the determinan t of the empirica l error co v ariance matrix, w e get an asymptotically optimal estimator. 1 In tr o duction Let us cons ide r a sequence ( Y t , Z t ) t ∈ N of i.i.d. 1 random vectors (i.e. ident ically distributed and indep endents). So, ea ch couple ( Y t , Z t ) has the sa me law that a generic v ar iable ( Y , Z ). Mo reov e r , w e as sume that the mo del can b e written Y t = F W 0 ( Z t ) + ε t where • F W 0 is a function r epresented b y a MLP with par ameters o r weights W 0 . • ( ε t ) is a n i.i.d. centered noise with unknown in vertible cov ariance matrix Γ 0 . Our g oal is to estimate the tr ue pa rameter by minimizing an appro priate cost function. This mo del is called a r egressio n mo del and a p opula r choice for the asso ciated cost function is the mea n sq uare er r or : 1 n n X t =1 k Y t − F W ( Z t ) k 2 where k . k deno tes the Euclidean no rm on R d . Although this function is widely used, it is easy to show that we get then a sub optimal estimator . An o ther solution is to use an approximation of the cov ariance error matrix to compute generalized least s quare es timator : 1 n n X t =1 ( Y t − F W ( Z t )) T Γ − 1 ( Y t − F W ( Z t )) , 1 It is not hard to extend all what w e show in this pap er for stationary mixing v ar iables and so f or time series where T denotes the trans p o sition of the matrix. Here we assume that Γ is a go o d approximation of the true cov ariance matrix of the noise Γ 0 . How ever it takes time to compute a go o d approximation of matrix Γ 0 and it leads asymptotically to the co s t function pro p o sed in this a rticle (s e e for example Rynkiewicz [4]) : U n ( W ) := log det 1 n n X t =1 ( Y t − F W ( Z t ))( Y t − F W ( Z t )) T ! (1) This pap er is devoted to the theor e tica l study o f U n ( W ). W e ass ume that the true a rchitecture of the MLP is known so that the Hessian matrix computed in the sequel veriﬁes the as sumption to b e deﬁnite p ositive (see F ukumizu [1]). In this framework, we study the a s ymptotic b ehavior ˆ W n := arg min U n ( W ), the weights minimizing the co st function U n ( W ). W e s how that under simple assumptions this e s timator is asy mpto tica lly optimal in the sense that it ha s the same asymptotic b ehavior tha n the gener alized lea st squar e estima to r using the true cov ariance matrix o f the noise. Numerical pr o cedures to compute this estimator and examples of it behavior can b e found in Rynkie w ic z [4]. 2 The ﬁrst and second deriv atives of W 7− → U n ( W ) First, we introduce a notation : if F W ( X ) is a d -dimensiona l parametric function depe nding of a para meter W , let us write ∂ F W ( X ) ∂ W k (resp. ∂ 2 F W ( X ) ∂ W k ∂ W l ) for the d - dimensional vector of partial deriv ative (resp. second or der partial der iv atives) of ea ch co mp o ne nt of F W ( X ). 2.1 First deriv ativ es Now, if Γ n ( W ) is a matrix dep ending of the parameter vector W , we get F ro m Magnus a nd Neudeck er [3 ] ∂ ∂ W k ln det (Γ n ( W )) = tr  Γ − 1 n ( W ) ∂ ∂ W k Γ n ( W )  here Γ n ( W ) = 1 n n X t =1 ( y t − F W ( z t ))( y t − F W ( z t )) T note that these ma trix Γ n ( W ) and it inv er se are symmetric . Now, if we note A n ( W k ) = 1 n n X t =1  − ∂ F W ( z t ) ∂ W k ( y t − F W ( z t )) T  using the fac t tr  Γ − 1 n ( W ) A n ( W k )  = tr  A T n ( W k )Γ − 1 n ( W )  = tr  Γ − 1 n ( W ) A T n ( W k )  we g et ∂ ∂ W k ln det (Γ n ( W )) = 2 tr  Γ − 1 n ( W ) A n ( W k )  (2) 2.2 Second de riv ativ es W e write now B n ( W k , W l ) := 1 n n X t =1 ∂ F W ( z t ) ∂ W k ∂ F W ( z t ) ∂ W l T ! and C n ( W k , W l ) := 1 n n X t =1 − ( y t − F W ( z t )) ∂ 2 F W ( z t ) ∂ W k ∂ W l T ! W e get ∂ 2 U n ( W ) ∂ W k ∂ W l = ∂ ∂ W l 2 tr  Γ − 1 n ( W ) A n ( W k )  = 2 tr  ∂ Γ − 1 n ( W ) ∂ W l A ( W k )  + 2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ n ( W ) − 1 C n ( W k , W l )  Now, Magnus and Neudeck er [3] give a n analytic form o f the deriv ative of an inv erse ma trix, so we ge t ∂ 2 U n ( W ) ∂ W k ∂ W l = 2 tr  Γ − 1 n ( W )  A n ( W k ) + A T n ( W k )  Γ − 1 n ( W ) A n ( W k )  + 2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ − 1 n ( W ) C n ( W k , W l )  so ∂ 2 U n ( W ) ∂ W k ∂ W l = 4 tr  Γ − 1 n ( W ) A n ( W k )Γ − 1 n ( W ) A n ( W k )  +2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ − 1 n ( W ) C n ( W k , W l )  (3) 3 Asymptotic prop erties of ˆ W n First, following the same lines that Y ao [5], it is easy to show that, if the no ise of the mo del has a moment of o rder at least 2, the estimator is stro ngly consistent (i.e. ˆ W n a.s. → W 0 ). Moreov er, for a MLP function, there exists a cons ta nt C such that we hav e the following ineq ualities : k ∂ F W ( Z ) ∂ W k k ≤ C (1 + k Z k ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l k ≤ C (1 + k Z k 2 ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l − ∂ 2 F 0 W ( Z ) ∂ W k ∂ W l k ≤ C k W − W 0 k (1 + k Z k 3 ) So, if Z has a moment o f o rder at least 3 (see the justiﬁcation in Y ao [5]), we get the following lemma : Lemma 1 L et ∆ U n ( W 0 ) b e the gr adient ve ct or of U n ( W ) at W 0 , ∆ U ( W 0 ) b e the gr adient ve ctor of U ( W ) := log det( Y − F W ( Z )) at W 0 and H U n ( W 0 ) b e the H ess ian matrix of U n ( W ) at W 0 . We deﬁne ﬁn al ly B ( W k , W l ) := ∂ F W ( Z ) ∂ W k ∂ F W ( Z ) ∂ W l T and A ( W k ) =  − ∂ F W ( Z ) ∂ W k ( Y − F W ( Z )) T  We get then 1. H U n ( W 0 ) a.s. → 2 I 0 2. √ n ∆ U n ( W 0 ) Law → N (0 , 4 I 0 ) wher e, the c omp onent ( k , l ) of t he matrix I 0 is : tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  pr o of T o pr ov e the lemma, we r emark ﬁr s t that the co mpo nent ( k , l ) o f the matrix 4 I 0 is : E  ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l  = E  2 tr  Γ − 1 0 A T ( W 0 k )  × 2 tr  Γ − 1 0 A ( W 0 l )  and, since the trace of the pro duct is inv a riant by circular p er mut ation, E  ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l  = 4 E  − ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ( Y − F W 0 ( Z ))( Y − F W 0 ( Z )) T Γ − 1 0  − ∂ F W 0 ( Z )) ∂ W l  = 4 E  ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ∂ F W 0 ( Z ) ∂ W l  = 4 tr  Γ − 1 0 E  ∂ F W 0 ( Z ) ∂ W k ∂ F W 0 ( Z ) T ∂ W l  = 4 tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  Now, for the comp o nent ( k , l ) of the exp ecta tio n of the Hessian matrix, we remark that lim n →∞ tr  Γ − 1 n ( W 0 ) A n ( W 0 k )Γ − 1 n ( W 0 ) A n ( W 0 k )  = 0 and lim n →∞ tr Γ − 1 n C n ( W 0 k , W 0 l ) = 0 so lim n →∞ H n ( W 0 ) = lim n →∞ 4 tr  Γ − 1 n ( W 0 ) A n ( W 0 k )Γ − 1 n ( W 0 ) A n ( W 0 k )  + 2 tr Γ − 1 n ( W 0 ) B n ( W 0 k , W 0 l ) + 2 tr Γ − 1 n C n ( W 0 k , W 0 l ) = = 2 tr  Γ − 1 0 E  B ( W 0 k , W 0 l )   F rom a classical a rgument of lo cal a symptotic no rmality (see for e xample Y ao [5]), we deduce then the following pro p erty for the estimator ˆ W n : Prop ositi on 1 L et W ∗ n the est imator of the gener alize d le ast squar e : W ∗ n := ar g min 1 n n X t =1 ( Y t − F W ( Z t )) T Γ − 1 0 ( Y t − F W ( Z t )) then we have lim n →∞ √ n ( W ∗ n − W 0 ) = lim n →∞ √ n ( ˆ W n − W 0 ) = N (0 , I − 1 0 ) W e remar k that ˆ W n has the sa me a symptotic b e havior than the estimato r generalized leas t square estimator with the tr ue cov a r iance ma trix Γ − 1 0 which is asymptotically o ptimal (see for ex a mple Ljung [2]), so the pro p o sed estimator is asymptotically optimal to o. 4 Conclusion In the linear mult idimensional reg ression mo del the o ptimal estimator has an analytic solution (see Ma gnus and Neudeck e r [3]), so it do e sn’t make sense to consider minimization of a co st function. How ever, for the non-linea r mu ltidi- mensional r egress ion mo del the ordinar y least squa re estima to r is sub-optimal if the cov aria nce matrix o f the noise is not the identit y matrix. W e can ov er come this diﬃculty by using the cost function U n ( W ). The numerical computation and the empirical prop erties of these estimator hav e b een studied in a prev ious article (see rynkiewicz [4]). In this pap er, we have g iven a pro of o f the opti- mality o f the estimator asso cia ted with U n ( W ). This is then a go o d choice fo r the estimation of multidimensional non-linea r regr ession mo del with multila yer per ceptron. References [1] F ukumizu, K., A regularity condition of the i nformation matrix of a m ultil a yer per ceptron net work, Neur al N etworks , V ol. 9, 5:871–879, 1996 [2] Lj ung, L. System identiﬁc ation : The ory for the user , Prentice Hall , 1999 [3] M agn us, J. , Neudeck er, H. Matrix diﬀer ential c alculus wit h applic ations in statistics and e c onometrics J. Wiley and Sons, New Y ork, 1988. [4] Rynkiewicz, J., Es timation of M ultidimensional R egression Model wi th M ultilay er P er- ceptron. In J. M ira and A. Prieto, editors, pro ceedings of the 8 th international workshop on artiﬁcial neur al networks (IW ANN 2003), Lecture Notes in Computer Science 2686, pages 310-317, Springer-V erlag, 2003. [5] Y ao, J.F. , On least s quare estimation for stable nonlinear AR pro cesses, The Annals of Institut of Mathematic al Statistic s 52:316-331, 2000

Efficient Estimation of Multidimensional Regression Model with Multilayer Perceptron

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment