Efficient Estimation of Multidimensional Regression Model with Multilayer Perceptron

This work concerns estimation of multidimensional nonlinear regression models using multilayer perceptron (MLP). The main problem with such model is that we have to know the covariance matrix of the noise to get optimal estimator. however we show tha…

Authors: Joseph Rynkiewicz

Efficien t Estimation of Multidimensional Regression Mo del with Multila y er P erceptron Joseph Rynkiewicz 1 Universit ´ e P aris I - SAMOS/MA TISSE 72 rue Regnault, P aris - F ra nce Abstract . This w ork concerns estimation of m u l tidimensional nonlinear regression mo dels u s in g multil a yer p erceptron (MLP). The main problem with such mod el is that we hav e to know the co vari ance matrix of t he noise to get optimal estimator. how ever we sho w that, if we choose as cost function the log arithm of the determinan t of the empirica l error co v ariance matrix, w e get an asymptotically optimal estimator. 1 In tr o duction Let us cons ide r a sequence ( Y t , Z t ) t ∈ N of i.i.d. 1 random vectors (i.e. ident ically distributed and indep endents). So, ea ch couple ( Y t , Z t ) has the sa me law that a generic v ar iable ( Y , Z ). Mo reov e r , w e as sume that the mo del can b e written Y t = F W 0 ( Z t ) + ε t where • F W 0 is a function r epresented b y a MLP with par ameters o r weights W 0 . • ( ε t ) is a n i.i.d. centered noise with unknown in vertible cov ariance matrix Γ 0 . Our g oal is to estimate the tr ue pa rameter by minimizing an appro priate cost function. This mo del is called a r egressio n mo del and a p opula r choice for the asso ciated cost function is the mea n sq uare er r or : 1 n n X t =1 k Y t − F W ( Z t ) k 2 where k . k deno tes the Euclidean no rm on R d . Although this function is widely used, it is easy to show that we get then a sub optimal estimator . An o ther solution is to use an approximation of the cov ariance error matrix to compute generalized least s quare es timator : 1 n n X t =1 ( Y t − F W ( Z t )) T Γ − 1 ( Y t − F W ( Z t )) , 1 It is not hard to extend all what w e show in this pap er for stationary mixing v ar iables and so f or time series where T denotes the trans p o sition of the matrix. Here we assume that Γ is a go o d approximation of the true cov ariance matrix of the noise Γ 0 . How ever it takes time to compute a go o d approximation of matrix Γ 0 and it leads asymptotically to the co s t function pro p o sed in this a rticle (s e e for example Rynkiewicz [4]) : U n ( W ) := log det 1 n n X t =1 ( Y t − F W ( Z t ))( Y t − F W ( Z t )) T ! (1) This pap er is devoted to the theor e tica l study o f U n ( W ). W e ass ume that the true a rchitecture of the MLP is known so that the Hessian matrix computed in the sequel verifies the as sumption to b e definite p ositive (see F ukumizu [1]). In this framework, we study the a s ymptotic b ehavior ˆ W n := arg min U n ( W ), the weights minimizing the co st function U n ( W ). W e s how that under simple assumptions this e s timator is asy mpto tica lly optimal in the sense that it ha s the same asymptotic b ehavior tha n the gener alized lea st squar e estima to r using the true cov ariance matrix o f the noise. Numerical pr o cedures to compute this estimator and examples of it behavior can b e found in Rynkie w ic z [4]. 2 The first and second deriv atives of W 7− → U n ( W ) First, we introduce a notation : if F W ( X ) is a d -dimensiona l parametric function depe nding of a para meter W , let us write ∂ F W ( X ) ∂ W k (resp. ∂ 2 F W ( X ) ∂ W k ∂ W l ) for the d - dimensional vector of partial deriv ative (resp. second or der partial der iv atives) of ea ch co mp o ne nt of F W ( X ). 2.1 First deriv ativ es Now, if Γ n ( W ) is a matrix dep ending of the parameter vector W , we get F ro m Magnus a nd Neudeck er [3 ] ∂ ∂ W k ln det (Γ n ( W )) = tr  Γ − 1 n ( W ) ∂ ∂ W k Γ n ( W )  here Γ n ( W ) = 1 n n X t =1 ( y t − F W ( z t ))( y t − F W ( z t )) T note that these ma trix Γ n ( W ) and it inv er se are symmetric . Now, if we note A n ( W k ) = 1 n n X t =1  − ∂ F W ( z t ) ∂ W k ( y t − F W ( z t )) T  using the fac t tr  Γ − 1 n ( W ) A n ( W k )  = tr  A T n ( W k )Γ − 1 n ( W )  = tr  Γ − 1 n ( W ) A T n ( W k )  we g et ∂ ∂ W k ln det (Γ n ( W )) = 2 tr  Γ − 1 n ( W ) A n ( W k )  (2) 2.2 Second de riv ativ es W e write now B n ( W k , W l ) := 1 n n X t =1 ∂ F W ( z t ) ∂ W k ∂ F W ( z t ) ∂ W l T ! and C n ( W k , W l ) := 1 n n X t =1 − ( y t − F W ( z t )) ∂ 2 F W ( z t ) ∂ W k ∂ W l T ! W e get ∂ 2 U n ( W ) ∂ W k ∂ W l = ∂ ∂ W l 2 tr  Γ − 1 n ( W ) A n ( W k )  = 2 tr  ∂ Γ − 1 n ( W ) ∂ W l A ( W k )  + 2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ n ( W ) − 1 C n ( W k , W l )  Now, Magnus and Neudeck er [3] give a n analytic form o f the deriv ative of an inv erse ma trix, so we ge t ∂ 2 U n ( W ) ∂ W k ∂ W l = 2 tr  Γ − 1 n ( W )  A n ( W k ) + A T n ( W k )  Γ − 1 n ( W ) A n ( W k )  + 2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ − 1 n ( W ) C n ( W k , W l )  so ∂ 2 U n ( W ) ∂ W k ∂ W l = 4 tr  Γ − 1 n ( W ) A n ( W k )Γ − 1 n ( W ) A n ( W k )  +2 tr  Γ − 1 n ( W ) B n ( W k , W l )  + 2 tr  Γ − 1 n ( W ) C n ( W k , W l )  (3) 3 Asymptotic prop erties of ˆ W n First, following the same lines that Y ao [5], it is easy to show that, if the no ise of the mo del has a moment of o rder at least 2, the estimator is stro ngly consistent (i.e. ˆ W n a.s. → W 0 ). Moreov er, for a MLP function, there exists a cons ta nt C such that we hav e the following ineq ualities : k ∂ F W ( Z ) ∂ W k k ≤ C (1 + k Z k ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l k ≤ C (1 + k Z k 2 ) k ∂ 2 F W ( Z ) ∂ W k ∂ W l − ∂ 2 F 0 W ( Z ) ∂ W k ∂ W l k ≤ C k W − W 0 k (1 + k Z k 3 ) So, if Z has a moment o f o rder at least 3 (see the justification in Y ao [5]), we get the following lemma : Lemma 1 L et ∆ U n ( W 0 ) b e the gr adient ve ct or of U n ( W ) at W 0 , ∆ U ( W 0 ) b e the gr adient ve ctor of U ( W ) := log det( Y − F W ( Z )) at W 0 and H U n ( W 0 ) b e the H ess ian matrix of U n ( W ) at W 0 . We define fin al ly B ( W k , W l ) := ∂ F W ( Z ) ∂ W k ∂ F W ( Z ) ∂ W l T and A ( W k ) =  − ∂ F W ( Z ) ∂ W k ( Y − F W ( Z )) T  We get then 1. H U n ( W 0 ) a.s. → 2 I 0 2. √ n ∆ U n ( W 0 ) Law → N (0 , 4 I 0 ) wher e, the c omp onent ( k , l ) of t he matrix I 0 is : tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  pr o of T o pr ov e the lemma, we r emark fir s t that the co mpo nent ( k , l ) o f the matrix 4 I 0 is : E  ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l  = E  2 tr  Γ − 1 0 A T ( W 0 k )  × 2 tr  Γ − 1 0 A ( W 0 l )  and, since the trace of the pro duct is inv a riant by circular p er mut ation, E  ∂ U ( W 0 ) ∂ W k ∂ U ( W 0 ) ∂ W 0 l  = 4 E  − ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ( Y − F W 0 ( Z ))( Y − F W 0 ( Z )) T Γ − 1 0  − ∂ F W 0 ( Z )) ∂ W l  = 4 E  ∂ F W 0 ( Z ) T ∂ W k Γ − 1 0 ∂ F W 0 ( Z ) ∂ W l  = 4 tr  Γ − 1 0 E  ∂ F W 0 ( Z ) ∂ W k ∂ F W 0 ( Z ) T ∂ W l  = 4 tr  Γ − 1 0 E  B ( W 0 k , W 0 l )  Now, for the comp o nent ( k , l ) of the exp ecta tio n of the Hessian matrix, we remark that lim n →∞ tr  Γ − 1 n ( W 0 ) A n ( W 0 k )Γ − 1 n ( W 0 ) A n ( W 0 k )  = 0 and lim n →∞ tr Γ − 1 n C n ( W 0 k , W 0 l ) = 0 so lim n →∞ H n ( W 0 ) = lim n →∞ 4 tr  Γ − 1 n ( W 0 ) A n ( W 0 k )Γ − 1 n ( W 0 ) A n ( W 0 k )  + 2 tr Γ − 1 n ( W 0 ) B n ( W 0 k , W 0 l ) + 2 tr Γ − 1 n C n ( W 0 k , W 0 l ) = = 2 tr  Γ − 1 0 E  B ( W 0 k , W 0 l )   F rom a classical a rgument of lo cal a symptotic no rmality (see for e xample Y ao [5]), we deduce then the following pro p erty for the estimator ˆ W n : Prop ositi on 1 L et W ∗ n the est imator of the gener alize d le ast squar e : W ∗ n := ar g min 1 n n X t =1 ( Y t − F W ( Z t )) T Γ − 1 0 ( Y t − F W ( Z t )) then we have lim n →∞ √ n ( W ∗ n − W 0 ) = lim n →∞ √ n ( ˆ W n − W 0 ) = N (0 , I − 1 0 ) W e remar k that ˆ W n has the sa me a symptotic b e havior than the estimato r generalized leas t square estimator with the tr ue cov a r iance ma trix Γ − 1 0 which is asymptotically o ptimal (see for ex a mple Ljung [2]), so the pro p o sed estimator is asymptotically optimal to o. 4 Conclusion In the linear mult idimensional reg ression mo del the o ptimal estimator has an analytic solution (see Ma gnus and Neudeck e r [3]), so it do e sn’t make sense to consider minimization of a co st function. How ever, for the non-linea r mu ltidi- mensional r egress ion mo del the ordinar y least squa re estima to r is sub-optimal if the cov aria nce matrix o f the noise is not the identit y matrix. W e can ov er come this difficulty by using the cost function U n ( W ). The numerical computation and the empirical prop erties of these estimator hav e b een studied in a prev ious article (see rynkiewicz [4]). In this pap er, we have g iven a pro of o f the opti- mality o f the estimator asso cia ted with U n ( W ). This is then a go o d choice fo r the estimation of multidimensional non-linea r regr ession mo del with multila yer per ceptron. References [1] F ukumizu, K., A regularity condition of the i nformation matrix of a m ultil a yer per ceptron net work, Neur al N etworks , V ol. 9, 5:871–879, 1996 [2] Lj ung, L. System identific ation : The ory for the user , Prentice Hall , 1999 [3] M agn us, J. , Neudeck er, H. Matrix differ ential c alculus wit h applic ations in statistics and e c onometrics J. Wiley and Sons, New Y ork, 1988. [4] Rynkiewicz, J., Es timation of M ultidimensional R egression Model wi th M ultilay er P er- ceptron. In J. M ira and A. Prieto, editors, pro ceedings of the 8 th international workshop on artificial neur al networks (IW ANN 2003), Lecture Notes in Computer Science 2686, pages 310-317, Springer-V erlag, 2003. [5] Y ao, J.F. , On least s quare estimation for stable nonlinear AR pro cesses, The Annals of Institut of Mathematic al Statistic s 52:316-331, 2000

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment