Minimum L2 and robust Kullback-Leibler estimation
This paper introduces two new robust methods for estimation of parameters in a given parametric family. The first method is that of `minimum weighted L2', effectively minimising an estimate of the integrated (and possibly weighted) squared error. The…
Authors: Nils Lid Hjort
MINIMUM L2 AND R OBUST KULLBA CK–LEIBLER ESTIMA TION ∗ Nils Lid Hjort, Univ ersit y of Oslo Departmen t of Mathematics, N–0316 Oslo, Norw a y Abstract. This pap er in tro duces t w o new robust metho ds for estimation of parameters in a giv en parametric family . The first metho d is that of ‘minim um w eighted L2’, effectively min- imising an estimate of the in tegrated (and possi- bly weigh ted) squared error. The second is ‘ro- bust Kullbac k–Leibler’, consisting of minimis- ing a robust version of the empirical Kullbac k– Leibler distance, and can b e view ed as a gen- eral robust mo dification of the maximum lik e- liho o d pro cedure. This second metho d is also related to recent lo cal lik eliho o d ideas for semi- parametric densit y estimation. The methods are describ ed, influence functions are found, as are form ulae for asymptotic v ariances. In particu- lar large-sample efficiencies are computed under the home turf conditions of the underlying para- metric mo del. The metho ds and form ulae are illustrated for the normal mo del. 1. Minim um weigh ted L2 estimation. Let X 1 , . . . , X n b e independent data p oin ts from an unknown density f , and supp ose that the data are to be fitted to some given regular para- metric family of densities f θ ( x ). A simple and natural estimation idea is to minimise an esti- mate of R w ( f θ − f ) 2 d x , where w ( . ) is a suitable w eigh t function, p erhaps the constan t 1. Disre- garding the one term that does not dep end on the parameter, this leads to the following strat- egy: minimise Q n ( θ ) = Z w f 2 θ d x − 2 1 n n X i =1 w ( x i ) f θ ( x i ) . (1 . 1) T aking the deriv ativ e this is also the same as solving V n ( θ ) = Z w f θ u θ (d F n − f θ d x ) = 0 , (1 . 2) where u θ ( x ) = ∂ log f θ ( x ) /∂ θ is the score func- tion of the mo del, and where F n is the empirical distribution of the data. ∗ F rom Pro ceedings of the 12th Prague Conference, 1994 W e deriv ed (1.2) as a consequence of the natural (1.1), but forming an estimator by solv- ing this second equation can also b e motiv ated separately . It forces a w eigh ted in tegral of the nonparametric d F n ( x ) to b e equal to the cor- resp onding weigh ted in tegral of the parametric f θ ( x ) d x . In spite of muc h w ork in the literature on v arious minimum distance strategies, the par- ticular estimator (1.1)–(1.2) do es not seem to ha v e b een studied earlier. It has also b een pro- p osed indep endently by M.C. Jones (personal comm unication). A metho d recently considered in Bro wn and Hwang (1993) has in tentions sim- ilar to that of (1.1), but is unnecessarily ham- p ered with an intermediate histogram appro xi- mation. This is a case of ‘don’t smo oth if y ou don’t hav e to’. 2. Influence function. Let b θ b e the es- timator and let θ 0 minimise R w ( f θ − f ) 2 d x . There is typically a unique parameter achiev- ing this, and w e interpret this θ 0 as the ‘least false’ or ‘most appropriate’ parameter v alue. As n gro ws b θ conv erges almost surely to θ 0 . Stan- dard T a ylor argumen ts show that b θ − θ 0 . = − V ∗ n ( θ 0 ) − 1 V n ( θ 0 ) , (2 . 1) where V ∗ n ( θ 0 ) is the matrix of deriv atives of V n ( θ ). Letting u ∗ θ b e the matrix of second or- der deriv ativ es of the log density we ha v e V ∗ n ( θ ) = Z w ( f 2 θ u θ u 0 θ + f θ u ∗ θ )(d F n − f θ d x ) − Z w f 2 θ u θ u 0 θ d x, so that − V ∗ n ( θ 0 ) → p J = J ( θ 0 ), where J ( θ ) = Z w f 2 θ u θ u 0 θ d x − Z w ( f 2 θ u θ u 0 θ + f θ u ∗ θ )( f − f θ ) d x. The influence function of the estimator can no w b e established, via (2.1); see for example Hub er – 1 – (1981) for definition of and important uses of influence functions. Here it becomes I ( f , x ) = J − 1 { w ( x ) f ( x, θ 0 ) u ( x, θ 0 ) − ξ 0 } , (2 . 2) where ξ 0 = E f w ( X ) f ( X , θ 0 ) u ( X , θ 0 ) = Z w ( x ) f ( x, θ 0 ) f ( x ) u ( x, θ 0 ) d x = Z w ( x ) f ( x, θ 0 ) 2 u ( x, θ 0 ) d x. Where notationally con v enien t w e write f ( x, θ ) for f θ ( x ), and so on. The (2.2) function is typ- ically b ounded, whic h means robustness. The influence function is in fact also redescending in most cases, going to zero for x -v alues outside mainstream. This is often considered an attrac- tiv e robustness feature of an es timation method. 3. Limit distribution. By the central limit theorem and the definition of θ 0 , √ nV n ( θ 0 ) tends to N { 0 , M } , where M = V AR f { w ( X ) f ( X , θ 0 ) u ( X , θ 0 ) } = Z w 2 f 2 θ f u θ u 0 θ d x − ξ 0 ξ 0 0 . F rom (2.1) follows √ n ( b θ − θ 0 ) → d N { 0 , J − 1 M J − 1 } , (3 . 1) with J as given ab o v e. Note that this result has b een reac hed without ha ving to assume that the true f b elongs to the parametric mo del. The expressions for J and M simplify under mo del conditions. Of course there is some loss of efficiency , that is, the limiting co v ariance ma- trix J − 1 M J − 1 is larger than the best p ossible one under the mo del, namely ( R f θ u θ u 0 θ d x ) − 1 , ac hiev ed b y the maximum lik eliho o d metho d. 4. Lo cal and w eigh ted L2 fitting. The size of the limiting v ariances dep end on the w eigh t function w ( . ). Cho osing a lo cal w eight function can b e contemplated, say of the ker- nel t yp e K h ( x 0 − t ) around a giv en x 0 . Here K h ( u ) = h − 1 K ( h − 1 u ) and K is a giv en k ernel function. This gives a lo cally estimated normal, for example, in a spirit similar to lo cal lik eliho o d metho ds discussed in Hjort and Jones (1994). The apparatus ab ov e can b e used to in v estigate influence functions and large-sample prop erties. It is sometimes desirable to let the w eigh t function b e data driv en to o, p erhaps to increase precision under close to the mo del circumstan- ces. One example w ould b e to use w n ( x ) = w 0 (( x − e µ ) / e σ ), for a suitable w 0 ( . ) function, with preliminary robust estimates of lo cation and scale. Result (3.1) is still true under appro- priate conditions, with J and M being defined in terms of the limit function v ersion of w n ( . ). 5. Local and robust Kullback–Leibler fitting. The lo cal kernel smo othed lik eliho o d function, around a giv en x 0 , is L n ( x 0 , θ ) = n X i =1 K h ( x i − x 0 ) log f ( x i , θ ) − n Z K h ( t − x 0 ) f ( t, θ ) d t, (5 . 1) see Hjort and Jones (1994). As sho wn in Hjort and Jones (1994), maximising (5.1) aims at min- imising the lo calised Kullbac k–Leibler distance d ( f , f θ ) = Z K h ( t − x 0 ) f ( t ) log f ( t ) f θ ( t ) − { f ( t ) − f θ ( t ) } d t from true density to parametric density . In other words, the maximiser of (5.1) aims at a ‘least false’ parameter v alue θ 0 that in general is differen t from the one asso ciated with the mini- m um weigh ted L2 method. Note that a large h giv es a flat K h ( t − x 0 ) function, and brings back the ordinary Kullbac k–Leibler distance and the traditional full likelihoo d metho d. The aim of Hjort and Jones (1994) is pri- marily the complete semiparametric estimation of the full density curv e, as partly opp osed to concen trating on the lo cally estimated parame- ters themselves. But this is also automatically one w ay of obtaining robust parameter estimates for a giv en parametric family: Apply the ab ov e – 2 – for a suitable centrally placed x 0 , for a reason- ably sized h . The resulting maximiser b θ is a ro- bust estimate of θ , and f ( x, b θ ) a robust estimate of the underlying densit y curve. Hjort and Jones (1994) demonstrate that √ n ( b θ − θ 0 ) → d N { 0 , J − 1 h M h J − 1 h } , with certain generally v alid expressions a v ailable there for J h and M h . At the moment it will suffice to give these under mo del conditions: J h = Z K h ( t − x 0 ) u θ u 0 θ f θ d t, M h = Z K h ( t − x 0 ) 2 u θ u 0 θ f θ d t − ξ 0 ξ 0 0 , (5 . 2) where ξ 0 = R K h ( t − x 0 ) u θ f θ d t . The influence function of this robustified maxim um likelihoo d is also derived in Hjort and Jones (1994), and is of the form I ( f , x ) = J − 1 h { K h ( x − x 0 ) u ( x, θ ) − ξ 0 } . (5 . 3) This is reasonably similar to the influence func- tion (2.2) for the minim um L2 method. In man y cases the presen t method, with a suitably c hosen h , is more efficien t at the mo del than at least the un w eighted v ersion of the minimum L2 metho d. 6. The normal mo del. The most im- p ortan t sp ecial case is that of fitting data to a normal ( µ, σ 2 ). 6.1. The minimum L2 method. F rom (1.2) tw o equations are easily put up to de- fine minimu m L2 estimators b µ and b σ . These are solved for example by the iterativ e Newton– Raphson technique. Regarding p erformance, under Gaußian circumstances, and using a con- stan t w eigh t function, we find J = ( σ 3 √ 2 π ) − 1 diag(1 / 2 3 / 2 , 3 / 2 5 / 2 ) , M = ( σ 4 2 π ) − 1 diag(1 / 3 3 / 2 , 2 / 3 3 / 2 − 1 / 8) . This giv es an asymptotic v ariance for b µ of size 1 . 5396 σ 2 /n and an asymptotic v ariance for b σ of size 0 . 9241 σ 2 /n . These should b e compared to the minimum p ossible v alues, under mo del conditions. These optimal figures are achiev ed b y the ML metho d, and are σ 2 /n and 1 2 σ 2 /n , resp ectiv ely . This makes the direct minimum L2 metho d qualify as a ‘quite robust but p er- haps to o inefficien t metho d’. Increased effi- ciency at the mo del is ac hieved through appro- priate choices of w eigh t function w ( . ), cf. com- men ts at the end of Section 4. One p ossibility here is w n ( x ) = exp { 1 2 δ ( x − e µ ) 2 / e σ 2 } , defined in terms of preliminary robust estimates of lo cation and scale, and with an extra tuning parameter δ ∈ (0 , 1). Cho osing e.g. δ = 0 . 8 leads to quite go o d efficiency at the mo del, while still retaining a reasonable robustness. 6.2. The r obustified ML method. The robust Kullbac k–Leibler fitting metho d of Sec- tion 5 can easily b e made b etter than the un- w eigh ted minim um L2 metho d. F or the presen t normal model, let us use a normal kernel. The metho d is then to minimise, for given x 0 , the function 1 n n X i =1 φ x i − x 0 h 1 h { log σ + 1 2 ( x i − µ ) 2 /σ 2 } + φ x 0 − µ √ σ 2 + h 2 1 √ σ 2 + h 2 (6 . 1) o v er all ( µ, σ ). W e ma y compute the J h and M h matrices of (5.2) without serious difficulties. But in the present con text the in terest lies more in getting hold of a single, robust ( µ, σ )-estimate, than in obtaining a full function of lo cal esti- mates. Therefore w e suggest using x 0 = e µ , a ro- bust preliminary estimate of the mean, say the simple median. Minimising (6.1) with this x 0 defines the prop osed b µ and b σ . Again it is of in terest to see how well the metho d fares under Gaußian home-turf condi- tions. Somewhat arduous calculations give tw o diagonal matrices ( J µ , J σ ) and ( M µ , M σ ) for J h and M h of (5.2). Here J µ = 1 σ 2 1 √ 2 π 1 h 1 R 3 and M µ = 1 σ 2 1 2 π 1 h 2 1 S 3 , in which R = (1 + σ 2 /h 2 ) 1 / 2 and S = (1 + 2 σ 2 /h 2 ) 1 / 2 . – 3 – Similarly , J σ = 1 σ 2 1 √ 2 π 1 h 1 − 2 R 2 + 3 R 4 , M σ = 1 σ 2 1 2 π 1 h 2 n 1 S 1 − 2 S 2 + 3 S 4 − 1 R 2 1 − 1 R 2 2 o . Th us √ n ( b µ − µ ) → d N { 0 , κ 2 µ } and √ n ( b σ − σ ) → d N { 0 , κ 2 σ } , where the asymptotic v ariances are found as M µ /J 2 µ and M σ /J 2 σ . Some calculations giv e κ 2 µ = σ 2 R 6 S 3 = σ 2 (1 + 1 /k 2 ) 3 (1 + 2 /k 2 ) 3 / 2 , writing h = k σ , and similarly κ 2 σ = σ 2 (1 + 1 /k 2 ) 2 (1 + 2 /k 2 ) 5 / 2 (1 + 1 /k 2 ) 3 (2 + 4 /k 4 ) − (1 + 2 /k 2 ) 5 / 2 1 /k 4 (2 + 1 /k 4 ) 2 . Ho w large should h b e chosen? W e think of h as k e σ , where e σ is a robust preliminary estimate of standard deviation, and need to c ho ose the factor k . As a mild surprise the v alue k = 1 giv es precisely the same large-sample v ariances under the mo del as the straigh tforw ard minim um L2 metho d of Section 2, resp ectively 1 . 5396 σ 2 and 0 . 9241 σ 2 . A more efficien t but still quite ro- bust v alue would b e k = 2, ‘place a normal with t w o standard deviations around the median and maximise the lo cal kernel smo othed lik eliho o d’. Then the v alues are 1 . 063 σ 2 and 0 . 563 σ 2 , only a few p ercent abov e the v alues that are optimal under the model, viz. σ 2 and σ 2 / 2. Increasing the v alue to three estimated standard deviations brings the large-sample v ariances further down to 1 . 015 σ 2 and 0 . 5152 σ 2 . One should not go m uc h further if robustness is aimed for, but of course a large h gives bac k these optimal v alues. Comparing the p erformance of the weigh ted L2 metho d, say with the data driven w eight function indicated ab ov e, with that of the ro- bust Kullback–Leibler estimator, is an interest- ing problem for further research. One should also devise criteria for choosing the necessary fine-tuning parameters. 7. Robust estimation of lo cation and co v ariance matrix. The ideas and results ab o v e generalise easily to the multi-dimensional case. In particular the lo calised Kullback–Leib- ler metho d seems to constitute a fruitful w a y of obtaining robust estimates of µ and Σ, the mean v ector and co v ariance matrix of the underlying distribution. The estimates can b e view ed as robust estimates of these parameters under nor- malit y assumptions but also outside normality . One concrete v ersion of this sc heme, in the p -dimensional case, is as follo ws: Start out with preliminary and robust estimates e µ and e Σ for mean and co v ariance matrix. Then carry out lo cal likelihoo d estimation with a Gaußian k er- nel function centred at e µ and with cov ariance matrix of size h 2 e Σ. This is seen to b e the same as minimising the criterion function h 1 n n X i =1 exp {− 1 2 ( x i − e µ ) 0 e Σ − 1 ( x i − e µ ) /h 2 } h p | e Σ | 1 / 2 { 1 2 log | Σ | + 1 2 ( x i − µ ) 0 Σ − 1 ( x i − µ ) } i + exp {− 1 2 ( µ − e µ ) 0 ( h 2 e Σ + Σ) − 1 ( µ − e µ ) } | h 2 e Σ + Σ | 1 / 2 o v er all possible ( µ, Σ). Note that this method prop erly generalises that of (6.1). F or h larger than say 5 the pro cedure is practically the same as ordinary maximum likelihoo d estimation. A v alue of p erhaps h = 2 constitutes a mo dified maxim um lik eliho o d pro cedure with quite go o d robustness qualities without sacrificing muc h in efficiency under multinormal conditions. Ac kno wledgemen ts. This pap er has b en- efited from ongoing join t w ork on related mat- ters with M.C. Jones, I.R. Harris and A. Basu. References Bro wn, L.D. and Hwang, J.T.G. (1993). How to approximate a histogram by a normal densit y . American Statistician 47 , 251–255. Hjort, N.L. and Jones, M.C. (1994). Lo cally parametric nonparametric densit y estima- tion. Submitted for publication. Hub er, P .J. (1981). Robust Statistics. Wiley , New Y ork. – 4 –
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment