Upper Bounds for the I-MSE and max-MSE of Kernel Density Estimators

Upp er Bounds for the I-MSE and max-MSE of Kernel Densit y Estimators Nils Lid Hjort 1 and Nik olai G. Ushak ov 2 , * Decem b er 1999 1 Departmen t of Mathematics, Univ ersit y of Oslo, Norwa y 2 Russian Academy of Sciences, Chernogolovk a, Russia Abstra ct. The p erformance of k ernel densit y estimators is usually stud- ied via T a ylor expansions and asymptotic appro ximation arguments, in whic h the bandwidth parameter tends to zero with increasing sample size. In contrast, this paper focusses directly on the ﬁnite-sample situa- tion. Informativ e upper b ounds are deriv ed b oth for the in tegrated and the maximal mean squared error function. Results are reac hed for the traditional case, where the k ernel is a probabilit y density function, under v arious sets of assumptions on the underlying densit y to b e estimated. Results are also deriv ed for the imp ortan t non-con ven tional case of the sinc kernel, which is not in tegrable and also takes negativ e v alues. W e pin-p oin t w ays in which the sinc-based estimator p erforms b etter than the con ven tional kernel estimators. When pro ving our results we rely on metho ds related to c haracteristic and empirical c haracteristic functions. Key w ords: c haracteristic functions, density estimation, ﬁnite-sample p erformance, max-MSE, sinc k ernel, upp er b ounds 1. In tro duction In this article we deriv e some rigorous upp er b ounds for the estimation error of k ernel densit y estimators for ﬁnite v alues of the sample size n , in terms of c hoices of the kernel function K and the bandwidth h = h n . These b ounds are by construction non-asymptotic, and are useful when one needs to secure a certain precision of an estimate for a giv en (ﬁnite) v alue of n , for broad classes of densities. W e study b oth smo oth cases (where the density to b e estimated is one or more times diﬀeren tiable) and non-smo oth cases (the underlying densit y function is not supp osed to b e diﬀeren tiable or even con tin uous). The machinery of c haracteristic and empirical characteristic functions is used, and relev an t general results are established in Section 2. In Section 3 conv en tional k ernel estimators will b e considered, i.e. estimators whose k ernels are probability density functions. These estimators alwa ys pro duce estimates whic h are densities. W e term a k ernel densit y estimator non-con v entional if its kernel function is not a probability density , i.e. it ma y tak e negative v alues * P artially supp orted by RFBR Grants 97-01-00273, 98-01-00621 and 98-01-00926, and by INT AS-RFBR Gran t IR-97-0537 1 or/and do es not integrate to one (or even is not in tegrable). Such non-con ven tional estimators are studied in Section 4, with particular atten tion to the sinc k ernel; see also Glad, Hjort and Ushako v (1999a). Such estimators, based on higher order k ernels, sup erkernels or the sinc k ernel, often provide b etter estimation precision, but ha ve an essen tial disadv an tage: they pro duce estimates whic h are not prob- abilit y densit y functions, i.e. may tak e negativ e v alues or/and do not in tegrate to one. Ho wev er, this defect can b e corrected afterw ards without loss of their p erformance prop erties (see Glad, Hjort and Ushako v, 1999b). A discussion of our results, with a view to wards their use in densit y estimation problems, is giv en in the ﬁnal Section 5. T opics there include new strategies for bandwidth selection. 2. Auxiliary results, via c haracteristic functions In this pap er we use the characteristic function approach to studying p erformance of densit y estimators, rather than the traditional T a ylor expansions and asymp- totic appro ximations. Therefore w e ﬁrst express some basic concepts of kernel densit y estimators in terms of c haracteristic functions. Let X 1 , . . . , X n b e indep enden t and iden tically distributed random v ariables with absolutely contin uous distribution function F ( x ), density function p ( x ), and c haracteristic function f ( t ). The kernel density estimator asso ciated with the sample X 1 , . . . , X n is deﬁned as p n ( x ) = p n,h ( x ) = 1 n n X j =1 K h ( x − X j ) , (2 . 1) where K ( x ) is the kernel function with scaled version K h ( x ) = h − 1 K ( h − 1 x ) and h = h n is a p ositiv e n umber (depending on n ) called the bandwidth or the smooth- ing parameter. W e do not necessarily demand that K is integrable (sometimes the b est estimators correspond to nonin tegrable k ernels). Ho wev er, w e supp ose that K is square integrable, and in addition that it is integrable in the sense of the Cauc hy principal v alue with v . p . R ∞ −∞ K ( x ) d x = 1, in which v . p . Z ∞ −∞ = lim T →∞ lim  → 0 h Z −  − T + Z T  i . Under these assumptions the F ourier transform of K can b e deﬁned as ϕ ( t ) = v . p . Z ∞ −∞ e itx K ( x ) d x (see Chapter 4 of Titchmarsh, 1937). In the following we will omit integration limits when the integral is to b e tak en o ver the full real line. 2 Let ˆ p n b e an estimator (not necessarily a k ernel estimator) of p asso ciated with the sample X 1 , . . . , X n . The bias, the mean squared error (MSE) and the mean integrated squared error (MISE) of ˆ p n are deﬁned, resp ectiv ely , as B n ( ˆ p n ( x )) = E ˆ p n ( x ) − p ( x ) , MSE( ˆ p n ( x )) = E { ˆ p n ( x ) − p ( x ) } 2 , and MISE( ˆ p n ) = Z MSE( ˆ p n ( x )) d x = E Z { ˆ p n ( x ) − p ( x ) } 2 d x. (2 . 2) In case of the k ernel estimator p n , deﬁned b y (2.1), the bias ma y b e expressed via the conv olution as B n ( p n ( x )) = ( K h ? p )( x ) − p ( x ) = Z K h ( x − y ) p ( y ) d y − p ( x ) . Since con volution is a kind of smo othing, the bias of the kernel estimator is the diﬀerence b etw een a smoothed densit y and the densit y itself. The mean squared error admits a well-kno wn decomp osition into v ariance and squared bias, with consequen t MISE representation MISE( ˆ p n ) = Z B 2 n ( ˆ p n ( x )) d x + Z V ar( ˆ p n ( x )) d x. Note that together with MSE and MISE other measures of deviation may b e used. Among them, the mean absolute error E | ˆ p n ( x ) − p ( x ) | and its in tegral are esp ecially imp ortan t (see Devro ye and Gy¨ orﬁ, 1985). In the presen t article w e restrict attention to MSE and MISE, ho wev er. F or a real v alued function g w e will use the follo wing notation, pro vided the in tegrals exist: µ k ( g ) = Z | x | k g ( x ) d x and R ( g ) = Z g 2 ( x ) d x. If the kernel K is a probability densit y function, and the density to b e estimated is twice diﬀerentiable and with square in tegrable second order deriv ativ e, then it is w ell kno wn that the b est order of estimation accuracy in terms of MISE is O ( n − 4 / 5 ); see also Section 5.1. Ho wev er, if w e permit the k ernel not to b e a densit y , then the order can b e improv ed. F or example, if p is the normal densit y and K is the sinc k ernel, i.e. K ( x ) = sin x/ ( π x ), then min h> 0 MISE( p n ) = O  √ log n n  as n → ∞ ; see Section 4 b elo w and Glad, Hjort and Ushak ov (1999a). 3 W e now express basic characteristics of densit y estimators in terms of F ourier transforms and establish some auxiliary results. Let ˆ f n denote the F ourier transform of an estimator ˆ p n . Making use of the in- v ersion formula for densities and the Parsev al–Plancherel iden tit y we easily obtain the following formulae: B n ( ˆ p n ( x )) = 1 2 π Z e − itx { E ˆ f n ( t ) − f ( t ) } d t, (2 . 3) MSE( ˆ p n ( x )) = E n 1 2 π Z e − itx [ ˆ f n ( t ) − f ( t )] d t o 2 = 1 (2 π ) 2 Z Z e − i ( u + v ) x E { ( ˆ f n ( u ) − f ( u ))( ˆ f n ( v ) − f ( v )) } d u d v , (2 . 4) and MISE( ˆ p n ) = 1 2 π Z E | ˆ f n ( t ) − f ( t ) | 2 d t. (2 . 5) In the remainder of this section, we will consider only k ernel estimators and supp ose that the k ernel K is a probabilit y densit y function, i.e. it is nonnegative and integrates to one. Study the empirical characteristic function asso ciated with sample X 1 , . . . , X n , f n ( t ) = 1 n n X j =1 e itX j . The characteristic function of the estimator p n,h ( x ) is f n ( t ) ϕ ( ht ), where ϕ ( t ) = R e itu K ( u ) d u is the characteristic function of the k ernel. And the k ernel estimator (2.1) can b e expressed in terms of f n as p n,h ( x ) = 1 2 π Z e − itx f n ( t ) ϕ ( ht ) d t. No w, taking into account that E f n ( u ) f n ( v ) = (1 − 1 /n ) f ( u ) f ( v ) + (1 /n ) f ( u + v ) and E | f n ( t ) | 2 = (1 − 1 /n ) | f ( t ) | 2 + 1 /n, w e can write (2.3)–(2.5) in the form B n ( p n ( x )) = 1 2 π Z e − itx f ( t ) { ϕ ( ht ) − 1 } d t, (2 . 6) MSE( p n ( x )) = 1 (2 π ) 2 Z Z e − i ( u + v ) x h 1 n ϕ ( hu ) ϕ ( hv ) f ( u + v ) + n 1 − 1 n  ϕ ( hu ) ϕ ( hv ) − 2 ϕ ( hu ) + 1 o f ( u ) f ( v ) i d u d v (2 . 7) 4 and MISE( p n ) = 1 2 π h Z | f ( t ) | 2 | 1 − ϕ ( ht ) | 2 d t + 1 n Z | ϕ ( ht ) | 2 { 1 − | f ( t ) | 2 } d t i . (2 . 8) F rom (2.6) w e immediately obtain | B n ( p n ( x )) | ≤ 1 2 π Z | f ( t ) | | 1 − ϕ ( ht ) | d t. (2 . 9) Lemma 1. F or each x , MSE( p n ( x )) ≤ n 1 2 π Z | f ( t ) | | 1 − ϕ ( ht ) | d t o 2 + a ( x ) π nh Z | ϕ ( t ) | d t, (2 . 10) where a ( x ) = ( K h ? p )( x ) . If p is b ounded b y a , then sup x MSE( p n ( x )) ≤ n 1 2 π Z | f ( t ) | | 1 − ϕ ( ht ) | d t o 2 + a π nh Z | ϕ ( t ) | d t. Pro of. It suﬃces to pro v e the ﬁrst statemen t, since a ( x ) = R p ( x − y ) K h ( y ) d y ≤ a for all x . Making use of relation (2.7), w e obtain MSE( p n ( x )) = h 1 2 π Z e − itx f ( t ) { 1 − ϕ ( ht ) } d t i 2 + 1 n 1 (2 π ) 2 Z Z e − i ( u + v ) x ϕ ( hu ) ϕ ( hv ) f ( u + v ) d u d v − 1 n 1 (2 π ) 2 Z Z e − i ( u + v ) x ϕ ( hu ) ϕ ( hv ) f ( u ) f ( v ) d u d v . The ﬁrst term on the righ t hand side is dominated by the ﬁrst term of the right hand side of (2.10). Let us then estimate the absolute v alue of the second (denoted b y T 2 ) and third (denoted b y T 3 ) terms. W e ha ve T 2 = 1 n 1 2 π Z ϕ ( hu ) n 1 2 π Z e − i ( u + v ) x ϕ ( hv ) f ( u + v ) d v o d u. The term in brack ets, b eing transformed to the form 1 2 π Z e − itx ϕ ( h ( t − u )) f ( t ) d t, is equal to R p ( x − y ) K h ( y ) e − iuy d y (since ϕ ( h ( t − u )) f ( t ) is the F ourier transform of the con volution of functions p ( x ) and K h ( x ) e − iux ), and clearly    Z p ( x − y ) K h ( y ) e − iuy d y    ≤ a ( x ) . 5 Hence | T 2 | ≤ a ( x ) n 1 2 π Z | ϕ ( ht ) | d t. F urthermore, | T 3 | = 1 n    1 2 π Z e − iux ϕ ( hu ) f ( u ) d u 1 2 π Z e − iv x f ( v ) ϕ ( hv ) d v    ≤ 1 n ( K h ? p )( x ) 1 2 π Z | f ( v ) | | ϕ ( hv ) | d v ≤ a ( x ) n 2 π Z | ϕ ( hv ) | d v . Th us w e ﬁnally obtain (2.10). Lemma 2. MISE( p n,h ) ≤ 1 2 π n Z | f ( t ) | 2 | 1 − ϕ ( ht ) | 2 d t + 1 nh Z | ϕ ( t ) | 2 d t o . This lemma immediately follows from relation (2.8). W e conclude this section with some inequalities for characteristic functions and which will b e used b elo w. Lemma 3. Let F b e a distribution function with c haracteristic function f . If the ﬁrst order absolute moment β 1 = R | x | d F ( x ) is ﬁnite, then | 1 − f ( t ) | ≤ β 1 | t | for all real t. If F has null exp ectation and ﬁnite v ariance σ 2 , then | 1 − f ( t ) | ≤ 1 2 σ 2 t 2 for all real t. Pro of. Observe that for an y p ositive integer n and an y x > 0,    e ix − 1 − ix 1! − . . . − ( ix ) n − 1 ( n − 1)!    ≤ x n n ! (2 . 11) (see for example F eller, 1971, Chapter 15). The ﬁrst inequalit y of the lemma follo ws quic kly via | 1 − f ( t ) | ≤ Z | 1 − e itx | d F ( x ) ≤ Z | tx | d F ( x ) = β 1 | t | . T o prov e the second inequality , we obtain, again making use of (2.11), | 1 − f ( t ) | =    Z { e itx − 1 } d F ( x )    =    Z ( e itx − 1 − itx ) d F ( x )    ≤ Z | e itx − 1 − itx | d F ( x ) ≤ Z 1 2 t 2 x 2 d F ( x ) = 1 2 σ 2 t 2 . 6 Along the same lines one may pro v e for example that | f ( t ) − (1 − 1 2 σ 2 t 2 ) | ≤ 1 6 | t | 3 R | x | 3 d F ( x ). Let g b e a real-v alued function deﬁned on an in terv al [ a, b ] of the real line. The total v ariation of g on [ a, b ] is deﬁned as V b a ( g ) = sup n X i =1 | g ( x i ) − g ( x i − 1 ) | where the suprem um is taken o ver all n and all collections x 0 , . . . , x n suc h that a = x 0 < · · · < x n = b . The total v ariation on the whole real line is deﬁned as V ∞ −∞ ( g ) = lim x →∞ V x − x ( g ) . In the case V ∞ −∞ ( g ) we omit limits and write V ( g ). A function g is said to b e a function of b ounded total v ariation if V ( g ) < ∞ (or V b a ( g ) < ∞ if it is considered on an interv al [ a, b ]). Note that if g has an integrable deriv ative, then V b a ( g ) = R b a | g 0 | d x . Lemma 4. Let p be a probabilit y densit y and f the corresponding charac- teristic function. If p is m − 1 times diﬀerentiable, and p ( m − 1) is a function of b ounded v ariation, then | f ( t ) | ≤ V ( p ( m − 1) ) / | t | m for all real t (b y deﬁnition, p (0) = p ). A pro of of this lemma is contained in Ushako v and Ushako v (1999). 3. Densit y estimators with conv entional k ernels First we study the ‘smo oth’ case, i.e. when the density to b e estimated is one or sev eral times diﬀerentiable. Theorem 1. Let p b e twice diﬀerentiable, with p 00 a function of b ounded v ariation, V ( p 00 ) = V 2 < ∞ . If the kernel K has n ull exp ectation, and h n = h 0 n − 1 / 5 ( h 0 b eing some constant ) , then MISE( p n ) ≤ n 3 µ 2 2 ( K ) V 5 / 3 2 10 π h 4 0 + R ( K ) h 0 o n − 4 / 5 . (3 . 1) Pro of. Due to Lemma 4, we ha ve | f ( t ) | ≤ ( 1 for | t | ≤ V 1 / 3 2 , V 2 / | t | 3 for | t | > V 1 / 3 2 , (3 . 2) 7 and, due to Lemma 3, | 1 − ϕ ( h n t ) | ≤ 1 2 µ 2 ( K ) h 2 n t 2 for all t . Hence Z | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t ≤ 1 2 µ 2 2 ( K ) h 4 n Z V 1 / 3 2 0 t 4 d t + 1 2 µ 2 2 ( K ) h 4 n V 2 2 Z ∞ V 1 / 3 2 d t t 2 = (3 / 5) µ 2 2 ( K ) V 5 / 3 2 h 4 0 n − 4 / 5 . (3 . 3) F urther, using the Parsev al–Plancherel iden tity , w e get 1 nh n 1 2 π Z | ϕ ( t ) | 2 d t = 1 h 0 n − 4 / 5 Z K 2 ( x ) d x = R ( K ) h 0 n − 4 / 5 . (3 . 4) F rom (3.3), (3.4) and Lemma 2, we obtain (3.1). Corollary . Let the conditions of Theorem 1 b e satisﬁed. Then for each n , min h> 0 MISE( p n ) ≤  3 · 5 4 2 9 π  1 / 5 { µ 2 2 ( K ) R 4 ( K ) } 1 / 5 V 1 / 3 2 n − 4 / 5 , with minimum of the upp er b ound attained for h n = n 5 π 6 R ( K ) µ 2 2 ( K ) o 1 / 5 V − 1 / 3 2 n − 1 / 5 . If p is only one time diﬀeren tiable or/and the exp ectation of K do es not equal zero, then results are w eaker. Theorem 2. Let p b e diﬀeren tiable with p 0 a function of b ounded v ariation, V ( p 0 ) = V 1 < ∞ . If h n = h 0 n − 1 / 3 , then MISE( p n ) ≤ n 4 3 π µ 2 1 ( K ) V 3 / 2 1 h 2 0 + R ( K ) h 0 o n − 2 / 3 . (3 . 5) Pro of. Due to Lemmas 3 and 4, | f ( t ) | ≤ ( 1 for | t | ≤ V 1 / 2 1 , V 1 / | t | 2 for | t | > V 1 / 2 1 , and | 1 − ϕ ( h n t ) | ≤ µ 1 ( K ) h n | t | for all t . Hence (see the pro of of Theorem 1), Z | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t ≤ (8 / 3) µ 2 1 ( K ) V 3 / 2 1 h 2 0 n − 2 / 3 . (3 . 6) And, as in the pro of of Theorem 1, 1 2 π 1 nh n Z | ϕ ( t ) | 2 d t = R ( K ) h 0 n − 2 / 3 . (3 . 7) F rom (3.6), (3.7) and Lemma 2, we obtain (3.5). 8 Corollary . Let the conditions of Theorem 2 b e satisﬁed. Then for each n , min h> 0 MISE( p n ) ≤ (9 /π ) 1 / 3 µ 2 / 3 1 ( K ) p V 1 R ( K ) 2 / 3 n − 2 / 3 . Theorems 1 and 2 giv e b ounds for the integral deviation of the mean squared error of a kernel estimator from zero. No w w e obtain b ounds for the sup deviation, in terms of A ( K ) = 1 2 π Z | ϕ ( t ) | d t. Theorem 3. Let p b e three times diﬀerentiable with p 000 a function of b ounded v ariation, V ( p 000 ) = V 3 < ∞ , and let p b e b ounded b y a . If h n = h 0 n − 1 / 5 , then sup x MSE( p n ( x )) ≤ n 4 9 π 2 µ 2 2 ( K ) V 3 / 2 3 h 4 0 + 2 aA ( K ) h 0 o n − 4 / 5 . Pro of. Due to Lemma 4, | f ( t ) | ≤ ( 1 for | t | ≤ V 1 / 4 3 , V 3 / | t | 4 for | t | > V 1 / 4 3 , and, due to Lemma 3, | 1 − ϕ ( h n t ) | ≤ 1 2 µ 2 ( K ) h 2 n t 2 for all t . Hence Z | f ( t ) | | 1 − ϕ ( h n t ) | d t ≤ µ 2 ( K ) h 2 n  Z V 1 / 4 3 0 t 2 d t + V 3 Z ∞ V 1 / 4 3 d t t 2  = (4 / 3) µ 2 ( K ) V 3 / 4 3 h 2 0 n − 2 / 5 . T o get the result it suﬃces now to apply Lemma 1. Corollary . Let the conditions of Theorem 3 b e satisﬁed. Then for each n , min h> 0 sup x MSE( p n ( x )) ≤ 5(36 π 2 ) − 1 / 5 µ 2 / 5 2 ( K ) V 3 / 10 3 A 4 / 5 ( K ) a 4 / 5 n − 4 / 5 , with minimum of the upp er b ound b eing attained for h n = n 9 π 2 8 A ( K ) µ 2 2 ( K ) o 1 / 5 a 1 / 5 V − 3 / 10 3 n − 1 / 5 . Theorem 4. Let p b e t wice diﬀeren tiable with p 00 a function of b ounded v ariation, and let p b e b ounded by a . If h n = h 0 n − 1 / 3 , then sup x MSE( p n ( x )) ≤ n 9 4 π 2 µ 2 1 ( K ) V 4 / 3 2 h 2 0 + 2 aA ( K ) h 0 o n − 2 / 3 . (3 . 8) 9 Pro of. Using (3.2) and the second inequality of Lemma 3, w e hav e | 1 − ϕ ( h n t ) | ≤ µ 1 ( K ) h n | t | . This leads to Z | f ( t ) | | 1 − ϕ ( h n t ) | d t ≤ 2 µ 1 ( K ) h n Z V 1 / 3 2 0 t d t + 2 V 2 µ 1 ( K ) h n Z ∞ V 1 / 3 2 d t t 2 = 3 µ 1 ( K ) V 2 / 3 2 h n = 3 µ 1 ( K ) V 2 / 3 2 h 0 n − 1 / 3 . Using this estimate and Lemma 1, w e get (3.8). Corollary . Let the conditions of Theorem 4 b e satisﬁed. Then for each n , min h> 0 sup x MSE( p n ( x )) ≤ 3  9 4 π 2  1 / 3 µ 2 / 3 1 ( K ) V 4 / 9 2 B 2 / 3 ( K ) a 2 / 3 n − 2 / 3 . Next we consider the so-called non-smo oth case. This means that the underly- ing densit y function is not supp osed to b e diﬀerentiable or even con tinuous. Some minim um regularity conditions must b e in tro duced, ho wev er (otherwise nothing substan tial can b e deriv ed). Here this minim um condition will b e the b ounded- ness of the total v ariation of the underlying density . Note that this condition is a little less restrictive than those usually assumed when authors work with the non-smo oth case (see for example v an Eeden, 1985 and v an Es, 1997). Theorem 5. Let the underlying densit y p b e a function of b ounded v ariation, V = V ( p ) < ∞ . If h n = h 0 / ( √ n log n ) , then MISE( p n,h ) ≤ log 2 n √ n h 4 √ 2 π max { p µ 1 ( K ) , µ 1 ( K ) } × max { V 3 / 2 , V 2 } max { p h 0 , h 0 } + R ( K ) h 0 log n i (3 . 9) for all n ≥ 16 . Pro of. Let us use Lemma 2. F or the second term in the square brack ets, due to the P arsev al–Plancherel iden tity , w e ha ve 1 nh n Z | ϕ ( t ) | 2 d t = 2 π nh n Z K 2 ( x ) d x = 2 π R ( K ) nh n . (3 . 10) Let us estimate the ﬁrst term. First we establish the follo wing inequality: for any 0 < α < 1, | 1 − ϕ ( t ) | ≤ µ 1 ( K ) α 2 1 − α | t | α (3 . 11) for all real t . Indeed, due to Lemma 3, | 1 − ϕ ( t ) | ≤ µ 1 ( K ) | t | . (3 . 12) 10 F or | t | ≤ 2 /µ 1 ( K ), the right hand side of (3.11) ma jorises the righ t hand side of (3.12), therefore (3.11) holds for these t . If | t | > 2 /µ 1 ( K ), then (3.11) is eviden t b ecause its right hand side exceeds 2. Let α b e arbitrary inside (0 , 1 2 ). Making use of (3.11) and Lemma 4, we get Z | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t = 2 Z V 0 | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t + 2 Z ∞ V | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t ≤ 2 µ 2 α 1 ( K )2 2(1 − α ) h 2 α n  Z V 0 t 2 α d t + V 2 Z ∞ V t 2 α − 2 d t  = 2 4 − 2 α 1 − 4 α 2 µ 1 ( K ) 2 α V 2 α +1 h 2 α n . F rom this estimate and (3.10), using Lemma 1, we obtain MISE( p n,h ) ≤ 2 3 − 2 α π (1 − 4 α 2 ) µ 2 α 1 ( K ) V 2 α +1 h 2 α n + R ( K ) nh n = 2 3 − 2 α π (1 − 4 α 2 ) µ 2 α 1 ( K ) V 2 α +1 h 2 α 0  1 √ n log n  2 α + R ( K ) h 0 log n √ n (3 . 13) for any α ∈ (0 , 1 2 ). Put α = log n 2(log n + 2 log log n ) . Then 1 4 < α < 1 2 (pro vided that n ≥ e e , which translates into n ≥ 16), and hence 2 3 − 2 α < 4 √ 2 , µ 1 ( K ) ≤ max { p µ 1 ( K ) , µ 1 ( K ) } , V 2 α +1 ≤ max { V 3 / 2 , V 2 } , h 2 α 0 ≤ max { p h 0 , h 0 } . Therefore from (3.13) we obtain MISE( p n,h ) ≤ 4 √ 2 π max { p µ 1 ( K ) , µ 1 ( K ) } max { V 3 / 2 , V 2 } max { p h 0 , h 0 } × 1 1 − 4 α 2  1 √ n log n  2 α + R ( K ) h 0 log n √ n . (3 . 14) Putting now α 0 = log n − 2 log log n 2(log n + 2 log log n ) , then α > α 0 (if n ≥ e e ), hence  1 √ n log n  2 α <  1 √ n log n  2 α 0 = log n √ n . (3 . 15) 11 It remains to assess the size of 1 / (1 − 4 α 2 ). W e hav e 1 1 − 4 α 2 = (log n + 2 log log n ) 2 (log n + 2 log log n ) 2 − (log n ) 2 = (log n + 2 log log n ) 2 (2 log n + 2 log log n )2 log log n ≤ 1 4 log n + 1 + log log n log n + log log n ≤ log n (3 . 16) if n ≥ e e . F rom (3.14), (3.15) and (3.16) we ﬁnally obtain (3.9). Corollary . Let p be a unimo dal density function, and b ounded by a . If h n = h 0 / ( √ n log n ) , then MISE( p n,h ) ≤ log 2 n √ n · h 4 √ 2 π max { p µ 1 ( K ) , µ 1 ( K ) } × max { 2 √ 2 a 3 / 2 , a 2 } max { p h 0 , h 0 } + R ( K ) h 0 log n i . 4. The sinc k ernel density estimator The sinc k ernel is the function K ( x ) = sin x π x with the F ourier transform (deﬁned as the principal v alue of the corresp onding in tegral) ϕ ( t ) =  1 for | t | ≤ 1, 0 for | t | > 1. (Sometimes the sinc kernel is deﬁned as K ( x ) = sin( π x ) / ( π x ) with the F ourier transform ϕ ( t ) = I {| t | ≤ π } . Both functions sin x/ ( π x ) and sin( π x ) / ( π x ) in tegrate to one in the sense of the principal v alue, and the diﬀerence is only in the scale parameter.) F rom now on w e fo cus on the kernel estimator p n ( x ) of (2.1) with K b eing the sinc kernel. It often leads to b etter p erformance, and some of its prop erties are in fact easier to study than for other kernel estimators; see Glad, Hjort and Ushak ov (1999a). Its defects – p ossible negativeness and nonintegrabilit y – can easily b e corrected b y a certain mo diﬁcation pro cedure (Glad, Hjort and Ushako v, 1999b). It consists in setting ¯ p n ( x ) = max { p n ( x ) − ξ , 0 } , where the random ξ is chosen so that the integral is 1. After this correction pro cedure, estimation precision of the estimator is guaran teed to improv e. 12 In terms of the empirical characteristic function f n ( t ) the sinc estimator can b e expressed as p n ( x ) = 1 2 π Z 1 /h n − 1 /h n e − itx f n ( t ) d t. (4 . 1) Supp ose that the characteristic function f of the underlying densit y p is in tegrable. First w e obtain relations for the sinc estimator, analogous to those of Lemmas 1 and 2 (these cannot b e applied directly since now K is not in tegrable). Lemma 5. F or the sinc kernel estimator, sup x MSE( p n ( x )) ≤ n 1 2 π Z | t |≥ 1 /h n | f ( t ) | d t o 2 + 2 π nh n 1 2 π Z | f ( t ) | d t (4 . 2) and MISE( p n ) ≤ 1 2 π n Z | t |≥ 1 /h n | f ( t ) | 2 d t + 2 nh n o . (4 . 3) Pro of. W e ﬁrst prov e the ﬁrst inequality . W e ha ve MSE( p n ( x )) = E h 1 2 π n Z e − itx f ( t ) d t − Z 1 /h n − 1 /h n e − itx f n ( t ) d t oi 2 = E h 1 2 π Z | t |≥ 1 /h n e − itx f ( t ) d t + 1 2 π Z 1 /h n − 1 /h n e − itx { f ( t ) − f n ( t ) } d t i 2 = n 1 2 π Z | t |≥ 1 /h n e − itx f ( t ) d t o 2 + E h 1 2 π Z 1 /h n − 1 /h n e − itx { f ( t ) − f n ( t ) } d t i 2 . Let us estimate the second term on the righ t hand side. Denote it by T 2 . T aking in to ac coun t that E f n ( u ) f n ( v ) = (1 − 1 /n ) f ( u ) f ( v ) + (1 /n ) f ( u + v ) , w e obtain T 2 = 1 n 1 (2 π ) 2 Z 1 /h n − 1 /h n Z 1 /h n − 1 /h n e − i ( u + v ) x { f ( u + v ) − f ( u ) f ( v ) } d u d v = 1 2 π n Z 1 /h n − 1 /h n n 1 2 π Z 1 /h n − 1 /h n e − i ( u + v ) x f ( u + v ) d u o d v − 1 n n 1 2 π Z 1 /h n − 1 /h n e − itx f ( t ) d t o 2 = 1 2 π n Z 1 /h n − 1 /h n n 1 2 π Z 1 /h n + v − 1 /h n + v e − itx f ( t ) d t o d v − 1 n n 1 2 π Z 1 /h n − 1 /h n e − itx f ( t ) d t o 2 . 13 Therefore T 2 ≤ 1 2 π n Z 1 /h n − 1 /h n d v 1 2 π Z | f ( t ) | d t + 1 n 1 2 π Z | f ( t ) | d t 1 2 π Z 1 /h n − 1 /h n d s = 2 π nh n 1 2 π Z | f ( t ) | d t. Th us w e obtain (4.2). Next we prov e (4.3). Observ e that we ma y use relation (2.5) with ˆ f n ( x ) = n f n ( x ) if | t | ≤ 1 /h n , 0 otherwise. Therefore MISE( p n ) = 1 2 π n Z 1 /h n − 1 /h n E | f n ( t ) − f ( t ) | 2 d t + Z | t |≥ 1 /h n | f ( t ) | 2 d t o , and it suﬃces to sho w that Z 1 /h n − 1 /h n E | f n ( t ) − f ( t ) | 2 d t ≤ 2 nh n . T aking now in to account that E | f n ( t ) | 2 = (1 − 1 /n ) | f ( t ) | 2 + 1 /n , w e obtain Z 1 /h n − 1 /h n E | f n ( t ) − f ( t ) | 2 d t = 1 n Z 1 /h n − 1 /h n (1 − | f ( t ) 2 | ) d t ≤ 1 n Z 1 /h n − 1 /h n d t = 2 nh n . This prov es the claim. No w we derive some estimates for MISE and MSE of the sinc estimator in terms of the degree of smo othness of the underlying density . First we consider the non-smo oth case, when a densit y to b e estimated is not supposed to b e diﬀeren- tiable or ev en con tinuous. Theorem 6. Let p ha v e b ounded v ariation, V ( p ) = V < ∞ , and let p n b e the sinc estimator. If h n = h 0 / √ n , then MISE( p n ) ≤ 1 π √ n  V 2 h 0 + 1 h 0  . Pro of. Making use of relation (4.3) of Lemma 5 and Lemma 4, we obtain MISE( p n ) ≤ 1 2 π n Z | t |≥ 1 /h n | f ( t ) | 2 d t + 2 nh n o ≤ 1 2 π  2 V 2 Z ∞ 1 /h n d t t 2 + 2 nh n  = 1 π √ n  V 2 h 0 + 1 h 0  , as required. 14 Corollary 1. Let the conditions of Theorem 6 b e satisﬁed. Then for each n , min h> 0 MISE( p n ) ≤ 2 V π √ n . Corollary 2. Let p be a unimo dal densit y function, and let p n b e the sinc estimator. If p is b ounded by a , and h n = h 0 / √ n , then MISE( p n ) ≤ 1 π √ n  4 a 2 h 0 + 1 h 0  , and min h> 0 MISE( p n ) ≤ 4 a π √ n . No w consider the case when the density to b e estimated is m times diﬀeren- tiable, m ≥ 1. It will b e shown that in this case the upp er b ound for MISE of the sinc estimator has order n − 2 m/ (2 m +1) that in principal cannot b e ac hieved (for m > 2) for k ernel estimators with k ernels b eing densit y functions. Theorem 7. Let p b e m times diﬀeren tiable with p ( m ) a function of b ounded v ariation, V ( p ( m ) ) = V m < ∞ . If p n is the sinc estimator, and h n = h 0 n − 1 / (2 m +1) , then MISE( p n ) ≤ 1 2 π n 4( m + 1) 2 m + 1 V (2 m +1) / ( m +1) m h 2 m 0 + 2 h 0 o n − 2 m/ (2 m +1) . (4 . 4) Pro of. W e hav e Z | t |≥ 1 /h n | f ( t ) | 2 d t = h 2 m n Z | t |≥ 1 /h n  1 h n  2 m | f ( t ) | 2 d t ≤ h 2 m n Z | t |≥ 1 /h n | t | 2 m | f ( t ) | 2 d t ≤ h 2 m n Z | t | 2 m | f ( t ) | 2 d t. (4 . 5) Let us estimate the integral on the right hand side, making use of Lemma 4. W e ha ve | f ( t ) | = ( 1 for | t | ≤ V 1 / ( m +1) m , V m / | t | m +1 for | t | > V 1 / ( m +1) m , therefore Z | t | 2 m | f ( t ) | 2 d t ≤ 2 Z V 1 / ( m +1) m 0 t 2 m d t + 2 V 2 m Z ∞ V 1 / ( m +1) m d t t 2 = 4( m + 1) 2 m + 1 V (2 m +1) / ( m +1) m . (4 . 6) Th us, from inequality (4.3) of Lemma 5, and relations (4.5) and (4.6), w e obtain (4.4). 15 Corollary . Let the conditions of Theorem 7 b e satisﬁed. Then for each n , min h> 0 MISE( p n ) ≤ 1 2 π { 4( m +1) } 1 / (2 m +1)  2 m + 1 m  2 m/ (2 m +1) V 1 / ( m +1) m n − 2 m/ (2 m +1) . Theorem 8. Let p b e m times diﬀeren tiable, with p ( m ) a function of b ounded v ariation, V ( p ( m ) ) = V m < ∞ . If p n is the sinc estimator, and h n = h 0 n − 1 / (2 m − 1) , then sup x MSE( p n ( x )) ≤ 1 π 2 n ( m + 1) 2 m 2 V 2 m/ ( m +1) m h 2( m − 1) 0 + 2  V 1 / ( m +1) m + 1 m V m/ ( m +1) m  1 h 0 o n − 2( m − 1) / (2 m − 1) . The pro of of this theorem is analogous to that of Theorem 7, one just needs to use relation (4.2) of Lemma 5 instead of relation (4.3) and take in to accoun t that due to Lemma 4, A ( p ) = 1 2 π Z | f ( t ) | d t ≤ 1 2 π Z V 1 / ( m +1) m − V 1 / ( m +1) m d t + 1 2 π Z | t | >V 1 / ( m +1) m V m d t | t | m +1 ≤ 1 π n V 1 / ( m +1) m + 1 m V m/ ( m +1) m o . Corollary . Let the conditions of Theorem 8 b e satisﬁed. Then for each n , min h> 0 sup x MSE( p n ( x )) ≤ 2 m − 1 π 2  m + 1 m  2 / (2 m − 1) × n m + V ( m − 1) / ( m +1) m m ( m − 1) o 2 m − 2 2 m − 1 V 2 / ( m +1) m n − 2( m − 1) / (2 m − 1) . No w w e pro ceed to the ‘sup ersmo oth’ case which w e deﬁne in terms of char- acteristic functions (although this class of distribution can b e deﬁned in terms of densit y functions as well, a description in terms of characteristic functions is simpler, more natural and more conv enient for our purp oses). A distribution F with c haracteristic function f ( t ) is said to b e sup ersmo oth if for some α > 0 and γ > 0, B ( p ; α, γ ) = Z e γ | t | α | f ( t ) | d t < ∞ . Th us a normal densit y is s upersmo oth with α = 2 while a Cauch y is sup ersmooth with α = 1, for example. Theorem 9. Let the c haracteristic function f of p ha ve a ﬁnite B ( p ; α, γ ) v alue, for some p ositiv e α and γ . If p n is the sinc estimator, and h n = n 1 γ log( h 0 n ) o − 1 /α , with h 0 ≥ 1 n , then MISE( p n ) ≤ 1 2 π n n 2 γ 1 /α (log h 0 + log n ) 1 /α + B ( p ; α, γ ) h 0 o . (4 . 7) 16 Pro of. W e hav e Z | t |≥ 1 /h n | f ( t ) | 2 d t ≤ Z | t |≥ 1 /h n | f ( t ) | d t = e − γ /h α n Z | t |≥ 1 /h n e γ /h α n | f ( t ) | d t ≤ e − γ /h α n Z e γ | t | α | f ( t ) | d t = B ( p ; α, γ ) nh 0 . Using this estimate and inequalit y (4.3) of Lemma 5, w e obtain (4.7). Theorem 10. Let the conditions of Theorem 9 b e satisﬁed, and let again A ( p ) = (2 π ) − 1 R | f ( t ) | d t . Then sup x MSE( p n ( x )) ≤ 1 n n 2 A ( p ) π γ 1 /α (log h 0 + log n ) 1 /α + B 2 ( p ; α, γ ) 4 π 2 nh 0 o . The pro of of the theorem is similar to that of Theorem 9 (inequality (4.2) is used instead of (4.3)). Theorems 9 and 10 can be impro v ed for one subclass of sup ersmo oth densities. The result is given b y the next theorem and is quite curious. Note that this theorem corresp onds to a sp ecial case of a result b y Ibragimo v and Khas’minskii (1982), and we give it here for the completeness. Theorem 11. Let the c haracteristic function f of p satisfy the condition: there exists τ > 0 such that f ( t ) = 0 for | t | > τ . If p n is the sinc estimator, and h n ≤ 1 /τ , then sup x MSE( p n ( x )) ≤ 2 A ( p ) π nh n ≤ 2 τ π 2 nh n , and MISE( p n ) ≤ 1 π nh n . In particular, if h n = const . = 1 /τ , then sup x MSE( p n ( x )) ≤ 2 τ 2 π 2 n and MISE( p n ) ≤ τ π n . A proof of the theorem can b e immediately obtained from inequalities (4.2) and (4.3) of Lemma 5: integrals on the right hand sides of these v anish when h n ≤ 1 /τ . Theorem 11 implies in particular that if the characteristic function of the underlying distribution v anishes for large v alues of the argumen t, and one uses the sinc estimator for approximation, then p n con verges to p as n → ∞ even when h n do es not conv erge to zero. 17 5. Discussion and applications This article has pro vided upp er bounds for b oth the traditional MISE and also the less work ed with max-MSE p erformance measures of kernel densit y estimators. A list of such upp er bounds has been pro vided, under v arious sets of assumptions, for b oth the traditional k ernels as well as for the sinc kernel, which has particularly attractiv e features. Our ﬁnite-sample results hav e b een reached entirely outside the customary framework of asymptotics, T a ylor expansions and small bandwidths, through the extensive use of characteristic and empirical characteristic functions. Belo w w e giv e some concluding remarks, p oin ting to wa ys in whic h the results can b e applied in statistics. 5.1. Rule-of-th umb bandwidths for MISE and max-MSE. Consider k ernel estimators with a traditional k ernel K , a symmetric densit y . The traditional large- sample approximations lead to an asymptotically optimal bandwidth of size h n = { R ( K ) /µ 2 2 ( K ) } 1 / 5 R ( p 00 ) − 1 / 5 n − 1 / 5 , and with consequen t minim um appro ximate MISE of size min h> 0 AMISE( p n ) = (5 / 4) { µ 2 2 ( K ) R 4 ( K ) } 1 / 5 R ( p 00 ) 1 / 5 n − 4 / 5 , see for example W and and Jones (1995). When K is standard normal, and p is a normal densit y with standard deviation σ , this leads to the p opular ‘normal rule- of-th umb’ bandwidth h n = 1 . 0592 σ n − 1 / 5 . Note that the structure of these classic results is v ery similar to that seen in Theorem 1 and its corollary; in particular, the well-kno wn large-sample result ab out the n − 4 / 5 precision rate is here reached en tirely without asymptotics mac hinery or approximations. It is interesting to compare the ab o v e with what one ﬁnds using the upp er b ounds. F or a normal densit y , V 2 = R | p 000 | d x is found to b e the scale factor σ − 3 times 2(2 π ) − 1 / 2 { 1 + 4 exp( − 3 / 2) } = 1 . 5100, and this leads via Theorem 1 to the rule h n = 0 . 8204 σ n − 1 / 5 . This has b een calculated using upp er b ound results deriv ed under minimal assumptions, and which hence do not pretend to be v ery accurate for smo oth densities lik e the normal. It is comforting to see that only a mo derate amoun t is lost in precision, in this v ery smo oth case, since the ratio of the minimised upp er b ound to the minimised asymptotic MISE is found to b e 1.2911. The max-MSE criterion is a natural v enue, seemingly not tra velled b efore. It is diﬃcult to reac h applicable results for this criterion based on the traditional appro ximations. How ev er, Theorem 3 and its corollary pro vide w ays of b ounding the max-MSE when there is information on V 3 = R | p (4) | d x . If p is normal, then V 3 can b e sho wn to b e the scale factor σ − 4 times 4 { (3 b − b 3 ) φ ( b ) − (3 c − c 3 ) φ ( c ) } = 2 . 8006, where b = (3 − √ 6) 1 / 2 and c = (3 + √ 6) 1 / 2 , and φ is the standard normal 18 densit y . The normal rule-of-thum b, when the normal k ernel is used, b ecomes h n = 1 . 1883 σ n − 1 / 5 , which again is not far from the traditional rule-of-thum b. W e also p oin t out that some of these results may b e sharp ened under further assumed constraints on the underlying densit y . The quite crude b ound (3.2) has for example b een used for | f ( t ) | , whic h could b e bounded more eﬀectively under suc h additional restrictions. This is not pursued here, how ev er. 5.2. Cross-v alidation and normal rule-of-thum b in new ligh t. Results reac hed in this article, ab out MISE and upp er b ounds expressed in terms of characteristic functions, p oin t to new w a ys in whic h bandwidths can b e selected from data. Expressions (2.8) and its upp er b ound giv en in Lemma 2 dep end on q ( t ) = | f ( t ) | 2 , but not on other asp ects of the underlying density p . A suitable estimate ˆ q ( t ) ma y no w b e inserted in these expressions, after which one ma y minimise ov er the smo othing parameter h . F or the normal k ernel, this could for example mean minimising Q n ( h ) = 1 2 π Z ˆ q ( t ) { 1 − exp( − 1 2 h 2 t 2 ) } 2 d t + 1 2 π 1 n Z exp( − h 2 t 2 ) d t o ver h , after ha ving selected an estimator ˆ q ( t ). In terestingly it turns out that this sc heme repro duces the w ell-known ‘unbiased cross-v alidation’ rule, see for example W and and Jones (1995), when one employs the natural nonparametric un biased estimator ˆ q ( t ) = 1 n ( n − 1) X j 6 = k exp( it ( X j − X k )) = 2 n ( n − 1) X j

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment