Upper Bounds for the I-MSE and max-MSE of Kernel Density Estimators

The performance of kernel density estimators is usually studied via Taylor expansions and asymptotic approximation arguments, in which the bandwidth parameter tends to zero with increasing sample size. In contrast, this paper focusses directly on the…

Authors: Nils Lid Hjort, Nikolai G. Ushakov

Upp er Bounds for the I-MSE and max-MSE of Kernel Densit y Estimators Nils Lid Hjort 1 and Nik olai G. Ushak ov 2 , * Decem b er 1999 1 Departmen t of Mathematics, Univ ersit y of Oslo, Norwa y 2 Russian Academy of Sciences, Chernogolovk a, Russia Abstra ct. The p erformance of k ernel densit y estimators is usually stud- ied via T a ylor expansions and asymptotic appro ximation arguments, in whic h the bandwidth parameter tends to zero with increasing sample size. In contrast, this paper focusses directly on the finite-sample situa- tion. Informativ e upper b ounds are deriv ed b oth for the in tegrated and the maximal mean squared error function. Results are reac hed for the traditional case, where the k ernel is a probabilit y density function, under v arious sets of assumptions on the underlying densit y to b e estimated. Results are also deriv ed for the imp ortan t non-con ven tional case of the sinc kernel, which is not in tegrable and also takes negativ e v alues. W e pin-p oin t w ays in which the sinc-based estimator p erforms b etter than the con ven tional kernel estimators. When pro ving our results we rely on metho ds related to c haracteristic and empirical c haracteristic functions. Key w ords: c haracteristic functions, density estimation, finite-sample p erformance, max-MSE, sinc k ernel, upp er b ounds 1. In tro duction In this article we deriv e some rigorous upp er b ounds for the estimation error of k ernel densit y estimators for finite v alues of the sample size n , in terms of c hoices of the kernel function K and the bandwidth h = h n . These b ounds are by construction non-asymptotic, and are useful when one needs to secure a certain precision of an estimate for a giv en (finite) v alue of n , for broad classes of densities. W e study b oth smo oth cases (where the density to b e estimated is one or more times differen tiable) and non-smo oth cases (the underlying densit y function is not supp osed to b e differen tiable or even con tin uous). The machinery of c haracteristic and empirical characteristic functions is used, and relev an t general results are established in Section 2. In Section 3 conv en tional k ernel estimators will b e considered, i.e. estimators whose k ernels are probability density functions. These estimators alwa ys pro duce estimates whic h are densities. W e term a k ernel densit y estimator non-con v entional if its kernel function is not a probability density , i.e. it ma y tak e negative v alues * P artially supp orted by RFBR Grants 97-01-00273, 98-01-00621 and 98-01-00926, and by INT AS-RFBR Gran t IR-97-0537 1 or/and do es not integrate to one (or even is not in tegrable). Such non-con ven tional estimators are studied in Section 4, with particular atten tion to the sinc k ernel; see also Glad, Hjort and Ushako v (1999a). Such estimators, based on higher order k ernels, sup erkernels or the sinc k ernel, often provide b etter estimation precision, but ha ve an essen tial disadv an tage: they pro duce estimates whic h are not prob- abilit y densit y functions, i.e. may tak e negativ e v alues or/and do not in tegrate to one. Ho wev er, this defect can b e corrected afterw ards without loss of their p erformance prop erties (see Glad, Hjort and Ushako v, 1999b). A discussion of our results, with a view to wards their use in densit y estimation problems, is giv en in the final Section 5. T opics there include new strategies for bandwidth selection. 2. Auxiliary results, via c haracteristic functions In this pap er we use the characteristic function approach to studying p erformance of densit y estimators, rather than the traditional T a ylor expansions and asymp- totic appro ximations. Therefore w e first express some basic concepts of kernel densit y estimators in terms of c haracteristic functions. Let X 1 , . . . , X n b e indep enden t and iden tically distributed random v ariables with absolutely contin uous distribution function F ( x ), density function p ( x ), and c haracteristic function f ( t ). The kernel density estimator asso ciated with the sample X 1 , . . . , X n is defined as p n ( x ) = p n,h ( x ) = 1 n n X j =1 K h ( x − X j ) , (2 . 1) where K ( x ) is the kernel function with scaled version K h ( x ) = h − 1 K ( h − 1 x ) and h = h n is a p ositiv e n umber (depending on n ) called the bandwidth or the smooth- ing parameter. W e do not necessarily demand that K is integrable (sometimes the b est estimators correspond to nonin tegrable k ernels). Ho wev er, w e supp ose that K is square integrable, and in addition that it is integrable in the sense of the Cauc hy principal v alue with v . p . R ∞ −∞ K ( x ) d x = 1, in which v . p . Z ∞ −∞ = lim T →∞ lim  → 0 h Z −  − T + Z T  i . Under these assumptions the F ourier transform of K can b e defined as ϕ ( t ) = v . p . Z ∞ −∞ e itx K ( x ) d x (see Chapter 4 of Titchmarsh, 1937). In the following we will omit integration limits when the integral is to b e tak en o ver the full real line. 2 Let ˆ p n b e an estimator (not necessarily a k ernel estimator) of p asso ciated with the sample X 1 , . . . , X n . The bias, the mean squared error (MSE) and the mean integrated squared error (MISE) of ˆ p n are defined, resp ectiv ely , as B n ( ˆ p n ( x )) = E ˆ p n ( x ) − p ( x ) , MSE( ˆ p n ( x )) = E { ˆ p n ( x ) − p ( x ) } 2 , and MISE( ˆ p n ) = Z MSE( ˆ p n ( x )) d x = E Z { ˆ p n ( x ) − p ( x ) } 2 d x. (2 . 2) In case of the k ernel estimator p n , defined b y (2.1), the bias ma y b e expressed via the conv olution as B n ( p n ( x )) = ( K h ? p )( x ) − p ( x ) = Z K h ( x − y ) p ( y ) d y − p ( x ) . Since con volution is a kind of smo othing, the bias of the kernel estimator is the difference b etw een a smoothed densit y and the densit y itself. The mean squared error admits a well-kno wn decomp osition into v ariance and squared bias, with consequen t MISE representation MISE( ˆ p n ) = Z B 2 n ( ˆ p n ( x )) d x + Z V ar( ˆ p n ( x )) d x. Note that together with MSE and MISE other measures of deviation may b e used. Among them, the mean absolute error E | ˆ p n ( x ) − p ( x ) | and its in tegral are esp ecially imp ortan t (see Devro ye and Gy¨ orfi, 1985). In the presen t article w e restrict attention to MSE and MISE, ho wev er. F or a real v alued function g w e will use the follo wing notation, pro vided the in tegrals exist: µ k ( g ) = Z | x | k g ( x ) d x and R ( g ) = Z g 2 ( x ) d x. If the kernel K is a probability densit y function, and the density to b e estimated is twice differentiable and with square in tegrable second order deriv ativ e, then it is w ell kno wn that the b est order of estimation accuracy in terms of MISE is O ( n − 4 / 5 ); see also Section 5.1. Ho wev er, if w e permit the k ernel not to b e a densit y , then the order can b e improv ed. F or example, if p is the normal densit y and K is the sinc k ernel, i.e. K ( x ) = sin x/ ( π x ), then min h> 0 MISE( p n ) = O  √ log n n  as n → ∞ ; see Section 4 b elo w and Glad, Hjort and Ushak ov (1999a). 3 W e now express basic characteristics of densit y estimators in terms of F ourier transforms and establish some auxiliary results. Let ˆ f n denote the F ourier transform of an estimator ˆ p n . Making use of the in- v ersion formula for densities and the Parsev al–Plancherel iden tit y we easily obtain the following formulae: B n ( ˆ p n ( x )) = 1 2 π Z e − itx { E ˆ f n ( t ) − f ( t ) } d t, (2 . 3) MSE( ˆ p n ( x )) = E n 1 2 π Z e − itx [ ˆ f n ( t ) − f ( t )] d t o 2 = 1 (2 π ) 2 Z Z e − i ( u + v ) x E { ( ˆ f n ( u ) − f ( u ))( ˆ f n ( v ) − f ( v )) } d u d v , (2 . 4) and MISE( ˆ p n ) = 1 2 π Z E | ˆ f n ( t ) − f ( t ) | 2 d t. (2 . 5) In the remainder of this section, we will consider only k ernel estimators and supp ose that the k ernel K is a probabilit y densit y function, i.e. it is nonnegative and integrates to one. Study the empirical characteristic function asso ciated with sample X 1 , . . . , X n , f n ( t ) = 1 n n X j =1 e itX j . The characteristic function of the estimator p n,h ( x ) is f n ( t ) ϕ ( ht ), where ϕ ( t ) = R e itu K ( u ) d u is the characteristic function of the k ernel. And the k ernel estimator (2.1) can b e expressed in terms of f n as p n,h ( x ) = 1 2 π Z e − itx f n ( t ) ϕ ( ht ) d t. No w, taking into account that E f n ( u ) f n ( v ) = (1 − 1 /n ) f ( u ) f ( v ) + (1 /n ) f ( u + v ) and E | f n ( t ) | 2 = (1 − 1 /n ) | f ( t ) | 2 + 1 /n, w e can write (2.3)–(2.5) in the form B n ( p n ( x )) = 1 2 π Z e − itx f ( t ) { ϕ ( ht ) − 1 } d t, (2 . 6) MSE( p n ( x )) = 1 (2 π ) 2 Z Z e − i ( u + v ) x h 1 n ϕ ( hu ) ϕ ( hv ) f ( u + v ) + n 1 − 1 n  ϕ ( hu ) ϕ ( hv ) − 2 ϕ ( hu ) + 1 o f ( u ) f ( v ) i d u d v (2 . 7) 4 and MISE( p n ) = 1 2 π h Z | f ( t ) | 2 | 1 − ϕ ( ht ) | 2 d t + 1 n Z | ϕ ( ht ) | 2 { 1 − | f ( t ) | 2 } d t i . (2 . 8) F rom (2.6) w e immediately obtain | B n ( p n ( x )) | ≤ 1 2 π Z | f ( t ) | | 1 − ϕ ( ht ) | d t. (2 . 9) Lemma 1. F or each x , MSE( p n ( x )) ≤ n 1 2 π Z | f ( t ) | | 1 − ϕ ( ht ) | d t o 2 + a ( x ) π nh Z | ϕ ( t ) | d t, (2 . 10) where a ( x ) = ( K h ? p )( x ) . If p is b ounded b y a , then sup x MSE( p n ( x )) ≤ n 1 2 π Z | f ( t ) | | 1 − ϕ ( ht ) | d t o 2 + a π nh Z | ϕ ( t ) | d t. Pro of. It suffices to pro v e the first statemen t, since a ( x ) = R p ( x − y ) K h ( y ) d y ≤ a for all x . Making use of relation (2.7), w e obtain MSE( p n ( x )) = h 1 2 π Z e − itx f ( t ) { 1 − ϕ ( ht ) } d t i 2 + 1 n 1 (2 π ) 2 Z Z e − i ( u + v ) x ϕ ( hu ) ϕ ( hv ) f ( u + v ) d u d v − 1 n 1 (2 π ) 2 Z Z e − i ( u + v ) x ϕ ( hu ) ϕ ( hv ) f ( u ) f ( v ) d u d v . The first term on the righ t hand side is dominated by the first term of the right hand side of (2.10). Let us then estimate the absolute v alue of the second (denoted b y T 2 ) and third (denoted b y T 3 ) terms. W e ha ve T 2 = 1 n 1 2 π Z ϕ ( hu ) n 1 2 π Z e − i ( u + v ) x ϕ ( hv ) f ( u + v ) d v o d u. The term in brack ets, b eing transformed to the form 1 2 π Z e − itx ϕ ( h ( t − u )) f ( t ) d t, is equal to R p ( x − y ) K h ( y ) e − iuy d y (since ϕ ( h ( t − u )) f ( t ) is the F ourier transform of the con volution of functions p ( x ) and K h ( x ) e − iux ), and clearly    Z p ( x − y ) K h ( y ) e − iuy d y    ≤ a ( x ) . 5 Hence | T 2 | ≤ a ( x ) n 1 2 π Z | ϕ ( ht ) | d t. F urthermore, | T 3 | = 1 n    1 2 π Z e − iux ϕ ( hu ) f ( u ) d u 1 2 π Z e − iv x f ( v ) ϕ ( hv ) d v    ≤ 1 n ( K h ? p )( x ) 1 2 π Z | f ( v ) | | ϕ ( hv ) | d v ≤ a ( x ) n 2 π Z | ϕ ( hv ) | d v . Th us w e finally obtain (2.10). Lemma 2. MISE( p n,h ) ≤ 1 2 π n Z | f ( t ) | 2 | 1 − ϕ ( ht ) | 2 d t + 1 nh Z | ϕ ( t ) | 2 d t o . This lemma immediately follows from relation (2.8). W e conclude this section with some inequalities for characteristic functions and which will b e used b elo w. Lemma 3. Let F b e a distribution function with c haracteristic function f . If the first order absolute moment β 1 = R | x | d F ( x ) is finite, then | 1 − f ( t ) | ≤ β 1 | t | for all real t. If F has null exp ectation and finite v ariance σ 2 , then | 1 − f ( t ) | ≤ 1 2 σ 2 t 2 for all real t. Pro of. Observe that for an y p ositive integer n and an y x > 0,    e ix − 1 − ix 1! − . . . − ( ix ) n − 1 ( n − 1)!    ≤ x n n ! (2 . 11) (see for example F eller, 1971, Chapter 15). The first inequalit y of the lemma follo ws quic kly via | 1 − f ( t ) | ≤ Z | 1 − e itx | d F ( x ) ≤ Z | tx | d F ( x ) = β 1 | t | . T o prov e the second inequality , we obtain, again making use of (2.11), | 1 − f ( t ) | =    Z { e itx − 1 } d F ( x )    =    Z ( e itx − 1 − itx ) d F ( x )    ≤ Z | e itx − 1 − itx | d F ( x ) ≤ Z 1 2 t 2 x 2 d F ( x ) = 1 2 σ 2 t 2 . 6 Along the same lines one may pro v e for example that | f ( t ) − (1 − 1 2 σ 2 t 2 ) | ≤ 1 6 | t | 3 R | x | 3 d F ( x ). Let g b e a real-v alued function defined on an in terv al [ a, b ] of the real line. The total v ariation of g on [ a, b ] is defined as V b a ( g ) = sup n X i =1 | g ( x i ) − g ( x i − 1 ) | where the suprem um is taken o ver all n and all collections x 0 , . . . , x n suc h that a = x 0 < · · · < x n = b . The total v ariation on the whole real line is defined as V ∞ −∞ ( g ) = lim x →∞ V x − x ( g ) . In the case V ∞ −∞ ( g ) we omit limits and write V ( g ). A function g is said to b e a function of b ounded total v ariation if V ( g ) < ∞ (or V b a ( g ) < ∞ if it is considered on an interv al [ a, b ]). Note that if g has an integrable deriv ative, then V b a ( g ) = R b a | g 0 | d x . Lemma 4. Let p be a probabilit y densit y and f the corresponding charac- teristic function. If p is m − 1 times differentiable, and p ( m − 1) is a function of b ounded v ariation, then | f ( t ) | ≤ V ( p ( m − 1) ) / | t | m for all real t (b y definition, p (0) = p ). A pro of of this lemma is contained in Ushako v and Ushako v (1999). 3. Densit y estimators with conv entional k ernels First we study the ‘smo oth’ case, i.e. when the density to b e estimated is one or sev eral times differentiable. Theorem 1. Let p b e twice differentiable, with p 00 a function of b ounded v ariation, V ( p 00 ) = V 2 < ∞ . If the kernel K has n ull exp ectation, and h n = h 0 n − 1 / 5 ( h 0 b eing some constant ) , then MISE( p n ) ≤ n 3 µ 2 2 ( K ) V 5 / 3 2 10 π h 4 0 + R ( K ) h 0 o n − 4 / 5 . (3 . 1) Pro of. Due to Lemma 4, we ha ve | f ( t ) | ≤ ( 1 for | t | ≤ V 1 / 3 2 , V 2 / | t | 3 for | t | > V 1 / 3 2 , (3 . 2) 7 and, due to Lemma 3, | 1 − ϕ ( h n t ) | ≤ 1 2 µ 2 ( K ) h 2 n t 2 for all t . Hence Z | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t ≤ 1 2 µ 2 2 ( K ) h 4 n Z V 1 / 3 2 0 t 4 d t + 1 2 µ 2 2 ( K ) h 4 n V 2 2 Z ∞ V 1 / 3 2 d t t 2 = (3 / 5) µ 2 2 ( K ) V 5 / 3 2 h 4 0 n − 4 / 5 . (3 . 3) F urther, using the Parsev al–Plancherel iden tity , w e get 1 nh n 1 2 π Z | ϕ ( t ) | 2 d t = 1 h 0 n − 4 / 5 Z K 2 ( x ) d x = R ( K ) h 0 n − 4 / 5 . (3 . 4) F rom (3.3), (3.4) and Lemma 2, we obtain (3.1). Corollary . Let the conditions of Theorem 1 b e satisfied. Then for each n , min h> 0 MISE( p n ) ≤  3 · 5 4 2 9 π  1 / 5 { µ 2 2 ( K ) R 4 ( K ) } 1 / 5 V 1 / 3 2 n − 4 / 5 , with minimum of the upp er b ound attained for h n = n 5 π 6 R ( K ) µ 2 2 ( K ) o 1 / 5 V − 1 / 3 2 n − 1 / 5 . If p is only one time differen tiable or/and the exp ectation of K do es not equal zero, then results are w eaker. Theorem 2. Let p b e differen tiable with p 0 a function of b ounded v ariation, V ( p 0 ) = V 1 < ∞ . If h n = h 0 n − 1 / 3 , then MISE( p n ) ≤ n 4 3 π µ 2 1 ( K ) V 3 / 2 1 h 2 0 + R ( K ) h 0 o n − 2 / 3 . (3 . 5) Pro of. Due to Lemmas 3 and 4, | f ( t ) | ≤ ( 1 for | t | ≤ V 1 / 2 1 , V 1 / | t | 2 for | t | > V 1 / 2 1 , and | 1 − ϕ ( h n t ) | ≤ µ 1 ( K ) h n | t | for all t . Hence (see the pro of of Theorem 1), Z | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t ≤ (8 / 3) µ 2 1 ( K ) V 3 / 2 1 h 2 0 n − 2 / 3 . (3 . 6) And, as in the pro of of Theorem 1, 1 2 π 1 nh n Z | ϕ ( t ) | 2 d t = R ( K ) h 0 n − 2 / 3 . (3 . 7) F rom (3.6), (3.7) and Lemma 2, we obtain (3.5). 8 Corollary . Let the conditions of Theorem 2 b e satisfied. Then for each n , min h> 0 MISE( p n ) ≤ (9 /π ) 1 / 3 µ 2 / 3 1 ( K ) p V 1 R ( K ) 2 / 3 n − 2 / 3 . Theorems 1 and 2 giv e b ounds for the integral deviation of the mean squared error of a kernel estimator from zero. No w w e obtain b ounds for the sup deviation, in terms of A ( K ) = 1 2 π Z | ϕ ( t ) | d t. Theorem 3. Let p b e three times differentiable with p 000 a function of b ounded v ariation, V ( p 000 ) = V 3 < ∞ , and let p b e b ounded b y a . If h n = h 0 n − 1 / 5 , then sup x MSE( p n ( x )) ≤ n 4 9 π 2 µ 2 2 ( K ) V 3 / 2 3 h 4 0 + 2 aA ( K ) h 0 o n − 4 / 5 . Pro of. Due to Lemma 4, | f ( t ) | ≤ ( 1 for | t | ≤ V 1 / 4 3 , V 3 / | t | 4 for | t | > V 1 / 4 3 , and, due to Lemma 3, | 1 − ϕ ( h n t ) | ≤ 1 2 µ 2 ( K ) h 2 n t 2 for all t . Hence Z | f ( t ) | | 1 − ϕ ( h n t ) | d t ≤ µ 2 ( K ) h 2 n  Z V 1 / 4 3 0 t 2 d t + V 3 Z ∞ V 1 / 4 3 d t t 2  = (4 / 3) µ 2 ( K ) V 3 / 4 3 h 2 0 n − 2 / 5 . T o get the result it suffices now to apply Lemma 1. Corollary . Let the conditions of Theorem 3 b e satisfied. Then for each n , min h> 0 sup x MSE( p n ( x )) ≤ 5(36 π 2 ) − 1 / 5 µ 2 / 5 2 ( K ) V 3 / 10 3 A 4 / 5 ( K ) a 4 / 5 n − 4 / 5 , with minimum of the upp er b ound b eing attained for h n = n 9 π 2 8 A ( K ) µ 2 2 ( K ) o 1 / 5 a 1 / 5 V − 3 / 10 3 n − 1 / 5 . Theorem 4. Let p b e t wice differen tiable with p 00 a function of b ounded v ariation, and let p b e b ounded by a . If h n = h 0 n − 1 / 3 , then sup x MSE( p n ( x )) ≤ n 9 4 π 2 µ 2 1 ( K ) V 4 / 3 2 h 2 0 + 2 aA ( K ) h 0 o n − 2 / 3 . (3 . 8) 9 Pro of. Using (3.2) and the second inequality of Lemma 3, w e hav e | 1 − ϕ ( h n t ) | ≤ µ 1 ( K ) h n | t | . This leads to Z | f ( t ) | | 1 − ϕ ( h n t ) | d t ≤ 2 µ 1 ( K ) h n Z V 1 / 3 2 0 t d t + 2 V 2 µ 1 ( K ) h n Z ∞ V 1 / 3 2 d t t 2 = 3 µ 1 ( K ) V 2 / 3 2 h n = 3 µ 1 ( K ) V 2 / 3 2 h 0 n − 1 / 3 . Using this estimate and Lemma 1, w e get (3.8). Corollary . Let the conditions of Theorem 4 b e satisfied. Then for each n , min h> 0 sup x MSE( p n ( x )) ≤ 3  9 4 π 2  1 / 3 µ 2 / 3 1 ( K ) V 4 / 9 2 B 2 / 3 ( K ) a 2 / 3 n − 2 / 3 . Next we consider the so-called non-smo oth case. This means that the underly- ing densit y function is not supp osed to b e differentiable or even con tinuous. Some minim um regularity conditions must b e in tro duced, ho wev er (otherwise nothing substan tial can b e deriv ed). Here this minim um condition will b e the b ounded- ness of the total v ariation of the underlying density . Note that this condition is a little less restrictive than those usually assumed when authors work with the non-smo oth case (see for example v an Eeden, 1985 and v an Es, 1997). Theorem 5. Let the underlying densit y p b e a function of b ounded v ariation, V = V ( p ) < ∞ . If h n = h 0 / ( √ n log n ) , then MISE( p n,h ) ≤ log 2 n √ n h 4 √ 2 π max { p µ 1 ( K ) , µ 1 ( K ) } × max { V 3 / 2 , V 2 } max { p h 0 , h 0 } + R ( K ) h 0 log n i (3 . 9) for all n ≥ 16 . Pro of. Let us use Lemma 2. F or the second term in the square brack ets, due to the P arsev al–Plancherel iden tity , w e ha ve 1 nh n Z | ϕ ( t ) | 2 d t = 2 π nh n Z K 2 ( x ) d x = 2 π R ( K ) nh n . (3 . 10) Let us estimate the first term. First we establish the follo wing inequality: for any 0 < α < 1, | 1 − ϕ ( t ) | ≤ µ 1 ( K ) α 2 1 − α | t | α (3 . 11) for all real t . Indeed, due to Lemma 3, | 1 − ϕ ( t ) | ≤ µ 1 ( K ) | t | . (3 . 12) 10 F or | t | ≤ 2 /µ 1 ( K ), the right hand side of (3.11) ma jorises the righ t hand side of (3.12), therefore (3.11) holds for these t . If | t | > 2 /µ 1 ( K ), then (3.11) is eviden t b ecause its right hand side exceeds 2. Let α b e arbitrary inside (0 , 1 2 ). Making use of (3.11) and Lemma 4, we get Z | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t = 2 Z V 0 | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t + 2 Z ∞ V | f ( t ) | 2 | 1 − ϕ ( h n t ) | 2 d t ≤ 2 µ 2 α 1 ( K )2 2(1 − α ) h 2 α n  Z V 0 t 2 α d t + V 2 Z ∞ V t 2 α − 2 d t  = 2 4 − 2 α 1 − 4 α 2 µ 1 ( K ) 2 α V 2 α +1 h 2 α n . F rom this estimate and (3.10), using Lemma 1, we obtain MISE( p n,h ) ≤ 2 3 − 2 α π (1 − 4 α 2 ) µ 2 α 1 ( K ) V 2 α +1 h 2 α n + R ( K ) nh n = 2 3 − 2 α π (1 − 4 α 2 ) µ 2 α 1 ( K ) V 2 α +1 h 2 α 0  1 √ n log n  2 α + R ( K ) h 0 log n √ n (3 . 13) for any α ∈ (0 , 1 2 ). Put α = log n 2(log n + 2 log log n ) . Then 1 4 < α < 1 2 (pro vided that n ≥ e e , which translates into n ≥ 16), and hence 2 3 − 2 α < 4 √ 2 , µ 1 ( K ) ≤ max { p µ 1 ( K ) , µ 1 ( K ) } , V 2 α +1 ≤ max { V 3 / 2 , V 2 } , h 2 α 0 ≤ max { p h 0 , h 0 } . Therefore from (3.13) we obtain MISE( p n,h ) ≤ 4 √ 2 π max { p µ 1 ( K ) , µ 1 ( K ) } max { V 3 / 2 , V 2 } max { p h 0 , h 0 } × 1 1 − 4 α 2  1 √ n log n  2 α + R ( K ) h 0 log n √ n . (3 . 14) Putting now α 0 = log n − 2 log log n 2(log n + 2 log log n ) , then α > α 0 (if n ≥ e e ), hence  1 √ n log n  2 α <  1 √ n log n  2 α 0 = log n √ n . (3 . 15) 11 It remains to assess the size of 1 / (1 − 4 α 2 ). W e hav e 1 1 − 4 α 2 = (log n + 2 log log n ) 2 (log n + 2 log log n ) 2 − (log n ) 2 = (log n + 2 log log n ) 2 (2 log n + 2 log log n )2 log log n ≤ 1 4 log n + 1 + log log n log n + log log n ≤ log n (3 . 16) if n ≥ e e . F rom (3.14), (3.15) and (3.16) we finally obtain (3.9). Corollary . Let p be a unimo dal density function, and b ounded by a . If h n = h 0 / ( √ n log n ) , then MISE( p n,h ) ≤ log 2 n √ n · h 4 √ 2 π max { p µ 1 ( K ) , µ 1 ( K ) } × max { 2 √ 2 a 3 / 2 , a 2 } max { p h 0 , h 0 } + R ( K ) h 0 log n i . 4. The sinc k ernel density estimator The sinc k ernel is the function K ( x ) = sin x π x with the F ourier transform (defined as the principal v alue of the corresp onding in tegral) ϕ ( t ) =  1 for | t | ≤ 1, 0 for | t | > 1. (Sometimes the sinc kernel is defined as K ( x ) = sin( π x ) / ( π x ) with the F ourier transform ϕ ( t ) = I {| t | ≤ π } . Both functions sin x/ ( π x ) and sin( π x ) / ( π x ) in tegrate to one in the sense of the principal v alue, and the difference is only in the scale parameter.) F rom now on w e fo cus on the kernel estimator p n ( x ) of (2.1) with K b eing the sinc kernel. It often leads to b etter p erformance, and some of its prop erties are in fact easier to study than for other kernel estimators; see Glad, Hjort and Ushak ov (1999a). Its defects – p ossible negativeness and nonintegrabilit y – can easily b e corrected b y a certain mo dification pro cedure (Glad, Hjort and Ushako v, 1999b). It consists in setting ¯ p n ( x ) = max { p n ( x ) − ξ , 0 } , where the random ξ is chosen so that the integral is 1. After this correction pro cedure, estimation precision of the estimator is guaran teed to improv e. 12 In terms of the empirical characteristic function f n ( t ) the sinc estimator can b e expressed as p n ( x ) = 1 2 π Z 1 /h n − 1 /h n e − itx f n ( t ) d t. (4 . 1) Supp ose that the characteristic function f of the underlying densit y p is in tegrable. First w e obtain relations for the sinc estimator, analogous to those of Lemmas 1 and 2 (these cannot b e applied directly since now K is not in tegrable). Lemma 5. F or the sinc kernel estimator, sup x MSE( p n ( x )) ≤ n 1 2 π Z | t |≥ 1 /h n | f ( t ) | d t o 2 + 2 π nh n 1 2 π Z | f ( t ) | d t (4 . 2) and MISE( p n ) ≤ 1 2 π n Z | t |≥ 1 /h n | f ( t ) | 2 d t + 2 nh n o . (4 . 3) Pro of. W e first prov e the first inequality . W e ha ve MSE( p n ( x )) = E h 1 2 π n Z e − itx f ( t ) d t − Z 1 /h n − 1 /h n e − itx f n ( t ) d t oi 2 = E h 1 2 π Z | t |≥ 1 /h n e − itx f ( t ) d t + 1 2 π Z 1 /h n − 1 /h n e − itx { f ( t ) − f n ( t ) } d t i 2 = n 1 2 π Z | t |≥ 1 /h n e − itx f ( t ) d t o 2 + E h 1 2 π Z 1 /h n − 1 /h n e − itx { f ( t ) − f n ( t ) } d t i 2 . Let us estimate the second term on the righ t hand side. Denote it by T 2 . T aking in to ac coun t that E f n ( u ) f n ( v ) = (1 − 1 /n ) f ( u ) f ( v ) + (1 /n ) f ( u + v ) , w e obtain T 2 = 1 n 1 (2 π ) 2 Z 1 /h n − 1 /h n Z 1 /h n − 1 /h n e − i ( u + v ) x { f ( u + v ) − f ( u ) f ( v ) } d u d v = 1 2 π n Z 1 /h n − 1 /h n n 1 2 π Z 1 /h n − 1 /h n e − i ( u + v ) x f ( u + v ) d u o d v − 1 n n 1 2 π Z 1 /h n − 1 /h n e − itx f ( t ) d t o 2 = 1 2 π n Z 1 /h n − 1 /h n n 1 2 π Z 1 /h n + v − 1 /h n + v e − itx f ( t ) d t o d v − 1 n n 1 2 π Z 1 /h n − 1 /h n e − itx f ( t ) d t o 2 . 13 Therefore T 2 ≤ 1 2 π n Z 1 /h n − 1 /h n d v 1 2 π Z | f ( t ) | d t + 1 n 1 2 π Z | f ( t ) | d t 1 2 π Z 1 /h n − 1 /h n d s = 2 π nh n 1 2 π Z | f ( t ) | d t. Th us w e obtain (4.2). Next we prov e (4.3). Observ e that we ma y use relation (2.5) with ˆ f n ( x ) = n f n ( x ) if | t | ≤ 1 /h n , 0 otherwise. Therefore MISE( p n ) = 1 2 π n Z 1 /h n − 1 /h n E | f n ( t ) − f ( t ) | 2 d t + Z | t |≥ 1 /h n | f ( t ) | 2 d t o , and it suffices to sho w that Z 1 /h n − 1 /h n E | f n ( t ) − f ( t ) | 2 d t ≤ 2 nh n . T aking now in to account that E | f n ( t ) | 2 = (1 − 1 /n ) | f ( t ) | 2 + 1 /n , w e obtain Z 1 /h n − 1 /h n E | f n ( t ) − f ( t ) | 2 d t = 1 n Z 1 /h n − 1 /h n (1 − | f ( t ) 2 | ) d t ≤ 1 n Z 1 /h n − 1 /h n d t = 2 nh n . This prov es the claim. No w we derive some estimates for MISE and MSE of the sinc estimator in terms of the degree of smo othness of the underlying density . First we consider the non-smo oth case, when a densit y to b e estimated is not supposed to b e differen- tiable or ev en con tinuous. Theorem 6. Let p ha v e b ounded v ariation, V ( p ) = V < ∞ , and let p n b e the sinc estimator. If h n = h 0 / √ n , then MISE( p n ) ≤ 1 π √ n  V 2 h 0 + 1 h 0  . Pro of. Making use of relation (4.3) of Lemma 5 and Lemma 4, we obtain MISE( p n ) ≤ 1 2 π n Z | t |≥ 1 /h n | f ( t ) | 2 d t + 2 nh n o ≤ 1 2 π  2 V 2 Z ∞ 1 /h n d t t 2 + 2 nh n  = 1 π √ n  V 2 h 0 + 1 h 0  , as required. 14 Corollary 1. Let the conditions of Theorem 6 b e satisfied. Then for each n , min h> 0 MISE( p n ) ≤ 2 V π √ n . Corollary 2. Let p be a unimo dal densit y function, and let p n b e the sinc estimator. If p is b ounded by a , and h n = h 0 / √ n , then MISE( p n ) ≤ 1 π √ n  4 a 2 h 0 + 1 h 0  , and min h> 0 MISE( p n ) ≤ 4 a π √ n . No w consider the case when the density to b e estimated is m times differen- tiable, m ≥ 1. It will b e shown that in this case the upp er b ound for MISE of the sinc estimator has order n − 2 m/ (2 m +1) that in principal cannot b e ac hieved (for m > 2) for k ernel estimators with k ernels b eing densit y functions. Theorem 7. Let p b e m times differen tiable with p ( m ) a function of b ounded v ariation, V ( p ( m ) ) = V m < ∞ . If p n is the sinc estimator, and h n = h 0 n − 1 / (2 m +1) , then MISE( p n ) ≤ 1 2 π n 4( m + 1) 2 m + 1 V (2 m +1) / ( m +1) m h 2 m 0 + 2 h 0 o n − 2 m/ (2 m +1) . (4 . 4) Pro of. W e hav e Z | t |≥ 1 /h n | f ( t ) | 2 d t = h 2 m n Z | t |≥ 1 /h n  1 h n  2 m | f ( t ) | 2 d t ≤ h 2 m n Z | t |≥ 1 /h n | t | 2 m | f ( t ) | 2 d t ≤ h 2 m n Z | t | 2 m | f ( t ) | 2 d t. (4 . 5) Let us estimate the integral on the right hand side, making use of Lemma 4. W e ha ve | f ( t ) | = ( 1 for | t | ≤ V 1 / ( m +1) m , V m / | t | m +1 for | t | > V 1 / ( m +1) m , therefore Z | t | 2 m | f ( t ) | 2 d t ≤ 2 Z V 1 / ( m +1) m 0 t 2 m d t + 2 V 2 m Z ∞ V 1 / ( m +1) m d t t 2 = 4( m + 1) 2 m + 1 V (2 m +1) / ( m +1) m . (4 . 6) Th us, from inequality (4.3) of Lemma 5, and relations (4.5) and (4.6), w e obtain (4.4). 15 Corollary . Let the conditions of Theorem 7 b e satisfied. Then for each n , min h> 0 MISE( p n ) ≤ 1 2 π { 4( m +1) } 1 / (2 m +1)  2 m + 1 m  2 m/ (2 m +1) V 1 / ( m +1) m n − 2 m/ (2 m +1) . Theorem 8. Let p b e m times differen tiable, with p ( m ) a function of b ounded v ariation, V ( p ( m ) ) = V m < ∞ . If p n is the sinc estimator, and h n = h 0 n − 1 / (2 m − 1) , then sup x MSE( p n ( x )) ≤ 1 π 2 n ( m + 1) 2 m 2 V 2 m/ ( m +1) m h 2( m − 1) 0 + 2  V 1 / ( m +1) m + 1 m V m/ ( m +1) m  1 h 0 o n − 2( m − 1) / (2 m − 1) . The pro of of this theorem is analogous to that of Theorem 7, one just needs to use relation (4.2) of Lemma 5 instead of relation (4.3) and take in to accoun t that due to Lemma 4, A ( p ) = 1 2 π Z | f ( t ) | d t ≤ 1 2 π Z V 1 / ( m +1) m − V 1 / ( m +1) m d t + 1 2 π Z | t | >V 1 / ( m +1) m V m d t | t | m +1 ≤ 1 π n V 1 / ( m +1) m + 1 m V m/ ( m +1) m o . Corollary . Let the conditions of Theorem 8 b e satisfied. Then for each n , min h> 0 sup x MSE( p n ( x )) ≤ 2 m − 1 π 2  m + 1 m  2 / (2 m − 1) × n m + V ( m − 1) / ( m +1) m m ( m − 1) o 2 m − 2 2 m − 1 V 2 / ( m +1) m n − 2( m − 1) / (2 m − 1) . No w w e pro ceed to the ‘sup ersmo oth’ case which w e define in terms of char- acteristic functions (although this class of distribution can b e defined in terms of densit y functions as well, a description in terms of characteristic functions is simpler, more natural and more conv enient for our purp oses). A distribution F with c haracteristic function f ( t ) is said to b e sup ersmo oth if for some α > 0 and γ > 0, B ( p ; α, γ ) = Z e γ | t | α | f ( t ) | d t < ∞ . Th us a normal densit y is s upersmo oth with α = 2 while a Cauch y is sup ersmooth with α = 1, for example. Theorem 9. Let the c haracteristic function f of p ha ve a finite B ( p ; α, γ ) v alue, for some p ositiv e α and γ . If p n is the sinc estimator, and h n = n 1 γ log( h 0 n ) o − 1 /α , with h 0 ≥ 1 n , then MISE( p n ) ≤ 1 2 π n n 2 γ 1 /α (log h 0 + log n ) 1 /α + B ( p ; α, γ ) h 0 o . (4 . 7) 16 Pro of. W e hav e Z | t |≥ 1 /h n | f ( t ) | 2 d t ≤ Z | t |≥ 1 /h n | f ( t ) | d t = e − γ /h α n Z | t |≥ 1 /h n e γ /h α n | f ( t ) | d t ≤ e − γ /h α n Z e γ | t | α | f ( t ) | d t = B ( p ; α, γ ) nh 0 . Using this estimate and inequalit y (4.3) of Lemma 5, w e obtain (4.7). Theorem 10. Let the conditions of Theorem 9 b e satisfied, and let again A ( p ) = (2 π ) − 1 R | f ( t ) | d t . Then sup x MSE( p n ( x )) ≤ 1 n n 2 A ( p ) π γ 1 /α (log h 0 + log n ) 1 /α + B 2 ( p ; α, γ ) 4 π 2 nh 0 o . The pro of of the theorem is similar to that of Theorem 9 (inequality (4.2) is used instead of (4.3)). Theorems 9 and 10 can be impro v ed for one subclass of sup ersmo oth densities. The result is given b y the next theorem and is quite curious. Note that this theorem corresp onds to a sp ecial case of a result b y Ibragimo v and Khas’minskii (1982), and we give it here for the completeness. Theorem 11. Let the c haracteristic function f of p satisfy the condition: there exists τ > 0 such that f ( t ) = 0 for | t | > τ . If p n is the sinc estimator, and h n ≤ 1 /τ , then sup x MSE( p n ( x )) ≤ 2 A ( p ) π nh n ≤ 2 τ π 2 nh n , and MISE( p n ) ≤ 1 π nh n . In particular, if h n = const . = 1 /τ , then sup x MSE( p n ( x )) ≤ 2 τ 2 π 2 n and MISE( p n ) ≤ τ π n . A proof of the theorem can b e immediately obtained from inequalities (4.2) and (4.3) of Lemma 5: integrals on the right hand sides of these v anish when h n ≤ 1 /τ . Theorem 11 implies in particular that if the characteristic function of the underlying distribution v anishes for large v alues of the argumen t, and one uses the sinc estimator for approximation, then p n con verges to p as n → ∞ even when h n do es not conv erge to zero. 17 5. Discussion and applications This article has pro vided upp er bounds for b oth the traditional MISE and also the less work ed with max-MSE p erformance measures of kernel densit y estimators. A list of such upp er bounds has been pro vided, under v arious sets of assumptions, for b oth the traditional k ernels as well as for the sinc kernel, which has particularly attractiv e features. Our finite-sample results hav e b een reached entirely outside the customary framework of asymptotics, T a ylor expansions and small bandwidths, through the extensive use of characteristic and empirical characteristic functions. Belo w w e giv e some concluding remarks, p oin ting to wa ys in whic h the results can b e applied in statistics. 5.1. Rule-of-th umb bandwidths for MISE and max-MSE. Consider k ernel estimators with a traditional k ernel K , a symmetric densit y . The traditional large- sample approximations lead to an asymptotically optimal bandwidth of size h n = { R ( K ) /µ 2 2 ( K ) } 1 / 5 R ( p 00 ) − 1 / 5 n − 1 / 5 , and with consequen t minim um appro ximate MISE of size min h> 0 AMISE( p n ) = (5 / 4) { µ 2 2 ( K ) R 4 ( K ) } 1 / 5 R ( p 00 ) 1 / 5 n − 4 / 5 , see for example W and and Jones (1995). When K is standard normal, and p is a normal densit y with standard deviation σ , this leads to the p opular ‘normal rule- of-th umb’ bandwidth h n = 1 . 0592 σ n − 1 / 5 . Note that the structure of these classic results is v ery similar to that seen in Theorem 1 and its corollary; in particular, the well-kno wn large-sample result ab out the n − 4 / 5 precision rate is here reached en tirely without asymptotics mac hinery or approximations. It is interesting to compare the ab o v e with what one finds using the upp er b ounds. F or a normal densit y , V 2 = R | p 000 | d x is found to b e the scale factor σ − 3 times 2(2 π ) − 1 / 2 { 1 + 4 exp( − 3 / 2) } = 1 . 5100, and this leads via Theorem 1 to the rule h n = 0 . 8204 σ n − 1 / 5 . This has b een calculated using upp er b ound results deriv ed under minimal assumptions, and which hence do not pretend to be v ery accurate for smo oth densities lik e the normal. It is comforting to see that only a mo derate amoun t is lost in precision, in this v ery smo oth case, since the ratio of the minimised upp er b ound to the minimised asymptotic MISE is found to b e 1.2911. The max-MSE criterion is a natural v enue, seemingly not tra velled b efore. It is difficult to reac h applicable results for this criterion based on the traditional appro ximations. How ev er, Theorem 3 and its corollary pro vide w ays of b ounding the max-MSE when there is information on V 3 = R | p (4) | d x . If p is normal, then V 3 can b e sho wn to b e the scale factor σ − 4 times 4 { (3 b − b 3 ) φ ( b ) − (3 c − c 3 ) φ ( c ) } = 2 . 8006, where b = (3 − √ 6) 1 / 2 and c = (3 + √ 6) 1 / 2 , and φ is the standard normal 18 densit y . The normal rule-of-thum b, when the normal k ernel is used, b ecomes h n = 1 . 1883 σ n − 1 / 5 , which again is not far from the traditional rule-of-thum b. W e also p oin t out that some of these results may b e sharp ened under further assumed constraints on the underlying densit y . The quite crude b ound (3.2) has for example b een used for | f ( t ) | , whic h could b e bounded more effectively under suc h additional restrictions. This is not pursued here, how ev er. 5.2. Cross-v alidation and normal rule-of-thum b in new ligh t. Results reac hed in this article, ab out MISE and upp er b ounds expressed in terms of characteristic functions, p oin t to new w a ys in whic h bandwidths can b e selected from data. Expressions (2.8) and its upp er b ound giv en in Lemma 2 dep end on q ( t ) = | f ( t ) | 2 , but not on other asp ects of the underlying density p . A suitable estimate ˆ q ( t ) ma y no w b e inserted in these expressions, after which one ma y minimise ov er the smo othing parameter h . F or the normal k ernel, this could for example mean minimising Q n ( h ) = 1 2 π Z ˆ q ( t ) { 1 − exp( − 1 2 h 2 t 2 ) } 2 d t + 1 2 π 1 n Z exp( − h 2 t 2 ) d t o ver h , after ha ving selected an estimator ˆ q ( t ). In terestingly it turns out that this sc heme repro duces the w ell-known ‘unbiased cross-v alidation’ rule, see for example W and and Jones (1995), when one employs the natural nonparametric un biased estimator ˆ q ( t ) = 1 n ( n − 1) X j 6 = k exp( it ( X j − X k )) = 2 n ( n − 1) X j

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment