Confidence Sets Based on Penalized Maximum Likelihood Estimators in Gaussian Regression

Conﬁdence Sets Based on P enalized Maxi mum Lik eli ho o d Estimator s in Gaussian Regression ∗ Benedikt M. P¨ otsc her † and Ulrik e Sc hneider ‡ Departmen t o f Statistics, Univ ersit y of Vienna and Institute for Mathematical Sto chas tics, Unive rsit y of G¨ ottingen Preliminary v ersion: F ebruary 2008 First v ersion: June 20 08 First revision: Ma y 2 009 Second revision: Jan uary 2010 Abstract Conﬁdence interv als based on p enalized maximum likelihood estima- tors such a s the LASS O, adaptiv e LASS O, and hard-thresholding are an- alyzed. I n the known-v ariance case, the ﬁ n ite-sample cov erage prop erties of such interv als are d etermined and it is shown th at symmetric in ter- v als are the shortest. The length of the shortest in terva ls based on th e hard-thresholding estimator is larger than the length of the shortest in- terv al based on t he adaptive LASSO, which is larger t h an the length of the sh ortest interv al b ased on the LAS SO, which in turn is larger than the standard interv al b ased on the maxim u m lik elihoo d estimator. In the case where the p enali zed estimators are tuned t o possess the ‘spar- sit y prop erty’, the interv als based on t h ese estimators are larger than the stand ard interv al by an order of magnitude. F u rthermore, a simple asymptotic conﬁdence interv al construction in the ‘sparse’ case, that also applies to the smoothly clipp ed absolute deviation estimator, is discussed. The results for the kno wn-v ariance case are sho wn to carry o ver to th e unknown-v ariance case in an appropriate asymptotic sense. MSC Subje ct Classiﬁc ations: Primary 62F25; secondary 62C25, 62J07. Keywor ds : p enalized maxim um li keli ho od , p enali zed least squares, Lasso, adaptiv e Lasso, hard-thresholding, soft-thresholding, conﬁdence set, cove rage probability , sparsity , mo del selection. ∗ Earlier versions of this paper were circulated under the title ”Conﬁdence Sets Based on Pe nalized M aximum Like liho od Estimators”. † Departmen t of Statistics, Universit y of Vi enna, Univ ersit¨ atsstrasse 5, A- 1010 Vienna. Phone: + 431 427738640. E-mail: b enedikt.po etsche r@univie.ac.at ‡ Institute for Mathematical Sto c hastics, Georg-August-Univ ersity G¨ ottingen, Goldsc hmi dtstraße 7, D-37077 G¨ ottingen. Phone: +49 55139172107. E- mail: ulrike.sc hneider@math.uni-go ettingen.de 1 1 In tro duction Recent years hav e seen a n incr e ased interest in pena lized maximum likeliho o d (least squares) estimators. Prominent examples of such estimators a re the LASSO estimator (Tibshira ni (1996 )) and its v ar iants like the ada ptive LASSO (Zou (2006)), the Bridge estimators (F rank and F r iedman (1993)), o r the smo o thly clippe d absolute deviation (SCAD) estimator (F an and Li (20 0 1)). In linear regres s ion mo de ls with orthogonal regressor s, the hard- and soft-thresholding estimators can also b e r e formulated as pena lized leas t sq uares es timators, with the soft-thresho lding e s timator then co inc iding with the LASSO e stimator. The asymptotic distributional prop er ties of p enalized maximum lik eliho o d (least squar es) estimators ha ve bee n studied in the liter a ture, mostly in the con- text of a ﬁnite-dimensiona l linea r regre ssion mo del; see Knight and F u (20 00), F an and Li (2001), and Zo u (20 06). Knigh t and F u (2 000) s tudy the asymptotic distribution of Bridg e estimators and, in particular , of the LASSO e s timator. Their analysis concentrates on the case where the estima to rs are tuned in such a wa y as to p erfo r m conserv ative model selec tio n, and their asymptotic frame- work allows for dep endence of parameters on sa mple size. In con tra st, F a n and Li (2001 ) for the SCAD estimator and Zou (20 06) for the adaptive LASSO estimator concentrate o n the case where the estimator s are tuned to p ossess the ‘spa rsity’ proper t y . They show that, with suc h tuning, these e s timators po ssess wha t ha s co me to b e known as the ‘oracle proper ty’. How ever, their results are based on a ﬁxed-parameter asymptotic framew ork only . P¨ otscher and Le e b (20 09) and P¨ otscher and Schneider (2009) study the ﬁnite-sample dis- tribution of the hard-thresholding , the soft-thresholding (LASSO), the SCAD, and the a daptive LASSO estimator under no rmal error s; they also o btain the asymptotic distributions of these estimators in a general ‘moving par ameter’ asymptotic framew ork. The results obtained in these tw o pap ers clearly sho w that the distributions of the estimators studied are often highly no n-normal and that the so-called ‘oracle prop erty’ typically paints a misleading picture of the actual p er fo rmance o f the estimator. [In the wak e of F an and Li (2001 ) a con- siderable liter ature has sprung up establishing the so -called ‘ora cle prop er ty’ fo r a v ariety of estimators. All these results are ﬁxed- pa rameter asy mptotic results only and can b e very mislea ding. See Leeb and P¨ otsc her (2 008) a nd P ¨ otsc her (2009) for more discus sion.] A natural question no w is what a ll these distributional results mean for c o nﬁ- dence interv a ls that are based o n p enalized maximum likeliho o d (least sq uares) estimators. This is the question we address in the presen t pa p e r in the con- text of a nor mal linear reg ression mo del with orthog onal r egresso rs. In the known-v a riance case we obtain fo rmulae for the ﬁnite-sample inﬁma l co verage probabilities of ﬁxed- width conﬁdence interv a ls based on the follo wing estima- tors: hard-thresho lding, LASSO (s o ft-thresholding), and a daptive LASSO. W e show that among those interv als the symmetric ones are the s hortest, and w e show that hard-thresholding lea ds to longer interv als than the adaptive LASSO, which in turn leads to lo ng er interv a ls than the LASSO. All these int erv als ar e longer than the standar d co nﬁdence int erv al based on the ma ximum likelihoo d 2 estimator, which is in line with Joshi (1969). In case the estimator s are tuned to po s sess the ‘spar sity’ prop er ty , explicit asymptotic formulae for the length of the conﬁdence in terv als ar e furthermore o btained, s howing that in this c a se the int erv als based on the p enalized maximum likeliho o d estimators are larger by an order of magnitude than the standar d maximum likelihoo d based interv al. This reﬁnes, for the particula r estimator s c onsidered, a genera l r e sult for conﬁ- dence s ets based on ‘sparse’ estimators (P¨ otscher (2009 )). Additionally , in the ‘sparsely’ tuned case a simple asymptotic construction of co nﬁdence interv als is pr ovided that also a pplies to o ther p ena lized maximum likelihoo d estimators such as the SCAD estimator. F ur thermore, w e show how the results for the known-v a riance ca se carry over to the unknown-v ariance cas e in an asymptotic sense. The plan of the pap er is as follows: After introducing the mo del and esti- mators in Section 2, the known-v ariance case is treated in Section 3 whereas the unknown-v a riance case is dealt with in Section 4. All pro ofs as well as some techn ical lemmata are r elegated to the App endix. 2 The M o del and Estimators F or a normal linear reg ression model with o rthogona l r egresso rs, distr ibutional prop erties o f p enalized maximum lik eliho o d (lea st squa res) estimator s with a separable penalty can b e reduced to the case o f a Gaussia n lo c ation problem; for details see, e.g., P¨ otsc her and Sc hneider (2 009). Since w e a r e only interested in conﬁdence sets for individual comp onents o f the parameter v ector in the regres s ion that are based on such estimators, w e shall hence supp os e that the data y 1 , . . . , y n are indep endent ident ically distributed as N ( θ, σ 2 ), θ ∈ R , 0 < σ < ∞ . [This entails no loss of gene r ality in the known-v ariance case. In the unknown-v a riance case an explicit treatment of the ortho g onal linea r mo del would diﬀer fro m the analysis in the present pap er only in that the es timator ˆ σ 2 deﬁned b elow would be repla ced b y the usual res idual v a riance estimator from the least- s quares reg ressio n; this w o uld ha ve no substantial eﬀect on the results.] W e shall b e co ncerned with conﬁdence sets for θ based on pena lized maximum likelihoo d es tima to rs such a s the hard-thresho lding estimator, the LASSO (re ducing to soft-thresho lding in this setting), and the a da ptive LASSO estimator. The hard- thresholding estimator ˜ θ H is given by ˜ θ H := ˜ θ H ( η n ) = ¯ y 1 ( | ¯ y | > ˆ σ η n ) where the threshold η n is a p ositive real num b e r, ¯ y denotes the max imum likelihoo d estimator , i.e., the a rithmetic mean of the data, a nd ˆ σ 2 = ( n − 1) − 1 P n i =1 ( y i − ¯ y ) 2 . Also deﬁne the infeasible estima to r ˆ θ H := ˆ θ H ( η n ) = ¯ y 1 ( | ¯ y | > σ η n ) which uses the v alue o f σ . The LASSO (or soft-thresholding) estimator ˜ θ S is given by ˜ θ S := ˜ θ S ( η n ) = sign( ¯ y )( | ¯ y | − ˆ σ η n ) + 3 and its infeasible version by ˆ θ S := ˆ θ S ( η n ) = sign( ¯ y )( | ¯ y | − σ η n ) + . Here sign( x ) is deﬁned as − 1, 0, and 1 in ca se x < 0 , x = 0, and x > 0, resp ectively , and z + is shorthand for max { z , 0 } . The adaptive LASSO estimator ˜ θ A in this simple mo del is given by ˜ θ A := ˜ θ A ( η n ) = ¯ y (1 − ˆ σ 2 η 2 n / ¯ y 2 ) + =  0 if | ¯ y | ≤ ˆ σ η n ¯ y − ˆ σ 2 η 2 n / ¯ y if | ¯ y | > ˆ σ η n , and its infeasible counterpart by ˆ θ A := ˆ θ A ( η n ) = ¯ y (1 − σ 2 η 2 n / ¯ y 2 ) + =  0 if | ¯ y | ≤ σ η n ¯ y − σ 2 η 2 n / ¯ y if | ¯ y | > σ η n . It coincides with the nonneg a tive Garotte in this simple mo del. F or the feasible estimators w e alw ays need to assume n ≥ 2, wherea s for the infeasible estimators also n = 1 is admissible. Note that η n plays the rˆ ole of a tuning para meter a nd it is most natural to let the estimators dep end on the tuning parameter only via σ η n and ˆ σ η n , resp ectively , in orde r to ta ke account of the scale of the data. This makes the es- timators mentioned ab ov e sca le equiv ariant. W e sha ll often suppress dep endence of the e s timators o n η n in the notation. In the following let P n,θ,σ denote the distribution of the sample when θ and σ are the true parameters. F urthermore, let Φ denote the standard normal cumulativ e distribution function. W e also note the fo llowing obvious fact: Since hard- and s oft-thresholding op erate in a co ordinatewise fa s hion, the r esults given b elow also apply mut atis m utandis to linear regressions with non-orthog onal r egresso rs. O f course, the soft-thresholding estimato r then no longer coinc ide s with the LASSO e s timator. W e refr ain fr om sp elling out details. 3 Conﬁdence Int erv als: Kno wn-V ariance Case In this section we consider the ca se where the v ariance σ 2 is known, n ≥ 1 holds, and we ar e in teres ted in the ﬁnite-sample cov er age pr op erties of interv a ls of the form [ ˆ θ − σ a n , ˆ θ + σ b n ] where a n and b n are nonnegative rea l n umber s and ˆ θ s ta nds for any one of the es timators ˆ θ H = ˆ θ H ( η n ), ˆ θ S = ˆ θ S ( η n ), or ˆ θ A = ˆ θ A ( η n ). W e shall also co nsider o ne - sided interv als ( −∞ , ˆ θ + σ c n ] and [ ˆ θ − σ c n , ∞ ) with 0 ≤ c n < ∞ . Let p n ( θ ; σ , η n , a n , b n ) = P n,θ ,σ  θ ∈ [ ˆ θ − σ a n , ˆ θ + σ b n ]  denote the cov erage probability . Due to the ab ov e-no ted scale equiv ariance of the estimator ˆ θ , it is obvious that p n ( θ ; σ , η n , a n , b n ) = p n ( θ/ σ ; 1 , η n , a n , b n ) holds, and the s ame is true for the one-side d interv a ls. In par ticula r, it follows that the inﬁmal coverage pr obabilities inf θ ∈ R p n ( θ ; σ , η n , a n , b n ) do not depend 4 on σ . Therefore , we s hall assume without lo ss o f generality tha t σ = 1 for the remainder of this section and we shall write P n,θ for P n,θ , 1 . 3.1 Inﬁmal co v erage probabilities in ﬁnite samp les W e b eg in with soft-thresholding. Let C S,n denote the interv al [ ˆ θ S − a n , ˆ θ S + b n ]. W e ﬁrs t determine the inﬁmum of the cov erag e pro bability p S,n ( θ ) := p S,n ( θ ; 1 , η n , a n , b n ) = P n,θ ( θ ∈ C S,n ) of this interv al. Prop ositi on 1 F or every n ≥ 1 , the inﬁmal c over age pr ob ability of the interval C S,n is given by inf θ ∈ R p S,n ( θ ) =  Φ( n 1 / 2 ( a n − η n )) − Φ( n 1 / 2 ( − b n − η n )) if a n ≤ b n Φ( n 1 / 2 ( b n − η n )) − Φ( n 1 / 2 ( − a n − η n )) if a n > b n . (1) As a p oint of interest we note that p S,n ( θ ) is a piecewise constan t function with jumps at θ = − a n and θ = b n . Next we turn to hard-thres ho lding. Let C H,n denote the in terv a l [ ˆ θ H − a n , ˆ θ H + b n ]. The inﬁmum of the cov era ge pr obability p H,n ( θ ) := p H,n ( θ ; 1 , η n , a n , b n ) = P n,θ ( θ ∈ C H,n ) of this interv a l has been obtained in Prop ositio n 3.1 in P¨ otscher (2009), which we rep eat for conv enience. Prop ositi on 2 F or every n ≥ 1 , the inﬁmal c over age pr ob ability of the interval C H,n is given by inf θ ∈ R p H,n ( θ ) (2) =    Φ( n 1 / 2 ( a n − η n )) − Φ( − n 1 / 2 b n ) if η n ≤ a n + b n and a n ≤ b n Φ( n 1 / 2 ( b n − η n )) − Φ( − n 1 / 2 a n ) if η n ≤ a n + b n and a n > b n 0 if η n > a n + b n . F or later use w e observe that the interv al C H,n has po sitive inﬁmal c overage probability if and only if the length o f the int erv al a n + b n is large r than η n . As a point of in terest we also no te that the co verage probability p H,n ( θ ) is discontin uous (with dis contin uity points at θ = − a n and θ = b n ). F urthermore, as discussed in P ¨ otsc her (2009), the inﬁmum in (2) is attaine d if η n > a n + b n , but not in cas e η n ≤ a n + b n . Finally , we cons ider the ada ptive LASSO . Let C A,n denote the in terv a l [ ˆ θ A − a n , ˆ θ A + b n ]. The inﬁmum of the cov era ge pr obability p A,n ( θ ) := p A,n ( θ ; 1 , η n , a n , b n ) = P n,θ ( θ ∈ C A,n ) of this int erv al is given ne x t. Prop ositi on 3 F or every n ≥ 1 , the inﬁmal c over age pr ob ability of C A,n is given by inf θ ∈ R p A,n ( θ ) = Φ( n 1 / 2 ( a n − η n )) − Φ  n 1 / 2  ( a n − b n ) / 2 − p (( a n + b n ) / 2) 2 + η 2 n  if a n ≤ b n , and by inf θ ∈ R p A,n ( θ ) = Φ( n 1 / 2 ( b n − η n )) − Φ  n 1 / 2  ( b n − a n ) / 2 − p (( a n + b n ) / 2) 2 + η 2 n  5 if a n > b n . W e no te that p A,n is co ntin uous except at θ = b n and θ = − a n and that the inﬁm um of p A,n is not a ttained which can b e seen from a simple reﬁnement of the pro of of Pr op osition 3. Remark 4 (i) If we c onsider the op en interval C o S,n = ( ˆ θ S − a n , ˆ θ S + b n ) the formula for the c over age pr ob ability b e c omes P n,θ  θ ∈ C o S,n  = [Φ( n 1 / 2 ( a n − η n )) − Φ( n 1 / 2 ( − b n − η n ))] 1 ( θ ≤ − a n ) + [Φ( n 1 / 2 ( a n + η n )) − Φ( n 1 / 2 ( − b n − η n ))] 1 ( − a n < θ < b n ) + [Φ( n 1 / 2 ( a n + η n )) − Φ( n 1 / 2 ( − b n + η n ))] 1 ( b n ≤ θ ) . As a c onse quenc e, the inﬁmal c over age pr ob ability of C o S,n is again given by (1 ) . A fortiori, the half-op en intervals ( ˆ θ n − a n , ˆ θ n + b n ] and [ ˆ θ n − a n , ˆ θ n + b n ) then also have inﬁmal c over age pr ob ability given by (1). (ii) F or the op en interval C o H,n = ( ˆ θ H − a n , ˆ θ H + b n ) the c over age pr ob ability satisﬁes P n,θ  θ ∈ C o H,n  = P n,θ ( θ ∈ C H,n ) − [ 1 ( θ = b n ) + 1 ( θ = − a n )][Φ( n 1 / 2 ( − θ + η n )) − Φ( n 1 / 2 ( − θ − η n ))] . Insp e ction of the pr o of of Pr op osition 3.1 in P¨ otscher (2009) then shows t hat C o H,n has the s ame inﬁmal c over age pr ob abili t y as C H,n . However, n ow the inﬁmum is alwa ys a minimu m. F urthermor e, the half-op en intervals ( ˆ θ H − a n , ˆ θ H + b n ] and [ ˆ θ H − a n , ˆ θ H + b n ) then a fortiori have inﬁmal c over age pr ob ability given by (2); for these intervals the inﬁmum is attaine d if η n > a n + b n , but not ne c essarily if η n ≤ a n + b n . (iii) If C o A,n denotes the op en int erval ( ˆ θ A − a n , ˆ θ A + b n ) , t he formula for the c over age pr ob ability b e c omes P n,θ  θ ∈ C o A,n  =    Φ  n 1 / 2 γ ( − ) ( θ, − a n )  − Φ  n 1 / 2 γ ( − ) ( θ , b n )  if θ ≤ − a n Φ  n 1 / 2 γ (+) ( θ, − a n )  − Φ  n 1 / 2 γ ( − ) ( θ , b n )  if − a n < θ < b n Φ  n 1 / 2 γ (+) ( θ, − a n )  − Φ  n 1 / 2 γ (+) ( θ , b n )  if θ ≥ b n , wher e γ ( − ) and γ (+) ar e deﬁne d in (17) and (18) in the App endix. A gain the c ov- er age pr ob ability is c ontinuous exc ept at θ = b n and θ = − a n (and is c ontinuous everywher e in the trivial c ase a n = b n = 0 ). It is now e asy to se e that the inﬁ- mal c over age pr ob ability of C o A,n c oincides with the inﬁmal c over age pr ob ability of the close d interval C A,n , the inﬁ mum of the c over age pr ob ability of C o A,n now always b eing a minimum. F urthermor e, the half-op en intervals ( ˆ θ A − a n , ˆ θ A + b n ] and [ ˆ θ A − a n , ˆ θ A + b n ) a fortiori have the same inﬁmal c over age pr ob ability as C A,n and C o A,n . 6 (iv) The one-side d intervals ( −∞ , ˆ θ S + c n ] , ( −∞ , ˆ θ S + c n ) , [ ˆ θ S − c n , ∞ ) , ( ˆ θ S − c n , ∞ ) , ( −∞ , ˆ θ H + c n ] , ( −∞ , ˆ θ H + c n ) , [ ˆ θ H − c n , ∞ ) , ( ˆ θ H − c n , ∞ ) , ( −∞ , ˆ θ A + c n ] , ( −∞ , ˆ θ A + c n ) , ( ˆ θ A − c n , ∞ ) , and [ ˆ θ A − c n , ∞ ) , with c n a nonn e gative r e al numb er, have inﬁmal c over age pr ob abili ty Φ( n 1 / 2 ( c n − η n )) . This is e asy to se e for soft-thr esholding, fol lows fr om the r e asoning in P¨ otscher (2009) fo r har d-thr esholding, and for the ada ptive LASSO fol lo ws by similar, bu t s impler, r e asoning as in the pr o of of Pr op osition 3. 3.2 Symmetric interv als are shortest F or the t wo-sided conﬁdence sets c onsidered ab ov e, we nex t show that giv en a prescr ib ed inﬁmal cov er a ge pr obability the symmetric in terv als are shor test. W e then show that these shortest interv als are longer than the sta nda rd interv al based on the maxim um likelihoo d estimator and quan tify the excess length of these int erv als. Theorem 5 F or every n ≥ 1 and every δ satisfying 0 < δ < 1 we have: (a) Among al l intervals C S,n with inﬁmal c over age pr ob ability not less than δ ther e is a unique shortest int erval C ∗ S,n = [ ˆ θ S − a ∗ n,S , ˆ θ S + b ∗ n,S ] char acterize d by a ∗ n,S = b ∗ n,S with a ∗ n,S b eing the unique solut ion of Φ( n 1 / 2 ( a n − η n )) − Φ( n 1 / 2 ( − a n − η n )) = δ . (3) The interval C ∗ S,n has inﬁmal c over age pr ob ability e qual to δ and a ∗ n,S is p ositive. (b) Among al l intervals C H,n with inﬁmal c over age pr ob ability n ot less than δ ther e is a u nique shortest interval C ∗ H,n = [ ˆ θ H − a ∗ n,H , ˆ θ H + b ∗ n,H ] char acterize d by a ∗ n,H = b ∗ n,H with a ∗ n,H b eing the unique solution of Φ( n 1 / 2 ( a n − η n )) − Φ( − n 1 / 2 a n ) = δ . (4) The int erval C ∗ H,n has inﬁmal c over age pr ob ability e qual to δ and a ∗ n,H satisﬁes a ∗ n,H > η n / 2 . (c) Among al l intervals C A,n with inﬁmal c over age pr ob ability not less t han δ ther e is a unique shortest interval C ∗ A,n = [ ˆ θ A − a ∗ n,A , ˆ θ A + b ∗ n,A ] char acterize d by a ∗ n,A = b ∗ n,A with a ∗ n,A b eing the unique solut ion of Φ( n 1 / 2 ( a n − η n )) − Φ  − n 1 / 2 p a 2 n + η 2 n  = δ . (5) The interval C ∗ A,n has inﬁmal c over age pr ob ability e qual to δ and a ∗ n,A is p ositive. In the statistically unin ter esting case δ = 0 the in terv a l with a n = b n = 0 is the unique shortest interv al in all three cases . How ever, for the case of the hard-thresho lding estimator also any interv al with a n = b n and a n ≤ η n / 2 has inﬁmal cov era ge pro bability equal to zer o. Given that the distributions of the estima tio n errors ˆ θ S − θ , ˆ θ H − θ , and ˆ θ A − θ a re not symmetric (see P¨ otscher and Leeb (2 0 09), P¨ o tscher and Sc hnei- der (2009)), it may seem s ur prising at ﬁr st glance tha t the sho rtest co nﬁdence 7 int erv als are s ymmetric. Some in tuition for this pheno meno n can b e gained on the grounds that the distributions o f the estimation er r ors under θ = τ and θ = − τ are mirror -images of one another . The ab ove theorem shows that given a pr esp eciﬁed δ (0 < δ < 1), the shortest conﬁdence set with inﬁmal coverage probability equal to δ based on the soft-thresholding (LASSO) estimator is shorter tha n the co rresp o nding in- terv al based on the adaptive LASSO estimator , which in turn is shorter than the corr esp onding in terv al bas ed on the hard-thr e s holding estimator. All three int erv als are long er than the cor resp onding standard conﬁdence in terv al based on the maximum likeliho o d estimator . That is, a ∗ n,H > a ∗ n,A > a ∗ n,S > n − 1 / 2 Φ − 1 ((1 + δ ) / 2) . Figure 1 b elow shows n 1 / 2 times the half-length of the shor test δ -lev e l conﬁdence int erv als based on ha rd-thresholding , adaptive LASSO, soft-thresholding, and the maximum likeliho o d estimator, res pe c tively , as a function of n 1 / 2 η n for v arious v alues of δ . The graphs illustrate that the interv als based on hard- thresholding, a da ptive LASSO, and soft-thresholding substantially exceed the length of the maximum likeliho o d based interv al except if n 1 / 2 η n is very small. F or large v alues of n 1 / 2 η n the g r aphs s uggest a linea r incr ease in the leng th o f the interv als based on the penaliz ed estima tors. This is for ma lly conﬁrmed in Section 3.2.1 b elow. 8 0.0 0.2 0.4 0.6 0.8 1.0 2.0 2.2 2.4 2.6 delta=0.95 MLE LASSO ad. LASSO hard thres. 0.0 0.2 0.4 0.6 0.8 1.0 1.6 1.8 2.0 2.2 delta=0.9 MLE LASSO ad. LASSO hard thres. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 delta=0.8 MLE LASSO ad. LASSO hard thres. 0.0 0.2 0.4 0.6 0.8 1.0 0.6 0.7 0.8 0.9 1.0 1.1 1.2 delta=0.5 MLE LASSO ad. LASSO hard thres. Figure 1 : n 1 / 2 a ∗ n,H , n 1 / 2 a ∗ n,A , n 1 / 2 a ∗ n,S as a function of n 1 / 2 η n for co verage probabilities δ = 0 . 5, 0 . 8, 0 . 9, 0 . 95. The hor izontal line at height Φ − 1 ((1 + δ ) / 2) indicates n 1 / 2 times the half-le ng th of the standa rd maximum likelihoo d bas ed int erv al. 9 3.2.1 Asymptotic b eha vior of the leng th It is well-known that as n → ∞ tw o diﬀerent re g imes for the tuning parameter η n can be distinguished. In the ﬁrst regime η n → 0 and n 1 / 2 η n → e , 0 < e < ∞ . This ch oice of tuning par ameter lea ds to es timators ˆ θ S , ˆ θ H , and ˆ θ A that p erform conserv ative mo del selectio n. In the seco nd regime η n → 0 and n 1 / 2 η n → ∞ , leading to estimators ˆ θ S , ˆ θ H , and ˆ θ A that p e r form consis tent mo del selection (also known as the ‘spar sity prop erty’); that is, with pr obability a pproaching 1, the estimators are exactly zero if the true v alue θ = 0, and they are diﬀerent from zero if θ 6 = 0. See P¨ otscher and Leeb (2 0 09) a nd P¨ otscher and Schneider (2009) for a detailed discussion. W e now discuss the asy mptotic b ehavior, under the t wo regimes, of the half-length a ∗ n,S , a ∗ n,H , and a ∗ n,A of the s hortest interv a ls C ∗ S,n , C ∗ H,n , and C ∗ A,n with a ﬁxed inﬁmal coverage proba bility δ , 0 < δ < 1. If η n → 0 and n 1 / 2 η n → e , 0 < e < ∞ , then it follo ws immediately from The- orem 5 that n 1 / 2 a ∗ n,S , n 1 / 2 a ∗ n,H , and n 1 / 2 a ∗ n,A conv erg e to the unique solutions of Φ( a − e ) − Φ( − a − e ) = δ , (6) Φ( a − e ) − Φ( − a ) = δ , (7) and Φ  p a 2 + e 2  − Φ( − a + e ) = δ , (8) resp ectively . [Actually , this is even true if e = 0.] Hence, while a ∗ n,H , a ∗ n,A , and a ∗ n,S are larger than the half-length n − 1 / 2 Φ − 1 ((1 + δ ) / 2 ) of the standard int erv al, they ar e o f the sa me order n − 1 / 2 . The situation is diﬀerent, howev er, if η n → 0 but n 1 / 2 η n → ∞ . In this case Theorem 5 shows that Φ( n 1 / 2 ( a ∗ n,S − η n )) → δ since n 1 / 2 ( − a ∗ n,S − η n ) ≤ − n 1 / 2 η n → −∞ . In other words, a ∗ n,S = η n + n − 1 / 2 Φ − 1 ( δ ) + o ( n − 1 / 2 ) . (9) Similarly , noting tha t n 1 / 2 a ∗ n,H > n 1 / 2 η n / 2 → ∞ , we get a ∗ n,H = η n + n − 1 / 2 Φ − 1 ( δ ) + o ( n − 1 / 2 ); (10) and since n 1 / 2 p a 2 n + η 2 n ≥ n 1 / 2 η n → ∞ we obtain a ∗ n,A = η n + n − 1 / 2 Φ − 1 ( δ ) + o ( n − 1 / 2 ) . (11) [Actually , the condition η n → 0 has not been used in the deriv ation o f (9)- (11).] Hence, the interv als C ∗ S,n , C ∗ H,n , and C ∗ A,n are asy mptotically of the same leng th. They are also longe r than the standard in ter v al by an order of magnitude: the ratio o f each of a ∗ n,S ( a ∗ n,H , a ∗ n,A , resp ectively) to the half-leng th of the standard in terv al is n 1 / 2 η n , which diverges to inﬁnity . Hence, when the 10 estimators ˆ θ S , ˆ θ H , a nd ˆ θ A are tuned to p ossess the ‘sparsity prop erty’, the corres p o nding conﬁdence sets become v er y larg e. F or the particular in terv als considered here this is a re ﬁnement of a g eneral result in P¨ otscher (2009) for conﬁdence sets based on arbitrary estimator s poss essing the ‘s parsity prop er ty’. [W e no te that the sparsely tuned hard-thresholding es tima to r or the spars ely tuned ada ptive LASSO (under an additiona l co ndition on η n ) ar e known to po ssess the so-called ‘o r acle pr op erty’. In lig ht of the ‘oracle prop er ty’ it is sometimes argued in the literature that v alid conﬁdence interv als based on these estimators with length pro po rtional to n − 1 / 2 can be obtained. How ever, in light of the abov e discussio n such in ter v als neces sarily have inﬁmal cov e rage probability that conv er ges to zero and th us ar e not v alid. This once more shows that ﬁxe d-p ar ameter asymptotic r esults like the ‘or acle’ prop er t y can b e dangerous ly misleading.] 3.3 A simple asymptotic conﬁdence interv al The results for the ﬁnite-sample conﬁdence interv a ls given in Section 3.1 re- quired a detailed case by ca se analysis based on the ﬁnite-sample distribution of the estimator on which the interv al is ba sed. If the es timators ˆ θ S , ˆ θ H , and ˆ θ A are tuned to p o ssess the ‘s parsity pro p erty’, i.e., if the tuning parameter satisﬁes η n → 0 a nd n 1 / 2 η n → ∞ , a s imple asymptotic conﬁdence int erv al con- struction relying on asymptotic r esults obtained in P¨ otscher and Leeb (2 009) and P ¨ o tscher and Schneider (20 09) is p ossible as shown below. An adv antage of this constructio n is that it ea sily extends to o ther estimators like the smo othly clippe d abs o lute deviation (SCAD) estimato r when tuned to po ssess the ‘s par- sity prop er t y’. As s hown in P ¨ o ts cher and Leeb (20 09) a nd P ¨ o tscher and Schneider (2009 ), the uniform rate of consistency of the ‘spar sely’ tuned es timators ˆ θ S , ˆ θ H , and ˆ θ A is not n 1 / 2 , but only η − 1 n ; further more, the limiting distributions of these estimators under the appropriate η − 1 n -scaling and under a moving-parameter asymptotic framework are alw ays co ncentrated on the interv al [ − 1 , 1]. These facts can b e used to obtain the following result. Prop ositi on 6 Supp ose η n → 0 and n 1 / 2 η n → ∞ . L et ˆ θ stand for any of the estimators ˆ θ S ( η n ) , ˆ θ H ( η n ) , or ˆ θ A ( η n ) . L et d b e a r e al numb er, and deﬁne the interval D n = [ ˆ θ − dη n , ˆ θ + dη n ] . If d > 1 , the int erval D n has inﬁmal c over age pr ob ability c onver ging to 1 , i.e., lim n →∞ inf θ ∈ R P n,θ ( θ ∈ D n ) = 1 . If d < 1 , lim n →∞ inf θ ∈ R P n,θ ( θ ∈ D n ) = 0 . The a s ymptotic distributional results in the ab ov e prop o sition do no t provide information on the case d = 1. Ho wever, from the ﬁnite-sample results in Section 11 3.1 w e see that in this case the inﬁma l coverage proba bilit y o f D n conv erg es to 1 / 2. Since the interv al D n for d > 1 has as ymptotic inﬁmal coverage probability equal to one, one may wonder how m uch cruder this in terv al is co mpared to the ﬁnite-sample interv als C ∗ S,n , C ∗ H,n , and C ∗ A,n constructed in Sec tion 3 .2, which hav e inﬁmal cov er age pr obability equal to a pres pe ciﬁed level δ , 0 < δ < 1: The ratio of the ha lf-length of D n to the half-leng th of the co rresp onding interv al C ∗ S,n , C ∗ H,n , and C ∗ A,n is d (1 + O ( n − 1 / 2 η − 1 n )) = d (1 + o (1)) as can b e seen from eq uations (9), (10), and (11). Since d can b e chosen a rbi- trarily close to one, this ratio can b e made ar bitrarily close to one. This may sound s omewhat strange, since w e a r e compar ing an interv al with asymptotic in- ﬁmal coverage pr obability 1 with the shor test ﬁnite-sample conﬁdence int erv als that have a ﬁxed inﬁmal coverage pr obability δ less than 1. The reaso n for this phenomenon is tha t, in the relev ant moving-parameter as y mptotic framework, the dis tribution of ˆ θ − θ is made up of a bias-co mp o nent whic h in the worst case is o f the order η n and a random co mp o nent which is of the order n − 1 / 2 . Since η n → 0 a nd n 1 / 2 η n → ∞ , the deterministic bias-comp onent dominates the ra ndom comp onent. This can also b e gleaned fr om eq uations (9), (10), and (11), where the level δ enters the formula for the ha lf-le ngth only in the low er order term. W e note that using Theorem 19 in P¨ otscher and Leeb (2009) the same pro of immediately shows that Pro p osition 6 also holds for the smo othly clippe d abso- lute deviation (SCAD) estimator when tuned to pos sess the ‘sparsity prop er ty’. In fact, the argument in the pro of of the above prop osition can be applied to a large cla ss of p ost-mo del-s election estimators based on a consistent mo del selection pro cedur e. Remark 7 (i) Supp ose D ′ n = [ ˆ θ − d 1 η n , ˆ θ + d 2 η n ] wher e ˆ θ st ands for any of the estimators ˆ θ S , ˆ θ H , or ˆ θ A . If min ( d 1 , d 2 ) > 1 , then the limit of the inﬁmal c over age pr ob ability of D ′ n is 1 ; if max( d 1 , d 2 ) < 1 then this limit is zer o. This fol lows imme diately fr om an insp e ction of t he pr o of of Pr op osition 6. (ii) Pr op osition 6 also r emains c orr e ct if D n is r eplac e d by the c orr esp onding op en interval. A similar c omment applies to the op en version of D ′ n . 4 Conﬁdence In terv als: Unkno wn-V a riance Case In this section we consider the case where the v ar iance σ 2 is unknown, n ≥ 2, and we a re interested in the cov er age prop e rties of interv als of the form [ ˜ θ − ˆ σ a n , ˜ θ + ˆ σ a n ] where a n is a nonnegative r eal num b er and ˜ θ stands fo r any one of the estimato rs ˜ θ H = ˜ θ H ( η n ), ˜ θ S = ˜ θ S ( η n ), or ˜ θ A = ˜ θ A ( η n ). F or br evity we only consider symmetric in terv a ls. A similar argument as in the known- v ariance case shows that we ca n assume without lo ss of genera lity that σ = 1, 12 and we shall do so in the sequel; in particular, this argument shows that the inﬁm um with resp ec t to θ of the coverage probability do es not dep end o n σ . 4.1 Soft-thresholding Consider the interv al E S,n = h ˜ θ S − ˆ σ a n , ˜ θ S + ˆ σ a n i where a n is a nonneg ative real n umber and ˜ θ S = ˜ θ S ( η n ). W e then hav e P n,θ ( θ ∈ E S,n ) = Z ∞ 0 P n,θ ( θ ∈ E S,n | ˆ σ = s ) h n ( s ) ds where h n is the density of ˆ σ , i.e., h n is the density of the squa re r o ot of a chi- square distributed random v ar iable with n − 1 degr ees of freedom divided by the degrees of freedom. In view of indep endence of ˆ σ and ¯ y we obta in the following representation of the ﬁnite-sample coverage probability P n,θ ( θ ∈ E S,n ) = Z ∞ 0 P n,θ  θ ∈ h ˆ θ S ( sη n ) − sa n , ˆ θ S ( sη n ) + sa n i h n ( s ) ds = Z ∞ 0 p S,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds (12) where p S,n is given in (1 5) in the App endix. W e ne x t determine the inﬁmal cov er age pr obability of E S,n in ﬁnite sa mples: It follows from (15), the dominated conv er gence theorem, and symmetry of the standard normal distribution that inf θ ∈ R P n,θ ( θ ∈ E S,n ) ≤ lim θ →∞ Z ∞ 0 p S,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds = Z ∞ 0 lim θ →∞ p S,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds = Z ∞ 0 [Φ( n 1 / 2 s ( a n − η n )) − Φ( n 1 / 2 s ( − a n − η n ))] h n ( s ) ds = T n − 1 ( n 1 / 2 ( a n − η n )) − T n − 1 ( n 1 / 2 ( − a n − η n )) , (13 ) where T n − 1 is the c df of a Student t -distribution with n − 1 degrees of freedo m. F urthermor e, (1) shows that p S,n ( θ ; 1 , sη n , sa n , sa n ) ≥ Φ( n 1 / 2 s ( a n − η n )) − Φ( n 1 / 2 s ( − a n − η n )) holds and whence we o btain from (12) and (13) the following e x pression for the inﬁmal cov era ge pro bability of E S,n : inf θ ∈ R P n,θ ( θ ∈ E S,n ) = T n − 1 ( n 1 / 2 ( a n − η n )) − T n − 1 ( n 1 / 2 ( − a n − η n )) (14) for every n ≥ 2. Remar k 4 shows that the same relation is true for the corr e- sp onding op en a nd half-op en interv a ls . 13 Relation (14) shows the following: supp ose 1 / 2 ≤ δ < 1 and a ∗ n,S solves (3), i.e., the co rresp o nding interv a l C ∗ S,n has inﬁmal cov erage probability equal to δ . Let a ∗∗ n,S be the (unique) so lution to T n − 1 ( n 1 / 2 ( a n − η n )) − T n − 1 ( n 1 / 2 ( − a n − η n )) = δ , i.e., the cor resp onding interv a l E ∗∗ S,n = h ˜ θ S − ˆ σ a ∗∗ n,S , ˜ θ S + ˆ σ a ∗∗ n,S i has inﬁmal cov era ge proba bility eq ual to δ . Then a ∗∗ n,S ≥ a ∗ n,S holds in view of Lemma 14 in the App endix. I.e., given the s ame inﬁmal coverage probability δ ≥ 1 / 2, the exp ected length of the in terv al E ∗∗ S,n based on ˜ θ S is not smaller than the length of the interv a l C ∗ S,n based on ˆ θ S . Since k Φ − T n − 1 k ∞ = sup x ∈ R | Φ( x ) − T n − 1 ( x ) | → 0 for n → ∞ holds b y Poly a ’s theorem, the following result is an immedia te consequence of (14), Prop os itio n 1, a nd Remark 4. Theorem 8 F or every se quenc e a n of nonne gative r e al n u mb ers we have with E S,n = h ˜ θ S − ˆ σ a n , ˜ θ S + ˆ σ a n i and C S,n = h ˆ θ S − a n , ˆ θ S + a n i that inf θ ∈ R P n,θ ( θ ∈ E S,n ) − inf θ ∈ R P n,θ ( θ ∈ C S,n ) → 0 as n → ∞ . The analo gous r esu lts hold for the c orr esp onding op en and half-op en intervals. W e discuss this theo rem together with the para lle l results for hard-thresholding and adaptive LASSO based interv als in Sectio n 4.4. 4.2 Hard-thresholding Consider the interv al E H,n = h ˜ θ H − ˆ σ a n , ˜ θ H + ˆ σ a n i where a n is a nonneg ative real num b er a nd ˜ θ H = ˜ θ H ( η n ). W e then hav e a nalogous ly as in the preceding subsection that P n,θ ( θ ∈ E H,n ) = Z ∞ 0 p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds. Note that p H,n ( θ ; 1 , sη n , sa n , sa n ) is s ymmetric in θ and for θ ≥ 0 is given by (see P¨ otscher (2009)) p H,n ( θ ; 1 , sη n , sa n , sa n ) = n Φ( n 1 / 2 ( − θ + sη n )) − Φ( n 1 / 2 ( − θ − sη n )) o 1 (0 ≤ θ ≤ sa n ) + max h 0 , Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ + sη n )) i 1 ( sa n < θ ≤ sη n + sa n ) + n Φ( n 1 / 2 sa n ) − Φ( − n 1 / 2 sa n ) o 1 ( sη n + sa n < θ ) 14 in case η n > 2 a n , by p H,n ( θ ; 1 , sη n , sa n , sa n ) = n Φ( n 1 / 2 ( − θ + sη n ) − Φ( n 1 / 2 ( − θ − sη n )) o 1 (0 ≤ θ ≤ sη n − sa n ) + n Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ − sη n )) o 1 ( sη n − sa n < θ ≤ sa n ) + n Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ + sη n )) o 1 ( sa n < θ ≤ sη n + sa n ) + n Φ( n 1 / 2 sa n ) − Φ( − n 1 / 2 sa n ) o 1 ( sη n + sa n < θ ) if a n ≤ η n ≤ 2 a n , and by p H,n ( θ ; 1 , sη n , sa n , sa n ) = n Φ( n 1 / 2 sa n ) − Φ( − n 1 / 2 sa n ) o { 1 (0 ≤ θ ≤ sa n − sη n ) + 1 ( sη n + sa n < θ ) } + n Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ − sη n )) o 1 ( sa n − sη n < θ ≤ sa n ) + n Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ + sη n )) o 1 ( sa n < θ ≤ sη n + sa n ) if η n < a n . In the subsequent theorems we co nsider only the case where η n → 0 as this is the only in ter esting case from an asymptotic p ersp ective: no te that any of the p ena lized ma ximum lik e liho o d estimator s consider ed in this pap er is inconsistent for θ if η n do es not co nverge to zero . Theorem 9 S u pp ose η n → 0 . F or every se quenc e a n of n onne gative r e al num- b ers we have with E H,n = h ˜ θ H − ˆ σ a n , ˜ θ H + ˆ σ a n i and C H,n = h ˆ θ H − a n , ˆ θ H + a n i that inf θ ∈ R P n,θ ( θ ∈ E H,n ) − inf θ ∈ R P n,θ ( θ ∈ C H,n ) → 0 as n → ∞ . The analo gous r esu lts hold for the c orr esp onding op en and half-op en intervals. 4.3 Adaptiv e LASSO Consider the interv al E A,n = [ ˜ θ A − ˆ σ a n , ˜ θ A + ˆ σ a n ] where a n is a nonneg ative real num b er and ˜ θ A = ˜ θ A ( η n ). W e then hav e analogously as in the preceding subsections that P n,θ ( θ ∈ E A,n ) = Z ∞ 0 p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds where p A,n is given in (1 6) in the App endix. Theorem 1 0 Supp ose η n → 0 . F or every se qu en c e a n of nonne gative re al num- b ers we have with E A,n = h ˜ θ A − ˆ σ a n , ˜ θ A + ˆ σ a n i and C A,n = h ˆ θ A − a n , ˆ θ A + a n i that inf θ ∈ R P n,θ ( θ ∈ E A,n ) − inf θ ∈ R P n,θ ( θ ∈ C A,n ) → 0 15 as n → ∞ . The analo gous r esu lts hold for the c orr esp onding op en and half-op en intervals. 4.4 Discussion Theorems 8 , 9 , and 10 sho w that the results in Section 3 carry o ver to the unknown-v a riance ca se in a n asymptotic sense: F or e xample, supp ose 0 < δ < 1, and a n,S ( a n,H , a n,A , resp ectively) is such that E S,n ( E H,n , E A,n , resp ectively) has inﬁmal coverage probability conv erging to δ . Then, for a r egime where n 1 / 2 η n → e with 0 ≤ e < ∞ , it follows that n 1 / 2 a n,S , n 1 / 2 a n,H , and n 1 / 2 a n,A hav e limits that so lve (6)-(8), resp ectively; tha t is, they ha ve the s ame limits as n 1 / 2 a ∗ n,S , n 1 / 2 a ∗ n,H , and n 1 / 2 a ∗ n,A , whic h ar e n 1 / 2 times the half-length of the shortest δ -conﬁdence interv als C ∗ S,n , C ∗ H,n , a nd C ∗ A,n , r esp ectively , in the known-v a riance case. F ur ther more, for a regime where n 1 / 2 η n → ∞ it follows that a n,S , a n,H , and a n,A satisfy (9)-(11), resp ectively (where w e also assume η n → 0 for hard-thresholding a nd the adaptive LASSO). Hence, a n,S , a n,H , and a n,A on the one hand, and a ∗ n,S , a ∗ n,H , and a ∗ n,A on the other hand have again the sa me asymptotic b ehavior. F urthermo re, Theorems 8, 9 , and 10 show that Prop os itio n 6 immediately carr ies over to the unknown-v ariance case. A App endix Pro of of Prop ositio n 1: Using the expressio n for the ﬁnite sample distri- bution o f n 1 / 2 ( ˆ θ S − θ ) given in P¨ otscher and Lee b (20 09) a nd noting that this distribution function ha s a jump at − n 1 / 2 θ we obtain p S,n ( θ ) = [Φ( n 1 / 2 ( a n − η n )) − Φ( n 1 / 2 ( − b n − η n ))] 1 ( θ < − a n ) + [Φ( n 1 / 2 ( a n + η n )) − Φ( n 1 / 2 ( − b n − η n ))] 1 ( − a n ≤ θ ≤ b n ) + [Φ( n 1 / 2 ( a n + η n )) − Φ( n 1 / 2 ( − b n + η n ))] 1 ( b n < θ ) . (15) It follows that inf θ ∈ R p S,n ( θ ) is as g iven in the prop osition.  Pro of of Prop ositio n 3: The distribution function F A,n,θ ( x ) = P n,θ ( n 1 / 2 ( ˆ θ A − θ ) ≤ x ) o f the ada ptive LASSO estimator is given by 1 ( x + n 1 / 2 θ ≥ 0 )Φ  − (( n 1 / 2 θ − x ) / 2) + q (( n 1 / 2 θ + x ) / 2) 2 + nη 2 n  + 1 ( x + n 1 / 2 θ < 0 )Φ  − (( n 1 / 2 θ − x ) / 2) − q (( n 1 / 2 θ + x ) / 2) 2 + nη 2 n  (see P¨ o tscher and Schneider (20 09)). Hence, the cov er age proba bility p A,n ( θ ) = F A,n,θ ( n 1 / 2 a n ) − lim x → ( − n 1 / 2 b n ) − F A,n,θ ( x ) is p A,n ( θ ) =    Φ  n 1 / 2 γ ( − ) ( θ , − a n )  − Φ  n 1 / 2 γ ( − ) ( θ , b n )  if θ < − a n Φ  n 1 / 2 γ (+) ( θ, − a n )  − Φ  n 1 / 2 γ ( − ) ( θ , b n )  if − a n ≤ θ ≤ b n Φ  n 1 / 2 γ (+) ( θ, − a n )  − Φ  n 1 / 2 γ (+) ( θ , b n )  if θ > b n . (16) 16 Here γ ( − ) ( θ, x ) = − (( θ + x ) / 2) − p (( θ − x ) / 2) 2 + η 2 n (17) γ (+) ( θ, x ) = − (( θ + x ) / 2) + p (( θ − x ) / 2) 2 + η 2 n , (18) which are clearly smooth functions of ( θ , x ). Observe that γ ( − ) and γ (+) are nonincreasing in θ ∈ R (for e very x ∈ R ). As a consequence, w e obtain for − a n ≤ θ ≤ b n the low er b ound p A,n ( θ ) ≥ Φ  n 1 / 2 γ (+) ( b n , − a n )  − Φ  n 1 / 2 γ ( − ) ( − a n , b n )  = Φ  n 1 / 2  ( a n − b n ) / 2 + p (( a n + b n ) / 2) 2 + η 2 n  − Φ  n 1 / 2  ( a n − b n ) / 2 − p (( a n + b n ) / 2) 2 + η 2 n  . (19) Consider ﬁrst the case where a n ≤ b n . W e then show that p A,n ( θ ) is nonin- creasing on ( −∞ , − a n ): The deriv a tive dp A,n ( θ ) / dθ is given by dp A,n ( θ ) / dθ = n 1 / 2 [ φ ( n 1 / 2 γ ( − ) ( θ , − a n )) ∂ γ ( − ) ( θ , − a n ) /∂ θ − φ ( n 1 / 2 γ ( − ) ( θ , b n )) ∂ γ ( − ) ( θ , b n ) /∂ θ ] where φ denotes the standard norma l density function. Using the r elation a n ≤ b n , element ary calcula tions s how that ∂ γ ( − ) ( θ, − a n ) /∂ θ ≤ ∂ γ ( − ) ( θ, b n ) /∂ θ for θ ∈ ( −∞ , − a n ). F urthermor e, g iven a n ≤ b n , it is not to o diﬃcult to s ee that   γ ( − ) ( θ, − a n )   ≤   γ ( − ) ( θ , b n )   for θ ∈ ( −∞ , − a n ) (cf. Le mma 1 1 b elow), which implies that φ ( n 1 / 2 γ ( − ) ( θ, − a n )) ≥ φ ( n 1 / 2 γ ( − ) ( θ, b n )) . The last t wo displa ys to gether with the fact that ∂ γ ( − ) ( θ, − a n ) /∂ θ as well as ∂ γ ( − ) ( θ , b n ) /∂ θ are le ss than or equal to zero, imply that dp A,n ( θ ) /dθ ≤ 0 o n ( −∞ , − a n ). This proves that inf θ < − a n p A,n ( θ ) = lim θ → ( − a n ) − p A,n ( θ ) = c with c = Φ  n 1 / 2 ( a n − η n )  − Φ  n 1 / 2  ( a n − b n ) / 2 − p (( a n + b n ) / 2) 2 + η 2 n  . (20) Since the low er b o und given in (19) is not less than c , we have inf θ ≤ b n p A,n ( θ ) = inf θ< − a n p A,n ( θ ) = c. It re mains to show that p A,n ( θ ) ≥ c for θ > b n . F ro m (1 6) a nd (2 0) a fter rearr anging terms we obtain for θ > b n p A,n ( θ ) − c = h Φ( n 1 / 2 γ (+) ( θ , − a n )) − Φ( n 1 / 2 γ ( − ) ( − a n , − a n )) i − h Φ( n 1 / 2 γ (+) ( θ , b n )) − Φ( n 1 / 2 γ ( − ) ( − a n , b n )) i . 17 It is elementary to show that γ (+) ( θ , − a n )) ≥ γ ( − ) ( − a n , − a n ) = a n − η n and γ (+) ( θ , b n )) ≥ γ ( − ) ( − a n , b n ). W e nex t show that γ (+) ( θ, − a n ) − γ ( − ) ( − a n , − a n )) ≥ γ (+) ( θ , b n ) − γ ( − ) ( − a n , b n ) . (21) T o establish this note that (21) can equiv alently b e rewritten as f (0) + f (( θ + a n ) / 2) ≥ f (( θ − b n ) / 2) + f (( a n + b n ) / 2) (22) where f ( x ) = ( x 2 + η 2 n ) 1 / 2 . Observe that 0 ≤ ( θ − b n ) / 2 ≤ ( θ + a n ) / 2 holds since 0 ≤ a n ≤ b n < θ . W riting ( θ − b n ) / 2 as λ ( θ + a n ) / 2 + (1 − λ )0 with 0 ≤ λ ≤ 1 gives ( a n + b n ) / 2 = (1 − λ )( θ + a n ) / 2 + λ 0. B ecause f is conv ex, the inequality (22) and hence (21) fo llows. Next observe that in case a n ≥ η n we hav e (using monotonicity o f γ (+) ( θ , b n )) 0 ≤ γ ( − ) ( − a n , − a n )) = a n − η n ≤ b n − η n = − γ (+) ( b n , b n ) ≤ − γ (+) ( θ , b n ) (23) for θ > b n . In ca s e a n < η n we hav e (using γ ( − ) ( θ, x ) = γ ( − ) ( x, θ ) and mono- tonicity of γ ( − ) in its ﬁrs t argument) γ ( − ) ( − a n , b n ) ≤ γ ( − ) ( − a n , − a n ) = a n − η n < 0 , (24) and (using mono to nicity of γ (+) ) γ ( − ) ( − a n , b n ) ≤ − γ (+) ( b n , − a n ) ≤ − γ (+) ( θ , − a n ) (25) for θ > b n . Applying Le mma 12 b elow with α = n 1 / 2 γ ( − ) ( − a n , − a n ), β = n 1 / 2 γ (+) ( θ, − a n ), γ = n 1 / 2 γ ( − ) ( − a n , b n ), a nd δ = n 1 / 2 γ (+) ( θ , b n ) and using (21)-(25), establishes p A,n ( θ ) − c ≥ 0 . This co mpletes the proo f in ca se a n ≤ b n . The ca se a n > b n follows fro m the observ ation that (16) r emains unchanged if a n and b n are in terchanged and θ is replaced by − θ .  Lemma 1 1 Supp ose a n ≤ b n . Then   γ ( − ) ( θ, − a n )   ≤   γ ( − ) ( θ, b n )   holds for θ ∈ ( −∞ , − a n ) . Pro of. Squaring both sides of the claimed inequality shows tha t the claim is equiv alent to a 2 n / 2 − ( a n − θ ) p (( a n + θ ) / 2) 2 + η 2 ≤ b 2 n / 2 + ( b n + θ ) p (( b n − θ ) / 2) 2 + η 2 . But, for θ < − a n , the left-hand side of the pr eceding display is not larg e r than a 2 n / 2 + ( a n + θ ) p (( a n − θ ) / 2) 2 + η 2 . Since a 2 n / 2 ≤ b 2 n / 2, it hence suﬃces to show that − ( a n + θ ) p (( a n − θ ) / 2) 2 + η 2 ≥ − ( b n + θ ) p (( b n − θ ) / 2) 2 + η 2 for θ < − a n . This is immediately s een by distinguishing the case s wher e − b n ≤ θ < − a n and where θ < − b n , and obser ving that a n ≤ b n . The following lemma is elementary to prov e. 18 Lemma 1 2 Supp ose α , β , γ , and δ ar e re al numb ers satisfying α ≤ β , γ ≤ δ , and β − α ≥ δ − γ . If 0 ≤ α ≤ − δ , or if γ ≤ α ≤ 0 and γ ≤ − β , then Φ( β ) − Φ( α ) ≥ Φ( δ ) − Φ( γ ) . Pro of of Theorem 5: (a) Since δ is p ositive, an y so lution to (3) has to be p ositive. No w the equation (3 ) has a unique s olution a ∗ n,S , since (3) as a function o f a n ∈ [0 , ∞ ) is e a sily se en to b e strictly incr easing with range [0 , 1). F urthermor e, the inﬁmal coverage probability (1) is a contin uous function of the pair ( a n , b n ) on [0 , ∞ ) × [0 , ∞ ). Let K ⊆ [0 , ∞ ) × [0 , ∞ ) consis t of all pairs ( a n , b n ) suc h that (i) the corres po nding interv al [ ˆ θ S − a n , ˆ θ S + b n ] has inﬁmal cov era ge probability not less than δ , and (ii) the length a n + b n is less than or equal 2 a ∗ n,S . Then K is compact. It is also nonempty as the pair ( a ∗ n,S , a ∗ n,S ) belo ngs to K . Since the length a n + b n is obviously co ntin uous, it follows that there is a pair ( a o n , b o n ) ∈ K having minimal length within K . Since conﬁdence sets corres p o nding to pairs not b elonging to K alwa ys hav e length larger than 2 a ∗ n,S , the pair ( a o n , b o n ) gives rise to an interv al with shortest leng th within the set of a ll in ter v als with inﬁmal co verage pr obability not less than δ . W e next show that a o n = b o n m ust hold: Supp ose no t, then we may assume without loss of g e ne r ality that a o n 0 and decr easing b o n by the same amount such that a o n + ε < b o n − ε holds, will result in an in ter v al of the same length with inﬁmal coverage probability Φ( n 1 / 2 ( a o n + ε − η n )) − Φ( n 1 / 2 ( − ( b o n − ε ) − η n )) . This inﬁmal coverage probability will b e strictly larg er than Φ( n 1 / 2 ( a o n − η n )) − Φ( n 1 / 2 ( − b o n − η n )) ≥ δ provided ε is chosen s uﬃciently small. But then, by c ontin uity of the inﬁmal cov era ge pro bability as a function of a n and b n , the int erv al [ ˆ θ S − a o n − ε, ˆ θ S + b ′ n − ε ] with ε < b ′ n < b o n will still hav e inﬁmal coverage pro bability not less tha n δ as lo ng as b ′ n is suﬃcient ly close to b o n ; at the same time this interv a l will b e shorter than the interv al [ ˆ θ S − a o n , ˆ θ S + b o n ]. This leads to a co nt radiction and establishes a o n = b o n . By what was said a t the beginning of the pro of, it is now obvious that a o n = b o n = a ∗ n,S m ust hold, thus also es tablishing uniqueness. The last claim is o bvious in view of the constructio n of a ∗ n,S . (b) Since δ is p ositive, any s olution to (4) has to b e larg er than η n / 2. No w equation (4) has a unique solution a ∗ n,H , since (4) as a function o f a n ∈ [ η n / 2 , ∞ ) is easily seen to b e strictly increasing with r a nge [0 , 1). F urthermo re, deﬁne K similarly as in the pro of of pa r t (a). Then, by the s ame r easoning a s in (a), the set K is compact and non-empty , leading to a pair ( a o n , b o n ) that gives ris e to an interv al with sho rtest length within the set of all in terv als with inﬁma l cov era ge pro bability not less than δ . W e next sho w that a o n = b o n m ust hold: Suppo se not, then we may again assume without loss o f genera lity that a o n η n m ust hold, s inc e the inﬁmal coverage probability of 19 the corresp o nding interv al is po sitive by construction. Since all this entails | a o n − η n | 0 and decreasing b o n by the sa me amoun t such that a o n + ε Φ( n 1 / 2 ( a o n − η n )) − Φ( − n 1 / 2 b o n ) ≥ δ provided ε is chosen s uﬃcient ly small. By co nt inuit y of the inﬁmal cov era ge probability a s a function of a n and b n , the interv al [ ˆ θ S − a o n − ε, ˆ θ S + b ′ n − ε ] with ε < b ′ n < b o n will still hav e inﬁmal cov erag e probability no t less than δ as long as b ′ n is suﬃciently close to b o n ; at the same time this interv a l will b e shorter than the in ter v al [ ˆ θ S − a o n , ˆ θ S + b o n ], leading to a con tr adiction th us establishing a o n = b o n . As in (a ) it now follows that a o n = b o n = a ∗ n,H m ust ho ld, th us also es ta blishing uniqueness. The last claim is then obvious in view of the construction of a ∗ n,H . (c) Since δ is p ositive, it is easy to see that an y s olution to (5) has to be po sitive. Now e quation (5) ha s a unique so lutio n a ∗ n,A , since (5) a s a function of a n ∈ [0 , ∞ ) is strictly incr easing with range [0 , 1 ). F urthermore, the inﬁmal cov era ge pr obability as given in Prop os ition 3 is a co nt inuous function of the pair ( a n , b n ) on [0 , ∞ ) × [0 , ∞ ). Deﬁne K similarly as in the pro o f of part (a). Then by the same reaso ning as in (a), the set K is compact and no n-empty , lea ding to a pa ir ( a o n , b o n ) that g ives r is e to a n in terv a l with shortes t length within the set of a ll in ter v als with inﬁmal co verage pr obability not less than δ . W e next show that a o n = b o n m ust hold: Suppose not, then we may again assume without loss of generality that a o n 0 a nd decr easing b o n by the same amoun t such that a o n + ε < b o n − ε holds, will result in an interv al of the same length with inﬁmal coverage probability Φ( n 1 / 2 ( a o n + ε − η n )) − Φ  n 1 / 2  ε + ( a o n − b o n ) / 2 − p (( a o n + b o n ) / 2) 2 + η 2 n  > Φ( n 1 / 2 ( a o n − η n )) − Φ  n 1 / 2  ( a o n − b o n ) / 2 − p (( a o n + b o n ) / 2) 2 + η 2 n  ≥ δ , provided ε is chosen suﬃciently small. This is so since a o n < b o n implies | a o n − η n | <    ( a o n − b o n ) / 2 − p (( a o n + b o n ) / 2) 2 + η 2 n    as is easily seen. But then, by contin uity of the inﬁmal coverage pro bability as a function of a n and b n , the interv al [ ˆ θ S − a o n − ε, ˆ θ S + b ′ n − ε ] with ε < b ′ n < b o n will still hav e inﬁmal coverage probability not less than δ as long a s b ′ n is suﬃcient ly close to b o n ; at the same time this interv al will b e sho rter than the interv al [ ˆ θ S − a o n , ˆ θ S + b o n ]. This leads to a co nt radiction and establishes a o n = b o n . As in (a) it now follo ws that a o n = b o n = a ∗ n,A m ust hold, th us also establishing uniqueness. The las t claim is obvious in view of the construction of a ∗ n,A .  Pro of of Prop osition 6: Let c = lim inf n →∞ inf θ ∈ R P n,θ  − d ≤ η − 1 n ( ˆ θ − θ ) ≤ d  . 20 By deﬁnition of c , we can ﬁnd a subs equence n k and elements θ n k ∈ R such that P n k ,θ n k  − d ≤ η − 1 n k ( ˆ θ − θ n k ) ≤ d  → c for k → ∞ . No w, b y Theor e m 17 (for ˆ θ = ˆ θ H ), Theorem 18 (for ˆ θ = ˆ θ S ), and Rema rk 12 in P¨ o tscher a nd Leeb (2009), and b y Theorem 6 (for ˆ θ = ˆ θ A ) and Remark 7 in P¨ otscher and Schneider (2009), a ny accumulation point of the distribution of η − 1 n k ( ˆ θ − θ n k ) with resp ect to w ea k conv ergenc e is a probability measure conce nt rated on [ − 1 , 1]. Since d > 1 , it fo llows that c = 1 m us t hold, whic h prov es the ﬁrs t claim. W e next pr ov e the se c o nd claim. In view of Theore m 1 7 (for ˆ θ = ˆ θ H ) and Theorem 18 (for ˆ θ = ˆ θ S ) in P¨ otscher a nd Leeb (200 9), a nd in view of Theo rem 6 (for ˆ θ = ˆ θ A ) in P¨ otscher and Schneider (2009) it is p os sible to choose a seq uence θ n ∈ R suc h that the distribution of η − 1 n ( ˆ θ − θ n ) converges to po int mass lo cated at o ne o f the endp oints of the int erv al [ − 1 , 1 ]. But then c learly P n,θ n  − d ≤ η − 1 n ( ˆ θ − θ n ) ≤ d  → 0 for d < 1 which implies the second cla im.  Pro of of Theorem 9: W e prov e the result for the clo s ed int erv al. Inspection of the proo f together with Remark 4 then gives the result for the op en a nd half-op en interv a ls. Step 1 : Observe that for e very s > 0 and n ≥ 2 w e have fr o m the above formulae for p H,n that lim θ →∞ p H,n ( θ ; 1 , sη n , sa n , sa n ) = Φ( n 1 / 2 sa n ) − Φ( − n 1 / 2 sa n ) . By the dominated conv erg ence theorem it follows that for θ → ∞ P n,θ ( θ ∈ E H,n ) = Z ∞ 0 p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds → Z ∞ 0 h Φ( n 1 / 2 sa n ) − Φ( − n 1 / 2 sa n ) i h n ( s ) ds = T n − 1 ( n 1 / 2 a n ) − T n − 1 ( − n 1 / 2 a n ) . Hence, inf θ ∈ R P n,θ ( θ ∈ C H,n ) ≤ lim θ →∞ p H,n ( θ ; 1 , η n , a n , a n ) = Φ( n 1 / 2 a n ) − Φ( − n 1 / 2 a n ) and inf θ ∈ R P n,θ ( θ ∈ E H,n ) ≤ T n − 1 ( n 1 / 2 a n ) − T n − 1 ( − n 1 / 2 a n ) ≤ Φ( n 1 / 2 a n ) − Φ( − n 1 / 2 a n ) , (26) the last inequality following from well-known proper ties of T n − 1 , cf. Le mma 1 4 below. This proves the theorem in ca s e n 1 / 2 a n → 0 for n → ∞ . 21 Step 2: F or every s > 0 and n ≥ 2 w e hav e from (2) inf θ ∈ R P n,θ ( θ ∈ C H,n ) = inf θ ∈ R p H,n ( θ ; 1 , η n , a n , a n ) = max h Φ( n 1 / 2 a n ) − Φ( − n 1 / 2 ( a n − η n )) , 0 i (27) and inf θ ∈ R p H,n ( θ ; 1 , sη n , sa n , sa n ) = max h Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − sa n + sη n )) , 0 i . F urthermor e, inf θ ∈ R P n,θ ( θ ∈ E H,n ) ≥ Z ∞ 0 inf θ ∈ R p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds = Z ∞ 0 max h Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − sa n + sη n )) , 0 i h n ( s ) ds = max  Z ∞ 0 h Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − sa n + sη n )) i h n ( s ) ds, 0  = max h T n − 1 ( n 1 / 2 a n ) − T n − 1 ( − n 1 / 2 ( a n − η n )) , 0 i . (28) If n 1 / 2 ( a n − η n ) → ∞ , then the far right-hand sides of (27) and (28) converge to 1, since k Φ − T n − 1 k ∞ → 0 as n → ∞ b y P o lya’s Theo rem and since n 1 / 2 a n ≥ n 1 / 2 ( a n − η n ). This proves the theor e m in case n 1 / 2 ( a n − η n ) → ∞ . Step 3: If n 1 / 2 η n → 0, then (27) and the fact that Φ is globally Lipschitz shows that inf θ ∈ R P n,θ ( θ ∈ C H,n ) diﬀers fro m Φ( n 1 / 2 a n ) − Φ( − n 1 / 2 a n ) only by a term that is o (1). Similarly , (26), (28), the fact that k Φ − T n − 1 k ∞ → 0 a s n → ∞ by Poly a ’s theorem, and the globa l Lipschitz prop erty of Φ show that the same is tr ue for inf θ ∈ R P n,θ ( θ ∈ E H,n ), proving the theorem in case n 1 / 2 η n → 0. Step 4: By a subseq uenc e argument and Steps 1-3 it r emains to prove the theorem under the ass umption that n 1 / 2 a n and n 1 / 2 η n are b ounded awa y from zero by a ﬁnite p ositive c o nstant c 1 , say , a nd that n 1 / 2 ( a n − η n ) is b ounded fro m ab ov e by a ﬁnite consta nt c 2 , say . It then fo llows that a n /η n is b ounded by a ﬁnite p os itive consta nt c 3 , s ay . F or given ε > 0 s et θ n ( ε ) = a n (1 + 2 c ( ε ) n − 1 / 2 ) where c ( ε ) is the cons ta nt given in Lemma 13. W e then ha ve for s ∈ [1 − c ( ε ) n − 1 / 2 , 1 + c ( ε ) n − 1 / 2 ] sa n < θ n ( ε ) ≤ s ( η n + a n ) whenever n > n 0 ( c ( ε ) , c 3 ). Without loss of generality we may cho ose n 0 ( c ( ε ) , c 3 ) large enough s uch that also 1 − c ( ε ) n − 1 / 2 > 0 ho lds for n > n 0 ( c ( ε ) , c 3 ). Con- sequently , w e hav e (observing that max(0 , x ) has Lips chit z constant 1 and Φ has Lipsc hitz cons tant (2 π ) − 1 / 2 ) for ev ery s ∈ [1 − c ( ε ) n − 1 / 2 , 1 + c ( ε ) n − 1 / 2 ] and 22 n > n 0 ( c ( ε ) , c 3 ) | p H,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) − p H,n ( θ n ( ε ); 1 , η n , a n , a n ) | =    max(0 , Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ n ( ε ) + sη n ))) − max(0 , Φ( n 1 / 2 a n ) − Φ( n 1 / 2 ( − θ n ( ε ) + η n )))    ≤    h Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 ( − θ n ( ε ) + sη n )) i − h Φ( n 1 / 2 a n ) − Φ( n 1 / 2 ( − θ n ( ε ) + η n )) i    ≤ (2 π ) − 1 / 2 n 1 / 2 ( a n + η n ) | s − 1 | ≤ (2 π ) − 1 / 2 c ( ε )( a n + η n ) ≤ (2 π ) − 1 / 2 c ( ε )( c 3 + 1) η n . It follows that for every n > n 0 ( c ( ε ) , c 3 ) inf θ ∈ R Z ∞ 0 p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≤ Z ∞ 0 p H,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) h n ( s ) ds = Z 1+ c ( ε ) n − 1 / 2 1 − c ( ε ) n − 1 / 2 p H,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) h n ( s ) ds + Z { s : | s − 1 |≥ c ( ε ) n − 1 / 2 } p H,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) h n ( s ) ds = B 1 + B 2 . Clearly , 0 ≤ B 2 ≤ ε holds, cf. Lemma 1 3, and for B 1 we hav e | B 1 − p H,n ( θ n ( ε ); 1 , η n , a n , a n ) | ≤      Z 1+ c ( ε ) n − 1 / 2 1 − c ( ε ) n − 1 / 2 [ p H,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) − p H,n ( θ n ( ε ); 1 , η n , a n , a n )] h n ( s ) ds      + ε ≤ (2 π ) − 1 / 2 c ( ε )( c 3 + 1) η n + ε for n > n 0 ( c ( ε ) , c 3 ). It follows that inf θ ∈ R Z ∞ 0 p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≤ p H,n ( θ n ( ε ); 1 , η n , a n , a n ) + (2 π ) − 1 / 2 c ( ε )( c 3 + 1) η n + 2 ε holds for n > n 0 ( c ( ε ) , c 3 ). Now p H,n ( θ n ( ε ); 1 , η n , a n , a n ) = max(0 , Φ( n 1 / 2 a n ) − Φ( n 1 / 2 ( − θ n ( ε ) + η n ))) = max(0 , Φ( n 1 / 2 a n ) − Φ( n 1 / 2 ( − a n (1 + 2 c ( ε ) n − 1 / 2 ) + η n ))) . But this diﬀers fr om inf θ ∈ R P n,θ ( θ ∈ C H,n ) = max(0 , Φ( n 1 / 2 a n ) − Φ ( n 1 / 2 ( − a n + η n ))) by a t most    Φ( n 1 / 2 ( − a n + η n )) − Φ( n 1 / 2 ( − a n (1 + 2 c ( ε ) n − 1 / 2 ) + η n ))    ≤ (2 π ) − 1 / 2 2 c ( ε ) a n ≤ (2 π ) − 1 / 2 2 c ( ε ) c 3 η n . 23 Consequently , for n > n 0 ( c ( ε ) , c 3 ) inf θ ∈ R P n,θ ( θ ∈ E H,n ) = inf θ ∈ R Z ∞ 0 p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≤ max(0 , Φ( n 1 / 2 a n ) − Φ( n 1 / 2 ( − a n + η n ))) + (2 π ) − 1 / 2 c ( ε )(3 c 3 + 1) η n + 2 ε = inf θ ∈ R P n,θ ( θ ∈ C H,n ) + (2 π ) − 1 / 2 c ( ε )(3 c 3 + 1) η n + 2 ε. On the other hand, inf θ ∈ R P n,θ ( θ ∈ E H,n ) = inf θ ∈ R Z ∞ 0 p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≥ Z ∞ 0 inf θ ∈ R p H,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds = Z ∞ 0 max(0 , Φ( n 1 / 2 sa n ) − Φ( n 1 / 2 s ( − a n + η n ))) h n ( s ) ds = max(0 , T n − 1 ( n 1 / 2 a n ) − T n − 1 ( n 1 / 2 ( − a n + η n ))) ≥ max(0 , Φ( n 1 / 2 a n ) − Φ( n 1 / 2 ( − a n + η n ))) − 2 k Φ − T n − 1 k ∞ = inf θ ∈ R P n,θ ( θ ∈ C H,n ) − 2 k Φ − T n − 1 k ∞ . Since η n → 0 and k Φ − T n − 1 k ∞ → 0 for n → ∞ and since ε was arbitrar y the pro of is complete.  Pro of of Theorem 1 0: W e prov e the r esult for the clos ed in ter v al. Inspection of the proo f together with Remark 4 then gives the result for the op en a nd half-op en interv a ls. Step 1: O bserve that for every s > 0 and n ≥ 2 w e have from (16) tha t lim θ →∞ p A,n ( θ ; 1 , sη n , sa n , sa n ) = Φ( n 1 / 2 sa n ) − Φ( − n 1 / 2 sa n ) . Then exactly the same arg ument as in the pro of of Theorem 9 shows tha t inf θ ∈ R P n,θ ( θ ∈ C A,n ) as well as inf θ ∈ R P n,θ ( θ ∈ E A,n ) con verge to ze ro for n → ∞ if n 1 / 2 a n → 0, thus proving the theorem in this cas e . F or la ter use we note that this r e asoning in particular gives inf θ ∈ R P n,θ ( θ ∈ E A,n ) ≤ T n − 1 ( n 1 / 2 a n ) − T n − 1 ( − n 1 / 2 a n ) ≤ Φ( n 1 / 2 a n ) − Φ( − n 1 / 2 a n ) . (29) Step 2: By P rop ositio n 3 we hav e for every s > 0 and n ≥ 1 inf θ ∈ R p A,n ( θ ; 1 , sη n , sa n , sa n ) = Φ( n 1 / 2 s p a 2 n + η 2 n ) − Φ( n 1 / 2 s ( − a n + η n )) . Arguing as in the pr o of of Theo r em 9 we then have inf θ ∈ R P n,θ ( θ ∈ C A,n ) = inf θ ∈ R p A,n ( θ ; 1 , η n , a n , a n ) = Φ( n 1 / 2 p a 2 n + η 2 n ) − Φ( n 1 / 2 ( − a n + η n )) (30) 24 and inf θ ∈ R P n,θ ( θ ∈ E A,n ) ≥ Z ∞ 0 inf θ ∈ R p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds = T n − 1 ( n 1 / 2 p a 2 n + η 2 n ) − T n − 1 ( n 1 / 2 ( − a n + η n )) . (31) If n 1 / 2 ( a n − η n ) → ∞ , then the far rig ht-hand sides of (30) a nd (31) con- verge to 1, since k Φ − T n − 1 k ∞ → 0 as n → ∞ by Polya’s Theor e m and since n 1 / 2 p a 2 n + η 2 n ≥ n 1 / 2 a n → ∞ and n 1 / 2 ( − a n + η n ) → −∞ . This pr ov es the theorem in cas e n 1 / 2 ( a n − η n ) → ∞ . Step 3: Analogous to the co r resp onding step in the pro o f of Theorem 9, using (30), (29), (31), and additionally noting that 0 ≤ n 1 / 2 p a 2 n + η 2 n − n 1 / 2 a n ≤ n 1 / 2 η n , the theor em is proved in the case n 1 / 2 η n → 0. Step 4: Similar as in the pro o f of Theorem 9 it r emains to prov e the theorem under the assumption that n 1 / 2 a n ≥ c 1 > 0, n 1 / 2 η n ≥ c 1 , and that n 1 / 2 ( a n − η n ) ≤ c 2 < ∞ . Aga in, it then follows that 0 ≤ a n /η n ≤ c 3 < ∞ . F or given ε > 0 set θ n ( ε ) = a n (1 + 2 c ( ε ) n − 1 / 2 ) where c ( ε ) is the constant given in Lemma 13. W e then hav e for s ∈ [1 − c ( ε ) n − 1 / 2 , 1 + c ( ε ) n − 1 / 2 ] sa n < θ n ( ε ) for all n . Cho ose n 0 ( c ( ε )) larg e enough suc h tha t 1 − c ( ε ) n − 1 / 2 > 1 / 2 holds for n > n 0 ( c ( ε )). Consequently , for every s ∈ [1 − c ( ε ) n − 1 / 2 , 1 + c ( ε ) n − 1 / 2 ] and n > n 0 ( c ( ε )) w e hav e from (16) (observ ing that Φ ha s Lipschitz cons tant (2 π ) − 1 / 2 ) | p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) − p A,n ( θ n ( ε ); 1 , η n , a n , a n ) | ≤ (2 π ) − 1 / 2 n 1 / 2  | s − 1 | a n +    p ( θ n ( ε ) + sa n ) 2 / 4 + s 2 η 2 n − p ( θ n ( ε ) + a n ) 2 / 4 + η 2 n    +    p ( θ n ( ε ) − sa n ) 2 / 4 + s 2 η 2 n − p ( θ n ( ε ) − a n ) 2 / 4 + η 2 n     . W e note the elemen tary inequalit y   x 1 / 2 − y 1 / 2   ≤ 2 − 1 z − 1 / 2 | x − y | for p os i- tive x , y , z sa tisfying min( x, y ) ≥ z . Using this inequality with z = (1 − c ( ε ) n − 1 / 2 ) 2 η 2 n t wice, we obtain for every s ∈ [1 − c ( ε ) n − 1 / 2 , 1 + c ( ε ) n − 1 / 2 ] and n > n 0 ( c ( ε )) | p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) − p A,n ( θ n ( ε ); 1 , η n , a n , a n ) | ≤ (2 π ) − 1 / 2 n 1 / 2 | s − 1 |  a n + h (1 − c ( ε ) n − 1 / 2 ) 2 η 2 n i − 1 / 2  θ n ( ε ) a n / 2 + ( s + 1)  ( a 2 n / 4) + η 2 n   . Since 1 − c ( ε ) n − 1 / 2 > 1 / 2 for n > n 0 ( c ( ε )) b y the choice of n 0 ( c ( ε )) a nd since a n /η n ≤ c 3 we obtain | p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) − p A,n ( θ n ( ε ); 1 , η n , a n , a n ) | ≤ (2 π ) − 1 / 2 c ( ε )  a n + 2 η − 1 n  a 2 n + (5 / 2 )(( a 2 n / 4) + η 2 n )  ≤ (2 π ) − 1 / 2 c ( ε )  c 3 + (13 / 4) c 2 3 + 5  η n = c 4 ( ε ) η n (32) 25 for every n > n 0 ( c ( ε )) and s ∈ [1 − c ( ε ) n − 1 / 2 , 1 + c ( ε ) n − 1 / 2 ]. Now, inf θ ∈ R Z ∞ 0 p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≤ Z ∞ 0 p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) h n ( s ) ds = Z 1+ c ( ε ) n − 1 / 2 1 − c ( ε ) n − 1 / 2 p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) h n ( s ) ds + Z | s − 1 |≥ c ( ε ) n − 1 / 2 p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) h n ( s ) ds =: B 1 + B 2 . Clearly , 0 ≤ B 2 ≤ ε holds by the choice of c ( ε ), see Le mma 1 3. F or B 1 we hav e using (32) | B 1 − p A,n ( θ n ( ε ); 1 , η n , a n , a n ) | ≤ Z 1+ c ( ε ) n − 1 / 2 1 − c ( ε ) n − 1 / 2 | p A,n ( θ n ( ε ); 1 , sη n , sa n , sa n ) − p A,n ( θ n ( ε ); 1 , η n , a n , a n ) | h n ( s ) ds + ε ≤ c 4 ( ε ) η n + ε for n > n 0 ( c ( ε )). It follows that inf θ ∈ R Z ∞ 0 p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≤ p A,n ( θ n ( ε ); 1 , η n , a n , a n ) + c 4 ( ε ) η n + 2 ε holds for n > n 0 ( c ( ε )). F urthermore, the a bsolute diﬀerence b etw een p A,n ( θ n ( ε ); 1 , η n , a n , a n ) and inf θ ∈ R P n,θ ( θ ∈ C A,n ) can b e b ounded as follows: Using Pr op osition 3, (1 6 ), observing that Φ ha s Lipschitz constant (2 π ) − 1 / 2 , and using the elementary in- equality noted ear lier t wice with z = η 2 n we obtain    p A,n ( θ n ( ε ); 1 , η n , a n , a n ) − Φ  n 1 / 2 p a 2 n + η 2 n  + Φ  n 1 / 2 ( − a n + η n )     ≤ (2 π ) − 1 / 2 n 1 / 2     − a n c ( ε ) n − 1 / 2 + q a 2 n (1 + c ( ε ) n − 1 / 2 ) 2 + η 2 n − p a 2 n + η 2 n     +(2 π ) − 1 / 2 n 1 / 2     q ( a n c ( ε ) n − 1 / 2 ) 2 + η 2 n − q ( a n c ( ε ) n − 1 / 2 + η n ) 2     ≤ (2 π ) − 1 / 2  2 a n c ( ε ) + (2 η n ) − 1 a 2 n (2 c ( ε ) + c ( ε ) 2 n − 1 / 2 )  ≤ (2 π ) − 1 / 2  2 c 3 c ( ε ) + 2 − 1 c 2 3 (2 c ( ε ) + c ( ε ) 2 )  η n = c 5 ( ε ) η n . 26 Consequently , for n > n 0 ( c ( ε )) inf θ ∈ R Z ∞ 0 p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≤ Φ( n 1 / 2 p a 2 n + η 2 n ) − Φ( n 1 / 2 ( − a n + η n )) + ( c 4 ( ε ) + c 5 ( ε )) η n + 2 ε. On the other hand, inf θ ∈ R Z ∞ 0 p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds ≥ Z ∞ 0 inf θ ∈ R p A,n ( θ ; 1 , sη n , sa n , sa n ) h n ( s ) ds = Z ∞ 0 h Φ( n 1 / 2 s p a 2 n + η 2 n ) − Φ( n 1 / 2 s ( − a n + η n )) i h n ( s ) ds = T n − 1 ( n 1 / 2 p a 2 n + η 2 n ) − T n − 1 ( n 1 / 2 ( − a n + η n )) ≥ Φ( n 1 / 2 p a 2 n + η 2 n ) − Φ( n 1 / 2 ( − a n + η n )) − 2 k Φ − T n − 1 k ∞ . Since η n → 0 and k Φ − T n − 1 k ∞ → 0 for n → ∞ and since ε was arbitrar y the pro of is complete.  Lemma 1 3 Supp ose σ = 1 . Then for every ε > 0 ther e exists a c = c ( ε ) > 0 such that Z 1+ cn − 1 / 2 max(0 , 1 − cn − 1 / 2 ) h n ( s ) ds ≥ 1 − ε holds for every n ≥ 2 . Pro of. By the central limit theorem and the delta-metho d we hav e that n 1 / 2 ( ˆ σ − 1) conv er ges to a normal distribution. It follows that n 1 / 2 ( ˆ σ − 1) is (uniformly) tight . In other words, for every ε > 0 we can ﬁnd a re a l num b er c > 0 such that for all n ≥ 2 ho lds Pr     n 1 / 2 ( ˆ σ − 1 )    ≤ c  ≥ 1 − ε . Lemma 1 4 Supp ose n ≥ 2 and x ≥ y ≥ 0 . Then T n − 1 ( x ) ≤ Φ( x ) and T n − 1 ( x − y ) − T n − 1 ( − x − y ) ≤ Φ( x − y ) − Φ( − x − y ) . 27 Pro of. The ﬁrs t claim is well-known, s ee, e.g ., K agan and Nagaev (200 8). The second cla im follows immediately from the ﬁrst claim, since by s ymmetry of Φ and T n − 1 we hav e Φ( x − y ) − Φ( − x − y ) − ( T n − 1 ( x − y ) − T n − 1 ( − x − y )) = [Φ( x − y ) − T n − 1 ( x − y )] + [Φ( x + y ) − T n − 1 ( x + y )] ≥ 0 . References [1] F a n, J . & R. Li (20 01): V a riable selection via nonconcave p e na lized likeli- ho o d and its o racle prop erties. Journal of the Americ an Statistic al Asso ci- ation 96, 1348 -136 0. [2] F r ank, I. E. & J. H. F riedman (1993): A sta tistical v iew of some c hemo- metrics re gressio n to ols (with discus sion). T e chnometrics 35, 109-1 48. [3] Jo shi, V. M. (1969 ): Admissibility o f the usua l conﬁdence sets for the mean of a univ ariate or biv ariate normal p opulation. Annals of Mathematic al Statistics 40, 1042 - 1067 . [4] Ka gan, A. & A. V. Nagaev (2008 ): A lemma on sto chastic ma jorization and proper ties of the Student distr ibution. The ory of Pr ob ability and its Applic ations 52 , 160 -164. [5] Knig ht, K. & W. F u (20 00): Asymptotics of lasso -type estimators. Annals of Statistics 28, 13 56-13 78. [6] Leeb, H. & B. M. P¨ o tscher (2008): Sparse estimators and the or acle pro p- erty , or the r eturn of Hodges ’ estimator . Journal of Ec onometrics 142 , 201-2 11. [7] P¨ otscher, B. M. (2009): Conﬁdence sets based on spar se estimators are necessarily larg e. Sankhya 7 1 -A, 1-18 . [8] P¨ otscher, B. M. & H. Leeb (200 9): On the distribution of p enalized maxi- m um likelihoo d estimator s: the LASSO, SCAD, and thresholding. Journal of Multivariate Analysis 100, 2065 -2082 . [9] P¨ otscher, B. M. & U. Schneider (200 9): On the dis tribution o f the a daptive LASSO estimator. Journal of Statistic al Planning and Infer enc e 1 39, 277 5- 2790. [10] Tibshira ni, R. (1996): Regression shrink age and selection via the lasso . Journal of the Ro yal S tatistic al So ciety Series B 58, 267- 288. [11] Zou, H. (2006): The adaptive lass o and its oracle prop erties. Journal of t he Americ an S tatistic al Asso ciation 101 , 1 418-1 429. 28

Confidence Sets Based on Penalized Maximum Likelihood Estimators in Gaussian Regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment