Self-concordant analysis for logistic regression
Most of the non-asymptotic theoretical work in regression is carried out for the square loss, where estimators can be obtained through closed-form expressions. In this paper, we use and extend tools from the convex optimization literature, namely sel…
Authors: Francis Bach (INRIA Rocquencourt)
Self-conc ordant analysis for logistic regression Francis Bach INRIA - WILLO W Project-T eam Laboratoire d’Informatique de l’Ecole Normale Sup ´ erieure (CNRS/ENS/INRIA UMR 8548) 23, a venue d’Italie, 75214 Par is, France francis.ba ch@inria.f r October 24, 2018 Abstract Most of the non- asymptotic theoretical w ork in regression is carried out for the square loss, where estimators can be obtained through closed-form expressions. In this paper , we use and extend tools from the co n vex optimization literature, nam ely self-conco rdant functions, to provid e simple extension s of theoretical results for the square loss to the logistic lo ss. W e app ly the e xten sion techniques to logistic regression with regularization by the ℓ 2 -norm and regularizatio n by the ℓ 1 -norm , showing that ne w results for bin ary classification throu gh logistic regression can be easily d eriv ed fr om correspo nding results for least-squa res re gr ession. 1 Introd uction The theoret ical analys is of statis tical metho ds is usually greatly simplified when the esti- mators ha ve closed-fo rm expre ssions. For methods based on the minimization of a certain functi onal, such as M-estimation methods [1], this is true when the function to minimize is quadra tic, i.e., in the conte xt of reg ression, for the squ are loss. When such loss is used, asymptotic and non-asymptoti c results may be deri ved with classic al tools from probab ility theory (see, e .g., [2]). When the functi on which is minimized in M-estimation is not amenable to closed-form solut ions, local approximation s are then needed for obta ining and ana lyzing a solu tion of th e opti mization problem. In the asympto tic reg ime, this has led to interesting dev elopments and extension s of result s from the quadratic case, e.g., consistenc y or asymptotic normality (see, e.g., [1]). Ho weve r , the situation is dif ferent when one wishes to deri ve non-asympt otic results, i.e., results w here all consta nts of the problem are explicit. Indeed , in order to prov e results as sharp as for the square loss, much nota tion and many as sumptions hav e to be introduced regardin g second and third 1 deri v ativ es; this mak es the deri ved results much more compli cated than the ones for closed- form estimat ors [3, 4, 5]. A similar situation occurs in con v ex optimization, for the study of Newton’ s method for obtaining solut ions of uncon strained optimization proble ms. It is kno wn to be locally quadra tically con ve rgen t for con ve x problems. Ho wev er , its classical analysis requi res cum- bersome notation s and assu mptions reg arding second and third -order deri v ativ es (see, e.g., [6, 7]). This situatio n was g reatly enhanced with the introducti on of the notio n of self-co ncor dant functi ons , i.e., functions whose third deri vat iv es are controlle d by their second deri v ativ es. W ith this tool, the analysis is much more transparen t [7 , 8]. While Newton’ s method is a commonly used algorith m for logistic regre ssion (see, e.g., [9, 10]), leading to iterati ve least-s quares algor ithms, we don’t focus in the paper on the resolution of the optimizati on proble ms, but on th e statistical analysis of the asso ciated global minimizers. In this paper , we aim to borro w tools from con vex optimiz ation and self-conc ordance to analyz e the statist ical properties of logistic regr ession. Since the logistic loss is not itself a self-co ncordant functi on, we introduc e in Section 2 a new type of functions with a dif ferent contro l of the third deri vat iv es. For these functions, we prov e two type s of results: first, we provide lo wer and upper T aylor exp ansions, i.e., T aylor expan sions which are globally upper -bou nding or lower -bound ing a giv en func tion. Second, we prov e resul ts on the be- ha vior of Newton’ s method which are similar to the ones for self-concor dant functio ns. W e then apply them in Sections 3, 4 and 5 to the one-step Newton iterate from the popula tion soluti on of the co rrespondi ng problem (i.e., ℓ 2 or ℓ 1 -reg ularized logistic r egressio n). This es- sentia lly shows that the analysis of logistic regressi on can be done non-asympt otically using the local quadrat ic approx imation of the logistic loss, without comple x additiona l assump- tions. Since this approximation corresp onds to a weighted least-squ ares problem, results from least -squares regres sion can thus be natur ally exte nded. In or der to c onsider such e xtension s and mak e sure that the ne w results closely match the corres ponding ones for least-s quares re gression, we deri ve in Appendi x G new Bernstein-li ke concen tration inequaliti es for quadrati c forms of bounded random v ariables, obtain ed from genera l results on U-statistics [11]. W e first apply in Section 4 the extens ion techniq ue to regu larization by the ℓ 2 -norm, where we consider two settings, a situation with no assumptions regar ding the condition al distrib utio n of the observ ations, and anothe r one where the model is assumed well-speci fied and we deri ve a symptotic exp ansions of the ge neralizati on performance with e xplicit boun ds on remainder terms. In S ection 5, we cons ider regu larization by the ℓ 1 -norm and exte nd two kno wn recent results for the squar e loss, one on model consi stency [12, 13, 14, 15] and one on predi ction efficienc y [16]. The main contrib ution of thi s paper is to mak e these ex tensions as simple as p ossible, by allo wing the u se of non-asy mptotic secon d-order T aylor expa nsions. Notation. For x ∈ R p and q > 1 , w e denote by k x k q the ℓ q -norm of x , defined as k x k q q = P p i =1 | x i | q . W e also denote by k x k ∞ = max i ∈{ 1 ,...,p } | x i | its ℓ ∞ -norm. W e denote by λ max ( Q ) a nd λ min ( Q ) t he larges t and smalles t eigen value of a symmetric m atrix Q . W e use the nota tion Q 1 4 Q 2 (resp. Q 1 < Q 2 ) for the posi tiv e semi-definiteness of the matrix 2 Q 2 − Q 1 (resp. Q 1 − Q 2 ). For a ∈ R , sign( a ) den otes the sign of a , defined as sign( a ) = 1 if a > 0 , − 1 if a < 0 , and 0 if a = 0 . For a vector v ∈ R p , sign( v ) ∈ {− 1 , 0 , 1 } p denote s the vec tor of signs of elements of v . Moreo ver , gi ven a vec tor v ∈ R p and a subset I of { 1 , . . . , p } , | I | d enotes the card inal of the set I , v I denote s the vector in R | I | of elements of v inde xed by I . Similarly , for a matrix A ∈ R p × p , A I J denote s the submatrix of A composed of elements of A whose ro ws are in I and columns are in J . F inally , we let deno te P and E general probabilit y measures and exp ectations. 2 T aylor expansions and Newton’ s method In this se ction, w e con sider a gen eric func tion F : R p → R , which is c on ve x and three times dif ferentia ble. W e deno te by F ′ ( w ) ∈ R p its gradient at w ∈ R p , by F ′′ ( w ) ∈ R p × p its Hessian at w ∈ R p . W e denote by λ ( w ) > 0 the smalle st eigen value of the Hessian F ′′ ( w ) at w ∈ R p . If λ ( w ) > 0 , i.e., the Hessian is in v ertible at w , we can define the N ewton step as ∆ N ( w ) = − F ′′ ( w ) − 1 F ′ ( w ) , and the Newton dec r ement ν ( F , w ) at w , defined through: ν ( F , w ) 2 = F ′ ( w ) ⊤ F ′′ ( w ) − 1 F ′ ( w ) = ∆ N ( w ) ⊤ F ′′ ( w )∆ N ( w ) . The one -step Newton i terate w + ∆ N ( w ) is the minimizer of the se cond-orde r T aylor ex pan- sion of F at w , i.e., of the fun ction v 7→ F ( w ) + F ′ ( w )( v − w ) + 1 2 ( v − w ) ⊤ F ′′ ( w )( v − w ) . Newto n’ s method co nsists in suc cessi vely applying the s ame iteratio n until con ver gence. For more back ground and details about Newton ’ s method, see, e.g., [7, 6, 17]. 2.1 Self-concordant functions W e no w re vie w s ome impor tant p roperties of self-con cordant functi ons [7, 8], i.e., three times dif ferentia ble con vex function s such that for all u, v ∈ R p , the function g : t 7→ F ( u + tv ) satisfies for all t ∈ R , | g ′′′ ( t ) | 6 2 g ′′ ( t ) 3 / 2 . The local be havi or of self-con cordant functions is well -studied and lo wer and up per T ay- lor expansio ns can be deri ved (similar to the ones we deriv e in Proposition 1). Moreo ver , bound s are av ailable for the beha vior of Newton’ s method; gi ven a self-conco rdant function F , if w ∈ R p is such that ν ( F , w ) 6 1 / 4 , then F attains its unique globa l minimum at some w ∗ ∈ R p , and we ha ve the follo wing bound on the error w − w ∗ (see, e.g., [8]): ( w − w ∗ ) ⊤ F ′′ ( w ) ( w − w ∗ ) 6 4 ν ( F , w ) 2 . (1) Moreo ver , the ne wton decrement at the one- step Newton iterate from w ∈ R p can be upper - bound ed as follo ws: ν ( F , w + ∆ N ( w )) 6 ν ( F, w ) 2 , (2) 3 which allo ws to prov e an upper -bou nd of the error of the one-step iterate, by applic ation of Eq. (1) to w + ∆ N ( w ) . Note that these bounds are not the sharpest, b ut are suf ficient in our conte xt. These are commonly used to show the global con ve rgen ce of the damped Ne wton’ s method [8] or of Ne wton’ s method with backtracking line search [7], as w ell as a precise upper bound on the number of iterati ons to reach a gi ven precisi on. Note that in the context of mac hine lear ning and statisti cs, self-conc ordant functions ha ve been used for bandit optimizati on and online learning [18], but for barrier functi ons related to cons trained optimizati on problems, and not directly for M-estimation . 2.2 Modifications of self-concordant functions The logistic function u 7→ log(1 + e − u ) is not self-con cordant as the third deri va tiv e is bound ed by a c onstant t imes the second deri v ati ve (with out th e p ower 3 / 2 ). Howe ver , si milar bound s can b e deri ved with a dif ferent co ntrol of the third deri v ati ves. Proposition 1 provid es lo wer and upper T aylor expans ions while P roposi tion 2 considers the beha vior of Newton’ s method. Proofs may be found in Appendi x A and follo w closely the ones for regula r self- conco rdant functio ns found in [8]. Pro position 1 (T aylor expansions) Let F : R p 7→ R be a con vex thr ee times diff er entiable functi on such that for all w , v ∈ R p , the function g ( t ) = F ( w + tv ) satisfies for all t ∈ R , | g ′′′ ( t ) | 6 R k v k 2 × g ′′ ( t ) , for some R > 0 . W e the n have for all w, v , z ∈ R p : F ( w + v ) > F ( w ) + v ⊤ F ′ ( w ) + v ⊤ F ′′ ( w ) v R 2 k v k 2 2 ( e − R k v k 2 + R k v k 2 − 1) , (3) F ( w + v ) 6 F ( w ) + v ⊤ F ′ ( w ) + v ⊤ F ′′ ( w ) v R 2 k v k 2 2 ( e R k v k 2 − R k v k 2 − 1) , (4) z ⊤ [ F ′ ( w + v ) − F ′ ( w ) − F ′′ ( w ) v ] [ z ⊤ F ′′ ( w ) z ] 1 / 2 6 [ v ⊤ F ′′ ( w ) v ] 1 / 2 e R k v k 2 − 1 − R k v k 2 R k v k 2 , (5) e − R k v k 2 F ′′ ( w ) 4 F ′′ ( w + v ) 4 e R k v k 2 F ′′ ( w ) . (6) Inequa lities in E q. (3 ) and Eq. (4) provide upper and lower secon d-order T aylor expan sions of F , while Eq. (5) provide s a first-order T aylor exp ansion of F ′ and Eq. (6) can be con- sidere d as an upper and lo wer zero- order T aylor exp ansion of F ′′ . Note the diffe rence here between Eqs. (3-4) and regular thi rd-order T aylor exp ansions of F : the re mainder term in the T aylor expans ion, i.e., F ( w + v ) − F ( w ) − v ⊤ F ′ ( w ) − 1 2 v ⊤ F ′′ ( w ) v is upper -bounde d by v ⊤ F ′′ ( w ) v R 2 k v k 2 2 ( e R k v k 2 − 1 2 R 2 k v k 2 2 − R k v k 2 − 1) ; for k v k 2 small, we obtain a term proportion al to k v k 3 2 (lik e a re gular lo cal T aylor exp ansion), b ut the bound r emains v alid f or all v and do es not gro w as fast as a third-o rder polyn omial. Moreov er , a regular T aylor exp ansion with a unifor mly bounde d third-orde r deri vati ve would lead to a b ound proportio nal to k v k 3 2 , which does not take into accou nt the local curv ature of F at w . T aking into acco unt this local cur- v ature is ke y to obtaini ng sharp and simple bou nds on the beh avio r of Ne wton’ s meth od (see proof in Appendi x A): 4 Pro position 2 (Beha vior of Newton’ s method) Let F : R p 7→ R be a con ve x thr ee times dif fer entiab le function suc h that for all w , v ∈ R p , the functio n g ( t ) = F ( w + t v ) satisfies for all t ∈ R , | g ′′′ ( t ) | 6 R k v k 2 × g ′′ ( t ) , for some R > 0 . Let λ ( w ) > 0 be the lowest eig en value of F ′′ ( w ) for some w ∈ R p . If ν ( F , w ) 6 λ ( w ) 1 / 2 2 R , then F has a unique global minimizer w ∗ ∈ R p and we have: w − w ∗ ⊤ F ′′ ( w ) w − w ∗ 6 16 ν ( F, w ) 2 , (7) Rν ( F , w + ∆ N ( w )) λ ( w + ∆ N ( w )) 1 / 2 6 Rν ( F , w ) λ ( w ) 1 / 2 2 , (8) w + ∆ N ( w ) − w ∗ ⊤ F ′′ ( w ) w + ∆ N ( w ) − w ∗ 6 16 R 2 λ ( w ) ν ( F , w ) 4 . (9) Eq. (7) exten ds E q. (1) while Eq. (8) extend s Eq. (2). Note that the notion and the results are not in varian t by af fine transform (contra ry to self-con cordant functio ns) and that we still need a (non-uni formly) lower -bou nded Hessian. T he last two propositio ns constitute the main technica l contrib utio n of this paper . W e no w apply these to logis tic regress ion and its reg ularized versions . 3 A pplication to logistic regr ession W e consider n pairs of observ ation s ( x i , y i ) in R p × {− 1 , 1 } and the follo wing objec tiv e functi on for logisti c regressi on: ˆ J 0 ( w ) = 1 n n X i =1 log 1 + exp( − y i w ⊤ x i ) = 1 n n X i =1 n ℓ ( w ⊤ x i ) − y i 2 w ⊤ x i o , (10) where ℓ : u 7→ log( e − u/ 2 + e u/ 2 ) is an e ven con vex functio n. A short calcu lation leads to ℓ ′ ( u ) = − 1 / 2 + σ ( u ) , ℓ ′′ ( u ) = σ ( u )[1 − σ ( u )] , ℓ ′′′ ( u ) = σ ( u )[1 − σ ( u )][ 1 − 2 σ ( u )] , where σ ( u ) = (1 + e − u ) − 1 is the sigmoid functio n. Note that we hav e for all u ∈ R , | ℓ ′′′ ( u ) | 6 ℓ ′′ ( u ) . The cost function ˆ J 0 defined in Eq. (10) is proportio nal to the negati v e conditiona l log-lik elihoo d of the data under the conditio nal model P ( y i = ε i | x i ) = σ ( ε i w ⊤ x i ) . If R = max i ∈{ 1 ,...,n } k x i k 2 denote s the maximum ℓ 2 -norm of all input data point s, then the cost functi on ˆ J 0 defined in E q. (10) satisfies the assumptions of Proposition 2. Indeed , we ha ve, with the notation s of Propositio n 2, | g ′′′ ( t ) | = 1 n n X i =1 ℓ ′′′ [( w + tv ) ⊤ x i ]( x ⊤ i v ) 3 6 1 n n X i =1 ℓ ′′ [( w + tv ) ⊤ x i ]( x ⊤ i v ) 2 k v k 2 k x i k 2 6 R k v k 2 × g ′′ ( t ) . Through out this paper , w e will consider a certain vector w ∈ R p (usual ly defined through the population functiona ls) and consider the one-step Newto n iterate from this w . R esults 5 from Section 2.2 will allow to sho w that this approx imates the global minimum of ˆ J 0 or a reg ularized version thereof. Through out this paper , we consider a fixed design setting (i.e., x 1 , . . . , x n are consider determin istic) and we make the follo wing assumption s: (A1) Independen t outpu ts : The outpu ts y i ∈ {− 1 , 1 } , i = 1 , . . . , n are indepe ndent (b ut not identi cally distrib uted). (A2) B ounded input s : max i ∈{ 1 ,...,n } k x i k 2 6 R . W e define the model as well-specifie d if there exists w 0 ∈ R p such that for all i = 1 , . . . , n , P ( y i = ε i ) = σ ( ε i w ⊤ 0 x i ) , which is equi v alent to E ( y i / 2) = ℓ ′ ( w ⊤ 0 x i ) , and implies v ar( y i / 2) = ℓ ′′ ( w ⊤ 0 x i ) . Ho wev er , we do not alwa ys make such ass umptions in the pap er . W e use the matrix notation X = [ x 1 , . . . , x n ] ⊤ ∈ R n × p for the design matrix and ε i = y i / 2 − E ( y i / 2) , for i = 1 , . . . , n , w hich formally corres ponds to the additi ve noise in least- square s regressi on. W e also use the notation Q = 1 n X ⊤ Diag(v ar( y i / 2)) X ∈ R p × p and q = 1 n X ⊤ ε ∈ R p . By assumptio n, we ha ve E ( q q ⊤ ) = 1 n Q . W e denote by J 0 the exp ectation of ˆ J 0 , i.e.: J 0 ( w ) = E ˆ J 0 ( w ) = 1 n n X i =1 n ℓ ( w ⊤ x i ) − E ( y i / 2) w ⊤ x i o . Note that w ith our notatio n, ˆ J 0 ( w ) = J 0 ( w ) − q ⊤ w . In this paper we conside r J 0 ( ˆ w ) as the generalizati on performance of a certain estimator ˆ w . This correspond s to the av er - age Kull back-Leibler di ver gence to the best model when the model is w ell-spe cified, and is common for the study of logistic regres sion and more generally generalized linear mod- els [19, 20]. Measur ing the classi fication performance through the 0–1 loss [21] is out of the scope of this paper . The functio n J 0 is bound ed from below , therefore it has a bounded infimum inf w ∈ R p J 0 ( w ) > 0 . This infimum might or might not be attain ed at a finite w 0 ∈ R p ; when the model is well- specified , it is always attaine d (b ut this is not a necessa ry condition ), and, unless the design matrix X has rank p , is not unique. The differe nce between the analys is throug h self-conco rdance and the classic al asymp- totic analysis is best seen when the mode l is well-specified, and exac tly mimics the differ ence between self-co ncordant analysi s of Newton’ s method and its classical analysis. The usual analys is of logis tic regres sion requir es that the logistic fun ction u 7→ log (1 + e − u ) is strongly con vex ( i.e., with a strict ly positi ve lo wer -bound on the secon d deri v ativ e), which is true onl y on a compac t subs et of R . Thus, non-a symptotic results such as the o nes from [5, 3] requ ires an upper bound M on | w ⊤ 0 x i | , where w 0 is the generat ing loading vecto r; then , the secon d deri v ativ e of the lo gistic loss is lower b ounded by (1 + e M ) − 1 , and this lo wer bound may be ver y small when M gets lar ge. O ur analys is does not require such a bound because of the fine contro l of the third deri vat iv e. 6 4 Regularization by the ℓ 2 -norm W e denot e by ˆ J λ ( w ) = ˆ J 0 ( w ) + λ 2 k w k 2 2 the empirica l ℓ 2 -reg ularized function al. For λ > 0 , the functio n ˆ J λ is strongly con v ex and we denote by ˆ w λ the unique global minimizer of ˆ J λ . In this section, our goal is to find upper and lower bounds on the genera lization perfor mance J 0 ( ˆ w λ ) , under minimal assumptio ns (Section 4.2) or when the model is w ell- specified (Section 4.3). 4.1 Repr oducing kernel Hilbert spaces and splines In this paper we focus explic itly on linear logistic regress ion, i.e., on a generalize d linear model that allo ws linear dependenc y between x i and the distri buti on of y i . Although ap- parent ly limiting, in the context of regula rization by the ℓ 2 -norm, this setting contains non- par ametric and non-lin ear methods based on splines or reprod ucing kerne l Hilbert spaces (RKHS) [22]. Indeed , because of the repres enter theorem [23], minimizing the cost function 1 n n X i =1 n ℓ [ f ( x i )] − y i 2 f ( x i ) o + λ 2 k f k 2 F , with respect to the functio n f in the RKHS F (with norm k · k F and kernel k ), is equi va lent to minimizin g the cost functio n 1 n n X i =1 n ℓ [( T β ) i ] − y i 2 ( T β ) i o + λ 2 k β k 2 2 , (11) with respect to β ∈ R p , where T ∈ R n × p is a square root of the kernel matrix K ∈ R n × n defined as K ij = k ( x i , x j ) , i.e., such that K = T T ⊤ . The unique solut ion of the origina l proble m f is then obtaine d as f ( x ) = P n i =1 α i k ( x, x i ) , where α is any vector satisfyin g T T ⊤ α = T β (which can be obtaine d by m atrix pseudo-in versio n [24]). Similar dev elop- ments c an be c arried out for s moothing spline s (see, e.g., [22, 2 5]). By ident ifying the matri x T with the data matrix X , the optimization problem in Eq. (11) is identi cal to minimizing ˆ J 0 ( w ) + λ 2 k w k 2 2 , and thus our results apply to estimatio n in RKHS s. 4.2 Minimal assumptions (misspecified model) In this section, we do not assume that the model is w ell-spe cified. W e obtain the followin g theore m (see proof in Appendix B), which only assumes boun dedness of the cov ariat es and indepe ndence of the outputs: Theor em 1 (Misspeci fied m odel) A ssume (A1) , (A2) and λ = 19 R 2 q log(8 /δ ) n , with δ ∈ (0 , 1) . Then, with pr obability at least 1 − δ , for all w 0 ∈ R p , J 0 ( ˆ w λ ) 6 J 0 ( w 0 ) + 10 + 100 R 2 k w 0 k 2 2 r log(8 /δ ) n . (12) 7 In partic ular , if the global minimum of J 0 is attained at w 0 (which is not an assumptio n of T heorem 1), we obtain an oracle inequ ality as J 0 ( w 0 ) = inf w ∈ R p J 0 ( w ) . The lack of additi onal assumptio ns unsurpri singly giv es rise to a slow rate of n − 1 / 2 . This is to be compar ed with [26], which uses dif ferent proo f techniq ues but obtains sim- ilar results for all con ve x Lipschitz-c ontinuous losse s (and not only for the logistic loss). Ho weve r , the technique s presented in this paper allow the deri v ation of much more precis e statemen ts in terms of bias and va riance (and w ith better rates) , that in v olv es some kno wl- edge of the problem. W e do not pursue detail ed results here, but focus in the ne xt section on well-spec ified models, where results hav e a simpler form. This highl ights two opposite strate gies for the theoretical analysis of reg ularized prob- lems: the first one, follo wed by [26, 27], is mostly loss -independ ent and relies on advan ced tools from empirical process theory , namely uniform conce ntration inequa lities. Results are widely applicable and make very few assumptions. Ho weve r , they tend to gi ve performance guaran tees which are far belo w the observe d performanc es of such methods in applica tions. The second strategy , which we follo w in this paper , is to restri ct the loss class (to linear or logist ic) and deri ve the limiting con v ergen ce rate, which does depend on unkno wn constants (typic ally the best linear classifier itself). Once the limit is obtained , w e belie ve it giv es a better interpretat ion of the performance of these methods, and if one really wishes to make no assumption , taking upper bounds on these quantiti es, we may get back results obtained with the gener ic strate gy , which is exactly what Theorem 1 is achie ving . Thus, a detailed analysi s of the con v erge nce rate, as done in Theorem 2 in the ne xt sec- tion, serves two purpo ses: first, it giv es a sharp result that depends on unkno wn constant s; second the cons tants can be maximized out and more general results may be obtai ned, with fe wer assumption s but worse co n ver gence rates. 4.3 W ell-specified models W e now assume that the model is well-specified, i.e., that the probability that y i = 1 is a sigmoid functio n of a linear function of x i , which is equi v alent to: (A3) W ell-specifie d model : There exists w 0 ∈ R p such that E ( y i / 2) = ℓ ′ ( w ⊤ 0 x i ) . Theorem 2 will giv e upper and lo wer bounds on the ex pected risk of the ℓ 2 -reg ularized estimato r ˆ w λ , i.e., J 0 ( ˆ w λ ) . W e use the follo wing definitions for the two degree s of freed om and bias es, which are usual in the conte xt of ridge regres sion and spline smoothi ng (see, e.g., [22, 25, 28]): deg rees of freedo m (1) : d 1 = tr Q ( Q + λI ) − 1 , deg rees of freedo m (2) : d 2 = tr Q 2 ( Q + λI ) − 2 , bias (1) : b 1 = λ 2 w ⊤ 0 ( Q + λI ) − 1 w 0 , bias (2) : b 2 = λ 2 w ⊤ 0 Q ( Q + λI ) − 2 w 0 . Note that we always hav e the inequalit ies d 2 6 d 1 6 m in { R 2 /λ, n } and b 2 6 b 1 6 min { λ k w 0 k 2 2 , λ 2 w ⊤ 0 Q − 1 w 0 } , and that these quan tities dep end o n λ . In the context of RK HSs 8 outlin ed in Section 4.1, we ha ve d 1 = tr K ( K + nλ Diag ( σ 2 i )) − 1 , a quantit y w hich is also usually referred to as the de gr ees of fr eedom [29]. In the conte xt of the analysis of ℓ 2 -reg ularized methods, the two degrees of freedom are ne cessary , as outli ned in Theorems 2 and 3, and in [28]. Moreo ver , we denote by κ > 0 the follo wing quantity κ = R λ 1 / 2 d 1 n + b 1 d 2 n + b 2 − 1 / 2 . (13) Such quan tity is a n extens ion of the o ne used by [30] in t he contex t of k ernel Fisher discrim- inant analy sis used as a test for homogenei ty . In order to obtai n asymptoti c equi val ents, w e requir e κ to be small, which, as sho wn later in this section, occurs in many interes ting cases when n is lar ge enough. In this section , we will apply results from Section 2 to the functio ns ˆ J λ and J 0 . Essen- tially , we w ill conside r local quadrati c approxi mations of these functions around the gener - ating loading ve ctor w 0 , leading to replacing the true estimator ˆ w λ by the one-ste p Newton iterate from w 0 . This is only possib le if the Ne wton decremen t ν ( ˆ J λ , w 0 ) is small enoug h, which leads to additio nal constra ints (in particula r the up per-b ound on κ ). Theor em 2 (Asymptotic gener alization perf ormance) Assume (A1) , (A2) and (A 3) . As- sume more over κ 6 1 / 16 , wher e κ is defined in E q. (13). If v ∈ [0 , 1 / 4] satisfi es v 3 ( d 2 + nb 2 ) 1 / 2 6 12 , then, with pr ob ability at least 1 − exp( − v 2 ( d 2 + nb 2 )) : J 0 ( ˆ w λ ) − J 0 ( w 0 ) − 1 2 b 2 + d 2 n 6 b 2 + d 2 n (69 v + 2560 κ ) . (14) Relationship to pre vious work. When the dimension p of w 0 is bounded , then under the reg ular asymptotic regime ( n tends to + ∞ ), J 0 ( ˆ w λ ) has the follo wing expan sion J 0 ( w 0 ) + 1 2 b 2 + d 2 n , a result w hich has been obtaine d by sev eral autho rs in se veral settings [31, 32]. In this asymptotic reg ime, the optimal λ is known to be of order O ( n − 1 ) [33]. The main contri butio n of our analysis is to allo w a non asymptotic analys is with exp licit constan ts. Moreo ver , note that for the squar e loss, the bound in Eq. (14 ) holds with κ = 0 , which can be li nked to th e fact that our se lf-concord ant analysis from Prop ositions 1 and 2 is a pplicable with R = 0 for the square loss. Note that the constants in the previo us theorem could probab ly be improv ed. Conditions for asymptotic equiv alence. In order to hav e the remainder term in E q. (14 ) neg ligible with high probabil ity compare d to the lowest order term in the exp ansion of J 0 ( ˆ w λ ) , we need to ha ve d 2 + nb 2 lar ge and κ small (so that v can be taken taking small while v 2 ( d 2 + nb 2 ) is lar ge, and hence we ha ve a res ult with high-prob ability). The assu mp- tion that d 2 + nb 2 gro ws unboun ded when n tends to infinity is a classical assumption in the study of smoothing splines and R KHSs [34, 35], and simply states that the con ve rgen ce rate of th e excess ri sk J 0 ( ˆ w λ ) − J 0 ( w 0 ) , i.e., b 2 + d 2 /n , is s lower th an for para metric estimation, i.e., slo wer than n − 1 . 9 Study of parameter κ . F irst, we alway s ha ve κ > R λ 1 / 2 d 1 n + b 1 1 / 2 ; thus an uppe r bound on κ implies an uppe rbound on d 1 n + b 1 which is needed in the proof of Theorem 2 to sho w that the Newton decrement is small enou gh. Moreov er , κ is bound ed by the sum of κ bias = R λ 1 / 2 b 1 b − 1 / 2 2 and κ v ar = R λ 1 / 2 d 1 n d 2 n − 1 / 2 . Under s imple assump tions on the ei gen v alues of Q or e quiv alen tly of Diag( σ i ) K Diag ( σ i ) , o ne can sho w tha t κ v ar is small. For example, if d of these eigen valu es are equa l to one and the remain ing ones are zero, then, κ v ar = Rd 1 / 2 λ 1 / 2 n 1 / 2 . And thus we simply need λ asympto tically great er than R 2 d/n . For additiona l conditio ns for κ v ar , see [28, 30]. A simple cond ition for κ bias can be obtaine d if w ⊤ 0 Q − 1 w 0 is assumed bound ed (in the contex t of RKH Ss this is a stric ter condition that the generating function is inside t he RKHS, and i s used by [36] in t he conte xt of sparsi ty-inducin g norms). In this c ase, the bias terms are neglig ible compared to the v ariance term as soon as λ is asymptot ically greate r than n − 1 / 2 . V ariance term. Note that the diagona l mat rix Diag ( σ 2 i ) is upperb ounded by 1 4 I , i.e., Diag( σ 2 i ) 4 1 4 I , so that the degrees of freedom for logistic re gression ar e al ways les s than the correspond - ing ones for least-squ ares regress ion (for λ multiplied by 4). Indeed, the pairs ( x i , y i ) for which the conditiona l distrib ution is close to determinist ic are such that σ 2 i is close to zero. And thus it should redu ce the v ariance of the estimato r , as little noise is associated with t hese points , and the ef fect of this reduction is exactly measure d by the reduction in the degre es of freedo m. Moreo ver , the rate of con ver genc e d 2 /n of the v ariance term has been studied by many author s (see, e.g., [22, 25, 30]) and depend s on the deca y of the eigen val ues of Q (the faster the decay , the smaller d 2 ). The degrees of freedom usua lly grows with n , b ut in many cases is slo wer than n 1 / 2 , leadi ng to faste r rates in Eq. (14). 4.4 Smoothing parameter selection In this sectio n, we obtain a criterion similar to Mallo w’ s C L [37] to estimat e the generaliz a- tion error and select in a data-dri ve n way the regu larization paramete r λ (referre d to as the smoothin g parameter when dealing w ith splines or RKHSs). The follo wing theorem shows that with a data-d ependent criterion, we may obtain a good estimate of the general ization perfor mance, up to a constant term q ⊤ w 0 indepe ndent of λ (see proof in Appendix D): Theor em 3 (Data-dri ven estimatio n of generalizatio n perf ormance) Assume (A1) , (A2) and (A3) . L et ˆ Q λ = 1 n P n i =1 ℓ ′′ ( ˆ w ⊤ λ x i ) x i x ⊤ i and q = 1 n P n i =1 ( y i / 2 − E ( y i / 2)) x i . A ssume mo re - ove r κ 6 1 / 16 , wher e κ is d efined in Eq. (13). If v ∈ [0 , 1 / 4] satisfi es v 3 ( d 2 + nb 2 ) 1 / 2 6 12 , then, with pr obability at least 1 − exp( − v 2 ( d 2 + nb 2 )) : J 0 ( ˆ w λ ) − ˆ J 0 ( ˆ w λ ) − 1 n tr ˆ Q λ ( ˆ Q λ + λI ) − 1 − q ⊤ w 0 6 b 2 + d 2 n (69 v + 2560 κ ) . The prev ious theo rem, w hich is essential ly a non-asympto tic version of results in [31, 32] can be furthe r exten ded to obtain oracle inequalit ies w hen minimizing the data-dri ven cri- terion ˆ J 0 ( ˆ w λ ) + 1 n tr ˆ Q λ ( ˆ Q λ + λI ) − 1 , similar to results obtained in [35, 28] for the squar e 10 loss. Note that contrary to least-square s regr ession with Gaussian noise, there is no need to estimate the unkno wn noise va riance (of cours e only when the logis tic model is actuall y well-spec ified); ho wev er , the matrix Q used to define the de grees of fr eedom does dep end on w 0 and thus requires that ˆ Q λ is used as an estimate. Finally , criteria based on generalized cross- valid ation [38, 4] cou ld be stud ied with similar tools. 5 Regularization by the ℓ 1 -norm In this section , we conside r an estimator ˆ w λ obtain ed as a minimizer of the ℓ 1 -reg ularized empirica l risk, i.e., ˆ J 0 ( w ) + λ k w k 1 . It is well-kno wn that the estimat or has some zero com- ponen ts [39]. In this section, w e exten d some of the recent results [12, 13, 14, 15, 16, 40] for the square loss (i.e., the Lasso) to the logistic loss. W e assume throug hout this sectio n that the model is well-specified, that is, that the observ ations y i , i = 1 , . . . , n , are genera ted accord ing to the logisti c model P ( y i = ε i ) = σ ( ε i w ⊤ 0 x i ) . W e deno te by K = { j ∈ { 1 , . . . , p } , ( w 0 ) j 6 = 0 } the set of non-ze ro compone nts of w 0 and s = sign( w 0 ) ∈ {− 1 , 0 , 1 } p the vector of sign s o f w 0 . On top of A ssumptio ns (A1) , (A2) and ( A3) , we will make t he follo wing assumpti on regardin g normal ization for each c ov ariate (which can alw ays be imposed by renormalizat ion), i.e., (A4) N ormalize d cova riates : for all j = 1 , . . . , p , 1 n P n i =1 [( x i ) j ] 2 6 1 . In this se ction, w e con sider two dif ferent results, one on model c onsistenc y (Section 5.1) and one on ef ficiency (Section 5.2). As for the square loss, they will both depen d on ad- dition al assumpti ons regar ding the square p × p matrix Q = 1 n P n i =1 ℓ ′′ ( w ⊤ 0 x i ) x i x ⊤ i . This matrix is a weighted Gram matrix, which corresp onds to the unweighted one for the square loss. As already sho wn in [5, 3 ], usual assumption s for the Gram matrix for the square loss are extende d, for the logistic loss setting using the weighted Gram matrix Q . In this paper , we c onsider two type s of res ults based o n specific ass umptions on Q , b ut oth er ones co uld be consid ered as well (such as [41]). The main contr ibuti on of using self-c oncordant analysis is to allo w simple exte nsions from the square loss with short proofs and sharp er bounds, in particu lar by a v oiding an expo nential constant in the maximal v alue o f | w ⊤ 0 x i | , i = 1 , . . . , n . 5.1 Model consistency condition The follo wing theorem provides a suf ficient condition for m odel consis tency . It is base d on the cons istency cond ition k Q K c K Q − 1 K K s K k ∞ < 1 , which is exac tly the same as the one for the square loss [15, 12, 14] (see proof in Appendi x E): Theor em 4 (Model consi stency f or ℓ 1 -r egularizati on) A ssume (A1) , (A2) , (A 3) and (A4) . Assume that ther e ex ists η , ρ, µ > 0 such that k Q K c K Q − 1 K K s K k ∞ 6 1 − η , (1 5) 11 λ min ( Q K K ) > ρ and min j ∈ K | ( w 0 ) j | > µ . A ssume λ 6 min n ρµ 4 | K | 1 / 2 , ηρ 3 / 2 64 R | K | o . Then the pr ob ability that the vector of signs of ˆ w λ is dif fer ent fr om s = sign( w 0 ) is upperbo unded by 2 p exp − nλ 2 η 2 16 + 2 | K | exp − nρ 2 µ 2 16 | K | + 2 | K | exp − λnρ 3 / 2 η 64 R | K | . (16) Comparison w ith squ ar e loss. For the square loss, the previo us theore m simplifies [15, 12]: with ou r nota tions, the c onstraint λ 6 ηρ 3 / 2 64 R | K | and the last term in Eq. (16), which are the only ones depend ing on R , can be remov ed (indeed, the square loss allows the applica tion of our adapted self-co ncordant analysi s with the consta nt R = 0 ). On the one hand, the fa v orable scal ing between p and n , i.e., log p = O ( n ) for a certain well-chose n λ , is preserv ed (since the logarithm of the added term is proportion al to − λn ). Ho weve r , on the other hand, the terms in R may be larg e as R is the radius of the entire data (i.e., with all p cov ariat es). Bounds with the radiu s of the data on on ly the rele v ant fea tures in K could b e d eriv ed as well (see detail s in the proof in Appendix E). Necessar y c ondition. In the case of the square loss, a w eak form of Eq. (15), i.e., k Q K c K Q − 1 K K s K k ∞ 6 1 turns out to be necessary and sufficien t for asymptotic correct model select ion [14]. W hile the weak form is clearl y necessary for model consistenc y , and the strict form sufficien t (as pro ved in Theorem 4 ), we are currently in ve stigating whether the weak conditi on is also suf ficient for the logistic loss. 5.2 Efficiency Another type of result has been deri ved , based on dif ferent proof techniques [16] and aimed at ef ficiency (i.e., predicti ve performance) . Here again, we can exten d the result in a very simple way . W e assume, gi ven K the set of non-zero components of w 0 : (A5) R estrict ed eig en value cond ition : ρ = min k ∆ K c k 1 6 3 k ∆ K k 1 (∆ ⊤ Q ∆) 1 / 2 k ∆ K k 2 > 0 . Note that the assumpt ion made in [16] is sligh tly stronger but only depend s on the car - dinali ty of K (by minimizin g with respect to all sets of indices with cardi nality equal to the one of K ). The follo wing theorem prov ides an estimate of the estimation error as well as an oracle inequa lity for the generaliz ation performanc e (see proof in Append ix F): Theor em 5 (Efficiency f or ℓ 1 -r egularizati on) A ssume (A1) , (A2) , (A 3) , (A4) , and (A5) . F or all λ 6 ρ 2 48 R | K | , with pr oba bility at least 1 − 2 pe − λn 2 / 5 , we have: k ˆ w λ − w 0 k 1 6 12 λ | K | ρ − 2 , J 0 ( ˆ w λ ) − J 0 ( w 0 ) 6 12 λ 2 | K | ρ − 2 . 12 W e obtain a result which directly mimics the one obtained in [16] for the square loss with the except ion of the added bound on λ . In particular , if we tak e λ = q 10 log( p ) n , we get with probab ility at least 1 − 2 /p , an upper bound on the generalizati on performance J 0 ( ˆ w λ ) 6 J 0 ( w 0 ) + 120 log p n | K | ρ − 2 . Again, the proof of this result is a direc t exten sion of the corresp onding one for the square loss, with few additiona l assumptions o wing to the proper self-co ncordant analysis . 6 Conclusion W e hav e pro vided an extensio n of self-concor dant functions that a llows the simple exte nsions of theore tical results for the square loss to the logistic loss. W e ha ve applied the ext ension techni ques to regula rization by the ℓ 2 -norm and re gularizati on by the ℓ 1 -norm, sho wing that ne w results for logistic regr ession can be eas ily deriv ed from correspond ing resul ts for least - square s regre ssion, without adde d comple x assumpti ons. The presen t wor k could be extend ed in se veral interest ing way s to differ ent settings. First, for logi stic regress ion, other exte nsions of theore tical results from le ast-square s regres- sion could be carried out: for ex ample, the analysis of sequent ial experiment al design for logist ic regress ion leads to many assumptions that could be relaxe d (see, e.g., [42]). Also , other regul arization framewo rks based on spars ity-induci ng norms could be applied to lo- gistic regre ssion with similar guara ntees than for least-squ ares regressio n, such as group Lasso for grou ped va riables [43] or non-parametr ic problems [36], or resampling-b ased pro- cedure s [44, 45] that allo w to get rid of suf ficient consiste ncy cond itions. Second, the techniq ues dev eloped in this paper could be exten ded to other M-estimatio n proble ms: indeed , other general ized linear models be yond logist ic regre ssion could be con- sidere d where higher -order deriv ati ves can be exp ressed throug h cumulants [19]. Moreov er , similar de ve lopments could b e made for den sity estimation for t he exp onential family , which would in particular lea d to interestin g dev elopment s for Gaussian models in high d imensions, where ℓ 1 -reg ularizatio n has pro ved useful [46, 47]. Finally , other losse s for binary or multi- class classificatio n are of clear interest [21], potentially with dif ferent contro ls of the third deri v ativ es. A Proofs of optimization results W e follo w the proof techniq ues of [8], by simply changing the control of the third order deri v ativ e. W e denote by F ′′′ ( w ) the third-or der deri v ativ e of F , which is itself a function from R p × R p × R p to R . The as sumptions made in Proposit ions 1 and 2 a re i n f act eq uiv alent to (see similar proof in [8]): ∀ u, v , w ∈ R p , | F ′′′ [ u, v , t ] | 6 R k u k 2 [ v ⊤ F ′′ ( w ) v ] 1 / 2 [ t ⊤ F ′′ ( w ) t ] 1 / 2 . (17) 13 A.1 Univariate fu nctions W e first consi der univ ariate functions and prove the follo wing lemma that giv es upper and lo wer T aylor expan sions: Lemma 1 Let g be a con vex thr ee times dif fer entia ble function g : R 7→ R such that for all t ∈ R , | g ′′′ ( t ) | 6 S g ′′ ( t ) , for some S > 0 . Then, for all t > 0 : g ′′ (0) S 2 ( e − S t + S t − 1) 6 g ( t ) − g (0) − g ′ (0) t 6 g ′′ (0) S 2 ( e S t − S t − 1) . (18) Pro of Let us first assume that g ′′ ( t ) is strictly posi tiv e for all t ∈ R . W e hav e, for all t > 0 : − S 6 d log g ′′ ( t ) dt 6 S . Then, by integr ating once between 0 and t , taking expo nentials, and then inte grating twice: − S t 6 log g ′′ ( t ) − log g ′′ (0) 6 S t, g ′′ (0) e − S t 6 g ′′ ( t ) 6 g ′′ (0) e S t , (19) g ′′ (0) S − 1 (1 − e − S t ) 6 g ′ ( t ) − g ′ (0) 6 g ′′ (0) S − 1 ( e S t − 1) , g ( t ) > g (0) + g ′ (0) t + g ′′ (0) S − 2 ( e − S t + S t − 1) , (20) g ( t ) 6 g (0) + g ′ (0) t + g ′′ (0) S − 2 ( e S t − S t − 1) , (21) which leads to Eq. (18). Let us now assume only that g ′′ (0) > 0 . If we denote by A the co nnected compo nent that contai ns 0 of th e open set { t ∈ R , g ′′ ( t ) > 0 } , then the p receding dev elopments are v alid on A ; thus, E q. (19) implies that A is not upper -bounded . The same reason ing on − g ensures that A = R and hence g ′′ ( t ) is strictly positi ve for all t ∈ R . Since the proble m is in v ariant by translation , we hav e sho wn that if there exis ts t 0 ∈ R such that g ′′ ( t 0 ) > 0 , then for all t ∈ R , g ′′ ( t ) > 0 . Thus, we need to pro ve Eq. (18) for g ′′ alw ays stric tly positi ve (which is d one abo ve) and for g ′′ identi cally equal to zero, which implies that g is linear , which is then equiv alen t to Eq. (18). Note the differe nce with a classic al unifor m bound on the third deri vat iv e, w hich leads to a third-o rder polynomial lower bo und, which tends to −∞ more quickly than E q. (20). More- ov er , Eq. (21) may be inte rpreted as an uppe rbound on the remaind er in the T aylor e xpansion of g around 0 : g ( t ) − g (0) − g ′ (0) t − g ′′ (0) 2 t 2 6 g ′′ (0) S − 2 ( e S t − 1 2 S 2 t 2 − S t − 1) . The right hand-side is equi val ent to S t 3 6 g ′′ (0) for t close to zero (which should be expected from a three-times dif ferenti able function such that g ′′′ (0) 6 S g ′′ (0) ), b ut still provid es a good bound for t away from zero (which cannot be obtained from a regu lar T aylor expa nsion). Through out th e p roofs, we will use the fact that the functions u 7→ e u − 1 u and u 7→ e u − 1 − u u 2 can be e xtended to conti nuous funct ions on R , whic h are th us bounded on any co mpact. The bound will depen d on the compact and can be obtained easily . 14 A.2 Pr oof of Proposition 1 By applyin g Lemma 1 (Eq. (20) and Eq. (21)) to g ( t ) = F ( w + tv ) (with constan t S = R k v k 2 ) and taking t = 1 , we get the desired fi rst two inequaliti es in Eq. (3) and Eq. (4). By consideri ng the functio n g ( t ) = u ⊤ F ′′ ( w + tv ) u , we ha ve g ′ ( t ) = F ′′′ ( w + tv )[ u, u, v ] , which is such that | g ′ ( t ) | 6 k v k 2 Rg ( t ) , leading to g (0) e −k v k 2 Rt 6 g ( t ) 6 g (0 ) e k v k 2 Rt , and thus to Eq. (6) for t = 1 (when considered for all u ∈ R p ). In order to pr ove Eq. (5), we co nsider h ( t ) = z ⊤ ( F ′ ( w + tv ) − F ′ ( w ) − F ′′ ( w ) v t ) . W e ha ve h (0) = 0 , h ′ (0) = 0 and h ′′ ( t ) = F ′′′ ( w + t v )[ v , v , z ] 6 R k v k 2 e tR k v k 2 [ z ⊤ F ′′ ( w ) z ] 1 / 2 [ v ⊤ F ′′ ( w ) v ] 1 / 2 using Eq. (6) and Eq. (17). Thus, by inte grating betwee n 0 and t , h ′ ( t ) 6 [ z ⊤ F ′′ ( w ) z ] 1 / 2 [ v ⊤ F ′′ ( w ) v ] 1 / 2 ( e tR k v k 2 − 1) , which implie s h (1) 6 [ z ⊤ F ′′ ( w ) z ] 1 / 2 [ v ⊤ F ′′ ( w ) v ] 1 / 2 R 1 0 ( e tR k v k 2 − 1) dt, which i n turn lead s to Eq. (5). Using similar technique s, i.e., by considering the functio n t 7→ = z ⊤ [ F ′′ ( w + tv ) − F ′′ ( w )] u , we can prov e that for all z , u, v , w ∈ R p , we ha ve: z ⊤ [ F ′′ ( w + v ) − F ′′ ( w )] u 6 e R k v k 2 − 1 k v k 2 [ v ⊤ F ′′ ( w ) v ] 1 / 2 [ z ⊤ F ′′ ( w ) z ] 1 / 2 k u k 2 . (22) A.3 Pr oof of Proposition 2 Since we ha ve assu med that λ ( w ) > 0 , then by Eq. (6 ), the Hessian of F is e verywh ere in v ertible, and h ence the func tion F is stri ctly con v ex. Therefor e, if the minimum is attai ned, it is uniq ue. Let v ∈ R p be such that v ⊤ F ′′ ( w ) v = 1 . W ithout loss of generality , we may assume that F ′ ( w ) ⊤ v is ne gati ve. This implies that for all t 6 0 , F ( w + tv ) > F ( w ) . Moreov er , let us denote κ = − v ⊤ F ′ ( w ) R k v k 2 , which is nonneg ativ e and such that κ 6 R | v ⊤ F ′ ( w ) | λ ( w ) 1 / 2 6 Rν ( F ,w ) λ ( w ) 1 / 2 6 1 / 2 . F rom Eq. (3), for all t > 0 , w e ha ve: F ( w + tv ) > F ( w ) + v ⊤ F ′ ( w ) t + 1 R 2 k v k 2 2 ( e − R k v k 2 t + R k v k 2 t − 1) > F ( w ) + 1 R 2 k v k 2 2 h e − R k v k 2 t + (1 − κ ) R k v k 2 t − 1 i . Moreo ver , a short calculat ion shows th at for all κ ∈ (0 , 1] : e − 2 κ (1 − κ ) − 1 + (1 − κ )2 κ (1 − κ ) − 1 − 1 > 0 . (23) This implies that for t 0 = 2( R k v k 2 ) − 1 κ (1 − κ ) − 1 , F ( w + t 0 v ) > F ( w ) . S ince t 0 6 2 1 − κ | v ⊤ F ′ ( w ) | 6 2 ν ( F, w ) 1 − ν ( F , w ) R λ ( w ) 1 / 2 − 1 6 4 ν ( F , w ) , we ha ve F ( w + tv ) > F ( w ) for t = 4 ν ( F, w ) . 15 Since this is true for all v such that v ⊤ F ′′ ( w ) v = 1 , this sho ws that the v alue of the functi on F on the entire ellipsoi d (sin ce F ′′ ( w ) is positi v e d efinite) v ⊤ F ′′ ( w ) v = 16 ν ( F, w ) 2 is greater or eq ual to the v alue at w ; thus, by co n ve xity , there m ust be a minimizer w ∗ —which is uniqu e becaus e of Eq. (6)—of F such that ( w − w ∗ ) ⊤ F ′′ ( w ) ( w − w ∗ ) 6 16 ν ( F , w ) 2 , leadin g to Eq. (7). In or der to pro ve Eq. (9 ), we will simp ly apply Eq. (7) at w + v , which requir es to upper - bound ν ( F , w + v ) . If we deno te by v = − F ′′ ( w ) − 1 F ′ ( w ) the Newto n step, we hav e: k F ′′ ( w ) − 1 / 2 F ′ ( w + v ) k 2 = F ′′ ( w ) − 1 / 2 [ F ′ ( w + v ) − F ′ ( w ) − F ′′ ( w ) v ] 2 = Z 1 0 F ′′ ( w ) − 1 / 2 [ F ′′ ( w + tv ) − F ′′ ( w )] v dt 2 6 Z 1 0 F ′′ ( w ) − 1 / 2 [ F ′′ ( w + tv ) − F ′′ ( w )] F ′′ ( w ) − 1 / 2 F ′′ ( w ) 1 / 2 v 2 dt 6 Z 1 0 h F ′′ ( w ) − 1 / 2 F ′′ ( w + tv ) F ′′ ( w ) − 1 / 2 − I i F ′′ ( w ) 1 / 2 v 2 dt. Moreo ver , we ha ve from Eq. (6): ( e − tR k v k 2 − 1) I 4 F ′′ ( w ) − 1 / 2 F ′′ ( w + tv ) F ′′ ( w ) − 1 / 2 − I 4 ( e tR k v k 2 − 1) I . Thus, k F ′′ ( w ) − 1 / 2 F ′ ( w + v ) k 2 6 Z 1 0 max { e tR k v k 2 − 1 , 1 − e − tR k v k 2 }k F ′′ ( w ) 1 / 2 v k 2 dt = ν ( F , w ) Z 1 0 ( e tR k v k 2 − 1) dt = ν ( F , w ) e R k v k 2 − 1 − R k v k 2 R k v k 2 . Therefore , using Eq. (6) again, we obtain : ν ( F , w + v ) = k F ′′ ( w + v ) − 1 / 2 F ′ ( w + v ) k 2 6 ν ( F, w ) e R k v k 2 / 2 e R k v k 2 − 1 − R k v k 2 R k v k 2 . W e ha ve R k v k 2 6 Rλ − 1 / 2 ν ( F , w ) 6 1 / 2 , and thus, w e ha ve e R k v k 2 / 2 e R k v k 2 − 1 − R k v k 2 R k v k 2 6 R k v k 2 6 Rν ( F , w ) λ ( w ) − 1 / 2 , leadin g to: ν ( F , w + v ) 6 R λ ( w ) 1 / 2 ν ( F , w ) 2 . (24) 16 Moreo ver , we ha ve: Rν ( F , w + v ) λ ( w + v ) 1 / 2 6 Re R k v k 2 / 2 λ ( w ) 1 / 2 ν ( F , w + v ) 6 R λ ( w ) 1 / 2 ν ( F , w ) e R k v k 2 e R k v k 2 − 1 − R k v k 2 R k v k 2 , 6 R λ ( w ) 1 / 2 ν ( F , w ) × R k v k 2 6 R λ ( w ) 1 / 2 ν ( F , w ) 2 6 1 / 4 , which leads to Eq. (8). Moreo ver , it shows that we ca n appl y Eq. (7) at w + v and get: [( w ∗ − w − v ) ⊤ F ′′ ( w ) ( w ∗ − w − v )] 1 / 2 6 e R k v k 2 / 2 [( w ∗ − w − v ) ⊤ F ′′ ( w + v )( w ∗ − w − v )] 1 / 2 6 4 e R k v k 2 / 2 ν ( F , w + v ) 6 4 R k v k 2 ν ( F , w ) , which leads to the desired result, i.e., Eq. (9). B Pr oof of Theorem 1 Follo wing [26, 27], we deno te by w λ the unique g lobal minimiz er of the expected re gularized risk J λ ( w ) = J 0 ( w ) + λ 2 k w k 2 2 . W e simply apply Eq. (7) from Proposi tion 2 to ˆ J λ and w λ , to obtain, if the N e wton decrement (see S ection 2 for its definition) ν ( ˆ J λ , w λ ) 2 is less than λ/ 4 R 2 , that ˆ w λ and its popu lation counter part w λ are clos e, i.e.: ( ˆ w λ − w λ ) ⊤ ˆ J ′′ λ ( w λ )( ˆ w λ − w λ ) 6 16 ν ( ˆ J λ , w λ ) 2 . W e ca n then apply the uppe r T aylor ex pansion in Eq. (4) from Propositi on 1 to J λ and w λ , to obtain , with v = ˆ w λ − w λ (which is such that R k v k 2 6 4 Rν ( ˆ J λ ,w λ ) λ 1 / 2 6 2 ): J λ ( ˆ w λ ) − J λ ( w λ ) 6 v ⊤ J ′′ λ ( w λ ) v R 2 k v k 2 2 ( e R k v k 2 − R k v k 2 − 1) 6 20 ν ( ˆ J λ , w λ ) 2 . Therefore , for any w 0 ∈ R p , since w λ is the minimizer of J λ ( w ) = J 0 ( w ) + λ 2 k w k 2 2 : J 0 ( ˆ w λ ) 6 J 0 ( w 0 ) + λ 2 k w 0 k 2 2 + 20 ν ( ˆ J λ , w λ ) 2 . (25) W e can now apply the concent ration inequa lity from P roposi tion 4 in Appendix G, i.e., Eq. (42), with u = log(8 /δ ) . W e use λ = 19 R 2 q log(8 /δ ) n . In order to actually ha ve ν ( ˆ J λ , w λ ) 6 λ 1 / 2 / 2 R (so that we can apply our self-conco rdant analysis) , it is suf ficient that: 41 R 2 u/λn 6 λ/ 8 R 2 , 63( u/n ) 3 / 2 R 2 /λ 6 λ/ 16 R 2 , 8( u/n ) 2 R 2 /λ 6 λ/ 16 R 2 , 17 leadin g to the constra ints u 6 n/ 125 . W e the n get with p robability at least 1 − δ = 1 − 8 e − u (for u 6 n / 125 ): J 0 ( ˆ w λ ) 6 J 0 ( w 0 ) + λ 2 k w 0 k 2 2 + 20 λ 4 R 2 6 J 0 ( w 0 ) + (10 + 100 R 2 k w 0 k 2 2 ) p log(8 /δ ) √ n . For u > n/ 125 , the bound in E q. (12) is always satisfied. Indeed, th is implies with o ur choice of λ that λ > R 2 . Moreov er , since k ˆ w λ k 2 2 is boun ded from abov e by log (2) λ − 1 6 R − 2 , J 0 ( ˆ w λ ) 6 J 0 ( w 0 ) + R 2 2 k ˆ w λ − w 0 k 2 F 6 J 0 ( w 0 ) + 1 + R 2 k w 0 k 2 2 , which is smaller than the right hand- side of Eq. (12). C Proof of T heor em 2 W e denote by J T 0 the second -order T aylor exp ansion of J 0 around w 0 , equal to J T 0 ( w ) = J 0 ( w 0 ) + 1 2 ( w − w 0 ) ⊤ Q ( w − w 0 ) , with Q = J ′′ 0 ( w 0 ) , and ˆ J T 0 the expa nsion of ˆ J 0 around w 0 , equal to J T 0 ( w ) − q ⊤ w . W e denote by ˆ w N λ the one- step Newton iterate from w 0 for the functi on ˆ J 0 , de fined as the globa l minimizer of ˆ J T 0 and e qual to ˆ w N λ = w 0 + ( Q + λI ) − 1 ( q − λw 0 ) . What the follo wing propositi on sho ws is that we can replace ˆ J 0 by ˆ J T 0 for obta ining the estimato r and that we can replace J 0 by J T 0 for measuring its performance, i.e., w e may do as if we had a weighted leas t-squares cost, as lon g as the Ne wton decr ement is small enou gh: Pro position 3 (Quadratic appr oximation of risks) Assume ν ( ˆ J λ , w 0 ) 2 = ( q − λw 0 ) ⊤ ( Q + λI ) − 1 ( q − λw 0 ) 6 λ 4 R 2 . W e have : | J 0 ( ˆ w λ ) − J T 0 ( ˆ w N λ ) | 6 15 Rν ( ˆ J λ , w 0 ) 2 λ 1 / 2 k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 + 40 R 2 λ ν ( ˆ J λ , w 0 ) 4 . (26) Pro of W e sho w that (1) ˆ w N λ is close to ˆ w λ using Proposition 2 on the behav ior of Ne wton’ s method, (2) that ˆ w N λ is close to w 0 by using its close d form ˆ w N λ = w 0 + ( Q + λI ) − 1 ( q − λw 0 ) , and (3) that J 0 and J T 0 are close using Proposit ion 1 on upper and lower T aylo r e xpansions . W e first apply Eq. (9) from Proposit ion 2 to get ( ˆ w λ − ˆ w N λ ) ⊤ ˆ J ′′ λ ( w 0 )( ˆ w λ − ˆ w N λ ) 6 16 R 2 λ ν ( ˆ J λ , w 0 ) 4 . (27) This implies that ˆ w λ and ˆ w N λ are close, i.e., k ˆ w λ − ˆ w N λ k 2 6 λ − 1 ( ˆ w λ − ˆ w N λ ) ⊤ ˆ J ′′ λ ( w 0 )( ˆ w λ − ˆ w N λ ) 6 16 R 2 λ 2 ν ( ˆ J λ , w 0 ) 4 6 4 λ ν ( ˆ J λ , w 0 ) 2 6 1 R 2 . 18 Thus, using the closed form ex pression for ˆ w N λ = w 0 + ( Q + λI ) − 1 ( q − λw 0 ) , we obtai n k ˆ w λ − w 0 k 6 k ˆ w λ − ˆ w N λ k + k w 0 − ˆ w N λ k 6 2 ν ( ˆ J λ , w 0 ) λ 1 / 2 + ν ( ˆ J λ , w 0 ) λ 1 / 2 6 3 ν ( ˆ J λ , w 0 ) λ 1 / 2 6 3 2 R . W e can no w apply Eq. (3) from Proposition 2 to get for all v such that R k v k 2 6 3 / 2 , | J 0 ( w 0 + v ) − J T 0 ( w 0 + v ) | 6 ( v ⊤ Qv ) R k v k 2 / 4 . (28) Thus, using Eq. (28) for v = ˆ w λ − w 0 and v = ˆ w N λ − w 0 : | J 0 ( ˆ w λ ) − J T 0 ( ˆ w N λ ) | 6 | J 0 ( ˆ w λ ) − J T 0 ( ˆ w λ ) | + | J T 0 ( ˆ w N λ ) − J T 0 ( ˆ w λ ) | , 6 R 4 k ˆ w λ − w 0 k 2 k Q 1 / 2 ( ˆ w λ − w 0 ) k 2 2 + 1 2 k Q 1 / 2 ( ˆ w λ − w 0 ) k 2 2 − k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 2 , 6 3 Rν ( ˆ J λ , w 0 ) 4 λ 1 / 2 k Q 1 / 2 ( ˆ w λ − w 0 ) k 2 2 + 1 2 k Q 1 / 2 ( ˆ w λ − w 0 ) k 2 2 − k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 2 , 6 3 Rν ( ˆ J λ , w 0 ) 4 λ 1 / 2 k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 2 + 1 2 + 3 4 k Q 1 / 2 ( ˆ w λ − w 0 ) k 2 2 − k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 2 , 6 3 Rν ( ˆ J λ , w 0 ) 4 λ 1 / 2 k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 2 + 5 4 k Q 1 / 2 ( ˆ w λ − ˆ w N λ ) k 2 2 + 5 2 k Q 1 / 2 ( ˆ w λ − ˆ w N λ ) k 2 k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 . From Eq. (27), we hav e k Q 1 / 2 ( ˆ w λ − ˆ w N λ ) k 2 2 6 16 R 2 λ ν ( ˆ J λ , w 0 ) 4 . W e thus obtain, using that k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 6 ν ( ˆ J 0 , w 0 ) : | J 0 ( ˆ w λ ) − J T 0 ( ˆ w N λ ) | 6 3 4 + 5 2 √ 32 ν ( ˆ J λ , w 0 ) 2 R − 1 λ 1 / 2 k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 + 40 R 2 λ ν ( ˆ J λ , w 0 ) 4 , which leads to the desired result. W e can no w go on with the proof of Theorem 2. From Eq. (26) in Proposit ion 3 abov e, we ha ve, if ν ( ˆ J λ , w 0 ) 2 6 λ/ 4 R 2 , J 0 ( ˆ w λ ) = J T 0 ( ˆ w N λ ) + B = J 0 ( w 0 ) + 1 2 ( q − λw 0 ) ⊤ Q ( Q + λI ) − 2 ( q − λw 0 ) + B = J 0 ( w 0 ) + d 2 2 n + b 2 2 + B + C, with C = λw ⊤ 0 ( Q + λI ) − 2 Qq + 1 2 tr( Q + λI ) − 2 Q q q ⊤ − 1 n Q , | B | 6 15 Rν ( ˆ J λ , w 0 ) 2 λ 1 / 2 k Q 1 / 2 ( ˆ w N λ − w 0 ) k 2 + 40 R 2 λ ν ( ˆ J λ , w 0 ) 4 . 19 W e can no w bou nd each term s eparately an d che ck that we in deed ha ve ν ( ˆ J λ , w 0 ) 2 6 λ/ 4 R 2 (which allo ws to apply Proposition 2). First, from Eq. (13), we can deri ve b 2 + d 2 n 6 b 1 + d 1 n 6 κλ 1 / 2 R b 2 + d 2 n 1 / 2 6 κλ 1 / 2 R b 1 + d 1 n 1 / 2 , which implies the follo wing identities : b 2 + d 2 n 6 b 1 + d 1 n 6 κ 2 λ R 2 . (29) W e ha ve moreov er: ν ( ˆ J λ , w 0 ) 2 = ( q − λw 0 ) ⊤ ( Q + λI ) − 1 ( q − λw 0 ) 6 b 1 + d 1 n + tr Q + λI ) − 1 q q ⊤ − Q n + 2 λw ⊤ 0 ( Q + λI ) − 1 q . W e can no w apply conc entration inequalitie s from Appendix G , together with the followin g applic ations of Bernstein’ s inequ ality . Indeed, we ha ve λw ⊤ 0 ( Q + λI ) − 2 Qq = P n i =1 Z i , with | Z i | 6 λ n | w ⊤ 0 ( Q + λI ) − 2 Qx i | 6 λ 2 n w ⊤ 0 ( Q + λI ) − 2 Qw 0 1 / 2 x ⊤ i ( Q + λI ) − 2 Qx i 1 / 2 6 b 1 / 2 2 2 n Rλ − 1 / 2 . Moreo ver , E Z 2 i 6 λ 2 n w ⊤ 0 ( Q + λI ) − 2 Q 3 ( Q + λI ) − 2 w 0 6 1 n b 2 . W e can no w apply Bern stein inequa lity [2] to get with probabili ty at least 1 − 2 e − u (and using Eq. (29)): λw ⊤ 0 ( Q + λI ) − 2 Qq 6 r 2 b 2 u n + u 6 n b 1 / 2 2 Rλ − 1 / 2 6 r 2 b 2 u n + uκ 6 n . Similarly , with probabili ty at least 1 − 2 e − u , we ha ve: λw ⊤ 0 ( Q + λI ) − 1 q 6 r 2 b 2 u n + uκ 6 n . W e thus get, throu gh the union bound , with probability at least 1 − 20 e − u : ν ( ˆ J λ , w 0 ) 2 6 b 1 + d 1 n + 32 d 1 / 2 2 u 1 / 2 n + 18 u n + 53 Rd 1 / 2 1 u 3 / 2 n 3 / 2 λ 1 / 2 + 9 R 2 u 2 λn 2 + 2 r 2 b 2 u n + κu 6 n , 6 b 1 + d 1 n + 64 u 1 / 2 n 1 / 2 b 2 + d 2 n 1 / 2 + u n 18 + κ 6 + R 2 λ 9 u 2 n 2 + 53 n 1 / 2 κu 3 / 2 n 3 / 2 , 6 λκ 2 R 2 + E , 20 togeth er w ith C 6 E . W e no w take u = ( nb 2 + d 2 ) v 2 and assume v 6 1 / 4 , κ 6 1 / 16 , and v 3 ( nb 2 + d 2 ) 1 / 2 6 12 , so that, we ha ve E 6 64 v b 2 + d 2 n + v 2 b 2 + d 2 n 18 + κ 6 + 9 R 2 λ v 4 b 2 + d 2 n 2 +53 n 1 / 2 κv 3 b 2 + d 2 n 3 / 2 , 6 b 2 + d 2 n 64 v + 18 + κ 6 v 2 + 9 R 2 λ v 4 λκ 2 R 2 + 53 κv 3 ( nb 2 + d 2 ) 1 / 2 , 6 b 2 + d 2 n 64 v + 18 v 2 + κ 6 v 2 + 9 κ 2 v 4 + 53 κv 3 ( nb 2 + d 2 ) 1 / 2 , 6 b 2 + d 2 n 68 . 5 v + κ 6 × 16 + 9 κ/ 16 × 16 × 16 + 53 κ × 12 64 , 6 b 2 + d 2 n 69 v + 10 κ 6 20 b 2 + d 2 n . This implies that ν ( ˆ J λ , w 0 ) 2 6 λ R 2 20 256 6 λ 4 R 2 , so that w e can appl y Proposition 2. Thus, by denoti ng e 2 = b 2 + d 2 n , e 1 = b 1 + d 1 n , and α = 69 v + 10 κ 6 20 , we get a global up per boun d: B + | C | 6 e 2 α + 40 R 2 λ ( e 1 + e 2 α ) 2 + 15 Re 1 / 2 2 λ 1 / 2 ( e 1 + e 2 α )(1 + α ) 1 / 2 . W ith e 1 + e 2 α 6 e 1 / 2 2 ( κλ 1 / 2 /R )(1 + α ) , we get B + | C | 6 e 2 α + 40 κ 2 e 2 (1 + α ) 2 + 15 κe 2 (1 + α ) 3 / 2 6 e 2 α + e 2 κ (40 × 21 × 21 / 16 + 15(2 1) 3 / 2 ) 6 e 2 (69 v + 2560 κ ) , which leads to the desired result, i.e., Eq. (14). D Proof of T heor em 3 W e follo w the same proof techniqu e than for Theorem 2 in Appen dix C. W e ha ve : J 0 ( ˆ w λ ) = ˆ J 0 ( ˆ w λ ) + q ⊤ ( ˆ w λ − w 0 ) + q ⊤ w 0 = ˆ J 0 ( ˆ w λ ) + q ⊤ ( ˆ w λ − ˆ w N N λ ) + q ⊤ ( ˆ w N λ − w 0 ) − q ⊤ ˆ J ′′ λ ( ˆ w N λ ) − 1 ˆ J ′ λ ( ˆ w N λ ) + q ⊤ w 0 , where ˆ w N N λ is the two-step Newto n iterate from w 0 . W e ha ve, from Eq. (24), ν ( ˆ J λ , ˆ w N λ ) 6 2 R λ 1 / 2 ν ( ˆ J λ , w 0 ) 2 , which then implies (with Eq. (9)): ( ˆ w λ − ˆ w N N λ ) ⊤ ( Q + λI )( ˆ w λ − ˆ w N N λ ) 6 16 R 2 λ 2 R λ 1 / 2 ν ( ˆ J λ , w 0 ) 2 4 6 512 R 6 ν ( ˆ J λ , w 0 ) 8 λ 3 , 21 which in turn implies | q ⊤ ( ˆ w λ − ˆ w N N λ ) | 6 [ q ( Q + λI ) − 1 q ] 1 / 2 32 R 3 ν ( ˆ J λ , w 0 ) 4 λ 3 / 2 6 R [ q ( Q + λI ) − 1 q ] 1 / 2 λ 1 / 2 32 R 2 ν ( ˆ J λ , w 0 ) 4 λ . (30) Moreo ver , we ha ve from the closed-for m expressio n of ˆ w N λ : q ⊤ ( ˆ w N λ − w 0 ) − d 1 n 6 tr( Q + λI ) − 1 ( q q ⊤ − Q/n ) + λw ⊤ 0 ( Q + λI ) − 1 q . (31) Finally , we hav e, using Eq. (5) from Proposition 1: q ⊤ ˆ J ′′ λ ( ˆ w N λ ) − 1 ˆ J ′ λ ( ˆ w N λ ) = q ⊤ ˆ J ′′ λ ( ˆ w N λ ) − 1 [ ˆ J ′ 0 ( ˆ w N λ ) − ˆ J ′ 0 ( w 0 ) − Q ( ˆ w N λ − w 0 )] 6 q ⊤ ˆ J ′′ λ ( ˆ w N λ ) − 1 Q ˆ J ′′ λ ( ˆ w N λ ) − 1 q 1 / 2 ∆ ⊤ Q ∆ 1 / 2 R k ∆ k 2 6 2 q ⊤ Q ( Q + λI ) − 2 q 1 / 2 k Q 1 / 2 ∆ k 2 Rν ( ˆ J λ , w 0 ) λ 1 / 2 , (32) where ∆ = ˆ w N λ − w 0 . What also needs to be sho wn is that tr ˆ Q λ ( ˆ Q λ + λI ) − 1 − tr Q ( Q + λI ) − 1 is small enoug h; by noting that Q = J ′′ 0 ( w 0 ) , ˆ Q λ = J ′′ 0 ( w 0 + v ) , and v = ˆ w λ − w 0 , we hav e, using Eq. (22) from Appendi x A.2: tr ˆ Q λ ( ˆ Q λ + λI ) − 1 − tr Q ( Q + λI ) − 1 = λ tr ( ˆ Q λ + λI ) − 1 ( Q − ˆ Q λ )( Q + λI ) − 1 6 λ p X i =1 δ ⊤ i ( ˆ Q λ + λI ) − 1 ( Q − ˆ Q λ )( Q + λI ) − 1 δ i 6 λR p X i =1 k Q 1 / 2 ( Q + λI ) − 1 δ i k 2 k ( ˆ Q λ + λI ) − 1 δ i k 2 k Q 1 / 2 v k 2 6 λ − 1 / 2 R k Q 1 / 2 v k 2 p X i =1 δ ⊤ i Q ( Q + λI ) − 1 δ i = λ − 1 / 2 R k Q 1 / 2 v k 2 d 1 . (33) All the terms in Eqs. (30,31,32,33) that need to be added to obtain the required upperb ound are essentially the same than the ones proof of Theorem 2 in Appendix C (with smaller consta nts). Thus the rest of the proof follo ws. E Pr oof of Theorem 4 W e follo w the same proof technique than for the L asso [15, 12, 14], i.e., we consid er ˜ w the minimizer of ˆ J 0 ( w ) + λs ⊤ w subject to w K c = 0 (which is unique because Q K K is 22 in v ertible), and (1) sho w that ˜ w K has the correc t (non zero) signs and (2) that it is actually the unres tricted minimum of ˆ J 0 ( w ) + λ k w k 1 ov er R p , i.e., using optimality conditions for nonsmoo th con v ex optimizat ion problems [48 ], that k [ ˆ J ′ 0 ( ˜ w )] K c k ∞ 6 λ . All this will be sho wn by replacing ˜ w by the proper one-ste p Newton ite rate from w 0 . Corr ect signs on K . W e directly use Propos ition 2 w ith the function w K 7→ ˆ J 0 ( w K , 0) + λs ⊤ K w K —where ( w K , 0) denotes the p -dimensio nal vector obtaine d by completing w K by zeros—to obtain from Eq. (7): ( ˜ w K − ( w 0 ) K ) ⊤ Q K K ( ˜ w K − ( w 0 ) K ) 6 16( q K − λs K ) ⊤ Q − 1 K K ( q K − λs K ) = 16 ν 2 , as soon as ν 2 = ( q K − λs K ) ⊤ Q − 1 K K ( q K − λs K ) 6 ρ 4 R 2 , and thus as soon as q K Q − 1 K K q K 6 ρ 8 R 2 and λ 2 s ⊤ K Q − 1 K K s K 6 ρ 8 R 2 . W e thus ha ve: k ˜ w − w 0 k ∞ 6 k ˜ w K − ( w 0 ) K k 2 6 ρ − 1 / 2 k Q 1 / 2 K K ( ˜ w K − ( w 0 ) K ) k 2 6 4 ρ − 1 / 2 ν. W e th erefore get the co rrect signs for th e cov ariates inde xed by K , as soon as k ˜ w − w 0 k 2 ∞ 6 min j ∈ K | ( w 0 ) j | 2 = µ 2 , i.e., as soon as max n q K Q − 1 K K q K , λ 2 s ⊤ K Q − 1 K K s K o 6 min n ρ 16 µ 2 , ρ 8 R 2 o . Note that s ⊤ K Q − 1 K K s K 6 | K | ρ − 1 , thus it is implied by the follo wing constrain t: λ 6 ρ 4 | K | 1 / 2 min µ, R − 1 , ( 34) q K Q − 1 K K q K 6 ρ 16 min µ 2 , R − 2 . (35) Gradient condition on K c . W e de note by ˜ w N the o ne-step Newto n iterate f rom w 0 for th e minimizatio n of ˆ J 0 ( w ) + λs ⊤ w restricted to w K c = 0 , equ al to ˜ w N K = ( w 0 ) K + Q − 1 K K ( q K − λs K ) . From Eq. (9), we get: ( ˜ w K − ˜ w N K ) ⊤ Q K K ( ˜ w K − ˜ w N K ) 6 16 R 2 ρ ( q K − λs K ) ⊤ Q − 1 K K ( q K − λs K ) 2 = 16 R 2 ν 4 ρ . W e thus ha ve k ˜ w − ˜ w N k 2 6 ρ − 1 / 2 4 Rν 2 ρ 1 / 2 = 4 Rν 2 ρ 6 1 /R, k w 0 − ˜ w N k 2 6 ρ − 1 / 2 ν 6 1 / 2 R, k ˜ w − w 0 k 2 6 k ˜ w − ˜ w N k 2 + k w 0 − ˜ w N k 2 6 3 ν ρ − 1 / 2 6 3 R/ 2 . Note that up to h ere, all bound s R may be repla ced by the maximal ℓ 2 -norm of all data po ints, reduce d to var iables in K . 23 In order to check the gradien t conditio n, we compute the gradient of ˆ J 0 along the direc- tions in K c , to obtai n for all z ∈ R p , using Eq. (5) and with any v such that R k v k 2 6 3 / 2 : z ⊤ [ ˆ J ′ 0 ( w 0 + v ) − ˆ T ′ 0 ( w 0 + v )] ( z ⊤ Qz ) 1 / 2 6 ( v ⊤ Qv ) 1 / 2 e R k v k 2 − 1 − R k v k 2 R k v k 2 6 2( v ⊤ Qv ) 1 / 2 R k v k 2 , where ˆ T ′ 0 ( w ) = ˆ J ′ 0 ( w 0 ) + ˆ J ′′ 0 ( w 0 )( w − w 0 ) is the deri va tiv e of the T aylor expansio n of ˆ J 0 around w 0 . This implies, since diag( Q ) 6 1 / 4 , the follo wing ℓ ∞ -boun d on th e dif ference ˆ J 0 and its T aylor ex pansion: k [ ˆ J ′ 0 ( w 0 + v ) − ˆ T ′ 0 ( w 0 + v )] K c k ∞ 6 ( v ⊤ Qv ) 1 / 2 R k v k 2 . W e no w ha ve, k ˆ J ′ 0 ( ˜ w ) K c k ∞ 6 k ˆ T ′ 0 ( ˜ w N ) K c k ∞ + k ˆ T ′ 0 ( ˜ w N ) K c − ˆ T ′ 0 ( ˜ w ) K c k ∞ + k ˆ T ′ 0 ( ˜ w ) K c − ˆ J ′ 0 ( ˜ w ) K c k ∞ , 6 k [ ˆ J ′ 0 ( w 0 ) + Q ( ˜ w N − w 0 )] K c k ∞ + k [ Q ( ˜ w − ˜ w N )] K c k ∞ + R k ˜ w − w 0 k 2 k Q 1 / 2 ( ˜ w − w 0 ) k 2 , 6 k − q K c + Q K c K Q − 1 K K ( q K − λs K ) k ∞ + k Q K c K Q − 1 / 2 K K Q 1 / 2 K K ( ˜ w K − ˜ w N K ) k ∞ + 3 ν Rρ − 1 / 2 (4 Rν 2 ρ − 1 / 2 + ν ) , 6 k q K c − Q K c K Q − 1 K K ( q K − λs K ) k ∞ + 1 4 k Q 1 / 2 K K ( ˜ w K − ˜ w N K ) k 2 + 9 R ρ 1 / 2 ν 2 , 6 k q K c − Q K c K Q − 1 K K ( q K − λs K ) k ∞ + 1 4 16 R ρ 1 / 2 ν 2 + 9 R ρ 1 / 2 ν 2 , 6 k q K c − Q K c K Q − 1 K K ( q K − λs K ) k ∞ + 16 R ρ 1 / 2 ν 2 . Thus, in order to get k ˆ J ′ 0 ( ˜ w ) K c k ∞ 6 λ , we need k q K c − Q K c K Q − 1 K K q K k ∞ 6 η λ/ 4 , (36) and max n q K Q − 1 K K q K , λ 2 s ⊤ K Q − 1 K K s K o 6 λη ρ 1 / 2 64 R . (37) In terms of uppe r bound on λ we then get: λ 6 min ( ρ 4 | K | 1 / 2 µ, ρ 4 | K | 1 / 2 R − 1 , η ρ 3 / 2 64 R | K | ) , which can b e reduced λ 6 min n ρ 4 | K | 1 / 2 µ, ηρ 3 / 2 64 R | K | o . In terms of up per bound on q ⊤ K Q − 1 K K q K we get: q ⊤ K Q − 1 K K q K 6 min ( ρ 16 µ 2 , ρ 16 R − 2 , λη ρ 1 / 2 64 R ) , 24 which can be reduc ed to q ⊤ K Q − 1 K K q K 6 min n ρ 16 µ 2 , ληρ 1 / 2 64 R o , using the constr aint on λ . W e now deri ve and use concent ration inequali ties. W e first use Bernstein’ s inequa lity (using for all k and i , | ( x i ) k − Q k K Q − 1 K K ( x i ) K || ε i | 6 R /ρ 1 / 2 and Q k k 6 1 / 4 ) , and the union bound to get P ( k q K c − Q K c K Q − 1 K K q K k ∞ > λη / 4) 6 2 p exp − nλ 2 η 2 / 32 1 / 4 + R λη ρ − 1 / 2 / 12 6 2 p exp − nλ 2 η 2 16 , as soon as R λη ρ − 1 / 2 6 3 , i.e., as soon as, λ 6 3 ρ 1 / 2 R − 1 , which is indeed satis fied becaus e of our assump tion on λ . W e also use Bernstein ’ s ineq uality to get P ( q ⊤ K Q − 1 K K q K > t ) 6 P k q K k ∞ > s ρt | K | 6 2 | K | exp − nρt | K | . The union bound then leads to the desire d result. F Pr oof o f Theor em 5 W e follo w the proof techn ique of [16]. W e hav e ˆ J 0 ( ˆ w λ ) = J 0 ( ˆ w λ ) − q ⊤ ˆ w λ . Thus, becau se ˆ w λ is a minimizer of ˆ J 0 ( w ) + λ k w k 1 , J 0 ( ˆ w λ ) − q ⊤ ˆ w λ + λ k ˆ w λ k 1 6 J 0 ( w 0 ) − q ⊤ w 0 + λ k w 0 k 1 , (38) which implies, since J 0 ( ˆ w λ ) > J 0 ( w 0 ) : λ k ˆ w λ k 1 6 λ k w 0 k 1 + k q k ∞ k ˆ w λ − w 0 k 1 , λ k ( ˆ w λ ) K k 1 + λ k ( ˆ w λ ) K c k 1 6 λ k ( w 0 ) K k 1 + k q k ∞ k ( ˆ w λ ) K − ( w 0 ) K k 1 + k ( ˆ w λ ) K c k 1 . If we denote by ∆ = ˆ w λ − w 0 the estimatio n error , we deduce: ( λ − k q k ∞ ) k ∆ K c k 1 6 ( λ + k q k ∞ ) k ∆ K k 1 . If we assume k q k ∞ 6 λ/ 2 , then, we ha ve k ∆ K c k 1 6 3 k ∆ K k 1 , and thus using (A5) , we get ∆ ⊤ Q ∆ > ρ 2 k ∆ K k 2 2 . From Eq. (38), we thus get: J 0 ( ˆ w λ ) − J 0 ( w 0 ) 6 q ⊤ ( ˆ w λ − w 0 ) − λ k ˆ w λ k 1 + λ k w 0 k 1 , J 0 ( w 0 + ∆) − J 0 ( w 0 ) 6 ( k q k ∞ + λ ) k ∆ k 1 6 3 λ 2 k ∆ k 1 . (39) Using Eq. (3) in Propos ition 1 with J 0 , we obtain : J 0 ( w 0 + ∆) − J 0 ( w 0 ) > ∆ ⊤ Q ∆ R 2 k ∆ k 2 2 e − R k ∆ k 2 + R k ∆ k 2 − 1 , 25 which implies, using ∆ ⊤ Q ∆ > ρ 2 k ∆ K k 2 2 and Eq. (39): ρ 2 k ∆ K k 2 2 R 2 k ∆ k 2 2 e − R k ∆ k 2 + R k ∆ k 2 − 1 6 3 λ 2 k ∆ k 1 . (40) W e can no w use, with s = | K | , k ∆ k 2 6 k ∆ k 1 6 4 k ∆ K k 1 6 4 √ s k ∆ K k 2 to get: ρ 2 e − R k ∆ k 2 + R k ∆ k 2 − 1 6 3 λ 2 (4 √ s k ∆ K k 2 ) 2 R k ∆ k 2 k ∆ K k 2 2 6 24 λsR 2 k ∆ k 2 . This implies using Eq. (23), that R k ∆ k 2 6 48 λRs/ρ 2 1 − 24 λsR/ρ 2 6 2 a soon as R λsρ − 2 6 1 / 48 , which itself implies that 1 ( R k ∆ k 2 ) 2 e − R k ∆ k 2 + R k ∆ k 2 − 1 > 1 / 2 , and thus, from Eq. (40), k ∆ K k 2 6 3 λ 2 × 4 √ s k ∆ K k 2 . The secon d result then follo ws from Eq. (39) (usin g Bernstein inequ ality for an upper bound on P ( k q k ∞ > λ/ 2) ). G Concentration inequali ties In this section, we deri ve concentr ation inequal ities for quadratic forms of bounded random v ariables that ex tend the ones already kno wn for Gaussian random variab les [28]. The fol- lo wing p roposition is a simple corollary of a gener al concentra tion result on U -statis tics [11]. Pro position 4 Let y 1 , . . . , y n be n vector s in R p suc h that k y i k 2 6 b for all i = 1 , . . . , n and Y = [ y ⊤ 1 , . . . , y ⊤ n ] ⊤ ∈ R n × p . Let ε ∈ R n be a vector of zer o-mean indep endent ran dom variab les almost sur ely bounded by 1 and with varian ces σ 2 i , i = 1 , . . . , n . Let S = Diag ( σ i ) ⊤ Y Y ⊤ Diag( σ i ) . Then, for all u > 0 : P | ε ⊤ Y Y ⊤ ε − tr S | > 32 tr ( S 2 ) 1 / 2 u 1 / 2 + 18 λ max ( S ) u + 126 b (t r S ) 1 / 2 u 3 / 2 + 39 b 2 u 2 6 8 e − u . (41) Pro of W e apply Theorem 3.4 from [11], with T i = ε i , g i,j ( t i , t j ) = y ⊤ i y j t i t j if | t i | , | t j | 6 1 and zero otherwis e. W e then ha ve (follo wing notati ons from [11]): A = max i,j | y ⊤ i y j | 6 b 2 , B 2 = max i ∈{ 1 ,...,n } X j 44 . 8 C u 1 / 2 + 35 . 36 D u + 124 . 56 B u 3 / 2 + A 38 . 26 u 2 6 5 . 542 e − u . Moreo ver , we ha ve from Bernstein’ s inequali ty [2]: P n X i =1 y ⊤ i y i ( ε 2 i − σ 2 i ) > u 1 / 2 √ 2 b 2 tr S + b 2 u 3 6 2 e − u , leadin g to the desired result, noting that for u 6 log(8) , the bound is trivia l. W e can apply to our s etting to get, with y i = 1 n ( P + λI ) − 1 / 2 x i (with k x i k 2 6 R ), lead ing to b = 1 2 Rn − 1 λ − 1 / 2 and S = 1 n Diag( σ ) X ( P + λI ) − 1 X ⊤ Diag( σ ) . Misspecified m odels. If no assumptions are made, we simply hav e: λ max ( S ) 6 (tr S 2 ) 1 / 2 6 tr( S ) 6 R 2 /λn and we get after bring ing terms together : P q ⊤ ( P + λI ) − 1 q > 41 R 2 u λn + R 2 λ 8 u 2 n 2 + 63 u 3 / 2 n 3 / 2 6 8 e − u . (42) W ell-specified models In this case, P = Q and λ max ( S ) 6 1 /n , tr S = d 1 /n , tr S 2 = d 2 /n 2 . P q ⊤ ( P + λI ) − 1 q − d 1 n > 32 d 1 / 2 2 u 1 / 2 n + 18 u n + 53 Rd 1 / 2 1 u 3 / 2 n 3 / 2 λ 1 / 2 + 9 R 2 u 2 λn 2 6 8 e − u . (43) Ackno wledgements I would like to thank S ylv ain Arlot, Jean-Yves Audibe rt and Guillaume O bozin ski fo r fr uitful discus sions related to this work. This work was suppo rted by a Fren ch gra nt fro m the Agence National e de la Recherche (MGA Project ANR-07-BLAN-0311). Refer ences [1] A. W . V an der V aart. Asymptotic Statis tics . C ambridge Uni versity Press, 1998 . [2] P . Massart . C oncent ration Inequal ities and Model Selection: Ecole d’ ´ et ´ e de P r oba - bilit ´ es de Saint-Flour 23 . Springer , 2003. [3] S.A. V an De Geer . High-dimensi onal gene ralized linear models and the Lasso. Annals of Stati stics , 36(2):6 14, 2008. 27 [4] C. Gu. Ada ptiv e spline smoothing in non-gauss ion regressi on models. Jou rnal of the American Statis tical Associati on , pages 801–807, 1990. [5] F . Bunea. Honest varia ble selecti on in linear and logistic regressi on models via ℓ 1 and ℓ 1 + ℓ 2 penali zation. Electr onic J ournal of Statistics , 2:1153–11 94, 2008 . [6] D. P . Bertsekas. Nonlinear pr o gramming . Athena Scientific, 1999. [7] S. Boyd and L. V andenber ghe. Con ve x Optimization . Cambridge Univ ersity P ress, 2003. [8] Y . Nestero v and A. Nemirovsk ii. Interi or-p oint polynomial algor ithms in con ve x pr o- gra mming . S IAM stud ies in Applie d Mathematics , 1994. [9] R. Christens en. Log-linea r models and logisti c r e gr essio n . Springer , 1997. [10] D.W . H osmer a nd S. Lemesho w . Applied lo gistic r e gr ess ion . Wi ley-Inte rscience, 2004. [11] C. Houdr ´ e and P . Reynaud-Bo uret. Exponential inequaliti es, with constant s, for U- statist ics of order two. In Stocha stic inequal ities and applicat ions, P r o gr ess in Pr oba- bility , 56 , pages 55–69. B irkh ¨ auser , 2003. [12] P . Zhao and B. Y u. On model selection consis tency of Lasso. J ournal of Machine Learning Resear c h , 7:25 41–2563, 2006. [13] M. Y uan and Y . Lin. On the non-ne gati ve garrott e estimator . J ournal of The Royal Statis tical Society Series B , 69(2): 143–161, 2007. [14] H. Zou. The adapti ve Lasso and its oracle prop erties. Journ al of the American Statistical Associat ion , 101:1 418–1429 , December 2006 . [15] M. J. W ainwright. Sharp thres holds for noisy and high-dimen sional recov ery o f s parsity using ℓ 1 -const rained quadratic programming (Lasso). IEEE T ra nsactions on Informa- tion Theory , 2009. T o ap pear . [16] P . J. Bickel, Y . Ritov , and A. Tsybak ov . S imultane ous analysis of Lasso and Dantzig selecto r . Annals of Statistics , 2009. T o appear . [17] J. F . Bonnan s, J. C. Gilbert, C. Lemar ´ echal, and C. A. Sagastizb al. Numerical Opti- mization Theor etica l and Practic al Aspects . S pringe r , 2003. [18] J. Abernethy , E. Hazan, and A. R akhlin . Competing in the dark: An efficient algo- rithm for bandit linear optimization. In Pr oceeding s of the 21st Annual Confer enc e on Learning Theory (COLT) , p ages 263–274 , 2008 . [19] P . McCullagh and J.A. Nelder . General ized linear models . Chapman & Hall/CRC, 1989. 28 [20] B. Efron. The estimatio n of predi ction error: Cov ariance p enalties an d cross-v alid ation. J ournal of the American Statisti cal A ssocia tion , 99(467 ):619–633 , 2004. [21] P .L. Bartlet t, M. I. Jordan, and J.D. McAulif fe. Con v exity , classification , and risk bound s. Jo urnal of the America n Statis tical Associati on , 101(473) :138–156 , 2006. [22] G. W ahba. Spline Models for Observ ational Data . S IAM, 1990. [23] G. S. Kimeldor f and G. W ahba. Some res ults on T cheby chef fian spline func tions. Jour - nal of Mathemati cal Analysis and Applicatio ns , 33:82–95 , 1971. [24] G. H. Golub and C. F . V an Loan. Matrix Computat ions . Johns Hopkins Uni versit y Press, 1996 . [25] C. Gu. Smoothing spline A NO V A models . Springer , 2002. [26] K. Sridharan, N. S rebro, and S. Shale v-Shwartz . Fast rates for regu larized objecti ves . In Advanc es in Neural Infor mation Pr ocess ing Syste ms , 2008. [27] I. Steinwa rt, D. Hush, and C. Scove l. A new concentratio n result for reg ularized risk minimizers . High Dimensional Pr obabi lity: Pr oceeding s of the F ourth Inter national Confer enc e , 51:260– 275, 2006. [28] S. A rlot and F . Bach. Data-dri ven calibrati on of linear estimators with minimal penal- ties. In A dvance s in Neura l Information P r oces sing Systems , 2009. [29] T . J . Hastie and R. J . T ibshiran i. G ener alized Additive Models . Chapman & Hall , 1990. [30] Z. Harchaou i, F . R. Bach, and E. Moulines. T est ing for homogeneity with kernel fisher discrimin ant analysi s. T ech nical Report 00270806 , H AL, 2008. [31] R. Shibata. Statistical aspects of model selection. In F r om Data to Model , pages 215– 240. Springer , 1989. [32] H. Bozdoga n. Model selection and Akaike’ s information criterion (AIC): The general theory and its anal ytical extens ions. Psychometrik a , 52(3):345 –370, 1987 . [33] P . Liang, F . B ach, G. Bouchard , and M. I. Jorda n. An asymptotic analysis of smooth reg ularizers. In Advances in Neural Informat ion Pr ocessin g Systems , 2009. [34] P . C ra ven and G. W ahba. Smoothin g noisy data with spline funct ions. E stimating the correc t degr ee of smoothing b y t he metho d of generali zed cross-v alidatio n. Numerische Mathematik , 31(4): 377–403, 1978/79 . [35] K.-C. L i. Asymptotic optimality for C p , C L , cross-v alidation and generalized cross- v alidation : discre te inde x set. Annals of Statistics , 15(3):958 –975, 1987. 29 [36] F . Bach. Consiste ncy of the group Lasso and multiple kernel learning. J ournal of Machi ne L earnin g Resear c h , 9:11 79–1225, 2008. [37] C. L. Mallo ws. Some comments on C p . T echnometr ics , 15:66 1–675, 1973. [38] F . O ’Sulli v an, B.S. Y and ell, and W .J. Raynor Jr . Automatic smoothin g of regr ession functi ons in gen eralized linear models . J ournal of the American Statis tical Association , pages 96– 103, 1986. [39] R. Tib shirani. Regres sion shrinkage and selection via the Lasso. Jou rnal of The Royal Statis tical Society Series B , 58(1): 267–288, 1996. [40] T . Zhang. Some shar p performance bound s for least sq uares regre ssion w ith ℓ 1 reg ular- ization . Annals of Statist ics , 2009. T o appear . [41] A. Juditsk y and A. S. Nemiro vski. On V erifiable Suf ficient C onditi ons for S parse Signal Recov ery via ℓ 1 Minimizati on. T echnical Report 0809.2650 , arXiv , 2008. [42] P . Chaudhu ri and P . A. Mykla nd. Nonlinear exper iments: Optimal design and inferen ce based on like lihood. Journ al of the American Statistical Association , pages 538–546, 1993. [43] M. Y uan and Y . Lin. Model selection and estimation in regre ssion with groupe d vari- ables. Journ al of The Royal Statistic al Society Series B , 68(1):49 –67, 2006. [44] F . B ach. B olasso : model consistent L asso estimation through the bootstrap. In Pr o- ceedin gs of the Internat ional Confer ence on Machine Learning (ICML) , 2008 . [45] N. Meinshau sen an d P . B ¨ uhlmann. Stability selection. T echnical report, arXiv: 0809.29 32, 2008. [46] N. Meinshau sen and P . B ¨ uhlman n. High-dimens ional gra phs an d v ariable sele ction with the Lasso. A nnals of statis tics , 34(3):14 36, 2006. [47] O. Banerjee, L. El G haoui, and A. dAspremont. Model select ion through sparse maxi- mum lik elihood estimati on. J ournal of Machine Learnin g Resear ch , 9:485 –516, 2008. [48] J. M. Borwein and A . S. L e wis. Con ve x Analysis and Nonline ar Optimizati on . Num- ber 3 in CMS Books in Mathema tics. Springer , 2000. 30
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment