Sharp Convergence Rate and Support Consistency of Multiple Kernel Learning with Sparse and Dense Regularization

We theoretically investigate the convergence rate and support consistency (i.e., correctly identifying the subset of non-zero coefficients in the large sample limit) of multiple kernel learning (MKL). We focus on MKL with block-l1 regularization (ind…

Authors: Taiji Suzuki, Ryota Tomioka, Masashi Sugiyama

Sharp Con vergen ce Rate and Support Consistency of Multiple K er nel Learnin g with Sparse and Dense Regularization T aiji Suzuki, Ry ota T omioka Departmen t of Mathematical Informatics, The University of T okyo, 7-3-1 Hongo, Bunkyo-ku , T o kyo t-suzuki@mist .i.u-tokyo.ac .jp , tomioka@mist. i.u-tokyo.ac. jp Masashi Sugiyama Departmen t of Computer Science, T okyo Institute of T ech nology , 2-12- 1 O-ok ayama, Meguro-ku, T okyo sugi@cs.titec h.ac.jp Abstract W e theor etically in vestigate the co n vergence rate and supp ort con sistency (i.e., correctly identifying the subset of non -zero co efficients in the large sample l imit) of multiple kernel learnin g (M KL). W e focus on MKL with block- ℓ 1 regularization (inducin g sparse kernel co mbination ), block- ℓ 2 regu- larization (inducin g uniform kernel combination), an d elastic-net regularization (including both block- ℓ 1 and block- ℓ 2 regularization) . For the case where th e true kernel combination is spar se, we show a sharp er conver gence rate of th e blo ck- ℓ 1 and elastic-net MKL method s th an th e exist- ing rate for b lock- ℓ 1 MKL. W e further show that elastic-n et MKL requires a m ilder con dition for being consistent than blo ck- ℓ 1 MKL. For the case where the optimal kernel comb ination is not e x- actly sparse, we prove tha t elasti c-net MKL can achieve a faster con vergence rate than the block- ℓ 1 and block- ℓ 2 MKL methods by carefully controlling the balance between the block- ℓ 1 and block- ℓ 2 regularizers. Thus, our theoretical results overall suggest the use of e lastic-net regularization in MKL. 1 Intr oduction The choice of kernel func tions is a key issue for kernel methods such as support vector machines to w o rk well (V apnik, 1998). A tradition al but very powerful ap proach to optim izing the kernel function is the use o f cr oss- validation (CV) ( Stone, 19 74). Although the CV -based kern el cho ice often leads to better g eneralization , it is computatio nally e x pensive wh en the kernel contains multiple tuning parameters. T o overcome th is limitation, the framework of mu ltiple kernel learn ing (MKL) has b een introdu ced, which tries to learn the optimal lin ear combina tion of prefixed base- kernels by con vex o ptimization (Lanckriet et al., 2004, Micchelli and Pontil, 2005, Lin and Zhang, 20 06, Sonnen burg et al., 200 6, Rakotomamonjy et al., 2008, Suzu ki and T om ioka, 200 9). T he sem inal paper b y Bach et al. (2004) showed that this MKL formu la- tion can b e interp reted as block- ℓ 1 regularization (i.e., ℓ 1 regularization across the kern els and ℓ 2 regulariza- tion w ithin th e same kernel). W e refer to this MKL for mulation as ‘blo ck- ℓ 1 MKL’. Based on this in terpre- tation, blo ck- ℓ 1 MKL was proved to be sup port c onsistent ( i.e., corr ectly id entifying the subset of n on-zero coefficients with prob ability one in the large sample limit) when the true kernel combination is sparse (Bach, 2008). Fur thermor e, the convergence rate of blo ck- ℓ 1 MKL has also been elucidated in K o ltchinskii and Y uan (2008), which can be regarded as an e x tension of the theoretical analysis for ord inary (no n-block ) ℓ 1 regular- ization (Bickel et al., 2009, Zhang, 2009). Howe ver, in many pra ctical applications, the true kerne l combination may no t be exactly sp arse. In such a n on-spar se situation, block - ℓ 1 MKL was shown to perform r ather poo rly—just the unifor m combin ation of b ase kernels obta ined by b lock- ℓ 2 regularization (Micchelli and Pontil, 20 05) (which we call ‘block - ℓ 2 MKL’) often works better in practice (Cortes, 2009). Furtherm ore, recent works showed that some ‘interme- diate’ regularization between block - ℓ 1 and blo ck- ℓ 2 regularization is more prom ising, e.g., block - ℓ p regular- ization with 1 ≤ p ≤ 2 (Cortes et al., 20 09, Klo ft et al., 20 09), a nd elastic-net regularization (Zou and Hastie, 2005) which includes both block - ℓ 1 and block - ℓ 2 regularization (T o mioka and Suzuki, 2010) (we call this method ‘elastic-net MKL ’). Theo retically , the supp ort consistency and the con vergenc e rate f or param etric elastic-nets have been elu cidated in Y uan and Lin (2 007) and Zou and Zh ang (200 9 ) , respectiv e ly , and that for non- parametric cases has been inv e stigated i n Meier et al. (200 9) focusing on the Sobolev spa ce. In this paper , we theoretically an alyze the s uppo rt consistency and conv ergence rate of MKL, an d provide three new results. • For the case where the true k e rnel combin ation is sparse, we show that elastic-net MKL achiev e s a faster conv e rgence rate than the one shown fo r block- ℓ 1 MKL (Koltchinskii and Y uan, 2 008). More specifi- cally , we show that the L 2 conv e rgence error is giv en b y O p (min { dn − 2 2+ s + d log( M ) /n, d 1 − s 1+ s n − 1 1+ s + d log ( M ) /n } ) , where d is the num ber of activ e compo nents of th e target functio n, s is the co mplexity of RKHSs, M is the number of candidate k ernels, and n is the num ber of samples. • For the case where the optima l kernel comb ination is not exactly sparse, we prove th at elastic-net MKL achieves a faster con vergence rate than the block - ℓ 1 and block- ℓ 2 MKL methods by carefully con trolling the balan ce be tween block - ℓ 1 and b lock- ℓ 2 regularization. Our theoretical result well agr ees with the experimental results reported in T omiok a and Suzuki (2010). • For th e case where the true kernel combination is sparse, we prove that the necessary and sufficient condition s of the suppo rt con sistency for elastic-n et MKL is milder th an the cond itions requir ed for block- ℓ 1 MKL (Bach, 2008). Overall, our theoretical results sug gest the use of elastic-net regularization in MKL. 2 Pr eliminaries In th is section, we formulate th e elastic-n et MKL a pproac h and sum marize ma thematical too ls that are nee ded for the theoretical analysis. 2.1 Formulation Suppose we are gi ven n samples ( x i , y i ) n i =1 where x i belongs to an inpu t space X and y i ∈ R . ( x i , y i ) n i =1 are indepen dent and identically d istributed from a probab ility measure P . W e d enote the m arginal distribution of X by Π . W e consider a MKL regression pro blem in which the u nknown target fun ction is represented as a form o f f ( x ) = P M m =1 f m ( x ) , where each f m belongs to different RKHSs H m ( m = 1 , . . . , M ) correspo nding to M dif ferent base kernels k m over X × X . Elastic-net MKL learns a decision function ˆ f as 1 ˆ f = arg min f m ∈H m ( m =1 ,...,M ) 1 n n X i =1 y i − M X m =1 f m ( x i ) ! 2 + λ ( n ) 1 M X m =1 k f m k H m + λ ( n ) 2 M X m =1 k f m k 2 H m , (1) where the first term is the squ ared-loss of f unction fitting and, the second and the thir d terms ar e bloc k- ℓ 1 and blo ck- ℓ 2 regularizers, resp ectiv e ly . It ca n be seen fr om (1) that elastic-n et MKL is reduced to block - ℓ 1 MKL if λ ( n ) 2 = 0 , which tends to indu ce sparse k e rnel combination (Lanckriet et al., 2 004, Bach et al., 2004). On the other hand , it is reduced to block - ℓ 2 MKL if λ ( n ) 1 = 0 , which resu lts in un iform kern el co mbination (Micchelli and Pontil, 20 05). I t is worth n oting that, elastic-net MKL allows u s to obtain various lev els of sparsity by controllin g the ratio between λ ( n ) 1 and λ ( n ) 2 . 2.2 Notatio ns and Assumptions Here, we prepare technical tools needed in the following s ections. Due to Mercer’ s theo rem, there are an ortho normal system { φ k,m } k,m in L 2 (Π) and the spectrum { µ k,m } k,m such that k m has the following spectral repr esentation: k m ( x, x ′ ) = ∞ X k =1 µ k,m φ k,m ( x ) φ k,m ( x ′ ) . (2) By this spectral representation , the inner-produ ct of RKHS can be expressed as h f m , g m i H m = P ∞ k =1 µ − 1 k,m h f m , φ k,m i L 2 (Π) h φ k,m , g m i L 2 (Π) . Let H = H 1 ⊕ · · · ⊕ H M . For f = ( f 1 , . . . , f M ) ∈ H and a subset of in dices I ⊆ { 1 , . . . , M } , we denote by f I the restriction of f to an index set I , i.e., f I = ( f m ) m ∈ I . W e d enote by I 0 the indices of truly active kernels, i.e., I 0 = { m | k f ∗ m k H m > 0 } , and define the complem ent of I 0 as J 0 = I 0 c . Throu ghout the pap er , we assume the following techn ical conditions (see also Bach (2008)). 1 For simplicity , we focus on the squared-loss function here. Ho wev er, we note that it is straightforward to extend ou r con verg ence analysis and support consistency results gi ven in Sections 3 and 4 to general loss functions that are strongly con vex and Lipschitz continuou s, by following the line of K oltchinskii and Y uan (2 008). 2 T able 1: Summary of the constants we use in this article. M The numbe r of cand idate kernels. d The numbe r of active kernels of the truth; i.e., d = | I 0 | . R The upper boun d of P M m =1 ( k f ∗ m k H m + k f ∗ m k 2 H m ) ; see (A4). s The spectral decay coefficient; see (A5) . β The appro ximate sparsity coef ficien t; see (A7). b The par ameter that tunes the correlation between kernels; see (A8). Assumption 1 (Basic Assumptions) (A1) The r e exists f ∗ = ( f ∗ 1 , . . . , f ∗ M ) ∈ H such that E [ Y | X ] = P M m =1 f ∗ m ( X ) , an d th e noise ǫ := Y − f ∗ ( X ) has a strictly po sitive variance; there exists σ > 0 such tha t E[ ǫ 2 | X ] > σ 2 for all X ∈ X . W e also assume that ǫ is bounded as | ǫ | ≤ L . (A2) F o r each m = 1 , . . . , M , H m is separable and sup X ∈X | k m ( X, X ) | < 1 . (A3) The r e exists g ∗ m ∈ H m such that f ∗ m ( x ) = Z X k (1 / 2) m ( x, x ′ ) g ∗ m ( x ′ )dΠ( x ′ ) ( ∀ m = 1 , . . . , M ) , (3) wher e k (1 / 2) m ( x, x ′ ) = P ∞ k =1 µ 1 / 2 k,m φ k,m ( x ) φ k,m ( x ′ ) is the operator squar e-r oo t of k m . The first assumption in (A1) ensures the model H is correctly s pecified, and th e technical a ssumption | ǫ | < L allows ǫf to be Lipschitz continuo us with respect to f . It is known that the assumption (A2) gives the following relation: k f m k ∞ ≤ sup x h k m ( x, · ) , f m i H m ≤ sup x k k m ( x, · ) k H m k f m k H m ≤ sup x p k m ( x, x ) k f m k H m ≤ k f m k H m . The assumption (A3) was used in Caponne tto and de V ito (20 07) and also in Bach ( 2008). It ensures the consistency of the least-squares estimates in terms of th e RKHS norm . Using the spectral repr esentation (2), the condition g ∗ m ∈ H m is expressed as k g ∗ m k 2 H m = ∞ X k =1 µ − 2 k,m h f ∗ m , φ k,m i 2 L 2 (Π) < ∞ . (4) This condition was also assumed in K oltch inskii and Y uan ( 2008). Pro position 9 of Bach (2008) gav e a sufficient condition to fulfill (3) for translatio n in variant k ernels k m ( x, x ′ ) = h m ( x − x ′ ) . Constants we use later are summarized in T ab le 1. 3 Con vergen ce Rate of Elastic-net MKL In this section, we derive the con vergence ra te of elastic-net MKL in two situations: (i) A spa rse situation where the truth f ∗ is sparse (Section 3.1). (ii) A near sparse situation where the tr uth is not exactly spar se, but k f m k H m decays p olyno mially as m increases (Section 3.2). For ( i), we show that elastic-net MKL (and b lock- ℓ 1 MKL) achieves a faster convergence rate than the rate shown fo r block- ℓ 1 MKL (K o ltchinskii and Y uan, 2 008). Furtherm ore, fo r (ii), we show that e lastic-net MKL can outperfor m blo ck- ℓ 1 MKL and block- ℓ 2 MKL depending on the sparsity of the truth and the condition of the problem. Throu ghout this section, we ass ume the following conditions. Assumption 2 (Boundedness Assumption) There e xists constan ts C 1 and R such that (A4) max m ∈ I 0 k g ∗ m k H m k f ∗ m k H m ≤ C 1 , M X m =1 ( k f ∗ m k H m + k f ∗ m k 2 H m ) ≤ R. Assumption 3 (Spectr al Assumption) Ther e exist 0 < s < 1 a nd C 2 such that (A5) µ k,m ≤ C 2 k − 1 s , (1 ≤ ∀ k , 1 ≤ ∀ m ≤ M ) , wher e { µ k,m } k is the spectrum of the kernel k m (see Eq. (2) ). 3 The first assumption in (A4) app eared in T heorem 2 o f Koltchinskii and Y uan (20 08). The second assump - tion in (A4) b ound s the amplitud e of f ∗ . It was shown that the sp ectral assump tion (A5) is equ iv alent to th e c lassical covering numb er assumption (Steinwart et al., 200 9). Recall that the ǫ -covering number N ( ǫ, B H m , L 2 (Π)) with respect to L 2 (Π) is the minim al number of balls with radiu s ǫ needed to cover the un it ball B H m in H m (van der V aart and W ellne r, 19 96). If the spectral assump tion (A5) holds, there exists a constant c tha t depend s only on s such that N ( ε, B H m , L 2 (Π)) ≤ cε − 2 s , (5) and the con verse is also true (see Theore m 15 of Stein wart et al. (2009) an d Steinwart (2008) for details). Therefo re, if s is large, at le ast one RKHS is “complex”, and if s is small, th e RKHSs are r egarded as “simple”. For a gi ven set of indices I ⊆ { 1 , . . . , M } , let κ ( I ) be defined as follows: κ ( I ) := sup ( κ ≥ 0 | κ ≤ k P m ∈ I f m k 2 L 2 (Π) P m ∈ I k f m k 2 L 2 (Π) , ∀ f m ∈ H m ( m ∈ I ) ) . κ ( I ) represen ts the correlation of RKHSs in side th e indices I . Similarly , we define the correlation s of RKHSs between I a nd I c as follows: ρ ( I ) := sup  h f I , g I c i L 2 (Π) k f I k L 2 (Π) k g I c k L 2 (Π) | f I ∈ H I , g I c ∈ H I c , f I 6 = 0 , g I c 6 = 0  . In Subsectio ns 3.1 and 3.2, we will assume that th e kernels have no perfect canonica l dependence , implying that the kernels are not similar to each other (see (A6) and (A8) below). Throu ghout this paper, we assume log( M n ) n ≤ 1 and log ( M ) is slower than any polynomial order against the n umber o f samp les n : lo g ( M ) = o ( n ǫ ) f or all ǫ > 0 . W ith some abuse, we use C to denote co nstants that are independ ent of d and n ; its value may be different. 3.1 Sparse Situation Here we deri ve the con vergence rate of the estimator ˆ f when the truth f ∗ is sparse. Le t d = | I 0 | and suppose that the numb er of k ernels M and the number of acti ve kernels d ar e increasing with respect to the number of samples n . W e furth er assume the follo wing condition in this subsection. Assumption 4 (Incoherence Assumption) Ther e exists a constant C 3 > 0 such that (A6) 0 < C − 1 3 < κ ( I 0 )(1 − ρ 2 ( I 0 )) . (6) This cond ition is known as the inco her en ce condition (K oltchinskii and Y uan, 2 008, Meier et al., 200 9), i.e., kernels are not too dep endent on each other and the problem is well conditioned. Th en we ha ve the f ollowing conv e rgence r ate. Theorem 1 Under assumptio ns (A1-A6), ther e exist constan ts C , F an d K depending only on κ ( I 0 ) , ρ ( I 0 ) , s , C 1 , C 2 , L , and R such tha t th e L 2 (Π) -norm of the residual ˆ f − f ∗ can be boun ded as follows: when d 3+ s n − 1 ≤ 1 , for λ ( n ) 1 = λ ( n ) 2 = max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  dn − 2 2+ s + dt n  , (7) and, when d 3+ s n − 1 > 1 , for λ ( n ) 1 = max { K (1 + √ t ) n − 1 2 , F q log( M n ) n } and λ ( n ) 2 ≤ λ ( n ) 1 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  d 1 − s 1+ s n − 1 1+ s + d (log( M n ) + t ) n  , (8) wher e ea ch inequality holds with pr ob ability at least 1 − e − t − n − 1 for all t ≥ log log( R √ n ) + log M . The above th eorem in dicates th at the learning rate depend s o n th e complexity of RKHSs (the simp ler , the faster) and the numb er of active kernels rath er than the n umber of kern els M (the influen ce o f M is at mo st d l og( M ) n ). It is worth notin g that the conver g ence rate in (7) an d (8) is faster than o r e qual to the rate of block - ℓ 1 MKL shown by Koltchinskii and Y uan (200 8) wh ich established the learn ing rate O p  d 1 − s 1+ s n − 1 1+ s + d log( M ) n  under the same conditions as ours 2 . 2 In our second bound (8), there is the additional d log ( n ) n term. Howe ver this can be eli minated by replacing the probability 1 − e − t − n − 1 with 1 − e − t − M − A as in Ko ltchinskii and Y uan (2008). Moreov er , i f √ n log( n ) − 1+ s 2 s ≥ d , then the term d log( n ) n is dominated by the first term d 1 − s 1+ s n − 1 1+ s . 4 3.2 Near -Sparse Situation In this subsectio n, we analyze the conv e rgence rate under a situation wh ere f ∗ is not sparse but near sparse . W e have s hown a faster learning rate than existing bound s i n the previous subsection. Howev er , the assump- tions we used might be too restrictive to captu re the situatio n where MKL is used in prac tice. In fact, it was pointed out in Zou and Hastie (2005) in the context of ( non-b lock) ℓ 1 regularization th at ℓ 1 regularization could fail in the following situations: • When the truth f ∗ is n ot spar se, the ℓ 1 regularization shrink s many small but non -zero comp onents to zero. • When ther e exist stro ng correlatio ns between d ifferent kernels, th e solu tion of bloc k- ℓ 1 MKL becom es unstable. • When the nu mber of kernels M is not large, there is no need to impose the estimator to be sparse. In order to analyze th ese situations in the MKL setting, we intr oduce th ree parameters β , b , and τ : β controls the level of sparsity (see (A7) ), b con trols the co rrelation between ca ndidate kernels (see ( A8)), and τ co ntrols the growth of the number of kernels against the number of samples (see (A9)). W e sho w that naturally block- ℓ 2 MKL is preferable when th ere are only few can didate kernels or the truth is dense. Importan tly , if the candid ate kernels are correlated, the convergence of block- ℓ 1 MKL can be slow ev en wh en the tr uth is sparse. O ur a nalysis shows that elastic-net MKL is most valuable in such a n intermediate situation. By permu ting indices, we can assume without loss of g enerality that k f ∗ m k H m is decrea sing with respect to m , i.e., k f ∗ 1 k H 1 ≥ k f ∗ 2 k H 2 ≥ k f ∗ 3 k H 3 ≥ · · · . W e fu rther assume the following condition s in this subsection. Assumption 5 (Approximate Sparsity) The truth is appr oximately spar se, i.e., k f ∗ m k H m > 0 for all m and thus I 0 = { 1 , . . . , M } . Ho wever , k f ∗ m k H m decays polyno mially with res pect to m as follows: (A7) k f ∗ m k H m ≤ C 3 m − β . W e call β ( > 1) the ap pr oximate sparsity coefficient. Assumption 6 (Ge neralized Incoherence) T her e exist b > 0 and C 4 such that for all I ⊆ { 1 , . . . , M } , (A8) (1 − ρ 2 ( I )) κ ( I ) ≥ C 4 | I | − b . Assumption 7 (Kernel-Set Growth) The n umber of kernels M is incr ea sing polynomia lly with respect to the number of samples n , i.e., ∃ τ > 0 such that (A9) M = ⌈ n τ ⌉ . For notation al convenience, let τ 1 = 1 (2 β + b )(2+ s ) − 1 − s , τ 2 = ( s − 1)(2 β − 1)+ bs (2 β + b )(2+ s ) − 1 − s , τ 3 = s { 2( b + β ) − 1 } 2(2+ s )( b + β ) − s , τ 4 = s 2+ s , τ 5 = b +1 ( β + b ) { b (2+ s )+2 } and τ 6 = 1 (1 − s )(1+ b ) . In ad dition, we denote by K some sufficiently large constant. Theorem 2 Suppo se assumptio ns (A1-A5) and (A7-A9), 2 β (1 − s ) < s ( b − 1) , and τ 1 < τ < τ 4 ar e satisfied. Th en the estimato r o f e lastic-net MKL possesses the follo wing con ve r gence rate ea ch of which holds with pr obab ility at least 1 − e − t − n − 1 for all t ≥ log log( R √ n ) + log M : 1. When τ 1 < τ < τ 2 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C n n − γ 1 + ( n − (2 β + b )(2+ s ) − 3 − s +2 β 2 { (2 β + b )(2+ s ) − 1 − s } + λ ( n ) 2 2 )( √ t + t ) o , wher e γ 1 = 4 β + b − 2 (2 + s )(2 β + b ) − 1 − s , (9) with λ ( n ) 1 = max { K n − 3 β + b − 1 (2 β + b )(2+ s ) − 1 − s + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = K n − 2 β + b − 1 (2 β + b )(2+ s ) − 1 − s . 2. When τ 2 ≤ τ < τ 3 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C n n τ (2+ s ) b +2 2 { (2+ s )( b + β ) − s } − γ 2 + ( n τ ( 2+ s )(1 − β ) − (4 β +2 b + sb − 2) 2 { ( β + b )(2+ s ) − s } + λ ( n ) 2 2 )( √ t + t ) o , wher e γ 2 = 4 β + b (2 + s ) − 2 2 { (2 + s )( b + β ) − s } , (10) 5 with λ ( n ) 1 = max { K q M n + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = K n τ −{ 2( b + β ) − 1 } 2 { (2+ s )( b + β ) − s } . 3. When τ 3 ≤ τ < τ 4 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  n τ γ 3 − γ 3 + ( n τ ( β − 1)+1 − 2 β − b 2( b + β ) + λ ( n ) 2 2 )( √ t + t )  , wher e γ 3 = b + 2 β − 1 2( b + β ) , (11) with λ ( n ) 1 = max { K q M n + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = K ( M /n ) 2( b + β ) − 1 4( b + β ) . Theorem 3 Under a ssumptions (A 1-A5) an d (A7- A9), if τ 5 < τ , th e estimator ˆ f ℓ 1 of b lock- ℓ 1 MKL has the following con v er gence rate with pr ob ability at least 1 − e − t − n − 1 for all t ≥ lo g log( R √ n ) + log M : ( block- ℓ 1 MKL ) k ˆ f ℓ 1 − f ∗ k 2 L 2 (Π) ≤ C  n − γ 4 + n − 4 β + 2 b − 2+ s ( b + β ) 2(2+ s )( b + β ) ( √ t + t )  , wher e γ 4 = 2 β + b − 1 ( β + b )(2 + s ) , (12) with λ ( n ) 1 = max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = 0 . Mor eover , if τ < τ 6 , the estimator ˆ f ℓ 2 of block- ℓ 2 MKL has the follo wing co n v er gence rate with pr o bability at least 1 − e − t − n − 1 for a ll t ≥ log log( R √ n ) + log M : ( block- ℓ 2 MKL ) k ˆ f ℓ 2 − f ∗ k 2 L 2 (Π) ≤ C  n τ ( b + 2 2+ s ) − γ 5 +  λ ( n ) 2 2 + M 1+ b n  t  , wher e γ 5 = 2 2 + s , (13) with λ ( n ) 2 = max { K ( M n ) 1 2+ s , F q log( M n ) n } and λ ( n ) 1 = 0 . In all con vergence r ates presented in Theorems 2 and 3, the leading terms are the terms that do n ot contain t . The conv ergence order of the terms containing t are faster than the leading terms, thus negligible. By simple calculation, we ca n confirm that elastic-net MKL always con verges faster than block- ℓ 1 MKL and block- ℓ 2 MKL if β an d M satisfy the conditio n of Theo rem 2. The convergence rate of elastic-net MKL becomes ide ntical with b lock- ℓ 2 MKL and blo ck- ℓ 1 MKL at the two extreme poin ts of the interval τ = τ 1 and τ 4 , resp ectiv e ly . Ou tside th e region, block - ℓ 1 MKL o r block - ℓ 2 MKL has a faster conv ergence rate than elastic-net MKL. Mo reover , at τ = τ 2 , the conver gence rates (9) and (10) of elastic-net MKL are identical, and at τ = τ 3 , the co n vergenc e rates (10) and (11) are identical. The relation between the mo st pref erred method and the growth rate τ of the number of kernels is illustrated in Figure 1. The conditio n τ 1 < τ < τ 4 in Theo rem 2 indicates that when the number o f kernels is not too small or too large, an ‘intermed iate’ effect of e lastic-net MKL bec omes a dvantageous. Roughly speakin g, if M is large, sparsity is ne eded to en sure the co n vergence and th us block - ℓ 1 MKL perfo rms th e best. On the other hand, if M is small, there is no need to make the solutio n sparse and thus b lock- ℓ 2 MKL becom es the best. For an intermediate M , elastic-net MKL becom es the best. The condition 2 β (1 − s ) < s ( b − 1) in Theorem 2 ensures the existence of M that satisfies the conditio n in the theo rem, i.e., τ 1 < τ 2 < τ 3 < τ 4 . It can b e seen that as b bec omes large (the conditio n of the prob lem becomes worse), the rang e of β and M in w hich elastic-net MKL perf orms b etter th an block - ℓ 1 MKL and block- ℓ 2 MKL becom es lar ge. This ind icates that the worse the cond ition of the problem becomes, the more importan t to con trol the balance of λ ( n ) 1 and λ ( n ) 2 approp riately . 4 Support Consistency of Elastic-net MKL In this section, we derive necessary an d sufficient conditio ns f or the statistical support con sistency of the estimated sparsity patter n, i.e. , the prob ability o f { m | k ˆ f m k H m 6 = 0 } = I 0 goes to 1 as the numb er of samples n tend s to infinity . Du e to the additional squared regularization term, the necessary conditio n for the support con sistency of elastic-net M KL is shown to be weaker than th at for b lock- ℓ 1 MKL ( Bach, 200 8). In this section, we assume M and d = | I 0 | are fixed against the number of samples n . Let H I be the restriction of H 1 ⊕ · · · ⊕ H M to the index set I . Since E X [ k m ( X, X )] < ∞ f or all m (fr om assumption ( A2)), we d efine the (non- centered) cr o ss covaria nce operator Σ I ,J : H I → H J as a bou nded 6 elastic-net block- ` 1 block- ` 2 τ τ 2 τ 4 τ 3 τ 5 τ 1 − γ 1 − γ 4 growth rate of the number of kernels convergence rate Figure 1: Relation b etween the convergence rate and the numb er o f kernels. If the truth is inter mediately sparse (the growth rate τ of the number o f kernels is between τ 1 and τ 5 ), then elastic-net MKL p erform s best. At the e dge of the interval, the con vergence rate of elastic-n et MKL coin cides with that of block- ℓ 1 MKL or block- ℓ 2 MKL. linear operato r such that 3 h f I , Σ I ,J g J i H I = X m ∈ I X m ′ ∈ J h f m , Σ m,m ′ g m ′ i H m = X m ∈ I X m ′ ∈ J E X [ f m ( X ) g m ′ ( X )] , (14) for all f I = ( f m ) m ∈ I ∈ H I and g J = ( g m ′ ) m ′ ∈ J ∈ H J . See Baker (1973) f or the details of the cross covariance operator ( f , g ) 7→ cov( f ( X ) g ( X )) . Moreover , we define the bounded (non-c entered) cr o ss-corr ela tion oper ators 4 V l,m by Σ 1 / 2 l,l V l,m Σ 1 / 2 m,m = Σ l,m . The joint cross-corr elation operator V I ,J : H J → H I is defined analogou sly to Σ I ,J . In this section, we assume in addition to the basic assumptions (A1-A3) that (A10) All V l,m are compact and the joint correlation opera tor V is invertible. Let ˆ I b e the indices of active k ernels for th e estimated ˆ f ∈ H by elastic-net MKL: ˆ I := { m | k ˆ f m k H m > 0 } . Let D := Diag( k f ∗ m k − 1 H m ) = Diag(( k f ∗ m k − 1 H m ) m ∈ I 0 ) , where Diag is the | I 0 | × | I 0 | block-diag onal operator with operato rs k f ∗ m k − 1 H m I H m on diagonal block s for m ∈ I 0 . In this section, we assume that the true sparsity pattern I 0 and the numbe r of kernels M ar e fixed independently of the number of s amples n . The norm of f ∈ H is d efined by k f k H := q P M m =1 k f m k 2 H m and similar ly that of f I ∈ H I is de- fined by k f I k H I := q P m ∈ I k f m k 2 H m . The f ollowing theorem gives a sufficient cond ition for the suppor t consistency of sparsity patterns. Theorem 4 Suppo se λ ( n ) 2 > 0 , λ ( n ) 1 → 0 , λ ( n ) 2 → 0 , λ ( n ) 1 √ n → ∞ , and lim sup n     Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1  D + 2 λ ( n ) 2 λ ( n ) 1  f ∗ I 0     H m < 1 , ( ∀ m ∈ J = I c 0 ) . (15) Then 5 , unde r assumptio ns (A1-A3, A10), k ˆ f − f ∗ k H p → 0 and ˆ I p → I 0 . The condition λ ( n ) 2 > 0 is just for tech nical simplicity to let Σ I 0 ,I 0 + λ ( n ) 2 in vertib le. The con dition λ ( n ) 1 √ n → ∞ means that λ ( n ) 1 does not decrease too quickly . The con dition (15) corresp onds to an in finite- dimensiona l extension of the elastic-net ‘irrepre sentable’ condition. I n the p aper of Zh ao and Y u (2006), the irrepresenta ble cond ition was derived as a nec essary and sufficient co ndition for the sign con sistency of ℓ 1 regularization when th e numb er of par ameters is finite. I ts ela stic-net version was derived in Y uan and Lin (2007), and it was extended to a situation where th e number of pa rameters diverges as n increases (Jia and Y u, 2010). W e also have a necessary condition for consistency . 3 If one fits a function with a constant offset ( f ( x ) + b instead of f ( x ) ) as in B ach (2008), then the centered version of cross co variance operator is required instead of the non-centered version, i.e., h f m , Σ m,m ′ g m ′ i H m = E X [( f m ( X ) − E X [ f m ])( g m ′ ( X ) − E X [ g m ′ ])] . Howe ver , this difference is not essential because, without loss of generality , one can consider a situation where E Y [ Y ] = 0 and E X [ f m ( X )] = 0 for all f m ∈ H M by centering all the functions. 4 Actually , such a bounded operator alw ays exists (Baker, 1973). 5 For random variables x n and y , x n p → y means the con vergence in probability , i.e., the probability | x n − y | > ǫ goes to 0 for all ǫ as the number of samples n tends to infinity . 7 Theorem 5 If k ˆ f − f ∗ k H p → 0 a nd ˆ I p → I 0 , th en und er assumptio ns (A 1-A3, A1 0), there exist sequ ences λ ( n ) 1 , λ ( n ) 2 → 0 such that lim sup n     Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1  D + 2 λ ( n ) 2 λ ( n ) 1  f ∗ I 0     H m ≤ 1 , ( ∀ m ∈ J = I c 0 ) . (16) Mor eover , such λ ( n ) 1 satisfies λ ( n ) 1 √ n → ∞ . The sufficient co ndition (15) con tains the strict inequ ality (‘ < ’), while similar conditions for o rdinary (non- block) ℓ 1 regularization or o rdinary (n on-blo ck) elastic-net regularization co ntain the wea k inequality (‘ ≤ ’). The strict in equality a ppears b ecause each block co ntains multiple variables in grou p lasso and MKL (Bach, 2008). The condition λ ( n ) 1 √ n → ∞ is necessary to impose the RKHS-n orm conver g ence k ˆ f − f ∗ k H p → 0 . Roughly speaking , this means that the block- ℓ 1 regularization term should be stronger than the noise le vel to suppress fluctuation s by noise. It is worth no ting that the conditions (15) an d (16) are weaker th an the condition for block- ℓ 1 MKL presented in Bach (200 8 ) ; the block- ℓ 1 MKL irrepresentab le cond ition is 6    (Sufficient cond ition)    Σ 1 / 2 m,m V m,I 0 V − 1 I 0 ,I 0 D g ∗ I 0    H m < 1 , ( ∀ m ∈ J ) , (Necessary conditio n)    Σ 1 / 2 m,m V m,I 0 V − 1 I 0 ,I 0 D g ∗ I 0    H m ≤ 1 , ( ∀ m ∈ J ) . (17) This is because the grou p- ℓ 2 regularization term eases the singular ity of the prob lem. E xamples that elastic- nets successfully estimate th e true spar sity pattern, w hile ℓ 1 regularization fails in p arametric situations can be found in Jia and Y u (2010). 5 Conclusions W e provided three novel theo retical results on the support consistency and convergence rate of elastic-net MKL. (i) Elastic-n et MKL was sho wn to be supp ort consistent under a milder condition than block- ℓ 1 MKL. (ii) A tig hter con vergence rate than existing bound s was deriv ed for the situation where the truth is sparse. (iii) Th e convergence r ates of block- ℓ 1 MKL, elastic-n et MKL, and b lock- ℓ 2 MKL when the truth is near sparse were eluc idated, and elastic-net MKL was shown to p erform better when the decr ea se rate β is not large, or the condition of the problem is bad. Based on ou r theor etical finding s, we conclu de that th e use o f elastic-net regularization is recomm ended for MKL. Elastic-net MKL ca n b e regarded as ‘in termediate’ betwee n block - ℓ 1 MKL and block- ℓ 2 MKL. Anoth er popular intermediate variant is blo ck- ℓ p MKL for 1 ≤ p ≤ 2 (Klof t et al., 2009, Cortes et al., 2009). E lastic- net MKL a nd block- ℓ p MKL are conceptu ally similar, but they have a notab le difference: elastic-net MKL with λ ( n ) 1 > 0 tend s to produ ce sparse solutions, while b lock- ℓ p MKL with 1 < p ≤ 2 always prod uces d ense solutions (i.e., all comb ination coefficients of kernels are non-z ero). Sparsity of elastic-net MKL would b e advantageous when the true kernel combinatio n is sparse, as we proved in this p aper . Howe ver, when the true kernel combinatio n is n on-spar se, the dif fer ence/relation between elastic-net MKL and blo ck- ℓ p MKL is not clear yet. This needs to be fur ther in vestigated in the futu re work. A Proofs of th e theore ms For a function f on X × R , we defin e P n f := 1 n P n i =1 f ( x i , y i ) and P f := E X,Y [ f ( X , Y )] . For a fu nction f I ∈ H I , we d efine k f I k ℓ 1 as k f I k ℓ 1 := P m ∈ I k f m k H m and fo r f ∈ H w e write k f k ℓ 1 := P M m =1 k f m k H m . Similarly we define k f I k ℓ 2 as k f I k 2 ℓ 2 := P m ∈ I k f m k 2 H m for f I ∈ H I and for f ∈ H we write k f k 2 ℓ 2 := P M m =1 k f m k 2 H m . W e write max { a, b } as a ∨ b . 6 Note that in the original paper by B ach (2008), the RHS of (17) is P m ∈ I 0 k f ∗ m k H m because the squared group- ℓ 1 regularizer ( P m k f m k H m ) 2 was used. W e can sho w t hat the squared formulation is actually equiv alent to the non- squared formulation in the sense that there exists on e-to-one correspondence between the two formulations. 8 Lemma 6 F o r all I ⊆ { 1 , . . . , M } , we have k f k 2 L 2 (Π) ≥ (1 − ρ ( I ) 2 ) κ ( I )( X m ∈ I k f m k 2 L 2 (Π) ) . (18) Proof: For J = I c , we have P f 2 = k f I k 2 L 2 (Π) + 2 h f I , f J i L 2 (Π) + k f J k 2 L 2 (Π) ≥ k f I k 2 L 2 (Π) − 2 ρ ( I ) k f I k L 2 (Π) k f J k L 2 (Π) + k f J k 2 L 2 (Π) ≥ (1 − ρ ( I ) 2 ) k f I k 2 L 2 (Π) ≥ (1 − ρ ( I ) 2 ) κ ( I )( X m ∈ I k f m k 2 L 2 (Π) ) , (19) where we used Schwarz’ s inequality in the last line. The following lemma gives an upp er b ound of P M m =1 k ˆ f k H m that hold w ith a high p robab ility . Th is is an extension of Theorem 1 of K oltchinskii and Y u an (2008). The proof is given in Appendix B. Lemma 7 The r e e xists a constant F d ependin g o n only L in (A1) such that, if λ ( n ) 1 ≥ F q log( M n ) n , we have, for r = λ ( n ) 1 λ ( n ) 1 ∨ λ ( n ) 2 , with pr obab ility 1 − n − 1 , M X m =1 k ˆ f m k H m ≤ M 1 − r 2 − r 3 M X m =1 k f ∗ m k H m + 3 M X m =1 k f ∗ m k 2 H m ! 1 2 − r . Mor eover , if λ ( n ) 2 ≥ F q log( M n ) n and λ ( n ) 2 ≥ λ ( n ) 1 , we have M X m =1 k ˆ f m − f ∗ m k H m ≤ M  3 / 2 + 2 max m k f ∗ m k H m  . The following lemma giv es a basic inequality that is a start point for the following analyses. The proof is giv en in Appendix B. Lemma 8 S uppose λ ( n ) 1 ∨ λ ( n ) 2 ≥ F q log( M n ) n wher e F is the co nstant appe ar ed in Lemma 7. Then th er e exis t con stants ˜ K 1 and ˜ K 2 depend ing only on L in (A1) , R in (A4) , s in (A6) , C 2 in (A6) such tha t for a ll I ⊆ { 1 , . . . , M } , and for all t ≥ log log ( R √ n ) + log M , wi th pr ob ability at least 1 − e − t − n − 1 , 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 X m ∈ I k ˆ f I − f ∗ I k 2 H m + λ ( n ) 2 X m ∈ J k ˆ f m k 2 H m + λ ( n ) 1 − ˆ γ n − ˜ K 2 r t n ! X m ∈ J k ˆ f m k H m ≤ ˜ K 1 (1 + k ˆ f − f ∗ k ℓ 1 )  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ∨ k ˆ f m − f ∗ m k H m n 1 1+ s + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 2 X m ∈ J k f ∗ m k 2 H m + λ ( n ) 1 + ˆ γ n + ˜ K 2 r t n ! X m ∈ J k f ∗ m k H m , (20) wher e J = I c , γ n := ˜ K 1 √ n and ˆ γ n := γ n (1 + k ˆ f − f ∗ k ∞ ) . The ab ove lemm a is derived b y peeling d evice or localization me thod . Details of those techniques can be found in, for example, Bartlett et al. (2005), K o ltchinskii (200 6), Mendelson (20 02), v a n de Geer (2000). Proof: (Theorem 1) Since λ ( n ) 1 ≥ F q log( M n ) n , we can assume th at the ine quality (2 0) is satisfied with I = I 0 . For no tational simplicity , we sup pose I denote s I 0 in this proof . In ad dition, since λ ( n ) 1 ≥ λ ( n ) 2 , k ˆ f k ∞ ≤ P M m =1 k f ∗ k H m ≤ 3 R (with probab ility 1 − n − 1 ) by Lem ma 7. Note tha t k f ∗ m k H m = 0 fo r all 9 m ∈ J = I c = I c 0 , and ˆ γ n + ˜ K 2 q t n ≤ max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } = λ ( n ) 1 by tak ing K sufficiently large. Therefo re by the inequ ality (20), we ha ve 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 k ˆ f I − f ∗ I k 2 ℓ 2 ≤ K 1  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t n  + X m ∈ I λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) , (21) where K 1 is ˜ K 1 (1 + 3 R ) (here we om itted the term P m ∈ I n − 1 1+ s k ˆ f m − f ∗ m k H m for simplicity . One can show t hat that term is negligible). By H ¨ older’ s inequality , the first term in the RHS of the above inequality can be bound ed as K 1 X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ≤ K 1 ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s ( k ˆ f I − f ∗ I k ℓ 1 ) s √ n ≤ √ dK 1 ( P m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ) 1 − s 2 ( k ˆ f I − f ∗ I k 2 ℓ 2 ) s 2 √ n . Applying Y o ung’ s inequality , the last term can be bounded by K 1 ( λ ( n ) 2 / 2) − s 2 √ d √ n ( X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ) 1 − s 2 × ( λ ( n ) 2 / 2) s 2 ( k ˆ f I − f ∗ I k 2 ℓ 2 ) s 2 ≤ C ( n − 1 2 √ dλ ( n ) 2 − s 2 ) 2 2 − s X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ! 1 − s 2 − s + λ ( n ) 2 2 k ˆ f I − f ∗ I k 2 ℓ 2 ≤ C [(1 − ρ 2 ( I )) κ ( I )] − 1 n − 1 dλ ( n ) 2 − s + (1 − ρ 2 ( I )) κ ( I ) 8 X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) + λ ( n ) 2 2 k ˆ f I − f ∗ I k 2 ℓ 2 ≤ C n − 1 dλ ( n ) 2 − s + 1 8 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 2 k ˆ f I − f ∗ I k 2 ℓ 2 . (22) where C den otes a constan t that is indepen dent of d and n and changes by the contexts, and we used Lemma 6 in the last line. Similarly , by the inequality of arithmetic and geometric means, we obtain a bound as X m ∈ I 2 λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) ≤ C [(1 − ρ 2 ( I )) κ ( I )] − 1 X m ∈ I (  k g ∗ m k H m k f ∗ m k H m  2 λ ( n ) 1 2 + k g ∗ m k 2 H m λ ( n ) 2 2 + t n ) + (1 − ρ 2 ( I )) κ ( I ) 8 X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ≤ C ( dλ ( n ) 1 2 + λ ( n ) 2 2 + dt/n ) + 1 8 k ˆ f − f ∗ k 2 L 2 (Π) , (23) where we used Lemma 6 in the last line. By substituting (22) and (23) to (21), we have 1 4 k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  dn − 1 λ ( n ) 2 − s + dλ ( n ) 1 2 + λ ( n ) 2 2 + ( d + 1 ) t n  . (24) The minimum of the RHS with respect to λ ( n ) 1 , λ ( n ) 2 under the con straint λ ( n ) 1 ≥ λ ( n ) 2 is achieved by λ ( n ) 1 = max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } , λ ( n ) 2 = K n − 1 2+ s up to constants. Thu s we have the first assertion (7). 10 Next we show the second assertion (8). By H ¨ older’ s inequality and Y o ung’ s inequality , we ha ve K 1 X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ≤ K 1 ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s ( k ˆ f I − f ∗ I k ℓ 1 ) s √ n ≤ C ˜ λ − s 1 − s n − 1 2(1 − s ) P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) + ˜ λ 2 k ˆ f I − f ∗ I k ℓ 1 ≤ C d ˜ λ − 2 s 1 − s n − 1 1 − s + 1 8 k ˆ f − f ∗ k 2 L 2 (Π) + ˜ λ 2 ( k ˆ f I k ℓ 1 + k f ∗ I k ℓ 1 ) , (25) where ˜ λ > 0 is an arbitrary positiv e real. By sub stituting (25) and (23) to (21), we have 1 4 k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  d ˜ λ − 2 s 1 − s n − 1 1 − s + ˜ λ + dλ ( n ) 1 2 + λ ( n ) 2 2 + ( d + 1 ) t n  . This is minimized by ˜ λ = C d 1 − s 1+ s n − 1 1+ s , λ ( n ) 1 = ( 2 ˜ K 1 (1+3 R ) √ n + ˜ K 2 q t n ) ∨ F q log( M n ) n ≥ (2 ˆ γ n + ˜ K 2 q t n ) ∨ F q log( M n ) n , and λ ( n ) 2 ≤ λ ( n ) 1 . Thu s we obtain the as sertion. Proof: (Theorem 2) Le t I d := { 1 , . . . , d } and J d = I c d = { d + 1 , . . . , M } . By th e assumption (A7), w e have P m ∈ J d k f ∗ m k 2 H m ≤ C 3 2 β − 1 d 1 − 2 β , P m ∈ J d k f ∗ m k H m ≤ C 3 β − 1 d 1 − β . Therefo re Lemma 8 gives k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 k ˆ f I d − f ∗ I d k 2 ℓ 2 + λ ( n ) 2 k ˆ f J d k 2 ℓ 2 ≤ K 1  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t k ˆ f − f ∗ k ℓ 1 n  + K 1 M X m =1 k ˆ f m − f ∗ m k H m !  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I d λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + C λ ( n ) 2 d 1 − 2 β + λ ( n ) 1 + ˆ γ n + r t n ! d 1 − β ! , (26) if λ ( n ) 1 > ˆ γ n + ˜ K 2 q t n and λ ( n ) 1 ≥ F q log ( M n ) n . The second term can be upper bound ed as K 1 M X m =1 k ˆ f m − f ∗ m k H m !  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t k ˆ f − f ∗ k ℓ 1 n  H ¨ older ≤ K 1 M X m =1 k ˆ f m − f ∗ m k H m ! ( ( P m ∈ I d k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s ( P m ∈ I d k ˆ f m − f ∗ m k H m ) s √ n + t k ˆ f − f ∗ k ℓ 1 n ) = K 1 ( P m ∈ I d k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s  P M m =1 k ˆ f m − f ∗ m k H m  ( P m ∈ I d k ˆ f m − f ∗ m k H m ) s √ n + t k ˆ f − f ∗ k 2 ℓ 1 n Jensen ≤ K 1 d 1 − s 2 ( P m ∈ I d k ˆ f m − f ∗ m k 2 L 2 (Π) ) 1 − s 2 M 1 2  P M m =1 k ˆ f m − f ∗ m k 2 H m  1 2 d s 2 ( P m ∈ I d k ˆ f m − f ∗ m k 2 H m ) s 2 √ n + t k ˆ f − f ∗ k 2 ℓ 1 n Lemma 6 ≤ K 1 { (1 − ρ ( I d ) 2 ) κ ( I d ) } − 1 − s 2 ( k ˆ f − f ∗ k 2 L 2 (Π) ) 1 − s 2 d 1 2 M 1 2 k ˆ f − f ∗ k 1+ s ℓ 2 √ n + t k ˆ f − f ∗ k 2 ℓ 1 n Y oung ≤ k ˆ f − f ∗ k 2 L 2 (Π) 2 + C { (1 − ρ ( I d ) 2 ) κ ( I d ) } − 1 − s 1+ s d 1 1+ s M 1 1+ s k ˆ f − f ∗ k 2 ℓ 2 n 1 1+ s + t k ˆ f − f ∗ k 2 ℓ 1 n (A8) ≤ k ˆ f − f ∗ k 2 L 2 (Π) 2 + C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s k ˆ f − f ∗ k 2 ℓ 2 + t k ˆ f − f ∗ k 2 ℓ 1 n . 11 W e will see that we may assume C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s ≤ λ ( n ) 2 4 . T hus the second term in the RHS of the ab ove inequality can be upper boun ded as C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s k ˆ f − f ∗ k 2 ℓ 2 ≤ λ ( n ) 2 4 k ˆ f − f ∗ k 2 ℓ 2 ≤ λ ( n ) 2 4  k ˆ f I d − f ∗ I d k 2 ℓ 2 + 2 k ˆ f J d k 2 ℓ 2 + 2 k f ∗ J d k 2 ℓ 2  ≤ λ ( n ) 2 2  k ˆ f I d − f ∗ I d k 2 ℓ 2 + k ˆ f J d k 2 ℓ 2 + k f ∗ J d k 2 ℓ 2  . (27) Moreover Lemma 7 gives k ˆ f − f ∗ k ℓ 1 n ≤ C √ RM n ≤ C λ ( n ) 2 2 and k ˆ f − f ∗ k 2 ℓ 1 n ≤ C RM n ≤ C R λ ( n ) 2 2 . T herefor e (26) becomes 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 2 k ˆ f I d − f ∗ I d k 2 ℓ 2 + λ ( n ) 2 2 k ˆ f J d k 2 ℓ 2 ≤ C  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + tλ ( n ) 2 2  + X m ∈ I d C 1 λ ( n ) 1 + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + C λ ( n ) 2 d 1 − 2 β + λ ( n ) 1 + ˆ γ n + r t n ! d 1 − β ! . As in the proof of Theorem 1 (using the relations (23) and (22)), we ha ve 1 2 k ˆ f − f ∗ k 2 L 2 (Π) ≤ C ( [(1 − ρ 2 ( I d )) κ ( I d ))] − 1  dn − 1 λ ( n ) 2 − s + dλ ( n ) 1 2 + λ ( n ) 2 2 + t n  + λ ( n ) 2 d 1 − 2 β + ( λ ( n ) 1 + ˆ γ n + ( t/n ) 1 2 ) d 1 − β + tλ ( n ) 2 2 ) . Now using the assumptio n (1 − ρ 2 ( I d )) κ ( I d ) ≥ C 4 d − b , we have k ˆ f I d − f ∗ I d k 2 L 2 (Π) ≤ C " d 1+ b n − 1 λ ( n ) 2 − s + d 1+ b λ ( n ) 1 2 + d b λ ( n ) 2 2 + λ ( n ) 2 d 1 − 2 β + ( λ ( n ) 1 + ˆ γ n ) d 1 − β + tλ ( n ) 2 2 + d 1 − β r t n + d 1+ b t n # . (28) Remind that ˆ γ n = ˜ K 1 (1 + k ˆ f − f ∗ k ∞ ) / √ n . Since λ ( n ) 1 ≥ F q log( M n ) n , Lemma 7 gives k ˆ f − f ∗ k ∞ ≤ √ M 3 R + R ≤ c √ M with pro bability 1 − n − 1 for some co nstant c > 0 . Therefore ˆ γ n ≤ c p M /n . The values o f λ ( n ) 1 , λ ( n ) 2 presented in the statemen t is ach iev ed by minimizing the RHS of Eq . (28) und er the constraint λ ( n ) 1 ≥ c p M /n + ˜ K 2 q t n ≥ ˆ γ n + ˜ K 2 q t n and C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s ≤ λ ( n ) 2 4 . i) Suppose n − b +3 β − 1 (2 β + b )(2+ s ) − 1 − s > c p M /n , i.e., τ ≤ τ 2 . Then the RHS of the above ineq uality can be minimized b y d = n 1 (2 β + b )(2+ s ) − 1 − s , λ ( n ) 2 = K n − 2 β + b − 1 (2 β + b )(2+ s ) − 1 − s , and λ ( n ) 1 = max { K n − b +3 β − 1 (2 β + b )(2+ s ) − 1 − s + ˜ K 2 q t n , F q log( M n ) n } up to constants indepen dent o f n , wh ere the le ading terms are d 1+ b n − 1 λ ( n ) 2 − s + d b λ ( n ) 2 2 + λ ( n ) 2 d 1 − 2 β + λ ( n ) 1 d 1 − β . It sho uld be noted that λ ( n ) 1 is gre ater than ˆ γ n + ˜ K 2 q t n because n − b +3 β − 1 (2 β + b )(2+ s ) − 1 − s > c p M /n ≥ ˆ γ n , therefor e (26) is valid. Using τ ≤ τ 2 , we can sho w that C d b (1 − s )+1 1+ s ( M /n ) 1 1+ s ≤ λ ( n ) 2 / 4 by setting th e constant K suf ficiently la rge, hence (27) is valid. Moreover , since M > n 1 (2 β + b )(2+ s ) − 1 − s = n τ 1 , we can take d a s d = n 1 (2 β + b )(2+ s ) − 1 − s ≤ M . 12 ii) Suppose τ 2 ≤ τ ≤ τ 3 . Then the RHS o f the above inequality can be minim ized by d = ( M 2+ s n 2 − s ) 1 2 { (2+ s )( b + β ) − s } , λ ( n ) 2 = K ( M n −{ 2( b + β ) − 1 } ) 1 2 { (2+ s )( b + β − 1)+2 } , and λ ( n ) 1 = max  c p M /n + ˜ K 2 q t n , F q log( M n ) n  ≥ ˆ γ n + ˜ K 2 q t n up to constants independen t of n , where the lead- ing terms are d 1+ b n − 1 λ ( n ) 2 − s + d b λ ( n ) 2 2 + λ ( n ) 1 d 1 − β . Since λ ( n ) 1 ≥ ˆ γ n + ˜ K 2 q t n , (26) is v alid. Using τ ≤ τ 3 , we can show that C d b (1 − s )+1 1+ s ( M /n ) 1 1+ s ≤ λ ( n ) 2 / 4 by setting the co nstant K sufficiently large, hence (27) is valid. Moreover, since β ≤ s ( b − 1) 2(1 − s ) and τ 2 ≤ τ , we can show that d ≤ M . iii) Sup pose τ 3 ≤ τ ≤ τ 4 . W e take λ ( n ) 1 = max  c p M /n + ˜ K 2 q t n , F q log( M n ) n  . Then the RHS o f the inequ ality ( 28) is minimized by λ ( n ) 2 = K √ dλ ( n ) 1 ∼ p dM /n and d = ( n M ) 1 2( b + β ) up to co nstants, where the leading term s are d b λ ( n ) 2 2 + d 1+ b λ ( n ) 1 2 + λ ( n ) 1 d 1 − β . Note that since λ ( n ) 1 ≥ ˆ γ n + ˜ K 2 q t n , (2 6) is valid. Using τ ≤ τ 4 , we can s how that C d b (1 − s )+1 1+ s ( M /n ) 1 1+ s ≤ λ ( n ) 2 / 4 by setting the constant K sufficiently lar g e, hence (27) is valid. Moreover, since β ≤ s ( b − 1) 2(1 − s ) and n τ 3 ≤ M , we have d = ( n M ) 1 2( b + β ) ≤ M . In a ll settings i) to iii) , we can show that d 1 − β √ n & d 1+ b n . Th us the terms regar ding t is upper bo unded as d 1 − β q t n + d 1+ b t n + tλ ( n ) 2 2 . ( d 1 − β √ n + λ ( n ) 2 2 )( √ t + t ) . Thr ough a simple calculatio n d 1 − β √ n is ev alu ated as i) d 1 − β √ n ≃ n − (2 β + b )(2+ s ) − 3 − s +2 β 2 { (2 β + b )(2+ s ) − 1 − s } , ii) d 1 − β √ n ≃ ( M (2+ s )(1 − β ) n − (4 β +2 b + sb − 2) ) 1 2 { ( β + b )(2+ s ) − s } , and iii) d 1 − β √ n ≃ ( M β − 1 n 1 − 2 β − b ) 1 2( β + b ) respectively . Thu s we obtain the assertion. Proof: (Theorem 3) (Con vergence rate of block- ℓ 1 MKL) Note that since λ ( n ) 1 > λ ( n ) 2 = 0 , we h av e λ ( n ) 1 λ ( n ) 1 ∨ λ ( n ) 2 = 1 . Therefor e Lemma 7 g iv e s P M m =1 k ˆ f m k H m ≤ 3 R with pr obability 1 − n − 1 . Th us ˆ γ n = γ n (1 + k ˆ f − f ∗ k ∞ ) ≤ γ n (1 + P M m =1 k ˆ f m k H m + P M m =1 k f ∗ m k H m ) ≤ γ n (1 + 4 R ) . When λ ( n ) 2 = 0 and λ ( n ) 1 > (1 + 4 R ) γ n + ˜ K 2 q t n , as in Lemma 8 we hav e with prob ability at least 1 − e − t − n − 1 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 X m ∈ I k ˆ f m k H m ≤ K 1  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t n  + λ ( n ) 1 X m ∈ I k f ∗ m k H m + 2 λ ( n ) 1 X m ∈ J k f ∗ m k H m + ˜ K 2 X m ∈ I r t n k f ∗ m − ˆ f m k L 2 (Π) , (29) for all t ≥ log log( R √ n ) + log M . W e lower bound the term λ ( n ) 1 P m ∈ I ( k ˆ f m k H m − k f ∗ m k H m ) in the LHS o f the above inequality ( 21). There exists c 1 > 0 only depend ing R such that k f m k H m = q k f m − f ∗ m k 2 H m − 2 h f m − f ∗ m , f ∗ m i H m + k f ∗ m k 2 H m ≥ c 1 k f m − f ∗ m k 2 H m − 2 k f ∗ m k − 1 H m |h f m − f ∗ m , f ∗ m i H m | + k f ∗ m k H m (30) for all f m ∈ H m such that k f m k H m ≤ 3 R and m ∈ I 0 . R e mind that f ∗ m = T 1 / 2 m g ∗ m , then we h av e k f m k H m ≥ c 1 k f m − f ∗ m k 2 H m − 2 k g ∗ m k H m k f ∗ m k H m k f m − f ∗ m k L 2 (Π) + k f ∗ m k H m . Since max m k ˆ f m k H m ≤ 3 R are met with prob ability 1 − n − 1 , k ˆ f m k H m ≥ c 1 k ˆ f m − f ∗ m k 2 H m − 2 k g ∗ m k H m k f ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) + k f ∗ m k H m , with probability 1 − n − 1 . 13 Therefo re by the inequ ality (29), we ha ve with proba bility at least 1 − e − t − n − 1 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 X m ∈ I ( c 1 k ˆ f m − f ∗ m k 2 H m − 2 k g ∗ m k H m k f ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) + k f ∗ m k H m ) ≤ K 1  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t n  + λ ( n ) 1 X m ∈ I k f ∗ m k H m + 2 λ ( n ) 1 X m ∈ J k f ∗ m k H m + ˜ K 2 X m ∈ I r t n k f ∗ m − ˆ f m k L 2 (Π) , (31) for all t ≥ log log( R √ n ) + log M . Thu s using Y o ung’ s inequality k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  d 1+ b n − 1 λ ( n ) 1 − s + d 1+ b λ ( n ) 1 2 + 2 λ ( n ) 1 d 1 − β + t (1 + d 1+ b ) n  . The RHS is minimized b y d = n 1 (2+ s )( β + b ) and λ ( n ) 1 = max  K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n  (up to constants indepen dent of n ). Note that since the optimal λ ( n ) 1 obtained above satisfies λ ( n ) 1 > (1 + 4 R ) γ n + ˜ K 2 q t n by taking K su fficiently la rge, the inequa lity (31) is valid. Mor eover the con dition M > n τ 5 = n b +1 ( β + b ) { b (2+ s )+2 } in the statement ensures d < M . Finally w e e valuate the terms including t , that is, t n d 1+ b + q t n d 1 − β . W e can check that 1 n d 1+ b . q 1 n d 1 − β . T herefor e those ter ms are upper bounded as t n d 1+ b + q t n d 1 − β . q 1 n d 1 − β ( √ t + t ) ≃ n − 4 β + 2 b − 2+ s ( b + β ) 2(2+ s )( b + β ) ( √ t + t ) . Thus we obtain the assertion. (Con vergence rate for block- ℓ 2 MKL) When λ ( n ) 1 = 0 , sub stituting I M to I in Lemma 8, and using Y oung ’ s inequ ality , as in the proo f of Theorem 2, the conv ergence rate of block- ℓ 2 MKL can be e valuated as k ˆ f I d − f ∗ I d k 2 L 2 (Π) ≤ C  M 1+ b n − 1 λ ( n ) 2 − s + M b λ ( n ) 2 2 + tλ ( n ) 2 2 + t n M 1+ b  , (32) with prob ability 1 − e − t − n − 1 (note that since I = { 1 , . . . , M } ( I c = ∅ ), w e don ’t need the condition λ ( n ) 1 ≥ ˆ γ n + ˜ K 2 q t n ). λ ( n ) 2 = K ( M n ) 1 2+ s ∨ F q log( M n ) n giv es th e minim um o f the RHS with respect to λ ( n ) 2 up to co nstants. Using τ ≤ τ 6 , we can show that M b (1 − s )+1 1+ s ( M /n ) 1 1+ s = M b (1 − s )+2 1+ s n − 1 1+ s . λ ( n ) 2 by setting the constant K sufficiently large, hence (2 7) is v a lid. B Proof of Lemmas 7 and 8 Proof: (Lemma 7) Sinc e ˆ f min imizes the empirical risk (1), we ha ve 1 n n X i =1 M X m =1 ( ˆ f m ( x i ) − f ∗ m ( x i )) ! 2 + λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≤ 2 n M X m =1 n X i =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) + λ ( n ) 1 k f ∗ k ℓ 1 + λ ( n ) 2 k f ∗ k 2 ℓ 2 . (33) By Proposition 1 (Bernstein’ s ineq uality in Hilber t spa ces, see also Theorem 6.14 of Steinwart (20 08) fo r example), there exists a universal constant C such tha t we hav e 1 n n X i =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) ≤      1 n n X i =1 ǫ i k m ( x i , · )      k ˆ f m − f ∗ m k H m ≤ C L r log( M n ) n k ˆ f m − f ∗ m k H m ≤ C L r log( M n ) n ( k ˆ f m k H m + k f ∗ m k H m ) (34) 14 for all m with p robab ility at least 1 − n − 1 , wh ere we u sed the assumption log( M n ) n ≤ 1 . If λ ( n ) 1 ≥ 4 C L q log( M n ) n , then we ha ve λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≤ 3 ( λ ( n ) 1 ∨ λ ( n ) 2 )( k f ∗ k ℓ 1 + k f ∗ k 2 ℓ 2 ) , (35) with prob ability at least 1 − n − 1 . Set r = λ ( n ) 1 λ ( n ) 1 ∨ λ ( n ) 2 , then by Y oun g’ s inequality and Jensen’ s inequality , the LHS of the above inequality (33) is lower bounded by λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≥ ( λ ( n ) 1 ∨ λ ( n ) 2 )( M X m =1 k ˆ f m k 2 − r H m ) ≥ M ( λ ( n ) 1 ∨ λ ( n ) 2 ) 1 M M X m =1 k ˆ f m k 2 − r H m ! ≥ M r − 1 ( λ ( n ) 1 ∨ λ ( n ) 2 ) k ˆ f k 2 − r ℓ 1 . (36) Therefo re we have the first assertion by setting F = 4 C L . The second assertion can be shown as follows: by th e inequality (33) we ha ve M − 1 λ ( n ) 2  k ˆ f − f ∗ k ℓ 1  2 ≤ λ ( n ) 2 k ˆ f − f ∗ k 2 ℓ 2 ≤ 2 n M X m =1 n X i =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) + λ ( n ) 1 k ˆ f − f ∗ k ℓ 1 + 2 λ ( n ) 2 M X m =1 h f ∗ m , f ∗ m − ˆ f m i H m ≤ λ ( n ) 2  3 2 + 2 max m k f ∗ m k H m  k ˆ f − f ∗ k ℓ 1 (37) with proba bility at least 1 − n − 1 , wher e we used (34), λ ( n ) 2 ≥ 4 C L q log( M n ) n and λ ( n ) 2 ≥ λ ( n ) 1 in the last inequality . Proof: (Lemma 8) In what f ollows, we assume k ˆ f − f ∗ k ℓ 1 ≤ ¯ R wher e ¯ R = 4 M R (the pro bability of this ev e nt is greater than 1 − n − 1 by Lemma 7). Since ˆ f min imizes the empirical risk we ha ve P n ( ˆ f − Y ) 2 + λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≤ P n ( f ∗ − Y ) 2 + λ ( n ) 1 k f ∗ k ℓ 1 + λ ( n ) 2 k f ∗ k 2 ℓ 2 ⇒ P ( ˆ f − f ∗ ) 2 + λ ( n ) 1 k ˆ f J k ℓ 1 + λ ( n ) 2 k ˆ f J k 2 ℓ 2 ≤ ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ )+ + λ ( n ) 1 ( k f ∗ I k ℓ 1 − k ˆ f I k ℓ 1 ) + λ ( n ) 2 ( k f ∗ I k 2 ℓ 2 − k ˆ f I k 2 ℓ 2 ) + λ ( n ) 1 k f ∗ J k ℓ 1 + λ ( n ) 2 k f ∗ J k 2 ℓ 2 . (38) The second term in the RHS of the above inequality (38) can be bound ed from abov e as ( k f ∗ I k ℓ 1 − k ˆ f I k ℓ 1 ) ≤ X m ∈ I h∇k f ∗ m k H m , ˆ f m − f ∗ m i H m = X m ∈ I h g ∗ m , T 1 / 2 m ( ˆ f m − f ∗ m ) i H m k f ∗ m k H m ≤ X m ∈ I k g ∗ m k H m k f ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) , (39) where we used f ∗ m = T 1 / 2 m g ∗ m for m ∈ I ⊆ I 0 . W e also have λ ( n ) 2 ( k f ∗ I k 2 ℓ 2 − k ˆ f I k 2 ℓ 2 ) = λ ( n ) 2 ( X m ∈ I 2 h f ∗ m , f ∗ m − ˆ f m i H m − k ˆ f I − f ∗ I k 2 ℓ 2 ) ≤ λ ( n ) 2 ( X m ∈ I 2 k g ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) − k ˆ f I − f ∗ I k 2 ℓ 2 ) . (40) Substituting (39) and (40) to (38), we obtain k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 k ˆ f I − f ∗ I k 2 ℓ 2 + λ ( n ) 1 k ˆ f J k ℓ 1 + λ ( n ) 2 k ˆ f J k 2 ℓ 2 ≤ ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) + X m ∈ I ( λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m ) k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 1 k f ∗ J k ℓ 1 + λ ( n ) 2 k f ∗ J k 2 ℓ 2 . (41) 15 Finally we ev alua te the fir st ter m ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) in the RHS of the above inequ ality (41) by applying T alagra nd’ s conc entration inequality (T alag rand, 1996a,b, Bousquet, 2002). First we decompose ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) as ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) = M X m =1 ( P − P n )(( f ∗ − ˆ f )( f ∗ m − ˆ f m ) + 2 ( ˆ f m − f ∗ m ) ǫ ) , and bound each term ( P − P n )(( f ∗ − ˆ f )( f ∗ m − ˆ f m ) + 2 ( ˆ f m − f ∗ m ) ǫ ) in the summation. Here suppose f ∈ H satisfies k f k ∞ ≤ k f k ℓ 1 ≤ ˆ R for a con stant ˆ R ( ≤ ¯ R ) . Sin ce | ǫ | ≤ L , we have | f f m + 2 f m ǫ | ≤ 2( L + ˆ R ) | f | ≤ 2( L + ˆ R ) k f m k H m , (42a) p P ( f f m + 2 f m ǫ ) 2 = p P ( f 2 f 2 m ) + 4 P ( f 2 m ǫ 2 ) ≤ q k f k 2 L 2 (Π) k f m k 2 L 2 (Π) + 4 L 2 k f m k 2 L 2 (Π) ≤ k f k L 2 (Π) k f m k L 2 (Π) + 2 L k f m k L 2 (Π) , (42b) for all f ∈ H . Let Q n f := 1 n P n i =1 ε i f ( x i , y i ) wh ere { ε i } n i =1 ∈ {± 1 } n is the Radem acher r andom v ariable, and Ψ m ( ξ m , σ m ) be Ψ m ( ξ m , σ m ) := E[sup { Q n ( | f m | ) | f m ∈ H m , k f m k H m ≤ ξ m , k f m k L 2 (Π) ≤ σ m } ] . Then one can show t hat by the spectral assumptions (A5) (equivalently the covering numbe r cond ition) Ψ m ( ξ m , σ m ) ≤ K s  σ 1 − s m ξ s m √ n ∨ n − 1 1+ s ξ m  where K s is a constant that d epends on s and C 2 (Mendelson, 20 02). Let Ξ m ( ξ m , σ m ) := { f m ∈ H m | k f m k H m ≤ ξ m , k f m k L 2 (Π) ≤ σ m } . Now by Rad emacher contraction inequality (Ledou x and T alagrand, 1991, Theor em 4.1 2), for gi ven { ξ m , σ m } m ∈ I and ˆ R we have E[sup { Q n ( f f m + 2 f m ǫ ) | f ∈ H such that f m ∈ Ξ m ( ξ m , σ m ) , k f k ℓ 1 ≤ ˆ R } ] ≤ 2( L + ˆ R )Ψ m ( ξ m , σ m ) ≤ 2 K s ( L + ˆ R )  σ 1 − s m ξ s m √ n ∨ n − 1 1+ s ξ m  . (43) Therefo re by the symme trization argument (van der V aart and W elln er, 1996), we have E[sup { ( P n − P )( f f m + 2 f m ǫ ) | f ∈ H such that f m ∈ Ξ m ( ξ m , σ m ) , k f k ℓ 1 ≤ ˆ R } ] ≤ 4 K s ( L + ˆ R )  σ 1 − s m ξ s m √ n ∨ n − 1 1+ s ξ m  . (44) By T ala grand’ s concentration ineq uality with (42) and (44), fo r given ˆ R, ¯ σ , ξ m , σ m with probability at least 1 − e − t ( t > 0) , we have sup f ∈H : k f k L 2 (Π) ≤ ¯ σ , k f k ∞ ≤ ˆ R,f m ∈ Ξ m ( ξ m ,σ m ) ( P n − P )( f f m + 2 f m ǫ ) ≤ √ 2  4 K s ( L + ˆ R )  σ 1 − s m ξ s m √ n ∨ ξ m n 1 1+ s  + q t n ( ¯ σ σ m + 2 Lσ m ) + 2 ( L + ˆ R ) ξ m t n  . (45) where we used the relation (42). Our next goal is to deriv e an uniform v er sion of the above inequality over 1 √ n ≤ ˆ R ≤ ¯ R, 1 √ n ≤ ¯ σ ≤ ¯ R, 1 √ nM ≤ ξ m ≤ ¯ R and 1 √ nM ≤ σ m ≤ ¯ R. By considering a grid { ˆ R ( k 1 ) , ¯ σ ( k 2 ) , ξ ( k 3 ) m , σ ( k 4 ) m } log 2 ( M ¯ R √ n ) k i =0( i =1 ,..., 4) such that ˆ R ( k ) := ¯ R 2 − k , ¯ σ ( k ) := ¯ R 2 − k , ξ ( k ) m := ¯ R 2 − k and σ ( k ) m := ¯ R 2 − k , we hav e with pro bability at least 1 − (log( M ¯ R √ n )) 4 e − t ≥ 1 − (log(4 R M 2 √ n )) 4 e − t ( P n − P )( f f m + 2 f m ǫ ) ≤ K (1 + k f k ℓ 1 ) k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k H m n 1 1+ s + t k f m k H m n ! + r 2 t n ( k f k L 2 (Π) k f m k L 2 (Π) + 2 L k f m k L 2 (Π) ) , 16 for all f ∈ H such that k f m k H m ≤ ¯ R and k f k ℓ 1 ≤ ¯ R , and for all t > 1 , where K = 4(4 K s L ∨ 4 K s ∨ 2 L ∨ 2) . Summing up this bound for m = 1 , . . . , M , then we ob tain ( P n − P )( f 2 + 2 f ǫ ) ≤ K (1 + k f k ℓ 1 ) M X m =1 k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k H m n 1 1+ s + t k f k ℓ 1 n ! + r 2 t n k f k L 2 (Π) M X m =1 k f m k L 2 (Π) + 2 L M X m =1 k f m k L 2 (Π) ! , unifor mly for all f ∈ H su ch that k f m k H m ≤ ¯ R ( ∀ m ) and k f k ℓ 1 ≤ ¯ R with proba bility at least 1 − M (log(4 R M 2 √ n )) 4 e − t . Here set γ n = K √ n and note that q 2 t n k f k L 2 (Π) P M m =1 k f m k L 2 (Π) ≤ 1 2 k f k 2 L 2 (Π) + t n ( P M m =1 k f m k L 2 (Π) ) 2 ≤ 1 2 k f k 2 L 2 (Π) + t n ( k f k ℓ 1 ) 2 then we have ( P n − P )( f 2 + 2 f ǫ ) ≤ K (1 + k f k ℓ 1 ) " X m ∈ I k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k H m n 1 1+ s ! + 2 t k f k ℓ 1 n # + γ n (1 + k f k ℓ 1 ) k f J k ℓ 1 + 1 2 k f k 2 L 2 (Π) + 2 √ 2 L r t n M X m =1 k f m k L 2 (Π) . (46) for all f ∈ H such that k f m k H m ≤ ¯ R ( ∀ m ) and k f k ℓ 1 ≤ ¯ R with pro bability at least 1 − M (log(4 R M 2 √ n )) 4 e − t . W e will rep lace t with t + 5 log M + 4 log log ( R √ n ) , then the probability 1 − M (log(4 R √ nM 2 )) 4 e − t can be replac ed with 1 − e − t and we have t + 5 log M + 4 log log( R √ n ) ≤ 6 t for all t ≥ log M + log log( R √ n ) . On the event where k ˆ f − f ∗ k ℓ 1 ≤ ¯ R hold s, substituting ˆ f − f ∗ to f in (46) and replacing K a pprop riately , ( 41) yields 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 X m ∈ I k ˆ f I − f ∗ I k 2 H m + λ ( n ) 2 X m ∈ J k ˆ f m k 2 H m + ( λ ( n ) 1 − ˆ γ n ) X m ∈ J k ˆ f m k H m ≤ ˜ K 1 (1 + k ˆ f − f ∗ k ℓ 1 )  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ∨ k ˆ f m − f ∗ m k H m n 1 1+ s + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I  λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m  k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 2 X m ∈ J k f ∗ m k 2 H m + ( λ ( n ) 1 + ˆ γ n ) X m ∈ J k f ∗ m k H m + ˜ K 2 r t n M X m =1 k ˆ f m − f ∗ m k L 2 (Π) , (47) where ˜ K 1 and ˜ K 2 are constan ts and ˆ γ n = γ n (1 + k f k ℓ 1 ) . Fin ally since ˜ K 2 q t n P M m =1 k ˆ f m − f ∗ m k L 2 (Π) = ˜ K 2 q t n ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) + P m ∈ J k ˆ f m k L 2 (Π) + P m ∈ J k f ∗ m k L 2 (Π) ) ≤ ˜ K 2 q t n ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) + P m ∈ J k ˆ f m k H m + P m ∈ J k f ∗ m k H m ) , (47) becomes 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 X m ∈ I k ˆ f I − f ∗ I k 2 H m + λ ( n ) 2 X m ∈ J k ˆ f m k 2 H m + λ ( n ) 1 − ˆ γ n − ˜ K 2 r t n ! X m ∈ J k ˆ f m k H m ≤ ˜ K 1 (1 + k ˆ f − f ∗ k ℓ 1 )  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ∨ k ˆ f m − f ∗ m k H m n 1 1+ s + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 2 X m ∈ J k f ∗ m k 2 H m + λ ( n ) 1 + ˆ γ n + ˜ K 2 r t n ! X m ∈ J k f ∗ m k H m , (48) which yields the assertion. 17 C Proof of T heor ems 4 and 5 W e wr ite the operato r norm of S I ,J : H J → H I as k S I ,J k H I , H J := sup g J ∈H J ,g J 6 =0 k S I ,J g J k H I k g J k H J . Definition 9 F or all 1 ≤ m, m ′ ≤ M , we de fine the empirical ( non center ed ) cr oss covariance operator ˆ Σ m,m ′ as follows: h f m , ˆ Σ m,m ′ g m ′ i H m := 1 n n X i =1 f m ( x i ) g m ′ ( x i ) , (49) wher e f m ∈ H m , g m ′ ∈ H m ′ . An alogous to the join t covariance op erator Σ , we d efine the joint empirical cr oss covarian ce op erator ˆ Σ : H → H as ( ˆ Σ h ) m = P M l =1 ˆ Σ m,l h l . W e den ote b y ˆ Σ m,ǫ the element of H m such that h f m , ˆ Σ m,ǫ i H m := 1 n n X i =1 ǫ i f m ( x i ) . Let ¯ R be a constan t such that 4( P M m =1 k f ∗ m k H m + P M m =1 k f ∗ m k H m ) ≤ ¯ R . W e denote by F n the objective function of elastic-net MKL F n ( f ) := 1 n n X i =1 ( f ( x i ) − y i ) 2 + λ ( n ) 1 M X m =1 k f m k H m + λ ( n ) 2 M X m =1 k f m k 2 H m . Proof: (Theorem 4) Let ˜ f ∈ ⊕ m ∈ I 0 H m be the minimizer of ˜ F n : ˜ f := ar g min f ∈H I 0 ˜ F n ( f ) , where ˜ F n ( f ) := 1 n n X i =1 ( f ( x i ) − y i ) 2 + λ ( n ) 1 X m ∈ I 0 k f m k H m + λ ( n ) 2 X m ∈ I 0 k f m k 2 H m . (Step 1) W e first show that ˜ f p → f ∗ with r espect to the RKHS norm. Since λ ( n ) 1 √ n → ∞ , as in the proof of Lemma 7, the pro bability o f P M m =1 k ˆ f m − f ∗ m k H m ≤ √ M ¯ R goes to 1 (this can be checked as follows: by replacing q log( M n ) n in Eq . (34) with log( M ) λ ( n ) 1 , then we see that Eq. (34) ho lds with p robab ility 1 − ex p( − λ ( n ) 1 2 n ) ). Th ere exists c 1 only dependin g √ M ¯ R such that k f m k H m = q k f m − f ∗ m k 2 H m − 2 h f m − f ∗ m , f ∗ m i H m + k f ∗ m k 2 H m ≥ c 1 k f m − f ∗ m k 2 H m − 2 k f ∗ m k − 1 H m |h f m − f ∗ m , f ∗ m i H m | + k f ∗ m k H m (50) for all m ∈ I 0 and all f m ∈ H m such that k f m k H m ≤ √ M ¯ R . Since ˜ f minimizes ˜ F n , if P M m =1 k ˜ f m − f ∗ m k H m ≤ √ M ¯ R (the pro bability of which ev ent goes to 1 ) we have h ˜ f I 0 − f ∗ I 0 , ˆ Σ I 0 ,I 0 ( ˜ f I 0 − f ∗ I 0 ) i H I 0 + c 1 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ 2 h ˆ Σ I 0 ,ǫ , ˜ f − f ∗ i H I 0 + 2 X m ∈ I 0  1 k f ∗ m k H m λ ( n ) 1 + λ ( n ) 2  |h ˜ f m − f ∗ m , f ∗ m i H m | , (51) where we used the r elation (50). By the assumptio n f ∗ m = Σ 1 / 2 m,m g ∗ m , we have |h ˜ f m − f ∗ m , f ∗ m i H m | ≤ k g ∗ m k H m k ˜ f m − f ∗ m k L 2 (Π) . By Lemma 10 and Lemma 11, we have k Σ m,m ′ − ˆ Σ m,m ′ k H m , H m ′ = O p (1 / √ n ) , k ˆ Σ I 0 ,ǫ k H I 0 = O p (1 / √ n ) . Substituting these inequa lities to (51), we have k ˜ f − f ∗ k 2 L 2 (Π) + c 1 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p P m ∈ I 0 k ˜ f m − f ∗ m k H m √ n + ( λ ( n ) 1 + λ ( n ) 2 ) X m ∈ I 0 k ˜ f m − f ∗ m k L 2 (Π) ! . (52) 18 Remind that the ( non centered ) cro ss corre lation oper ator is in vertible. Thus there exists a c onstant c su ch that k ˜ f − f ∗ k 2 L 2 (Π) = h ˜ f I 0 − f ∗ I 0 , Σ I 0 ,I 0 ( ˜ f I 0 − f ∗ I 0 ) i H = h ˜ f I 0 − f ∗ I 0 , Diag (Σ 1 / 2 m,m ) V I 0 ,I 0 Diag(Σ 1 / 2 m,m )( ˜ f I 0 − f ∗ I 0 ) i H I 0 ≥ c X m ∈ I 0 h ˜ f m − f ∗ m , Σ m,m ( ˜ f m − f ∗ m ) i H m = c X m ∈ I 0 k ˜ f m − f ∗ m k 2 L 2 (Π) . This and Eq. (52) give that using ab ≤ ( a 2 + b 2 ) / 2 k ˜ f − f ∗ k 2 L 2 (Π) + c 1 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p P m ∈ I 0 k ˜ f m − f ∗ m k H m √ n + ( λ ( n ) 1 + λ ( n ) 2 ) X m ∈ I 0 k ˜ f m − f ∗ m k L 2 (Π) ! ≤ O p 1 nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 ! + c 1 2 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + c 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 L 2 (Π) ≤ O p 1 nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 ! + c 1 2 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + 1 2 k ˜ f − f ∗ k 2 L 2 (Π) . Therefo re we have 1 2 k ˜ f − f ∗ k 2 L 2 (Π) + c 1 2 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p 1 nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 ! ⇒ X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p 1 ( c 1 λ ( n ) 1 + λ ( n ) 2 ) nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 c 1 λ ( n ) 1 + λ ( n ) 2 ! = O p 1 nλ ( n ) 1 2 + ( λ ( n ) 1 + λ ( n ) 2 ) ! . This and λ ( n ) 1 √ n → ∞ giv es k ˜ f − f ∗ I 0 k H I 0 → 0 in probability . (Step 2) Next we sho w that the probab ility of ˜ f = ˆ f goes to 1. Sin ce k ˜ f − f ∗ I 0 k H I 0 → 0 , we can assume that k ˜ f m k H m > 0 ( m ∈ I 0 ) with out loss of gen erality . W e identify ˜ f as an element of H by setting ˜ f m = 0 for m ∈ J 0 . Now we show that ˜ f is also the minimizer of F n , that is ˜ f = ˆ f , with high probability , hence ˆ I = I 0 with high pr obability . By th e KKT con dition, the necessary a nd sufficient cond ition that ˜ f also m inimizes F n is k 2 ˆ Σ m,I 0 ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m ≤ λ ( n ) 1 ( ∀ m ∈ J 0 ) , (53) (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n )( ˜ f I 0 − f ∗ I 0 ) + λ ( n ) 1 D n f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ = 0 , (54) where D n = Diag ( k ˜ f m k − 1 H m ) . Note that ( 54) is satisfied (with high p robab ility) because ˜ f is th e minimizer of ˜ F n and k ˜ f m k H m > 0 for all m ∈ I 0 (with high p robability ). Th erefore if the co ndition (53) h olds w . h.p., ˜ f = ˆ f w .h.p .. W e will n ow show the condition (53) holds w .h.p. . Due to (54), we have ˜ f I 0 − f ∗ I 0 = − (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 [( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ ] . Therefo re the LHS of (5 3), k 2 ˆ Σ m,I 0 ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m , can be ev alua ted as k − 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 [( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ ] − 2 ˆ Σ m,ǫ k H m = k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 − 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ + 2 ˆ Σ m,ǫ k H m ≤k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H m + k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ − 2 ˆ Σ m,ǫ k H m . (55) W e evaluate the probabilistic orders of the las t two terms. 19 (i) (Boundin g B n,m := k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ − 2 ˆ Σ m,ǫ k H m ) W e show that ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ = O p  1 √ n  . Since O   ˆ Σ I 0 ,I 0 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m  , we hav e O  ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n / 2 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m + λ ( n ) 2 !  2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n 0 0 2 ˆ Σ m,m + 2 λ ( n ) 2 ! . The second inequality is due to the fact that for all ( f I 0 , f m ) ∈ H I 0 ∪ m we have *  f I 0 − f m  , ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n / 2 − ˆ Σ I 0 ,m − ˆ Σ m,I 0 ˆ Σ m,m + λ ( n ) 2 !  f I 0 − f m  + H I 0 ∪ m ≥ 0 because of O   ˆ Σ I 0 ,I 0 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m  . Thus we hav e       ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n 2 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m + λ ( n ) 2 ! 2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n 0 0 2 ˆ Σ m,m + 2 λ ( n ) 2 ! − 1  ˆ Σ I 0 ,ǫ ˆ Σ m,ǫ        H I 0 ∪ m ≤      ˆ Σ I 0 ,ǫ ˆ Σ m,ǫ      H I 0 ∪ m ≤ O p (1 / √ n ) . (56) Here the LHS of the above inequality is equiv alent to      ∗ ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ + ( ˆ Σ m,m + λ ( n ) 2 )(2 ˆ Σ m,m + 2 λ ( n ) 2 ) − 1 ˆ Σ m,ǫ      H I 0 ∪ m . Therefo re we ob serve     ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ + 1 2 ˆ Σ m,ǫ     H m = O p (1 / √ n ) . Since k ˆ Σ m,ǫ k H m = O p (1 / √ n ) , we also hav e k ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ k H m = O p (1 / √ n ) . This and k ˆ Σ m,ǫ k H m = O p (1 / √ n ) yield B n,m = O p (1 / √ n ) . (57) (ii) (Bo unding E n,m := k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H m ) Note that, du e to k ˜ f − f ∗ k H p → 0 , we h ave D n p → D , and we know that max m,m ′ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ = O p ( p log( M ) /n ) = O p ( 1 √ n ) by Lemma 1 0. Thus S n := (2Σ I 0 ,I 0 − 2 ˆ Σ I 0 ,I 0 ) /λ ( n ) 1 + D − D n satisfies S n = o p (1) and thus D − S n  D / 2 with high proba bility . Hen ce 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 =2Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + O p  1 √ n  =2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + 2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 λ ( n ) 1 S n (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + O p  1 √ n  . (58) 20 Here we obtain k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 2 k 2 H m , H I 0 = k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 Σ I 0 ,m k H m , H m ≤k Σ 1 2 m,m V m,I 0 (2 V I 0 ,I 0 ) − 1 V I 0 ,m Σ 1 2 m,m k H m , H m = O p (1) , (59) and due to the fact that D − S n  D / 2 with high proba bility we ha ve k (Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H I 0 = k (Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 Diag(Σ 1 2 m,m )( λ ( n ) 1 D n + 2 λ ( n ) 2 ) g ∗ I 0 k H I 0 ≤ O p ( k V − 1 I 0 ,I 0 k − 1 2 H I 0 , H I 0 ( λ ( n ) 1 + λ ( n ) 2 )) = O p ( λ ( n ) 1 + λ ( n ) 2 ) . Therefo re the second term in the RHS of Eq. (58) is ev alu ated as k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 λ ( n ) 1 S n (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H m ≤k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 2 k H m , H I 0 k (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 2 k H I 0 , H I 0 λ ( n ) 1 k S n k H I 0 , H I 0 × k (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 k H I 0 , H I 0 k (Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H I 0 ≤ O p (1 · ( λ ( n ) 1 + λ ( n ) 2 ) − 1 2 · λ ( n ) 1 o p (1) · ( λ ( n ) 1 + λ ( n ) 2 ) − 1 2 · ( λ ( n ) 1 + λ ( n ) 2 )) = o p ( λ ( n ) 1 ) . Therefo re this and Eq. (58) give 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 =2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + o p ( λ ( n ) 1 ) + O p  1 √ n  =2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + o p ( λ ( n ) 1 ) . Define A n := Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 D n + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 , A := Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 . W e show k A n − A k H m = o p (1) . By the definition, we have A − A n =Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 λ ( n ) 1 D 2 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 + Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 ( D − D n ) f ∗ I 0 . (60) On the other hand, as in Eq. (56), we observe that 2 ≥      Σ I 0 ,I 0 Σ I 0 ,m Σ m,I 0 Σ m,m   (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 0 0 0      H I 0 ∪ m , H I 0 ∪ m =      ∗ ∗ Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 0      H I 0 ∪ m , H I 0 ∪ m ≥ k Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 k H m , H I 0 . (61) 21 Moreover , since f ∗ m = Σ 1 2 m,m g ∗ m ( ∀ m ), we hav e       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0       H I 0 =       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 Diag(Σ 1 2 m,m ) D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I 0       H I 0 ≤       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 2       H I 0 , H I 0       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 2 Diag(Σ 1 2 m,m )       H I 0 , H I 0 ×      D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I 0      H I 0 ≤ O p (( λ ( n ) 1 + λ ( n ) 2 ) − 1 2    V − 1 2 I 0 ,I 0    H I 0 , H I 0 ) ≤ O p ( λ ( n ) 1 − 1 2 ) . (62) W e ca n also boun d the second term of (6 0) as       Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 ( D − D n ) f ∗ I 0       H m ≤       Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1       H m , H I 0   ( D − D n ) f ∗ I 0   H I 0 ≤     Σ m,I 0  Σ I 0 ,I 0 + λ ( n ) 2  − 1     H m , H I 0   ( D − D n ) f ∗ I 0   H I 0 ≤ 2   ( D − D n ) f ∗ I 0   H I 0 ( ∵ Eq. (61) ) = o p (1) . Therefo re app lying the inequalities Eq. (61) and Eq. (62) to Eq. (60), we ha ve k A n − A k H m = O p ( λ ( n ) 1 1 2 ) + o p (1) = o p (1) . (6 3) Hence we hav e E n,m = λ ( n ) 1 k A k H m + o p ( λ ( n ) 1 ) . (iii) (Combining (i) and (ii)) Due to the above ev alu ations ((i) and (ii)), we ha ve max m ∈ J 0    2 ˆ Σ m,I ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ    H m = max m ∈ J λ ( n ) 1      Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0      H m + o p ( λ ( n ) 1 ) < λ ( n ) 1 (1 − η ) + o p ( λ ( n ) 1 ) . This yields P  k 2 ˆ Σ m,I 0 ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m ≥ λ ( n ) 1 , ∀ m ∈ J 0  → 0 . Thus the probab ility of the cond ition (53) goes to 1. Proof: (Theo rem 5) First we pr ove that λ ( n ) 1 √ n → ∞ is a necessary cond ition for ˆ I p → I 0 . Assume that lim inf λ ( n ) 1 √ n < ∞ . Then we can take a su b-sequen ce that co n verges to a finite value, ther efore b y taking the sub -sequen ce, if necessary , we can assume lim λ ( n ) 1 √ n → µ 1 without loss of gen erality . W e will der iv e a contrad iction unde r the conditions of k ˆ f − f ∗ k H p → 0 and ˆ I p → I 0 . Supp ose ˆ I = I 0 . 22 By the KKT condition , 0 = 2( ˆ Σ I 0 ,I 0 ˆ f I 0 − ˆ Σ I 0 ,ǫ − ˆ Σ I 0 ,I 0 f ∗ I 0 ) + λ ( n ) 1 D n ˆ f I 0 + 2 λ ( n ) 2 ˆ f I 0 ⇒ 2 ( ˆ Σ I 0 ,I 0 + λ ( n ) 2 )( f ∗ I 0 − ˆ f I 0 ) = λ ( n ) 1 D n f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ (64) ⇒ 2 √ n (Σ I 0 ,I 0 + λ ( n ) 2 )( f ∗ I 0 − ˆ f I 0 ) = √ nλ ( n ) 1 D f ∗ I 0 + √ n 2 λ ( n ) 2 f ∗ I 0 − 2 √ n ˆ Σ I 0 ,ǫ + (2 √ n (Σ I 0 ,I 0 − ˆ Σ I 0 ,I 0 )( f ∗ I 0 − ˆ f I 0 ) + √ nλ ( n ) 1 ( D n − D ) f ∗ I 0 ) ⇒ 2 √ n (Σ I 0 ,I 0 + λ ( n ) 2 )( f ∗ I 0 − ˆ f I 0 ) = µ 1 D f ∗ I 0 + √ n 2 λ ( n ) 2 f ∗ I 0 − 2 √ n ˆ Σ I 0 ,ǫ + o p (1) , (65) where the last in equality is due to √ nλ ( n ) 1 → µ 1 , k D n − D k H I 0 , H I 0 = o p (1) , k ˆ f − f ∗ k H = o p (1) and k Σ I 0 ,I 0 − ˆ Σ I 0 ,I 0 k H I 0 , H I 0 = o p (1) . Mo reover since the second equality (64) ind icates that o p (1) + o p ( λ ( n ) 2 ) = λ ( n ) 1 D f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 + o p (1) , we have λ ( n ) 1 = o p (1) and λ ( n ) 2 = o p (1) . W e now show that the KKT cond ition un der which ˆ f satisfying ˆ I = I 0 is optima l with respect to F n is violated with strictly positive probability: lim inf P  ∃ m ∈ J, k 2( ˆ Σ m,I 0 ˆ f I 0 − ˆ Σ m,I 0 f ∗ I 0 − ˆ Σ m,ǫ ) k H m > λ ( n ) 1  > 0 . (66) Obviously this indicates that the probability ˆ I = I 0 does not conv erges to 1 , which is a contrad iction. For all v m ∈ H m ( m ∈ J 0 ) , there exists w I 0 ∈ H I 0 such that Σ I 0 ,m v m = (Σ I 0 ,I 0 + λ ( n ) 2 ) w I 0 . (67) Note that w I 0 is uniformly bounded for all λ ( n ) 2 ≥ 0 because the rang e of Σ I 0 ,m is included in the ran ge of Σ I 0 ,I 0 (Baker, 19 73) and th ere exists ˜ w I 0 such th at Σ I 0 ,m v m = Σ I 0 ,I 0 ˜ w I 0 ( ˜ w I 0 is indep endent of λ ( n ) 2 ), hence Σ I 0 ,I 0 ˜ w I 0 = (Σ I 0 ,I 0 + λ ( n ) 2 ) w I 0 , and k w I 0 k H I 0 ≤ q h ˜ w I 0 , Σ I 0 ,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 2 Σ I 0 ,I 0 ˜ w I 0 i H I 0 ≤ k ˜ w I 0 k H I 0 for λ ( n ) 2 > 0 and k w I 0 k H I 0 = k ˜ w I k H I 0 for λ ( n ) 2 = 0 . Let v m ∈ H m be any no n-zero elemen t such that Σ 1 / 2 m,m v m 6 = 0 and w I 0 be satisfying the above equality (67), then √ n h v m , ˆ Σ m,ǫ + ˆ Σ m,I 0 f ∗ I 0 − ˆ Σ m,I 0 ˆ f I 0 i H m = √ n h v m , ˆ Σ m,ǫ i H m + h v m , ˆ Σ m,I 0 √ n ( f ∗ I 0 − ˆ f I 0 ) i H m = √ n h v m , ˆ Σ m,ǫ i H m + h v m , Σ m,I √ n ( f ∗ I 0 − ˆ f I 0 ) i H m + o p (1) = √ n h v m , ˆ Σ m,ǫ i H m + h w I 0 , (Σ I 0 ,I 0 + λ ( n ) 2 ) √ n ( f ∗ I 0 − ˆ f I 0 ) i H m + o p (1) = √ n h v m , ˆ Σ m,ǫ i H m − √ n h w I 0 , ˆ Σ I 0 ,ǫ i H m + D w I 0 ,  µ 1 2 D + √ nλ ( n ) 2  f ∗ I 0 E H m + o p (1) , where we used k ˆ Σ m,I 0 − Σ m,I 0 k H m , H I 0 = O p (1 / √ n ) and k f ∗ − ˆ f k H p → 0 in the second eq uality , and the relation (65) in the last equ ality . W e can sh ow th at Z n := √ n h v m , ˆ Σ m,ǫ i − √ n h w I 0 , ˆ Σ I 0 ,ǫ i h as a positiv e variance as follo ws (see also Bach (200 8)): E[ Z n ] = 0 , E[ Z 2 n ] ≥ σ 2 ( h v m , Σ m,m v m i − 2 h v m , Σ m,I 0 w I 0 i + h w I 0 , Σ I 0 ,I 0 w I 0 i ) = σ 2 ( h v m , Σ m,m v m i − h v m , Σ m,I 0 w I 0 i + o p (1)) ( ∵ λ ( n ) 2 = o p (1)) = σ 2 h Σ 1 / 2 m,m v m , ( I H m − V m,I 0 ˜ V − 1 I 0 ,I 0 V I 0 ,m )Σ 1 / 2 m,m v m i + o p (1) , where ˜ V − 1 I 0 ,I 0 = Diag(Σ 1 / 2 m,m )(Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 Diag(Σ 1 / 2 m,m ) ( note that ˜ V I 0 ,I 0 is invertible because V I 0 ,I 0  ˜ V I 0 ,I 0 and V I 0 ,I 0 is in vertible). Now since V I 0 ,I 0  ˜ V I 0 ,I 0 and I H m − V m,I 0 V − 1 I 0 ,I 0 V I 0 ,m ≻ O (th is is becau se V I 0 ∪ m,I 0 ∪ m =  V I 0 ,I 0 V m,I 0 V I 0 ,m I H m  is invertible), we have I H m − V m,I 0 ˜ V − 1 I 0 ,I 0 V I 0 ,m ≻ O . Therefore by the central limit theo rem Z n conv e rges Gaussian r andom variable with strictly p ositiv e variance in distribution. Thus the probab ility of 2 |h v m , ˆ Σ m,ǫ + ˆ Σ m,I 0 f ∗ I 0 − ˆ Σ m,I 0 ˆ f I 0 i m | > λ ( n ) 1 k v m k H m 23 is asym ptotically strictly p ositiv e beca use λ ( n ) 1 √ n → µ 1 (Note th at this is tru e wh ether √ nλ ( n ) 2 conv e rges to finite value or no t). This yields (66), i.e. ˆ f does no t satisfy ˆ I = I 0 with asymptotically strictly positive probab ility . W e say Cond ition A as Condition A : λ ( n ) 1 √ n → ∞ . Now that we ha ve proven λ ( n ) 1 √ n → ∞ , we are ready to prove t he assertion (16). Suppose the condition (16) is not satisfied for any sequences λ ( n ) 1 , λ ( n ) 2 → 0 , that is, there exists a constant ξ > 0 such that lim sup n →∞      Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I 0      H m > (1 + ξ ) , ( ∃ m ∈ J 0 ) , (6 8) for any sequences λ ( n ) 1 , λ ( n ) 2 → 0 satisfy ing Cond ition A ( λ ( n ) 1 √ n → ∞ ). Fix ar bitrary sequences λ ( n ) 1 , λ ( n ) 2 → 0 satisfying Condition A. If ˆ I = I 0 , the KKT condition k 2 ˆ Σ m,I 0 ( ˆ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m ≤ λ ( n ) 1 ( ∀ m ∈ J 0 ) , (69) (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n )( ˜ f I 0 − f ∗ I 0 ) + λ ( n ) 1 D n f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ = 0 , (70) should be satisfied (see (5 3) and (54)). W e prove that the first ine quality (69) of the KKT con dition is violated with strictly positi ve proba bility u nder the assumptio ns and the cond ition (70). W e have shown tha t (see (55) ) λ ( n ) 1 − 1 (2 ˆ Σ m,I 0 ( ˆ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ ) =2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( D n + 2 λ ( n ) 2 λ ( n ) 1 ) f ∗ I 0 − 2 λ ( n ) 1 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ + 2 λ ( n ) 1 ˆ Σ m,ǫ . (71) As shown in the p roof of Theorem 1 , the first term can be ap proxim ated by Σ m,I 0  Σ I 0 ,I 0 + λ ( n ) 2  − 1  D + 2 λ ( n ) 2 λ ( n ) 1  f ∗ I 0 , more precisely Eq. (63) giv es       ˆ Σ m,I 0 ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n 2 ! − 1 D n + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 − Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I       H m p → 0 . Since lim inf n     Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1  D + 2 λ ( n ) 2 λ ( n ) 1  g ∗ I 0     H m > (1 + ξ ) by the assumption, we o bserve that P      2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 D n + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0      H m > (1 + ξ ) ! 6→ 0 . (72) Now since λ ( n ) 1 √ n → ∞ , we have proven that      − 2 λ ( n ) 1 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ + 2 λ ( n ) 1 ˆ Σ m,ǫ      H m = O p (1 / ( λ ( n ) 1 √ n )) = o p (1) , ( 73) in the p roof of Theo rem 1 (Eq. ( 57)). Th erefore, combinin g (71), ( 72) an d (7 3), we have observed that the KKT cond ition (53) is v iolated with strictly positive pro bability if the con dition (68) is satisfied. This y ields the irrepresenter condition (16) is a necessary condition for the consistency of elastic-net MKL. Lemma 10 If sup X k m ( X, X ) ≤ 1 an d sup X k m ′ ( X, X ) ≤ 1 , then P ( k ˆ Σ m,m ′ − Σ m,m ′ k H m , H ′ m ≥ E[ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H ′ m ] + ε ) ≤ exp( − nε 2 / 2) . (74) In particular , P k ˆ Σ m,m ′ − Σ m,m ′ k H m , H ′ m ≥ r 1 n + ǫ ! ≤ exp( − nε 2 / 2) . (75) 24 Proof: W e use McDiar mid’ s inequality (De v roye et al., 1996). By definitio n h g , ˆ Σ mm ′ f i = 1 n n X i =1 h g , k m ( · , x i ) i m h f , k m ′ ( · , x i ) i m ′ . W e de note by ˜ Σ m,m ′ the empirical cross cov ariance operator with n samples ( x 1 , . . . , x j − 1 , ˜ x j , x j +1 , . . . , x n ) where the j -th sample x j is replaced by ˜ x j indepen dently distributed b y the same distribution as x j ’ s. By the triangular inequality , we ha ve k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − k ˜ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ≤ k ˆ Σ m,m ′ − ˜ Σ m,m ′ k H m , H m ′ . Now the RHS can be evaluated as follows: k ˆ Σ m,m ′ − ˜ Σ m,m ′ k H m , H m ′ =     1 n ( k m ( · , x j ) k m ′ ( x j , · ) − k m ( · , ˜ x j ) k m ′ ( ˜ x j , · ))     H m , H m ′ . (76) The RHS of (76) can be further ev aluated as k 1 n ( k m ( · , x j ) k m ′ ( x j , · ) − k m ( · , ˜ x j ) k m ′ ( ˜ x j , · )) k H m , H m ′ ≤ 1 n ( k k m ( · , x j ) k m ′ ( x j , · ) k H m , H m ′ + k k m ( · , ˜ x j ) k m ′ ( ˜ x j , · )) k H m , H m ′ ) ≤ 1 n ( k k m ( · , x j ) k H m k k m ′ ( x j , · ) k H m ′ + k k m ( · , ˜ x j ) k H m k k m ′ ( ˜ x j , · )) k H m ′ ) ≤ 1 n ( q k m ( x j , x j ) k m ′ ( x j , x j ) + q k m ( ˜ x j , ˜ x j ) k m ′ ( ˜ x j , ˜ x j )) ≤ 2 n , (77) where we used k k m ( · , x j ) k H m = p h k m ( · , x j ) , k m ( · , x j ) i H m = p k m ( x j , x j ) . Bounding the nor m of ( 76) by (77), we have k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − k ˜ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ≤ 2 n . By symmetry , changing ˆ Σ and ˜ Σ gives |k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − k ˜ Σ m,m ′ − Σ m,m ′ k H m , H m ′ | ≤ 2 n . Therefo re by McDiarm id’ s inequality we obtain P  k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − E[ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ] ≥ ε  ≤ exp  − 2 ε 2 n (2 /n ) 2  = exp  − ε 2 n 2  . This giv e s the first assertion Eq. (74). T o show the second assertion (Eq. (75)), first we note that E[ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ] ≤ q E[ k ˆ Σ m,m ′ − Σ m,m ′ k 2 H m , H m ′ ] = q E[ k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k H m , H m ] ≤ q E[ k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k tr ] , (78 ) where k · k tr is the trace norm and the last inequ ality . As in Lemma 1 of Gretton et al. (2005), we see that k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k tr = 1 n 2 n X i,j =1 k k m ( · , x i ) k m ′ ( x i , x j ) k m ( x j , · ) k tr − 2 n n X i =1 E X [ k k m ( · , x i ) k m ′ ( x i , X ) k m ( X, · ) k tr ] + E X,X ′ [ k k m ( · , X ) k m ′ ( X, X ′ ) k m ( X ′ , · ) k tr ] = 1 n 2 n X i,j =1 k m ( x j , x i ) k m ′ ( x i , x j ) − 2 n n X i =1 E X [ k m ( X, x i ) k m ′ ( x i , X )] + E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] , 25 where X and X ′ are independ ent rando m v ar iable distrib uted from Π . Thus E[ k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k tr ] = n n 2 E X [ k m ( X, X ) k m ′ ( X, X )] + n ( n − 1) n 2 E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] − 2E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] + E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] = 1 n E X [ k m ( X, X ) k m ′ ( X, X )] − 1 n E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] ≤ 1 n . This and Eq. (78) with the first assertion (Eq. (74)) giv e s the second assertion. Lemma 11 If E [ ǫ 2 | X ] ≤ σ 2 almost sur ely a nd sup X k m ( X, X ) ≤ 1 , then we h ave k ˆ Σ m,ǫ k H m = O p ( σ / √ n ) . (79) Proof: By de finition, we ha ve E[ k ˆ Σ m,ǫ k H m ] ≤ q E[ k ˆ Σ m,ǫ k 2 H m ] = v u u u t E   1 n 2 n X i,j =1 k m ( x i , x j ) ǫ i ǫ j   ≤ r σ 2 n . Applying Markov’ s inequality we obtain the assertion. Proposition 1 (Ber nstein’ s inequality in Hilbert spaces) Let (Ω , A , P ) be a pr o bability space, H be a sep- arable Hilbert space, B > 0 , and σ > 0 . Furthermor e, let ξ 1 , . . . , ξ n : Ω → H be indepe ndent random variables satisfying E[ ξ i ] = 0 , k ξ k H ≤ B , and E[ k ξ i k 2 H ] ≤ σ 2 for all i = 1 , . . . , n . Then we have P      1 n n X i =1 ξ i      H ≥ r 2 σ 2 τ n + r σ 2 n + 2 B τ 3 n ! ≤ e − τ , ( τ > 0 ) . Proof: See Th eorem 6.14 of Steinwart (2008). Refer ences F . Bach, G. Lanckriet, and M. Jordan. M ultiple kernel learning, conic duality , and the SMO algor ithm. In the 21st Internation al Conference on Machine Learning , pages 41–48 , 200 4. F . R. Bach. Consistency of the g roup lasso and multip le kern el learning. J ou rnal o f Ma chine Learning Resear ch , 9:1179 –122 5, 2 008. C. R. Baker . Joint measures and cross-covariance oper ators. T ransactions of th e American Mathematical Society , 186:27 3–289 , 1 973. P . Bartlett, O. Bo usquet, and S. Men delson. Local Rademacher complexities. The Ann als of S tatistics , 33: 1487– 1537 , 20 05. P . J. Bickel, Y . Ritov , and A. B. Tsybakov . Simultaneo us analysis of Lasso and Dantzig selector . The Annals of Statistics , 37(4 ):1705 –1732, 2 009. O. Bousquet. A Ben nett co ncentratio n inequality a nd its ap plication to sup rema of emp irical pro cess. C. R. Acad. Sci. P a ris Ser . I Math. , 334:495 –500 , 20 02. A. Caponnetto and E. de V ito . Optimal rates for regularized least-squares algorithm. F ou ndation s of Compu- tational Mathematics , 7(3) :331–3 68, 2 007. C. Cortes. Can learning kernels h elp p erform ance?, 2009. In vited talk at Internation al Confe rence on Machine Learning (ICML 2009). Montr ´ eal, Canada, 2009. 26 C. Cortes, M. Mo hri, an d A. Rostamizadeh . L 2 regularization for learning kernels. In the 25th Confer ence on Uncertainty in Artificial Intelligence (U AI 2009 ) , 2009. Montr ´ eal, Canad a. L. Devroye, L. Gy ¨ or fi, and G. Lugosi. A Pr oba bilistic Theory of P attern Recognitio n . Springer, 1996. A. Gretton, O. Bousq uet, A . Smola, and B. Sch ¨ olkopf. Measuring statistical dep endence with Hilber t- Schmidt nor ms. In S. Jain, H. U. Simon, a nd E. T om ita, editor s, A lgorithmic Learning Theo ry , Lectu re Notes in Artificial Intelligence, pages 63–7 7, B erlin, 2005 . Spring er-V erlag . J. Jia and B. Y u. On model selection consistency of th e elastic net when p ≫ n. S tatistica S inica , 20(2) :to appear, 2010. M. Kloft, U. Brefeld, S. Sonn enburg, P . Laskov , K.- R. M ¨ uller, and A. Zien. Efficient and accurate ℓ p -norm multiple kernel learning. In Advances in Neural Information Pr o cessing S ystems 22 , pa ges 99 7–100 5, Cambridge, MA, 2009 . MIT Press. V . K o ltchinskii. Local Rademacher co mplexities and oracle inequalities in risk minim ization. The Annals of Statistics , 34:2593 –265 6, 2 006. V . Koltchinskii and M. Y uan. Sp arse r ecovery in large ensembles of kern el machines. In Pr o ceedings of the Annua l C onfer ence on Learning Theory , pages 229–2 38, 200 8. G. Lanckr iet, N. Cristianini, L. E. Gh aoui, P . Bartlett, a nd M. Jorda n. Learnin g the kernel matrix with semi- definite progr amming. J o urnal of Machine Learning Resear ch , 5:27–72, 2004. M. Led oux and M. T alag rand. P r obab ility in Banach Space s. Isope rimetry and Pr ocesses . Springer, New Y o rk, 1991. MR1 10201 5. Y . Lin and H. H. Zh ang. Com ponen t selecion and smoothin g in multiv ariate non parametr ic re gression. The Annals of Statistics, , 34(5) :2272– 2297, 2 006. L. Meier, S. van de Geer, a nd P . B ¨ uh lmann. High -dimensio nal additive modeling. The Annals of Statistics , 37(6B):3 779–3 821, 20 09. S. Mende lson. Im proving the sample complexity using global data. IEEE T ransactions on Informa tion Theory , 48:19 77–19 91, 2 002. C. A. Micchelli and M. Pontil. Learning the k er nel function via regularization. Journal of Machine Learning Resear ch , 6:1099 –112 5, 2 005. A. Rakotomamon jy , F . Bach , S. Canu, and G. Y . SimpleMKL. Journal of Machine Learnin g Re sear ch , 9: 2491– 2521 , 20 08. S. Sonn enburg, G. R ¨ atsch, C. Sch ¨ afer , and B. Sch ¨ olkopf. Large scale mu ltiple kern el learn ing. J o urnal of Machine Learning Resear ch , 7:1531–15 65, 2006. I. Steinwart. Sup port V ector Machines . Sprin ger, 20 08. I. Steinwart, D . Hush , and C. Scovel. Op timal rates for regular ized least squares regression. In Pr oce edings of the Annu al Conference on Learning Theo ry , pages 79–93, 2009. M. Ston e. Cross-validatory choice and assessment o f statistical pred ictions. Journal of the Ro yal Sta tistical Society , Series B , 36:11 1–147 , 1 974. T . Suzuki and R. T omiok a. Spicymkl, 2009. arXiv:0909.5 026. M. T alag rand. A new look at inde penden ce. The Ann als of Statistics , 24:1–34, 1996a. M. T alag rand. New con centration inequalities in pro duct spaces. Inventiones Mathematica e , 126:5 05–5 63, 1996b . R. T omioka and T . Suzu ki. Sparsity-acc uracy trade-off in MKL , 2010. arXiv:1001.2 615. S. van de Geer . Empirical Pr ocesses in M-Estimation . Camb ridge Univ ersity Press, 2000. A. W . van d er V aart and J. A. W ellner . W ea k Conver gence and Em pirical Pr ocesses: W ith Application s to Statistics . Springer, New Y ork , 1996. 27 V . N. V apnik. S tatistical Learning Theory . W iley , New Y ork, 1998 . M. Y uan and Y . L in. On the n onnegative garr ote estimator . Journal of the Roya l Statistical Society B , 69(2 ): 143–1 61, 200 7. T . Zhang . Some sharp performan ce bounds for least squares regression with l 1 regularization. The Annals of Statistics , 37(5):2 109–2 144, 20 09. P . Z hao and B. Y u . On model selection consistency o f lasso. Journal of Machine Learning R esear ch , 7: 2541– 2563 , 20 06. H. Zo u and T . H astie. Regularization and variable selection via the elastic net. Journal of th e Ro yal Statistical: Series B , 67(2) :301–3 20, 2 005. H. Zo u and H. H. Zhan g. On the a daptive elastic-net with a d iv e rging number of param eters. The An nals of Statistics , 37(4):1 733–1 751, 20 09. 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment