Sharp Convergence Rate and Support Consistency of Multiple Kernel Learning with Sparse and Dense Regularization

Sharp Con vergen ce Rate and Support Consistency of Multiple K er nel Learnin g with Sparse and Dense Regularization T aiji Suzuki, Ry ota T omioka Departmen t of Mathematical Informatics, The University of T okyo, 7-3-1 Hongo, Bunkyo-ku , T o kyo t-suzuki@mist .i.u-tokyo.ac .jp , tomioka@mist. i.u-tokyo.ac. jp Masashi Sugiyama Departmen t of Computer Science, T okyo Institute of T ech nology , 2-12- 1 O-ok ayama, Meguro-ku, T okyo sugi@cs.titec h.ac.jp Abstract W e theor etically in vestigate the co n vergence rate and supp ort con sistency (i.e., correctly identifying the subset of non -zero co efﬁcients in the large sample l imit) of multiple kernel learnin g (M KL). W e focus on MKL with block- ℓ 1 regularization (inducin g sparse kernel co mbination ), block- ℓ 2 regu- larization (inducin g uniform kernel combination), an d elastic-net regularization (including both block- ℓ 1 and block- ℓ 2 regularization) . For the case where th e true kernel combination is spar se, we show a sharp er conver gence rate of th e blo ck- ℓ 1 and elastic-net MKL method s th an th e exist- ing rate for b lock- ℓ 1 MKL. W e further show that elastic-n et MKL requires a m ilder con dition for being consistent than blo ck- ℓ 1 MKL. For the case where the optimal kernel comb ination is not e x- actly sparse, we prove tha t elasti c-net MKL can achieve a faster con vergence rate than the block- ℓ 1 and block- ℓ 2 MKL methods by carefully controlling the balance between the block- ℓ 1 and block- ℓ 2 regularizers. Thus, our theoretical results overall suggest the use of e lastic-net regularization in MKL. 1 Intr oduction The choice of kernel func tions is a key issue for kernel methods such as support vector machines to w o rk well (V apnik, 1998). A tradition al but very powerful ap proach to optim izing the kernel function is the use o f cr oss- validation (CV) ( Stone, 19 74). Although the CV -based kern el cho ice often leads to better g eneralization , it is computatio nally e x pensive wh en the kernel contains multiple tuning parameters. T o overcome th is limitation, the framework of mu ltiple kernel learn ing (MKL) has b een introdu ced, which tries to learn the optimal lin ear combina tion of preﬁxed base- kernels by con vex o ptimization (Lanckriet et al., 2004, Micchelli and Pontil, 2005, Lin and Zhang, 20 06, Sonnen burg et al., 200 6, Rakotomamonjy et al., 2008, Suzu ki and T om ioka, 200 9). T he sem inal paper b y Bach et al. (2004) showed that this MKL formu la- tion can b e interp reted as block- ℓ 1 regularization (i.e., ℓ 1 regularization across the kern els and ℓ 2 regulariza- tion w ithin th e same kernel). W e refer to this MKL for mulation as ‘blo ck- ℓ 1 MKL’. Based on this in terpre- tation, blo ck- ℓ 1 MKL was proved to be sup port c onsistent ( i.e., corr ectly id entifying the subset of n on-zero coefﬁcients with prob ability one in the large sample limit) when the true kernel combination is sparse (Bach, 2008). Fur thermor e, the convergence rate of blo ck- ℓ 1 MKL has also been elucidated in K o ltchinskii and Y uan (2008), which can be regarded as an e x tension of the theoretical analysis for ord inary (no n-block ) ℓ 1 regular- ization (Bickel et al., 2009, Zhang, 2009). Howe ver, in many pra ctical applications, the true kerne l combination may no t be exactly sp arse. In such a n on-spar se situation, block - ℓ 1 MKL was shown to perform r ather poo rly—just the unifor m combin ation of b ase kernels obta ined by b lock- ℓ 2 regularization (Micchelli and Pontil, 20 05) (which we call ‘block - ℓ 2 MKL’) often works better in practice (Cortes, 2009). Furtherm ore, recent works showed that some ‘interme- diate’ regularization between block - ℓ 1 and blo ck- ℓ 2 regularization is more prom ising, e.g., block - ℓ p regular- ization with 1 ≤ p ≤ 2 (Cortes et al., 20 09, Klo ft et al., 20 09), a nd elastic-net regularization (Zou and Hastie, 2005) which includes both block - ℓ 1 and block - ℓ 2 regularization (T o mioka and Suzuki, 2010) (we call this method ‘elastic-net MKL ’). Theo retically , the supp ort consistency and the con vergenc e rate f or param etric elastic-nets have been elu cidated in Y uan and Lin (2 007) and Zou and Zh ang (200 9 ) , respectiv e ly , and that for non- parametric cases has been inv e stigated i n Meier et al. (200 9) focusing on the Sobolev spa ce. In this paper , we theoretically an alyze the s uppo rt consistency and conv ergence rate of MKL, an d provide three new results. • For the case where the true k e rnel combin ation is sparse, we show that elastic-net MKL achiev e s a faster conv e rgence rate than the one shown fo r block- ℓ 1 MKL (Koltchinskii and Y uan, 2 008). More speciﬁ- cally , we show that the L 2 conv e rgence error is giv en b y O p (min { dn − 2 2+ s + d log( M ) /n, d 1 − s 1+ s n − 1 1+ s + d log ( M ) /n } ) , where d is the num ber of activ e compo nents of th e target functio n, s is the co mplexity of RKHSs, M is the number of candidate k ernels, and n is the num ber of samples. • For the case where the optima l kernel comb ination is not exactly sparse, we prove th at elastic-net MKL achieves a faster con vergence rate than the block - ℓ 1 and block- ℓ 2 MKL methods by carefully con trolling the balan ce be tween block - ℓ 1 and b lock- ℓ 2 regularization. Our theoretical result well agr ees with the experimental results reported in T omiok a and Suzuki (2010). • For th e case where the true kernel combination is sparse, we prove that the necessary and sufﬁcient condition s of the suppo rt con sistency for elastic-n et MKL is milder th an the cond itions requir ed for block- ℓ 1 MKL (Bach, 2008). Overall, our theoretical results sug gest the use of elastic-net regularization in MKL. 2 Pr eliminaries In th is section, we formulate th e elastic-n et MKL a pproac h and sum marize ma thematical too ls that are nee ded for the theoretical analysis. 2.1 Formulation Suppose we are gi ven n samples ( x i , y i ) n i =1 where x i belongs to an inpu t space X and y i ∈ R . ( x i , y i ) n i =1 are indepen dent and identically d istributed from a probab ility measure P . W e d enote the m arginal distribution of X by Π . W e consider a MKL regression pro blem in which the u nknown target fun ction is represented as a form o f f ( x ) = P M m =1 f m ( x ) , where each f m belongs to different RKHSs H m ( m = 1 , . . . , M ) correspo nding to M dif ferent base kernels k m over X × X . Elastic-net MKL learns a decision function ˆ f as 1 ˆ f = arg min f m ∈H m ( m =1 ,...,M ) 1 n n X i =1 y i − M X m =1 f m ( x i ) ! 2 + λ ( n ) 1 M X m =1 k f m k H m + λ ( n ) 2 M X m =1 k f m k 2 H m , (1) where the ﬁrst term is the squ ared-loss of f unction ﬁtting and, the second and the thir d terms ar e bloc k- ℓ 1 and blo ck- ℓ 2 regularizers, resp ectiv e ly . It ca n be seen fr om (1) that elastic-n et MKL is reduced to block - ℓ 1 MKL if λ ( n ) 2 = 0 , which tends to indu ce sparse k e rnel combination (Lanckriet et al., 2 004, Bach et al., 2004). On the other hand , it is reduced to block - ℓ 2 MKL if λ ( n ) 1 = 0 , which resu lts in un iform kern el co mbination (Micchelli and Pontil, 20 05). I t is worth n oting that, elastic-net MKL allows u s to obtain various lev els of sparsity by controllin g the ratio between λ ( n ) 1 and λ ( n ) 2 . 2.2 Notatio ns and Assumptions Here, we prepare technical tools needed in the following s ections. Due to Mercer’ s theo rem, there are an ortho normal system { φ k,m } k,m in L 2 (Π) and the spectrum { µ k,m } k,m such that k m has the following spectral repr esentation: k m ( x, x ′ ) = ∞ X k =1 µ k,m φ k,m ( x ) φ k,m ( x ′ ) . (2) By this spectral representation , the inner-produ ct of RKHS can be expressed as h f m , g m i H m = P ∞ k =1 µ − 1 k,m h f m , φ k,m i L 2 (Π) h φ k,m , g m i L 2 (Π) . Let H = H 1 ⊕ · · · ⊕ H M . For f = ( f 1 , . . . , f M ) ∈ H and a subset of in dices I ⊆ { 1 , . . . , M } , we denote by f I the restriction of f to an index set I , i.e., f I = ( f m ) m ∈ I . W e d enote by I 0 the indices of truly active kernels, i.e., I 0 = { m | k f ∗ m k H m > 0 } , and deﬁne the complem ent of I 0 as J 0 = I 0 c . Throu ghout the pap er , we assume the following techn ical conditions (see also Bach (2008)). 1 For simplicity , we focus on the squared-loss function here. Ho wev er, we note that it is straightforward to extend ou r con verg ence analysis and support consistency results gi ven in Sections 3 and 4 to general loss functions that are strongly con vex and Lipschitz continuou s, by following the line of K oltchinskii and Y uan (2 008). 2 T able 1: Summary of the constants we use in this article. M The numbe r of cand idate kernels. d The numbe r of active kernels of the truth; i.e., d = | I 0 | . R The upper boun d of P M m =1 ( k f ∗ m k H m + k f ∗ m k 2 H m ) ; see (A4). s The spectral decay coefﬁcient; see (A5) . β The appro ximate sparsity coef ﬁcien t; see (A7). b The par ameter that tunes the correlation between kernels; see (A8). Assumption 1 (Basic Assumptions) (A1) The r e exists f ∗ = ( f ∗ 1 , . . . , f ∗ M ) ∈ H such that E [ Y | X ] = P M m =1 f ∗ m ( X ) , an d th e noise ǫ := Y − f ∗ ( X ) has a strictly po sitive variance; there exists σ > 0 such tha t E[ ǫ 2 | X ] > σ 2 for all X ∈ X . W e also assume that ǫ is bounded as | ǫ | ≤ L . (A2) F o r each m = 1 , . . . , M , H m is separable and sup X ∈X | k m ( X, X ) | < 1 . (A3) The r e exists g ∗ m ∈ H m such that f ∗ m ( x ) = Z X k (1 / 2) m ( x, x ′ ) g ∗ m ( x ′ )dΠ( x ′ ) ( ∀ m = 1 , . . . , M ) , (3) wher e k (1 / 2) m ( x, x ′ ) = P ∞ k =1 µ 1 / 2 k,m φ k,m ( x ) φ k,m ( x ′ ) is the operator squar e-r oo t of k m . The ﬁrst assumption in (A1) ensures the model H is correctly s peciﬁed, and th e technical a ssumption | ǫ | < L allows ǫf to be Lipschitz continuo us with respect to f . It is known that the assumption (A2) gives the following relation: k f m k ∞ ≤ sup x h k m ( x, · ) , f m i H m ≤ sup x k k m ( x, · ) k H m k f m k H m ≤ sup x p k m ( x, x ) k f m k H m ≤ k f m k H m . The assumption (A3) was used in Caponne tto and de V ito (20 07) and also in Bach ( 2008). It ensures the consistency of the least-squares estimates in terms of th e RKHS norm . Using the spectral repr esentation (2), the condition g ∗ m ∈ H m is expressed as k g ∗ m k 2 H m = ∞ X k =1 µ − 2 k,m h f ∗ m , φ k,m i 2 L 2 (Π) < ∞ . (4) This condition was also assumed in K oltch inskii and Y uan ( 2008). Pro position 9 of Bach (2008) gav e a sufﬁcient condition to fulﬁll (3) for translatio n in variant k ernels k m ( x, x ′ ) = h m ( x − x ′ ) . Constants we use later are summarized in T ab le 1. 3 Con vergen ce Rate of Elastic-net MKL In this section, we derive the con vergence ra te of elastic-net MKL in two situations: (i) A spa rse situation where the truth f ∗ is sparse (Section 3.1). (ii) A near sparse situation where the tr uth is not exactly spar se, but k f m k H m decays p olyno mially as m increases (Section 3.2). For ( i), we show that elastic-net MKL (and b lock- ℓ 1 MKL) achieves a faster convergence rate than the rate shown fo r block- ℓ 1 MKL (K o ltchinskii and Y uan, 2 008). Furtherm ore, fo r (ii), we show that e lastic-net MKL can outperfor m blo ck- ℓ 1 MKL and block- ℓ 2 MKL depending on the sparsity of the truth and the condition of the problem. Throu ghout this section, we ass ume the following conditions. Assumption 2 (Boundedness Assumption) There e xists constan ts C 1 and R such that (A4) max m ∈ I 0 k g ∗ m k H m k f ∗ m k H m ≤ C 1 , M X m =1 ( k f ∗ m k H m + k f ∗ m k 2 H m ) ≤ R. Assumption 3 (Spectr al Assumption) Ther e exist 0 < s < 1 a nd C 2 such that (A5) µ k,m ≤ C 2 k − 1 s , (1 ≤ ∀ k , 1 ≤ ∀ m ≤ M ) , wher e { µ k,m } k is the spectrum of the kernel k m (see Eq. (2) ). 3 The ﬁrst assumption in (A4) app eared in T heorem 2 o f Koltchinskii and Y uan (20 08). The second assump - tion in (A4) b ound s the amplitud e of f ∗ . It was shown that the sp ectral assump tion (A5) is equ iv alent to th e c lassical covering numb er assumption (Steinwart et al., 200 9). Recall that the ǫ -covering number N ( ǫ, B H m , L 2 (Π)) with respect to L 2 (Π) is the minim al number of balls with radiu s ǫ needed to cover the un it ball B H m in H m (van der V aart and W ellne r, 19 96). If the spectral assump tion (A5) holds, there exists a constant c tha t depend s only on s such that N ( ε, B H m , L 2 (Π)) ≤ cε − 2 s , (5) and the con verse is also true (see Theore m 15 of Stein wart et al. (2009) an d Steinwart (2008) for details). Therefo re, if s is large, at le ast one RKHS is “complex”, and if s is small, th e RKHSs are r egarded as “simple”. For a gi ven set of indices I ⊆ { 1 , . . . , M } , let κ ( I ) be deﬁned as follows: κ ( I ) := sup ( κ ≥ 0 | κ ≤ k P m ∈ I f m k 2 L 2 (Π) P m ∈ I k f m k 2 L 2 (Π) , ∀ f m ∈ H m ( m ∈ I ) ) . κ ( I ) represen ts the correlation of RKHSs in side th e indices I . Similarly , we deﬁne the correlation s of RKHSs between I a nd I c as follows: ρ ( I ) := sup  h f I , g I c i L 2 (Π) k f I k L 2 (Π) k g I c k L 2 (Π) | f I ∈ H I , g I c ∈ H I c , f I 6 = 0 , g I c 6 = 0  . In Subsectio ns 3.1 and 3.2, we will assume that th e kernels have no perfect canonica l dependence , implying that the kernels are not similar to each other (see (A6) and (A8) below). Throu ghout this paper, we assume log( M n ) n ≤ 1 and log ( M ) is slower than any polynomial order against the n umber o f samp les n : lo g ( M ) = o ( n ǫ ) f or all ǫ > 0 . W ith some abuse, we use C to denote co nstants that are independ ent of d and n ; its value may be different. 3.1 Sparse Situation Here we deri ve the con vergence rate of the estimator ˆ f when the truth f ∗ is sparse. Le t d = | I 0 | and suppose that the numb er of k ernels M and the number of acti ve kernels d ar e increasing with respect to the number of samples n . W e furth er assume the follo wing condition in this subsection. Assumption 4 (Incoherence Assumption) Ther e exists a constant C 3 > 0 such that (A6) 0 < C − 1 3 < κ ( I 0 )(1 − ρ 2 ( I 0 )) . (6) This cond ition is known as the inco her en ce condition (K oltchinskii and Y uan, 2 008, Meier et al., 200 9), i.e., kernels are not too dep endent on each other and the problem is well conditioned. Th en we ha ve the f ollowing conv e rgence r ate. Theorem 1 Under assumptio ns (A1-A6), ther e exist constan ts C , F an d K depending only on κ ( I 0 ) , ρ ( I 0 ) , s , C 1 , C 2 , L , and R such tha t th e L 2 (Π) -norm of the residual ˆ f − f ∗ can be boun ded as follows: when d 3+ s n − 1 ≤ 1 , for λ ( n ) 1 = λ ( n ) 2 = max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  dn − 2 2+ s + dt n  , (7) and, when d 3+ s n − 1 > 1 , for λ ( n ) 1 = max { K (1 + √ t ) n − 1 2 , F q log( M n ) n } and λ ( n ) 2 ≤ λ ( n ) 1 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  d 1 − s 1+ s n − 1 1+ s + d (log( M n ) + t ) n  , (8) wher e ea ch inequality holds with pr ob ability at least 1 − e − t − n − 1 for all t ≥ log log( R √ n ) + log M . The above th eorem in dicates th at the learning rate depend s o n th e complexity of RKHSs (the simp ler , the faster) and the numb er of active kernels rath er than the n umber of kern els M (the inﬂuen ce o f M is at mo st d l og( M ) n ). It is worth notin g that the conver g ence rate in (7) an d (8) is faster than o r e qual to the rate of block - ℓ 1 MKL shown by Koltchinskii and Y uan (200 8) wh ich established the learn ing rate O p  d 1 − s 1+ s n − 1 1+ s + d log( M ) n  under the same conditions as ours 2 . 2 In our second bound (8), there is the additional d log ( n ) n term. Howe ver this can be eli minated by replacing the probability 1 − e − t − n − 1 with 1 − e − t − M − A as in Ko ltchinskii and Y uan (2008). Moreov er , i f √ n log( n ) − 1+ s 2 s ≥ d , then the term d log( n ) n is dominated by the ﬁrst term d 1 − s 1+ s n − 1 1+ s . 4 3.2 Near -Sparse Situation In this subsectio n, we analyze the conv e rgence rate under a situation wh ere f ∗ is not sparse but near sparse . W e have s hown a faster learning rate than existing bound s i n the previous subsection. Howev er , the assump- tions we used might be too restrictive to captu re the situatio n where MKL is used in prac tice. In fact, it was pointed out in Zou and Hastie (2005) in the context of ( non-b lock) ℓ 1 regularization th at ℓ 1 regularization could fail in the following situations: • When the truth f ∗ is n ot spar se, the ℓ 1 regularization shrink s many small but non -zero comp onents to zero. • When ther e exist stro ng correlatio ns between d ifferent kernels, th e solu tion of bloc k- ℓ 1 MKL becom es unstable. • When the nu mber of kernels M is not large, there is no need to impose the estimator to be sparse. In order to analyze th ese situations in the MKL setting, we intr oduce th ree parameters β , b , and τ : β controls the level of sparsity (see (A7) ), b con trols the co rrelation between ca ndidate kernels (see ( A8)), and τ co ntrols the growth of the number of kernels against the number of samples (see (A9)). W e sho w that naturally block- ℓ 2 MKL is preferable when th ere are only few can didate kernels or the truth is dense. Importan tly , if the candid ate kernels are correlated, the convergence of block- ℓ 1 MKL can be slow ev en wh en the tr uth is sparse. O ur a nalysis shows that elastic-net MKL is most valuable in such a n intermediate situation. By permu ting indices, we can assume without loss of g enerality that k f ∗ m k H m is decrea sing with respect to m , i.e., k f ∗ 1 k H 1 ≥ k f ∗ 2 k H 2 ≥ k f ∗ 3 k H 3 ≥ · · · . W e fu rther assume the following condition s in this subsection. Assumption 5 (Approximate Sparsity) The truth is appr oximately spar se, i.e., k f ∗ m k H m > 0 for all m and thus I 0 = { 1 , . . . , M } . Ho wever , k f ∗ m k H m decays polyno mially with res pect to m as follows: (A7) k f ∗ m k H m ≤ C 3 m − β . W e call β ( > 1) the ap pr oximate sparsity coefﬁcient. Assumption 6 (Ge neralized Incoherence) T her e exist b > 0 and C 4 such that for all I ⊆ { 1 , . . . , M } , (A8) (1 − ρ 2 ( I )) κ ( I ) ≥ C 4 | I | − b . Assumption 7 (Kernel-Set Growth) The n umber of kernels M is incr ea sing polynomia lly with respect to the number of samples n , i.e., ∃ τ > 0 such that (A9) M = ⌈ n τ ⌉ . For notation al convenience, let τ 1 = 1 (2 β + b )(2+ s ) − 1 − s , τ 2 = ( s − 1)(2 β − 1)+ bs (2 β + b )(2+ s ) − 1 − s , τ 3 = s { 2( b + β ) − 1 } 2(2+ s )( b + β ) − s , τ 4 = s 2+ s , τ 5 = b +1 ( β + b ) { b (2+ s )+2 } and τ 6 = 1 (1 − s )(1+ b ) . In ad dition, we denote by K some sufﬁciently large constant. Theorem 2 Suppo se assumptio ns (A1-A5) and (A7-A9), 2 β (1 − s ) < s ( b − 1) , and τ 1 < τ < τ 4 ar e satisﬁed. Th en the estimato r o f e lastic-net MKL possesses the follo wing con ve r gence rate ea ch of which holds with pr obab ility at least 1 − e − t − n − 1 for all t ≥ log log( R √ n ) + log M : 1. When τ 1 < τ < τ 2 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C n n − γ 1 + ( n − (2 β + b )(2+ s ) − 3 − s +2 β 2 { (2 β + b )(2+ s ) − 1 − s } + λ ( n ) 2 2 )( √ t + t ) o , wher e γ 1 = 4 β + b − 2 (2 + s )(2 β + b ) − 1 − s , (9) with λ ( n ) 1 = max { K n − 3 β + b − 1 (2 β + b )(2+ s ) − 1 − s + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = K n − 2 β + b − 1 (2 β + b )(2+ s ) − 1 − s . 2. When τ 2 ≤ τ < τ 3 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C n n τ (2+ s ) b +2 2 { (2+ s )( b + β ) − s } − γ 2 + ( n τ ( 2+ s )(1 − β ) − (4 β +2 b + sb − 2) 2 { ( β + b )(2+ s ) − s } + λ ( n ) 2 2 )( √ t + t ) o , wher e γ 2 = 4 β + b (2 + s ) − 2 2 { (2 + s )( b + β ) − s } , (10) 5 with λ ( n ) 1 = max { K q M n + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = K n τ −{ 2( b + β ) − 1 } 2 { (2+ s )( b + β ) − s } . 3. When τ 3 ≤ τ < τ 4 , k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  n τ γ 3 − γ 3 + ( n τ ( β − 1)+1 − 2 β − b 2( b + β ) + λ ( n ) 2 2 )( √ t + t )  , wher e γ 3 = b + 2 β − 1 2( b + β ) , (11) with λ ( n ) 1 = max { K q M n + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = K ( M /n ) 2( b + β ) − 1 4( b + β ) . Theorem 3 Under a ssumptions (A 1-A5) an d (A7- A9), if τ 5 < τ , th e estimator ˆ f ℓ 1 of b lock- ℓ 1 MKL has the following con v er gence rate with pr ob ability at least 1 − e − t − n − 1 for all t ≥ lo g log( R √ n ) + log M : ( block- ℓ 1 MKL ) k ˆ f ℓ 1 − f ∗ k 2 L 2 (Π) ≤ C  n − γ 4 + n − 4 β + 2 b − 2+ s ( b + β ) 2(2+ s )( b + β ) ( √ t + t )  , wher e γ 4 = 2 β + b − 1 ( β + b )(2 + s ) , (12) with λ ( n ) 1 = max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } and λ ( n ) 2 = 0 . Mor eover , if τ < τ 6 , the estimator ˆ f ℓ 2 of block- ℓ 2 MKL has the follo wing co n v er gence rate with pr o bability at least 1 − e − t − n − 1 for a ll t ≥ log log( R √ n ) + log M : ( block- ℓ 2 MKL ) k ˆ f ℓ 2 − f ∗ k 2 L 2 (Π) ≤ C  n τ ( b + 2 2+ s ) − γ 5 +  λ ( n ) 2 2 + M 1+ b n  t  , wher e γ 5 = 2 2 + s , (13) with λ ( n ) 2 = max { K ( M n ) 1 2+ s , F q log( M n ) n } and λ ( n ) 1 = 0 . In all con vergence r ates presented in Theorems 2 and 3, the leading terms are the terms that do n ot contain t . The conv ergence order of the terms containing t are faster than the leading terms, thus negligible. By simple calculation, we ca n conﬁrm that elastic-net MKL always con verges faster than block- ℓ 1 MKL and block- ℓ 2 MKL if β an d M satisfy the conditio n of Theo rem 2. The convergence rate of elastic-net MKL becomes ide ntical with b lock- ℓ 2 MKL and blo ck- ℓ 1 MKL at the two extreme poin ts of the interval τ = τ 1 and τ 4 , resp ectiv e ly . Ou tside th e region, block - ℓ 1 MKL o r block - ℓ 2 MKL has a faster conv ergence rate than elastic-net MKL. Mo reover , at τ = τ 2 , the conver gence rates (9) and (10) of elastic-net MKL are identical, and at τ = τ 3 , the co n vergenc e rates (10) and (11) are identical. The relation between the mo st pref erred method and the growth rate τ of the number of kernels is illustrated in Figure 1. The conditio n τ 1 < τ < τ 4 in Theo rem 2 indicates that when the number o f kernels is not too small or too large, an ‘intermed iate’ effect of e lastic-net MKL bec omes a dvantageous. Roughly speakin g, if M is large, sparsity is ne eded to en sure the co n vergence and th us block - ℓ 1 MKL perfo rms th e best. On the other hand, if M is small, there is no need to make the solutio n sparse and thus b lock- ℓ 2 MKL becom es the best. For an intermediate M , elastic-net MKL becom es the best. The condition 2 β (1 − s ) < s ( b − 1) in Theorem 2 ensures the existence of M that satisﬁes the conditio n in the theo rem, i.e., τ 1 < τ 2 < τ 3 < τ 4 . It can b e seen that as b bec omes large (the conditio n of the prob lem becomes worse), the rang e of β and M in w hich elastic-net MKL perf orms b etter th an block - ℓ 1 MKL and block- ℓ 2 MKL becom es lar ge. This ind icates that the worse the cond ition of the problem becomes, the more importan t to con trol the balance of λ ( n ) 1 and λ ( n ) 2 approp riately . 4 Support Consistency of Elastic-net MKL In this section, we derive necessary an d sufﬁcient conditio ns f or the statistical support con sistency of the estimated sparsity patter n, i.e. , the prob ability o f { m | k ˆ f m k H m 6 = 0 } = I 0 goes to 1 as the numb er of samples n tend s to inﬁnity . Du e to the additional squared regularization term, the necessary conditio n for the support con sistency of elastic-net M KL is shown to be weaker than th at for b lock- ℓ 1 MKL ( Bach, 200 8). In this section, we assume M and d = | I 0 | are ﬁxed against the number of samples n . Let H I be the restriction of H 1 ⊕ · · · ⊕ H M to the index set I . Since E X [ k m ( X, X )] < ∞ f or all m (fr om assumption ( A2)), we d eﬁne the (non- centered) cr o ss covaria nce operator Σ I ,J : H I → H J as a bou nded 6 elastic-net block- ` 1 block- ` 2 τ τ 2 τ 4 τ 3 τ 5 τ 1 − γ 1 − γ 4 growth rate of the number of kernels convergence rate Figure 1: Relation b etween the convergence rate and the numb er o f kernels. If the truth is inter mediately sparse (the growth rate τ of the number o f kernels is between τ 1 and τ 5 ), then elastic-net MKL p erform s best. At the e dge of the interval, the con vergence rate of elastic-n et MKL coin cides with that of block- ℓ 1 MKL or block- ℓ 2 MKL. linear operato r such that 3 h f I , Σ I ,J g J i H I = X m ∈ I X m ′ ∈ J h f m , Σ m,m ′ g m ′ i H m = X m ∈ I X m ′ ∈ J E X [ f m ( X ) g m ′ ( X )] , (14) for all f I = ( f m ) m ∈ I ∈ H I and g J = ( g m ′ ) m ′ ∈ J ∈ H J . See Baker (1973) f or the details of the cross covariance operator ( f , g ) 7→ cov( f ( X ) g ( X )) . Moreover , we deﬁne the bounded (non-c entered) cr o ss-corr ela tion oper ators 4 V l,m by Σ 1 / 2 l,l V l,m Σ 1 / 2 m,m = Σ l,m . The joint cross-corr elation operator V I ,J : H J → H I is deﬁned analogou sly to Σ I ,J . In this section, we assume in addition to the basic assumptions (A1-A3) that (A10) All V l,m are compact and the joint correlation opera tor V is invertible. Let ˆ I b e the indices of active k ernels for th e estimated ˆ f ∈ H by elastic-net MKL: ˆ I := { m | k ˆ f m k H m > 0 } . Let D := Diag( k f ∗ m k − 1 H m ) = Diag(( k f ∗ m k − 1 H m ) m ∈ I 0 ) , where Diag is the | I 0 | × | I 0 | block-diag onal operator with operato rs k f ∗ m k − 1 H m I H m on diagonal block s for m ∈ I 0 . In this section, we assume that the true sparsity pattern I 0 and the numbe r of kernels M ar e ﬁxed independently of the number of s amples n . The norm of f ∈ H is d eﬁned by k f k H := q P M m =1 k f m k 2 H m and similar ly that of f I ∈ H I is de- ﬁned by k f I k H I := q P m ∈ I k f m k 2 H m . The f ollowing theorem gives a sufﬁcient cond ition for the suppor t consistency of sparsity patterns. Theorem 4 Suppo se λ ( n ) 2 > 0 , λ ( n ) 1 → 0 , λ ( n ) 2 → 0 , λ ( n ) 1 √ n → ∞ , and lim sup n     Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1  D + 2 λ ( n ) 2 λ ( n ) 1  f ∗ I 0     H m < 1 , ( ∀ m ∈ J = I c 0 ) . (15) Then 5 , unde r assumptio ns (A1-A3, A10), k ˆ f − f ∗ k H p → 0 and ˆ I p → I 0 . The condition λ ( n ) 2 > 0 is just for tech nical simplicity to let Σ I 0 ,I 0 + λ ( n ) 2 in vertib le. The con dition λ ( n ) 1 √ n → ∞ means that λ ( n ) 1 does not decrease too quickly . The con dition (15) corresp onds to an in ﬁnite- dimensiona l extension of the elastic-net ‘irrepre sentable’ condition. I n the p aper of Zh ao and Y u (2006), the irrepresenta ble cond ition was derived as a nec essary and sufﬁcient co ndition for the sign con sistency of ℓ 1 regularization when th e numb er of par ameters is ﬁnite. I ts ela stic-net version was derived in Y uan and Lin (2007), and it was extended to a situation where th e number of pa rameters diverges as n increases (Jia and Y u, 2010). W e also have a necessary condition for consistency . 3 If one ﬁts a function with a constant offset ( f ( x ) + b instead of f ( x ) ) as in B ach (2008), then the centered version of cross co variance operator is required instead of the non-centered version, i.e., h f m , Σ m,m ′ g m ′ i H m = E X [( f m ( X ) − E X [ f m ])( g m ′ ( X ) − E X [ g m ′ ])] . Howe ver , this difference is not essential because, without loss of generality , one can consider a situation where E Y [ Y ] = 0 and E X [ f m ( X )] = 0 for all f m ∈ H M by centering all the functions. 4 Actually , such a bounded operator alw ays exists (Baker, 1973). 5 For random variables x n and y , x n p → y means the con vergence in probability , i.e., the probability | x n − y | > ǫ goes to 0 for all ǫ as the number of samples n tends to inﬁnity . 7 Theorem 5 If k ˆ f − f ∗ k H p → 0 a nd ˆ I p → I 0 , th en und er assumptio ns (A 1-A3, A1 0), there exist sequ ences λ ( n ) 1 , λ ( n ) 2 → 0 such that lim sup n     Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1  D + 2 λ ( n ) 2 λ ( n ) 1  f ∗ I 0     H m ≤ 1 , ( ∀ m ∈ J = I c 0 ) . (16) Mor eover , such λ ( n ) 1 satisﬁes λ ( n ) 1 √ n → ∞ . The sufﬁcient co ndition (15) con tains the strict inequ ality (‘ < ’), while similar conditions for o rdinary (non- block) ℓ 1 regularization or o rdinary (n on-blo ck) elastic-net regularization co ntain the wea k inequality (‘ ≤ ’). The strict in equality a ppears b ecause each block co ntains multiple variables in grou p lasso and MKL (Bach, 2008). The condition λ ( n ) 1 √ n → ∞ is necessary to impose the RKHS-n orm conver g ence k ˆ f − f ∗ k H p → 0 . Roughly speaking , this means that the block- ℓ 1 regularization term should be stronger than the noise le vel to suppress ﬂuctuation s by noise. It is worth no ting that the conditions (15) an d (16) are weaker th an the condition for block- ℓ 1 MKL presented in Bach (200 8 ) ; the block- ℓ 1 MKL irrepresentab le cond ition is 6    (Sufﬁcient cond ition)    Σ 1 / 2 m,m V m,I 0 V − 1 I 0 ,I 0 D g ∗ I 0    H m < 1 , ( ∀ m ∈ J ) , (Necessary conditio n)    Σ 1 / 2 m,m V m,I 0 V − 1 I 0 ,I 0 D g ∗ I 0    H m ≤ 1 , ( ∀ m ∈ J ) . (17) This is because the grou p- ℓ 2 regularization term eases the singular ity of the prob lem. E xamples that elastic- nets successfully estimate th e true spar sity pattern, w hile ℓ 1 regularization fails in p arametric situations can be found in Jia and Y u (2010). 5 Conclusions W e provided three novel theo retical results on the support consistency and convergence rate of elastic-net MKL. (i) Elastic-n et MKL was sho wn to be supp ort consistent under a milder condition than block- ℓ 1 MKL. (ii) A tig hter con vergence rate than existing bound s was deriv ed for the situation where the truth is sparse. (iii) Th e convergence r ates of block- ℓ 1 MKL, elastic-n et MKL, and b lock- ℓ 2 MKL when the truth is near sparse were eluc idated, and elastic-net MKL was shown to p erform better when the decr ea se rate β is not large, or the condition of the problem is bad. Based on ou r theor etical ﬁnding s, we conclu de that th e use o f elastic-net regularization is recomm ended for MKL. Elastic-net MKL ca n b e regarded as ‘in termediate’ betwee n block - ℓ 1 MKL and block- ℓ 2 MKL. Anoth er popular intermediate variant is blo ck- ℓ p MKL for 1 ≤ p ≤ 2 (Klof t et al., 2009, Cortes et al., 2009). E lastic- net MKL a nd block- ℓ p MKL are conceptu ally similar, but they have a notab le difference: elastic-net MKL with λ ( n ) 1 > 0 tend s to produ ce sparse solutions, while b lock- ℓ p MKL with 1 < p ≤ 2 always prod uces d ense solutions (i.e., all comb ination coefﬁcients of kernels are non-z ero). Sparsity of elastic-net MKL would b e advantageous when the true kernel combinatio n is sparse, as we proved in this p aper . Howe ver, when the true kernel combinatio n is n on-spar se, the dif fer ence/relation between elastic-net MKL and blo ck- ℓ p MKL is not clear yet. This needs to be fur ther in vestigated in the futu re work. A Proofs of th e theore ms For a function f on X × R , we deﬁn e P n f := 1 n P n i =1 f ( x i , y i ) and P f := E X,Y [ f ( X , Y )] . For a fu nction f I ∈ H I , we d eﬁne k f I k ℓ 1 as k f I k ℓ 1 := P m ∈ I k f m k H m and fo r f ∈ H w e write k f k ℓ 1 := P M m =1 k f m k H m . Similarly we deﬁne k f I k ℓ 2 as k f I k 2 ℓ 2 := P m ∈ I k f m k 2 H m for f I ∈ H I and for f ∈ H we write k f k 2 ℓ 2 := P M m =1 k f m k 2 H m . W e write max { a, b } as a ∨ b . 6 Note that in the original paper by B ach (2008), the RHS of (17) is P m ∈ I 0 k f ∗ m k H m because the squared group- ℓ 1 regularizer ( P m k f m k H m ) 2 was used. W e can sho w t hat the squared formulation is actually equiv alent to the non- squared formulation in the sense that there exists on e-to-one correspondence between the two formulations. 8 Lemma 6 F o r all I ⊆ { 1 , . . . , M } , we have k f k 2 L 2 (Π) ≥ (1 − ρ ( I ) 2 ) κ ( I )( X m ∈ I k f m k 2 L 2 (Π) ) . (18) Proof: For J = I c , we have P f 2 = k f I k 2 L 2 (Π) + 2 h f I , f J i L 2 (Π) + k f J k 2 L 2 (Π) ≥ k f I k 2 L 2 (Π) − 2 ρ ( I ) k f I k L 2 (Π) k f J k L 2 (Π) + k f J k 2 L 2 (Π) ≥ (1 − ρ ( I ) 2 ) k f I k 2 L 2 (Π) ≥ (1 − ρ ( I ) 2 ) κ ( I )( X m ∈ I k f m k 2 L 2 (Π) ) , (19) where we used Schwarz’ s inequality in the last line. The following lemma gives an upp er b ound of P M m =1 k ˆ f k H m that hold w ith a high p robab ility . Th is is an extension of Theorem 1 of K oltchinskii and Y u an (2008). The proof is given in Appendix B. Lemma 7 The r e e xists a constant F d ependin g o n only L in (A1) such that, if λ ( n ) 1 ≥ F q log( M n ) n , we have, for r = λ ( n ) 1 λ ( n ) 1 ∨ λ ( n ) 2 , with pr obab ility 1 − n − 1 , M X m =1 k ˆ f m k H m ≤ M 1 − r 2 − r 3 M X m =1 k f ∗ m k H m + 3 M X m =1 k f ∗ m k 2 H m ! 1 2 − r . Mor eover , if λ ( n ) 2 ≥ F q log( M n ) n and λ ( n ) 2 ≥ λ ( n ) 1 , we have M X m =1 k ˆ f m − f ∗ m k H m ≤ M  3 / 2 + 2 max m k f ∗ m k H m  . The following lemma giv es a basic inequality that is a start point for the following analyses. The proof is giv en in Appendix B. Lemma 8 S uppose λ ( n ) 1 ∨ λ ( n ) 2 ≥ F q log( M n ) n wher e F is the co nstant appe ar ed in Lemma 7. Then th er e exis t con stants ˜ K 1 and ˜ K 2 depend ing only on L in (A1) , R in (A4) , s in (A6) , C 2 in (A6) such tha t for a ll I ⊆ { 1 , . . . , M } , and for all t ≥ log log ( R √ n ) + log M , wi th pr ob ability at least 1 − e − t − n − 1 , 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 X m ∈ I k ˆ f I − f ∗ I k 2 H m + λ ( n ) 2 X m ∈ J k ˆ f m k 2 H m + λ ( n ) 1 − ˆ γ n − ˜ K 2 r t n ! X m ∈ J k ˆ f m k H m ≤ ˜ K 1 (1 + k ˆ f − f ∗ k ℓ 1 )  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ∨ k ˆ f m − f ∗ m k H m n 1 1+ s + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 2 X m ∈ J k f ∗ m k 2 H m + λ ( n ) 1 + ˆ γ n + ˜ K 2 r t n ! X m ∈ J k f ∗ m k H m , (20) wher e J = I c , γ n := ˜ K 1 √ n and ˆ γ n := γ n (1 + k ˆ f − f ∗ k ∞ ) . The ab ove lemm a is derived b y peeling d evice or localization me thod . Details of those techniques can be found in, for example, Bartlett et al. (2005), K o ltchinskii (200 6), Mendelson (20 02), v a n de Geer (2000). Proof: (Theorem 1) Since λ ( n ) 1 ≥ F q log( M n ) n , we can assume th at the ine quality (2 0) is satisﬁed with I = I 0 . For no tational simplicity , we sup pose I denote s I 0 in this proof . In ad dition, since λ ( n ) 1 ≥ λ ( n ) 2 , k ˆ f k ∞ ≤ P M m =1 k f ∗ k H m ≤ 3 R (with probab ility 1 − n − 1 ) by Lem ma 7. Note tha t k f ∗ m k H m = 0 fo r all 9 m ∈ J = I c = I c 0 , and ˆ γ n + ˜ K 2 q t n ≤ max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } = λ ( n ) 1 by tak ing K sufﬁciently large. Therefo re by the inequ ality (20), we ha ve 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 k ˆ f I − f ∗ I k 2 ℓ 2 ≤ K 1  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t n  + X m ∈ I λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) , (21) where K 1 is ˜ K 1 (1 + 3 R ) (here we om itted the term P m ∈ I n − 1 1+ s k ˆ f m − f ∗ m k H m for simplicity . One can show t hat that term is negligible). By H ¨ older’ s inequality , the ﬁrst term in the RHS of the above inequality can be bound ed as K 1 X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ≤ K 1 ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s ( k ˆ f I − f ∗ I k ℓ 1 ) s √ n ≤ √ dK 1 ( P m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ) 1 − s 2 ( k ˆ f I − f ∗ I k 2 ℓ 2 ) s 2 √ n . Applying Y o ung’ s inequality , the last term can be bounded by K 1 ( λ ( n ) 2 / 2) − s 2 √ d √ n ( X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ) 1 − s 2 × ( λ ( n ) 2 / 2) s 2 ( k ˆ f I − f ∗ I k 2 ℓ 2 ) s 2 ≤ C ( n − 1 2 √ dλ ( n ) 2 − s 2 ) 2 2 − s X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ! 1 − s 2 − s + λ ( n ) 2 2 k ˆ f I − f ∗ I k 2 ℓ 2 ≤ C [(1 − ρ 2 ( I )) κ ( I )] − 1 n − 1 dλ ( n ) 2 − s + (1 − ρ 2 ( I )) κ ( I ) 8 X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) + λ ( n ) 2 2 k ˆ f I − f ∗ I k 2 ℓ 2 ≤ C n − 1 dλ ( n ) 2 − s + 1 8 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 2 k ˆ f I − f ∗ I k 2 ℓ 2 . (22) where C den otes a constan t that is indepen dent of d and n and changes by the contexts, and we used Lemma 6 in the last line. Similarly , by the inequality of arithmetic and geometric means, we obtain a bound as X m ∈ I 2 λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) ≤ C [(1 − ρ 2 ( I )) κ ( I )] − 1 X m ∈ I (  k g ∗ m k H m k f ∗ m k H m  2 λ ( n ) 1 2 + k g ∗ m k 2 H m λ ( n ) 2 2 + t n ) + (1 − ρ 2 ( I )) κ ( I ) 8 X m ∈ I k ˆ f m − f ∗ m k 2 L 2 (Π) ≤ C ( dλ ( n ) 1 2 + λ ( n ) 2 2 + dt/n ) + 1 8 k ˆ f − f ∗ k 2 L 2 (Π) , (23) where we used Lemma 6 in the last line. By substituting (22) and (23) to (21), we have 1 4 k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  dn − 1 λ ( n ) 2 − s + dλ ( n ) 1 2 + λ ( n ) 2 2 + ( d + 1 ) t n  . (24) The minimum of the RHS with respect to λ ( n ) 1 , λ ( n ) 2 under the con straint λ ( n ) 1 ≥ λ ( n ) 2 is achieved by λ ( n ) 1 = max { K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n } , λ ( n ) 2 = K n − 1 2+ s up to constants. Thu s we have the ﬁrst assertion (7). 10 Next we show the second assertion (8). By H ¨ older’ s inequality and Y o ung’ s inequality , we ha ve K 1 X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ≤ K 1 ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s ( k ˆ f I − f ∗ I k ℓ 1 ) s √ n ≤ C ˜ λ − s 1 − s n − 1 2(1 − s ) P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) + ˜ λ 2 k ˆ f I − f ∗ I k ℓ 1 ≤ C d ˜ λ − 2 s 1 − s n − 1 1 − s + 1 8 k ˆ f − f ∗ k 2 L 2 (Π) + ˜ λ 2 ( k ˆ f I k ℓ 1 + k f ∗ I k ℓ 1 ) , (25) where ˜ λ > 0 is an arbitrary positiv e real. By sub stituting (25) and (23) to (21), we have 1 4 k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  d ˜ λ − 2 s 1 − s n − 1 1 − s + ˜ λ + dλ ( n ) 1 2 + λ ( n ) 2 2 + ( d + 1 ) t n  . This is minimized by ˜ λ = C d 1 − s 1+ s n − 1 1+ s , λ ( n ) 1 = ( 2 ˜ K 1 (1+3 R ) √ n + ˜ K 2 q t n ) ∨ F q log( M n ) n ≥ (2 ˆ γ n + ˜ K 2 q t n ) ∨ F q log( M n ) n , and λ ( n ) 2 ≤ λ ( n ) 1 . Thu s we obtain the as sertion. Proof: (Theorem 2) Le t I d := { 1 , . . . , d } and J d = I c d = { d + 1 , . . . , M } . By th e assumption (A7), w e have P m ∈ J d k f ∗ m k 2 H m ≤ C 3 2 β − 1 d 1 − 2 β , P m ∈ J d k f ∗ m k H m ≤ C 3 β − 1 d 1 − β . Therefo re Lemma 8 gives k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 k ˆ f I d − f ∗ I d k 2 ℓ 2 + λ ( n ) 2 k ˆ f J d k 2 ℓ 2 ≤ K 1  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t k ˆ f − f ∗ k ℓ 1 n  + K 1 M X m =1 k ˆ f m − f ∗ m k H m !  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I d λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + C λ ( n ) 2 d 1 − 2 β + λ ( n ) 1 + ˆ γ n + r t n ! d 1 − β ! , (26) if λ ( n ) 1 > ˆ γ n + ˜ K 2 q t n and λ ( n ) 1 ≥ F q log ( M n ) n . The second term can be upper bound ed as K 1 M X m =1 k ˆ f m − f ∗ m k H m !  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t k ˆ f − f ∗ k ℓ 1 n  H ¨ older ≤ K 1 M X m =1 k ˆ f m − f ∗ m k H m ! ( ( P m ∈ I d k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s ( P m ∈ I d k ˆ f m − f ∗ m k H m ) s √ n + t k ˆ f − f ∗ k ℓ 1 n ) = K 1 ( P m ∈ I d k ˆ f m − f ∗ m k L 2 (Π) ) 1 − s  P M m =1 k ˆ f m − f ∗ m k H m  ( P m ∈ I d k ˆ f m − f ∗ m k H m ) s √ n + t k ˆ f − f ∗ k 2 ℓ 1 n Jensen ≤ K 1 d 1 − s 2 ( P m ∈ I d k ˆ f m − f ∗ m k 2 L 2 (Π) ) 1 − s 2 M 1 2  P M m =1 k ˆ f m − f ∗ m k 2 H m  1 2 d s 2 ( P m ∈ I d k ˆ f m − f ∗ m k 2 H m ) s 2 √ n + t k ˆ f − f ∗ k 2 ℓ 1 n Lemma 6 ≤ K 1 { (1 − ρ ( I d ) 2 ) κ ( I d ) } − 1 − s 2 ( k ˆ f − f ∗ k 2 L 2 (Π) ) 1 − s 2 d 1 2 M 1 2 k ˆ f − f ∗ k 1+ s ℓ 2 √ n + t k ˆ f − f ∗ k 2 ℓ 1 n Y oung ≤ k ˆ f − f ∗ k 2 L 2 (Π) 2 + C { (1 − ρ ( I d ) 2 ) κ ( I d ) } − 1 − s 1+ s d 1 1+ s M 1 1+ s k ˆ f − f ∗ k 2 ℓ 2 n 1 1+ s + t k ˆ f − f ∗ k 2 ℓ 1 n (A8) ≤ k ˆ f − f ∗ k 2 L 2 (Π) 2 + C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s k ˆ f − f ∗ k 2 ℓ 2 + t k ˆ f − f ∗ k 2 ℓ 1 n . 11 W e will see that we may assume C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s ≤ λ ( n ) 2 4 . T hus the second term in the RHS of the ab ove inequality can be upper boun ded as C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s k ˆ f − f ∗ k 2 ℓ 2 ≤ λ ( n ) 2 4 k ˆ f − f ∗ k 2 ℓ 2 ≤ λ ( n ) 2 4  k ˆ f I d − f ∗ I d k 2 ℓ 2 + 2 k ˆ f J d k 2 ℓ 2 + 2 k f ∗ J d k 2 ℓ 2  ≤ λ ( n ) 2 2  k ˆ f I d − f ∗ I d k 2 ℓ 2 + k ˆ f J d k 2 ℓ 2 + k f ∗ J d k 2 ℓ 2  . (27) Moreover Lemma 7 gives k ˆ f − f ∗ k ℓ 1 n ≤ C √ RM n ≤ C λ ( n ) 2 2 and k ˆ f − f ∗ k 2 ℓ 1 n ≤ C RM n ≤ C R λ ( n ) 2 2 . T herefor e (26) becomes 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 2 k ˆ f I d − f ∗ I d k 2 ℓ 2 + λ ( n ) 2 2 k ˆ f J d k 2 ℓ 2 ≤ C  X m ∈ I d k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + tλ ( n ) 2 2  + X m ∈ I d C 1 λ ( n ) 1 + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + C λ ( n ) 2 d 1 − 2 β + λ ( n ) 1 + ˆ γ n + r t n ! d 1 − β ! . As in the proof of Theorem 1 (using the relations (23) and (22)), we ha ve 1 2 k ˆ f − f ∗ k 2 L 2 (Π) ≤ C ( [(1 − ρ 2 ( I d )) κ ( I d ))] − 1  dn − 1 λ ( n ) 2 − s + dλ ( n ) 1 2 + λ ( n ) 2 2 + t n  + λ ( n ) 2 d 1 − 2 β + ( λ ( n ) 1 + ˆ γ n + ( t/n ) 1 2 ) d 1 − β + tλ ( n ) 2 2 ) . Now using the assumptio n (1 − ρ 2 ( I d )) κ ( I d ) ≥ C 4 d − b , we have k ˆ f I d − f ∗ I d k 2 L 2 (Π) ≤ C " d 1+ b n − 1 λ ( n ) 2 − s + d 1+ b λ ( n ) 1 2 + d b λ ( n ) 2 2 + λ ( n ) 2 d 1 − 2 β + ( λ ( n ) 1 + ˆ γ n ) d 1 − β + tλ ( n ) 2 2 + d 1 − β r t n + d 1+ b t n # . (28) Remind that ˆ γ n = ˜ K 1 (1 + k ˆ f − f ∗ k ∞ ) / √ n . Since λ ( n ) 1 ≥ F q log( M n ) n , Lemma 7 gives k ˆ f − f ∗ k ∞ ≤ √ M 3 R + R ≤ c √ M with pro bability 1 − n − 1 for some co nstant c > 0 . Therefore ˆ γ n ≤ c p M /n . The values o f λ ( n ) 1 , λ ( n ) 2 presented in the statemen t is ach iev ed by minimizing the RHS of Eq . (28) und er the constraint λ ( n ) 1 ≥ c p M /n + ˜ K 2 q t n ≥ ˆ γ n + ˜ K 2 q t n and C d b (1 − s )+1 1+ s M 1 1+ s n 1 1+ s ≤ λ ( n ) 2 4 . i) Suppose n − b +3 β − 1 (2 β + b )(2+ s ) − 1 − s > c p M /n , i.e., τ ≤ τ 2 . Then the RHS of the above ineq uality can be minimized b y d = n 1 (2 β + b )(2+ s ) − 1 − s , λ ( n ) 2 = K n − 2 β + b − 1 (2 β + b )(2+ s ) − 1 − s , and λ ( n ) 1 = max { K n − b +3 β − 1 (2 β + b )(2+ s ) − 1 − s + ˜ K 2 q t n , F q log( M n ) n } up to constants indepen dent o f n , wh ere the le ading terms are d 1+ b n − 1 λ ( n ) 2 − s + d b λ ( n ) 2 2 + λ ( n ) 2 d 1 − 2 β + λ ( n ) 1 d 1 − β . It sho uld be noted that λ ( n ) 1 is gre ater than ˆ γ n + ˜ K 2 q t n because n − b +3 β − 1 (2 β + b )(2+ s ) − 1 − s > c p M /n ≥ ˆ γ n , therefor e (26) is valid. Using τ ≤ τ 2 , we can sho w that C d b (1 − s )+1 1+ s ( M /n ) 1 1+ s ≤ λ ( n ) 2 / 4 by setting th e constant K suf ﬁciently la rge, hence (27) is valid. Moreover , since M > n 1 (2 β + b )(2+ s ) − 1 − s = n τ 1 , we can take d a s d = n 1 (2 β + b )(2+ s ) − 1 − s ≤ M . 12 ii) Suppose τ 2 ≤ τ ≤ τ 3 . Then the RHS o f the above inequality can be minim ized by d = ( M 2+ s n 2 − s ) 1 2 { (2+ s )( b + β ) − s } , λ ( n ) 2 = K ( M n −{ 2( b + β ) − 1 } ) 1 2 { (2+ s )( b + β − 1)+2 } , and λ ( n ) 1 = max  c p M /n + ˜ K 2 q t n , F q log( M n ) n  ≥ ˆ γ n + ˜ K 2 q t n up to constants independen t of n , where the lead- ing terms are d 1+ b n − 1 λ ( n ) 2 − s + d b λ ( n ) 2 2 + λ ( n ) 1 d 1 − β . Since λ ( n ) 1 ≥ ˆ γ n + ˜ K 2 q t n , (26) is v alid. Using τ ≤ τ 3 , we can show that C d b (1 − s )+1 1+ s ( M /n ) 1 1+ s ≤ λ ( n ) 2 / 4 by setting the co nstant K sufﬁciently large, hence (27) is valid. Moreover, since β ≤ s ( b − 1) 2(1 − s ) and τ 2 ≤ τ , we can show that d ≤ M . iii) Sup pose τ 3 ≤ τ ≤ τ 4 . W e take λ ( n ) 1 = max  c p M /n + ˜ K 2 q t n , F q log( M n ) n  . Then the RHS o f the inequ ality ( 28) is minimized by λ ( n ) 2 = K √ dλ ( n ) 1 ∼ p dM /n and d = ( n M ) 1 2( b + β ) up to co nstants, where the leading term s are d b λ ( n ) 2 2 + d 1+ b λ ( n ) 1 2 + λ ( n ) 1 d 1 − β . Note that since λ ( n ) 1 ≥ ˆ γ n + ˜ K 2 q t n , (2 6) is valid. Using τ ≤ τ 4 , we can s how that C d b (1 − s )+1 1+ s ( M /n ) 1 1+ s ≤ λ ( n ) 2 / 4 by setting the constant K sufﬁciently lar g e, hence (27) is valid. Moreover, since β ≤ s ( b − 1) 2(1 − s ) and n τ 3 ≤ M , we have d = ( n M ) 1 2( b + β ) ≤ M . In a ll settings i) to iii) , we can show that d 1 − β √ n & d 1+ b n . Th us the terms regar ding t is upper bo unded as d 1 − β q t n + d 1+ b t n + tλ ( n ) 2 2 . ( d 1 − β √ n + λ ( n ) 2 2 )( √ t + t ) . Thr ough a simple calculatio n d 1 − β √ n is ev alu ated as i) d 1 − β √ n ≃ n − (2 β + b )(2+ s ) − 3 − s +2 β 2 { (2 β + b )(2+ s ) − 1 − s } , ii) d 1 − β √ n ≃ ( M (2+ s )(1 − β ) n − (4 β +2 b + sb − 2) ) 1 2 { ( β + b )(2+ s ) − s } , and iii) d 1 − β √ n ≃ ( M β − 1 n 1 − 2 β − b ) 1 2( β + b ) respectively . Thu s we obtain the assertion. Proof: (Theorem 3) (Con vergence rate of block- ℓ 1 MKL) Note that since λ ( n ) 1 > λ ( n ) 2 = 0 , we h av e λ ( n ) 1 λ ( n ) 1 ∨ λ ( n ) 2 = 1 . Therefor e Lemma 7 g iv e s P M m =1 k ˆ f m k H m ≤ 3 R with pr obability 1 − n − 1 . Th us ˆ γ n = γ n (1 + k ˆ f − f ∗ k ∞ ) ≤ γ n (1 + P M m =1 k ˆ f m k H m + P M m =1 k f ∗ m k H m ) ≤ γ n (1 + 4 R ) . When λ ( n ) 2 = 0 and λ ( n ) 1 > (1 + 4 R ) γ n + ˜ K 2 q t n , as in Lemma 8 we hav e with prob ability at least 1 − e − t − n − 1 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 X m ∈ I k ˆ f m k H m ≤ K 1  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t n  + λ ( n ) 1 X m ∈ I k f ∗ m k H m + 2 λ ( n ) 1 X m ∈ J k f ∗ m k H m + ˜ K 2 X m ∈ I r t n k f ∗ m − ˆ f m k L 2 (Π) , (29) for all t ≥ log log( R √ n ) + log M . W e lower bound the term λ ( n ) 1 P m ∈ I ( k ˆ f m k H m − k f ∗ m k H m ) in the LHS o f the above inequality ( 21). There exists c 1 > 0 only depend ing R such that k f m k H m = q k f m − f ∗ m k 2 H m − 2 h f m − f ∗ m , f ∗ m i H m + k f ∗ m k 2 H m ≥ c 1 k f m − f ∗ m k 2 H m − 2 k f ∗ m k − 1 H m |h f m − f ∗ m , f ∗ m i H m | + k f ∗ m k H m (30) for all f m ∈ H m such that k f m k H m ≤ 3 R and m ∈ I 0 . R e mind that f ∗ m = T 1 / 2 m g ∗ m , then we h av e k f m k H m ≥ c 1 k f m − f ∗ m k 2 H m − 2 k g ∗ m k H m k f ∗ m k H m k f m − f ∗ m k L 2 (Π) + k f ∗ m k H m . Since max m k ˆ f m k H m ≤ 3 R are met with prob ability 1 − n − 1 , k ˆ f m k H m ≥ c 1 k ˆ f m − f ∗ m k 2 H m − 2 k g ∗ m k H m k f ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) + k f ∗ m k H m , with probability 1 − n − 1 . 13 Therefo re by the inequ ality (29), we ha ve with proba bility at least 1 − e − t − n − 1 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 X m ∈ I ( c 1 k ˆ f m − f ∗ m k 2 H m − 2 k g ∗ m k H m k f ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) + k f ∗ m k H m ) ≤ K 1  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n + t n  + λ ( n ) 1 X m ∈ I k f ∗ m k H m + 2 λ ( n ) 1 X m ∈ J k f ∗ m k H m + ˜ K 2 X m ∈ I r t n k f ∗ m − ˆ f m k L 2 (Π) , (31) for all t ≥ log log( R √ n ) + log M . Thu s using Y o ung’ s inequality k ˆ f − f ∗ k 2 L 2 (Π) ≤ C  d 1+ b n − 1 λ ( n ) 1 − s + d 1+ b λ ( n ) 1 2 + 2 λ ( n ) 1 d 1 − β + t (1 + d 1+ b ) n  . The RHS is minimized b y d = n 1 (2+ s )( β + b ) and λ ( n ) 1 = max  K n − 1 2+ s + ˜ K 2 q t n , F q log( M n ) n  (up to constants indepen dent of n ). Note that since the optimal λ ( n ) 1 obtained above satisﬁes λ ( n ) 1 > (1 + 4 R ) γ n + ˜ K 2 q t n by taking K su fﬁciently la rge, the inequa lity (31) is valid. Mor eover the con dition M > n τ 5 = n b +1 ( β + b ) { b (2+ s )+2 } in the statement ensures d < M . Finally w e e valuate the terms including t , that is, t n d 1+ b + q t n d 1 − β . W e can check that 1 n d 1+ b . q 1 n d 1 − β . T herefor e those ter ms are upper bounded as t n d 1+ b + q t n d 1 − β . q 1 n d 1 − β ( √ t + t ) ≃ n − 4 β + 2 b − 2+ s ( b + β ) 2(2+ s )( b + β ) ( √ t + t ) . Thus we obtain the assertion. (Con vergence rate for block- ℓ 2 MKL) When λ ( n ) 1 = 0 , sub stituting I M to I in Lemma 8, and using Y oung ’ s inequ ality , as in the proo f of Theorem 2, the conv ergence rate of block- ℓ 2 MKL can be e valuated as k ˆ f I d − f ∗ I d k 2 L 2 (Π) ≤ C  M 1+ b n − 1 λ ( n ) 2 − s + M b λ ( n ) 2 2 + tλ ( n ) 2 2 + t n M 1+ b  , (32) with prob ability 1 − e − t − n − 1 (note that since I = { 1 , . . . , M } ( I c = ∅ ), w e don ’t need the condition λ ( n ) 1 ≥ ˆ γ n + ˜ K 2 q t n ). λ ( n ) 2 = K ( M n ) 1 2+ s ∨ F q log( M n ) n giv es th e minim um o f the RHS with respect to λ ( n ) 2 up to co nstants. Using τ ≤ τ 6 , we can show that M b (1 − s )+1 1+ s ( M /n ) 1 1+ s = M b (1 − s )+2 1+ s n − 1 1+ s . λ ( n ) 2 by setting the constant K sufﬁciently large, hence (2 7) is v a lid. B Proof of Lemmas 7 and 8 Proof: (Lemma 7) Sinc e ˆ f min imizes the empirical risk (1), we ha ve 1 n n X i =1 M X m =1 ( ˆ f m ( x i ) − f ∗ m ( x i )) ! 2 + λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≤ 2 n M X m =1 n X i =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) + λ ( n ) 1 k f ∗ k ℓ 1 + λ ( n ) 2 k f ∗ k 2 ℓ 2 . (33) By Proposition 1 (Bernstein’ s ineq uality in Hilber t spa ces, see also Theorem 6.14 of Steinwart (20 08) fo r example), there exists a universal constant C such tha t we hav e 1 n n X i =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) ≤      1 n n X i =1 ǫ i k m ( x i , · )      k ˆ f m − f ∗ m k H m ≤ C L r log( M n ) n k ˆ f m − f ∗ m k H m ≤ C L r log( M n ) n ( k ˆ f m k H m + k f ∗ m k H m ) (34) 14 for all m with p robab ility at least 1 − n − 1 , wh ere we u sed the assumption log( M n ) n ≤ 1 . If λ ( n ) 1 ≥ 4 C L q log( M n ) n , then we ha ve λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≤ 3 ( λ ( n ) 1 ∨ λ ( n ) 2 )( k f ∗ k ℓ 1 + k f ∗ k 2 ℓ 2 ) , (35) with prob ability at least 1 − n − 1 . Set r = λ ( n ) 1 λ ( n ) 1 ∨ λ ( n ) 2 , then by Y oun g’ s inequality and Jensen’ s inequality , the LHS of the above inequality (33) is lower bounded by λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≥ ( λ ( n ) 1 ∨ λ ( n ) 2 )( M X m =1 k ˆ f m k 2 − r H m ) ≥ M ( λ ( n ) 1 ∨ λ ( n ) 2 ) 1 M M X m =1 k ˆ f m k 2 − r H m ! ≥ M r − 1 ( λ ( n ) 1 ∨ λ ( n ) 2 ) k ˆ f k 2 − r ℓ 1 . (36) Therefo re we have the ﬁrst assertion by setting F = 4 C L . The second assertion can be shown as follows: by th e inequality (33) we ha ve M − 1 λ ( n ) 2  k ˆ f − f ∗ k ℓ 1  2 ≤ λ ( n ) 2 k ˆ f − f ∗ k 2 ℓ 2 ≤ 2 n M X m =1 n X i =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) + λ ( n ) 1 k ˆ f − f ∗ k ℓ 1 + 2 λ ( n ) 2 M X m =1 h f ∗ m , f ∗ m − ˆ f m i H m ≤ λ ( n ) 2  3 2 + 2 max m k f ∗ m k H m  k ˆ f − f ∗ k ℓ 1 (37) with proba bility at least 1 − n − 1 , wher e we used (34), λ ( n ) 2 ≥ 4 C L q log( M n ) n and λ ( n ) 2 ≥ λ ( n ) 1 in the last inequality . Proof: (Lemma 8) In what f ollows, we assume k ˆ f − f ∗ k ℓ 1 ≤ ¯ R wher e ¯ R = 4 M R (the pro bability of this ev e nt is greater than 1 − n − 1 by Lemma 7). Since ˆ f min imizes the empirical risk we ha ve P n ( ˆ f − Y ) 2 + λ ( n ) 1 k ˆ f k ℓ 1 + λ ( n ) 2 k ˆ f k 2 ℓ 2 ≤ P n ( f ∗ − Y ) 2 + λ ( n ) 1 k f ∗ k ℓ 1 + λ ( n ) 2 k f ∗ k 2 ℓ 2 ⇒ P ( ˆ f − f ∗ ) 2 + λ ( n ) 1 k ˆ f J k ℓ 1 + λ ( n ) 2 k ˆ f J k 2 ℓ 2 ≤ ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ )+ + λ ( n ) 1 ( k f ∗ I k ℓ 1 − k ˆ f I k ℓ 1 ) + λ ( n ) 2 ( k f ∗ I k 2 ℓ 2 − k ˆ f I k 2 ℓ 2 ) + λ ( n ) 1 k f ∗ J k ℓ 1 + λ ( n ) 2 k f ∗ J k 2 ℓ 2 . (38) The second term in the RHS of the above inequality (38) can be bound ed from abov e as ( k f ∗ I k ℓ 1 − k ˆ f I k ℓ 1 ) ≤ X m ∈ I h∇k f ∗ m k H m , ˆ f m − f ∗ m i H m = X m ∈ I h g ∗ m , T 1 / 2 m ( ˆ f m − f ∗ m ) i H m k f ∗ m k H m ≤ X m ∈ I k g ∗ m k H m k f ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) , (39) where we used f ∗ m = T 1 / 2 m g ∗ m for m ∈ I ⊆ I 0 . W e also have λ ( n ) 2 ( k f ∗ I k 2 ℓ 2 − k ˆ f I k 2 ℓ 2 ) = λ ( n ) 2 ( X m ∈ I 2 h f ∗ m , f ∗ m − ˆ f m i H m − k ˆ f I − f ∗ I k 2 ℓ 2 ) ≤ λ ( n ) 2 ( X m ∈ I 2 k g ∗ m k H m k ˆ f m − f ∗ m k L 2 (Π) − k ˆ f I − f ∗ I k 2 ℓ 2 ) . (40) Substituting (39) and (40) to (38), we obtain k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 k ˆ f I − f ∗ I k 2 ℓ 2 + λ ( n ) 1 k ˆ f J k ℓ 1 + λ ( n ) 2 k ˆ f J k 2 ℓ 2 ≤ ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) + X m ∈ I ( λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m ) k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 1 k f ∗ J k ℓ 1 + λ ( n ) 2 k f ∗ J k 2 ℓ 2 . (41) 15 Finally we ev alua te the ﬁr st ter m ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) in the RHS of the above inequ ality (41) by applying T alagra nd’ s conc entration inequality (T alag rand, 1996a,b, Bousquet, 2002). First we decompose ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) as ( P − P n )(( f ∗ − ˆ f ) 2 + 2( ˆ f − f ∗ ) ǫ ) = M X m =1 ( P − P n )(( f ∗ − ˆ f )( f ∗ m − ˆ f m ) + 2 ( ˆ f m − f ∗ m ) ǫ ) , and bound each term ( P − P n )(( f ∗ − ˆ f )( f ∗ m − ˆ f m ) + 2 ( ˆ f m − f ∗ m ) ǫ ) in the summation. Here suppose f ∈ H satisﬁes k f k ∞ ≤ k f k ℓ 1 ≤ ˆ R for a con stant ˆ R ( ≤ ¯ R ) . Sin ce | ǫ | ≤ L , we have | f f m + 2 f m ǫ | ≤ 2( L + ˆ R ) | f | ≤ 2( L + ˆ R ) k f m k H m , (42a) p P ( f f m + 2 f m ǫ ) 2 = p P ( f 2 f 2 m ) + 4 P ( f 2 m ǫ 2 ) ≤ q k f k 2 L 2 (Π) k f m k 2 L 2 (Π) + 4 L 2 k f m k 2 L 2 (Π) ≤ k f k L 2 (Π) k f m k L 2 (Π) + 2 L k f m k L 2 (Π) , (42b) for all f ∈ H . Let Q n f := 1 n P n i =1 ε i f ( x i , y i ) wh ere { ε i } n i =1 ∈ {± 1 } n is the Radem acher r andom v ariable, and Ψ m ( ξ m , σ m ) be Ψ m ( ξ m , σ m ) := E[sup { Q n ( | f m | ) | f m ∈ H m , k f m k H m ≤ ξ m , k f m k L 2 (Π) ≤ σ m } ] . Then one can show t hat by the spectral assumptions (A5) (equivalently the covering numbe r cond ition) Ψ m ( ξ m , σ m ) ≤ K s  σ 1 − s m ξ s m √ n ∨ n − 1 1+ s ξ m  where K s is a constant that d epends on s and C 2 (Mendelson, 20 02). Let Ξ m ( ξ m , σ m ) := { f m ∈ H m | k f m k H m ≤ ξ m , k f m k L 2 (Π) ≤ σ m } . Now by Rad emacher contraction inequality (Ledou x and T alagrand, 1991, Theor em 4.1 2), for gi ven { ξ m , σ m } m ∈ I and ˆ R we have E[sup { Q n ( f f m + 2 f m ǫ ) | f ∈ H such that f m ∈ Ξ m ( ξ m , σ m ) , k f k ℓ 1 ≤ ˆ R } ] ≤ 2( L + ˆ R )Ψ m ( ξ m , σ m ) ≤ 2 K s ( L + ˆ R )  σ 1 − s m ξ s m √ n ∨ n − 1 1+ s ξ m  . (43) Therefo re by the symme trization argument (van der V aart and W elln er, 1996), we have E[sup { ( P n − P )( f f m + 2 f m ǫ ) | f ∈ H such that f m ∈ Ξ m ( ξ m , σ m ) , k f k ℓ 1 ≤ ˆ R } ] ≤ 4 K s ( L + ˆ R )  σ 1 − s m ξ s m √ n ∨ n − 1 1+ s ξ m  . (44) By T ala grand’ s concentration ineq uality with (42) and (44), fo r given ˆ R, ¯ σ , ξ m , σ m with probability at least 1 − e − t ( t > 0) , we have sup f ∈H : k f k L 2 (Π) ≤ ¯ σ , k f k ∞ ≤ ˆ R,f m ∈ Ξ m ( ξ m ,σ m ) ( P n − P )( f f m + 2 f m ǫ ) ≤ √ 2  4 K s ( L + ˆ R )  σ 1 − s m ξ s m √ n ∨ ξ m n 1 1+ s  + q t n ( ¯ σ σ m + 2 Lσ m ) + 2 ( L + ˆ R ) ξ m t n  . (45) where we used the relation (42). Our next goal is to deriv e an uniform v er sion of the above inequality over 1 √ n ≤ ˆ R ≤ ¯ R, 1 √ n ≤ ¯ σ ≤ ¯ R, 1 √ nM ≤ ξ m ≤ ¯ R and 1 √ nM ≤ σ m ≤ ¯ R. By considering a grid { ˆ R ( k 1 ) , ¯ σ ( k 2 ) , ξ ( k 3 ) m , σ ( k 4 ) m } log 2 ( M ¯ R √ n ) k i =0( i =1 ,..., 4) such that ˆ R ( k ) := ¯ R 2 − k , ¯ σ ( k ) := ¯ R 2 − k , ξ ( k ) m := ¯ R 2 − k and σ ( k ) m := ¯ R 2 − k , we hav e with pro bability at least 1 − (log( M ¯ R √ n )) 4 e − t ≥ 1 − (log(4 R M 2 √ n )) 4 e − t ( P n − P )( f f m + 2 f m ǫ ) ≤ K (1 + k f k ℓ 1 ) k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k H m n 1 1+ s + t k f m k H m n ! + r 2 t n ( k f k L 2 (Π) k f m k L 2 (Π) + 2 L k f m k L 2 (Π) ) , 16 for all f ∈ H such that k f m k H m ≤ ¯ R and k f k ℓ 1 ≤ ¯ R , and for all t > 1 , where K = 4(4 K s L ∨ 4 K s ∨ 2 L ∨ 2) . Summing up this bound for m = 1 , . . . , M , then we ob tain ( P n − P )( f 2 + 2 f ǫ ) ≤ K (1 + k f k ℓ 1 ) M X m =1 k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k H m n 1 1+ s + t k f k ℓ 1 n ! + r 2 t n k f k L 2 (Π) M X m =1 k f m k L 2 (Π) + 2 L M X m =1 k f m k L 2 (Π) ! , unifor mly for all f ∈ H su ch that k f m k H m ≤ ¯ R ( ∀ m ) and k f k ℓ 1 ≤ ¯ R with proba bility at least 1 − M (log(4 R M 2 √ n )) 4 e − t . Here set γ n = K √ n and note that q 2 t n k f k L 2 (Π) P M m =1 k f m k L 2 (Π) ≤ 1 2 k f k 2 L 2 (Π) + t n ( P M m =1 k f m k L 2 (Π) ) 2 ≤ 1 2 k f k 2 L 2 (Π) + t n ( k f k ℓ 1 ) 2 then we have ( P n − P )( f 2 + 2 f ǫ ) ≤ K (1 + k f k ℓ 1 ) " X m ∈ I k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k H m n 1 1+ s ! + 2 t k f k ℓ 1 n # + γ n (1 + k f k ℓ 1 ) k f J k ℓ 1 + 1 2 k f k 2 L 2 (Π) + 2 √ 2 L r t n M X m =1 k f m k L 2 (Π) . (46) for all f ∈ H such that k f m k H m ≤ ¯ R ( ∀ m ) and k f k ℓ 1 ≤ ¯ R with pro bability at least 1 − M (log(4 R M 2 √ n )) 4 e − t . W e will rep lace t with t + 5 log M + 4 log log ( R √ n ) , then the probability 1 − M (log(4 R √ nM 2 )) 4 e − t can be replac ed with 1 − e − t and we have t + 5 log M + 4 log log( R √ n ) ≤ 6 t for all t ≥ log M + log log( R √ n ) . On the event where k ˆ f − f ∗ k ℓ 1 ≤ ¯ R hold s, substituting ˆ f − f ∗ to f in (46) and replacing K a pprop riately , ( 41) yields 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 X m ∈ I k ˆ f I − f ∗ I k 2 H m + λ ( n ) 2 X m ∈ J k ˆ f m k 2 H m + ( λ ( n ) 1 − ˆ γ n ) X m ∈ J k ˆ f m k H m ≤ ˜ K 1 (1 + k ˆ f − f ∗ k ℓ 1 )  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ∨ k ˆ f m − f ∗ m k H m n 1 1+ s + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I  λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m  k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 2 X m ∈ J k f ∗ m k 2 H m + ( λ ( n ) 1 + ˆ γ n ) X m ∈ J k f ∗ m k H m + ˜ K 2 r t n M X m =1 k ˆ f m − f ∗ m k L 2 (Π) , (47) where ˜ K 1 and ˜ K 2 are constan ts and ˆ γ n = γ n (1 + k f k ℓ 1 ) . Fin ally since ˜ K 2 q t n P M m =1 k ˆ f m − f ∗ m k L 2 (Π) = ˜ K 2 q t n ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) + P m ∈ J k ˆ f m k L 2 (Π) + P m ∈ J k f ∗ m k L 2 (Π) ) ≤ ˜ K 2 q t n ( P m ∈ I k ˆ f m − f ∗ m k L 2 (Π) + P m ∈ J k ˆ f m k H m + P m ∈ J k f ∗ m k H m ) , (47) becomes 1 2 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 2 X m ∈ I k ˆ f I − f ∗ I k 2 H m + λ ( n ) 2 X m ∈ J k ˆ f m k 2 H m + λ ( n ) 1 − ˆ γ n − ˜ K 2 r t n ! X m ∈ J k ˆ f m k H m ≤ ˜ K 1 (1 + k ˆ f − f ∗ k ℓ 1 )  X m ∈ I k ˆ f m − f ∗ m k 1 − s L 2 (Π) k ˆ f m − f ∗ m k s H m √ n ∨ k ˆ f m − f ∗ m k H m n 1 1+ s + t k ˆ f − f ∗ k ℓ 1 n  + X m ∈ I λ ( n ) 1 k g ∗ m k H m k f ∗ m k H m + 2 λ ( n ) 2 k g ∗ m k H m + ˜ K 2 r t n ! k ˆ f m − f ∗ m k L 2 (Π) + λ ( n ) 2 X m ∈ J k f ∗ m k 2 H m + λ ( n ) 1 + ˆ γ n + ˜ K 2 r t n ! X m ∈ J k f ∗ m k H m , (48) which yields the assertion. 17 C Proof of T heor ems 4 and 5 W e wr ite the operato r norm of S I ,J : H J → H I as k S I ,J k H I , H J := sup g J ∈H J ,g J 6 =0 k S I ,J g J k H I k g J k H J . Deﬁnition 9 F or all 1 ≤ m, m ′ ≤ M , we de ﬁne the empirical ( non center ed ) cr oss covariance operator ˆ Σ m,m ′ as follows: h f m , ˆ Σ m,m ′ g m ′ i H m := 1 n n X i =1 f m ( x i ) g m ′ ( x i ) , (49) wher e f m ∈ H m , g m ′ ∈ H m ′ . An alogous to the join t covariance op erator Σ , we d eﬁne the joint empirical cr oss covarian ce op erator ˆ Σ : H → H as ( ˆ Σ h ) m = P M l =1 ˆ Σ m,l h l . W e den ote b y ˆ Σ m,ǫ the element of H m such that h f m , ˆ Σ m,ǫ i H m := 1 n n X i =1 ǫ i f m ( x i ) . Let ¯ R be a constan t such that 4( P M m =1 k f ∗ m k H m + P M m =1 k f ∗ m k H m ) ≤ ¯ R . W e denote by F n the objective function of elastic-net MKL F n ( f ) := 1 n n X i =1 ( f ( x i ) − y i ) 2 + λ ( n ) 1 M X m =1 k f m k H m + λ ( n ) 2 M X m =1 k f m k 2 H m . Proof: (Theorem 4) Let ˜ f ∈ ⊕ m ∈ I 0 H m be the minimizer of ˜ F n : ˜ f := ar g min f ∈H I 0 ˜ F n ( f ) , where ˜ F n ( f ) := 1 n n X i =1 ( f ( x i ) − y i ) 2 + λ ( n ) 1 X m ∈ I 0 k f m k H m + λ ( n ) 2 X m ∈ I 0 k f m k 2 H m . (Step 1) W e ﬁrst show that ˜ f p → f ∗ with r espect to the RKHS norm. Since λ ( n ) 1 √ n → ∞ , as in the proof of Lemma 7, the pro bability o f P M m =1 k ˆ f m − f ∗ m k H m ≤ √ M ¯ R goes to 1 (this can be checked as follows: by replacing q log( M n ) n in Eq . (34) with log( M ) λ ( n ) 1 , then we see that Eq. (34) ho lds with p robab ility 1 − ex p( − λ ( n ) 1 2 n ) ). Th ere exists c 1 only dependin g √ M ¯ R such that k f m k H m = q k f m − f ∗ m k 2 H m − 2 h f m − f ∗ m , f ∗ m i H m + k f ∗ m k 2 H m ≥ c 1 k f m − f ∗ m k 2 H m − 2 k f ∗ m k − 1 H m |h f m − f ∗ m , f ∗ m i H m | + k f ∗ m k H m (50) for all m ∈ I 0 and all f m ∈ H m such that k f m k H m ≤ √ M ¯ R . Since ˜ f minimizes ˜ F n , if P M m =1 k ˜ f m − f ∗ m k H m ≤ √ M ¯ R (the pro bability of which ev ent goes to 1 ) we have h ˜ f I 0 − f ∗ I 0 , ˆ Σ I 0 ,I 0 ( ˜ f I 0 − f ∗ I 0 ) i H I 0 + c 1 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ 2 h ˆ Σ I 0 ,ǫ , ˜ f − f ∗ i H I 0 + 2 X m ∈ I 0  1 k f ∗ m k H m λ ( n ) 1 + λ ( n ) 2  |h ˜ f m − f ∗ m , f ∗ m i H m | , (51) where we used the r elation (50). By the assumptio n f ∗ m = Σ 1 / 2 m,m g ∗ m , we have |h ˜ f m − f ∗ m , f ∗ m i H m | ≤ k g ∗ m k H m k ˜ f m − f ∗ m k L 2 (Π) . By Lemma 10 and Lemma 11, we have k Σ m,m ′ − ˆ Σ m,m ′ k H m , H m ′ = O p (1 / √ n ) , k ˆ Σ I 0 ,ǫ k H I 0 = O p (1 / √ n ) . Substituting these inequa lities to (51), we have k ˜ f − f ∗ k 2 L 2 (Π) + c 1 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p P m ∈ I 0 k ˜ f m − f ∗ m k H m √ n + ( λ ( n ) 1 + λ ( n ) 2 ) X m ∈ I 0 k ˜ f m − f ∗ m k L 2 (Π) ! . (52) 18 Remind that the ( non centered ) cro ss corre lation oper ator is in vertible. Thus there exists a c onstant c su ch that k ˜ f − f ∗ k 2 L 2 (Π) = h ˜ f I 0 − f ∗ I 0 , Σ I 0 ,I 0 ( ˜ f I 0 − f ∗ I 0 ) i H = h ˜ f I 0 − f ∗ I 0 , Diag (Σ 1 / 2 m,m ) V I 0 ,I 0 Diag(Σ 1 / 2 m,m )( ˜ f I 0 − f ∗ I 0 ) i H I 0 ≥ c X m ∈ I 0 h ˜ f m − f ∗ m , Σ m,m ( ˜ f m − f ∗ m ) i H m = c X m ∈ I 0 k ˜ f m − f ∗ m k 2 L 2 (Π) . This and Eq. (52) give that using ab ≤ ( a 2 + b 2 ) / 2 k ˜ f − f ∗ k 2 L 2 (Π) + c 1 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p P m ∈ I 0 k ˜ f m − f ∗ m k H m √ n + ( λ ( n ) 1 + λ ( n ) 2 ) X m ∈ I 0 k ˜ f m − f ∗ m k L 2 (Π) ! ≤ O p 1 nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 ! + c 1 2 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + c 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 L 2 (Π) ≤ O p 1 nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 ! + c 1 2 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + 1 2 k ˜ f − f ∗ k 2 L 2 (Π) . Therefo re we have 1 2 k ˜ f − f ∗ k 2 L 2 (Π) + c 1 2 λ ( n ) 1 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m + λ ( n ) 2 X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p 1 nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 ! ⇒ X m ∈ I 0 k ˜ f m − f ∗ m k 2 H m ≤ O p 1 ( c 1 λ ( n ) 1 + λ ( n ) 2 ) nλ ( n ) 1 + ( λ ( n ) 1 + λ ( n ) 2 ) 2 c 1 λ ( n ) 1 + λ ( n ) 2 ! = O p 1 nλ ( n ) 1 2 + ( λ ( n ) 1 + λ ( n ) 2 ) ! . This and λ ( n ) 1 √ n → ∞ giv es k ˜ f − f ∗ I 0 k H I 0 → 0 in probability . (Step 2) Next we sho w that the probab ility of ˜ f = ˆ f goes to 1. Sin ce k ˜ f − f ∗ I 0 k H I 0 → 0 , we can assume that k ˜ f m k H m > 0 ( m ∈ I 0 ) with out loss of gen erality . W e identify ˜ f as an element of H by setting ˜ f m = 0 for m ∈ J 0 . Now we show that ˜ f is also the minimizer of F n , that is ˜ f = ˆ f , with high probability , hence ˆ I = I 0 with high pr obability . By th e KKT con dition, the necessary a nd sufﬁcient cond ition that ˜ f also m inimizes F n is k 2 ˆ Σ m,I 0 ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m ≤ λ ( n ) 1 ( ∀ m ∈ J 0 ) , (53) (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n )( ˜ f I 0 − f ∗ I 0 ) + λ ( n ) 1 D n f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ = 0 , (54) where D n = Diag ( k ˜ f m k − 1 H m ) . Note that ( 54) is satisﬁed (with high p robab ility) because ˜ f is th e minimizer of ˜ F n and k ˜ f m k H m > 0 for all m ∈ I 0 (with high p robability ). Th erefore if the co ndition (53) h olds w . h.p., ˜ f = ˆ f w .h.p .. W e will n ow show the condition (53) holds w .h.p. . Due to (54), we have ˜ f I 0 − f ∗ I 0 = − (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 [( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ ] . Therefo re the LHS of (5 3), k 2 ˆ Σ m,I 0 ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m , can be ev alua ted as k − 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 [( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ ] − 2 ˆ Σ m,ǫ k H m = k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 − 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ + 2 ˆ Σ m,ǫ k H m ≤k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H m + k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ − 2 ˆ Σ m,ǫ k H m . (55) W e evaluate the probabilistic orders of the las t two terms. 19 (i) (Boundin g B n,m := k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ − 2 ˆ Σ m,ǫ k H m ) W e show that ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ = O p  1 √ n  . Since O   ˆ Σ I 0 ,I 0 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m  , we hav e O  ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n / 2 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m + λ ( n ) 2 !  2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n 0 0 2 ˆ Σ m,m + 2 λ ( n ) 2 ! . The second inequality is due to the fact that for all ( f I 0 , f m ) ∈ H I 0 ∪ m we have *  f I 0 − f m  , ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n / 2 − ˆ Σ I 0 ,m − ˆ Σ m,I 0 ˆ Σ m,m + λ ( n ) 2 !  f I 0 − f m  + H I 0 ∪ m ≥ 0 because of O   ˆ Σ I 0 ,I 0 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m  . Thus we hav e       ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n 2 ˆ Σ I 0 ,m ˆ Σ m,I 0 ˆ Σ m,m + λ ( n ) 2 ! 2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n 0 0 2 ˆ Σ m,m + 2 λ ( n ) 2 ! − 1  ˆ Σ I 0 ,ǫ ˆ Σ m,ǫ        H I 0 ∪ m ≤      ˆ Σ I 0 ,ǫ ˆ Σ m,ǫ      H I 0 ∪ m ≤ O p (1 / √ n ) . (56) Here the LHS of the above inequality is equiv alent to      ∗ ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ + ( ˆ Σ m,m + λ ( n ) 2 )(2 ˆ Σ m,m + 2 λ ( n ) 2 ) − 1 ˆ Σ m,ǫ      H I 0 ∪ m . Therefo re we ob serve     ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ + 1 2 ˆ Σ m,ǫ     H m = O p (1 / √ n ) . Since k ˆ Σ m,ǫ k H m = O p (1 / √ n ) , we also hav e k ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ˆ Σ I 0 ,ǫ k H m = O p (1 / √ n ) . This and k ˆ Σ m,ǫ k H m = O p (1 / √ n ) yield B n,m = O p (1 / √ n ) . (57) (ii) (Bo unding E n,m := k 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H m ) Note that, du e to k ˜ f − f ∗ k H p → 0 , we h ave D n p → D , and we know that max m,m ′ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ = O p ( p log( M ) /n ) = O p ( 1 √ n ) by Lemma 1 0. Thus S n := (2Σ I 0 ,I 0 − 2 ˆ Σ I 0 ,I 0 ) /λ ( n ) 1 + D − D n satisﬁes S n = o p (1) and thus D − S n  D / 2 with high proba bility . Hen ce 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 =2Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + O p  1 √ n  =2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + 2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 λ ( n ) 1 S n (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + O p  1 √ n  . (58) 20 Here we obtain k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 2 k 2 H m , H I 0 = k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 Σ I 0 ,m k H m , H m ≤k Σ 1 2 m,m V m,I 0 (2 V I 0 ,I 0 ) − 1 V I 0 ,m Σ 1 2 m,m k H m , H m = O p (1) , (59) and due to the fact that D − S n  D / 2 with high proba bility we ha ve k (Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H I 0 = k (Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 Diag(Σ 1 2 m,m )( λ ( n ) 1 D n + 2 λ ( n ) 2 ) g ∗ I 0 k H I 0 ≤ O p ( k V − 1 I 0 ,I 0 k − 1 2 H I 0 , H I 0 ( λ ( n ) 1 + λ ( n ) 2 )) = O p ( λ ( n ) 1 + λ ( n ) 2 ) . Therefo re the second term in the RHS of Eq. (58) is ev alu ated as k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 λ ( n ) 1 S n (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H m ≤k Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 2 k H m , H I 0 k (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 2 k H I 0 , H I 0 λ ( n ) 1 k S n k H I 0 , H I 0 × k (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 k H I 0 , H I 0 k (Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 ( D − S n )) − 1 2 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 k H I 0 ≤ O p (1 · ( λ ( n ) 1 + λ ( n ) 2 ) − 1 2 · λ ( n ) 1 o p (1) · ( λ ( n ) 1 + λ ( n ) 2 ) − 1 2 · ( λ ( n ) 1 + λ ( n ) 2 )) = o p ( λ ( n ) 1 ) . Therefo re this and Eq. (58) give 2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 =2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + o p ( λ ( n ) 1 ) + O p  1 √ n  =2Σ m,I 0 (2Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D ) − 1 ( λ ( n ) 1 D n + 2 λ ( n ) 2 ) f ∗ I 0 + o p ( λ ( n ) 1 ) . Deﬁne A n := Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 D n + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 , A := Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 . W e show k A n − A k H m = o p (1) . By the deﬁnition, we have A − A n =Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 λ ( n ) 1 D 2 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 + Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 ( D − D n ) f ∗ I 0 . (60) On the other hand, as in Eq. (56), we observe that 2 ≥      Σ I 0 ,I 0 Σ I 0 ,m Σ m,I 0 Σ m,m   (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 0 0 0      H I 0 ∪ m , H I 0 ∪ m =      ∗ ∗ Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 0      H I 0 ∪ m , H I 0 ∪ m ≥ k Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 k H m , H I 0 . (61) 21 Moreover , since f ∗ m = Σ 1 2 m,m g ∗ m ( ∀ m ), we hav e       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0       H I 0 =       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 Diag(Σ 1 2 m,m ) D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I 0       H I 0 ≤       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 2       H I 0 , H I 0       Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 2 Diag(Σ 1 2 m,m )       H I 0 , H I 0 ×      D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I 0      H I 0 ≤ O p (( λ ( n ) 1 + λ ( n ) 2 ) − 1 2    V − 1 2 I 0 ,I 0    H I 0 , H I 0 ) ≤ O p ( λ ( n ) 1 − 1 2 ) . (62) W e ca n also boun d the second term of (6 0) as       Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1 ( D − D n ) f ∗ I 0       H m ≤       Σ m,I 0 Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D 2 ! − 1       H m , H I 0   ( D − D n ) f ∗ I 0   H I 0 ≤     Σ m,I 0  Σ I 0 ,I 0 + λ ( n ) 2  − 1     H m , H I 0   ( D − D n ) f ∗ I 0   H I 0 ≤ 2   ( D − D n ) f ∗ I 0   H I 0 ( ∵ Eq. (61) ) = o p (1) . Therefo re app lying the inequalities Eq. (61) and Eq. (62) to Eq. (60), we ha ve k A n − A k H m = O p ( λ ( n ) 1 1 2 ) + o p (1) = o p (1) . (6 3) Hence we hav e E n,m = λ ( n ) 1 k A k H m + o p ( λ ( n ) 1 ) . (iii) (Combining (i) and (ii)) Due to the above ev alu ations ((i) and (ii)), we ha ve max m ∈ J 0    2 ˆ Σ m,I ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ    H m = max m ∈ J λ ( n ) 1      Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0      H m + o p ( λ ( n ) 1 ) < λ ( n ) 1 (1 − η ) + o p ( λ ( n ) 1 ) . This yields P  k 2 ˆ Σ m,I 0 ( ˜ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m ≥ λ ( n ) 1 , ∀ m ∈ J 0  → 0 . Thus the probab ility of the cond ition (53) goes to 1. Proof: (Theo rem 5) First we pr ove that λ ( n ) 1 √ n → ∞ is a necessary cond ition for ˆ I p → I 0 . Assume that lim inf λ ( n ) 1 √ n < ∞ . Then we can take a su b-sequen ce that co n verges to a ﬁnite value, ther efore b y taking the sub -sequen ce, if necessary , we can assume lim λ ( n ) 1 √ n → µ 1 without loss of gen erality . W e will der iv e a contrad iction unde r the conditions of k ˆ f − f ∗ k H p → 0 and ˆ I p → I 0 . Supp ose ˆ I = I 0 . 22 By the KKT condition , 0 = 2( ˆ Σ I 0 ,I 0 ˆ f I 0 − ˆ Σ I 0 ,ǫ − ˆ Σ I 0 ,I 0 f ∗ I 0 ) + λ ( n ) 1 D n ˆ f I 0 + 2 λ ( n ) 2 ˆ f I 0 ⇒ 2 ( ˆ Σ I 0 ,I 0 + λ ( n ) 2 )( f ∗ I 0 − ˆ f I 0 ) = λ ( n ) 1 D n f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ (64) ⇒ 2 √ n (Σ I 0 ,I 0 + λ ( n ) 2 )( f ∗ I 0 − ˆ f I 0 ) = √ nλ ( n ) 1 D f ∗ I 0 + √ n 2 λ ( n ) 2 f ∗ I 0 − 2 √ n ˆ Σ I 0 ,ǫ + (2 √ n (Σ I 0 ,I 0 − ˆ Σ I 0 ,I 0 )( f ∗ I 0 − ˆ f I 0 ) + √ nλ ( n ) 1 ( D n − D ) f ∗ I 0 ) ⇒ 2 √ n (Σ I 0 ,I 0 + λ ( n ) 2 )( f ∗ I 0 − ˆ f I 0 ) = µ 1 D f ∗ I 0 + √ n 2 λ ( n ) 2 f ∗ I 0 − 2 √ n ˆ Σ I 0 ,ǫ + o p (1) , (65) where the last in equality is due to √ nλ ( n ) 1 → µ 1 , k D n − D k H I 0 , H I 0 = o p (1) , k ˆ f − f ∗ k H = o p (1) and k Σ I 0 ,I 0 − ˆ Σ I 0 ,I 0 k H I 0 , H I 0 = o p (1) . Mo reover since the second equality (64) ind icates that o p (1) + o p ( λ ( n ) 2 ) = λ ( n ) 1 D f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 + o p (1) , we have λ ( n ) 1 = o p (1) and λ ( n ) 2 = o p (1) . W e now show that the KKT cond ition un der which ˆ f satisfying ˆ I = I 0 is optima l with respect to F n is violated with strictly positive probability: lim inf P  ∃ m ∈ J, k 2( ˆ Σ m,I 0 ˆ f I 0 − ˆ Σ m,I 0 f ∗ I 0 − ˆ Σ m,ǫ ) k H m > λ ( n ) 1  > 0 . (66) Obviously this indicates that the probability ˆ I = I 0 does not conv erges to 1 , which is a contrad iction. For all v m ∈ H m ( m ∈ J 0 ) , there exists w I 0 ∈ H I 0 such that Σ I 0 ,m v m = (Σ I 0 ,I 0 + λ ( n ) 2 ) w I 0 . (67) Note that w I 0 is uniformly bounded for all λ ( n ) 2 ≥ 0 because the rang e of Σ I 0 ,m is included in the ran ge of Σ I 0 ,I 0 (Baker, 19 73) and th ere exists ˜ w I 0 such th at Σ I 0 ,m v m = Σ I 0 ,I 0 ˜ w I 0 ( ˜ w I 0 is indep endent of λ ( n ) 2 ), hence Σ I 0 ,I 0 ˜ w I 0 = (Σ I 0 ,I 0 + λ ( n ) 2 ) w I 0 , and k w I 0 k H I 0 ≤ q h ˜ w I 0 , Σ I 0 ,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 2 Σ I 0 ,I 0 ˜ w I 0 i H I 0 ≤ k ˜ w I 0 k H I 0 for λ ( n ) 2 > 0 and k w I 0 k H I 0 = k ˜ w I k H I 0 for λ ( n ) 2 = 0 . Let v m ∈ H m be any no n-zero elemen t such that Σ 1 / 2 m,m v m 6 = 0 and w I 0 be satisfying the above equality (67), then √ n h v m , ˆ Σ m,ǫ + ˆ Σ m,I 0 f ∗ I 0 − ˆ Σ m,I 0 ˆ f I 0 i H m = √ n h v m , ˆ Σ m,ǫ i H m + h v m , ˆ Σ m,I 0 √ n ( f ∗ I 0 − ˆ f I 0 ) i H m = √ n h v m , ˆ Σ m,ǫ i H m + h v m , Σ m,I √ n ( f ∗ I 0 − ˆ f I 0 ) i H m + o p (1) = √ n h v m , ˆ Σ m,ǫ i H m + h w I 0 , (Σ I 0 ,I 0 + λ ( n ) 2 ) √ n ( f ∗ I 0 − ˆ f I 0 ) i H m + o p (1) = √ n h v m , ˆ Σ m,ǫ i H m − √ n h w I 0 , ˆ Σ I 0 ,ǫ i H m + D w I 0 ,  µ 1 2 D + √ nλ ( n ) 2  f ∗ I 0 E H m + o p (1) , where we used k ˆ Σ m,I 0 − Σ m,I 0 k H m , H I 0 = O p (1 / √ n ) and k f ∗ − ˆ f k H p → 0 in the second eq uality , and the relation (65) in the last equ ality . W e can sh ow th at Z n := √ n h v m , ˆ Σ m,ǫ i − √ n h w I 0 , ˆ Σ I 0 ,ǫ i h as a positiv e variance as follo ws (see also Bach (200 8)): E[ Z n ] = 0 , E[ Z 2 n ] ≥ σ 2 ( h v m , Σ m,m v m i − 2 h v m , Σ m,I 0 w I 0 i + h w I 0 , Σ I 0 ,I 0 w I 0 i ) = σ 2 ( h v m , Σ m,m v m i − h v m , Σ m,I 0 w I 0 i + o p (1)) ( ∵ λ ( n ) 2 = o p (1)) = σ 2 h Σ 1 / 2 m,m v m , ( I H m − V m,I 0 ˜ V − 1 I 0 ,I 0 V I 0 ,m )Σ 1 / 2 m,m v m i + o p (1) , where ˜ V − 1 I 0 ,I 0 = Diag(Σ 1 / 2 m,m )(Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 Diag(Σ 1 / 2 m,m ) ( note that ˜ V I 0 ,I 0 is invertible because V I 0 ,I 0  ˜ V I 0 ,I 0 and V I 0 ,I 0 is in vertible). Now since V I 0 ,I 0  ˜ V I 0 ,I 0 and I H m − V m,I 0 V − 1 I 0 ,I 0 V I 0 ,m ≻ O (th is is becau se V I 0 ∪ m,I 0 ∪ m =  V I 0 ,I 0 V m,I 0 V I 0 ,m I H m  is invertible), we have I H m − V m,I 0 ˜ V − 1 I 0 ,I 0 V I 0 ,m ≻ O . Therefore by the central limit theo rem Z n conv e rges Gaussian r andom variable with strictly p ositiv e variance in distribution. Thus the probab ility of 2 |h v m , ˆ Σ m,ǫ + ˆ Σ m,I 0 f ∗ I 0 − ˆ Σ m,I 0 ˆ f I 0 i m | > λ ( n ) 1 k v m k H m 23 is asym ptotically strictly p ositiv e beca use λ ( n ) 1 √ n → µ 1 (Note th at this is tru e wh ether √ nλ ( n ) 2 conv e rges to ﬁnite value or no t). This yields (66), i.e. ˆ f does no t satisfy ˆ I = I 0 with asymptotically strictly positive probab ility . W e say Cond ition A as Condition A : λ ( n ) 1 √ n → ∞ . Now that we ha ve proven λ ( n ) 1 √ n → ∞ , we are ready to prove t he assertion (16). Suppose the condition (16) is not satisﬁed for any sequences λ ( n ) 1 , λ ( n ) 2 → 0 , that is, there exists a constant ξ > 0 such that lim sup n →∞      Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I 0      H m > (1 + ξ ) , ( ∃ m ∈ J 0 ) , (6 8) for any sequences λ ( n ) 1 , λ ( n ) 2 → 0 satisfy ing Cond ition A ( λ ( n ) 1 √ n → ∞ ). Fix ar bitrary sequences λ ( n ) 1 , λ ( n ) 2 → 0 satisfying Condition A. If ˆ I = I 0 , the KKT condition k 2 ˆ Σ m,I 0 ( ˆ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ k H m ≤ λ ( n ) 1 ( ∀ m ∈ J 0 ) , (69) (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n )( ˜ f I 0 − f ∗ I 0 ) + λ ( n ) 1 D n f ∗ I 0 + 2 λ ( n ) 2 f ∗ I 0 − 2 ˆ Σ I 0 ,ǫ = 0 , (70) should be satisﬁed (see (5 3) and (54)). W e prove that the ﬁrst ine quality (69) of the KKT con dition is violated with strictly positi ve proba bility u nder the assumptio ns and the cond ition (70). W e have shown tha t (see (55) ) λ ( n ) 1 − 1 (2 ˆ Σ m,I 0 ( ˆ f I 0 − f ∗ I 0 ) − 2 ˆ Σ m,ǫ ) =2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 ( D n + 2 λ ( n ) 2 λ ( n ) 1 ) f ∗ I 0 − 2 λ ( n ) 1 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ + 2 λ ( n ) 1 ˆ Σ m,ǫ . (71) As shown in the p roof of Theorem 1 , the ﬁrst term can be ap proxim ated by Σ m,I 0  Σ I 0 ,I 0 + λ ( n ) 2  − 1  D + 2 λ ( n ) 2 λ ( n ) 1  f ∗ I 0 , more precisely Eq. (63) giv es       ˆ Σ m,I 0 ˆ Σ I 0 ,I 0 + λ ( n ) 2 + λ ( n ) 1 D n 2 ! − 1 D n + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0 − Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1 D + 2 λ ( n ) 2 λ ( n ) 1 ! g ∗ I       H m p → 0 . Since lim inf n     Σ m,I 0 (Σ I 0 ,I 0 + λ ( n ) 2 ) − 1  D + 2 λ ( n ) 2 λ ( n ) 1  g ∗ I 0     H m > (1 + ξ ) by the assumption, we o bserve that P      2 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 D n + 2 λ ( n ) 2 λ ( n ) 1 ! f ∗ I 0      H m > (1 + ξ ) ! 6→ 0 . (72) Now since λ ( n ) 1 √ n → ∞ , we have proven that      − 2 λ ( n ) 1 ˆ Σ m,I 0 (2 ˆ Σ I 0 ,I 0 + 2 λ ( n ) 2 + λ ( n ) 1 D n ) − 1 2 ˆ Σ I 0 ,ǫ + 2 λ ( n ) 1 ˆ Σ m,ǫ      H m = O p (1 / ( λ ( n ) 1 √ n )) = o p (1) , ( 73) in the p roof of Theo rem 1 (Eq. ( 57)). Th erefore, combinin g (71), ( 72) an d (7 3), we have observed that the KKT cond ition (53) is v iolated with strictly positive pro bability if the con dition (68) is satisﬁed. This y ields the irrepresenter condition (16) is a necessary condition for the consistency of elastic-net MKL. Lemma 10 If sup X k m ( X, X ) ≤ 1 an d sup X k m ′ ( X, X ) ≤ 1 , then P ( k ˆ Σ m,m ′ − Σ m,m ′ k H m , H ′ m ≥ E[ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H ′ m ] + ε ) ≤ exp( − nε 2 / 2) . (74) In particular , P k ˆ Σ m,m ′ − Σ m,m ′ k H m , H ′ m ≥ r 1 n + ǫ ! ≤ exp( − nε 2 / 2) . (75) 24 Proof: W e use McDiar mid’ s inequality (De v roye et al., 1996). By deﬁnitio n h g , ˆ Σ mm ′ f i = 1 n n X i =1 h g , k m ( · , x i ) i m h f , k m ′ ( · , x i ) i m ′ . W e de note by ˜ Σ m,m ′ the empirical cross cov ariance operator with n samples ( x 1 , . . . , x j − 1 , ˜ x j , x j +1 , . . . , x n ) where the j -th sample x j is replaced by ˜ x j indepen dently distributed b y the same distribution as x j ’ s. By the triangular inequality , we ha ve k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − k ˜ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ≤ k ˆ Σ m,m ′ − ˜ Σ m,m ′ k H m , H m ′ . Now the RHS can be evaluated as follows: k ˆ Σ m,m ′ − ˜ Σ m,m ′ k H m , H m ′ =     1 n ( k m ( · , x j ) k m ′ ( x j , · ) − k m ( · , ˜ x j ) k m ′ ( ˜ x j , · ))     H m , H m ′ . (76) The RHS of (76) can be further ev aluated as k 1 n ( k m ( · , x j ) k m ′ ( x j , · ) − k m ( · , ˜ x j ) k m ′ ( ˜ x j , · )) k H m , H m ′ ≤ 1 n ( k k m ( · , x j ) k m ′ ( x j , · ) k H m , H m ′ + k k m ( · , ˜ x j ) k m ′ ( ˜ x j , · )) k H m , H m ′ ) ≤ 1 n ( k k m ( · , x j ) k H m k k m ′ ( x j , · ) k H m ′ + k k m ( · , ˜ x j ) k H m k k m ′ ( ˜ x j , · )) k H m ′ ) ≤ 1 n ( q k m ( x j , x j ) k m ′ ( x j , x j ) + q k m ( ˜ x j , ˜ x j ) k m ′ ( ˜ x j , ˜ x j )) ≤ 2 n , (77) where we used k k m ( · , x j ) k H m = p h k m ( · , x j ) , k m ( · , x j ) i H m = p k m ( x j , x j ) . Bounding the nor m of ( 76) by (77), we have k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − k ˜ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ≤ 2 n . By symmetry , changing ˆ Σ and ˜ Σ gives |k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − k ˜ Σ m,m ′ − Σ m,m ′ k H m , H m ′ | ≤ 2 n . Therefo re by McDiarm id’ s inequality we obtain P  k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ − E[ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ] ≥ ε  ≤ exp  − 2 ε 2 n (2 /n ) 2  = exp  − ε 2 n 2  . This giv e s the ﬁrst assertion Eq. (74). T o show the second assertion (Eq. (75)), ﬁrst we note that E[ k ˆ Σ m,m ′ − Σ m,m ′ k H m , H m ′ ] ≤ q E[ k ˆ Σ m,m ′ − Σ m,m ′ k 2 H m , H m ′ ] = q E[ k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k H m , H m ] ≤ q E[ k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k tr ] , (78 ) where k · k tr is the trace norm and the last inequ ality . As in Lemma 1 of Gretton et al. (2005), we see that k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k tr = 1 n 2 n X i,j =1 k k m ( · , x i ) k m ′ ( x i , x j ) k m ( x j , · ) k tr − 2 n n X i =1 E X [ k k m ( · , x i ) k m ′ ( x i , X ) k m ( X, · ) k tr ] + E X,X ′ [ k k m ( · , X ) k m ′ ( X, X ′ ) k m ( X ′ , · ) k tr ] = 1 n 2 n X i,j =1 k m ( x j , x i ) k m ′ ( x i , x j ) − 2 n n X i =1 E X [ k m ( X, x i ) k m ′ ( x i , X )] + E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] , 25 where X and X ′ are independ ent rando m v ar iable distrib uted from Π . Thus E[ k ( ˆ Σ m,m ′ − Σ m,m ′ )( ˆ Σ m ′ ,m − Σ m ′ ,m ) k tr ] = n n 2 E X [ k m ( X, X ) k m ′ ( X, X )] + n ( n − 1) n 2 E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] − 2E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] + E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] = 1 n E X [ k m ( X, X ) k m ′ ( X, X )] − 1 n E X,X ′ [ k m ( X ′ , X ) k m ′ ( X, X ′ )] ≤ 1 n . This and Eq. (78) with the ﬁrst assertion (Eq. (74)) giv e s the second assertion. Lemma 11 If E [ ǫ 2 | X ] ≤ σ 2 almost sur ely a nd sup X k m ( X, X ) ≤ 1 , then we h ave k ˆ Σ m,ǫ k H m = O p ( σ / √ n ) . (79) Proof: By de ﬁnition, we ha ve E[ k ˆ Σ m,ǫ k H m ] ≤ q E[ k ˆ Σ m,ǫ k 2 H m ] = v u u u t E   1 n 2 n X i,j =1 k m ( x i , x j ) ǫ i ǫ j   ≤ r σ 2 n . Applying Markov’ s inequality we obtain the assertion. Proposition 1 (Ber nstein’ s inequality in Hilbert spaces) Let (Ω , A , P ) be a pr o bability space, H be a sep- arable Hilbert space, B > 0 , and σ > 0 . Furthermor e, let ξ 1 , . . . , ξ n : Ω → H be indepe ndent random variables satisfying E[ ξ i ] = 0 , k ξ k H ≤ B , and E[ k ξ i k 2 H ] ≤ σ 2 for all i = 1 , . . . , n . Then we have P      1 n n X i =1 ξ i      H ≥ r 2 σ 2 τ n + r σ 2 n + 2 B τ 3 n ! ≤ e − τ , ( τ > 0 ) . Proof: See Th eorem 6.14 of Steinwart (2008). Refer ences F . Bach, G. Lanckriet, and M. Jordan. M ultiple kernel learning, conic duality , and the SMO algor ithm. In the 21st Internation al Conference on Machine Learning , pages 41–48 , 200 4. F . R. Bach. Consistency of the g roup lasso and multip le kern el learning. J ou rnal o f Ma chine Learning Resear ch , 9:1179 –122 5, 2 008. C. R. Baker . Joint measures and cross-covariance oper ators. T ransactions of th e American Mathematical Society , 186:27 3–289 , 1 973. P . Bartlett, O. Bo usquet, and S. Men delson. Local Rademacher complexities. The Ann als of S tatistics , 33: 1487– 1537 , 20 05. P . J. Bickel, Y . Ritov , and A. B. Tsybakov . Simultaneo us analysis of Lasso and Dantzig selector . The Annals of Statistics , 37(4 ):1705 –1732, 2 009. O. Bousquet. A Ben nett co ncentratio n inequality a nd its ap plication to sup rema of emp irical pro cess. C. R. Acad. Sci. P a ris Ser . I Math. , 334:495 –500 , 20 02. A. Caponnetto and E. de V ito . Optimal rates for regularized least-squares algorithm. F ou ndation s of Compu- tational Mathematics , 7(3) :331–3 68, 2 007. C. Cortes. Can learning kernels h elp p erform ance?, 2009. In vited talk at Internation al Confe rence on Machine Learning (ICML 2009). Montr ´ eal, Canada, 2009. 26 C. Cortes, M. Mo hri, an d A. Rostamizadeh . L 2 regularization for learning kernels. In the 25th Confer ence on Uncertainty in Artiﬁcial Intelligence (U AI 2009 ) , 2009. Montr ´ eal, Canad a. L. Devroye, L. Gy ¨ or ﬁ, and G. Lugosi. A Pr oba bilistic Theory of P attern Recognitio n . Springer, 1996. A. Gretton, O. Bousq uet, A . Smola, and B. Sch ¨ olkopf. Measuring statistical dep endence with Hilber t- Schmidt nor ms. In S. Jain, H. U. Simon, a nd E. T om ita, editor s, A lgorithmic Learning Theo ry , Lectu re Notes in Artiﬁcial Intelligence, pages 63–7 7, B erlin, 2005 . Spring er-V erlag . J. Jia and B. Y u. On model selection consistency of th e elastic net when p ≫ n. S tatistica S inica , 20(2) :to appear, 2010. M. Kloft, U. Brefeld, S. Sonn enburg, P . Laskov , K.- R. M ¨ uller, and A. Zien. Efﬁcient and accurate ℓ p -norm multiple kernel learning. In Advances in Neural Information Pr o cessing S ystems 22 , pa ges 99 7–100 5, Cambridge, MA, 2009 . MIT Press. V . K o ltchinskii. Local Rademacher co mplexities and oracle inequalities in risk minim ization. The Annals of Statistics , 34:2593 –265 6, 2 006. V . Koltchinskii and M. Y uan. Sp arse r ecovery in large ensembles of kern el machines. In Pr o ceedings of the Annua l C onfer ence on Learning Theory , pages 229–2 38, 200 8. G. Lanckr iet, N. Cristianini, L. E. Gh aoui, P . Bartlett, a nd M. Jorda n. Learnin g the kernel matrix with semi- deﬁnite progr amming. J o urnal of Machine Learning Resear ch , 5:27–72, 2004. M. Led oux and M. T alag rand. P r obab ility in Banach Space s. Isope rimetry and Pr ocesses . Springer, New Y o rk, 1991. MR1 10201 5. Y . Lin and H. H. Zh ang. Com ponen t selecion and smoothin g in multiv ariate non parametr ic re gression. The Annals of Statistics, , 34(5) :2272– 2297, 2 006. L. Meier, S. van de Geer, a nd P . B ¨ uh lmann. High -dimensio nal additive modeling. The Annals of Statistics , 37(6B):3 779–3 821, 20 09. S. Mende lson. Im proving the sample complexity using global data. IEEE T ransactions on Informa tion Theory , 48:19 77–19 91, 2 002. C. A. Micchelli and M. Pontil. Learning the k er nel function via regularization. Journal of Machine Learning Resear ch , 6:1099 –112 5, 2 005. A. Rakotomamon jy , F . Bach , S. Canu, and G. Y . SimpleMKL. Journal of Machine Learnin g Re sear ch , 9: 2491– 2521 , 20 08. S. Sonn enburg, G. R ¨ atsch, C. Sch ¨ afer , and B. Sch ¨ olkopf. Large scale mu ltiple kern el learn ing. J o urnal of Machine Learning Resear ch , 7:1531–15 65, 2006. I. Steinwart. Sup port V ector Machines . Sprin ger, 20 08. I. Steinwart, D . Hush , and C. Scovel. Op timal rates for regular ized least squares regression. In Pr oce edings of the Annu al Conference on Learning Theo ry , pages 79–93, 2009. M. Ston e. Cross-validatory choice and assessment o f statistical pred ictions. Journal of the Ro yal Sta tistical Society , Series B , 36:11 1–147 , 1 974. T . Suzuki and R. T omiok a. Spicymkl, 2009. arXiv:0909.5 026. M. T alag rand. A new look at inde penden ce. The Ann als of Statistics , 24:1–34, 1996a. M. T alag rand. New con centration inequalities in pro duct spaces. Inventiones Mathematica e , 126:5 05–5 63, 1996b . R. T omioka and T . Suzu ki. Sparsity-acc uracy trade-off in MKL , 2010. arXiv:1001.2 615. S. van de Geer . Empirical Pr ocesses in M-Estimation . Camb ridge Univ ersity Press, 2000. A. W . van d er V aart and J. A. W ellner . W ea k Conver gence and Em pirical Pr ocesses: W ith Application s to Statistics . Springer, New Y ork , 1996. 27 V . N. V apnik. S tatistical Learning Theory . W iley , New Y ork, 1998 . M. Y uan and Y . L in. On the n onnegative garr ote estimator . Journal of the Roya l Statistical Society B , 69(2 ): 143–1 61, 200 7. T . Zhang . Some sharp performan ce bounds for least squares regression with l 1 regularization. The Annals of Statistics , 37(5):2 109–2 144, 20 09. P . Z hao and B. Y u . On model selection consistency o f lasso. Journal of Machine Learning R esear ch , 7: 2541– 2563 , 20 06. H. Zo u and T . H astie. Regularization and variable selection via the elastic net. Journal of th e Ro yal Statistical: Series B , 67(2) :301–3 20, 2 005. H. Zo u and H. H. Zhan g. On the a daptive elastic-net with a d iv e rging number of param eters. The An nals of Statistics , 37(4):1 733–1 751, 20 09. 28

Sharp Convergence Rate and Support Consistency of Multiple Kernel Learning with Sparse and Dense Regularization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment