Fast Learning Rate of lp-MKL and its Minimax Optimality

In this paper, we give a new sharp generalization bound of lp-MKL which is a generalized framework of multiple kernel learning (MKL) and imposes lp-mixed-norm regularization instead of l1-mixed-norm regularization. We utilize localization techniques …

Authors: Taiji Suzuki

F ast Learning Rate of ℓ p -MKL and its M inimax Optimalit y T aiji Suzuki Department of Mathematical Informatics, The University of T o kyo, 7-3-1 Hongo, Bunky o-ku, T okyo 113-8656 , JAP AN t-suzuki@m ist.i.u-to kyo.ac.jp Abstract In this paper, w e give a new sharp genera lization b ound of ℓ p -MKL which is a g eneralized framework o f mult iple kernel lear ning (MKL) and imp o ses ℓ p -mixed-norm reg ularization instead of ℓ 1 -mixed-norm reg ularization. W e utilize lo c alization te chniques to obtain the sharp lear ning r a te. The b ound is c haracter ized by the decay rate o f the eigenv alues of the asso ciated kernels. A la rger decay rate g ives a faster co nv er gence rate. F urther mo re, we give the minimax learning ra te o n the ball c haracterized by ℓ p -mixed-norm in the pro duct space. Then we show that our deriv ed learning r ate of ℓ p -MKL achiev es the minimax optimal ra te on the ℓ p -mixed-norm ball. 1 In tro duction Multiple Ker nel Learning (MKL) prop os ed b y Lanckriet et a l. (2004) is one o f the most promis - ing metho ds that adaptiv ely select the k e rnel function in sup er vised kernel lear ning. Kernel metho d is widely used and several studies hav e suppo rted its usefulness (Sc h¨ olkopf a nd Smola, 200 2, Shaw e- T aylor and Cristia nini, 200 4). Howev e r the per formance of kernel metho ds critica lly relies o n the choice of the k ernel function. Many methods have b een pr o po sed to deal with the issue of kernel selection. Ong et al. (2005) studied hyper krenels as a kernel of kernel functions. Argyrio u et al. (2006) consider e d DC pr ogramming approa ch to learn a mixture of kernels with co nt inuous para me- ters (see also Argyr iou et al. (2 005)). Some studies tackled a problem to learn non-linea r combination of kernels as in Ba ch (2009), Co rtes et al. (2009a), V a rma and Babu (20 09). Among them, lear ning a linear com bination of finite candidate kernels with non- negative co efficients is the most basic, fundamen tal and commonly used approach. The se mina l work of MK L b y Lanckriet et al. (200 4) considered le a rning co n vex co m bination of candidate k ernels. This work op ened up the s equence of the MKL studies. Bach et a l. (2004) show ed that MKL can be reformulated a s a kernel v ersion of the gro up lass o (Y ua n a nd Lin, 2 006). This formulation g ives an insight that MKL ca n b e de- scrib ed as a ℓ 1 regular iz ed lear ning metho d. As a generaliza tio n of MKL, ℓ p -MKL that impo ses ℓ p -mixed-norm regula rization ( P M m =1 k f m k p H m with p ≥ 1) has been prop osed (Micchelli and Pontil , 2005, Kloft et al., 2009), where {H m } M m =1 are M repro ducing kernel Hilb ert spa c es (RKHSs) a nd f m ∈ H m . ℓ p -MKL includes the or iginal MKL as a sp ecial ca se of ℓ 1 -MKL. One recent perceptio n is that ℓ p -MKL w ith p > 1 shows be tter p er formances than ℓ 1 -MKL in se veral situations (Kloft et al., 2009, Cor tes et al., 2009b). T o justify the us e fulness of ℓ p -MKL, a few pap ers hav e g iven theor etical analyses o f ℓ p -MKL (Cortes et a l., 20 09b, 201 0, K loft et al., 201 0a) . In this pa pe r , we give a new faster lea r ning rate o f ℓ p -MKL utilizing the lo c aliza tion te chniques (v an de Geer, 2 0 00, Bartlett et al., 2005, 2006, Koltchinskii, 200 6), and s how our learning rate is optimal in a sense of minimaxity . This is the first attempt to show the fast lo c a lized lear ning rate for ℓ p -MKL. In the pionee ring paper of Lanckriet et a l. (20 04), a con v ergence rate of MKL is given as q M n , where M is the num ber of given kernels and n is the num ber of samples. Srebro and Ben-David (2006) gave improved learning b ound utilizing the pseudo-dimension o f the given kernel class. Ying a nd Campb ell (2009) gav e a con vergence b ound utilizing Rademacher chaos and gav e some up- per b ounds of the Rademacher chaos utilizing the pseudo -dimension o f the kernel class. Cortes et a l. (2009b) presen ted a conv e rgence b ound for a learning metho d with L 2 regular iz ation on the kernel weigh t. Cortes e t al. (20 10) show ed that the con vergence rate of ℓ 1 -MKL is q log( M ) n . They gav e also the convergence rate of ℓ p -MKL as M 1 − 1 p √ n for p > 1. Kloft et al. (2010a) gav e a similar con- vergence b ound with improv e constants. Klo ft et al. (20 10b) generalized the bound to a v ariant of the elasticnet type regular ization and widened the effective rang e of p to all range o f p ≥ 1 while in the existing bounds 1 ≤ p ≤ 2 was imp osed. Our co ncern ab out the exis ting bo unds is that all bo unds introduced ab ov e are “ global” b ounds in a sens e that the b ounds a re applicable to all candidates of estimators . Cons e quent ly all con vergence rate presented ab ove a re of order 1 / √ n with resp ect to the n um ber n of samples . How ever, b y utilizing the lo c alization techniques including so-called lo cal Rademacher complexit y (Bartlett et al., 2005, 2006, Koltchinskii, 2006) a nd pee ling device (v an de Geer, 2000), we can derive a faster learning rate. Instead of uniformly b ounding all candidates of estima to rs, the lo c alized inequality focuse s on a particular estimator such a s empirical risk minimizer , thus can g ives a sharp conv ergence rate. Lo calized b ounds of MKL have been given mainly in sparse learning settings such as ℓ 1 -MKL or elas ticnet type MKL (Sha we-T aylor , 20 08, T o miok a a nd Suzuki, 200 9). The firs t lo calized b ound of MKL is der ived by Koltc hinskii and Y uan (2008) in the setting o f ℓ 1 -MKL. The second o ne w as given b y Meier et al. (2009) who g av e a ne a r o ptimal conv er gence for elasticnet type reg ularization. Recently Koltchinskii and Y ua n (2 010) considered a v ariant of ℓ 1 -MKL and show ed it ac hieves the minimax o ptimal co n vergence r ate. All the lo calized con vergence r ates were consider ed in s pa rse learning settings. The lo c a lized fast learning rate of ℓ p -MKL ha s not b een addressed. In this pap er, w e g ive a sharp con vergence r ate of ℓ p -MKL utilizing the lo calization tec hniques . Our b ound a lso clarifies the r elation b etw ee n the conv ergence rate and the tuning para meter p . The resultant con vergence r ate is M 1 − 2 s p (1+ s ) n − 1 1+ s R 2 s p (1+ s ) p where R p = ( P M m =1 k f ∗ m k p H m ) 1 p determined by the true function f ∗ and s (0 < s < 1) is a c onstant that represents the complexity of RKHSs and satisfies 0 < s < 1. The bound includes the b ound of Cortes et al. (2 010), Kloft et al. (2010 a) as a sp ecial case of s → 1. Finally , we show that the b ound for ℓ p -MKL achiev es the minimax optimal rate in the ball with resp ect to ℓ p -mixed-norm { f = P M m =1 f m | ( P M m =1 k f m k p H m ) 1 p ≤ R } . This indicates that ℓ p -MKL is compatible with ℓ p -mixed-norm. 2 Preliminary In this section w e g ive the problem for m ulation, the notations and the assumptions for the conv er- gence a na lysis of ℓ p -MKL. 2.1 Problem F orm ulation Suppo se that w e ar e given n i.i.d. s amples { ( x i , y i ) } n i =1 distributed from a probabilit y distr ibution P on X × R that has the mar ginal distribution Π on X . W e are given M r epro ducing kernel Hilbert spaces (RKHS) {H m } M m =1 each of w hich is a s so ciated with a k e rnel k m . ℓ p -MKL ( p ≥ 1 ) fits a function f = P M m =1 f m ( f m ∈ H m ) to the data b y solving the following o ptimization proble m 1 : ˆ f = M X m =1 ˆ f m = arg min f m ∈H m ( m =1 ,...,M ) 1 n N X i =1 y i − M X m =1 f m ( x i ) ! 2 + λ ( n ) 1 M X m =1 k f m k p H m ! 2 p . (1) This is reduced to a finite dimensio nal o ptimization pr oblem due to the repres ent er theor e m (Kimeldorf a nd W ahba, 1971). The problem is con vex and th us there are efficient algorithms to solve that, e.g., Kloft et al. (2009, 201 0a) and Vish wanathan et al. (2010). In this paper , w e fo - cus on the r egressio n problem (the squared loss). How ev er the discussion presented here can b e generalized to Lipschitz contin uous and strongly conv ex lo sses (Ba rtlett et al., 20 0 6). Sometimes the reg ularization of ℓ p -MKL for 1 ≤ p ≤ 2 is imp osed in terms of the kernel w eigh t as min θ ∈ R M , f ∈H k θ 1 n N X i =1 ( y i − f ( x i )) 2 + λ ( n ) 1 k f k 2 H k θ s . t . k θ = M X m =1 θ m k m , M X m =1 θ p 2 − p m = 1 , θ m ≥ 0 , (2) where H k θ is the RKHS corres po nding to the kernel k θ . Howev er these tw o for m ulations are com- pletely s ame, that is, w e obtain the same resultant solution in b oth form ulations (see Lemma 25 of 1 One migh t lik e to use P M m =1 k f m k p H m instead of  P M m =1 k f m k p H m  2 p as regularization. Ho w eve r this difference does not matter b ecause by adjusting t h e regularizatio n parameter λ ( n ) 1 there is a one-to-one corresponden ce b etw een the solutions of b oth regularization types. 2 T able 1: Summary of the co nstants we use in this ar ticle. n The num ber of sa mples. M The num ber of ca ndida te kernels. s The s pectr al decay co efficient; see (A3). κ M The s mallest eigenv alue of the design matrix (see Eq. (6)). R p The ℓ p -mixed-norm of the truth: ( P M m =1 k f ∗ m k p H m ) 1 p . Micchelli and Pontil (200 5) and T o miok a and Suzuki (20 10) for details). Mor eov er our formulation (1) also covers the situation of p > 2 while the kernel weight constraint formulation is restricted to 1 ≤ p ≤ 2. 2.2 Notatio ns and Ass umptions Here, we prepa re notations and c onditions that are used in the a nalysis. Let H ⊕ M = H 1 ⊕ · · · ⊕ H M . Throug ho ut the pap er, we assume the following tec hnical conditions (see also (Ba ch , 200 8)). Assumption 1 (Basi c Ass umptions) (A1) Ther e exists f ∗ = ( f ∗ 1 , . . . , f ∗ M ) ∈ H ⊕ M such that E[ Y | X ] = f ∗ ( X ) = P M m =1 f ∗ m ( X ) , and the noise ǫ := Y − f ∗ ( X ) is b ounde d as | ǫ | ≤ L . (A2) F or e ach m = 1 , . . . , M , H m is sep ar able (with r esp e ct to the R KHS n orm) and sup X ∈X | k m ( X, X ) | < 1 . The first a ssumption in (A1) ensures the mo del H ⊕ M is corr ectly sp ecified, and the tec hnical as- sumption | ǫ | ≤ L allows ǫf to b e Lipschitz co nt inuous with resp ect to f . The noise bo undedness can be rela xed to un bounded situation as in (Raskutti et al., 20 1 0), but w e don’t pursue that direction for simplicity . Due to Mer cer’s theorem, there ar e an orthonor mal sy stem { φ k,m } k,m in L 2 (Π) and the sp ectrum { µ k,m } k,m such that k m has the following sp ectral representation: k m ( x, x ′ ) = ∞ X k =1 µ k,m φ k,m ( x ) φ k,m ( x ′ ) . (3) By this sp ectral representation, the inner-pr o duct of RKHS can be express ed as h f m , g m i H m = P ∞ k =1 µ − 1 k,m h f m , φ k,m i L 2 (Π) h φ k,m , g m i L 2 (Π) , for f m , g m ∈ H m . Constants we use later are summar iz ed in T able 1. Assumption 2 (Sp ectral As sumption) Ther e exist 0 < s < 1 and 0 < c s uch that (A3) µ k,m ≤ ck − 1 s , (1 ≤ ∀ k , 1 ≤ ∀ m ≤ M ) , wher e { µ k,m } k is the sp e ctru m of the kernel k m (se e Eq. (3) ). It was shown that the sp ectral assumption (A3) is equiv alent to the classical covering n umber as- sumption (Stein wart et al., 20 09). Recall that the ǫ - c ov er ing num b er N ( ǫ, B H m , L 2 (Π)) with respe ct to L 2 (Π) is the minimal num ber of balls with radius ǫ ne e ded to co ver the unit ball B H m in H m (v an der V aa rt and W ellner , 19 9 6). If the sp ectral ass umption (A3) holds, there exists a consta n t C that dep ends only on s and c such that log N ( ε, B H m , L 2 (Π)) ≤ C ε − 2 s , (4) and the con verse is also true (see Steinw a rt et al. (2009, Theorem 15) and Stein wart (2008) for details). Therefor e, if s is lar ge, the RKHSs are regarded as “ complex”, and if s is small, the RKHSs are “ simple”. Asso ciated with the ǫ -covering nu mber, the i - th entr opy numb er e i ( H m → L 2 (Π)) is defined as the infim um over all ε > 0 for which N ( ε, B H m , L 2 (Π)) ≤ 2 i − 1 . If the spec tral assumption (A3) holds, the relation (4) implies that the i -th en tr opy num b er is b ounded as e i ( H m → L 2 (Π)) ≤ C i − 1 2 s , (5) where C is a constant. T o b ound empirical pro c e ss a bound of the en trop y num ber with re s pec t to the empir ical distribution is needed. The following prop ositio n gives an upper b o und of that (see Corollar y 7 .3 1 of Stein wart (20 0 8), for ex a mple). 3 Prop ositio n 1 If ther e exists c onstants 0 < s < 1 and C ≥ 1 su ch that e i ( H m → L 2 (Π)) ≤ C i − 1 2 s , then ther e exists a c onst ant c s > 0 only dep ending on s such that E D n ∼ Π n [ e i ( H m → L 2 ( D n ))] ≤ c s C (min( i, n )) 1 2 s i − 1 s , in p articular E D n ∼ Π n [ e i ( H m → L 2 ( D n ))] ≤ c s C i − 1 2 s . An imp or tant cla ss of RKHSs where s is known is Sob olev spa c e. ( A3) holds with s = d 2 m for So bo lev s pa ce of m -times co nt inuously differentiabilit y on the Euclidea n ball of R d (v an der V aa rt and W ellner , 1 996, Theor em 2.7.1). Moreov er, for m -times different iable kernels on a closed Euclidea n ball in R d , that holds for s = d 2 m (Stein wart, 2008, Theor em 6.26). According to Zhou (2002), for Gauss ia n kernels with co mpact suppo r t, that holds for arbitra ry small 0 < s . The entropy n um ber of Ga ussian kernels with unb ounde d supp ort is describ ed in Theorem 7.34 of Stein wart (2008). Let κ M be defined a s follows: κ M := sup ( κ ≥ 0    κ ≤ k P M m =1 f m k 2 L 2 (Π) P M m =1 k f m k 2 L 2 (Π) , ∀ f m ∈ H m ( m = 1 , . . . , M ) ) . (6) κ M represents the correla tion of RKHSs. W e a ssume all RKHSs are no t completely corr elated to each o ther. Assumption 3 (Incoherence Assumption) κ M is strictly b ounde d fr om b elow; t her e exists a c onstant C 0 > 0 such that (A4) 0 < C − 1 0 < κ M . This condition is mo tiv ated b y the inc oher enc e c ondition (Koltchinskii and Y uan, 2008, Meier et al., 2009) co nsidered in s pa rse MKL settings. This ensures the uniqueness of the decomp ositio n f ∗ = P M m =1 f ∗ m of the ground tr uth. Bac h (2008) also assumed this condition to show the consistency of ℓ 1 -MKL. Finally we give a technical assumption with resp ect to ∞ -norm. Assumption 4 (E m b edded Assum ption) Under the Sp e ctr al Assu mption ( s ), ther e exists a c onstant C 1 > 0 such that (A5) k f m k ∞ ≤ C 1 k f m k 1 − s H m k f m k s L 2 (Π) . This conditio n is met when the RKHSs are contin uously embedded in a Besov space B sm 2 , 1 ( X ) where s = d 2 m , d is the dimensio n of the input space X and m is the smoo thness of the Besov spa ce. F or example, the RKHSs of Gaussian kernels ca n b e em bedded in all Sob olev spa ces, and therefore the condition (A5) se ems rather common a nd practical. More gener ally , there is a clear characterization of the condition (A5) in terms of r e al interp olation of sp ac es . One can find detailed and formal discussions of interpolatio ns in Steinw art et al. (2009), and P rop osition 2.10 o f Bennett and Sharpley (1988) gives the necessary and sufficien t condition for the condition (A 5). 3 Con v ergence Rate of ℓ p -MKL Here w e derive the con vergence r ate of the estimator ˆ f . W e suppo se that the num ber of k ernels M can increa se alo ng with the n um ber of sa mples n . The motiv a tion of our analys is is summarized as follows: • Deriving a sharp conv ergence rate utilizing lo calization techniques. • Clarifying the relatio n betw een the norm ( P M m =1 k f ∗ m k p H m ) 1 p of the truth and the g eneralizatio n bo und. Now we define η ( t ) := η n ( t ) = max(1 , √ t, t/ √ n ) , and for a given p o s itive real λ w e define ζ n := 2   r M lo g( M ) n ∨ λ − s 2 M 1+ s 2 − s p √ n ∨ M 1+4 s − s 2 2(1+ s ) − s (3 − s ) p (1+ s ) λ − s (3 − s ) 2(1+ s ) n 1 1+ s   . (7) Then we obtain the following conv ergence r ate. 4 Theorem 2 Su pp ose λ ( n ) 1 > 0 and let λ in the definition (7) of ζ n b e λ = λ ( n ) 1 . Th en ther e ex ists a c onstant ψ s dep ending L, s, c, C 1 such that for al l n and t ′ ( > 0) that satisfy log( M ) √ n ≤ 1 and ψ s √ nζ 2 n κ M η ( t ′ ) ≤ 1 , the solution of ℓ p -MKL given in Eq. (1) for arbitr ary r e al p ≥ 1 satisfies k ˆ f − f ∗ k 2 L 2 (Π) ≤ ψ 2 s κ M ζ 2 n η ( t ) 2 + 8 3 λ ( n ) 1 M X m =1 k f ∗ m k p H m ! 2 p , (8) with pr ob ability 1 − ex p( − t ) − exp( − t ′ ) for al l t ≥ 1 . The pr o of will b e given in Appendix A. Let R p :=  P M m =1 k f ∗ m k p H m  1 p . Suppose that n is sufficiently large co mpared with M and R p ( n ≥ M 2 p R − 2 p (log M ) 1+ s s and n ≥ ( R p / M 1 p ) 4 s 1 − s ). Then the r e gularizatio n par ameter λ ( n ) 1 that achiev es the minim um of the RHS of the bound (8 ) is given by λ ( n ) 1 = n − 1 1+ s M 1 − 2 s p (1+ s ) R − 2 1+ s p , up to constant . Then the conv ergence rate o f k ˆ f − f ∗ k 2 L 2 (Π) bec omes k ˆ f − f ∗ k 2 L 2 (Π) = O p  n − 1 1+ s M 1 − 2 s p (1+ s ) R 2 s 1+ s p + M log( M ) n + n − 1 1+ s − ( s − 1) 2 (1+ s ) 2 M 1 − 2 s (3 − s ) p (1+ s ) 2 R 2 s (3 − s ) (1+ s ) 2 p  . (9) Under the co ndition n ≥ M 2 p R − 2 p (log M ) 1+ s s ∨ ( R p / M 1 p ) 4 s 1 − s , the leading ter m is the first term, and th us we hav e k ˆ f − f ∗ k 2 L 2 (Π) = O p  n − 1 1+ s M 1 − 2 s p (1+ s ) R 2 s 1+ s p  . (10) Note that as the complexity s of RKHSs beco mes sma ll the co nv er gence r ate b ecomes fa s t. It is known that n − 1 1+ s is the minimax optimal lear ning ra te for single kernel learning. The derived rate of ℓ p -MKL is obta ined b y multiplying a coefficient dep ending on M and R p to the optimal rate of single k ernel learning. T o in vestigate the dependency of R p to the learning rate, let us consider t wo extreme settings, i.e., sparse setting ( k f ∗ m k H m ) M m =1 = (1 , 0 , . . . , 0) a nd dense setting ( k f ∗ m k H m ) M m =1 = (1 , . . . , 1) as in Kloft et al. (201 0a). • ( k f ∗ m k H m ) M m =1 = (1 , 0 , . . . , 0): R p = 1 for all p . Therefor e the convergence rate n − 1 1+ s M 1 − 2 s p (1+ s ) is fast for small p a nd the minimu m is ac hiev ed a t p = 1. This means that ℓ 1 regular iz ation is preferred for sparse truth. • ( k f ∗ m k H m ) M m =1 = (1 , . . . , 1): R p = M 1 p , th us the conv ergence rate is M n − 1 1+ s for a ll p . I nterest- ingly for dense ground truth, there is no dependency of the con v ergence r a te on the parameter p . That is, the conv ergence rate is M times the o ptimal lear ning ra te of s ingle kernels learn- ing ( n − 1 1+ s ) for all p . This means that for the dense settings, the complex ity of so lv ing MKL problem is equiv alent to that of solving M single k e r nel learning problems. 3.1 Com parison wi th exi sting b ounds Here we compar e the bound we derived with the existing bo unds. Let H ℓ p ( R ) b e the ℓ p -mixed norm ball with radius R defined a s follows: H ℓ p ( R ) :=    f = M X m =1 f m    M X m =1 k f m k p H m ! 1 p ≤ R    . The b o unds b y Cortes et al. (2010) and Kloft et al. (2 010b,a) a r e most relev ant to our res ults. Roughly sp ea king, their bounds are given a s R ( f ) ≤ b R ( f ) + C M 1 − 1 p ∨ p log( M ) √ n R for all f ∈ H ℓ p ( R ) , (11) 5 where R ( f ) and b R ( f ) is the p opulation ris k and the empirical r isk. First o bs erv ation is that the bo unds b y Co rtes et al. (2010) and Kloft et al. (201 0 a) are restr ic ted to the situation 1 ≤ p ≤ 2 bec ause their analysis is bas e d on the k ernel weigh t constraint for mu lation (2). On the other hand, our analys is ans that o f Kloft et al. (201 0b) covers all p ≥ 1. Second, s ince our b ound is sp ecialized to the r egularized risk minimizer ˆ f defined a t Eq . (1) while the exis ting b ound (11) is applicable to all f ∈ H ℓ p ( R ), our b ound is sharp er than theirs. T o s ee this, supp ose that 1 ≤ p ≤ 2 and n − 1 2 M 1 − 1 p ≤ 1 (whic h means the bo und (11) makes sense), then we hav e n − 1 1+ s M 1 − 2 s p (1+ s ) ≤ n − 1 2 M 1 − 1 p . F or the situation of p ≥ 2, we hav e also n − 1 1+ s M 1 − 2 s p (1+ s ) ≤ n − 1 2 M 1 − 1 p for n ≥ M 2 p . Mor eov er w e should note that s can b e lar ge as long as Spectra l Assumption (A3) is satisfied. Th us the b ound (11) is recov ered by our analysis by approaching s to 1. The r esults b y Koltchinskii and Y uan (2008), Meier et al. (2009), Koltc hinskii and Y uan (2010) are also related to ours in terms of the pr o of techniques. Their analyses and ours utilize the loc a liza- tion techniques to obtain fast lo c alize d le arning r ate , in co ntrast to the g lobal b ound of Cortes et al. (2010), Kloft et al. (20 10b,a). Howev er a ll those lo caliz e d bo unds ar e co nsidered on a sparse learning settings suc h as ℓ 1 and elasticnet r egulariza tions. Hence their frameworks a re ra ther differen t from ours. 4 Lo w er b ound of learning rate In this sec tio n, we show that the der ived learning rate achieves the minimax-learning r ate o n H ℓ p ( R ). W e derive the minimax learning r ate in a simpler situa tio n. First we ass ume tha t ea ch RKHS is same as others. Tha t is, the input vector is decomp osed into M co mpo nent s like x = ( x (1) , . . . , x ( M ) ) where { x ( m ) } M m =1 are M i.i.d. copies o f a random v ariable ˜ X , a nd H m = { f m | f m ( x ) = f m ( x (1) , . . . , x ( M ) ) = ˜ f m ( x ( m ) ) , ˜ f m ∈ e H} where e H is an RKHS shared by all H m . Thus f ∈ H ⊕ M is deco mpo sed as f ( x ) = f ( x (1) , . . . , x ( M ) ) = P M m =1 ˜ f m ( x ( m ) ) where each ˜ f m is a member of the common RKHS e H . W e deno te by e k the kernel ass o ciated with the RKHS e H . In a ddition to the condition ab out the upp er b ound of sp ectr um (Sp ectral Assumption (A3)), we assume that the sp ectrum of all the RK HSs H m hav e the same low er bound of p olyno mia l ra te. Assumption 5 (Strong Sp ectral Ass umption) Ther e exist 0 < s < 1 and 0 < c, c ′ such that (A6) c ′ k − 1 s ≤ ˜ µ k ≤ ck − 1 s , (1 ≤ ∀ k ) , wher e { ˜ µ k } k is t he sp e ctrum of t he kernel ˜ k . In p articular, t he sp e ctrum of kernels k m also satisfy µ k,m ∼ k − 1 s ( ∀ k , m ) . As disc us sed just after Assumption 5, this means that the covering num ber of e H sa tisfies N ( ε, B e H , L 2 (Π)) ∼ ε − 2 s , where B e H is the unit ba ll o f e H (see Steinw ar t et al. (2009, Theor em 15 ) a nd Steinw art (20 08) for details). Wit hout loss of generality , we may assume that E[ f ( ˜ X )] = 0 ( ∀ f ∈ e H ) . Since ea ch f m receives i.i.d. copy o f ˜ X , H m s ar e orthogonal to eac h other: E[ f m ( X ) f m ′ ( X )] = E [ ˜ f m ( X ( m ) ) ˜ f m ′ ( X ( m ′ ) )] = 0 ( ∀ f m ∈ H m , ∀ f m ′ ∈ H m ′ , 1 ≤ ∀ m 6 = m ′ ≤ M ) . W e also assume that the no is e { ǫ i } n i =1 is an i.i.d. norma l sequence with standard deviation σ > 0. Under the assumptions describ ed abov e, then w e hav e the followin g minimax L 2 (Π)-error. Theorem 3 F or a given 0 < R , the minimax-le arning r ate on H ℓ p ( R ) is lower b ounde d as min ˆ f max f ∗ ∈H ℓ p ( R ) E h k ˆ f − f ∗ k 2 L 2 (Π) i ≥ C n − 1 1+ s M 1 − 2 s p (1+ s ) R 2 s 1+ s , wher e inf is taken over al l me asura ble functions of n samples { ( x i , y i ) } n i =1 . The pr o of will be given in Appendix B. One can see that the conv ergence rate deriv ed in Theorem 2 and Eq. (10) a chiev es the low er b ound of Theo rem 3. Thus o ur b ound is tigh t. Interestingly , the learning rate (10) of ℓ p -MKL a nd the minimax lear ning ra te o n ℓ p -mixed-norm ball coincide at the common p . This means that the ℓ p -mixed-norm regularization is w ell s uited to make the estimator included in the ℓ p -mixed-norm ball. 6 5 Conclusion and Discussion W e have shown a shar p optimal learning rate of ℓ p -MKL by utilizing the lo ca lization techniques. Our bo und is shar per than ex is ting b o unds and a chieves the minimax learning rate under the Spectr al Assumption (A3). There still rema in impor tant future w o rks. The b ound giv en in Eq. (10) b ecomes smaller as p beco mes smaller since R p / M 1 p decreases as p ց 1. That is, according to the theoretica l result, ℓ p -MKL shows the bes t pe rformance at p = 1 despite the disa pp ointing results of p = 1 r epo rted by some n umerical exp eriments. This concer n was als o po in ted out b y Cortes et a l. (2010). It is an imp orta nt future work to theoretically clarify why ℓ p -MKL with p > 1 works w e ll in so me real situations. The seco nd interesting future work is a bo ut the M log ( M ) n term app e ared in the b ound Eq. (9). Because of this ter m, our b ound is O ( M log( M )) with resp ect to M while in the exis ting work that is O ( M 1 − 1 p ). It is an interesting issue to clarify whether the term M log ( M ) n can b e r eplaced by other tighter b o unds o r no t. T o do so , it might b e useful to precisely estimate the cov ering num b er of H ℓ p ( R ). Ac kno wledgemen t W e would like to thank Ryota T omiok a and Masashi Sugiyama for suggestive discussions. This work was pa rtially supp orted b y MEXT K akenhi 2 27002 89. A Pro of of Theorem 2 Before we s how Theo r em 2, w e prepar e several lemmas. The fo llowing tw o pro po sitions are k ey for lo calization. Let { σ i } n i =1 be i.i.d. Ra demacher r andom v ariables , i.e., σ i ∈ {± 1 } and P ( σ i = 1) = P ( σ i = − 1) = 1 2 . Prop ositio n 4 (Stein w art , 2008, Theorem 7.16) L et B σ,a,b ⊂ H m b e a set such that B σ,a,b = { f m ∈ H m | k f m k L 2 (Π) ≤ σ, k f m k H m ≤ a, k f m k ∞ ≤ b } . Assume t hat ther e exist c onstant s 0 < s < 1 and 0 < ˜ c s such that E D n [ e i ( H m → L 2 ( D n ))] ≤ ˜ c s i − 1 2 s . Then ther e exists a c onst ant C ′ s dep ending only s such that E " sup f m ∈B σ,a,b      1 n n X i =1 σ i f m ( x i )      # ≤ C ′ s  σ 1 − s (˜ c s a ) s √ n ∨ ( ˜ c s a ) 2 s 1+ s b 1 − s 1+ s n − 1 1+ s  . (12) Prop ositio n 5 (T alagrand’s Co ncen tration Inequality (T alagrand, 1996, Bousquet, 2002)) L et G b e a function class on X that is s ep ar able with r esp e ct to ∞ -norm, and { x i } n i =1 b e i.i.d. r andom variables with values in X . F urthermor e, let B ≥ 0 and U ≥ 0 b e B := sup g ∈G E[( g − E[ g ]) 2 ] and U := sup g ∈G k g k ∞ , then ther e exists a universal c onstant K su ch that, fo r Z := sup g ∈G   1 n P n i =1 g ( x i ) − E[ g ]   , we have P Z ≥ K " E[ Z ] + r B t n + U t n #! ≤ e − t . Let λ > 0 b e an arbitra r y p ositive r eal. W e determine U n,s ( f m ) as follows: U n,s ( f m ) :=   r M lo g( M ) n ∨ λ − s 2 M 1 − s 2 + s (1 − 1 p ) √ n ∨ λ − (3 − s ) 2(1+ s ) M 1+4 s − s 2 2(1+ s ) − s (3 − s ) p (1+ s ) n 1 1+ s   × k f m k L 2 (Π) √ M + λ 1 2 k f m k H m M 1 − 1 p ! . It is e asy to s ee U n,s ( f m ) is an upper b o und of the quantit y k f m k 1 − s L 2 (Π) k f m k s H m √ n ∨ k f m k (1 − s ) 2 1+ s L 2 (Π) k f m k s (3 − s ) 1+ s H m n 1 1+ s (this co rresp onds to the RHS of Eq. (12)) beca use k f m k 1 − s L 2 (Π) k f m k s H m √ n = λ − s 2 M 1 − s 2 + s (1 − 1 p ) √ n  k f m k L 2 (Π) √ M  1 − s λ 1 2 k f m k H m M 1 − 1 p ! s 7 ≤ λ − s 2 M 1 − s 2 + s (1 − 1 p ) √ n k f m k L 2 (Π) √ M + λ 1 2 k f m k H m M 1 − 1 p ! , (13) where we used Y oung’s ineq uality in the last line, and similarly we obtain k f m k (1 − s ) 2 1+ s L 2 (Π) k f m k s (3 − s ) 1+ s H m n 1 1+ s ≤ λ − (3 − s ) 2(1+ s ) M (1 − s ) 2 2(1+ s ) +(1 − 1 p ) s (3 − s ) 1+ s n 1 1+ s k f m k L 2 (Π) √ M + λ 1 2 k f m k H m M 1 − 1 p ! . Using Pr op ositions 5 and 4, we obtain the following ratio t yp e uniform b ound. Lemma 6 Under t he S p e ctr al Assu mption (Assu mption 2) and the Emb e dde d As s umption (Assump- tion 4), ther e exists a c onst ant C s dep ending only on s , c and C 1 such that E " sup f m ∈H m : k f m k H m =1 | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # ≤ C s . (14) Pro of: [Pro of of L e mma 6] Let H m ( σ ) := { f m ∈ H m | k f m k H m = 1 , k f m k L 2 (Π) ≤ σ } and z = 2 1 /s > 1. Define τ := λ s 2 M 1 2 − 1 p . Then by combining P r op ositions 1 and 4 with Assumption 4, we hav e E " sup f m ∈H m : k f m k H m =1 | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # ≤ E " sup f m ∈H m ( τ ) | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # + ∞ X k =1 E " sup f m ∈H m ( τ z k ) \H m ( τ z k − 1 ) | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # ≤ C ′ s τ 1 − s ˜ c s s √ n λ − s 2 M 1+ s 2 − s p √ n λ 1 2 M 1 − 1 p ∨ C 1 − s 1+ s 1 τ (1 − s ) 2 1+ s ˜ c 2 s 1+ s s n 1 1+ s λ − s (3 − s ) 1+ s M (1 − s ) 2 2(1+ s ) +(1 − 1 p ) s (3 − s ) 1+ s n 1 1+ s λ 1 2 M 1 − 1 p + ∞ X k =1 C ′ s z k (1 − s ) τ 1 − s ˜ c s s √ n λ − s 2 M 1+ s 2 − s p √ n τ z k − 1 √ M ∨ C 1 − s 1+ s 1 z k (1 − s ) 2 1+ s τ (1 − s ) 2 1+ s ˜ c 2 s 1+ s s n 1 1+ s λ − s (3 − s ) 1+ s M (1 − s ) 2 2(1+ s ) +(1 − 1 p ) s (3 − s ) 1+ s n 1 1+ s τ z k − 1 √ M ≤ C ′ s  ˜ c s s ∨ C 1 − s 1+ s 1 ˜ c 2 s 1+ s s  1 + ∞ X k =1 z 1 − ks ∨ z 1 − k s (3 − s ) 1+ s ! = C ′ s  ˜ c s s ∨ C 1 − s 1+ s 1 ˜ c 2 s 1+ s s  1 + z 1 − s 1 − z − s ∨ z 1 − s (3 − s ) 1+ s 1 − z − s (3 − s ) 1+ s ! . Thu s by setting, C s = C ′ s  ˜ c s s ∨ ˜ c 2 s 1+ s s   1 + z 1 − s 1 − z − s ∨ z 1 − s (3 − s ) 1+ s 1 − z − s (3 − s ) 1+ s  , we obtain the assertion. This lemma immediately gives the follo wing c o rollar y . Corollary 7 Under the Sp e ctr al A ssumption (Assu mption 2) and the Emb e dde d Assumption (As - sumption 4), ther e exists a c onstant C s dep ending only on s and C such that E " sup f m ∈H m | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # ≤ C s . Lemma 8 If log( M ) √ n ≤ 1 , then under the Sp e ctr al Assumption (Assumption 2) and the Emb e dde d Assumption (Assumption 4) ther e exists a c onst ant ˜ C s dep ending only on s , c , C 1 such that E " max m sup f m ∈H m | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # ≤ ˜ C s . 8 Pro of: [Pr o of of Lemma 8] First notice that the L 2 (Π)-norm and the ∞ -norm of σ i f m ( x i ) U n,s ( f m ) can b e ev aluated b y     σ i f m ( x i ) U n,s ( f m )     L 2 (Π) = k f m k L 2 (Π) U n,s ( f m ) ≤ r log( M ) n ∨ λ − s 2 M − s 2 + s (1 − 1 p ) √ n ! − 1 ≤ r n log( M ) , (15)     σ i f m ( x i ) U n,s ( f m )     ∞ = k f m k ∞ U n,s ( f m ) ≤ C 1 k f m k 1 − s L 2 (Π) k f m k s ∞ U n,s ( f m ) ≤ C 1 √ n. (16) The last inequality o f Eq. (16) can b e shown b y using the relation (13). Thus T alag rand’s inequalit y implies P max m sup f m ∈H m | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) ≥ K " C s + s t log( M ) + C 1 t √ n #! ≤ M X m =1 P sup f m ∈H m | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) ≥ K " C s + s t log( M ) + C 1 t √ n #! ≤ M e − t . By setting t ← t + log ( M ), we obtain P max m sup f m ∈H m | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) ≥ K " C s + s t + lo g( M ) log( M ) + C 1 ( t + log( M )) √ n #! ≤ e − t for all t ≥ 0. Consequently the expectation of the max-sup term can be bo unded a s E " max m sup f m ∈H m | 1 n P n i =1 σ i f m ( x i ) | U n,s ( f m ) # ≤ K  C s + 1 + C 1 log( M ) √ n  + Z ∞ 0 K " C s + s t + 1 + log( M ) log( M ) + C 1 ( t + 1 + lo g( M )) √ n # e − t d t ≤ 2 K  C s + 2 + r π 4 log( M ) + 2 C 1 + C 1 log( M ) √ n  ≤ ˜ C s , where w e used p t + 1 + log( M ) ≤ √ t + p 1 + log( M ) and R ∞ 0 √ te − t d t = p π 4 , log( M ) √ n ≤ 1, and ˜ C s = 2 K [ C s + 2 + p π 4 + 3 C 1 ]. Lemma 9 If log( M ) √ n ≤ 1 , then under the Sp e ctr al Assumption (Assumption 2) and the Emb e dde d Assumption (Assumption 4), the fol lowing holds P max m sup f m ∈H m | 1 n P n i =1 ǫ i f m ( x i ) | U n,s ( f m ) ≥ K L  2 ˜ C s + √ t + C 1 t √ n  ! ≤ e − t . Pro of: [Pro o f of Lemma 9] By the contraction inequalit y (Ledoux and T alag rand, 1991, Theorem 4.12) a nd Lemma 8 , we hav e E " max m sup f m ∈H m | 1 n P n i =1 ǫ i f m ( x i ) | U n,s ( f m ) # ≤ 2E " max m sup f m ∈H m | 1 n P n i =1 σ i ǫ i f m ( x i ) | U n,s ( f m ) # ≤ 2 L ˜ C s . Using this and Eq . (15) and Eq . (16), T algr and’s inequality gives the assertion. Theorem 10 L et φ ′ s = K [2 C 1 ˜ C s + C 1 + C 2 1 ] . Then, if log( M ) √ n ≤ 1 , we have for all t ≥ 0        P M m =1 f m    2 n −    P M m =1 f m    2 L 2 (Π)     ≤ φ ′ s √ n M X m =1 U n,s ( f m ) ! 2 η ( t ) , for al l f m ∈ H m ( m = 1 , . . . , M ) with pr ob abili ty 1 − exp( − t ) . 9 Pro of: [P ro of of Theorem 10] E     sup f m ∈H m        P M m =1 f m    2 n −    P M m =1 f m    2 L 2 (Π)      P M m =1 U n,s ( f m )  2     ≤ 2E    sup f m ∈H m    1 n P n i =1 σ i ( P M m =1 f m ( x i )) 2     P M m =1 U n,s ( f m )  2    ≤ sup f m ∈H m    P M m =1 f m    ∞ P M m =1 U n,s ( f m ) × 2E   sup f m ∈H m    1 n P n i =1 σ i ( P M m =1 f m ( x i ))    P M m =1 U n,s ( f m )   , (17) where w e used the contraction inequality in the last line (Ledoux and T alagra nd, 19 91, Theorem 4.12). Thus using Eq. (16), the RHS o f the inequality (1 7) can be bounded as 2 C 1 √ n E   sup f m ∈H m    1 n P n i =1 σ i ( P M m =1 f m ( x i ))    P M m =1 U n,s ( f m )   ≤ 2 C 1 √ n E " sup f m ∈H m max m   1 n P n i =1 σ i f m ( x i )   U n,s ( f m ) # , where we used the relation P m a m P m b m ≤ max m ( a m b m ) for all a m ≥ 0 and b m ≥ 0 with a conv en tion 0 0 = 0 . By Lemma 8, the righ t ha nd s ide is upp er bounded by 2 C 1 √ n ˜ C s . Her e we aga in apply T alagra nd’s concentration ineq uality , then we hav e P     sup f m ∈H m        P M m =1 f m    2 n −    P M m =1 f m    2 L 2 (Π)      P M m =1 U n,s ( f m )  2 ≥ K h 2 C 1 ˜ C s √ n + √ tnC 1 + C 2 1 t i     ≤ e − t , (1 8) where we substituted the following upper b ounds of B and U . B ≤ sup f m ∈H m E       ( P M m =1 f m ) 2  P M m =1 U n,s ( f m )  2    2    ≤ sup f m ∈H m E    ( P M m =1 f m ) 2  P M m =1 U n,s ( f m )  2 ( k P M m =1 f m k ∞ ) 2  P M m =1 U n,s ( f m )  2    ≤ sup f m ∈H m  P M m =1 k f m k L 2 (Π)  2  P M m =1 U n,s ( f m )  2 ( P M m =1 C 1 √ nU n,s ( f m )) 2  P M m =1 U n,s ( f m )  2 ≤ C 2 1 n 2 log( M ) ≤ C 2 1 n 2 , where in the seco nd inequality we used the rela tion E[( M X m =1 f m ) 2 ] = E [ M X m,m ′ =1 f m f m ′ ] ≤ M X m,m ′ =1 k f m k L 2 (Π) k f m ′ k L 2 (Π) = ( M X m =1 k f m k L 2 (Π) ) 2 and in the third and forth inequality w e used Eq. (16) and Eq. (15) respectively . Her e we again use Eq. (15) to obtain U = sup f m ∈H m        ( P M m =1 f m ) 2  P M m =1 U n,s ( f m )  2        ∞ ≤ C 2 1 n. 10 Therefore for t ← √ nt , the above inequality implies the following ineq uality sup f m ∈H m        P M m =1 f m    2 n −    P M m =1 f m    2 L 2 (Π)      P M m =1 U n,s ( f m )  2 ≤ K h 2 C 1 ˜ C s + C 1 + C 2 1 i √ n max(1 , √ t, t/ √ n ) . (19) with probability 1 − ex p( − t ). Remind φ ′ s = K h 2 C 1 ˜ C s + C 1 + C 2 1 i , then we obtain the as s ertion. Now we define φ s := max  K L h 2 ˜ C s + 1 + C 1 i , K h 2 C 1 ˜ C s + C 1 + C 2 1 i . W e define even ts E 1 ( t ) and E 2 as E 1 ( t ) = (      1 n n X i =1 ǫ i f m ( x i )      ≤ φ s U n,s ( f m ) η ( t ) , ∀ f m ∈ H m ( m = 1 , . . . , M ) ) , E 2 ( t ′ ) = (        P M m =1 f m    2 n −    P M m =1 f m    2 L 2 (Π)     ≤ φ s √ n M X m =1 U n,s ( f m ) ! 2 η ( t ′ ) , ∀ f m ∈ H m ( m = 1 , . . . , M ) ) . The following theorem immediately gives Theorem 2 . Theorem 11 L et λ b e an arbitr ary p ositive r e al, and λ ( n ) 1 satisfy λ ( n ) 1 ≥ 1 2 λ . Then for al l n and t ′ ( > 0) that satisfy log( M ) √ n ≤ 1 and φ s √ nζ 2 n η ( t ′ ) /κ M ≤ 1 8 , we have k ˆ f − f ∗ k 2 L 2 (Π) ≤ 8 3 κ M η ( t ) 2 φ 2 s ζ 2 n + 8 3 λ ( n ) 1 M X m =1 k f ∗ m k p H m ! 2 p , with pr ob ability 1 − ex p( − t ) − exp( − t ′ ) for al l t ≥ 1 . This g ives Theor e m 2 by setting ψ s = 8 φ s . Pro of: [P ro of of Theorem 11] Since y i = f ∗ ( x i ) + ǫ i , we hav e k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 M X m =1 k ˆ f m k p H m ! 2 p ≤ ( k ˆ f − f ∗ k 2 L 2 (Π) − k ˆ f − f ∗ k 2 n ) + 1 n n X n =1 M X m =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) + λ ( n ) 1 M X m =1 k f ∗ m k p H m ! 2 p . Here on the even t E 2 ( t ′ ), the above inequality gives k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 M X m =1 k ˆ f m k p H m ! 2 p ≤ φ s √ n M X m =1 U n,s ( ˆ f m − f ∗ m ) ! 2 η ( t ′ ) + 1 n n X n =1 M X m =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) + λ ( n ) 1 M X m =1 k f ∗ m k p H m ! 2 p . (20) Before we show the statemen ts, we show tw o basic upper b ounds of P M m =1 U n,s ( f m ) require d in the pr o of. First note that P M m =1 k f m k L 2 (Π) √ M + λ 1 2 P M m =1 k f m k H m M 1 − 1 p ! ≤ M X m =1 k f m k 2 L 2 (Π) ! 1 2 + λ 1 2 M X m =1 k f m k p H m ! 1 p 11 ≤ 2   M X m =1 k f m k 2 L 2 (Π) + λ M X m =1 k f m k p H m ! 2 p   1 2 . Therefore we have M X m =1 U n,s ( f m ) ≤ 2   r M log( M ) n ∨ λ − s 2 M 1 − s 2 + s (1 − 1 p ) √ n ∨ M (1 − s ) 2 2(1+ s ) +(1 − 1 p ) s (3 − s ) 1+ s λ − s (3 − s ) 2(1+ s ) n 1 1+ s   ×   M X m =1 k f m k 2 L 2 (Π) + λ M X m =1 k f m k p H m ! 2 p   1 2 . Reminding the definition o f ζ n (Eq. (7)), the ab ov e bo und is equiv alent to M X m =1 U n,s ( f m ) ≤ ζ n   M X m =1 k f m k 2 L 2 (Π) + λ M X m =1 k f m k p H m ! 2 p   1 2 . (21) Step 1. By Eq. (21), the fir st ter m on the RHS c an b e upper bounded as φ s √ n M X m =1 U n,s ( ˆ f m − f ∗ m ) ! 2 η ( t ′ ) ≤ φ s √ nζ 2 n η ( t ′ )   M X m =1 k ˆ f m − f ∗ m k 2 L 2 (Π) + λ M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p   ≤ φ s √ nζ 2 n η ( t ′ )   k ˆ f − f ∗ k 2 L 2 (Π) κ M + λ M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p   . By assumption, w e hav e φ s √ nζ 2 n η ( t ′ ) /κ M ≤ 1 8 . Hence the RHS of the above inequa lit y is b ounded by 1 8   k ˆ f − f ∗ k 2 L 2 (Π) + λ M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p   . (22) Step 2. On the even t E 1 ( t ), we hav e 1 n n X n =1 M X m =1 ǫ i ( ˆ f m ( x i ) − f ∗ m ( x i )) ≤ M X m =1 η ( t ) φ s U n,s ( ˆ f m − f ∗ m ) ≤ η ( t ) φ s ζ n   M X m =1 k ˆ f m − f ∗ m k 2 L 2 (Π) + λ M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p   1 2 ≤ 2 κ M η ( t ) 2 φ 2 s ζ 2 n + κ M 8   M X m =1 k ˆ f m − f ∗ m k 2 L 2 (Π) + λ M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p   ≤ 2 κ M η ( t ) 2 φ 2 s ζ 2 n + 1 8   k ˆ f − f ∗ k 2 L 2 (Π) + λ M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p   . (23) Step 5. Substituting the inequalities (22) and (23) to Eq. (2 0), we obtain 3 4 k ˆ f − f ∗ k 2 L 2 (Π) + λ ( n ) 1 M X m =1 k ˆ f m k p H m ! 2 p 12 ≤ 2 κ M η ( t ) 2 φ 2 s ζ 2 n + λ 4 M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p + λ ( n ) 1 M X m =1 k f ∗ m k p H m ! 2 p . Now the second term of the RHS can b e bounded as M X m =1 k ˆ f m − f ∗ m k p H m ! 2 p ≤ M X m =1 ( k ˆ f m k H m + k f ∗ m k H m ) p ! 2 p ≤ 2 M X m =1 k ˆ f m k p H m ! 2 p + 2 M X m =1 k f ∗ m k p H m ! 2 p . Therefore we have 3 4 k ˆ f − f ∗ k 2 L 2 (Π) ≤ 2 κ M η ( t ) 2 φ 2 s ζ 2 n + 2 λ ( n ) 1 M X m =1 k f ∗ m k p H m ! 2 p . This g ives the assertion. B Pro of of Theorem 3 (minim ax learning rate) Pro of: [P ro of of Theorem 3] The pro of utilizes the techniques developed b y (Raskutti et al., 20 09, 2010) that applied the information theoretic technique developed by (Y ang and Barron, 1 999) to the MKL settings. The δ -packing n um b e r Q ( δ, H , L 2 (Π)) of a function class H is the lar g est num ber of functions { f 1 , . . . , f Q } ⊆ H such that k f i − f j k L 2 (Π) ≥ δ for all i 6 = j . T o simplify the notation, we write F := H ℓ p ( R ), N ( ε, H ) := N ( ε, H , L 2 (Π)) and Q ( ε, H ) := Q ( ε, H , L 2 (Π)). It can be easily shown that Q (2 ε, F ) ≤ N (2 ε, F ) ≤ Q ( ε, F ). W e utilize the follo wing inequality given by Lemma 3 of Rask utti et al. (2009): min ˆ f max f ∗ ∈H ℓ p ( R p ) E k ˆ f − f ∗ k 2 L 2 (Π) ≥ δ 2 n 4  1 − log N ( ε n , F ) + nε 2 n / 2 + log 2 log Q ( δ n , F )  . First we sho w the assertio n for p = ∞ . In this situation, ther e is a cons tant C tha t dep ends only s s uc h that log Q ( δ, F ) ≥ C M log Q ( δ / √ M , e H ( R )) , log N ( ε, F ) ≤ M log N ( ε/ √ M , e H ( R )) , (this is shown in Lemma 5 o f Ras kutti et a l. (201 0), but we give the pro of in Lemma 12 for com- pleteness). Using this expression, the minimax-learning rate is b o unded a s min ˆ f max f ∗ ∈H ℓ p ( R p ) E k ˆ f − f ∗ k 2 L 2 (Π) ≥ δ 2 n 4 1 − M log N ( ε n / √ M , e H ( R )) + n ε 2 n / 2 + log 2 M log Q ( δ n / √ M , e H ( R )) ! . Here we choose ε n and δ n to satisfy the following relations: n 2 σ 2 ε 2 n ≤ M log N  ε n / √ M , e H ( R )  , (24) 4 log N  δ n / √ M , e H ( R )  ≤ C log Q  δ n / √ M , e H ( R )  . (25) With ε n and δ n that sa tisfy the above rela tions (24) a nd (25), we hav e min ˆ f max f ∗ ∈H ℓ p ( R p ) E k ˆ f − f ∗ k 2 L 2 (Π) ≥ δ 2 n 16 . (26) The r elation (24) can b e r ewritten as n 2 σ 2 ε 2 n ≤ C M  ε n R √ M  − 2 s . It is sufficien t to impose ε 2 n ≤ C n − 1 1+ s M R 2 s 1+ s , with a co ns tant C . The r elation (25) can b e satisfied by taking δ n = cε n with an appropriately chosen c onstant c . Th us Eq. (2 6) gives min ˆ f max f ∗ ∈H ℓ p ( R p ) E k ˆ f − f ∗ k 2 L 2 (Π) ≥ C n − 1 1+ s M R 2 s 1+ s , (27) 13 with a constant C . This gives the assertion for p = ∞ . Finally we sho w the assertio n for 1 ≤ p < ∞ . Note that H ℓ ∞ ( R/ M 1 p ) ⊂ H ℓ p ( R ) (this is b ecause, for { x m } M m =1 s.t. | x m | ≤ R / M 1 p ( ∀ m ), we ha v e P M m =1 | x m | p ≤ M  R/ M 1 p  p = R p ). Ther efore w e hav e min ˆ f max f ∗ ∈H ℓ p ( R ) E k ˆ f − f ∗ k 2 L 2 (Π) ≥ min ˆ f max f ∗ ∈H ℓ ∞  R/ M 1 p  E k ˆ f − f ∗ k 2 L 2 (Π) ≥ C n − 1 1+ s M  R/ M 1 p  2 s 1+ s ( ∵ E q. (27)) ≥ C n − 1 1+ s M 1 − 2 s p (1+ s ) R 2 s 1+ s . This c oncludes the pro of. Lemma 12 Ther e is a c onstant C such that log Q ( δ, H ℓ ∞ ( R )) ≥ C M log Q ( δ / √ M , e H ( R )) , for sufficiently smal l δ . Pro of: The pro of is analogous to that of L e mma 5 in (Raskutti et a l., 2 010). W e describe the outline o f the pro of. L et N = Q ( √ 2 δ / √ M , e H ( R )) and { f 1 m , . . . , f N m } b e a √ 2 δ / √ M -packing o f H m ( R ). Then we ca n co nstruct a function class Υ as Υ = ( f j = M X m =1 f j m m | j = ( j 1 , . . . , j M ) ∈ { 1 , . . . , N } M ) . W e denote by [ N ] := { 1 , . . . , M } . F or tw o functions f j , f j ′ ∈ Υ , we hav e b y the construction k f j − f j ′ k 2 L 2 (Π) = M X m =1 k f j m m − f j ′ m m k 2 L 2 (Π) ≥ 2 δ 2 M M X m =1 1 [ j m 6 = j ′ m ] . Thu s, it suffices to construct a sufficiently large subset A ⊂ [ N ] M such tha t all different pairs j , j ′ ∈ A hav e at least M / 2 of Hamming distance d H ( j , j ′ ) := P M m =1 1 [ j m 6 = j ′ m ]. Now we define d H ( A, j ) := min j ′ ∈ A d H ( j ′ , j ). If | A | satisfies      j ∈ [ N ] M    d H ( A, j ) ≤ M 2      < | [ N ] M | = N M , (2 8) then ther e exists a member j ′ ∈ [ N ] M such that j ′ is more than M 2 aw ay from A with resp ect to d H , i.e. d H ( A, j ′ ) > M 2 . That is, we can add j ′ to A as long a s Eq. (28) holds. No w since      j ∈ [ N ] M    d H ( A, j ) ≤ M 2      ≤ | A |  M M / 2  N M / 2 , (29) Eq. (28) holds as long as A satisfies | A | ≤ 1 2 N M  M M / 2  N M / 2 =: Q ∗ . The lo garithm of Q ∗ can b e ev aluated as follows log Q ∗ = log 1 2 N M  M M / 2  N M / 2 ! = M log N − log 2 − log  M M / 2  − M 2 log N ≥ M 2 log N − lo g 2 − log 2 M ≥ M 2 log N 16 . There exists a co nstant C suc h that N = Q ( √ 2 δ / √ M , e H ( R )) ≥ C Q ( δ / √ M , e H ( R )) because Q ( δ, e H ( R )) ∼  δ R  − 2 s . Thus we obtain the assertion for sufficien tly la rge N . 14 References A. Arg yriou, C. A. Micchelli, and M. Pon til. Learning co nv ex co m binations of contin uously pa- rameterized ba s ic kernels. In Pr o c e e dings of the A nnual Confer enc e on Computational L e arning The ory , 2005. A. Arg yriou, R. Hauser, C. A. Micch elli, and M. Pon til. A DC- pr ogra mming alg orithm for k ernel selection. In the 23st International Confer enc e on Machine L e arning , 2 006. F. Bach. Explor ing la rge feature spaces with hiera rchical multiple k ernel learning. In A dvanc es in Neur al Information Pr o c essing Syst ems 21 , page s 1 05–1 1 2. 200 9. F. Bach, G. Lanc kriet, and M. Jordan. Multiple kernel learning, conic dualit y , and the SMO algorithm. In the 21st Intern ational Confer enc e on Machine L e arning , pag es 41 –48, 2004. F. R. Bach. Co nsistency of the gr o up la sso and multip le kernel lea r ning. J ournal of Machine Le arning R ese ar ch , 9 :1179 – 1225 , 20 0 8. P . Bartlett, O. B o usquet, a nd S. Mendels o n. Lo cal Radema cher c omplexities. The Annals of Statis- tics , 3 3:1487 –1537 , 20 05. P . Bartlett, M. Jor dan, and D. McAuliffe. Con vexit y , cla ssification, and ris k b ounds. Journal of the Americ an Statistic al Asso ciation , 101:138 –156 , 20 0 6. C. Be nnett and R. Sharpley . Interp olation of Op er ators . Academic Press , Boston, 1 988. O. B ousquet. A Bennett conc e n tration inequality and its application to supre ma of empirical pr o cess. C. R. A c ad. Sci. Paris Ser. I Math. , 334 :495– 5 00, 2002 . C. Cortes, M. Mohri, a nd A. Rostamizadeh. Learning non- line a r c o mbin ations of kernels. In Y. B en- gio, D. Sch uurma ns, J. Lafferty , C. K. I. Williams, and A. Culo tta, editors , A dvanc es in Neur al Information Pr o c essing Systems 22 , pa ges 396 –404. 2009 a. C. Co rtes, M. Mohri, and A. Rostamizadeh. L 2 regular iz ation fo r lear ning kernels. In the 25th Confer enc e on Unc ertainty in Artificial In tel ligenc e (UAI 200 9) , 2009b. Montr´ eal, Canada. C. Cor tes, M. Mohr i, and A. Rostamiza deh. Genera lization b ounds for learning kernels. In Pr o c e e d- ings of the 27th International Confer enc e on Machine L e arning , 2 010. G. S. Kimeldorf and G. W a h ba. Some results on Tcheb ycheffian spline functions . Jour n al of Math- ematic al Analysis and App lic ations , 33:82 –95, 1 9 71. M. Kloft, U. Brefeld, S. Sonnenburg, P . Lasko v, K .-R. M¨ uller , a nd A. Zien. Efficient and ac curate ℓ p -norm multiple kernel lear ning. In Ad vanc es in Neur al In formation Pr o c essing Systems 22 , pag es 997–1 005, Ca m bridge, MA, 2009. MIT Press. M. Kloft, U. Brefeld, S. Sonnen burg, and A. Zien. No n-sparse regular ization for multiple kernel learning, 2010a. arXiv:100 3 .0079 . M. Kloft, U. R uck er t, and P . L. Bartlett. A unifying view of m ultiple k ernel learning. In Pr o c e e d- ings of the Eur op e an Co nfer enc e on Machine L e arning and Know le dge Disc overy in Datab ases (ECML/PKDD) , 2010b. V. Koltchinskii. Local Ra demacher co mplexities and oracle inequalities in risk minimization. The Annals of St atistics , 34:25 93–26 56, 2006 . V. K oltchinskii and M. Y uan. Sparse recovery in large ensembles of kernel machines. In Pr o c e e dings of the Annual Confer enc e on L e arning The ory , pa ges 229– 238, 2008 . V. Ko ltchin skii and M. Y uan. Sparsity in m ultiple k ernel learning . The Annals of Statistics , 38(6): 3660– 3695, 20 1 0. G. La nckriet, N. Cristianini, L. E. Ghaoui, P . Bartlett, and M. Jordan. Learning the k ernel matrix with semi- definite pro gramming. Journ al of Machi ne L e arning R ese ar ch , 5:27–72, 2004 . M. Ledoux and M. T alagr and. Pr ob ability in Banach Sp ac es. Isop erimetry and Pr o c esses . Springer, New Y ork, 1 991. MR110 2015. 15 L. Meier, S. v an de Geer, a nd P . B ¨ uhlmann. High-dimensional additive modeling . The Annals of Statistics , 37(6B):3779– 3821, 2009. C. A. Micc helli and M. P ontil. Learning the k ernel function via regulariza tion. Jour n al of Machine L e arning R ese ar ch , 6:1 099–1 125, 2005 . C. S. Ong, A. J. Smola, a nd R. C. Williamson. Lea rning the k ernel with h yp erkernels. J ournal of Machine L e arning R ese ar ch , 6:104 3–107 1, 2005. G. Raskutti, M. W ainwright, and B. Y u. Low er b ounds on minimax r ates for nonpa rametric r e- gressio n with additiv e spars ity and smo othness. In A dvanc es in Neur al In formation Pr o c essing Systems 22 , pa g es 1563 –1570 . MIT Pre ss, Cambridge, MA, 2 009. G. Raskutti, M. W ainwright, and B. Y u. Minimax-optimal rates for spars e additive mo dels ov er kernel classes via con vex progr amming. T e chnical rep o r t, 201 0. arXiv:100 8.365 4 . B. Sch¨ olkopf and A. J. Smo la. L e arning with Kernels . MIT Press, Cambridge, MA, 2002. J. Shaw e-T aylor. Ker nel lea rning for nov elty detection. In NIPS 2008 Workshop on Kernel L e arning: Automatic Sele ction of Optimal Kernels , Whistler, 2008. J. Shaw e-T aylor and N. Cristianini. Kernel Metho ds for Pattern A nalysis . Cambridge Universit y Press, 20 04. N. Sr ebro and S. Ben-David. Learning b ounds for supp ort vector machines with lear ned kernels. In Pr o c e e dings of the A nnual Confer enc e on L e arning The ory , 2006. I. Steinw art. Su pp ort V e ctor Machines . Spring e r, 20 0 8. I. Steinw a rt, D. Hush, and C. Sc ovel. Optimal rates for r egularized least squar es r egressio n. In Pr o c e e dings of the A nnual Confer enc e on L e arning The ory , pages 79– 9 3, 2 009. M. T alagr and. New co ncentration inequalities in pro duct spaces. Inventiones Mathematic ae , 1 2 6: 505–5 63, 19 96. R. T omiok a and T. Suzuki. Spa rsity-accuracy trade-off in MKL. Whistler, 2009. NIPS 2009 W or kshop :: Understanding Multiple Kernel Lear ning Metho ds. R. T omiok a a nd T. Suzuki. Reg ularization str ategies and empirical bay esian learning for mkl. In NIPS 2010 Workshop: New Dir e ctions in Multiple Kernel L e arning , Whistler, 2010. S. v an de Geer. Empiric al Pr o c esses in M-Estimation . Cambridge University Pr ess, 200 0 . A. W. v an der V aa rt and J. A. W e llner. We ak Conver genc e and Empiric al Pr o c esses: With A ppli- c ations to Statistics . Springer, New Y o rk, 19 96. M. V ar ma and B. R. Babu. Mo re generality in efficien t m ultiple kernel learning. In The 26th International Confer enc e on Machine L e arning , 2009. S. Vishw anathan, Z. sun, N. Amp ornpunt, and M. V a r ma. Multiple kernel learning and the SMO algorithm. In Ad vanc es in Neur al Information Pr o c essing Systems 23 , page s 2361–236 9. 2010 . Y. Y a ng and A. B a rron. Information-theoretic deter mination of minimax rates of convergence. The Annals of St atistics , 27(5):15 64–15 99, 1999 . Y. Ying and C. Campb ell. Generalization bo unds for learning the kernel. In S. Dasgupta and A. Kliv a ns, editor s, Pr o c e e dings of t he A n nual Confer enc e on Le arning The ory , Mo n treal Queb ec, 2009. O mnipress. M. Y uan and Y. Lin. Model s election and estimation in regression with groupe d v ariables. Journal of The R oyal Statistic al So ciety Series B , 68(1):4 9–67, 200 6 . D.-X. Zho u. The co v ering num b er in lear ning theory . J our n al of Complexity , 18 :739–7 67, 2002 . 16

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment