The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning
We derive an upper bound on the local Rademacher complexity of $\ell_p$-norm multiple kernel learning, which yields a tighter excess risk bound than global approaches. Previous local approaches aimed at analyzed the case $p=1$ only while our analysis…
Authors: Marius Kloft, Gilles Blanchard
The Lo cal Rademac her Co mplexit y of ℓ p -Norm Multiple Kern el Learning Marius Kloft ∗ Mac hine Learning Lab oratory T ec hnisc he Unive rsit¨ at Berlin F ranklinstr. 28/29 10587 Berlin, German y mkloft@mail.tu-b erlin.de Gilles Blancha rd Departmen t of Mathematics Univ ersit y of P otsdam Am Neuen P alais 10 14469 P otsdam, Germany gilles.blanc hard@math.uni-p otsdam.de Ma y 29, 20 18 Abstract W e derive an upp er b ound on the lo ca l Rademacher complexit y o f ℓ p -norm multiple kernel learning, which yields a tighter excess risk b ound than global a pproaches. Prev io us lo cal approaches aimed at analyzed the case p = 1 o nly while our ana lysis cov er s all cases 1 ≤ p ≤ ∞ , assuming the different featur e mappings corr e sp onding to the differen t kernels to b e unco rrelated. W e also show a lower bo und that shows that the b ound is tight, and derive c o nsequences reg arding excess loss, na mely fast convergence rates of the order O ( n − α 1+ α ), wher e α is the minimum eig env a lue decay r ate of the individual kernels. Keyw ords: m ultiple kernel learning, learning kernels, generaliz a tion bo unds, lo ca l Rademacher co mplex it y 1. In tro duction Prop elled by the increasing “industrialization” of mo dern application d omains suc h as bioin- formatics or computer vision leading to the accumulat ion of v ast amoun ts of data, the ∗ A P art of the w ork was d one while MK was at Learning Theory Group, Computer Science Division and Department of Statistics, Universit y of California, Berkeley , CA 94720-1758, USA . 1 past d ecade exp erienced a r apid pr ofessionalizat ion of mac hine learning metho ds. Sophisti- cated mac hin e learning solutions such as the sup p ort v ector mac hine can n o w adays almost completely b e applied out-of-the-b ox (Bouc k aert et al., 2010). Nev ertheless, a displeasing stum b lin g blo c k to wa r d s the complete automatizatio n of mac hine learning remains that of finding the b est abstraction or kernel for a pr oblem at h and. In the cur ren t state of researc h, there is little hop e that a mac hine will b e able to find automatica lly—or ev en engineer—the b est kernel for a particular p roblem (Searle, 1980). Ho w eve r, by r estricting to a less general p roblem, namely to a fi nite set of b ase k ernels the algorithm can pick from, one might hop e to ac hiev e automatic k ernel selection: clearly , cross-v ali d ation based mo del selection (Stone, 1974 ) can b e applied if the num b er of base k ernels is d ecen t. S till, the p erf ormance of suc h an algorithm is limited b y the p erformance of the b est kernel in the set. In the seminal w ork of Lanc kriet et al. (2004) it was shown that it is computationally feasible to sim ultaneously learn a supp ort v ector mac hine and a linear combinatio n of kernels at the same time, if we require the so-formed kernel com binations to b e p ositiv e defin ite and trace-norm normalized. Th ough f easible for small sample s izes, the computational b u rden of this so-called multiple kernel le arning (MKL) appr oac h is still high. By further restricting the multi- kernel class to only cont ain con vex com b in ations of k ern els, the efficiency can b e considerably imp r o v ed, so that ten thousands of training p oin ts and thousands of kernels can b e pro cessed (Sonnenburg et al., 2006). Ho w eve r, these computational adv ances come at a pr ice. Empirical evidence has accu- m ulated sh o wing th at sparse-MKL optimized kernel com binations rarely help in practice and frequ ently are to b e outp erform ed b y a regular S VM using an unw eigh ted-sum kernel K = P m K m (Cortes et al., 2008; Gehler and No woz in, 2009 ), leading f or ins tance to the pro vocativ e qu estion “Can learning kernels help p erform ance?”(Cortes, 2009). By imp osing an ℓ q -norm, q ≥ 1, r ather than an ℓ 1 p enalt y on the k ernel com bination co efficien ts, MKL w as finally made usefu l for pr actical app lications and profitable (K loft et al., 2009, 201 1). The ℓ q -norm MKL is an empirical minimization algorithm th at op erates on the multi-k ernel class consisting of f unctions f : x 7→ h w , φ k ( x ) i with k w k k ≤ D , where φ k is the kernel mapp ing int o the repro du cing k ernel Hilb ert sp ace (RKHS) H k with k ernel k and norm k . k k , wh ile the k ernel k itself ranges ov er the set of p ossible kernels k = P M m =1 θ m k m k θ k q ≤ 1 , θ ≥ 0 . In Figure 1, w e r epro duce exemplary results tak en fr om Kloft et al. (2009, 2011) (see also references th er ein for furth er evidence p ointing in the same d ir ection). W e fi rst observe that, as exp ected, ℓ q -norm MKL enforces strong spars ity in the co efficien ts θ m when q = 1, and no sp arsit y at all for q = ∞ , whic h corresp onds to the SVM with an unw eigh ted-sum k ernel, wh ile intermediate v alues of q enforce different degrees of soft sparsity (understo o d as the steepness of the decrease of the ordered co efficients θ m ). Cr ucially , the p erformance (as measured by the AU C criterion) is not m onotonic as a function of q ; q = 1 (spars e MKL) yields significantl y wo rs e p erform an ce than q = ∞ (regular S VM with sum k ernel), but optimal p erf ormance is attained for some intermediate v alue of q . T his is an empirical strong motiv a tion to study theoretically the p erformance of ℓ q -MKL b eyo nd the limiting cases q = 1 or q = ∞ . 2 0 10K 20K 30K 40K 50K 60K 0.88 0.89 0.9 0.91 0.92 0.93 sample size AUC 1−norm MKL 4/3−norm MKL 2−norm MKL 4−norm MKL SVM 1−norm n=5k 4/3−norm 2−norm 4−norm unw.−sum n=20k n=60k Figure 1: Splice site detection exp eriment in Kloft et al. ( 2009, 2011). Left: The Area under ROC cur v e as a function of the training set size is sho wn . The regular SVM is equiv alen t to q = ∞ (or p = 2). Right: The optimal kernel weigh ts θ m as outpu t by ℓ q -norm MKL are sh o wn. A conceptual milestone going bac k to the work of Bac h et al. (2004 ) and Micc helli and P ontil (20 05) is that the ab o v e multi-k ernel class can equiv alen tly b e repr esen ted as a blo c k-norm regularized linear class in the pro du ct Hilb ert space H := H 1 × · · · × H M , where H m denotes the RKHS asso ciated to kernel k m , 1 ≤ m ≤ M . More precisely , denoting by φ m the ke rn el feature mapp in g asso ciated to k ernel k m o v er in put s pace X , and φ : x ∈ X 7→ ( φ 1 ( x ) , . . . , φ M ( x )) ∈ H , the class of functions d efined ab ov e coincides with H p,D ,M = f w : x 7→ h w , φ ( x ) i w = ( w (1) , . . . , w ( M ) ) , k w k 2 ,p ≤ D , (1) where there is a one-to-one mappin g of q ∈ [1 , ∞ ] to p ∈ [1 , 2] giv en b y p = 2 q q +1 . The ℓ 2 ,p - norm is defined here as w 2 ,p := k w (1) k k 1 , . . . , k w ( M ) k k M p = P M m =1 w ( m ) p k m 1 /p ; for simplicit y , w e will frequent ly write w ( m ) 2 = w ( m ) k m . Clearly , learning the complexit y of (1) w ill b e greater than one that is based on a sin gle k ernel only . Ho w eve r, it is unclear whether the increase is decent or consid erably high and—since there is a free parameter p —how this relates to the c hoice of p . T o this end the main aim of th is pap er is to analyze the sample complexit y of the ab ov e h yp othesis class (1). An analysis of this mo d el, based on global Rademac h er complexities, wa s deve lop ed b y Cortes et al. (2010 ). In the pr esen t work, w e b ase our main analysis on the theory of lo c al Rademac her complexities, w hic h allo ws to derive im p ro ve d and more p recise rates of con v ergence. Outline of the con tributions. This pap er mak es the follo wing con tributions: • Upp er b ounds on the lo cal Rademac h er complexity of ℓ p -norm MKL are sho wn, from whic h we der ive an excess r isk b ound that ac h iev es a fast con ve r gence rate of the order O ( n − α 1+ α ), where α is the minimum eig env alue d eca y rate of the ind ividual k ernels (previous b ounds for ℓ p -norm MKL only ac hiev ed O ( n − 1 2 ). 3 • A lo wer b ound is s h o wn that b eside absolute constan ts m atches the up p er b ounds , sho wing that our resu lts are tigh t. • The generalization p erform an ce of ℓ p -norm MKL as guarante ed by the excess r isk b ound is studied for v arying v alues of p , shedding ligh t on the appropriateness of a small/large p in v arious learning scenarios. F urthermore, we also p resen t a simp ler pro of of the global Rademac her b ound sh own in Cortes et al. (2010) . A comparison of th e rates obtained with lo cal and global Rademac her analysis, resp ectiv ely , can b e f ound in Section 6.1. Notation. F or n otational simplicit y we will omit f eature maps and directly view φ ( x ) and φ m ( x ) as rand om v ariables x and x ( m ) taking v alues in the Hilb er t space H and H m , resp ectiv ely , where x = ( x (1) , . . . , x ( M ) ). Corresp ond ingly , th e hyp othesis class we are in terested in r eads H p,D ,M = f w : x 7→ h w , x i k w k 2 ,p ≤ D . If D or M are clear fr om the context , w e sometimes synonymously denote H p = H p,D = H p,D ,M . W e will frequen tly use the notation ( u ( m ) ) M m =1 for the elemen t u = ( u (1) , . . . , u ( M ) ) ∈ H = H 1 × . . . × H M . W e denote the k ernel matrices corresp onding to k and k m b y K and K m , resp ectiv ely . Note that w e are considerin g normalized k ernel Gr am matrices, i.e., the ij th entry of K is 1 n k ( x i , x j ). W e will also work with co v ariance op erators in Hilb ert spaces. In a fin ite dimensional vect or space, th e (uncentered) co v ariance op erator can b e defined in usu al v ector/matrix notation as E xx ⊤ . Since w e are w orking with p oten tially infin ite-dimensional v ector s paces, we w ill use instead of xx ⊤ the tensor notation x ⊗ x ∈ HS( H ), w h ic h is a Hilb ert-Sc hm id t op erator H 7→ H defined as ( x ⊗ x ) u = h x , u i x . Th e space HS( H ) of Hilb ert-Sc hm id t op erators on H is itself a Hilb ert space, and th e exp ectation E x ⊗ x is well - defined and b elongs to HS( H ) as so on as E k x k 2 is fi nite, whic h will alw a ys b e assumed (as a matter of fact, we will often assum e that k x k is b ounded a.s.). W e denote by J = E x ⊗ x , J m = E x ( m ) ⊗ x ( m ) the u ncen tered co v ariance op erators corresp onding to v ariables x , x ( m ) ; it holds that tr( J ) = E k x k 2 2 and tr( J m ) = E x ( m ) 2 2 . Finally , for p ∈ [1 , ∞ ] w e u se the standard notation p ∗ to denote the conju gate of p , that is, p ∗ ∈ [1 , ∞ ] and 1 p + 1 p ∗ = 1. 2. Global Rademac her Complexities in Multiple Kernel Learning W e first review global Rademacher complexities (GRC) in multi p le k ernel learnin g. Let x 1 , . . . , x n b e an i.i.d. samp le dra wn fr om P . The global R ad emacher complexit y is defi n ed as R ( H p ) = E sup f w ∈ H p h w , 1 n P n i =1 σ i x i i , wh ere ( σ i ) 1 ≤ i ≤ n is an i.i.d. f amily (indep enden t of ( x i ) ) of Rademacher v ariables (random signs). Its empirical counterpart is denoted by b R ( H p ) = E R ( H p ) x 1 , . . . , x n = E σ sup f w ∈ H p h w , 1 n P n i =1 σ i x i i . The in terest in the global Rademac her complexit y comes from th at if kno wn it can b e us ed to b oun d the generalization error (Koltc hins kii, 2001; Bartlett and Mendelson, 2002). In the r ecent pap er of Cortes et al. (2010 ) it w as shown u s ing a combinato rial argu m en t that the empirical v ersion of the global Rademac h er complexit y can b e b ounded as b R ( H p ) ≤ D s cp ∗ n tr( K m ) M m =1 p ∗ 2 , 4 where c = 23 22 and tr( K ) denotes th e trace of the ke rn el matrix K . W e will now sh o w a quite short pro of of th is result and then pr esent a b oun d on the p opulation v ersion of the GR C. Th e pro of pr esen ted here is based on the Khint chine-Kahane inequ ality (Kah an e, 1985) using the constant s tak en from Lemma 3.3.1 and Prop osition 3.4.1 in Kw api´ en and W o yczy ´ nski (1992). Lemma 1 (Khin tchine-Kahane inequalit y) . L et b e v 1 , . . . , v M ∈ H . Then, for any q ≥ 1 , it holds E σ n X i =1 σ i v i q 2 ≤ c n X i =1 v i 2 2 q 2 , wher e c = max(1 , p ∗ − 1) . In p articular the r esult holds for c = p ∗ . Prop osition 2 (Global Rademac h er complexit y , empirical version) . F or any p ≥ 1 the em- piric al version of glob al R ademacher c omplexity of the multi- kernel class H p c an b e b ounde d as ∀ t ≥ p : b R ( H p ) ≤ D s t ∗ n tr( K m ) M m =1 t ∗ 2 . Pro of First n ote that it suffices to pro ve the result for t = p as tr ivially k x k 2 ,t ≤ k x k 2 ,p holds for all t ≥ p and therefore R ( H p ) ≤ R ( H t ). W e can use a blo ck-structured v ersion of H¨ older’s inequalit y (cf. Lemma 12) and the Khin tchine-Kahane (K.-K.) inequalit y (cf. Lemma 1) to b ound the emp irical ve r s ion of the global Rademac her complexit y as follo ws: b R ( H p ) def. = E σ sup f w ∈ H p h w , 1 n n X i =1 σ i x i i H¨ older ≤ D E σ 1 n n X i =1 σ i x i 2 ,p ∗ Jensen ≤ D E σ M X m =1 1 n n X i =1 σ i x ( m ) i p ∗ 2 1 p ∗ K.-K. ≤ D r p ∗ n M X m =1 1 n n X i =1 x ( m ) i 2 2 | {z } =tr( K m ) p ∗ 2 1 p ∗ = D s p ∗ n tr( K m ) M m =1 p ∗ 2 , what was to sh ow. Remark. Note that there is a very go o d reason to s tate the ab ov e b ound in terms of t ≥ p instead of solely in terms of p : th e Rademac her complexit y b R ( H p ) is n ot monotonic in p and thus it is not alw a ys the b est c hoice to tak e t := p in th e ab o ve b oun d. T h is can is r eadily seen, for example, for the easy case where all ke r n els h a v e th e same tr ace—in 5 that case the b oun d translates into b R ( H p ) ≤ D q t ∗ M 2 t ∗ tr( K 1 ) n . Int erestingly , the fu nction x 7→ xM 2 /x is not monotone and attains its minimum for x = 2 log M , where log denotes the natural logarithm with resp ect to the base e . This has in teresting consequences: for an y p ≤ (2 log M ) ∗ w e can tak e the b ound b R ( H p ) ≤ D q e log( M ) tr( K 1 ) n , w h ic h has only a mild dep endency on the num b er of kernels; note that in particular we can tak e this b oun d for the ℓ 1 -norm class b R ( H 1 ) for all M > 1. Despite the simplicit y the ab o v e p ro of, the constan ts are slightly b etter than the ones ac hiev ed in Cortes et al. (2010). Ho wev er, computing the p opu lation version of the global Rademac her complexit y of MKL is somewh at more inv olv ed and to th e b est of our knowledge has not b een addressed y et by the literature. T o this end, note that fr om th e previous pr o of w e obtain R ( H p ) = E D p p ∗ /n P M m =1 1 n P n i =1 x ( m ) i 2 H m p ∗ 2 1 p ∗ . W e thus can use Jensen’s inequalit y to mo v e the exp ectation op erator in side the r o ot, R ( H p ) = D p p ∗ /n M X m =1 E 1 n n X i =1 x ( m ) i 2 2 p ∗ 2 1 p ∗ , (2) but no w n eed a handle on the p ∗ 2 -th momen ts. T o this aim we use the inequalities of Rosen thal (1970 ) and Y oung (e.g., Steele, 2004) to show the follo wing Lemma. Lemma 3 (Rosent h al + Y oung) . L et X 1 , . . . , X n b e indep endent nonne g ative r andom vari- ables satisfying ∀ i : X i ≤ B < ∞ almost sur ely. Then, denoting C q = (2 q e ) q , for any q ≥ 1 2 it holds E 1 n n X i =1 X i q ≤ C q B n q + 1 n n X i =1 E X i q . The pr o of is defered to App en dix A. It is now easy to show: Corollary 4 (Global Rademacher complexit y , p opulation ve rsion) . Assume the kernels ar e uniformly b ounde d, that i s, k k k ∞ ≤ B < ∞ , almost sur ely. Then for any p ≥ 1 the p opulation version of g lob al R ademacher c omplexity of the multi-ke rnel c lass H p c an b e b ounde d as ∀ t ≥ p : R ( H p,D ,M ) ≤ D t ∗ r e n tr( J m ) M m =1 t ∗ 2 + √ B eD M 1 t ∗ t ∗ n . F or t ≥ 2 the rig ht-hand term c an b e disc ar de d and the r esult also holds for unb ounde d kernels. Pro of As ab o ve in the previous p ro of it suffices to prov e the resu lt for t = p . F rom (2) we conclude by the previous Lemma R ( H p ) ≤ D r p ∗ n M X m =1 ( ep ∗ ) p ∗ 2 B n p ∗ 2 + E 1 n n X i =1 x ( m ) i 2 2 | {z } =tr( J m ) p ∗ 2 ! 1 p ∗ ≤ D p ∗ r e n tr( J m ) M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n , 6 where for the last inequalit y w e use the s ubadditivit y of the ro ot fu nction. Note that for p ≥ 2 it is p ∗ / 2 ≤ 1 and thus it suffices to emplo y Jensen’s inequalit y instead of the previous lemma so that we come along without the last term on the righ t-hand side. F or example, when the traces of the k ernels are b ounded, the ab o v e b ound is essen tially determined by O p ∗ M 1 p ∗ √ n . W e can also remark that b y setting t = (log ( M )) ∗ w e obtain the b ound R ( H 1 ) = O log M √ n . 3. The Lo cal R ademac her Complexit y of Multiple Ker nel Learning Let x 1 , . . . , x n b e an i.i.d. sample dra wn from P . W e defi n e the lo cal Rademacher complex- it y of H p as R r ( H p ) = E su p f w ∈ H p : P f 2 w ≤ r h w , 1 n P n i =1 σ i x i i , where P f 2 w := E ( f w ( x )) 2 . Note that it subsu m es the global R C as a s p ecial case for r = ∞ . As self-adjoint, p ositiv e Hilb ert- Sc hmid t op erators, co v ariance op erators enjo y d iscrete eigen v alue-eigen ve ctor decomp osi- tions J = E x ⊗ x = P ∞ j =1 λ j u j ⊗ u j and J m = E x ( m ) ⊗ x ( m ) = P ∞ j =1 λ ( m ) j u ( m ) j ⊗ u ( m ) j , where ( u j ) j ≥ 1 and ( u ( m ) j ) j ≥ 1 form orthonorm al bases of H and H m , r esp ectiv ely . W e will need the follo win g assumption f or the case 1 ≤ p ≤ 2: Assumption (U) (no-correlation). The Hilb ert sp ac e value d variables x 1 , . . . , x m ar e said to b e (p airwise) unc orr elate d if for any m 6 = m ′ and w ∈ H m , w ′ ∈ H m ′ , the r e al variables h w , x m i and h w ′ , x m ′ i ar e u nc orr elate d. Since H m , H m ′ are RKHS s with k ernels k m , k m ′ , if we go bac k to th e inp ut rand om v ariable in the original space X ∈ X , the ab o v e prop ert y is equiv alen t to saying that for an y fixed t, t ′ ∈ X , the v ariables k m ( X, t ) and k m ′ ( X, t ′ ) are un corr elated. T his is the case, for example, if the original in put s p ace X is R M , the orginal input v ariable X ∈ X has indep end en t co ordinates, and the kernels k 1 , . . . , k M eac h act on a differen t co ord inate. Su c h a setting w as considered in particular by Rasku tti et al. (2010) in the setting of ℓ 1 -p enalized MKL. W e discus s this assu mption in more detail in Section 6.2. W e are no w equip p ed to state ou r main results: Theorem 5 (Lo cal Rademac her complexit y , p ∈ [1 , 2] ) . Assume that the kernels ar e uni- formly b ounde d ( k k k ∞ ≤ B < ∞ ) and that Assumption ( U) holds. The lo c al R ademacher c omplexity of the multi-kernel c lass H p c an b e b ounde d for any 1 ≤ p ≤ 2 as ∀ t ∈ [ p, 2] : R r ( H p ) ≤ v u u t 16 n ∞ X j =1 min r M 1 − 2 t ∗ , ceD 2 t ∗ 2 λ ( m ) j M m =1 t ∗ 2 + √ B eD M 1 t ∗ t ∗ n . Theorem 6 (Lo cal Rademac her complexit y , p ≥ 2) . The lo c al R ademacher c omplexity of the multi-kernel class H p c an b e b ounde d for any p ≥ 2 as R r ( H p ) ≤ v u u t 2 n ∞ X j =1 min( r , D 2 M 2 p ∗ − 1 λ j ) . 7 Remark 1. Note that for the case p = 1, b y u sing t = (log( M )) ∗ in Theorem 5, w e obtain the b ound R r ( H 1 ) ≤ v u u t 16 n ∞ X j =1 min r M , e 3 D 2 (log M ) 2 λ ( m ) j M m =1 ∞ + √ B e 3 2 D log ( M ) n . (See b elo w after the pro of of Theorem 5 for a detailed justification.) Remark 2. The result of Theorem 6 for p ≥ 2 can b e pro ved using consid- erably simpler tec hniques a n d without imp osing assu mptions on b ou n dedness nor on uncorrelation of the kernels. If in addition the v ariables ( x ( m ) ) are cen- tered and uncorrelated, then the sp ectra a re relate d as follo ws : sp ec( J ) = S M m =1 sp ec( J m ); that is, { λ i , i ≥ 1 } = S M m =1 n λ ( m ) i , i ≥ 1 o . Then one can write equ iv- alen tly the b ound of T h eorem 6 as R r ( H p ) ≤ q 2 n P M m =1 P ∞ j =1 min( r , D 2 M 2 p ∗ − 1 λ ( m ) j ) = r 2 n P ∞ j =1 min( r , D 2 M 2 p ∗ − 1 λ ( m ) j ) M m =1 1 . Ho w ever, the main in tended fo cus of this pa- p er is on th e more c hallenging case 1 ≤ p ≤ 2 wh ic h is usu ally studied in multiple k ernel learning and relev ant in p ractice. Remark 3. It is interesting to compare the ab o v e b ound s for the sp ecial case p = 2 with the ones of Bartlett et al. (2005 ). T h e main term of th e b oun d of Theorem 6 (taking t = p = 2) is then essentia lly determined by O q 1 n P M m =1 P ∞ j =1 min r , λ ( m ) j . If the v ariables ( x ( m ) ) are cen tered and uncorrelated, by the r elation b et ween the sp ectra stated in Remark 2, this is equiv alen tly of order O q 1 n P ∞ j =1 min r , λ j , w h ic h is also what we obtain thr ough Th eorem 6, and coincides with the r ate shown in Bartlett et al. (2005). Pro of of Theorem 5 and Remark 1. The p ro of is based on firs t relating the complexit y of the class H p with its cen tered counterpart, i.e., where all fu nctions f w ∈ H p are cent ered around their exp ected v alue. T h en w e compu te the complexit y of the cente red class by decomp osing the complexit y into blo cks, app lying th e no-correlatio n assumption, and using the in equalities of H¨ older and Rosenthal. Th en we relate it bac k to th e original class, whic h w e in the fin al step relate to a b oun d inv olving the trun cation of the particular sp ectra of the k ernels. Note th at it suffi ces to prov e the result for t = p as trivially R ( H p ) ≤ R ( H t ) for all p ≤ t . Step 1: Rela ting the original class with the centered class. In order to exploit the n o-correlatio n assumption, w e will w ork in large p arts of the pro of with the cen tered class ˜ H p = ˜ f w k w k 2 ,p ≤ D , wh erein ˜ f w : x 7→ h w , ˜ x i , and ˜ x := x − E x . W e start the p r o of by noting that ˜ f w ( x ) = f w ( x ) − h w , E x i = f w ( x ) − E h w , x i = f w ( x ) − E f w ( x ), so that, by the bias-v ariance decomp osition, it h olds that P f 2 w = E f w ( x ) 2 = E ( f w ( x ) − E f w ( x )) 2 + ( E f w ( x )) 2 = P ˜ f 2 w + P f w 2 . (3) 8 F urthermore w e note that by J ensen’s inequ alit y E x 2 ,p ∗ = M X m =1 E x ( m ) p ∗ 2 1 p ∗ = M X m =1 E x ( m ) , E x ( m ) p ∗ 2 1 p ∗ Jensen ≤ M X m =1 E x ( m ) , x ( m ) p ∗ 2 1 p ∗ = s tr( J m ) M m =1 p ∗ 2 (4) so that we can express the complexit y of the cen tered class in term s of the u ncen tered one as follo ws: R r ( H p ) = E sup f w ∈ H p , P f 2 w ≤ r w , 1 n n X i =1 σ i x i ≤ E su p f w ∈ H p , P f 2 w ≤ r w , 1 n n X i =1 σ i ˜ x i + E su p f w ∈ H p , P f 2 w ≤ r w , 1 n n X i =1 σ i E x Concerning the fir st term of the ab o ve up p er b oun d, using (3) we ha ve P ˜ f 2 w ≤ P f 2 w , and th us E sup f w ∈ H p , P f 2 w ≤ r w , 1 n n X i =1 σ i ˜ x i ≤ E sup f w ∈ H p , P ˜ f 2 w ≤ r w , 1 n n X i =1 σ i ˜ x i = R r ( ˜ H p ) . No w to b ound the second term, w e write E sup f w ∈ H p , P f 2 w ≤ r w , 1 n n X i =1 σ i E x = E 1 n n X i =1 σ i sup f w ∈ H p , P f 2 w ≤ r h w , E x i ≤ sup f w ∈ H p , P f 2 w ≤ r w , E x E 1 n n X i =1 σ i ! 2 1 2 = √ n sup f w ∈ H p , P f 2 w ≤ r h w , E x i . No w observ e finally that we ha v e h w , E x i H¨ older ≤ k w k 2 ,p k E x k 2 ,p ∗ (4) ≤ k w k 2 ,p r tr( J m ) M m =1 p ∗ 2 as well as h w , E x i = E f w ( x ) ≤ p P f 2 w . W e finally obtain, putting together the steps ab o ve , R r ( H p ) ≤ R r ( ˜ H p ) + n − 1 2 min √ r, D r tr( J m ) M m =1 p ∗ 2 (5) 9 This shows that we at the exp ense of the additional su mmand on the righ t hand side we can work with the cen tered class instead of the u ncen tered one. Step 2: Bo unding the comple xity of the cent ered class . Since the (cen tered) co v ariance op erator E ˜ x ( m ) ⊗ ˜ x ( m ) is also a self-adjoint Hilb ert-Sc hmid t op erator on H m , there exists an eigendecomposition E ˜ x ( m ) ⊗ ˜ x ( m ) = ∞ X j =1 ˜ λ ( m ) j ˜ u ( m ) j ⊗ ˜ u ( m ) j , (6) wherein ( ˜ u ( m ) j ) j ≥ 1 is an orthogonal basis of H m . F urthermore, the n o-correlatio n assumption (U) enta ils E ˜ x ( l ) ⊗ ˜ x ( m ) = 0 for all l 6 = m . As a consequ ence, P ˜ f 2 w = E ( f w ( ˜ x )) 2 = E M X m =1 w m , ˜ x ( m ) 2 = M X l,m =1 D w l , E ˜ x ( l ) ⊗ ˜ x ( m ) w m E (U) = M X m =1 D w m , E ˜ x ( m ) ⊗ ˜ x ( m ) w m E = M X m =1 ∞ X j =1 ˜ λ ( m ) j D w m , ˜ u ( m ) j E 2 (7) and, for all j and m , E D 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j E 2 = E 1 n 2 n X i,l =1 σ i σ l D ˜ x ( m ) i , ˜ u ( m ) j E D ˜ x ( m ) l , ˜ u ( m ) j E σ i.i .d. = E 1 n 2 n X i =1 D ˜ x ( m ) i , ˜ u ( m ) j E 2 = 1 n D ˜ u ( m ) j , 1 n n X i =1 E ˜ x ( m ) i ⊗ ˜ x ( m ) i | {z } = E ˜ x ( m ) ⊗ ˜ x ( m ) ˜ u ( m ) j E = ˜ λ ( m ) j n . (8) 10 Let no w h 1 , . . . , h M b e arb itrary n onnegativ e in tegers. W e can express the lo cal Rademac her complexit y in terms of the eigendecomp ositon (6) as follo ws R r ( ˜ H p ) = E sup f w ∈ ˜ H p : P ˜ f 2 w ≤ r D w , 1 n n X i =1 σ i ˜ x i E = E sup f w ∈ ˜ H p : P ˜ f 2 w ≤ r D w ( m ) M m =1 , 1 n n X i =1 σ i ˜ x ( m ) i M m =1 E ≤ E sup P ˜ f 2 w ≤ r D h m X j =1 q ˜ λ ( m ) j h w ( m ) , ˜ u ( m ) j i ˜ u ( m ) j M m =1 , h m X j =1 q ˜ λ ( m ) j − 1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j M m =1 E + E su p f w ∈ ˜ H p w , ∞ X j = h m +1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j M m =1 C.-S., Jensen ≤ sup P ˜ f 2 w ≤ r " M X m =1 h m X j =1 ˜ λ ( m ) j h w ( m ) , ˜ u ( m ) j i 2 1 2 × M X m =1 h m X j =1 ˜ λ ( m ) j − 1 E 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j 2 1 2 # + E sup f w ∈ ˜ H p w , ∞ X j = h m +1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j M m =1 so that (7) and (8) yield R r ( ˜ H p ) (7), (8) ≤ s r P M m =1 h m n + E sup f w ∈ ˜ H p w , ∞ X j = h m +1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j M m =1 H¨ older ≤ s r P M m =1 h m n + D E ∞ X j = h m +1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j M m =1 2 ,p ∗ . Step 3: Khintch ine-Kahane’s a nd R osen thal ’s ineq ualities. W e can no w use the Khintc h ine-Kahane (K.-K.) inequalit y (see Lemma 1 in App endix A) to f urther b ound 11 the r igh t term in the ab o v e expression as follo ws E ∞ X j = h m +1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j M m =1 2 ,p ∗ Jensen ≤ E M X m =1 E σ ∞ X j = h m +1 h 1 n n X i =1 σ i ˜ x ( m ) i , ˜ u ( m ) j i ˜ u ( m ) j p ∗ H m 1 p ∗ K.-K. ≤ r p ∗ n E M X m =1 ∞ X j = h m +1 1 n n X i =1 h ˜ x ( m ) i , ˜ u ( m ) j i 2 p ∗ 2 1 p ∗ Jensen ≤ r p ∗ n M X m =1 E ∞ X j = h m +1 1 n n X i =1 h ˜ x ( m ) i , ˜ u ( m ) j i 2 p ∗ 2 1 p ∗ , Note that for p ≥ 2 it h olds that p ∗ / 2 ≤ 1, and thus it suffi ces to employ Jens en ’s inequalit y once again in order to mo ve the exp ectation op erator ins id e the inn er term. I n the general case we need a handle on the p ∗ 2 -th m omen ts and to this end emplo y Lemma 3 (Rosenthal + Y oun g), which yields M X m =1 E ∞ X j = h m +1 1 n n X i =1 h ˜ x ( m ) i , ˜ u ( m ) j i 2 p ∗ 2 1 p ∗ R+Y ≤ M X m =1 ( ep ∗ ) p ∗ 2 B n p ∗ 2 + ∞ X j = h m +1 1 n n X i =1 E h ˜ x ( m ) i , ˜ u ( m ) j i 2 | {z } = ˜ λ ( m ) j p ∗ 2 ! 1 p ∗ ( ∗ ) ≤ v u u t ep ∗ B M 2 p ∗ n + M X m =1 ∞ X j = h m +1 ˜ λ ( m ) j p ∗ 2 2 p ∗ ! = v u u t ep ∗ B M 2 p ∗ n + ∞ X j = h m +1 ˜ λ ( m ) j M m =1 p ∗ 2 ! ≤ v u u t ep ∗ B M 2 p ∗ n + ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 ! where for ( ∗ ) we used the subadd itivit y of p ∗ √ · and in the last step we applied th e Lidskii- Mirsky-Wielandt theorem wh ic h giv es ∀ j, m : ˜ λ ( m ) j ≤ λ ( m ) j . T hus by th e subadditivit y of the r o ot function R r ( ˜ H p ) ≤ s r P M m =1 h m n + D v u u t ep ∗ 2 n B M 2 p ∗ n + ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 ! = s r P M m =1 h m n + v u u t ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n . (9) 12 Step 4: Bounding the co mplexity of the original class. No w note that for all nonnegativ e in tegers h m w e either ha v e n − 1 2 min √ r , D r tr( J m ) M m =1 p ∗ 2 ≤ v u u t ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 (in case all h m are zero) or it holds n − 1 2 min √ r, D r tr( J m ) M m =1 p ∗ 2 ≤ s r P M m =1 h m n (in case that at least one h m is n onzero) so th at in any case we get n − 1 2 min √ r, D r tr( J m ) M m =1 p ∗ 2 ≤ s r P M m =1 h m n + v u u t ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 . (10) Th u s the f ollo w ing preliminary b oun d follo ws fr om (5) b y (9) and (10): R r ( H p ) ≤ s 4 r P M m =1 h m n + v u u t 4 ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n , (11) for all nonn egativ e in tegers h m ≥ 0. W e could stop here as the ab ov e b ound is already the one that will b e used in the subsequent section for the computation of the excess loss b ound s. Ho wev er, we can wo rk a little more on the form of the ab o ve b ound to gain more insigh t in the prop erties—we will sho w that it is r elated to th e tru n cation of the sp ectra at the scale r . Step 5: Rela ting the boun d to the trunca tion of the spect ra of the kerne ls. T o this end, notice that for all nonnegativ e real num b ers A 1 , A 2 and an y a 1 , a 2 ∈ R m + it holds for all q ≥ 1 p A 1 + p A 2 ≤ p 2( A 1 + A 2 ) (12) k a 1 k q + k a 2 k q ≤ 2 1 − 1 q k a 1 + a 2 k q ≤ 2 k a 1 + a 2 k q (13) (the fir s t statemen t follo ws from the conca vit y of the squ are ro ot f unction and the second one is pro ved in ap p endix A; see Lemma 14) and th u s 13 R r ( H p ) (12) ≤ v u u u t 8 r P M m =1 h m n + ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n ℓ 1 -to- ℓ p ∗ 2 ≤ v u u u t 8 n r M 1 − 2 p ∗ h m M m =1 p ∗ 2 + ep ∗ 2 D 2 ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n (13) ≤ v u u t 16 n r M 1 − 2 p ∗ h m + ep ∗ 2 D 2 ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n , where to obtain the s econd inequalit y we applied that for all non-negativ e a ∈ R M and 0 < q < p ≤ ∞ it h olds 1 ( ℓ q -to- ℓ p con v ersion) k a k q = h 1 , a q i 1 q H¨ older ≤ k 1 k ( p/q ) ∗ k a q k p/q 1 /q = M 1 q − 1 p k a k p . (14) Since the ab ov e holds f or all nonnegativ e integ ers h m , it follo ws R r ( H p ) ≤ v u u t 16 n min h m ≥ 0 r M 1 − 2 p ∗ h m + ep ∗ 2 D 2 ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n = v u u t 16 n ∞ X j =1 min r M 1 − 2 p ∗ , ep ∗ 2 D 2 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n , whic h completes the pro of of th e theorem. Pr oo f o f t he rema rk. T o see that Remark 1 holds notice that R ( H 1 ) ≤ R ( H p ) for all p ≥ 1 and th us b y c h o osing p = (log( M )) ∗ the ab ov e b ound implies R r ( H 1 ) ≤ v u u t 16 n ∞ X j =1 min r M 1 − 2 p ∗ , ep ∗ 2 D 2 λ ( m ) j M m =1 p ∗ 2 + √ B eD M 1 p ∗ p ∗ n ℓ p ∗ 2 − to − ℓ ∞ ≤ v u u t 16 n ∞ X j =1 min r M , ep ∗ 2 M 2 p ∗ D 2 λ ( m ) j M m =1 ∞ + √ B eD M 1 p ∗ p ∗ n = v u u t 16 n ∞ X j =1 min r M , e 3 D 2 (log M ) 2 λ ( m ) j M m =1 ∞ + √ B e 3 2 D (log M ) n , whic h completes the pro of. Pro of of Theorem 6. 1. W e d en ote by a q the vector with entries a q i and by 1 the vector with entri es all 1. 14 The eigendecomp osition E x ⊗ x = P ∞ j =1 λ j u j ⊗ u j yields P f 2 w = E ( f w ( x )) 2 = E h w , x i 2 = w , ( E x ⊗ x ) w = ∞ X j =1 λ j h w , u j i 2 , (15) and, for all j E D 1 n n X i =1 σ i x i , u j E 2 = E 1 n 2 n X i,l =1 σ i σ l h x i , u j i h x l , u j i σ i.i .d. = E 1 n 2 n X i =1 h x i , u j i 2 = 1 n D u j , 1 n n X i =1 E x i ⊗ x i | {z } = E x ⊗ x u j E = λ j n . (16) Therefore, we can u se, for an y n on n egativ e in teger h , the Cauc hy-Sc h warz inequalit y and a block-structured version of H¨ older’s inequalit y (see L emm a 12) to b oun d th e local Rademac her complexit y as follo ws: R r ( H p ) = E s up f w ∈ H p : P f 2 w ≤ r w , 1 n n X i =1 σ i x i = E s up f w ∈ H p : P f 2 w ≤ r h X j =1 p λ j h w , u j i u j , h X j =1 p λ j − 1 h 1 n n X i =1 σ i x i , u j i u j + w , ∞ X j = h +1 h 1 n n X i =1 σ i x i , u j i u j C.-S., (15) , (16) ≤ r r h n + E sup f w ∈ H p w , ∞ X j = h +1 h 1 n n X i =1 σ i x i , u j i u j H¨ older ≤ r r h n + D E ∞ X j = h +1 h 1 n n X i =1 σ i x i , u j i u j 2 ,p ∗ ℓ p ∗ 2 − to − ℓ 2 ≤ r r h n + D M 1 p ∗ − 1 2 E ∞ X j = h +1 h 1 n n X i =1 σ i x i , u j i u j H Jensen ≤ r r h n + D M 1 p ∗ − 1 2 ∞ X j = h +1 E h 1 n n X i =1 σ i x i , u j i 2 | {z } (16) ≤ λ j n 1 2 ≤ r r h n + v u u t D 2 M 2 p ∗ − 1 n ∞ X j = h +1 λ j . 15 Since the ab o ve h olds for all h , the result no w follo ws from √ A + √ B ≤ p 2( A + B ) f or all nonnegativ e r eal num b ers A, B (whic h holds b y the conca vit y of the square ro ot function): R r ( H p ) ≤ v u u t 2 n min 0 ≤ h ≤ n r h + D 2 M 2 p ∗ − 1 ∞ X j = h +1 λ j = v u u t 2 n ∞ X j =1 min( r , D 2 M 2 p ∗ − 1 λ j ) . 4. Lo w er Bound In this sub s ection w e inv estigate the tigh tness of our b oun d on the lo cal Rademacher com- plexit y of H p . T o d eriv e a lo w er b ound w e consider the particular case wh ere v ariables x (1) , . . . , x ( M ) are i.i.d. F or example, this happ ens if the original input sp ace X is R M , the original input v ariable X ∈ X has i.i.d. co ordinates, and the kernels k 1 , . . . , k M are iden tical and eac h act on a different co ordinate of X . Lemma 7. Assume that the variables x (1) , . . . , x ( M ) ar e c enter e d and identic al ly indep en- dently distribute d. Then, the fol lowing lower b ound holds for the lo c al R ademacher c om- plexity of H p for any p ≥ 1 : R r ( H p,D ,M ) ≥ R r M ( H 1 ,D M 1 /p ∗ , 1 ) . Pro of First note that sin ce the x ( i ) are cente red and uncorrelated, that P f 2 w = M X m =1 w m , x ( m ) 2 = M X m =1 w m , x ( m ) 2 . No w it follo ws R r ( H p,D ,M ) = E sup w : P f 2 w ≤ r k w k 2 ,p ≤ D w , 1 n n X i =1 σ i x i = E sup w : P M m =1 w ( m ) , x ( m ) 2 ≤ r k w k 2 ,p ≤ D w , 1 n n X i =1 σ i x i ≥ E sup w : ∀ m : w ( m ) , x ( m ) 2 ≤ r /M w ( m ) 2 ,p ≤ D w (1) = · · · = w ( M ) w , 1 n n X i =1 σ i x i = E sup w : ∀ m : w ( m ) , x ( m ) 2 ≤ r /M ∀ m : w ( m ) 2 ≤ D M − 1 p M X m =1 w ( m ) , 1 n n X i =1 σ i x ( m ) i = M X m =1 E su p w ( m ) : w ( m ) , x ( m ) 2 ≤ r /M w ( m ) 2 ≤ D M − 1 p w ( m ) , 1 n n X i =1 σ i x ( m ) i , 16 so that we can use th e i.i.d. assumption on x ( m ) to equiv alen tly rewrite the last term as R r ( H p,D ,M ) x ( m ) i.i.d. ≥ E sup w (1) : w (1) , x (1) 2 ≤ r /M w (1) 2 ≤ D M − 1 p M w (1) , 1 n n X i =1 σ i x (1) i = E sup w (1) : M w (1) , x (1) 2 ≤ r M M w (1) 2 ≤ D M 1 p ∗ M w (1) , 1 n n X i =1 σ i x (1) i = E sup w (1) : w (1) , x (1) 2 ≤ r M w (1) 2 ≤ D M 1 p ∗ w (1) , 1 n n X i =1 σ i x (1) i = R r M ( H 1 ,D M 1 /p ∗ , 1 ) In Mendelson (2003) it w as sho wn that there is an absolute constan t c so that if λ (1) ≥ 1 n then for all r ≥ 1 n it holds R r ( H 1 , 1 , 1 ) ≥ q c n P ∞ j =1 min( r , λ (1) j ). Closer insp ection of the pr o of rev eals that more generally it holds R r ( H 1 ,D , 1 ) ≥ q c n P ∞ j =1 min( r , D 2 λ (1) j ) if λ ( m ) 1 ≥ 1 nD 2 so that we can use th at result together w ith the previous lemma to obtain: Theorem 8 (Lo wer b ound) . Assume that the kernels ar e c enter e d and identic al ly indep en- dently distribute d. Then, the fol lowing lower b ound holds for the lo c al R ademacher c omplex- ity of H p . Ther e is an absolute c onstant c such that if λ (1) ≥ 1 nD 2 then for al l r ≥ 1 n and p ≥ 1 , R r ( H p,D ,M ) ≥ v u u t c n ∞ X j =1 min( r M , D 2 M 2 /p ∗ λ (1) j ) . (17) W e would lik e to compare the ab o ve lo wer b oun d with the upp er b ound of T h eorem 5. T o this end note th at for centered identi cal indep end en t ke rn els the upp er b oun d reads R r ( H p ) ≤ v u u t 16 n ∞ X j =1 min r M , ceD 2 p ∗ 2 M 2 p ∗ λ (1) j + √ B eD M 1 p ∗ p ∗ n , whic h is of the order O q P ∞ j =1 min r M , D 2 M 2 p ∗ λ (1) j and, disr egarding the quic kly con- v erging term on the right h and sid e and abs olute constan ts, again matc hes the upp er b ounds of the previous section. A similar comparison can b e p erformed for the upp er b ound of The- orem 6: by Remark 2 th e b ound reads R r ( H p ) ≤ v u u t 2 n ∞ X j =1 min( r , D 2 M 2 p ∗ − 1 λ ( m ) j ) M m =1 1 , 17 whic h for i.i.d. kernels b ecomes q 2 /n P ∞ j =1 min r M , D 2 M 2 p ∗ λ (1) j and thus, b eside abso- lute constans, matc h es the lo w er b ound. T his sh ows that the upp er b ound s of the previous section are tigh t. 5. Excess Risk Bounds In this section we sh o w an app lication of our results to p r ediction problems, such as classi- fication or r egression. T o this aim, in add ition to the data x 1 , . . . , x n in tro d uced earlier in this p ap er, let also a lab el sequence y 1 , . . . , y n ⊂ [ − 1 , 1] b e giv en that is i.i.d. generated from a probabilit y distribu tion. The goa l in statistical learning is to find a h yp othesis f from a pr egiv en class F that m inimizes the exp ected loss E l ( f ( x ) , y ), where l : R 2 7→ [ − 1 , 1] is a predefin ed loss function that enco d es th e ob jectiv e of giv en the learning/prediction task at hand. F or example, the hinge loss l ( t, y ) = max(0 , 1 − y t ) and the squared loss l ( t, y ) = ( t − y ) 2 are fr equen tly u sed in classification and regression problems, resp ectiv ely . Since the distribution generating the example/lab el pairs is un kno wn, the optimal de- cision fu nction f ∗ := argmin f E l ( f ( x ) , y ) can not b e computed d irectly and a frequently u sed metho d consists of instead min imizing the empiric al loss, ˆ f := argmin f 1 n n X i =1 l ( f ( x 1 ) , y 1 ) . In order to ev aluate the p erform ance of this so-called empiric al minimization algorithm w e study th e excess loss, P ( l ˆ f − l f ∗ ) := E l ( ˆ f ( x ) , y ) − E l ( f ∗ ( x ) , y ) . In Bartlett et al. (2005) and Koltc hinskii (2006) it w as sh o wn that the rate of con v ergence of the excess risk is b asically determined by the fixed p oin t of the lo cal R ad emacher complexit y . F or example, the f ollo w ing resu lt is a sligh t mo dification of Corollary 5.3 in Bartlett et al. (2005 ) that is we ll-ta ylored to the class s tudied in this p ap er. 2 Lemma 9. L et F b e an absolute c onvex class r anging in the interval [ a, b ] and let l b e a Lipschitz c ontinuous loss with c onstant L . Assume ther e is a p ositive c onstant F su c h that ∀ f ∈ F : P ( f − f ∗ ) 2 ≤ F P ( l f − l f ∗ ) . Then, denoting by r ∗ the fixe d p oint of 2 F L R r 4 L 2 ( F ) for al l x > 0 with pr ob ability at le ast 1 − e − x the exc ess loss c an b e b ounde d as P ( l ˆ f − l f ∗ ) ≤ 7 r ∗ F + (11 L ( b − a ) + 27 F ) x n . 2. W e exploit the improve d constants from Theorem 3.3 in Bartlett et al. (2005) b ecause an absolute conv ex class is star-shap ed . Compared to Corollary 5.3 in Bartlett et al. (2005) w e also use a slightly more general function class ranging in [ a , b ] instead of the interv al [ − 1 , 1]. This is also justified by Theorem 3.3. 18 The ab o v e result sho ws that in order to obtain an excess risk b oun d on the multi-k ern el class H p it suffices to compute the fixed p oin t of our b oun d on the lo cal Rademac h er complexit y present ed in S ection 3. T o this end w e sho w: Lemma 10. A ssume that k k k ∞ ≤ B almost sur ely and let p ∈ [1 , 2] . F or the fixe d p oint r ∗ of the lo c al R ademacher c omplexity 2 F LR r 4 L 2 ( H p ) it holds r ∗ ≤ min 0 ≤ h m ≤∞ 4 F 2 P M m =1 h m n +8 F L v u u t ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + 4 √ B eD F LM 1 p ∗ p ∗ n . Pro of F or this pro of we make u s e of the b ound (11) on the lo cal Rad emacher complexit y . Defining a = 4 F 2 P M m =1 h m n and b = 4 F L v u u t ep ∗ 2 D 2 n ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + 2 √ B eD F LM 1 p ∗ p ∗ n , in order to find a fix ed p oint of (11) w e need to solv e f or r = √ ar + b , wh ich is equiv alent to solving r 2 − ( a + 2 b ) r + b 2 = 0 for a p ositiv e r o ot. Denote this solution by r ∗ . It is then easy to see that r ∗ ≥ a + 2 b . Resubstituting the d efi nitions of a and b yields the result. W e no w add ress the issue of computing actual rates of con ve rgence of the fixed p oin t r ∗ under the assumption of algebraically d ecreasing eigen v alues of the ke rn el matrices, th is means, w e assume ∃ d m : λ ( m ) j ≤ d m j − α m for some α m > 1. T h is is a common assumption and, for example, met for fi nite rank kernels and conv olution kernels (Williamson et al., 2001) . Notice that this imp lies ∞ X j = h m +1 λ ( m ) j ≤ d m ∞ X j = h m +1 j − α m ≤ d m Z ∞ h m x − α m dx = d m h 1 1 − α m x 1 − α m i ∞ h m = − d m 1 − α m h 1 − α m m . (18) T o exploit the ab o ve fact, first n ote that b y ℓ p -to- ℓ q con v ersion 4 F 2 P M m =1 h m n ≤ 4 F s F 2 M P M m =1 h 2 m n 2 ≤ 4 F s F 2 M 2 − 2 p ∗ h 2 m ) M m =1 2 /p ∗ n 2 so that we can translate the result of th e pr evious lemma by (12) , (13), and (14) in to r ∗ ≤ min 0 ≤ h m ≤∞ 8 F v u u t 1 n F 2 M 2 − 2 p ∗ h 2 m n + 4 ep ∗ 2 D 2 L 2 ∞ X j = h m +1 λ ( m ) j M m =1 p ∗ 2 + 4 √ B eD F LM 1 p ∗ p ∗ n . (19) 19 Inserting the result of (18) into th e ab o ve b ound and setting the deriv ativ e with resp ect to h m to zero we fi nd the optimal h m as h m = 4 d m ep ∗ 2 D 2 F − 2 L 2 M 2 p ∗ − 2 n 1 1+ α m . Resubstituting the ab ov e into (19) we note that r ∗ = O s n − 2 α m 1+ α m M m =1 p ∗ 2 ! so that we observ e that the asymp totic rate of con v ergence in n is determin ed b y th e k ernel with the smallest decreasing sp ectrum (i.e., smallest α m ). De- noting d max := max m =1 ,...,M d m , α min := min m =1 ,...,M α m , and h max := 4 d max ep ∗ 2 D 2 F − 2 L 2 M 2 p ∗ − 2 n 1 1+ α min w e can u pp er-b ound (19) by r ∗ ≤ 8 F r 3 − α m 1 − α m F 2 M 2 h 2 max n − 2 + 4 √ B eD F LM 1 p ∗ p ∗ n ≤ 8 r 3 − α m 1 − α m F 2 M h max n − 1 + 4 √ B eD F LM 1 p ∗ p ∗ n ≤ 16 r e 3 − α m 1 − α m ( d max D 2 F − 2 L 2 p ∗ 2 ) 1 1+ α min M 1+ 2 1+ α min 1 p ∗ − 1 n − α min 1+ α min + 4 √ B eD F LM 1 p ∗ p ∗ n . (20) W e hav e th u s pro ved the follo w ing theorem, whic h follo ws by the ab o v e inequalit y , Lemma 9, and the fact that our class H p ranges in B D M 1 p ∗ . Theorem 11. Assume that k k k ∞ ≤ B and ∃ d m : λ ( m ) j ≤ d m j − α m for some α m > 1 . L et l b e a Lipschitz c ontinuous loss with c onstant L and assume ther e is a p ositive c onstant F such that ∀ f ∈ F : P ( f − f ∗ ) 2 ≤ F P ( l f − l f ∗ ) . Then for al l x > 0 with pr ob ability at le ast 1 − e − x the exc ess loss of the multi-kernel class H p c an b e b ounde d for p ∈ [1 , . . . , 2] as P ( l ˆ f − l f ∗ ) ≤ min t ∈ [ p, 2] 186 r 3 − α m 1 − α m d max D 2 F − 2 L 2 t ∗ 2 1 1+ α min M 1+ 2 1+ α min 1 t ∗ − 1 n − α min 1+ α min + 47 √ B D LM 1 t ∗ t ∗ n + (22 B D LM 1 t ∗ + 27 F ) x n W e see fr om th e ab o v e b ound that con v ergence can b e almost as s low as O p ∗ M 1 p ∗ n − 1 2 (if at least one α m ≈ 1 is small and thus α min is small) and almost as fast as O n − 1 (if α m is large for all m and th u s α min is large). F or example, the latter is the case if all kernels ha ve finite r ank and also the con v olution k ernel is an example of this typ e. Notice th at we of course could rep eat the ab o v e discussion to obtain excess r isk b ounds for the case p ≥ 2 as well, but since it is v ery questionable th at this will lead to new insigh ts, it is omitted for simplicit y . 20 6. Discussion In this section we compare th e obtained lo cal Rademac her b oun d with the global one, discuss related w ork as w ell as th e assumption (U) , and give a pr actical application of the b ound s b y stu dying the approp r iateness of small/large p in v arious learning scenarios. 6.1 Global vs. Lo cal Rademac her Bounds In this section, w e d iscuss the r ates obtained from the b ound in Th eorem 11 for the excess risk and compare them to the rates obtained using th e global Rademac her complexit y b ound of C orollary 4. T o simplify somewhat the discussion, we assume that the eigen v alues satisfy λ ( m ) j ≤ d j − α (with α > 1) f or all m and concen trate on the r ates obtained as a f unction of the parameters n, α, M , D and p , while considering other parameters fixed and hiding them in a big-O notation. Using this simplifi cation, the b ound of Theorem 11 reads ∀ t ∈ [ p, 2] : P ( l ˆ f − l f ∗ ) = O t ∗ D 2 1+ α M 1+ 2 1+ α 1 t ∗ − 1 n − α 1+ α . (21) On th e other hand, th e global Rademac her complexit y directly leads to a b ound on the supremum of the cen tered empirical pr o cess in dexed b y F and th us also pr o vides a b oun d on the excess risk (see, e.g., Bousqu et et al., 2004). Therefore, using Corollary 4, wherein w e upp er b ound th e trace of eac h J m b y the constan t B (and sub s ume it under the O-notation), w e ha ve a second b ound on the excess r isk of the form ∀ t ∈ [ p, 2] : P ( l ˆ f − l f ∗ ) = O t ∗ D M 1 t ∗ n − 1 2 . (22) First consid er th e case wh ere p ≥ (log M ) ∗ , that is, the b est c hoice in (21) and (22) is t = p . Clearly , if we hold all other parameters fixed and let n gro w to infi n it y , the rate obtained through the lo cal Rademac her analysis is b etter since α > 1. How ever, it is also of in terest to consider what happ ens w h en the num b er of kernels M and the ℓ p ball rad iu s D can gro w with n . In general, we h a v e a b ound on the excess risk given b y the minim um of (21) and (22); a straightfo rward calculation shows that th e lo cal Rademac her analysis improv es o v er th e global one wh en ev er M 1 p D = O ( √ n ) . In terestingly , we n ote that this “ph ase transition” do es n ot dep end on α (i.e. the “com- plexit y” of the individu al k ern els), but only on p . If p ≤ (log M ) ∗ , the b est c hoice in (21) and (22) is t = (log M ) ∗ . In this case taking the minim u m of the t wo b oun ds reads ∀ p ≤ (log M ) ∗ : P ( l ˆ f − l f ∗ ) ≤ O min( D (log M ) n − 1 2 , D log M 2 1+ α M α − 1 1+ α n − α 1+ α ) , (23) and the p hase transition wh en the local Rademac h er b ound impro ves ov er the global one o ccurs for M D log M = O ( √ n ) . Finally , it is also in teresting to observe the b eha vior of (21) and (22) as α → ∞ . In this case, it means that only one eigen v alue is n onzero for eac h ke r n el, that is, eac h kernel sp ace 21 is one-dimensional. In other words, in this case w e are in the case of “classical” aggregation of M basis fun ctions, and the min imum of the t wo b oun ds reads ∀ t ∈ [ p, 2] : P ( l ˆ f − l f ∗ ) ≤ O min( M n − 1 , t ∗ D M 1 t ∗ n − 1 2 . (24) In th is configuration, observe that the lo cal Rademac her b ound is O ( M /n ) and do es not dep end on D , n or p , an y longer; in fact, it is the same b ound that one w ould obtain for the empirical r isk minimization ov er the sp ace of all linear com binations of the M base functions, without any restriction on the norm of the coefficients—the ℓ p -norm constr aint b ecomes vo id. The global Rademac her b oun d on the other hand, still d ep ends crucially on the ℓ p norm constraint. This situation is to b e compared to the sharp analysis of th e optimal con v ergence rate of con v ex aggregatio n of M fun ctions obtained b y Tsybak o v (2003) in the framew ork of squ ared error loss regression, whic h are sho wn to b e O min M n , s 1 n log M √ n !! . This corresp onds to th e setting studied here with D = 1 , p = 1 and α → ∞ , and w e see that the b ound (23) reco v ers (up to log facto rs ) in this case this sh arp b ound and th e r elated phase transition phen omenon . 6.2 Discuss ion of Assumption (U) Assumption (U) is arguably quite a strong hyp othesis for the v alidit y of our resu lts (needed for 1 ≤ p ≤ 2), whic h w as not required f or the global Rademac her b ound. A similar assumption was made in the recen t w ork of Raskutti et al. (201 0), where a related MK L algorithm using an ℓ 1 -t yp e p enalt y is s tudied, and b ounds are deriv ed that dep end on the “sparsit y pattern” of the Ba yes fun ction, i.e. ho w man y co efficients w ∗ m are non-zero. If the k ernel spaces are one-dimensional, in wh ic h case ℓ 1 -p enalized MKL reduces qualitativ ely to standard lasso-t y p e metho ds, this assumption can b e seen as a strong form of the so-called Restricted I s ometry Prop ert y (RIP), wh ic h is kn o wn to b e necessary to grant th e v alidit y of b ounds taking in to accoun t the sparsit y p attern of the Ba y es fun ction. In the present work, our analysis sta ys delib erately “agnostic” (or wo rs t-case) with re- sp ect to the true s p arsit y pattern (in part b ecause exp eriment al evidence seems to p oint to w ards the fact that the Ba y es function is not strongly sparse), corresp ondingly it could legitimat ely b e hop ed that th e RIP condition, or Assump tion (U) , could b e sub stan tially relaxed. Considering again the s p ecial case of one-dimensional k ernel spaces and the discus- sion ab out the qualitativ ely equiv alen t case α → ∞ in the previous section, it can b e seen that Ass umption (U) is indeed un n ecessary for b oun d (24) to hold, and more sp ecifically for the rate of M /n obtained thr ough lo cal Rademac her analysis in this case. Ho we ver, as we discussed, what happ ens in th is sp ecific case is that the local Rademac h er analysis b ecomes oblivious to the ℓ p -norm constraint, and we are left with the standard p aramet- ric con ve rgence rate in dimension M . In other wo rd s, w ith one-dimensional kernel s paces, the tw o constrain ts (on the L 2 ( P )-norm of the function and on the ℓ p blo c k-norm of the co efficien ts) app earing in the definition of lo cal Rademacher complexit y are essen tially not activ e simultaneously . Unfortunately , it is clear that this prop ert y is not true an ymore for 22 k ernels of h igher complexit y (i.e. w ith a n on-trivial d eca y rate of th e eigen v alues). T his is a sp ecificit y of the kernel setting as compared to com binations of a dictionary of M simple functions, and Assump tion ( U) was in effect used to “align” the tw o constraints. T o sum up, Assump tion (U) is used here for a different p urp ose from th at of the RIP in sparsity analyses of ℓ 1 regularization metho d s; it is not clear to us at th is p oint if this assumption is necessary or if uncorrelated v ariables x ( m ) constitutes a “w orst case” f or our analysis. W e did n ot su ceed so far in relinqu ishing this assump tion for p ≤ 2, and this question remains op en. Up to our kno wledge, there is no previous existing analysis of the ℓ p -MKL setting for p > 1; the recen t w orks of Raskutti et al. (201 0) and Koltc hins kii and Y u an (2010) fo cus on the case p = 1 and on the sparsity p attern of the Ba ye s fu nction. A refined analysis of ℓ p -regularized metho ds in the case of com b ination of M basis functions was laid out by Koltc hinskii (2009), also taking into accoun t the p ossible soft sparsit y pattern of the Ba y es function. Extendin g the ideas und erlying th e latter analysis into the kernel setting is lik ely to op en inte resting d ev elopmen ts. 6.3 Analysis of the Impact of the Norm P arameter p on the Accuracy of ℓ p -norm MKL As outlined in the int r o duction, there is emp irical eviden ce that the p erformance of ℓ p - norm MKL crucially dep ends on the choic e of the norm parameter p (cf. Figure 1 in th e in tro d uction). The aim of this section is to relate the theoretica l analysis presented here to this emp irically observed phenomenon. W e b eliev e that this phenomenon can b e (at least partly) explained on base of our excess risk b oun d obtained in the last section. T o this end w e will analyze the dep end ency of the excess risk b ounds on the c h osen norm parameter p . W e will sho w that the optimal p dep ends on the geometrica l prop erties of the learning problem and that in general—dep ending on th e true geometry—an y p can b e optimal. Since our excess risk b ound is only f orm ulated for p ≤ 2, we will limit the analysis to the range p ∈ [1 , 2]. T o start with, first note that the c h oice of p only affects the excess risk b ound in the factor (cf. Theorem 11 and Equation (21)) ν t := min t ∈ [ p, 2] D p t ∗ 2 1+ α M 1+ 2 1+ α 1 t ∗ − 1 . So we write the excess risk as P ( l ˆ f − l f ∗ ) = O ( ν t ) and hide all v ariables and constants in the O-notation for the whole section (in particular the sample size n is considered a constan t for the pu rp oses of the p resen t discus sion). It migh t s u rprise the r eader th at w e consider the term in D in the b ound although it seems from the b ound that it do es not dep en d on p . T his stems from a su btle r eason that we hav e ignored in this analysis so f ar: D is related to the appro ximation p rop erties of the class, i.e., its abilit y to attain the Ba y es h yp othesis. F or a “fair” analysis we sh ould tak e the appro ximation prop erties of the class in to accoun t. T o illustrate this, let us assu m e that th e Ba yes h yp othesis b elongs to the space H and can b e represent ed by w ∗ ; assu me further that th e blo c k comp onents satisfy k w ∗ m k 2 = m − β , m = 1 , . . . , M , where β ≥ 0 is a parameter parameterizing the “soft sparsity” of the comp onent s. F or example, the cases β ∈ { 0 . 5 , 1 , 2 } are sh o wn in Figure 2 for M = 2 an d 23 w * (a) β = 2 w * (b) β = 1 w * (c) β = 0 . 5 Figure 2: Two-dimensional illus tration of the three analyzed learning scenarios, which d iffer in the soft sparsit y of the Ba y es hyp othesis w ∗ (parametrized by β ). Left : A soft spars e w ∗ . Center: An intermediate non-sp arse w ∗ . Right: An almost-uniformly n on-sparse w ∗ . assuming that eac h kernel h as rank 1 (thus b eing isomorphic to R ). If n is large, the b est bias-complexit y tradeoff for a fixed p will corresp ond to a v anish in g bias, so that the b est c hoice of D will b e close to the minimal v alue su c h that w ∗ ∈ H p,D , that is, D p = || w ∗ || p . Plugging in this v alue f or D p , the b oun d factor ν p b ecomes ν p := k w ∗ k 2 1+ α p min t ∈ [ p, 2] t ∗ 2 1+ α M 1+ 2 1+ α 1 t ∗ − 1 . W e can now p lot the v alue ν p as a function of p for sp ecial c hoices of α , M , and β . W e realized this sim ulation for α = 2, M = 1000, an d β ∈ { 0 . 5 , 1 , 2 } , wh ic h means we generated three learning scenarios with d ifferen t lev els of soft sp arsit y parametrized b y β . The resu lts are sh o wn in Figure 3. Note that the soft s p arsit y of w ∗ is increased from the left hand to the righ t hand side. W e observe th at in the “soft sparsest” scenario ( β = 2, sh o wn on the left-hand side) the minimum is attained for a quite small p = 1 . 2, w h ile for the intermediate case ( β = 1, shown at the cente r) p = 1 . 4 is optimal, and fin ally in the un iformly n on -sp arse scenario ( β = 2, sho wn on the right-hand side) the c hoice of p = 2 is optimal (although ev en a h igher p could b e optimal, but our b ound is only v alid for p ∈ [1 , 2]). This m eans that if th e true Bay es hyp othesis has an intermediate ly den s e representa- tion, our b oun d giv es the strongest generalization guarantee s to ℓ p -norm MKL using an in termediate choic e of p . This is also intuitiv e: if the tr uth exhib its some s oft sp ar s it y but is not strongly sparse, w e exp ect n on -sp arse MKL to p erform b etter than strongly sp ars e MKL or the u nw eigh ted-sum kernel SVM. 7. Conclusion W e deriv ed a sharp upp er b oun d on the lo cal Rademacher complexit y of ℓ p -norm multiple k ernel learnin g under the assumption of uncorrelated kernels. W e also pr ov ed a lo we r b oun d that matc hes the upp er one an d sho ws that our resu lt is tight. Using the local Rademac her 24 1.0 1.2 1.4 1.6 1.8 2.0 60 70 80 90 110 p bound (a) β = 2 1.0 1.2 1.4 1.6 1.8 2.0 40 45 50 55 60 65 p bound (b) β = 1 1.0 1.2 1.4 1.6 1.8 2.0 20 30 40 50 60 p bound (c) β = 0 . 5 Figure 3: Results of the sim u lation for the three analyzed learning scenarios. Th e v alue of the b ound f actor ν t is plotted as a fu nction of p . Th e minim u m is attained dep ending on the tru e s oft sparsity of the Ba y es hyp othesis w ∗ (parametrized b y β ). Le ft: An “almost sparse” w ∗ . complexit y b ound , we der ived an excess risk b ound that attains the fast rate of O ( n − α 1+ α ), where α is th e min in um eigenv alue d eca y rate of the in d ividual kernels. In a practical case stu dy , we f ound that the optimal v alue of that b oun d dep ends on the true Ba y es-optimal kernel wei ghts. If the true w eigh ts exh ibit soft sparsit y but are not strongly sparse, then the generalization b ound is minimized for an intermediate p . This is not only intuitiv e but also su pp orts empirical studies sh o wing th at sparse MKL ( p = 1) rarely works in p ractice, wh ile some intermediate c h oice of p can impr ov e p erformance. Of course, this conn ection is only v alid if the optimal ke rn el weigh ts are lik ely to b e non-sparse in p ractice. Indeed, related researc h p oin ts in that direction. F or example, already weak conn ectivit y in a causal graphical m o del ma y b e sufficien t f or all v ariables to b e requir ed for optimal p redictions, and ev en the pr ev alence of s p arsit y in causal fl o ws is b eing questioned (e.g., for the so cial sciences Gelman, 2010, argues that “There are (almost) no true zeros”). Finally , we note that seems to b e a certain preference for sparse m o dels in the scient ific comm unity . How ever, sparsit y by itself sh ou ld not b e considered the ultimate virtue to b e striv ed for—on the con trary: previous MKL researc h has shown that non-sparse mo dels may impro ve quite imp ressiv ely ov er s parse ones in p ractical app lications. T he pr esent analysis supp orts this by sho win g that the reason for this m ight b e traced bac k to non-spars e MKL attaining b etter generalization b ounds in non-sparse learning s cenarios. W e r emark that this p oin t of view is also sup p orted b y related analyses. F or example, it w as sho wn by Leeb and P¨ otsc her (200 8) in a fixed design setup that any sparse estimator (i.e., satisfying th e oracle pr op ert y of correctly predicting the zero v alues of the true target w ∗ ) has a maximal scaled m ean squared err or (MSMSE) that dive rges to ∞ . This is somewhat sub optimal since, for example, least-squares regression has a con ve r ging MSMSE. Although this is an asymp totic resu lt, it migh t also b e one of the reasons for finding excellen t (n onasymptotic) r esults in non-sparse MKL. In another, recen t study of Xu et al. (200 8), it w as sho wn that no sp arse algorithm can b e algorithmically stable. This 25 is n oticeable b ecause algorithmic stabilit y is connected with generalization error (Bousquet and Elisseeff, 2002). Ac knowled gmen t s W e thank Pet er L. Bartlett and Klaus-Rob ert M ¨ uller for helpfu l commen ts on the man uscrip t. App endix A . L emmata and Pro ofs The follo wing resu lt giv es a b lo c k-structured version of H¨ older’s inequalit y (e.g., S teele, 2004) . Lemma 12 (Blo c k-structured H¨ older inequalit y) . L et x = ( x (1) , . . . , x ( m ) ) , y = ( y (1) , . . . , y ( m ) ) ∈ H = H 1 × · · · × H M . Then, for any p ≥ 1 , it holds h x , y i ≤ k x k 2 ,p k y k 2 ,p ∗ . Pro of By the C auc hy- S c hw arz in equ alit y (C.-S.), we hav e for all x , y ∈ H : h x , y i = M X m =1 h x ( m ) , y ( m ) i C.-S. ≤ M X m =1 k x k 2 k y k 2 = ( k x (1) k 2 , . . . , k x ( M ) k 2 ) , ( k y (1) k 2 , . . . , k y ( M ) k 2 ) . H¨ older ≤ k x k 2 ,p k y k 2 ,p ∗ Pro of of Lemma 3 (Rosen thal + Y oung) It is clear that the result trivially holds for 1 2 ≤ p ≤ 1 with C q = 1 b y J ensen’s inequalit y . In the case p ≥ 1, w e apply Rosenthal’s inequalit y (Rosen thal, 1970) to the s equ ence X 1 , . . . , X n thereb y using th e optimal constant s computed in Ibr agimo v and Sharakh meto v (2001), that are, C q = 2 ( q ≤ 2) and C q = E Z q ( q ≥ 2), resp ectiv ely , where Z is a r andom v ariable distributed according to a Poisson law with parameter λ = 1. T his yields E 1 n n X i =1 X i q ≤ C q max 1 n q n X i =1 E X q i , 1 n n X i =1 X i ! q ! . (25) By using that X i ≤ B holds almost surely , w e could readily obtain a b ound of the form B q n q − 1 on the first term. Ho wev er, this is lo ose and for q = 1 d o es n ot con ve rge to zero when n → ∞ . Therefore, we follo w a differen t appr oac h based on Y oung’s inequalit y (e.g. Steele, 26 2004) : 1 n q n X i =1 E X q i ≤ B n q − 1 1 n n X i =1 E X i Y oung ≤ 1 q ∗ B n q ∗ ( q − 1) + 1 q 1 n n X i =1 E X i ! q = 1 q ∗ B n q + 1 q 1 n n X i =1 E X i ! q . It thus follo w s from (25) that for all q ≥ 1 2 E 1 n n X i =1 X i q ≤ C q B n q + 1 n n X i =1 E X i q ! , where C q can b e tak en as 2 ( q ≤ 2) and E Z q ( q ≥ 2), resp ectiv ely , wh ere Z is P oisson- distributed. In the su bsequent Lemma 13 w e sho w E Z q ≤ ( q + e ) q . Clearly , for q ≥ 1 2 it holds q + e ≤ q e + eq = 2 eq so th at in any case C q ≤ max(2 , 2 eq ) ≤ 2 eq , wh ic h concludes the r esult. W e u se the follo wing Lemma giv es a handle on the q -th moment of a Poisson-distributed random v ariable an d is u sed in the pr evious Lemma. Lemma 13. F or the q -moment of a r andom variable Z distribute d ac c or ding to a P oisson law with p ar ameter λ = 1 , the fol lowing ine quality holds for al l q ≥ 1 : E Z q def. = 1 e ∞ X k =0 k q k ! ≤ ( q + e ) q . Pro of W e s tart by decomp osing E Z q as follo ws: E q = 1 e 0 + q X k =1 k q k ! + ∞ X k = q +1 k q k ! = 1 e q X k =1 k q − 1 ( k − 1)! + ∞ X k = q +1 k q k ! ≤ 1 e q q + ∞ X k = q +1 k q k ! (26) (27) 27 Note th at b y Stirling’s appr oximati on it holds k ! = √ 2 π e τ k k k e q with 1 12 k +1 < τ k < 1 12 k for all q . Thus ∞ X k = q +1 k q k ! = ∞ X k = q +1 1 √ 2 π e τ k k e k k − ( k − q ) = ∞ X k =1 1 √ 2 π e τ k + q ( k + q ) e k + q k − k = e q ∞ X k =1 1 √ 2 π e τ k + q ( k + q ) e k k ( ∗ ) ≤ e q ∞ X k =1 1 √ 2 π e τ k k e k k Stirling = e q ∞ X k =1 1 k ! = e q +1 where for ( ∗ ) note th at e τ k k ≤ e τ k + q ( k + q ) can b e sho wn b y some algebra using 1 12 k +1 < τ k < 1 12 k . No w by (26) E Z q = 1 e q q + e q +1 ≤ q q + e q ≤ ( q + e ) q , whic h w as to sh o w. Lemma 14. F or any a , b ∈ R m + it holds for al l q ≥ 1 k a k q + k b k q ≤ 2 1 − 1 q k a + b k q ≤ 2 k a + b k q . Pro of Let a = ( a 1 , . . . , a m ) and b = ( b 1 , . . . , b m ). Because all comp onen ts of a , b are nonnegativ e, w e ha ve ∀ i = 1 , . . . , m : a q i + b q i ≤ a i + b i q and thus k a k q q + k b k q q ≤ k a + b k q q . (28) W e conclude by ℓ q -to- ℓ 1 con v ersion (see (14)) k a k q + k b k q = k a k q , k b k q 1 (14) ≤ 2 1 − 1 q k a k q , k b k q q = 2 1 − 1 q k a k q q + k b k q q 1 q (28) ≤ 2 1 − 1 q k a + b k q , whic h completes the pro of. 28 References F. R. Bac h, G. R . G. Lanc kriet, and M. I. Jordan. Multiple k ern el learning, conic d ualit y , and the SMO algorithm. In Pr o c. 21st ICML . A CM, 2004. P . Bartlett and S. Mendelson. Rademac h er and gaussian complexities: Risk b ounds and structural resu lts. Journal of Machine L e arning R ese ar c h , 3:463–48 2, Nov. 2002. P . L. Bartlett, O. Bousquet, and S . Mendelson. Lo cal Rademac h er complexities. Anna ls of Statistics , 33(4):1497– 1537, 2005 . R. R. Bouc k aert, E. F rank, M. A. Hall, G. Holmes, B. Pfahringer, P . Reutemann, and I. H. Witten. WEKA–exp eriences with a ja v a op en-source pro ject. Journal of Machine L e arning R ese ar ch , 11:2533–2 541, 2010. O. Bousquet and A. Elisseeff. S tabilit y and generalization. J. Mach. L e arn. R es. , 2:499–5 26, Marc h 2002. ISS N 1532-44 35. O. Bousqu et, S. Bouc heron, and G. Lugosi. Introd u ction to statistical learning theory . In O. Bousquet, U. v on Luxbur g, and G. R¨ atsc h, editors, A dvanc e d L e c tu r es on Machine L e arning , vol u me 3176 of L e ctur e N otes in Computer Scienc e , pages 169–2 07. Sprin ger Berlin / Heidelb erg, 2004. C. C ortes. Invit ed talk: Can learning k ernels help p erform ance? I n Pr o c e e dings of the 26th Annual Internationa l Confer enc e on Machine L e arning , ICML ’09, pages 1:1–1:1, New Y ork, NY, USA, 2009. ACM. ISBN 978-1-605 58-516-1. C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, and A. Rostamizadeh. Pro ceedings of the NIPS Workshop on Kernel Learning: Automatic Selection of O ptimal Kern els, 2008. URL http ://www.c s.nyu.edu /learning_kernels . C. C ortes, M. Mohri, and A. Rostamizadeh. Generalizatio n b ounds for learning ke rn els. In Pr o c e e dings, 27th ICML , 2010. P . V. Gehler and S. Now ozin. Let the kernel figure it out: Principled learning of p re- pro cessing for ke rn el classifiers. In IEEE Computer So ciety Confer enc e on Computer Vision and Pattern R e c o gnition , 06 2009. A. Gelman. Causalit y and statistical learning. Americ an Journal of So ci olo gy , 0, 2010. R. Ibragimo v and S. Sh arakh meto v. The b est constan t in th e rosenthal inequalit y for nonnegativ e r andom v ariables. Statistics & P r ob ability L etters , 55(4):367 – 376, 2001. ISSN 0167-715 2. J.-P . Kahane. Some r andom series of functions . C am bridge Unive rsity Press, 2nd edition, 1985. M. Kloft, U. Brefeld, S. Sonnenburg, P . Lask o v, K.-R. M ¨ uller, and A. Z ien. Efficien t and accurate lp-n orm multiple k ernel learning. I n Y. Bengio, D. Sc huurmans, J. Lafferty , C. K. I. Williams, and A. Culotta, editors, A dvanc es in Neur al Information Pr o c essing Systems 22 , pages 997–1005. MIT Pr ess, 2009. 29 M. Kloft, U. Brefeld, S. S onnen b u rg, and A. Zien. ℓ p -norm m ultiple kernel learning. Journal of M achine L e arning R ese ar ch , 2011 . T o app ear. URL http: //doc.ml .tu- berlin .de/ nonspars e_mkl/ . V. Koltc hin skii. Rademac her p enalties and stru ctur al risk min imization. IEEE T r ansactions on Information The ory , 47(5):19 02–1914, 2001. V. Koltc hinskii. Lo cal Rademac her complexities and oracle inequalities in risk min imization. Anna ls of Statistics , 34(6):2 593–2656, 2006. V. Koltc hin skii. Sparsit y in p enalized empirical risk minimization. Annales de l’Institut Henri Poinc ar ´ e, Pr ob abilit´ es et Statistiques , 45(1): 7–57, 2009. V. Koltc hinskii and M. Y uan. S parsit y in multiple k ernel learning. Annals of Statistics , 38 (6):36 60–3695, 2010. S. Kw api ´ en and W. A. W o yczy ´ nski. R andom Se rie s and Sto chastic Inte gr als: Single and Multiple . Birkh ¨ au s er, Basel and Boston, M.A., 1992. G. Lanckriet, N. Cristianini, L. E. Ghaoui, P . Bartlett, and M. I. Jordan. L earning th e k ernel matrix with semi-defin ite programming. JMLR , 5:27–72, 2004. H. Leeb and B. M. P¨ otsc her . Sparse estimators and the oracle pr op ert y , or the return of Ho dges’ estimator. Journal of Ec onometrics , 142:201– 211, 200 8. S. Mendelson. On the p erformance of k ernel classes. J. Mach. L e arn. R e s. , 4:75 9–771, Decem b er 2003. C. A. Micc h elli and M. P ontil. Learning the k ernel fu n ction via regularization. Journal of Machine L e arning R ese ar ch , 6:1099–112 5, 200 5. G. Raskutti, M. J. W ain wr igh t, and B. Y u . Minimax-optimal rates for sparse ad d itiv e mo dels ov er k ernel classes via con ve x programming. CoRR , abs/1008.3 654, 2010. H. Rosenthal. On the subspaces of L p ( p > 2) spanned by sequences of in dep end en t r andom v ariables. Isr ael J. M ath. , 8:273–3 03, 1970. J. R. Searle. Mind s, brains, and pr ograms. B e havior al and Br ain Scienc es , 3(03):417– 424, 1980. doi: 10.1017/S014 0525X00005756. URL http ://dx.do i.org/10 .1017/ S0140525 X0000575 6 . S. Sonn en bu rg, G. R¨ atsc h, C. S c h¨ afer, an d B. Sc h ¨ olk opf. Large scale multiple k ernel learning. Journal of M achine L e arning R e se ar ch , 7:1531– 1565, July 2006. J. M. Steele. The Cauchy-Schwarz Master Class: An Intr o duction to the Art of M athematic al Ine qualities . Cam br idge Universit y Press, New Y ork, NY, USA, 2004. I S BN 0521546 77X. M. S tone. C r oss-v alidatory c hoice and assessment of statistical predictors (with discu ssion). Journal of the R oyal Statistic al So ciety , B36:111 –147, 1974. 30 A. Ts y b ak o v. Optimal rates of aggregation. In B. Sch¨ olk opf and M. W armuth, editors, Computation al L e arning The ory and Kernel Machines (COL T-2003) , vo lu m e 2777 of L e ctur e Notes in Artificial Intel ligenc e , pages 303–3 13. Springer, 2003. R. C. Williamson, A. J . Smola, and B. Sch¨ olk opf. Generalizatio n p erformance of r egulariza- tion net works and s upp ort v ector m ac hines via en tropy num b ers of compact op erators. IEEE T r ansactions on Information The ory , 47(6):2516– 2532, 200 1. H. Xu, S. Mannor, and C. Caramanis. Spars e algorithms are not stable: A no-free-lunc h theorem. In Pr o c e e dings of the 46th Annua l Al lerton Confer enc e on Communic ation, Contr ol, and Computing , pages 1299 –1303, 2008. 31
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment