Reproducing Kernel Banach Spaces with the l1 Norm II: Error Analysis for Regularized Least Square Regression
A typical approach in estimating the learning rate of a regularized learning scheme is to bound the approximation error by the sum of the sampling error, the hypothesis error and the regularization error. Using a reproducing kernel space that satisfi…
Authors: Guohui Song, Haizhang Zhang
Repro ducing Kernel Banac h Spaces with the ℓ 1 Norm I I: Error Analysis for Regularized Least Square Regressio n ∗ Guoh ui Song † and Haizhang Zhang ‡ Abstract A t ypica l approach in estimating the learning rate of a r egularize d lea rning scheme is to b ound the approximation error by the sum of the sampling er ror, the hypothesis er ror and the reg ulariza- tion er ror. Using a r epro ducing kernel spa ce that satisfies the linear represe nter theorem br ing s the adv an tag e o f disca rding the hypothesis erro r from the sum automatica lly . F ollowing this dir ection, we illustrate ho w repro ducing kernel Ba nach spaces with the ℓ 1 norm can b e applied to improve the lear ning r ate estimate of ℓ 1 -regular ization in machine lear ning. Keyw ords : repro ducing kernel Bana ch spaces, sparse learning, r e g ularizatio n, least squar e regr es- sion, lear ning r ate, the r epresenter theore m 1 In tro du ction A class of repr o ducing kernel Banac h sp aces (RKBS) with the ℓ 1 norm that satisfies the linear rep - resen ter theorem was recen tly constr u cted in [14]. The pu rp ose of this note is to illustrate how the obtained spaces can be applied to esti mate the learning rate of the ℓ 1 -regularized least square r egression in mac hin e learning. A general coefficient-based regularizatio n of the least square r egression has the form min c ∈ R m 1 m m X j =1 | K x ( x j ) c − y j | 2 + λφ ( c ) , (1.1) where x := { x j : j ∈ N m } with N m := { 1 , 2 , . . . , m } is the sequence of sampling p oints from an in p ut space X , y j ∈ Y ⊆ R is the observed d ata on x j , λ is a p ositiv e regularization parameter, φ is a nonnegativ e regularization function on the co efficient column v ector c , and with a c hosen f u nction K : X × X → R , K x ( x ) is the 1 × m row vecto r ( K ( x j , x ) : j ∈ N m ). When K is a p ositiv e-definite r epro du cing kernel on X and φ ( c ) := c T K [ x ] c , (1.2) ∗ Supp orted by Guangdong Provincia l Gov ernment of China through the “Computational Science Innov ative Researc h T eam” p rogram. † School of Mathematical and Statistical Sciences, Arizona S tate Universit y , T emp e, A Z 85287 . E-mail address: gsong9@asu.e du . ‡ School of Ma th ematics and Computational Science and Guangdong Pro vince Key Lab oratory of Computational Science, Su n Y at- sen Universit y , Guangzhou 510275, P . R. China. E-mail address: zhha i zh2@sysu.e du.cn . 1 where K [ x ] is the m × m matrix d efined by ( K [ x ]) j,k := K ( x k , x j ) , j, k ∈ N m , it f ollo ws f r om the celebrated rep r esen ter theorem [ 7 ] that (1.1) is the classical regularization net wo rk and has b een extensiv ely stu d ied in the literature [6 , 9 , 10, 13, 19]. Estimates for th e learning r ate of t he regularization net w ork can b e f ound, for example, in [ 4 , 5, 12, 15, 23]. Learning rates for (1.1) when φ ( c ) = P m j =1 | c j | p for 1 < p ≤ 2 and p = 2 w ere r esp ectiv ely obtained in [18] and [16 ]. The linear pr ogramming regularizat ion w here φ ( c ) is the ℓ 1 norm k c k 1 of c has recen tly attracted m uch atten tion. The increasing in terest is m ainly b r ough t by the progress of the lasso in statistics [17] and compressiv e sensing [2, 3 ] in whic h ℓ 1 -regularizatio n is able to yield sparse represent ation of the resulting minimizer, a desirable feature in mo del selection. Moreo v er, the ℓ 1 -regularizatio n is particularly r obust to non-Gaussian additive noise s u c h as impulsive n oise [1, 8]. Without making use of a repr o ducing k ernel space, the recen t r eferences [11, 20] established esti- mates of the learnin g rate for the ℓ 1 -regularized least square regression min c ∈ R m 1 m m X j =1 | K x ( x j ) c − y j | 2 + λ k c k 1 . (1.3) W e attempt to sho w that improv emen t on the estimates could b e m ade if an RKBS with th e ℓ 1 norm is u sed. T o explain ho w this could b e done, we fir s t int ro duce the p opular app roac h [ 5 ] for learning rate estimates in mac hine learnin g. A fund amen tal assump tion in m ac hine learning is that the sample d ata z := { ( x j , y j ) : j ∈ N m } ∈ X × Y consists of in dep end en t and identic ally distrib uted instances of a rand om v ariable ( x, y ) ∈ X × Y sub ject to an un kno wn p robabilit y measure ρ on X × Y . The p erformance of a predictor f : X → Y is h ence measured by E ( f ) := Z X × Y | f ( x ) − y | 2 dρ. The pr edictor that minimizes the ab o v e error is the regression function f ρ ( x ) := Z Y y dρ ( y | x ) , x ∈ X , (1.4) where ρ ( y | x ) d enotes the conditional probabilit y measure of y with resp ect to x . In fact, w e h a v e for ev ery predictor f th at E ( f ) = E ( f ρ ) + k f − f ρ k 2 L 2 ρ X , (1.5) where ρ X is the marginal probabilit y measur e of ρ on X and for p ∈ [1 , + ∞ ), L p ρ X denotes the Banac h space of measurable fun ctions f on X with resp ect to ρ X suc h that k f k L p ρ X := Z X | f ( x ) | p dρ X ( x ) 1 /p < + ∞ . The formula (1.4), though attractiv e, is only of theoretical v alue as ρ is un kno wn. A practical w a y is to fin d a minimizer c z ,λ of (1.1) and hop e that f z ,λ ( x ) := K x ( x ) c z ,λ , x ∈ X (1.6) will b e comp etitiv e with f ρ in the sense that the approximati on error E ( f z ,λ ) − E ( f ρ ) would b e small. T o b e more precise, for the learning scheme (1.1) to b e us eful in practice, th is error should con verge to zero fast in probabilit y as the num b er of samp lin g p oints increases. 2 The appr oac h in [ 5 ] wo rks by int r o ducing intermediate fu n ctions b etw een f z ,λ and f ρ that are from a Banac h space B of functions on X w ith the prop er ties that K ( x, · ) ∈ B for all x ∈ X and f or all pairwise distinct x j ∈ X , j ∈ N m and c ∈ R m ψ ( k K x ( · ) c k B ) = φ ( c ) , for some nonn egativ e fu nction ψ . Here k · k B is the norm on B . Let g b e an arbitrary f unction from suc h a space B and set for eac h fu nction f : X → R E z ( f ) := 1 m m X j =1 ( f ( x j ) − y j ) 2 . The appr o ximation error E ( f z ,λ ) − E ( f ρ ) can then b e decomp osed into the su m of f our quantitie s E ( f z ,λ ) − E ( f ρ ) = S ( z , λ, g ) + P ( z , λ, g ) + D ( λ, g ) − λψ ( k f z ,λ k B ) , (1.7) where S ( z , λ, g ) := E ( f z ,λ ) − E z ( f z ,λ ) + E z ( g ) − E ( g ) , P ( z , λ, g ) := ( E z ( f z ,λ ) + λψ ( k f z ,λ k B )) − ( E z ( g ) + λψ ( k g k B )) , D ( λ, g ) := E ( g ) − E ( f ρ ) + λψ ( k g k B ) . The ab ov e three quan tities are called the sampling err or , the hyp othesis err or and the r e gu larization err or , resp ectiv ely . T he s tr ategy is to c ho ose B and g carefully so th at these three errors can b e well b ound ed from ab o v e. When B is the repr o ducing k ernel Hilb ert space of a p ositiv e-definite repr o ducing k ernel K on X and the regularizer φ is giv en b y (1.2), w e h a v e ψ ( t ) = t 2 , t ∈ R and b y the r epresen ter theorem and the defin ition of f z ,λ in (1.6) that f z ,λ = arg m in f ∈B E z ( f ) + λ k f k 2 B . (1.8) In this case, one immediately has that P ( z , λ, g ) ≤ 0 and thus, by (1.7) that E ( f z ,λ ) ≤ S ( z , λ, g ) + D ( λ, g ) . (1.9) F or the ℓ 1 -regularizatio n wh er e φ ( c ) = k c k 1 , the space B c hosen in [11, 20] do es not satisfy the linear represent er theorem. C onsequen tly , the hypothesis er r or needed to b e dealt with th ere. A class of RKBS w ith the ℓ 1 norm that satisfies the linear represen ter theorem was recen tly constructed in [14]. In Section 2, we s h all follo w a similar id ea to construct a sligh tly larger RK BS with th e same desirable prop erties. By usin g the constructed space, w e enjo y the same adv an tage as that for the RK HS case of discard in g the h yp othesis error automatically . Moreo ver, the space also leads to a b etter estimate of the r egularizatio n error than that in [20]. Com binin g these t wo impro vemen ts an d directly using th e estimates of the samp ling error established in [20] or [11], one immediately has a sup erior learning rate. As our fo cus is on the adv an tages br ough t by the constructed RKBS, we shall only imp ro v e the learning r ate estimate of [20] in Section 3. In terested readers may follo w our strategy to engage the more sophisticated sampling er r or estimate giv en in [11] to improv e the learning rate th er ein. 2 RKBS b y Borel Measures In this section, w e construct RKBS applicable to the err or analysis of the ℓ 1 -regularized least square regression. The constructed spaces are exp ected to h a v e the ℓ 1 norm and satisfy the linear representer 3 theorem. The approac h is d ifferen t from the one by semi-inner pro ducts in [21, 22] as an infinite- dimensional ℓ 1 space is neither refl exiv e n or strictly con ve x. Supp ose that the inp ut s p ace X is a lo cally conv ex top ological sp ace and d en ote by C 0 ( X ) the space of con tinuous functions f : X → R suc h that for all ε > 0, the set { x ∈ X : | f ( x ) | > ε } is compact. W e also imp ose the requirement that for all pairwise distinct x j ∈ X , j ∈ N m , m ∈ N , th e k ernel matrix K [ x ] is nons in gular. With the maxim u m norm k f k C 0 ( X ) := max x ∈ X | f ( x ) | , th e sp ace C 0 ( X ) is a Banac h sp ace. Its dual space is isometrically isomorphic to the space M ( X ) of all the signed Borel m easures on X with b ounded total v ariatio n. In other w ords, f or eac h con tin uous linear functional T on C 0 ( X ), there exists a unique measure µ ∈ M ( X ) suc h that T ( f ) = Z X f ( x ) dµ ( x ) and sup f ∈ C 0 ( X ) ,f 6 =0 | T f | k f k C 0 ( X ) = k µ k , (2.1) where k µ k denotes the total v ariation of µ . Let K b e a real-v alued function on X × X suc h that K ( · , x ) ∈ C 0 ( X ) for all x ∈ X and span { K ( · , x ) : x ∈ X } = C 0 ( X ) . (2.2) With su c h a function, we in tro d uce the follo wing space B := f µ := Z X K ( t, · ) dµ ( t ) : µ ∈ M ( X ) (2.3) with the norm k f µ k B := k µ k . (2.4) Recall that a v ector space V is called a pr e-RKBS [14] on X if it is a Banac h space consisting of functions on X suc h that p oint ev aluation fu nctionals are conti nuous on V and su ch that for all f ∈ V , k f k V = 0 if and only if f v anishes everywhere on X . Prop osition 2.1 Supp ose that K ( · , x ) ∈ C 0 ( X ) for al l x ∈ X and (2.2) is satisfie d. Then B define d by (2.3) is a pr e- RKBS on X . Pro of: W e fir st sh ow that the norm (2.4) is w ell-defined . Let µ, ν b e t wo m easures in M ( X ) su c h that f µ ( x ) = f ν ( x ) for all x ∈ X . Then we get that Z X K ( t, x ) d ( µ − ν )( t ) = 0 for all x ∈ X. By the denseness condition (2.2), the ab o ve equation implies that µ − ν = 0. T h us , the m easure µ asso ciated with a fu nction f µ ∈ B is uniqu e. T his prov es that (2.4) is we ll-defined and th at k f µ k B = 0 if and only if f µ ( x ) = 0 for all x ∈ X . Another consequence is that B is isometrically isomorphic to M ( X ) and is hence a Banac h space. Finally , w e observ e for all x 0 ∈ X and µ ∈ M ( X ) that | f µ ( x 0 ) | = Z X K ( t, x 0 ) dµ ( t ) ≤ k K ( · , x 0 ) k C 0 ( X ) k µ k = k K ( · , x 0 ) k C 0 ( X ) k f µ k B . Therefore, p oin t ev aluations are con tinuous linear fu nctionals on B . W e conclude that B is a pr e-RKBS on X . The pro of is complete. ✷ 4 Let the sampling p oin ts in x b e pairwise distinct. By defin ition, K x ( · ) c ∈ B for all c ∈ R m . The denseness cond ition (2.2) imp lies that K ( x j , · ), j ∈ N m are linearly indep en den t. As a result, k K x ( · ) c k B = k c k 1 . (2.5) It is in the ab o ve s ense that B is said to p ossess the ℓ 1 norm. W e next turn to the crucial linear repr esen ter theorem in B . W e sa y that B satisfies the lin- ear r epresen ter theorem if for all con tin uous n onnegativ e loss fun ction Q and regularizer ψ with lim t →∞ ψ ( t ) = + ∞ , the regularized learnin g sc heme inf f ∈B Q ( f ( x )) + λψ ( k f k B ) has a minimizer f 0 of the form f 0 = K x ( · ) c for some c ∈ R m . Here, f ( x ) = ( f ( x j ) : j ∈ N m ) T . The follo wing lemma can b e prov ed by arguments similar to those in [14]. Lemma 2.2 The sp ac e B satisfies the line ar r epr esenter the or em if and only if for al l x of p airwise distinct sampling p oints and y ∈ R m , the minimal norm interp olation inf {k f k B : f ∈ B , f ( x ) = y } (2. 6) has a minimizer f 0 of the form f 0 = K x ( · ) c for some c ∈ R m . A subspace of B wa s constructed in [14] and conditions for it to satisfy the linear r epresen ter theorem were studied. In order to mak e u se of the r esults obtained there, we firs t introdu ce the subspace. Denote by ℓ 1 ( X ) the subset of M ( X ) of those Borel measures that are su pp orted on a coun table s ubset of X . Thus, f or eac h ν ∈ ℓ 1 ( X ), th ere exist s ome pairwise distinct p oint s x j ∈ X , j ∈ I wh ere I is a count able ind ex set, such that ν ( A ) = X x j ∈ A ν ( x j ) for ev ery Borel subset A ⊆ X . Denote by sup p ν the coun table s et of p oin ts w here ν is n onzero. The s pace B 1 considered in [14] is B 1 := ( X x ∈ supp ν ν ( x ) K ( x, · ) : ν ∈ ℓ 1 ( X ) ) with the norm in herited from that of B . Put for all x ∈ X , K x ( x ) := ( K ( x, x j ) : j ∈ N m ) T , w hic h is an m × 1 vec tor in R m . One should not confuse K x ( x ) with K x ( x ). The latter is 1 × m and migh t ev en n ot b e the transp ose of the former as K is not requ ired to b e symmetric. T h e follo wing result ab out B 1 is from [14]. Lemma 2.3 F or al l y ∈ R m , the minimal norm interp olation inf {k f k B 1 : f ∈ B 1 , f ( x ) = y } (2.7) has a minimizer f 0 of the form f 0 = K x ( · ) c for some c ∈ R m if and only if k K [ x ] − 1 K x ( x ) k 1 ≤ 1 for al l x ∈ X . (2.8) Mor e over, under c ondition (2.8), ther e holds for al l c ∈ R m that k c T K x ( · ) k C 0 ( X ) = k c T K [ x ] k ∞ , (2.9) wher e k · k ∞ is the maximum norm on R m . 5 W e are ready to p resen t the main r esult of this section. Theorem 2.4 The sp ac e B satisfies the line ar r epr esenter the or em if and only if (2.8) holds true. Pro of: Supp ose th at (2.8) h olds true. By Lemma 2.2, to sho w that B satisfies the linear representer theorem, it suffices to sho w that f 0 = K x ( · ) K [ x ] − 1 y is a minimizer of (2.6). Clearly , f 0 ( x ) = y . Let f µ , µ ∈ M ( X ), b e an arb itrary fu n ction in B that satisfies th e interp olation condition f µ ( x ) = y . W e then hav e for all c ∈ R m that Z X c T K x ( t ) dµ ( t ) = Z X m X j =1 c j K ( t, x j ) dµ ( t ) = m X j =1 c j f µ ( x j ) = c T y . It follo ws from (2.1) that for all c ∈ R m | c T y | ≤ k c T K x ( · ) k C 0 ( X ) k µ k . This together with (2.9) implies that k µ k ≥ sup c ∈ R m , c 6 =0 | c T y | k c T K x ( · ) k C 0 ( X ) = sup c ∈ R m , c 6 =0 | c T y | k c T K [ x ] k ∞ = sup a ∈ R m , a 6 =0 | a T K [ x ] − 1 y | k a k ∞ = k K [ x ] − 1 y k 1 . No w, recall by (2.5) that k f 0 k B = k K [ x ] − 1 y k 1 and by defin ition of k · k B that k f µ k B = k µ k . These t wo facts com bined w ith the ab o ve in equ alit y imp ly that k f µ k B ≥ k f 0 k B . Thus, f 0 is indeed a minimizer of (2.6). On the other hand, supp ose that B satisfies the linear represen ter theorem and w e wa nt to pro v e (2.8). Let y ∈ R m . By Lemma 2.2, the minimal norm in terp olatio n (2.6) has a minimizer f 0 of th e form f 0 = K x ( · ) c for some c ∈ R m . Clearly , f 0 is also a minimizer of (2.7) b ecause f 0 ∈ B 1 and k f 0 k B 1 = k f 0 k B = inf {k f k B : f ∈ B , f ( x ) = y } ≤ inf {k f k B 1 : f ∈ B 1 , f ( x ) = y } . By Lemma 2.3, (2.8) holds true. The p ro of is complete. ✷ It w ill b ecome clear in th e n ext section that the ab o v e theorem mak es B a usefu l s p ace for error analysis of the ℓ 1 -regularized least square regression. W e present tw o examples of K that satisfy all the assumptions, esp ecially (2.8), in this section: – the exp onent ial k ern el K ( s, t ) := e −| s − t | , s, t ∈ R , – the Bro wnian bridge kernel K ( s, t ) := min { s, t } − st, s, t ∈ (0 , 1) . That these t w o k ern els satisfy (2.8) h as b een pr o v ed in [14]. It remains to v erify th e denseness requirement (2.2). The exp onen tial k ernel is a particular case of the f ollo wing result. Prop osition 2.5 If φ is L eb e sgue inte gr able on R d that is nonzer o almost everywher e then the function K ( s , t ) := Z R d e − i ( s − t ) · ξ φ ( ξ ) d ξ , s , t ∈ R d (2.10) satisfies that K ( · , t ) ∈ C 0 ( R d ) for al l t ∈ R d and the denseness c ondition (2.2). So do es K ( s , t ) := ψ ( s − t ) , s , t ∈ R d wher e ψ is a nontrivial c ontinuous function on R d of c omp act supp ort. 6 Pro of: That the fu nction giv en by (2.10) b elongs to C 0 ( R d ) for all t ∈ R d follo ws from the Riemann - Leb esgue lemma. The denseness cond ition (2.2) for th e t wo kernels can b e prov ed b y argum en ts similar to those in [14]. ✷ The Brownian bridge kernel is h andled with a mann er differen t from that in [14]. Prop osition 2.6 The Br ownian bridge k ernel satisfies (2.2). Pro of: Clearly , f or the Brownian bridge kernel, K ( · , t ) is con tin uou s for all t ∈ (0 , 1). Let ν b e a Borel measure on X := (0 , 1) suc h th at Z X K ( s, t ) dν ( s ) = 0 for all t ∈ (0 , 1) . (2.11) Note that K has the representa tion K ( s, t ) = Z X Γ s ( z )Γ t ( z ) dz , s, t ∈ (0 , 1) , where Γ s := χ (0 ,s ) − s with χ (0 ,s ) denoting the c haracteristic fun ction of (0 , s ). Ar gu m en ts similar to those in [14] yield that there exists a constan t C s u c h th at Z s 0 dν ( s ) = C for all s ∈ (0 , 1) . It f ollo ws that ν (( s 1 , s 2 )) = 0 for all 0 < s 1 < s 2 < 1. Consequen tly , ν is the zero Borel measure on (0 , 1). Thus, the Brownian brid ge k ernel satisfies (2.2). ✷ Finally , we r emark that the fu nction K can b e regarded as the repro du cing k ernel for B constru cted b y (2.3). T o see this, we in tro du ce a b ilinear form on B × C 0 ( X ) by setting h f µ , g i := Z X g ( x ) dµ ( x ) for all µ ∈ M ( X ) and g ∈ C 0 ( X ) . W e observe b y (2.1) that |h f µ , g i| ≤ k µ kk g k C 0 ( X ) = k f µ k B k g k C 0 ( X ) and that for all x ∈ X , f µ ( x ) = h f µ , K ( · , x ) i , g ( x ) = h K ( x, · ) , g i . In the ab o ve senses, K is said to b e the repro d ucing k ernel for b oth B and C 0 ( X ). 3 Error Analysis of the ℓ 1 -Regularization W e apply the constructed sp ace B to estimate the learning rate of the ℓ 1 -regularized least square regression (1.3) in this section. T o this end, we first in tro du ce some standard assum p tions in the literature imp osed on the regression function f ρ , the input space X and the function K . Let X b e compact metric sp ace with the distance d and assu me that ρ X is a Borel pr obabilit y measure on X . In this n ote, w e sup p ose that K is a p ositiv e-definite repro d ucing kernel on X with the Lip sc hitz condition | K ( x, t ) − K ( x, t ′ ) | ≤ C α ( d ( t, t ′ )) α for some p ositiv e constan ts α, C α and for all x, t, t ′ ∈ X. (3.1) 7 Denote for all r > 0 b y N ( X, r ) th e least n u m b er of op en balls with r adius r that co ver X . Assume that this co v ering num b er s atisfies for some p ositive constants η , C η that N ( X , r ) ≤ C η r η for all 0 < r ≤ 1 . (3.2) The requirement on f ρ is that it is contai n ed in the ran ge ran ( L s K ) of L s K for some s > 0. Here, L K is the compact p ositiv e op erator on L 2 ρ X defined by L K f := Z X K ( t, · ) f ( t ) dρ X ( t ) , f ∈ L 2 ρ X . Let φ j , j ∈ N b e an orthonormal basis f or L 2 ρ X consisting of eigenfunctions of L K with the corresp ond- ing eigen v alues λ j ≥ λ j +1 , j ∈ N . Th e assumption f ρ ∈ ran ( L s K ) imp lies that f ρ = ∞ X j =1 λ s j a j φ j for some h = P ∞ j =1 a j φ j in L 2 ρ X . In order to mak e use of the space constru cted in the last section, our last r equiremen t is that K satisfies that span { K ( · , x ) : x ∈ X } = C ( X ) and condition (2.8). Let c z ,λ b e a minimizer of (1.3) and let f z ,λ b e giv en by (1.6). F or the m in imization pr oblem (1.3), the hyp othesis error and regularization error hav e the sp ecific forms P ( z , λ, g ) := ( E z ( f z ,λ ) + λ k f z ,λ k B ) − ( E z ( g ) + λ k g k B ) , D ( λ, g ) := E ( g ) − E ( f ρ ) + λ k g k B , where g is a function in B to b e carefully c hosen. The use of the space B enables us to discard the hypothesis error immediately . Lemma 3.1 Under the ab ove assumptions on K , ther e holds E ( f z ,λ ) − E ( f ρ ) ≤ S ( z , λ, g ) + D ( λ, g ) for al l g ∈ B . Pro of: By Theorem 2.4, f z ,λ = arg m in f ∈B E z ( f ) + λ k f k B . As a consequence, P ( z , λ, g ) ≤ 0, whic h together with inequalit y (1.7) completes the pro of. ✷ W e next estimate the regularizatio n error. Lemma 3.2 If 0 < s < 1 then inf g ∈B D ( λ, g ) ≤ ( k h k L 2 ρ X + k h k 2 L 2 ρ X ) λ 2 s 1+ s . (3.3) If s ≥ 1 then f ρ ∈ B and D ( λ, f ρ ) ≤ ( λ s − 1 1 k h k L 2 ρ X ) λ. (3.4) Pro of: Firs tly , w e ha ve for eac h ϕ ∈ L 2 ρ X that L K ϕ ∈ B and by the Cauc hy-Sc h wartz in equalit y that k L K ϕ k B = k ϕ k L 1 ρ X ≤ k ϕ k L 2 ρ X . (3.5) 8 If s ≥ 1 then f ρ = L K ϕ wh ere ϕ = ∞ X j =1 λ s − 1 j a j φ j . As λ j is non-increasing, k ϕ k L 2 ρ X ≤ λ s − 1 1 ∞ X j =1 | a j | 2 1 / 2 = λ s − 1 1 k h k L 2 ρ X . W e then get b y the ab o v e equation and (3.5) th at D ( λ, f ρ ) = λ k f ρ k B ≤ λ k ϕ k L 2 ρ X ≤ λλ s − 1 1 k h k L 2 ρ X , whic h is (3.4). Supp ose no w th at 0 < s < 1. If λ 1 ≤ λ 1 1+ s then by (1.5), D ( λ, 0) = E (0) − E ( f ρ ) = k f ρ k 2 L 2 ρ X = ∞ X j =1 λ 2 s j a 2 j ≤ λ 2 s 1+ s k h k 2 L 2 ρ X , whic h implies (3.3). If λ 1 > λ 1 1+ s then since λ j decreases to zero as j tends to infin it y , there exists some N ∈ N suc h th at λ N +1 < λ 1 1+ s ≤ λ N . Put ϕ := N X j =1 λ s − 1 j a j φ j . It follo ws from (1.5) and (3.5) that D ( λ, L K ϕ ) ≤ k L K ϕ − f ρ k 2 L 2 ρ X + λ k ϕ k L 2 ρ X . W e estimate that λ k ϕ k L 2 ρ X = λ N X j =1 a 2 j λ 2 s − 2 j 1 / 2 ≤ λλ s − 1 1+ s N X j =1 a 2 j 1 / 2 ≤ λ 2 s 1+ s k h k L 2 ρ X and that k L K ϕ − f ρ k 2 L 2 ρ X = ∞ X j = N +1 λ 2 s j a 2 j ≤ λ 2 s 1+ s ∞ X j = N +1 a 2 j ≤ λ 2 s 1+ s k h k 2 L 2 ρ X . Com bing the ab ov e tw o inequalities leads to (3.3). T he pro of is complete. ✷ W e remark that the estimated regularization error in [20] w as of the order O ( λ 2 s 2+ s ) for 0 < s ≤ 2. T u rning to the sampling error, we follo w th e approac h in [20] to decomp ose it into the sum S ( z , λ, g ) = S 1 ( z , λ, g ) + S 2 ( z , λ ) wher e S 1 ( z , λ, g ) = ( E z ( g ) − E z ( f ρ )) − ( E ( g ) − E ( f ρ )) , S 2 ( z , λ ) = ( E ( f z ,λ ) − E ( f ρ )) − ( E z ( f z ,λ ) − E z ( f ρ )) . The fi rst su mmand S 1 ( z , λ, g ) can b e b ounded by u sing the law of large num b ers. By the same argumen ts as th ose in [5, 20], we u s e the estimate in L emm a 3.2 to obtain an impro ved b oun d. 9 Lemma 3.3 Supp ose that the output of sample data i s b ounde d by a p ositive c onstant almost sur ely. If 0 < s < 1 then for e ach ε > 0 ther e exists some g ∈ B such that for al l 0 < δ < 1 , we have with c onfidenc e 1 − δ 2 that S 1 ( z , λ, g ) ≤ C 1 λ 2( s − 1) 1+ s m + λ 2 s − 1 1+ s √ m log 2 δ for some p ositive c onstant C 1 . If s ≥ 1 then S 1 ( z , λ, f ρ ) = 0 . F or S 2 ( z , λ ), we cite the f ollo wing result from [20]. Lemma 3.4 Supp ose that (3.1) and (3.2) hold true. If λ ≤ 1 then we have with c onfidenc e 1 − δ 2 that S 2 ( z , λ ) ≤ 1 2 ( E ( f z ,λ ) − E ( f ρ )) + C 2 log 2 δ + log (1 + m ) λ 2 m − 1 1+ η/ α for some p ositive c onstant C 2 . Com binin g Lemmas 3.1, 3.2, 3.3, and 3.4, we reac h a new learning rate estimate of th e ℓ 1 -regularized least squ are regression. Theorem 3.5 Supp ose tha t X sat isfy (3 .2), the output i s b ounde d by a p ositive c onstant almost sur ely, and f ρ ∈ ran ( L s K ) for some s > 0 . L et K b e a p ositive- definite r epr o ducing kernel satisfying span { K ( · , x ) : x ∈ X } = C ( X ) , the c ondition (2.8) and the Lipschitz c ondition (3.1). Then ther e exists some c onstant C > 0 such that with the choic e λ = m − 1 2 1 1+ η/ α 1+ s 1+2 s , we have for al l 0 < δ < 1 with c onfidenc e 1 − δ that E ( f z ,λ ) − E ( f ρ ) ≤ C m − s 1+2 s 1 1+ η/ α log 2 + 2 m δ , when 0 < s < 1 (3.6) and E ( f z ,λ ) − E ( f ρ ) ≤ C m − 1 3(1+ η/α ) log 1 + m δ , when s ≥ 1 . Pro of: W e only discuss the case w hen 0 < s < 1 as the other situation is easier and can b e sh own in a similar w a y . W e c ho ose λ = m − θ , θ > 0 and get by Lemmas 3.1, 3.2, 3.3, and 3.4 that there exists some constant C > 0 s uc h that with confidence 1 − δ E ( f z ,λ ) − E ( f ρ ) ≤ C m − γ log 2 + 2 m δ , (3.7) where γ = min 1 1 + η /α − 2 θ , 1 − 2 θ (1 − s ) 1 + s , 1 2 − 1 − 2 s 1 + s θ , 2 θ s 1 + s . The maximum of γ is achiev ed when θ = 1 2 1 1 + η /α 1 + s 1 + 2 s . Substituting th e ab o v e c hoice in to (3.7) yields (3.6) . ✷ Improv ements of the learning rate can b e ac hiev ed if higher regularit y is imp osed on the k ernel K [24] or b etter estimates of th e sampling err or are engaged [11]. Another remark is that the assump tion of p ositiv e-definiteness and symm etry on K m igh t b e abandoned b y using the s tr ategy in [20]. 10 References [1] S. Alliney . A prop erty of the minimum v ector s of a regular izing functional defined by means of the abso lute norm. IEEE T r ansactions on Signal Pr o c essing , 45 :9 13–9 17, 1 9 97. [2] E . J. Cand` es, J. Romberg , and T. T ao . Robust uncertaint y principles: exact signa l reconstructio n from highly incomplete frequency informa tion. IEEE T r ans. Inform. The ory , 52(2):489 –509 , 2006 . [3] S. S. Chen, D. L. Donoho, and M. A. Sa unders. A tomic deco mpo sition by basis pur suit. SIAM J. S ci. Comput. , 20(1):3 3–61, 1998 . [4] F. Cuck er and S. Smale. On the mathematica l foundations of le a rning. Bul l. Amer. Math. S o c. (N.S.) , 39(1):1–4 9 (electro nic), 2002 . [5] F. Cuc ker and D.-X. Zhou. L e arning the ory: an appr oximation the ory viewp oint . Cambridge Monographs on Applied and Computatio nal Mathematics. Cambridge University Pr ess, Cambridge, 2007. With a foreword by Stephen Smale . [6] T. Evg eniou, M. Pontil, a nd T. Poggio . Regularizatio n netw orks and s uppo rt vector machines. A dv. Comput. Math. , 13 (1):1–50 , 2000 . [7] G. Kimeldorf and G. W ah ba. So me results on Tc hebyc heffian spline functions . J. Math. Anal. Appl. , 33:82– 95, 197 1 . [8] M. Nikolo v a. A v ariational appr oach to r e mov e outliers and impulse noise. J. Math. Imaging Vision , 20(1-2 ):9 9–120 , 200 4. Sp ecial issue on mathematics a nd image analys is. [9] B. Sch¨ olkopf and A. J . Smola. L e arning with Kernels: Supp ort V e ctor Machines, R e gularization, Optimiza- tion, and Beyond (Ad aptive Computation and Machine L e arning) . The MIT Press, Cambridge, December 2001. [10] J. Shaw e-T aylor and N. Cristia nini. Kernel Metho ds for Pattern Analy sis . Cam br idg e Universit y Pr ess, Cambridge, 200 4. [11] L. Shi, Y.-L . F eng, and D.-X. Zho u. Concentration es timates for lear ning with ℓ 1 -regular izer and data depe ndent hypothesis spaces. Appl. Comput. Harmon. Anal. to app e ar. [12] S. Smale a nd D.-X. Z hou. Learning theor y estimates via integral op erato r s and their a pproximations. Constr. Appr ox. , 2 6(2):153 – 172, 2007 . [13] G. So ng and Y. Xu. Approximation of high-dimensio nal kernel ma trices b y multilev el circ ula nt matrices. J. Complexity , 2 6(4):375 –405, 2010 . [14] G. Song , H. Zhang, and F. J. Hick ernell. Repr o ducing kernel bana ch spac es with the ℓ 1 norm. pr eprint, arXiv:110 1.4388v1 , 2011. [15] H. Sun and Q. W u. Regulariz e d least square r egress ion with dep endent s amples. A dv. Comput. Math. , 32(2):175 –189 , 201 0 . [16] H. Sun a nd Q. W u. Co efficient reg ula rization in least sq uare kernel r egress io n. pr eprint , 20 11. [17] R. Tibshira ni. Regre s sion shrink age and selection via the lass o. J. Ro y. Statist . So c. Ser. B , 58(1 ):267–2 88, 1996. [18] H. T ong, D.-R. Chen, and F. Y ang. Least square r egressio n with ℓ p -co efficient re g ularizatio n. Neur al Comput. , 22:3 221–3 235, 2010 . [19] V. N. V apnik. St atistic al L e arning The ory . Adaptive a nd Le a rning Sy s tems for Signal Pro ce s sing, Co m- m unica tio ns, and Control. John Wiley & Sons Inc., New Y ork, 199 8. A Wiley-Interscience Publicatio n. [20] Q.-W. Xiao and D.-X. Zhou. Learning b y nonsymmetric kernels with data dep endent spac e s a nd ℓ 1 - regular iz er. T aiwanese J. Math. , 14(5):18 21–18 36, 2 010. 11 [21] H. Zhang, Y. Xu, and J. Z hang. Repr o ducing kernel Banach spaces for machine learning. J. Mach. L e arn. R es. , 10:2 741– 2775, 2 0 09. [22] H. Zhang and J. Zhang. F r a mes, Riesz bases , and sampling expansions in Bana ch spaces v ia semi-inner pro ducts. Appl. Comput. Harmon. Anal. to a ppe a r. [23] T. Zhang. Leav e-o ne-out b ounds for kernel metho ds. Neur al Comput. , 15:13 9 7–14 37, 200 3 . [24] D.-X. Z hou. Capacit y of repro ducing k ernel spac es in lear ning theory . IEEE T r ans. Inform. The ory , 49(7):174 3–17 52, 200 3 . 12
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment