Reproducing Kernel Banach Spaces with the l1 Norm

Targeting at sparse learning, we construct Banach spaces B of functions on an input space X with the properties that (1) B possesses an l1 norm in the sense that it is isometrically isomorphic to the Banach space of integrable functions on X with res…

Authors: Guohui Song, Haizhang Zhang, Fred J. Hickernell

Repro ducing Kernel Banac h Spaces with the ℓ 1 Norm ∗ Guoh ui Song † , Haizhang Zhang ‡ , and F red J. Hic k ernell § Abstract T arg e ting at s parse lear ning, we co nstruct Ba nach spac e s B of functions on an input space X with the following prop erties: (1) B p osses ses an ℓ 1 norm in the sense that B is isometrically isomorphic to the Banach space of integrable functions on X with resp ect to the count ing mea s ure; (2) po int ev a luations are cont inuous linear functionals on B and are representable thro ugh a bilinear form with a kernel function; and (3) regularized learning schemes on B satisfy the linear representer theorem. Examples of k er ne l functions admissible for the construction of such spaces are given. Keyw ords : repro ducing kernel Banach spa ces, sparse learning, lass o, basis pursuit, reg u- larization, the representer theorem, the Brownian bridge kernel, the exp onential kernel. 1 In tro du ction It is now widely kno wn that minimizing a loss function regularized by the ℓ 1 norm yields sparsit y in the resulting min im izer. The spars it y is essen tial for extracting relativ ely lo w dimensional features fr om samp le data that us ually liv e in a high dimensional sp ace. When the square loss function is u sed in regression, the metho d is kno wn as the lasso in statistics [26]. Recen tly , the metho dology has b een applied to compressive sensin g where it is referred to as basis pursu it [ 4 , 5 ]. The purp ose of this pap er is to establish an appr opriate foun d ation for dev eloping ℓ 1 regularization for mac hin e learning with repr o ducing k ernels. P ast researc h on learning with kernels [6, 7 , 9, 22, 23, 24, 27] has mainly b een built up on the theory of repr o ducing k ern el Hilb ert sp aces (RKHS) [2]. T here are man y reasons that accoun t for the success from suc h a c hoice. RKHS are b y definition the Hilb ert space of fun ctions where p oin t ev aluations are conti nuous linear functionals. Sample data a v ailable for learning are usu ally mo deled by p oint ev aluations of the un kno wn target fun ction. Therefore, RKHS is a class of f unction spaces where sampling is stable, a desirable feature in applications. By the Riesz represent ation theorem, cont inuous linear f u nctionals on a Hilb ert space are representable b y the inner pr o duct on the sp ace. This give s rise to the representat ion of p oin t ev aluation functionals ∗ Supp orted by Guangdong Pro vincial Go vernment of China through the “Computational Science I nnov ativ e Researc h T eam” p rogram. † School of Mathematical and Statistical Sciences, A rizona St ate Universit y , T empe, AZ 85287, USA. E-mail address: gsong9@asu.e du . ‡ School of Mathematics and Computational Science and Guangdong Pro vince Key Lab oratory of Computa- tional Science, Sun Y at-sen Universit y , Guangzhou 510275, P . R. China. E -mail address: zhhaizh2@sysu.e du.cn . § Department of Applied Mathematics, Illinois Institute of T ec hnology , 10 W. 32nd St., Chicago, IL, 60616, USA. E-mail address: hickernel l@iit. e du . This auth or’s work w as supp orted in part by National Science F ounda- tion grants DMS-0713848 and DMS- 1115392 . 1 on an RKHS by its associated repro d ucing kernel and leads to the celebrated represen ter theorem [14] in mac hine learning. This theo r em s tates that the original minimization p r oblem in a t ypically i n finite dimensional RKHS can b e conv erted into a problem of determining finitely man y co efficient s in a linear com bination of the k ernel fu nction with one argument ev aluated at the data sites. F or th is represen ter theorem, the nonzero co efficien ts to b e found are generally as man y as the samp ling p oin ts. F or the sak e of econom y , it is h ence d esir ab le to regularize the class of candidate fun ctions by some ℓ 1 norm to force most of the co efficien ts to b e zero. A n attempt in this direction is the linear p rogramming approac h to co efficien t based regularization for mac hine learning [22]. Th e metho d lac ks a general mathematical found ation lik e the RKHS th ough. In particular, i t is unkno wn wh ether the algo rithm results b y some represent er theorem from a minimization on an infinite dimensional Banac h space. A consequence is that the hypothesis error in the learning rate estimate will not go a wa y automatically as in the RK HS case [30]. W e aim at combining the rep ro ducing k ernel metho ds and the ℓ 1 regularization tec hn ique. Sp ecifically , we d esire to construct function spaces with the follo wing prop erties: — p oint ev aluation functionals on th e sp ace are con tin uous and can b e represent ed b y some k ernel fu nction; — the s pace p ossesses an ℓ 1 norm; — a linear represen ter theorem holds for regularized learning s chemes on the space. There are three wa ys of rep resen ting con tinuous p oin t ev aluat ion f unctionals in a function sp ace: b y an inn er pro du ct, b y a semi-inner pr o duct [11, 15], or by a b ilinear form on the tensor pro d uct of the s p ace and its du al space. Sin ce the space we constructed is exp ected to ha ve an ℓ 1 norm, it can not hav e an inner pro duct. Semi-inner pro d ucts are a natur al su bstitute for inner pro du cts in Banac h spaces. A n otion of repro ducing kernel Ba n ac h spaces (RKBS) w as established in [31, 32] via the s emi-inn er pro du ct. The sp aces considered there are uniformly con v ex and uniformly F r ´ ec het differenti able to ensur e th at con tin uous linear fu nctionals ha ve a unique representa tion b y the semi-inner pro du ct. An in fi nite dimen sional Banac h space wit h the ℓ 1 norm i s non- reflexiv e. As a consequence, th ere is no guaran tee [13] that the s emi-inner pro d u ct is able to represent all contin uous p oin t ev aluation fu nctionals in su c h a sp ace. F or th ese reasons, we sh all pursu e the third approac h in this s tudy , that is, to repr esen t the p oint ev aluation functionals b y a bilinear form. W e br iefly in tro d u ce th e construction and main results of the pap er b elo w. Let X b e a pr escrib ed set that we call the input space. The constru ction starts directly with a complex-v alued fu n ction K on X × X , wh ich is not necessarily Hermitian. F or the constructed space to h a v e the three desirable pr op erties describ ed ab o ve, K n eeds to b e an admissible kernel. T o introduce this class of fu n ctions crucial to our constru ction, we denote for an y set Ω by ℓ 1 (Ω) the Banac h sp ace of f unctions on Ω that is inte grable with resp ect to the counti n g measure on Ω. In other words, ℓ 1 (Ω) := { c = ( c t ∈ C : t ∈ Ω) : k c k ℓ 1 (Ω) := X t ∈ Ω | c t | < + ∞} . Note that Ω migh t b e uncountable but for every c ∈ ℓ 1 (Ω), su p p c := { t ∈ Ω : c t 6 = 0 } must b e coun table. Fi n ally , w e d efine the set N n := { 1 , 2 , . . . , n } for all n ∈ N . Definition 1.1. A function K on X × X is c al le d an admissible k ernel for the c onstruction of RKBS on X with the ℓ 1 norm if the fol lowing r e quir ements ar e satisfie d: 2 (A1) f or al l se quenc es x = { x j : j ∈ N n } ⊆ X of p airw ise distinct sampling p oints, the matrix K [ x ] := [ K ( x k , x j ) : j , k ∈ N n ] ∈ C n × n (1.1) is nonsingular, (A2) K is b ounde d, namely, | K ( s, t ) | ≤ M for some p ositive c on stant M and al l s, t ∈ X , (A3) f or al l p airwise distinct x j ∈ X , j ∈ N and c ∈ ℓ 1 ( N ) , P ∞ j =1 c j K ( x j , x ) = 0 f or al l x ∈ X implies c = 0 , and (A4) f or al l p airw ise distinct x 1 , x 2 , . . . , x n +1 ∈ X ,   ( K [ x ]) − 1 K x ( x n +1 )   ℓ 1 ( N n ) ≤ 1 , (1.2) wher e K x ( x ) = ( K ( x, x j ) : j ∈ N n ) T ∈ C n . The follo wing th eorem will b e prov ed in the next three sections. Theorem 1.2. If K i s an admissible kernel on X × X then B :=  X t ∈ supp c c t K ( t, · ) : c ∈ ℓ 1 ( X )  with the norm     X t ∈ supp c c t K ( t, · )     B := k c k ℓ 1 ( X ) (1.3) and B ♯ , the c ompletion of the ve ctor sp ac e of fu nc tions P n j =1 c j K ( · , x j ) , x j ∈ X u nder the supr e mum norm     n X j =1 c j K ( · , x j )     B ♯ := su p      n X j =1 c j K ( x, x j )     : x ∈ X  , ar e b o th Banach sp ac es of functions on X wher e p oint ev aluations ar e c ontinuous line ar func- tionals. In addition, the biline ar form  n X j =1 a j K ( s j , · ) , m X k =1 b k K ( · , t k )  K := n X j =1 m X k =1 a j b k K ( s j , t k ) , s j , t k ∈ X (1.4) c an b e extende d to B × B ♯ such that |h f , g i K | ≤ k f k B k g k B ♯ for al l f ∈ B , g ∈ B ♯ and h f , K ( · , x ) i K = f ( x ) , h K ( x, · ) , g i K = g ( x ) for al l x ∈ X , f ∈ B , g ∈ B ♯ . F urthermor e, for every r e gularize d le arning scheme of the form inf f ∈B V ( f ( x 1 ) , f ( x 2 ) , · · · , f ( x n )) + µφ ( k f k B ) , wher e µ is a p ositive r e gularization p ar ameter, V and φ ar e nonne gative c ontinuous functions with lim t →∞ φ ( t ) = + ∞ , ther e exists a minimizer, f 0 , of the form f 0 ( x ) = n X j =1 c j K ( x j , x ) , x ∈ X for some c o efficients c j ∈ C , j ∈ N n . Conversely, for the c onstructe d sp ac es B and B ♯ to enjoy those desir able pr op erties, K must b e an admissible kernel on X × X . 3 The organization of the p ap er is as follo ws. W e fir st presen t a general construction of Banac h spaces of functions with a repr o ducing kernel in the n ext section. In Section 3, w e sp ecify the construction to th e b uilding of RKBS with the ℓ 1 norm as describ ed in Theorem 1.2. In Section 4 , w e study the conditions on the repro d ucing kernel so that regularized learning sc hemes on the constructed spaces satisfy the linear represente r theorem. In the last sectio n , w e sho w that the Brownian bridge k ernel and the exp onential k ernel are admissible k ernels. In the final section, condition (A4), the most stringent condition in Definition 1.1 is relaxed, whic h leads to a mo dified v ersion of Theorem 1.2. 2 A General Construction T o ensur e that there exists a repr o ducing k ern el, we shall start the construction of the Banac h space based on such a function. Let X b e an input space and let K b e a function on X × X . In tro d uce the v ector s pace B 0 := span { K ( x, · ) : x ∈ X } . Note that unlik e repro du cing kernels for Hilb ert sp aces, this K is n ot necessarily symmetric in its arguments or p ositiv e defin ite. S upp ose that a norm k · k B 0 is imp osed on B 0 suc h that p oin t ev al u ation functionals are conti nuous on B 0 . Th at is, for an y x ∈ X , t her e exists a p ositiv e constan t M x suc h that | δ x ( f ) | = | f ( x ) | ≤ M x k f k B 0 for all f ∈ B 0 . (2.1) The function K and the n orm on B 0 will b e explicitly giv en in a sp ecific construction. In [31, 33, 32], a ve ctor space B is called an RKBS on X if it is a u niformly conv ex and u ni- formly F r´ ec het differen tiable Banach space of functions on X a n d p oin t ev aluation fu n ctionals are con tinuous on B . The uniform conv exit y and u niform F r ´ ec het differentia b ilit y w ere imp osed there to ensure the existence of a repro du cing ke rn el f or representing the p oint ev aluatio n fu nc- tionals. By the resu lts to b e established in the cu rrent pap er, these stronger conditions are not necessary . T o accommo date the searc h for alternativ es, w e int r o duce the follo wing definitions. Definition 2.1. The sp ac e B is c al le d a Banac h space of fu nctions if the p oint e v aluation functionals ar e c onsist ent with t he norm on B in the sense that for al l f ∈ B , k f k B = 0 if and only if f vanishes everywher e on X . A Banach sp ac e B of functions on X is said to b e a pre-RKBS on X i f p oint evaluations ar e c ontinuous line ar functionals on B . W e plan to co mp lete B 0 b y the norm k · k B 0 to obtain a p re-RKBS B . Tw o things need to b e c heck ed for the appr oac h to succeed. An abstract completion of B 0 migh t not consist of functions, or might not hav e b ounded p oin t ev aluation fun ctionals. W e shall p resen t a Banac h completion p ro cess that yields a space of functions. Let { f n : n ∈ N } b e a Cauc hy sequ ence in B 0 . Since p oin t ev aluation fu nctionals are contin uous on B 0 , for an y x ∈ X , the sequence { f n ( x ) : n ∈ N } con ve r ges in C . W e denote the limit b y f ( x ), whic h defin es a function on X . On e sees that t wo equiv alen t Cauch y sequences in B 0 giv e the same fu nction. W e let B b e comp osed of all such limit f unctions w ith the norm k f k B := lim n →∞ k f n k B 0 . T o inv estigate cond itions for B to b e a pr e-RKBS, w e need to inv ok e the follo wing assumption. Definition 2.2 . A norme d ve ctor sp ac e V of functions on X satisfies the Norm Consistency Prop erty if for every Cauchy se quenc e { f n : n ∈ N } in V , lim n →∞ f n ( x ) = 0 for al l x ∈ X implies lim n →∞ k f n k V = 0 . 4 Prop osition 2.3. The norm k · k B is wel l-define d and makes B a pr e- RKBS on X if and only if B 0 satisfies the Norm Consistency Pr op erty. Pr o of. W e fir st sh o w the necessit y . If B is a Banac h space then k · k B is a well-defined n orm . The v alidit y of the Norm Consistency Pr op ert y follo ws directly from k 0 k B = 0. W e n ext pro v e the su fficiency . Su pp ose that the Norm Consistency P r op erty h olds for B 0 . W e first show that k · k B is a w ell-defined norm. Supp ose th at { f n : n ∈ N } and { g n : n ∈ N } are b oth Cauc hy sequences in B 0 suc h that lim n →∞ f n ( x ) = lim n →∞ g n ( x ) for all x ∈ X . W e need to sho w t h at lim n →∞ k f n k B 0 = lim n →∞ k g n k B 0 . Clearly , f n − g n forms a Cauc hy sequence in B 0 . Since lim n →∞ ( f n − g n )( x ) = 0 f or all x ∈ X , it follo ws from the Norm Consistency Prop erty that lim n →∞ k f n − g n k B 0 = 0, which implies lim n →∞ k f n k B 0 = lim n →∞ k g n k B 0 . Therefore, k · k B is w ell-defined. As a result, B is isometrically isomorp hic to the abstract Banac h space that is the completion of B 0 . It imp lies that B is a Banac h space and B 0 is dense in B . Moreo v er, it follo ws immediately from the Norm Consistency Prop erty that B is a Banac h space of fun ctions. It remains to sh o w that the p oin t ev al u ation functional δ x is con tinuous on B for all x ∈ X . Let x ∈ X and f ∈ B . By definition, there exists a C auc hy sequence { f n : n ∈ N } in B 0 suc h that f ( x ) = lim n →∞ f n ( x ) for all x ∈ X , and k f k B = lim n →∞ k f n k B 0 . Since δ x is con tinuous on B 0 , there exists a p ositive constan t M x suc h that | f n ( x ) | ≤ M x k f n k B 0 for all n ∈ N . T aking the limits on b oth s id es, w e hav e | f ( x ) | ≤ M x k f k B . The pro of is complete. In the r est of this section, w e assume the Norm Cons istency Prop ert y for B 0 and aim at deriving a repro d ucing kernel f or B . T o this end, w e set B ♯ 0 := sp an { K ( · , x ) : x ∈ X } and define a bilinear form h· , ·i K on B 0 × B ♯ 0 b y (1.4). It is straight forward to observe that h f , K ( · , x ) i K = f ( x ) , h K ( x, · ) , g i K = g ( x ) for all f ∈ B 0 , g ∈ B ♯ 0 and x ∈ X . It means (1.4) is well d efined and that K is ab le to r epro du ce the p oin t ev aluations of fun ctions on B 0 via this bilinear f orm. W e n eed to extend this prop erty to the whole space B in order to claim that it is a repr o ducing k ernel for B . F or this purp ose, we defi ne another norm k g k B ♯ 0 := sup f ∈B 0 ,f 6 =0 |h f , g i K | k f k B 0 , g ∈ B ♯ 0 . (2.2) The next result indicates that the ab o v e n orm is w ell-defined. Prop osition 2.4. Th e norm k · k B ♯ 0 is wel l-define d and p oint evaluation fu nc tionals ar e c ontin- uous on B ♯ 0 if and only if p oint evaluation fu nctionals ar e c ontinuous on B 0 . Pr o of. W e b egin w ith the sufficiency . Sup p ose that p oin t ev aluation functionals are con tin uous on B 0 . That is, for any x ∈ X there exists a p ositiv e constan t M x satisfying (2.1). Let g ∈ B ♯ 0 . 5 It m u st b e of th e form g = P n j =1 a j K ( · , x j ) for some a j ∈ C and x j ∈ X , j ∈ N n , n ∈ N . W e ha ve f or all f ∈ B 0 |h f , g i K | k f k B 0 = |h f , P n j =1 a j K ( · , x j ) i K | k f k B 0 =    P n j =1 a j f ( x j )    k f k B 0 ≤ n X j =1 | a j | M x j , whic h implies that k g k B ♯ 0 is well-defined. W e next pro v e that p oint ev aluation functionals are con tin uous on B ♯ 0 . By (2.2), w e hav e f or all f ∈ B 0 , g ∈ B ♯ 0 |h f , g i K | ≤ k f k B 0 k g k B ♯ 0 . (2.3) F or an y x ∈ X , taking f = K ( x, · ) in the ab o v e inequalit y yields that | g ( x ) | = |h K ( x, · ) , g i K | ≤ k K ( x, · ) k B 0 k g k B ♯ 0 for all g ∈ B ♯ 0 . It follo ws that the p oint ev aluatio n functional δ x is con tinuous on B ♯ 0 as k K ( x, · ) k B 0 is a constan t indep end en t of g . W e next turn to the necessit y . Supp ose k g k B ♯ 0 is w ell-defined f or all g ∈ B ♯ 0 . F or an y x ∈ X , letting g = K ( · , x ) in (2.3) yields | f ( x ) | ≤ k K ( · , x ) k B ♯ 0 k f k B 0 , whic h implies that p oint ev aluation fu nctionals are con tinuous on B 0 . W e complete B ♯ 0 using the norm k · k B ♯ 0 to a Banac h s pace B ♯ b y the pro cess describ ed b efore Prop osition 2.3. W e ha ve the f ollo wing observ ation similar to that ab out th e space B . Prop osition 2.5. The sp ac e B ♯ is a pr e-RKBS on X if and only if the norme d ve ctor sp ac e B ♯ 0 satisfies the Norm Consistency Pr op erty. In the follo wing discussion, supp ose that B ♯ 0 endo wed with the norm k · k B ♯ 0 has th e Norm Consistency Prop ert y . By app lying the Hahn-Banac h extension theorem t wice, we can extend the bilinear form h· , ·i K from B 0 × B ♯ 0 to B × B ♯ in a unique wa y suc h that |h f , g i K | ≤ k f k B k g k B ♯ , f ∈ B , g ∈ B ♯ . (2.4) The next result tells that the d efi nition of k · k B ♯ 0 in (2.2) can b e extended to B ♯ . Prop osition 2.6. Supp ose that p oint evaluation functionals ar e c ont inuous on B 0 . If b oth B 0 and B ♯ 0 satisfy the Norm Consistency Pr op erty then we have k g k B ♯ = sup f ∈B ,f 6 =0 |h f , g i K | k f k B , g ∈ B ♯ . (2.5) Pr o of. By (2.4), the righ t hand side ab o ve is b ounded b y the left hand side. W e only need to pro ve the other direction of the inequalit y . W e first sho w it for functions in B ♯ 0 . Let g ∈ B ♯ 0 . It is straigh tforward to observ e that k g k B ♯ = sup f ∈B 0 ,f 6 =0 |h f , g i K | k f k B ≤ sup f ∈B ,f 6 =0 |h f , g i K | k f k B . (2.6) 6 No w let g b e an arbitrary b ut fixed function in B ♯ . Since B ♯ 0 is dense in B ♯ , there exists { g n : n ∈ N } ⊆ B ♯ 0 suc h that k g − g n k B ♯ → 0 as n → ∞ . This together with (2.6) implies k g k B ♯ = lim n →∞ k g n k B ♯ ≤ lim n →∞ sup f ∈B ,f 6 =0 |h f , g n i K | k f k B . Note that |h f , g n i K | k f k B ≤ |h f , g i K | k f k B + |h f , g − g n i K | k f k B ≤ |h f , g i K | k f k B + k g − g n k B ♯ . It follo ws fr om the ab o ve tw o equations that k g k B ♯ ≤ lim n →∞ sup f ∈B ,f 6 =0  |h f , g i K | k f k B + k g − g n k B ♯  = sup f ∈B ,f 6 =0 |h f , g i K | k f k B , whic h completes th e pro of. W e next pr esent necessary and sufficien t conditions f or K to b e able to reprod uce p oint ev al u ation f u nctionals on B and B ♯ b y the bilinear form. W e sh all see that assuming the Norm Consistency P r op erty , b oth B and B ♯ are Banac h spaces of fun ctions on X suc h that the p oint ev al u ation functionals are co ntin uous and can b e represented b y th e bilinear form with the function K . It is in this sense that B and B ♯ are said to b e a r epr o ducing kernel Banach sp ac e with the r epr o ducing kernel K . Theorem 2.7. Supp ose that B 0 and B ♯ 0 satisfy the Norm Consistency Pr op erty. Then b oth B and B ♯ ar e pr e- RKBS on X a nd the kernel K r epr o duc es function values via the biline ar form, namely, h f , K ( · , x ) i K = f ( x ) for al l x ∈ X and f ∈ B (2.7) and h K ( x, · ) , g i K = g ( x ) for al l x ∈ X and g ∈ B ♯ . (2.8) Thus, B and B ♯ ar e r epr o ducing kernel Banach sp ac es (RKBS). Pr o of. By Prop ositions 2.3 and 2.5, b oth B and B ♯ are pre-RKBS on X . F or eac h f ∈ B , there exists a sequence { f n : n ∈ N } ⊆ B 0 con v ergent to f . As a consequence, we h a v e for an y x ∈ X f ( x ) = lim n →∞ f n ( x ) = lim n →∞ h f n , K ( · , x ) i K . By (2.4), h· , K ( · , x ) i K is a b ound ed linear fun ctional on B , whic h imp lies lim n →∞ h f n , K ( · , x ) i K = h f , K ( · , x ) i K . Com binin g th e ab o v e t w o equations pro v es (2.7). Equation (2.8) can b e pro ved similarly . W e n ext discuss the r elationship b etw een the space B ♯ and the du al space B ∗ of B . It is clear b y (2.4) and (2.5) that the mapping L from B ♯ to B ∗ defined b y the b ilinear form, ( L g )( f ) := h f , g i K , f ∈ B , g ∈ B ♯ , (2.9) is isometric and linear. In other w ords, L is an em b ed d ing from B ♯ to B ∗ . W e next pr esen t a necessary and sufficient condition f or it to b e surjectiv e. 7 Prop osition 2.8. Supp ose that b oth B 0 and B ♯ 0 satisfy the Norm Consistency Pr op erty. The mapping L define d by (2.9) is surje ctive onto B ∗ if and only i f f or any pr op er close d subsp ac e M $ B , the ortho gonal sp ac e M ⊥ := { g ∈ B ♯ : h f , g i K = 0 for al l f ∈ M} is nontrivial. Pr o of. W e first pro ve the necessit y . F or an y prop er closed s ubspace M $ B , b y the Hahn- Banac h theorem, there exists a nontrivia l fun ctional ν ∈ B ∗ suc h that ν ( f ) = 0 for all f ∈ M . If L is sur jectiv e then there exists a function g ∈ B ♯ suc h th at L ( g ) = ν , namely , ν ( f ) = h f , g i K for all f ∈ B . It follo ws that g ∈ M ⊥ and g 6 = 0 as ν is nontrivial. W e next sho w the su fficiency . Let ν b e a non trivial f unctional in B ∗ . Then its k ernel ker( ν ) is a pr op er closed subsp ace of B . By assu mption, there exists a nonzero function g ∈ M ⊥ . This enables us to find a fun ction f 0 ∈ B \M suc h that h f 0 , g i K 6 = 0 and ν ( f 0 ) = 1. Set g 0 := g / h f 0 , g i K . Since f − ν ( f ) f 0 ∈ ker( ν ) for all f ∈ B , we get for an y f ∈ M h f , g 0 i K = h f − ν ( f ) f 0 , g 0 i K + h ν ( f ) f 0 , g 0 i K = ν ( f ) h f 0 , g 0 i K = ν ( f ) , whic h implies that L is su rjectiv e. W e close the section w ith a conclusion on the general constru ction and the related resu lts present ed ab o v e. Theorem 2.9. Supp ose that (a) the ve ctor sp ac e B 0 = span { K ( x, · ) : x ∈ X } with the norm k · k B 0 has the Norm Consis- tency Pr op erty, and (b) p oint evaluation functionals ar e c ontinuous on B 0 . Then the fol lowing statements hold true: (1) B 0 c an b e c omplete d to a pr e-RKBS B on X ; (2) the norm k · k B ♯ 0 given by (2.2) is wel l-define d and p oint evaluation functionals ar e b ounde d on B ♯ 0 with r esp e ct to this norm; (3) if B ♯ 0 satisfies the N orm Consistency Pr op erty as wel l then B ♯ 0 c an b e c omplete d to an RKBS B ♯ and K is the r epr o ducing kernel for b oth B and B ♯ in the sense that (2.7) and (2.8) hold true. In this c ase, B ♯ c an b e isometric al ly emb e dde d into B ∗ via the biline ar form, and the emb e dding is su rje ctive if and only i f for any pr op er close d su b sp ac e M of B , M ⊥ is nontrivial. 3 RKBS with the ℓ 1 Norm W e shall f ollo w the pro cedures in T heorem 2.9 to construct an RK BS with the ℓ 1 norm in this section. T o start, we let K b e a b ound ed function on X × X suc h that K ( x j , · ) , j ∈ N n are lin early indep endent for all pairwise distinct p oin ts x j ∈ X , j ∈ N n . (3.1) Note that this assump tion is implied b y Admissib ilit y Assu mption (A1), b ut is somewhat w eak er than (A1). Introdu ce an ℓ 1 norm on B 0 = sp an { K ( x, · ) : x ∈ X } by setting for all fin itely man y pairwise distinct p oints x j ∈ X and constan ts c j ∈ C , j ∈ N m , m ∈ N     m X j =1 c j K ( x j , · )     B 0 := m X j =1 | c j | . (3.2) 8 Since K is b ounded , it is clear that p oin t ev aluation fun ctionals are b oun ded on B 0 . W e next c hec k the imp ortan t Norm C onsistency Prop er ty and find that it is imp lied b y the Admissibilit y Assumption ab ov e. Prop osition 3.1. The sp ac e B 0 with the norm (3.2) satisfies the Norm Consistency Pr op erty if and only if K satisfies (A3). Pr o of. W e first sho w the n ecessit y . Supp ose that for some c ∈ ℓ 1 ( N ) and p airwise distinct { x j ∈ X : j ∈ N } , P ∞ j =1 c j K ( x j , x ) = 0 for all x ∈ X . L et f n := P n j =1 c j K ( x j , · ) for all n ∈ N . Since c ∈ ℓ 1 ( N ), { f n : n ∈ N } forms a Cauc hy sequence in B 0 . Moreo v er, lim n →∞ f n ( x ) = 0 for all x ∈ X as K is b ounded on X × X . I t follo ws from the Norm Consistency Prop erty th at lim n →∞ k f n k B 0 = lim n →∞ P n j =1 | c j | = k c k ℓ 1 ( N ) = 0. Therefore, (A3) h olds true. On the other hand, su pp ose th at K satisfies (A3). Let { f n : n ∈ N } b e a Cauch y sequ ence in B 0 with lim n →∞ f n ( x ) = 0 for all x ∈ X . W e can find pairwise distinct x j ∈ X , j ∈ N s u c h that for an y n ∈ N f n = ∞ X j =1 c n,j K ( x j , · ) , where c n := ( c n,j : j ∈ N ) has finitely man y nonzero comp onen ts. By d efinition (3.2), { c n : n ∈ N } is a Cauc hy sequence in ℓ 1 ( N ). Let c b e its limit in ℓ 1 ( N ) and define f := ∞ X j =1 c j K ( x j , · ) . Supp ose that | K ( s, t ) | ≤ M for some p ositiv e constant M and all s, t ∈ X . A direct calculation giv es that f or an y x ∈ X | f n ( x ) − f ( x ) | =     ∞ X j =1 ( c n,j − c j ) K ( x j , x )     ≤ M k c n − c k ℓ 1 ( N ) . It follo ws that lim n →∞ f n ( x ) = f ( x ) for all x ∈ X . S ince lim n →∞ f n ( x ) = 0 for all x ∈ X , w e ha v e f ( x ) = 0 f or all x ∈ X . By (A3), c = 0, which implies lim n →∞ k f n k B 0 = lim n →∞ k c n k ℓ 1 ( N ) = k c k ℓ 1 ( N ) = 0 . The pro of is complete. F u nctions K satisfying prop ert y (A3) will b e giv en later. W e assu m e for the time b eing that (A3) holds tru e. One sees from the pro of of Prop osition 3.1 that B has the form (1.3). W e remark that in the p reparation of the pap er, w e came across a Banac h space w ith a form similar to (1.3) used in [30] for error estimates with linear programming regularization. O n e observ es from (1.3) that ℓ 1 ( X ) is isometrically isomorphic to B through the mapp ing Φ( c ) := X t ∈ X c t K ( t, · ) , c ∈ ℓ 1 ( X ) . In this sense, w e sa y that B is a pre-RKBS on X with the ℓ 1 norm. It remains to deriv e a repro du cing k ernel for it. By Theorem 2.7, it s uffices to c hec k the Norm Consistency Prop ert y 9 for B ♯ 0 . W e shall show that the Norm Consistency Prop ert y automatical ly h olds true for B ♯ 0 without any additional r equiremen t. T o this end, we fi rst calculate a sp ecific form of the norm k · k B ♯ 0 . Denote for an y f unction g on X by k g k L ∞ ( X ) the supremum of | g ( x ) | o ver x ∈ X . Lemma 3.2. Ther e holds for any function g ∈ B ♯ 0 that k g k B ♯ 0 = k g k L ∞ ( X ) . Pr o of. W e first p ro ve that k g k B ♯ 0 is b ounded by k g k L ∞ ( X ) . An y f ∈ B 0 has the form f = P n j =1 c j K ( x j , · ) for some c j ∈ C an d pairwise distinct x j ∈ X , j ∈ N n . W e v erify that |h f , g i K | =      n X j =1 c j K ( x j , · ) , g      =     n X j =1 c j g ( x j )     ≤ k g k L ∞ ( X ) n X j =1 | c j | = k g k L ∞ ( X ) k f k B 0 , whic h implies k g k B ♯ 0 ≤ k g k L ∞ ( X ) . F or the other dir ection, w e notice for all x 0 ∈ X k g k B ♯ 0 ≥ |h K ( x 0 , · ) , g i K | k K ( x 0 , · ) k B 0 = | g ( x 0 ) | . Since x 0 is arbitrarily c hosen, w e ha ve k g k B ♯ 0 ≥ k g k L ∞ ( X ) . W e sho w that the space B ♯ is also a pre-RKBS on X . Lemma 3.3. The sp ac e B ♯ 0 satisfies the Norm Consistency Pr op erty. Pr o of. Let { f n : n ∈ N } b e a Cauch y sequence in B ♯ 0 with lim n →∞ f n ( x ) = 0 for all x ∈ X . By Lemma 3.2, there exists for any ǫ > 0 some p ositiv e int eger N 0 suc h that wh en m, n ≥ N 0 , | f m ( x ) − f n ( x ) | ≤ ǫ for all x ∈ X . Since lim n →∞ f n ( x ) = 0, we let n go es to infinity in the ab o v e inequalit y to obtain that wh en m ≥ N 0 , | f m ( x ) | ≤ ǫ for all x ∈ X. In other wo r d s, k f m k L ∞ ( X ) ≤ ǫ w h en m ≥ N 0 , implying lim n →∞ k f n k L ∞ ( X ) = 0. By Prop osition 3.1 and Lemmas 3.2 and 3.3, we conclude our construction of RK BS with the ℓ 1 norm in the follo wing r esult. Theorem 3.4. L et K b e a b ounde d function on X × X that satisfies (A3). Then B having the form (1.3) and B ♯ ar e RKBS on X with the r epr o ducing kernel K . W e shall discuss in the rest of this sect ion conditions on translation in v arian t K : R d × R d → C for whic h Admissibilit y Assum ption (A3) h olds. Sp ecifically , su c h K are of the form K ( s , t ) = Z R d e − i ( s − t ) · ξ ϕ ( ξ ) d ξ , s , t ∈ R d , (3.3) where s · t stands for the standard in n er p ro duct on R d , and ϕ ∈ L 1 ( R d ), the s pace of Leb esgue in tegrable functions on R d . One sh ou ld not confu se L 1 ( R d ) with ℓ 1 ( R d ). The latter one is defined with resp ect to the counting m easure on R d while the first one is with resp ect to the Leb esgue measure. Note that K is b oun d ed and contin uous on R d × R d . W e giv e a suffi cien t condition for so defined a fun ction K to satisfy (A3). 10 Prop osition 3.5. L et K b e given by (3.3). If ϕ is nonzer o almost everywher e on R d with r esp e ct to the L e b esgue me asur e then K satisfies (A3). Pr o of. S upp ose that there exists c ∈ ℓ 1 ( N ) and pairwise distinct p oin ts s j ∈ R d , j ∈ N such that ∞ X j =1 c j K ( s j , t ) = 0 for all t ∈ R d . This equation can b e r eform ulated by (3.3) as Z R d  ∞ X j =1 c j e − i s j · ξ  ϕ ( ξ ) e i t · ξ d ξ = 0 f or all t ∈ R d . It follo ws that f or almost ev ery ξ ∈ R d with resp ect to the Leb esgue measure  ∞ X j =1 c j e − i s j · ξ  ϕ ( ξ ) = 0 . By the assump tion on ϕ , ∞ X j =1 c j e − i s j · ξ = 0 for almost every ξ ∈ R d . Note that the function on the left h and side ab o v e is con tinuous on ξ . W e h en ce obtain th at the F ou r ier transform of the d iscr ete measur e ν ( A ) := X s j ∈ A c j for ev ery Borel subset A ⊆ R d is zero. Consequently , ν is th e zero measure, implying c = 0. W e next present a particular example as a corollary to P rop osition 3.5. Corollary 3.6. If φ is nontrivial c ontinuous function on R d with a c omp act supp ort then K ( s , t ) = φ ( s − t ) , s , t ∈ R d satisfies (A3). Pr o of. W e regard φ as a temp ered distribu tion and n ote by the Pale y-Wiener th eorem that the F ou r ier transform of φ is real-analyti c on R d . T herefore, the F ourier transform of φ is nonzero ev erywhere on R d except at a sub set of zero Leb esgue measure. Th e argumen ts similar to those in the pro of of the last pr op osition hence apply . W e n ext presen t b y Prop osition 3.5 and Corollary 3.6 sev eral examples of K th at satisfy (A3) and hence can b e u sed to constru ct RKBS with the ℓ 1 norm. Su ch fu nctions include: – the exp onen tial kernel K ( s , t ) = exp( −k s − t k ℓ 1 ( N d ) ) = 1 π d Z R d e − i ( s − t ) · ξ d Y j =1 1 1 + ξ 2 j d ξ , s , t ∈ R d , where f or s ∈ R d , k s k 2 is its standard Euclidean norm on R d . 11 – the Gaussian kernel K ( s , t ) = exp  − k s − t k 2 2 σ  =  √ σ 2 √ π  d Z R d e − i ( s − t ) · ξ exp( − σ 4 k ξ k 2 2 ) d ξ , s , t ∈ R d . (3.4) – inv erse m u ltiquad r ics K ( s , t ) =  1 1 + k s − t k 2 2  β , s , t ∈ R d , β > 0 , (3.5) whose F ourier transform is give n b y the mo dified Bessel fun ction and is p ositiv e almo st ev erywhere on R d (see [28], pages 52, 76 and 95). – B-spline kernels K ( s , t ) = d Y j =1 B p ( s j − t j ) , s , t ∈ R d , where s j is the j -th comp onen t of s and B p denotes the p -th order B-spline, p ≥ 2. B- spline ke rn els satisfies (A3) as they are given by b ound ed con tin uous fu nctions of compact supp ort. – r ad ial basis fun ctions of compact supp ort, in cluding W u’s functions [29] and W endland ’s functions [28]. S uc h functions are of the f orm K ( s , t ) = φ ( k s − t k 2 ), s , t ∈ R d , where φ is a compactly supp orted u n iv ariate function d ep end ent on the dimens ion d . W e giv e t w o examples for d = 3: φ ( r ) := (1 − r ) 2 + and φ ( r ) := (1 − r ) 4 + (1 + 4 r ) , r ≥ 0 where t + := max { 0 , t } for t ∈ R . These functions satisfy (A3) by Corollary 3.6. On the other hand, a translation inv ariant K d o es not satisfy (A3) if its F ourier transform is compactly sup p orted, as in dicated in the next result. Prop osition 3.7. If ϕ ∈ L 1 ( R d ) is c omp actly supp orte d on R d then K given by (3.3) do e s not satisfy (A3). Pr o of. Without lost of generalit y , w e ma y assume that supp ϕ ⊆ [ − 1 , 1] d . Ch o ose a n on trivial infinitely con tinuously differen tiable fu nction φ that is supp orted on [ − π , π ] d and v a n ishes on [ − 1 , 1] d . W e expand φ to a F ourier series φ ( ξ ) = X j ∈ Z d c j e − i j · ξ , ξ ∈ [ − π , π ] d , where c j is the F ourier co efficien t of φ . Note that { c j : j ∈ Z d } ∈ ℓ 1 ( Z d ) as φ is infin itely con tin uous ly d ifferentiable on [ − π , π ] d . By argument s in the p ro of of Prop osition 3.5, X j ∈ Z d c j K ( j , t ) = Z R d  X j ∈ Z d c j e − i j · ξ  ϕ ( ξ )e i t · ξ d ξ , t ∈ R d . By our construction,  X j ∈ Z d c j e − i j · ξ  ϕ ( ξ ) = 0 for all ξ ∈ R d , whic h implies P j ∈ Z d c j K ( j , · ) = 0. Moreo v er, c j 6 = 0 for at least on e j ∈ Z d b ecause φ is non trivial. W e obtain that K d o es not satisfy (A3). 12 By Prop osition 3.7, the sin c kernel K ( s , t ) := sinc ( s − t ) := d Y j =1 sin( π ( s j − t j )) π ( s j − t j ) , s , t ∈ R d do es not satisfy (A3). As a consequence, it can not yield an RKBS with the ℓ 1 norm b y th e pro cedur es in tro d uced in this section. Similar argument s as those in the p ro of of Prop osition 3.7 are ab le to show that if ν is a compactly sup p orted Borel m easur e on R d of fin ite total v ariation then the follo wing function K ( s , t ) := Z R d e − i ( s − t ) · ξ dν ( ξ ) , s , t ∈ R d do es not satisfy (A3). Instances in clude the class of Bessel-based radial functions [10] where the Borel measure is the dirac d elta m easure on the unit sphere of the Euclidean space. 4 Represen ter T heorems in RKBS with the ℓ 1 Norm Up to now our argum en ts h a v e relied on Admissibilit y Assumptions (A1)–(A 3). I n this section the fin al assump tion, (A4), is inv ok ed to gu arantee that the representer theorem should hold for the constructed R K BS. A r egularized learnin g sc heme in the RK BS B constructed by (1.3) can b e generally expr essed as fin ding f 0 suc h that f 0 = argmin f ∈B [ V ( f ( x )) + µφ ( k f k B )] , (4.1) where x := { x j ∈ X : j ∈ N n } , n ∈ N , is the sequ en ce of giv en p airw ise distinct sampling p oints, f ( x ) := ( f ( x j ) : j ∈ N n ) ∈ C n , V : C n → R + is a loss fu nction, µ is a p ositive regularization parameter, and φ : R + → R + is a nondecreasing regularization function. Here, R + := [0 , + ∞ ). The loss function and r egularizatio n fun ction should satisfy some minimal requirement s for the learning sc heme (4.1) to b e useful. Th is consideration giv es r ise to th e follo wing d efinition. Definition 4.1. A r e g u larize d le arning scheme (4.1) is said to b e acceptable if V and φ ar e c ontinuous and lim t →∞ φ ( t ) = + ∞ . (4.2) It is p ossible that the solution to (4.1) is non-u nique, and in that case we are only intereste d in finding one p ossib le solution. W e no w introdu ce the main concept of this section. Definition 4.2. The sp a c e B is said to satisfy the linear repr esen ter theorem for regularized learning if every ac c eptable r e gularize d le arning scheme (4.1) has a minimizer of the form f 0 = n X j =1 c j K ( x j , · ) , (4.3) wher e c j ’s ar e c onsta nts. In other wor ds, ther e exists a solution f 0 lying in the finite dimensional subsp ac e S x := sp an { K ( x j , · ) : j ∈ N n } . 13 An RKHS with K b eing its repro du cing ke rn el in the usu al sen se alw a ys satisfies the lin ear represent er theorem [14]. The result for uniformly con v ex and un iformly F r´ ec het differen tiable pre-RKBS with a repr o ducing ke rn el giv en by the semi-inn er pr o duct was established in [31, 32]. F or more information on this imp ortant p rop erty for RKHS and v ector-v al u ed R K HS, s ee, for example, [1, 17, 21] and the references cited ther ein. Our pu rp ose is to discuss the conditions on K such that B satisfies the linear rep r esen ter theorem. The represente r theorem for (4.1) is closely related to the representer th eorem for the minimal norm in terp olation p roblem. In the RKHS case, an equiv alence w as prov ed in [16]. W e shall follo w the approac h to consider the m in imal norm in terp olation in B first. F or an y y ∈ C n , set I x ( y ) to b e the sub set of functions in B that interp olate the sp ecified data, namely , I x ( y ) := { f ∈ B : f ( x ) = y } . A minimal norm int erp olan t in B is a f unction f min satisfying f min = argmin {k f k B : f ∈ I x ( y ) } . (4.4) Again, in the case of a non-unique solution, w e are only interested in obtaining one solution. Since K [ x ] is nonsingular, one sees that the t yp ically infi nite dimensional I x ( y ) alw a ys has a non-empt y in tersection with S x , for all y ∈ C n and pairwise distinct x ⊆ X . Definition 4.3. An RKBS B is said to satisfy the linear rep resen ter theorem for minimal norm in terp olation if for any choic e of data, x and y , ther e is a minimal norm interp olant, (4.4), lying in S x . W e shall show that B satisfies the linear r epresen ter theorem for r egularized learning if and only if it do es so for minimal n orm inte rp olation. W e fi rst prov e one d irection of the equiv alence. Lemma 4.4. If B satisfies the line ar r e pr esenter the or em for the minimal norm interp olation, then it also do es so for r e gularize d le ar ning. Pr o of. Let V , φ , and µ b e arbitrary , but fi xed according to the conditions that (4.1) b e an acceptable regularization scheme. F or an arbitrary function f in B . W e let f 0 b e the minimizer of inf g ∈I x ( f ( x )) k g k B that h as the form (4.3). Then f 0 ( x ) = f ( x ) and k f 0 k B ≤ k f k B . As a consequence, V ( f 0 ( x )) = V ( f ( x )) b u t φ ( k f 0 k B ) ≤ φ ( k f k B ) as φ is n ondecreasing. It follo ws that inf f ∈B V ( f ( x )) + µφ ( k f k B ) = inf f ∈S x V ( f ( x )) + µφ ( k f k B ) . By (4.2), there exists a p ositiv e constan t α suc h that inf f ∈S x V ( f ( x )) + µφ ( k f k B ) = inf f ∈S x , k f k B ≤ α V ( f ( x )) + µφ ( k f k B ) . Note that the functional w e are minimizing is cont inuous on B b y the assu mption on V , φ and b y th e contin uit y of p oint ev aluatio n fu nctionals on B . By the elemen tary fact that a con tinuous function on a compact metric space attains its min im um in the space, (4.1) h as a min im izer that b elongs to { f ∈ S x : k f k B ≤ α } . Therefore, B satisfies the linear represen ter theorem. F or the other direction, it suffices to consider a class of regularizatio n fun ctionals w ith a particular c hoice of V and φ . In th e limit of v anishing µ we reco v er the min imal n orm int erp olan t. Lemma 4.5. If B satisfies the line ar r epr esenter the or em for r e gularize d le arning, then it also satisfies the line ar r epr esenter the or em for minimal norm interp olation. 14 Pr o of. W e shall follo w the idea in [16]. Cho ose any n ∈ N n , any x = { x j ∈ X : j ∈ N n } with pairwise distinct elemen ts, and any y ∈ C n . F or ev ery µ > 0, let f 0 ,µ ∈ S x b e a m inimizer of (4.1) with the c hoice of V ( f ( x )) = k f ( x ) − y k 2 2 , φ ( t ) = t. (4.5) Here, k · k 2 is the standard Euclidean norm on C n . Defining the 1 × n row vecto r function by K x ( x ) := ( K ( x j , x ) : j ∈ N n ) for all x ∈ X . It follo ws that f 0 ,µ = K x ( · ) c µ for some c µ ∈ C n . Then we hav e k K [ x ] c µ − y k 2 2 = k f 0 ,µ ( x ) − y k 2 2 ≤ V ( f 0 ,µ ) + µφ ( k f 0 ,µ k B ) ≤ V (0) + µφ ( k 0 k B ) = k y k 2 2 . As K [ x ] is nonsin gular, the ab o ve inequalit y implies that { c µ : µ > 0 } form s a b oun d ed set in C n . By restricting to a subsequence if n ecessary , we m ay hence assume that c µ con v erges to some c 0 ∈ C n as µ go es to zero. W e shall show that f 0 , 0 := K x ( · ) c 0 ∈ S x is a m inimal n orm in terp olan t. Since c µ con v erges to c 0 as µ tends to zero, w e fi rst get lim µ → 0 k f 0 ,µ − f 0 , 0 k B = lim µ → 0 k c µ − c 0 k ℓ 1 ( N n ) = 0 . (4.6) Since p oin t ev al u ation fun ctionals are cont inuous on B , w e obtain by (4.6) f 0 , 0 ( x j ) = lim µ → 0 f 0 ,µ ( x j ) for all j ∈ N n . (4.7) No w let g b e an arbitrary interp olant, i.e., an arbitrary elemen t of I x ( y ). As f 0 ,µ is a minimizer of (4.1) with the c hoice (4.5), it follo ws that k f 0 ,µ ( x ) − y k 2 2 + µ k f 0 ,µ k B ≤ k g ( x ) − y k 2 2 + µ k g k B = µ k g k B . (4.8) Letting µ → 0 on b oth sides of the ab o ve inequalit y , w e obtain by (4.7) k f 0 , 0 ( x ) − y k 2 2 = 0, whic h implies that f 0 , 0 is also an interpolant, i.e,. f 0 , 0 ∈ I x ( y ). It also follo ws from (4.8) that k f 0 ,µ k B ≤ k g k B for all µ > 0, whic h toget h er w ith (4.6) implies k f 0 , 0 k B ≤ k g k B . S ince g is an arbitrary function in I x ( y ) and f 0 , 0 ∈ I x ( y ), w e see that f 0 , 0 is a minimal norm in terp olant, i.e., a solution of (4.4). The pro of is complete. Com binin g Lemmas 4.4 and 4.5, we reac h th e c haracterizat ion for B to satisfy the linear represent er theorem. Prop osition 4.6. The sp ac e B satisfies the line ar r e pr ese nter the or em for r e gularize d le arning if and only if B satisfies the line ar r epr esenter the or em for minimal norm interp olation. In view of the ab o v e result, we shall fo cus on necessary and sufficient conditions for the minimal norm int erp olation in B to satisfy the linear representer theorem. T o this end , w e b egin with the simplest case when only one more samp ling p oin t is added to x . Recall the d efinition of K x ( x ) from the in tro d uction. I t is w orth wh ile to p oin t out that K x ( x ) is in general n ot the transp ose of K x ( x ) as K is n ot r equired to b e symmetric. Lemma 4.7. L et x = { x j ∈ X : j ∈ N n } have p airwise distinct elements, let x n +1 b e an arbitr ary p oint in X \ x , and set x := { x j : j ∈ N n +1 } . It fol lows that the minimum norm interp olant in S x is the same as the minimum norm interp olant i n S x , i.e., min f ∈I x ( y ) ∩S x k f k B = min f ∈I x ( y ) ∩S x k f k B for al l y ∈ C n , (4.9) if and only if (1.2) holds true. 15 Pr o of. Notice that I x ( y ) ∩ S x has only one fun ction f = K x ( · ) K [ x ] − 1 y . W e next estimate the norm of functions in I x ( y ) ∩ S x . Let g ∈ I x ( y ) ∩ S x and b := g ( x n +1 ). Note that g is uniqu ely determined by b as it h as already satisfied th e inte rp olation condition g ( x ) = y . In fact, as K [ x ] is nonsingular, g = K x ( · ) K [ x ] − 1 y , wh ere y = ( y T , b ) T ∈ C n +1 . Direct computations show th at K [ x ] − 1 y =  K [ x ] K x ( x n +1 ) K x ( x n +1 ) K ( x n +1 , x n +1 )  − 1  y b  = K [ x ] − 1 y + q p K [ x ] − 1 K x ( x n +1 ) − q p ! , where p := K ( x n +1 , x n +1 ) − K x ( x n +1 ) K [ x ] − 1 K x ( x n +1 ) an d q := K x ( x n +1 ) K [ x ] − 1 y − b . W e no w show sufficiency . If (1.2) holds true then we h av e k g k B = k K [ x ] − 1 y k ℓ 1 ( N n +1 ) ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) −   ( K [ x ]) − 1 K x ( x n +1 )   ℓ 1 ( N n ) | q p | + | q p | ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) = k f k B , whic h implies min f ∈I x ( y ) ∩S x k f k B ≥ min f ∈I x ( y ) ∩S x k f k B . Since S x ⊆ S x , min f ∈I x ( y ) ∩S x k f k B ≤ min f ∈I x ( y ) ∩S x k f k B . Th u s, (4.9) holds true. On the other hand, if (4.9) is alwa ys true for all y ∈ C n then w e must h av e k K [ x ] − 1 y k ℓ 1 ( N n +1 ) ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) for all y ∈ C n and b ∈ C . In particular, the c hoices y = K x ( x n +1 ) an d b = K x ( x n +1 ) K [ x ] − 1 K T x ( x n +1 ) + p yields that k K [ x ] − 1 y k ℓ 1 ( N n +1 ) =      0 1      ℓ 1 ( N n +1 ) = 1 and k K [ x ] − 1 y k ℓ 1 ( N n ) =   ( K [ x ]) − 1 K x ( x n +1 )   ℓ 1 ( N n ) . Com bing the ab o v e t w o equations pro ve s (1.2) . T h e pro of is complete. W e are no w ready to presen t one of the main results in this pap er. Theorem 4.8. Every minimal norm interp olant (4.4) in B satisfies the line ar r epr esenter the o- r em if and only if (1.2) holds true for al l n ∈ N and al l p airwise distinct sampling p oints x j ∈ X , j ∈ N n +1 . Pr o of. Th e minimal norm in terp olan t (4.4) s atisfies the lin ear represente r theorem if and only if min g ∈I x ( y ) k g k B = min f ∈I x ( y ) ∩S x k f k B . Therefore, if the ab o v e equation holds true then since I x ( y ) ∩ S x ⊆ I x ( y ) ∩ S x ⊆ I x ( y ), w e obtain (4.9). By Lemma 4.7, (1.2) is tru e for ev ery x n +1 ∈ X . It remains to p ro ve the suffi ciency . W e sh all first s ho w k g k B ≥ min f ∈I x ( y ) ∩S x k f k B for all g ∈ I x ( y ) ∩ B 0 . T o this end, w e express g as g = P m j =1 c j K ( x j , · ) for s ome m ≥ n and p airwise distinct { x j : j ∈ N m } ⊆ X . This can alw a ys b e done by adding s ome sampling p oin ts, setting the corresp onding co efficient s to b e zero, and r elab eling if necessary . W e let y j := g ( x j ), j ∈ N m , 16 u l := ( y j : j ∈ N l ), and v l = { x j : j ∈ N l } for 1 ≤ l ≤ m . Note that y = u n and x = v n . I t follo ws th at g ∈ I v m ( u m ) ∩ S v m and th us , k g k B ≥ min f ∈I v m ( u m ) ∩S v m k f k B . Since I v m ( u m ) ⊆ I v m − 1 ( u m − 1 ), w e app ly Lemma 4.7 to get min f ∈I v m ( u m ) ∩S v m k f k B ≥ min f ∈I v m − 1 ( u m − 1 ) ∩S v m k f k B = min f ∈I v m − 1 ( u m − 1 ) ∩S v m − 1 k f k B . It follo ws that k g k B ≥ min f ∈I v m − 1 ( u m − 1 ) ∩S v m − 1 k f k B . Rep eating this pr o cess, w e reac h k g k B ≥ min f ∈I v n ( u n ) ∩S v n k f k B = min f ∈I x ( y ) ∩S x k f k B for all g ∈ I x ( y ) ∩ B 0 . (4.10) No w let g ∈ I x ( y ) b e arbitrary bu t fixed. Th en there exists a sequence of fun ctions { g j ∈ B 0 : j ∈ N } that conv erges to g in B . W e let f and f j b e the fun ction in S x suc h that f ( x ) = y and f j ( x ) = g j ( x ), j ∈ N . Th ey are explicitly giv en by f = K x ( · ) K [ x ] − 1 g ( x ) and f j = K x ( · ) K [ x ] − 1 g j ( x ) , j ∈ N . Since g j con v erges to g in B and p oint ev aluation functionals are con tin uou s on B , g j ( x ) → g ( x ) as j → ∞ . As a r esult, lim j →∞ k f − f j k B = 0. By (4.10) , k g j k B ≥ k f j k B for all j ∈ N . W e hence obtain that k g k B ≥ k f k B . Th er efore, min g ∈I x ( y ) k g k B ≥ min f ∈I x ( y ) ∩S x k f k B . The rev erse d irection of the inequalit y is clear as I x ( y ) ∩ S x ⊆ I x ( y ). W e dra w the follo wing conclusion by Th eorems 4.6 and 4.8. Corollary 4.9. E very ac c eptable r e gularize d le arning scheme of the form (4.1) has a minimizer of the form (4.3) if and only if the function K satisfies the pr op erty (1.2). In the last part of the s ection, w e briefly discuss the linear rep resen ter theorem in B ♯ under the same assump tion that K is b ound ed and satisfies (A3). By T heorem 3.4, B ♯ is an RKBS on X . Like wise, w e call a regularized learning scheme f 0 = argmin f ∈B ♯ V ( f ( x )) + µφ ( k f k B ♯ ) (4.11) ac c eptable if V and φ are con tin uous and (4.2) is satisfied b y φ . Th e space B ♯ is said to satisfy the linear repr esen ter theorem if every acceptable learning sc heme (4.11) has a minimizer of the follo wing form f 0 = n X j =1 c j K ( · , x j ) , (4.12) where c j ’s are constan ts. W e follo w similar approac hes to those used for B to study this imp ortan t prop erty on B ♯ . 17 Prop osition 4.10. L et x ⊆ X have p airwise distinct elements. Every ac c eptable r e gularize d le arning scheme (4.11) in B ♯ has a minimizer, f 0 lying in S x := sp an { K ( · , x j ) : j ∈ N n } if and only if ther e is a minimal norm interp olant, f min := argmin f ∈B ♯ ,f ( x )= y k f k B ♯ (4.13) lying in S x for al l y ∈ C n . Pr o of. Th e argument s of the pro of are similar to those for B . On e only n eeds to note th at although th e n orm of a f unction in B ♯ ma y not b e kno wn, an y t w o norms on the finite dim en sional v ector sp ace S x are equ iv ale nt. T o study conditions ensuring th at the minimal norm int erp olation (4.13) s atisfies the linear represent er theorem, we first iden tify a sp ecific form of the norm k · k B ♯ 0 under the assu mption that K satisfies (1.2). Notice that a fu nction f c = P n j =1 c j K ( · , x j ) ∈ S x ⊆ B ♯ 0 can b e represented as f c = c T K x ( · ). Lemma 4.11. L e t x have p airwise distinct elements. The f u nction K satisfies (1.2) if and only if k f c k B ♯ = k c T K [ x ] k ∞ for al l f c = c T K x ( · ) , c ∈ C n , (4.14) wher e k · k ∞ denotes the maximum norm on C n . Pr o of. S upp ose that K satisfies (1.2) for all x n +1 ∈ X \ x . Then w e ha ve for all x ∈ X that k K [ x ] − 1 K x ( x ) k ℓ 1 ( N n ) ≤ 1. Let c ∈ C n and x ∈ X . It follo ws from this inequalit y th at | c T K x ( x ) | = | c T K [ x ] K [ x ] − 1 K x ( x ) | ≤ k c T K [ x ] k ∞ k K [ x ] − 1 K x ( x ) k ℓ 1 ( N n ) ≤ k c T K [ x ] k ∞ , whic h implies b y Lemma 3.2 that for f c = c T K x ( · ) k f c k B ♯ = k c T K x ( · ) k L ∞ ( X ) ≤ k c T K [ x ] k ∞ . The other direction of the inequalit y is clear as w e h a v e k c T K [ x ] k ∞ = m ax {| c T K x ( x j ) | : j ∈ N n } ≤ k c T K x ( · ) k L ∞ ( X ) = k f c k B ♯ . It remains to sho w that (4.14) implies (1.2). W e prov e th is b y constr u ction. F or any x n +1 ∈ X , w e can find a n on zero vecto r c ∈ C n suc h that | c T K x ( x n +1 ) | = | c T K [ x ] K [ x ] − 1 K x ( x n +1 ) | = k c T K [ x ] k ∞ k K [ x ] − 1 K x ( x n +1 ) k ℓ 1 ( N n ) . W e then let f c = c T K x ( · ) and obtain by (4.14) k c T K [ x ] k ∞ k K [ x ] − 1 K x ( x n +1 ) k ℓ 1 ( N n ) = | f c ( x n +1 ) | ≤ k f c k L ∞ ( X ) = k f c k B ♯ = k c T K [ x ] k ∞ , whic h implies (1.2 ) for c T K [ x ] is not the zero ve ctor. The pr o of is complete. W e no w show that (1.2) is s ufficien t for B ♯ to satisfy the linear representer theorem. Theorem 4.12. If K satisfies (1.2) then B ♯ satisfies the line ar r epr esenter the or em. 18 Pr o of. S upp ose that (1.2) holds true. By Lemma 4.10, it s u ffices to sh ow that the minimal norm in terp olation (4.13) has a m inimizer of the f orm (4.3). W e s h all pr o v e this by directly sho wing that f 0 = y T K [ x ] − 1 K x ( · ) is a m inimizer for (4.13). Let f b e an arb itrary function in B ♯ suc h that f ( x ) = y . T hen w e h a v e by Lemma 3.2 k f k B ♯ = k f k L ∞ ( X ) ≥ k f ( x ) k ∞ = k y k ∞ . By Lemma 4.11, k f 0 k B ♯ = k y T K [ x ] − 1 K [ x ] k ∞ = k y k ∞ . Com binin g the ab o v e t wo inequalities leads to k f 0 k B ♯ ≤ k f k B ♯ . Therefore, (4.13) has the mini- mizer f 0 = y T K [ x ] − 1 K x ( · ) which has the form (4.12). In the particular case when X has a fi n ite cardinalit y , we s h all show that condition (1.2) is also necessary for B ♯ to satisfy the linear representer theorem. Prop osition 4.13. If X c onsists of finitely many p oints and B ♯ satisfies the line ar r epr esenter the o r em then (1.2) holds true. Pr o of. Let c ∈ C n and f c = c T K x ( · ). Und er th e assumptions, w e get by P rop osition 4.10 that f c is a minimizer f or the minimal n orm interpolation (4.13) w ith y = f c ( x ) = ( K [ x ]) T c . Sin ce X has a finite cardinalit y and K [ x ] is nonsingular for all p airw ise distinct x ⊆ X , we can find a function g ∈ B 0 suc h that g ( x ) = y and k g k L ∞ ( X ) ≤ k y k ∞ . Since f c is a minimizer of (4.13) and g satisfies g ( x ) = y , k f c k B ♯ ≤ k g k B ♯ = k g k L ∞ ( X ) = k y k ∞ = k ( K [ x ] T ) c k ∞ . On the other hand, we ha ve b y Lemma 3.2 k f c k B ♯ = k f c k L ∞ ( X ) ≥ k f c ( x ) k ∞ = k ( K [ x ] T ) c k ∞ . By the ab ov e t w o equations, (4.14 ) holds tru e. By Lemma 4.11, K satisfies (1.2). One observes that the k ey ingredien t in the pro of of Prop osition 4.13 is to extend a fu nction on th e discrete set x to a fu nction in B ♯ in a w a y that the sup rem um n orm is preserv ed. In man y cases, th is is ac hiev able w ithout X b eing a finite set. F or instance, by the Tietze extension theorem in top ology , such an extension exists wh en X is a compact m etric space and K is a unive rsal k ernel [19] on X . Th us, for those inp u t sp aces X and fu nctions K , B ♯ satisfies the linear representer theorem if and only if (1.2) h olds true. 5 Examples of Admissible Kernels Recall the defi n ition of admissible k ernels from the in tro d uction. Note that the fi rst requirement (A1) in th e d efinition imp lies (3.1). T heorem 1.2 is pro ved by com bining Th eorem 3.4 and Corollary 4.9. By this result, admiss ib le k ernels are cru cial for our construction. F u nctions K satisfying requirements (A1)–(A3) are u sually relativ ely easy to find . S ome examples ha ve b een present ed b efore Prop osition 3.7 in Section 3. Ho wev er, requirement (A4) could b e somewhat demanding and rule out many commonly used kernels. W e are able to presen t t wo examples of admissible ke rn els b elow. The firs t example is Bro wn ian b ridge k ernel that arises in the stud y of Bro wn ian bridge sto c hastic pro cess in statistics [3]. 19 Prop osition 5.1. The Br ownian bridge kernel define d by K ( s, t ) := min { s, t } − st, s, t ∈ (0 , 1) is an admissible ke rnel on the input sp ac e X = (0 , 1) . Pr o of. W e start with v alidating requiremen t (A4). Let 0 < x 1 < x 2 < · · · < x n < 1 b e give n and x ∈ (0 , 1) b e different from x j , j ∈ N n . Direct compu tations sho w that 1. If x < x 1 then K [ x ] − 1 K x ( x ) =  x x 1 , 0 , . . . , 0  T . 2. If x > x n then K [ x ] − 1 K x ( x ) =  0 , . . . , 0 , 1 − x 1 − x n  T . 3. If x j < x < x j +1 for some j ∈ N n − 1 then K [ x ] − 1 K x ( x ) =  0 , . . . , 0 , x j +1 − x x j +1 − x j , x − x j x j +1 − x j , 0 , . . . , 0  T . In all cases, it is straigh tforward to see   K [ x ] − 1 K x ( x )   ℓ 1 ( N n ) ≤ 1. Therefore, requ iremen t (A4) is indeed fulfi lled. T o v erify th e other three requirements, we firs t ob s erv e K ( s, t ) = Z 1 0 Γ s ( z )Γ t ( z ) dz , s, t ∈ (0 , 1) , where Γ x := χ (0 ,x ) − x with χ A standing for the charact eristic function of A ⊆ (0 , 1). Su p p ose that K [ x ] c = 0 for some c ∈ C n . Then we hav e Z 1 0     n X j =1 c j Γ x j ( z )     2 dz = c ∗ K [ x ] c = 0 , whic h implies that n X j =1 c j Γ x j ( z ) = 0 for almost ev ery z ∈ [0 , 1] . Clearly , Γ x j , j ∈ N n are linearly indep endent. Therefore, c j = 0 for all j ∈ N n . Requ ir emen t (A1) is hence satisfied. The function K is clearly b oun ded by 1. S upp ose that for some c ∈ ℓ 1 ( N ) and pairwise distinct x j ∈ (0 , 1), j ∈ N ∞ X j =1 c j K ( x j , x ) = Z 1 0  ∞ X j =1 c j Γ x j ( z )  Γ x ( z ) dz = 0 for all x ∈ (0 , 1) . It implies that the fun ction φ := P ∞ j =1 c j Γ x j is orthogonal to Γ x for all x ∈ (0 , 1), that is, Z x 0 φ ( t ) dt − x Z 1 0 φ ( t ) dt = 0 for all x ∈ (0 , 1) . 20 T aking the deriv ativ e on b oth sid es of the ab ov e equations yields that φ equals a constan t C almost ev erywh ere on [0 , 1]. Namely , ∞ X j =1 c j χ [0 ,x j ] − ∞ X j =1 c j x j = C almost everywhere . W e no w tak e the d eriv ativ e of b oth sides of the equation ab o v e in th e distributional sense to get P j ∈ N c j δ x j = 0. Let j b e an arbitrary but fixed p ositiv e in teger. W e can find a sequence of infinitely con tinuously differenti able functions φ k , k ∈ N su ch th at k φ k k L ∞ ([0 , 1]) ≤ 1, φ k ( x j ) = 1, and the L eb esgue measure of the set where φ k is nonzero is less than or equal to 1 k . F or eac h N ∈ N , we hav e for sufficien tly large k that φ k ( t l ) = 0 for all l ∈ N N \ { j } . W e get for this φ k 0 =      ∞ X l =1 c l δ x l  ( φ k )     ≥ | c j | − X l>N | c l | . Since P l>N | c l | conv erges to zero as N → ∞ , we hav e c j = 0. Th erefore, c = 0 for j is arbitrary c hosen. W e conclude that all the four requirements of an admissible kernel are fulfi lled by the Brow- nian bridge k ernel. The second example is the exp onen tial ke rn el (also called the C 0 Mat ´ ern kernel). Prop osition 5.2. The exp onential kernel K ( s, t ) := e −| s − t | , s, t ∈ R (5.1) is an admissible ke rnel on R . Pr o of. W e ha ve seen in Section 3 that th is kernel satisfies requirements (A1)–(A 3). It remains to chec k r equirement (A4). L et x 1 < x 2 < · · · < x n b e give n and x ∈ R b e differen t f rom x j , j ∈ N n . Direct computations show th at 1. If x < x 1 then K [ x ] − 1 K x ( x ) = ( e x − x 1 , 0 , . . . , 0) T . 2. If x > x n then K [ x ] − 1 K x ( x ) = (0 , . . . , 0 , e x n − x ) T . 3. If x j < x < x j +1 for some j ∈ N n − 1 then K [ x ] − 1 K x ( x ) =  0 , . . . , 0 , e x j +1 − x − e x − x j +1 e x j +1 − x j − e x j − x j +1 , e x − x j − e x j − x e x j +1 − x j − e x j − x j +1 , 0 , . . . , 0  T . In all cases,   K [ x ] − 1 K x ( x )   ℓ 1 ( N n ) ≤ 1. The pro of is complete. Finally , w e r emark that by n u m erical exp eriment s, the Gaussian kernel K ( s, t ) = exp  − ( s − t ) 2 σ  , s, t ∈ R do es not satisfy (A4). Consequently , neither d o es the Gaussian k ern el (3.4) on R d . T he same situation happ ens to the inv erse multiquadric (3.5) w hen β = 1 / 2. 21 6 Relaxatio n of the A dmissible Condition (A4) As seen ab o ve, the admissible condition (A4) is satisfied for few commonly used k ernels. This section aims at wea kening th is requirement to accommod ate more k ern els. W e are v ery grateful to the anon ymous r eferee for a useful remark that insp ired th e approac h b elo w. Let K b e a fu n ction on X × X th at s atisfies (A1)-(A3 ) and let B b e constructed b y (1.3). The condition (A4) is mean t to ensur e the v alidit y of the linear represente r theorem for regularized learning in B . T o see ho w it can b e relaxed, we first examine the r ole of the linear represen ter theorem in the learning rate estimate. Consider the ℓ 1 norm co efficien t-based regularization algorithm min c ∈ C n 1 n n X j =1 | K x ( x j ) c − y j | 2 + µ k c k ℓ 1 ( N n ) (6.1) where x := { x j : j ∈ N n } is a sequence of sampling p oints from th e input space X , y j ∈ Y ⊆ C is the observ ed output on x j , µ is a p ositiv e regularization parameter. F ollo wing a commonly used assumption in mac h in e learning, w e assum e that the sample data z := { ( x j , y j ) : j ∈ N n } ∈ X × Y is formed by indep en d en t and identica lly distrib uted in stances of a random v ariable ( x, y ) ∈ X × Y su b ject to an unkn o wn p robabilit y measure ρ on X × Y . Let c z ,µ b e a minimizer of (6.1). W e h op e that the obtained function f z ,µ ( x ) := K x ( x ) c z ,µ , x ∈ X (6.2) will w ell predict the outputs of new inputs from X . Th e p erformance of a general p r edictor f : X → Y is usually measured by E ( f ) := Z X × Y | f ( x ) − y | 2 dρ. The predictor that min im izes the ab o v e err or is the regression function f ρ ( x ) := Z Y y dρ ( y | x ) , x ∈ X , where ρ ( y | x ) denotes the conditional probability measure of y with resp ect to x . This optimal predictor f ρ is un reac hable as ρ is u n kno wn . W e shall appro ximate f ρ with f z ,µ . More p recisely , w e exp ect with a large confid ence that the appro ximation error E ( f z ,µ ) − E ( f ρ ) w ould conv erge to zero fast as the num b er of sampling p oints increases. A stand ard approac h [7] in estimating the err or E ( f z ,µ ) − E ( f ρ ) is to b oun d it by the su m of the sampling error, the h yp othesis error and the r egularizatio n error. Let g b e an arbitrary function from B and set for eac h f unction f : X → C E z ( f ) := 1 n n X j =1 | f ( x j ) − y j | 2 . The appro ximation error E ( f z ,µ ) − E ( f ρ ) can then b e decomp osed into the sum of four quan tities E ( f z ,µ ) − E ( f ρ ) = S ( z , µ, g ) + P ( z , µ, g ) + D ( µ, g ) − µ k f z ,µ k B , where th e sampling err or , the hyp oth esis err or and the r e gulariza tion err or are resp ectiv ely defined b y S ( z , µ, g ) := E ( f z ,µ ) − E z ( f z ,µ ) + E z ( g ) − E ( g ) , P ( z , µ, g ) := ( E z ( f z ,µ ) + µ k f z ,µ k B ) − ( E z ( g ) + µ k g k B ) , D ( µ, g ) := E ( g ) − E ( f ρ ) + µ k g k B . 22 Under the condition (A4), B satisfies the linear repr esen ter theorem. As a result, E z ( f z ,µ ) + µ k f z ,µ k B = min f ∈S x E z ( f ) + µ k f k B = m in f ∈B E z ( f ) + µ k f k B . (6.3) Immediately , one has P ( z , µ, g ) ≤ 0, leading to the estimate E ( f z ,µ ) − E ( f ρ ) ≤ S ( z , µ, g ) + D ( µ, g ) . Starting from th e ab ov e inequalit y , learnin g rates of f z ,µ can b e ob tained [25]. T o wea ken (A4), w e should not stic k to the linear represent er theorem (6.3). Ins tead, we wish to replace it w ith the r elaxe d line ar r epr esenter the or em min f ∈S x E z ( f ) + µ k f k B ≤ m in f ∈B E z ( f ) + µβ n k f k B , (6.4) where β n is a constant dep ending on the num b er n of s ampling p oin ts, th e k ernel K and the input space X . F or simp licit y , we supp ress the n otations K and X as they are fi xed in our con text. The appr oximati on error E ( f z ,µ ) − E ( f ρ ) is accordingly factored as E ( f z ,µ ) − E ( f ρ ) = S ( z , µ, g ) + ˜ P ( z , µ, g ) + ˜ D ( µ, g ) − µ k f z ,µ k B , where ˜ P ( z , µ, g ) := ( E z ( f z ,µ ) + µ k f z ,µ k B ) − ( E z ( g ) + µβ n k g k B ) , ˜ D ( µ, g ) := E ( g ) − E ( f ρ ) + µβ n k g k B . By (6.4), w e keep th e adv an tage that ˜ P ( z , µ, g ) ≤ 0. Th erefore, E ( f z ,µ ) − E ( f ρ ) ≤ S ( z , µ, g ) + ˜ D ( µ, g ) . As long as β n do es not increase to o fast as n increases, one is still able to obtain a learning rate comp etitiv e with th ose in [25, 30]. W e shall omit the detailed argumen ts and assu mptions on the kernel K , the regression fu n ction f ρ and the in put sp ace X , as they are similar to those in [25]. W e pr esen t one result that for all 0 < δ < 1, there exists a constan t C δ suc h that w ith confidence 1 − δ , w e h a v e E ( f z ,µ ) −E ( f ρ ) ≤ C δ ( µβ n ) 2 s 1+ s + log 2 δ n ( µβ n ) 2 s − 2 1+ s + log 2 δ √ n ( µβ n ) 2 s − 1 1+ s + log 2 δ + log(1 + n ) ( µβ n ) 2 β 2 n n − 1 1+ θ ! , where s ∈ (0 , 1) represents the r egularit y of f ρ , θ > 0 is a p ositiv e constan t related to assump tions on the ke r n el K and the inp ut space X , [25]. T h us , as long as β 2 n do es not cancel the deca y of the term n − 1 1+ θ , one still has the hop e of getting a satisfactory learning rate wh en µ is appropriately c hosen. W e discuss tw o in stances b elo w: (i) If β n is uniformly b ounded with a large confid en ce then E ( f z ,µ ) − E ( f ρ ) has the same learnin g rate as th at established in [25], that is, E ( f z ,µ ) − E ( f ρ ) ≤ C δ n − s 1+2 s 1 1+ θ log 2 + 2 n δ . (6.5) (ii) If β n ≤ C n α for some p ositiv e constan ts C and α < 1 2+2 θ then E ( f z ,µ ) − E ( f ρ ) ≤ C δ n − s 1+2 s ( 1 1+ θ − 2 α ) log 2 + 2 n δ . (6.6) 23 If we give up the linear repr esen ter theorem and pu rsue the relaxed ve r s ion (6.4) instead, ho w can the adm issible condition (A4) b e wea kened? W e next answer this qu estion. Prop osition 6.1. If ther e exi sts some β n ≥ 1 such that for al l y ∈ C n min f ∈I x ( y ) k f k B ≥ 1 β n min I x ( y ) ∩S x k f k B (6.7) then the r elaxe d line ar r epr esenter the or em (6.4) holds true for any c ontinuo us loss func tion V and any r e gularization p ar ameter µ . Pr o of. S upp ose that (6.7) is satisfied. L et f 0 b e a minimizer of min f ∈B V ( f ( x )) + λβ n k f k B . Cho ose g to b e a f u nction in S x that in terp olates f 0 at x , namely , g ( x ) = f 0 ( x ). By (6.7), k g k B ≤ β n k f 0 k B , whic h yields V ( g ( x )) + λ k g k B ≤ V ( f 0 ( x )) + λβ n k f 0 k B . The pro of is hence complete. W e next giv e a charact erization of (6.7), which giv es rise to a relaxation of the admissible condition (A4) and leads to the relaxed linear r epresent er theorem (6.4). Theorem 6.2. Eq u ation (6.7) holds true for al l y ∈ C n if and only if k ( K [ x ]) − 1 K x ( t ) k ℓ 1 ( N n ) ≤ β n for al l t ∈ X. (6.8) Pr o of. Th e set I x ( y ) ∩ S x consists of only one fu nction f 0 := K x ( · ) K [ x ] − 1 y . Let g b e an arbitrary function in I x ( y ) ∩ B 0 . By adding samplin g p oin ts and assigning the corresp onding co efficien ts to b e zero if necessary , w e ma y assume g ∈ S x ∪ t ∩ I x ( y ) f or some t := { t j ∈ X : j ∈ N m } disjoin t with x . Let b := g ( t ), and denote b y K [ t , x ] and K [ x , t ] the n × m and m × n matrices giv en by ( K [ t , x ]) j k := K ( t k , x j ) , j ∈ N n , k ∈ N m , ( K [ x , t ]) j k := K ( x k , t j ) : j ∈ N m , k ∈ N n . Then k g k B =       K [ x ] K [ t , x ] K [ x , t ] K [ t ]  − 1  y b       ℓ 1 ( N n + m ) =      K [ x ] − 1 y − K [ x ] − 1 K [ t , x ] ˜ b ˜ b      ℓ 1 ( N n + m ) , (6.9) where ˜ b := ( K [ t ] − K [ x , t ] K [ x ] − 1 K [ t , x ]) − 1 ( b − K [ x , t ] K [ x ] − 1 y ) . Note that as b is allo we d to equal any vec tor in C m , so is ˜ b . If (6.7) holds true for all y ∈ C n then we c ho ose t to b e a sin gleton { t } , ˜ b = 1, and y = K [ t, x ] = K x ( t ) to get      0 1      ℓ 1 ( N n +1 ) ≥ 1 β n k f 0 k B = 1 β n   K [ x ] − 1 y   ℓ 1 ( N n ) = 1 β n   K [ x ] − 1 K x ( t )   ℓ 1 ( N n ) , 24 whic h is (6.8). Conv ersely , supp ose that (6.8) is satisfied. W e need to s h o w that for all g ∈ I x ( y ) k g k B ≥ 1 β n k f 0 k B = 1 β n   K [ x ] − 1 y   ℓ 1 ( N n ) . W e shall discuss the case when g ∈ I x ( y ) ∩ B 0 only as the general case will then follo w b y the same argum en ts as those in the last paragraph of the p ro of of Theorem 4.8. Let g ∈ I x ( y ) ∩ B 0 ha ve th e n orm (6.9). C learly , k g k B ≥ 1 β n   K [ x ] − 1 y   ℓ 1 ( N n ) if k K [ x ] − 1 y k ℓ 1 ( N n ) ≤ β n k ˜ b k ℓ 1 ( N m ) . When k K [ x ] − 1 y k ℓ 1 ( N n ) > β n k ˜ b k ℓ 1 ( N m ) , w e h a v e k g k B ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) − k K [ x ] − 1 K [ t , x ] ˜ b k ℓ 1 ( N m ) + k ˜ b k ℓ 1 ( N m ) ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) −  max k ∈ N m k K [ x ] − 1 K x ( t k ) k ℓ 1 ( N n )  k ˜ b k ℓ 1 ( N m ) + k ˜ b k ℓ 1 ( N m ) ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) − ( β n − 1) k ˜ b k ℓ 1 ( N m ) ≥ k K [ x ] − 1 y k ℓ 1 ( N n ) − ( β n − 1) 1 β n k K [ x ] − 1 y k ℓ 1 ( N n ) = 1 β n k K [ x ] − 1 y k ℓ 1 ( N n ) , whic h completes th e pro of. The ab ov e result together with the discussion of the application of Prop osition 6.1 to regularized learning p ro vides a relaxatio n of th e requiremen t (A4). The quan tit y sup t ∈ X k K [ x ] − 1 K x ( t ) k ℓ 1 ( N n ) is th e Leb esgue constan t of the k ern el in terp olation. Asking it to b e exactly b ound ed by 1 is in deed demandin g. Recen t numerical exp eriments [8] and analysis [12 ] indicate that for many k ernels, this Leb esgue constan t could b e uniformly b oun ded. In this case, the ℓ 1 -regularized learning in B p erforms w ell b y (6.5). F u r thermore, as long as β n do es not increase to in fi nit y to o fast, the learning sc heme can still wo rk well by (6.6). Sp ecifically , it wa s prov ed in [12] that the Leb esgue constan t for the repro d ucing kernel of the Sob olev sp ace on a compact d omain is uniformly b oun ded for quasi-uniform in put p oint s (see, Theorem 4.6 therein). Another example is giv en in [8] for translation in v arian t k ernels K ( x, y ) = φ ( x − y ), x, y ∈ R d . It wa s sho wn there that as long as c 1 (1 + k ξ k 2 2 ) − τ ≤ ˆ φ ( ξ ) ≤ c 2 (1 + k ξ k 2 2 ) − τ , k ξ k 2 > M (6.10) for some p ositiv e constan ts c 1 , c 2 , M and τ , the Leb esgue constan t for quasi-un iform inputs is b ound ed by a multiple of √ n . C ommonly used k ernels satisfying (6.10) includ e P oisson radial functions [10], Mat ´ ern k ern els and W endland ’s compactly supp orted k ernels [28]. Finally , we remark from n u merical exp eriments that the follo wing k ernels [20] exp  −k x − y k γ ℓ p ( N d )  , x, y ∈ R d , γ ∈ (0 , 1) , p = 1 , 2 seem to satisfy (A4) f or sm all en ough γ and mod erate n . W e shall lea v e the searc h of more k ernels satisfying (A4) and its relaxatio n (6.8) as an op en qu estion for future study . 25 7 Numerical Exp erimen ts W e end this pap er with a numerical exp erimen t to sho w that the regularization algorithm (4.1) is indeed able to yield sp arse learning compared to the classical regularization n et w ork in machine learning. W e shall us e th e exp onent ial k ernel K (5.1). Let B b e th e corresp onding RK BS with the ℓ 1 norm constructed b y (1.3) and let H K b e the RKHS of K . W e restrict ourselv es to th e field of real n umb ers and use the square loss fu nction V ( f ( x )) := k f ( x ) − y k 2 2 . W e shall compare the t w o mo dels min f ∈B k f ( x ) − y k 2 2 + µ k f k B and min g ∈H K k g ( x ) − y k 2 2 + µ k g k 2 H K . Both of them satisfy the lin ear represen ter theorem. Sp ecifically , the minim izers f 0 and g 0 of the ab ov e tw o mo d els are resp ectiv ely give n b y f 0 = K x ( · ) b with b := argmin c ∈ R n {k K [ x ] c − y k 2 2 + µ k c k ℓ 1 ( N n ) } and g 0 = K x ( · ) h with h := argmin c ∈ R n {k K [ x ] c − y k 2 2 + µ c T K [ x ] c } . W e p oin t out that the ab o v e ℓ 1 minimization problem ab out b d o es not ha ve a closed form solution. Th ere are numerous metho ds prop osed to solv e this prob lem and h ere we emplo y the p r o ximit y algorithm recen tly dev elop ed in [18]. The closed form of the minimizer h is well kno wn to b e ( K [ x ] + µI n ) − 1 y . Here I n denotes the n × n identit y matrix. F or b oth mo dels, x is set to b e 200 equally spaced p oin ts in [ − 1 , 1] and the output vect or y is c hosen to b e the ev aluation of the target function f ( x ) = e −| x +1 | + e −| x +0 . 8 | + e −| x | + e −| x − 0 . 8 | + e −| x − 1 | , x ∈ [ − 1 , 1] at x and then disturb ed b y some noise. Also, the regularization parameter µ for eac h mo d el will b e optimally chosen from { 10 j : j = − 7 , − 6 , . . . , 1 } so that the distance b et ween the learned function and the target fu nction in L 2 ([ − 1 , 1]) will b e min imized. W e then compare the appro ximation accuracy measured by this error and the s p arsit y for these t wo mo dels. The sparsit y is measur ed b y the num b er of n onzero comp onen ts in th e co efficien t ve ctors b and h . Gaussian noise Uniform noise P epp er sauce noise Error S parsit y (Max) Error Sparsit y (Max) Er ror Sp arsit y (Max) RKHS 2.1E-3 20 0 (200) 7.9E-4 2 00 (200 ) 9.4E-4 200 (200) RKBS 1.0E-3 13.4 (17) 3 .6E-4 14.7 (25) 4.5E-4 14.5 (23) T able 1: Comparison of th e least square regularization in RK HS and in RKBS with the ℓ 1 norm for the exp onential ke rn el. W e test b oth mo dels with three types of noise: Gaussian noise with v ariance 0 . 01, uniform noise in [ − 0 . 1 , 0 . 1] and some random p epp er sauce noise in {− 0 . 1 , 0 . 1 } . F or eac h typ e of noise, w e ru n 50 times of n umerical exp erimen ts and compute the a v erage appro ximation error, the a v erage sp arsit y , and the maxim um sparsity in the 50 exp eriments. The r esults are tabulated ab o ve. 26 References [1] A. Argyriou, C . A. Micc helli, and M. P on til. When is there a representer th eorem? Vector v ersus matrix regularizers. J. Mach. L e arn. R es. , 10:2507–2 529, 2009. [2] N. Aronsza jn. Theory of repro du cing kernels. T r ans. Amer. Math. So c. , 68:337– 404, 1950. [3] A. Berlinet and C. Thomas-Agnan. R epr o ducing Kernel Hilb ert Sp ac es in Pr ob ability and Statistics . Kluw er, Dordrec ht , 2004. [4] E. J. Cand` es, J. Rom b erg, and T. T ao. Robust uncertain ty principles: exact signal re- construction fr om highly incomplete fr equency information. IEEE T r ans. Inform. The ory , 52(2): 489–509, 2006. [5] S. S. Chen, D. L. Donoho, and M. A. Saund ers. A tomic decomp osition by basis purs u it. SIAM J. Sci. Comput. , 20(1):33–6 1, 1998 . [6] F. Cuck er and S . Smale. On the m athematical foundations of learning. Bul l. Amer. Math. So c . (N.S.) , 39(1):1–4 9 (electronic), 2002. [7] F. Cu c k er and D.-X. Zhou. L e arning the ory: an appr oximation the ory viewp oint . Cambridge Monographs on Applied and Comp utational Mathematics. Cam br idge Universit y Press, Cam brid ge, 2007. With a forewo rd by S tephen Smale. [8] S. De Marc hi and R. Sc habac k. S tabilit y of k ernel-based in terp olation. A dv. Comput. Math. , 32(2): 155–161, 2010. [9] T. Evgeniou, M. P onti l, and T. Pog gio. Regularization net w orks and supp ort v ector ma- c hines. A dv. Comput. Math. , 13(1):1–50, 2000. [10] B. F ornberg, E. Larsson, and G. W righ t. A new class of oscillatory radial basis fu nctions. Comput. Math. Appl. , 51(8):1209– 1222, 2006. [11] J . R. Giles. Classes of semi-inner-pro duct spaces. T r ans. Amer. M ath. So c. , 129:436– 446, 1967. [12] T . Hangelbro ek, F. J. Narco wic h, and J. D. W ard. Kernel approximati on on manifolds I: b ound ing the Leb esgue constant. SIAM J. Math. Anal. , 42(4): 1732–1760 , 2010. [13] R. C. J ames. Characterizations of reflexivit y . Studia Math. , 23:205–216 , 1963/ 1964. [14] G. Kimeldorf and G. W ah ba. Some results on Tc hebyc heffian spline functions. J. M ath. Ana l. Appl. , 33:82–95, 1971. [15] G. Lumer. Semi-inner-pr o duct sp aces. T r ans. Amer. Math. So c. , 100 :29–43, 1961 . [16] C . A. Micc h elli and A. Pin kus. V ariational problems arising fr om balancing several error criteria. R endic ont i di M atematic a, Serie VII , 14:37–86, 1994. [17] C . A. Micc h elli and M. Pon til. On learning v ector-v alued fun ctions. Neur a l Comput. , 17(1): 177–204, 2005. [18] C . A. Micc helli, L. Sh en, and Y. Xu. Pr oximit y algorithms for image mo dels: denoising. Inverse Pr oblems , 27:04 5009, 2011. 27 [19] C . A. Micc helli, Y. Xu, and H. Zhang. Univ ersal k ernels. J. Mach. L e ar n. R es. , 7:2651–2 667, 2006. [20] I. J. Sc ho en b erg. Metric spaces and p ositiv e definite functions. T r ans. A mer. M ath. So c. , 44(3): 522–536, 1938. [21] B. S c h¨ olk opf, R. Herbric h, and A. J. Smola. A generalized r epresen ter theorem. In Com- putational le arning the ory (Am ster dam, 2001 ) , v olume 2111 of L e ctur e Notes in Comput. Sci. , pages 416–4 26. Sp ringer, Berlin, 2001. [22] B. Sc h¨ olko pf and A. J. S m ola. L e arning with Kernels: Supp ort V e ctor Machines, R e gular- ization, Optimization, and Beyond (A dap tive Computation and Machine L e arning) . The MIT Press, Cam brid ge, Decem b er 2001. [23] J . Sh a w e-T a ylor and N. Cr istianini. Kernel Me tho ds for Pattern Analysis . Cam bridge Univ ersit y Press, Cam b r idge, 2004. [24] G. Song and Y. Xu. Ap pro ximation of h igh-dimensional kernel matrices by multile vel circulan t matrices. J. Complexity , 26(4):3 75–405, 2010. [25] G. Song and H. Z hang. Repr o ducing k ernel banac h spaces w ith the ℓ 1 norm ii: error analysis for regularized least square regression. Neur al Comput. , 23(1 0):2713–2 729, 2011. [26] R. Tib shirani. Regression shrink age and selection via th e lasso. J. R oy. Statist. So c. Ser. B , 58(1):26 7–288, 1996. [27] V. N. V apnik. Statistic al L e ar ning The ory . Adaptive and Learn ing Systems for Signal Pro cessing, Comm unications, and Cont rol. John Wiley & Sons In c., New Y ork, 1998. A Wiley-In terscience Publication. [28] H. W endland. Sc atter e d data appr oximation , v olume 17 of Cambridge Mono gr aphs on Ap- plie d and Computational Mathematics . Cambridge Universit y Press, Cambridge, 2005 . [29] Z . M. W u . C ompactly supp orted p ositiv e d efinite r adial fun ctions. A dv. Comput. Math. , 4(3):2 83–292, 1995. [30] Q .-W. Xiao and D.-X. Zhou. Learn ing b y nonsymmetric ke rn els with data dep endent spaces and ℓ 1 -regularizer. T aiwanese J . Math. , 14(5):182 1–1836, 2010 . [31] H. Zh ang, Y. Xu, and J. Z h ang. R ep ro ducing k ernel Banac h spaces for mac hine learning. J. Mach. L e arn. R e s. , 10:2741 –2775, 2009. [32] H. Z hang and J. Zhang. Regularized learning in Banac h spaces as an optimization p roblem: represent er theorems. J. Glob al O ptim. to app ear. [33] H. Zhang and J . Zhang. F rames, R iesz bases, and sampling expansions in Banac h sp aces via semi-inner pro d ucts. Appl. Comput. Harmon. Anal. , 31:1– 25, 2011. 28

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment