Universality, Characteristic Kernels and RKHS Embedding of Measures

A Hilbert space embedding for probability measures has recently been proposed, wherein any probability measure is represented as a mean element in a reproducing kernel Hilbert space (RKHS). Such an embedding has found applications in homogeneity test…

Authors: Bharath K. Sriperumbudur, Kenji Fukumizu, Gert R. G. Lanckriet

Universality, Characte ristic Kernels and RKHS Embedding of Measures Univ e rsalit y , Characteristic Ke rnels a nd RKHS Em b edding of Measures Bharath K. Srip erum budur bhara thsv@ucsd.edu Dep artment of Ele ctric al and Computer Engine ering University of California, San Die go L a Jol la, CA 92093-0407, USA. Kenji F ukumizu fukumizu@ism.ac.jp The Institu te of Statistic al Mathematics 10-3 Midori-cho, T achikawa T oky o 190-8562, Jap an. Gert R. G. Lanc kriet ger t@ece .ucsd.edu Dep artment of Ele ctric al and Computer Engine ering University of California, San Die go L a Jol la, CA 92093-0407, USA. Editor: Abstract A Hilb ert s pa ce em b e dding for pro babilit y measures ha s recently been prop osed, wherein any probability measure is re presented as a mean ele ment in a repro ducing kernel Hilb ert space (RKHS). Such an em b edding has found applications in homog eneit y testing, indep en- dence testing, dimensionalit y reduction, etc., with the requir emen t that the reproducing kernel is char acteristic , i.e., the embedding is injective. In this pap er, we generalize this em b edding to finite signed Borel measures, w he r ein an y finite sig ned B orel measure is repres e n ted as a mean element in an RKHS. W e show tha t the pr opo sed embedding is injective if a nd only if the kernel is universal . This therefo re, provides a novel characteriz ation of universal kernels, whic h are pr op o sed in the con text of achieving the Bayes risk by kernel-based classification/ r egressio n algorithms. By explo iting this relatio n b et w een universality and the embedding o f finite s ig ned Borel measur e s in to an RKHS, w e establish the rela tion betw een universal and characteristic kernels. Keyw ords: Kernel metho ds, Chara cteristic kernels, Hilber t space embeddings, Universal kernels, T ranslation in v ariant kernels, Radial kernels, Proba bilit y metrics, Bina ry classifi- cation, Homog eneit y testing. 1. Introduct ion Kernel method s ha v e b een p opular in mac hine learning and pattern analysis f or th eir s u- p erior p erformance on a wide sp ectrum of learning tasks. Th ey are b r oadly established as an easy w a y to constru ct nonlinear algorithms from linear ones, b y embedd ing d ata p oint s in to higher dimensional repr o ducing kernel Hilb ert spaces (RKHSs) (Sc h¨ olk opf and Sm ola, 2002; Sh a we- T a ylor and Cristianini, 2004). Recen tly , this idea has b een generalized to em b ed probabilit y distrib utions in to RKHSs, whic h provides a linear metho d for dealing with higher order statistics (Gretton et al., 2007; S mola et al., 200 7; F uku mizu et al., 2008, 1 Sriper umbudur, Fukumizu and Lanckriet 2009b; Srip erumbudur et al., 200 8, 2009 a,b). F ormally , giv en the set of all Borel prob ab ility measures d efined on the top ological space X , and the RK HS ( H , k ) of f unctions on X with k : X × X → R as its r epro ducing kernel (r.k.) that is measurab le and b ounded, an y Borel probabilit y measur e, P is em b edded as, P 7→ Z X k ( · , x ) d P ( x ) . (1) Suc h an emb edding h as b een foun d to b e useful in man y statistica l app licati ons like h omo- geneit y testing (Gretton et al. , 2007), indep end ence testing (Gretton et al. , 2008 ; F ukumizu et al., 2008), dimens ionalit y r eduction (F ukumizu et al., 2004, 2009a), etc., as it pro vides a p ow- erful and straigh tforwa rd metho d of dealing with higher-order statistics of rand om v ari- ables. Ho wev er, in these applicatio ns, it is critica l that the em b edd ing in (1) is injec- tiv e so that probabilit y measures can b e distinguished by their images in H . T o this end, F ukumizu et al. (2008) introdu ced the notion of char acteristic kernel — a b oun ded, measurable k is s aid to b e char acteristic if (1) is injectiv e — for which many charact er- izations hav e recen tly b een pr o vided (Gretton et al., 2007; F u kumizu et al., 2008, 2009b ; Srip eru m b udur et al., 2008, 2009a,b). A natural extension to the ab ov e idea of emb edding probabilit y measures into an RKHS, H is to em b ed fin ite signed Borel measures, µ into H as µ 7→ Z X k ( · , x ) dµ ( x ) , (2) and study the cond itions on the ke rnel, k for wh ic h suc h an em b edding is injectiv e. Al- though the emb edding in (2) can b e pr op osed and in v estigated for mathematical pleasure, w e show as one of the main con tributions of this pap er that u nder certain conditions on µ and X , the emb edding in (2) is closely related to the concept of universal kernels (see Section 1.1 for the formal introd uction to univ ers al kernels), wh ic h was first prop osed b y Stein w art (2001) — in the con text of ac hieving the Bayes risk in ke rnel-based classifica- tion/regression alg orithms — and later extended b y Mic c helli et al. (20 06), Carmeli et al. (2009) an d Srip eru m b udur et al. (201 0). 1 This connection shows that the em b edding in (2) is n ot just an abstract mathematical ob ject, but has applications in k ernel-based cla s- sification/regression algorithms. Using the connection b et w een (2) and u n iv ersal k ernels, w e then show how the v arious notio ns of un iv ersalit y men tioned ab ov e are related to eac h other. In addition, since the embedd ing in (2) is a generalization of the em b edding in (1), w e also demonstrate the r elatio n b et ween c haracteristic k ern els and u niv ersal k ernels, whic h extends the preliminary study carried out in Srip eru m b udur et al. (2009 b , Section 3.4). In the remainder of this introdu ction, w e provide a comprehensiv e o verview of our con trib utions which a re pr esen ted in detail in later sections. First, in Section 1.1, w e in tro duce universalit y , br iefly discuss v arious n otions of un iv ersalit y that are p rop osed in literature, and outline our con tr ib ution: a measure emb edding view p oin t of u niv er s alit y , whic h is no ve l and different fr om the existing view p oint of appro ximating functions in some target s p ace b y fu n ctions in an RKHS . W e sh o w that a ke rnel is u n iv ersal if and only if the em b edding in (2) is injectiv e. Second, in S ection 1.2, w e d iscuss our second con tribution of relating u niv ersal and charac teristic k ernels. 1. The present pap er is an extended versio n of Srip erumbudur et al. (2010). 2 Universality, Characte ristic Kernels and RKHS Embedding of Measures 1.1 Con tribution 1: Injective RKHS embedding of finite signed Radon measures to c haracterize universalit y In the regularization app r oac h to learning (Evgeniou et al., 2000), it is we ll kno wn th at k er n el-based algorithms (for classification/regression) generally inv ok e the r epr esenter the- or em ( Kimeldorf and W ah b a, 1970; Sc h¨ olk opf et al., 2001) and learn a function in H that has the representat ion, f := X j ∈ N n c j k ( · , x j ) , (3) where N n := { 1 , 2 , . . . , n } and { c j : j ∈ N n } ⊂ R are parameters t ypically obtained from training d ata, { x j : j ∈ N n } ⊂ X . As n oted in Micc helli et al. (2006), one can ask wh ether the fu nction, f in (3) appro ximates any real-v alued target function arbitrarily wel l as the n um b er of su mmands in creases without boun d. This is an imp ortan t question to consider b ecause if the answer is affirmative , then the ke rnel-based learning algorithm is c onsistent in the s en se that for any target function, f ⋆ (whic h is usually assumed to b elong to some sub set of the space of real-v alued con tinuous f unctions defined on X ), the d iscrepancy b et ween f (whic h is learned from the training d ata) and f ⋆ go es to zero (in some sense) as the sample size go es to infin it y . Since    X j ∈ N n c j k ( · , x j ) : n ∈ N , { c j } ⊂ R , { x j } ⊂ X    is d ense in H (Aronsza jn , 1950), and assuming that the k ernel-based algorithm make s f “con verge to an appropriate function” in H as n → ∞ , the ab o ve question of appro xim ating f ⋆ arbitrarily wel l b y f in (3 ) as n go es to infinity is equiv alen t to the qu estion of whether H is rich e nough to appr oximate any f ⋆ arbitr arily wel l , i.e., whether H is universal . W e show that c haracterizing unive rsal RKHSs (or equ iv alently , the c haracterizatio n of corresp ondin g repro ducing k er n els (r.k.) as any RKHS is uniquely determined by its repr od ucing ke rnel) leads to the embedd ing in (2). As menti oned ab o ve, the goal is to c haracterize H that allo w to approximate any f ⋆ in some target space, usually assumed to b e some sub s et of the space of real-v alued con tinuous functions on X . Therefore, dep end ing on the c hoice of X , the c hoice of target space and the t yp e of appro ximation, v arious notions of universality h a ve b een prop osed (Steinw art , 2001; Micc helli et al., 2006; Carmeli et al., 2009; Srip erumbudur et al., 2010), whic h are b riefly discussed in the follo wing paragraphs. The ev ent ual goal is to ha ve a notion of unive rsal- it y that allo ws comprehensive (and general) necessary and/or sufficien t conditions on the repro ducing ke rnel f or appr o ximating, as strong as p ossible, a class of target fu n ctions, as general as p ossible. c - univ ersality: Let C ( X ) d enote the space of con tinuous r eal-v alued functions on some top ologica l space, X . Steinw art (2001) considered the ab o v e approximati on problem when X is a compact metric space, with f ⋆ ∈ C ( X ) and defined a con tinuous k ernel, k as universal (in this pap er, we refer to it as c- universal ) if its asso ciated RKHS, H is den se in C ( X ) w.r.t. the u niform norm (see Section 2 for the defin ition of u n iform norm), i.e., for an y f ⋆ ∈ C ( X ), there exists a g ∈ H that uniformly appro ximates f ⋆ . In th e con text of learning, this indicates that if a kernel is c-universal , then the corresp onding k ern el-based learning 3 Sriper umbudur, Fukumizu and Lanckriet algorithm could b e consisten t in the sense that an y target fun ction, f ⋆ ∈ C ( X ) could b e appro ximated arbitrarily well in the uniform norm by f in (3) as n go es to infinity (see Stein w art and Christmann (2008, Corollary 5.29) for a r igorous result). By applying the Stone-W eierstraß theorem (F olland , 1999, Theorem 4.45 ), Stein w art (2001) then provided sufficien t conditions for a ke rnel to b e c-universal , usin g whic h the Gaussian kernel is sho wn to b e c-universal on ev ery compact su b set of R d . As our con tribution, in S ection 3.1, we completely c haracterize c-universal ke rnels by sho wing that k is c-uni v ersal if and only if the em b edding in (2) is injectiv e for µ ∈ M b ( X ), the space of finite signed Radon measures defined on a compact Hausd orff space, X ( see Section 2 for a formal defin ition of M b ( X )). It h as to b e noted that this result is differen t from and m ore general — as b oth n ecessary and sufficien t conditions are p ro vided — than the one by Steinw art (2001, T heorem 9), where only a su fficien t condition is provided. Using this c haracterizatio n, as a sp ecial case, w e also obtain necessary and sufficien t conditions for a F ourier kernel (see S ectio n 3.3 ) to b e c- universal , wh ile S tein wart (2001) provided only a sufficien t condition. cc - univ ersality: One limitation in the setup considered b y Stein w art (2001) is that X is assumed to b e compact, whic h excludes many in teresting spaces, such as R d and infinite discrete sets. T o o v ercome this limitation, Carm eli et al. (2009, Definition 2, Th eorem 3) and Srip erumbudur et al. (2010) appro ximated a n y f ⋆ ∈ C ( X ) b y some g ∈ H un iformly o ver ev ery compact set, Z ⊂ X , b y d efining a con tinuous ke rnel, k to b e universal (in this pap er, we refer to it as c c- universal ) if the corresp onding RKHS, H is dense in C ( X ) with the top olo gy of c omp act c onver genc e , where X is a non-compact Hausdorff space. I.e., for an y compact set Z ⊂ X , for any f ⋆ ∈ C ( Z ), ther e exists a g ∈ H | Z that unif orm ly appro ximates f ⋆ . Here, C ( Z ) is the space of al l cont in uous real-v alued functions on Z equipp ed with the uniform norm, H | Z := { f | Z : f ∈ H } is the r estriction of H to Z and f | Z is the restriction of f to Z . As our con tr ib ution, in S ection 3.1, w e sho w that k is c c -universal if and only if the em b edding in (2) is injectiv e for µ ∈ M bc ( X ), the space of compactly supp orted finite signed Radon measures d efined on a non-compact Hausdorff space, X . Compared to the c h aracterization b y Carmeli et al. (2009, Theorem 4), whic h deals with the injectivit y of a certain in tegral op erator on the space of square-in tegrable fu n ctions, our c haracterization is easy to und erstand — as it is related to a generalizati on of the em b edding in (1) — and will naturally lead to unders tand ing the relation b et w een c c-universal and char acteristic k er n els. Using this characte rization, we also sho w that k is c c-universal if and only if it is unive rsal in the sense of Micc helli et al. (2006): for any compact Z ⊂ X , the set K ( Z ) := span { k ( · , y ) : y ∈ Z } is dens e in C ( Z ) in the u niform norm (see Remark 7(b); also see Carmeli et al. (2009, Remark 1)). As examples, man y p opular k ernels on R d are sho wn to b e c c - universal (see Sections 3.2 and 3.4; also see Micc helli et al. (2 006, Section 4)): Gaussian, L aplacian, B 2 l +1 -spline, sinc k ernel, etc. c 0 - univ ersality: Although c c-uni v ersality solv es the limitation of c-universality by han- dling n on -compact X , the top ology of compact conv ergence considered in c c-u ni v ersality is w eak er than the top ology of uniform c onver genc e , i.e., a sequence of fu nctions, { f n } ⊂ C ( X ) con verging to f ∈ C ( X ) in the top ology of uniform con vergence ensu res that th ey con v erge in the top ology of compact con verge nce bu t not vice-v ersa. So, the natural qu estion to ask is 4 Universality, Characte ristic Kernels and RKHS Embedding of Measures whether we can c haracterize H that are ric h enough to appro ximate an y f ⋆ on n on -compact X in a stronger sense, i.e., u niformly , by some g ∈ H . Recent ly , this has b een answe red by Carmeli et al. (2009, Definition 2, Theorem 1) and S rip erumbudur et al. (201 0), wh erein they defined k to b e c 0 -universal if k is b ounded, k ( · , x ) ∈ C 0 ( X ) , ∀ x ∈ X and its corre- sp onding RKHS, H is dense in C 0 ( X ) w.r.t. the u niform n orm, where X is a lo cally compact Hausdorff (LCH) sp ace and C 0 ( X ) is the Banac h s pace of b oun d ed con tinuous fun ctions v anishing at in finit y , endo wed with the uniform norm (see Section 2 for the d efinition of C 0 ( X )). As our con tribution, in Section 3.1, we pr esen t the follo w ing n ecessary and sufficient condition for a ke rnel to b e c 0 -universal : k is c 0 -universal if and only if the embed d ing in (2) is injectiv e for µ ∈ M b ( X ). It can b e seen that this c h aracteriza tion naturally leads to understand the r elation b etw een c 0 -universal and char acteristic k ern els, wh ic h is not straigh tforward with the c h aracteriza tion obtained by Carmeli et al. (2009, Theorem 2), wherein c 0 -universality is c h aracterize d b y th e injectivit y of a certain in tegral operator on the sp ace of square-inte grable functions. Using this result, simple necessary and sufficien t conditions are derived for translation in v ariant ke rnels on R d (see Secti on 3. 2), F ourier k er n els on T d , the d -T orus (see Section 3.3) and radial k ern els on R d (see Section 3.4) to b e c 0 -universal . Examples of c 0 -universal k ern els on R d include the Gaussian, Laplacian, B 2 l +1 -spline, inv erse multiquadratic s, Mat ´ ern class, etc. c b - univ ersality: Th e definition of c 0 -universality d eals with H b eing dense in C 0 ( X ) w.r.t. the u niform n orm, w here X is an L C H space. Although the notion of c 0 -universality addresses limitations asso ciated with b oth c- and c c-universality , it only approximat es a subset of C ( X ), i.e., it cannot d eal with f unctions in C ( X ) \ C 0 ( X ). This limitation can b e addressed by considering a larger class of functions to b e appr o ximated. T o this end, w e prop ose a notion of un iv ersalit y that is stronger than c 0 -universality : k is said to b e c b -universal if its corresp onding RKHS, H is dense in C b ( X ), the space of b ounded con tin u ous fun ctions on a top ological space, X (note that C 0 ( X ) ⊂ C b ( X )). This notion of c b -universality is more applicable in learning theory th an c 0 -universality as the target f unction, f ⋆ can b elong to C b ( X ) (whic h is a more natural assu mption) instead of it b eing restrained to C 0 ( X ) (note that C 0 ( X ) only con tains functions that v anish at infinit y). W e sho w in Section 3.1 that k is c b -universal if and only if the emb edding in (2) is in jectiv e for µ b elonging to a c ertain class of set fu nctions (see S ection 2 for the defin ition of set functions) d efined on a normal top ological space, X (see Theorem 6 for detail s). Because of the technical ities inv olv ed in dealing with set functions, in this pap er, we do not fully analyze this notio n of univ ersalit y unlik e the other aforemen tioned notions, although it is an inte resting problem to b e resolv ed b ecause of its applicabilit y in learning theory . Based on the ab o ve discussion that r elates injectivit y of the em b edding in (2) to v arious notions of univ ersality , w e also sh o w ho w these n otions of u niv ersalit y are related. If X is compact, the n otions of c - , c c- , c 0 - and c b -universality are equiv alen t. On the other hand, if X is n ot compact, the notion of c 0 -universality is stronger than c c -universality . I.e., if a k er n el is c 0 -universal , th en it is c c- universal b ut not vice-v ersa (for example, the Gaussian k er n el on R d is sho w n to b e c 0 -universal and therefore is c c-universal , while the sinc k ernel is c c-u niversal but n ot c 0 -universal ). W e show in Section 3.4 that the con verse is true in the case of radial k ern els on R d . Similarly , when X is not compact (but an LCH space), the 5 Sriper umbudur, Fukumizu and Lanckriet notion of c b -universality is stronger than c 0 -universality , and th erefore c c-universality . A summary of the relationship b etw een v arious notions of u niv ersalit y is s ho wn in Figure 1. T o summarize our fi rst con trib u tion, we sh o w that, by appr opriately choosing X and µ in (2), the injectivit y of th e em b edd in g in (2) co mpletely c haracterizes v arious notions of unive rsalit y that are prop osed in literature. Using this connection b et ween univ ersalit y and the injectivit y of th e em b edding in (2), we relate all these notions of u niv ers ality , whic h is summarized in Figure 1. 1.2 Con tribution 2: Relation b etw een c haracteristic and universal kernels Gretton et al. (2007) related un iv ersalit y and th e charac teristic p r op ert y of k b y sho wing that if k is c- universal , then it is cha r acteristic . Besides th is result, n ot muc h is known or understo o d ab out the relation b et ween u niv ersal and c h aracteristic ke rnels. In Section 4.1, w e relate universalit y and charact eristic kernels by using the r esults in Sectio n 3.1 that relate universalit y and the R K HS embedd ing of Radon measures. As an example, we sh o w that a tr an s lation inv ariant kernel on R d (in general, an y lo cally compact Ab elian group) or a radial k ernel on R d is c 0 -universal if and only if it is char acteristic . W e also show that the con verse to the r esult by Gretton et al. (2007) is not true, i.e., if a kernel is char acteristic , it need not b e c-u niversal (see S rip erumbudur et al. , 2009b, Corollary 15). A summary of the r elatio n b et ween un iv ersal and characte ristic ke rnels is shown in Figure 1. Using the embedd ing in (1), Gretton et al. (2007) prop osed a metric, called the max- im u m mean discrepancy (MMD), on th e space of all Borel p robabilit y measures, wh en k is charac teristic. One imp ortan t theoretical question th at is usually considered for met- rics on probabilit y m easures is (Dudley, 2002, Chapter 11): “What is the nature of th e top ology induced by the probabilit y metric in relation to the usu al w eak t op ology?” In probabilit y theory , this question is imp ortant in un derstanding and provi ng cen tral limit theorems. Although k b eing c haracteristic is sufficien t for MM D to b e a metric, we sh o w in Section 4.2 that a notion stronger than the c haracteristic pr op ert y is required to answer the ab ov e qu estion. In particular, w e show in Pr op osition 24 that if X is an LC H space and k is c 0 -universal , then th e top ology induced by MMD coincides with the u sual w eak top ology on the sp ace of Radon probability measures defin ed on X . 2 This result can b e used to compare MMD to ot her probabilit y metrics, suc h as the Dud ley metric, total v ariation distance, W asserstein distance, etc. W e refer to Srip erumbudur et al. (2009 b ) f or a d etaile d study on the comparison of MMD to other p robabilit y metrics. T o sum marize, ou r main con tributions in this pap er are: (a) T o establish the relat ionship b et w een v arious notions of universalit y and the RKHS em b edding, shown in (2), of finite signed Radon measures, and in turn p r esen t a no vel measure em b edding view p oin t of un iv ersalit y compared to the classical fu nction appro ximation view p oint. (b) T o clarify the relationship b etw een unive rsal and c haracteristic k ernels. 2. Sriperumbudur et al. (2009b ) show ed that if X is a compact metric space and k is c-universal , then the top ology induced by MMD coincides with the usual weak t opology . The result for non-compact X was left as an open q u estion and is addressed in this pap er, by applying the notion of c 0 -universality . 6 Universality, Characte ristic Kernels and RKHS Embedding of Measures A summary of the results in this pap er is sho wn in Figure 1. I n the follo wing section, w e in tro duce the n otatio n and some definitions that are used throughout the pap er. Supp le- men tary resu lts used in pro ofs are collected in App endix A. 2. Definitions & Notat ion Let X be a top ological space. C ( X ) denotes the space of all con tinuous functions on X . C b ( X ) is the space of a ll b ounded, contin uous f u nctions o n X . F or a lo cally compact Hausdorff space, X , f ∈ C ( X ) is s aid to vanish at infinity if for ev ery ǫ > 0 the set { x : | f ( x ) | ≥ ǫ } is compact. Th e class of all con tinuous f on X whic h v anish at infi nit y is denoted as C 0 ( X ). Th e s paces C b ( X ) and C 0 ( X ) are endo w ed with the uniform norm , k · k u defined as k f k u := su p x ∈ X | f ( x ) | for f ∈ C 0 ( X ) ⊂ C b ( X ). If Y denotes a top ological v ector space, w e denote b y Y ′ the v ector sp ace of con tin uou s linear functionals on Y , and Y ′ is called the top olo gic al dual sp ac e (in this p ap er, w e simply refer to it as the dual ). F or a set A , we d enote its interior as A ◦ . Radon measure: A signed R adon me asur e µ on a Hausdorff space X is a Borel measure on X satisfying ( i ) µ ( C ) < ∞ for eac h compact subset C ⊂ X , ( ii ) µ ( B ) = sup { µ ( C ) | C ⊂ B , C compact } f or eac h B in the Borel σ -algebra of X . µ is said to b e finite if k µ k := | µ | ( X ) < ∞ , where | µ | is the tot al-v ariation of µ . M b + ( X ) denotes the space of all finite Radon measures on X wh ile M b ( X ) denotes the s pace of all finite signed Radon measures on X . Th e space of all Radon probabilit y measur es is denoted as M 1 + ( X ) := { µ ∈ M b + ( X ) : µ ( X ) = 1 } . F or µ ∈ M b ( X ), the sup p ort of µ is defined as supp( µ ) = { x ∈ X | for any op en s et U suc h that x ∈ U, | µ | ( U ) 6 = 0 } . (4) M bc ( X ) denotes the s pace of all compact ly su pp orted fi n ite signed Radon measur es on X . W e refer the reader to Berg et al. (198 4, Chap ter 2) for a general r eference on the theory of Radon measures. Finitely a ddit iv e, regular set function: A set function is a fun ction defined on a f amily of sets, and has v alues in [ −∞ , + ∞ ]. A set fu nction µ defined on a f amily τ of sets is said to b e finitely additive if ∅ ∈ τ , µ ( ∅ ) = 0 and µ ( ∪ n l =1 A l ) = P n l =1 µ ( A l ), f or ev ery fin ite family { A 1 , . . . , A n } of disjoint subsets of τ such that ∪ n l =1 A l ∈ τ . A field of sub sets of a set X is a non -emp t y family , Σ, of subsets of X su c h that ∅ ∈ Σ, X ∈ Σ, and for all A, B ∈ Σ, w e hav e A ∪ B ∈ Σ and B \ A ∈ Σ. An add itiv e set function µ defined on a field Σ of subsets of a top ological space X is said to b e r e gular if for eac h A ∈ Σ and ǫ > 0, there exists B ∈ Σ wh ose closure is con tained in A and there exists C ∈ Σ whose interior con tains A suc h that | µ ( D ) | < ǫ for every D ∈ Σ with D := C \ B . P ositive definite (pd) , strictly p d and conditionally strictly p d: A function k : X × X → R is called p ositive definite (p d) ( r esp. conditionally p d ) if, for all n ∈ N ( r esp. 7 Sriper umbudur, Fukumizu and Lanckriet (a) (b) (c) (d) Figure 1: Summary of results: The relationships b et ween v arious notions are shown along with th e r eferen ce. The letters “P”, “R” and “T” refer to Prop osition, Remark and Th eorem resp ectiv ely . F or example, P . 7 refers to Prop osition 7. Th e im- plications which are op en problems are sho w n with “?”. The trivial implications are shown w ithout an y referen ce. (a) X is an LCH space. Refer to S ection 2 for the defin ition of M b ( X ) and M bc ( X ). (b) The implications sh o wn h old for an y compact Hausdorff space, X . Ho w ev er, when X = T d , the d -T orus, w ith k ( x, y ) = ψ (( x − y ) mod 2 π ), where ψ ∈ C ( T d ) is a p ositiv e definite (p d) fun c- tion, the implicatio n b et w een char acteristic and strictly p d , sho w n as ( A 2 ) is v alid, whic h foll o ws from Proposition 14 and Theorem 15. (c) X = R d and k ( x, y ) = ψ ( x − y ), where ψ ∈ C b ( R d ) is a p d function and the F ourier trans- form of a finite non-negativ e Borel measure, Λ (see T h eorem 10 for detail s). If ψ ∈ C b ( R d ) ∩ L 1 ( R d ), then th e implication sho wn as ( ♠ ) holds. Oth er w ise, it is not cl ear whether th e implication holds. F or a set A , A ◦ represent s its inte rior. (d) X = R d and k ( x, y ) = ϕ ( k x − y k 2 2 ), where ϕ is the Laplace transf orm of a finite non-negativ e Borel m easur e, ν on [0 , ∞ ) (see (21)). 8 Universality, Characte ristic Kernels and RKHS Embedding of Measures n ≥ 2), α 1 , . . . , α n ∈ R ( r esp. with P n j =1 α j = 0) and all x 1 , . . . , x n ∈ X , w e hav e n X l,j =1 α l α j k ( x l , x j ) ≥ 0 . (5) F urthermore, k is said to b e strictly p d ( r esp . conditionally strictly p d) if, for mutually distinct x 1 , . . . , x n ∈ X , equalit y in (5) only holds for α 1 = · · · = α n = 0. F ourier transform in R d : F or X ⊂ R d , let L p ( X ) d enote the Banac h space of p -p o w er ( p ≥ 1) in tegrable functions w.r.t. the Leb esgue measure. F or f ∈ L 1 ( R d ), ˆ f an d ˇ f r epresen t the F ourier transform and inv erse F ourier transform of f resp ectiv ely , defined as ˆ f ( y ) := (2 π ) − d 2 Z R d e − iy T x f ( x ) dx, y ∈ R d , (6) ˇ f ( x ) := (2 π ) − d 2 Z R d e ix T y f ( y ) dy , x ∈ R d , (7) where i denotes the imaginary u nit √ − 1. F or a finite Borel measure, µ on R d , the F ourier transform of µ is giv en by ˆ µ ( ω ) = Z R d e − iω T x dµ ( x ) , ω ∈ R d , (8) whic h is a b oun ded, uniformly cont in uous function on R d . Holomorphic and en t ire functions: Let D ⊂ C d b e an op en s ubset and f : D → C b e a fu nction. f is said to b e holomorph ic at the p oin t z 0 ∈ D if f ′ ( z 0 ) := lim z → z 0 f ( z 0 ) − f ( z ) z 0 − z s exists. Moreo ver, f is cal led holomorphic if it is h olomorphic at every z 0 ∈ D . f is called an entir e function if f is holomorphic and D = C d . 3. Characterization of Univ ersal Kernels In S ection 1, w e h a ve briefly d iscussed the relation b etw een the em b edding in (2) and v arious notions of univ ersalit y . In Section 3.1, we pr esent and prov e our main result (Theorem 6), whic h r elates unive rsalit y and the embedd ing in (2). Theorem 6 shows that under appro- priate assum p tions on µ and X , th e injectivit y of the em b edding in (2) is necessary and sufficien t for a k ernel to b e c- , c c - , c 0 - or c b -universal . Using this result, it is sho w n that the notion of c 0 -universality is stronger than that of c c- universality , i.e., if k is c 0 -universal , then it is c c-u niversal bu t not v ice-ve rsa. Th en, in Pr op osition 8, w e obtain alternate nec- essary and sufficient conditions for the em b edding in (2) to b e injectiv e, whic h resem b les a condition for the k ernel to b e strictly pd (but not qu ite so!). Ho w ever, in Proposition 8, w e show that strict p ositiv e defin iteness of k is a n ecessary condition for the em b edding in (2) to b e in j ectiv e, i.e., f or k to b e universal . Using the charac terizatio n obtained in Prop o- sition 8, in Sections 3.2–3.5, we deriv e c h aracteriza tions for un iv ersalit y that are easy to 9 Sriper umbudur, Fukumizu and Lanckriet c h eck, for sp ecific classes of ke rnels, e.g., translation inv ariant k ernels on R d and T d , radial k er n els on R d , T a ylor-t y p e ke rnels on R d , et c. The results of this section are summarized in Figure 1. Before charact erizing v arious notions of unive rsalit y , let u s revisit their formal defini- tions. Definition 1 ( c -univ ersal) A c ontinuous kernel k on a c omp act Hausdorff sp ac e X is c al le d c-universal if the RKHS, H induc e d by k is dense i n C ( X ) w.r.t. th e uniform norm, i.e., for every fu nction g ∈ C ( X ) and al l ǫ > 0 , ther e exists an f ∈ H such that k f − g k u ≤ ǫ . Definition 2 ( cc -univ ersal) A c ontinuous kernel k on a Hausdorff sp ac e X is said to b e c c-universal if the RKHS, H induc e d by k is dense in C ( X ) endowe d with the top olo gy of c omp act c onver genc e, i.e., for any c omp act set Z ⊂ X , for any g ∈ C ( Z ) and al l ǫ > 0 , ther e exists an f ∈ H | Z such that k f − g k u ≤ ǫ . Definition 3 ( c 0 -univ e rsal) A b ounde d kernel, k with k ( · , x ) ∈ C 0 ( X ) , ∀ x ∈ X on a lo c al ly c omp act Hausdorff sp ac e, X is said to b e c 0 -universal if the RKHS, H induc e d by k is dense in C 0 ( X ) w.r.t. the uniform norm, i.e., for every f unction g ∈ C 0 ( X ) and al l ǫ > 0 , ther e exists an f ∈ H such that k f − g k u ≤ ǫ . Definition 4 ( c b -univ e rsal) A b ounde d c ontinuous kernel, k o n a top olo gic al sp ac e, X , is said to b e c b -universal if the RKHS, H induc e d by k is dense in C b ( X ) w.r.t. the uniform norm, i.e . , for any g ∈ C b ( X ) and al l ǫ > 0 , ther e e xists an f ∈ H such that k f − g k u ≤ ǫ . First note that th e ab ov e definitions are v alid only if H i s in clud ed in the app r opriate target space, i.e., C ( X ) for c- and c c-unive rsality , C 0 ( X ) for c 0 -universality , and C b ( X ) for c b -universality . By S tein wart and Christmann (2008, Lemma 4.28, Th eorem 4.61), the assumptions made on the k ern el in the ab o ve definitions ensure that the definitions are v alid. Also note that all these defin itions are equiv alent when X is compact as C 0 ( X ) = C b ( X ) = C ( X ) for compact X . When X is not compact, it is easy to see that c b -universality is stronger than c 0 -universality , i.e., if k is c b -universal , then it is also c 0 -universal , but not vice-v ersa. On the other hand, it is not straightforw ard to see ho w the notions of c c-u niversal and c 0 -universal are related wh en X is non-compact. By c haracterizing c 0 -universality and c c-universality , Theorem 6 in the follo wing section, s ho ws that the n otion of c 0 -universality is stronger than c c-universality , i.e., if a k ernel is c 0 -universal , then it is c c-universal , but not vice-v ersa. Based on these resu lts, it f ollo ws that c b -universality is stronger th an c c- universality (but not vice-v ersa), when X is non-compact. 3.1 Main results Before w e stat e our m ain r esult, i.e., Theorem 6 , we need the follo w ing result, usu ally referred to as the Hahn-Banac h th eorem, which we quote from Ru din (1991 , Theorem 3.5) (also see the remark follo wing Theorem 3.5 in Rudin (1991)). Theorem 5 (Hahn-Banac h) Supp ose A b e a subsp ac e of a lo c al ly c onvex top olo g i c al ve c- tor sp ac e Y . Then A is dense in Y if and onl y if A ⊥ = { 0 } , wher e A ⊥ := { T ∈ Y ′ : ∀ x ∈ A, T ( x ) = 0 } . (9) 10 Universality, Characte ristic Kernels and RKHS Embedding of Measures The follo w in g main r esu lt of this pap er, w hic h presents a necessary and su fficien t condition for k to b e c - , c c- , c 0 - or c b -universal . hin ges on th e ab o ve theorem, w h ere w e choose A to b e the RKHS, H and Y to b e C ( X ), C 0 ( X ) or C b ( X ) for which Y ′ is kno wn through the Riesz repr esen tation theorem. Theorem 6 (Characterization of univ ersal k ernels) The f ol lowing hold : (a) L et X b e a c omp act Hausdorff sp ac e with k b eing c ontinuous. Then k is c-universal if and onl y if the emb e dding, µ 7→ Z X k ( · , x ) dµ ( x ) , µ ∈ M b ( X ) , (10) is inje c tiv e. (b) L et X b e an LCH sp ac e and k ∈ C b ( X × X ) . Then k is c c-u niversal if and only if the emb e dding, µ 7→ Z X k ( · , x ) dµ ( x ) , µ ∈ M bc ( X ) , (11) is inje c tiv e. (c) L et X b e an LCH sp ac e with the kernel, k b eing b ounde d and k ( · , x ) ∈ C 0 ( X ) , ∀ x ∈ X . Then k is c 0 -universal if and only if the emb e dding, µ 7→ Z X k ( · , x ) dµ ( x ) , µ ∈ M b ( X ) , (12) is inje c tiv e. (d) L et X b e a normal top olo g i c al sp ac e and let M r ba ( X ) b e the sp ac e of al l finitely additive, r e gular, b ounde d set functions define d on the field gener ate d by the close d sets of X . Then, a b ounde d c ontinuous kernel, k is c b -universal if and only if the emb e dding, µ 7→ Z X k ( · , x ) dµ ( x ) , µ ∈ M r ba ( X ) , (13) is inje c tiv e. Pro of First, we prov e ( c ), from whic h ( a ) follo ws. ( c ) By Definition 3, k is c 0 -universal if H is dens e in C 0 ( X ). W e no w in v oke Theorem 5 to characte rize th e d enseness of H in C 0 ( X ), whic h means we need to consid er the dual C ′ 0 ( X ) := ( C 0 ( X )) ′ of C 0 ( X ). By the Riesz repr esentati on theorem (F olland, 1999, Theorem 7.17), C ′ 0 ( X ) = M b ( X ) in the sense that there is a bijectiv e linear isometry µ 7→ T µ from M b ( X ) on to C ′ 0 ( X ), give n b y the natural mapp ing, T µ ( f ) = Z X f dµ, f ∈ C 0 ( X ) . (14) Therefore, by Theorem 5, H is d ense in C 0 ( X ) if and only if H ⊥ :=  µ ∈ M b ( X ) : ∀ f ∈ H , Z X f dµ = 0  = { 0 } . (15) 11 Sriper umbudur, Fukumizu and Lanckriet ( ⇐ ) S upp ose (12) is inj ectiv e, i.e., for µ ∈ M b ( X ), R X k ( · , x ) dµ ( x ) = 0 ⇒ µ = 0. Then by Lemma 26 (see App endix A), w e hav e Z X f dµ = D f , Z X k ( · , x ) dµ ( x ) E H = 0 , ∀ f ∈ H ⇒ µ = 0 , whic h by (15 ) means H is den s e in C 0 ( X ) and therefore k is c 0 -universal . ( ⇒ ) W e need to pro v e that if H is d ense in C 0 ( X ) th en ( R X k ( · , x ) dµ ( x ) = 0 ⇒ µ = 0) holds. This is equiv alen t to sh o wing that if ( R X k ( · , x ) dµ ( x ) = 0 ⇒ µ = 0) does not hold, then H is not dense in C 0 ( X ). Supp ose ( R X k ( · , x ) dµ ( x ) = 0 ⇒ µ = 0) do es not hold, i.e ., ∃ 0 6 = µ ∈ M b ( X ) suc h that R X k ( · , x ) dµ ( x ) = 0, whic h means ∃ 0 6 = µ ∈ M b ( X ) suc h that R X f dµ = 0 for every f ∈ H , then , b y (15), H is n ot dense in C 0 ( X ). ( a ) Wh en X is compact, C 0 ( X ) coincides w ith C ( X ), whic h means c-univ e rsality and c 0 - universality are equiv alen t. Therefore, k is c-universal if and only if the embedd ing in (10) is injectiv e. ( b ) The pr o of is similar to that of ( a ) except that w e need to consider th e dual of C ( X ) endo w ed with the top ology of compact con v ergence (a lo cally co n v ex top ologica l v ector space) to c h aracterize the denseness of H in C ( X ). It is kno wn (Hewitt , 1950) that C ′ ( X ) = M bc ( X ) in the sens e that there is a bijectiv e linear isometry µ 7→ T µ from M bc ( X ) on to C ′ ( X ), giv en b y the natural mapping, T µ ( f ) = R X f dµ, f ∈ C ( X ). Th e rest of th e pro of is v er b atim with M b ( X ) replaced b y M bc ( X ). ( d ) The pro of is v ery similar to that of ( a ) , wh erein w e identify ( C b ( X )) ′ ∼ = M r ba ( X ) such that T ∈ ( C b ( X )) ′ and µ ∈ M r ba ( X ) satisfy T ( f ) = R X f dµ, f ∈ C b ( X ) (Dunford and Sc h w artz , 1958, p. 262 ). Here, ∼ = represent s the isometric isomorphism. The rest of the pro of is v er- batim w ith M b ( X ) r eplaced by M r ba ( X ). Theorem 6 can also b e in terpr eted as: for appropriate assu mptions on X and µ , the em b ed- ding in (2) is injectiv e if and only if the k ern el is universal , therefore relating univ ersalit y and injectiv e RKHS em b edding of fin ite signed Radon measures. In other wo rds, Theorem 6 pro vides a n o ve l measure em b eddin g view p oint of univ ersalit y compared to its we ll-kno wn function ap p ro ximation view p oin t. B ased on Th eorem 6, the follo w ing remarks can b e made. Remark 7 (a) The or em 6 pr ovides a ne c essary and sufficient c ondition for c-universality — k is c-universal if and only if the emb e dding in (10) is inje ctive — wh ile Steinwa rt (2001) pr ovide d only a sufficient c ondition (in terms of the fe atur e maps b eing an algebr a; se e Steinwart and Christm ann (2008, The or em 4.56) f or details) using the Stone-Weierstr aß the or em. Ther e f or e, The or em 6 differs fr om and ge ner alizes the r esult by Steinwart (20 01). (b) Note that the emb e dding in (11) is inje ctive if and only if for any c omp act set Z ⊂ X , the emb e dding µ 7→ Z Z k ( · , x ) dµ ( x ) , µ ∈ M b ( Z ) , (16) is inje ctive. Mic chel li et al. (2006 , Pr op osition 1) have show n that fo r any c omp act set Z ⊂ X , the emb e dding in (16) is inje ctive if and only if the se t K ( Z ) = span { k ( · , y ) : y ∈ Z } is dense in C ( Z ) w.r.t. the uniform norm. Ther efor e, it is c le ar that k is c c- u niversal if and 12 Universality, Characte ristic Kernels and RKHS Embedding of Measures only if it is univ ersal in th e sense of Mi c chel li et al. (2006). Se e a lso Carmeli et al. (2009, R emark 1). (c) By c omp aring the e mb e ddings in (11) and (12), sinc e M bc ( X ) ⊂ M b ( X ) , it is cle ar that c 0 -universality is str onger than c c-universality, i.e., if a kernel is c 0 -universal (satis- fies (12)), then it is c c-univ e rsal (satisfies (11)). In gener al, the c onverse is not true (se e Pr op osition 11 and E xample 1). However, we wil l show these notions to b e e quivalent in the c ase of r adial kernels on R d (se e Pr op osition 16). (d) Carmeli et al. (200 9, The or ems 2,4) pr ovide d char acterizations for c 0 - and c c-universality in terms of the inje c tivity of an inte gr al op er ator on the sp ac e of squar e-inte gr able functions, wher e as our char acterizations in The or em 6 de al with the inje ctivity of an emb e dding that maps finite signe d R adon me asur es into an RKHS, H . Sinc e the latter c an b e se en as a gener alization of the emb e dding in (1 ) that de als w ith char acteristic kernels, our cha r acter- izations c an b e use d in a str aightforwar d way to r elate universal and char acteristic kernels (se e Se ction 4 for details). (e) N ote that M r ba ( X ) in (13) do es not c ontain any me asur e — though a set fu nction in M r ba ( X ) c an b e extende d to a me asur e — as me asur es ar e c ountably additive and define d on a σ -field. Sinc e µ in The or em 6(d ) is not a me asur e but a finitely additive set func- tion define d on a field, it is no t cle ar ho w to de al with the inte gr al in (13). Be c ause of the te chnic alities i nv olve d in de aling with set func tions, we do not further pursue the notion of c b -universality in th is p ap er. Based on Theorem 6 , the follo wing resu lt p ro vides an alternate and equiv alen t charac ter- ization of universality or injectivit y of the em b edding in (2), whic h is easier to interpret, as it resem bles the cond ition of k b eing strictly p d (though not quite exactly th e same). This alternate c haracterization is then used in S ections 3.2–3 .4 to obtain easily c heck able conditions for the univ ersality of sp ecific classes of kernels. W e also sho w that strictly p d is a necessary condition for universality . Prop osition 8 Supp ose th e assumptions in The or em 6 h old. Then, (a) k is c -universal if and only if Z Z X k ( x, y ) dµ ( x ) dµ ( y ) > 0 , ∀ 0 6 = µ ∈ M b ( X ) . (17) (b) k is c c-uni v ersal if and only if Z Z X k ( x, y ) dµ ( x ) dµ ( y ) > 0 , ∀ 0 6 = µ ∈ M bc ( X ) . (18) (c) k is c 0 -universal if and only if Z Z X k ( x, y ) dµ ( x ) dµ ( y ) > 0 , ∀ 0 6 = µ ∈ M b ( X ) . (19) (d) If k is c-, c c- o r c 0 -universal, then it i s strictly p d. 13 Sriper umbudur, Fukumizu and Lanckriet Pro of W e only pro v e ( c ). The p ro of of ( b ) is exactly the same as th at of ( c ) with M b ( X ) replaced by M bc ( X ), wh ile the p ro of of ( a ) is trivial. ( c ) ( ⇐ ) Su pp ose k is not c 0 -universal . By Th eorem 6(c), there exists 0 6 = µ ∈ M b ( X ) such that R X k ( · , x ) dµ ( x ) = 0, w h ic h implies k R X k ( · , x ) dµ ( x ) k H = 0. Th is means 0 = D Z X k ( · , x ) dµ ( x ) , Z X k ( · , x ) dµ ( x ) E H ( e ) = Z Z k ( x, y ) dµ ( x ) dµ ( y ) , where ( e ) follo ws from Lemma 26 (see App en d ix A). By our assu mption in (19), this leads to a con trad iction. Therefore, if (19) holds, then k is c 0 -universal . ( ⇒ ) Sup p ose there exists 0 6 = µ ∈ M b ( X ) suc h that R R X k ( x, y ) dµ ( x ) dµ ( y ) = 0, i.e., k R X k ( · , x ) dµ ( x ) k H = 0, w hic h implies R X k ( · , x ) dµ ( x ) = 0. T herefore, the em b edd in g in (12) is not injectiv e, whic h by Theorem 6 implies that k is not c 0 -universal . Therefore, if k is c 0 -universal , th en k satisfies (19). ( d ) S u pp ose k is n ot strictly p d . This m eans for some n ∈ N and f or m u tually d istinct x 1 , . . . , x n ∈ X , there exists R ∋ α j 6 = 0 for s ome j ∈ { 1 , . . . , n } suc h that n X l,j =1 α l α j k ( x l , x j ) = 0 . (20) Define µ := P n j =1 α j δ x j , wher e δ x represent s the Dirac measures at x . Clearly µ 6 = 0 and µ ∈ M bc ( X ). F rom (20), it is clear that R R X k ( x, y ) dµ ( x ) dµ ( y ) = 0. Therefore, b y Prop osition 8(b), k is n ot c c-universal . The result for c 0 -universality follo ws from Re- mark 7(c), while the result for c-unive rsality is trivial. See C armeli et al. (2009, Coroll ary 5), Steinw art and Ch ristmann (2008, Prop osition 4.54, Example 4.11) and S rip erumbudur et al. (2009 b , F o otnote 4). Remark 9 (a) Although the c onditions in (17)-(19) r esemble the strictly p d c ondition, they ar e not e quivalent. By c ombining any of (a)-(c) with (d) in Pr op osition 8, it is e asy to se e th at if k satisfies a ny of (17)-(19), then it is strictly p d. However, th e c on- verse is not true (se e R emark 12(a) and the discussion fol lowing Example 2 ; also r efer to Steinwart and Christm ann (2008, Pr op osition 4.60, The or em 4.62) for the r elate d discus- sion). We show in Se ction 3.4 that in the c ase of r adial kernels on R d , the c onverse i s true, i.e., k b eing strictly p d is also sufficient for it to b e c c- or c 0 -universal (se e Pr op osition 16). (b) The c ondition on k in (19) c an b e se en as a gene r alization of inte gr al ly strictly p d kernels (Stewart, 1976, Se ction 6): R R X k ( x, y ) f ( x ) f ( y ) dx dy > 0 for al l f ∈ L 2 ( R d ) , wh ich is the strictly p ositive definiteness of the inte gr al op er ator given by the kernel. A su mmary of results based on Theorem 6, Remarks 7, 9 and Prop osition 8 is shown in Figures 1(a) and 1(b ). Although the conditions in (17)-(19) a re easy to interpret, they are not alwa ys ea sy to c h eck. T o this end, in the remainder of this section, we presen t easily chec k able c h aracteri- zations for the follo wing classes of k ern els. These classes of ke rnels are b oth mathematically and practically in teresting as man y of the p opular k ern els used in mac hine learning, e.g., Gaussian, Laplacian, exp on ential, etc., fall in these classes (se e Exa mples 1 –3 for more examples). 14 Universality, Characte ristic Kernels and RKHS Embedding of Measures ( A 1 ) k is translation in v arian t on R d × R d , i.e., k ( x, y ) = ψ ( x − y ), wh ere 0 6 = ψ ∈ C b ( R d ) is a p d fun ction on R d . 3 ( A 2 ) F ourier kernel: k is tr an s lation in v arian t on T d × T d , where T d := [0 , 2 π ) d , the d -T orus, i.e., k ( x, y ) = ψ (( x − y ) mod 2 π ), wh ere ψ ∈ C ( T d ) is a p d function on T d . ( A 3 ) k is a radial kernel on R d × R d , i.e., there exists a finite nonnegativ e Borel m easur e, ν on [0 , ∞ ) such that for all x, y ∈ R d , k ( x, y ) = Z [0 , ∞ ) e − t k x − y k 2 2 dν ( t ) . (21) These k ernels are also ca lled Scho enb er g kernels (W endland , 2005, Corollary 7.12, Theorem 7.13). 4 ( A 4 ) X is an L C H sp ace w ith b ound ed k . Let k ( x, y ) = P j ∈ I φ j ( x ) φ j ( y ) , ( x, y ) ∈ X × X , where w e assume th e series co n v erges uniformly on X × X . { φ j : j ∈ I } is a set of con tinuous real-v alued fun ctions on X wh er e I is a countable ind ex set. 3.2 T ranslation in v a rian t kernels on R d : ( A 1 ) The follo wing result provides an easily c hec k able c h aracteriza tion for k to b e c 0 -universal or c c- universal (w e do n ot consider c-universality as X = R d is n ot compact) when k is translation inv arian t on R d , i.e., wh en k s atisfies ( A 1 ). Before we present the result, we n eed a th eorem due to Bo c h ner that c haracterizes translation inv ariant k ernels on R d , whic h is quoted f rom W endland (2005, Theorem 6.6). Theorem 10 (Bochner) ψ ∈ C b ( R d ) is p d on R d if and only if it is the F ourier tr ansform of a finite nonne gative Bor el me asur e Λ on R d , i.e., ψ ( x ) = Z R d e − ix T ω d Λ( ω ) , x ∈ R d . (22) Prop osition 11 (T ranslation inv arian t k ernels on R d ) Supp ose ( A 1 ) holds. (a) L et ψ ∈ C 0 ( R d ) . Then k is c 0 -universal if and only if supp(Λ) = R d . 5 (b) If s u pp( ψ ) is c omp act, then k is c 0 -universal. (c) If (su pp(Λ)) ◦ 6 = ∅ , then k is c c-univ ersal. Pro of ( a ) ( ⇐ ) Consid er R R R d k ( x, y ) dµ ( x ) dµ ( y ) for any 0 6 = µ ∈ M b ( R d ) with k ( x, y ) = ψ ( x − y ). B := Z Z R d k ( x, y ) dµ ( x ) dµ ( y ) = Z Z R d ψ ( x − y ) dµ ( x ) dµ ( y ) ( d ) = Z Z Z R d e − i ( x − y ) T ω d Λ( ω ) dµ ( x ) dµ ( y ) ( e ) = Z Z R d e − ix T ω dµ ( x ) Z R d e iy T ω dµ ( y ) d Λ( ω ) 3. ψ is said to b e a p d function on R d if k ( x, y ) = ψ ( x − y ) is p d. 4. Note that k is a scale mixture of Gaussian k ernels. 5. See (4) for th e d efinition of supp ort of a Borel measure. 15 Sriper umbudur, Fukumizu and Lanckriet ( f ) = Z R d ˆ µ ( ω ) ˆ µ ( ω ) d Λ( ω ) = Z R d | ˆ µ ( ω ) | 2 d Λ( ω ) , (23) where T heorem 10 is inv ok ed in ( d ), F ubini’s theorem (F olland, 1999, Theorem 2.37) in ( e ) and (6) in ( f ). If sup p(Λ) = R d , then it is clear that B > 0. Therefore, by Prop osition 8(c), k is c 0 -universal. ( ⇒ ) Su pp ose k is c 0 -universal , w hic h by Theorem 6(a) means that µ 7→ R R d k ( · , x ) dµ ( x ) is injectiv e for µ ∈ M b ( R d ). T his means µ 7→ R R d k ( · , x ) dµ ( x ) is injectiv e for µ ∈ M 1 + ( R d ) and therefore T h eorem 7 in Srip erumbudur et al. (2008) yields sup p( Λ) = R d . ( b ) The pro of is the same as that of Corollary 10 in Srip erumbudur et al. (2009b). Since supp( ψ ) is compact in R d , b y the Pale y-Wiener th eorem (Ru din, 1991, T h eorem 7.23), we deduce that supp(Λ) = R d . Therefore, the result follo w s from Pr op osition 11(a). ( c ) C onsider R R R d k ( x, y ) dµ ( x ) dµ ( y ) with k ( x, y ) = ψ ( x − y ) and µ ∈ M bc ( R d ). Since (23) holds for an y µ ∈ M b ( R d ), it also holds for any µ ∈ M bc ( R d ), i.e., B := Z Z R d k ( x, y ) dµ ( x ) dµ ( y ) = Z R d | ˆ µ ( ω ) | 2 d Λ( ω ) . Since µ ∈ M bc ( R d ), b y the P aley-Wiener theorem (Ru din, 1991, Th eorem 7.23), we obtain that ˆ µ cannot v anish o v er an op en set in R d and sup p( ˆ µ ) = R d . Therefore if (su pp(Λ)) ◦ 6 = ∅ , then B > 0 for every 0 6 = µ ∈ M bc ( R d ) and the resu lt follo ws f rom Prop osition 8(b). Prop osition 11 can easily b e extended to lo cally compact Ab elian groups b y using the ideas in F uku mizu et al. (2009b). Note that Prop osition 11(c) matches w ith Prop osition 15 in Micc helli et al. (2006), whic h is not sur prising (see Remark 7(b)). Based on Prop osition 11, in the follo win g, w e p r o vide some examples of c 0 - and c c-universal k ernels that are trans la- tion inv ariant k ern els on R d . Example 1 L et d Λ( ω ) = (2 π ) − d/ 2 ˆ ψ ( ω ) dω . Note that sup p(Λ ) = su pp( ˆ ψ ) . The fol lowing kernels satisfy supp( ˆ ψ ) = R d and ther efor e ar e b oth c 0 - and c c- universal. (1) Gaussian, ψ ( x ) = exp  − k x k 2 2 2 σ 2  , σ > 0 with ˆ ψ ( ω ) = σ d exp  − σ 2 k ω k 2 2 2  . (2) L aplacian, ψ ( x ) = exp ( − σ k x k 1 ) , σ > 0 with ˆ ψ ( ω ) =  2 π  d/ 2 Q d j =1 σ σ 2 + ω 2 j , wher e ω = ( ω 1 , . . . , ω d ) . (3) B 1 -spline, ψ ( x ) = Q d j =1 (1 − | x j | ) 1 [ − 1 , 1] ( x j ) with ˆ ψ ( ω ) = Q d j =1 4 √ 2 π sin 2 ( ω j / 2) ω 2 j , wher e x = ( x 1 , . . . , x d ) and ω = ( ω 1 , . . . , ω d ) . The fol lowing ar e so me examples of tr anslation invariant kernels on R d that ar e not c 0 - universal but c c-universal. These kernels satisfy su p p( ˆ ψ ) ( R d and (supp( ˆ ψ )) ◦ 6 = ∅ . (4) Sinc kernel, ψ ( x ) = Q d j =1 sin σ x j x j , σ ∈ R : ˆ ψ ( ω ) =  π 2  d/ 2 Q d j =1 1 [ − σ,σ ] ( ω j ) and supp( ˆ ψ ) = [ − σ, σ ] d ( R d . 16 Universality, Characte ristic Kernels and RKHS Embedding of Measures (5) Sinc-squar e d kernel, ψ ( x ) = Q d j =1 sin 2 x j x 2 j : ˆ ψ ( ω ) = (2 π ) d/ 2 4 d Q d j =1 (1 − | ω j | ) 1 [ − 1 , 1] ( ω j ) and su pp( ˆ ψ ) = [ − 1 , 1] d ( R d . The follo wing remarks can b e made ab out Prop osition 11. Remark 12 (a) The or em 6.8 in Wend land (2 005) states th at: i f (supp(Λ)) ◦ 6 = ∅ , then k ( x, y ) = ψ ( x − y ) is strictly p d. By Pr op osition 11(a), this me ans a strictly p d kernel ne e d not b e c 0 -universal and ther e for e ne e d not satisfy the c ondition in (19), i.e . , strictly p d i s not a sufficient c ondition for (19) to hold (se e R emark 9(a)). A s an example, a sinc-squar e d kernel is str ictly p d but not c 0 -universal (se e Examp le 1). (b) In Pr op osition 8(d), we have shown that strictly p d i s a ne c essary c ondition for a kernel to b e c 0 - or c c-universal. F r om the ab ove r emark, it is cle ar that k b e i ng strictly p d do es not imply it is c 0 -universal. But do es it imply k is c c-univ ersal? In gener al, it is not cle ar whether this is true. However, if ψ ∈ C b ( R d ) ∩ L 1 ( R d ) i s strictly p d, then k ( x, y ) = ψ ( x − y ) is c c-universal. This fol lows fr om Wend land (2005, The or em 6.11, Cor ol lary 6.12): if ψ ∈ C b ( R d ) ∩ L 1 R d ) is strictly p d, then 0 6 = ˆ ψ ∈ L 1 ( R d ) , ˆ ψ ≥ 0 and (supp( ˆ ψ )) ◦ 6 = ∅ , which by Pr op osition 11(c) implies k is c c- universal. (c) Is the c onverse to Pr op osition 11(c) true? I.e., if k is c c-universal, then do es (supp(Λ)) ◦ 6 = ∅ hold? L et X = R . Supp ose (sup p( Λ)) ◦ = ∅ , which me ans supp(Λ) is of the form { 0 , ± ω 1 , ± ω 2 , . . . } , wher e 0 6 = ω j ∈ R for al l j . L et us assume that ther e exists a no n- zer o entir e function, h o n C that satisfies (i) h ( ω j ) = 0 , ∀ j and (ii) for e ach N ∈ N , ther e is a C N such that | h ( ζ ) | ≤ C N e R | Im ζ | (1 + | ζ | ) N , for al l ζ ∈ C and some R > 0 . Her e Im ζ r epr esents the imaginary p art of ζ . By the Paley- Wiener the or em (R e e d and Simon, 1972, The or em IX. 11, p. 16), ˇ h ∈ C 0 ( R ) is an infinitely differ entiable function on R and sup p( ˇ h ) ⊂ { x ∈ R : | x | ≤ R } . Define dµ ( x ) = ˇ h ( x ) dx . It is e asy che ck that Z Z R k ( x, y ) dµ ( x ) dµ ( y ) = Z Z R k ( x, y ) ˇ h ( x ) ˇ h ( y ) dx dy = 2 π Z R    ˆ ˇ h ( ω )    2 d Λ( ω ) = 2 π Z R | h ( ω ) | 2 d Λ( ω ) = 2 π X j | h ( ω j ) | 2 Λ( { ω j } ) = 0 . This me ans ther e exists 0 6 = µ ∈ M bc ( R ) such that R R R k ( x, y ) dµ ( x ) dµ ( y ) = 0 , which me ans k is not c c-universal, by Pr op osition 8(b). Ther ef or e, if k is c c-universal, then (supp(Λ)) ◦ 6 = ∅ , under the assumption that ther e exists an h that satisfies (i) and (ii) shown ab ove. The c onstruction of su ch an h is not str aightforwar d for any k , and ther efor e it is no t cle ar whether the ab ove c onverse is true in gener al. On the other hand, Srip erumbudur et al. (20 09b, Example 5) have shown that if k is a p e rio dic k ernel (these kernels satisfy (supp(Λ)) ◦ = ∅ ), then such an h define d on R c an b e c onstructe d. This me ans if k is c c-universal on R , then it is not p erio dic on R . However, this do es not rule out the c ase of k b eing c c-universal but ap erio dic such that (supp(Λ)) ◦ = ∅ . A summary of resu lts, based on Proposition 11 and R emark 1 2, for the case of k ernels satisfying ( A 1 ), is sho wn in Figure 1(c). 17 Sriper umbudur, Fukumizu and Lanckriet 3.3 T ranslation in v a rian t kernels on T d : ( A 2 ) First note th at since T d is a compact metric s pace, the notions of c-universality , c c- universality and c 0 -universality are equiv alent . Stein wart (2001, Corolla ry 11) pr ovided a s ufficien t cond ition for a F ourier k ern el to b e c-u niversal . In Prop osition 14, w e sh o w that this condition is also necessary . Using this result, we th en sho w th at the conv erse to Prop o- sition 8(d) is not true. Before w e present the result on the c haracterization of c-universality of kernels in ( A 2 ), we state Bo chner’s theorem that c haracterizes p d fu n ctions, ψ on T d . Theorem 13 (Bochner) ψ ∈ C ( T d ) is p d if and only if ψ ( x ) = X n ∈ Z d A ψ ( n ) e ix T n , x ∈ T d , (24) wher e A ψ : Z d → R + , A ψ ( − n ) = A ψ ( n ) and P n ∈ Z d A ψ ( n ) < ∞ . A ψ ar e c al le d the F ourier series c o efficients of ψ . Prop osition 14 (T ranslation inv arian t k ernels on T d ) Supp ose ( A 2 ) holds. Then, k is c- universal if and o nly if A ψ ( n ) > 0 , ∀ n ∈ Z d . Pro of ( ⇐ ) Consider R R T d k ( x, y ) dµ ( x ) dµ ( y ) for 0 6 = µ ∈ M b ( T d ). S ubstituting for k as in ( A 2 ) and for ψ as in (24), we ha ve B := Z Z T d k ( x, y ) dµ ( x ) dµ ( y ) = Z Z T d X n ∈ Z d A ψ ( n ) e i ( x − y ) T n dµ ( x ) dµ ( y ) ( a ) = X n ∈ Z d A ψ ( n ) Z T d e ix T n dµ ( x ) Z T d e − iy T n dµ ( y ) ( b ) = (2 π ) 2 d X n ∈ Z d A ψ ( n ) A µ ( n ) A µ ( n ) = (2 π ) 2 d X n ∈ Z d A ψ ( n ) | A µ ( n ) | 2 , (25) where F ubin i’s theorem is in v oked in ( a ) and A µ ( n ) := (2 π ) − d Z T d e − in T x dµ ( x ) , n ∈ Z d , (26) is used in ( b ). Note that A µ is the F ourier transf orm of µ in T d . Since A ψ ( n ) > 0 , ∀ n ∈ Z d , w e hav e B > 0, whic h by Prop osition 8(a) implies k is c-u niversal . ( ⇒ ) P r o ving n ecessit y is equiv alen t to proving that if A ψ ( n ) = 0 for some n = n 0 , then there exists 0 6 = µ ∈ M b ( T d ) su c h that R R T d k ( x, y ) dµ ( x ) dµ ( y ) = 0. Let A ψ ( n ) = 0 for some n = n 0 . Define dµ ( x ) = 2 α cos( x T n 0 ) dx, α ∈ R \{ 0 } . By (26), w e get A µ ( n ) = αδ n 0 ( n ), w here δ represents the K ronec ker delta. This means µ 6 = 0. Using A ψ and A µ in (25), it is easy to c h ec k that R R T d k ( x, y ) dµ ( x ) dµ ( y ) = 0. Therefore, k is not c-universal . Note that Prop osition 14 p ro vides an easy to c hec k condition for the c-universality of translation inv ariant k ernels on T d . 18 Universality, Characte ristic Kernels and RKHS Embedding of Measures Example 2 The fol lowing ar e so me examples of tr anslation invaria nt kernels on T that ar e c-universal (and ther efor e c 0 -universal and c c-u ni v ersal). (1) Poisson kernel, ψ ( x ) = 1 − σ 2 σ 2 − 2 σ cos x +1 , 0 < σ < 1 with A ψ ( n ) = σ | n | , n ∈ Z . (2) ψ ( x ) = e α cos x cos( α sin x ) , 0 < α ≤ 1 with A ψ (0) = 1 an d A ψ ( n ) = α | n | 2 | n | ! , ∀ n 6 = 0 . (3) ψ ( x ) = ( π − ( x ) mod 2 π ) 2 with A ψ (0) = π 2 3 and A ψ ( n ) = 2 n 2 , ∀ n 6 = 0 . Some examples of tr anslation i nvariant kernels on T that ar e not c-univ ersal (and ther efor e not c 0 -universal and no t c c-u niversal) ar e: (4) Dirichlet kernel, ψ ( x ) = sin (2 l +1) x 2 sin x 2 , l ∈ N with A ψ ( n ) = 1 f or n ∈ { 0 , ± 1 , . . . , ± l } =: D and A ψ ( n ) = 0 for n / ∈ D . (5) F ej´ er kernel, ψ ( x ) = 1 l +1 sin 2 ( l +1) x 2 sin 2 x 2 , l ∈ N with A ψ ( n ) = 1 − | n | l +1 for n ∈ D and A ψ ( n ) = 0 for n / ∈ D . c -univ ersal k ernels vs. Strictly p d k ernels: W e hav e sho wn in Pr op osition 8 (d) that strictly p d is a necessary condition for k to b e c- , c c- or c 0 -universal . Ho w ev er, the con verse is not tru e (see Remark 9(a)), whic h is based on Prop osition 14 and the follo wing resu lt in Theorem 15. Before w e state the resu lt, we need some d efi nitions. F or natural num b ers m and n and a set A of intege rs, m + nA := { j ∈ Z | j = m + na, a ∈ A } . An increasing sequence { c l } of nonnegativ e intege rs is said to b e prime if it is not con tained in an y set of the form p 1 N ∪ p 2 N ∪ · · · ∪ p n N , where p 1 , p 2 , . . . , p n are p rime n um b ers. Any infinite in creasing sequence of prime num b ers is a trivial example of a p rime sequence. W e w r ite N 0 n := { 0 , 1 , . . . , n } . Theorem 15 (Menegatto (1995)) L e t ψ b e a p d func tion on T of the form in (24). L et N := {| n | : A ψ ( n ) > 0 , n ∈ Z } ⊂ N ∪ { 0 } . Then ψ is strictly p d if N has a subset of the form ∪ ∞ l =0 ( b l + c l N 0 l ) , in which { b l } ∪ { c l } ⊂ N and { c l } is a prime se quenc e. Supp ose ψ b e suc h that N ( N ∪ { 0 } h as a subset of the form as men tioned in Th eorem 15. Clearly , ψ is strictly p d. How ev er, it is not c-universal as Prop osition 14 states that k is c-universal if and only if N = N ∪ { 0 } . A su mmary of results for kernels of th e t yp e ( A 2 ) is sho wn in Figure 1(b). 3.4 Radial k ernels on R d : ( A 3 ) The follo wing result provides an easily c h ec k able characte rization for k to b e c 0 - and c c- universal ( c-universality is not considered as X = R d is not compact) when k satisfies ( A 3 ). Prop osition 16 (Radial k ernels on R d ) Supp ose ( A 3 ) holds. Then the fol lowing c ondi- tions ar e e qui valent. (a) k is c 0 -universal. 19 Sriper umbudur, Fukumizu and Lanckriet (b) supp( ν ) 6 = { 0 } . (c) k is str ictly p d. (d) k is c c- u niversal. Pro of ( a ) ⇒ ( d ) by Remark 7(c), ( d ) ⇒ ( c ) b y Prop osition 8(d) and ( c ) ⇔ ( b ) by W endland (2005, Th eorem 7.14). No w, we sho w ( b ) ⇒ ( a ). Consider R R R d k ( x, y ) dµ ( x ) dµ ( y ) with k as in (21), giv en by B := Z Z R d k ( x, y ) dµ ( x ) dµ ( y ) = Z Z R d Z ∞ 0 e − t k x − y k 2 2 dν ( t ) dµ ( x ) dµ ( y ) ( e ) = Z ∞ 0  Z Z R d e − t k x − y k 2 2 dµ ( x ) dµ ( y )  dν ( t ) ( f ) = Z ∞ 0 1 (2 t ) d/ 2  Z R d | ˆ µ ( ω ) | 2 e − k ω k 2 2 4 t dω  dν ( t ) ( g ) = Z R d | ˆ µ ( ω ) | 2  Z ∞ 0 1 (2 t ) d/ 2 e − k ω k 2 2 4 t dν ( t )  dω , (27) where F ubini’s theorem is inv ok ed in ( e ) and ( g ), while (23) is inv ok ed in ( f ). Since supp( ν ) 6 = { 0 } , the inner inte gral in (27) is p ositiv e for ev ery ω ∈ R d and so B > 0. Therefore k is c 0 -universal b y Prop osition 8 . The ab ov e result sho ws that the notions of c 0 -universality , c c- universality and strict p ositiv e definiteness are equiv alen t for the class of radial kernels on R d . Example 3 The fol lowing r adial kernels on R d have supp( ν ) 6 = { 0 } and ther e for e ar e c 0 - universal, c c- universal and strictly p d. (1) Gaussian, k ( x, y ) = e − σ k x − y k 2 2 , σ > 0 . Note tha t ν = δ σ in (21), wher e δ σ r epr esents a Dir ac me asur e at σ . Cle arly supp( ν ) = { σ } 6 = { 0 } . (2) Inverse multiquadr atic, k ( x, y ) = ( c 2 + k x − y k 2 2 ) − β , β > 0 , c > 0 , obtaine d by cho osing dν ( t ) = 1 Γ( β ) t β − 1 e − c 2 t dt in (21 ). It is e asy to ve rif y that supp( ν ) 6 = { 0 } . A summary of results for k ernels of the t yp e ( A 3 ) is sho wn in Figure 1(d). 3.5 Kernels of t yp e ( A 4 ) W e n o w consider the c h aracteriza tion of c- , c c- and c 0 -universality for ( A 4 ). Prop osition 17 (Kernels of t yp e ( A 4 )) Supp ose ( A 4 ) holds. (a) k is c-universal ( resp . c c-universal) if and only if for any 0 6 = µ ∈ M b ( X ) ( resp. 0 6 = µ ∈ M bc ( X ) ), ther e exists some j ∈ I for which R X φ j dµ 6 = 0 . (b) L et k ( · , x ) ∈ C 0 ( X ) , ∀ x ∈ X . Then k is c 0 -universal if and only if for any 0 6 = µ ∈ M b ( X ) , ther e exists some j ∈ I for which R X φ j dµ 6 = 0 . 20 Universality, Characte ristic Kernels and RKHS Embedding of Measures Pro of W e first pro ve ( b ) . The pro of for c- universality in ( a ) is trivial as it follo w s from ( b ), while the pro of for c c-universality in ( a ) is exactly the same as that of ( b ) w ith M b ( X ) replaced by M bc ( X ). Let us consider Z Z X k ( x, y ) dµ ( x ) dµ ( y ) = Z Z X X j ∈ I φ j ( x ) φ j ( y ) dµ ( x ) dµ ( y ) ( c ) = X j ∈ I     Z X φ j ( x ) dµ ( x )     2 , (28) where we ha v e inv ok ed F ub ini’s theorem in ( c ). ( b ) ( ⇐ ) Supp ose for any 0 6 = µ ∈ M b ( X ), there exists some j ∈ I for w hic h R X φ j dµ 6 = 0. Then, from (28 ), it is clear that R R X k ( x, y ) dµ ( x ) dµ ( y ) > 0 , ∀ 0 6 = µ ∈ M b ( X ) and therefore k is c 0 -universal , which follo ws fr om Prop osition 8(c). ( ⇒ ) Supp ose there e xists a n on-zero measure, µ ∈ M b ( X ) for which R X φ j dµ = 0 for an y j ∈ I . By (28), this means there exists a 0 6 = µ ∈ M b ( X ) for w hic h R R X k ( x, y ) dµ ( x ) dµ ( y ) = 0, i.e., k is not c 0 -universal (b y Prop osition 8(c)). The conditions in Prop osition 17 are not alw a ys easy to chec k. Ho we v er, for the case of T ay- lor ke rnels (Stein wa rt and Christmann, 200 8, Lemma 4.8), whic h include the exp onent ial k er n el, simple, easy to c heck sufficien t conditions can b e obtained as sho wn in Corollary 18. Although this result is exactly the same as Corollary 4.57 in Stein wart and Chr istmann (2008), w e p resen t a different pro of (w e would like to remind the reader th at our c haracteri- zation of c-universality is different from the one p ro vid ed b y Steinw art (2001) and therefore the p ro of is d ifferent; see Remark 7(a)). Corollary 18 (Univ ersal T a ylor k ernels) L et X := { x ∈ R d : k x k 2 < √ r } , wher e r ∈ (0 , ∞ ] . L et f ( t ) = P ∞ n =0 a n t n , t ∈ ( − r, r ) . If a n > 0 , ∀ n ≥ 0 , then k ( x, y ) = f ( x T y ) , x, y ∈ X , is c -universal on every c omp act subset of X . Pro of F rom the pro of of Lemma 4.8 in Stein w art and Christmann (2008), we hav e k ( x, y ) = f ( x T y ) = ∞ X n =0 a n  x T y  n = X α ∈ N d a | α | c α x α y α , (29) where α := ( α j : j ∈ N d ), | α | := P j ∈ N d α j , c α := n ! Q d j =1 α j ! , x = ( x 1 , . . . , x d ) and x α := Q d j =1 ( x j ) α j . F rom (29), it is c lear that k ( x, y ) = P α ∈ N d φ α ( x ) φ α ( y ) , x, y ∈ X , wher e φ α ( x ) = √ a | α | c α x α . Let a | α | > 0 for all α ∈ N d . Then it is clear that for any 0 6 = µ ∈ M b ( X ), there exi sts α ∈ N d suc h that R X x α dµ ( x ) 6 = 0. Therefore, by Prop osition 17, k is c- universal . Examples of ke rnels that s atisfy the conditions in Corollary 18 and therefore are c-universal include the exp on ential kernel, k ( x, y ) = exp( x T y ) , x, y ∈ R d , bin omial kernel, k ( x, y ) = (1 − x T y ) − β , β > 0, defi n ed on X × X , w here X := { x ∈ R d : k x k 2 < 1 } , etc. See Examples 4.9 and 4.11 in Stein w art a nd Chr istmann (2008)). T o s u mmarize, in this section, by sh o wing the r elatio n b etw een v arious notions of universal- it y and the injectiv e RKHS em b edding of finite signed Radon measures, we h a ve pr esented a nov el measur e em b edding p oint of view of universalit y compared to its well- kno wn fun c- tion appro ximation view p oint . Since the RKHS em b edding of finite signed Radon measures 21 Sriper umbudur, Fukumizu and Lanckriet generalizes the concept of RKHS embedd ing of Radon probabilit y measur es, the latter b eing related to char acteristic k ernels (F ukumizu et al., 2004, 2008; Srip erumbudur et al., 2008), in the follo win g section, w e relate the n otion of univ ersalit y to c haracteristic k ern els. 4. Characteristic Kernels and Univ ersalit y Recen t studies in mac hine learning hav e co nsidered the mapping of random v ariables in to a s u itable RKHS and sho w ed that this provides a p o werful and straigh tforw ard metho d of dealing with higher-order statistics of the v ariables. Using their RK HS mapp ings, for sufficien tly rich RKHSs , it b ecomes p ossible to test f or homogeneit y (Gretton et al., 2007), indep end en ce (Gretton et al., 20 08), conditional indep endence (F ukumizu et al., 2008), to find the most p redictiv e subsp ace in r egression (F ukumizu et al., 2004), etc. Key to the ab o ve applications is the notion of a char acteristic kernel — defined b elo w — which gives rise to an RK HS that is sufficiently ric h in the sens e required ab o ve. Definition 19 (Characteristic k ernel) L et X b e a to p olo gic al sp ac e, P b e a Bor el pr ob- ability me asur e on X and k b e a me asur able, b ounde d k ernel on X . Then k is said to b e char acteristic if the emb e dding, P 7→ Z X k ( · , x ) d P ( x ) , (30) is inje c tiv e. Since the em b edding in (30) is a sp ecial case of the em b edding in (2), and the injectivit y of the em b edding in (2) is related to unive rsalit y (see S ection 3), w e n o w relate universal and c h aracteristic kernels. 4.1 Main results Gretton et al. (2007) h a ve shown th at a c-universal k ernel is char acteristic . Besides this re- sult, n ot muc h is kn o wn or un d ersto o d ab out the relation b et ween characte ristic an d u niv er- sal k ernels. Th e f ollo wing result not on ly provides th e same r esult obtained by Gretton et al. (2007), bu t also generalizes it for non-compact X . Prop osition 20 (Univ ersal and c haracteristic k ernels − I) Supp ose the assumptions in The or em 6 hold. If k is c -, cc - or c 0 -universal, then it is cha r acteristic to the set of pr ob a- bility me asur es c ontaine d in M b ( X ) , M bc ( X ) or M b ( X ) , r esp e ctively. Pro of The p ro of is trivial and follo ws from Theorem 6 and Definition 19. No w, one can ask when the conv erse to Prop osition 20 is true. The follo wing result answe rs this question for some sp ecial classes of k ernels. Prop osition 21 (Univ ersal and c haracteristic k ernels − I I) The fol lowing hold: (a) Supp ose ( A 1 ) hold s with ψ ∈ C 0 ( R d ) . Then, k is c 0 -universal if and onl y if it is char acteristic to the set of al l Bor el pr ob ability me asur es on R d . (b) Supp ose ( A 2 ) holds. Then, k is c-u niversal if it is char acteristic to the set of al l Bor el pr ob ability me asur es on T d and A ψ (0) > 0 , wh er e A ψ is define d in (24). 22 Universality, Characte ristic Kernels and RKHS Embedding of Measures (c) Supp ose ( A 3 ) holds. Then, k is c c-u niversal if and only if it is char acteristic to the set of al l Bor el pr ob ability me asur es o n R d . Pro of ( a ) Supp ose k is c 0 -universal . Then, by Prop osition 20, k is c haracteristic to M 1 + ( R d ). Con v ersely , if k is charac teristic to M 1 + ( R d ), w e ha v e s upp(Λ) = R d whic h follo ws from T he- orem 7 in Srip eru m b udur et al. (200 8). The result therefore follo ws from P r op osition 11(a). ( b ) F uku mizu et al. (200 9b, Theorem 8) and Srip erumbudur et al. (2009b , Theorem 14) ha ve shown that k is c haracteristic to M 1 + ( T d ) if and only if A ψ (0) ≥ 0, A ψ ( n ) > 0 , ∀ n ∈ Z d \{ 0 } . Therefore, if k is charac teristic with A ψ (0) > 0, then it is c- universal by Prop osi- tion 14. ( c ) If k is c c-universal , then b y Pr op osition 16, it is c 0 -universal , and th us charac teristic to M 1 + ( R d ) b y Prop osition 20 . T o pr o ve the con v erse, w e n eed to prov e th at if k is n ot c c-universal , then it is not c haracteristic to M 1 + ( R d ). If k is not c c-u ni v ersal , then b y Prop osition 16, w e ha v e supp( ν ) = { 0 } (see (21) for the d efinition of ν ), whic h means the k er n el is a constan t function on R d × R d and therefore not charact eristic to M 1 + ( R d ). Remark 22 (a) If k is c 0 -universal, then k is char acteristic, which fol lows fr om P r op o- sition 20. In gener al, the c onverse is not true, which fol lows fr om P r op osition 14 and Pr op osition 21(b). H owever, on the class of tr anslation invariant ke rnels and r adial kernels define d over R d , the c onverse is true, wh ich is show n in Pr op osition 21(a,c). (b) Although an RKHS, H c an b e char acteristic without c ontaining c onstant functions (F ukumizu et al., 2009 b , L emma 1), Pr op osition 21(b) shows that if H do es c ontain c onstant functions (i.e., A ψ (0) > 0 ), then the class of char acteristic ke rnels on T d is e quivalent to the class of c-universal (and, ther efor e, c c- and c 0 -universal) kernels. B ase d on F u kumizu et a l. (2009b, L emma 1) and Carmeli et al. (2009, The or em 1), this r esult c an b e gener alize d to any LCH sp ac e, X , which says that if c onstant functions ar e include d in H , then char ac- teristic kernels ar e e quiv alent to c 0 -universal ke rnels. A summary of the relation b et w een characte ristic and univ ers al k ern els is shown in Figure 1. Characteristic kernels vs. Strictly p d k ernels: In Section 3, w e ha ve sh o wn the relation b et ween unive rsal kernels and stric tly pd kernels, while in Pr op ositions 20 and 21, w e hav e related unive rsal and c haracteristic kernels. W e now in vestig ate the relation b et w een c haracteristic and str ictly p d kernels. Based on Prop ositions 11, 16 and 21, it is clear that a c h aracteristic k ernel that is translation inv arian t or radial on R d is strictly p d. While the con v er s e h olds for radial k er n els on R d , it d o es not hold for translation in v arian t k ernels o n R d , whic h follo ws from Prop osition 21 an d Remark 12(a). Similarly , in the case of translation inv ariant k ernels on T , i f a k ernel is c haracteristic, th en it is strictly p d, whic h follo w s from Theorem 15 and Proposition 21, wh ile the con verse is not true. So far, w e ha v e p resen ted the relation b et w een c haracteristic and strictly p d k ernels for sp ecific cases of ke rnels satisfying ( A 1 )– ( A 3 ), which is summarized in Figure 1 . F or the general case, it is not clear whether strict p d is a necessary condition for k to be characte ristic. Ho wev er, the f ollo wing result sho ws that c onditional ly strictly p d is a necessary condition for k to b e c haracteristic. Prop osition 23 If k is char acteristic, then it is c onditiona l ly strictly p d. 23 Sriper umbudur, Fukumizu and Lanckriet Pro of Sup p ose k is not conditionally strictly p d. This m eans for s ome n ≥ 2 and for m utually distinct x 1 , . . . , x n ∈ X , there exists { α j } 6 = 0 with P n j =1 α j = 0 suc h that P n l,j =1 α l α j k ( x l , x j ) = 0. Define µ := P n j =1 α j δ x j , where δ x represent s the Dirac measure at x . Clearly , µ is a fin ite n on -zero Borel measur e that satisfies ( i ) R R X k ( x, y ) dµ ( x ) dµ ( y ) = 0 and ( ii ) µ ( X ) = 0. S in ce µ is a finite non-zero Borel measure, by the Jord an decomp osition theorem (Dudley, 2002, Th eorem 5.6 .1), there exist u nique p ositiv e measures µ + and µ − suc h that µ = µ + − µ − and µ + ⊥ µ − ( µ + and µ − are singu lar). By (ii) , we ha ve µ + ( X ) = µ − ( X ) =: α . Defin e P = α − 1 µ + and Q = α − 1 µ − . Clearly , P and Q are d istinct Borel probabilit y measur es defined on X . Then, w e hav e     Z X k ( · , x ) d P ( x ) − Z X k ( · , x ) d Q ( x )     2 H ( a ) = Z Z X k ( x, y ) d ( P − Q )( x ) d ( P − Q )( y ) = α − 2 Z Z M k ( x, y ) dµ ( x ) dµ ( y ) ( b ) = 0 , where Lemma 26 is in vok ed in (a) and (b) is obtained by in v oking (i) . S o, there exist P 6 = Q suc h that R X k ( · , x ) d ( P − Q ) ( x ) = 0, i.e., k is n ot c haracteristic. The conv erse to Pr op osition 23 is ho w ev er n ot true. So far, w e presented the relation b et ween c haracteristic kernels and univ ersal ke rnels and show ed that f or an y LCH sp ace, X , the c haracteristic prop erty is a w eak er notion than c 0 -universality . Although su ch a weak er notion is su fficien t to mak e the em b edding in (30) injectiv e, in the follo wing section, we sh o w that the stronger notion of c 0 -universality is required to stud y an imp ortant prop ert y of the “pr ob ab ility m etric” asso ciated w ith the em b edding in (30). 4.2 Metrization of w eak top ology on M 1 + ( X ) Let X b e a P olish space. 6 Based on the emb ed ding, P 7→ R X k ( · , x ) d P ( x ) , P ∈ M 1 + ( X ), Gretton et al. (2007) prop osed the follo w in g pseudometric, γ k ( P , Q ) :=     Z X k ( · , x ) d P ( x ) − Z X k ( · , x ) d Q ( x )     H , (31) on M 1 + ( X ), called th e maximum me an discr ep ancy (MMD). Note that when k is char ac- teristic , γ k is a metric on M 1 + ( X ). On e immediate question that naturally arises is “ho w is MMD related to other metric s on M 1 + ( X ), suc h as the Pr ohoro v metric, Dudley m etric, W asserstein-Kan toro vic h metric, total v ariation metric, etc?” This is a qu estion of b oth theoretical and pr actical imp ortance. F or example, let us consider the p roblem of estimating an unknown densit y b ased on finite r an d om samples d r a wn i.i.d. from it. The qualit y of the estimate is m easured by determining the d istance b et ween the estimated den sit y and the true den s it y . Give n t wo probabilit y metrics, ρ 1 and ρ 2 , one might w ant to use the str onger 7 of the t wo to determine 6. A topological space ( X , τ ) is called a P olish space if the top ology τ has a countable basis and th ere exists a complete metric defining τ . 7. Tw o metrics ρ 1 : Y × Y → R + and ρ 2 : Y × Y → R + are said to b e equiv alent if ρ 1 ( x, y ) = 0 ⇔ ρ 2 ( x, y ) = 0 , ∀ x, y ∈ Y . On th e other hand , ρ 1 is said t o b e stronger than ρ 2 if ρ 1 ( x, y ) = 0 ⇒ ρ 2 ( x, y ) = 0 , ∀ x, y ∈ Y b ut not vice-versa. If ρ 1 is stronger th an ρ 2 , then we say ρ 2 is weak er than ρ 1 . 24 Universality, Characte ristic Kernels and RKHS Embedding of Measures this d istance, as the con ve rgence of the estimated densit y to the true density in the stronger metric implies the con v ergence in the w eak er metric, while the con verse is not true. On the other hand, one might n eed to use a metric o f w eak er top ology (i.e ., coarser top ology) to sho w con verge nce of some estimators, as the conv ergence might not o ccur w.r.t. a metric of strong top ology . This motiv ates a deep er analysis of the relation b et ween pr obabilit y metrics, e.g., as mentioned b efore, the relation b et ween MMD and other p opu lar probabilit y metrics to, e.g., d etermine which metrics are stronger resp ectiv ely weak er. Recen tly , Srip erumbudur et al. (2009 b ) stud ied the r elatio n b etw een MMD and other probabilit y metrics such as the Prohoro v distance, Dudley metric, W asserstein distance and total v ariation d istance and show ed that MMD is weak er than all these other metrics. This means that the top ology in duced by MMD is coarser than the top ology in duced by all these other metrics on M 1 + ( X ). It is w ell known that the P rohoro v and Dud ley metrics ind uce a top ology that coincides with th e we ak top olo gy (also called the we ak - ∗ (w eak-star) top ology) on M 1 + ( X ), defined as the wea k est top ology suc h that the map P 7→ R X f d P is con tin u ous f or all f ∈ C b ( X ). Th is naturally leads to the qu estion, “F or what k d o es the top ology ind uced b y MMD coincide with the weak top ology?” In other w ord s, “F or what k is MMD equiv alen t to the Pr ohoro v and Dudley metrics?” Although we arrive d at this q u estion motiv ated by an application, this question on its o w n is theoretically in teresting and imp ortant in probabilit y theory , esp ecially in p ro ving cent ral limit theorems. Before w e answer it (this question w as answ ered for compact Hausdorff, X and X = R d in Sr ip erumbudur et al. (2009b, S ection 5), whereas in the follo w in g, w e answ er it f or general LCH spaces), w e n eed some preliminaries. The we ak top olo gy on M 1 + ( X ) is the we ak est top ology such that th e map P 7→ R X f d P is con tinuous for all f ∈ C b ( X ). A sequ en ce of measur es is said to c onver ge we akly to P , written as P n w → P , if and only if R X f d P n → R X f d P for ev ery f ∈ C b ( X ). A m etric γ on M 1 + ( X ) is said to metrize the we ak top ology if the top ology induced b y γ coincides w ith the w eak top ology , whic h is d efined as follo ws: if, for P , P 1 , P 2 , . . . ∈ M 1 + ( X ), ( P n w → P ⇔ γ ( P n , P ) n →∞ − → 0) holds, th en the top ology indu ced b y γ coincides with th e w eak top ology . Prop osition 24 L et X b e an LCH sp ac e and k b e c 0 -universal. Then, the top olo gy induc e d by γ k c oincides with the we ak top olo gy on M 1 + ( X ) . Pro of W e need to sho w that for measures P , P 1 , P 2 , . . . ∈ M 1 + ( X ), P n w → P if and only if γ k ( P n , P ) → 0 as n → ∞ . T o pro v e the result, we use an equiv alen t rep resen tation of γ k giv en by Srip erumbudur et al. (2008, Th eorem 3), γ k ( P , Q ) = s u p k f k H ≤ 1     Z X f d P − Z X f d Q     = sup f ∈ H   R X f d P − R X f d Q   k f k H . (32) ( ⇐ ) Define P f := R X f d P . Since k is c 0 -universal , H is dense in C 0 ( X ) w.r.t. k · k u , i.e ., for an y f ∈ C 0 ( X ) and ev ery ǫ > 0, there exists a g ∈ H su c h th at k f − g k u ≤ ǫ . Th erefore, | P n f − P f | = | P n ( f − g ) + P ( g − f ) + ( P n g − P g ) | ≤ P n | f − g | + P | f − g | + | P n g − P g | ≤ 2 ǫ + | P n g − P g | ≤ 2 ǫ + k g k H γ k ( P n , P ) . (33) Since γ k ( P n , P ) → 0 as n → ∞ and ǫ is arbitrary , | P n f − P f | → 0 for an y f ∈ C 0 ( X ). The result f ollo ws from Berg et al. (1984, C orollary 4.3), whic h sa ys that if P n f → P f , ∀ f ∈ 25 Sriper umbudur, Fukumizu and Lanckriet C 0 ( X ), then P n f → P f , ∀ f ∈ C b ( X ), i.e., P n w → P . ( ⇒ ) S upp ose P n w → P , i.e., P n f → P f , ∀ f ∈ C b ( X ). This implies P n f → P f , ∀ f ∈ H and therefore γ k ( P n , P ) → 0 as n → ∞ . Prop osition 24 s ho ws that if k is c 0 -universal , then MMD in d uces the same top ology as induced b y the Pr ohoro v and Dud ley metrics and therefore is equiv alent to b oth t hese metrics. Th is means th at, although k b eing c haracteristic is sufficien t to guarantee γ k b eing a m etric, a stronger condition on k , i.e., k b eing c 0 -universal is required for γ k to metrize the w eak top ology on M 1 + ( X ). The follo wing result in Srip erumbudur et al. (2009b, T heorem 23) can b e obtained as a simp le corollary to Pr op osition 24, w herein the question of metrization of wea k top ology b y γ k is add r essed only for compact Ha usdorff X . T he general non-compact case w as left as an op en p r oblem, whic h we addressed in Prop osition 24. Corollary 25 (Srip erum budur e t al. (2009b)) Supp ose X is c omp act Hausdorff a nd k is c-univ ersal. Then, γ k metrizes the we ak top olo gy on M 1 + ( X ) . Pro of When X is compact, c-u niversality and c 0 -universality are equiv alent (see Re- mark 7(c)). T herefore, the result follo w s fr om Prop osition 24. T o s u mmarize, in this section, w e ha ve related the notio ns of universality and char ac- teristic kernels b y exploiting the relation b et ween unive rsality and the RK HS em b ed ding of Radon measur es, wh ic h is d iscussed in Section 3. W e sho w ed that universal and char acter- istic k ernels are equiv alent on the class of translation inv ariant and radial k ernels on R d . In addition, one of th e op en questions in Srip erumbudur et al. (2009b, S ection 5) is add ressed b y determining the cond itions on k s o that γ k metrizes the wea k top ology on the sp ace of probabilit y measur es, defined on a general non-compact X . 5. Conclusions & Discussion In this wo rk, we ha v e considered the pr oblem of emb ed ding finite signed Borel measur es in to an RKHS — whic h is a generalizatio n of the recently studied concept of em b edding Borel probabilit y m easures i n to an RKHS — and stud ied t he conditions on the k ernel under which th is em b edding is in jectiv e. W e sho wed that th e injectivit y of this embedd ing is related to the notion of universality : the em b edding is injectiv e if and only if the k ernel is universal . In other wo rds, compared to earlier c h aracteriza tions of unive rsalit y (Steinw art , 2001; Micc helli et al. , 2006; C arm eli et al. , 2009), w e ha v e provided a n ov el c h aracteriza tion for universal kernels , whic h is b ased on the measur e em b edding view p oin t as opp osed to the p oint of view of f u nction appro ximation. In addition, because of this r elatio n b et ween universality and the inj ectiv e embedd ing of fin ite signed Borel measur es, we established the relation b etw een unive rsal and char acteristic k ern els, the latter b eing related to the injectiv e em b edding of Borel probabilit y m easur es in to an RKHS. As an example, w e sh o wed the universal and char acteristic prop ert y to b e equiv alen t in the case of tr anslation in v ariant and radial k ernels on R d . The d iscussion in this pap er has b een related to the c haracterizatio n of v arious no- tions of unive rsality wherein the RK HS, H i s dense in some subset of C ( X ) (the sp ace of real -v alued con tinuous functions o n X ) w.r.t. the uniform n orm (here, X is a some ar- 26 Universality, Characte ristic Kernels and RKHS Embedding of Measures bitrary top ological space). This means any target f unction, f ⋆ in th e appropriate su bset of C ( X ) can b e appr oximate d arbitrarily we ll b y some g ∈ H w.r.t. the uniform norm. There is a notion of un iv ersalit y , whic h we ha v e not considered, called L p -universality (Stein wart and Christmann, 2008, Chapter 5): a measurable and b ounded k ernel, k defin ed on a Hausdorff space, X , is said to b e L p -universal if the RKHS, H induced b y k is dens e in L p ( X, µ ) w.r.t. the p -norm , defined as k f k p := ( R X | f ( x ) | p dµ ( x )) 1 /p , for all µ ∈ M 1 + ( X ) and some p ∈ [1 , ∞ ). Here L p ( X, µ ) is the Banac h space of p -in tegrable µ -measurable fu nctions on X . Th is notion of univ ersalit y is more app licable in learning theory , w here the target function, f ⋆ is usually assumed to lie in L p ( X, µ ) for some p ∈ [1 , ∞ ) and for some Borel probabilit y mea sure, µ . By considering this notion of un iv ersalit y , any f ⋆ ∈ L p ( X, µ ) can b e approxi mated arbitrarily we ll by some g ∈ H w.r.t. the p-norm for all Borel probabilit y measures µ and some p ∈ [1 , ∞ ). In p articular, Stein wart and Christmann (2008, Theorems 5.31, 5.36 and Coroll ary 5.37) ha v e sho wn that L p -universality is necessary and sufficien t to achiev e consistency in k ernel-based learning algorithms. In this p ap er, we did not con- sider this notion of universali t y b ecause unlike the other notions of u niv ers ality , it is n ot straigh tforward to relate L p -universality and the RKHS em b edding of measur es by using the Hahn-Banac h theorem (see Theorem 5). Ho w ever, rece n tly , Carmeli et al. (2 009, The- orem 1) ha v e sh o wn that k is L p -universal if and only if it is c 0 -universal , which therefore establishes the r elatio n b et ween L p -universality and th e RKHS emb edding of measures. Us- ing this r esult, L p -universality can b e related to all other n otions considered in this pap er, through Figure 1. Ac kno wledgmen ts B. K. S. and G. R. G. L. wish to ac knowledge supp ort from the Institute of Statistical Mathematics (ISM), T oky o, the National Science F oundation (gran t DMS-MSP A 0625409), the F air Isaa c Corp oration and the Universit y of California MICRO program. P art of this w ork w as done while B. K. S. w as visiting ISM. K.F. w as supp orted by JSPS KAKE NHI 19500 249. App endix A. Supplemen tary R esults F or completeness, w e pr esen t the f ollo wing supplementa ry result, whic h is a simple gen- eralizatio n of the tec hnique used in the p ro of o f Srip erumbudur et al. (2008, Theorem 3). Lemma 26 L et k b e a me asur able and b ounde d kernel on a me asur able sp ac e , X and let H b e its a sso c iate d RKHS. Then, for any f ∈ H and for any finite signe d Bor el me asur e, µ , Z X f ( x ) dµ ( x ) = Z X h f , k ( · , x ) i H dµ ( x ) = D f , Z X k ( · , x ) dµ ( x ) E H . (34) Pro of Let T µ : H → R b e a linear fun ctional defined as T µ [ f ] := R X f ( x ) dµ ( x ). It is easy to show that k T µ k := sup f ∈ H | T µ [ f ] | k f k H ≤ r sup x ∈ X k ( x, x ) k µ k < ∞ . 27 Sriper umbudur, Fukumizu and Lanckriet Therefore, T µ is a b oun ded linear functional on H . By the Riesz representa tion theorem (F olla nd , 1999, Theorem 5.25), there exists a uniqu e λ µ ∈ H such that T µ [ f ] = h f , λ µ i H for all f ∈ H . Set f = k ( · , u ) for some u ∈ X , whic h implies λ µ = R X k ( · , x ) dµ ( x ) and the result follo ws. References N. Aronsza jn. Theory of repro du cing kernels. T r ans. Amer. Math. So c. , 68:337– 404, 1950. C. Berg, J. P . R. Chr istens en , and P . Ressel. H armonic Analysis on Semigr oups . Spring V erlag, New Y ork, 1984. C. Carmeli, E. De Vito, A. T oigo, and V. Umanit` a. V ector v alued repro d ucing kernel Hilb ert spaces and univ ersalit y . A nalysis and Applic ations , 2009. R. M. Dudley . R e al Analysis and Pr ob ability . Cam b ridge Univ ersity Press, Cam bridge, UK, 2002. N. Dunford and J. T. Sch wa rtz. Line ar op er ators. I: Gener al the ory . Wiley-In terscience, New Y ork, 1958. T. Evgeniou, M. P on til, and T. P oggio. Regularization netw orks and supp ort vect or ma- c h in es. A dvanc e s in Computational Mathemat ics , 13(1):1–50, 2000. G. B. F olland. R e al Analys is: Mo dern T e chniques and Their Applic ations . Wi ley- In terscience, New Y ork, 1999. K. F ukumizu, F. Bac h, and M. J ord an. Dimensionalit y reduction for sup ervised learning with repro du cing k ernel Hilb ert spaces. Journal of Machine L e arning R ese ar ch , 5:73–99 , 2004. K. F ukumizu, A. Gretton, X. S un, and B. S c h¨ olk opf. Kernel measures of conditional de- p endence. In J.C. Platt, D. Koller, Y. Sin ger, and S. Ro weis, editors, A dvanc es in Neur al Information Pr o c essing Systems 20 , pages 489–496, Cam b ridge, MA, 2008. MIT Press. K. F ukum izu, F. R. Bac h, and M. I. Jordan. Kern el d imension redu ction in r egression. Anna ls of Stat istics , 37(5):1871– 1905, 2009 a. K. F ukumizu, B. K. Srip erum budu r, A. Gretton, and B. Sc h¨ olk opf . Characteristic kernels on groups and semigroups. In D. Kol ler, D. Sc huurmans, Y. Bengio, and L. Bottou, editors, A dvanc es in Neur al Inf ormation Pr o c essing Systems 21 , pages 473–4 80, 2009b. A. Gretton, K. M. Borgwardt, M. Rasc h , B. Sch¨ o lk opf , and A. Smola. A k ern el metho d for the tw o sample prob lem. In B. Sc h¨ olk opf, J. Platt, and T. Hoffman, editors, A dvanc es in Neur al Information Pr o c essing Systems 19 , pages 513–520. MIT Press, 2007. A. Gretto n, K. F ukumizu, C.-H. T eo, L. Song, B. Sc h¨ olk opf, and A. Smola . A k ernel statistica l test of indep enden ce. In A dvanc es i n Neur al Informatio n Pr o c essing Systems 20 , pages 585–5 92. MIT Press, 2008. 28 Universality, Characte ristic Kernels and RKHS Embedding of Measures E. Hewitt. Linear f unctionals on spaces of con tinuous fu nctions. F undamenta Mathematic ae , 37:161 –189, 1950. G. S. Kimeldorf and G. W ah b a. A corresp ondence b et w een ba y esian estimation on sto c hastic pro cesses and smo othing b y sp lines. Annals of M athema tic al Statistics , 41(2):49 5–502 , 1970. V. A. Me negatto. Str ictly p ositiv e definite k ernels o n the circle. R o cky Mountain Journal of Mathematics , 25(3):11 49–11 63, 1995. C. A. Micc helli, Y. Xu, and H. Zhang. Universal k ern els. Journal of Machine L e arning R ese ar ch , 7:2 651–2 667, 2006. M. Reed and B. Simon. F unctional A nalysis . Academic Press, New Y ork, 1972. W. Rud in. F unctional A nalysis . McGra w-Hill, USA, 1991. B. Sch¨ olk opf and A. J. S m ola. L e arning with Kernels . MIT Pr ess, Cam bridge, MA, 2002. B. Sch¨ o lk opf , R. Herbric h, and A. J. Smola. A generalized repr esen ter th eorem. In P r o c. of the 14 th Annual Confer enc e on L e arning The ory , pages 416–42 6, 2001. J. Sha we-T aylor and N. C r istianini. Kernel Metho ds for Pattern Analysis . C am br idge Univ ersit y P r ess, UK, 2004. A. J. S m ola, A. Gretton, L. S ong, and B. Sc h¨ olk opf. A Hilb ert space embed ding for distri- butions. In Pr o c. 18th International Confer enc e on A lgorithmic L e arning The ory , pages 13–31 . Sp ringer-V erlag, Berlin, Germany , 2007. B. K . S rip erumbudur, A. Gretton, K. F uku m izu, G. R. G. Lanckriet , a nd B. Sc h¨ olk opf. Injectiv e Hilb ert sp ace em b edd ings of probabilit y measures. In R. S erv edio and T. Zh ang, editors, Pr o c. of the 21 st Annual Confer enc e on L e arning Th e ory , pages 111–122, 2008. B. K . S rip erumbudur, K. F u kumizu, A. Gretton, G. R. G. Lanckriet, and B. Sc h¨ olk opf. Kernel c hoice and classifiabilit y for RKHS em b eddings of prob ab ility distribu tions. I n Y. Bengio, D. Sc huurmans, J. Laffert y , C . K. I. Williams, and A. Culotta, editors, A d- vanc es in N eur al Information Pr o c essing Systems 22 , pages 1750– 1758. MIT Press, 2009a. B. K. Sr ip erumbudur, A. Gretton, K. F ukumizu, B. Sc h ¨ olk opf, and G. R. G. Lanc kr iet. Hilb ert sp ace embedd ings and metrics on probabilit y mea sures. http://a rxiv.or g/abs/0907.5309 , August 2009b. B. K. Sr ip erumbudur, K. F ukumizu, and G. R. G. L an ckriet. On the relation b etw een unive rsalit y , c haracteristic k ernels and RKHS em b eddin g of measures. In Pr o c. of 13th International Confer enc e on Art ificial Intel ligenc e and Statistics , 2010. T o app ear. I. Steinw art. On the influ ence of the k ernel on th e consistency of supp ort v ector mac hines. Journal of M achine L e arning R ese ar ch , 2:67–93, 2001. I. Steinw art and A. Christmann . Supp ort V e ctor Machines . Spr inger, 2008. 29 Sriper umbudur, Fukumizu and Lanckriet J. Stew art. Positiv e defin ite fu n ctions and generaliza tions, an historical surv ey . R o cky Mountain Journal of M athematics , 6(3):409 –433, 19 76. H. W endland. Sc atter e d Data App r oximation . Cam bridge Univ ersity Press, Cam b ridge, UK, 2005. 30

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment