PAC-Bayes Analysis of Multi-view Learning

P A C-Ba y es Analysis of Multi-view Learning Shiliang Sun shiliangsun@gmail.com Shanghai Key L ab or atory of Mult idimensional Information Pr o c essing Dep artment of Computer Scienc e and T e chnolo gy East China Normal University 500 Dongchuan R o ad, Shanghai 2002 41, China John Sha we-T a ylor j.sha we-t a ylor@ucl.ac.uk Dep artment of Computer Scienc e University Col le ge L ondon Gower Str e et, L ondon WC1E 6BT, Unite d Kingdom Liang Mao lmao14@outlook.com Shanghai Key L ab or atory of Mult idimensional Information Pr o c essing Dep artment of Computer Scienc e and T e chnolo gy East China Normal University 500 Dongchuan R o ad, Shanghai 2002 41, China Editor: Abstract This pap er presen ts eight P AC-Bay es bounds to a nalyze the g eneralization p erformance of m ulti-view classiﬁer s. These b ounds adopt data dependent Gaussia n prior s which empha- size clas siﬁers w ith high view agreements. The center of the prior for the ﬁrst t wo bo unds is the origin, while the center of the prior fo r the third and fourth b o unds is given by a da ta depe ndent vector. An imp ortant technique to obta in these bo unds is tw o derived logar ith- mic determinant inequalities whos e diﬀerence lies in whether the dimens io nalit y of data is inv olved. The centers of the ﬁfth and sixth bounds ar e c a lculated on a separate subset of the training set. The last tw o bounds use unlab eled da ta to repre sen t view ag r eemen ts and are th us applica ble to semi- supervis e d multi-view lea rning. W e ev aluate all the pre- sented mu lti-view P AC-Ba yes bounds on benchmark data and compare them with previo us single-view P AC-Bay es b ounds. The us e fulness and p erformance of the multi-view b ounds are dis c us sed. Keyw ords: P AC-Bay es bo und, statistical learning theory , supp ort v ector machine, m ulti- view lear ning 1. In t ro duction Multi-view learning is a pr omising researc h direction with pr ev alent applicabilit y (Sun, 2013). F or instance, in m ultimedia con tent understanding, multimedia segmen ts can b e describ ed by b oth their video and aud io signals, and th e video and audio signals are r ega rd ed as the t w o views. Learning fr om data relie s on collecting data that con tain a suﬃcien t signal and en co ding our prior kn owledge in increasingly s op h istica ted regularization s chemes that enable the signal to b e extracted. With certain co-regularizati on sc hemes, multi-view learning p erform s well on v arious learning tasks. 1 Statistical learning theo ry (SL T ) provi des a ge neral framework t o analyze the gener- alizati on p erformance of m ac hine learnin g al gorithms. The theoretical outcomes can b e used to motiv ate algorithm d esign, select mo dels or giv e insigh ts on the eﬀects and b e- ha viors of some in teresting quan tities. F or example, the wel l-known large margin pr inciple in supp ort vecto r mac hines (SVMs) is we ll sup p orted b y v arious SL T b ounds (V apnik , 1998; Bartlett and Mendelson , 2002; Su n and Shaw e-T a ylor , 2010). Diﬀeren t from early b ounds that often r ely on the complexity m easures of th e consid ered fun ctio n classes, the r ece nt P AC-Ba y es b ounds (McAllester, 1999; S eeg er , 2002; Langford, 2005) give the tigh test pred icti ons of the generalization p erformance, for which the prior and p osterior distributions of learners are in vo lv ed on top of the P AC (Probably Appr o ximately Cor- rect) learning sett ing (Catoni, 2007; Germain et al., 200 9). Bey ond the common su p er- vised learning, P A C-Ba y es analysis h as also b een applied to other tasks, e.g., density esti- mation (S eldin and Tishb y, 2010; Higgs and Sha w e-T a ylor , 2010) and r einforcemen t learn- ing (Seldin et al., 2012). Although the ﬁ eld of m ulti-view learning has en jo yed a great su cce ss with algorithms and applications and is p ro vided with some theoretica l results, P A C-Ba y es an alysis of m ulti-view learning is still absen t. In this pap er, w e attempt to ﬁll the gap b et w een the dev elopment s in theory and practice by prop osing new P AC-Ba y es b ou n ds for m ulti-view learning. An earlier attempt to analyze th e generalization of t wo -view learning w as made using Rademac h er complexity (F arquhar et al., 2006; Rosenberg and Bartlett, 2007). The b ound relied on estimating the empirical Rademac her complexit y of th e class of pairs of functions from the t w o views that are matc hed in exp ectation u nder the data generating d istribution. Hence, th is approac h also imp licit ly r elie d on the data generating distr ibution to d eﬁne the function class (and hence prior). The curren t pap er mak es the d eﬁ nition of the p rior in terms of the data generating distribution explicit thr ou gh the P AC-B a yes fr amework and pro vides seve ral b ounds . Ho w ev er, the main adv an tage is that it d eﬁnes a framewo rk that mak es explicit the deﬁnition of the pr ior in terms of the data generating distribu tion, setting a template for other related approac hes to enco ding complex p rior kno wledge that relies on the data generating distribution. Kak ade and F oster (200 7 ) c haracterized the exp ected r egret of a semi-su p ervised m ulti- view r egressio n algorithm. The resu lts giv en by Sridh aran and Kak ade (2008) tak e an information theoretic approac h th at inv olv es a num b er of assum ptions that may b e diﬃcult to c hec k in p ractic e. With these assump tions theoretical results including P A C-st yle analysis to b ound exp ected losses w ere giv en, whic h inv olv e some Bay es optimal predictor and but cannot pr o vide computable classiﬁcati on err or b ound s since the data generating d istribution is usually unknown. These results therefore represent a related bu t distinct approac h. W e adopt a P AC-Ba y es analysis wh ere we enco de our assumptions through p riors deﬁn ed in term s of the data generating distribution. Suc h priors hav e b een studied by Catoni (2007) un der the name of lo calized priors and m ore recen tly by Lev er et al. (2013) as data distribution d ep enden t priors. Both pap ers consid er ed sc hemes for p laci ng a pr ior ov er classiﬁers deﬁned through their true generalizatio n errors. In contrast, the p rior that we consider is m ainly used to encod e the assumption ab out th e relatio ns h ip b et w een the tw o views in the data generating distribution. S uc h data distrib ution dep endent priors cannot b e sub jected to traditional Bay esian analysis since we do n ot ha v e an explicit form for the 2 prior, making inference imp ossible. Hence, this pap er illustrates one of the adv an tages that arise from the P A C-Ba y es framew ork. The P AC-Ba y es theorem b ound s th e true error of the distribution of classiﬁers in terms of a term from the sample complexit y and the KL div ergence b et w een the p osterior and the pr ior distribu tions of classiﬁers. The k ey tec h nical in no v ations of th e pap er enable the b ounding of th e KL div ergence term in terms of empirical qu an tities d esp ite inv olving priors that cannot b e computed. This approac h w as adopted in Parrado-Hern´ andez et al. (2012 ) for some simple pr iors s u c h as the Gaussian cent ered at E [ y φ ( x )]. Th e cur ren t pap er treats a signiﬁcantly more sophisticated case where the priors enco de our exp ectat ion that go o d w eigh t v ectors can b e found that giv e similar outputs from b oth views. Sp eciﬁcally , we ﬁrst provi de four P A C-Ba y es b ou n ds us ing pr iors that reﬂect ho w w ell the t wo views agree on av erage o v er all examples. The ﬁr s t tw o b ounds use a Gaussian prior cen tered at the origin, while the third and fourth ones adopt a diﬀerent p rior whose cente r is not the origin. How ev er, the form ulations of the priors inv olv e mathematical exp ectations with resp ect to th e unknown data distributions. W e manage to boun d the exp ectation related terms with their empirical estimations on a ﬁnite sample of data. Th en, we further pro vide t wo P A C-Ba y es b ounds using a part of the training data to determine priors, and t w o P AC-Ba y es b ound s for s emi-su p ervised m ulti-view learning wh ere unlab eled data are in vo lv ed in the deﬁn itio n of the p r iors. When a natural feature split do es not exist, multi-vi ew learning could still obtain p er- formance imp ro vemen ts with man ufactured splits, pro vided that eac h of the views cont ains not only enou gh information for the learning task itself, but some kno wledge that other views do not ha v e. It is th erefore imp ortant that p eople should split features in to views satisfying th e assumptions. Ho wev er, data split is still an op en question and b ey ond the scop e of this p ap er . The r est of this p aper is organized as follo w s. After brieﬂy reviewing the P AC-Ba y es b ound for SVMs in Section 2 , we give and d eriv e four multi- view P A C-Ba ye s b ounds in v olv- ing only empir ica l quant ities in Section 3 and Section 4. Then we giv e tw o b ounds whose cen ters are calculated on a separate subset of the trainin g data in Section 5. After that, we present tw o semi-sup ervised m ulti-view P AC-Ba y es b ou n ds in S ection 6. The optimizatio n form ulations of the r elated single-vie w and multi-view SVMs as w ell as semi-sup ervised m ulti-view SVMs are giv en in Section 7. After ev aluating the us efulness and p erformance of the b ou n ds in Section 8, w e giv e concluding remarks in S ect ion 9. 2. P AC-Ba yes Bound and Sp ecialization to SVMs Consider a binary classiﬁcatio n problem. Let D b e the distribu tion of feature x lying in an in put space X and the corresp onding output lab el y where y ∈ {− 1 , 1 } . Supp ose Q is a p osterior distribution ov er the p aramete rs of th e classiﬁer c . Deﬁne the tru e error and empirical error of a classiﬁer as e D = P r ( x ,y ) ∼D ( c ( x ) 6 = y ) , ˆ e S = P r ( x ,y ) ∼ S ( c ( x ) 6 = y ) = 1 m m X i =1 I ( c ( x i ) 6 = y i ) , 3 where S is a sample including m examples, and I ( · ) is the indicator function. With the distribution Q , we can then deﬁne the a v erage true error E Q, D = E c ∼ Q e D , and the a verag e empirical error ˆ E Q,S = E c ∼ Q ˆ e S . The follo wing lemma pr o vides the P AC-Ba y es b ound on E Q, D in the current cont ext of b inary classiﬁcat ion. Theorem 1 (P AC-Ba yes Bound ( Langford, 2005)) F or any data distribution D , for any prior P(c) over the classiﬁer c , for any δ ∈ (0 , 1] : P r S ∼D m ∀ Q ( c ) : K L + ( ˆ E Q,S || E Q, D ) ≤ K L ( Q || P ) + ln( m +1 δ ) m ! ≥ 1 − δ , (1) wher e K L ( Q || P ) = E c ∼ Q ln Q ( c ) P ( c ) is the KL diver genc e b etwe en Q and P , and K L + ( q || p ) = q ln q p + (1 − q ) ln 1 − q 1 − p for p > q and 0 otherwise. Supp ose from the m training examples w e learn an SVM classiﬁer represen ted b y c u ( x ) = sign( u ⊤ φ ( x )), where φ ( x ) is a p ro jection of the original feature to a certain feature space ind uced b y some k ernel f unction. Deﬁne the prior and the p osterior of the classiﬁer to b e Gaussian with u ∼ N ( 0 , I ) and u ∼ N ( µ w , I ), resp ectiv ely . Note that here k w k = 1, an d th us the distance b et wee n the center of the p osterior and the origin is µ . With this sp ecial- ization, we giv e the P A C-Ba y es b ound for SVMs (Langford, 2005; Parrado-Hern´ andez et al., 2012) b elo w. Theorem 2 F or any data distribution D , for any δ ∈ (0 , 1] , we have P r S ∼D m ∀ w , µ : K L + ( ˆ E Q,S ( w , µ ) || E Q, D ( w , µ )) ≤ µ 2 2 + ln( m +1 δ ) m ! ≥ 1 − δ, (2) wher e k w k = 1 . All that remains is calculating the empirical sto c hastic error rate ˆ E Q,S . It can b e sh own that for a p osterior Q = N ( µ w , I ) with k w k = 1, w e ha v e ˆ E Q,S = E S h ˜ F ( µγ ( x , y )) i , (3) where E S is the a v erage o v er the m training examples, γ ( x , y ) is the normalized m argin of the example γ ( x , y ) = y w ⊤ φ ( x ) / k φ ( x ) k , (4) and ˜ F ( x ) is the Gaussian cumulativ e distribution ˜ F ( x ) = Z ∞ x 1 √ 2 π e − x 2 / 2 dx. (5) The generaliza tion err or of the original SVM classiﬁer c w ( x ) = sign( w ⊤ φ ( x )) can b e b ounded b y at most t wice the a verag e true error E Q, D ( w , µ ) of the corresp onding sto c hastic classiﬁer (Langford and Shaw e-T a ylor , 2002). That is, for an y µ we ha v e P r ( x ,y ) ∼D  sign( w ⊤ φ ( x )) 6 = y  ≤ 2 E Q, D ( w , µ ) . (6) 4 3. Multi-view P AC-Ba y es Bounds W e prop ose a new data dep endent pr ior for P A C-Ba y es analysis of multi -view learning. In particular, we tak e the d istribution on th e concatenatio n of the t w o w eigh t v ectors u 1 and u 2 as their individual pr odu ct: ˜ P ([ u ⊤ 1 , u ⊤ 2 ] ⊤ ) = P 1 ( u 1 ) P 2 ( u 2 ) but then weig ht it in some manner associated with ho w w ell the t wo weig hts agree a ve ragely on all examples. That is, the pr ior is P ([ u ⊤ 1 , u ⊤ 2 ] ⊤ ) ∝ P 1 ( u 1 ) P 2 ( u 2 ) V ( u 1 , u 2 ) , where P 1 ( u 1 ) and P 1 ( u 2 ) are Gaussian with zero mean and identit y co v ariance, and V ( u 1 , u 2 ) = exp  − 1 2 σ 2 E ( x 1 , x 2 ) ( x ⊤ 1 u 1 − x ⊤ 2 u 2 ) 2  . T o sp ecialize the P A C-Ba ye s b ound for m ulti-view learnin g, we consider classiﬁers of the form c ( x ) = sign( u ⊤ φ ( x )) , (7) where u = [ u ⊤ 1 , u ⊤ 2 ] ⊤ is the concatenated weig ht v ector from tw o views, and φ ( x ) can b e the concatenated x = [ x ⊤ 1 , x ⊤ 2 ] ⊤ itself or a concatenation of maps of x to kernel-induced feature spaces. Note that x 1 and x 2 indicate features of one example fr om the t wo views, resp ectiv ely . F or simplicit y , here we u s e th e original f eat ur es to deriv e our resu lts, though k ernel map s can b e implicitly emplo yed as wel l. Our d imensionalit y indep endent b ounds w ork even when the dimension of the k ernelized feature sp ace goes to inﬁn it y . According to ou r setting, th e classiﬁer prior is ﬁxed to b e P ( u ) ∝ N ( 0 , I ) × V ( u 1 , u 2 ) , (8) F unction V ( u 1 , u 2 ) mak es the prior place large p robabilit y mass on parameters with whic h the classiﬁers from t wo views agree w ell on all examples av eragely . The p osterior is c hosen to b e of the form Q ( u ) = N ( µ w , I ) , (9) where k w k = 1. Deﬁne ˜ x = [ x ⊤ 1 , − x ⊤ 2 ] ⊤ . W e hav e P ( u ) ∝ N ( 0 , I ) × V ( u 1 , u 2 ) ∝ exp  − 1 2 u ⊤ u  × exp  − 1 2 σ 2 E ( x 1 , x 2 ) ( x ⊤ 1 u 1 − x ⊤ 2 u 2 ) 2  = exp  − 1 2 u ⊤ u  × exp  − 1 2 σ 2 E ˜ x ( u ⊤ ˜ x˜ x ⊤ u )  = exp  − 1 2 u ⊤ u  × exp  − 1 2 σ 2 u ⊤ E ( ˜ x ˜ x ⊤ ) u  = exp  − 1 2 u ⊤  I + E ( ˜ x ˜ x ⊤ ) σ 2  u  . That is, P ( u ) = N ( 0 , Σ) with Σ =  I + E ( ˜ x ˜ x ⊤ ) σ 2  − 1 . Supp ose d im ( u ) = d . Giv en the ab o v e p rior and p osterior, w e hav e the follo wing theorem to c haracterize their div ergence. 5 Theorem 3 K L ( Q ( u ) k P ( u )) = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  . (10) Pro of It is easy to show that the KL dive rgence b et ween t wo Gaussians (Rasm uss en and Williams, 2006) in an N -dimensional sp ace is K L ( N ( µ 0 , Σ 0 ) kN ( µ 1 , Σ 1 )) = 1 2  ln( | Σ 1 | | Σ 0 | ) + tr(Σ − 1 1 Σ 0 ) + ( µ 1 − µ 0 ) ⊤ Σ − 1 1 ( µ 1 − µ 0 ) − d  . (11) The KL div ergence b et ween the p osterior and p rior is thus K L ( Q ( u ) k P ( u )) = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + tr( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) + µ 2 w ⊤ ( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) w − d  = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + tr( E ( ˜ x ˜ x ⊤ ) σ 2 ) + µ 2 w ⊤ ( E ( ˜ x ˜ x ⊤ ) σ 2 ) w + µ 2  = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E [tr( ˜ x ˜ x ⊤ )] + µ 2 σ 2 E [( w ⊤ ˜ x ) 2 ] + µ 2  = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E [ ˜ x ⊤ ˜ x ] + µ 2 σ 2 E [( w ⊤ ˜ x ) 2 ] + µ 2  = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  , whic h completes the pr oof. The pr oblem with this expr ession is that it con tains exp ectatio ns ov er the input distribu- tion that w e are unable to compu te. This is b ecause w e hav e deﬁ n ed the pr ior d istr ibution in terms of the inpu t d istribution via the V fu nction. Suc h pr iors are referred to as lo- calized by Catoni (2007). While his w ork considered sp eciﬁc examples of such priors that satisfy certain optimalit y conditions, the deﬁn itio n w e consider h er e is enco ding natur al prior assumptions ab out the link b et wee n the input distribution and the classiﬁcation func- tion, namely that it will h av e a simple r epresen tation in b oth views. This is an examp le of luc kiness (S ha we-T ayl or et al., 1998), where generalizatio n is estimated making assump- tions that if prov en true lead to tigh ter b ounds, as for example in the case of a large margin classiﬁer. W e now d evelo p metho ds that estimate the relev ant quan tities in (10) from empirical data, so that ther e will b e add itional empirical estimations inv olved in the ﬁnal b oun ds b esides the usu al empirical error. W e p roceed to p ro vid e and pr o ve t w o in equ alit ies on the inv olv ed loga rithm ic determi- nan t function, wh ic h are v ery imp ortant for the su bsequen t multi- view P A C-Ba y es b ounds. Theorem 4 − ln    I + E ( ˜ x ˜ x ⊤ ) σ 2    ≤ − d ln E h    I + ˜ x˜ x ⊤ σ 2    1 /d i , (12) − ln    I + E ( ˜ x ˜ x ⊤ ) σ 2    ≤ − E ln    I + ˜ x˜ x ⊤ σ 2    . (13) 6 Pro of According to the Minko wski determinant theorem, for n × n p ositiv e semi-deﬁnite matrices A and B , the follo wing inequalit y holds | A + B | 1 /n ≥ | A | 1 /n + | B | 1 /n , (14) whic h imp lies that the fun ctio n A 7→ | A | 1 /n is conca v e on the s et of n × n p ositiv e semi- deﬁnite matrices. T h erefore, with Jensen’s in equ alit y w e ha v e − ln    I + E ( ˜ x ˜ x ⊤ ) σ 2    = − d ln    E ( I + ˜ x˜ x ⊤ σ 2 )    1 /d ≤ − d ln E h    I + ˜ x˜ x ⊤ σ 2    1 /d i . Since the natural logarithm is conca ve, we further h av e − d ln E h    I + ˜ x˜ x ⊤ σ 2    1 /d i ≤ − d E h ln    I + ˜ x˜ x ⊤ σ 2    1 /d i = − E ln    I + ˜ x˜ x ⊤ σ 2    , and thereby − ln    I + E ( ˜ x ˜ x ⊤ ) σ 2    ≤ − E ln    I + ˜ x˜ x ⊤ σ 2    . (15) Denote R = su p ˜ x k ˜ x k . F r om inequalit y (12), w e can ﬁn ally pro ve the follo wing theorem, as detailed in App endix A. Theorem 5 (Multi-view P AC-Ba yes b ound 1) Consider a classiﬁer prior given in (8) and a c lassiﬁer p osterior given in (9). F or any data distribution D , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m , the fol lowing ine quality holds ∀ w , µ : K L + ( ˆ E Q,S || E Q, D ) ≤ − d 2 ln h f m − ( d p ( R/σ ) 2 + 1 − 1) q 1 2 m ln 3 δ i + + H m 2 σ 2 + (1+ µ 2 ) R 2 2 σ 2 q 1 2 m ln 3 δ + µ 2 2 + ln ( m +1 δ/ 3 ) m , wher e f m = 1 m m X i =1    I + ˜ x i ˜ x ⊤ i σ 2    1 /d , H m = 1 m m X i =1 [ ˜ x ⊤ i ˜ x i + µ 2 ( w ⊤ ˜ x i ) 2 ] , and k w k = 1 . F rom the b ound form ulation, we see th at if ( w ⊤ ˜ x i ) 2 is small, that is, if the tw o view outputs tend to agree, th e b ound will b e tigh t. 7 Note that, although the formulation of f m in vo lv es the outer pro duct of feature vec tors, it can actually b e represen ted b y the in n er pro du ct, whic h is obvious thr ough the follo wing determinan t equalit y    I + ˜ x i ˜ x ⊤ i σ 2    = ˜ x ⊤ i ˜ x i σ 2 + 1 , (16) where we ha v e used th e fact th at matrix ˜ x i ˜ x ⊤ i has r ank 1 and h as only one nonzero eigen- v alue. W e can use inequalit y (13) instead of (12) to derive a d -indep endent b ound (see T h e- orem 6 b elo w), whic h is ind ep enden t of the dimensionalit y of the feature r epresen tation space. Theorem 6 (Multi-view P AC-Ba yes b ound 2) Consider a classiﬁer prior given in (8) and a c lassiﬁer p osterior given in (9). F or any data distribution D , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m , the fol lowing ine quality holds ∀ w , µ : K L + ( ˆ E Q,S || E Q, D ) ≤ ˜ f / 2 + 1 2  (1+ µ 2 ) R 2 σ 2 + ln (1 + R 2 σ 2 )  q 1 2 m ln 2 δ + µ 2 2 + ln ( m +1 δ/ 2 ) m , wher e ˜ f = 1 m m X i =1  1 σ 2 [ ˜ x ⊤ i ˜ x i + µ 2 ( w ⊤ ˜ x i ) 2 ] − ln    I + ˜ x i ˜ x ⊤ i σ 2     , (17) and k w k = 1 . The pr o of of th is theorem is giv en in Ap p en dix B. Since this b ound is ind ep enden t w ith d and the term    I + ˜ x i ˜ x ⊤ i σ 2    in vo lving the outer pro duct can b e represented by the inner pro duct through (16), th is b ound can b e emplo ye d when the dimension of the k ernelized f eat ur e space go es to inﬁnit y . 4. Another Two Multi-view P AC- Bay es B ound s W e fur ther p rop ose a new prior whose cent er is not located at the origin, inspired b y P arrado-Hern´ and ez et al. (2012). The n ew classiﬁer pr ior is P ( u ) ∝ N ( η w p , I ) × V ( u 1 , u 2 ) , (18) and the p osterior is still Q ( u ) = N ( µ w , I ) , (19) where η > 0, k w k = 1 and w p = E ( x ,y ) ∼D [ y x ] (or E ( x ,y ) ∼D [ y φ ( x )] in a p redeﬁned k ernel space) with x = [ x ⊤ 1 , x ⊤ 2 ] ⊤ . W e h a ve P ( u ) ∝ N ( η w p , I ) × V ( u 1 , u 2 ) ∝ exp  − 1 2 ( u − η w p ) ⊤ ( u − η w p )  × exp  − 1 2 σ 2 u ⊤ E ( ˜ x ˜ x ⊤ ) u  . 8 That is, P ( u ) = N ( u p , Σ) with Σ =  I + E ( ˜ x˜ x ⊤ ) σ 2  − 1 and u p = η Σ w p . With d being the d imensionalit y of u , the K L div ergence b et we en th e p osterior and prior is K L ( Q ( u ) k P ( u )) = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + tr( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) + ( u p − µ w ) ⊤ ( I + E ( ˜ x ˜ x ⊤ ) σ 2 )( u p − µ w ) − d  = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E [ ˜ x ⊤ ˜ x ] + ( u p − µ w ) ⊤ ( I + E ( ˜ x ˜ x ⊤ ) σ 2 )( u p − µ w )  . (20) W e h a ve ( u p − µ w ) ⊤ ( I + E ( ˜ x ˜ x ⊤ ) σ 2 )( u p − µ w ) = η 2 w ⊤ p ( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) − 1 w p − 2 η µ w ⊤ p w + µ 2 w ⊤ ( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) w = η 2 w ⊤ p ( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) − 1 w p − 2 η µ w ⊤ p w + µ 2 σ 2 E [( w ⊤ ˜ x ) 2 ] + µ 2 = η 2 w ⊤ p ( I + E ( ˜ x ˜ x ⊤ ) σ 2 ) − 1 w p − 2 η µ E [ y ( w ⊤ x )] + µ 2 σ 2 E [( w ⊤ ˜ x ) 2 ] + µ 2 ≤ η 2 w ⊤ p w p − 2 η µ E [ y ( w ⊤ x )] + µ 2 σ 2 E [( w ⊤ ˜ x ) 2 ] + µ 2 , (21) where for the last inequalit y we hav e used the fact that matrix I − ( I + E ( ˜ x˜ x ⊤ ) σ 2 ) − 1 is symmetric and p ositiv e semi-deﬁnite. Deﬁne ˆ w p = E ( x ,y ) ∼ S [ y x ] = 1 m P m i =1 [ y i x i ]. W e hav e η 2 w ⊤ p w p = k η w p − µ w + µ w k 2 = k η w p − µ w k 2 + µ 2 + 2( η w p − µ w ) ⊤ µ w ≤ k η w p − µ w k 2 + µ 2 + 2 µ k η w p − µ w k = ( k η w p − µ w k + µ ) 2 . (22) Moreo ver, w e hav e k η w p − µ w k = k η w p − η ˆ w p + η ˆ w p − µ w k ≤ k η w p − η ˆ w p k + k η ˆ w p − µ w k . (23) F rom (20), (21), (22) and (23), it follo ws that K L ( Q ( u ) k P ( u )) ≤ − 1 2 ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 2 ( k η w p − η ˆ w p k + k η ˆ w p − µ w k + µ ) 2 + 1 2 σ 2 E h ˜ x ⊤ ˜ x − 2 η µσ 2 y ( w ⊤ x ) + µ 2 ( w ⊤ ˜ x ) 2 i + µ 2 2 . (24) By u sing inequalities (12) and (13), we get the follo w ing tw o th eorems, whose pro ofs are detailed in Ap p endix C and App end ix D, resp ectiv ely . 9 Theorem 7 (Multi-view P AC-Ba yes b ound 3) Consider a classiﬁer prior given in (18) and a classiﬁer p osterior given in (19). F or any data distribution D , for any w , p ositive µ , and p ositive η , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m the fol lowing multi-view P AC-Bayes b ound holds K L + ( ˆ E Q,S || E Q, D ) ≤ − d 2 ln h f m − ( d p ( R/σ ) 2 + 1 − 1) q 1 2 m ln 4 δ i + m + 1 2  ηR √ m  2 + q 2 ln 4 δ  + k η ˆ w p − µ w k + µ  2 + ˆ H m 2 σ 2 + R 2 + µ 2 R 2 +4 ηµσ 2 R 2 σ 2 q 1 2 m ln 4 δ + µ 2 2 + ln ( m +1 δ/ 4 ) m , wher e f m = 1 m m X i =1    I + ˜ x i ˜ x ⊤ i σ 2    1 /d , ˆ H m = 1 m m X i =1 [ ˜ x ⊤ i ˜ x i − 2 η µσ 2 y i ( w ⊤ x i ) + µ 2 ( w ⊤ ˜ x i ) 2 ] , and k w k = 1 . Besides the term ( w ⊤ ˜ x i ) 2 that app ears in the previous b oun d s, we can see that if k η ˆ w p − µ w k is small, that is, the cen ters of the p r ior and p osterior tend to ov erlap, the b ound will b e tight. Theorem 8 (Multi-view P AC-Ba yes b ound 4) Consider a classiﬁer prior given in (18) and a classiﬁer p osterior given in (19). F or any data distribution D , for any w , p ositive µ , and p ositive η , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m the fol lowing multi-view P AC-Bayes b ound holds K L + ( ˆ E Q,S || E Q, D ) ≤ 1 2  ηR √ m  2 + q 2 ln 3 δ  + k η ˆ w p − µ w k + µ  2 m + ˜ H m 2 + R 2 +4 ηµσ 2 R + µ 2 R 2 + σ 2 ln(1+ R 2 σ 2 ) 2 σ 2 q 1 2 m ln 3 δ + µ 2 2 + ln ( m +1 δ/ 3 ) m , wher e ˜ H m = 1 m m X i =1 [ ˜ x ⊤ i ˜ x i − 2 η µσ 2 y i ( w ⊤ x i ) + µ 2 ( w ⊤ ˜ x i ) 2 σ 2 − ln    I + ˜ x i ˜ x ⊤ i σ 2    ] , (25) and k w k = 1 . 5. Separate T raining Data Dep enden t Multi-view P AC- Bay es Bounds W e attempt to impr o ve our b ounds by usin g a separate set of training data to determine new priors, ins pired by Ambrola dze et al. (2007) and P arrado-Hern´ andez et al. (201 2 ). W e consider a sp herical Gaussian wh ose center is calculat ed on a subset T of training s et 10 comprising r training patterns and lab els. In the exp erimen ts this is tak en as a ran d om subset, b ut for simplicit y of the p resen tation we w ill assume T comprises the last r examples { x k , y k } m k = m − r +1 . The new prior is P ( u ) = N ( η w p , I ) , (26) and the p osterior is again Q ( u ) = N ( µ w , I ) . (27) One reasonable c h oic e of w p is w p =  E ˜ x [ ˜ x ˜ x ⊤ ]  − 1 E ( x ,y ) ∼D [ y x ] , (28) whic h is the solution to the follo wing optimization problem max w E x 1 ,y [ y w ⊤ 1 x 1 ] + E x 2 ,y [ y w ⊤ 2 x 2 ] E x 1 , x 2 [( w ⊤ 1 x 1 − w ⊤ 2 x 2 ) 2 ] , (29) where w = [ w ⊤ 1 , w ⊤ 2 ] ⊤ . W e use the sub set T to app ro ximate w p , that is, let w p =  E ˜ x ∼ T [ ˜ x ˜ x ⊤ ]  − 1 E ( x ,y ) ∼ T [ y x ] = 1 m − r m − r + 1 X k = r [ ˜ x k ˜ x ⊤ k ] ! − 1 1 m − r m − r + 1 X k = r [ y k x k ] . (30) The KL div ergence b et ween the p osterior and p rior is K L ( Q ( u ) k P ( u )) = K L ( N ( µ w , I ) kN ( η w p , I )) = k η w p − µ w k 2 . (31) Since we separate r examples to calculate the pr ior, the actual size of training set that w e apply b ound to is m − r . W e ha ve the follo wing b ound. Theorem 9 (Multi-view P AC-Ba yes b ound 5) Consider a classiﬁer prior given in (26) and a classiﬁer p osterior given in (27), with w p given in (30). F or any data distribution D , for any w , p ositive µ , and p ositive η , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m the fol lowing multi-view P AC-Bayes b ound holds K L + ( ˆ E Q,S || E Q, D ) ≤ 1 2 k η w p − µ w k 2 + ln m − r + 1 δ m − r (32) and k w k = 1 . Another c hoice of w p is to learn a m ulti-view SVM classiﬁer with the su bset T , leading to the follo wing b ound . Theorem 10 (Multi-view P AC-Ba yes b ound 6) Consider a classiﬁer prior given in (26) and a classiﬁer p osterior given in (27). Classiﬁer w p has b e en le arne d fr om a subset T of r examples a priori sep ar ate d fr om a tr aining se t S of m samples. F or any data 11 distribution D , for any w , p ositive µ , and p ositive η , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m the fol lowing multi-vi ew P AC-Bayes b ound holds K L + ( ˆ E Q,S || E Q, D ) ≤ 1 2 k η w p − µ w k 2 + ln m − r + 1 δ m − r (33) and k w k = 1 . Although the ab ov e tw o b ou n ds look similar, they are essentially diﬀeren t in that the priors are determined diﬀeren tly . W e will see in the exp eriment al results that they also p erform diﬀerent ly when applied in our exp eriment s. 6. Semi-s up ervised Multi-view P AC-Ba yes B ounds No w w e consider P A C-Ba y es analysis for semi-sup ervised m ulti-view learning, wh ere b esides the m lab eled examples we are further pro vided w ith u unlab eled examples U = { ˜ x j } m + u j = m +1 . W e r eplace V ( u 1 , u 2 ) with ˆ V ( u 1 , u 2 ), which has the form ˆ V ( u 1 , u 2 ) = exp  − 1 2 σ 2 u ⊤ E U ( ˜ x ˜ x ⊤ ) u  , (34) where E U means the empirical a v erage o ve r the u nlab eled set U . 6.1 Noninformativ e Prior C en ter Under a similar setting with Section 3, that is, P ( u ) ∝ N ( 0 , I ) × ˆ V ( u 1 , u 2 ), we ha ve P ( u ) = N ( 0 , Σ) with Σ =  I + E U ( ˜ x˜ x ⊤ ) σ 2  − 1 . Therefore, according to Th eorem 3, w e h a ve K L ( Q ( u ) k P ( u )) = 1 2  − ln(    I + E U ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E U [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  . (35) Substituting (35) in to Theorem 1 , we reac h the follo wing s emi-sup ervised m ulti-view P A C-Ba y es b oun d . Theorem 11 (Semi-sup ervised m ulti-view P A C-Ba yes b ound 1) Consider a clas- siﬁer prior gi v en in (8) with ˆ V d eﬁne d in (34), a classiﬁer p osterior given in (9) and an unlab ele d set U = { ˜ x j } m + u j = m +1 . F or any data distribution D , for any δ ∈ (0 , 1] , with pr ob a- bility at le ast 1 − δ over S ∼ D m , the fol lowing ine quality holds ∀ w , µ : K L + ( ˆ E Q,S || E Q, D ) ≤ 1 2  − ln(    I + E U ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E U [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  + ln ( m +1 δ ) m , wher e k w k = 1 . 12 6.2 Informative Prior Center Similar to S ect ion 4, we tak e the classiﬁer p rior to b e P ( u ) ∝ N ( η w p , I ) × ˆ V ( u 1 , u 2 ) , (36) where ˆ V ( u 1 , u 2 ) is giv en b y (34), η > 0 and w p = E ( x ,y ) ∼D [ y x ] with x = [ x ⊤ 1 , x ⊤ 2 ] ⊤ . W e ha v e P ( u ) = N ( u p , Σ) with Σ =  I + E U ( ˜ x˜ x ⊤ ) σ 2  − 1 and u p = η Σ w p . By similar reasoning, we get K L ( Q ( u ) k P ( u )) ≤ − 1 2 ln(    I + E U ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 2 ( k η w p − η ˆ w p k + k η ˆ w p − µ w k + µ ) 2 + 1 2 σ 2 E U h ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 i − η µ E h y ( w ⊤ x ) i + µ 2 2 , (37) whic h is analogous to (24). Then, we can give the follo wing semi-sup ervised multi-view P A C-Ba y es b ound, whose pro of is pr ovided in App endix E. Theorem 12 (Semi-sup ervised m ulti-view P A C-Ba yes b ound 2) Consider a clas- siﬁer prior given in (36) with ˆ V deﬁne d in (34), a classiﬁer p osterior given in (19) and an unlab ele d set U = { ˜ x j } m + u j = m +1 . F or any data distribution D , for any w , p ositive µ , and p ositive η , for any δ ∈ (0 , 1] , with pr ob ability at le ast 1 − δ over S ∼ D m , the fol lowing ine quality holds K L + ( ˆ E Q,S || E Q, D ) ≤ 1 2  ηR √ m  2 + q 2 ln 3 δ  + k η ˆ w p − µ w k + µ  2 m + 1 2  − ln(    I + E U ( ˜ x˜ x ⊤ ) σ 2    ) + 1 σ 2 E U [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  + ¯ S m + η µR q 2 m ln 3 δ + ln ( m +1 δ/ 3 ) m , wher e ¯ S m = 1 m m X i =1 [ − η µy i ( w ⊤ x i )] , and k w k = 1 . 7. Learning Algorithms Belo w we pro vide the optimization form ulations f or the sin gle -view and m ulti-view SVMs as w ell as semi-sup ervised m ulti-view S VMs that are adopted to train classiﬁers and calculate P A C-Ba y es b oun ds. Note th at the augmen ted v ector rep r esen tation is used b y app ending a scalar 1 at th e end of the feature r epresen tations, in order to f ormulate the classiﬁer in a simple form w ithout the explicit bias term. 13 7.1 SVMs The op timization pr oblem (Cristianini and Sha we -T a ylor , 2000; Sh aw e-T a ylor and Sun, 2011) is formulat ed as min w , ξ 1 2 k w k 2 + C n X i =1 ξ i s.t. y i ( w ⊤ x i ) ≥ 1 − ξ i , i = 1 , . . . , n, ξ i ≥ 0 , i = 1 , . . . , n, (38) where scalar C con trols the balance b et wee n the margin and empirical loss. This pr oblem is a diﬀeren tiable con ve x problem with aﬃne constrain ts. Th e constrain t qu aliﬁcat ion is satisﬁed by the reﬁned Slater’s condition. The Lagrangian of problem (38) is L ( w , ξ , λ , γ ) = 1 2 k w k 2 + C n X i =1 ξ i − n X i =1 λ i h y i ( w ⊤ x i ) − 1 + ξ i i − n X i =1 γ i ξ i , λ i ≥ 0 , γ i ≥ 0 , (39) where λ = [ λ 1 , . . . , λ n ] ⊤ and γ = [ γ 1 , . . . , γ n ] ⊤ are the asso ciated Lagrange multipliers. F rom the op timalit y conditions, we obtain ∂ w L ( w ∗ , b ∗ , ξ ∗ , λ ∗ , γ ∗ ) = w ∗ − n X i =1 λ ∗ i y i x i = 0 , (40) ∂ ξ i L ( w ∗ , b ∗ , ξ ∗ , λ ∗ , γ ∗ ) = C − λ ∗ i − γ ∗ i = 0 , i = 1 , . . . , n. (41) The du al optimization problem is derive d as min λ 1 2 λ ⊤ D λ − λ ⊤ 1 s.t. λ  0 , λ  C 1 , (42) where D is a symmetric n × n matrix with entries D ij = y i y j x ⊤ i x j . O nce the solution λ ∗ is giv en, the SVM d ecisio n function is giv en by c ∗ ( x ) = sign n X i =1 y i λ ∗ i x ⊤ x i ! . (43) Using the k ernel tric k, the optimiza tion problem for S VMs is still (42). Ho wev er, no w D ij = y i y j κ ( x i , x j ) with the k ernel function κ ( · , · ), and the solution for the SVM classiﬁer is formulat ed as c ∗ ( x ) = sign n X i =1 y i λ ∗ i κ ( x i , x ) ! . (44) 14 7.2 MvSVMs Denote the classiﬁer w eigh ts from t wo views by w 1 and w 2 whic h are not assumed to b e unit v ectors at the momen t. Inspired b y semi-sup ervised m ulti-view SVMs (Sind h w ani et al., 2005; Sindhw ani and Rosen b erg, 2008; Sun and Sha we -T ayl or , 2010), the ob jectiv e function of the multi-view SVMs (MvSVMs) can b e giv en by min w 1 , w 2 , ξ 1 , ξ 2 1 2 ( k w 1 k 2 + k w 2 k 2 ) + C 1 n X i =1 ( ξ i 1 + ξ i 2 ) + C 2 n X i =1 ( w ⊤ 1 x i 1 − w ⊤ 2 x i 2 ) 2 s.t. y i w ⊤ 1 x i 1 ≥ 1 − ξ i 1 , i = 1 , · · · , n, y i w ⊤ 2 x i 2 ≥ 1 − ξ i 2 , i = 1 , · · · , n, ξ i 1 , ξ i 2 ≥ 0 , i = 1 , · · · , n. (45) If kernel functions are used, th e solution of the ab o ve optimization problem can b e giv en b y w 1 = P n i =1 α i 1 k 1 ( x i 1 , · ), and w 2 = P n i =1 α i 2 k 2 ( x i 2 , · ). Sin ce a fun ction deﬁned on view j only dep ends on the j th feature set, the solution is giv en b y w 1 = n X i =1 α i 1 k 1 ( x i , · ) , w 2 = n X i =1 α i 2 k 2 ( x i , · ) . (46) It can b e sho wn that k w 1 k 2 = α ⊤ 1 K 1 α 1 , k w 2 k 2 = α ⊤ 2 K 2 α 2 , n X i =1 ( w ⊤ 1 x i − w ⊤ 2 x i ) 2 = ( K 1 α 1 − K 2 α 2 ) ⊤ ( K 1 α 1 − K 2 α 2 ) , where K 1 and K 2 are k ernel matrices fr om t wo views. The optimization problem (45) can b e reform ulated as the follo wing min α 1 , α 2 , ξ 1 , ξ 2 F 0 = 1 2 ( α ⊤ 1 K 1 α 1 + α ⊤ 2 K 2 α 2 ) + C 2 ( K 1 α 1 − K 2 α 2 ) ⊤ ( K 1 α 1 − K 2 α 2 ) + C 1 n X i =1 ( ξ i 1 + ξ i 2 ) s.t. y i  n X j =1 α j 1 k 1 ( x j , x i )  ≥ 1 − ξ i 1 , i = 1 , · · · , n, y i  n X j =1 α j 2 k 2 ( x j , x i )  ≥ 1 − ξ i 2 , i = 1 , · · · , n, ξ i 1 , ξ i 2 ≥ 0 , i = 1 , · · · , n. (47) The deriv ation of the dual op timization form ulation is detailed in App end ix F. T able 1 summarizes the MvSVM algorithm. 15 T able 1: Th e MvSVM Algorithm Input: A training s et with n examples { ( x i , y i ) } n i =1 (eac h example has t wo views). Kernel fun ctio n k 1 ( · , · ) and k 2 ( · , · ) for t w o views, resp ectiv ely . Regularizatio n coeﬃcients C 1 , C 2 . Algorithm: 1 Calculate Gr am matrices K 1 and K 2 from t wo views. 2 Calculate A, B , D according to (90). 3 Solv e the quadr atic optimization problem (92) to get λ 1 , λ 2 . 4 Calculate α 1 and α 2 using (86) and (87). Output: Classiﬁer p arameters α 1 and α 2 used by (46). 7.3 Semi-sup ervised MvSVMs (SMvSVMs) Next we giv e the optimization form ulation for semi-sup ervised MvSVMs (SMvSVMs) (Sind h wani et al., 2005; Sindhw ani and Rosen b erg, 2008; Sun and Shaw e-T a ylor , 2010), where b esides the n lab eled examples we further hav e u unlab eled examples. Denote the classiﬁer weig hts fr om t wo views by w 1 and w 2 whic h are n ot assumed to b e unit vect ors. The ob j ective function of SMvSVMs is min w 1 , w 2 , ξ 1 , ξ 2 1 2 ( k w 1 k 2 + k w 2 k 2 ) + C 1 n X i =1 ( ξ i 1 + ξ i 2 ) + C 2 n + u X i =1 ( w ⊤ 1 x i 1 − w ⊤ 2 x i 2 ) 2 s.t. y i w ⊤ 1 x i 1 ≥ 1 − ξ i 1 , i = 1 , · · · , n, y i w ⊤ 2 x i 2 ≥ 1 − ξ i 2 , i = 1 , · · · , n, ξ i 1 , ξ i 2 ≥ 0 , i = 1 , · · · , n. (48) If k ernel fu n ctio ns are u s ed, the solution can b e exp ressed b y w 1 = P n + u i =1 α i 1 k 1 ( x i 1 , · ), and w 2 = P n + u i =1 α i 2 k 2 ( x i 2 , · ). Sin ce a function d eﬁned on view j only d ep ends on the j th feature set, the solution is giv en by w 1 = n + u X i =1 α i 1 k 1 ( x i , · ) , w 2 = n + u X i =1 α i 2 k 2 ( x i , · ) . (49) It is straigh tforward to sho w that k w 1 k 2 = α ⊤ 1 K 1 α 1 , k w 2 k 2 = α ⊤ 2 K 2 α 2 , n + u X i =1 ( w ⊤ 1 x i − w ⊤ 2 x i ) 2 = ( K 1 α 1 − K 2 α 2 ) ⊤ ( K 1 α 1 − K 2 α 2 ) , where ( n + u ) × ( n + u ) matrices K 1 and K 2 are kernel matrices from tw o views. The optimization problem (48) can b e reform ulated as min α 1 , α 2 , ξ 1 , ξ 2 ˜ F 0 = 1 2 ( α ⊤ 1 K 1 α 1 + α ⊤ 2 K 2 α 2 ) + C 2 ( K 1 α 1 − K 2 α 2 ) ⊤ ( K 1 α 1 − K 2 α 2 ) + 16 C 1 n X i =1 ( ξ i 1 + ξ i 2 ) s.t. y i  n + u X j =1 α j 1 k 1 ( x j , x i )  ≥ 1 − ξ i 1 , i = 1 , · · · , n, y i  n + u X j =1 α j 2 k 2 ( x j , x i )  ≥ 1 − ξ i 2 , i = 1 , · · · , n, ξ i 1 , ξ i 2 ≥ 0 , i = 1 , · · · , n. (50) The deriv ation of the dual optimizat ion form ulation is detailed in App endix G. T able 2 summarizes the SMvSVM algorithm. T able 2: Th e SMvSVM Algorithm Input: A training set w ith n examples { ( x i , y i ) } n i =1 (eac h example has t wo views) and u u nlab eled examples. Kernel fun ctio n k 1 ( · , · ) and k 2 ( · , · ) for t w o views, resp ectiv ely . Regularizatio n coeﬃcients C 1 , C 2 . Algorithm: 1 Calculate Gr am matrices K 1 and K 2 from t wo views. 2 Calculate A, B , D according to (104). 3 Solv e the quadr atic optimization problem (106) to get λ 1 , λ 2 . 4 Calculate α 1 and α 2 using (100) and (101). Output: Classiﬁer p arameters α 1 and α 2 used by (49). 8. Exp erimen ts The new b ounds are ev aluated on one synthetic and thr ee r eal -world m ulti-view d ata sets where the learning ta sk is b in ary classiﬁcation. Belo w we ﬁrst introdu ce the u sed data and the exp erimenta l settings. Th en w e rep ort the test errors of the in volv ed v ariants of th e SVM algorithms, and ev aluate the u sefulness and r ela tiv e p erformance of the new P A C-Ba y es b oun d s. 8.1 Data Sets The four m ulti-view d ata sets are introd uced as follo ws. Synthet ic The synthetic data include 2000 examples half of w hic h b elong to the p ositiv e class. The dimensionalit y f or eac h of the t wo views is 50. W e ﬁrst generate tw o random direction v ectors one for eac h view, and then for eac h view sample 2000 p oints to mak e the inner pro ducts b et w een the dir ection and the feature v ector of half of the p oints b e p ositiv e and the inner pro ducts for the other h alf of the p oints b e negativ e. F or the same p oin t, the 17 corresp onding inner pro du cts calculated from the t w o views are mad e iden tical. Finally , w e add Gaussian w hite noise to the generated d ata to form the s yn th etic data set. Handwritten The h andwritten digit data set is tak en from the UC I mac hine learning rep ository (Bac h e and Lic hman, 2013), whic h includes features of ten han d written digits (0 ∼ 9) extracted from a collection of Dutch utilit y m aps. It consists of 2000 examples (200 examples p er class) with the ﬁrst view b eing the 76 F ourier co eﬃcien ts, and the second view b eing the 64 Karhunen-Lo ` e ve co eﬃcie nts of eac h image. Binary classiﬁcati on b et ween digits (1, 2, 3) and (4, 5, 6) is used for exp eriment s. Ads The ads data are used for classifying web images in to ads and non-ads (Kush meric k, 1999). This d ata set consists of 3279 examples with 459 of them b eing ads. 1554 binary attributes (w eigh ts of text terms related to an image us ing Bo olea n mo del) are used for classiﬁcation, whose v alues can b e 0 and 1. T h ese attributes are divided into t wo views: one view describ es the image itself (terms in the image’s caption, URL and alt text) and the other view conta ins features fr om other information (terms in the p age and destination URLs). The t wo views ha v e 587 and 967 features, r esp ect ive ly . Course The course data set consists of 1051 tw o-view web p ag es collected from computer science departmen t we b sites at f our univ ersities: Corn ell Univ ersit y , Univ ersit y of W ashington, Univ ersit y of Wisconsin, and Un iv ersity of T exas. T here are 230 course pages and 821 non- course pages. T he t w o views are words o ccurring in a w eb page and words app earing in the links p oin ting to th at page (Blum and Mitc hell, 1998; Su n and Shaw e-T a ylor , 2010). T he do cumen t vec tors are n orm aliz ed to tf - id f (term frequ ency-in verse do cumen t frequen cy) features and then principal comp onent analysis is used to p erform dimens ionalit y reduction. The dimens ions of the tw o views are 500 and 87, resp ectiv ely . 8.2 Exp erimen tal Settings Our exp eriment s include algorithm test error ev aluation and P A C-Ba y es b oun d ev aluation for single-view learnin g, multi-vi ew learning, sup ervised learning and semi-sup ervised learn- ing. F or single-view learnin g, SVMs are trained separately on eac h of the t wo views and the third view (concatenating the p revious t wo views to form a long view), pro viding three sup ervised classiﬁers whic h are calle d S VM-1 , SVM-2 and SVM-3, resp ectiv ely . Ev aluat- ing the p erformance of th e third view is in teresting to compare single-view and m ulti-view learning m ethods, since sin gle -view learning on the third view can exploit the same data as the usu al multi-vi ew learning algorithms. Th e MvSVMs and SMvSVMs are su p ervised m ulti-view learning and semi-sup ervised multi- view learning algorithms, resp ectiv ely . The linear kernel is used for all the algorithms. F or eac h d ata set, four exp erimental settings are u s ed. All the settings use 20% of all the examples as the unlab eled examples. F or the remaining examples, the four settings u se 18 20%, 40%, 60% and 80% of them as the lab eled training set, resp ectiv ely , and the rest forms the test s et. Sup ervised algorithms will not use th e u n labeled training data. F or m ulti-view P A C-Ba y es b ound 5 and 6, we u s e 20% of the lab eled training set to calculate the prior, and ev aluate th e b oun ds on the remaining 80% of training set. Eac h s etting inv olv es 10 random p artiti ons of the ab o ve subsets. The rep orted p erformance is the a ve rage test error and standard d eviat ion o ve r these r andom partitions. Mo del p aramete rs, i.e., C in SVMs, and C 1 , C 2 in MvSVMs and SMvSVMs, are se- lected b y thr ee- fold cross-v alidation on eac h lab eled training set, w h ere C 1 , C 2 are se- lected fr om { 10 − 6 , 10 − 4 , 10 − 2 , 1 , 10 , 100 } and C is selecte d fr om { 10 − 8 , 5 × 10 − 8 , 10 − 7 , 5 × 10 − 7 , 10 − 6 , 5 × 10 − 6 , 10 − 5 , 5 × 10 − 5 , 10 − 4 , 5 × 10 − 4 , 10 − 3 , 5 × 10 − 3 , 10 − 2 , 5 × 10 − 2 , 10 − 1 , 5 × 10 − 1 , 1 , 5 , 10 , 20 , 25 , 30 , 40 , 50 , 55 , 60 , 70 , 80 , 85 , 90 , 1 00 , 300 , 500 , 7 00 , 900 , 1000 } . All the P A C- Ba yes b oun ds are ev aluated with a conﬁd ence of δ = 0 . 05. W e normalize w in the p osterior when w e calculate the b ounds. F or multi- view P AC-Ba y es b ounds, σ is ﬁxed to 100, η is set to 1, and R is equ al to 1 whic h is clear from the augment ed feature rep resen tation and data normalizat ion prepro cessing (all the training examples after feature augmentati on are divided by a common v alue to m ak e the maximum feature ve ctor length b e one). W e ev aluate the follo w ing elev en P A C-Ba y es b oun ds where the last eigh t b ounds are present ed in this p ap er. • PB-1: The P A C -Ba y es b ound giv en by Th eo rem 2 and th e S VM algorithm on th e ﬁrst view. • PB-2: The P A C -Ba y es b ound giv en by Th eo rem 2 and th e S VM algorithm on th e second view. • PB-3: The P A C -Ba y es b ound giv en by Th eo rem 2 and th e S VM algorithm on th e third view. • MvPB-1: Multi-view P A C-Ba ye s b ound 1 with the MvSVM algorithm. • MvPB-2: Multi-view P A C-Ba ye s b ound 2 with the MvSVM algorithm. • MvPB-3: Multi-view P A C-Ba ye s b ound 3 with the MvSVM algorithm. • MvPB-4: Multi-view P A C-Ba ye s b ound 4 with the MvSVM algorithm. • MvPB-5: Multi-view P A C-Ba ye s b ound 5 with the MvSVM algorithm. • MvPB-6: Multi-view P A C-Ba ye s b ound 6 with the MvSVM algorithm. • SMvPB-1: Semi-sup ervised m ulti-view P A C-Ba y es b ound 1 with the S MvSVM algo- rithm. • SMvPB-2: Semi-sup ervised m ulti-view P A C-Ba y es b ound 2 with the S MvSVM algo- rithm. 19 8.3 T est Errors The pr ed icti on p erformances of SVMs, MvSVMs and SMvSVMs for the four exp erimen tal settings are rep orted in T able 3, T able 4, T able 5 and T able 6, r esp ective ly . F or eac h data set, the b est p erformance is indicated with b oldface num b ers. F rom all of these results, w e see that MvSVMS and SMvS VMs ha ve the b est o ve rall p erformance and s omet imes single-view SVMs can ha ve the b est p erformances. SMvSVMs often p erform b etter than MvSVMS since additional unlab eled examples are used, esp ecially wh en the lab eled training data set is small. Moreo v er, as exp ecte d, with more lab eled training data the prediction p erformance of the algorithms will usually increase. 8.4 P AC-Ba y es Bounds T able 7, T able 8, T able 9 and T able 10 sh o w the v alues of v arious P A C-Ba y es b ounds under diﬀeren t settings, where for eac h data set the b est b ound is indicated in b old and the b est m ulti-view b ou n d is indicated with underline. F rom all the b ound results, w e ﬁnd that the b est single-view b ound is usually tigh ter than the b est multi-view b oun d, exp ect on the synthetic data set. O ne p ossible explanation for this is that, the synthetic data set is id eal and in accordance w ith the assumptions for m ulti-view learning enco ded in the p rior, w h ile the real wo rld data sets are not. Th is also indicates that there is muc h space and p ossibilit y for furth er d ev elopments of m ulti-view P A C-Ba y es analysis. In addition, with m ore lab eled training d ata the corresp ond ing b ound will u sually b ecome tigh ter. Last but not least, among the eigh t presente d m ulti-view P AC- Ba yes b oun ds on r eal world data sets, the tigh test one is often the ﬁrst semi-sup ervised m ulti-view b ound whic h exploits u nlab ele d data to calculate the function ˆ V ( u 1 , u 2 ) and needs no fur ther relaxation. The results also sh ow that the second multi- view P A C-Ba y es b ound (dim en sionalit y-indep endent b ou n d with the prior distrib ution cente red at the origin) is sometimes v ery go o d. 9. Conclusion The pap er la ys the foundation of a theoretical and practical framework for deﬁn ing pr iors that encode non-trivial in teractions b et ween d ata distribu tions and classiﬁers and translat- ing th em int o sophisticated regularization sc hemes and asso ciate d generalization b oun ds. Sp eciﬁcally , w e hav e presente d eigh t new m ulti-view P A C-Ba yes b ound s, whic h in tegrate T est Error Syn thetic Handwritten Ads Course SVM-1 17 . 20 ± 1 . 39 5 . 66 ± 0 . 94 5 . 84 ± 0 . 56 19 . 15 ± 1 . 54 SVM-2 19 . 98 ± 0 . 76 3 . 98 ± 0 . 68 5 . 25 ± 0 . 79 10 . 15 ± 1 . 60 SVM-3 16 . 55 ± 2 . 04 1 . 65 ± 0 . 53 4 . 62 ± 0 . 80 10 . 33 ± 1 . 34 MvSVM 10 . 5 4 ± 0 . 73 2 . 17 ± 0 . 64 4 . 55 ± 0 . 66 10 . 55 ± 1 . 47 SMvSVM 10 . 30 ± 0 . 79 2 . 04 ± 0 . 69 4 . 70 ± 0 . 70 10 . 28 ± 1 . 63 T able 3: Av erage error rates (%) and standard d eviati ons for diﬀerent learning algorithms under the 20% training setting. 20 T est Error Syn thetic Handwritten Ads Course SVM-1 14 . 49 ± 0 . 98 5 . 57 ± 0 . 41 5 . 04 ± 0 . 83 14 . 23 ± 1 . 27 SVM-2 16 . 88 ± 1 . 06 3 . 75 ± 0 . 99 4 . 14 ± 0 . 40 7 . 64 ± 0 . 80 SVM-3 10 . 31 ± 0 . 82 1 . 51 ± 0 . 39 3 . 61 ± 0 . 54 7 . 68 ± 0 . 97 MvSVM 7 . 72 ± 0 . 78 1 . 9 8 ± 0 . 61 3 . 56 ± 0 . 54 7 . 00 ± 0 . 93 SMvSVM 7 . 48 ± 0 . 66 2 . 03 ± 0 . 61 3 . 44 ± 0 . 54 6 . 81 ± 0 . 98 T able 4: Av erage error rates (%) and standard d eviati ons for diﬀerent learning algorithms under the 40% training setting. T est Error Syn thetic Handwritten Ads Course SVM-1 14 . 23 ± 1 . 24 5 . 16 ± 0 . 61 4 . 32 ± 0 . 50 11 . 28 ± 1 . 30 SVM-2 16 . 11 ± 0 . 94 3 . 46 ± 0 . 94 3 . 90 ± 0 . 58 6 . 53 ± 1 . 44 SVM-3 9 . 08 ± 1 . 07 1 . 7 7 ± 0 . 85 3 . 43 ± 0 . 51 6 . 62 ± 1 . 33 MvSVM 7 . 30 ± 0 . 85 1 . 6 7 ± 0 . 63 3 . 45 ± 0 . 32 5 . 82 ± 1 . 73 SMvSVM 7 . 31 ± 0 . 80 1 . 82 ± 0 . 70 3 . 36 ± 0 . 38 5 . 93 ± 1 . 63 T able 5: Av erage error rates (%) and standard d eviati ons for diﬀerent learning algorithms under the 60% training setting. T est Error Synt hetic Handwritten Ads Course SVM-1 13 . 06 ± 2 . 00 5 . 42 ± 1 . 51 4 . 47 ± 0 . 60 9 . 7 0 ± 1 . 64 SVM-2 16 . 03 ± 1 . 73 3 . 54 ± 1 . 33 3 . 59 ± 0 . 66 5 . 6 2 ± 1 . 68 SVM-3 8 . 06 ± 1 . 11 1 . 93 ± 0 . 66 2 . 96 ± 0 . 51 5 . 56 ± 1 . 72 MvSVM 6 . 28 ± 1 . 20 1 . 8 2 ± 0 . 75 3 . 19 ± 0 . 63 4 . 20 ± 1 . 51 SMvSVM 6 . 2 8 ± 1 . 19 1 . 93 ± 0 . 77 3 . 15 ± 0 . 75 3 . 96 ± 1 . 59 T able 6: Av erage error rates (%) and standard d eviati ons for diﬀerent learning algorithms under the 80% training setting. P AC-Bay es Bo und Synt hetic Handwritten Ads Course PB-1 60 . 58 ± 0 . 12 54 . 61 ± 1 . 59 40 . 49 ± 2 . 09 58 . 93 ± 8 . 90 PB-2 60 . 72 ± 0 . 09 45 . 17 ± 3 . 74 40 . 44 ± 2 . 12 61 . 64 ± 1 . 49 PB-3 60 . 49 ± 0 . 12 47 . 62 ± 3 . 42 43 . 75 ± 3 . 15 59 . 67 ± 2 . 32 MvPB-1 61 . 27 ± 0 . 07 51 . 63 ± 2 . 89 40 . 87 ± 2 . 77 63 . 54 ± 0 . 45 MvPB-2 61 . 04 ± 0 . 07 51 . 45 ± 2 . 89 40 . 80 ± 2 . 77 63 . 26 ± 0 . 47 MvPB-3 62 . 35 ± 0 . 01 63 . 44 ± 0 . 62 56 . 38 ± 1 . 49 66 . 37 ± 0 . 06 MvPB-4 62 . 17 ± 0 . 01 63 . 23 ± 0 . 61 56 . 29 ± 1 . 48 66 . 14 ± 0 . 06 MvPB-5 61 . 84 ± 0 . 09 52 . 52 ± 3 . 01 43 . 21 ± 2 . 94 64 . 36 ± 0 . 43 MvPB-6 63 . 74 ± 0 . 08 58 . 65 ± 7 . 09 54 . 94 ± 4 . 68 67 . 75 ± 0 . 25 SMvPB-1 60 . 60 ± 0 . 06 49 . 84 ± 2 . 87 40 . 65 ± 3 . 25 62 . 77 ± 0 . 49 SMvPB-2 62 . 17 ± 0 . 01 62 . 94 ± 0 . 62 56 . 28 ± 1 . 30 66 . 14 ± 0 . 06 T able 7: Av erage P A C-Ba y es b oun ds (%) and standard deviations for diﬀeren t lea rn ing algorithms un d er th e 20% training setting. 21 P AC-Bay es Bo und Synt hetic Handwritten Ads Course PB-1 57 . 20 ± 0 . 05 45 . 26 ± 1 . 48 33 . 11 ± 3 . 89 59 . 68 ± 0 . 52 PB-2 57 . 40 ± 0 . 11 35 . 45 ± 3 . 22 28 . 85 ± 3 . 26 55 . 26 ± 1 . 9 7 PB-3 57 . 15 ± 0 . 07 35 . 48 ± 2 . 26 32 . 74 ± 4 . 29 56 . 12 ± 0 . 78 MvPB-1 57 . 69 ± 0 . 09 40 . 85 ± 3 . 23 33 . 36 ± 2 . 17 59 . 17 ± 0 . 51 MvPB-2 57 . 54 ± 0 . 08 40 . 76 ± 3 . 22 33 . 32 ± 2 . 17 58 . 99 ± 0 . 50 MvPB-3 58 . 97 ± 0 . 02 57 . 26 ± 1 . 17 51 . 68 ± 1 . 38 61 . 91 ± 0 . 07 MvPB-4 58 . 85 ± 0 . 02 57 . 15 ± 1 . 16 51 . 62 ± 1 . 37 61 . 77 ± 0 . 10 MvPB-5 57 . 44 ± 0 . 13 42 . 56 ± 3 . 36 35 . 86 ± 2 . 23 59 . 91 ± 0 . 48 MvPB-6 52 . 67 ± 2 . 36 42 . 57 ± 5 . 93 47 . 34 ± 3 . 05 62 . 86 ± 0 . 09 SMvPB-1 57 . 27 ± 0 . 06 40 . 76 ± 3 . 26 34 . 26 ± 3 . 00 58 . 69 ± 0 . 44 SMvPB-2 58 . 85 ± 0 . 01 57 . 22 ± 1 . 18 52 . 16 ± 1 . 50 61 . 77 ± 0 . 09 T able 8: Av erage P A C-Ba y es b oun ds (%) and standard deviations for diﬀeren t lea rn ing algorithms un d er th e 40% training setting. P AC-Bay es Bo und Synt hetic Handwritten Ads Course PB-1 55 . 45 ± 0 . 08 42 . 07 ± 2 . 35 29 . 65 ± 1 . 93 57 . 52 ± 0 . 22 PB-2 55 . 71 ± 0 . 08 30 . 70 ± 2 . 05 28 . 59 ± 3 . 71 53 . 71 ± 2 . 2 7 PB-3 55 . 39 ± 0 . 16 30 . 50 ± 3 . 31 30 . 49 ± 4 . 35 53 . 78 ± 1 . 01 MvPB-1 55 . 89 ± 0 . 08 34 . 16 ± 1 . 88 31 . 72 ± 4 . 13 56 . 90 ± 0 . 46 MvPB-2 55 . 78 ± 0 . 07 34 . 09 ± 1 . 88 31 . 69 ± 4 . 13 56 . 75 ± 0 . 45 MvPB-3 57 . 38 ± 0 . 01 52 . 82 ± 1 . 08 49 . 77 ± 2 . 49 59 . 82 ± 0 . 07 MvPB-4 57 . 29 ± 0 . 01 52 . 73 ± 1 . 07 49 . 74 ± 2 . 48 59 . 69 ± 0 . 07 MvPB-5 55 . 60 ± 0 . 08 36 . 17 ± 1 . 88 34 . 11 ± 4 . 26 57 . 56 ± 0 . 42 MvPB-6 39 . 20 ± 5 . 03 31 . 76 ± 4 . 17 47 . 56 ± 3 . 81 60 . 67 ± 0 . 05 SMvPB-1 55 . 58 ± 0 . 06 33 . 93 ± 2 . 00 32 . 33 ± 3 . 37 56 . 53 ± 0 . 43 SMvPB-2 57 . 28 ± 0 . 01 52 . 76 ± 1 . 15 50 . 51 ± 1 . 64 59 . 69 ± 0 . 07 T able 9: Av erage P A C-Ba y es b oun ds (%) and standard deviations for diﬀeren t lea rn ing algorithms un d er th e 60% training setting. 22 P AC-Bay es Bo und Synt hetic Handwritten Ads Course PB-1 54 . 64 ± 0 . 77 37 . 52 ± 1 . 42 28 . 97 ± 1 . 51 56 . 21 ± 0 . 18 PB-2 54 . 59 ± 0 . 04 28 . 47 ± 2 . 07 30 . 28 ± 1 . 83 51 . 28 ± 2 . 97 PB-3 54 . 21 ± 0 . 08 26 . 50 ± 2 . 15 29 . 74 ± 3 . 42 52 . 00 ± 0 . 85 MvPB-1 54 . 65 ± 0 . 05 30 . 25 ± 0 . 86 29 . 69 ± 0 . 84 55 . 77 ± 1 . 09 MvPB-2 54 . 63 ± 0 . 05 30 . 19 ± 0 . 86 29 . 67 ± 0 . 84 55 . 38 ± 0 . 50 MvPB-3 56 . 41 ± 0 . 00 49 . 51 ± 0 . 52 48 . 12 ± 0 . 94 58 . 55 ± 0 . 07 MvPB-4 56 . 32 ± 0 . 01 49 . 43 ± 0 . 54 48 . 09 ± 0 . 92 58 . 44 ± 0 . 07 MvPB-5 54 . 36 ± 0 . 05 32 . 39 ± 0 . 88 31 . 44 ± 0 . 98 56 . 22 ± 0 . 41 MvPB-6 26 . 89 ± 2 . 05 31 . 52 ± 3 . 33 46 . 31 ± 1 . 50 59 . 23 ± 0 . 18 SMvPB-1 54 . 41 ± 0 . 03 30 . 15 ± 0 . 79 30 . 55 ± 2 . 28 55 . 24 ± 0 . 43 SMvPB-2 56 . 32 ± 0 . 01 49 . 43 ± 0 . 46 48 . 77 ± 1 . 38 58 . 44 ± 0 . 06 T able 10: Av erage P A C-Ba ye s b ound s (%) and stand ard d evia tions for diﬀerent learnin g algorithms un d er th e 80% training setting. the view agreemen t as a key measure to mo dulate the prior distributions of classiﬁers. As extensions of P A C-Ba ye s analysis to the multi -view learnin g scenario, the p rop osed theo- retical results are promising to ﬁll the gap b et w een the dev elopmen ts in theory and pr act ice of m ulti-view learning, and are also p ossible to serve as the underpin n ings to explain the eﬀectiv eness of m ulti-view learning. W e ha ve v alidated the theoretical sup eriorit y of multi- view learning in the ideal case of synthetic data, though th is is n ot so evident for real world data which ma y not w ell meet our assumptions on the p r iors for m ulti-view learning. The u s efulness of the prop osed b ounds has b een sho wn. Although often the curr en t b ounds are not the tigh test, th ey indeed op en the p ossibilit y of app lying P A C-Ba yes analysis to multi-view learning. W e think the set of b ounds could b e furth er tigh tened in the fu ture b y adopting other tec h niques. It is also p ossib le to study algorithms whose co-regularization term pus h es to wards the minimization of the multi-view P A C-Ba ye s b ounds. In addition, w e ma y use the work in this pap er to motiv ate P A C-Ba y es analysis for other learnin g tasks suc h as m ulti-task learning and domain adaptation, since these tasks are closely related to the cur r en t m ulti-view learning. Ac kno wledgmen ts This work is su p p orted by the National Natural Science F oundation of China under Pr o ject 61370 175, the Scientiﬁc Researc h F oundation for the Return ed Ov erseas Chinese Scholars, State Edu cat ion Ministry , and S hanghai Kn o wledge Service Platform Pro ject (No. ZF1213). App endix A. Proof of Theorem 5 Deﬁne f ( ˜ x 1 , . . . , ˜ x m ) = 1 m m X i =1    I + ˜ x i ˜ x ⊤ i σ 2    1 /d . (51) 23 Since the rank of m atrix ˜ x i ˜ x ⊤ i /σ 2 is 1 w ith the nonzero eigen v alue b eing k ˜ x i k 2 /σ 2 and the determinan t of a p ositiv e semi-deﬁnite matrix is equal to the pro duct of its eigen v alues, it follo ws that sup ˜ x 1 ,..., ˜ x m , ¯ x i | f ( ˜ x 1 , . . . , ˜ x m ) − f ( ˜ x 1 , . . . , ¯ x i , ˜ x i +1 , . . . , ˜ x m ) | = 1 m       I + ˜ x i ˜ x ⊤ i σ 2    1 /d −    I + ¯ x i ¯ x ⊤ i σ 2    1 /d    ≤ 1 m ( d p ( R/σ ) 2 + 1 − 1) . By McDiarmid’s inequalit y (Sha w e-T a ylor and C r istianini, 200 4 ), w e ha v e for all ǫ > 0, P  E h    I + ˜ x˜ x ⊤ σ 2    1 /d i ≥ f ( ˜ x 1 , . . . , ˜ x m ) − ǫ  ≥ 1 − exp − 2 mǫ 2 ( d p ( R/σ ) 2 + 1 − 1) 2 ! . (52) Setting the righ t h and size equal to 1 − δ 3 , w e ha v e with pr obabilit y at least 1 − δ 3 , E h    I + ˜ x˜ x ⊤ σ 2    1 /d i ≥ f ( ˜ x 1 , . . . , ˜ x m ) − ( d p ( R/σ ) 2 + 1 − 1) r 1 2 m ln 3 δ , (5 3) and − ln    I + E ( ˜ x ˜ x ⊤ ) σ 2    ≤ − d ln h f ( ˜ x 1 , . . . , ˜ x m ) − ( d p ( R/σ ) 2 + 1 − 1) r 1 2 m ln 3 δ i + , (5 4) where to reac h (54) we ha ve used (12) and deﬁned [ · ] + = max( · , 0). Denote H m = 1 m P m i =1 [ ˜ x ⊤ i ˜ x i + µ 2 ( w ⊤ ˜ x i ) 2 ]. It is clear that E [ H m ] = E ( 1 m m X i =1 [ ˜ x ⊤ i ˜ x i + µ 2 ( w ⊤ ˜ x i ) 2 ] ) = E [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] . (55) Recall R = sup ˜ x k ˜ x k . By McDiarmid’s inequalit y , we ha v e for all ǫ > 0, P { E [ H m ] ≤ H m + ǫ } ≥ 1 − exp  − 2 mǫ 2 (1 + µ 2 ) 2 R 4  . (56) Setting the righ t h and size equal to 1 − δ 3 , w e ha v e with pr obabilit y at least 1 − δ 3 , E [ H m ] ≤ H m + (1 + µ 2 ) R 2 r 1 2 m ln 3 δ . (57) In addition, from Lemma 1, w e h a ve P r S ∼D m ∀ Q ( c ) : K L + ( ˆ E Q,S || E Q, D ) ≤ K L ( Q || P ) + ln ( m +1 δ/ 3 ) m ! ≥ 1 − δ / 3 . (58) According to the u nion b ound ( P r ( A or B or C ) ≤ P r ( A ) + P r ( B ) + P r ( C )), the probabilit y that at least one of the inequalities in (54), (57) and (58) fails is n o larger than 24 δ / 3 + δ / 3 + δ / 3 = δ . Hence, the probabilit y that all of the three inequalities hold is no less than 1 − δ . That is, with probab ility at least 1 − δ o v er S ∼ D m , the follo wing inequalit y holds ∀ w , µ : K L + ( ˆ E Q,S || E Q, D ) ≤ − d 2 ln h f m − ( d p ( R/σ ) 2 + 1 − 1) q 1 2 m ln 3 δ i + + H m 2 σ 2 + (1+ µ 2 ) R 2 2 σ 2 q 1 2 m ln 3 δ + µ 2 2 + ln ( m +1 δ/ 3 ) m , where f m is a sh orthand for f ( ˜ x 1 , . . . , ˜ x m ), and k w k = 1. App endix B. Proof of Theorem 6 No w the KL div ergence b et ween the p osterior and prior b ecomes K L ( Q ( u ) k P ( u )) = 1 2  − ln(    I + E ( ˜ x ˜ x ⊤ ) σ 2    ) + 1 σ 2 E [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  ≤ 1 2  − E ln    I + ˜ x˜ x ⊤ σ 2    + 1 σ 2 E [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] + µ 2  = 1 2  E  1 σ 2 [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] − ln    I + ˜ x˜ x ⊤ σ 2     + µ 2  . Deﬁne ˜ f ( ˜ x 1 , . . . , ˜ x m ) = 1 m m X i =1  1 σ 2 [ ˜ x ⊤ i ˜ x i + µ 2 ( w ⊤ ˜ x i ) 2 ] − ln    I + ˜ x i ˜ x ⊤ i σ 2     . (59) Recall R = su p ˜ x k ˜ x k . Since the rank of m atrix ˜ x i ˜ x ⊤ i /σ 2 is 1 with th e nonzero eigen v alue b eing k ˜ x i k 2 /σ 2 and the determinant of a p ositiv e semi-deﬁ nite m atrix is equ al to the pro duct of its eigenv alues, it follo ws that sup ˜ x 1 ,..., ˜ x m , ¯ x i | ˜ f ( ˜ x 1 , . . . , ˜ x m ) − ˜ f ( ˜ x 1 , . . . , ¯ x i , ˜ x i +1 , . . . , ˜ x m ) | ≤ 1 m  (1 + µ 2 ) R 2 σ 2 + ln(1 + R 2 σ 2 )  . By McDiarmid’s inequalit y , w e hav e for all ǫ > 0, P  E  1 σ 2 [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] − ln    I + ˜ x˜ x ⊤ σ 2     ≤ ˜ f + ǫ  ≥ 1 − exp  − 2 mǫ 2 ∆ 2  , (60) where ˜ f is short for ˜ f ( ˜ x 1 , . . . , ˜ x m ), and ∆ = (1+ µ 2 ) R 2 σ 2 + ln (1 + R 2 σ 2 ). Setti ng the r igh t hand size of (60) equal to 1 − δ 2 , we ha v e with p robabilit y at least 1 − δ 2 , E  1 σ 2 [ ˜ x ⊤ ˜ x + µ 2 ( w ⊤ ˜ x ) 2 ] − ln    I + ˜ x ˜ x ⊤ σ 2     ≤ ˜ f + ∆ r 1 2 m ln 2 δ . (61) 25 Mean while, from Lemma 1, w e h a ve P r S ∼D m ∀ Q ( c ) : K L + ( ˆ E Q,S || E Q, D ) ≤ K L ( Q || P ) + ln ( m +1 δ/ 2 ) m ! ≥ 1 − δ / 2 . (62) According to the union b ound, we can complete the p roof for the dimensionalit y- indep endent P A C-Ba ye s b ound. App endix C. Pro of of Theorem 7 It is clear that from R = sup ˜ x k ˜ x k , w e h a ve sup x k x k = R and sup ( x ,y ) k y x k = R . F rom (54), it follo ws that w ith probabilit y at least 1 − δ 4 , − ln    I + E ( ˜ x ˜ x ⊤ ) σ 2    ≤ − d ln h f m − ( d p ( R/σ ) 2 + 1 − 1) r 1 2 m ln 4 δ i + . (63) With reference to a b ounding resu lt on estimating the cen ter of mass (Sh a we- T aylo r and Cristianini, 2004), it follo ws that w ith probabilit y at least 1 − δ / 4 the follo wing inequalit y holds k w p − ˆ w p k ≤ R √ m 2 + r 2 ln 4 δ ! . (64) Denote ˆ H m = 1 m P m i =1 [ ˜ x ⊤ i ˜ x i − 2 η µσ 2 y i ( w ⊤ x i ) + µ 2 ( w ⊤ ˜ x i ) 2 ]. It is clear that E [ ˆ H m ] = E [ ˜ x ⊤ ˜ x − 2 η µσ 2 y ( w ⊤ x ) + µ 2 ( w ⊤ ˜ x ) 2 ] . (65) By McDiarmid’s inequalit y , we ha v e for all ǫ > 0, P n E [ ˆ H m ] ≤ ˆ H m + ǫ o ≥ 1 − exp  − 2 mǫ 2 ( R 2 + 4 η µσ 2 R + µ 2 R 2 ) 2  . (66) Setting the righ t h and size equal to 1 − δ 4 , w e ha v e with pr obabilit y at least 1 − δ 4 , E [ ˆ H m ] ≤ ˆ H m + ( R 2 + µ 2 R 2 + 4 η µσ 2 R ) r 1 2 m ln 4 δ . (67) In addition, according to Lemm a 1, we h av e P r S ∼D m ∀ Q ( c ) : K L + ( ˆ E Q,S || E Q, D ) ≤ K L ( Q || P ) + ln ( m +1 δ/ 4 ) m ! ≥ 1 − δ / 4 . (68) Therefore, f rom the union b ound, w e get the result. 26 App endix D. Pro of of Theorem 8 Applying (13) to (24), we obtain K L ( Q ( u ) k P ( u )) ≤ − 1 2 E ln    I + ˜ x˜ x ⊤ σ 2    + 1 2 ( k η w p − η ˆ w p k + k η ˆ w p − µ w k + µ ) 2 + 1 2 σ 2 E h ˜ x ⊤ ˜ x − 2 η µσ 2 y ( w ⊤ x ) + µ 2 ( w ⊤ ˜ x ) 2 i + µ 2 2 = 1 2 ( k η w p − η ˆ w p k + k η ˆ w p − µ w k + µ ) 2 + 1 2 E  ˜ x ⊤ ˜ x − 2 η µσ 2 y ( w ⊤ x ) + µ 2 ( w ⊤ ˜ x ) 2 σ 2 − ln    I + ˜ x˜ x ⊤ σ 2     + µ 2 2 . F ollo wing S ha we-T aylo r and Cristianini (2004), w e h av e with pr obabilit y at least 1 − δ / 3 k w p − ˆ w p k ≤ R √ m 2 + r 2 ln 3 δ ! . (69) Denote ˜ H m = 1 m P m i =1 [ ˜ x ⊤ i ˜ x i − 2 ηµσ 2 y i ( w ⊤ x i )+ µ 2 ( w ⊤ ˜ x i ) 2 σ 2 − ln    I + ˜ x i ˜ x ⊤ i σ 2    ]. It is clear that E [ ˜ H m ] = E [ ˜ x ⊤ ˜ x − 2 η µσ 2 y ( w ⊤ x ) + µ 2 ( w ⊤ ˜ x ) 2 σ 2 − ln    I + ˜ x˜ x ⊤ σ 2    ] . (70) By McDiarmid’s inequalit y , we ha v e for all ǫ > 0, P n E [ ˜ H m ] ≤ ˜ H m + ǫ o ≥ 1 − exp − 2 mǫ 2 ( R 2 +4 ηµσ 2 R + µ 2 R 2 σ 2 + ln(1 + R 2 σ 2 )) 2 ! . (71) Setting the righ t h and size equal to 1 − δ 3 , w e ha v e with pr obabilit y at least 1 − δ 3 , E [ ˜ H m ] ≤ ˜ H m + ( R 2 + 4 η µσ 2 R + µ 2 R 2 σ 2 + ln (1 + R 2 σ 2 )) r 1 2 m ln 3 δ . (72) In addition, from Lemma 1, w e h a ve P r S ∼D m ∀ Q ( c ) : K L + ( ˆ E Q,S || E Q, D ) ≤ K L ( Q || P ) + ln ( m +1 δ/ 3 ) m ! ≥ 1 − δ / 3 . (73) By applyin g th e union b ound, we complete the p roof. App endix E. Proof of Theorem 12 W e already ha ve sup x k x k = R and sup ( x ,y ) k y x k = R f r om the deﬁn itio n R = sup ˜ x k ˜ x k . F ollo wing S ha we-T aylo r and Cristianini (2004), w e h av e with pr obabilit y at least 1 − δ / 3 k w p − ˆ w p k ≤ R √ m 2 + r 2 ln 3 δ ! . (74) 27 Denote ¯ S m = 1 m P m i =1 [ − η µy i ( w ⊤ x i )]. It is clear that E [ ¯ S m ] = − η µ E h y ( w ⊤ x ) i . (75) By McDiarmid’s inequalit y , we ha v e for all ǫ > 0, P  E [ ¯ S m ] ≤ ¯ S m + ǫ  ≥ 1 − exp − 2 mǫ 2 (2 η µR ) 2 ! . (76) Setting the righ t h and size equal to 1 − δ 3 , w e ha v e with pr obabilit y at least 1 − δ 3 , E [ ¯ S m ] ≤ ¯ S m + η µR r 2 m ln 3 δ . (77) In addition, from Lemma 1, w e h a ve P r S ∼D m ∀ Q ( c ) : K L + ( ˆ E Q,S || E Q, D ) ≤ K L ( Q || P ) + ln ( m +1 δ/ 3 ) m ! ≥ 1 − δ / 3 . (78) After app lyin g the union b ound , the p ro of is completed. App endix F. Dua l Optimization Deriv ation for MvSVMs T o optimize (47), here w e derive the Lagrange dual function. Let λ i 1 , λ i 2 , ν i 1 , ν i 2 ≥ 0 b e the Lagrange multipliers asso ciat ed with the inequalit y con- strain ts of prob lem (47). The Lagrangia n L ( α 1 , α 2 , ξ 1 , ξ 2 , λ 1 , λ 2 , ν 1 , ν 2 ) can b e written as L = F 0 − n X i =1 h λ i 1  y i ( n X j =1 α j 1 k 1 ( x j , x i )) − 1 + ξ i 1  + λ i 2  y i ( n X j =1 α j 2 k 2 ( x j , x i )) − 1 + ξ i 2  + ν i 1 ξ i 1 + ν i 2 ξ i 2 i . T o obtain the Lagrangian dual function, L h as to b e minimized with resp ect to the primal v ariables α 1 , α 2 , ξ 1 , ξ 2 . T o eliminate these v ariables, we compute the corresp onding partial deriv ativ es and set them to 0, obtaining the follo wing conditions ( K 1 + 2 C 2 K 1 K 1 ) α 1 − 2 C 2 K 1 K 2 α 2 = Λ 1 , (79) ( K 2 + 2 C 2 K 2 K 2 ) α 2 − 2 C 2 K 2 K 1 α 1 = Λ 2 , (80) λ i 1 + ν i 1 = C 1 , (81) λ i 2 + ν i 2 = C 1 , (82) where we ha v e deﬁned Λ 1 , n X i =1 λ i 1 y i K 1 (: , i ) , Λ 2 , n X i =1 λ i 2 y i K 2 (: , i ) , 28 with K 1 (: , i ) and K 2 (: , i ) b eing the i th columns of the corresp onding Gram matrices. Substituting (79) ∼ (82) in to L results in the follo wing expression of the L ag rangian dual function g ( λ 1 , λ 2 , ν 1 , ν 2 ) g = 1 2 ( α ⊤ 1 K 1 α 1 + α ⊤ 2 K 2 α 2 ) + C 2 ( α ⊤ 1 K 1 K 1 α 1 − 2 α ⊤ 1 K 1 K 2 α 2 + α ⊤ 2 K 2 K 2 α 2 ) − α ⊤ 1 Λ 1 − α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) = 1 2 α ⊤ 1 Λ 1 + 1 2 α ⊤ 2 Λ 2 − α ⊤ 1 Λ 1 − α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) = − 1 2 α ⊤ 1 Λ 1 − 1 2 α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) . (83) Deﬁne ˜ K 1 = K 1 + 2 C 2 K 1 K 1 , ¯ K 1 = 2 C 2 K 1 K 2 , ˜ K 2 = K 2 + 2 C 2 K 2 K 2 , ¯ K 2 = 2 C 2 K 2 K 1 . Then, (79) and (80) b ecome ˜ K 1 α 1 − ¯ K 1 α 2 = Λ 1 , (84) ˜ K 2 α 2 − ¯ K 2 α 1 = Λ 2 . (85) F rom (84) and (85), w e hav e ( ˜ K 1 − ¯ K 1 ˜ K − 1 2 ¯ K 2 ) α 1 = ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 ( ˜ K 2 − ¯ K 2 ˜ K − 1 1 ¯ K 1 ) α 2 = ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 . Deﬁne M 1 , ˜ K 1 − ¯ K 1 ˜ K − 1 2 ¯ K 2 and M 2 , ˜ K 2 − ¯ K 2 ˜ K − 1 1 ¯ K 1 . It f ollo ws that α 1 = M − 1 1 h ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 i , (86) α 2 = M − 1 2 h ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 i . (87) No w w ith α 1 and α 2 substituted in to (83), the Lagrange d ual fu nction g ( λ 1 , λ 2 , ν 1 , ν 2 ) is g = inf α 1 , α 2 , ξ 1 , ξ 2 L = − 1 2 α ⊤ 1 Λ 1 − 1 2 α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) = − 1 2 Λ ⊤ 1 M − 1 1 h ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 i − 1 2 Λ ⊤ 2 M − 1 2 h ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 i + n X i =1 ( λ i 1 + λ i 2 ) . The Lagrange d ual problem is giv en by max λ 1 , λ 2 g s.t.  0 ≤ λ i 1 ≤ C 1 , i = 1 , . . . , n 0 ≤ λ i 2 ≤ C 1 , i = 1 , . . . , n. (88) 29 As Lagrange dual functions are conca ve, we can form ulate the Lagrange d ual problem as a conv ex optimization problem min λ 1 , λ 2 − g s.t.  0 ≤ λ i 1 ≤ C 1 , i = 1 , . . . , n 0 ≤ λ i 2 ≤ C 1 , i = 1 , . . . , n. (89) Deﬁne matrix Y , diag( y 1 , . . . , y n ). Then, Λ 1 = K 1 Y λ 1 and Λ 2 = K 2 Y λ 2 with λ 1 = ( λ 1 1 , ..., λ n 1 ) ⊤ , and λ 2 = ( λ 1 2 , ..., λ n 2 ) ⊤ . I t is clear that ˜ K 1 and ˜ K 2 are symmetric m atric es, and ¯ K 1 = ¯ K ⊤ 2 . Therefore, it follo ws that matrices M 1 and M 2 are also s y m metric. W e h a ve − g = 1 2 Λ ⊤ 1 M − 1 1 h ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 i + 1 2 Λ ⊤ 2 M − 1 2 h ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 i − n X i =1 ( λ i 1 + λ i 2 ) = 1 2 n λ ⊤ 1 [ Y K 1 M − 1 1 K 1 Y ] λ 1 + λ ⊤ 1 [ Y K 1 M − 1 1 ¯ K 1 ˜ K − 1 2 K 2 Y ] λ 2 + λ ⊤ 2 [ Y K 2 M − 1 2 ¯ K 2 ˜ K − 1 1 K 1 Y ] λ 1 + λ ⊤ 2 [ Y K 2 M − 1 2 K 2 Y ] λ 2 o − 1 ⊤ ( λ 1 + λ 2 ) = 1 2 ( λ ⊤ 1 λ ⊤ 2 )  A B B ⊤ D  λ 1 λ 2  −  λ 1 λ 2  ⊤ 1 2 n , where A , Y K 1 M − 1 1 K 1 Y , B , Y K 1 M − 1 1 ¯ K 1 ˜ K − 1 2 K 2 Y , D , Y K 2 M − 1 2 K 2 Y , (90) 1 2 n = (1 , . . . , 1 (2 n ) ) ⊤ , and we ha ve used the fact that Y K 1 M − 1 1 ¯ K 1 ˜ K − 1 2 K 2 Y = [ Y K 2 M − 1 2 ¯ K 2 ˜ K − 1 1 K 1 Y ] ⊤ . (91) Because of the conv exit y of fu nction − g , w e aﬃrm that matrix  A B B ⊤ D  is p ositiv e semidef- inite. Hence, the optimizati on pr oblem in (89) can b e rewritten as min λ 1 , λ 2 1 2 ( λ ⊤ 1 λ ⊤ 2 )  A B B ⊤ D  λ 1 λ 2  −  λ 1 λ 2  ⊤ 1 2 n s.t.  0  λ 1  C 1 1 , 0  λ 2  C 1 1 . (92) After solving this problem, w e can th en obtain classiﬁer parameters α 1 and α 2 using (86) and (87), which are ﬁnally used b y (46). App endix G. Dual Optimization Deriv ation for SMvSVMs T o optimize (50), we ﬁr st d eriv e the Lagrange dual fu nction follo w ing the same line of optimization deriv ations for MvSVMs. Although here some of th e d eriv ations are similar to those for MvSVMs, for completeness we include them. 30 Let λ i 1 , λ i 2 , ν i 1 , ν i 2 ≥ 0 b e the Lagrange multipliers asso ciat ed with the inequalit y con- strain ts of problem (50). Th e Lagrangian L ( α 1 , α 2 , ξ 1 , ξ 2 , λ 1 , λ 2 , ν 1 , ν 2 ) can b e form ulated as L = ˜ F 0 − n X i =1 h λ i 1  y i ( n + u X j =1 α j 1 k 1 ( x j , x i )) − 1 + ξ i 1  + λ i 2  y i ( n + u X j =1 α j 2 k 2 ( x j , x i )) − 1 + ξ i 2  + ν i 1 ξ i 1 + ν i 2 ξ i 2 i . T o obtain the Lagrangian dual function, L w ill b e m inimized w ith resp ect to the pri- mal v ariables α 1 , α 2 , ξ 1 , ξ 2 . T o eliminate these v ariables, setting the corresp ond in g partial deriv ativ es to 0 results in th e follo wing conditions ( K 1 + 2 C 2 K 1 K 1 ) α 1 − 2 C 2 K 1 K 2 α 2 = Λ 1 , (93) ( K 2 + 2 C 2 K 2 K 2 ) α 2 − 2 C 2 K 2 K 1 α 1 = Λ 2 , (94) λ i 1 + ν i 1 = C 1 , (95) λ i 2 + ν i 2 = C 1 , (96) where we ha v e deﬁned Λ 1 , n X i =1 λ i 1 y i K 1 (: , i ) , Λ 2 , n X i =1 λ i 2 y i K 2 (: , i ) , with K 1 (: , i ) and K 2 (: , i ) b eing the i th columns of the corresp onding Gram matrices. Substituting (93) ∼ (96) in to L results in the Lagrangian dual fun ctio n g ( λ 1 , λ 2 , ν 1 , ν 2 ) g = 1 2 ( α ⊤ 1 K 1 α 1 + α ⊤ 2 K 2 α 2 ) + C 2 ( α ⊤ 1 K 1 K 1 α 1 − 2 α ⊤ 1 K 1 K 2 α 2 + α ⊤ 2 K 2 K 2 α 2 ) − α ⊤ 1 Λ 1 − α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) = 1 2 α ⊤ 1 Λ 1 + 1 2 α ⊤ 2 Λ 2 − α ⊤ 1 Λ 1 − α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) = − 1 2 α ⊤ 1 Λ 1 − 1 2 α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) . (97) Deﬁne ˜ K 1 = K 1 + 2 C 2 K 1 K 1 , ¯ K 1 = 2 C 2 K 1 K 2 , ˜ K 2 = K 2 + 2 C 2 K 2 K 2 , ¯ K 2 = 2 C 2 K 2 K 1 . Then, (93) and (94) b ecome ˜ K 1 α 1 − ¯ K 1 α 2 = Λ 1 , (98) ˜ K 2 α 2 − ¯ K 2 α 1 = Λ 2 . (99) 31 F rom (98) and (99), w e hav e ( ˜ K 1 − ¯ K 1 ˜ K − 1 2 ¯ K 2 ) α 1 = ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 ( ˜ K 2 − ¯ K 2 ˜ K − 1 1 ¯ K 1 ) α 2 = ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 . Deﬁne M 1 , ˜ K 1 − ¯ K 1 ˜ K − 1 2 ¯ K 2 and M 2 , ˜ K 2 − ¯ K 2 ˜ K − 1 1 ¯ K 1 . It is clear that α 1 = M − 1 1 h ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 i , (100) α 2 = M − 1 2 h ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 i . (101) With α 1 and α 2 substituted into (97), the Lagrange du al fu n ctio n g ( λ 1 , λ 2 , ν 1 , ν 2 ) is then g = inf α 1 , α 2 , ξ 1 , ξ 2 L = − 1 2 α ⊤ 1 Λ 1 − 1 2 α ⊤ 2 Λ 2 + n X i =1 ( λ i 1 + λ i 2 ) = − 1 2 Λ ⊤ 1 M − 1 1 h ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 i − 1 2 Λ ⊤ 2 M − 1 2 h ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 i + n X i =1 ( λ i 1 + λ i 2 ) . The Lagrange d ual problem is giv en by max λ 1 , λ 2 g s.t.  0 ≤ λ i 1 ≤ C 1 , i = 1 , . . . , n 0 ≤ λ i 2 ≤ C 1 , i = 1 , . . . , n. (102) As Lagrange dual fun ctio ns are conca ve, b elo w w e formulat e the Lagrange dual problem as a conv ex optimization problem min λ 1 , λ 2 − g s.t.  0 ≤ λ i 1 ≤ C 1 , i = 1 , . . . , n 0 ≤ λ i 2 ≤ C 1 , i = 1 , . . . , n. (103) Deﬁne matrix Y , diag( y 1 , . . . , y n ). Th en , Λ 1 = K n 1 Y λ 1 and Λ 2 = K n 2 Y λ 2 with K n 1 = K 1 (: , 1 : n ), K n 2 = K 2 (: , 1 : n ), λ 1 = ( λ 1 1 , ..., λ n 1 ) ⊤ , and λ 2 = ( λ 1 2 , ..., λ n 2 ) ⊤ . It is clear that ˜ K 1 and ˜ K 2 are symm etric matrices, and ¯ K 1 = ¯ K ⊤ 2 . T herefore, it follo ws that matrices M 1 and M 2 are also s y m metric. W e h a ve − g = 1 2 Λ ⊤ 1 M − 1 1 h ¯ K 1 ˜ K − 1 2 Λ 2 + Λ 1 i + 1 2 Λ ⊤ 2 M − 1 2 h ¯ K 2 ˜ K − 1 1 Λ 1 + Λ 2 i − n X i =1 ( λ i 1 + λ i 2 ) = 1 2 n λ ⊤ 1 [ Y K ⊤ n 1 M − 1 1 K n 1 Y ] λ 1 + λ ⊤ 1 [ Y K ⊤ n 1 M − 1 1 ¯ K 1 ˜ K − 1 2 K n 2 Y ] λ 2 + λ ⊤ 2 [ Y K ⊤ n 2 M − 1 2 ¯ K 2 ˜ K − 1 1 K n 1 Y ] λ 1 + λ ⊤ 2 [ Y K ⊤ n 2 M − 1 2 K n 2 Y ] λ 2 o − 1 ⊤ ( λ 1 + λ 2 ) = 1 2 ( λ ⊤ 1 λ ⊤ 2 )  A B B ⊤ D  λ 1 λ 2  −  λ 1 λ 2  ⊤ 1 2 n , 32 where A , Y K ⊤ n 1 M − 1 1 K n 1 Y , B , Y K ⊤ n 1 M − 1 1 ¯ K 1 ˜ K − 1 2 K n 2 Y , D , Y K ⊤ n 2 M − 1 2 K n 2 Y , (104) 1 2 n = (1 , . . . , 1 (2 n ) ) ⊤ , and we ha ve used the fact that Y K ⊤ n 1 M − 1 1 ¯ K 1 ˜ K − 1 2 K n 2 Y = [ Y K ⊤ n 2 M − 1 2 ¯ K 2 ˜ K − 1 1 K n 1 Y ] ⊤ . (105) Because of the conv exit y of fu nction − g , w e aﬃrm that matrix  A B B ⊤ D  is p ositiv e semidef- inite. Hence, the optimizati on pr oblem in (103) can b e rewritten as min λ 1 , λ 2 1 2 ( λ ⊤ 1 λ ⊤ 2 )  A B B ⊤ D  λ 1 λ 2  −  λ 1 λ 2  ⊤ 1 2 n s.t.  0  λ 1  C 1 1 , 0  λ 2  C 1 1 . (106) After solving this problem, w e can th en obtain classiﬁer parameters α 1 and α 2 using (100) and (101 ), whic h are ﬁnally us ed by (49). References A. Am broladze, E. Pa rr ad o-hern ´ and ez, and J. Shaw e-ta ylor. Tigh ter pac-ba ye s b ounds. A dvanc es i n Neur al Information P r o c essing Systems , 19:9–16 , 2007. K. Bac he and M. Lic hman. UCI Machine L e arning R ep ository . Un iv ersity of California, Irvine, Sc ho ol of Inf ormatio n an d Computer S cience s, 2013. URL http://a rchive.ics.uc i.edu/ml . P . Bartlett and S. Mendelson. Rademac h er and Gaussian complexities: Risk b ound s and structural results. Journa l of Machine L e arning R ese ar ch , 3(No v):463–48 2, 2002. A. Blum and T . Mitc hell. Com binin g lab eled and unlab eled data with co-training. In Pr o c e e dings of the 11th A nnual Con fer e nc e on Computa tional L e arning The ory , pages 92–10 0, 1998. O. Catoni. P AC-Bayesian Sup ervise d Classiﬁc ation: The Therm o dynamics of Statistic al L e arning . Institute of Mathematical S tati stics, Beac hw o od , Ohio, USA, 2007. N. C r istianini and J. S ha we-T aylo r. An Intr o duction to Supp ort V e ctor Machines . Cam- bridge Universit y Press, C am br idge, UK, 2000. J. F arquh ar, D. Hardo on, H. Meng, J. S ha we-T aylo r, and S . Szedmak. Two view learning: SVM-2K, theory and practice. A dvanc es in Neur al Information Pr o c essing Systems , 18: 355–3 62, 2006. P . Germ ain, A. Lacasse, F. La violette, and M. Marc hand . P AC-Ba y esian learnin g of lin- ear classiﬁers. In Pr o c e e dings of the 26th Annual International Confer enc e on Machine L e arning , pages 353–360, 2009. 33 M. Higgs and J. Sh a we-T a ylor. A P A C-Ba ye s b ound for tailored densit y estimation. L e ctur e Notes in Computer Scienc e , 6331:148 –162, 2010. S. Kak ade and D. F oster. Multi-view regression via canonical correlation analysis. In Pr o c e e dings of the 20th An nual Confer enc e on L e arning The ory , pages 82–9 6, 2007. N. Kushm eric k. Learn in g to remo ve int ernet adv ertisements. In P r o c e e dings of the 3r d International Confer enc e on Autonom ous A gents , pages 175–181 , 1999. J. Langford. T u toria l on practical pr edictio n theory for classiﬁcation. Journal of M achine L e arning R ese ar ch , 6(Mar):273–3 06, 2005. J. Langford an d J. Sh a we- T aylo r. P A C -Bay es & margins. A dvanc es in Neur al Information Pr o c essing Systems , 15:423–430 , 2002. G. Lever, F. La violette, and J. Sha we- T aylo r. Tighte r P A C-Ba y es b ound s th rough distribution-dep endent p r iors. The or etic al Computer Scienc e , 473(F eb):4–28 , 201 3. D. McAllester. P A C -Bay esian mo del a v eraging. In Pr o c e e dings of 12th Annual Confer enc e on Computationa l L e arning The ory , pages 164– 170, 1999. E. Parrado-Hern´ andez, A. Am broladze, J. S ha we-T ayl or, and S. Sun. P A C-Ba ye s b ounds with data d ep enden t priors. Journal of Machine L e arning R ese ar ch , 13(Dec):350 7–3531, 2012. C. Rasmussen and C. Williams. Gaussian Pr o c esses for Machine L e arning . MIT Press, Cam bridge, MA, 2006. D. Rosenb erg and P . Bartlett . The rademac her complexit y of co-regularized ke rnel classes. In Pr o c e e dings of the 11th Internationa l Confer enc e on Artiﬁcial Intel lige nc e and Statistics , pages 396–403 , 2007. M. Seeger. P AC-B a yesia n generalisation err or b oun ds for Gaussian pro cess classiﬁcatio n. Journal of Machine L e arning R ese ar ch , 3(Oct):233–2 69, 2002. Y. Seldin and N. Tishb y . P A C-Ba y esian analysis of co-clustering and b eyond. Journal of Machine L e arning R ese ar ch , 11(D ec):3595– 3646, 2010 . Y. Seldin, F. Lavio lette, N. Cesa-Bia nchi, J. Sh a we-T a ylor, and P . Auer. P A C-Ba ye sian inequalities for martingales. IE EE T r ansactions on Information The ory , 58(12 ):7086– 7093, 2012. J. Sha we-T a ylor and N. Cristianini. Kernel Metho ds for P attern Ana lysis . Cam bridge Univ ersit y Press, C am br idge, UK , 2004. J. Sha we- T aylo r and S . Sun . A review of optimization metho dologies in supp ort ve ctor mac hines. Neur o c omputing , 74(17):36 09–3618, 2011. J. Sha we- T aylo r, P . Bartlett, R. Williamson, and M. Anthon y . Structural risk minim ization o v er data-dep enden t h ierarc hies. IEEE T r ansactions on Informatio n The ory , 44(5): 1926– 1940, 1998. 34 V. S indh wa ni and D. Rosenberg. An rkhs for multi-vie w learning and manifold co- regularization. In Pr o c e e dings of the 25th Annual International Confer enc e on M achine L e arning , pages 976–983, 2008. V. Sin dh w ani, P . Niy ogi, and M. Belkin. A co-regularization approac h to semi-sup ervised learning with multiple views. In Pr o c e e dings of ICML Workshop on L e arning with Multiple Views , pages 74–79, 2005. K. S ridharan and S. Kak ad e. An information theoretic framewo rk for m ulti-view learning. In Pr o c e e dings of the 21st Annual Confer enc e on L e arning The ory , pages 403–414 , 2008. S. Sun. A surv ey of m ulti-view machine learning. N eur al Computing and Applic ations , 23 (7-8):2 031–2038 , 2013. S. Su n and J. Sha we-T a ylor. Sp arse semi-sup ervised learning using conju gat e functions. Journal of Machine L e arning R ese ar ch , 11(Sep):2423–2 455, 2010. V. V apnik. Statistic al L e arning The ory . Wiley , New Y ork, 1998. 35

PAC-Bayes Analysis of Multi-view Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment