Multiple Kernel Learning: A Unifying Probabilistic Viewpoint

Mul tiple Kernel Learning: A Unifying Pr obabilistic Viewpoint Multiple Kernel Learning: A Unifying Probabilistic Viewp oin t Hannes Nic kisc h hannes@nickisch.or g Max Planck Institute for Intel ligent Systems, Sp emannstr aße 38, 72076 Tübingen, Germany Matthias Seeger ma tthias.seeger@epfl.ch Ec ole Polyte chnique Fé dér ale de L ausanne, INJ 339, Station 14, 1015 L ausanne, Switzerland Abstract W e presen t a probabilistic viewpoint to m ultiple kernel learning unifying well-kno wn regu- larised risk approaches and recent adv ances in approximate Bay esian inference relaxations. The framework prop oses a general ob jectiv e function suitable for regression, robust regres- sion and classiﬁcation that is low er bound of the marginal likelihoo d and con tains man y regularised risk approac hes as sp ecial cases. F urthermore, we derive an eﬃcien t and prov- ably con vergen t optimisation algorithm. Keyw ords: Multiple kernel learning, approximate Ba y esian inference, double lo op algo- rithms, Gaussian processes 1. In tro duction Nonparametric kernel methods, cornerstones of mac hine learning to da y , can b e seen from diﬀeren t angles: as regularised risk minimisation in function spaces ( Sc hölkopf and Smola , 2002 ), or as probabilistic Gaussian process metho ds ( Rasm ussen and Williams , 2006 ). In these techniques, the kernel (or equiv alently co v ariance) function enco des in terp olation c har- acteristics from observ ed to unseen p oints, and t w o basic statistical problems hav e to b e mastered. First, a laten t function m ust be predicted whic h ﬁts data w ell, y et is as smo oth as p ossible given the ﬁxed k ernel. Second, the kernel function parameters hav e to be learned as well, to b est supp ort predictions whic h are of primary interest. While the ﬁrst problem is simpler and has been addressed m uch more frequen tly so far, the cen tral role of learning the co v ariance function is well ac knowledged, and a substan tial n umber of methods for “learn- ing the kernel”, “m ultiple k ernel learning”, or “evidence maximisation” are a v ailable now. Ho wev er, muc h of this w ork has ﬁrmly b een asso ciated with one of the “camps” (referred to as r e gularise d risk and pr ob abilistic in the sequel) with surprisingly little crosstalk or ac knowledgmen ts of prior work across this b oundary . In this paper, we clarify the relation- ship b et w een ma jor regularised risk and probabilistic kernel learning techniques precisely , p oin ting out adv antages and pitfalls of either, as w ell as algorithmic similarities leading to no vel p o w erful algorithms. W e develop a common analytical and algorithmical framework encompassing approaches from b oth camps and pro vide clear insights in to the optimisation structure. Ev en though, most of the optimisation is non conv ex, we show ho w to op erate a prov ably conv ergent “almost Newton” method nevertheless. Each step is not m uch more exp ensiv e than a gradient c  Hannes Nickisc h and Matthias Seeger. Nickisch and Seeger based approac h. Also, we do not require an y foreign optimisation co de to b e a v ailable. Our framew ork uniﬁes k ernel learning for regression, robust regression and classiﬁcation. The pap er is structured as follows: In section 2 , we in tro duce the regularised risk and the probabilistic view of k ernel learning. In increasing order of generality , we explain m ultiple k ernel learning (MKL, section 2.1 ), maximum a p osteriori estimation (MAP , section 2.2 ) and marginal lik eliho o d maximisation (MLM, section 2.3 ). A taxonom y of the m utual relations b et ween the approaches and important sp ecial cases is giv en in section 2.4 . Section 3 in tro duces a general optimisation scheme and section 4 draws a conclusion. 2. Kernel Metho ds and Kernel Learning Kernel-based algorithms come in man y shap es, how ever, the primary goal is – based on training data { ( x i , y i ) | i = 1 ..n } , x i ∈ X , y i ∈ Y and a parametrised k ernel function k θ ( x , x 0 ) – to predict the output y ∗ for unseen inputs x ∗ . Often, linear parametrisations k θ ( x , x 0 ) = P M m =1 θ m k m ( x , x 0 ) are used, where the k m are ﬁxed p ositiv e deﬁnite functions, and θ  0 . Learning the k ernel means ﬁnding θ to b est supp ort this goal. In general, kernel metho ds emplo y a postulated latent function u : X → R whose smoothness is controlled via the function space squared norm k u ( · ) k 2 k θ . Most often, smo othness is traded against data ﬁt, either enforced b y a loss function ` ( y i , u ( x i )) or mo deled b y a likeliho o d P ( y i | u i ) . Let us deﬁne kernel matrices K θ := [ k θ ( x i , x j )] ij , and K m := [ k m ( x i , x j )] ij in R n × n and the v ectors y := [ y i ] i ∈ Y n , u := [ u ( x i )] i ∈ R n collecting outputs and latent function v alues, resp ectiv ely . The r e gularise d risk route to kernel prediction, which is follo w ed by any supp ort v ector mac hine (SVM) or ridge regression tec hnique, yields k u ( · ) k 2 k θ + C n P n i =1 ` ( y i , u i ) as criterion, enforcing smo othness of u ( · ) as w ell as go od data ﬁt through th e loss function C n ` ( y i , u ( x i )) . By the representer theorem, the minimiser can b e written as u ( · ) = P i α i k θ ( · , x i ) , so that k u ( · ) k 2 k θ = α > K θ α ( Schölk opf and Smola , 2002 ). As u = K θ α , the regularised risk problem is given by min u u > K − 1 θ u + C n n X i =1 ` ( y i , u i ) . (1) A pr ob abilistic viewp oin t of the same setting is based on the notion of a Gaussian pro cess (GP) ( Rasmussen and Williams , 2006 ): a Gaussian random function u ( · ) with mean func- tion E [ u ( x )] = m ( x ) ≡ 0 and cov ariance function V [ u ( x ) , u ( x 0 )] = E [ u ( x ) u ( x 0 )] = k θ ( x , x 0 ) . In practice, we only use ﬁnite-dimensional snapshots of the pro cess u ( · ) : for example, P ( u ; θ ) = N ( u | 0 , K θ ) , a zero-mean join t Gaussian with cov ariance matrix K θ . W e adopt this GP as prior distribution ov er u ( · ) , estimating the laten t function as maxim um of the p osterior pro cess P ( u ( · ) | y ; θ ) ∝ P ( y | u ) P ( u ( · ); θ ) . Since the likelihoo d depends on u ( · ) only through the ﬁnite subset { u ( x i ) } , the p osterior process has a ﬁnite-dimensional represen- tation: P ( u ( · ) | y , u ) = P ( u ( · ) | u ) , so that P ( u ( · ) | y ; θ ) = R P ( u ( · ) | u ) P ( u | y ; θ ) d u is speciﬁed b y the join t distribution P ( u | y ; θ ) , a probabilistic equiv alent of the representer theorem. Kernel prediction amounts to maximum a p osteriori (MAP) estimation max u P ( u | y ; θ ) ≡ max u P ( u ; θ ) P ( y | u ) ≡ min u u > K − 1 θ u − 2 ln P ( y | u ) + ln | K θ | , (2) ignoring an additive constant. Minimising equations ( 1 ) and ( 2 ) for an y ﬁxed kernel matrix K gives the same minimiser ˆ u and prediction u ( x ∗ ) = ˆ u > K − 1 θ [ k θ ( x i , x ∗ )] i . 2 Mul tiple Kernel Learning: A Unifying Pr obabilistic Viewpoint Y Loss function ` ( y i , u i ) P ( y i | u i ) Lik eliho od {± 1 } SVM Hinge loss max(0 , 1 − y i u i ) @ {± 1 } Log loss ln(exp( − y i u i ) + 1) 1 / (exp( − τ y i u i ) + 1) Logistic R SVM  -insensitiv e loss max(0 , | y i − u i | / − 1) @ R Quadratic loss ( y i − u i ) 2 N ( y i | u i , σ 2 ) Gaussian R Linear loss | y i − u i | L ( y i | u i , τ ) Laplace T able 1: Relations b et ween loss functions and lik eliho o ds The corresp ondence b et ween likelihoo d and loss function bridges probabilistic and reg- ularised risk techniques. More speciﬁcally , an y lik eliho o d P ( y | u ) induces a loss function ` ( y , u ) via − 2 ln P ( y | u ) = − 2 X i ln P ( y i | u i ) C n n X i =1 ` ( y i , u i ) = ` ( y , u ) , ho wev er some loss functions cannot b e in terpreted as a negative log likelihoo d as shown in table ( 1 ) and as discussed for the SVM by Sollich ( 2000 ). If, t he lik eliho o d is a lo g- c onc ave function of u , it corresponds to a conv ex loss function ( Bo yd and V anden b erghe , 2002 , Sect. 3.5.1). Common loss functions and likelihoo ds for classiﬁcation Y = {± 1 } and regression Y = R are listed in table ( 1 ). In the following, w e discuss sev eral approac hes to learn the kernel parameters θ and sho w ho w all of them can b e understo o d as instances of or appro ximations to Bay esian evidence maximisation. Although the exp osition MKL section 2.1 and MAP section 2.2 use a linear parametrisation θ 7→ K θ = P M m =1 θ m K m , muc h of the results in MLM 2.3 and all the aforementioned discussion are still applicable to non-linear parametrisations. 2.1 Multiple Kernel Learning A widely adopted regularised risk principle, known as multiple kernel le arning (MKL) ( Chris- tianini et al. , 2001 ; Lanckriet et al. , 2004 ; Bac h et al. , 2004 ), is to minimise equation ( 1 ) w.r.t. the k ernel parameters θ as well. One obvious ca v eat is that for an y ﬁxed u , equation ( 1 ) b ecomes ev er smaller as θ m → ∞ : it cannot p er se play a meaningful statistical role. In order to preven t this, researchers constrain the domain of θ ∈ Θ and obtain min θ ∈ Θ min u u > K − 1 θ u + ` ( y , u ) , where Θ = { θ  0 , k θ k 2 ≤ 1 } or Θ = { θ  0 , 1 > θ ≤ 1 } ( V arma and Ray , 2007 ). Notably , these constraints are imp osed indep enden tly of the statistical problem, the mo del and of the parametrization θ 7→ K θ . The Lagrangian form of the MKL problem with parameter λ and a general p -norm unit ball constraint where p ≥ 1 ( Kloft et al. , 2009 ) is giv en b y min θ  0 φ MKL ( θ ) , where φ MKL ( θ ) := min u u > K − 1 θ u + ` ( y , u ) + λ · 1 > θ p | {z } ρ ( θ ) , λ > 0 . (3) Since, the r e gulariser ρ ( θ ) for the kernel parameter θ is con v ex, the map ( u , K ) 7→ u > K − 1 u is jointly con v ex for K  0 ( Boyd and V andenberghe , 2002 ) and the parametrisa- tion θ 7→ K θ is linear, MKL is a jointly conv ex problem for θ  0 whenever the loss function 3 Nickisch and Seeger 0 ln |K θ | linear upper bound 0 ln |K θ | quadratic upper bound Figure 1: Conv ex upper b ounds on (the conca ve non-decreasing) ln | K θ | By F enchel duality , we can represen t an y conca ve non-decreasing function and hence the log determinan t function by ln | K θ | = min λ  0 λ > | θ | p − g ∗ ( λ ) . As a consequence, w e obtain a piecewise p olynomial upp er b ound for an y particular v alue λ . ` ( y , u ) is conv ex. F urthermore, there are eﬃcient algorithms to solve equation ( 3 ) for large mo dels ( Sonnenburg et al. , 2006 ). 2.2 Joint MAP Estimation A dopting a probabilistic MAP viewpoint, w e can minimise equation ( 2 ) w.r.t. u and θ  0 : min θ  0 φ MAP ( θ ) , where φ MAP ( θ ) := min u u > K − 1 θ u − 2 ln P ( y | u ) + ln | K θ | . (4) While equation ( 3 ) and equation ( 4 ) share the “inner solution” ˆ u for ﬁxed K θ – in case the loss ` ( y , u ) corresp onds to a likelihoo d P ( y | u ) – they are diﬀeren t when it comes to optimising θ . The joint MAP problem is not in general jointly conv ex in ( θ , u ) , since θ 7→ ln | K θ | is conca ve, see ﬁgure 2 . How ever, it is alw ays a w ell-p osed statistical procedure, since ln | K θ | → ∞ as θ m → ∞ for all m . W e sho w in the following, how the regularisers ρ ( θ ) = λ k θ k p p of equation ( 3 ) can b e related to the probabilistic term f ( θ ) = ln | K θ | . In fact, the same reasoning can be applied to any concav e non-decreasing function. Since the function θ 7→ f ( θ ) = ln | K θ | , θ  0 is jointly concav e, we can represent it b y f ( θ ) = min λ  0 λ > θ − f ∗ ( λ ) where f ∗ ( λ ) denotes F enchel dual of f ( θ ) . F urthermore, the mapping ϑ 7→ ln | P M m =1 p √ ϑ m K m | = f ( p √ ϑ ) = g ( ϑ ) , ϑ  0 is join tly conca ve due to the comp osition rule ( Bo yd and V andenberghe , 2002 , §3.2.4), b ecause ϑ 7→ p √ ϑ is join tly conca ve and θ 7→ f ( θ ) is non-decreasing in all comp onen ts θ m as all matrices K m are p ositiv e (semi-)deﬁnite whic h guarantees that the eigen v alues of K θ increase as θ m increases. Thus w e can – similarly to Zhang ( 2010 ) – represen t ln | K θ | as ln | K θ | = f ( θ ) = g ( ϑ ) = min λ  0 λ > ϑ − g ∗ ( λ ) = min λ  0 λ > | θ | p − g ∗ ( λ ) . Cho osing a particular v alue λ = λ · 1 , w e obtain the bound ln | K θ | ≤ λ · k θ k p p − g ∗ ( λ · 1 ) . Figure 1 illustrates the bounds for p = 1 and p = 2 . The b ottom line is that one can in terpret the regularisers ρ ( θ ) = λ k θ k p p in equation ( 3 ) as corresp onding to parametrised upp er b ounds to the ln | K θ | part in equation ( 4 ), hence φ MKL ( θ ) = ψ MAP ( θ , λ = λ · 1 ) , where φ MAP ( θ ) = min λ  0 ψ MAP ( θ , λ ) . F ar from an ad ho c c hoice to k eep θ small, the 4 Mul tiple Kernel Learning: A Unifying Pr obabilistic Viewpoint ln | K θ | term embo dies the Occam’s razor concept b ehind MAP estimation: ov erly large θ are ruled out, since their explanation of the data y is extremely unlikely under the prior P ( u ; θ ) . The Occam’s razor eﬀect dep ends crucially on the prop er normalization of the prior ( MacKa y , 1992 ). F or example, the weigh ting parameter C of k ( k = C ˜ k ) can b e learned by join t MAP: if C = e c , then equation ( 4 ) is con vex in c for an y ﬁxed u . If k ernel-regularised estimation equation ( 1 ) is in terpreted as MAP estimation under a GP prior equation ( 2 ), the correct extension to kernel learning is join t MAP: the MKL criterion equation ( 3 ) lac ks prior normalization, whic h renders MAP w.r.t. θ meaningful in the ﬁrst place. F rom a non- probabilistic viewp oin t, the ln | K θ | term comes with a mo del and data dep enden t structure at least as complex as the rest of equation ( 3 ). While the MKL ob jectiv e, equation ( 3 ), enjoys the beneﬁt of being conv ex in the (linear) k ernel parameters θ , this do es not hold true for joint MAP estimation, equation ( 4 ), in general. W e illustrate the diﬀerences in ﬁgure 2 . The function ψ MAP ( θ , u ) is a building blo c k of the MAP ob jective φ MAP ( θ ) = min u [ ψ MAP ( θ , u ) − 2 ln P ( y | u )] , where ψ MAP ( θ , u ) = u > K − 1 θ u | {z } ψ ∪ ( θ , u ) + ln | K θ | | {z } ψ ∩ ( θ ) ≤ ψ MKL ( θ , u ) − g ∗ ( λ · 1 ) , ψ MKL ( θ , u ) = u > K − 1 θ u + λ k θ k p p . More concretely , ψ MAP ( θ , u ) is a sum of a nonnegative, jointly con vex function ψ ∪ ( θ , u ) that is strictly decreasing in every comp onen t θ m and a concav e function ψ ∩ ( θ ) that is strictly increasing in every component θ m . Both functions ψ ∪ ( θ , u ) and ψ ∩ ( θ ) alone do not hav e a stationary p oin t due to their monotonicity in θ m . Ho w ever, their sum can hav e (ev en m ultiple) stationary p oints as shown in ﬁgure 2 on the left left. W e can show, that the map K 7→ u > K − 1 u + ln | K | is invex i.e. every stationary p oin t ˆ K is a global minim um. Using the con vexit y of A 7→ u > Au − ln | A | ( Boyd and V anden b erghe , 2002 ) and the fact that the deriv ativ e of A 7→ A − 1 for A ∈ R n × n , A  0 has full rank n 2 , we see b y Mishra and Giorgi ( 2008 , theorem 2.1) that K 7→ u > K − 1 u + ln | K | is indeed inv ex. Often, the MKL ob jectiv e for the case p = 1 is motiv ated b y the fact that the optimal solution θ ? is sp arse (e.g. Sonnen burg et al. , 2006 ), meaning that man y components θ m are zero. Figure 2 illustrates that φ MAP ( θ ) also yields sparse solutions; in fact it enforces even more sparsit y . In MKL, φ MAP ( θ ) is simply relaxed to a conv ex ob jective φ MKL ( θ ) at the exp ense of ha ving only a single less sparse solution. Intuition f or the Ga ussian Case W e can gain further intuition ab out the criteria φ MKL and φ MAP b y asking which matric es K minimise them. F or simplicity , assume that P ( y | u ) = N ( y | u , σ 2 I ) and n/C = σ 2 , hence ` ( y , u ) = 1 σ 2 k y − u k 2 2 . The inner minimiser ˆ u for b oth φ MKL and φ MAP is given by K − 1 θ ˆ u = ( K θ + σ 2 I ) − 1 y . With σ 2 → 0 , w e ﬁnd for joint MAP that ∂ ∂ K φ MAP = 0 results in ˆ K = yy > . While this “nonparametric” estimate requires smo othing to b e useful in practice, closeness to yy > is fundamen tal to cov ariance estimation and can b e found in regularised risk k ernel learning w ork ( Christianini et al. , 2001 ). On the other hand, for tr ( K m ) = 1 and hence ρ ( θ ) = λ tr ( K θ ) = λ k θ k 1 , ∂ ∂ K φ MKL = 0 leads to ˆ K 2 = λ − 1 yy > : an o dd wa y of estimating cov ariance, not supported by an y statistical literature w e are a ware of. 5 Nickisch and Seeger θ 1 θ 2 ψ MAP ( θ ,u)=u T K θ −1 u + ln|K θ | 0 0.5 1 1.5 2 0 0.5 1 1.5 2 θ 1 θ 2 ψ MKL ( θ ,u)=u T K θ −1 u + | θ | 1 0 0.5 1 1.5 2 0 0.5 1 1.5 2 Figure 2: Con vex and non-con vex building blo c ks of the MKL and MAP ob jectiv e function 2.3 Marginal Likelihoo d Maximisation While the joint MAP criterion uses a prop erly normalised prior distribution, it is still not probabilistically consisten t. Kernel learning amoun ts to ﬁnding a v alue ˆ θ of high data lik eliho o d, no matter what the laten t function u ( · ) is. The correct likelihoo d to be maximised is mar ginal : P ( y | θ ) = R P ( y | u ) P ( u | θ ) d u (“max-sum”), while joint MAP employs the plug-in surrogate max u P ( y | u ) P ( u | θ ) (“max-max”). Mar ginal likeliho o d maximisation (MLM) is also kno wn as Bay esian estimation, and it underlies the EM algorithm or maxim um lik eliho o d learning of conditional random ﬁelds just as w ell: complexity is con trolled (and ov erﬁtting a voided) b y a veraging ov er unobserv ed v ariables u ( MacKa y , 1992 ), rather than plugging in some p oint estimate ˆ u φ MLM ( θ ) := − 2 ln Z N ( u | 0 , K θ ) P ( y | u ) d u . (5) The Ga ussian Case Before developing a general MLM approximation, we note an imp ortant analytically solv able exception: for Gaussian likelihoo d P ( y | u ) = N ( y | u , σ 2 I ) , P ( y | θ ) = N ( y | 0 , K θ + σ 2 I ) , and MLM becomes φ GAU ( θ ) := y > ( K θ + σ 2 I ) − 1 y + ln | K θ + σ 2 I | . (6) Ev en if the primary purp ose is classiﬁcation, the Gaussian likelihoo d is used for its analytical simplicit y ( Kap oor et al. , 2009 ). Only for the Gaussian case, join t MAP and MLM hav e an analytically closed form. F rom the pro duct formula of Gaussians ( Bro okes , 2005 , §5.1) Q ( u ) := N ( u | 0 , K θ ) N ( y | u , Γ ) = N ( y | 0 , K θ + Γ ) N ( u | m , V ) , where V = ( K − 1 θ + Γ − 1 ) − 1 and m = VΓ − 1 y w e can deduce that − 2 ln Z Q ( u ) d u = ln | K − 1 θ + Γ − 1 | + min u [ − 2 ln Q ( u )] − n ln | 2 π | . (7) Using σ 2 I = Γ and min u [ − 2 ln Q ( u )] = − 2 ln Q ( m ) , we see that by φ MAP/GAU ( θ ) : c = φ GAU ( θ ) − ln | K − 1 θ + σ − 2 I | c = y > ( K θ + σ 2 I ) − 1 y + ln | K θ | (8) 6 Mul tiple Kernel Learning: A Unifying Pr obabilistic Viewpoint MLM and MAP are very similar for the Gaussian case. The “ridge regression” approximation is also used together with p -norm constrain ts in- stead of the ln | K θ | term ( Cortes et al. , 2009 ) φ RR ( θ ) := y > ( K θ + σ 2 I ) − 1 y + λ k θ k p p . (9) Unfortunately , most GP metho ds to date work with a Gaussian likelihoo d for simplicit y , a restriction whic h often prov es short-sigh ted. Gaussian-linear mo dels come with unrealistic prop erties, and b eneﬁts of MLM ov er join t MAP cannot b e realised. Kernel parameter learning has been an integral part of probabilistic GP metho ds from the v ery b eginning. Williams and Rasmussen ( 1996 ) prop osed MLM for Gaussian noise equation 6 , ﬁfteen years ago. They treated sums of exp onen tial and linear kernels as well as learning lengthscales (ARD), predating recen t proposals suc h as “pro ducts of kernels” ( V arma and Babu , 2009 ). The General Case In general, join t MAP alw ays has the analytical form equation 4 , while P ( y | θ ) can only be appro ximated. F or non-Gaussian P ( y | u ) , numerous appro ximate inference metho ds ha v e b een prop osed, sp eciﬁcally motiv ated b y learning k ernel parameters via MLM. The simplest suc h metho d is Laplace’s approximation, applied to GP binary and m ulti-wa y classiﬁcation b y Williams and Barb er ( 1998 ): starting with con vex join t MAP , ln P ( y , u ) is expanded to second order around the p osterior mo de ˆ u . More recent approximations Girolami and Rogers ( 2005 ); Girolami and Zhong ( 2006 ) can b e m uch more accurate, yet come with non-conv ex problems and less robust algorithms ( Nickisc h and Rasmussen , 2008 ). In this pap er, w e concen trate on the v ariational lo wer b ound relaxation (VB) b y Jaakk ola and Jordan ( 2000 ), whic h is con vex for log-conca v e likelihoo ds P ( y | u ) ( Nickisc h and Seeger , 2009 ), pro viding a no vel simple and eﬃcient algorithm. While our VB approximation to MLM is more expensive to run than joint MAP for non-Gaussian lik eliho o d (even using Laplace’s appro ximation), the implementation complexit y of our VB algorithm is comparable to what is required in the Gaussian noise case equation 6 . More, sp eciﬁcally , we exploit that sup er-Gaussian of lik eliho o ds P ( y i | u i ) can be low er b ounded by scaled Gaussians N γ i of any width γ i : P ( y i | u i ) = max γ i > 0 N γ i = max γ i > 0 exp  β i u i − u 2 i 2 γ i − 1 2 h i ( γ i )  , where β i ∝ y i are constants, and h i ( · ) is con vex ( Nickisc h and Seeger , 2009 ) whenever the lik eliho o d P ( y i | u i ) is log-conca ve. If the p osterior distribution is P ( u | y ) = Z − 1 P ( y | u ) P ( u ) , then ln Z ≥ C e − ψ VB ( θ , γ ) / 2 b y plugging in these b ounds, where C is a constan t and φ VB ( θ ) := min γ  0 ψ VB ( θ , γ ) , ψ VB ( θ , γ ) := h ( γ ) − 2 ln Z N ( u | 0 , K θ ) e u > ( β − 1 2 Γ − 1 u ) d u , (10) h ( γ ) := P i h i ( γ i ) , Γ := dg ( γ ) . The v ariational relaxation 1 amoun ts to maximising the low er b ound, which means that P ( u | y ) is ﬁtted by the Gaussian approximation Q ( u | y ; γ ) with co- 1. Generalisations to other super-Gaussian potentials (log-conca ve or not) or mo dels including linear cou- plings and mixed p oten tials are giv en by Nickisc h and Seeger ( 2009 ). 7 Nickisch and Seeger v ariance matrix V = ( K − 1 θ + Γ − 1 ) − 1 ( Nic kisch and Seeger , 2009 ). Alternatively , we can in ter- pret ψ VB ( θ , γ ) as an upp er b ound on the Kullbac k-Leibler divergence KL ( Q ( u | y ; γ ) || P ( u | y )) ( Nic kisch , 2010 , §2.5.9), a measure for the dissimilarity b etw een the exact p osterior P ( u | y ) and the parametrised Gaussian approximation Q ( u | y ; γ ) . Finally , note that b y equation ( 7 ), ψ VB ( θ , γ ) can also b e written as ψ VB ( θ , γ ) = ln | K − 1 θ + Γ − 1 | + h ( γ ) + min u R ( u , θ , γ ) + ln | K θ | , (11) where R ( u , θ , γ ) = u > ( K − 1 θ + Γ − 1 ) u − 2 β > u . Using the concavit y of γ − 1 7→ ln | K − 1 θ + Γ − 1 | and F enc hel dualit y ln | K − 1 θ + Γ − 1 | = min z  0 z > γ − 1 − g ∗ θ ( z ) = ˆ z > θ γ − 1 − g ∗ θ ( ˆ z θ ) , with the optimal v alue ˆ z θ = dg ( V ) , we can reform ulate ψ VB ( θ , γ ) as ψ VB ( θ , γ ) = min z  0 [ z > γ − 1 − g ∗ θ ( z )] + h ( γ ) + min u R ( u , θ , γ ) + ln | K θ | , whic h allows to perform the minimisation w.r.t. γ in closed form ( Nickisc h , 2010 , §3.5.6): φ VB ( θ ) = min z  0 ψ VB ( θ , z ) , ψ VB ( θ , z ) = min u u > K − 1 θ u + ˜ ` z ( y , u ) − g ∗ θ ( z ) + ln | K θ | , (12) where ˜ ` z ( y , u ) := 2 β > ( v − u ) − 2 ln P ( y | v ) and ﬁnally v = sign ( u )  √ u 2 + z . Note that for z = 0 , we exactly reco ver join t MAP estimation, equation ( 4 ), as z = 0 implies u = v and ˜ ` z ( y , u ) = ` ( y , u ) . F or ﬁxed θ , the optimal v alue ˆ z θ = dg ( V ) corresponds to the marginal v ariances of the Gaussian approximation Q ( u | y ; γ ) : V ariational inference corresp onds to v ariance-smoothed join t MAP estimation ( Nickisc h , 2010 ) with a loss function ˜ ` ( y , u , θ ) that explicitly depends on the k ernel parameters θ . W e ha ve tw o equiv alen t representations of the loss ˜ ` ( y , u , θ ) that directly follo w from equations ( 11 ) and ( 12 ): ˜ ` ( y , u , θ ) = min γ  0 [ln | K − 1 θ + Γ − 1 | + h ( γ ) + u > Γ − 1 u − 2 β > u ] , and ˜ ` ( y , u , θ ) = min z  0 [2 β > ( v − u ) − 2 ln P ( y | v ) − g ∗ θ ( z )] , v = sign ( u )  p u 2 + z . Our VB problem is min θ  0 , γ  0 ψ VB ( θ , γ ) or equiv alently min θ  0 , z  0 ψ VB ( θ , z ) . The inner v ariables here are γ and z , in addition to u in joint MAP . There are further similarities: since ψ VB ( θ , γ ) = − 2 ln R e − R ( u , γ , θ ) d u + h ( γ ) + ln | 2 π K θ | , ( γ , θ ) 7→ ψ VB − ln | K θ | is jointly conv ex for γ  0 , θ  0 , b y the joint conv exity of ( u , γ , θ ) 7→ R and Prék opa’s theorem ( Boyd and V anden b erghe , 2002 , §3.5.2). Join t MAP and VB share the same con v exit y structure. In con trast, approximating P ( y | θ ) by other tec hniques like Exp ectation Propagation ( Mink a , 2001 ) or general V ariational Bay es ( Opper and Arc ham b eau , 2009 ) do es not ev en constitute con vex problems for ﬁxed θ . 2.4 Summary and T axonom y In the last paragraphs, w e ha ve detailed how a v ariety of kernel learning approaches can b e obtained from Bay esian marginal lik eliho od maximisation in a sequence of nested upp er b ounding steps. T able 2 nicely illustrates ho w man y k ernel learning ob jectives are related to each other – either by upp er b ounds or by Gaussianity assumptions. W e can clearly see, 8 Mul tiple Kernel Learning: A Unifying Pr obabilistic Viewpoint Name Ob jective function Marginal Likelihoo d Maximisation φ MLM ( θ ) = − 2 ln  R N ( u | 0 , K θ ) P ( y | u ) d u  V ariational Bounds φ VB ( θ ) = min γ  0 ψ VB ( θ , γ ) ≥ φ MLM ( θ ) by P ( y i | u i ) ≥ N γ i Maxim um A P osteriori φ MAP ( θ ) = − 2 ln [max u N ( u | 0 , K θ ) P ( y | u )] = ψ VB ( θ , z = 0 ) Multiple Kernel Learning φ MKL ( θ ) = φ MAP ( θ ) + λ k θ k p p − ln | K θ | = ψ MAP ( θ , λ = λ · 1 ) General P ( y i | u i ) Gaussian P ( y i | u i ) φ MLM ( θ ) , eq. ( 5 ) − → φ GAU ( θ ) , eq. ( 6 ) Super-Gaussian Bounding ↓ ↓ Bound is tigh t φ VB ( θ ) , eq. ( 10 ) − → φ GAU ( θ ) , eq. ( 6 ) Maxim um instead of integral ↓ ↓ Mo de ≡ mean φ MAP ( θ ) , eq. ( 4 ) − → φ MAP/GAU ( θ ) , eq. ( 8 ) Bound ln | K θ | ≤ λ k θ k p p − g ∗ ( λ 1 ) ↓ ↓ φ MKL ( θ ) , eq. ( 3 ) − → φ RR ( θ ) , eq. ( 9 ) T able 2: T axonomy of k ernel learning ob jectiv e functions The upper table visualises the relationship betw een several kernel learning ob jectiv e func- tions for arbitrary likelihoo d/loss functions: Marginal lik eliho od maximisation (MLM) can b e bounded by v ariational b ounds (VB) and maximum a p osteriori estimation (MAP) is a sp ecial case z = 0 thereof. Finally m ultiple k ernel learning (MKL) can b e understo od as an upp er b ound to the MAP estimation ob jectiv e λ = λ · 1 . The low er table complements the upp er table b y also co vering the analytically imp ortant Gaussian case. that φ VB ( θ ) – as an upp er b ound to the negative log marginal lik eliho od – can b e seen as the mother function. F or a special case, z = 0 , we obtain joint maxim um a p osteriori estimation, where the loss functions does not dep end on the k ernel parameters. Going further, a particular instance λ = λ · 1 yields the widely use multiple kernel learning ob jective that b ecomes con vex in the k ernel parameters θ . In the follo wing, we will concen trate on the optimisation and computational similarities b et ween the approac hes. 3. Algorithms In this section, we deriv e a simple, prov ably con vergen t and eﬃcient algorithm for MKL, join t MAP and VB. W e use the Lagrangian form of equation ( 3 ) and ` ( y , u ) := − 2 ln P ( y | u ) : ψ MKL ( θ , u ) = u > K − 1 u + ` ( y , u ) + λ · 1 > θ , λ > 0 , ψ MAP ( θ , u ) = u > K − 1 θ u + ` ( y , u ) + ln | K θ | , and ψ VB ( θ , u ) = u > K − 1 θ u + min z  0 h ` ( y , v ) + 2 β > ( v − u ) − g ∗ θ ( z ) i + ln | K θ | , where v = sign ( u )  p u 2 + z . Man y previous algorithms use alternating minimization, whic h is easy to implement but tends to con verge slo wly . Both φ VB and φ MAP are jointly con vex up to the concav e θ 7→ ln | K θ | part. Since ln | K θ | = min λ  0 λ > θ − f ∗ ( λ ) (Legendre dualit y , Boyd and V anden b erghe , 2002 ), join t MAP b ecomes min λ  0 , θ  0 , u φ λ ( θ , u ) with φ λ := u > K − 1 θ u + ` ( y , u ) + λ > θ − f ∗ ( λ ) which is jointly con vex in ( θ , u ) . Algorithm 1 iterates b et ween reﬁts of λ and join t Newton updates of ( θ , u ) . 9 Nickisch and Seeger Algorithm 1 Double lo op algorithm for join t MAP , MKL and VB. Require: Criterion ψ # ( θ , u ) = ˜ ψ # ( θ , u ) + ln | K θ | to minimise for ( u , θ ) ∈ R n × R M + . rep eat Newton min u ψ # for ﬁxed θ (optional; few steps). Reﬁt upper b ound: λ ← ∇ θ ln | K θ | = [ tr ( K − 1 θ K 1 ) , .., tr ( K − 1 θ K M )] > . Compute joint Newton search direction d for ψ λ := ˜ ψ # + λ > θ : ∇ 2 [ θ ; u ] ψ λ d = −∇ [ θ ; u ] ψ λ . Linesearc h: Minimise ψ # ( α ) i.e. ψ # ( θ , u ) along [ θ ; u ] + α d , α > 0 . un til Outer loop conv erged The Newton direction costs O ( n 3 + M n 2 ) , with n the n um b er of data points and M the n umber of base k ernels. All algorithms discussed in this paper require O ( n 3 ) time, apart from the requiremen t to store the base matrices K m . The conv ergence pro of hinges on the fact that φ and φ λ are tangen tially equal ( Nic kisc h and Seeger , 2009 ). Equiv alently , the algorithm can b e understo od as Newton’s metho d, y et dropping the part of the Hessian corresp onding to the ln | K | term (note that ∇ ( u , θ ) φ λ = ∇ ( u , θ ) φ for the Newton direction computation). Exact Newton for MKL. In practice, w e use K θ = P m θ m K m + ε I , ε = 10 − 8 to a void n umerical problems when computing λ and ln | K θ | . W e also hav e to enforce θ  0 in algorithm 1 , which is done b y the barrier metho d ( Bo yd and V anden b erghe , 2002 ). W e minimise tφ + 1 > (ln θ ) instead of φ , increasing t > 0 every few outer lo op iterations. A v ariant algorithm 1 can b e used to solve VB in a diﬀerent parametrisation ( γ  0 replaces u ), which has the same conv exit y structure as join t MAP . T ransforming equation ( 10 ) similarly to equation ( 6 ), we obtain φ VB ( θ ) = min γ  0 ln | C | − ln | Γ | + β > ΓC − 1 Γ β − β > Γ β + h ( γ ) (13) with C := K θ + Γ , computed using the Cholesky factorisation C = LL > . They cost O ( M n 3 ) to compute, whic h is more exp ensiv e than for joint MAP or MKL. Note that the cost O ( M n 3 ) is not sp eciﬁc to our particular relaxation or algorithm e.g. the Laplace MLM appro ximation ( Williams and Barber , 1998 ), solv ed using gradien ts w.r.t. θ only , comes with the same complexit y . 4. Conclusion W e presented a unifying probabilistic viewp oin t to m ultiple k ernel learn ing that deriv es regularised risk approac hes as sp ecial cases of approximate Ba yesian inference. W e pro vided an eﬃcien t and pro v ably conv ergen t optimisation algorithm suitable for regression, robust regression and classiﬁcation. Our taxonomy of m ultiple kernel learning approac hes connected many previously only lo osely related ideas and provided insights into the common structure of the resp ectiv e optimisation problems. Finally , w e prop osed an algorithm to solve the latter eﬃciently . 10 Mul tiple Kernel Learning: A Unifying Pr obabilistic Viewpoint References F rancis Bac h, Gert Lanc kriet, and Michael Jordan. Multiple kernel learning, conic dualit y , and the SMO algorithm. In ICML , 2004. Stephen Boyd and Lieven V anden b erghe. Convex Optimization . Cambridge Universit y Press, 2002. Mik e Bro ok es. The matrix reference manual, 2005. URL http://www.ee.ic.ac.uk/hp/ staff/dmb/matrix/intro.html . Nello Christianini, John Shaw e-T a ylor, André Elisseeﬀ, and Jaz Kandola. On k ernel-target alignmen t. In NIPS , 2001. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. L2 regularization for learning k ernels. In UAI , 2009. Mark Girolami and Simon Rogers. Hierarchic Bay esian mo dels for kernel learning. In ICML , 2005. Mark Girolami and Ming jun Zhong. Data in tegration for classiﬁcation problems emplo ying Gaussian pro cess. In NIPS , 2006. T omi Jaakkola and Mic hael Jordan. Ba yesian parameter estimation via v ariational metho ds. Statistics and Computing , 10:25–37, 2000. Ashish Kapo or, Kristen Grauman, Raquel Urtasun, and T revor Darrell. Gaussian pro cesses for ob ject categorization. IJCV , 2009. doi: 10.1007/s11263- 009- 0268- 3. Marius Kloft, Ulf Brefeld, Sören Sonnenburg, Pa vel Lasko v, Klaus-Rob ert Müller, and Alexander Zien. Eﬃcient and accurate lp-norm m ultiple k ernel learning. In NIPS , 2009. Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michal I. Jordan. Learning the kernel matrix with semideﬁnite programming. JMLR , 5:27–72, 2004. Da vic MacKay . Ba y esian interpolation. Neur al Computation , 4(3):415–447, 1992. T om Mink a. Exp ectation propagation for approximate Bay esian inference. In UAI , 2001. Shashi Kant Mishra and Giorgio Giorgi. Invexity and optimization . Springer, 2008. Hannes Nickisc h. Bayesian Infer enc e and Exp erimental Design for L ar ge Gener alise d Line ar Mo dels . PhD thesis, TU Berlin, 2010. Hannes Nickisc h and Carl Edw ard Rasm ussen. Appro ximations for binary Gaussian pro cess classiﬁcation. JMLR , 9:2035–2078, 2008. Hannes Nic kisch and Matthias Seeger. Con v ex v ariational Bay esian inference for large scale generalized linear mo dels. In ICML , 2009. Manfred Opp er and Cédric Archam b eau. The v ariational Gaussian approximation revisited. Neur al Computation , 21(3):786–792, 2009. 11 Nickisch and Seeger Carl Edward Rasm ussen and Christopher K. I. Williams. Gaussian Pr o c esses for Machine L e arning . MIT Press, 2006. Bernhard Schölk opf and Alex Smola. L e arning with Kernels . MIT Press, 1st edition, 2002. P eter Sollich. Probabilistic methods for support vector mac hines. In NIPS , 2000. Sören Sonnenburg, Gunnar Rätsc h, Christin Sc häfer, and Bernhard Sc hölkopf. Large scale m ultiple kernel learning. JMLR , 7:1531–1565, 2006. Manik V arma and Bo dla Rak esh Babu. More generalit y in eﬃcient multiple k ernel learning. In ICML , 2009. Manik V arma and Deba jyoti Ra y . Learning the discriminativ e p o w er-inv ariance trade-oﬀ. In ICCV , 2007. Christopher K. I. Williams and David Barber. Ba y esian classiﬁcation with Gaussian pro- cesses. IEEE T r ansactions on Pattern A nalysis and Machine Intel ligenc e , 20(12):1342– 1351, 1998. Christopher K. I. Williams and Carl Edward Rasmussen. Gaussian pro cesses for regression. In NIPS , 1996. T ong Zhang. Analysis of multi-stage con vex relaxation for sparse regularization. JMLR , 11: 1081–1107, 2010. 12

Multiple Kernel Learning: A Unifying Probabilistic Viewpoint

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment