Message-Passing Inference on a Factor Graph for Collaborative Filtering

1 Message-Passing Inference on a F actor Graph for Collaborati v e Filtering Byung-Hak Kim, Arvind Y edla, and Henry D. Pﬁster Department of Electrical and Computer Enginee ring, T exas A&M Uni versity College Station, TX 77843, USA {bhkim, yarvind, hpﬁster}@tamu.e du Abstract This paper introduces a novel message-passing (MP) framew ork for the collaborativ e ﬁltering (CF) problem associated with recommen der systems. W e mod el the movie-ratin g p rediction problem pop u- larized by the Netﬂix Prize, using a pro babilistic factor graph model and study the m odel by deriving generalizatio n e rror bou nds in terms of th e train ing error . Based on the model, we develop a new MP algorithm , termed IMP , fo r learnin g the model. T o show su periority of th e IMP algo rithm, we co mpare it with the closely re lated expecta tion-max imization (E M) b ased algorith m and a nu mber of other matr ix completion algorithm s. O ur simulation results on Netﬂix d ata show th at, while the meth ods perform similarly with large amou nts of data, the IMP algorithm is sup erior fo r small amo unts of data. This improves the cold-start pro blem o f the CF systems in practice. Ano ther ad vantage of the IMP algo rithm is that it can be an alyzed using the tec hnique of d ensity e volution (DE) that was originally developed f or MP d ecoding o f err or-correcting co des. Index T erms Belief pro pagation , message- passing; factor graph mo del; collab orative ﬁltering , recomm ender sys- tems. I . I N T RO D U C T I O N One compelling application of co llaborati ve ﬁltering is the a utomatic g eneration of recommend ations. For example, the Netﬂix Prize [9] ha s inc reased the interest in this ﬁeld dramatica lly . Recommendation systems a nalyze, in essen ce, patterns of use r interest in items to provide personalize d rec ommendations of items tha t might suit a us er’ s taste. Their ability to characterize and rec ommend items within h uge collections ha s been s teadily inc reasing a nd now represents a compu terized alternative to huma n rec om- mendations. In the collabo rati ve ﬁltering, the rec ommende r s ystem would iden tify use rs who s hare the same preferen ces (e.g . rating pa tterns) with the active user , and propo se items which the like-minded u sers fa vored (and the activ e user has not yet s een). One difﬁcult p art of building a recommenda tion system is accurately predicting the preferen ce of a use r , for a given item, based only on a few known ratings. The collabora ti ve ﬁltering prob lem is now b eing s tudied by a broad resea rch commun ity including groups interested in statistics, ma chine learning and information theory [5], [4]. Recent works on the collaborative ﬁltering problem c an be lar gely divided into two area s: 1) The ﬁrst area conside rs ef ﬁcient mode ls and practical algo rithms. There are two primary approa ches: neighbor hood model approa ches tha t are loose ly b ased on “ k -Neares t Ne ighbor” algorithms and factor models (e.g., low dimens ion or lo w-rank models with a least squa res ﬂav or) s uch as hard clustering bas ed on singular vector decompos ition (SVD) or probabilistic matrix factorization (PMF) and so ft clustering wh ich e mploys expectation maximization (EM) frameworks [5], [14], [15], [3], [16]. 2) The sec ond area in volves exploration of the fundamental limits of these systems. Prior work has developed some precise r elationships between sparse observation models and the recov ery of miss ing 2 entries in terms of the matrix comp letion problem u nder the restriction of low-rank ma trices mode l or c lustering models [6], [7], [13]. This area is clos ely related with the practical issues known as cold-start problem [20], [9]. That is, gi ving recommenda tions to new users who have s ubmitted only a fe w ratings, or recommending new items that have received only a fe w ratings from users. In other words, h ow few ratings to be provided for the the system to guess the preferences and generate re commenda tions? In this p aper , we employ an alternativ e mod ern coding-theoretic a pproach tha t have be en very succes sful in the ﬁeld of error -correcting codes to the problem. Our results are dif ferent from the above works in several aspec ts as outlined be low . 1) Our approach tries to comb ine the beneﬁts of clustering users and movies into groups proba- bilistically and applying a factor a nalysis to make predictions based on the groups. Th e prec ise probabilistic ge nerativ e factor graph model is stated and generalization error bo unds o f the model with some observations are studied in Sec. II. Base d on the model, we derive a MP base d algorithms, termed IMP , which h as d emonstrated empirical s ucces s in other applications: low-density parity- check codes and turbo-codes d ecoding . Furthermore, as a benchma rk, popular EM algorithms wh ich are frequently use d in both learning an d coding community [3], [11], [2] are developed in Se c. III. 2) Our goa l is to characterize sys tem limits via modern coding -theoretic techniqu es. T ow ard this e nd, we provide a c haracterization of the messag es distribution pas sed o n the graph via density evoluti on (DE) in Se c. IV. D E is an asymptotic analys is technique that was originally developed for MP decoding of e rror -correcting c odes. Also, through the emph asis of simulations o n cold-s tart settings, we see the cold s tart problem is grea tly reduced b y the IMP algorithm in comparison to other methods o n rea l Netﬂix data com in Sec. V. I I . F AC T O R G R A P H M O D E L A. Model Descr iption Consider a collection o f N use rs and M movies when the s et O of user-mo vie pairs have been observed. The main theoretical question is, “ How large sho uld the size of O be to estimate the unkn own ratings within some distortion δ ?”. Ans wers to this question ce rtainly require so me assump tions ab out the movie rating process as been studied by prior works [6], [7]. So we begin dif ferently by intr oduc ing a probabilistic model for the movie ratings. The ba sic idea is that hidde n variables are introdu ced for u sers and movies, and that the movie ratings are conditionally indepen dent giv en these h idden variables. It is co n venie nt to think of the h idden variable for any user (or movie) as the user g r o up (or movie g r oup ) o f that use r (or movie). In this context, the rating a ssoc iated with a user-movie pair de pends only on the user group and the movie group. Let there be g u user groups, g v movie g roups, and deﬁne [ k ] , { 1 , 2 , . . . , k } . The user group of the n -th u ser , U n ∈ [ g u ] , is a d iscrete r . v . drawn from P r ( U n = u ) , p U ( u ) and U = U 1 , U 2 , . . . , U N is the user grou p vector . Like wise, the movie g roup of the m -th movie, V m ∈ [ g v ] , is a d iscrete r .v . d rawn from Pr( V m = v ) , p V ( v ) and V = V 1 , V 2 , . . . , V M is the movie group vector . The n, the rating of the m -th movie by the n -th user is a discrete r .v . R nm ∈ R (e.g., Netﬂix uses R = [5] ) d rawn from Pr( R nm = r | U n = u, V m = v ) , w ( r | u, v ) and the rating R nm is conditionally independe nt gi ven the user group U n and the movie group V m . Let R denote the rating matrix and the obse rved submatrix be R O with O ⊆ [ N ] × [ M ] . In this s etup, s ome of the entries in the rating matrix are ob served while othe rs must be pred icted. The conditional indepen dence assu mption in the model implies that Pr ( R O | U , V ) , Y ( n,m ) ∈ O w ( R nm | U n , V m ) . 3 permutatio n permutatio n U 0 V 0 U 1 V 1 U 2 V 2 U 3 V 3 U 4 V 4 U 5 V 5 U 6 V 6 U N V M U y ( i ) Users Rating s R O V x ( i ) Movie s Figure 1. The factor graph model for the collaborativ e ﬁltering problem. The graph is sparse when there are few ratings. Edges represent random variables and nodes represent local probabilities. The node probability associated with the ratings implies that each rating depends only on the movie group (top edge) and the user group (bottom edge). Synthetic data can be generated by picking i.i.d. random user/movie groups and then using random permutations to associate groups wi th ratings. Note x ( i ) and y ( i ) are the messages from movie to user and user to movie during iteration i for the Algorithm 1. Speciﬁca lly , we consider the f actor graph (composed of 3 layers, see Fig. 1) as a randomly chosen instance of this problem based on this probab ilistic model. Th e key assumptions are tha t these lay ers separate the inﬂue nce of user groups, movie groups, a nd observed ratings and the outgoing ed ges from each u ser node a re attached to movie nod es via a rand om p ermutation. The main advantage of our model is that, since it exploits the c orrelation in ratings based on simi- larity between use rs (an d movies) and inc ludes noise proces s, this model app roximates real Netﬂix d ata generation proces s more closely than o ther simpler factor mode ls. It is a lso important to note that this is a pr obabilistic generative model wh ich allows one to evaluate different lea rning a lgorithms on s ynthetic data and co mpare the res ults with theoretical boun ds (see Sec. V -B for details). B. Generalization Error Bou nd In this s ection, we conside r boun ds on ge neralization from partial knowledge o f the (binary-rating) matrix for c ollaborativ e ﬁltering application. T he tighter bound implies one can us e mo st of known ratings for learning the mod el completely . Since co mputation of R c an be viewed as the product of three matrices, we c onsider the simpliﬁed class of tri-factorized matrices χ g u ,g v as, n X | X = U T W V , U ∈ [0 , 1] g u × N , V ∈ [0 , 1] g v × M , W ∈ {± 1 } g u × g v o . W e b ound the overall distortion betwe en the en tire pred icted matrix X and the true matrix Y as a func tion of the d istortion on the obs erved set of size | O | an d the e rror ǫ . Let y ∈ {± 1 } be binary ratings and deﬁne a z ero-one sign agreemen t distortion as d ( x, y ) , ( 1 if xy ≤ 0 0 otherwise . 4 Also, d eﬁne the average distortion over the entire prediction ma trix as D ( X, Y ) , X ( n,m ) ∈ [ N ] × [ M ] d ( x, y ) / N M and the averaged obse rved distortion as D O ( X, Y ) , X ( n,m ) ∈ O d ( x, y ) / | O | . Theorem 1: For any matrix Y ∈ {± 1 } N × M , N , M > 2 , δ > 0 and integers g u and g v , with probability at leas t 1 − δ over choos ing a subs et O of e ntries in Y uniformly among a ll s ubsets of | O | entries ∀ X ∈ χ g u ,g v , | D ( X, Y ) − D O ( X, Y ) | is up per b ounded by s  ( N g u + M g v + g u g v ) log 12 eM min ( g u , g v ) − log δ  / 2 | O | , h ( g u , g v , N , M , | O | ) . Pr o of: The proof of this the orem is g iv en in Append ix A. Let us ﬁnish this sec tion with two implications of the Thm. 1 in terms of the ﬁve p arameters: g u , g v , N , M , | O | . 1) For ﬁxed group numbers g u and g v , as number of users N a nd movies M increas es, to keep the bound tight, n umber of observed ratings | O | als o need s to grow in the same order . 2) For a ﬁ xed sized matrix, wh en the ch oice of g u and/or g v increases , | O | n eeds to g row in the same order to prevent over-learning the model. Also, a s | O | increase s, we co uld increase the value of g u and/or g v . I I I . L E A R N I N G A L G O R I T H M S A. Message P a ssing (MP) Lea rning Once a ge nerativ e mod el desc ribing the data ha s been spe ciﬁed, w e d escribe how two algo rithms can be applied in the mo del us ing a uniﬁed c ost func tion, the free ene r gy . S ince exac t learning an d inferenc e are often intractab le, so we turn to ap proximate algorithms that search distributions that are c lose to the correc t po sterior distribution by minimizing pseu do-distance s on distributions, c alled free ene r gies by s tatistical physic ists. The problem can be formulated via message-pa ssing (also known as belief propagation) framew ork via the sum-product algorithm since ﬁxed points o f (loopy) belief propagation correspond to extrema of the Bethe ap proximation of the free e nergy [17]. The ba sic ide a is that the local neighborhoo d of any node in the factor graph is tree-like, s o that be lief propa gation giv es a nearly optimal estimate of the a pos teriori distributions. W e denote the messa ge from movie m to use r n du ring iteration i by x ( i ) m → n and the messa ge from user n to movie m by y ( i ) n → m . Th e set of all users whos e rating movie m was ob served is de noted U m and the set of all movies wh ose rating by use r n was obs erved is den oted V n . The exact up date equations are given in Algorithm 1. Thoug h the idea is s imilar to an EM upd ate, the res ulting equation a re diff erent an d s eem to pe rform muc h b etter . B. Expec tation Maximization (EM) Lear ning Now , we reformulate the problem in a standard variational EM framework and prop ose a second algo rithm by minimizing an u pper bo und on the free energy [2]. In other words, w e view the problem as maximum- likelihood parame ter e stimation problem where p U n ( · ) , p V m ( · ) , and p R | U,M ( ·|· ) are the mo del parameters θ and U , V are the missing da ta. For eac h of these parameters, the i -th e stimate is denoted f ( i ) n ( u ) , 5 Algorithm 1 IMP Algorithm Step I : Initialization x (0) m → n ( v ) = x (0) m ( v ) = p V ( v ) , y (0) n → m ( u ) = y (0) n ( u ) = p U ( u ) , w ( r | u, v ) Step I I: Recursive upda te y ( i +1) n → m ( u ) = y (0) n ( u ) Y k ∈V n \ m X v w ( r n,m | u, v ) x ( i ) k → n ( v ) X u ′ y (0) n ( u ′ ) Y k ∈V n \ m X v w ( r n,m | u ′ , v ) x ( i ) k → n ( v ) x ( i +1) m → n ( v ) = x (0) m ( v ) Y k ∈U m \ n X u w ( r n,m | u, v ) y ( i ) k → m ( u ) X v ′ x (0) m ( v ′ ) Y k ∈U m \ n X u w ( r n,m | u, v ′ ) y ( i ) k → m ( u ) Step I II: Ou tput ˆ p ( i +1) R nm | R O ( r ) = X u,v y ( i +1) n → m ( u ) x ( i +1) m → n ( v ) w ( r | u, v ) X r X u,v y ( i +1) n → m ( u ) x ( i +1) m → n ( v ) w ( r | u, v ) ˆ p ( i +1) U n | R O ( u ) = y (0) n ( u ) Y k ∈V n X v w ( r n,m | u, v ) x ( i ) k → n ( v ) X u ′ y (0) n ( u ′ ) Y k ∈V n X v w ( r n,m | u ′ , v ) x ( i ) k → n ( v ) ˆ p ( i +1) V m | R O ( v ) = x (0) m ( v ) Y k ∈U m X u w ( r n,m | u, v ) y ( i ) k → m ( u ) X v ′ x (0) m ( v ′ ) Y k ∈U m X u w ( r n,m | u, v ′ ) y ( i ) k → m ( u ) h ( i ) m ( v ) , an d w ( i ) ( r | u, v ) . Let O ⊆ [ N ] × [ M ] be the set of user- movie pairs that have been observed. Then, we ca n w rite the complete data (negati ve) log-likelihood as R c ( θ ) = − log Y ( n,m ) ∈ O Pr ( R nm = r n,m , U n = u n , V m = v m ) = − log Y ( n,m ) ∈ O w ( r n,m | u n , v m ) f n ( u n ) h m ( v m ) . Using a variational ap proach, this can be upper boun ded by X ( n,m ) ∈ O D  Q U n ,V n | R nm ( · , ·| r n,m ) || ˆ p U n ,V m | R nm ( · , ·| r n,m )  , 6 Algorithm 2 EM Learning Algorithm Step I : Initialization f (0) n ( u ) = p U ( u ) , h (0) m ( v ) = p V ( v ) , w (0) ( r | u, v ) Step I I: Recursive upda te f ( i +1) n ( u ) = X m ∈V n f ( i ) n ( u ) X v ∈ [ g m ] w ( i ) ( r n,m | u, v ) h ( i ) m ( v ) X u ′ ∈ [ g u ] X m ∈V n f ( i ) n ( u ) X v ∈ [ g m ] w ( i ) ( r n,m | u, v ) h ( i ) m ( v ) h ( i +1) m ( v ) = X n ∈U m h ( i ) m ( v ) X u ∈ [ g u ] w ( i ) ( r n,m | u, v ) f ( i ) n ( u ) X v ′ ∈ [ g v ] X n ∈U m h ( i ) m ( v ) X u ∈ [ g u ] w ( i ) ( r n,m | u, v ) f ( i ) n ( u ) w ( i +1) ( r | u, v ) = X ( n,m ): r n,m = r w ( i ) ( r n,m | u, v ) f ( i +1) n ( u ) h ( i +1) m ( v ) X r ∈R X ( n,m ): r n,m = r w ( i ) ( r n,m | u, v ) f ( i +1) n ( u ) h ( i +1) m ( v ) Step I II: Ou tput ˆ p ( i +1) R nm | R O ( r ) = X u,v f ( i +1) n ( u ) h ( i +1) m ( v ) w ( i +1) ( r | u, v ) X r ∈R X u,v f ( i +1) n ( u ) h ( i +1) m ( v ) w ( i +1) ( r | u, v ) ˆ p ( i +1) U n | R O ( u ) = f ( i +1) n ( u ) ˆ p ( i +1) V m | R O ( v ) = h ( i +1) m ( v ) where we introduc e the variational probability distributi ons Q U n ,V m | R nm ( u, v | r ) that satisfy X u,v Q U n ,V m | R nm ( u, v | r ) = 1 and let ˆ p U n ,V m | R nm ( u, v | r ) = w ( r n,m | u, v ) f n ( u ) h m ( v ) P u ′ ,v ′ w ( r n,m | u ′ , v ′ ) f n ( u ′ ) h m ( v ′ ) . The variational EM algorithm we have developed use s alternating steps of KL diver gence minimization to estimate the underlying gen erativ e model [1]. The results show that this variational approach gives the equiv alent update rule as the standa rd EM framew ork (with a s impler deriv a tion in Appe ndix B) which g uarantees c on vergence to loc al minima. The up date equations are p resented in Algo rithm 2. This learning algorithm, in fact, extends Thomas Hofmann’ s work and generalizes p robabilistic ma trix factorization (PMF) res ults [3], [15]. Its main drawbac k is that it is difﬁcult to an alyze be cause the eff ects of initial conditions a nd local minima can be very comp licated. 7 Algorithm 3 VD VQ Clustering A lgorithm v ia GLA Splitting ( shown only fo r users) Step I : Initialization Let i = j = 0 and c (0 , 0) m (0) be the average rating of movie m . Step I I: Splitting o f c ritics Set c ( i +1 ,j ) m ( u ) = ( c ( i,j ) m ( u ) u = 0 , . . . , 2 i − 1 c ( i,j ) m ( u − 2 i ) + z ( i +1 ,j ) m ( u ) u = 2 i , . . . , 2 i +1 − 1 where the z ( i +1 ,j ) m ( u ) a re i.i.d. random variables with small variance. Step I II: Re cursive soft K-mean s clustering for c ( i,j ) m ( u ) for j = 1 , . . . , J . 1. Each training data is as signed a s oft degree o f assignme nt π n ( u ) to each of the critics using π ( i,j ) n ( u ) = exp  − β d  R O , c ( i,j ) m ( u )  X u ′ ∈ [ g u ] exp  − β d  R O , c ( i,j ) m ( u ′ )  where d  R O , c ( i,j ) m ( u )  = r P ( n,m ) ∈ O  c ( i,j ) nm ( u ) − r n,m  2 / | O | , g u = 2 i +1 . 2. Update all critics as c ( i,j +1) m ( u ) = P n π ( i,j ) n ( u ) c ( i,j ) m ( u ) P n π ( i,j ) n ( u ) . Step I V : Re peat Steps II a nd III until the desired nu mber o f critics g u is obtained. Step V : Estimate of w ( r | u, v ) After clustering users/movies e ach into user/movie groups with the soft g roup membership π n ( u ) and ˜ π m ( v ) , c ompute the s oft frequencies of ratings for each user/movie g roup pair as w ( r | u, v ) = X ( n,m ) ∈ O : R nm = r π n ( u ) ˜ π m ( v ) X r ∈R X ( n,m ) ∈ O : R nm = r π n ( u ) ˜ π m ( v ) . C. Prediction and Initialization Since the primary goal is the p rediction of h idden variables bas ed on observed ratings, the learning algorithms focus on estimating the distrib ution of each hidden variable given the obse rved ratings . In particular , the outputs of both a lgorithms (after i iterations) are es timates of the distributions for R nm , U n , and V m . They are denote d, respectiv ely , ˆ p ( i +1) R nm | R O ( r ) , ˆ p ( i +1) U n | R O ( u ) , an d ˆ p ( i +1) V m | R O ( v ) . Using the se, one can minimize various typ es of pred iction error . For example, minimizing the mean-sq uared prediction error res ults in the conditional mea n e stimate ˆ r ( i ) n,m, 1 = X r ∈R r ˆ p ( i ) R nm | R O ( r ) . 8 While minimizing the clas siﬁcation error o f users (and movies) into groups res ults in the maximum a posteriori (MAP) e stimates ˆ u ( i ) n = arg max u ˆ p ( i ) U n | R O ( u ) ˆ v ( i ) m = arg max v ˆ p ( i ) V m | R O ( v ) . Like wise, after w ( i ) ( r | u, v ) conv erges, we ca n also make hard decisions on grou ps b y the MAP estimates ﬁrst and then compute rating predictions by ˆ r ( i ) n,m, 2 = X r ∈R r w ( i )  r | ˆ u ( i ) n , ˆ v ( i ) m  . While this should perform worse with exac t inference, this may not be the c ase with approximate inference algorithms. Both iterati ve learning algorithms require proper initial es timates of a set of initial group (of the user an d movie) proba bilities f (0) n ( u ) , h (0) m ( v ) a nd o bservation model , w (0) ( r | u, v ) s ince randomize d initialization often lead s to local minima and poor p erformance. T o clus ter users (or movies), we employ a variable- dimension vector qu antization (VD VQ) algorithm [10 ] a nd the standa rd codeb ook splitting approach known as the gen eralized Lloyd algorithm (GLA) to generate cod ebooks whose size is any p ower of 2. The VD VQ algorithm is ess entially based on alternating minimization of the average d istance between users (or movies) and codeb ooks (that contains no missing data) with the two o ptimality c riteria: nearest neighbor and ce ntr o id rules only on the elements both vectors share. The g roup probab ilities are initialized by as suming that the VD VQ g i ves the “co rrect” group with probability ǫ = 0 . 9 and sp reads its errors uniformly ac ross a ll o ther g roups. In the c ase of us ers, o ne can think of this Algorithm 3 a s a “ k -critics” algorithms which tries to design k critics (i.e., pe ople who have s een every movie) tha t cover the space of a ll user tas tes and each user is gi ven a s oft “ degree of as signment (or so ft group me mbership)” to e ach of the c ritics wh ich c an take on v alues between 0 and 1. I V . D E N S I T Y E VO L U T I O N A NA L Y S I S Density ev olution (DE) is we ll-known tech nique for analyzing proba bilistic messag e-passing inference algorithms that was originally d eveloped to an alyze belief-propagation de coding of error- correcting codes and was later extended to more gen eral inference problems [13]. It works by tracking the distrib ution of the messag es pas sed in the graph under the assu mption that the local neighborhoo d of ea ch node is a tree. While this ass umption is not rigorous, it is motiv a ted by the follo wing lemma. W e consider the factor graph for a randomly chosen instance o f this problem. The key ass umption is that the outgoing edges from each user no de are attached to movie nodes via a rando m permutation. This is identical to the mod el us ed for irregular LDPC c odes [12]. Lemma 1: Let N l ( v ) denote the depth- l neighborhoo d (i.e., the indu ced subgra ph including all node s within l s teps from v ) o f an arbitrary use r (or movie) node v . Let the problem s ize N be come unbound ed with M = β N for β < 1 , maximum degree d N , and depth- l N neighborhoo ds. On e ﬁ nds that if (2 l N + 1) log d N log N < 1 − δ, for s ome δ > 0 an d a ll N , then the gra ph N l ( v ) is a tree w .h.p. for almost all v as N → ∞ . Pr o of: The proof follows from a c areful treatment of s tandard tree-like neighborho od arguments a s in Appendix C. For this prob lem, the messages passe d during inferenc e consist of belief func tions for user grou ps (e.g., passe d from movie nod es to us er nod es) and movie group s (e.g., pass ed form u ser n odes to movie nodes). The messa ge set for user belief functions is M u = P ([ g u ]) , where P ( S ) is the s et of proba bility 9 µ ( i +1) d ( u, B ) = Z X r 1 ,...,r d I ( G ( ( b 1 , r 1 ) , . . . , ( b d , r d ); a ) ∈ B ) µ (0) ( u, da ) d Y j =1 X v ν ( i ) ( v , db j ) w ( r j | u, v ) (1) ν ( i +1) d ( v , A ) = Z X r 1 ,...,r d I ( F (( a 1 , r 1 ) , . . . , ( a d , r d ); b ) ∈ A ) ν (0) ( v , db ) d Y j =1 X u µ ( i ) ( u, da j ) w ( r j | u, v ) (2) distrib utions over the ﬁnite set S . Likewise, the messa ge s et for movie belief func tions is M v = P ([ g v ]) . The decoder co mbines d user (resp. mo vie) belief-functions a 1 ( · ) , . . . , a d ( · ) ∈ M u (resp. b 1 ( · ) , . . . , b d ( · ) ∈ M v ) u sing F d ( a 1 , r 1 , ..., a d , r d ; b ) , b ( v ) Q d j =1 P u a j ( u ) w ( r j | u, v ) P v b ( v ) Q j P u a j ( u ) w ( r j | u, v ) G d ( b 1 , r 1 , ..., b d , r d ; a ) , a ( u ) Q d j =1 P v b j ( v ) w ( r j | u, v ) P u a ( u ) Q j P v b j ( v ) w ( r j | u, v ) . Since we need to consider the possibility that the ratings are generated by a process other than the assume d model, we must also keep track of the true user (or movie) group asso ciated with e ach belief function. Let µ ( i ) ( u, A ) (res p. ν ( i ) ( v , B ) ) be the proba bility that, during the i -th iteration, a ran domly chosen u ser (resp. movie) mes sage is coming from a node with true us er g roup u (resp. movie group v ) and has a user belief function a ( · ) ∈ A ⊆ M u (resp. movie belief function b ( · ) ∈ B ⊆ M v ). The DE update equ ations for degree d use r and movie nodes , in the spirit of [13], are sh own in equ ations (1) and (2) wh ere I ( x ∈ A ) is de ﬁned a s a indic ator function I ( x ∈ A ) = ( 1 if x ∈ A 0 if x / ∈ A . Like LDPC codes, we expect to see t hat the p erformance of Algorithm 1 depends crucially on the degree structure of the factor g raph. Therefore, we let Λ j (resp. Γ j ) be the fraction of user (resp. movie) nod es wit h degree j and deﬁne the ed ge degree d istrib ution to b e λ j = Λ j j / P k ≥ 1 Λ k k (resp. ρ j = Γ j j / P k ≥ 1 Γ k k ). A veraging over the degree distrib ution gives the ﬁnal upda te e quations µ ( i +1) ( u, B ) = X d ≥ 1 λ d µ ( i +1) d ( u, B ) ν ( i +1) ( v , A ) = X d ≥ 1 ρ d ν ( i +1) d ( v , A ) . W e a nticipate that this ana lysis will he lp us understand the IMP algo rithm’ s obse rved p erformance for lar ge problems bas ed on the su cces s of DE for channe l coding problems. V . E X P E R I M E N T A L R E S U L T S A. Details of Datasets and T raining The key challeng e of co llaborati ve ﬁltering p roblem is predicting the preference of a user for a gi ven item based only on very few kn own ratings in a way that minimizes so me per -letter metric d ( r , r ′ ) for 10 5 10 15 20 25 30 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 Average Number of Observed Ratings per User RMSE Netflix Dataset 1 IMP EM OptSpace SET SVT Movie Average 10 20 30 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 Average Number of Observed Ratings per User RMSE Netflix Dataset 2 IMP EM OptSpace SET SVT Movie Average Figure 2. Remedy for the Cold-Start Problem: Each plot shows the RMSE on the valida tion set versus the average number of observ ations per user for Netﬂix datasets. Performance is compared with three differen t matrix completion algorithms (OptSpace [21], SET [22] and SVT [23] ) and an algorithm that uses the av erage rating for each movie as the prediction. For IMP and EM, ˆ r ( i ) n,m, 1 prediction formula is used. 11 5 10 15 20 25 30 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 Average Number of Observed Ratings per User RMSE Synthetic Dataset 1 IMP EM Movie Average Lower Bound 10 20 30 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 Average Number of Observed Ratings per User RMSE Synthetic Dataset 2 IMP EM Movie Average Lower Bound Figure 3. Each plot shows the RMSE on the validation set versus the average number of observation s per user for synthetic datasets. Performance i s compared with an (analytical) lower bound on RMSE assuming known user and movie group. 12 ratings. T o stud y this, we c reated two smaller datase ts from the Netﬂix data by randomly sub sampling user/movie/rating triples from the original Netﬂix datasets which emphasizes the advantages of MP scheme . Th is idea was followed from [15] , [6]. • N etﬂix Dataset 1 is a ma trix giv en by the ﬁrst 5,00 0 movies a nd u sers. This ma trix contains 28 0,714 user/movie pairs. Over 15% o f the users and 43% o f the movies have less than 3 ratings. • N etﬂix Dataset 2 is a matrix o f 5,035 movies and 5,017 users by se lecting some 5, 300 movies and 7,250 users an d av oiding movies and users with less than 3 ratings. This matrix contains 4 54,218 user/movie pairs. Over 16% o f the users and 41% o f the movies have less than 10 ratings . T o provide further insights into the q uality of the propos ed factor grap h mo del a nd suboptimality of the algorithms by compa rison with the theoretical lower bounds, we gene rated two s ynthetic datas ets from the above partial matrices. Th e s ynthetic datas ets are gen erated once with the learne d density ˆ p ( i ) R nm | R O ( r ) , ˆ p ( i ) U n | R O ( u ) , and ˆ p ( i ) V m | R O ( v ) an d the n ran domly subsamp led as the partial Netﬂix datas ets. • S ynthetic Da taset 1 is ge nerated a fter learning Netﬂix Da taset 1 with g u = g v = 8 . • S ynthetic Da taset 2 is ge nerated a fter learning Netﬂix Da taset 2 with g u = g v = 16 . Additionally , to evaluate the p erformance of different algo rithms/models efﬁciently , we hide 1,000 ran- domly s elected us er/movie entries as a validation se t from each dataset. Note that the ch oice o f g u and g v to obtain syn thetic datas ets res ulted in the c ompetitiv e pe rformance o n this validation set, but not fully optimized. Simulations were performed on thes e partial datasets whe re the av erage numbe r of o bserved ratings per user was varied between 1 a nd 30. The experimental results are shown in Fig. 2 and 3 and the p erformance is evaluated using the root mean squared error (RMSE) of prediction deﬁne d by s X ( n,m ) ∈ S ( ˆ r n,m − r n,m ) 2 / | S | . B. Results and Model Compar isons While the IMP algorithm is not ye t competiti ve on the en tire Netﬂix data set [9], h owe ver , it shows some promise for the recomme nder systems base d on MP framew orks. In reality , we have dis covered that MP approac hes really do improve the co ld-start prob lem. Here is a summa ry of observations we’ ve learned from the simulation study . 1) Improv ement o f the co ld-start pr oblem wit h MP algorithms : From Fig. 2 results on partial Netﬂix datasets, we c learly see while many metho ds pe rform similarly with large amo unts of obs erved ratings, IMP is sup erior for small amounts o f data. Th is better thresh old performance o f the IMP algorithm over the other a lgorithms doe s help redu ce the cold start problem. This provides strong support to use MP approac hes in stand ard CF s ystems. Als o in s imulations, we observe lower computational complexity of the IMP algorithm even though we hav e developed compu tationally efﬁcient versions of our EM algorithm (see Appen dix B). 2) Comparison wi th low-rank matrix mode ls : Our factor graph m ode l is a probabilistic ge neralization of other low- rank ma trix mo dels. Similar asy mptotic be havior (for enough measu rements) betwe en partial Netﬂix and synthe tic da taset sug gests that the factor graph model is a goo d ﬁt for this problem. By comparing with the results in [6], we can su pport that the Netﬂix d ataset is much well described b y the factor graph model. Other than these advantages, e ach output g roup has generative nature with explicit semantics. I n other words, after learning the density , we can use them to generate synthetic d ata with clear mean ings. Th ese b eneﬁts do not extend to low- rank matrix mo dels ea sily . 13 V I . C O N C L U S I O N S For the Ne tﬂix problem, a numb er of rese archers have used lo w-rank models that lead to learning me thods based o n SVD and principal c omponen t analysis (PCA) with a leas t s quares ﬂavor . Unlike these prior works, in this pape r , we proposed the n ew factor graph model an d succe ssfully ap plied MP framew ork to the prob lem. First, we pre sented the IMP algorithm and u sed simulations to show its su periority over other algorithms by foc using on the cold-start problem. S econd, we studied qu ality of the mo del by deri ving the DE ana lysis with a g eneralization error bou nd and complemen ting these the oretical analyses with simulation res ults for synthetic data generated from the learned model. R E F E R E N C E S [1] I. Csiszar and G. T usnady . “Information Geometry and Alternating Minimization Procedures”, in Statistics & Decisions, Supplement Issue , 1:205–237, 1984. [2] R. M. Neal, G. E. Hi nton. “ A V iew of the EM Algorithm that Justiﬁes Incremental, Sparse, and Other V ariants”, in Learning in Graphical Models , 355–368, 1998. [3] T . Hofmann, “Probabilistic Latent S emantic Analysis”, in Uncertainty in Arti ﬁcial Intelligence . 1999. [4] J. Lafferty and L. W asserman. “Challenges in Statistical Machine Learning”, in Statistica Sinica , 16(2):307–32 3, 2006. [5] A CM SIGKDD KDD Cup and W orkshop 2007. http://www .cs.uic.edu/~liub/KDD-cup-2007/proce edings.html [6] R. Kesha v an, A. Montanari and S. Oh. Learning, “Learning Low Rank Matrices from O (n) Entries, ” in Pr oc. A llerton Conf. on Comm., Contro l and Computing , Monticello, Illinois, Sep. 2008. [7] S. T . Aditya, Onkar Dabeer and Bikash Kumar Dey , “ A Channel Coding Perspectiv e of Recommendation Systems, ” in Pr oc. 2009 IEE E Int’l. Symp. Information Theory , Seoul, Korea , Jun. 2009. [8] D. K oller and N. Friedman. Probabilistic Graphical Models: Principles and T echniques, Draft, 2008. [9] Netﬂix prize website: http://www .netﬂi xprize.com [10] A. Das, A.V . R ao, and A. Gersho, “V ariable-dimension V ector Quantization of Speech Spectra for L o w-rate V ocoders”, in Pr oc. Data Compre ssion Confer ence , 1994. [11] David MacKay , Information T heory , Inference, and Learning Algorithms, Cambridge, 2005. [12] T . Richardson and R . Urbanke, “The C apacity of Low-density Parity-check Codes under Message-passin g Decoding”, in IEEE T rans. Inform. Theory , vol. 47, pp. 599–618, Feb . 2001. [13] A. Montanari, “Estimating R andom V ariables from Random Sparse Observations”, in Eur . T rans. T elecom, V ol. 19 (4), pp. 385-403, April 2008. [14] S. Funk, “Netﬂix update: T ry this at home” at http://si fter .org/~simon/journal/2006 1211.html [15] R. Salakhutdinov and A. Mnih, “Probabilistic Matrix Factorization”, i n Advances in Neural I nformation Proce ssing Systems, 20, MIT , 2008. [16] B.-H. Ki m, “ An I nformation-theoretic A pproach to Collaborative Filtering”, T echnical Report , T exas A&M Univ ersity , 2009. [17] J. Y edidia, W .T . Freeman and Y . W eiss, “Understanding Belief Propagation and Its Generalizations”, in Advances in neural information pr ocessing systems , 13, MIT , 2001. [18] N. Alon, “T ools from Hi gher Algebra”, in Handbook of Combinatorics , North Holland, 1995. [19] N. S rebro, N. Alon and T . Jaakkola, “Generalization E rror Bounds for Collaborativ e Prediction wit h L o w-Rank Matrices”, in Advances in Neural Information Pr ocessing Systems , 17, 2005. [20] A. I . Schein, Al. Popescul, L. H. Ungar, D. M. Pennock, "Method s and Metrics for Cold-Start Recommendation s", in Pr oceedings of the 25th Annual International ACM SIGIR Confer ence on R esear ch and Development in Information Retrieval, pp. 253–260, August 2002. [21] R. Kesha v an, S. Oh, and A. Montanari, “Matrix completion from noisy entries”, Arxiv prep rint cs.IT/0906.2027 , 2009 [22] W . Dai, and O. Milenko vic, “SET : an al gorithm for consistent matrix completion”, Arxiv pr eprint cs.IT/0909.2705 , 2009 [23] J. C ai, E. Candes, and Z. Shen, “ A si ngular value thresholding algorithm for matrix completion”, Arxiv prep rint math.OC/0810.3286 , 2008 14 A P P E N D I X A P RO O F O F T H E O R E M 1 This proof follo ws arguments o f the generalization error in [19]. First, ﬁx Y as well as X ∈ R N × M . When an index pair ( n, m ) is chose n uniformly ra ndom, d ( x n,m , y n,m ) is a Bernou lli random vari - able with probab ility D ( X , Y ) of being one. If the entries o f O a re chos en indep enden tly random, | O | D O ( X, Y ) is b inomially d istrib uted with parameters | O | D ( X , Y ) a nd | O | ǫ . Using Chernoff ’ s in- equality , we get Pr ( D ( X, Y ) ≥ D O ( X , Y ) + ǫ ) = Pr ( | O | D O ( X, Y ) ≤ | O | D ( X , Y ) − | O | ǫ ) ≤ e − 2 | O | ǫ 2 . Now note tha t d ( x, y ) on ly depe nds on the sign of xy , s o it is en ough to c onsider equivalence c lasses of matrices with the same sign patterns. Let f ( N , M , g u , g v ) be the n umber o f such equi valence classes . For all matrices in an eq uiv a lence class, the random vari able D O ( X , Y ) is the same . Thus we take a union bound of the events { X | D ( X , Y ) ≥ D O ( X , Y ) + ǫ } for each of these f ( N , M , g u , g v ) random variables with the bound above and ǫ = r log f ( N , M , g u , g v ) − log δ 2 | O | , we have Pr ∃ X ∈ χ g u ,g v D ( X, Y ) ≥ D O ( X, Y ) + s log f ( N , M , g u , g v ) − log δ 2 | O | ! ≤ δ. Since any matrix X ∈ χ g u ,g v can be written a s X = U T GV , to bound the nu mber of sign patterns of X , f ( N , M , g u , g v ) , consider N g u + M g v + g u g v entries o f U, G, V as variables and the N M entries of X as polynomials of degree three over the se variables as x n,m = g u X k =1 g v X l =1 u k ,n · g k ,l · v l,m . By the use of the bou nd in lemma 2 , we obtain f ( N , M , g u , g v ) ≤  4 e · 3 · N M N g u + M g v + g u g v  N g u + M g v + g u g v ≤  12 eM min ( g u , g v )  N g u + M g v + g u g v . This b ound y ields a factor of log 12 eM min ( g u , g v ) in the bound and establish es the theorem. Lemma 2: [18] T otal nu mber o f sign patterns of r polyno mials, eac h of degree at mos t d , over q variables, is at mos t ( 8 edr /q ) q if 2 r > q > 2 . Also, total nu mber o f sign patterns of r polynomials with {± 1 } coordinates , e ach of degree a t mo st d , over q variables, is at most (4 edr /q ) q if r > q > 2 . A P P E N D I X B D E R I V A T I O N O F A L G O R I T H M 2 As the ﬁrst step , we sp ecify a complete data likelihood as Pr ( R nm = r n,m , U n = u n , V m = v m ) = w ( r n,m | u n , v m ) f n ( u n ) h m ( v m ) and the c orrespond ing (nega ti ve) log-likelihood fun ction c an be written as R c ( θ ) = − log Y ( n,m ) ∈ O Pr ( R nm = r n,m , U n = u n , V m = v m ) = − X ( n,m ) ∈ O [ log w ( r n,m | u n , v m ) + log f n ( u n ) + log h m ( v m )] The variati ona l EM algorithm now consists of two s teps that are performed in a lternation with a Q distrib ution to app roximate a ge neral d istrib ution. 15 A. E-step Since the states of the latent variables are not kn own, we introduc e a variational proba bility distrib ution Q U n ,V m | R nm ( u, v | r ) su bject to X u,v Q U n ,V m | R nm ( u, v | r ) = 1 for all observed pairs ( n, m ) . Exploiting the conc avity of the logarithm and using Jens en’ s inequality , we have R ( θ ) = − X ( n,m ) ∈ O log X u,v Pr ( R nm = r n,m , U n = u n , V m = v m ) = − X ( n,m ) ∈ O log X u,v Q U n ,V m | R nm ( u, v | r ) w ( r n,m | u, v ) f n ( u ) h m ( v ) Q U n ,V m | R nm ( u, v | r ) ≤ − X ( n,m ) ∈ O X u,v Q U n ,V m | R nm ( u, v | r ) log w ( r n,m | u, v ) f n ( u ) h m ( v ) Q U n ,V m | R nm ( u, v | r ) , ¯ R ( θ | Q ) − X ( n,m ) ∈ O H ( Q ( ·| u, v , r ) ) , R ( θ ; Q ) T o compute the tightest bound giv en p arameters ˆ θ i.e., we o ptimize the b ound w .r .t the Q s using ∇ Q   R ( θ ; Q ) + X ( n,m ) ∈ O X u,v λ u,v Q   = 0 . These yield p osterior probabilities o f the latent variables, ˆ p U n ,V m | R nm ( u, v | r ; ˆ θ ) = Q ∗ U n ,V m | R nm  u, v | r ; ˆ θ  = w ( r n,m | u, v ) f n ( u ) h m ( v ) P u ′ ,v ′ w ( r n,m | u ′ , v ′ ) f n ( u ′ ) h m ( v ′ ) . Also n ote tha t we can get the same result by Gibbs inequality as R ( θ ) ≤ − X ( n,m ) ∈ O X u,v Q U n ,V m | R nm ( u, v | r ) log w ( r n,m | u, v ) f n ( u ) h m ( v ) Q U n ,V m | R nm ( u, v | r ) = X ( n,m ) ∈ O D  Q U n ,V n | R nm ( · , ·| r n,m ) || ˆ p U n ,V m | R nm ( · , ·| r n,m )  . B. M-step Obviously the posterior probab ilities need on ly to be c omputed for p airs ( n, m ) that h ave actually been observed. Thus optimize ¯ R  θ , ˆ θ  = − X ( n,m ) ∈ O X u,v Q ∗ U n ,V m | R nm  u, v | r ; ˆ θ  { log w ( r n,m | u, v ) + log f n ( u ) + log h m ( v ) } = − X ( n,m ) ∈ O X u,v w ( r n,m | u, v ) f n ( u ) h m ( v ) P u ′ ,v ′ w ( r n,m | u ′ , v ′ ) f n ( u ′ ) h m ( v ′ ) { log w ( r n,m | u, v ) + log f n ( u ) + log h m ( v ) } with res pect to pa rameters θ which lead s to the three sets of equations for the update of w ( r | u, v ) , f n ( u ) , h m ( v ) . 16 Moreover , for large s cale problems, to av oid compu tational loa ds of ea ch step, combining both E a nd M steps by plugging Q fun ction into M-step gives more tractab le EM Algorithm. The resulting equations are d eﬁned in Algorithm 2 . A P P E N D I X C P RO O F O F L E M M A 1 Starting from any n ode v , w e can recursively gro w N i +1 ( v ) from N i ( v ) by adding a ll ne ighbors a t distance i + 1 . Let A i be the numbe r of o utgoing edges from N i ( v ) to the n ext level an d b ( i ) 1 , . . . , b ( i ) n be the degrees of the n i av ailable node s that can be chosen in the next level. The proba bility that the graph remains a tree is p  A i , b ( i )  = P S ⊂ [ n ] , | S | = A i Q s ∈ S b ( i ) s  P n j =1 b ( i ) j A i  , where the numerator is the number of ways that the A i edges c an attach to distinct nod es in the next level and the denominator is the total numbe r of ways that the A i edges ma y attach to the avail able nodes. Using the fact that the numerator is an unnormalized expected value of the product of A i b ’ s drawn without replacemen t, we c an lower b ound the numerator using X S ⊂ [ n ] , | S | = A i Y s ∈ S b ( i ) s ≥  n i A i   b i − ( d − 1) A i n i  A i ≥ ( n i − A i ) A i A i !  b i − ( d − 1) A i n i  A i . This c an be s een as lo wer bounding the expected value of A i b ’ s drawn from with replacement from a distrib ution with a slightly lower mean. Upp er bo unding the denominator by ( n i b i ) A i / A i ! gives p  A i , b ( i )  ≥ ( n i − A i ) A i A i !  b i − ( d − 1) A i n i  A i  n i b i  A i A i ! =  1 − A i n i  A i  1 − ( d − 1) A i b i n i  A i ≥  1 − A 2 i n i − A 2 i ( d − 1) b i n i  . Now , we ca n take the produ ct from i = 0 , . . . , l − 1 to g et Pr ( N l ( v ) is a tree ) = l − 1 Y i =0 Pr ( N i +1 ( v ) is a tree |N 0 ( v ) , . . . , N i ( v ) are trees ) ≥ l − 1 Y i =0  1 − A 2 i n i − A 2 i ( d − 1) b i n i  ≥ 1 − l − 1 X i =0  A 2 i n i + A 2 i ( d − 1) b i n i  ≥ 1 −  1 + 1 d 2 − 1   d 2 l β N − d l + d 2 l ( d − 1) β N − d l  ≥ 1 −  1 + 1 d 2 − 1  d 2 l +1 β N − d l , 17 becaus e A i ≤ d i +1 , P l − 1 i =0 A 2 i ≤ d 2 d 2 l d 2 − 1 = d 2 l  1 + 1 d 2 − 1  , and n i ≥ β N − P i j =0 d j ≥ β N − d i +1 . Examining the exp ression log d 2 l +1 β N − d l ≤ (2 l N + 1) log d N − log N + O (1) ≤ − δ log N + O (1) shows that the p robability of failure is O  N − δ  . Let Z b e a r .v . who se value is the numbe r of user no des whos e depth- l neigh borhood is n ot a tree. W e can up per b ound the expected value of Z with E [ Z ] ≤ d 2 l +1 Θ( N ) − d l N ≤ O  N − δ  Θ ( N ) − O  N 1 / 2  N = O  N 1 − δ  . W ith Markov’ s ineq uality , one ca n s how that Pr  Z ≥ N 1 − δ/ 2  ≤ E [ Z ] N 1 − δ/ 2 ≤ O  N 1 − δ  N 1 − δ/ 2 . Therefore, the depth- l neighb orhood is a tree (w .h.p . as N → ∞ ) for all but a vanishing fraction of us er nodes.

Message-Passing Inference on a Factor Graph for Collaborative Filtering

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment