A Hybrid Latent Variable Neural Network Model for Item Recommendation

A Hybrid Latent V ariable Neural Network Model f or Item Recommen dation Michael R. Smith Departmen t of C omputer Science Brigham Y oung University Provo, UT 86402 msmith@axon. cs.byu.edu T ony Martinez Departmen t of C omputer Science Brigham Y oung University Provo, UT 8460 2 martinez@cs. byu.edu Michael Gashler Departmen t of C omputer Science and Computer Engineer ing University of Arkansas Fayette ville, AR 72701 mgashler@uar k.edu Abstract Collaborative ﬁltering is used to recom mend items to a u ser without requiring a knowledge of th e item itself and tends to outperf orm other techniqu es. Howev er , collaborative ﬁltering su ffers from the cold-start prob lem, which occurs wh en an item has not yet b een rated or a user ha s not rated any items. Incorp orating ad- ditional infor mation, such as item o r user description s, into collabo rativ e ﬁltering can add ress the cold -start problem. In this paper, we present a neu ral network model with latent inpu t variables ( latent neural network or LNN) as a hyb rid col- laborative ﬁltering techniqu e that a ddresses th e cold-start p roblem. LNN outper- forms a broad selection o f con tent-based ﬁlters (wh ich make recomm endation s based on item descr iptions) and other hybrid appro aches while m aintaining the accuracy of state-of-the-art collaborativ e ﬁltering techniqu es. 1 Intr oduction Modern technology enables users to access an ab undance of information. This delug e of data makes it difﬁcult to sift th rough it all to ﬁnd what is desired . This problem is of particular con cern to compan ies who ar e tryin g sell products (e.g. Amazon or W almar t) or re commen d movies (e.g. Netﬂix). T o lessen the se verity of informatio n ov erload, recommen der systems help a user ﬁnd what he or she is looking for . T wo common ly u sed classes of reco mmender systems are conten t-based ﬁlters and collabor ativ e ﬁlters. Content-based ﬁlters (CBF) m ake recommenda tions based on item/user description s and users’ r at- ings o f the items. Creating item/user descriptions that a re predictive of h ow a user will r ate an item, howev er , is n ot a trivial p rocess. On th e oth er hand , co llaborative ﬁltering (CF) tech niques use co rrelations b etween users’ ratings to inf er the rating o f u nrated items for a user and make re c- ommend ations with out having to understan d the item or user itself. CF does no t dep end on item descriptions a nd tends to pro duce highe r accurac ies than CBF . Howe ver , CF suffers f rom the cold- start pr ob lem which occurs when an item cannot be recommended unless it is has been rated befo re ( ﬁrst-rater pr o blem ) or when a user has not rated any items ( new-user pr oblem ). This is particularly importan t in doma ins where n ew items are frequently ad ded to a set of items and users are m ore interested in the new items. F or examp le, many users are more interested, an d likely to purch ase, new styles of shoes rather t han out-da ted styles or many users are more i nterested in w atching newly 1 released movies rather than older movies. Reco mmendin g old items has th e potential to drive a way customers. In addition, making inappro priate recommen dations for n ew users who have not built a proﬁle can also drive aw ay users. One appr oach for addressing the cold -start problem is using a h ybrid recommende r system that can lev erage the advantages of m ultiple reco mmenda tion systems. Developing hybrid mod els is a sign if- icant r esearch dir ection [4, 1 8, 1 2, 20, 6, 7, 13]. Many hyb rid appro aches combine a content- based ﬁlter with a co llaborative ﬁlter throu gh m ethods such as averaging the predicted ratings or com bining the to p reco mmendatio ns from both techniqu es [2]. I n th is pap er , we p resent a neural network mod el with latent in put variables ( la tent neu ral network or LNN) as a hy brid reco mmend ation algo rithm that addr esses the cold-start problem . LNN uses a matrix of item rating s and item/user descriptio ns to simultaneously train the weights in a neural network and induce a set of latent inp ut v ariables for matrix f actorization. Using a neural netw ork allows f or ﬂexible architectur e conﬁgu rations to model higher-order dependencies in t he data. LNN is based on the idea o f generative bac kprop agation ( GenBP) [9] an d expands upo n u nsuper- vised back propag ation (UBP) [8]. Both GenBP and UBP are neu ral network methods that ind uce a set o f latent in put variables. The latent input variables form an intern al representatio n of observed values. When the latent in put variables ar e fewer than the ob served variables, both method s are dimensiona lity reduc tion techn iques. GenBP adjusts its la tent inputs while holdin g the network weights constant. It has b een u sed to gen erate lab els fo r ima ges [5], an d f or natural lang uage [1 ]. UBP differs from GenBP in that it trains n etwork weights simultan eously with the latent in puts, instead of training th e weights as a pr e-proc essing step. LNN is a further development of UBP th at incorpo rates input fea tures a mong the l atent in put variables. By incorporating user/item descriptio ns as inp ut features, LNN is able to address the cold-start pro blem. W e ﬁnd that L NN o utperfo rms oth er content-b ased ﬁlters and hybrid ﬁlters on the cold-start prob lem. Addition ally , LNN outper forms its predecessor (UBP) an d m aintains an accuracy similar to matrix factorization (wh ich ca nnot h andle the cold-start prob lem) on non-c old-start recommendation s. 2 Related W ork Matrix factorization (MF) has becom e a pop ular techniqu e, in part due to its effecti ven ess with the data used in the NetFlix competition [10, 16] and is widely considered a state-of-the- art recomm en- dation techniq ue. MF is a lin ear dimensionality reductio n tech nique that factors the rating matrix into two mu ch-smaller matrices. Th ese smaller matrice s can the n be com bined to predict all of the missing ra tings in the original matrix. It was previously shown th at MF cou ld be represented w ith a neural network m odel in v olving on e hidden laye r and linea r acti vation function s [21]. By using non-lin ear activ atio n function s, unsup ervised backpro pagation (UBP) may be viewed as a non-linear generalizatio n of MF . UBP is related to no nlinear PCA (NLPCA) that was used as a mean s o f im- puting missing values (a task similar to recommen ding items) [1 9]. UBP utilizes three phases for training t o initialize the latent v ariables, t he weights of the mo del and then to update the weights and latent variables simultaneou sly . L NN further b uilds o n UB P and NLPCA by in tegrating item or user descriptions with the latent input variables. Pure co llaborative ﬁltering (CF) techniques ar e not a ble to handle th e cold- start proble m fo r items or users. As a result, sev eral h ybrid methods h av e been developed that incorp orate item and/o r user descriptio ns into collab orative ﬁltering appr oaches. The most co mmon, as surveyed by Burke [2], in volves using separate CBF and CF techniques a nd then co mbining their outputs (i.e. weighted av erage, comb ining the output from both tech niques, or switching d ependin g on the context) or using the outp ut from one tech nique as inp ut to ano ther . Content-bo osted collaborativ e ﬁl tering [1 4] uses CBF to ﬁll in the missing values in the ratings m atrix and then the dense ratings matrix is pa ssed to a collaborative ﬁltering method (in their implementation , a neighbor based CF). Oth er work addresses the co ld-start pr oblem by build user/item descrip tions for later use in a recommen dation system [22]. 3 Latent Neural Network In this section, we f ormally describe latent neu ral networks (LNN). At a high-level, a LNN is a neural network with latent input variables in duced using gen erative b ackpro pagation. Put simply , 2 generative backpropag ation calculates th e gr adient of the latent inputs with respect to the e rror an d updates them in a manner similar to ho w the weights are updated in the backprop agation algor ithm. 3.1 Preliminaries In order to formally describe LNNs, we deﬁne the following terms. • L et X be a given m × n sparse user/item r ating matrix, where m is the num ber of items and n is the number of users. • L et A be an m × a matrix, representing the gi ven portio n of the item proﬁles. • L et V b e an m × t matrix, represen ting the latent portion of the item pr oﬁles. • I f x r c is the rating for item r by u ser c in X , then ˆ x r c is the pred icted rating when a r ∈ A and v r ∈ V are c oncatenated into a single vector q r and then fed forward into the LNN. • L et w ij be the weight that feeds from unit i to unit j in the LNN. • For each network u nit i o n hidd en layer j , let β j i be th e ne t inpu t into the unit, α j i be the output or activ a tion v alue of the unit, and δ j i be an error term associated with the unit. • L et l be the number of hidden layers in the LNN. • L et g be a vecto r rep resenting th e g radient with respect to the weights of the LNN, such that g ij is the compon ent of the gradien t that is used to reﬁne w ij . • L et h be a vector representing the gradient with respect to th e latent inputs of the LNN, such that h i is the comp onent of the gradien t that is used to reﬁne v r i ∈ v r . W e use item descriptions, but user descriptions could easi ly be used by transposing the X and using user descriptions instead of item descriptions. As using generative b ackpro pagation to co mpute the gradient with respe ct to the latent inputs, h , is less com monly used, we provide a deriv ation of it he re. W e co mpute each h i ∈ h fro m th e presentation of a single element x r c ∈ X since we assume that X is typically high-dim ensional and sparse. It is signiﬁcantly more efﬁcient to train with th e presentatio n of each known elem ent individually . W e b egin b y deﬁning an er ror signal f or an ind ividual elemen t, E r c = ( x r c − ˆ x r c ) 2 , and then express the gradien t as the p artial derivati ve of this error sign al with r espect to each latent input (the non-laten t inputs in A do no t change): h i = ∂ E r c ∂ v r i . (1) The intrinsic in put v r i affects the value of E r c throug h th e net value of a unit ( β j i ) and fu rther throug h the outpu t of a unit ( α j i ). Using the chain rule, Equation 1 becomes: h i = ∂ E r c ∂ α 0 c ∂ α 0 c ∂ β 0 c ∂ β 0 c ∂ v r i (2) where α 0 c and β 0 c represent, r espectively , the output values an d the net inpu t values of the outpu t nodes (the 0 th layer). Th e backp ropag ation alg orithm calculates ∂ E rc ∂ α 0 c ∂ α 0 c ∂ β 0 c (which is ∂ E rc ∂ β j,i for a network u nit) as th e error term δ j i associated with a network u nit. Thus, to calculate h i , the only additional calculation to the b ackpro pagation algorithm that needs to b e made is ∂ β jc ∂ v ri . For a single layer perceptr on (0 hidden layers): ∂ β 0 c ∂ v r i = ∂ ∂ v r i X t w tc v r t which is n on-zero only wh en t equals i and is equ al to w ic since the er ror is being calculated with respect to a sing le elem ent in X . When ther e are no hid den layers ( l = 0 ) a nd using the error f rom a single element x r c : h i = − w ic δ c . (3) If there is at least one hidden layer ( l > 0 ) , then, ∂ β 0 c ∂ v r i = ∂ β 0 c ∂ α 1 ∂ α 1 ∂ β 1 . . . ∂ α l ∂ β l ∂ β l ∂ v r i , where α k and β k are vector s that r epresent the output values and the net values for the u nits in the k th hidden lay er . As par t of th e error ter m fo r the units in the l th layer, backp ropaga tion calculates 3 Algorithm 1 LNN( A , X , η ′ , η ′′ , γ , λ ) 1: Initialize each element in V with small r andom values 2: Let T be the weights of a single-lay er perceptron 3: Initialize each element in T with small rando m values 4: η ← η ′ ; s ′ ← ∞ 5: while η > η ′′ do 6: s ← train epoch( A , X , T , λ, true , 0 ) 7: if 1 − s/ s ′ < γ then η ← η / 2 8: s ′ ← s 9: end while 10: L et W be the weights of a multi-layer perceptron with l hid den layers, l ≥ 0 11: I nitialize each element in W with small random v alues 12: η ← η ′ ; s ′ ← ∞ 13: w hile η > η ′′ do 14: s ← train epoch( A , X , W , λ, false , l ) 15: if 1 − s/s ′ < γ then η ← η / 2 16: s ′ ← s 17: e nd while 18: η ← η ′ ; s ′ ← ∞ 19: w hile η > η ′′ do 20: s ← train epoch( A , X , W , 0 , true , l ) 21: if 1 − s/s ′ < γ then η ← η / 2 22: s ′ ← s 23: e nd while 24: return { V , W } ∂ β 0 ,c ∂ α 1 ∂ α 1 ∂ β 1 . . . ∂ α l ∂ β l as the err or term associated w ith each network unit. Thus, th e o nly add itional calculation for h i is: ∂ β l ∂ v r i = ∂ ∂ v r i X j X t w j t v r t . As before, ∂ β l ∂ v ri is non-zer o only when t equ als i . For networks with at least one hidden layer: h i = − X j w ij δ j . (4) Equation 4 is a strict generalization of Equation 3. Eq uation 3 only considers the one output unit, c , for which a k nown target v a lue is be ing pre sented, whereas Eq uation 4 sums over each unit, j , into which the intrinsic value v r i feeds. 3.2 Three-Phase T raining T o integrate generative back propa gation into the training pro cess, LNN uses thr ee phases to train V and W : 1) th e ﬁrst ph ase compu tes an initial estimate for the in trinsic vectors, V , 2) the secon d phase com putes an initial estimate fo r the n etwork weights, W , an d 3) th e third p hase reﬁnes th em both to gether . All three ph ases train using stochastic g radient descent. In ph ase 1, th e intr insic vectors are induce d wh ile there are n o hidd en lay ers to fo rm nonline ar sepa rations among them. Like wise, phase 2 gives the weights a chan ce to converge withou t having to train again st moving inputs. These two p reproce ssing phases initialize the system (consisting of bo th in trinsic vector s and weights) to a goo d initial starting po int, such that gradient descent is more likely to ﬁn d a local optimum of higher qu ality . Empirical results comparing three-ph ase and single-phase training show that th ree-ph ase training prod uces mor e accurate results th an single-p hase trainin g, which only reﬁnes V and W toge ther (see [8]). Pseudo-co de for the LNN algo rithm, which trains V and W in three phases, is given in Algor ithm 1. LNN calls the train epoch function (shown in Algor ithm 2) which performs a single epoch of train- ing. A detailed description of LNN follows. 4 Algorithm 2 train epo ch( A , X , W , λ, p, l ) 1: for each kno wn x r c ∈ X in random order do 2: q r ← ( v r , a r ) 3: Compute α c by forward-p ropaga ting q r into a multilayer percep tron with weights W 4: δ c ← ( x r c − α c ) f ′ ( β c ) 5: for each hidden unit i feedin g into output unit c do 6: δ i ← w ic δ c f ′ ( β i ) 7: end for 8: for each hidden unit j in an earlier hidd en layer (in backward orde r) d o 9: δ j ← P k w j k δ k f ′ ( β j ) 10: end for 11: for each w ij ∈ W do 12: g ij ← − δ j α i 13: end for 14: W ← W − η ( g + λ W ) 15: if p = true then 16: for i from 0 to t − 1 do 17: if l = 0 t hen h i ← − w ic δ c else h i ← − P j w ij δ j 18: end for 19: v r ← v r − η ( h + λ v r ) 20: end if 21: e nd for 22: s ← measure RMSE with X 23: return s Matrices co ntaining th e kn own data values, X , and the item descriptio ns, A , are passed in to LNN along with the par ameters η ′ , η ′′ , γ , λ (de ﬁned below). L NN returns V and W . W is a s et or ragged matrix containin g weigh t values fo r an MLP that maps fro m each v i to an approx imation of x i ∈ X . Lines 1-9 pe rform the ﬁrst phase of trainin g, which c omputes an initial estimate f or V . Lines 1 -4 initialize th e mo del variables. T re presents the weights of a single-lay er perce ptron and the elements in T and V are initialized with small random values. Our imp lementation draws values from a Normal distribution with a mean of 0 and a d eviation of 0.01 . The sing le-layer p erceptro n is a temporar y m odel that is only u sed in phase 1 to for the initial training of V . η is the learning ra te and s ′ is used to stor e the p revious error score. As no error has been measured yet, it is initialized to ∞ . Lines 5-9 train V and T until convergence is d etected. T may then be discarded. W e note that many techniques could be used to detect con vergence. Our implemen tation decays the learning rate whenever pr edictions fail to improve by a sufﬁcient amount. Con vergence is d etected when the learning rate η falls below η ′′ . γ speciﬁes the amo unt of improvement that is expe cted afte r each epoch, or else the learning rate is decayed . λ is the regularizatio n term used in train epoch. Lines 10 -17 perfo rm the seco nd phase of training . This phase differs fro m the ﬁrst phase in two ways: 1) a multilaye r percep tron is used instead of a temporary single-lay er percep tron, and 2) V is held constant during this phase. Lines 18-23 perfo rm the third phase of training . In this phase, the same m ultilayer p erceptron th at is u sed in phase 2 is u sed ag ain, but V an d W are both reﬁned togeth er . Also, no regular ization is used in the third phase. 3.3 Stochastic gradient descent For co mpleteness, we describe train epoch given in Algorithm 2 , wh ich p erform s a single epoch of training by stoch astic grad ient descent. T his algorithm is very similar to an epoch of trad itional backpr opagation , except th at it presents each element individually , instead of p resenting each vector, and it conditio nally reﬁnes the latent variables, V , as well as the weights, W . Line 1 presents each known elem ent x r c ∈ X in random order . Line 2 concaten ates v r with the correspo nding item d escription a r . Line 3 c omputes a pr edicted value for the p resented elem ent 5 T able 1: The MAE f rom the in vestigated recommen dation systems on the validation s et and the test set. CBCF CBF LNN LNN 3PT MF NLPCA UBP V alidation 0.77 09 0.8781 0.5885 0.587 7 0.5886 0.6 058 0.59 42 T est 0.776 7 0.8831 0.5795 0.5810 0.5 779 0.59 71 0.5942 10CV 0.775 4 0.8695 0.5781 0.5778 0.5 760 0.59 15 0.5915 giv en th e current v r . Note that efﬁcient implem entations of line 3 should only propagate v alues into ou tput u nit r . Lines 4 -10 compute an erro r term for o utput unit r , and each hidden u nit in the network. The a ctiv atio n of the oth er output units is not compu ted, so th e err or on those u nits is 0. Lines 11-14 reﬁne W by gradient descent. Line 15 speciﬁes that V should only be reﬁned during phases 1 and 3. Lines 16 -19 reﬁne V by gradie nt descent. Line 22 computes the root-mean - squared- error of the MLP for each known element in X . 4 Experimental Results In this section we present the results from o ur experiments. W e examine L NN usin g th e MovieLe ns 1 data set from the HetRec201 1 workshop [3]. W e u se this data set becau se it p rovides descriptio ns for the movies in addition to the ratings matrix . There are few d ata sets that provide user/item descriptions in addition to the ratings matrix (e.g. th e Netﬂix data only contains user ratings). Some data sets provide u nstructur ed da ta such as twitter info rmation or a set of frien ds on last.fm from which inp ut variables co uld be created. A s this p aper fo cuses on th e perfo rmance of LNN rath er than feature creation from unstructu red d ata, we chose to u se the MovieLens data set. Also, runnin g state-of-the- art reco mmend ation systems can take a long time – it was repo rted th at runn ing Baye sian probab ilistic MF took 188 hour s on the Netﬂix data [17 ]. Using a smaller data set allows f or a more extensi ve evaluation and facilitates cro ss-validation. The MovieLens data set contains 2113 users and 1019 7 movies with 8 5559 8 rating s. On average, there are 405 ratings per user and 84 ratings per movie. For item descriptions, we use the genre(s) of the movie as a set of b inary variables ind icating if a movie belongs to one of the 19 genres. W e use LNN w ith and with out three ph ase train ing. This is equiv alent to a hybrid UBP an d h y- brid NLPCA techn ique. LNN with three phase tr aining is denoted as LNN 3PT . W e compare LNN with several other rec ommend ation systems: 1) content-bo osted collab orative ﬁltering ( CBCF), 2) content-b ased ﬁltering (CBF), 3) n onlinear pr inciple co mponen t analysis (NLPCA), 4 ) u nsuperv ised backpr opagation (UBP), and 5) matr ix factorization (MF). For each reco mmenda tion system, we test se veral parameter settin gs. CBF uses a single learning algor ithm to learn the ratin g prefer ences o f a user . W e experiment using na¨ ıve Bayes (as is commo nly used [1 4]), line ar regre ssion, a decision tree, and a n eural network trained with b ackpro pagation. Th e same learning algorithm s are also used for CBCF and the number of neighbor s r anges from 1 to 64. For MF , the numb er of latent variables ranges from 2 to 32 and the r egularization term from 0.00 1 to 0 .1. In ad dition to the values used for MF for the numb er of latent variables an d the regularization term , the nu mber of no des in the hidden lay er r anges f rom 0 to 32 for UBP , NLPCA, LNN, and LNN 3PT . For each experimen t, we random ly select 20% of the ratings as a test set. W e then using 10% of the training set as a v a lidation set for parameter selection . Using the selected param eters, we test on the test set and using 10-f old cross-validation. 4.1 Results The results com paring L NN with the other r ecommen dation appr oaches are shown in T able 1. W e report th e mean absolute error (MAE) for each a pproach . The bold values rep resent th e lowest means within 0 .002. Th e algo rithms that use latent variables are signiﬁcan tly lower than those th at do not (CBCF an d CBF), thus d emonstrating the predictive po wer of using latent variables for item recommen dation. L atent inpu ts also allows one to byp ass feature e ngineerin g – often a difﬁcult process. 1 http://www.gr ouplens.org 6 Algorithm 3 new item pred iction( a newI tem ) 1: Let count be a map containin g the count of ho w many times each rating was predicted 2: Initialize each element in count to 0 3: numN eig hbors ← 10 0 ; distT hresh ← 0 4: neig hbors ← getNeighb ors( a newI tem , numN eig hbor s ) 5: for i from 0 to numN eig hbors − 1 do 6: numRati ng s ← count numbe r of ratings for nei g hbor s [ i ] 7: if numR ating s > 50 && dis ta nce ( neig hbors [ i ]) > distsT hr e sh then 8: q new ← ( v neighbor s [ i ] , a newI tem ) 9: pre diction ← ro unded prediction of q new 10: counts [ pr ediction ]+ = num R ating s 11: end if 12: e nd for 13: return maxInd ex( counts ) The add ition of the item description s to NLPCA an d UBP (L NN and LNN 3PT ) improves the per- forman ce compa red to o nly using the latent variables. The perfor mance of L NN and LNN 3PT is similar to m atrix factorization, which is widely considere d state-of-th e-art in recom mendatio n sys- tems when comparing MAE. The power of LNN and LNN 3PT comes when faced with the cold-start problem wh ich we address in the following section. As w as d iscussed previously , MF and other p ure collaborative ﬁltering techniqu es are not able to address the cold-start problem despite being able to perfor m very well on items that hav e been rated previously a certain number of times. (They also suffer from the gra y sheep problem which occurs when an item has only been rated a small num ber of times.) LNN and LNN 3PT are capable o f ad dressing the co ld-start problem while still obtain ing similar perform ance to matrix factorization . 4.2 Cold Start Problem T o examine the cold-start problem, we remove the ratings for the top 10 m ost rated movies indi vid- ually and collecti vely . The number of removed ratings for a si ngle movie ranged from 1263 to 167 0 and 15,131 rating s were removed for all top 10. The recommend ation systems were trained u sing the remaining ratings using the parameter setting found in the previous experiments. For LNN, pre- dicting a new item poses an addition al challen ge sinc e the latent variables for the new items have not been induced. W e ﬁnd that using 0 values for the latent inputs often prod uced worse results than CBF . A CBF cre ates a model for each user based on item d escriptions an d corre sponding user ratings. LNN, on the o ther han d, pr oduces a sin gle m odel which is ben eﬁcial when using a ll of the ratings because th e mutual information between users and items can be shared. The shared informa - tion is contained in the latent variables. The qua lity of the latent variables depen ds on the nu mber of ratings that a user has giv en and/or an item has recei ved. T o compensate for the lack of latent variables for the new items, we utilize the new item pred iction function that takes a vector a newI tem representin g th e d escription of th e new item and is ou tlined in Algorith m 3. At a high level, new item pred iction u ses a newI tem to ﬁnd its nearest neighb ors. The in duced laten t input variables for each neighbor are con catenated with a newI tem and fed into a trained LNN to predict a r ating for the new item. Th e weighted m ode of the pr edicted ratings of the new item is then retu rned. The rating fr om e ach n eighbor is weighted ac cording to how many times it has b een rated. By weig hting, we mean when selecting the mod e from a set of nu mbers, the pr edicted rating is added r times to the set wh ere r is the n umber times that the neighbo r item has be en rated. W e chose to use th e mode rather than th e mean b ecause the mode is more r obust to outliers and ach iev es better empirical results on the validation sets in o ur experimentatio n. W e next describe new item pred iction in more detail. Lines 1-2 initializes a coun ter th at keeps trac k o f how many times a rating has been predicted for the new i tem and initializes all values to 0. Line 3 in itializes the num ber of nearest neighbor s to search for to 100 an d sets the distance thre shold to 0. W e chose 100 n eighbor s becau se it was g enerally more than enou gh neighb ors to p roduce good re sults. As we used binary item description s o f movie genres, we only considered using the latent variables fro m items that have the same gen re(s) (has a distance of 0). Th ese values come into play in line 7 whe re an item is no t used if its d istance is 7 T able 2: The MAE for the to p 1 0 most rated movies ( individually and co mbined) when h eld out of the training set. alg 2571 2858 2959 296 318 35 6 480 4993 5952 715 3 top10 CBCF 0.889 0.898 0 .875 0.7 42 0.92 9 0.760 0. 720 0.7 55 1.05 3 0.981 0.896 CBF 0.957 0 .905 0.9 20 0.87 0 0.965 0. 866 0.7 66 0.79 0 1.121 1. 041 0 .972 LNN 1.175 0 .689 0.8 94 0.66 6 0.789 0. 593 0.5 52 0.55 8 0.577 0. 523 0 .859 LNN 3PT 1.189 0 .690 0.9 06 0.68 0 0.810 0. 595 0.5 41 0.58 7 0.566 0. 521 0 .847 T able 3: The time (in second s) tak en to run each algorithm . CBCF CBF LNN LNN 3PT MF NLPCA UBP train 2278. 2 9.1 43.4 60.2 4.8 5.8 5.8 A ve 10CV 2432. 7 9.6 53.9 19 3.4 7.6 8.5 10.0 greater than distT hr esh (in this c ase 0), an d if an item has n ot been rated at least 50 times. Th e value of 5 0 was chosen based on the ev aluation of a conten t-based predicto r [15]. The numb er of times th at an item ha s b een rated helps to determine the quality of the induced latent variables for an item and p rovides a conﬁden ce level for latent variables. Line 4 ﬁn ds the clo sest n eighbo rs and inserts their indexes into an array . Lines 5-10 count t he num ber of times that each rating is pr edicted weighted by the n umber of tim es that the item has been r ated. W e use a linear rating such th at the prediction fo r an item that has b een rated 100 times will c ount f or 100 ra tings of that predicted value. This help s to discount items th at have only b een rated a few times and whose latent variables may not be set to good values. Line 13 retur ns the index (rating) that has the max coun t (i.e. the mod e). The results for recommen ding new items using new item pred iction ar e provided in T able 2. The values at the top of the table correspond to the movie id in the MovieLens data set. Th e bold values represent th e lowest MAE value obtain ed. No single recommen dation system p rodu ces the lowest MAE all of th e items, suggesting th at some recomm endation systems are better than other s for a giv en u ser and/o r item as has been sugg ested p reviously [1 1]. LNN and LNN 3PT each score the lowest MAE for several movies individually . W ith the exception of movie 257 1, LNN and LNN 3PT produ ce the lo west M AE for all of the movies wh en th ey have not been previously rated. W hen holding out all 10 items, LNN 3PT produ ces the lowest M AE value. This shows the importan ce of using latent variables. CBCF u ses CBF to create a dense matr ix (except for the ratin gs corre sponding to the active user) and then u ses a co llaborative ﬁltering techniqu e on the dense matrix to recommen d items to the user . Thus, m ore emp hasis is given to the CBF which generally prod uces p oorer item recommen dations th an a co llaborative ﬁltering approach. LNN, on the other hand, utilizes the latent variables and their pred ictiv e power . 4.3 Efﬁciency The efﬁciency of LNN is not precise as is the case f or most neu ral network models since it is based on the number o f iteration s until convergence. In our experime nts, LNN always conv erges regardless of the p arameter settings. Howe ver, so me p arameter settings did require longer to reach con vergence than others. The average time in seconds r equired to run each algorith m using the par ameter settings found in the previous experiments is shown in T a ble 3. The a dditional complexity of LNN requires more time to t rain. Ho wev er , it has the beneﬁt that a ne w model will not ha ve to be induced in order recommen d new or unrated items as is th e case with MF , NLPCA, and UBP . For recom mending new items in LNN, LNN u ses a k-d tree for the nea rest neighb or search which has log ( n ) search and insert complexities. 5 Conclusions and Futur e W ork In this p aper, we presented a n eural n etwork with latent input variables c apable of recommend ing unrated items to users or items to n ew u sers which we call a latent neural n etwork (LNN). T he latent variables and inpu t variables allow in formatio n and correlations a mong the rated items to be represented while also inco rporatin g th e item d escriptions in th e reco mmend ation. Thus, LNN 8 is a hybrid recom mendation algorithm that le verages the advantages of co llaborative ﬁltering and content based ﬁltering. Empirically , a LNN is able to achieve similar results to state-of-the- art collabo rative ﬁltering tech - niques such as matrix factor ization while also addressing the cold-start prob lem. Comp ared with other h ybrid ﬁlter s and conte nt-based ﬁltering, LNN achieves much lower er ror wh en recom mend- ing p reviously unrated item s. As LNN achieves similar er ror rates to the state-o f-the-art ﬁltering technique s and can make recomm endations for pr eviously unrated items, LNN does n ot h ave to b e retrained once new items are rated in order to recomm end them. As LNN is built on a neural network, it is capab le of modeling higher-order dependen cies and non- linearities in the d ata. Howe ver, the d ata in the M ovieLens da ta set and many similar data sets is well suited to u sing linear mod els such as matrix factorization. This may be d ue in part to the fact many o f the d ata sets are inheren tly spa rse and nonlinear models could overﬁt them and reduce their generalizatio n. As a directio n of futur e work, we are examining how to better inco rporate the non - linear comp onent o f LNN. W e are also look ing at integrating both user and item descriptions with latent input variables to address the new user problem and the new item problem in a single model. Refer ences [1] Y . B engio, H. S chwenk, J. Sen ´ ecal, F . Morin, and J. Gauvain. Neural probabilistic language models. In Innov ations in Machine Learning , pages 137–186. Springer , 2006 . [2] R. D. Burke. Hybrid recommen der systems: Surve y and experiments. User Modeling and User-Adap ted Interaction , 12 (4):331–370 , 2002. [3] I. Cantador , P . Brusilo vsk y , and T . Kuﬂik. 2nd workshop on information heterogeneity and fusion in rec- ommender systems (hetrec 2011). In Proceed ings of the 5th ACM confer ence on Recommender systems , RecSys 2011, New Y ork, NY , USA, 2011. A CM. [4] M. Claypool, A. Gokhale, T . Miranda, P . Murniko v , D. Netes, and M. Sart in. Combining content-based and collaborativ e ﬁlters in an online newspaper . In Pr oceed ings of the ACM SIGIR ’99 W o rkshop on Recommender Systems: Algorithms and Evaluation , Berkeley , California, 19 99. A CM. [5] D. Coheh and J. Shawe-T ay lor . Daug man’ s gabor transform as a simple generativ e back propag ation network. Electro nics Letters , 26(1 6):1241–1 243, 1990. [6] P . Cremonesi, R. T urrin, and F . Airold i. Hybrid algo rithms for recommend ing new items. In Pr oceeding s of the 2Nd International W orkshop on Information Heter ogeneity and Fusion in Recommender Systems , HetRec ’11, pages 33–40, New Y ork, NY , USA, 2011. A CM. [7] P . Forbes an d M. Zhu. Content-boo sted matrix factorization for recomme nder systems: experimen ts with recipe recommenda tion. In B. M obasher , R. D. Burke, D. Jannach, and G. Adomavicius, editors, RecSys , pages 261–264 . A CM, 2011 . [8] M. S . Gashler , M. R. Smith, R. Morris, and T . Martinez. Missing v alue imputation with unsupervised backpropa gation. Computational Intelligence , page T o App ear , 2014. [9] G. E. Hinton. Generativ e back-propag ation. In Abstracts 1st INNS , 1988 . [10] Y . K oren, R. Bell, and C. V olinsky . Matrix f actorization tech niques fo r reco mmender systems. Computer , 42(8):30–37 , 2009. [11] J. Lee, M. Sun, G. Lebanon, and S. jean Kim. A utomatic f eature induction for stagewise collaborative ﬁltering. In F . Pereira, C. Bu rges, L. Bo ttou, and K. W einberger , editors, Advances in Neur al Informa tion Pr ocessing Sy stems 25 , pages 314–322. Curran Associates, Inc., 2012. [12] Q. Li and B. M. Kim. Clustering approach for hybrid recommender system. In W eb Intelligence , pages 33–38. IEEE Computer Society , 2003. [13] J. Lin, K. Sugiyama, M.-Y . Kan, and T .-S. Chua. Addressing cold-start in app recommendation : latent user models constructed from twitter follo wers. In Proceed ings of the 36th International ACM SIGIR Confer ence on Researc h and D evelop ment i n Information R etrieval , SIGI R ’13, pages 283–2 92, New Y ork, NY , USA, 2013. A CM. [14] P . Melville, N. Shah, L. Mihalk ov a, and R. J. Mooney . Experiments on ense mbles with missing and noisy data. In Multiple Class iﬁer Systems , v olume 3077 of Lectur e Notes in Computer Science , pages 2 93–302 , 2004. [15] T . M. Mitchell. Machine Learning , v olume 1. McGraw-Hill New Y ork, 1997. [16] R. Salakhutdino v and A. Mnih. Probabilistic matrix factorization. In J. C. Platt, D. K oller , Y . Si nger , and S. T . Roweis, editors, Advances in Neura l Information Pro cessing Systems 20 . Curran Associates, Inc., 2007. 9 [17] R. Salakhutdino v and A. Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceed ings of the 25th International Confer ence on Machine Learning , 2008. [18] A. I. S chein, A. Popescul, L. H. U ngar , and D. M. Pennock. Methods and metrics for cold-start recom- mendations. In Pr oceedin gs of the 25th Annual International ACM SIGIR Confer ence on Resear ch and Developme nt in Information Retrieval , SIGIR ’02, pages 253–26 0, New Y ork, NY , USA, 2002. ACM. [19] M. Scholz, F . Kaplan, C. L. Guy , J. K opka, and J. Sel big. Non-linear pca: a missing data approach. Bioinformatics , 21(20):3887– 3895, 2005. [20] X. S u, R. Greiner , T . M. Khoshgoftaar , and X. Zhu. Hybrid collaborativ e ﬁ ltering algorithms using a mixture of expe rts. In W eb Intelligence , pages 645–64 9. IEEE Computer Society , 2007. [21] G. T ak ´ acs, I. P il ´ aszy , B. N ´ emeth, and D. Tikk. Scalable collaborati v e ﬁlteri ng approaches for large recommender systems. The Jou rnal of Machine Learning Resear ch , 10:623–656, 2009. [22] K. Zhou, S.-H. Y ang, and H. Zha. Functional matrix f actorizations for cold-start recommendation. In Pr o- ceeding of the 34th International ACM SIGIR Confer ence on Resear ch and Development in Information Retrieval, SIGIR ’11 , pages 31 5–324, 2011. 10

A Hybrid Latent Variable Neural Network Model for Item Recommendation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment