Algorithms of the LDA model [REPORT]

Algorithms of the LD A model J aka Špeh, Andrej Muhi ˇ c, J an Rupnik Artiﬁcial Intelligence Laboratory Jožef Stefan Institute Jamov a cesta 39, 1000 Ljubljana, Slovenia e-mail: {jaka.speh, andrej.muhic, jan.rupnik}@ijs.si ABSTRA CT W e re view three algorithms for Latent Dirichlet Allo- cation (LD A). T w o of them are variational inference algorithms: V ariational Bayesian inference and On- line V ariational Bay esian infer ence and one is Mark ov Chain Monte Carlo (MCMC) algorithm – Collapsed Gibbs sampling. W e compare their time complexity and perf ormance. W e ﬁnd that online variational Bay- esian inference is the fastest algorithm and still retur ns reasonably good r esults. 1 I N T R O D U C T I O N Now adays big corpora are used daily . People often search through huge numbers of documents either in libraries or online, using web search engines. Therefore, we need ef- ﬁcient algorithms that enable us efﬁcient information re- triev al. Sometimes appropriate documents are hard to ﬁnd, espe- cially if you do not hav e the exact title. One solution is to search using keyw ords. As many documents do not hav e keywords, we need to know what certain document is talking about. Therefore, we would like to tag docu- ments with appropriate ke ywords by clustering them ac- cording to their topics. As the size of corpus increases, manual annotation is not an option. W e would like that computers process doc- uments and ﬁnd their topics automatically . That can be done using machine learning. Probabilistic graphical models such as Latent Dirichlet Allocation (LD A) allow us to describe a document in terms of probabilistic distributions ov er topics, and these topics in terms of distributions over words. In order to obtain documents topics and corpus topics (distrib utions ov er words), we need to compute posterior distrib ution. Unfortunately , the posterior is intractable to compute and one must ap- peal to approximate posterior inference. Modern approximate posterior inference algorithms fall into two categories: sampling approaches and optimiza- tion approaches. Sampling approaches are usually based on MCMC sampling. Conceptual idea of the methods is to generate independent samples from posterior and then reason about documents and corpus topics. Whereas op- timization approaches are usually based on v ariational in- ference, also called V ariational Bayes (VB) for Bayesian models. V ariational Bayes methods optimize closeness, in Kullback-Leibler div ergence, of simpliﬁed parametric distribution to the posterior . In this paper, we compare one MCMC and two VB al- gorithms for approximating posterior distribution. In the subsequent sections, we formally introduce LD A model and algorithms. W e study performance of algorithms and make comparisons between them. For training and test- ing set we use articles from W ikipedia. W e show that Online V ariational Bayesian inference is the fastest algo- rithm. Howe ver the accuracy is lower than in the other two, but the results are still good enough for practical use. 2 L DA M O D E L Latent Dirichlet Allocation [1] is a Bayesian probabilistic graphical model, which is regularly used in topic mod- eling. It assumes M documents are build in a follow- ing fashion. First, a collection of K topics (distributions ov er words) are dra wn from a Dirichlet distrib ution, ϕ k ∼ Dirichlet ( β ) . Then for m -th document, we: θ z w α ϕ β M N K F I G U R E 1 . Plate notation of LD A. 1 1. Choose a topic distribution θ m ∼ Dirichlet ( α ) . 2. For each word w m , n in m -th document: i. choose a topic of the word z m , n ∼ Multinomial ( θ m ) , ii. choose a word w m , n ∼ Multinomial ( ϕ z m , n ) . LD A can be graphically presented using plate notation (Figure 1). Probability of the LD A model is p ( w , z , θ , ϕ | α , β ) = K ∏ k = 1 p ( ϕ k | β ) M ∏ m = 1 p ( θ m | α ) N m ∏ n = 1 p ( z m , n | θ m ) p ( w m , n | ϕ z m , n ) ! . W e can analyse a corpus of documents by computing the posterior distribution of the hidden variables ( z , θ , ϕ ) giv en a document ( w ). This posterior rev eals latent structure in the corpus that can be used for prediction or data ex- ploration. Unfortunately , this distribution cannot be com- puted directly [1], and is usually approximated using Mar- kov Chain Monte Carlo (MCMC) methods or v ariational inference. 3 A L G O R I T H M S In the following subsections, we will deriv e one MCMC algorithm and two variational Bayes algorithms for the approximation of the posterior inference. 3.1 Collapsed Gibbs sampling In the Collapsed Gibbs sampling we ﬁrst integrate θ and ϕ out. p ( z , w | α , β ) = Z θ Z ϕ p ( z , w , θ , ϕ | α , β ) d θ d ϕ . The goal of Collapsed Gibbs sampling here is to approx- imate the distribution p ( z | w , α , β ) . Conditional proba- bility p ( w | α , β ) does not depend on z , therefore Gibbs sampling equations can be deri ved from p ( z , w | α , β ) di- rectly . Speciﬁcally , we are interested in the following con- ditional probability p ( z m , n | z ¬ ( m , n ) , w , α , β ) , where z ¬ ( m , n ) denotes all z -s but z m , n . Note that for Col- lapsed Gibbs sampling we need only to sample a value for z m , n according to the abov e probability . Thus we only need the probability mass function up to scalar multipli- cation. So, the distribution can be simpliﬁed [4, page 22] as: p ( z m , n = k | z ¬ ( m , n ) , w , α , β ) ∝ (1) n ( v ) k , ¬ ( m , n ) + β ∑ V v = 1 ( n ( v ) k , ¬ ( m , n ) + β ) ( n ( k ) m , ¬ ( m , n ) + α ) , where n ( v ) k refers to the number of times that term v has been observed with topic k , n ( k ) m refers to the number of times that topic k has been observed with a word of docu- ment m , and n ( · ) · , ¬ ( m , k ) indicate that the n -th token in m -th document is excluded from the corresponding n ( v ) k or n ( k ) m . Corpus and document topics can be obtained by [4, page 23]: ϕ k , v = n ( t ) k + β ∑ V v = 1 ( n ( t ) k + β ) , θ m , k = n ( k ) m + α ∑ K k = 1 ( n ( k ) m + α ) . In Collapsed Gibbs sampling algorithm, we need to re- member values of three v ariables: z m , n , n ( k ) m , and n ( v ) k , and some sums of these variables for efﬁcienc y . The algorithm ﬁrst initializes z and computes n ( k ) m , n ( v ) k according to the initialized v alues. Then in one iteration of the algorithm we go over all words of all documents, sample values of z m , n according to Equation (1), and recompute n ( k ) m , n ( v ) k . Then one has to decide when (from which iteration/s) to take a sample or samples and which criteria to choose to check if Markov chain has con ver ged. 3.2 V ariational Bayesian infer ence This algorithm was proposed in the original LD A paper [1]. In V ariational Bayesian inference (VB) the true posterior is approximated by a simpler distribution q ( z , θ , φ ) , which is indexed by a set of free parameters [6]. W e choose a fully factorized distrib ution q of the form q ( z m , n = k ) = ψ m , n , k , q ( θ m ) = Dirichlet ( θ m | γ m ) , q ( ϕ k ) = Dirichlet ( ϕ k | λ k ) . The posterior is parameterized by ψ , γ and λ . W e refer to λ as corpus topics and γ as documents topics. The parameters are optimized to maximize the Evidence Lower Bound (ELBO): log p ( w | α , β ) ≥ L ( w , ψ , γ , λ ) (2) = E q [ log p ( w , z , θ , ϕ | α , β )] − E q [ log q ( z , θ , ϕ )] . Maximizing the ELBO is equiv alent to minimizing the Kullback-Leibler div ergence between q ( z , θ , ϕ ) and the posterior p ( z , θ , ϕ | w , α , β ) . 2 ψ z γ θ λ ϕ N K M F I G U R E 2 . Plate notation of parame- terized distribution q . ELBO L can be optimized using coordinate ascent over the variational parameters (detailed deri vation in [1, 2]): ψ m , v , k ∝ exp  E q [ log θ m , k ] + E q [ log ϕ k , v ]  , (3) γ m , k = α + ∑ V v = 1 n m , v ψ m , v , k , (4) λ k , v = β + ∑ M m = 1 n m , v ψ m , v , k , (5) where n m , v is the number of terms v in document m . The expectations are E q [ log θ m , k ] = Ψ ( γ m , k ) − Ψ  ∑ K e k = 1 γ m , e k  , E q [ log ϕ k , v ] = Ψ ( λ k , v ) − Ψ  ∑ V e v = 1 λ k , e v  , where Ψ denotes the digamma function (the ﬁrst deri va- tiv e of the logarithm of the gamma function). The updates of the variational parameters are guaranteed to con verge to a stationary point of the ELBO. W e can make some parallels with Expectation-Maximization (EM) algorithm [3]. Iterativ e updates of γ and ψ until conv er- gence, holding λ ﬁxed, can be seen as “E”-step, and up- dates of λ , giv en γ and ψ , can be seen as “M”-step. The V ariational Bayesian inference algorithm ﬁrst initial- izes λ randomly . Then for each documents does the “E”- step: initializes γ randomly and then until γ con verges does the coordinate ascend using Equations (3) and (4). After γ conv erges, the algorithm performs the “M”-step: sets λ using Equation (4). Each combination “E” and “M”-step improves ELBO. V ariational Bayesian inference ﬁnishes after relativ e improvement of L is less than a pre- prescribed limit or after we reach maximum number of iterations. W e deﬁne an iteration as “E” + “M”-step. After algorithm ﬁnishes γ represents documents topics and λ represents corpus topics. 3.3 Online V ariational Bayesian infer ence Previously described algorithm has constant memory re- quirements. It requires full pass through the entire corpus each iteration. Therefore, it is not naturally suited when new data is constantly arriving. W e would like an algo- rithm that gets the data, calculates the data topics and up- dates the existing corpus topics. Let us modify previous algorithm and make desired one. First, we factorize ELBO (Equation (2)) into: L ( w , ψ , γ , λ ) = ∑ M m = 1  E q [ log p ( w m | θ m , z m , ϕ )] + E q [ log p ( z m | θ m )] − E q [ log q ( z m )] + E q [ log p ( θ m | α )] − E q [ log q ( θ m )] + ( E q [ log p ( ϕ | β )] − E q [ log q ( ϕ )]) / M  . Note that we bring the per corpus topics terms into the summation ov er documents, and divide them by the num- ber of documents M . This allows us to look at the max- imization of the ELBO according to the parameters ψ and γ for each document individually . Therefore, we ﬁrst maximize ELBO according to the ψ and γ as in pre vi- ous algorithm with λ ﬁxed. Then we choose such λ for which the ELBO is as high as possible. Let γ ( w m , λ ) and ψ ( w m , λ ) be the v alues of γ m and ψ m produced by the “E”-step. Our goal is to ﬁnd λ that maximizes L ( w , λ ) = ∑ M m = 1 ` m ( w m , γ ( w m , λ ) , ψ ( w m , λ ) , λ ) , where ` m ( w m , γ ( w m , λ ) , ψ ( w m , λ ) , λ ) is the m -th document’ s contribution to ELBO. Then we compute e λ , the setting of λ that would be op- timal with gi ven ψ if our entire corpus consisted of the single document w m repeated M times: e λ k , v = β + M n m , v ψ m , v , k . Here M is the number of av ailable documents, the size of the corpus. Then we update λ using conv ex combination of its previous value and e λ : λ = ( 1 − ρ m ) λ + ρ m e λ , where the weight is ρ m = ( τ 0 + m ) − κ . Unknowns have special meaning: τ 0 ≥ 0 slows down the early iteration and κ con- trols the rate at which old values e λ are forgotten. T o sum up. The algorithm ﬁrstly initializes λ randomly . Then, on a gi ven document, performs “E”-step as in V ari- ational Bayesian inference. Next it updates λ as discussed abov e. Finally it mov es on the new document and repeats ev erything. The algorithm terminates after all documents are processed. This algoritem is called Online V ariational Bayesian in- ference (Online VB) and was proposed by Hofffman, Blei and Bach in [5]. 3 4 E X P E R I M E N T S W e ran sev eral experimets to e valuate algorithms of the LD A model. Our purpose was to compare the time com- plexity and performance of previously described algori- thms. For training and testing corpora we used Wikipedia. Efﬁcienc y was measured by using perplexity on held-out data, which is deﬁned as perplexity ( w test , λ ) = exp  − ∑ M m = 1 log p ( w m | λ ) ∑ M m = 1 N m  , where N m denotes number of words in m -th document. Since we cannot directly compute log p ( w m | λ ) , we use ELBO as approximation: perplexity ( w test , λ ) ≤ exp  − ∑ M m = 1 ( E q [ log p ( w m , z m , θ m | ϕ )] − E q [ log q ( z m , θ m | ϕ )])  ∑ M m = 1 N m } . W e tested three algorithms and ran experiments for 10.000, 20.000, . . . , 80.000 documents as a training set for cor- pora topics. Later we e valuated perplexity on 100 held- out documents. Size of vocab ulary was around 150.000 words. In all experiments α and β are ﬁxed at 0 . 01 and the num- ber of topics K is equal to 100. For Collapsed Gibbs sam- pling, no experiment con verged. The criteria was relativ e change in z variable; change did not get under 20% in 1000 iterations. In V ariational Bayesian inference the “E”-step and the “M”-step conv erge if relativ e change in γ is under 0 . 001 and relati ve improvement of the ELBO is under 0 . 001 , re- spectiv ely . If there is no con ver gence, we terminate after 100 iterations for both “E” and “M”-step. Ho wev er algo- rithm always con ver ged in less than 20 iterations. In Online V ariational Bayesian inference limit for the “E”- step was the same as in V ariational Bayesian inference. Batchsize was 100 documents, τ 0 was 1024 and κ was equal to 0 . 7 as proposed in [5]. The fastest algorithm is Online VB, other two hav e simi- lar time complexity with a large note: VB algorithm con- ver ged e very time while Gibbs sampling algorithm did not con verge. Unexpected, Online VB does not perform as well as other two but in practice still giv es reasonably good results. Our future goal is to explain the results ob- tained by experiments. Therefore we recommend Online VB algorithm for prac- tical use, if time is a factor . 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 0 2 4 6 8 10 12 14 16 Gibbs VB Online VB F I G U R E 3 . Time used by the algo- rithms (in hours) giv en the number of the documents. 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 0 50 100 150 200 250 300 Gibbs VB Online VB F I G U R E 4 . Perplexity on held-out doc- uments as a function of number of doc- uments analyzed. 5 AC K N OW L E D G M E N T The authors gratefully acknowledge that the funding for this work was pro vided by the project XLIKE (ICT - 257790-STREP). R E F E R E N C E S [1] Da vid M. Blei, Andre w Y . Ng, and Michael I. Jordan. Latent dirich- let allocation. J. Mac h. Learn. Res. , 3:993–1022, March 2003. [2] W im De Smet and Marie-Francine Moens. Cross-language link- ing of news stories on the web using interlingual topic modelling. In Proceedings of the 2nd ACM workshop on Social web sear ch and mining , SWSM ’09, pages 57–64, New Y ork, NY , USA, 2009. A CM. [3] A. P . Dempster, N. M. Laird, and D. B. Rubin. Maximum likeli- hood from incomplete data via the em algorithm. JOURNAL OF THE R O YAL ST ATISTICAL SOCIETY , SERIES B , 39(1):1–38, 1977. [4] Gre gor Heinrich. Parameter estimation for text analysis. T echnical report, Fraunhofer IGD, Darmstadt, Germany , 2005. 4 [5] Matthe w D. Hoffman, David M. Blei, and Francis R. Bach. Online learning for latent dirichlet allocation. In NIPS , pages 856–864. Cur - ran Associates, Inc., 2010. [6] Michael I. Jordan, Zoubin Ghahramani, T ommi S. Jaakkola, and Lawrence K. Saul. An introduction to v ariational methods for graph- ical models. Mach. Learn. , 37(2):183–233, Nov ember 1999. 5

Algorithms of the LDA model [REPORT]

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment