The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies

THE NESTED CHINESE REST A URANT PR OCESS AND B A YESIAN NONP ARAMETRIC INFERENCE OF T OPIC HIERARCHIES D A VID M. BLEI PRINCETON UNIVERSITY THOMAS L. GRIFFITHS UNIVERSITY OF CALIFORNIA, BERKELEY MICHAEL I. JORD AN UNIVERSITY OF CALIFORNIA, BERKELEY A B S T R AC T . W e present the nested Chinese restaurant process (nCRP), a stochastic process which assigns probability distrib utions to inﬁnitely- deep, inﬁnitely-branching trees. W e sho w how this stochastic process can be used as a prior distribution in a Bayesian nonparametric model of document collections. Speciﬁcally , we present an application to infor- mation retriev al in which documents are modeled as paths down a ran- dom tree, and the preferential attachment dynamics of the nCRP leads to clustering of documents according to sharing of topics at multiple le vels of abstraction. Given a corpus of documents, a posterior inference algo- rithm ﬁnds an approximation to a posterior distrib ution ov er trees, topics and allocations of words to lev els of the tree. W e demonstrate this al- gorithm on collections of scientiﬁc abstracts from sev eral journals. This model ex empliﬁes a recent trend in statistical machine learning—the use of Bayesian nonparametric methods to infer distributions on ﬂexible data structures. 1. I N T R O D U C T I O N For much of its history , computer science has focused on deducti ve for - mal methods, allying itself with deductiv e traditions in areas of mathematics such as set theory , logic, algebra, and combinatorics. There has been ac- cordingly less focus on efforts to dev elop inductive, empirically-based for- malisms in computer science, a gap which became increasingly visible over the years as computers have been required to interact with noisy , dif ﬁcult- to-characterize sources of data, such as those deriving from physical signals or from human acti vity . In more recent history , the ﬁeld of machine learning 1 2 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN has aimed to ﬁll this gap, allying itself with inducti ve traditions in probabil- ity and statistics, while focusing on methods that are amenable to analysis as computational procedures. Machine learning methods can be di vided into supervised learning meth- ods and unsupervised learning methods. Supervised learning has been a major focus of machine learning research. In supervised learning, each data point is associated with a label (e.g., a category , a rank or a real number) and the goal is to ﬁnd a function that maps data into labels (so as to predict the labels of data that have not yet been labeled). A canonical example of supervised machine learning is the email spam ﬁlter , which is trained on kno wn spam messages and then used to mark incoming unlabeled email as spam or non-spam. While supervised learning remains an activ e and vibrant area of research, more recently the focus in machine learning has turned to unsupervised learning methods. In unsupervised learning the data are not labeled, and the broad goal is to ﬁnd patterns and structure within the data set. Different for- mulations of unsupervised learning are based on dif ferent notions of “pat- tern” and “structure. ” Canonical examples include clustering , the problem of grouping data into meaningful groups of similar points, and dimension r eduction , the problem of ﬁnding a compact representation that retains use- ful information in the data set. One way to render these notions concrete is to tie them to a supervised learning problem; thus, a structure is v alidated if it aids the performance of an associated supervised learning system. Of- ten, howe ver , the goal is more exploratory . Inferred structures and patterns might be used, for example, to visualize or organize the data according to subjectiv e criteria. W ith the increased access to all kinds of unlabeled data—scientiﬁc data, personal data, consumer data, economic data, gov ern- ment data, text data—exploratory unsupervised machine learning methods hav e become increasingly prominent. Another important dichotomy in machine learning distinguishes between parametric and nonparametric models. A parametric model in volv es a ﬁxed representation that does not grow structurally as more data are observed. Examples include linear regression and clustering methods in which the number of clusters is ﬁxed a priori. A nonparametric model, on the other hand, is based on representations that are allo wed to gro w structurally as more data are observed. 1 Nonparametric approaches are often adopted when the goal is to impose as few assumptions as possible and to “let the data speak. ” 1 In particular, despite the nomenclature, a nonparametric model can in volve parameters; the issue is whether or not the number of parameters grows as more data are observ ed. THE NESTED CHINESE REST A URANT PROCESS 3 The nonparametric approach underlies many of the most signiﬁcant de- velopments in the supervised learning branch of machine learning ov er the past two decades. In particular , modern classiﬁers such as decision trees, boosting and nearest neighbor methods are nonparametric, as are the class of supervised learning systems built on “kernel methods, ” including the sup- port vector machine. (See Hastie et al. (2001) for a good revie w of these methods.) Theoretical de velopments in supervised learning hav e shown that as the number of data points grows, these methods can con verge to the true labeling function underlying the data, e ven when the data lie in an uncount- ably inﬁnite space and the labeling function is arbitrary (Devro ye et al., 1996). This would clearly not be possible for parametric classiﬁers. The assumption that labels are a vailable in supervised learning is a strong assumption, but it has the virtue that few additional assumptions are gener- ally needed to obtain a useful supervised learning methodology . In unsu- pervised learning, on the other hand, the absence of labels and the need to obtain operational deﬁnitions of “pattern” and “structure” generally makes it necessary to impose additional assumptions on the data source. In par- ticular , unsupervised learning methods are often based on “generati ve mod- els, ” which are probabilistic models that express hypotheses about the way in which the data may hav e been generated. Pr obabilistic graphical mod- els (also kno wn as “Bayesian netw orks” and “Marko v random ﬁelds”) ha ve emerged as a broadly useful approach to specifying generativ e models (Lau- ritzen, 1996; Jordan, 2000). The ele gant marriage of graph theory and prob- ability theory in graphical models makes it possible to take a fully proba- bilistic (i.e., Bayesian) approach to unsupervised learning in which ef ﬁcient algorithms are av ailable to update a prior generativ e model into a posterior generati ve model once data ha ve been observed. Although graphical models hav e catalyzed much research in unsuper- vised learning and have had many practical successes, it is important to note that most of the graphical model literature has been focused on parametric models. In particular , the graphs and the local potential functions compris- ing a graphical model are vie wed as ﬁxed objects; they do not grow struc- turally as more data are observ ed. Thus, while nonparametric methods hav e dominated the literature in supervised learning, parametric methods hav e dominated in unsupervised learning. This may seem surprising giv en that the open-ended nature of the unsupervised learning problem seems partic- ularly commensurate with the nonparametric philosophy . But it reﬂects an underlying tension in unsupervised learning—to obtain a well-posed learn- ing problem it is necessary to impose assumptions, but the assumptions should not be too strong or they will inform the discov ered structure more than the data themselves. 4 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN It is our view that the framew ork of Bayesian nonparametric statistics provides a general way to lessen this tension and to pave the way to un- supervised learning methods that combine the virtues of the probabilistic approach embodied in graphical models with the nonparametric spirit of supervised learning. In Bayesian nonparametric (BNP) inference, the prior and posterior distributions are no longer restricted to be parametric distri- butions, but are general stochastic pr ocesses Hjort et al. (2009). Recall that a stochastic process is simply an index ed collection of random vari- ables, where the index set is allowed to be inﬁnite. Thus, using stochas- tic processes, the objects of Bayesian inference are no longer restricted to ﬁnite-dimensional spaces, but are allowed to range ov er general inﬁnite- dimensional spaces. For e xample, objects such as trees of arbitrary branch- ing factor and arbitrary depth are allowed within the BNP frame work, as are other structured objects of open-ended cardinality such as partitions and lists. It is also possible to work with stochastic processes that place distrib u- tions on functions and distributions on distributions. The latter fact e xhibits the potential for recursive constructions that is av ailable within the BNP frame work. In general, we vie w the representational ﬂexibility of the BNP frame work as a statistical counterpart of the ﬂexible data structures that are ubiquitous in computer science. In this paper , we aim to introduce the BNP framew ork to a wider com- putational audience by showing ho w BNP methods can be deployed in a speciﬁc unsupervised machine learning problem of signiﬁcant current interest—that of learning topic models for collections of text, images and other semi-structured corpora Blei et al. (2003); Grifﬁths and Steyv ers (2006); Blei and Laf ferty (2009). Let us brieﬂy introduce the problem here; a more formal presentation ap- pears in Section 4. A topic is deﬁned to be a probability distribution across wor ds from a vocabulary . Giv en an input corpus —a set of documents each consisting of a sequence of words—we want an algorithm to both ﬁnd use- ful sets of topics and learn to organize the topics according to a hierarchy in which more abstract topics are near the root of the hierarchy and more concrete topics are near the leaves. While a classical unsupervised analy- sis might require the topology of the hierarchy (branching factors, etc) to be chosen in advance, our BNP approach aims to infer a distribution on topologies, in particular placing high probability on those hierarchies that best explain the data. Moreov er , in accordance with our goals of using ﬂexible models that “let the data speak, ” we wish to allo w this distribution to hav e its support on arbitrary topologies—there should be no limitations such as a maximum depth or maximum branching factor . W e provide an exam ple of the output from our algorithm in Figure 1. The input corpus in this case was a collection of abstracts from the Journal of the THE NESTED CHINESE REST A URANT PROCESS 5 the, of a, is, and, in, to, for , that, we n algorithm time log problem graph graphs vertices edge edges Property testing and its connection to learning and approximation Fully dynamic planarity testing with applications Recognizing planar perfect graphs The coloring and maximum independent set problems on planar perfect graphs Biconnectivity approximations and graph carvings functions function polynomial egr set trees tree search regular string On the sorting-complexity of sufﬁx tree construction Efﬁcient algorithms for inverting evolution Theory of neuromata Patricia tries again revisited Decision tree reduction scheduling online competitive machine parallel programs language languages sets program logic formulas logics temporal relational Alternating-time temporal logic Fixpoint logics, relational machines, and computational complexity Deﬁnable relations and ﬁrst-order query languages over strings Autoepistemic logic Expressiveness of structured document query languages... rules resolution proof rewriting completeness question algebra dependencies boolean algebras networks network n protocol bounds asynchronous t objects consensus object routing sorting networks adaptive scheme Planar-adaptive routing: low-cost adaptive networks for multiprocessors On-line analysis of the TCP acknowledgment delay problem A trade-off between space and efﬁciency for routing tables Universal-stability results and performance bounds for greedy contention-resolution protocols Periodiﬁcation scheme: constructing sorting networks with constant period queuing closed throughput product-form asymptotic states automata veriﬁcation automaton state system systems database processing schemes transactions distributed performance measures availability consistency constraint constraints local d Constraint tightness and looseness versus local and global consistency An optimal on-line algorithm for metrical task system On the minimality and global consistency of row-convex constraint networks Using temporal hierarchies to efﬁciently maintain large temporal databases Maintaining state constraints in relational databases: a proof theoretic basis methods retrieval decomposition several extra knowledge classes inference theory queries learning probabilistic formulas quantum learnable Learning to reason Learning Boolean formulas Learning functions represented as multiplicity automata Dense quantum coding and quantum ﬁnite automata A neuroidal architecture for cognitive computation F I G U R E 1 . The topic hierarchy learned from 536 abstracts of the J ournal of the A CM (J A CM) from 1987–2004. The vocab ulary was restricted to the 1,539 terms that occurred in more than ﬁ ve documents, yielding a corpus of 68K words. The learned hierarchy contains 25 topics, and each topic node is annotated with its top ﬁv e most probable terms. W e also present examples of documents associated with a subset of the paths in the hierarchy . A CM (J A CM) from the years 1987 to 2004. The ﬁgure depicts a topology that is giv en highest probability by our algorithm, along with the highest probability words from the topics associated with this topology (each node in the tree corresponds to a single topic). As can be seen from the ﬁgure, the algorithm has discov ered the category of function words at le vel zero (e.g., “the” and “of ”), and has discov ered a set of ﬁrst-le vel topics that are 6 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN a reasonably faithful representation of some of the main areas of computer science. The second le vel provides a further subdivision into more concrete topics. W e emphasize that this is an unsupervised problem. The algorithm discov ers the topic hierarchy without any extra information about the corpus (e.g., ke ywords, titles or authors). The documents are the only inputs to the algorithm. A learned topic hierarchy can be useful for many tasks, including text categorization, text compression, text summarization and language model- ing for speech recognition. A commonly-used surrogate for the e v aluation of performance in these tasks is predictiv e likelihood, and we use predic- ti ve likelihood to e valuate our methods quantitati vely . But we also vie w our work as making a contribution to the dev elopment of methods for the visu- alization and browsing of documents. The model and algorithm we describe can be used to build a topic hierarchy for a document collection, and that hierarchy can be used to sharpen a user’ s understanding of the contents of the collection. A qualitativ e measure of the success of our approach is that the same tool should be able to uncov er a useful topic hierarch y in different domains based solely on the input data. By deﬁning a probabilistic model for documents, we do not deﬁne the le vel of “abstraction” of a topic formally , but rather deﬁne a statistical pro- cedure that allows a system designer to capture notions of abstraction that are reﬂected in usage patterns of the speciﬁc corpus at hand. While the content of topics will vary across corpora, the ways in which abstraction interacts with usage will not. A corpus might be a collection of images, a collection of HTML documents or a collection of DN A sequences. Dif- ferent notions of abstraction will be appropriate in these different domains, but each are expressed and discov erable in the data, making it possible to automatically construct a hierarchy of topics. This paper is organized as follows. W e begin with a revie w of the nec- essary background in stochastic processes and Bayesian nonparametric sta- tistics in Section 2. In Section 3, we dev elop the nested Chinese restaurant process, the prior on topologies that we use in the hierarchical topic model of Section 4. W e deriv e an approximate posterior inference algorithm in Section 5 to learn topic hierarchies from text data. Examples and an empir- ical ev aluation are provided in Section 6. Finally , we present related work and a discussion in Section 7. THE NESTED CHINESE REST A URANT PROCESS 7 1 2 3 4 5 6 7 8 9 10 β 1 β 2 β 3 β 4 β 5 F I G U R E 2 . A conﬁguration of the Chinese restaurant pro- cess. There are an inﬁnite number of tables, each associated with a parameter β i . The customers sit at the tables accord- ing to Eq. (1) and each generate data with the corresponding parameter . In this conﬁguration, ten customers hav e been seated in the restaurant, populating four of the inﬁnite set of tables. 2. B AC K G R O U N D Our approach to topic modeling reposes on several b uilding blocks from stochastic process theory and Bayesian nonparametric statistics, speciﬁ- cally the Chinese restaurant process (Aldous, 1985), stick-breaking pro- cesses (Pitman, 2002), and the Dirichlet process mixture (Antoniak, 1974). In this section we brieﬂy revie w these ideas and the connections between them. 2.1. Dirichlet and beta distributions. Recall that the Dirichlet distribu- tion is a probability distribution on the simplex of nonnegati ve real numbers that sum to one. W e write U ∼ Dir ( α 1 , α 2 , . . . , α K ) , for a random vector U distributed as a Dirichlet random v ariable on the K - simplex, where α i > 0 are parameters. The mean of U is proportional to the parameters E [ U i ] = α i P K k =1 α k and the magnitude of the parameters determines the concentration of U around the mean. The speciﬁc choice α 1 = · · · = α K = 1 yields the uniform distribution on the simplex. Letting α i > 1 yields a unimodal dis- tribution peaked around the mean, and letting α i < 1 yields a distribution that has modes at the corners of the simplex. The beta distribution is a spe- cial case of the Dirichlet distribution for K = 2 , in which case the simplex is the unit interv al (0 , 1) . In this case we write U ∼ Beta ( α 1 , α 2 ) , where U is a scalar . 8 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN 2.2. Chinese restaurant process. The Chinese restaurant process (CRP) is a single parameter distribution over partitions of the integers. The dis- tribution can be most easily described by specifying how to draw a sample from it. Consider a restaurant with an inﬁnite number of tables each with inﬁnite capacity . A sequence of N customers arri ve, labeled with the inte- gers { 1 , . . . , N } . The ﬁrst customer sits at the ﬁrst table; the n th subsequent customer sits at a table drawn from the follo wing distribution: (1) p ( occupied table i | pre vious customers ) = n i γ + n − 1 p ( next unoccupied table | previous customers ) = γ γ + n − 1 , where n i is the number of customers currently sitting at table i , and γ is a real-v alued parameter which controls how often, relativ e to the number of customers in the restaurant, a customer chooses a new table versus sitting with others. After N customers hav e been seated, the seating plan giv es a partition of those customers as illustrated in Figure 2. W ith an eye to wards Bayesian statistical applications, we assume that each table is endowed with a parameter vector β drawn from a distribution G 0 . Each customer is associated with the parameter vector at the table at which he sits. The resulting distribution on sequences of parameter values is referred to as a P ´ olya urn model (Johnson and K otz, 1977). The P ´ olya urn distrib ution can be used to deﬁne a ﬂe xible clustering model. Let the parameters at the tables index a family of probability distri- butions (for example, the distribution might be a multiv ariate Gaussian in which case the parameter would be a mean vector and cov ariance matrix). Associate customers to data points, and draw each data point from the prob- ability distrib ution associated with the table at which the customer sits. This induces a probabilistic clustering of the generated data because customers sitting around each table share the same parameter vector . This model is in the spirit of a traditional mixture model (T itterington et al., 1985), but is critically different in that the number of tables is un- bounded. Data analysis amounts to in verting the generativ e process to de- termine a probability distribution on the “seating assignment” of a data set. The underlying CRP lets the data determine the number of clusters (i.e., the number of occupied tables) and further allows ne w data to be assigned to ne w clusters (i.e., ne w tables). 2.3. Stick-breaking constructions. The Dirichlet distrib ution places a dis- tribution on nonneg ati ve K -dimensional vectors whose components sum to one. In this section we discuss a stochastic process that allows K to be unbounded. Consider a collection of nonnegati ve real numbers { θ i } ∞ i =1 where P i θ i = 1 . W e wish to place a probability distribution on such sequences. Giv en THE NESTED CHINESE REST A URANT PROCESS 9 that each such sequence can be viewed as a probability distribution on the positi ve integers, we obtain a distribution on distributions, i.e., a random probability distribution. T o do this, we use a stick-br eaking construction . V iew the interv al (0 , 1) as a unit-length stick. Draw a value V 1 from a Beta ( α 1 , α 2 ) distribution and break off a fraction V 1 of the stick. Let θ 1 = V 1 denote this ﬁrst fragment of the stick and let 1 − θ 1 denote the remainder of the stick. Continue this procedure recursi vely , letting θ 2 = V 2 (1 − θ 1 ) , and in general deﬁne θ i = V i i − 1 Y j =1 (1 − V j ) , where { V i } are an inﬁnite sequence of independent draws from the Beta ( α 1 , α 2 ) distribution. Sethuraman (1994) shows that the resulting sequence { θ i } sat- isﬁes P i θ i = 1 with probability one. In the special case α 1 = 1 we obtain a one-parameter stochastic pro- cess known as the GEM distribution (Pitman, 2002). Let γ = α 2 denote this parameter and denote draws from this distribution as θ ∼ GEM ( γ ) . Large values of γ sk e w the beta distribution tow ards zero and yield ran- dom sequences that are heavy-tailed, i.e., signiﬁcant probability tends to be assigned to large integers. Small values of γ yield random sequences that decay more quickly to zero. 2.4. Connections. The GEM distribution and the CRP are closely related. Let θ ∼ GEM ( γ ) and let { Z 1 , Z 2 , . . . , Z N } be a sequence of indicator v ari- ables drawn independently from θ , i.e., p ( Z n = i | θ ) = θ i . This distribution on indicator variables induces a random partition on the integers { 1 , 2 , . . . , N } , where the partition reﬂects indicators that share the same v alues. It can be shown that this distribution on partitions is the same as the distribution on partitions induced by the CRP (Pitman, 2002). As implied by this result, the GEM parameter γ controls the partition in the same way as the CRP parameter γ . As with the CRP , we can augment the GEM distribution to consider draws of parameter v alues. Let { β i } be an inﬁnite sequence of independent draws from a distribution G 0 deﬁned on a sample space Ω . Deﬁne G = ∞ X i =1 θ i δ β i , where δ β i is an atom at location β i and where θ ∼ GEM ( γ ) . The object G is a distribution on Ω ; it is a r andom distrib ution . 10 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN Consider no w a ﬁnite partition of Ω . Sethuraman (1994) sho wed that the probability assigned by G to the cells of this partition follows a Dirichlet distribution. Moreov er , if we consider all possible ﬁnite partitions of Ω , the resulting Dirichlet distributions are consistent with each other . Thus, by appealing to the K olmogorov consistency theorem (Billingsley, 1995), we can view G as a draw from an underlying stochastic process, where the index set is the set of Borel sets of Ω . This stochastic process is kno wn as the Dirichlet pr ocess (Fer guson, 1973). Note that if we truncate the stick-breaking process after L − 1 breaks, we obtain a Dirichlet distribution on an L -dimensional vector . The ﬁrst L − 1 components of this vector manifest the same kind of bias towards larger v alues for earlier components as the full stick-breaking distribution. How- e ver , the last component θ L represents the portion of the stick that remains after L − 1 breaks and has less of a bias to ward small values than in the untruncated case. Finally , we will ﬁnd it con venient to deﬁne a two-parameter v ariant of the GEM distribution that allows control over both the mean and variance of stick lengths. W e denote this distribution as GEM ( m, π ) , in which π > 0 and m ∈ (0 , 1) . In this variant, the stick lengths are deﬁned as V i ∼ Beta ( mπ , (1 − m ) π ) . The standard GEM ( γ ) is the special case when mπ = 1 and γ = (1 − m ) π . Note that its mean and v ariance are tied through its single parameter . 3. T H E N E S T E D C H I N E S E R E S TAU R A N T P R O C E S S The Chinese restaurant process and related distributions are widely used in Bayesian nonparametric statistics because the y make it possible to deﬁne statistical models in which observations are assumed to be drawn from an unkno wn number of classes. Howe ver , this kind of model is limited in the structures that it allo ws to be expressed in data. Analyzing the richly structured data that are common in computer science requires extending this approach. In this section we discuss how similar ideas can be used to deﬁne a probability distribution on inﬁnitely-deep, inﬁnitely-branching trees. This distribution is subsequently used as a prior distribution in a hierarchical topic model that identiﬁes documents with paths down the tree. A tree can be viewed as a nested sequence of partitions. W e obtain a distribution on trees by generalizing the CRP to such sequences. Speciﬁ- cally , we deﬁne a nested Chinese r estaurant pr ocess (nCRP) by imagining the following scenario for generating a sample. Suppose there are an inﬁ- nite number of inﬁnite-table Chinese restaurants in a city . One restaurant is identiﬁed as the root restaurant, and on each of its inﬁnite tables is a card with the name of another restaurant. On each of the tables in those THE NESTED CHINESE REST A URANT PROCESS 11 1 1 1 2 2 2 3 3 3 4 4 4 β β β β β β β β β β β β β 5 5 5 F I G U R E 3 . A conﬁguration of the nested Chinese restau- rant process illustrated to three lev els. Each box represents a restaurant with an inﬁnite number of tables, each of which refers to a unique table in the next le vel of the tree. In this conﬁguration, ﬁ ve tourists have visited restaurants along four unique paths. Their paths trace a subtree in the inﬁnite tree. (Note that the conﬁguration of customers within each restaurant can be determined by observing the restaurants chosen by customers at the next lev el of the tree.) In the hLD A model of Section 4, each restaurant is associated with a topic distribution β . Each document is assumed to choose its words from the topic distrib utions along a randomly cho- sen path. restaurants are cards that refer to other restaurants, and this structure re- peats inﬁnitely many times. 2 Each restaurant is referred to exactly once; thus, the restaurants in the city are organized into an inﬁnitely-branched, inﬁnitely-deep tree. Note that each restaurant is associated with a le vel in this tree. The root restaurant is at level 1, the restaurants referred to on its tables’ cards are at le vel 2, and so on. A tourist arri ves at the city for an culinary v acation. On the ﬁrst e vening, he enters the root Chinese restaurant and selects a table using the CRP distri- bution in Eq. (1). On the second e vening, he goes to the restaurant identiﬁed on the ﬁrst night’ s table and chooses a second table using a CRP distribution based on the occupancy pattern of the tables in the second night’ s restaurant. He repeats this process fore ver . After M tourists have been on vacation in the city , the collection of paths describes a random subtree of the inﬁnite tree; this subtree has a branching factor of at most M at all nodes. See Figure 3 for an example of the ﬁrst three le vels from such a random tree. 2 A ﬁnite-depth precursor of this model was presented in Blei et al. (2003). 12 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN There are many ways to place prior distributions on trees, and our spe- ciﬁc choice is based on sev eral considerations. First and foremost, a prior distribution combines with a likelihood to yield a posterior distrib ution, and we must be able to compute this posterior distrib ution. In our case, the likelihood will arise from the hierarchical topic model to be described in Section 4. As we will show in Section 5, the speciﬁc prior that we propose in this section combines with the likelihood to yield a posterior distribution that is amenable to probabilistic inference. Second, we ha ve retained impor- tant aspects of the CRP , in particular the “preferential attachment” dynamics that are b uilt into Eq. (1). Probability structures of this form ha ve been used as models in a v ariety of applications (Barabasi and Reka, 1999; Krapivsk y and Redner, 2001; Albert and Barabasi, 2002; Drinea et al., 2006), and the clustering that they induce makes them a reasonable starting place for a hierarchical topic model. In fact, these two points are intimately related. The CRP yields an ex- chang eable distrib ution across partitions, i.e., the distrib ution is in v ariant to the order of the arriv al of customers (Pitman, 2002). This exchangeability property makes CRP-based models amenable to posterior inference using Monte Carlo methods (Escobar and W est, 1995; MacEachern and Muller, 1998; Neal, 2000). 4. H I E R A R C H I C A L L A T E N T D I R I C H L E T A L L O C A T I O N The nested CRP provides a way to deﬁne a prior on tree topologies that does not limit the branching factor or depth of the trees. W e can use this distribution as a component of a probabilistic topic model. The goal of topic modeling is to identify subsets of words that tend to co-occur within documents. Some of the early work on topic modeling deri ved from latent semantic analysis, an application of the singular value decomposition in which “topics” are viewed post hoc as the basis of a lo w- dimensional subspace (Deerwester et al., 1990). Subsequent work treated topics as probability distributions ov er words and used likelihood-based methods to estimate these distributions from a corpus (Hofmann, 1999b). In both of these approaches, the interpretation of “topic” dif fers in key ways from the clustering metaphor because the same word can be giv en high probability (or weight) under multiple topics. This giv es topic models the capability to capture notions of polysemy (e.g., “bank” can occur with high probability in both a ﬁnance topic and a waterways topic). Probabilistic topic models were gi ven a fully Bayesian treatment in the latent Dirichlet allocation (LD A) model (Blei et al., 2003). T opic models such as LD A treat topics as a “ﬂat” set of probability dis- tributions, with no direct relationship between one topic and another . While THE NESTED CHINESE REST A URANT PROCESS 13 these models can be used to recov er a set of topics from a corpus, they fail to indicate the lev el of abstraction of a topic, or ho w the various topics are related. The model that we present in this section builds on the nCRP to deﬁne a hierar chical topic model . This model arranges the topics into a tree, with the desideratum that more general topics should appear near the root and more specialized topics should appear near the leaves (Hofmann, 1999a). Ha ving deﬁned such a model, we use probabilistic inference to simultaneously identify the topics and the relationships between them. Our approach to deﬁning a hierarchical topic model is based on iden- tifying documents with the paths generated by the nCRP . W e augment the nCRP in two ways to obtain a generativ e model for documents. First, we as- sociate a topic, i.e., a probability distrib ution across words, with each node in the tree. A path in the tree thus picks out an inﬁnite collection of topics. Second, gi ven a choice of path, we use the GEM distrib ution to deﬁne a probability distribution on the topics along this path. Gi ven a draw from a GEM distribution, a document is generated by repeatedly selecting topics according to the probabilities deﬁned by that draw , and then drawing each word from the probability distrib ution deﬁned by its selected topic. More formally , consider the inﬁnite tree deﬁned by the nCRP and let c d denote the path through that tree for the d th customer (i.e., document). In the hierar chical LD A (hLD A) model, the documents in a corpus are as- sumed drawn from the follo wing generativ e process: (1) F or each table k ∈ T in the inﬁnite tree, (a) Dra w a topic β k ∼ Dirichlet ( η ) . (2) F or each document, d ∈ { 1 , 2 , . . . , D } (a) Dra w c d ∼ nCRP ( γ ) . (b) Dra w a distrib ution over lev els in the tree, θ d | { m, π } ∼ GEM ( m, π ) . (c) F or each word, (i) Choose le vel Z d,n | θ ∼ Mult ( θ d ) . (ii) Choose word W d,n | { z d,n , c d , β } ∼ Mult ( β c d [ z d,n ]) , which is parameterized by the topic in position z d,n on the path c d . This generativ e process deﬁnes a probability distribution across possible corpora. The goal of ﬁnding a topic hierarchy at different lev els of abstraction is distinct from the problem of hierarchical clustering Zamir and Etzioni (1998); Larsen and Aone (1999); V aithyanathan and Dom (2000); Duda et al. (2000); Hastie et al. (2001); Heller and Ghahramani (2005). Hierar - chical clustering treats each data point as a leaf in a tree, and merges similar data points up the tree until all are merged into a root node. Thus, internal nodes represent summaries of the data below which, in this setting, would 14 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN yield distributions across w ords that share high probability words with their children. In the hierarchical topic model, the internal nodes are not summaries of their children. Rather, the internal nodes reﬂect the shared terminology of the documents assigned to the paths that contain them. This can be seen in Figure 1, where the high probability words of a node are distinct from the high probability words of its children. It is important to emphasize that our approach is an unsupervised learning approach in which the probabilistic components that we hav e deﬁned are latent v ariables. That is, we do not assume that topics are predeﬁned, nor do we assume that the nested partitioning of documents or the allocation of topics to le vels are predeﬁned. W e infer these entities from a Bayesian computation in which a posterior distribution is obtained from conditioning on a corpus and computing probabilities for all latent v ariables. As we will see experimentally , there is statistical pressure in the posterior to place more general topics near the root of the tree and to place more spe- cialized topics further down in the tree. T o see this, note that each path in the tree includes the root node. Giv en that the GEM distrib ution tends to assign relati vely large probabilities to small integers, there will be a relati vely large probability for documents to select the root node when generating words. Therefore, to explain an observed corpus, the topic at the root node will place high probability on words that are useful across all the documents. Moving down in the tree, recall that each document is assigned to a single path. Thus, the ﬁrst level below the root induces a coarse partition on the documents, and the topics at that le vel will place high probability on words that are useful within the corresponding subsets. As we mov e still further do wn, the nested partitions of documents become ﬁner . Consequently , the corresponding topics will be more specialized to the particular documents in those paths. W e hav e presented the model as a two-phase process: an inﬁnite set of topics are generated and assigned to all of the nodes of an inﬁnite tree, and then documents are obtained by selecting nodes in the tree and drawing words from the corresponding topics. It is also possible, howe ver , to con- ceptualize a “lazy” procedure in which a topic is generated only when a node is ﬁrst selected. In particular , consider an empty tree (i.e., containing no topics) and consider generating the ﬁrst document. W e select a path and then repeatedly select nodes along that path in order to generate words. A topic is generated at a node when that node is ﬁrst selected and subsequent selections of the node reuse the same topic. After n words ha ve been generated, at most n nodes will hav e been vis- ited and at most n topics will ha ve been generated. The ( n + 1) th word in the document can come from one of previously generated topics or it can THE NESTED CHINESE REST A URANT PROCESS 15 come from a new topic. Similarly , suppose that d documents have previ- ously been generated. The ( d + 1) th document can follow one of the paths laid down by an earlier document and select only “old” topics, or it can branch off at an y point in the tree and generated “new” topics along the ne w branch. This discussion highlights the nonparametric nature of our model. Rather than describing a corpus by using a probabilistic model inv olving a ﬁxed set of parameters, our model assumes that the number of parameters can gro w as the corpus gro ws, both within documents and across documents. Ne w documents can spark ne w subtopics or new specializations of e xisting subtopics. Given a corpus, this ﬂexibility allo ws us to use approximate pos- terior inference to discov er the particular tree of topics that best describes its documents. It is important to note that e ven with this ﬂexibility , the model still makes assumptions about the tree. Its size, shape, and character will be affected by the settings of the hyperparameters. The most inﬂuential hyperparame- ters in this regard are the Dirichlet parameter for the topics η and the stick- breaking parameters for the topic proportions { m, π } . The Dirichlet param- eter controls the sparsity of the topics; smaller values of η will lead to topics with most of their probability mass on a small set of words. W ith a prior bias to sparser topics, the posterior will prefer more topics to describe a col- lection and thus place higher probability on larger trees. The stick-breaking parameters control how many words in the documents are likely to come from topics of varying abstractions. If we set π to be large (e.g., π = 0 . 5 ) then the posterior will more likely assign more words from each document to higher le vels of abstraction. Setting m to be lar ge (e.g., m = 100 ) means that word allocations will not likely de viate from such a setting. Ho w we set these hyperparameters depends on the goal of the analysis. When we analyze a document collection with hLD A for discovering and visualizing a hierarchy embedded within it, we might examine v arious set- tings of the h yperparameters to ﬁnd a tree that meets our exploratory needs. W e analyze documents with this purpose in mind in Section 6.2. In a dif- ferent setting, when we are looking for a good predictiv e model of the data, e.g., to compare hLD A to other statistical models of text, then it makes sense to “ﬁt” the hyperparameters by placing priors on them and computing their posterior . W e describe posterior inference for the hyperparameters in Section 5.4 and analyze documents using this approach in Section 6.3. Finally , we note that hLD A is the simplest model that exploits the nested CRP , i.e., a ﬂexible hierarchy of distributions, in the topic modeling frame- work. In a more complicated model, one could consider a variant of hLD A where each document exhibits multiple paths through the tree. This can be 16 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN modeled using a two-lev el distribution for word generation: ﬁrst choose a path through the tree, and then choose a le vel for the w ord. Recent extensions to topic models can also be adapted to make use of a ﬂexible topic hierarchy . As examples, in the dynamic topic model the doc- uments are time stamped and the underlying topics change over time (Blei and Lafferty, 2006); in the author-topic model the authorship of the docu- ments affects which topics they exhibit (Rosen-Zvi et al., 2004). This said, some extensions are more easily adaptable than others. In the correlated topic model, the topic proportions exhibit a cov ariance structure (Blei and Laf ferty, 2007). This is achieved by replacing a Dirichlet distribution with a logistic normal, and the application of Bayesian nonparametric extensions is less direct. 4.1. Related work. In previous work, researchers hav e dev eloped a num- ber of methods that employ hierarchies in analyzing text data. In one line of work, the algorithms are giv en a hierarchy of document categories, and their goal is to correctly place documents within it (K oller and Sahami, 1997; Chakrabarti et al., 1998; McCallum et al., 1999; Dumais and Chen, 2000). Other work has focused on deri ving hierarchies of indi vidual terms using side information, such as a grammar or a thesaurus, that are some- times av ailable for text domains (Sanderson and Croft, 1999; Stoica and Hearst, 2004; Cimiano et al., 2005). Our method provides still another way to employ a notion of hierarchy in text analysis. First, rather than learn a hierarchy of terms we learn a hi- erarchy of topics, where a topic is a distribution over terms that describes a signiﬁcant pattern of word co-occurrence in the data. Moreover , while we focus on text, a “topic” is simply a data-generating distribution; we do not rely on any text-speciﬁc side information such as a thesaurus or gram- mar . Thus, by using other data types and distrib utions, our methodology is readily applied to biological data sets, purchasing data, collections of im- ages, or social network data. (Note that applications in such domains hav e already been demonstrated for ﬂat topic models (Pritchard et al., 2000; Mar- lin, 2003; Fei-Fei and Perona, 2005; Blei and Jordan, 2003; Airoldi et al., 2008).) Finally , as a Bayesian nonparametric model, our approach can ac- commodate future data that might lie in new and previously undiscovered parts of the tree. Previous work commits to a single ﬁxed tree for all future data. 5. P RO B A B I L I S T I C I N F E R E N C E W ith the hLD A model in hand, our goal is to perform posterior inference , i.e., to “in vert” the generati ve process of documents described abov e for es- timating the hidden topical structure of a document collection. W e hav e THE NESTED CHINESE REST A URANT PROCESS 17 constructed a joint distribution of hidden variables and observations—the latent topic structure and observed documents—by combining prior expec- tations about the kinds of tree topologies we will encounter with a genera- ti ve process for producing documents giv en a particular topology . W e are no w interested in the distribution of the hidden structure conditioned on having seen the data, i.e., the distribution of the underlying topic structure that might hav e generated an observed collection of documents. Finding this posterior distrib ution for dif ferent kinds of data and models is a central problem in Bayesian statistics. See Bernardo and Smith (1994) and Gelman et al. (1995) for general introductions to Bayesian statistics. In our nonparametric setting, we must ﬁnd a posterior distribution on countably inﬁnite collections of objects—hierarchies, path assignments, and le vel allocations of w ords—gi ven a collection of documents. Moreov er , we need to be able to do this using the ﬁnite resources of the computer . Not surprisingly , the posterior distribution for hLD A is not av ailable in closed form. W e must appeal to an approximation. W e de velop a Markov chain Monte Carlo (MCMC) algorithm to approx- imate the posterior for hLD A. In MCMC, one samples from a target dis- tribution on a set of variables by constructing a Markov chain that has the target distribution as its stationary distribution (Robert and Casella, 2004). One then samples from the chain for suf ﬁciently long that it approaches the target, collects the sampled states thereafter , and uses those collected states to estimate the tar get. This approach is particularly straightforward to apply to latent v ariable models, where we take the state space of the Markov chain to be the set of v alues that the latent variables can take on, and the target distribution is the conditional distrib ution of these latent v ariables gi ven the observed data. The particular MCMC algorithm that we present in this paper is a Gibbs sampling algorithm (Geman and Geman, 1984; Gelfand and Smith, 1990). In a Gibbs sampler each latent v ariable is iterativ ely sampled conditioned on the observ ations and all the other latent variables. W e employ collapsed Gibbs sampling (Liu, 1994), in which we marginalize out some of the latent v ariables to speed up the con vergence of the chain. Collapsed Gibbs sam- pling for topic models (Grifﬁths and Steyvers, 2004) has been widely used in a number of topic modeling applications (McCallum et al., 2004; Rosen- Zvi et al., 2004; Mimno and McCallum, 2007; Dietz et al., 2007; Ne wman et al., 2006). In hLD A, we sample the per-document paths c d and the per-w ord lev el allocations to topics in those paths z d,n . W e marginalize out the topic pa- rameters β i and the per-document topic proportions θ d . The state of the 18 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN the,of a, is and n algorithm time log problem programs language languages sets program graph graphs vertices edge edges functions function polynomial egr trees tree search regular string c d z d,n w d,n F I G U R E 4 . A single state of the Markov chain in the Gibbs sampler for the abstract of “ A new approach to the maximum-ﬂo w problem” [Goldberg and T arjan, 1986]. The document is associated with a path through the hierarchy c d , and each node in the hierarchy is associated with a distri- bution ov er terms. (The ﬁv e most probable terms are illus- trated.) Finally , each word in the abstract w d,n is associated with a lev el in the path through the hierarchy z d,n , with 0 being the highest lev el and 2 being the lowest. The Gibbs sampler iterativ ely draws c d and z d,n for all words in all doc- uments (see Section 5). THE NESTED CHINESE REST A URANT PROCESS 19 Marko v chain is illustrated, for a single document, in Figure 4. (The partic- ular assignments illustrated in the ﬁgure are taken at the approximate mode of the hLD A model posterior conditioned on abstracts from the J A CM.) Thus, we approximate the posterior p ( c 1: D , z 1: D | γ , η , m, π , w 1: D ) . The hyperparameter γ reﬂects the tendency of the customers in each restaurant to share tables, η reﬂects the expected variance of the underlying topics (e.g, η  1 will tend to choose topics with fewer high-probability words), and m and π reﬂect our expectation about the allocation of words to lev els within a document. The hyperparameters can be ﬁxed according to the constraints of the analysis and prior e xpectation about the data, or inferred as described in Section 5.4. Intuiti vely , the CRP parameter γ and topic prior η provide control over the size of the inferred tree. For example, a model with large γ and small η will tend to ﬁnd a tree with more topics. The small η encourages fewer words to hav e high probability in each topic; thus, the posterior requires more topics to explain the data. The large γ increases the likelihood that documents will choose ne w paths when trav ersing the nested CRP . The GEM parameter m reﬂects the proportion of general words relativ e to speciﬁc w ords, and the GEM parameter π reﬂects ho w strictly we e xpect the documents to adhere to these proportions. A larger value of π enforces the notions of generality and speciﬁcity that lead to more interpretable trees. The remainder of this section is organized as follows. First, we outline the two main steps in the algorithm: the sampling of le vel allocations and the sampling of path assignments. W e then combine these steps into an ov erall algorithm. Next, we present prior distributions for the hyperparam- eters of the model and describe posterior inference for the hyperparameters. Finally , we outline how to assess the con vergence of the sampler and ap- proximate the mode of the posterior distribution. 5.1. Sampling level allocations. Gi ven the current path assignments, we need to sample the lev el allocation variable z d,n for word n in document d from its distribution gi ven the current v alues of all other variables: (2) p ( z d,n | z − ( d,n ) , c , w , m, π , η ) ∝ p ( z d,n | z d, − n , m, π ) p ( w d,n | z , c , w − ( d,n ) , η ) , where z − ( d,n ) and w − ( d,n ) are the vectors of lev el allocations and observed words leaving out z d,n and w d,n respecti vely . W e will use similar notation whene ver items are left out from an index set; for example, z d, − n denotes the le vel allocations in document d , lea ving out z d,n . The ﬁrst term in Eq. (2) is a distribution ov er le vels. This distrib ution has an inﬁnite number of components, so we sample in stages. First, we sample from the distribution ov er the space of levels that are currently represented in the rest of the document, i.e., max( z d, − n ) , and a level deeper than that 20 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN le vel. The ﬁrst components of this distribution are, for k ≤ max( z d, − n ) , p ( z d,n = k | z d, − n , m, π ) = E " V k k − 1 Y j =1 V j | z d, − n , m, π # = E [ V k | z d, − n , m, π ] k − 1 Y j =1 E [1 − V j | z d, − n , m, π ] = (1 − m ) π + #[ z d, − n = k ] π + #[ z d, − n ≥ k ] k − 1 Y j =1 mπ + #[ z d, − n > j ] π + #[ z d, − n ≥ j ] where #[ · ] counts the elements of an array satisfying a giv en condition. The second term in Eq. (2) is the probability of a giv en word based on a possible assignment. From the assumption that the topic parameters β i are generated from a Dirichlet distribution with h yperparameters η we obtain (3) p ( w d,n | z , c , w − ( d,n ) , η ) ∝ #[ z − ( d,n ) = z d,n , c z d,n = c d,z d,n , w − ( d,n ) = w d,n ]+ η which is the smoothed frequency of seeing word w d,n allocated to the topic at le vel z d,n of the path c d . The last component of the distribution o ver topic assignments is p ( z d,n > max( z d, − n ) | z d, − n , w , m, π , η ) = 1 − max( z d, − n ) X j =1 p ( z d,n = j | z d, − n , w , m, π , η ) . If the last component is sampled then we sample from a Bernoulli distribu- tion for increasing values of ` , starting with ` = max( z d, − n ) + 1 , until we determine z d,n , p ( z d,n = ` | z d, − n , z d,n > ` − 1 , w , m, π , η ) = (1 − m ) p ( w d,n | z , c , w − ( d,n ) , η ) p ( z d,n > ` | z d, − n , z d,n > ` − 1) = 1 − p ( z d,n = ` | z d, − n , z d,n > ` − 1 , w , m, π , η ) . Note that this changes the maximum lev el when resampling subsequent le vel assignments. 5.2. Sampling paths. Giv en the lev el allocation v ariables, we need to sam- ple the path associated with each document conditioned on all other paths and the observed words. W e appeal to the fact that max( z d ) is ﬁnite, and are only concerned with paths of that length: (4) p ( c d | w , c − d , z , η , γ ) ∝ p ( c d | c − d , γ ) p ( w d | c , w − d , z , η ) . This expression is an instance of Bayes’ s theorem with p ( w d | c , w − d , z , η ) as the probability of the data giv en a particular choice of path, and p ( c d | c − d , γ ) as the prior on paths implied by the nested CRP . The probability of the data THE NESTED CHINESE REST A URANT PROCESS 21 is obtained by integrating over the multinomial parameters, which giv es a ratio of normalizing constants for the Dirichlet distribution, p ( w d | c , w − d , z , η ) = max( z d ) Y ` =1 Γ ( P w #[ z − d = `, c − d,` = c d,` , w − d = w ] + V η ) Q w Γ (#[ z − d = `, c − d,` = c d,` , w − d = w ] + η ) Q w Γ (#[ z = `, c ` = c d,` , w = w ] + η ) Γ ( P w #[ z = `, c ` = c d,` , w = w ] + V η ) , where we use the same notation for counting over arrays of variables as abov e. Note that the path must be drawn as a block, because its v alue at each le vel depends on its value at the previous lev el. The set of possible paths corresponds to the union of the set of existing paths through the tree, each represented by a leaf, with the set of possible no vel paths, each represented by an internal node. 5.3. Summary of Gibbs sampling algorithm. W ith these conditional dis- tributions in hand, we specify the full Gibbs sampling algorithm. Giv en the current state of the sampler , { c ( t ) 1: D , z ( t ) 1: D } , we iterativ ely sample each v ari- able conditioned on the rest. (1) F or each document d ∈ { 1 , . . . , D } (a) Randomly dra w c ( t +1) d from Eq. (4). (b) Randomly draw z ( t +1) n,d from Eq. (2) for each word, n ∈ { 1 , . . . N d } . The stationary distribution of the corresponding Markov chain is the con- ditional distribution of the latent v ariables in the hLD A model giv en the corpus. After running the chain for sufﬁciently many iterations that it can approach its stationary distribution (the “burn-in”) we can collect samples at interv als selected to minimize autocorrelation, and approximate the true posterior with the corresponding empirical distribution. Although this algorithm is guaranteed to con ver ge in the limit, it is dif- ﬁcult to say something more deﬁnitiv e about the speed of the algorithm independent of the data being analyzed. In hLD A, we sample a leaf from the tree for each document c d and a le vel assignment for each word z d,n . As described abov e, the number of items from which each is sampled depends on the current state of the hierarchy and other lev el assignments in the doc- ument. T w o data sets of equal size may induce different trees and yield dif ferent running times for each iteration of the sampler . For the corpora analyzed below in Section 6.2, the Gibbs sampler averaged 0.001 seconds per document for the J A CM data and Psyc hological Re view data, and 0.006 seconds per document for the Pr oceedings of the National Academy of Sci- ences data. 3 3 T imings were measured with the Gibbs sampler running on a 2.2GHz Opteron 275 processor . 22 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN 0 500 1000 1500 2000 − 474000 − 470000 − 466000 Log complete probability for the JACM corpus Iteration Log complete probability 0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0 Lag ACF Autocorrelation function for the JACM corpus F I G U R E 5 . (Left) The complete log likelihood of Eq. (5) for the ﬁrst 2000 iterations of the Gibbs sampler run on the J A CM corpus of Section 6.2. (Right) The autocorrelation function (A CF) of the log complete log likelihood (with con- ﬁdence interval) for the remaining 8000 iterations. The auto- correlation decreases rapidly as a function of the lag between samples. 5.4. Sampling the hyperparameters. The values of hyperparameters are generally unknown a priori. W e include them in the inference process by endo wing them with prior distributions, m ∼ Beta ( α 1 , α 2 ) π ∼ Exponential ( α 3 ) γ ∼ Gamma ( α 4 , α 5 ) η ∼ Exponential ( α 6 ) . These priors also contain parameters (“h yper-h yperparameters”), but the re- sulting inferences are less inﬂuenced by these hyper -hyperparameters than they are by ﬁxing the original hyperparameters to speciﬁc values Bernardo and Smith (1994). T o incorporate this extension into the Gibbs sampler , we interleav e Metropolis- Hastings (MH) steps between iterations of the Gibbs sampler to obtain ne w v alues of m , π , γ , and η . This preserves the integrity of the Markov chain, although it may mix slower than the collapsed Gibbs sampler without the MH updates (Robert and Casella, 2004). THE NESTED CHINESE REST A URANT PROCESS 23 5.5. Assessing con vergence and approximating the mode. Practical ap- plications must address the issue of approximating the mode of the distribu- tion on trees and assessing con ver gence of the Markov chain. W e can obtain information about both by examining the log probability of each sampled state. F or a particular sample, i.e., a conﬁguration of the latent variables, we compute the log probability of that conﬁguration and observ ations, con- ditioned on the hyperparameters: (5) L ( t ) = log p ( c ( t ) 1: D , z ( t ) 1: D , w 1: D | γ , η , m, π ) . W ith this statistic, we can approximate the mode of the posterior by choos- ing the state with the highest log probability . Moreov er, we can assess con ver gence of the chain by examining the autocorrelation of L ( t ) . Fig- ure 5 (right) illustrates the autocorrelation as a function of the number of iterations between samples (the “lag”) when modeling the J A CM corpus described in Section 6.2. The chain was run for 10,000 iterations; 2000 iterations were discarded as burn-in. Figure 5 (left) illustrates Eq. (5) for the burn-in iterations. Gibbs samplers stochastically climb the posterior distribution surface to ﬁnd an area of high posterior probability , and then explore its curvature through sampling. In practice, one usually restarts this procedure a handful of times and chooses the local mode which has highest posterior likelihood (Robert and Casella, 2004). Despite the lack of theoretical guarantees, Gibbs sampling is appropriate for the kind of data analysis for which hLD A and many other latent vari- able models are tailored. Rather than try to understand the full surface of the posterior , the goal of latent v ariable modeling is to ﬁnd a useful rep- resentation of complicated high-dimensional data, and a local mode of the posterior found by Gibbs sampling often pro vides such a representation. In the next section, we will assess hLD A qualitativ ely , through visualization of summaries of the data, and quantitati vely , by using the latent variable representation to provide a predicti ve model of text. 6. E X A M P L E S A N D E M P I R I C A L R E S U LT S W e present experiments analyzing both simulated and real text data to demonstrate the application of hLD A and its corresponding Gibbs sampler . 6.1. Analysis of simulated data. In Figure 6, we depict the hierarchies and allocations for ten simulated data sets dra wn from an hLD A model. For each data set, we draw 100 documents of 250 words each. The vocab ulary size is 100, and the hyperparameters are ﬁxed at η = . 005 , and γ = 1 . In these simulations, we truncated the stick-breaking procedure at three lev els, 24 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN 100 83 67 10 6 16 16 1 1 100 83 67 10 6 16 16 1 1 100 89 89 6 4 2 3 2 1 1 1 1 1 100 89 89 6 4 2 3 2 1 1 1 1 1 100 61 22 17 16 4 2 35 35 4 4 100 61 23 16 16 4 2 35 35 4 4 100 57 50 6 1 39 38 1 3 3 1 1 100 57 50 6 1 39 38 1 3 3 1 1 100 69 33 33 2 1 22 21 1 5 3 1 1 4 4 100 69 33 33 2 1 22 21 1 5 3 1 1 4 4 100 75 68 6 1 19 18 1 5 5 1 1 100 75 68 6 1 19 18 1 5 5 1 1 100 67 55 7 2 2 1 22 22 11 10 1 100 67 55 7 2 2 1 22 22 11 10 1 100 35 27 4 3 1 27 27 26 14 11 1 7 4 2 1 5 5 100 35 27 4 3 1 27 27 26 14 11 1 7 4 2 1 5 5 100 100 54 45 1 100 100 54 45 1 100 68 52 15 1 19 16 3 12 8 4 1 1 100 68 67 1 19 16 3 12 8 4 1 1 True dataset hierarchy Posterior mode True dataset hierarchy Posterior mode F I G U R E 6 . Inferring the mode of the posterior hierarchy from simulated data. See Section 6.1. and simply took a Dirichlet distribution over the proportion of words allo- cated to those levels. The resulting hierarchies sho wn in Figure 6 illustrate the range of structures on which the prior assigns probability . THE NESTED CHINESE REST A URANT PROCESS 25 In the same ﬁgure, we illustrate the estimated mode of the posterior distri- bution across the hierarchy and allocations for the ten data sets. W e exactly recov er the correct hierarchies, with only two errors. In one case, the error is a single wrongly allocated path. In the other case, the inferred mode has higher posterior probability than the true tree structure (due to ﬁnite data). In general we cannot expect to always ﬁnd the exact tree. This is depen- dent on the size of the data set, and how identiﬁable the topics are. Our choice of small η yields topics that are relativ ely sparse and (probably) very dif ferent from each other . T rees will not be as easy to identify in data sets which exhibit polysemy and similarity between topics. 6.2. Hierarchy discovery in scientiﬁc abstracts. Giv en a document col- lection, one is typically interested in examining the underlying tree of topics at the mode of the posterior . As described above, our inferential procedure yields a tree structure by assembling the unique subset of paths contained in { c 1 , . . . , c D } at the approximate mode of the posterior . For a gi ven tree, we can examine the topics that populate the tree. Giv en the assignment of words to lev els and the assignment of documents to paths, the probability of a particular word at a particular node is roughly propor- tional to the number of times that word was generated by the topic at that node. More speciﬁcally , the mean probability of a word w in a topic at lev el ` of path p is giv en by (6) p ( w | z , c , w , η ) = #[ z = `, c = p , w = w ] + η #[ z = `, c = p ] + V η . Using these quantities, the hLD A model can be used for analyzing collec- tions of scientiﬁc abstracts, recov ering the underlying hierarchical structure appropriate to a collection, and visualizing that hierarchy of topics for a better understanding of the structure of the corpora. W e demonstrate the analysis of three dif ferent collections of journal abstracts under hLD A. In these analyses, as abov e, we truncate the stick-breaking procedure at three lev els, facilitating visualization of the results. The topic Dirichlet hy- perparameters were ﬁxed at η = { 2 . 0 , 1 . 0 , 0 . 5 } , which encourages many terms in the high-lev el distributions, fewer terms in the mid-lev el distribu- tions, and still fewer terms in the low-le vel distributions. The nested CRP parameter γ was ﬁxed at 1.0. The GEM parameters were ﬁxed at m = 100 and π = 0 . 5 . This strongly biases the lev el proportions to place more mass at the higher le vels of the hierarchy . In Figure 1, we illustrate the approximate posterior mode of a hierarchy estimated from a collection of 536 abstracts from the J A CM. The tree struc- ture illustrates the ensemble of paths assigned to the documents. In each 26 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN node, we illustrate the top ﬁ ve words sorted by expected posterior proba- bility , computed from Eq. (6). Sev eral leav es are annotated with document titles. For each leaf, we chose the ﬁ ve documents assigned to its path that hav e the highest numbers of words allocated to the bottom lev el. The model has found the function words in the data set, assigning words like “the, ” “of, ” “or, ” and “and” to the root topic. In its second lev el, the posterior hierarchy appears to ha ve captured some of the major subﬁelds in computer science, distinguishing between databases, algorithms, program- ming languages and networking. In the third le vel, it further reﬁnes those ﬁelds. For example, it delineates between the veriﬁcation area of network- ing and the queuing area. In Figure 7, we illustrate an analysis of a collection of 1,272 psychology abstracts from Psychological Review from 1967 to 2003. Again, we hav e discov ered an underlying hierarchical structure of the ﬁeld. The top node contains the function w ords; the second le vel delineates between lar ge sub- ﬁelds such as behavioral, social and cogniti ve psychology; the third level further reﬁnes those subﬁelds. Finally , in Figure 8, we illustrate a portion of the analysis of a collec- tion of 12,913 abstracts from the Pr oceedings of the National Academy of Sciences from 1991 to 2001. An underlying hierarchical structure of the content of the journal has been discovered, dividing articles into groups such as neuroscience, immunology , population genetics and enzymology . In all three of these examples, the same posterior inference algorithm with the same hyperparameters yields very different tree structures for dif- ferent corpora. Models of ﬁxed tree structure force us to commit to one in adv ance of seeing the data. The nested Chinese restaurant process at the heart of hLD A provides a ﬂexible solution to this dif ﬁcult problem. 6.3. Comparison to LD A. In this section we present experiments compar - ing hLD A to its non-hierarchical precursor , LD A. W e use the inﬁnite-depth hLD A model; the per-document distribution ov er le vels is not truncated. W e use predictiv e held-out likelihood to compare the two approaches quantita- ti vely , and we present examples of LD A topics in order to provide a quali- tati ve comparison of the methods. LD A has been sho wn to yield good pre- dicti ve performance relativ e to competing unigram language models, and it has also been argued that the topic-based analysis provided by LD A repre- sents a qualitati ve impro vement on competing language models (Blei et al., 2003; Grifﬁths and Steyv ers, 2006). Thus LD A provides a natural point of comparison. There are se veral issues that must be borne in mind in comparing hLD A to LD A. First, in LD A the number of topics is a ﬁxed parameter , and a model selection procedure is required to choose the number of topics. (A THE NESTED CHINESE REST A URANT PROCESS 27 THE OF AND A TO memory recognition word words speech reading lexical phonological sentence syntactic recall retrieval items item associative cue availability be can see source monitoring depression amount threshold hippocampus regular perceptual conditioning brain skill encoded target forcedchoice repr esentation social individuals interpersonal self personality attitude attitudes integration others explicit consistency personality crosssituational motives act group groups intelligence categories comparison depression risk or accuracy relationships selfesteem behaviors schema interaction judgment reinfor cement conditioning behavior responses animals fear density recent animals inconsistent drug drive behaviors stimulation brain schedules matching instrumental schedule response rt decision judgment judgments scale psychophysical measurement intensity scales magnitude stages elementary it frequency although stochastic attribute similarity ambiguity violations diffusion conjunction continuous judged size inference rational strategies bayesian causal reasoning mental statistical propositional inferences outcome model psychological much trial covariation animal power over claims choice version suggesting on these F I G U R E 7 . A portion of the hierarchy learned from the 1,272 abstracts of Psycholo gical Revie w from 1967–2003. The v ocabulary was restricted to the 1,971 terms that oc- curred in more than ﬁv e documents, yielding a corpus of 136K words. The learned hierarchy , of which only a portion is illustrated, contains 52 topics. 28 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN THE OF IN AND A structure folding state structures reaction neurons brain neuronal cortex memory species evolution genetic populations population enzyme biosynthesis acid coli synthase t cd cells antigen il energy time method theory ﬂuorescence residues helix enzyme catalytic site genes genome sequences genomes loci dna replication polymerase strand recombination synaptic receptors glutamate gaba ca climate global carbon fossil years plants plant arabidopsis leaves hsp visual cortex task auditory stimulus tcr class hla nk mhc hiv virus viral infection ccr iron fe oxygen o heme host virulence parasite malaria parasites tumor prostate antibody anti cancer male female males females sexual pax fgf retinal rod photoreceptor app ps notch amyloid alzheimer cholesterol ra rar sterol ldl globin prp prion hd gata capsid icam crystallin heparin fn peroxisomal urea cho pex pts apo ﬁxation plasminogen hta pu F I G U R E 8 . A portion of the hierarchy learned from the 12,913 abstracts of the Pr oceedings of the National Academy of Sciences from 1991–2001. The vocab ulary was restricted to the 7,200 terms that occurred in more than ﬁ ve documents, yielding a corpus of 2.3M words. The learned hierarchy , of which only a portion is illustrated, contains 56 topics. Note that the γ parameter is ﬁxed at a smaller value, to provide a reasonably sized topic hierarchy with the signiﬁcantly larger corpus. THE NESTED CHINESE REST A URANT PROCESS 29 Bayesian nonparametric solution to this can be obtained with the hierarchi- cal Dirichlet process (T eh et al., 2007).) Second, giv en a set of topics, LD A places no constraints on the usage of the topics by documents in the corpus; a document can place an arbitrary probability distribution on the topics. In hLD A, on the other hand, a document can only access the topics that lie along a single path in the tree. In this sense, LD A is signiﬁcantly more ﬂexible than hLD A. This ﬂexibility of LD A implies that for large corpora we can expect LD A to dominate hLD A in terms of predictiv e performance (assuming that the model selection problem is resolved satisfactorily and assuming that hyper- parameters are set in a manner that controls ov erﬁtting). Thus, rather than trying to simply optimize for predicti ve performance within the hLD A fam- ily and within the LD A family , we hav e instead opted to ﬁrst run hLD A to obtain a posterior distribution ov er the number of topics, and then to conduct multiple runs of LD A for a range of topic cardinalities bracketing the hLD A result. This provides an hLD A-centric assessment of the consequences (for predicti ve performance) of using a hierarchy v ersus a ﬂat model. W e used predictiv e held-out likelihood as a measure of performance. The procedure is to di vide the corpus into D 1 observed documents and D 2 held- out documents, and approximate the conditional probability of the held-out set gi ven the training set (7) p ( w held-out 1 , . . . , w held-out D 2 | w obs 1 , . . . , w obs D 1 , M ) , where M represents a model, either LD A or hLD A. W e employed collapsed Gibbs sampling for both models and integrated out all the hyperparameters with priors. W e used the same prior for those hyperparameters that exist in both models. T o approximate this predictiv e quantity , we run two samplers. First, we collect 100 samples from the posterior distrib ution of latent variables gi ven the observed documents, taking samples 100 iterations apart and using a burn-in of 2000 samples. For each of these outer samples , we collect 800 samples of the latent variables giv en the held-out documents and approxi- mate their conditional probability giv en the outer sample with the harmonic mean (Kass and Raftery, 1995). Finally , these conditional probabilities are av eraged to obtain an approximation to Eq. (7). Figure 9 illustrates the ﬁ ve-fold cross-validated held-out likelihood for hLD A and LD A on the J A CM corpus. The ﬁgure also provides a visual indication of the mean and variance of the posterior distribution over topic cardinality for hLD A; the mode is approximately a hierarchy with 140 top- ics. For LD A, we plot the predictiv e likelihood in a range of topics around this v alue. 30 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN Number of topics Mean held−out log likelihood 0 100 200 300 400 −80000 −70000 −60000 −50000 −40000 −30000 −20000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● hLDA LDA F I G U R E 9 . The held-out predictiv e log likelihood for hLD A compared to the same quantity for LD A as a func- tion of the number of topics. The shaded blue region is cen- tered at the mean number of topics in the hierarchies found by hLD A (and has width equal to twice the standard error). of objects to and the the of a to we the of a is in methods the a of problems k the of algorithm for for the linear problem problems the of we and a the and of to that operations the functional requires and the of a is in F I G U R E 1 0 . The ﬁ ve most probable words for each of ten randomly chosen topics from an LD A model ﬁt to ﬁfty top- ics. W e see that at each ﬁxed topic cardinality in this range of topics, hLD A provides signiﬁcantly better predicti ve performance than LD A. As discussed abov e, we ev entually expect LD A to dominate hLD A for large numbers of topics. In a large range near the hLD A mode, howe ver , the constraint that THE NESTED CHINESE REST A URANT PROCESS 31 documents pick topics along single paths in a hierarchy yields superior per- formance. This suggests that the hierarchy is useful not only for interpreta- tion, but also for capturing predicti ve statistical structure. T o giv e a qualitative sense of t he relati ve de gree of interpretability of the topics that are found using the two approaches, Figure 10 illustrates ten LD A topics chosen randomly from a 50-topic model. As these examples make clear , the LD A topics are generally less interpretable than the hLD A topics. In particular , function words are giv en high probability through- out. In practice, to sidestep this issue, corpora are often stripped of function words before ﬁtting an LDA model. While this is a reasonable ad-hoc so- lution for (English) text, it is not a general solution that can be used for non-text corpora, such as visual scenes. Even more importantly , there is no notion of abstraction in the LDA topics. The notion of multiple lev els of abstraction requires a model such as hLD A. In summary , if interpretability is the goal, then there are strong reasons to prefer hLD A to LD A. If predictive performance is the goal, then hLD A may well remain the preferred method if there is a constraint that a relati vely small number of topics should be used. When there is no such constraint, LD A may be preferred. These comments also suggest, ho we ver , that an interesting direction for further research is to explore the feasibility of a model that combines the deﬁning features of the LD A and hLD A models. As we described in Section 4, it may be desirable to consider an hLD A- like hierarchical model that allo ws each document to e xhibit multiple paths along the tree. This might be appropriate for collections of long documents, such as full-text articles, which tend to be more heterogeneous than short abstracts. 7. D I S C U S S I O N In this paper , we have shown how the nested Chinese restaurant process can be used to deﬁne prior distributions on recursiv e data structures. W e hav e also sho wn ho w this prior can be combined with a topic model to yield a Bayesian nonparametric methodology for analyzing document collections in terms of hierarchies of topics. Gi ven a collection of documents, we use MCMC sampling to learn an underlying thematic structure that provides a useful abstract representation for data visualization and summarization. W e emphasize that no knowledge of the topics of the collection or the structure of the tree are needed to infer a hierarchy from data. W e hav e demonstrated our methods on collections of abstracts from three dif ferent scientiﬁc journals, showing that while the content of these dif ferent domains can v ary signiﬁcantly , the statistical principles behind our model make it 32 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN possible to recov er meaningful sets of topics at multiple lev els of abstrac- tion, and org anized in a tree. The Bayesian nonparametric framework underlying our work makes it possible to deﬁne probability distributions and inference procedures over countably inﬁnite collections of objects. There has been other recent work in artiﬁcial intelligence in which probability distrib utions are deﬁned on in- ﬁnite objects via concepts from ﬁrst-order logic (Milch et al., 2005; Pasula and Russell, 2001; Poole, 2007). While providing an expressi ve language, this approach does not necessarily yield structures that are amenable to ef- ﬁcient posterior inference. Our approach reposes instead on combinatorial structure—the exchangeability of the Dirichlet process as a distribution on partitions—and this leads directly to a posterior inference algorithm that can be applied ef fecti vely to lar ge-scale learning problems. The hLD A model draws on two complementary insights—one from sta- tistics, the other from computer science. From statistics, we take the idea that it is possible to work with general stochastic processes as prior distribu- tions, thus accommodating latent structures that v ary in complexity . This is the ke y idea behind Bayesian nonparametric methods. In recent years, these models hav e been extended to include spatial models (Duan et al., 2007) and grouped data (T eh et al., 2007), and Bayesian nonparametric methods no w enjoy new applications in computer vision (Sudderth et al., 2005), bioinfor - matics (Xing et al., 2007), and natural language processing (Li et al., 2007; T eh et al., 2007; Goldwater et al., 2006b,a; Johnson et al., 2007; Liang et al., 2007). From computer science, we take the idea that the representations we in- fer from data should be richly structured, yet admit ef ﬁcient computation. This is a gro wing theme in Bayesian nonparametric research. For e xample, one line of recent research has explored stochastic processes in v olving mul- tiple binary features rather than clusters (Grifﬁths and Ghahramani, 2006; Thibaux and Jordan, 2007; T eh et al., 2007). A parallel line of in vestiga- tion has explored alternativ e posterior inference techniques for Bayesian nonparametric models, providing more efﬁcient algorithms for extracting this latent structure. Speciﬁcally , variational methods, which replace sam- pling with optimization, ha ve been developed for Dirichlet process mix- tures to further increase their applicability to large-scale data analysis prob- lems (Blei and Jordan, 2005; Kurihara et al., 2007). The hierarchical topic model that we explored in this paper is just one example of how this synthesis of statistics and computer science can pro- duce powerful new tools for the analysis of complex data. Howe ver , this example sho wcases the two major strengths of the Bayesian nonparametric approach. First, the use of the nested CRP means that the model does not start with a ﬁxed set of topics or hypotheses about their relationship, but THE NESTED CHINESE REST A URANT PROCESS 33 gro ws to ﬁt the data at hand. Thus, we learn a topology but do not commit to it; the tree can gro w as new documents about new topics and subtopics are observed. Second, despite the fact that this results in a very rich hy- pothesis space, containing trees of arbitrary depth and branching factor , it is still possible to perform approximate probabilistic inference using a simple algorithm. This combination of ﬂexible, structured representations and efﬁ- cient inference makes nonparametric Bayesian methods uniquely promising as a formal frame work for learning with ﬂe xible data structures. Acknowledgments. W e thank Edo Airoldi for providing the PN AS data, and we thank three anonymous re vie wers for their insightful comments. David M. Blei is supported by ONR 175-6343, NSF CAREER 0745520, and grants from Google and Microsoft Research. Thomas L. Grifﬁths is supported by NSF grant BCS-0631518 and the D ARP A CALO project. Michael I. Jordan is supported by grants from Google and Microsoft Re- search. R E F E R E N C E S A I R O L D I , E . , B L E I , D . , F I E N B E R G , S . , A N D X I N G , E . 2008. Mixed membership stochastic blockmodels. J ournal of Machine Learning Re- sear ch 9 , 1981–2014. A L B E RT , R . A N D B A R A B A S I , A . 2002. Statistical mechanics of complex networks. Revie ws of Modern Physics 74, 1, 47–97. A L D O U S , D . 1985. Exchangeability and related topics. In ´ Ecole d’Et ´ e de Pr obabilit ´ es de Saint-Flour , XIII—1983 . Springer , Berlin, Germany , 1–198. A N T O N I A K , C . 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics 2 , 1152– 1174. B A R A B A S I , A . A N D R E K A , A . 1999. Emergence of scaling in random networks. Science 286, 5439, 509–512. B E R N A R D O , J . A N D S M I T H , A . 1994. Bayesian Theory . John W iley & Sons Ltd., Chichester , UK. B I L L I N G S L E Y , P . 1995. Pr obability and Measur e . W iley-Interscience, New Y ork, NY . B L E I , D . , G R I FFI T H S , T., J O R D A N , M . , A N D T E N E N B AU M , J . 2003. Hi- erarchical topic models and the nested Chinese restaurant process. In Ad- vances in Neural Information Pr ocessing Systems 16 . MIT Press, Cam- bridge, MA, 17–24. B L E I , D . A N D J O R D A N , M . 2003. Modeling annotated data. In Pr oceed- ings of the 26th annual International ACM SIGIR Conference on Re- sear ch and De velopment in Information Retrieval . A CM Press, 127–134. 34 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN B L E I , D . A N D J O R DA N , M . 2005. V ariational inference for Dirichlet pro- cess mixtures. Journal of Bayesian Analysis 1 , 121–144. B L E I , D . A N D L A FF E RT Y , J . 2006. Dynamic topic models. In Pr oceedings of the 23r d International Conference on Machine Learning . A CM Press, Ne w Y ork, NY , 113–120. B L E I , D . A N D L A FF E RT Y , J . 2007. A correlated topic model of Science . Annals of Applied Statistics 1 , 17–35. B L E I , D . A N D L A FF E RT Y , J . 2009. T opic models. In T ext Mining: Theory and Applications . T aylor and Francis, London, UK. B L E I , D ., N G , A . , A N D J O R D A N , M . 2003. Latent Dirichlet allocation. J ournal of Machine Learning Resear ch 3 , 993–1022. C H A K R A B A RT I , S . , D O M , B . , A G R AW A L , R . , A N D R A G H A V A N , P . 1998. Scalable feature selection, classiﬁcation and signature generation for or- ganizing large text databases into hierarchical topic taxonomies. The VLDB J ournal 7 , 163–178. C I M I A N O , P . , H O T H O , A . , A N D S T A A B , S . 2005. Learning concept hi- erarchies from text corpora using formal concept analysis. Journal of Artiﬁcial Intelligence Resear ch 24 , 305–339. D E E RW E S T E R , S . , D U M A I S , S . , L A N D AU E R , T., F U R N A S , G . , A N D H A R S H M A N , R . 1990. Indexing by latent semantic analysis. J ournal of the American Society of Information Science 6, 391–407. D E V RO Y E , L . , G Y ¨ O R FI , L . , A N D L U G O S I , G . 1996. A Pr obabilistic The- ory of P attern Recognition . Springer-V erlag, Ne w Y ork, NY . D I E T Z , L . , B I C K E L , S . , A N D S C H E FF E R , T. 2007. Unsupervised pre- diction of citation inﬂuences. In Pr oceedings of the 24th International Confer ence on Machine Learning . A CM Press, Ne w Y ork, NY , 233–240. D R I N E A , E . , E N AC H E S U , M . , A N D M I T Z E N M A C H E R , M . 2006. V aria- tions on random graph models for the web . T ech. Rep. TR-06-01, Har- v ard Uni versity . D U A N , J . , G U I N D A N I , M . , A N D G E L FA N D , A . 2007. Generalized spatial Dirichlet process models. Biometrika 94 , 809–825. D U D A , R . , H A RT , P ., A N D S T O R K , D . 2000. P attern Classiﬁcation . W iley- Interscience, Ne w Y ork, NY . D U M A I S , S . A N D C H E N , H . 2000. Hierarchical classiﬁcation of web con- tent. In Pr oceedings of the 23r d Annual International ACM SIGIR con- fer ence on Researc h and Development in Information Retrieval . A CM Press, Ne w Y ork, NY , 256–263. E S C O B A R , M . A N D W E S T , M . 1995. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Associa- tion 90 , 577–588. F E I - F E I , L . A N D P E R O NA , P . 2005. A Bayesian hierarchical model for learning natural scene categories. IEEE Computer V ision and P attern THE NESTED CHINESE REST A URANT PROCESS 35 Recognition , 524–531. F E R G U S O N , T . 1973. A Bayesian analysis of some nonparametric prob- lems. Annals of Statistics 1 , 209–230. G E L FA N D , A . A N D S M I T H , A . 1990. Sampling based approaches to cal- culating marginal densities. J ournal of the American Statistical Associa- tion 85 , 398–409. G E L M A N , A . , C A R L I N , J . , S T E R N , H . , A N D R U B I N , D . 1995. Bayesian Data Analysis . Chapman & Hall, London, UK. G E M A N , S . A N D G E M A N , D . 1984. Stochastic relaxation, Gibbs distri- butions and the Bayesian restoration of images. IEEE T ransactions on P attern Analysis and Machine Intelligence 6 , 721–741. G O L D B E R G , A . A N D T A R JA N , R . 1986. A new approach to the maximum ﬂo w problem. J ournal of the Association for Computing Machinery 35, 4, 921–940. G O L D W A T E R , S . , G R I FFI T H S , T., A N D J O H N S O N , M . 2006a. Contex- tual dependencies in unsupervised word segmentation. In Pr oceedings of the 21st International Confer ence on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Stroudsbur g, P A, 673–680. G O L D W A T E R , S . , G R I FFI T H S , T., A N D J O H N S O N , M . 2006b . Interpolating between types and tokens by estimating power -law generators. In Ad- vances in Neural Information Pr ocessing Systems 18 . MIT Press, Cam- bridge, MA, 459–467. G R I FFI T H S , T. A N D G H A H R A M A N I , Z . 2006. Inﬁnite latent feature mod- els and the Indian buf fet process. In Advances in Neural Information Pr ocessing Systems 18 . MIT Press, Cambridge, MA, 475–482. G R I FFI T H S , T . A N D S T E Y V E R S , M . 2004. Finding scientiﬁc topics. Pr o- ceedings of the National Academy of Science 101 , 5228–5235. G R I FFI T H S , T. A N D S T E Y V E R S , M . 2006. Probabilistic topic models. In Latent Semantic Analysis: A Road to Meaning , T . Landauer , D. McNa- mara, S. Dennis, and W . Kintsch, Eds. Erlbaum, Hillsdale, NJ. H A S T I E , T., T I B S H I R A N I , R . , A N D F R I E D M A N , J . 2001. The Elements of Statistical Learning . Springer , New Y ork, NY . H E L L E R , K . A N D G H A H R A M A N I , Z . 2005. Bayesian hierarchical cluster- ing. In Pr oceedings of the 22nd International Confer ence on Machine Learning . A CM Press, Cambridge, MA, 297–304. H J O RT , N . , H O L M E S , C . , M ¨ U L L E R , P . , A N D W A L K E R , S . 2009. Bayesian Nonparametrics: Principles and Practice . Cambridge Univ ersity Press, Cambridge, UK. H O F M A N N , T . 1999a. The cluster-abstraction model: Unsupervised learn- ing of topic hierarchies from text data. In Pr oceedings of the 15th Inter- national J oint Confer ences on Artiﬁcial Intellig ence . Morg an Kaufmann, 36 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN San Francisco, CA, 682–687. H O F M A N N , T. 1999b . Probabilistic latent semantic indexing. In Pr oceed- ings of the 22nd Annual A CM SIGIR Confer ence on Resear ch and De vel- opment in Information Retrieval . A CM Press, Ne w Y ork, NY , 50–57. J O H N S O N , M . , G R I FFI T H S , T., A N D S . , G . 2007. Adaptor grammars: A frame work for specifying compositional nonparametric Bayesian mod- els. In Advances in Neural Information Pr ocessing Systems 19 . MIT Press, Cambridge, MA, 641–648. J O H N S O N , N . A N D K OT Z , S . 1977. Urn Models and Their Applications: An Appr oach to Modern Discr ete Pr obability Theory . W iley , New Y ork, NY . J O R DA N , M . I . 2000. Graphical models. Statistical Science 19 , 140–155. K A S S , R . A N D R A F T E RY , A . 1995. Bayes factors. Journal of the American Statistical Association 90 , 773–795. K O L L E R , D . A N D S A H A M I , M . 1997. Hierarchically classifying docu- ments using very fe w words. In Pr oceedings of the 14th International Confer ence on Machine Learning . Morgan Kaufmann, San Francisco, CA, 170–178. K R A P I V S K Y , P . A N D R E D N E R , S . 2001. Organization of growing random networks. Physical Revie w E 63, 6. K U R I H A R A , K . , W E L L I N G , M . , A N D V L A S S I S , N . 2007. Accelerated v ariational Dirichlet process mixtures. In Advances in Neural Informa- tion Pr ocessing Systems 19 . MIT Press, Cambridge, MA, 761–768. L A R S E N , B . A N D A O N E , C . 1999. Fast and ef fecti ve text mining us- ing linear-time document clustering. In Pr oceedings of the 5th A CM SIGKDD International Confer ence on Knowledge Discovery and Data Mining . A CM Press, New Y ork, NY , 16–22. L AU R I T Z E N , S . L . 1996. Graphical Models . Oxford Univ ersity Press, Oxford, UK. L I , W . , B L E I , D . , A N D M C C A L L U M , A . 2007. Nonparametric Bayes pachinko allocation. In Pr oceedings of the 23rd Conference on Uncer- tainty in Artiﬁcial Intelligence . A U AI Press, Menlo Park, CA. L I A N G , P ., P E T R OV , S . , K L E I N , D . , A N D J O R D A N , M . 2007. The inﬁ- nite PCFG using hierarchical Dirichlet processes. In Pr oceedings of the 2007 Joint Confer ence on Empirical Methods in Natural Language Pr o- cessing and Computational Natural Langua ge Learning . Association for Computational Linguistics, Stroudsbur g, P A, 688–697. L I U , J . 1994. The collapsed Gibbs sampler in Bayesian computations with application to a gene regulation problem. J ournal of the American Sta- tistical Association 89, 958–966. M A C E AC H E R N , S . A N D M U L L E R , P . 1998. Estimating mixture of Dirich- let process models. Journal of Computational and Graphical Statistics 7 , THE NESTED CHINESE REST A URANT PROCESS 37 223–238. M A R L I N , B . 2003. Modeling user rating proﬁles for collaborativ e ﬁltering. In Advances in Neural Information Pr ocessing Systems 16 . MIT Press, Cambridge, MA, 627–634. M C C A L L U M , A . , C O R R A D A - E M M A N U E L , A . , A N D W A N G , X . 2004. The author-recipient-topic model for topic and role discov ery in social net- works: Experiments with Enron and academic email. T ech. rep., Uni ver - sity of Massachusetts, Amherst. M C C A L L U M , A . , N I G A M , K . , R E N N I E , J . , A N D S E Y M O R E , K . 1999. Building domain-speciﬁc search engines with machine learning tech- niques. In Pr oceedings of the AAAI Spring Symposium on Intelligent Agents in Cyberspace . AAAI Press, Menlo P ark, CA. M I L C H , B . , M A RT H I , B . , S O N TAG , D . , O N G , D . , A N D K O B O L OV , A . 2005. Approximate inference for inﬁnite contingent Bayesian networks. In Pr oceedings of 10th International W orkshop on Artiﬁcial Intelligence and Statistics . The Society for Artiﬁcial Intelligence and Statistics, NJ. M I M N O , D . A N D M C C A L L U M , A . 2007. Organizing the OCA: Learning faceted subjects from a library of digital books. In Pr oceedings of the 7th A CM/IEEE-CS J oint Confer ence on Digital libraries . A CM Press, Ne w Y ork, NY , 376–385. N E A L , R . 2000. Mark ov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9 , 249–265. N E W M A N , D . , C H E M U D U G U N TA , C . , A N D S M Y T H , P . 2006. Statistical entity-topic models. In Pr oceedings of the 12th ACM SIGKDD Inter- national Confer ence on Knowledge Discovery and Data Mining . A CM Press, Ne w Y ork, NY , 680–686. P A S U L A , H . A N D R U S S E L L , S . 2001. Approximate inference for ﬁrst-order probabilistic languages. In Pr oceedings of the 17th International J oint Confer ences on Artiﬁcial Intelligence . Morgan Kaufmann, San Fran- cisco, CA, 741–748. P I T M A N , J . 2002. Combinatorial Stochastic Pr ocesses . Lecture Notes for St. Flour Summer School. Springer-V erlag, Ne w Y ork, NY . P O O L E , D . 2007. Logical generativ e models for probabilistic reasoning about existence, roles and identity . In Pr oceedings of the 22nd AAAI Con- fer ence on Artiﬁcial Intelligence . AAAI Press, Menlo Park, CA, 1271– 1279. P R I T C H A R D , J . , S T E P H E N S , M . , A N D D O N N E L L Y , P . 2000. Inference of population structure using multilocus genotype data. Genetics 155 , 945– 959. R O B E RT , C . A N D C A S E L L A , G . 2004. Monte Carlo Statistical Methods . Springer-V erlag, Ne w Y ork, NY . 38 D. M. BLEI, T . L. GRIFFITHS, M. I. JORD AN R O S E N - Z V I , M . , G R I FFI T H S , T., S T E Y V E R S , M . , A N D S M I T H , P . 2004. The author -topic model for authors and documents. In Pr oceedings of the 20th Confer ence on Uncertainty in Artiﬁcial Intelligence . A U AI Press, Menlo Park, CA, 487–494. S A N D E R S O N , M . A N D C R O F T , B . 1999. Deri ving concept hierarchies from text. In Pr oceedings of the 22nd Annual International ACM SIGIR Con- fer ence on Resear ch and Development in Information Retrieval . A CM, Ne w Y ork, NY , 206–213. S E T H U R A M A N , J . 1994. A constructiv e deﬁnition of Dirichlet priors. Sta- tistica Sinica 4 , 639–650. S T O I C A , E . A N D H E A R S T , M . 2004. Nearly-automated metadata hierarchy creation. In Companion Pr oceedings of HLT -NAA CL . Boston, MA. S U D D E RT H , E . , T O R R A L B A , A . , F R E E M A N , W . , A N D W I L L S K Y , A . 2005. Describing visual scenes using transformed Dirichlet processes. In Ad- vances in Neural Information Pr ocessing Systems 18 . MIT Press, Cam- bridge, MA, 1297–1306. T E H , Y . , G O R U R , D . , A N D G H A H R A M A N I , Z . 2007. Stick-breaking con- struction for the Indian buf fet process. In Pr oceedings of 11th Interna- tional W orkshop on Artiﬁcial Intelligence and Statistics . The Society for Artiﬁcial Intelligence and Statistics, NJ. T E H , Y . , J O R D A N , M . , B E A L , M . , A N D B L E I , D . 2007. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101 , 1566–1581. T H I B AU X , R . A N D J O R D A N , M . 2007. Hierarchical beta processes and the Indian buf fet process. In Pr oceedings of 11th International W ork- shop on Artiﬁcial Intelligence and Statistics . The Society for Artiﬁcial Intelligence and Statistics, NJ. T I T T E R I N G T O N , D . , S M I T H , A . , A N D M A K OV , E . 1985. Statistical Anal- ysis of F inite Mixtur e Distributions . W iley , Chichester , UK. V A I T H Y A N A T H A N , S . A N D D O M , B . 2000. Model-based hierarchical clus- tering. In Pr oceedings of the 16th Confer ence on Uncertainty in Artiﬁcial Intelligence . Mor gan Kaufmann, San Francisco, CA, 599–608. X I N G , E . , J O R D A N , M . , A N D S H A R A N , R . 2007. Bayesian haplotype inference via the Dirichlet process. J ournal of Computational Biology 14 , 267–284. Z A M I R , O . A N D E T Z I O N I , O . 1998. W eb document clustering: A fea- sibility demonstration. In Pr oceedings of the 21st Annual International A CM SIGIR Confer ence on Resear ch and Development in Information Retrieval . A CM Press, Ne w Y ork, NY , 46–54.

The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment