Continuous-time Infinite Dynamic Topic Models
Topic models are probabilistic models for discovering topical themes in collections of documents. In real world applications, these models provide us with the means of organizing what would otherwise be unstructured collections. They can help us clus…
Authors: Wesam Elshamy
CONTINUOUS-TIME INFINITE D YNAMIC TOPIC MODELS b y WESAM SAMY ELSHAMY B.S., Ain Shams Univ ersit y , Egypt, 2004 M.Sc., Cairo Univ ersit y , Egypt, 2007 AN ABSTRA CT OF A DISSER T A TION submitted in partial fulfillmen t of the requiremen ts for the degree DOCTOR OF PHILOSOPHY Departmen t of Computing and Information Sciences College of Engineering KANSAS ST A TE UNIVERSITY Manhattan, Kansas 2013 Abstract T opic mo dels are probabilistic mo dels for disco v ering topical themes in collections of do cumen ts. In real w orld applications, these models pro vide us with the means of organizing what w ould otherwise be unstructured collections. They can help us cluster a huge collection in to different topics or find a subset of the collection that resem bles the topical theme found in an article at hand. The first w a ve of topic mo dels developed w ere able to discov er the prev ailing topics in a big collection of do cuments spanning a p erio d of time. It was later realized that these time- in v arian t models w ere not capable of mo deling 1) the time v arying num b er of topics they disco ver and 2) the time c hanging structure of these topics. F ew mo dels w ere dev elop ed to address this t wo deficiencies. The online-hierarc hical Dirichlet pro cess models the do cumen ts with a time v arying n umber of topics. It v aries the structure of the topics ov er time as well. Ho wev er, it relies on do cumen t order, not timestamps to evolv e the model ov er time. The con tinuous-time dynamic topic mo del ev olves topic structure in con tinuous-time. Ho w ev er, it uses a fixed n umber of topics ov er time. In this dissertation, I presen t a mo del, the con tinuous-time infinite dynamic topic model, that com bines the adv antages of these tw o mo dels 1) the online-hierarc hical Diric hlet pro- cess, and 2) the contin uous-time dynamic topic mo del. More sp ecifically , the model I present is a probabilistic topic mo del that do es the follo wing: 1) it c hanges the n umber of topics o ver contin uous time, and 2) it c hanges the topic structure ov er con tin uous-time. I compared the mo del I dev elop ed with the t w o other mo dels with differen t setting v alues. The results obtained w ere fav orable to my mo del and sho wed the need for ha ving a mo del that has a con tinuous-time v arying n umber of topics and topic structure. CONTINUOUS-TIME INFINITE D YNAMIC TOPIC MODELS b y WESAM SAMY ELSHAMY B.S., Ain Shams Univ ersit y , Egypt, 2004 M.Sc., Cairo Univ ersit y , Egypt, 2007 A DISSER T A TION submitted in partial fulfillmen t of the requiremen ts for the degree DOCTOR OF PHILOSOPHY Departmen t of Computing and Information Sciences College of Engineering KANSAS ST A TE UNIVERSITY Manhattan, Kansas 2013 Appro ved by: Ma jor Professor William Henry Hsu Cop yrigh t W esam Sam y Elsham y 2013 Abstract T opic mo dels are probabilistic mo dels for disco v ering topical themes in collections of do cumen ts. In real w orld applications, these models pro vide us with the means of organizing what w ould otherwise b e unstructured collections. They can help us cluster a h uge collection in to different topics or find a subset of the collection that resem bles the topical theme found in an article at hand. The first w a ve of topic mo dels developed w ere able to discov er the prev ailing topics in a big collection of do cuments spanning a p erio d of time. It was later realized that these time- in v arian t mo dels w ere not capable of mo deling 1) the time v arying num ber of topics they disco ver and 2) the time c hanging structure of these topics. F ew mo dels w ere dev elop ed to address this t wo deficiencies. The online-hierarc hical Dirichlet pro cess models the do cumen ts with a time v arying n umber of topics. It v aries the structure of the topics ov er time as well. Ho wev er, it relies on do cumen t order, not timestamps to evolv e the model ov er time. The con tinuous-time dynamic topic mo del ev olves topic structure in con tinuous-time. Ho w ev er, it uses a fixed n umber of topics ov er time. In this dissertation, I presen t a mo del, the con tinuous-time infinite dynamic topic model, that com bines the adv antages of these tw o mo dels 1) the online-hierarc hical Diric hlet pro- cess, and 2) the contin uous-time dynamic topic mo del. More sp ecifically , the model I present is a probabilistic topic mo del that do es the follo wing: 1) it c hanges the n umber of topics o ver contin uous time, and 2) it c hanges the topic structure ov er con tin uous-time. I compared the mo del I dev elop ed with the t w o other mo dels with differen t setting v alues. The results obtained w ere fav orable to my mo del and sho wed the need for ha ving a mo del that has a con tinuous-time v arying n umber of topics and topic structure. T able of Con ten ts T able of Con ten ts vi List of T ables viii List of Figures ix Ac kno wledgements xiii 1 In tro duction 1 1.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Problem statemen t . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Cen tral thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.3 T ec hnical ob jectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Existing approac hes: atemp oral mo dels . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Laten t semantic analysis . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Probabilistic laten t semantic analysis . . . . . . . . . . . . . . . . . . 7 1.2.3 Laten t Dirichlet allo cation and topic mo dels . . . . . . . . . . . . . . 7 1.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Ba y esian mo dels and inference algorithms 11 2.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 F actorized distributions . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 V ariational parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 V ariational inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Existing solutions and their limitations 20 3.1 T emp oral topic mo dels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 T opics o ver time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Discrete-time infinite dynamic topic mo del . . . . . . . . . . . . . . . . . . . 26 3.4 Online hierarc hical Dirichlet pro cesses . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Con tinuous-time dynamic topic mo del . . . . . . . . . . . . . . . . . . . . . 35 4 Con tin uous-time infinite dynamic topic mo del 38 4.1 Dim sum pro cess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Mathematical mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 vi 5 T estb ed dev elopmen t 47 5.1 News stories corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Corp ora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.1 Reuters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.2 BBC News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3 Ev aluation of prop osed solution . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Exp erimen tal design and results 58 6.1 Reuters corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.2 BBC news corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2.1 Wh y ciDTM outp erforms cDTM . . . . . . . . . . . . . . . . . . . . 70 6.2.2 Wh y ciDTM outp erforms oHDP . . . . . . . . . . . . . . . . . . . . . 70 6.3 Scalabilit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.3.1 Scalabilit y of ciDTM and oHDP vs cDTM . . . . . . . . . . . . . . . 73 6.4 Timeline construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.4.1 Wh y ciDTM is b etter than oHDP in timeline construction . . . . . . 79 Bibliograph y 92 A Probabilit y distributions 93 A.1 Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.2 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.3 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.4 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.5 Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.6 Dirichlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 B Diric hlet Pro cess 97 C Graphical mo dels 101 C.1 Inference algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 C.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 D Information theory 108 D.1 En tropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 D.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 D.3 Information div ergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 E T able of notations 112 vii List of T ables 3.1 T able of notations for iDTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 T able of notations for cDTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.1 Setting v alues used for oHDP . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.2 Setting v alues used for ciDTM . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.3 Confusion matrix for timeline constuction using oHDP . AS stands for Arab Spring topic, and non-AS is non Arab Spring topic. The true class is presen ted in ro ws (T rue AS, and non-AS), and Inferred class in columns (Inferred AS and non-AS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4 Confusion matrix for timeline construction using ciDTM. AS stands for Arab Spring topic, and non-AS is non Arab Spring topic. The true class is presen ted in ro ws (T rue AS, and non-AS), and Inferred class in columns (Inferred AS and non-AS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 E.1 T able of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 viii List of Figures 1.1 T op left: the con tinuous-time dynamic topic mo del (cDTM) has a contin uous- time domain; word and topic distributions ev olve in con tinuous time. The n umber of topics in this mo del is fixed though. This may lead to ha ving t w o separate topics b eing merged in to one topic which was the case in the first topic from b elo w. T op righ t: the online hierarc hical Diric hlet process (oHDP) based topic mo del ev olv es the num b er of topics o ver time. The fiv e documents b elonging to the first and second topics from b elo w w ere asso ciated with their resp ectiv e topics and were not merged into one topic which w as the case with cDTM on the left. Ho w ever, the mo del has a discrete-time domain which practically limits the degree of time gran ularity that we can use; very small time steps will mak e the mo del prohibitively exp ensive to inference. Bottom: The contin uous-time infinite dynamic topic mo del (ciDTM) is a mixture of the oHDP and cDTM. It has a contin uous-time domain like cDTM, and its n umber of topics ev olve o v er time as in oHDP . It ov ercomes limitations of b oth mo dels regarding ev olution of the num b er of topics and time-step gran ularit y . 3 1.2 T opic state-transition diagram: Ov al shap ed no des represen t states with their names t yp ed inside of them. Arro ws represent transitions from one state to another, or to itself. Eac h arro w has an Event / Action describing the ev ent that triggered the transition, and the action taken b y the system in resp onse to that transition. The start state is represen ted by an arro w with no origin p oin ting to the state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Laten t Dirichlet allo cation (LD A) graphical mo del. The rectangles, known as plates in this notation, represent replication. Plate with m ultiplicity D denotes a collection of do cuments. Each of these do cuments is made of a collection of w ords represen ted b y plate with m ultiplicity N . Plate with m ultiplicity K represen ts a collection of topics. The shaded no de W repre- sen ts a word which is the only observ ed random v ariable in the mo del. The non-shaded no des represent laten t random v ariables in the mo del; α is the Diric hlet prior parameter for topic distribution p er do cumen t. θ is a topic distribution for a do cument, while Z is the topic sampled from θ for w ord W . β is a Mark ov matrix giving the word distribution p er topic, and η is the Diric hlet prior parameter used in generating that matrix. Shaded no des are observ ed random v ariables, while non-shaded no des are laten t random v ari- ables. No des are lab eled with the v ariable they represent. The rectangles are kno wn as plates in this notation and they represent replication. F or nested plates, an inner plate has multiplicit y equal to the replication v alue sho wn on its lo wer-righ t corner times the multiplicit y of its paren t. . . . . . . . . . . . 8 ix 1.4 Visualization of a subset of 50 do cuments in t wo dimensions from the NIPS dataset. Each rectangle represents a do cumen t with its title inside of the rectangle. Distance b et ween t wo do cumen ts is directly prop ortional to their similarit y . The symmetrized Kullbac k-Leibler divergence is used to ev aluate the distance b et ween pairs of do cuments. . . . . . . . . . . . . . . . . . . . . 9 2.1 V ariational transformation of the logarithmic function. The transformed function ( λx − log λ − 1) forms an upp er found for the original function. . . 16 3.1 T opics o v er time graphical mo del. In this view, one timestamp t is sampled from a Beta distribution per w ord in a document. All w ords in a single do cumen t ha ve the same timestamp. All the symbols in this figure follow the notation giv en in Figure 1.3 and ψ is the Beta distribution prior. . . . . . . . 23 3.2 A graphical model represen tation of the infinite dynamic topic mo del using plate notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Graphical mo del represen tation of the contin uous-time dynamic topic mo del using plate notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Con tinuous-time infinite dynamic topic mo del. None-shaded nodes represen t unobserv ed latent v ariables, shaded no des are observ ed v ariables, diamonds represen t h yp erparameters, and plates represen t rep etition. Sym b ols in this figure follo w the notation in Figure 3.2 and H is the base distribution for the upp er lev el DP , γ and α are the concentration parameters for the upper level, and low er level DPs resp ectiv ely , and s t is the timestamp for do cument at time t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 A blo ck diagram describing the w orkflo w for the contin uous-time infinite dy- namic topic mo del w orking on a single do cument at a time. . . . . . . . . . . 42 5.1 Illustration of the birth/death/resurrection of five topics. Discontin uity of a topic indicates its death. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.1 cDTM performance v aries with differen t v alues for the n umber of topics using Reuters corpus. A v alue of 50 leads to b est mo del fit among the six v alues tried. The p er-word log-lik eliho o d v alues shown are the mo ving av erage of a set of 100 consecutiv e do cumen ts. This a veraging was needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.2 oHDP p er-word log-lik eliho o d for different batc h size v alues using Reuters corpus. The p er-word log-lik eliho o d v alues sho wn are the mo ving av erage of a set of 100 consecutiv e documents. This a v eraging w as needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 x 6.3 ciDTM p er-w ord log-lik eliho o d for differen t batc h size v alues using Reuters corpus. The per-word log-lik eliho o d v alues sho wn are the mo ving a v erage of a set of 100 consecutiv e do cumen ts. This a veraging was needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.4 A comparison of ciDTM, oHDP and cDTM using their b est setting v alues disco vered earlier using Reuters corpus. The p er-w ord log-lik eliho o d v alues sho wn are the moving av erage of a set of 100 consecutive do cuments. This a veraging was needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . 64 6.5 Comparison of the p er-word log-likelihoo d for cDTM, oHDP and ciDTM using 10,000 news stories from the Reuters21578 corpus [ 47 ]. Higher p er- w ord log-likelihoo d indicate a b etter fit of the mo del to the data. . . . . . . . 65 6.6 cDTM performance v aries with differen t v alues for the n umber of topics using the BBC news corpus. A v alue of 200 leads to best mo del fit among the sev en v alues tried. The p er-w ord log-likelihoo d v alues sho wn are the moving a verage of a set of 100 consecutiv e documents. This av eraging was needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.7 oHDP p er-word log-likelihoo d for different batc h size v alues using the BBC news corpus. A v alue of 200 leads to best mo del fit among the sev en v alues tried. The p er-word log-lik eliho o d v alues shown are the mo ving av erage of a set of 100 consecutiv e do cumen ts. This a veraging was needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.8 ciDTM p er-w ord log-lik eliho o d for different batch size v alues using the BBC news corpus. The p er-word log-likelihoo d v alues shown are the mo ving av- erage of a set of 100 consecutive do cumen ts. This a veraging w as needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.9 A comparison of ciDTM, oHDP and cDTM using their b est setting v alues disco vered earlier using the BBC news corpus. The p er-word log-likelihoo d v alues shown are the mo ving a v erage of a set of 100 consecutiv e documents. This av eraging was needed to smo oth out the plot. Higher v alues indicate b etter mo del fit on this 10,000 news stories corpus. . . . . . . . . . . . . . . 69 6.10 The wall clo ck running time in minutes for cDTM (10 topics), oHDP (batc h size = 1) and ciDTM (batch size = 256) using subsets of the Reuters news corpus with different n um b er of do cuments. These systems w ere run on a PC with 16GB of RAM and 3.4 GHz pro cessor. . . . . . . . . . . . . . . . . . . 74 xi 6.11 oHDP timeline construction for the Arab Spring topic o ver 61 do cuments co vering this topic. The indep endent axis is the time, while the dep enden t axis is the topic index. It was found that topic 0 represen ts the Arab Spring ev ents topic. A pink dot ( • ) represents the real topic that should b e disco v- ered, an blac k circle with a small dot ( ) represents the topic/s that was disco vered for the do cumen t. A blac k circle with a pink dot in the middle of it ( • ) is a real topic that w as successfully discov ered by the system. . . . . . 76 6.12 ciDTM timeline construction for the Arab Spring topic ov er 61 do cuments co vering this topic. The indep endent axis is the time, while the dep enden t axis is the topic index. It was found that topic 0 represen ts the Arab Spring ev ents topic. A pink dot ( • ) represents the real topic that should b e disco v- ered, an blac k circle with a small dot ( ) represents the topic/s that was disco vered for the do cumen t. A blac k circle with a pink dot in the middle of it ( • ) is a real topic that w as successfully discov ered by the system. . . . . . 77 C.1 Directed graphical mo del. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 C.2 Undirected graphical mo del. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 C.3 A skip-c hain CRF mo del for an NER task. . . . . . . . . . . . . . . . . . . . 105 xii Ac kno wledgmen ts I am greatly thankful to all those who help ed me or game me advice while w orking on this dissertation and the researc h that led to it. Sp ecifically , I would lik e to thank m y advisor William Hsu for guiding and supporting me since I came to Kansas State Universit y ov er fiv e years ago. He gav e me the right amoun t of freedom to explore differen t problems and differen t approaches to solv e them. His guidance and feedbac k motiv ated and inspired me. I learned a lot from Bill throughout those years. I wish to thank T racey Hsu (Bill’s wife) for pro of reading and editing m y dissertation. She is v ery attentiv e to details and my dissertation is in a better shap e no w thanks to her. Dr Doina Cragea’s advice help ed me refine the work I presen ted in this dissertation. I w ould like to thank her for that and for the advice and guidance she ga v e me while working on an indep enden t study class with her. I w as fortunate to ha v e the help of undergraduate programmers in our group. Xinghuang Leon Hsu developed the cra wler and parser used for crawling the BBC News corpus. He sp en t precious hours and days working on it making sure the corpus has what I need. I enjoy ed having go o d discussions with Surya T eja Kallumadi o ver the past few years. W e discussed research and non-research problems. I wan t to thank him for hosting me in the summer of 2011 in Boston. Sp ecial thanks to m y b eautiful wife Rac hel who believed I will finish m y PhD one day . She did not giv e up even though that day w as a mo ving target. Alwa ys moving forw ard. Finally , I am greatly indebted to m y parents. Without their lo ve and supp ort throughout m y life I could not hav e accomplished this. xiii Chapter 1 In tro duction 1.1 Goal 1.1.1 Problem statemen t In this thesis, I develop a con tinuous-time dynamic topic mo del. This mo del is an extension of the laten t Diric hlet allocation (LD A) whic h is atemporal. I add tw o temp oral components to: it i) the word distribution p er topic, and ii) the num b er of topics. Both evolv e in con tinuous time. There exists similar temp oral dynamic topic models. Ahmed and Xing [ 2 ] presen ted a mo del where the word distribution p er topic and the n umber of topics ev olve in discrete time. Ba yesian inference using Mark ov chain Monte Carlo tec hniques in this mo del is feasible when the time step is big enough. It was efficien tly applied to a system with 13 time steps [ 2 ]. Increasing the time granularit y of the mo del dramatically increases the num b er of latent v ariables making inference prohibitively exp ensive. On the other hand, W ang et al. [ 72 ] presen ted a mo del where the word distribution p er topic evolv e in contin uous time. They used v ariational Bay es metho ds for fast inference. A big limitation of their mo del is that it uses a predefined and fixed num b er of topics that do es not ev olve ov er time. In this thesis I dev elop and use a topic mo del that is a mixture of these t w o mo dels where w ord distribution p er topic and the n umber of topics evolv e in contin uous-time. The need for the dynamic contin uous-time topic mo del I am dev eloping is evident in 1 the mass media business where news stories are published round-the-clo c k 1 . T o pro vide the news reader with a broad con text and ric h reading exp erience, man y news outlets pro vide a list of related stories from news archiv es to the ones they presen t. As these arc hives are t ypically huge, man ual search for these related stories is infeasible. Ha ving the archiv ed stories categorized into geographical areas or news topics such as p olitics and sp orts ma y help a little; a news editor can comb through a category lo oking for relev an t stories. How ever, relev an t stories ma y cross category b oundaries. Keyword searching the en tire archiv e may not b e effective either; it returns stories based on k eyword mentions not the topics the stories co ver. A dynamic contin uous-time topic model can b e efficien tly used to find relev an t stories. It can track a topic ov er time ev en as a story dev elops and the word distribution associated with its topic evolv es. It can detect the n um b er of topics discussed in the news o ver time and fine-tune the mo del accordingly . Goal: T o develop a con tin uous-time topic mo del in which the word distribution p er topic and the n umber of topics evolv e in contin uous time. 1.1.2 Cen tral thesis A contin uous-time topic mo del with an evolving n umber of topics and a dynamic w ord distribution p er topic (ciDTM) can b e built using a Wiener pro cess to mo del the dynamic topics in a hierarc hical Diric hlet process [ 68 ]. This mo del cannot b e efficien tly em ulated b y the discrete-time infinite dynamic topic mo del when the topics it mo dels ev olv e at differen t sp eeds. Using a fine time gran ularity to fit the fastest evolving topic w ould mak e inference in this system impractical. On the other hand, ciDTM cannot b e em ulated by a con tinuous- time dynamic topic mo del as it w ould require the use of a fixed n umber of topics which is a 1 Agence F rance-Presse (AFP) releases on a verage one news story ev ery 20 seconds (5000 p er da y) while Reuters releases 800,000 English-language news stories ann ually . See: http://www.afp.com/en/agency/afp-in-numbers and http://www.infotoday.com/it/apr01/news6.htm 2 ciDTM Continuous time Topics oHDP Discrete time Topics cDTM Continuous time Topics Synthesize Figure 1.1 : T op left: the c ontinuous-time dynamic topic mo del (cDTM) has a c ontinuous- time domain; wor d and topic distributions evolve in c ontinuous time. The numb er of topics in this mo del is fixe d though. This may le ad to having two sep ar ate topics b eing mer ge d into one topic which was the c ase in the first topic fr om b elow. T op right: the online hier ar chic al Dirichlet pr o c ess (oHDP) b ase d topic mo del evolves the numb er of topics over time. The five do cuments b elonging to the first and se c ond topics fr om b elow wer e asso ciate d with their r esp e ctive topics and wer e not mer ge d into one topic which was the c ase with cDTM on the left. However, the mo del has a discr ete-time domain which pr actic al ly limits the de gr e e of time gr anularity that we c an use; very smal l time steps wil l make the mo del pr ohibitively exp ensive to infer enc e. Bottom: The c ontinuous-time infinite dynamic topic mo del (ciDTM) is a mixtur e of the oHDP and cDTM. It has a c ontinuous-time domain like cDTM, and its numb er of topics evolve over time as in oHDP. It over c omes limitations of b oth mo dels r e gar ding evolution of the numb er of topics and time-step gr anularity. 3 mo del parameter. Apart from the problem of finding an appropriate v alue for the parameter to start with, this v alue remains fixed o v er time leading to the potential problem of having m ultiple topics merge in to one topic or having one topic split in to multiple topics to keep the o verall num b er of topics fixed. F or the problem of creating news timelines, ciDTM is more efficient and less computa- tionally exp ensive than using a discrete-time un b ounded-topic mo del [ 2 ] with a very small time unit (ep o c h), and more accurate than using a con tinuous-time b ounded-topic mo del [ 72 ]. This is illustrated in Figure 1.1 . 1.1.3 T ec hnical ob jectiv es A news timeline is characterized b y a mixture of topics. When a news story arriv es to the system, it should b e placed on the appropriate timeline, if such one exists, or a new timeline will b e created. A set is created of related stories that ma y not b elong to the same timeline as the arriving story . In this pro cess, care should b e taken to av oid some pitfalls: If the dynamic n um b er of topics generated b y the model b ecomes to o small, some news timelines ma y get conflated and distances b etw een news stories ma y get distorted. If the dynamic n umber of topics b ecomes to o large then num b er of v ariational parameters of the mo del will explo de and the inference algorithm will b ecome prohibitiv ely exp ensive to run. In a topic mo del system that receives text stream, a topic has t wo dormancy states: Active and Dead . T ransition b et ween these t wo states is illustrated in a state-transition diagram in Figure 1.2 . When a new topic is b orn, it enters the Active state and a timer ( ActiveTimer ) starts. The timer is reset whenev er a do cument relev an t to the topic is receiv ed and pro cessed b y the system. When the timer expires, the topic transitions to the Dead state. The topic remains in this state as long as the system receiv es and pro cesses non-relev an t do cuments. If a relev ant do cumen t is receiv ed and pro cessed by the system, the topic transitions to the Active state, and the timer ( ActiveTimer ) is started. 4 Dead Active /reset ActiveTimer Relevant document ActiveTimer expired /start ActiveTimer Relevant document ActiveTimer /start Non−relevant document Figure 1.2 : T opic state-tr ansition diagr am: Oval shap e d no des r epr esent states with their names typ e d inside of them. Arr ows r epr esent tr ansitions fr om one state to another, or to itself. Each arr ow has an Event / Action describing the event that trigger e d the tr ansi- tion, and the action taken by the system in r esp onse to that tr ansition. The start state is r epr esente d by an arr ow with no origin p ointing to the state. 1.2 Existing approac hes: atemp oral mo dels The v olume of digitized kno wledge has b een increasing with an unpreceden ted rate recently . The past decade has witnessed the rise of pro jects lik e JSTOR [ 67 ], Go ogle Bo oks [ 21 ], the In ternet Archiv e [ 32 ] and Europ eana [ 31 ] that scan and index printed b o oks, do cumen ts, newspap ers, photos, pain tings and maps. The v olume of data that is b orn in digital format is even more rapidly increasing. Whether it is mainly unstructured user-generated con tent suc h as in so cial media, or generated by news p ortals. New innov ative w a ys w ere developed for categorizing, searc hing and presenting the digital material [ 17 , 28 , 48 ]. The usefulness of these services to the end user, which are free in most cases, is evident by their ever increasing utilization in learning, en tertainmen t and decision- making. A do cumen t collection can b e indexed and k eyword searc h can b e used to search for a do cumen t in it. Ho wev er, most keyw ord searc h algorithms are context-free and lac k the abilit y to categorize do cuments based on their topic or find do cumen ts related to one of in terest to us. The large v olume of do cument collections precludes manual tec hniques of 5 annotation or categorization of the text. The pro cess of finding a set of do cuments related to a do cument at hand can b e au- tomated using topic models [ 8 ]. Most of these existing models are atemp oral and p erform badly in finding old relev ant stories b ecause the w ord collections that iden tifies the topics that these stories co v er c hange o v er time and the model do es not account for that c hange. Moreo ver, these mo dels assume the n umber of topics is fixed o ver time, whereas in realit y this n umber is constan tly changing. Some famous examples of these mo dels that w ere widely used in the past are latent semantic analysis (LSA) [ 46 ], probabilistic LSA (pLSA) [ 35 ] and laten t Dirichlet allo cation (LDA) [ 13 ]. 1.2.1 Laten t seman tic analysis Laten t semantic analysis (LSA) is a deterministic technique for analyzing the relationship b et ween do cumen ts or b et ween w ords [ 46 ]. Like most topic mo dels it treats a do cumen t as a bag of w ords; a do cument is a v ector of term frequency ignoring term sequence. It is built on the idea that w ords that tend to app ear in the same do cument together tend to carry similar meanings. In this do cument analysis technique, which is sometimes known as laten t semantic indexing (LSI), the v ery large, noisy and sparse term-do cumen t frequency matrix gets its rank low ered using matrix factorization techniques such as singular v alue decomp osition. In the lo w er-rank approximation matrix, each do cument is represen ted by a n umber of concepts in lieu of terms. LSA is known for its inabilit y to distinguish the differen t meanings of a p olysem y word and place synonym words closer together in the seman tic space; Differen t o ccurrences of a p olysemous w ord could ha ve different meanings based on its con text. Ho wev er, since LSA treats eac h word as a p oin t in space, that p oint will b e the seman tic av erage of the different meanings of the word in the do cumen t. Differen t words with the same meaning (synonyms) represen t another c hallenge to LSA in information retriev al applications. If a do cument is made out of the synonyms of terms found in another do cumen t, the correlation b et ween 6 these tw o documents o ver their terms using LSA will not be as high as w e would exp ect it to b e. 1.2.2 Probabilistic laten t seman tic analysis T o ov ercome some of the LSA limitations, we can condition the o ccurrence of a word with a do cumen t on a latent topic v ariable. These will lead us to the probabilistic latent semantic analysis (pLSA). This mo del ov ercomes the discrepancy of the assumed joint Gaussian distribution b etw een terms and do cuments in LSA and the observed P oisson distribution b et ween them. Ho wev er, pLSA is not a proper generative mo del for new do cumen ts, and the num b er of parameters in its mo del gro ws linearly with the n umber of do cuments. T o o vercome the generative pro cess issue, a Dirichlet prior can be used for the do cumen t topic distribution, leading us to the laten t Diric hlet allo cation, whic h is the basis for topic models. 1.2.3 Laten t Diric hlet allo cation and topic mo dels T opic mo dels are probabilistic mo dels that capture the underlying semantic structure of a do cumen t collection based on a hierarc hical Ba yesian analysis of the original text [ 12 , 13 ]. By disco v ering w ord patterns in the collection and establishing a pattern similarit y measure, similar do cuments can b e linked together and semantically categorized based on their topic. A graphical mo del for the laten t Diric hlet allo cation is shown in Figure 1.3 [ 8 ]. The rectangles, known as plates in this notation, represent replication. Plate with m ultiplicity D denotes a collection of do cumen ts. Eac h of these do cumen ts is made of a collection of words represented b y plate with multiplicit y N . Plate with m ultiplicity K represents a collection of topics. The shaded no de W represents a w ord whic h is the only observ ed random v ariable in the mo del. The non-shaded no des represen t latent random v ariables in the mo del; α is the Dirichlet prior parameter for topic distribution p er do cument. θ is a topic distribution for a do cument, while Z is the topic sampled from θ for w ord W . β is a Mark ov matrix giving the w ord distribution per topic, and η is the Dirichlet prior parameter used in generating that matrix. 7 θ Z W β η N D K α Figure 1.3 : L atent Dirichlet al lo c ation (LD A) gr aphic al mo del. The r e ctangles, known as plates in this notation, r epr esent r eplic ation. Plate with multiplicity D denotes a c ol le ction of do cuments. Each of these do cuments is made of a c ol le ction of wor ds r epr esente d by plate with multiplicity N . Plate with multiplicity K r epr esents a c ol le ction of topics. The shade d no de W r epr esents a wor d which is the only observe d r andom variable in the mo del. The non-shade d no des r epr esent latent r andom variables in the mo del; α is the Dirichlet prior p ar ameter for topic distribution p er do cument. θ is a topic distribution for a do cument, while Z is the topic sample d fr om θ for wor d W . β is a Markov matrix giving the wor d distribution p er topic, and η is the Dirichlet prior p ar ameter use d in gener ating that matrix. Shade d no des ar e observe d r andom variables, while non-shade d no des ar e latent r andom variables. No des ar e lab ele d with the variable they r epr esent. The r e ctangles ar e known as plates in this notation and they r epr esent r eplic ation. F or neste d plates, an inner plate has multiplicity e qual to the r eplic ation value shown on its lower-right c orner times the multiplicity of its p ar ent. The reason for choosing the Dirichlet distribution to mo del the distribution of topics in a document and to mo del the distribution of w ords in a topic [ 15 ] is b ecause the Diric hlet distribution is con venien t distribution on the simplex [ 13 ], it is in the exp onen tial family [ 70 ], has finite dimensional sufficien t statistics [ 16 ], and is conjugate to the multinomial distribution [ 60 ]. These prop erties help dev eloping an inference pro cedure and parameter estimation as p oin ted out by Blei et al. [ 13 ]. 1.3 Significance Ev en though news trac ker and news aggregator systems hav e b een used for a few years at a commercial scale for w eb news p ortals and news w ebsites, most of them only pro vide relev an t stories from the near past. It is understo o d that this is done to limit the num b er of relev an t stories but this at the same time casts doubt o ver the p erformance of these systems when they try to dig for related stories that are a few y ears old and therefore hav e differen t w ord distributions for the same topic. 8 Maximum Entropy Discrimination, P redictive Sequence Learning in Recurrent Neocortical Circuits, An Analog VLSI Model of the Fly Elementary Motion Detector, Inference for the Generalization Error, Retinogeniculat e Development: The Role of Compet ition and Correlated Retin al Activity Assessing and Improving Neural Network Predictions by the Bo otstra p Algorithm A CHARGE−BASED CMOS PARALLEL ANALOG VECTOR QUA NTIZER Splines, Rational Functions and Neural Networks Local Dimensionality Reduction, Constraints on Adaptive Networks for Modeling Human Generalization Grouping Contours by Iterated Pairing Network . Bayesian Model Comparison by Monte Carlo Chaining, Computational Elements of the Ada ptive Controller of the Human Arm Learning the Similarity of Documents: An Information−Geometric Approach to Document Retrieval and Categorization, Against Edges: Function Approximation with Multiple Support Maps Learning by Choice of Internal Representations Combining Classi fiers Using Correspond ence Analysis, Spatiotemporal Coupling and Scaling of Natural Images and Human Visual Sensitivities, Baye sian Backprop in Action: Pruning, C ommittees, Error Bars, and an Application to Spectroscopy INTERIOR POINT IMP LEMENTATIONS OF ALTERNATING MINIMIZATION TRAINING Models of Ocular Dominance Column Formation: Analytical and Computational Results A Neurod ynamical App roach to Visual Attention , A Micropower Analog VLSI HMM State Decoder for Wor dspotting, S tatistical Modeling of Cell−Assemblies Activities in Associative Cortex of Behaving Monkeys Experiments with Neural Networks for Real Time Implementation of Control SPERT−II: A Vector Microprocessor System and Its Application to Large Problems in Backpropagation Training Improving the Accur acy and Speed of Support Vector Machines, Temporal Difference in Learning in Continuous Time and Space RECOGNIZING HANDWRITEN DIGITS USING MIXTURES OF LINEAR MODELS Non−Linear Dimensionality Reduction Scaling Properties of Coarse−Coded Symbol Memories VLSI Implementation of Motion Centroid Localization for Autonomous Navigation, MURPHY: A Robot that Learns by Doing Opti mal Signalling in Attractor Neural Networks ANALYSIS OF UNSTANDARDIZED CONTRIBUTIONS IN CROSS CONNECTED NETWORKS Topographic Transforma tion as a Discrete Latent Variable, Efficient Simulation of Biological Neural Networks on Massively Parallel Supercomputers with Hypercube Architecture Self−organisation in real neurons: Anti−Hebb in ’Channel Space’? Induction of Finite−State Automata Using Second−Order Recurrent Networ ks Classifying Facial Action A Bifurcation Theory Approach to the Programming of Periodic Attractors in Network Models of Olfactory Amplifying and Linearizing Apical Synaptic Inputs to Cortical Pyramidal Cells Computer Simulation of Oscillatory Behavior in Cerebral Cortical Networks Competition Among Ne tworks Improves Committee Performance, Policy Gradient Methods for Reinforcement Learning with Function Approximation, An Application of the Principle of Maximum Information Preservation to Linear Systems VLSI Implementations of Learning and Memory Systems .. Dual Mechanisms for Neural Binding and Segmentation Learning to Find Pictures of People, Associative Learning via Inhibitory Search Figure 1.4 : Visualization of a subset of 50 do cuments in two dimensions fr om the NIPS dataset. Each r e ctangle r epr esents a do cument with its title inside of the r e ctangle. Dis- tanc e b etwe en two do cuments is dir e ctly pr op ortional to their similarity. The symmetrize d Kul lb ack-L eibler diver genc e is use d to evaluate the distanc e b etwe en p airs of do cuments. 9 T opic mo dels not only help us automate the pro cess of text categorization and search, they enable us to analyze text in a w ay that cannot b e done manually . Using topic mo dels w e can see how topics evolv e ov er time [ 2 ], and how differen t topics are correlated with eac h other [ 10 ], and how this correlation c hanges ov er time. W e can pro ject the do cuments on the topic space and take a bird’s ey e view to understand and analyze their distribution. W e can zo om-in to analyze the main themes of a topic, or zo om-out to get a broader view of where this topic sits among other related topics. It should b e noted that topic mo deling algorithms do all that with no prior kno wledge of the existence or the comp osition of an y topic, and without text annotation. The successful utilization of topic mo deling in text encouraged researchers to explore other domains of applications for it. It has b een used in soft ware analysis [ 49 ] as it can be used to automatically measure, monitor and better understand soft ware conten t, complexity and temp oral ev olution. T opic mo dels were used to “improv e the classification of protein- protein interactions b y condensing lexical knowledge a v ailable in unannotated biomedical text in to a seman tically-informed k ernel smo othing matrix” [ 57 ]. In the field of signal pro cessing, topic mo dels w ere used in audio scene understanding [ 39 ], where audio signals w ere assumed to contain latent topics that generate acoustic w ords describing an audio scene. It has b een used in text summarization [ 22 ], building semantic question answering systems [ 18 ], sto c k market mo deling [ 24 ] and music analysis [ 36 ]. Using v ariational metho ds for appro ximate Bay esian inference in the dev elop ed hierar- c hical Diric hlet allo cation mo del for the dynamic contin uous-time topic mo del will facilitate inference in mo dels with a higher n umber of latent topic v ariables. Other than the ob vious application for the timeline creation system in retrieving a set of do cuments relev ant to a do cument at hand, topic models can b e used as a visualization tec hnique. The user can view the timeline at different scale levels and understand how differen t even ts temp orally unfolded. 10 Chapter 2 Ba y esian mo dels and inference algorithms Ev aluation of the p osterior distribution p ( Z | X ) of the set of laten t v ariables Z given the observ ed data v ariable set X is essential in topic mo deling applications [ 6 ]. This ev aluation is infeasible in real-life applications of practical interest due to the high num b er of laten t v ariables w e need to deal with [ 14 ]. F or example, the time complexity of the junction tree algorithm is exp onen tial in the size of the maximal clique in the junction tree. Exp ectation ev aluation with respect to such a highly complex p osterior distribution would b e analytically in tractable [ 70 ]. Just like the case with many engineering problems when finding an exact solution is not p ossible or to o exp ensiv e to obtain, w e resort to appro ximate metho ds. Even in some cases when the exact solution can b e obtained, we migh t fav or an approximate solution b ecause the b enefit of reac hing the exact solution do es not justify the extra cost sp ent to reac h it. When the no des or node clusters of the graphical mo del are almost conditionally indep enden t, or when the no de probabilities can b e determined by a subset of its neighbors, an approximate solution will suffice for all practical purp oses [ 55 ]. Approximate inference metho ds fall broadly under t wo categories, sto chastic and deterministic [ 41 ]. Sto c hastic methods, such as Mark ov Chain Monte Carlo (MCMC) metho ds, can theo- retically reach exact solution in limited time given unlimited computational p ow er [ 29 ]. In 11 practice, the quality of the approximation obtained is tied to the av ailable computational p o wer. Even though these methods are easy to implemen t and therefore widely used, they are computationally exp ensiv e. Deterministic appro ximation metho ds, lik e v ariational metho ds, make simplifying as- sumptions on the form of the p osterior distribution or the wa y the no des of the graphical mo del can b e factorized [ 37 ]. These methods therefore could never reac h an exact solution ev en with unlimited computational resources. 2.1 In tro duction As I stated earlier, the main problem in graphical mo del applications is finding an ap- pro ximation for the p osterior distribution p ( Z | X ) and the mo del evidence p ( X ), where X = { x 1 , . . . , x N } is the set of all observed v ariables and Z = { z 1 , . . . , z N } is the set of all laten t v ariables and mo del parameters. p ( X ) can b e decomp osed using [ 6 ]: ln p ( X ) = L ( q ) + KL( q || p ) (2.1) where L ( q ) = Z q ( Z ) ln p ( X , Z ) q ( Z ) d Z (2.2) KL( q || p ) = − Z q ( Z ) ln p ( Z | X ) q ( Z ) d Z (2.3) Our goal is to find a distribution q ( Z ) that is as close as p ossible to p ( Z | X ). T o do so, w e need to minimize the Kullback-Leibler (KL) distance b et ween them [ 43 ]. Minimizing this measure while keeping the left-hand side v alue fixed means maximizing the low er b ound L ( q ) on the log marginal probabilit y . The appro ximation in finding such a distribution arises from the set of restrictions we put on the family of distributions w e pic k q from. This family of distributions has to b e rich enough to allo w us to include a distribution that is 12 close enough to our target p osterior distribution, yet the distributions hav e to b e tractable [ 6 ]. This problem can b e transformed into a non-linear optimization problem if we use a parametric distribution p ( Z | ω ), where ω is its set of parameters. L ( q ) b ecomes a function of ω and the problem can b e used using non-linear optimization metho ds such as Newton or quasi-Newton metho ds [ 63 ]. 2.1.1 F actorized distributions Instead of restricting the form of the family of distributions we w an t to pick q ( Z ) from, w e can mak e assumptions on the wa y it can b e factored. W e can mak e some indep endence assumptions [ 70 ]. Let us sa y that our set of latent v ariables Z can b e factorized according to [ 37 ]: q ( Z ) = M Y i =1 q i ( Z i ) (2.4) This appro ximation metho d is known in the physics domain b y me an field the ory [ 4 ]. T o maximize the lo wer b ound L ( q ) w e need to minimize each one of the factors q i Z i . W e can do so b y substituting ( 2.4 ) in to ( 2.2 ) to get the following [ 6 ]: L ( q ) = Z Y i q i ( ln p ( X , Z ) − X i ln q i ) d Z (2.5) = Z q j ( Z ln p ( X , Q ) Y i 6 = j q i d Z i ) − Z q j ln q j d Z j + c (2.6) = Z q j ln ˜ p ( X , Z j ) d Z j − Z q j ln q j d Z j + c (2.7) where ln ˜ p ( X , Z j ) = E i 6 = j [ln p ( X , Q )] + c (2.8) and 13 E i 6 = j [ln p ( X , Z )] = Z ln p ( X , Z ) Y i 6 = j q i d Z i (2.9) Where E i 6 = j is the exp ectation with resp ect to q ov er z i suc h that i 6 = j . Since ( 2.7 ) is a negativ e KL div ergence betw een q j ( Z j ) and ˜ p ( X , Z j ), w e can maximize L ( q ) with resp ect to q j ( Z j ) and obtain [ 6 ]: ln q ? j ( Z j ) = E i 6 = j [ln p ( X , Z )] + c (2.10) T o get rid of the constant we can take the exp onen tial of both sides of this equation to get: q ? j ( Z j ) = exp( E i 6 = j [ln p ( X , Z )]) R exp( E i 6 = j [ln p ( X , Q )]) d Z j (2.11) Example F or illustration, w e are going to take a simple example for the use of factorized distribution in the v ariational appro ximation of parameters of a simple distribution [ 50 ]. Let us tak e the univ ariate Gaussian distribution o ver x . In this example, we will infer the p osterior distribution of its mean ( µ ) and precision ( τ = σ − 2 ), given a data set D = { x 1 , . . . , x N } of i.i.d. observ ed v alues with a likelihoo d: p ( D | µ, τ ) = τ 2 π N/ 2 exp ( − τ 2 N X n =1 ( x n − µ ) 2 ) (2.12) with a Gaussian and Gamma conjugate prior distributions for µ and τ as follows: p ( µ | τ ) = N ( µ | µ 0 , ( λ 0 τ ) − 1 ) (2.13) p ( τ ) = Gamma( τ | a 0 , b 0 ) (2.14) By using factorized v ariational appro ximation, we can assume the p osterior distribution factorizes according to: 14 q ( µ, τ ) = q µ ( µ ) q τ ( τ ) (2.15) W e can ev aluate the optim um for each of the factors using ( 2.12 ). The optimal factor for the mean is: ln q ? µ ( µ ) = − E τ 2 ( λ 0 ( µ − µ 0 ) 2 + N X n =1 ( x n − µ ) 2 ) + c (2.16) Whic h is the Gaussian N ( µ | µ N , λ − 1 N ) with mean and precision giv en by: µ N = λ 0 µ 0 + N ¯ x λ 0 + N (2.17) λ N = ( λ 0 + N ) E [ τ ] (2.18) And the optimal factor for the precision is: ln q ? τ ( τ ) = ( a 0 − 1 + N 2 ) ln τ − b 0 τ − τ 2 E µ " N X n =1 ( x n − µ ) 2 + λ 0 ( µ − µ 0 ) 2 # + c (2.19) Whic h is a gamma distribution Gamma( τ | a N , b N ) with parameters: a N = a 0 + N 2 (2.20) b N = b 0 + 1 2 E µ " N X n =1 ( x n − µ ) 2 + λ 0 ( µ − µ 0 )2 # (2.21) As we can see from ( 2.18 ) and ( 2.21 ), the optimal distributions for the mean and precision factors dep end on expectations function of the other v ariable. Their v alues can b e ev aluated iterativ ely b y first assuming an initial v alue, lets sa y for E [ τ ] and use this v alue to ev aluate q µ ( µ ). Giv en this v alue, w e can use it to extract E [ µ ] and E [ µ 2 ] and use them to ev aluate q τ ( τ ). This v alue can in turn b e used to extract a new revised v alue for E τ whic h can b e used to up date q µ ( µ ) and con tinue the cycle un til their v alues conv erge to the optimum v alues. 15 -4 -2 0 2 4 0 1 2 3 4 5 f(x) x log( x ) λ = 1 λ = 2 λ = 7 λ = 20 Figure 2.1 : V ariational tr ansformation of the lo garithmic function. The tr ansforme d func- tion ( λx − log λ − 1 ) forms an upp er found for the original function. 2.1.2 V ariational parameters V ariational metho ds transform a complex problem in to a simpler form b y decoupling the degrees of freedom of this problem b y adding v ariational parameters. F or example, we can transform the logarithm function as follo ws [ 37 ]: log( x ) = min λ { λx − log λ − 1 } (2.22) Here I introduced the v ariational parameter λ Which w e are trying its v alue that mini- mizes the function. As w e can see in Figure 2.1 , for differen t v alues of λ , there is a tangen t line to the concav e logarithmic function, and the set of lines formed by v arying ov er the v alues of λ forms a family of upp er b ounds for the logarithmic function. Therefore, log( x ) ≤ λx − log λ − 1 ∀ λ (2.23) 16 2.2 V ariational inference In this section, I use v ariational inference to find an appro ximation to the true p osterior of the latent topic structure [ 76 ]; The topic distribution p er word, the topic distribution per do cumen t, and the word distribution ov er topics. I use v ariational Kalman filtering [ 38 ] in con tinuous time for this problem. The v aria- tional distribution o ver the latent v ariables can b e factorized as follows: q ( β 1: T ,z 1: T , 1: N , θ 1: T | ˆ β , φ, γ ) = K Y k =1 q ( β 1 ,k , . . . , β T ,k | ˆ β 1 ,k , . . . , ˆ β T ,k ) × T Y t =1 q ( θ t | γ t ) N t Y n =1 q ( z t,n | φ t,n ) ! (2.24) Where β is the w ord distribution o ver topics and β 1: T ,z 1: T , 1: N is the w ord distribution o v er topics for time 1 : T , topic z 1: T and word index 1 : N , where N is the size of the dictionary . In equation ( 2.24 ), γ t is a Diric hlet parameter at time t for the multinomial per do cumen t topic distribution θ t , and φ t,n is a multinomial parameter at time t for w ord n for the topic z t,n . { ˆ β 1 ,k , . . . , ˆ β T ,K } are Gaussian v ariational observ ations for the Kalman filter [ 72 ]. In discrete-time topic mo dels, a topic at time t is represen ted by a distribution o v er all terms in the dictionary including terms not observed at that time instance. This leads to high memory requirements sp ecially when the time granularit y gets finer. In m y model, I use sparse v ariational inference [ 30 ] in which a topic at time t is represen ted by a m ultinomial distribution ov er terms observed at that time instance; v ariational observ ations are only made for observ ed w ords. The probabilit y of the v ariational observ ation ˆ β t,w giv en β t,w is Gaussian [ 72 ]: p ( ˆ β t,w | β t,w ) = N ( β t,w , ˆ v t ) (2.25) 17 I use the forward-bac kward algorithm [ 59 ] for inference for the sparse v ariational Kalman filter. The v ariational forward distribution p ( β t,w | ˆ β 1: t,w ) is Gaussian [ 11 ]: p ( β t,w | ˆ β 1: t,w ) = N ( m t,w , V t,w ) (2.26) where m t,w = E ( β t,w | ˆ β 1: t,w ) = ˆ v t V t − 1 ,w + v ∆ s t + ˆ v t m t − 1 ,w + 1 − ˆ v t V t − 1 ,w + v ∆ s t + ˆ v t ˆ β t,w (2.27) V t,w = E (( β t,w − m t,w ) 2 | ˆ β 1: t,w ) = ˆ v t V t − 1 ,w + v ∆ s t + ˆ v t ( V t − 1 ,w + v ∆ s t ) (2.28) Similarly , the bac kward distribution p ( β t,w | ˆ β 1: T ,w ) is Gaussian [ 11 ]: p ( β t,w | ˆ β 1: T ,w ) = N ( m t,w , V t,w ) (2.29) where e m t − 1 = E ( β t − 1 | ˆ β 1: T ) = v ∆ s t V t − 1 ,w + v ∆ s t m t − 1 + 1 − v ∆ s t V t − 1 ,w + v ∆ s t e m t,w (2.30) e V t − 1 ,w = E (( β t − 1 − e m t − 1 ) 2 | ˆ β 1: T ) = V t − 1 ,w + V t − 1 V t − 1 + v ∆ s t 2 e V t − V t − 1 − v ∆ s t (2.31) The lik eliho o d of the observ ations has a low er b ound defined b y: 18 L ( ˆ β ) ≥ T X t =1 E q h log p ( w t | β t ) − log p ( ˆ β t | β t ) i + T X t =1 log q ( ˆ β t | ˆ β 1: t − 1 ) (2.32) where E q log p ( w t | β t ) = X w n t,w E q β t,w − log X w exp( β t,w ) ! ≥ X w n t,w e m t,w − n t log X w exp( e m t,w + e V t,w / 2) (2.33) E q log p ( ˆ β t | β t ) = X w δ t,w E q log q ( ˆ β t,w | β t,w ) (2.34) log q ( ˆ β t | ˆ β 1: t − 1 ) = X w δ t,w log q ( ˆ β t,w | ˆ β 1: t − 1 ,w ) (2.35) where δ t,w is the Dirac delta function [ 23 ] and it is equal to 1 iff ˆ β t,w is in the v ariational observ ations. n t,w is the n umber of words in do cument d t , and n t = P w n t,w . 19 Chapter 3 Existing solutions and their limitations Sev eral studies ha ve b een done to account for the c hanging laten t v ariables of the topic mo del. Xing [ 77 ] presen ted a dynamic logistic-normal-multinomial and logistic-normal- P oisson mo dels that he used later as building blo c ks for his mo dels. W ang and McCallum [ 74 ] presen ted a non-Marko vian contin uous-time topic mo del in whic h each topic is asso ciated with a contin uous distribution o v er timestamps. Blei and Lafferty [ 11 ] prop osed a dynamic topic mo del in whic h the topic’s word distribution and p opularity are linked o ver time, though the num b er of topics w as fixed. This work w as pick ed up by other researchers who extended this mo del. In the follo wing I describ e some of these extended mo dels. 3.1 T emp oral topic mo dels T raditional topic models which are manifestations of graphical models mo del the o ccurrence and co-o ccurrence of words in do cumen ts disregarding the fact that many do cument collec- tions cov er a long p erio d of time. Over this long p erio d of time, topics which are distributions o ver words could c hange drastically . The w ord distribution for a topic cov ering a branch of medicine is a go o d example of a topic that is dynamic and evolv es quickly ov er time. The terms used in medical journals and publications c hange ov er time as the field dev elops. Learning the w ord distribution for suc h a topic using a collection of old medical do cuments 20 w ould not b e go o d in classifying new medical do cuments as the new do cumen ts are written using more modern terms that reflect recen t medical researc h directions and disco veries that k eep adv ancing every da y . Using a fixed w ord distribution for suc h a topic, and for man y other topics that usually ev olve in time, would result in wrong do cument classification and wrong topic inference. The error could b ecome greater o ver the course of time as the topic ev olves more and more and the distance b etw een the original w ord distribution that w as learned using the old collection of do cumen ts and the w ord distribution of the same topic in a more recen t do cument collection for the same field b ecomes greater. Therefore, there is a strong need to add a temp oral mo del to suc h topic mo dels to reflect the c hanging word distribution for topics in a collection of do cumen ts to impro v e topic inference in a dynamic collection of do cumen ts. 3.2 T opics o v er time Sev eral topic mo dels were suggested that add a temp oral comp onen t to the mo del. I will refer to them in this dissertation as temp oral topic mo dels. These temp oral topic mo dels include the T opics ov er Time (TOT) topic mo del [ 74 ]. This mo del directly observes do cument timestamp. In its generative pro cess, the mo del generates word collection and a timestamp for each do cumen t. In their original pap er, W ang and McCallum ga ve t wo alternativ es views for the mo del. The first one has a generativ e pro cess in which for each do cument a multinomial distribution θ is sampled from a Diric hlet prior α . And then from that m ultinomial θ one timestamp t is sampled for that do cumen t from a Beta distribution, and one topic z is sampled for each word w in the do cumen t. Another m ultinomial β is sampled for each topic in the do cument from a Dirichlet prior η , and a word w is sampled from that m ultinomial β giv en the topic z sampled for this do cument. This process which seems natural to do cument collections in whic h eac h do cument has one timestamp contrasts the other view whic h the authors presen ted and is giv en in Figure 3.1 . In this view whic h the authors adopted in their mo del, the generativ e pro cess is similar to the generativ e pro cess 21 presen ted earlier for the first view, but instead of sampling one timestamp t from a Beta distribution for each do cument, one timestamp is sampled from a Beta distribution for eac h word w in the do cument. All w ords in the same do cument how ever hav e the same timestamp. The authors claim that this view mak es it easier to implemen t and understand the mo del. It is to b e noted that the word distribution p er topic in this TOT mo del is fixed ov er time though. “TOT captures changes in the o ccurrence (and co-o ccurrence conditioned on time) of the topics themselv es, not changes of the word distribution of each topic.” [ 74 ] The authors argue that evolution in topics happ ens by the changing o ccurrence and co- o ccurrence of topics as tw o co-o ccurring topics would b e equiv alen t to a new topic that is formed by merging b oth topics, and losing the co-o ccurrence w ould be equiv alen t to splitting that topic in to tw o topics. In this TOT mo del, topic co-o ccurrences happ en in contin uous-time and the timestamps are sampled from a Beta distribution. A temporal topic mo del ev olving in con tinuous-time has a big adv antage o ver a discrete-time topic mo del. Discretization of time usually comes with the problem of selecting a go o d timestep. Picking a large timestep leads to the prob- lem of having do cuments cov ering a large time p erio d ov er whic h the word distributions for the topics co vered in these do cumen ts evolv ed significantly used in learning these distri- butions, or are inferenced using one fixed distribution ov er that p erio d of time. Picking a small timestep would complicate inference and learning as the num b er of parameters would explo de as the timestep gran ularity increases. Another problem that arises with discretiza- tion is that it do es not account for v arying time gran ularit y o ver time. Since topics evolv e at different paces, and ev en one topic ma y evolv e with different sp eeds o ver time, ha ving a small timestep at one p oin t in time to capture the fast dynamics of an ev olving topic ma y b e unnecessary later in time when the topic b ecomes stagnan t and do es not ev olv e as quickly . Keeping a fine grained timestep at that p oin t will mak e inference and learning slow er as it increases the n um b er of mo del parameters. On the other hand, ha ving a coarse timestep at 22 θ Z W β η ψ N D K α t Figure 3.1 : T opics over time gr aphic al mo del. In this view, one timestamp t is sample d fr om a Beta distribution p er wor d in a do cument. Al l wor ds in a single do cument have the same timestamp. Al l the symb ols in this figur e fol low the notation given in Figur e 1.3 and ψ is the Beta distribution prior. one p oint in time when a topic do es not ev olve quickly ma y b e suitable for that time but w ould b ecome to o big of a timestep when the topic starts ev olving faster in time, and doc- umen ts that fall within one timestep would b e inferenced using the fixed w ord distributions for the topics not reflecting the change that happ ened to these topics. T o take this case to the extreme, a very large timestep cov ering the time p erio d o ver whic h the do cumen t collection exists w ould b e equiv alen t to using a classical laten t Dirichlet allo cation mo del that has no notion of time at all. It is to be noted that the TOT topic mo del uses a fixed n umber of topics. This limitation has ma jor implications b ecause not only do topics evolv e o ver time, but topics get b orn, die, and reb orn ov er time. The num b er of activ e topics ov er time should not b e assumed to b e fixed. Assuming a fixed num b er could lead to having topics conflated or merged in a wrong wa y . Assuming a num b er of topics that is greater than the actual num b er of topics in do cument collection at a certain p oint in time causes the actual topics to be split o v er more than one topic. In an application that classifies news articles based on the topics they discuss, this will cause extra classes to b e created and mak es the reader distracted b et ween tw o classes that co ver the same topic. On the other hand, ha ving a num b er of topics that is smaller than the actual num b er of topics co vered b y a collection of do cuments 23 mak es the mo del merge differen t topics into one. In the same application of new articles classification, this leads to having articles co vering differen t topics app earing under the same class. Article classes could become very broad and this is usually undesirable as the reader relies on classification to read articles based on his/her fo cused in terest. In the TOT mo del, exact inference cannot be done. W ang and McCallum [ 74 ] resorted to Gibbs sampling in this mo del for appro ximate inference. Since a Diric hlet which is a conjugate prior for the m ultinomial distribution is used, the m ultinomials θ and φ can b e in tegrated out in this mo del. Therefore, we do not ha v e to sample θ and φ . This mak es the mo del more simple, faster to simulate, faster to implemen t, and less prone to errors. The authors claim that b ecause they use a con tin uous Beta distribution rather than discretizing time, sparsity would not b e a big concern in fitting the temp oral part of the mo del. In the training phase or learning the parameters of the mo del, ev ery do cumen t has a single timestamp, and this timestamp is associated with every w ord in the do cumen t. Therefore, all the words in a single do cumen t ha ve the same timestamp. Which is what w e would naturally exp ect. Ho w ever, in the generativ e graphical mo del presented in Figure 3.1 for the topics ov er time topic mo del, one timestamp is generated for eac h word in the same do cumen t. This w ould probably result in different w ords app earing in the same do cument ha ving different timestamp. This is typically something we would not exp ect, b ecause naturally , all w ords in a single do cument hav e the same timestamp b ecause they were all authored and released or published under one title as a single publication unit or publication en tity . In this sense, the generativ e mo del presen ted in Figure 3.1 is deficient as it assigns differen t timestamps to w ords within the same do cument. The authors of the pap er that this model w as presen ted in argue that this deficiency does not distract a lot from this model and it still remains a p o werful in mo deling large dynamic text collections. An alternative generative pro cess for the topics o ver time was also presented by the authors of this mo del. In this alternativ e pro cess, one timestamp is generated for each do cumen t using rejection sampling or imp ortance sampling from a mixture of p er topic 24 Beta distributions o ver time with mixture weigh t as the p er do cumen t θ d o ver topics. Instead of jointly modeling co-o ccurrence of words and timestamps in do cument collec- tions, other topic mo dels relied on analyzing ho w topics c hange ov er time b y dividing the time co v ered by the do cumen ts into regions for analysis or by discretizing time. Griffiths and Steyv ers [ 33 ] used atemporal topic model to infer the topic mixtures of the pro ceedings of the National Academy of Sciences (PNAS). They then ordered the do cumen ts in time based on their timestamps and assigned them to differen t time regions and analyzed their topic mixtures ov er time. This study do es not infer or learn the timestamps of do cumen ts and merely analyzes the topics learned using a simple laten t Dirichlet allo cation mo del [ 13 ]. Instead of learning one topic mo del for the en tire do cument collection and then analyzing the topic structure of the do cumen ts in the collection, W ang et al. [ 75 ] first divided the do cumen ts in to consecutive time regions based on their timestamps. They then trained a differen t topic mo del for eac h region and analyzed ho w topics changed ov er time. This mo del has several limitations: First, the alignment of topics from one time region to the next is hard to do and w ould probably b e done by hand, which is hard to do even with relativ ely small num b er of topics. Second, the num b er of topics was held constant throughout time ev en at times when the documents b ecome ric h in con text and they naturally con tain more topics, or at times when the do cuments are not as rich and contain relatively less num b er of topics. The mo del do es not account for topics dying out and others being b orn. Third, finding the correct time segment and num b er of segmen ts is hard to do as it typically in- v olves man ual insp ection of the do cuments. The mo del ho w ever b enefited from the fact that differen t mo dels for do cuments in adjacen t time regions are similar and the Gibbs sampling parameters learned for one region could b e used as a starting p oin t for learning parameters for the next time region [ 64 ]. In their TimeMines system, Swan and Jensen [ 66 ] generated a topic mo del that assigns one topic p er do cumen t for a collection of news stories used to construct timelines in a topic detection and trac king task. 25 The topics o v er time topic mo del is a temp oral model but not a Mark ovian one. It do es not make the assumption that a topic state at time t + 1 is indep endent on all previous states of this topic except for its state at time t . Sark ar and Mo ore [ 61 ] analyze the dynamic so cial netw ork of friends as it ev olv es o v er time using a Mark ovian assumption. No delman et al. [ 56 ] dev elop ed a Contin uous-time Bay esian netw ork (CTBN) that do es not rely on time discretization. In their mo del, a Ba yesian netw ork evolv es based on a contin uous- time transition mo del using a Mark o vian assumption. Kleinberg [ 40 ] created a mo del that relies on the relativ e order of do cuments in time instead of using timestamps. This relativ e ordering ma y simplify the mo del and ma y b e suitable for when the documents are released on a fixed or near fixed time interv al but w ould not tak e into account the p ossibilit y that in some other applications lik e in news streams, the pace at whic h new stories are released and published v aries ov er time. 3.3 Discrete-time infinite dynamic topic mo del Ahmed and Xing [ 2 ] prop osed a solution that ov ercomes the problem of having a fixed n umber of topics. They prop osed an infinite Dynamic T opic Mo del (iDTM) that allows for an un b ounded n umber of topics and an ev olving represen tation of topics according to a Mark ovian dynamics. They analyzed the birth and evolution of topics in the NIPS comm unity based on conference proceedings. Their mo del evolv ed topics ov er discrete time units called epo chs. All pro ceedings of a conference meeting fall into the same ep o c h. This mo del do es not suit man y applications as news articles pro duction and tw eet generation is more spread o ver time and do es not usually come out in bursts. In many topic mo deling applications, such as for discussion forums, news feeds and t weets, the time duration of an ep o c h may not b e clear. Cho osing to o coarse a resolution ma y render in v alid the assumption that documents within the same ep o c h are exc hangeable. Differen t topics and storylines will get conflated, and unrelated do cumen ts will hav e similar topic distribution. If the resolution is c hosen to b e to o fine, then the num b er of v ariational 26 φ 111 φ 112 ψ 111 φ 114 φ 113 ψ 113 ψ 112 φ 115 φ 116 φ 125 φ 123 ψ 123 ψ 121 φ 124 φ 121 ψ 122 φ 122 φ 21 ψ 112 ψ 122 ψ 121 φ 11 ψ 111 ψ 113 ψ 123 φ 31 x 11 x 12 φ 211 φ 212 ψ 211 φ 214 φ 213 ψ 213 ψ 212 φ 215 φ 216 ψ 214 φ 217 φ 225 φ 226 ψ 223 ψ 221 φ 224 φ 221 ψ 222 φ 222 φ 223 φ 12 ψ 223 φ 22 ψ 221 ψ 213 ψ 211 ψ 212 ψ 222 ψ 214 φ 42 φ 32 x 22 x 21 Figure 3.2 : A gr aphic al mo del r epr esentation of the infinite dynamic topic mo del using plate notation. parameters of the mo del will explo de with the num b er of data p oints. The inference al- gorithm will b ecome prohibitiv ely exp ensive to run. Using a discrete time dynamic topic mo del could b e v alid based on assumptions ab out the data. In man y cases, the contin uity of the data which has an arbitrary time granularit y preven ts us from using a discrete time mo del. T o summarize: in streaming text topic mo deling applications, the discrete-time mo del giv en ab ov e is brittle. An extension to con tinuous-time will give it the needed flexibility to accoun t for change in temp oral granularit y . Statistical mo del Figure 3.2 sho ws a graphical model represen tation for an order one recurren t Chinese restau- ran t franchise pro cess (R CRF). The sym b ols in this figure follow the same notation used for LD A in Figure 1.3 In RCRF, do cuments in eac h ep o ch are mo deled using a Chinese restauran t franchise pro cess (CRFP). The men us that the do cuments were sampled from at differen t timesteps are tied ov er time. F or a giv en Dirichlet Process (DP), DP ( G 0 , α ), with a base measure G 0 , and a concen- tration parameter α , G ∼ D P ( G 0 , α ) is a Dirichlet distribution o v er the parameter space θ . By in tegrating out G , θ follows a P olya urn distribution [ 7 ] or a recurren t Chinese restauran t 27 pro cess [ 2 ] θ i | θ 1: i − 1 , G 0 , α ∼ X k m k i − 1 + α δ ( φ k ) + α i − 1 + α G o (3.1) where, φ 1: k are the distinct v alues of θ , and m k is the num b er of parameter θ with φ k v alue. A Dirichlet pro cess mixture mo del (DPM) can b e built using the given DP on top of a hierarc hical Bay esian mo del. One disadv antage of using RCRP is that in it each do cument is generated using a single topic. This assumption is unrealistic sp ecially in some domains lik e in modeling news stories where eac h do cument is typically a mixture of different topics. T o o v ercome this limitation, I can use a hierarchical Diric hlet pro cess (HDP) in whic h each document can b e generated from m ultiple topics. T o add temp oral dep endence in our model, w e can use the temp oral Dirichlet pro cess mixture mo del (TDPM) proposed in [ 1 ] whic h allows unlimited n umber of mixture comp o- nen ts. In this mo del, G evolv es as follows [ 2 ]: G t | φ 1: k , G o , α ∼ D P ( ζ , D ) (3.2) ζ = α + X k m 0 kt (3.3) D = X k m 0 kt P l m 0 lt + α δ ( φ k ) + α P l m 0 lt + α G 0 (3.4) m 0 kt = ∆ X δ =1 exp − δ /λ m k,t − δ (3.5) where m 0 kt is the prior weigh t of comp onent k at time t , ∆, λ are the width and decay factor of the time decaying k ernel. F or ∆ = 0, this TDPM will represent a set of indep enden t DPMs at eac h time step, and a global DPM when ∆ = T and λ = ∞ . As w e did with the time-indep enden t DPM, we can in tegrate out G 1: T from our mo del to get a set of parameters θ 1: t that follo ws a Poly-urn distribution: θ ti | θ t − 1 : t − ∆ , θ t, 1: i − 1 , G 0 , α ∝ X k ( m 0 kt + m kt ) δ ( φ kt ) + αG 0 (3.6) 28 In this pro cess, topic p opularit y at ep o c h t dep ends on its use at this ep o c h m kt , and its use in the previous ∆ ep o c hs, m 0 kt . Therefore, after ∆ ep o c hs of not b eing used, a topic can b e considered dead. This makes sense as higher-order mo dels will require more ep o c hs to pass b y without using a topic b efore it is considered dead. This is b ecause the length of the c hain is longer and the effect of the global menu of topics carries along for more ep o chs. On the other hand, in low er-order mo dels the effect of the global men u of topics can only b e passed along for less ep o c hs than in the previous case. By placing this mo del on top of a hierarc hical Dirichlet pro cess as indicated earlier, we can tie all random measures G d , from whic h the m ultinomially distributed parameters θ d are dra wn for eac h w d do cumen t in a topic mo del scheme by mo deling G 0 as a random measure sampled from DP ( γ , H ). By in tegrating G d out of this model, we get the Chinese restauran t franchise pro cess (CRFP) [ 11 ]; θ di | θ d, 1: i − 1 , α, ψ ∼ b = B d X b =1 n db i − 1 + α δ ψ db + α i − 1 + α δ ψ db new (3.7) ψ db new | ψ , γ ∼ K X k =1 m k P K l =1 m l + γ δ φ k + γ P K l =1 m l + γ H (3.8) where ψ db is topic b for do cumen t d , n db is the n um b er of words sampled from it, ψ db new is a new topic, B d is the n umber of topics in document d , and m k is the n umber of do cuments sharing topic φ k . W e can mak e word distributions and topic trends evolv e o ver time if we tie together all h yp er-parameter base measures G 0 through time. The mo del will now take the following form: θ tdi | θ td, 1: i − 1 , α, ψ t − ∆: t ∼ b = B d X b =1 n tdb i − 1 + α δ ψ tdb + α i − 1 + α δ ψ tdb new (3.9) 29 ψ tdb new | ψ , γ ∼ X k : m kt > 0 m kt + m 0 kt P K t l =1 m lt + m 0 lt + γ δ φ kt + X k : m kt =0 m kt + m 0 kt P K t l =1 m lt + m 0 lt + γ P( . | φ k,t − 1 ) + γ P K t l =1 m lt + m 0 lt + γ H (3.10) where φ kt ev olves using a random walk k ernel lik e in [ 11 ]: H = N (0 , σ I ) (3.11) φ k,t | φ k,t − 1 ∼ N ( φ k,t − 1 , ρI ) (3.12) w tdi | φ kt ∼ M ( L ( φ kt )) (3.13) L ( φ kt ) = exp( φ kt ) P W w =1 exp( φ ktw ) (3.14) W e can see from ( 3.9 ), ( 3.10 ), ( 3.11 ) the non-conjugacy b etw een the base measure and the lik eliho o d. Sym b ol Definition D P ( . ) Diric hlet pro cess G 0 base measure α concen tration parameter θ Diric hlet distribution parameter space φ 1: k Distinct v alues of θ m k n umber of parameter θ with φ k v alue m 0 kt prior w eight of comp onent k at time t ∆ width of time deca ying kernel λ deca y factor of a time decaying kernel w w ord in a do cument H base measure of the DP generating G 0 γ concen tration parameter for the DP generating G 0 ψ tdb topic b for do cumen t d at time t n tdb n umber of words sampled from ψ tdb ψ tdb new a new topic T able 3.1 : T able of notations for iDTM The mo del giv en abov e is suitable for discrete time topic mo dels with ev olving n umber of topics. The topic trends and the word distribution of topics will evolv e in discrete time though. 30 Ev olving topics in contin uous time can b e ac hiev ed b y using a Bro wnian motion mo del [ 54 ] that mo dels the natural parameters of a multinomial distribution for the words o ver topics. A Dirichlet distribution can be used to mo del the natural parameters of m ultinomial distribution of the topics giv en the words of the parameter. 3.4 Online hierarc hical Diric hlet pro cesses T raditional v ariational inference algorithms are suitable for some applications in whic h the do cumen t collection to b e mo deled is known b efore mo del learning or p osterior inference tak es place. If the do cument collection changes, how ever, the entire p osterior inference pro cedure has to b e rep eated to up date the learned model. This clearly incurs the additional cost of relearning and re-analyzing a p oten tially h uge volume of information sp ecially as the collection grows o ver time. This cost could b ecome very high and a compromise should b e made if this traditional v ariational inference algorithm is to b e used ab out ha ving an up-to-date mo del against sa ving computational p ow er. Online inference algorithms do not require several passes o ver the en tire dataset to up date the mo del whic h is a requiremen t for traditional inference algorithms. Sato [ 62 ] in tro duced an online v ariational Ba yesian inference algorithm that giv es v ariational inference algorithms an extra edge o ver their MCMC counterparts. T raditional v ariational inference algorithms appro ximate the true p osterior o ver the la- ten t v ariables b y suggesting a simpler distribution that gets refined to minimize its Kullbac k- Leibler (KL) distance to the true posterior. In online v ariational inference, this optimization is done using sto c hastic approximation [ 62 , 73 ]. Statistical mo del A t the top lev el of a tw o lev el hierarc hical Diric hlet pro cess (HDP) a Diric hlet distribution G 0 is sampled from a Dirichlet pro cess (DP). This distribution G 0 is used as the base measure for another DP at the low er level from whic h another Diric hlet distribution G j is drawn. 31 This means that all the distributions G j share the same set of atoms they inherited from their paren t with different atom weigh ts. F ormally put: G 0 ∼ D P ( γ , H ) (3.15) G j ∼ D P ( α 0 , G 0 ) (3.16) Where γ and H are the concentration parameter and base measure for the first lev el DP , α 0 and G 0 are the concentration parameter and base measure for the second level DP , and G j is the Diric hlet distribution sampled from the second level DP . In a topic mo del utilizing this HDP structure, a do cumen t is made of a collection of w ords, and each topic is a distribution ov er the wo rds in the do cumen t collection. The atoms of the top lev el DP are the global set of topics. Since the base measure of the second lev el DP is sampled from the first lev el DP , then the sets of topics in the second level DP are subsets of the global set of topics in the first level DP . This ensures that the do cuments sampled from the second level pro cess share the same set of topics in the upp er lev el. F or eac h do cument j in the collection, a Diric hlet G j is sampled from the second lev el pro cess. Then, for each word in the document a topic is sampled then a word is generated from that topic. In Bay esian non-parametric mo dels, v ariational metho ds are usually represented using a stic k-breaking construction. This representation has its o wn set of laten t v ariables on whic h an appro ximate p osterior is given [ 9 , 44 , 69 ]. The stick-breaking represen tation used for this HDP is given at tw o lev els: corpus-level dra w for the Dirichlet G 0 from the top-level DP , and a do cumen t-level dra w for the Diric hlet G j from the lo wer-lev el DP . The corpus-lev el sample can b e obtained as follo ws: 32 β 0 k ∼ B eta (1 , γ ) (3.17) β k = β 0 k k − 1 Y l =1 (1 − β 0 l ) (3.18) φ k ∼ H (3.19) G 0 = ∞ X k =1 β k δ φ k (3.20) where γ is a parameter for the Beta distribution, β k is the w eight for topic k , φ k is atom (topic) k , H is the base distribution for the top lev el DP , and δ is the Dirac delta function. The second lev el (document-lev el) dra ws for the Dirichlet G j are done b y applying Seth u- raman stic k-breaking construction of the DP again as follows: ψ j t ∼ G 0 (3.21) π 0 j t ∼ B eta (1 , α 0 ) (3.22) π j t = π 0 j t t − 1 Y l =1 (1 − π 0 j l ) (3.23) G j = ∞ X t =1 π j t δ ψ j t (3.24) where ψ j t is a do cumen t-level atom (topic) and π j t is the w eight asso ciated with it. This mo del can b e simplified b y in tro ducing indicator v ariables c j c j c j that are dra wn from the wigh ts β β β [ 73 ] c j t ∼ M ul t ( β β β ) (3.25) The v ariational distribution is thus given by: 33 q ( β β β 0 , π π π 0 , c c c, z z z , φ φ φ ) = q ( β β β 0 ) q ( π π π 0 ) q ( c c c ) q ( z z z ) q ( φ φ φ ) (3.26) q ( β β β 0 ) = K − 1 Y k =1 q ( β 0 k | u k , v k ) (3.27) q ( π π π 0 ) = Y j T − 1 Y t =1 q ( π 0 j t | a j t , b j t ) (3.28) q ( c c c ) = Y j Y t q ( c j t | ϕ j t ) (3.29) q ( z z z ) = Y j Y n q ( z j n | ζ j n ) (3.30) q ( φ φ φ ) = Y k q ( φ k | λ k ) (3.31) where β β β 0 is the corpus-lev el stick prop ortions and ( u k , v k ) are parameters for its b eta distribution, π π π 0 j is the do cumen t-level stic k prop ortions and ( a j t , b j t ) are parameters for its b eta distribution, c c c j is the vector of indicators, φ φ φ is the topic distributions, and z z z is the topic indices v ector. In this setting, the v ariational parameters are ϕ j t , ζ j n , and λ k . The v ariational ob jective function to b e optimized is the marginal log-likelihoo d of the do cumen t collection D given by [ 73 ]: log p ( D | γ , α 0 , ζ ) ≥ E q [log p ( D D D , β β β 0 , π π π 0 , c c c, z z z , φ φ φ )] + H ( q ) (3.32) = X j { E q [log( p ( w w w j | c c c j , z z z j , φ φ φ ) p ( c c c j | β β β 0 ) p ( z z z j | π π π 0 j ) p ( π π π 0 j | α 0 ))] (3.33) + H ( q ( c c c j )) + H ( q ( z z z j )) + H ( q ( π π π 0 j )) } (3.34) + E q [log p ( β β β 0 ) p ( φ φ φ )] + H ( q ( β β β 0 )) + H ( q ( φ φ φ )) (3.35) = L ( q ) (3.36) Where H ( . ) is the en tropy term for the v ariational distribution. 34 3.5 Con tin uous-time dynamic topic mo del On the other hand, W ang et al. [ 72 ] prop osed a con tinuous-time dynamic topic mo del. That mo del uses Bro wnian motion to mo del [ 54 ] the evolution of topics o v er time. Ev en though that mo dels uses a nov el sparse v ariational Kalman filtering algorithm for fast inference, the n um b er of topics it samples from is b ounded, and that sev erely limits its application in news feed storyline creation and article aggregation. When the num b er of topics cov ered by the news feed is less than the pre-tuned num b er of topics set of the mo del, similar stories will sho w under differen t storylines. On the other hand, if the num b er of topics cov ered b ecomes greater than the pre-set n um b er of topics, topics and storylines will get conflated. Statistical mo del A graphical mo del representation for this mo del using plate notation is given in Figure 3.3 . α θ z w N β t − 1 s t − 1 α θ z w α θ z w N N β t +1 s t +1 s t β t K Figure 3.3 : Gr aphic al mo del r epr esentation of the c ontinuous-time dynamic topic mo del using plate notation. 35 Let the distribution of the k th topic parameter for w ord w b e: β 0 ,k,w ∼ N ( m, v 0 ) (3.37) β j,k,w | β i,k,w , s ∼ N ( β i,k,w , v ∆ s j ,s i ) (3.38) where i , j are time indexes, s i , s j are timestamps, and ∆ s j ,s i is the time elapsed b etw een them. In this mo del, the m ultinomially distributed topic distribution at time t is sampled from a Dirichlet distribution θ t ∼ D ir ( α ), and then a topic z t,n is sampled from a m ultinomial distribution parametrized b y θ t , z t,n ∼ Mult( θ t ). T o make the topics ev olve o v er time, w e define a Wiener pro cess [ 54 ] { X ( t ) , t ≥ 0 } and sample β t from it. The obtained unconstrained β t can then b e mapp ed on the simplex. More formally: β t,k | β t − 1 ,k , s ∼ N ( β t − 1 ,k , v ∆ st I ) (3.39) π ( β t,k ) w = exp( β t,k,w ) P w exp( β t,k,w ) (3.40) w t,n ∼ Mult( π ( β t,z t ,n )) (3.41) where π ( . ) maps the unconstrained multinomial natural parameters to its mean parameters, whic h are on the simplex. The p osterior, which is the distribution of the laten t topic structure giv en the observ ed do cumen ts, is in tractable. W e resort to appro ximate inference. F or this mo del, sparse v ariational inference presented in [ 72 ] could b e used. 36 Sym b ol Definition B d n umber of topics in do cument d N ( m, v 0 ) a Gaussian distribution with mean m and v ariance v 2 0 β i,k,w distribution of w ords ov er topic k at time i for w ord w s i timestamp for time index i ∆ s j ,s i time duration b et ween s i and s j D ir ( . ) Diric hlet distribution z t,n topic n sampled at time t I iden tity matrix Mult(.) m ultinomial distribution π ( . ) mapping function T able 3.2 : T able of notations for cDTM 37 Chapter 4 Con tin uous-time infinite dynamic topic mo del In this chapter I describ e my own con tribution, the con tin uous-time infinite dynamic topic mo del. 4.1 Dim sum pro cess I prop ose building a topic mo del called the c ontinuous-time infinite dynamic topic mo del (ciDTM) that combines the prop erties of 1) the contin uous-time Dynamic T opic Mo del (cDTM), and 2) the online Hierarc hical Dirichlet Pro cess mo del (oHDP). I will refer to the sto chastic pro cess I dev elop that combines the prop erties of these t wo systems as the Dim Sum Pr o c ess . Dim sum is a style of Chinese foo d. One imp ortant feature of dim sum restauran ts is the freedom given to the c hef to create new dishes dep ending on seasonal a v ailability and what he thinks is auspicious for that day . This leew ay giv en to the co ok leads to change o ver time of the ingredient mixture of the differen t dishes the restauran t serve to b etter satisfy the customers taste and suit seasonal c hanges and av ailability of these ingredients. The generativ e pro cess of a dim sum pro cess is similar to that of a Chinese restauran t franc hise pro cess [ 68 ] in the w ay customers are seated in the restaurant and the w ay dishes are assigned to tables, but differs in that the ingredient (mixture) of the dishes evolv es in 38 con tinuous time. When the dim sum pro cess is used in a topic modeling application, eac h customer cor- resp onds to a w ord and the restauran t represen ts a do cumen t. A dish is mapp ed to a topic and a table is used to group a set of customers to a dish. The generativ e pro cess of the dim sum pro cess pro ceeds as follows: • Initially the dim sum restaurant (do cument) is empt y . • The first customer (word) to arrive sits do wn at a table and orders a dish (topic) for his table. • The second customer to arrive has t w o options. 1) She can sit at a new table with probabilit y α/ (1 + α ) and orders fo o d for her new table, or 2) she cang sit at the o ccupied table with probability 1 / (1 + α ) and eat from the fo o d that has b een already ordered for that table. • When the n − 1 customer en ters the restaurant, he can sit do wn at a new table with probabilit y α / ( n + α ), or he can sit down at table k with probabilit y n k / ( n + α ) where n k is the n umber of customers currently sitting at table k . Higher v alues of α leads to higher n um b er of o ccupied tables and dishes (topics) sampled in the restaurant (do cument). This mo del can b e extended to a franchise restauran t setting where all the restauran ts share one global menu from which the customers order. T o do so, each restauran t samples its parameter α and its lo cal menu from a global men u. This global men u is a higher lev el Diric hlet pro cess. This tw o level Diric hlet pro cess is known in the literature as the Chinese restauran t franchise pro cess (CRFP) [ 68 ]. The main difference b et ween a dim sum pro cess and a CRFP is in the global men u. This global men u is k ept fixed ov er time in the CRFP setting, whereas it ev olves in contin uous time in the dim sum pro cess using a Bro wnian motion mo del [ 54 ]. The dim sum pro cess can b e describ ed using plate notation as shown in Figure 4.1 . This mo del com bines the tw o mo dels in that it gets rid of the highest level time sp ecific 39 hierarc hical Diric hlet pro cess. It implicitly ties the base measures G 0 across all do cuments. This measure which ensures sharing of the topics through time and ov er do cumen ts, is sampled from a D P ( γ , H ), and it has b een integrated out from our mo del. Note that I reduced the hierarc hical structure of the mo del one level, and it no w represents a Chinese restauran t pro cess instead of a recurrent Chinese restaurant process. The other mo dification I mak e to the mo del is to ev olv e its topic distribution using a Bro wnian motion mo del [ 54 ], similar to the one used in the con tinuous-time dynamic topic mo del. The true posterior is not tractable; we hav e to resort to approximate inference for this mo del. Moreo ver, due to non-conjugacy b etw een the distribution of words for eac h topic and the w ord probabilities, I cannot use collapsed Gibbs sampling. Instead I plan to use v ariational metho ds for inference. Figure 4.2 sho ws a blo c k diagram for the con tin uous-time infinite dynamic topic mo del op erating on a single do cumen t at a time. 4.2 Mathematical mo del The contin uous-time infinite dynamic topic mo del (ciDTM) is a mixture of the con tinuous- time dynamic topic mo del presented earlier in Section 3.5 , and the online hierarc hical Diric h- let pro cess (oHDP) mo del presen ted in Section 3.4 . A generativ e pro cess for this ciDTM using dim sum pro cess pro ceeds as follows: W e build a tw o level hierarchical Diric hlet pro cess (HDP) like the one presented in Section 3.4 . A t the top lev el of a tw o level hierarchical Dirichlet pro cess (HDP), a Diric hlet distribu- tion G 0 is sampled from a Diric hlet pro cess (DP). This distribution G 0 is used as the base measure for another DP at the low er level from which another Dirichlet distribution G j is dra wn. This means that all the distributions G j share the same set of atoms they inherited from their paren t with different atom weigh ts. F ormally put: 40 K z w N z w N z w N s t − 1 s t +1 s t β t − 1 β t β t +1 θ t − 1 θ t θ t +1 H , γ , α Figure 4.1 : Continuous-time infinite dynamic topic mo del. None-shade d no des r epr esent unobserve d latent variables, shade d no des ar e observe d variables, diamonds r epr esent hyp er- p ar ameters, and plates r epr esent r ep etition. Symb ols in this figur e fol low the notation in Figur e 3.2 and H is the b ase distribution for the upp er level DP, γ and α ar e the c onc entr a- tion p ar ameters for the upp er level, and lower level DPs r esp e ctively, and s t is the timestamp for do cument at time t . 41 Topic Proportion Topic Proportion ciDTM Continuous time Topics Get Topic proportions Apply trained Dirichlet Process Model Figure 4.2 : A blo ck diagr am describing the workflow for the c ontinuous-time infinite dy- namic topic mo del working on a single do cument at a time. 42 G 0 ∼ D P ( γ , H ) (4.1) G j ∼ D P ( α 0 , G 0 ) (4.2) Where γ and H are the concentration parameter and base measure for the first lev el DP , α 0 and G 0 are the concentration parameter and base measure for the second level DP , and G j is the Diric hlet distribution sampled from the second level DP . In a topic mo del utilizing this HDP structure, a do cumen t is made of a collection of w ords, and each topic is a distribution ov er the wo rds in the do cumen t collection. The atoms of the top lev el DP are the global set of topics. Since the base measure of the second lev el DP is sampled from the first lev el DP , then the sets of topics in the second level DP are subsets of the global set of topics in the first level DP . This ensures that the do cuments sampled from the second level pro cess share the same set of topics in the upp er lev el. F or eac h do cument j in the collection, a Diric hlet G j is sampled from the second lev el pro cess. Then, for each word in the document a topic is sampled then a word is generated from that topic. In Bay esian non-parametric mo dels, v ariational metho ds are usually represented using a stic k-breaking construction. This representation has its o wn set of laten t v ariables on whic h an appro ximate p osterior is given [ 9 , 44 , 69 ]. The stick-breaking represen tation used for this HDP is given at tw o lev els: corpus-level dra w for the Dirichlet G 0 from the top-level DP , and a do cumen t-level dra w for the Diric hlet G j from the lo wer-lev el DP . The corpus-lev el sample can b e obtained as follo ws: 43 β 0 k ∼ B eta (1 , γ ) (4.3) β k = β 0 k k − 1 Y l =1 (1 − β 0 l ) (4.4) φ k ∼ H (4.5) G 0 = ∞ X k =1 β k δ φ k (4.6) where γ is a parameter for the Beta distribution, β k is the w eight for topic k , φ k is atom (topic) k , H is the base distribution for the top lev el DP , and δ is the Dirac delta function. The second lev el (document-lev el) dra ws for the Dirichlet G j are done b y applying Seth u- raman stic k-breaking construction of the DP again as follows: ψ j t ∼ G 0 (4.7) π 0 j t ∼ B eta (1 , α 0 ) (4.8) π j t = π 0 j t t − 1 Y l =1 (1 − π 0 j l ) (4.9) G j = ∞ X t =1 π j t δ ψ j t (4.10) where ψ j t is a do cumen t-level atom (topic) and π j t is the w eight asso ciated with it. This mo del can b e simplified b y in tro ducing indicator v ariables c j c j c j that are dra wn from the w eights β β β [ 73 ] c j t ∼ M ul t ( β β β ) (4.11) The v ariational distribution is thus given by: 44 q ( β β β 0 , π π π 0 , c c c, z z z , φ 0 φ 0 φ 0 ) = q ( β β β 0 ) q ( π π π 0 ) q ( c c c ) q ( z z z ) q ( φ 0 φ 0 φ 0 ) (4.12) q ( β β β 0 ) = K − 1 Y k =1 q ( β 0 k | u k , v k ) (4.13) q ( π π π 0 ) = Y j T − 1 Y t =1 q ( π 0 j t | a j t , b j t ) (4.14) q ( c c c ) = Y j Y t q ( c j t | ϕ j t ) (4.15) q ( z z z ) = Y j Y n q ( z j n | ζ j n ) (4.16) q ( φ 0 φ 0 φ 0 ) = Y k q ( φ k | λ k ) (4.17) where β β β 0 is the corpus-lev el stick prop ortions and ( u k , v k ) are parameters for its b eta distribution, π π π 0 j is the do cumen t-level stic k prop ortions and ( a j t , b j t ) are parameters for its b eta distribution, c c c j is the v ector of indicators, φ 0 φ 0 φ 0 is the topic distributions, and z z z is the topic indices v ector. In this setting, the v ariational parameters are ϕ j t , ζ j n , and λ k . The v ariational ob jective function to b e optimized is the marginal log-likelihoo d of the do cumen t collection D given by [ 73 ]: log p ( D | γ , α 0 , ζ ) ≥ E q [log p ( D D D , β β β 0 , π π π 0 , c c c, z z z , φ 0 φ 0 φ 0 )] + H ( q ) (4.18) = X j { E q [log( p ( w w w j | c c c j , z z z j , φ 0 φ 0 φ 0 ) p ( c c c j | β β β 0 ) p ( z z z j | π π π 0 j ) p ( π π π 0 j | α 0 ))] (4.19) + H ( q ( c c c j )) + H ( q ( z z z j )) + H ( q ( π π π 0 j )) } (4.20) + E q [log p ( β β β 0 ) p ( φ 0 φ 0 φ 0 )] + H ( q ( β β β 0 )) + H ( q ( φ 0 φ 0 φ 0 )) (4.21) = L ( q ) (4.22) Where H ( . ) is the en tropy term for the v ariational distribution. I use co ordinate ascent to maximize the log-lik e lik eliho o d giv en in ( 4.18 ). Next, given the p er-topic word distribution φ 0 φ 0 φ 0 , I use a Wiener motion pro cess [ 54 ] to mak e the topics 45 ev olve o ver time. I define the pro cess { X ( t ) , t ≥ 0 } and sample φ φ φ t from it. The obtained unconstrained φ φ φ t can then b e mapp ed on the simplex. More formally: φ φ φ t,k | φ φ φ t − 1 ,k , s ∼ N ( φ φ φ t − 1 ,k , v ∆ st I ) (4.23) π ( φ φ φ t,k ) w = exp( φ φ φ t,k,w ) P w exp( φ φ φ t,k,w ) (4.24) w t,n ∼ Mult( π ( φ φ φ t,z t ,n )) (4.25) where π ( . ) maps the unconstrained multinomial natural parameters to its mean parameters, whic h are on the simplex. The p osterior, which is the distribution of the laten t topic structure giv en the observ ed do cumen ts, is in tractable. W e resort to appro ximate inference. F or this mo del, sparse v ariational inference presented in [ 72 ] could b e used. 46 Chapter 5 T estb ed dev elopmen t In this c hapter, I will motiv ate the need for a con tinuous-time infinite dynamic topic mo del and present the dataset/corpus I will be using for m y testbed and ev aluation of the mo del I will dev elop as well as other comp eting mo dels. I start by describing the corpus: 1. Why do I need the corpus? 2. What should this corpus b e made of ? 3. What makes a go o d corpus? 4. How can a go o d corpus b e created? 5. Are there corp ora that fit my needs? 6. Can I mo dify existing corp ora to fit my needs? 7. What are the challenges in creating such a corpus? 8. Who else could b enefit from this corpus if I can publicly publish it? and 9. Can I publicly publish this corpus? After that, I will define the problem of creating timelines for different news stories whic h has many p otential applications esp ecially in the news media industry . I will follow that by sev eral attempts to solve this problem, at first using a simple topic mo del, then successively using more adv anced mo dels that alleviate some of the shortcomings of the earlier mo dels. 47 5.1 News stories corpus In order to test the p erformance of my ciDTM and create news timelines I need a news corpus. This corpus should b e made of a collection of news stories. These stories could b e collected from news outlets, lik e newspapers, newswire, transcripts of radio news broadcasts, or cra wled from w ebsites of news agencies or newspap ers websites. In my case, I can crawl newspap ers w ebsites, news agencies websites, or news websites. F or one of these sources to b e considered a v alid source for m y corpus, each of the news stories they publish should con tain: 1) an identification num b er, 2) story publication date and time, 3) story title, 4) story text b o dy , and 5) a list of related stories. Man y news agencies lik e Reuters and Associated Press, news w ebsites lik e BBC News and Huffington Post, and newspap er w ebsites like The Guardian and New Y ork Times all meet these conditions in their published stories. There are few differences though that mak e some of them b etter than the others, and each one of them has its adv antages and disadv an tages. News agencies t ypically publish more stories p er da y than the other sources I considered. This mak es the news timeline ric her with stories and mak es more news timelines. They tend to publish more follo w-up stories than other sources which con tributes to the richness of the timeline as well. They also co v er a bigger v ariet y of topics and larger geographical region, usually all w orld countries, than other sources. They do not hav e limitations on the w ord coun t of their stories as they are not restricted b y page space or other space requiremen ts whic h make their stories richer in syn tax and v o cabulary than other sources. On the other hand, news agencies only publish stories pro duced b y their own journal- ists, and their o wn journalists hav e to follow the agency’s guidelines and rules of writing, editing and up dating news stories. This restriction makes the stories more uniform regard- ing editorial structure, and sometimes they hav e v o cabulary uniformit y also. Ev en though this uniformity do es not affect the journalistic quality of the news stories they publish, it 48 mak es the task of news timeline creation easier for the topic mo del. The topic mo del could just learn the set of keyw ords used for the news stories that fall under a timeline that gets rep eated o ver and ov er b y the same journalist who co vers the topic for the news agency o ver a certain p erio d of time. The model could b e b etter tested if the stories are written b y different journalists b elonging to different organizations and following different rules and guidelines. The pro cess of crawling news agencies news w ebsites is usually easy . Their web pages are usually well formatted, hav e few adv ertisements and little unrelated con ten t. They are usually well tagged also. News agencies often tend not to hav e strong political bias in their news co verage when compared to newspap ers and news websites. This will b e reflected in the lac k of opinionated adjectiv es that comes with p olitical bias whic h affects the vocabulary structure of the news stories. News w ebsites typically only exist in electronic form on the web and collect their news stories from different sources. They purchase stories from differen t news agencies. Some of them ha ve their own dedicated journalists and freelance journalists, and some purc hase stories from other news websites and newspap ers. Man y of the stories they gather from other resources pass through an editorial step in whic h some sections of the story may be remo ved for publication space limitations. In other cases, sections written b y their o wn journalists or collected from other sources could b e added to the story , or tw o stories could b e merged to fill up publication space. This editorial pro cess could lead to syn tactic and v o cabulary richness in the news stories as the stories b elonging to the same timeline w ould ha ve lots of synon ym words, and this synonym y should b e learned by the topic mo del in order to classify the related stories as b elonging to the same timeline. As news w ebsites only exist in electronic form on the w eb, they tend to make the most of it and b e ric h in media format; a story could ha v e the main story text, text commen tary , audio, video and links to other websites co vering the same story . This richness in media format could help in mo deling the do cumen t and in placing it on the correct timeline. 49 This richness mak es it harder for automated crawling applications to extract the news story comp onen ts, suc h as title, story text, and related stories, from a page crammed with different kinds of con tent like adv ertisemen ts, unrelated stories, headers, fo oters, and such. News w ebsites tend to ha ve soft publication space limitation; a news story could run few pages long or b e a couple of paragraph. The list of related stories could b e ric h in the case of news websites as the media richness discussed earlier usually includes links to news stories published b y other sources, let us call them external sources, and b y the same news w ebsite, let us call them in ternal sources, and that again helps in b etter testing the news timeline creation. Different news stories along the same news timeline will include many synon ym w ords, and the synonym y should b e learned b y the topic mo del to iden tify these stories as b elonging to the same timeline. News websites tend to ha v e more political bias than news agencies, but less bias than newspap ers. This means related news stories they publish would not carry so m uch at the same opinionated adjectiv es in their cov erage of the story . It is to b e noted that this will usually only apply to the stories written b y their o wn journalists, and not collected or purc hased from news agencies. Newspap er w ebsites tend to fall in the middle b etw een news agencies websites and news w ebsites regarding news stories div ersity , cov erage and ric hness. Newspap ers usually collect news stories from news agencies for regions of the world they do not cov er. They ha ve their own journalists who write according to the newspap er’s editorial and p olitical rules and guidelines, and they purc hase stories from other newspap ers also. This makes a similar syntactic and vocabulary ric hness in their con tent, even though it do es not matc h that of the news websites. The stories collected or purc hased from other sources t ypically go through an editorial pro cess in which a story ma y b e cut short, extended by adding con tent from other sources to it, or merged with another story purchased or collected from another sources co vering the same topic. Newspap ers usually put on their website all the stories they publish on paper. They sometimes publish extended v ersions online, and may 50 also hav e online exclusive con tent. The online exclusive con tent usually has soft space limitations. The pap er-published con tent usually has hard space limitations though. The process of cra wling newspap er w ebsites is usually easier than cra wling news w ebsites as the former’s w eb pages usually contain less advertisemen ts, less unrelated con tent, and less links to other sources. Newspap ers, news web pages are usually w ell tagged. Newspap er w ebsites tend to hea vily link to their o wn sources, unlike news w ebsites whic h link hea vily to external sources. If I am in terested in creating a ric h news timeline, this will b e counted in fa vor of newspap er w ebsites as I usually cra wl web pages with the same format, whi ch means cra wling only internal sources not external sources because they usually hav e differen t web page format. Newspap ers usually tend to hav e stronger p olitical bias in their news cov erage when compared to news agencies and news websites. They tend to use more strongly opinionated adjectiv es. This has the side effect of ha ving news stories b elonging to the same timeline sharing a collection of adjective keyw ords that mak es the pro cess of clustering the news stories into different timelines easier. This is usually not desirable as this could make the mo del more reliant on clustering stories based on k eywords, instead of using w ord distribution, and ev en b etter, a word distribution that changes o v er time. A go o d text news stories corpus would contain a syntactically and semantically rich collection of stories. The stories should b e collected from differen t sources and written b y differen t authors who follo w differen t writing and editorial rules and guidelines, or ev en b etter if they do not share any set of these standards. This diversit y in v o cabulary and syn tactic structure will translate to a larger set of synonyms and an ton yms b eing used in the collection. The topic mo del is tested on the differen t relationships among these words and the degree to which it correctly learns these relationship translates to go o d p erformance in do cumen t classification and correct placement of a news story on its natural timeline. A go o d corpus should hav e a big set of related stories for eac h news story it contains, if suc h related stories exist in the corpus. The bigger the set, the ric her the news timeline 51 b ecomes, and the more chances of success and failure the topic mo del will hav e in creating the timeline. This will generate another challenge in discov ering the birth/death/reviv al of topics. The longer the timeline gets, the more of this topic life cycle can b e detected or missed. The set of related news stories provided b y man y news outlets for eac h of the news stories they release could b e either manually created or automatically generated. In the man ual creation pro cess, a news editor sifts through past news stories manually or assisted b y searc h to ols and lo oks for relev ant stories. The relev ancy judgment is done by a h uman. The n umber of stories in the set of related stories is usually k ept within a certain limit to emphasize the imp ortance and relatedness of the stories in the set. The set usually includes the more recen t five or seven related stories. A related stories set created man ually this w ay is needed for m y corpus to b e used in testing the p erformance of the timeline creation algorithm and the topic detection and trac king algorithm. This manually created set is called the gold standar d . It represents the highest p ossible standard that an y algorithm trying to solv e this problem should seek to match. Not all news outlets use h uman annotators to judge the relatedness of the news stories to create a set of related stories. Some of them use algorithms to do this job. This is usually driv en b y the need to a void the cost of ha ving a h uman annotator. Related stories generated b y such systems, like Newstrack er 1 system b eing used b y BBC News, cannot b e used as a gold standard for my system. They do not represent the ultimate standard that cannot b e surpassed. Their p erformance can b e improv ed up on by other systems addressing the same problem. They can b e used as a baseline for the performance of other systems that tries to matc h or even exceed their p erformance. A go o d corpus should hav e news stories time stamp ed with high time granularit y . Some news sources lik e news agencies publish news stories round-the-clo ck; Agence F rance-Presse (AFP) releases, on av erage, one story every 20 seconds 2 . Such a fast-paced publication 1 BBC links to other news sites: http://www.bbc.co.uk/news/10621663 2 Agence F rance-Presse (AFP) releases on a verage one news story every 20 seconds (5000 p er day) while 52 needs a few min utes’ resolution and accurate time stamp ed stories to correctly place the story on its timeline. A news timeline typically contains many news stories, usually ov er ten stories, and extends ov er a relativ ely long p erio d of time, sev eral months long. One set of related news stories cannot b e used to create a timeline; it t ypically con tains less than ten related stories, and in many cases, the stories extend ov er a few days or a w eek. T o b e able to create a timeline, differen t sets of related news stories hav e to be c hained. F or each story in the set of related stories, I get its set of related stories. F or eac h one of these stories in turn, I get a set of related stories. This pro cess can rep eat and I can extend the c hain as desired. Ho wev er, the longer the chain gets, the less related the stories at the end of the c hain will b e to the original story at the other end of the chain. This is b ecause at each step along the c hain w e compare the stories to the curren t no de (story) in the chain, and not the first no de (original) that w e wan t to get a set of stories related to it. I create a news corpus by cra wling news websites lik e the British newspap er The Guar dian w ebsite. I start by a set of diverse news stories seeds, or links. This set typically cov ers a wide v ariety of topics and geographical areas. F or each one of these links, I use the link as the story identifier. I cra wl the main news story text and title, its release date and time, and the set of related news stories. I follow the links to the related news stories section to create the set of related news stories. These related stories were hand-pic ked b y a news editor and therefore I can use them for my gold standard. I rep eat this pro cess until I crawl a predefined n umber of news stories. 5.2 Corp ora F or m y exp eriments I use tw o corp ora: a Reuters corpus, and a BBC news corpus. Reuters releases 800,000 English-language news s tories ann ually . See: http://www.afp.com/en/agency/ afp- in- numbers and http://www.infotoday.com/it/apr01/news6.htm 53 5.2.1 Reuters The Reuters-21578 corpus [ 47 ] 3 is curren tly the most widely used test collection for text categorization research, and it is av ailable for free online. The corpus is made of 21578 do cumen ts in English that app eared on the Reuters newswire in 1987. The a verage num b er of unique w ords in a news story in this corpus is 58.4. It has a vocabulary of 28569 unique w ords. This collection of news articles comes in XML format. Each do cument comes with: Date The date the story was published. This date is accurate to milliseconds. E.g. 26- FEB-1987 15:01:01.79. T opic A man ually assigned topic that the news story discusses. There are 135 topics. E.g. Co coa. Places A Geographical lo cation where the story to ok place. E.g. El-Salv ador. P eople Names of famous p eople mentioned in the corpus. E.g. Murdo ch. Organizations Names of organizations men tioned in the corpus. E.g. ACM. Exc hanges Abbreviations of the v arious sto ck exc hanges mentioned in the corpus. E.g. NASD AQ. Companies Names of companies men tioned in the corpus. E.g. Microsoft. Title The title of the news story . E.g. Bahia co coa review. Bo dy The b o dy of the news story . E.g. Show ers con tinued throughout the w eek in the Bahia co coa zone, ... In m y exp eriments, I only use Date and Bo dy . 3 h ttp://www.daviddlewis.com/resources/testcollections/reuters21578/ 54 5.2.2 BBC News The BBC news corpus is made of 10,000 news stories in English I collected myself from the BBC news website 4 . The stories in the corpus co ver about t wo and a half years time p erio d, from April 2010 to Nov ember 2012. The av erage n umber of unique w ords in a news story in this corpus is 189.6 and the corpus has a vocabulary of 64374 unique w ords. F or each news story I collect: ID An identification for the news story . I use the story web page URL for that. E.g. h ttp://www.bb c.co.uk/news/world-europe-10912658 Date The date the story w as published. This date is accurate to seconds. E.g. 2010/08/09 15:51:53 Title The title of the news story . E.g. Death rate doubles in Mosco w as heat wa ve con tin ues Bo dy The b o dy of the news story . E.g. Death rate doubles in Moscow as heatw a ve con tinues Extreme heat and wildfires hav e led to ... Related The ID of a related news story . E.g. h ttp://www.bb c.co.uk/news/world-europe- 10916011 5.3 Ev aluation of prop osed solution T o ev aluate the p erformance of the prop osed solution I am going to use the p er-w ord log- lik eliho o d of a news story conditioned on the rest of the data. More formally , the p erfor- mance measure will b e: l ik el ihood pw = P w w w ∈D log p ( w w w |D − w w w ) P w w w ∈D | w w w | (5.1) 4 h ttp://www.bb cnews.com 55 topic 1 topic 2 topic 3 topic 4 topic 5 Time Topics Figure 5.1 : Il lustr ation of the birth/de ath/r esurr e ction of five topics. Disc ontinuity of a topic indic ates its de ath. Where w w w is a do cument in the collection of documents D , | w w w | is the n um b er of words in do cumen t w w w , and D − w w w is the collection of do cumen ts D minus do cument w w w . The use of log-lik eliho o d as an ev aluation measure for topic mo dels is widely accepted in the machine learning comm unit y [ 19 , 69 , 71 ]. It is a measure of the goo dness of fit of the mo del to the data [ 3 ]. The b etter the fit, the more likely the mo del will lab el the do cumen ts with the righ t topics. Using the natural log function is con venien t for sev eral reasons. 1) It reduces the chances of running in to an underflo w problem due to v ery small likelihoo d v alues when simulating the mo del. 2) It allows us to use summation of terms (computationally cheap) instead of pro duct of terms (more computationally exp ensive), and 3) The natural log function is a monotone transformation whic h means that extrema of the log-likelihoo d function is equiv alen t to the extrema of the likelihoo d function [ 58 ]. Scalabilit y A real w orld news feed con tains hundreds of thousands of arc hiv ed news ar- ticles with more articles b eing generated ev ery min ute. An offline mo del is expected to b e able to build the storyline of a hundred thousand do cuments in less than a da y running on a desktop mac hine. An online mo del, if implemented, should b e able to streamline the incoming do cumen ts in real time. 56 Accuracy The accuracy of the mo del will b e measured with the p er-word log-lik eliho o d giv en in ( 5.1 ). A higher v alue of the p er-word log-likelihoo d is desirable. It means that the do cumen ts hav e topics with word distributions that are consistent with the p er-topic w ord distributions the mo del has learned. 57 Chapter 6 Exp erimen tal design and results I ran exp eriments to ev aluate the p erformance of m y ciDTM mo del against the con tin uous- time dynamic topic mo del (cDTM) presen ted in Section 3.5 , and the online Hierarchical Diric hlet process presen ted in Section 3.4 . T o find the best settings for eac h of the competing mo dels, I ev aluated their p erformance with differen t practical setting v alues. In the next t wo sections I sho w the results obtained by running these exp eriments on the Reuters-21578 and the BBC news corp ora presen ted in Section 5.2 on page 53 . I used the co ded pro vided b y W ang et al. [ 72 ] 1 to build and run the cDTM system, and used the co de pro vided by W ang et al. [ 73 ] 2 to build and run the oHDP system. In my own system, ciDTM, I used some co de from the previous tw o softw are pack ages to build and run it. I used MALLET [ 53 ] also for corpus creation, parsing and filtering. 6.1 Reuters corpus cDTM num b er of topics: Since the cDTM has a fixed n umber of topics, it is expected that its p erformance will v ary with differen t v alues of this parameter. The closer this v alue is to the av erage num b er of topics this mo del can fit to the data o v er time the higher the p erformance will b e. I first ev aluated the p erformance of the cDTM with differen t v alues for the num b er of topics ranging from 1 to 200 topics. The p er-word log-likelihoo d v alues 1 h ttp://www.cs.cmu.edu/ chongw/soft w are/cdtm.tar.gz 2 h ttp://www.cs.cmu.edu/ chongw/soft w are/onlinehdp.tar.gz 58 -54 -52 -50 -48 -46 -44 -42 Feb 28, 1987 Mar 07, 1987 Mar 14, 1987 Mar 21, 1987 per-word log-likelihood Time cDTM : number of topics 1 10 50 100 150 200 Figure 6.1 : cDTM p erformanc e varies with differ ent values for the numb er of topics using R euters c orpus. A value of 50 le ads to b est mo del fit among the six values trie d. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. . for the inferred do cuments is presen ted in Figure 6.1 . Higher v alues indicate b etter mo del p erformance. In all cDTM exp erimen ts to follow, the cDTM mo del w as first (unsup ervised) trained on a set of 5,000 do cuments sampled uniformly from the en tire corpus of 10,000 do cumen ts, then the p erformance of the trained mo del w as ev aluated using the other docu- men ts in the corpus. This guarantees that the documents the mo del is tested on co ver the same time range as the training do cumen ts. Figure 6.1 sho ws that among the six v alues tried, the mo del p erformed b est with 50 topics. This reflects the moderate ric hness of the 5,000 Reuters news stories the mo del w as 59 tested on. By manually insp ecting the corpus, I found that most of its do cuments mainly discuss econom y and finance and slightly co vers politics. This corpus is considered somehow limited in the topics it cov ers as compared to the wide sp ectrum of topics co v ered b y the stories released b y Reuters news agency . -45 -40 -35 -30 -25 Feb 28, 1987 Mar 07, 1987 Mar 14, 1987 Mar 21, 1987 per-word log-likelihood Time oHDP : batch size : kappa 1 : 1 8 : 1 16 : 1 64 : 0.8 256 : 0.8 1024 : 0.6 2048 : 0.6 Figure 6.2 : oHDP p er-wor d lo g-likeliho o d for differ ent b atch size values using R euters c orpus. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. . oHDP batc h size oHDP uses online v ariational inference to fit its parameters. F or eac h iteration of the algorithm, it uses a small batc h of do cuments to up date these parameters. I exp erimen ted with differen t batc h size and mo del parameter ( κ ) v alues to determine the b est v alue to use. These v alues were suggested b y W ang et al. [ 73 ] as they generated the 60 b est results in their case. Low er κ v alues fa v or larger batch sizes. It is to b e noted that this batc h size v alue should be relatively small to effectiv ely retain the online inference feature of this algorithm. Exp eriment results with sev en v alues are presen ted in Figure 6.2 . The t wo largest batc h sizes, 1024 and 2048, were only included to sho w the trend of decreasing p erformance with increasing batc h size. It is not used later on b ecause such a large batch size requires longer perio d of waiting time to collect the do cuments and up date the system making the system out-of-date in the mean time. F or the oHDP mo del, I use the same v alues suggested b y W ang et al. [ 73 ] and giv en in T able 6.1 . T 300 K 20 α 1 ζ 0.01 γ 1 T able 6.1 : Setting values use d for oHDP The p er-word log-likelihoo d v alues for the mo del is b etter with smaller batc h sizes. The difference is minor for batc hes smaller than 64. ciDTM batc h size The size of the batch of do cuments ciDTM pro cesses with ev ery iteration of its algorithm has a double effect: 1) it affects the conv ergence of the online v ariational inference algorithm as w as the case with oHDP ab ov e, and 2) it affects the v ariational Kalman filter used to evolv e the p er-topic word distribution in con tinuous-time. It is exp ected that larger batches of do cuments would improv e the p erformance of the mo del as the document timestamp (arriv al time) information and in ter-arriv al times b etw een do cumen ts will b e used by the filter to dynamically ev olve the per-topic w ord distribution. It is to b e noted that a batch of size one cannot b e used in ciDTM. The Kalman filter used b y the mo del evolv es p er-topic w ord distribution based on do cumen ts in ter-arriv al times in the batc h. The pro cess running the filter starts fresh with ev ery new batc h of do cuments. 61 Figure 6.3 shows the results of an experiment in whic h four differen t batc h size v alues w ere tried. -70 -60 -50 -40 -30 -20 -10 Feb 28, 1987 Mar 07, 1987 Mar 14, 1987 Mar 21, 1987 per-word log-likelihood Time ciDTM : batch size : kappa 8 : 1 16 : 1 64 : 0.8 256 : 0.8 Figure 6.3 : ciDTM p er-wor d lo g-likeliho o d for differ ent b atch size values using R euters c orpus. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. . Since this mo del has the b enefit of using an online v ariational inference algorithm, the batc h size should b e k ept small to main tain this desirable feature. F rom the Figure w e can see that a p oin t of diminishing return is reached with a batch of size 64. The performance gained by increasing the batch size from 64 to 256 is negligible. On the other hand, if do cumen ts arrive to the system at a fixed rate. Then 64 do cumen ts will need 4.5 hours to gather, while 256 do cumen ts would require 18 hours to collect. With the pace of 14 62 do cumen ts p er hour encountered in this corpus, once a da y up date for the mo del w ould b e enough to k eep it up-to-date. I use a batch size of 256 as the optimal batch size for this mo del. The ciDTM mo del parameter v alues I used are equal to their corresp onding parameter v alues in the cDTM and oHDP mo dels, where applicable. More sp ecifically , the mo del parameters I used are: T 300 K 20 α 1 ζ 0.01 γ 1 α 0 0.2 T able 6.2 : Setting values use d for ciDTM Comparing three mo dels Figure 6.4 shows the p er-w ord log-lik eliho o d p erformance for the three comp eting mo dels using their b est setting v alues disco vered earlier. It is clear that ciDTM outp erform b oth cDTM and oHDP by a big margin when tested on this news corpus esp ecially ciDTM with batc h sizes greater than 16. Figure 6.5 sho ws the non-smo othed p er-word log-likelihoo d v alues for each of the three mo dels with ciDTM with batch size of 10, oHDP with batc h size of 1, and cDTM with 10 topics. ciDTM outp erformed the oHDP , while the cDTM p erformed the w ords of all three. The impro vemen t the ciDTM mo del ov er oHDP is significant knowing that the v alue on the indep enden t axis is a log v alue. The p er-w ord log-lik eliho o d v alue is ev aluated for each do cumen t in the mini batch of documents processed by the ciDTM in eac h iteration of the algorithm. Since cDTM relies on an offline learning algorithm. The p er-word log-likelihoo d obtained is ev aluated after the mo del was trained on the training data set. This explains the steady state p erformance of the cDTM algorithm as the w ord distribution for the topics co vered b y the corpus were learned b y the mo del first b efore the log-lik eliho o d v alues w ere ev aluated. 63 -45 -40 -35 -30 -25 -20 -15 -10 Feb 28, 1987 Mar 07, 1987 Mar 14, 1987 Mar 21, 1987 per-word log-likelihood Time ciDTM:batch size=256, kappa=0.8 cDTM:10-topics cDTM:50-topics oHDP:batch size=1, kappa=1 oHDP:batch size=16, kappa=1 Figure 6.4 : A c omp arison of ciDTM, oHDP and cDTM using their b est setting values disc over e d e arlier using R euters c orpus. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. . The fluctuation in the data in the case of ciDTM and oHDP is due to the fact that they are encoun tering new do cumen ts with ev ery time step. As the mo del learns the word distribution for these topics, the fluctuations starts decreasing ov er time. The gap perio ds in the graph are dormancy p erio ds when no news stories w ere b eing published. 6.2 BBC news corpus I repeated the same set of exp erimen ts I ran ab ov e and used the BBC news corpus presen ted in Section 5.2.2 on page 55 . 64 -100 -80 -60 -40 -20 0 Feb 28, 1987 Mar 07, 1987 Mar 14, 1987 Mar 21, 1987 per-word log-likelihood Time ciDTM:batch size=10, kappa=1 cDTM:10-topics oHDP:batch size=1, kappa=1 Figure 6.5 : Comp arison of the p er-wor d lo g-likeliho o d for cDTM, oHDP and ciDTM us- ing 10,000 news stories fr om the R euters21578 c orpus [ 47 ]. Higher p er-wor d lo g-likeliho o d indic ate a b etter fit of the mo del to the data. . cDTM n umber of topics The cDTM mo del was trained on a set of 5,000 do cuments sampled uniformly from the 10,000 document corpus. The trained model w as then used to infer the do cuments’ topic mixture and the topics’ w ord mixture for the other half of the corpus. Figure 6.6 shows the p er-word log-likelihoo d v alue obtained using differen t v alues for the n umber of topics. The p er-w ord log-likelihoo d p erformance of this mo del impro ved as the n um b er of topics used increased. This improv ement reac hed its p eak with a mo del of 200 topics. Higher n umber of topics h urt the mo del p erformance as can b e seen in the 250 topics mo del case whic h p erforms sligh tly worse than the 200 topics mo del. The reason this model reached 65 -38 -36 -34 -32 -30 -28 -26 May 01, 2010 Nov 01, 2010 May 01, 2011 Nov 01, 2011 May 01, 2012 Oct 01, 2012 per-word log-likelihood Time cDTM : number of topics 1 10 50 100 150 200 250 Figure 6.6 : cDTM p erformanc e varies with differ ent values for the numb er of topics using the BBC news c orpus. A value of 200 le ads to b est mo del fit among the seven values trie d. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. its p eak p erformance with 200 topics with this corpus while the same mo del reached a p eak p erformance with 50 topics with the Reuters corpus can b e explained b y the prop erties of these t wo corp ora. The BBC news corpus has a higher v o cabulary ric hness and the relativ ely length y do cuments as compared to the Reuters news corpus. This BBC news corpus has a v o cabulary of 64374 unique w ords and an a verage do cumen t length of 189.6 unique w ords. The Reuters news corpus has a vocabulary of 28569 unique words and an av erage do cumen t length of 58.4 unique w ords. 66 -80 -70 -60 -50 -40 -30 May 01, 2010 Nov 01, 2010 May 01, 2011 Nov 01, 2011 May 01, 2012 Oct 01, 2012 per-word log-likelihood Time oHDP : batch size : kappa 1 : 1 8 : 1 16 : 1 64 : 0.8 256 : 0.8 1024 : 0.6 2048 : 0.6 Figure 6.7 : oHDP p er-wor d lo g-likeliho o d for differ ent b atch size values using the BBC news c orpus. A value of 200 le ads to b est mo del fit among the seven values trie d. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. oHDP batc h size I exp erimented with differen t batc h size and κ v alues for the oHDP mo del. I used the v alues suggested b y W ang et al. [ 73 ] and other v alues that follow the same trend of these v alues. W ang et al. [ 73 ] found that higher κ v alues fa v or smaller batc h sizes. Figure 6.7 show the results of running these exp eriments. The trend obtained is similar to the one obtained with the Reuters corpus. The b est p er-w ord log-likelihoo d was obtained using a batc h of size 1. As the batc h size increased, the p erformance tended to consisten tly drop. 67 -60 -50 -40 -30 -20 -10 May 01, 2010 Nov 01, 2010 May 01, 2011 Nov 01, 2011 May 01, 2012 Oct 01, 2012 per-word log-likelihood Time ciDTM : batch size : kappa 8 : 1 16 : 1 64 : 0.8 256 : 0.8 Figure 6.8 : ciDTM p er-wor d lo g-likeliho o d for differ ent b atch size values using the BBC news c orpus. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. ciDTM batc h size The size of the batch of do cuments the ciDTM mo del pro cesses in eac h iteration affects the conv ergence of the inference algorithm and the v ariational Kalman filter used to ev olve the p er-topic word distribution in con tin uous time. Figure 6.8 shows ho w the p er-w ord log-lik eliho o d p erformance of the mo del c hanges with differen t batc h size v alues. The trend sho wn mirrors the one obtained using the Reuters corpus. The model p erformance impro ves as the batc h size increases. A point of diminishing returns is reached around a v alue of 64 for the batch size. Minor p erformance gain obtained using higher batch size v alues is outw eighted by the longer p erio ds separating mo del up dates. 68 -60 -50 -40 -30 -20 -10 May 01, 2010 Nov 01, 2010 May 01, 2011 Nov 01, 2011 May 01, 2012 Oct 01, 2012 per-word log-likelihood Time ciDTM batchsize=256, kappa=0.8 cDTM 200 topics oHDP batchsize=1, kappa=1 Figure 6.9 : A c omp arison of ciDTM, oHDP and cDTM using their b est setting values disc over e d e arlier using the BBC news c orpus. The p er-wor d lo g-likeliho o d values shown ar e the moving aver age of a set of 100 c onse cutive do cuments. This aver aging was ne e de d to smo oth out the plot. Higher values indic ate b etter mo del fit on this 10,000 news stories c orpus. Comparing three mo dels Figure 6.9 shows the p er-w ord log-lik eliho o d p erformance for the cDTM, oHDP and ciDTM mo dels using their b est setting v alues discov ered by running the previous exp erimen ts. ciDTM outp erforms b oth oHDP and cDTM by a big margin. The steady state p erformance of cDTM and oHDP is ab out the same. It took oHDP ab out 2,000 do cumen ts to reach this stage. 69 6.2.1 Wh y ciDTM outp erforms cDTM ciDTM outp erforms cDTM b ecause ciDTM ev olves the n um b er of topics p er do cument while cDTM uses a fixed n umber of topics p er do cument. Limitations of cDTM Because cDTM uses a fixed and predefined num b er of topics p er do cumen t, it do es not allow for topic discov ery . When cDTM infers the topic mixture of a do cumen t that discusses a new topic, it has to lab el the do cumen t with the predefined set of topics. The new topic will then get absorb ed b y that predefined set of topics and a new topic will not b e learned. There will b e no single topic that all the words in that do cumen t can b e sampled from. This will directly reduce the p er-word log-lik eliho o d of sampling the w ords of the do cumen t giv en a topic whic h is the measure I use to ev aluate the p erformance of the three comp eting mo dels. Adv antage of ciDTM ciDTM ev olves the n umber of topics p er do cument to reflect the n umber of topics discov ered so far in the corpus and to reflect the ric hness of the do cument. As more do cumen ts get pro cessed by ciDTM, more topics will b e disco v ered and their word distribution will b e learned using the w ords in these do cuments. Disco vering new topics instead of absorbing them into a set of predefined topics directly impro ves the p er-w ord log- lik eliho o d of sampling the w ords in a do cument given a single topic whic h is the measure I use to ev aluate the p erformance of the three competing models. This is why ciDTM achiev es a p er-w ord log-likelihoo d v alue higher than that of cDTM. 6.2.2 Wh y ciDTM outp erforms oHDP ciDTM outp erforms oHDP b ecause ciDTM evolv es the p er-topic word distribution in con tinuous-time using document timestamps while oHDP only relies on do cumen t ordering and do es not use do cumen t timestamps. 70 Adv antage of ciDTM Because ciDTM knows how m uch time elapsed b et ween do cu- men ts that b elong to the same topic, it evolv es the topic w ord distribution accordingly to reflect the real c hange in that topic w ord distribution. The longer the time perio d separating t wo do cumen ts co v ering the same topic gets, the more drastic the c hange to the topic w ord distribution o ver that perio d will b ecome. This direct prop ortional relationship b etw een the length of the time p erio d and the degree of c hange to the topic w ord distribution is plausible and can b e seen in real life. The closeness of the inferred topic word distribution to the real topic w ord distribution translates to a higher p er-w ord log-likelihoo d v alue for the mo del b ecause it helps ciDTM b etter fit the mo del to the data. Limitations of oHDP Since oHDP is unaw are of the time p erio d separating do cuments, its c hange to the topic word distribution is unaffected b y the length of this time p erio d. If this p erio d gets big enough, the real topic word distribution will drastically c hange and oHDP will not change its topic word distribution to matc h it. This is because oHDP relies on do cument ordering to mak e this change and is not aw are of the length of that time p erio d. The big distance b et ween the inferred topic word distribution and the real topic w ord distribution will lead to a low er p er-word log-lik eliho o d v alue for oHDP . 6.3 Scalabilit y A practical topic mo del system that infers the topic structure of news stories arriving in real time from a news outlet lik e a news agency should b e able to process these documents in real time. All the three topic mo del systems I exp erimented with earlier, that is, the contin uous- time dynamic topic mo del (cDTM), the online hierarchical Diric hlet pro cess based topic mo del (oHDP), and the contin uous-time infinite dynamic topic mo del (ciDTM), eac h one of them w as able to train on the en tire set of 10,000 documents in less than six hours. The topic inference pro cess (the test phase of the mo del) in which the topic mixtures of news stories 71 arriving in real time from a news outlet takes m uch less time to finish than the training phase. The cDTM mo del w as able to finish this task for a test set of 5,000 do cumen ts in less then 5 min utes. Because oHDP and ciDTM are online learners, the learning and inference phases alternated. The time sp ent running the test phase was included in the ov erall run time for these mo dels and was not measured separately . By manual insp ection I found that the time sp en t on the test phase is very negligible compared to the time sp ent on the training phase. Giv en that a news agency like Agence F rance-Presse (AFP) releases on av erage 5,000 news stories p er day that is equiv alent to one news story every 20 seconds, and assuming that the running time complexity of all three systems is monotonically increasing, all the three systems presented can train on the entire set of do cumen ts released b y AFP and then infer the topic mixture of a set of do cumen ts of the same size in real time. T aking into account that the cDTM system is an offline learner, with every newly arriving do cumen t or set of documents it has to b e retrained from scratch. As time passes by , the n umber of released do cuments that the mo del needs to train on increases linearly with time and at some p oin t in the future the s ystem will not b e able to train on all released do cumen ts in time b efore new batc h of do cumen ts that requires analysis arriv es to the system from the news outlet. This lag will keep increasing linearly with time. T o hav e the cDTM work in real time a compromise should be made. The system could main tain a limited history of the most recen tly released do cuments to train on. This history should be limited so that the cDTM system could train on it and then infer the topic mixture in real time. This solution will ha ve a negativ e side effect on the p erformance of the system as the p er-topic word distribution ev olved b y the system will not reflect the status of the p er-topic w ord distribution of do cuments outside the limited history . The system will ignore older do cumen ts b ecause it do es not hav e enough time to train on it. The oHDP and ciDTM systems are online learners. They can train on every news story arriving liv e from a news agency like AFP due to their fast online learning algorithm. They 72 can infer the topic mixture of news stories as their arrive to the system as w ell. The effect of the p er-topic w ord distribution seen in a do cumen t will b e retained b y the system and carried on to affect the topic comp osition of do cuments arriving to the system in some arbitrary p oint in the future as long as the system is up and running pro cessing do cuments. These status of these t w o systems will reflect the en tire history of documents they received. T o illustrate the time requirements of the three systems and ho w they scale up I measured the wall clo c k time it tak es eac h one of them to train on subsets of the Reuters corpus. I used subsets of size 1,000; 2,000; 5,000 and 10,000. The inference (testing) time w as very negligible compared to training time (ab out 1 to 100 ratio) for the three systems. F or eac h one of the three systems I used the settings that resulted in the b est p erformance when exp erimen ted with the Reuters news corpus earlier. I used a cDTM mo del with 50-topics, an oHDP model with a batc h size of 1 and a ciDTM mo del with a batc h size of 256. The results are presen ted in Figure 6.10 . 6.3.1 Scalabilit y of ciDTM and oHDP vs cDTM As shown in the figure, the cDTM do es not scale up w ell with the size of the corpus. It b ecomes practically unfeasible to use it in real life applications with corp ora of av erage size ( ∼ 10,000 do cuments). This gro wth in running time that is almost exp onen tial can b e attributed to the fixed n umber of topics in the mo del. The reason cDTM do es not scale up w ell with the size of the corpus is b ecause the n um- b er of v ariables in the v ariational distribution ov er the latent v ariables is exp onential in the n umber of topics and corpus size [ 72 ]. Since cDTM needs to tune the v ariational distribu- tion v ariables to minimize the KL distance b etw een the true p osterior and the v ariational p osterior, the time it needs to do so will gro w linearly with the num b er of these v ariables whic h in turn grows exp onentially with the size of the corpus. On the other hand oHDP and ciDTM scale up w ell with the size of the corpus. This 73 0 50 100 150 200 250 300 350 400 cDTM oHDP ciDTM Minutes Corpus size 1,000 2,000 5,000 10,000 Figure 6.10 : The wal l clo ck running time in minutes for cDTM (10 topics), oHDP (b atch size = 1) and ciDTM (b atch size = 256) using subsets of the R euters news c orpus with differ ent numb er of do cuments. These systems wer e run on a PC with 16GB of RAM and 3.4 GHz pr o c essor. graceful scalabilit y can b e attributed to tw o factors: 1) the online v ariational inference algorithm that scales linearly with the num b er of documents, and 2) the n umber of topics that v aries from one do cument to another. It is to be noted how ev er that the oHDP based topic mo del is faster than the ciDTM. This is due to the extra time needed by ciDTM to run and sample from the Bro wnian motion mo del [ 54 ] that evolv es the p er-topic w ord distribution o ver time. 6.4 Timeline construction F or the timeline construction task, I man ually selected 61 news stories from the BBC news corpus cov ering the Arab Spring even ts since its inception in January 2011 un til the near 74 presen t (No v ember 2012). I will call this set of news stories the Arab Spring set. Discov ering suc h a topic is c hallenging as the ev en ts that fall under it are geographically scattered across the middle east and geographical names in the news stories will not giv e m uch clue to the topic mo del. More importantly , the ev ents asso ciated with this topic evolv e rapidly ov er a short p erio d of time and the set of v o cabulary asso ciated with it changes as w ell. The cDTM is not suitable for this task. It assigns a fixed num b er of topics to ev ery do cumen t in the corpus. If this num b er is high, the topic we are trying to disco ver will b e split o ver more than one topic, and if the num b er is lo w, the topic will b e merged with other topics. Moreo ver, since the num b er of topics typically v ary from one do cument to another, finding one fixed v alue for it will alw ays b e inferior to other approac hes that evolv e it dynamically . I trained the oHDP based topic mo del and the ciDTM on the en tire BBC news corpus and inferred the topics asso ciated with each document in the Arab Spring set. Then, I manually insp ected the inferred topics and found the one that corresp onds to the Arab Spring even ts. Both systems where trained using their b est settings found in earlier exp eriments on the BBC news corpus in Section 6.2 . Figure 6.11 shows the inferred topics for do cuments in the Arab Spring set using oHDP . The notation I used in this Figure is describ ed in its caption. After the start of the Arab Spring in early Jan uary of 2011, the documents discussing its even ts were assigned topic 3. This topic in the oHDP mo del corresp onds to accidents and disasters. One month later, the mo del w as able to infer the correct topic and asso ciate it with the do cuments of the Arab Spring even ts. The p erformance of the system was steady from mid F ebruary 2011 until June 2012. After that, the topic evolv ed so fast after the Egyptian presiden t assumed office in late June 2012. The oHDP was unable to evolv e the w ord distribution asso ciated with that topic quic kly enough to k eep trac k of the topic. W e can see that after that date the mo del unable to infer the correct topic for 3 do cumen ts in this set. The do cuments were asso ciated with the accidents and disasters topic instead. Overall, the mo del was unable to 75 -2 0 2 4 6 8 10 May 01, 2011 Nov 01, 2011 May 01, 2012 Oct 01, 2012 T opic T ime R eal topic Infer r ed topic Figure 6.11 : oHDP timeline c onstruction for the A r ab Spring topic over 61 do cuments c overing this topic. The indep endent axis is the time, while the dep endent axis is the topic index. It was found that topic 0 r epr esents the Ar ab Spring events topic. A pink dot ( • ) r epr esents the r e al topic that should b e disc over e d, an black cir cle with a smal l dot ( ) r epr esents the topic/s that was disc over e d for the do cument. A black cir cle with a pink dot in the midd le of it ( • ) is a r e al topic that was suc c essful ly disc over e d by the system. infer the correct topic for 10 out of 61 do cumen ts in the Arab Spring set. Inferred AS non-AS Accuracy T rue AS 51 10 0.836 non-AS 0 13 1.000 0.865 T able 6.3 : Confusion matrix for timeline c onstuction using oHDP. AS stands for A r ab Spring topic, and non-AS is non Ar ab Spring topic. The true class is pr esente d in r ows (T rue AS, and non-AS), and Inferr e d class in c olumns (Inferr e d AS and non-AS). T able 6.3 shows the confusion matrix for the oHDP topic assignment to news stories b elonging to the Arab Spring timeline. oHDP failed to correctly tag 10 out of 61 stories as 76 b elonging to that topic. Ho wev er, it successfully lab eled all 13 stories whic h had non Arab Spring topic comp onents with the correct topic lab el. This giv es oHDP a recall score of 83.6% and a precision of 100%. -2 0 2 4 6 8 10 May 01, 2011 Nov 01, 2011 May 01, 2012 Oct 01, 2012 T opic T ime R eal topic Infer r ed topic Figure 6.12 : ciDTM timeline c onstruction for the A r ab Spring topic over 61 do cuments c overing this topic. The indep endent axis is the time, while the dep endent axis is the topic index. It was found that topic 0 r epr esents the Ar ab Spring events topic. A pink dot ( • ) r epr esents the r e al topic that should b e disc over e d, an black cir cle with a smal l dot ( ) r epr esents the topic/s that was disc over e d for the do cument. A black cir cle with a pink dot in the midd le of it ( • ) is a r e al topic that was suc c essful ly disc over e d by the system. Figure 6.12 sho ws the inferred topics for do cumen ts in the Arab Spring set using ciDTM. This Figure uses the same notation as the previous Figure and it is describ ed in its caption also. T opic 0 represents the Arab Spring topic in this mo del. At first glance at the Figure, w e notice that the ciDTM w as able to infer the correct topic for the first do cument in the Arab Spring set. This can be explained b y the fact that ciDTM trains on a batch of 256 do cumen ts. The mo del w as able to train on a big batc h of do cumen ts that included some Arab Spring set do cuments and learn the new topic word distribution b efore inferring their 77 topics. If that explanation is v alid then this adv antage is not in trinsic of the ciDTM as the oHDP based mo del could b eha ve similarly by increasing its batc h size. W e notice that some do cumen ts in this set w ere assigned m ultiple topics with this model. Besides topic 0, whic h corresp onds to the Arab Spring ev ents, topics 1, 2 or 3 sometimes app ear together with topic 0 in the same do cumen t. By insp ecting these topics, I found that topic 1 corresp onds to econom y and finance, topic 2 corresp onds to health and medicine, and topic 3 corresp onds to acciden ts and disasters. Some of the early do cuments in the Arab Spring set w ere assigned topic 3. This reflects the violence that marred the early days of the Egyptian revolution in late January 2011. The asso ciation of topic 2 with the set do cumen ts may reflect the mention of the injured protesters at that time. Later do cuments in this set are more asso ciated with topic 1. This can b e explained by the volume of published news stories discussing the Egyptian gov ernment efforts to reco ver from the economical and financial damages whic h the revolution inflicted on the country . Inferred AS non-AS Accuracy T rue AS 57 4 0.934 non-AS 0 13 1.000 0.946 T able 6.4 : Confusion matrix for timeline c onstruction using ciDTM. AS stands for Ar ab Spring topic, and non-AS is non Ar ab Spring topic. The true class is pr esente d in r ows (T rue AS, and non-AS), and Inferr e d class in c olumns (Inferr e d AS and non-AS). T able 6.4 sho ws the confusion matrix for the ciDTM topic assignmen t to news stories b elonging to the Arab Spring timeline. ciDTM shows ab out 10% impro v ement in accuracy o ver the oHDP based topic mo del due to a 10% higher true p ositive rate. ciDTM achiev es a recall score of 93.4% and a 100% precision. ciDTM has a higher recall p o wer without sacrificing precision. 78 6.4.1 Wh y ciDTM is b etter than oHDP in timeline construction ciDTM is b etter than oHDP in timeline construction b ecause ciDTM ev olv es the p er- topic word distribution in contin uous time while oHDP only relies on do cument ordering and do es not make use of do cument timestamps in evolving the p er-topic word distribution. This can b e seen b y comparing the timelines constructed for the Arab Spring topic using oHDP based topic mo del in Figure 6.11 and using ciDTM in Figure 6.12 . oHDP and ciDTM w ere able to detect the birth of a new topic, whic h is the Arab Spring topic, so on after it started in Jan uary of 2011. The true p ositiv e rate was high for b oth mo dels (oHDP=0.91, ciDTM=0.94) in the first part of this timeline (from January 2011 to Marc h 2012). In the second part of this timeline (from June 2012 to Nov em b er 2012) the true p ositiv e rate for oHDP sharply dropp ed to a v alue of 0.74 while it sligh tly dropp ed to a v alue of 0.92 for ciDTM. In what follo ws I explain why ciDTM maintained a high true p ositive rate v alue in the second part of the Arab Spring topic timeline while the true positive rate for oHDP dropp ed sharply . The corpus used to train both mo dels had no do cuments with the Arab Spring topic in the p erio d of Marc h 2012 to June 2012. Limitations of oHDP Because oHDP ev olves the p er-topic word distribution based on do cumen t ordering and do es not use do cument timestamps for that it w as unaw are of the long time p erio d (3 months) that separated the last do cument in the first part of the timeline and the first do cument in the second part of the timeline. Because of that, oHDP did not ev olve the Arab Spring topic word distribution to reflect the changes in v o cabulary of the news stories co vering this topic. The actual word distribution for the Arab Spring topic significan tly changed and the oHDP did not change its Arab Spring topic word distribution 79 to match it. This is because oHDP w as una ware of the long time perio d (Marc h 2012 to June 2012 [3 months]) in whic h no do cuments cov ering this topic w ere included in the corpus. Because of the big difference b etw een the actual Arab Spring topic word distribution and the oHDP Arab Spring topic w ord distribution, oHDP failed to correctly lab el man y news stories with the Arab Spring topic in the second part of the timeline after the long p erio d of topic dormancy . This mislab eling resulted in a sharp drop in the true positive rate v alue for oHDP in the second part of the timeline when compared to the true p ositive rate v alue it ac hieved in the first part. Adv antage of ciDTM On the other hand ciDTM uses do cument timestamps to evolv e the p er-topic w ord distribution. When ciDTM infers the topic mixture for the first do cumen t in the second part of the Arab Spring timeline (June 2012 to Nov em b er 2012) it evolv es the Arab Spring topic w ord distribution to reflect the long p erio d (Marc h 2012 to June 2012 [3 mon ths]) of dormancy of this topic. The word distribution of the Arab Spring topic evolv ed b y ciDTM to reflect this long time p erio d of dormancy is closer to the actual Arab Spring topic. Because of this closeness, ciDTM w as able to correctly lab el news stories with the Arab Spring topic in the second part of the timeline. This successful lab eling resulted in a slight drop in the true p ositiv e rate for ciDTM in the second part of the Arab Spring timeline when compared to the true p ositiv e rate v alue it ac hiev ed in the first part of the timeline. 80 Bibliograph y [1] Amr Ahmed and Eric P . Xing. Dynamic non-parametric mixture models and the recur- ren t chinese restaurant pro cess: with applications to evolutionary clustering. In SDM , pages 219–230. SIAM, 2008. URL http://www.siam.org/proceedings/datamining/ 2008/dm08_20_Ahmed.pdf . [2] Amr Ahmed and Eric P . Xing. Timeline: A dynamic hierarc hical dirichlet pro cess model for recov ering birth/death and ev olution of topics in text stream. In Peter Gr ¨ unw ald and P eter Spirtes, editors, UAI , pages 20–29, Corv allis, Oregon, 2010. A UAI Press. ISBN 978-0-9749039-6-5. URL http://uai.sis.pitt.edu/papers/10/p20- ahmed.pdf . [3] Onureena Banerjee, Lauren t El Ghaoui, and Alexandre d’Aspremont. Mo del selection through sparse maxim um likelihoo d estimation for multiv ariate gaussian or binary data. J. Mach. L e arn. R es. , 9:485–516, June 2008. ISSN 1532-4435. URL http://dl.acm. org/citation.cfm?id=1390681.1390696 . [4] A.L. Barab´ asi, R. Albert, and H. Jeong. Mean-field theory for scale-free ran- dom net works. Physic a A: Statistic al Me chanics and its Applic ations , 272(1): 173–187, 1999. URL http://www.barabasilab.com/pubs/CCNR- ALB_Publications/ 199910- 01_PhysA- Meanfield/199910- 01_PhysA- Meanfield.pdf . [5] Y oshua Bengio, Dale Sc huurmans, John D. Laffert y , Christopher K. I. Williams, and Aron Culotta, editors. A dvanc es in Neur al Information Pr o c essing Systems 22: 23r d A nnual Confer enc e on Neur al Information Pr o c essing Systems 2009. Pr o c e e dings of a me eting held 7-10 De c emb er 2009, V anc ouver, British Columbia, Canada , 2009. Curran Asso ciates, Inc. ISBN 9781615679119. URL http://nips2009.topicmodels.net . [6] Christopher M. Bishop. Pattern R e c o gnition and Machine L e arning (Information Sci- 81 enc e and Statistics) . Springer, 1st ed. 2006. corr. 2nd prin ting edition, Octob er 2007. ISBN 0387310738. URL http://www.worldcat.org/isbn/0387310738 . [7] D. Blackw ell and J. B. Macqueen. F erguson distributions via P´ olya urn sc hemes. The A nnals of Statistics , 1:353–355, 1973. URL http://www.stat.ucla.edu/ ~ macqueen/ PP12.pdf . [8] David M. Blei. Probabilistic topic mo dels. Commun. A CM , 55(4):77–84, April 2012. ISSN 0001-0782. doi: 10.1145/2133806.2133826. URL http://www.cs.princeton. edu/ ~ blei/papers/Blei2012.pdf . [9] David M. Blei and Mic hael I. Jordan. V ariational metho ds for the dirichlet pro cess. In Pr o c e e dings of the twenty-first international c onfer enc e on Machine le arning , ICML ’04, pages 12–, New Y ork, NY, USA, 2004. ACM. ISBN 1-58113-838-5. doi: 10.1145/ 1015330.1015439. URL www.cs.berkeley.edu/ ~ jordan/papers/vdp- icml.pdf . [10] David M. Blei and John D. Lafferty . Correlated topic mo dels. In NIPS , 2005. URL http://books.nips.cc/papers/files/nips18/NIPS2005_0774.pdf . [11] David M. Blei and John D. Laffert y . Dynamic topic mo dels. In Pr o c e e dings of the 23r d international c onfer enc e on Machine le arning , ICML ’06, pages 113–120, New Y ork, NY, USA, 2006. A CM. URL http://www.cs.princeton.edu/ ~ blei/papers/ BleiLafferty2006a.pdf . [12] David M. Blei and John D. Laffert y . T opic mo dels. In Mehran Sahami and Ashok Sriv asta v a, editors, T ext Mining: Classific ation, Clustering, and Applic ations . T ay- lor and F rancis Group, 2009. URL http://www.cs.princeton.edu/ ~ blei/papers/ BleiLafferty2009.pdf . [13] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allo cation. J. Mach. L e arn. R es. , 3:993–1022, 2003. ISSN 1532-4435. URL http://www.cs. princeton.edu/ ~ blei/papers/BleiNgJordan2003.pdf . 82 [14] N. E. Breslow and D. G. Clayton. Appro ximate Inference in Generalized Lin- ear Mixed Mo dels. Journal of the Americ an Statistic al Asso ciation , 88(421):9–25, 1993. URL http://www.public.iastate.edu/ ~ alicia/stat544/Breslow%20and% 20Clayton%201993.pdf . [15] W ray L. Buntine. Estimating lik eliho o ds for topic mo dels. In Asian Confer enc e on Ma- chine L e arning , 2009. URL http://www.nicta.com.au/__data/assets/pdf_file/ 0019/20746/sdca- 0202.pdf . [16] W ray L. Bun tine and Aleks Jakulin. Discrete comp onent analysis. In SLSFS , v olume 3940 of L e ctur e Notes in Computer Scienc e , pages 1–33. Springer, 2005. ISBN 3-540- 34137-4. URL http://cosco.hiit.fi/Articles/buntineBohinj.pdf . [17] Mike Cafarella and Doug Cutting. Building n utch: Op en source search. Queue , 2 (2):54–61, April 2004. ISSN 1542-7730. doi: 10.1145/988392.988408. URL http: //doi.acm.org/10.1145/988392.988408 . [18] Asli Celikyilmaz. A seman tic question / answering system using topic mo dels. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://www.umiacs.umd.edu/ ~ jbg/nips_tm_ workshop/16.pdf . [19] Jonathan Chang, Jordan Bo yd-Grab er, Chong W ang, Sean Gerrish, and Da vid M. Blei. Reading tea lea ves: How h umans in terpret topic mo dels. In Neur al Information Pr o c ess- ing Systems , 2009. URL http://www.umiacs.umd.edu/ ~ jbg/docs/nips2009- rtl. pdf . [20] Gregory F. Co op er. The computational complexity of probabilistic inference using ba yesian b elief net works. Artificial Intel ligenc e , 42(2–3):393–405, 1990. ISSN 0004- 3702. doi: DOI:10.1016/0004- 3702(90)90060- D. URL http://bmir.stanford.edu/ file_asset/index.php/644/BMIR- 1990- 0317.pdf . [21] Rob ert Darnton. Go ogle and the future of b o oks. Online, 2009. 83 [22] Pradipto Das and Rohini Srihari. Learning to summarize using coherence. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://www.cedar.buffalo.edu/ ~ rohini/ Papers/pdasNIPS09Wkshp.pdf . [23] P .A.M. Dirac. The principles of quantum me chanics . Clarendon press, 1992. ISBN 9780198520115. URL http://books.google.com/books?id=svhAAQAAIAAJ . [24] Gabriel Doyle and Charles Elk an. Financial topic mo dels. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://idiom.ucsd.edu/ ~ gdoyle/papers/ doyle- elkan- 2009- nips- paper.pdf . [25] Jenny Rose Finkel, T rond Grenager, and Christopher Manning. Incorp orating non- lo cal information into information extraction systems by gibbs sampling. In ACL ’05: Pr o c e e dings of the 43r d Annual Me eting on Asso ciation for Computational Linguistics , pages 363–370, Morristo wn, NJ, USA, 2005. Association for Computational Linguistics. doi: http://dx.doi.org/10.3115/1219840.1219885. URL http://nlp.stanford.edu/ ~ manning/papers/gibbscrf3.pdf . [26] Eugene C. F reuder. Complexit y of k-tree structured constraint satisfaction prob- lems. In AAAI’90: Pr o c e e dings of the eighth National c onfer enc e on A rtificial intel ligenc e , pages 4–9. AAAI Press, 1990. ISBN 0-262-51057-X. URL http: //rna- informatics.uga.edu/Readings/tertiary/Complexity%20of%20K- Tree% 20Structured%20Constraint%20Satisfaction%20Problems%20(Freuder1990).pdf . [27] Subir Ghosh, Kenneth P . Burnham, Nico F. Laubscher, Gerard E. Dallal, Leland Wilkinson, Donald F. Morrison, Milton W. Loy er, Bennett Eisenberg, Solomon Kull- bac k, Ian T. Jolliffe, and Jeffrey S. Simonoff. Letters to the editor. The Americ an Statistician , 41(4):pp. 338–341, 1987. ISSN 00031305. URL http://www.jstor.org/ stable/2684769 . 84 [28] C. Lee Giles, Kurt D. Bollack er, and Stev e Lawrence. Citeseer: an automatic ci- tation indexing system. In Pr o c e e dings of the thir d ACM c onfer enc e on Digital li- br aries , DL ’98, pages 89–98, New Y ork, NY, USA, 1998. A CM. ISBN 0-89791- 965-3. doi: 10.1145/276675.276685. URL http://clgiles.ist.psu.edu/papers/ DL- 1998- citeseer.pdf . [29] W.R. Gilks. Marko v chain mon te carlo. Encyclop e dia of Biostatistics , 2005. [30] M. Girolami. A v ariational metho d for learning sparse and ov ercomplete representa- tions. Neur al c omputation , 13(11):2517–2532, 2001. [31] Stefan Gradmann. Kno wledge = information in con text: on the impor- tance of sematic contextualisation in europ eana. Europ eana, 2010. URL http://www.europeanaconnect.eu/europeanatech/documents/Gradmann_Stefan_ LinkedOpenEuropeana.pdf . [32] Heather Green. A library as big as the world. online, 2002. [33] Thomas L. Griffiths and Mark Steyv ers. Finding scientific topics. PNAS , 101(suppl. 1):5228–5235, 2004. URL http://www.pnas.org/content/101/suppl.1/5228.full. pdf . [34] David Hec kerman. A tutorial on learning with ba yesian net works. T echnical Re- p ort MSR-TR-95-06, Microsoft research, Redmond, W A, March 1995. URL http: //research.microsoft.com/pubs/69588/tr- 95- 06.pdf . [35] Thomas Hofmann. Probabilistic laten t seman tic indexing. In Pr o c e e dings of the 22nd annual international ACM SIGIR c onfer enc e on R ese ar ch and development in infor- mation r etrieval , SIGIR ’99, pages 50–57, New Y ork, NY, USA, 1999. A CM. ISBN 1-58113-096-1. doi: 10.1145/312624.312649. URL http://www.cs.brown.edu/ ~ th/ papers/Hofmann- UAI99.pdf . 85 [36] Diane Hu and La wrence Saul. A probabilistic topic model for music analysis. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://cseweb.ucsd.edu/ ~ dhu/docs/nips09_ abstract.pdf . [37] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An introduction to v ari- ational metho ds for graphical mo dels. Machine le arning , 37(2):183–233, 1999. URL http://www.cs.berkeley.edu/ ~ jordan/papers/variational- intro.pdf . [38] Kalman, Rudolph, and Emil. A New Approac h to Linear Filtering and Prediction Prob- lems. T r ansactions of the ASME–Journal of Basic Engine ering , 82(Series D):35–45, 1960. URL http://www.cs.unc.edu/ ~ welch/kalman/media/pdf/Kalman1960.pdf . [39] Samuel Kim, Shiv a Sundaram, P anayiotis Georgiou, and Shrik an th Nara yanan. Audio scene understanding using topic mo dels. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://sail.usc.edu/aigaion2/index.php/attachments/single/372 . [40] Jon Klein b erg. Bursty and hierarc hical structure in streams. Data Min. Know l. Disc ov. , 7(4):373–397, Octob er 2003. ISSN 1384-5810. doi: 10.1023/A:1024940629314. URL http://www.cs.cornell.edu/home/kleinber/bhs.pdf . [41] D. Koller, U. Lerner, and D. Angelo v. A general algorithm for appro ximate inference and its application to hybrid bay es nets. In Pr o c e e dings of the Fifte enth c onfer enc e on Unc ertainty in artificial intel ligenc e , pages 324–333. Morgan Kaufmann Publishers Inc., 1999. URL http://robotics.stanford.edu/ ~ uri/Papers/uai99.ps . [42] D. Koller, N. F riedman, L. Geto or, and B. T ask ar. Graphical mo dels in a nutshell. In L. Geto or and B. T ask ar, editors, Intr o duction to Statistic al R elational L e arning . MIT Press, 2007. [43] S. Kullback and R.A. Leibler. On information and sufficiency . The Annals of Mathemat- ic al Statistics , 22(1):79–86, 1951. URL http://www.csee.wvu.edu/ ~ xinl/library/ papers/math/statistics/Kullback_Leibler_1951.pdf . 86 [44] Kenichi Kurihara, Max W elling, and Y ee Wh ye T eh. Collapsed v ariational dirichlet pro cess mixture mo dels. In IJCAI , pages 2796–2801, 2007. URL http://dli.iiit. ac.in/ijcai/IJCAI- 2007/PDF/IJCAI07- 449.pdf . [45] John D. Lafferty , Andrew McCallum, and F ernando C. N. P ereira. Conditional random fields: Probabilistic mo dels for segmenting and lab eling sequence data. In Carla E. Bro dley and Andrea Pohorec kyj Danyluk, editors, ICML , pages 282–289. Morgan Kaufmann, 2001. ISBN 1-55860-778-1. URL http://www.cis.upenn.edu/ ~ pereira/ papers/crf.pdf . [46] Thomas K Landauer, Peter W. F oltz, and Darrell Laham. An in tro duction to la- ten t seman tic analysis. Disc ourse Pr o c esses , 25(2-3):259–284, 1998. doi: 10.1080/ 01638539809545028. URL http://lsa.colorado.edu/papers/dp1.LSAintro.pdf . [47] David D. Lewis. Reuters-21578. [48] Michael Ley . The dblp computer science bibliograph y: Ev olution, research issues, p ersp ectiv es. In Pr o c e e dings of the 9th International Symp osium on String Pr o c essing and Information R etrieval , SPIRE 2002, pages 1–10, London, UK, UK, 2002. Springer- V erlag. ISBN 3-540-44158-1. URL http://dl.acm.org/citation.cfm?id=646491. 694954 . [49] Erik Linstead, Lindsey Hughes, Cristina Lop es, and Pierre Baldi. Softw are analysis with unsup ervised topic mo dels. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://www.umiacs.umd.edu/ ~ jbg/nips_tm_workshop/5.pdf . [50] David J. C. Mac k ay . Information The ory, Infer enc e and L e arning Algorithms . Cam- bridge Univ ersity Press, 1st edition, June 2003. ISBN 0521642981. URL http: //www.cs.toronto.edu/ ~ mackay/itprnn/book.pdf . [51] Andrew McCallum and W ei Li. Early results for named en tit y recognition with con- ditional random fields, feature induction and web-enhanced lexicons. In Pr o c e e d- 87 ings of the seventh c onfer enc e on Natur al language le arning at HL T-NAACL 2003 , pages 188–191, Morristown, NJ, USA, 2003. Asso ciation for Computational Linguis- tics. doi: http://dx.doi.org/10.3115/1119176.1119206. URL http://www.cs.umass. edu/ ~ mccallum/papers/mccallum- conll2003.pdf . [52] Andrew McCallum and Charles Sutton. An in tro duction to conditional random fields for relational learning. In Lise Geto or and Ben T ask ar, editors, Intr o duction to Statistic al R elational L e arning . MIT Press, 2006. URL http://www.cs.umass.edu/ ~ mccallum/papers/crf- tutorial.pdf . [53] Andrew Kac hites McCallum. Mallet: A mac hine learning for language to olkit. h ttp://mallet.cs.umass.edu, 2002. [54] Peter M¨ orters and Y uv al P eres. Br ownian motion . Cam bridge Series in Statistical and Probabilistic Mathematics. Cam bridge Universit y Press, Cam bridge, 2010. ISBN 978- 0-521-76018-8. URL http://www.stat.berkeley.edu/ ~ peres/bmbook.pdf . With an app endix b y Oded Schramm and W endelin W erner. [55] Kevin P . Murphy , Y air W eiss, and Michael I. Jordan. Lo opy b elief propagation for appro ximate inference: an empirical study . In Pr o c e e dings of the Fifte enth c onfer enc e on Unc ertainty in artificial intel ligenc e , UAI’99, pages 467–475, San F rancisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1-55860-614-9. URL http:// www.cs.ubc.ca/ ~ murphyk/Papers/loopy_uai99.pdf . [56] U. Nodelman, C.R. Shelton, and D. Koller. Con tinuous time ba y esian net w orks. In Pr o- c e e dings of the Eighte enth Confer enc e on Unc ertainty in A rtificial Intel ligenc e (UAI) , pages 378–387, 2002. URL http://www.cs.ucr.edu/ ~ cshelton/papers/docs/ctbn. pdf . [57] T amara Pola jnar and Mark Girolami. Application of lexical topic mo dels to protein 88 in teraction sentence prediction. In Bengio et al. [ 5 ]. ISBN 9781615679119. URL http://eprints.pascal- network.org/archive/00005815/01/9.pdf . [58] John W Pratt. on the efficiency of maximum likelihoo d estimation. Annals of Statistics , 4:501–514, 1976. ISSN 3. URL http://projecteuclid.org/DPubS/Repository/1. 0/Disseminate?view=body&id=pdf_1&handle=euclid.aos/1176343457 . [59] Lawrence Rabiner and Biing-Hw ang Juang. F undamentals of sp e e ch r e c o gnition . Pren tice-Hall, Inc., Upp er Saddle River, NJ, USA, 1993. ISBN 0-13-015157-2. [60] John A. Rice. Mathematic al Statistics and Data Analysis . Duxbury Press, 3 edition, April 2001. ISBN 0534399428. [61] Purnamrita Sark ar and Andrew W. Mo ore. Dynamic so cial netw ork analysis using laten t space mo dels. SIGKDD Explor. Newsl. , 7(2):31–40, December 2005. ISSN 1931-0145. doi: 10.1145/1117454.1117459. URL http://www.cs.cmu.edu/ ~ psarkar/ NIPS2005_0724.pdf . [62] Masa-Aki Sato. Online mo del selection based on the v ariational ba yes. Neur al Comput. , 13(7):1649–1681, July 2001. ISSN 0899-7667. URL http://www.robots.ox.ac.uk/ ~ parg/mlrg/papers/online_variational.pdf . [63] D.F. Shanno et al. Conditioning of quasi-newton metho ds for function minimization. Mathematics of c omputation , 24(111):647–656, 1970. URL http://www.ams.org/journals/mcom/1970- 24- 111/S0025- 5718- 1970- 0274029- X/ S0025- 5718- 1970- 0274029- X.pdf . [64] Xiao dan Song, Ching-Y ung Lin, Belle L. Tseng, and Ming-Ting Sun. Modeling and predicting p ersonal information dissemination b eha vior. In Pr o c e e dings of the eleventh A CM SIGKDD international c onfer enc e on Know le dge disc overy in data mining , The 11th ACM SIGKDD In ternational conferenc on Knowledge Disco very and Data Min- 89 ing (KDD), pages 479–488, New Y ork, NY, USA, 2005. ACM. ISBN 1-59593-135- X. doi: 10.1145/1081870.1081925. URL http://wing.comp.nus.edu.sg/downloads/ keyphraseCorpus/138/138.pdf . [65] Charles Sutton, Andrew McCallum, and Khasha yar Rohanimanesh. Dynamic con- ditional random fields: F actorized probabilistic mo dels for lab eling and segmen ting sequence data. J. Mach. L e arn. R es. , 8:693–723, 2007. ISSN 1532-4435. URL http://people.cs.umass.edu/ ~ mccallum/papers/dcrf- icml04.pdf . [66] Russell Sw an and Da vid Jensen. Timemines: Constructing timelines with statistical mo dels of word usage. In The sixth A CM SIGKDD international c onfer enc e on Know l- e dge Disc overy and Datamining , pages 73–80, 2000. URL http://www.cs.cmu.edu/ ~ dunja/KDDpapers/Swan_TM.pdf . [67] J. T aylor. Jstor: an electronic archiv e from 1665. Notes and R e c or ds of the R oyal So ciety of L ondon , 55(1):179–181, 2001. doi: 10.1098/rsnr.2001.0135. URL http: //rsnr.royalsocietypublishing.org/content/55/1/179.abstract . [68] Y. W. T eh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet pro cesses. Journal of the Americ an Statistic al Asso ciation , 101(476):1566–1581, 2006. URL http: //www.cs.berkeley.edu/ ~ jordan/papers/hdp.pdf . [69] Y ee Why e T eh, Kenichi Kurihara, and Max W elling. Collapsed v ariational inference for hdp. In John C. Platt, Daphne Koller, Y oram Singer, and Sam T. Ro weis, editors, NIPS . MIT Press, 2007. URL http://books.nips.cc/papers/files/nips20/NIPS2007_ 0763.pdf . [70] Martin J. W ainwrigh t and Mic hael I. Jordan. Graphical models, exp onential families, and v ariational inference. F ound. T r ends Mach. L e arn. , 1(1-2):1–305, January 2008. ISSN 1935-8237. doi: 10.1561/2200000001. URL http://www.eecs.berkeley.edu/ ~ wainwrig/Papers/WaiJor08_FTML.pdf . 90 [71] Hanna M. W allac h, Iain Murray , Ruslan Salakhutdino v, and David Mimno. Ev aluation metho ds for topic models. In Pr o c e e dings of the 26th A nnual International Confer enc e on Machine L e arning , ICML ’09, pages 1105–1112, New Y ork, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553515. URL http://people.cs. umass.edu/ ~ wallach/talks/evaluation.pdf . [72] Chong W ang, David Blei, and Da vid Heck erman. Con tinuous time dynamic topic mo dels. In David A. Mcallester and Petri Myllym¨ aki, editors, Pr o c e e dings of the 24th Confer enc e in Unc ertainty in Artificial Intel ligenc e (UAI) , 2008. URL http: //uai2008.cs.helsinki.fi/UAI_camera_ready/wang.pdf . [73] Chong W ang, John William P aisley , and Da vid M. Blei. Online v ariational inference for the hierarc hical dirichlet pro cess. Journal of Machine L e arning R ese ar ch - Pr o- c e e dings T r ack , 15:752–760, 2011. URL http://www.jmlr.org/proceedings/papers/ v15/wang11a/wang11a.pdf . [74] Xuerui W ang and Andrew McCallum. T opics ov er time: a non-mark ov con tinuous- time mo del of topical trends. In Pr o c e e dings of the 12th ACM SIGKDD in- ternational c onfer enc e on Know le dge disc overy and data mining , KDD ’06, pages 424–433, New Y ork, NY, USA, 2006. ACM. ISBN 1-59593-339-5. doi: 10. 1145/1150402.1150450. URL http://www.bradblock.com/Topics_over_Time_A_ Non_Markov_Continuous_Time_Model_of_Topical_Trends.pdf . [75] Xuerui W ang, Natasha Mohan t y , and Andrew McCallum. Group and topic disco very from relations and text. In Pr o c e e dings of the 3r d international workshop on Link disc overy , LinkKDD ’05, pages 28–35, New Y ork, NY, USA, 2005. ACM. ISBN 1-59593- 215-1. doi: 10.1145/1134271.1134276. URL http://www.cs.umass.edu/ ~ mccallum/ papers/grouptopic_linkkdd05.pdf . [76] E.P . Xing, M.I. Jordan, and S. Russell. A generalized mean field algorithm for v ari- ational inference in exp onential families. In Pr o c e e dings of the Ninete enth c onfer enc e 91 on Unc ertainty in Artificial Intel ligenc e , pages 583–591. Morgan Kaufmann Publishers Inc., 2002. URL http://www.gatsby.ucl.ac.uk/publications/papers/06- 2000. pdf . [77] Eric P Xing. On topic evolution. T echnical Rep ort CMU-CALD-05-115, Carnegie Mellon Universit y , Pittsburgh, P A, USA, Dec 2005. URL http://www.cs.cmu.edu/ ~ epxing/papers/TR- 115.pdf . TDT 2004 workshop notes. [78] Raymond W. Y eung. Information The ory and Network Co ding . Springer Publishing Compan y , Incorp orated, 1 edition, 2008. ISBN 0387792333, 9780387792330. 92 App endix A Probabilit y distributions In this App endix, I briefly in tro duce some of the most imp ortan t probabilit y distributions in Ba yesian data mo deling. A.1 Uniform This is a simple distribution for a con tinuous time v ariable x . The probabilit y that this v ariable takes any v alue in its finite domain ([ a, b ]) is constan t and can b e defined by: U( x | a, b ) = 1 b − a (A.1) A.2 Bernoulli This distribution is for a single binary v ariable x ∈ { 0 , 1 } , can b e thought of as representing the result of flipping a coin. The parameter µ ∈ [0 , 1] giv es the probabilit y of success, which is ha ving x = 1. More formally: Bern( x | µ ) = µ x (1 − µ ) 1 − x (A.2) 93 A.3 Binomial If a Bernoulli trial (suc h as coin flipping) is rep eated N times, where the probabilit y of success is µ ∈ [0 , 1], then the probability distribution of the n umber of successes ( x ) is given b y: Bin( x | µ, N ) = N x µ x (1 − µ ) N − x (A.3) A.4 Multinomial The m ultinomial distribution is a generalization of the binomial distribution where the Bernoulli distribution giving the probability of success is replaced with a categorical distri- bution. In this case, there are K p ossible outcomes for the discrete v ariable x = { x 1 . . . , x K } with probabilities µ = { µ 1 , . . . , µ K } suc h that P k µ k = 1. F or N observ ations, if m = { m 1 , . . . , m K } , where m k is the num b er of times outcome x k is observ ed. Then, m follo ws a m ultinomial distribution with parameters µ and N . This distribution can b e defined as follo ws: Mult( m | µ, N ) = N ! m 1 ! . . . m K ! K Y k =1 µ m k k (A.4) The categorical and m ultinomial distributions are often conflated in computer science literature. The outcome of a categorical distribution is sometimes represen ted as a binary 1- of- K v ector, indicating whic h one of the K outcomes was observed. In this notation, whic h is adopted by Bishop [ 6 , p. 690], the categorical distribution represents a m ultinomial distribution o ver a single observ ation. A.5 Beta This is a distribution o v er a contin uous v ariable x ∈ [0 , 1] and is often used to represent the probabilit y of a binary ev ent with t w o parameters a > 0 and b > 0. This Beta distribution 94 is the conjugate prior for the Bernoulli distribution where a and b can be in terpreted as the effectiv e prior num b er of success and failure observ ations, resp ectiv ely . This distribution can b e defined b y: Beta( x | a, b ) = Γ( a + b ) Γ( a )Γ( b ) x a − 1 (1 − x ) b − 1 (A.5) Where Γ( . ) is the Gamma function giv en by: Γ( x ) = Z ∞ 0 u x − 1 e − u du (A.6) = ( x − 1)! , for integer x (A.7) A.6 Diric hlet The Diric hlet distribution is a m ultiv ariate distribution o ver K random v ariables 0 ≤ µ k ≤ 1, where k = 1 , . . . , K , Dir( µ | α ) = C ( α ) K Y k =1 µ α k − 1 k (A.8) where C ( α ) = Γ( ˆ α ) Γ( α 1 ) . . . Γ( α K ) (A.9) and ˆ α = K X k =1 α k . (A.10) Sub ject to the constraints: 0 ≤ µ k ≤ 1 (A.11) K X k =1 µ k = 1 . (A.12) 95 The Diric hlet distribution forms a conjugate prior with the m ultinomial distribution and is a generalization of the Beta distribution. The parameters α k can b e in terpreted as the effectiv e n umber of observ ations of the corresp onding v alues of the K-dimensional observ ation vector x . 96 App endix B Diric hlet Pro cess The Diric hlet Pro cess (DP) is a distribution o ver distributions, meaning that a dra w from the pro cess giv es us a Dirichlet distribution. A DP denoted b y DP( G 0 , α ) is parametrized b y a base measure G 0 and a concentration parameter α which is analogous to the mean and v ariance of a Gaussian distribution, resp ectively . By in tegrating out G , θ follo ws a P olya urn distribution [ 7 ] or a Chinese Restauran t Pro cess [ 2 ] θ i | θ 1: i − 1 , G 0 , α ∼ X k m k i − 1 + α δ ( φ k ) + α i − 1 + α G o (B.1) where, φ 1: k are the distinct v alues of θ , and m k is the num b er of parameter θ with φ k v alue. A Dirichlet pro cess mixture mo del (DPM) can b e built using the given DP on top of a hierarc hical Bay esian mo del. Instead of having a fixed Diric hlet distribution to draw a parameter v ector θ from in a generativ e mo del, as shown in Figure 1.3 , that v ector, whic h represen ts a topic prop ortions v ector in a topic mo del, can b e drawn from a Diric hlet distribution G d that was dra wn from a DP . Such a mo del can accommo date an infinite num b er of topics. T o share topics among differen t do cuments, the do cument-specific Dirichlet distributions G d can b e tied together b y dra wing their base measure G 0 from another DP . More formally: G 0 ∼ DP( H , γ ). This final construction is known as the Hierarchical Dirichlet Pro cess (HDP). By in tegrating G d 97 out of this mo del, w e get the Chinese restaurant franchise process (CRFP) [ 11 ]; θ di | θ d, 1: i − 1 , α, ψ ∼ b = B d X b =1 n db i − 1 + α δ ψ db + α i − 1 + α δ ψ db new (B.2) ψ db new | ψ , γ ∼ K X k =1 m k P K l =1 m l + γ δ φ k + γ P K l =1 m l + γ H (B.3) where ψ db is topic b for do cumen t d , n db is the n um b er of words sampled from it, ψ db new is a new topic, B d is the n umber of topics in document d , and m k is the n umber of do cuments sharing topic φ k . The generativ e pro cess of this mo del follows a Chinese restaurant model. In this metaphor, a restaurant represen ts a do cumen t, a customer represents a word, and a dish represents a topic. Customers sitting at the same table in a Chinese restaurant share the same dish serv ed on that table. When customer i arrives at a restauran t, she is assigned a table to sit at. This table could b e an empty table with probability α i − 1+ α , or it could b e a table o ccupied by n db customers with probability n db i − 1+ α . If an empty table is assigned, the dish serv ed on this table could either b e a new dish that has not b een serv ed in any restauran t b efore with probabilit y γ P K l =1 m l + γ where m l is the num b er of restauran ts that ha ve serv ed dish l b efore, or the dish served on the table could b e a dish that has b een serv ed on m k tables across all restauran ts with probabilit y m k P K l =1 m l + γ . T able assignment and dish selection follo ws a richer scheme. T o add temp oral dep endence in our model, w e can use the temp oral Dirichlet pro cess mixture mo del (TDPM) proposed in [ 1 ] whic h allows unlimited n umber of mixture comp o- 98 nen ts. In this mo del, G evolv es as follows [ 2 ]: G t | φ 1: k , G o , α ∼ D P ( ζ , D ) (B.4) ζ = α + X k m 0 kt (B.5) D = X k m 0 kt P l m 0 lt + α δ ( φ k ) + α P l m 0 lt + α G 0 (B.6) m 0 kt = ∆ X δ =1 exp − δ /λ m k,t − δ (B.7) where m 0 kt is the prior weigh t of comp onent k at time t , ∆, λ are the width and decay factor of the time decaying k ernel. F or ∆ = 0, this TDPM will represent a set of indep enden t DPMs at eac h time step, and a global DPM when ∆ = T and λ = ∞ . As w e did with the time-indep enden t DPM, we can in tegrate out G 1: T from our mo del to get a set of parameters θ 1: t that follo ws a Poly-urn distribution: θ ti | θ t − 1 : t − ∆ , θ t, 1: i − 1 , G 0 , α ∝ X k ( m 0 kt + m kt ) δ ( φ kt ) + αG 0 (B.8) W e can mak e word distributions and topic trends evolv e o ver time if we tie together all h yp er-parameter base measures G 0 through time. The mo del will now take the following form: θ tdi | θ td, 1: i − 1 , α, ψ t − ∆: t ∼ b = B d X b =1 n tdb i − 1 + α δ ψ tdb + α i − 1 + α δ ψ tdb new (B.9) ψ tdb new | ψ , γ ∼ X k : m kt > 0 m kt + m 0 kt P K t l =1 m lt + m 0 lt + γ δ φ kt + X k : m kt =0 m kt + m 0 kt P K t l =1 m lt + m 0 lt + γ P( . | φ k,t − 1 ) + γ P K t l =1 m lt + m 0 lt + γ H (B.10) 99 where φ kt ev olves using a random walk k ernel lik e in [ 11 ]: H = N (0 , σ I ) (B.11) φ k,t | φ k,t − 1 ∼ N ( φ k,t − 1 , ρI ) (B.12) w tdi | φ kt ∼ M ( L ( φ kt )) (B.13) L ( φ kt ) = exp( φ kt ) P W w =1 exp( φ ktw ) (B.14) W e can see from ( 3.9 ), ( 3.10 ), ( 3.11 ) the non-conjugacy b etw een the base measure and the lik eliho o d. 100 App endix C Graphical mo dels Probabilistic graphical models are tools that allo w us to visualize the indep endence relation- ship b et ween random v ariables in a mo del and use av ailable graph manipulation algorithms to mo del complex systems, tune the mo del parameters, and infer the lik eliho o d of even ts. Probabilistic graphical mo dels, or simply graphical mo dels, can b e categorized into di- r e cte d graphical mo dels (otherwise known as Ba yesian netw orks), and undir e cte d graphical mo dels (called Mark ov netw orks, or Mark ov Random Fields [MRF] as well). An example of a directed graphical mo del is given in Figure C.1 . This graph mo dels the indep endence/dep endence relationship b etw een sev en random v ariables. The structure of the graph and the relationship b et ween the v ariables were created using human judgmen t. Algorithms for learning the graph structure and parameters do exist [ 34 ] but were not used in this simple example. The no des in the graph represent random v ariables and the directed edges sho w the flo w of influence of the random v ariables. In this model, the priors Genes , T r aining , Drugs , define the probability that an athlete w ould hav e go o d genes, train w ell, and take p erformance-enhancing drugs, resp ectiv ely . These three random v ariables affect the athlete’s Performanc e , whic h along with the probability that the Other athletes w ould p erform well or not affect the probabilit y that our athlete w ould win a Me dal . Drug doping can b e tested separately b y its effect on the Drug test . An example of an undirected graphical mo del is given in Figure C.2 where the direct in teraction b et ween mark et sto cks’ v alues is represen ted by the graph edges and the nodes 101 Genes Training Drugs Performance Medal Other athletes Drug test Figure C.1 : Dir e cte d gr aphic al mo del. StockA Stock C Stock B Stock D Figure C.2 : Undir e cte d gr aphic al mo del. themselv es represen t the v alue of the market sto c k. Since Sto ck A and Sto ck D are not neigh b ors in the graph they do not affect each other directly . Simply sp eaking, a graphical mo del (directed or undirected) represents a join t distri- bution P o ver a set of random v ariables X . F or inference on a set of v ariables, w e can exhaustiv ely marginalize o ver the rest of the v ariables in the mo del. How ev er, the inference complexit y will increase exp onentially with the n um b er of v ariables and the problem w ould b ecome intractable. This problem can b e av oided in Ba yesian net works (directed graphs) b y making use of the indep endence relationships b etw een the v ariables as follo ws: Definition 1. Given a Ba yesian netw ork G defined ov er a set of random v ariables X = { X 1 , . . . , X n } and a probability distribution P X o ver the same set of v ariables. Then we say that P X factorizes according to G if P X = n Y i =1 P ( X i | P a ( X i )) (C.1) where P a ( X i ) is the set of paren t no des of X i . This follo ws from the effect of the blanket v ariables that separate v ariables as follows: 102 Definition 2. Given a Ba yesian netw ork G defined ov er a set of random v ariables X = { X 1 , . . . , X n } . A v ariable X i is indep enden t of its non-descendants given its paren ts: ∀ X i , ( X i ⊥ nonD escendants | P a ( X i )) (C.2) Analogously , for Mark ov netw orks: Definition 3. Giv en a Marko v ne tw ork structure H defined o ver the set of random v ariables X = { X i , . . . , X n } and a distribution P H defined ov er the same space. W e say that the distribution P H factorizes o ver H if there is a set of factors π 1 [ D 1 ] , . . . , π m [ D m ] where D i is a clique in H such that: P H = 1 Z P 0 H ( X i , . . . , X n ) (C.3) P 0 H = m Y i =1 π i [ D i ] (C.4) Z = X X 1 ,...,X n P 0 H ( X 1 , . . . , X n ) (C.5) where P 0 H is an unnormalized measure, and Z is a normalizing constant. Indep endency in Mark ov netw orks can b e iden tified using blankets of v ariables [ 42 ]: Definition 4. Giv en a Marko v netw ork H defined ov er a set of random v ariables X = { X i , . . . , X n } . Let N H ( X i ) b e the set of X i neigh b ors in the graph. Then, the lo c al Markov indep endencies asso ciated with H can b e defined as: I ( H ) = { ( X ⊥ X − N H ( X ) − X |N H ( X )) : X ∈ X } (C.6) C.1 Inference algorithms Inference on graphical mo dels could b e a c onditional indep endenc e query , or a Maximum a Priori (MAP) or its sp ecial (and imp ortan t) case the Most Pr ob able Explanation (MPE) among other t yp es of queries. 103 A conditional indep endence query finds the probability of assignmen t of a random v ari- able to a set of v alues given some evidence: P ( Y | E = e ) = P ( Y , e ) P ( e ) (C.7) The MPE query finds the most lik ely assignment of v alues to all non-evidence v ariables. In other w ords, it finds the v alue of w that maximizes P ( w , e ), where w is the v alues assigned to W = X − E . As it turns out, the problem of inference on graphical mo dels is N P -hard at b est [ 26 ], if no assumptions are made ab out the structure of the underlying graphical mo del [ 20 ]. Ho wev er, exact inference could b e done efficien tly on low treewidth graphical mo dels. In man y other imp ortant mo dels, approximate inference could yield goo d results. The problem of inference could b e dealt with as an optimization problem, or sampling problem. The first alternative tries to find a distribution Q that is similar enough to the desired distribution P by searching a set of easy candidate distributions Q trying to find the one that minimizes the distance measure b et ween P and Q . Among the distance measures that could b e used are Euclidean distance, whic h would require doing hard inference on P taking us back to the main problem, and the en trop y based Kullbac k-Leibler (KL)-divergence measure [ 27 ]. C.1.1 Example Conditional Random Field (CRF) [ 45 ] mo dels ha v e replaced hidden Mark ov mo dels (HMMs) and maxim um entrop y Marko v mo dels (MEMM) in many NLP applications suc h as named- en tity recognition [ 51 ] and part-of-sp eec h tagging (POS) [ 65 ] due to their ability to relax strong indep endence assumptions made in these mo dels. Figure C.3 shows a graphical represen tation for a skip-c hain CRF adapted from [ 52 ]. Only the x no des are observed, and identical w ords are connected (with skip e dges ) b ecause they are lik ely to ha ve the same lab el. Strong evidence at one end can affect the lab el at the other endp oin t of the edge. Long range dep endencies in a mo del like this one w ould b e 104 y(t−1) y(t) y(t+1) x(t−1) x(t+7) x(t+1) x(t) x(t+8) x(t+9) y(t+8) y(t+7) y(t+9) . started Doe Doe John Announced Figure C.3 : A skip-chain CRF mo del for an NER task. hard to represen t using n -gram mo dels b ecause they tend to ha v e to o many parameters if n is large. In a skip-c hain mo del, skip-edge creation should dep end on the application of the mo del. In a POS tagging task, this edge creation could b e based on a similarity measure that uses stemming and word ro ots or edit distance. In the current example, we are connecting iden tical capitalized w ords for the NER task. Adding so man y edges w ould complicate the mo del and mak e appro ximate inference harder. In our mo del, the probabilit y of a lab el sequence y giv en an input x could b e ev aluated b y: p θ ( y | x ) = 1 Z ( x ) T Y t =1 Ψ t ( y t , y t − 1 , x ) Y ( u,v ) ∈I Ψ uv ( y u , y v , x ) (C.8) where I = { ( u, v ) } is the set of all skip-edges sequence p ositions, Ψ t are the factors for the linear-c hain edges, and Ψ uv are the factors ov er skip edges (like the one connecting y t +1 and y t +1 in Figure C.3 ). And w e hav e: Ψ t ( y t , y t − 1 , x ) = exp ( X k λ 1 k f 1 k ( y t , y t − 1 , x , t ) ) (C.9) Ψ uv ( y u , y v , x ) = exp ( X k λ 2 k f 2 k ( y u , y v , x , u, v ) ) , (C.10) where θ 1 = { λ 1 k } K 1 k =1 are linear-c hain template parameters, and θ 2 = { λ 2 k } K 2 k =1 are the parameters of the skip-template. The observ ation functions q k ( x , t ) can depend on arbitrary 105 p ositions of the input string. They factorize to: f 0 k ( y u , y v , x , u, v ) = 1 { y u = ˜ y u } 1 { y v = ˜ y v } q 0 k ( x , u, v ) (C.11) Skip-c hain CRFs hav e b een in tro duced and tested b y McCallum and Sutton on a problem of extracting speaker names from seminar announcements [ 52 ]. They rep orted an impro ved p erformance ov er linear-chain CRFs, whic h in turn outp erforms HMM due to the relaxed indep endence assumptions b et ween the states and the curren t observ ation. How ever, skip- c hain CRFs fall in to the trap of the implicit assumption of indep endence b etw een current observ ations x and the neigh b ors of the states on b oth ends of the skip-edge. F or example, in one do cument, the w ord China representing the coun try w ould b e link ed with a skip- edge to the w ord China in The China Daily , and b oth could b e lab eled as COUTR Y, or OR GANIZA TION in a NER task, even though they are different en tities. A solution to this problem has b een prop osed b y Finkel et al. [ 25 ]. They added ric her long-distance factors to the original pairs of w ords factors. These factors are useful for adding exceptions to the rule that iden tical words ha ve the same en tit y in text. How ever, due to the sparsity of the training data this mo del ma y fail. Augmen ting the features with neigh b ors of the states at the ends of the skip-edges may remedy this problem. This could tak e the form: Ψ ef ( y e , y f , x ) = exp ( X k λ 3 k f 3 k ( y e , y f , x , e, f ) ) (C.12) Ψ g h ( y g , y h , x ) = exp ( X k λ 4 k f 4 k ( y g , y h , x , g , h ) ) , (C.13) Eac h factor of a skip-edge will b e a function of four more states. This w ould require the use of differen t parameter estimation techniques on tw o stages [ 25 ]: First, w e estimate the parameters of the linear-chain CRF, ignoring the skip-edges; then w e heuristically select the parameters for the long-distance factors and neigh b oring states. Another p oin t worth considering is the precision of the probabilities of the forward- bac kward b elief propagation algorithm. Their v alues b ecome to o small to b e represented 106 within n umeric precision. T o solve this problem we could use the logarithmic domain for computation. Therefore, log( ab ) = M ( a + b ) (C.14) where a ⊕ b = log ( e a + e b ) , (C.15) or a ⊕ b = a + log (1 + e b − a ) (C.16) Ev en though graphical mo dels give us flexibility in mo deling complex systems, these mo dels could easily b ecome in tractable if the system input inv olved a set of ric h nonindepen- den t features. This is typical in many real w orld Natural Language Pro cessing applications. Since graphical mo dels are usually used to represent a joint probability distribution p ( y , x ), where y is the attributes of the entit y we w ant to predict, and x is our observ ation ab out this entit y . Modeling the joint distribution for a system with rich setof features is complicated b y the fact that it requires mo deling the distribution p ( x ) with its complex dep endencies, causing our mo del to b ecome in tractable. This problem can b e solv ed b y mo deling the conditional distribution p ( y | x ) instead. This approac h in represent ation is kno wn as Conditional Random Fields (CRFs) [ 45 ]. 107 App endix D Information theory Man y definitions and equations used in this thesis can b e better understoo d in the con text of information theory . The av erage amoun t of information contained in a random v ariable and the relativ e entrop y measures are useful in understanding ho w t wo probability distributions div erge from each other. D.1 En trop y The en tropy H ( X ) of a random v ariable X is defined by [ 78 ]: H ( X ) = − X x p ( x ) log p ( x ) (D.1) The base for this logarithm can b e chosen to b e an y n umber greater than 1. If the v alue c hosen is 2, the unit for this en tropy measure is the bit . If the base is Euler’s num b er e , the unit of this en tropy measure is nat , for natural logarithm. In this thesis, the base of the logarithm used is the size of the co de alphab et, whic h is 2. It is to b e noted that in computer science a bit is an en tity that can take one of tw o v alues, 1 or 0. In information theory , a bit is a unit of measuremen t for the entrop y of a random v ariable. It is usually common in information theory to express en trop y of a random v ariable in terms of its exp ected v alue. An alternative definition of the entrop y could then giv en b y 108 H ( X ) = − E log p ( X ) (D.2) where E g ( x ) = X x p ( x ) g ( x ) (D.3) The entrop y H ( X ) of a random v ariable X is a functional of the probability distribution p ( x ). It measures the a verage amoun t of information contained in X , or the amount of unc ertainty remov ed from X b y rev ealing its outcome. The v alue of the entrop y dep ends on p ( x ) and not the v alues in its alphab et X . In information theory , an alphab et X of a random v ariable is the set of all v alues a random v ariable may tak e. T o maximize the en tropy of random v ariable we w ant to maximize the uncertain ty of the outcome of that v ariable. F or a binary random v ariable X with X = { 0 , 1 } , this is achiev ed when b oth of its outcomes are equally lik ely to o ccur, i.e. when p (0) = p (1) = 0 . 5. The join t entrop y of tw o random v ariables X and Y is defined by H ( X, Y ) = − X x,y p ( x, y ) log p ( x, y ) (D.4) = − E log p ( X, Y ) (D.5) The join t entrop y H ( X , Y ) of tw o random v ariables X and Y is given by H ( X, Y ) = − X x,y p ( x, y ) log p ( x, y ) (D.6) = − E log p ( X, Y ) (D.7) In some cases, the outcome of one random v ariable migh t give a clue ab out the outcome of another v ariable. In that case, the definition of conditional entrop y would b e of high in terest to us. 109 F or t wo random v ariables X and Y the conditional entrop y of Y giv en X is H ( Y | X ) = − X x,y p ( x, y ) log p ( y | x ) (D.8) = − E log p ( Y | X ) (D.9) D.2 Mutual information F or t wo random v ariables X and Y , the mutual information b et ween X and Y is defined as I ( X ; Y ) = X x,y p ( x, y ) log p ( x, y ) p ( x ) p ( y ) (D.10) = E log p ( X , Y ) p ( X ) p ( Y ) (D.11) F rom this definition, w e can see that m utual information is symmetrical in X and Y , i.e. I ( X ; Y ) = I ( Y ; X ). D.3 Information div ergence In many communication applications we wan t to measure ho w a probability distribution p differs from another probabilit y distribution q whic h is defined o ver the same alphabet. This measure is usually desired to be a non-negativ e one. The information diver genc e measure from p to q can b e used in this case and it is defined as D ( p || q ) = X x p ( x ) log p ( x ) q ( x ) (D.12) = E p log p ( X ) q ( X ) (D.13) where E p is the exp ectation with resp ect to p . 110 In information theory , information divergence is also kno wn as relativ e en tropy . This mak es sense as D.12 could b e interpreted as the en tropy of p relative to the entrop y of q as defined b y the en trop y measure in D.1 and D.2 . Information divergence is also known as Kul lb ack-L eibler distanc e as well. How ever, care must be tak en when using this term as the information divergence measure is not symmetric in p and q , and do es not satisfy the triangular inequality . i.e. D ( p || q ) 6 = D ( q || p ). Whereas the word distance in many situations imply symmetry . 111 App endix E T able of notations Sym b ol Definition D P ( . ) Diric hlet pro cess G 0 base measure α concen tration parameter θ Diric hlet distribution parameter space φ 1: k Distinct v alues of θ m k n umber of parameter θ with φ k v alue m 0 kt prior w eight of comp onent k at time t ∆ width of time deca ying kernel λ deca y factor of a time decaying kernel w w ord in a do cument H base measure of the DP generating G 0 γ concen tration parameter for the DP generating G 0 ψ tdb topic b for do cumen t d at time t n tdb n umber of words sampled from ψ tdb ψ tdb new a new topic B d n umber of topics in do cument d N ( m, v 0 ) a Gaussian distribution with mean m and v ariance v 2 0 β i,k,w distribution of w ords ov er topic k at time i for w ord w s i time stamp for time index i ∆ s j ,s i time duration b et ween s i and s j D ir ( . ) Diric hlet distribution z t,n topic n sampled at time t I iden tity matrix Mult(.) m ultinomial distribution π ( . ) mapping function T able E.1 : T able of notations 112
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment