We develop the "Draw My Topics" toolkit, which provides a fast way to incorporate social scientists' interest into standard topic modelling. Instead of using raw corpus with primitive processing as input, an algorithm based on Vector Space Model and Conditional Entropy are used to connect social scientists' willingness and unsupervised topic models' output. Space for users' adjustment on specific corpus of their interest is also accommodated. We demonstrate the toolkit's use on the Diachronic People's Daily Corpus in Chinese.
Probabilistic topic models, such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (Blei, Ng and Jordan, 2003), are widely used as common tools to assist social scientists' understanding large, unstructured collections of documents. The value of topic models is being recognized by social scientists as a tool for large document-summary and abstraction for economically and politically interesting facts such as Chinese Censorship (Grimmer and Stewart, 2013;King, Robert and Pan, 2013;Tingley, 2013;Bamman, Connor and Smith, 2012).
Social scientists often start with off-the-shelf implementations of topic modeling which are widely available on the Internet. Then a variety of post-hoc evaluation of the implementation’s output, including topic prevalence and topic variation can be conducted. However, due to topic models are mainly unsupervised method. In this case, social scientists often have little to do with the topic generation process. So many unrelated topics may show up, but they are not of the social scientists’ interest. There are already considerable terrific works on better connecting social scientists with topic modeling. Kim, Zhai and Diermeier (2013) connected topic modeling with time-series feedback, Roberts, Stewart, Tingley and Airoldi (2013) developed “Structural Topic Model” to incorporate corpus observed metadata into standard topic model. Hall, Wallach, Mimno and McCallum (2009) accommodated outside information by optimize the hyper parameters of LDA. Hall, Jurafsky and Manning (2008) hand selected seed-word by adding number of pseudo-counts to the topic related words that they are especially interested in. We develop the “Draw Related-Topics” toolkit to help social scientists and other topic model users to get desired topics in a more direct way. The central idea is that users define their interesting topics by a “central word”, and then we extract this word (topic)’s relatively small context rather than the huge volume of raw corpus as topic model’s input. Based on “Spatial Locality Principle”, this allows us to draw central word’s related topic prevalence and related topics much easier than searching for the whole corpus purposely. To define and find the “related context”, we propose a two -step approach. First, to find the “central word’s top twenty similar words by Vector Space Model (Salton, Wong and Yang, 1975) and Conditional Entropy (Cover, Thomas M, 1991). These form the similar word set. Second, extract adjacent context of the similar word set to form the whole related context. Furthermore, users can adjust the two approaches by their subjective judgment (in other words, social science sense/knowledge) according to their own corpus’ part-of speech-tagging statistics to get more desirable results. After describing the method, we demonstrate the use of “Draw Related-Topics” toolkit by analyzing several interesting words on the diachronic “People’s Daily” corpus in Chinese.
The input of our “Draw My-Topics” Toolkit is interesting words defined by users (target at social scientists mainly) and large volume of corpus. The output is “central word” related topics content and topic prevalence. Also, users can adjust the output by their domain knowledge and intuitions by flexible parameters we provide.
In the first step, we calculate top three hundred similar words of each given “Central Word” by vector space model and conditional entropy. Vector space model is an algebraic model for representing text documents as vectors of identifiers. In our case, each word is treated as a vector in the space. The similarity degree of different words is calculated by the cosine of the angle between different vectors of words. The entry of word vector is point-to-point mutual information. Then to calculate mutual information, decision on length of information window gets crucial and subtle. We do this based on “amount of information” of each window, which is calculated as conditional entropy.
In it Y denotes the target word and X denotes words in nearby context. For four part-of-speech tagging types, we set four different information thresholds as the following based on sampling, observation, and statistics: This information threshold table for similarity degree calculation can also be determined by toolkit users themselves since “similar” is quite a subjective measure from different disciplines’ perspective. For example, “demand” may be “similar” to “supply” from an economist’s view while political scientist may think “demand” is related to “power”. Some of the similarity calculation results based on Chinese diachronic corpus of “People’s Daily” are presented below.
Table 2 In the second step, we apply a straightforward method to summarize related corpus from the original one based on similarity words result from first step. We go through the People’s Daily Corpus year by year. For every line of one year, draw it down if the line contains any of the similar words of given
This content is AI-processed based on open access ArXiv data.