Entities as topic labels: Improving topic interpretability and evaluability combining Entity Linking and Labeled LDA

Reading time: 5 minute
...

📝 Original Info

  • Title: Entities as topic labels: Improving topic interpretability and evaluability combining Entity Linking and Labeled LDA
  • ArXiv ID: 1604.07809
  • Date: 2016-04-27
  • Authors: Federico Nanni and Pablo Ruiz Fabo

📝 Abstract

In order to create a corpus exploration method providing topics that are easier to interpret than standard LDA topic models, here we propose combining two techniques called Entity linking and Labeled LDA. Our method identifies in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clear-cut labels, they promote interpretability, and this may help quantitative evaluation. We illustrate the potential of the approach by applying it in order to define the most relevant topics addressed by each party in the European Parliament's fifth mandate (1999-2004).

💡 Deep Analysis

Figure 1

📄 Full Content

Humanities scholars have experimented with the potential of different text mining techniques for exploring large corpora, from cooccurrencebased methods to sequencelabeling algorithms (e.g. Named entity recognition). LDA topic modeling (Blei et al., 2003) has become one of the most employed approaches (Meeks and Weingart, 2012). Scholars have often remarked its potential for distant reading analyses (Milligan, 2012) and have assessed its reliability by, for example, using it for examining already wellknown historical facts (Au Yeung, 2011). However, researchers have observed that topic modelling results are usually difficult to interpret (Schmidt, 2012). This limits the possibilities to evaluate topic modeling outputs (Chang et al., 2009).

In order to create a corpus exploration method providing topics that are easier to interpret than standard LDA topic models, we propose combining two techniques called Entity linking and Labeled LDA; we are not aware of literature combining these two techniques in the way we describe. Our method identifies in an ontology a series of descriptive labels for each document in a corpus. Then it generates a specific topic for each label. Having a direct relation between topics and labels makes interpretation easier; using an ontology as background knowledge limits label ambiguity. As our topics are described with a limited number of clearcut labels, they promote interpretability, and this may help quantitative evaluation.

We illustrate the potential of the approach by applying it to define the most relevant topics addressed by each party in the European Parliament’s fifth term (1999 2004).

The structure of the abstract is as follows: We first describe the basic technologies considered. We then describe our approach combining Entity Linking and Labeled LDA. Based on the European Parliament corpus (Koehn, 2005), we show how the results of the combined 1 approach are easier to interpret or evaluate than results for Standard LDA.

Entity linking (Rao et al., 2013) tags textual mentions with an entity from a knowledge base like DBpedia (Auer et al., 2007). Mentions can be ambiguous, and the challenge is to choose the entity that most closely reflects the sense of the mention in context. For instance, in the expression Clinton Sanders debate , Clinton is more likely to refer to DBpedia entity Hillary_Clinton than to Bill_Clinton . However, in the expression Clinton vs. Bush debate , the mention Clinton is more likely to refer to Bill_Clinton . An entity linking tool is able to disambiguate mentions taking into account their context, among other factors.

Topic modeling is arguably the most popular text mining technique in digital humanities (Brauer and Fridlund, 2013). It addresses a common research need, as it can identify the most important topics in a collection of documents, and how these topics are distributed across the documents in the collection. The method’s unsupervised nature makes it attractive for large corpora. However, topic modeling does not always yield satisfactory results. The topics obtained are usually difficult to interpret (Schmidt, 2012, among others). Each topic is presented as a list of words. It generally depends on the intuitions of the researcher how to interpret these tokens in order to propose concepts or issues that these lists of words represent.

An extension of LDA is Labeled LDA (Ramage et al., 2009). If each document in a corpus is described by a set of tags (e.g. a newspaper archive with articles tagged for areas like ’economics’, ‘foreign policy’, etc.), labeled LDA will identify the relation between LDA topics, documents and tags, and the output will consist of a list of labeled topics .

Labeled LDA has shown its potential for fine grained topic modeling (e.g. Zirn and Stuckenschmidt, 2014). The method requires a corpus where documents are annotated with tags describing their content. Several methods can be applied to automatically generating tags, e.g. keyphraseextraction (Kim et al. 2010). Our source for tags is Entity linking. Since entity linking provides a unique label for sets of topicallyrelated expressions across a corpus’ documents, it can help researchers get an overview of different concepts present in the corpus, even if the concepts are conveyed by different expressions in different documents.

Our first step is identifying potential topic labels via entity linking. Linked entities were obtained with DBpedia Spotlight (Mendes et al., 2011). Spotlight disambiguates against DBpedia, outputting a confidence value for each annotation. Annotations whose confidence was below 2 0.1 were filtered out. We also removed too general or too frequent entities (e.g. Country or European_Union )

We then rank entities’ relevance per document with tfidf, which promotes entities that are salient in a specific subset of corpus documents rather than frequent overall in the corpus. Finally, we select the top five entities per document as per tfidf.

📸 Image Gallery

cover.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut