Inferring Missing Entity Type Instances for Knowledge Base Completion: New Dataset and Methods

Inferring Missing Entity T ype Instances f or Knowledge Base Completion: New Dataset and Methods Arvind Neelakantan ∗ Department of Computer Science Uni versity of Massachusetts, Amherst Amherst, MA, 01003 arvind@cs.umass.edu Ming-W ei Chang Microsoft Research 1 Microsoft W ay Redmond, W A 98052, USA minchang@microsoft.com Abstract Most of previous w ork in knowledge base (KB) completion has focused on the problem of relation extraction. In this work, we focus on the task of inferring missing entity type in- stances in a KB, a fundamental task for KB competition yet receiv es little attention. Due to the nov elty of this task, we construct a large-scale dataset and design an automatic ev aluation methodology . Our kno wledge base completion method uses information within the existing KB and e xternal information from W ikipedia. W e sho w that individual methods trained with a global objective that consid- ers unobserved cells from both the entity and the type side gi ves consistently higher qual- ity predictions compared to baseline methods. W e also perform manual e valuation on a small subset of the data to verify the ef fectiv eness of our knowledge base completion methods and the correctness of our proposed automatic ev aluation method. 1 Introduction There is now increasing interest in the construction of knowledge bases like F r eebase (Bollacker et al., 2008) and NELL (Carlson et al., 2010) in the nat- ural language processing community . KBs contain facts such as T iger W oods is an athlete , and Bar ack Obama is the pr esident of USA . Ho we ver , one of the main drawbacks in existing KBs is that they are in- complete and are missing important facts (W est et ∗ Most of the research conducted during summer internship at Microsoft. al., 2014), jeopardizing their usefulness in do wn- stream tasks such as question answering. This has led to the task of completing the knowledge base entries, or Knowledge Base Completion (KBC) ex- tremely important. In this paper , we address an important subprob- lem of knowledge base completion— inferring miss- ing entity type instances. Most of previous work in KB completion has only focused on the problem of relation extraction (Mintz et al., 2009; Nickel et al., 2011; Bordes et al., 2013; Riedel et al., 2013). Entity type information is crucial in KBs and is widely used in many NLP tasks such as relation extraction (Chang et al., 2014), coreference reso- lution (Ratinov and Roth, 2012; Hajishirzi et al., 2013), entity linking (Fang and Chang, 2014), se- mantic parsing (Kwiatko wski et al., 2013; Berant et al., 2013) and question answering (Bordes et al., 2014; Y ao and Durme, 2014). For example, adding entity type information improv es relation extraction by 3% (Chang et al., 2014) and entity linking by 4.2 F1 points (Guo et al., 2013). Despite their im- portance, there is surprisingly little pre vious work on this problem and, there are no datasets publicly av ailable for e v aluation. W e construct a large-scale dataset for the task of inferring missing entity type instances in a KB. Most of previous KBC datasets (Mintz et al., 2009; Riedel et al., 2013) are constructed using a single snapshot of the KB and methods are ev aluated on a subset of facts that are hidden during training. Hence, the methods could be potentially ev aluated by their abil- ity to predict easy facts that the KB already contains. Moreov er , the methods are not directly ev aluated Figure 1: Freebase description of Jean Metellus can be used to infer that the entity has the type /book/author . This missing fact is found by our algorithm and is still missing in the latest version of Freebase when the paper is written. on their ability to predict missing facts. T o ov er- come these drawbacks we construct the train and test data using two snapshots of the KB and e v aluate the methods on predicting facts that are added to the more recent snapshot, enabling a more realistic and challenging e valuation. Standard e valuation metrics for KBC methods are generally type-based (Mintz et al., 2009; Riedel et al., 2013), measuring the quality of the predictions by aggregating scores computed within a type. This is not ideal because: (1) it treats every entity type equally not considering the distrib ution of types, (2) it does not measure the ability of the methods to rank predictions across types. Therefore, we additionally use a global ev aluation metric, where the quality of predictions is measured within and across types, and also accounts for the high variance in type distri- bution. In our experiments, we show that models trained with negati ve examples from the entity side perform better on type-based metrics, while when trained with negati ve examples from the type side perform better on the global metric. In order to design methods that can rank pre- dictions both within and across entity (or relation) types, we propose a global objective to train the models. Our proposed method combines the ad- v antages of pre vious approaches by using ne ga- ti ve examples from both the entity and the type side. When considering the same number of ne ga- ti ve examples, we ﬁnd that the linear classiﬁers and the lo w-dimensional embedding models trained with the global objectiv e produce better quality ranking within and across entity types when compared to training with negati v es e xamples only from entity or type side. Additionally compared to prior methods, the model trained on the proposed global objectiv e can more reliably suggest conﬁdent entity-type pair candidates that could be added into the giv en knowl- edge base. Our contributions are summarized as follo ws: • W e dev elop an ev aluation framew ork com- prising of methods for dataset construction and e valuation metrics to ev aluate KBC approaches for missing entity type in- stances. The dataset and ev aluation scripts are publicly av ailable at http://research. microsoft.com/en- US/downloads/ df481862- 65cc- 4b05- 886c- acc181ad07bb/ default.aspx . • W e propose a global training objecti ve for KBC methods. The experimental results show that both linear classiﬁers and lo w-dimensional em- bedding models achieve best o verall perfor- mance when trained with the global objectiv e function. • W e conduct e xtensi ve studies on models for in- ferring missing type instances studying the im- pact of using v arious features and models. 2 Inferring Entity T ypes W e consider a KB Λ containing entity type informa- tion of the form ( e, t ) , where e ∈ E ( E is the set of all entities) is an entity in the KB with type t ∈ T ( T is the set of all types). For example, e could be T ig er W oods and t could be sports athlete . As a single entity can hav e multiple types, entities in Freebase often miss some of their types. The aim of this work is to infer missing entity type instances in the KB. Gi ven an unobserv ed f act (an entity-type pair) in the training data ( e, t ) 6∈ Λ where entity e ∈ E and type t ∈ T , the task is to infer whether the KB currently misses the fact, i.e., infer whether ( e, t ) ∈ Λ . W e consider entities in the intersection of Freebase and W ikipedia in our experiments. 2.1 Information Resources No w , we describe the information sources used to construct the feature representation of an entity to infer its types. W e use information in Freebase and external information from W ikipedia to complete the KB. • Entity T ype Featur es : The entity types ob- served in the training data can be a useful sig- nal to infer missing entity type instances. For example, in our snapshot of F reebase , it is not uncommon to ﬁnd an entity with the type /peo- ple/deceased person but missing the type /peo- ple/person . • Freebase Description : Almost all entities in F reebase have a short one paragraph descrip- tion of the entity . Figure 1 shows the F r eebase description of J ean Metellus that can be used to infer the type /book/author which F reebase does not contain as the date of writing this arti- cle. • Wikipedia : As external information, we in- clude the W ikipedia full text article of an en- tity in its feature representation. W e con- sider entities in Freebase that have a link to their W ikipedia article. The W ikipedia full text of an entity gi ves sev eral clues to predict it’ s entity types. For example, Figure 2 shows a section of the W ikipedia article of Clair e Martin which giv es clues to infer the type /awar d/awar d winner that F reebase misses. 3 Evaluation Framework In this section, we propose an e v aluation methodol- ogy for the task of inferring missing entity type in- stances in a KB. While we focus on recov ering entity types, the proposed framework can be easily adapted to relation extraction as well. First, we discuss our two-snapshot dataset con- struction strategy . Then we moti vate the importance of e v aluating KBC algorithms globally and describe the e valuation metrics we emplo y . 3.1 T wo Snapshots Construction In most previous work on KB completion to pre- dict missing relation facts (Mintz et al., 2009; Riedel et al., 2013), the methods are ev aluated on a subset of facts from a single KB snapshot, that are hidden while training. Howe ver , giv en that the missing en- tries are usually selected randomly , the distribution of the selected unknown entries could be v ery differ - ent from the actual missing facts distribution. Also, since any fact could be potentially used for e valua- tion, the methods could be ev aluated on their ability to predict easy facts that are already present in the KB. T o o vercome this dra wback, we construct our train and test set by considering two snapshots of the kno wledge base. The train snapshot is taken from an earlier time without special treatment. The test snapshot is taken from a later period, and a KBC algorithm is ev aluated by its ability of recov ering ne wly added knowledge in the test snapshot. This enables the methods to be directly e v aluated on f acts that are missing in a KB snapshot. Note that the facts that are added to the test snapshot, in general, are more subtle than the facts that the y already con- tain and predicting the newly added facts could be harder . Hence, our approach enables a more realis- tic and challenging e v aluation setting than pre vious work. W e use manually constructed F r eebase as the KB in our e xperiments. Notably , Chang et al. (2014) use a two-snapshot strategy for constructing a dataset for relation extraction using automatically constructed NELL as their KB. The ne w facts that are added to a KB by an automatic method may not hav e all the characteristics that make the two snapshot strate gy more adv antageous. W e construct our train snapshot Λ 0 by taking the F reebase snapshot on 3 rd September , 2013 and con- sider entities that hav e a link to their Wikipedia page. KBC algorithms are e v aluated by their ability to pre- dict facts that were added to the 1 st June, 2014 snap- shot of F r eebase Λ . T o get negati ve data, we make a closed world assumption treating any unobserved instance in F r eebase as a negati ve example. Un- observed instances in the F r eebase snapshot on 3 rd September , 2013 and 1 st June, 2014 are used as ne g- ati ve examples in training and testing respecti v ely . 1 The positiv e instances in the test data ( Λ − Λ 0 ) are facts that are newly added to the test snapshot Λ . Us- ing the entire set of negati ve examples in the test data is impractical due to the large number of negati ve ex- amples. T o a v oid this we only add the negati ve types 1 Note that some of the negati ve instances used in training could be positi ve instances in test b ut we do not remov e them during training. Figure 2: A section of the W ikipedia article of Clair e Martin which gives clues that entity has the type /awar d/awar d winner . This currently missing fact is also found by our algorithm. of entities that have at least one ne w fact in the test data. Additionally , we add a portion of the negati v e examples for entities which do not ha ve new fact in the test data and that were unused during training. This makes our dataset quite challenging since the number of ne gativ e instances is much lar ger than the number of positi ve instances in the test data. It is important to note that the goal of this work is not to predict facts that emerged between the time period of the train and test snapshot 2 . For e xample, we do not aim to predict the type /awar d/awar d winner for an entity that won an aw ard after 3 rd September , 2013. Hence, we use the F reebase description in the training data snap- shot and W ikipedia snapshot on 3 rd September , 2013 to get the features for entities. One might worry that the ne w snapshot might contain a signiﬁcant amount of emerging facts so it could not be an effecti v e way to ev aluate the KBC algorithms. Therefore, we examine the differ - ence between the training snapshot and test snap- shot manually and found that this is likely not the case. F or example, we randomly selected 25 /awar d/awar d winner instances that were added to the test snapshot and found that all of them had w on at least one aw ard before 3 rd September , 2013. Note that while this automatic e v aluation is closer to the real-world scenario, it is still not perfect as the ne w KB snapshot is still incomplete. Therefore, we also perform human e v aluation on a small dataset to verify the ef fecti veness of our approach. 2 In this work, we also do not aim to correct existing false positiv e errors in F r eebase 3.2 Global Evaluation Metric Mean average pr ecision (MAP) (Manning et al., 2008) is no w commonly used to e valuate KB com- pletion methods (Mintz et al., 2009; Riedel et al., 2013). MAP is deﬁned as the mean of avera ge pre- cision ov er all entity (or relation) types. MAP treats each entity type equally (not explicitly accounting for their distribution). Howe ver , some types occur much more frequently than others. For example, in our large-scale experiment with 500 entity types, there are man y entity types with only 5 instances in the test set while the most frequent entity type has tens of thousands of missing instances. Moreover , MAP only measures the ability of the methods to correctly rank predictions within a type. T o account for the high v ariance in the distribu- tion of entity types and measure the ability of the methods to correctly rank predictions across types we use global av erage precision (GAP) (similarly to micro-F1) as an additional e valuation metric for KB completion. W e conv ert the multi-label classi- ﬁcation problem to a binary classiﬁcation problem where the label of an entity and type pair is true if the entity has that type in F r eebase and false oth- erwise. GAP is the average precision of this trans- formed problem which can measure the ability of the methods to rank predictions both within and across entity types. Prior to us, Bordes et al. (2013) use mean recip- rocal rank as a global e valuation metric for a KBC task. W e use av erage precision instead of mean re- ciprocal rank since MRR could be biased to the top predictions of the method (W est et al., 2014) While GAP captures global ordering, it would be beneﬁcial to measure the quality of the top k pre- dictions of the model for bootstrapping and active learning scenarios (Lewis and Gale, 1994; Cucerzan and Y aro wsky , 1999). W e report G@k, GAP mea- sured on the top k predictions (similarly to Pr eci- sion@k and Hits@k ). This metric can be reliably used to measure the overall quality of the top k pre- dictions. 4 Global Objective f or Knowledge Base Completion W e describe our approach for predicting missing en- tity types in a KB in this section. While we focus on recov ering entity types in this paper , the meth- ods we de velop can be easily extended to other KB completion tasks. 4.1 Global Objective Framework During training, only positiv e examples are ob- served in KB completion tasks. Similar to previous work (Mintz et al., 2009; Bordes et al., 2013; Riedel et al., 2013), we get negati ve training examples by treating the unobserved data in the KB as ne gativ e examples. Because the number of unobserved e x- amples is much larger than the number of facts in the KB, we follow previous methods and sample fe w unobserved negati ve examples for ev ery positi v e e x- ample. Pre vious methods largely neglect the sampling methods on unobserved negati ve examples. The pro- posed global object framework allo ws us to system- atically study the effect of the different sampling methods to get negati ve data, as the performance of the model for different ev aluation metrics does de- pend on the sampling method. W e consider a training snapshot of the KB Λ 0 , containing facts of the form ( e, t ) where e is an en- tity in the KB with type t . Given a fact ( e, t ) in the KB, we consider two types of ne gati ve examples constructed from the follo wing tw o sets: N E ( e, t ) is the “negati ve entity set”, and N T ( e, t ) is the “nega- ti ve type set”. More precisely , N E ( e, t ) ⊂ { e 0 | e 0 ∈ E , e 0 6 = e, ( e 0 , t ) / ∈ Λ 0 } , and N T ( e, t ) ⊂ { t 0 | t 0 ∈ T , t 0 6 = t, ( e, t 0 ) / ∈ Λ 0 } . Let θ be the model parameters, m = |N E ( e, t ) | and n = |N T ( e, t ) | be the number of negati v e exam- ples and types considered for training respecti vely . For each entity-type pair ( e, t ) , we deﬁne the scor- ing function of our model as s ( e, t | θ ) . 3 W e deﬁne two loss functions one using negati ve entities and the other using negati v e types: L E (Λ 0 , θ ) = X ( e,t ) ∈ Λ 0 ,e 0 ∈N E ( e,t ) [ s ( e 0 , t ) − s ( e, t ) + 1] k + , and L T (Λ 0 , θ ) = X ( e,t ) ∈ Λ 0 ,t 0 ∈N T ( e,t ) [ s ( e, t 0 ) − s ( e, t ) + 1] k + , where k is the po wer of the loss function ( k can be 1 or 2), and the function [ · ] + is the hinge function. The global objecti ve function is deﬁned as min θ Reg ( θ ) + C L T (Λ 0 , θ ) + C L E (Λ 0 , θ ) , (1) where Reg ( θ ) is the regularization term of the model, and C is the regularization parameter . In- tuiti vely , the parameters θ are estimated to rank the observed facts above the negati ve examples with a margin. The total number of negati ve examples is controlled by the size of the sets N E and N T . W e experiment by sampling only entities or only types or both by ﬁxing the total number of negati v e exam- ples in Section 5. The rest of section is organized as follows: we propose three algorithms based on the global objec- ti ve in Section 4.2. In Section 4.3, we discuss the re- lationship between the proposed algorithms and ex- isting approaches. Let Φ( e ) → R d e be the feature function that maps an entity to its feature represen- tation, and Ψ( t ) → R d t be the feature function that maps an entity type to its feature representation. 4 d e and d t represent the feature dimensionality of the en- tity features and the type features respectiv ely . Fea- ture representations of the entity types ( Ψ ) is only used in the embedding model. 3 W e often use s ( e, t ) as an abbre viation of s ( e, t | θ ) in order to sav e space. 4 This giv es the possibility of deﬁning features for the labels in the output space but we use a simple one-hot representation for types right now since richer features did not gi ve perfor- mance gains in our initial experiments. Algorithm 1 The training algorithm for Lin- ear .Adagrad. 1: Initialize w t = 0 , ∀ t = 1 . . . | T | 2: for ( e, t ) ∈ Λ 0 do 3: for e 0 ∈ N E ( e, t ) do 4: if w T t Φ( e ) − w T t Φ( e 0 ) − 1 < 0 then 5: AdaGradUpdate ( w t , Φ( e 0 ) − Φ( e )) 6: end if 7: end for 8: for t 0 ∈ N T ( e, t ) do 9: if w T t Φ( e ) − w T t 0 Φ( e ) − 1 < 0 then 10: AdaGradUpdate ( w t , − Φ( e )) 11: AdaGradUpdate ( w t 0 , Φ( e )) . 12: end if 13: end for 14: end for 4.2 Algorithms W e propose three different algorithms based on the global objectiv e framew ork for predicting missing entity types. T wo algorithms use the linear model and the other one uses the embedding model. Linear Model The scoring function in this model is giv en by s ( e, t | θ = { w t } ) = w T t Φ( e ) , where w t ∈ R d e is the parameter vector for target type t . The regularization term in Eq. (1) is deﬁned as follo ws: R ( θ ) = 1 / 2 P t =1 w T t w t . W e use k = 2 in our experiments. Our ﬁrst algorithm is obtained by using the dual coordinate descent algorithm (Hsieh et al., 2008) to optimize Eq. (1), where we modi- ﬁed the original algorithm to handle multiple weight vectors. W e refer to this algorithm as Linear .DCD . While DCD algorithm ensures con ver gence to the global optimum solution, its conv ergence can be slo w in certain cases. Therefore, we adopt an on- line algorithm, Adagrad (Duchi et al., 2011). W e use the hinge loss function ( k = 1 ) with no regu- larization ( R eg ( θ ) = ∅ ) since it gav e best results in our initial experiments. W e refer to this algo- rithm as Linear .Adagrad , which is described in Al- gorithm 1. Note that AdaGradUpdate ( x, g ) is a pro- cedure which updates the vector x with the respect to the gradient g . Embedding Model In this model, vector repre- sentations are constructed for entities and types us- ing linear projection matrices. Recall Ψ( t ) → R d t is the feature function that maps a type to its feature representation. The scoring function is giv en by Algorithm 2 The training algorithm for the embed- ding model. 1: Initialize V , U randomly . 2: for ( e, t ) ∈ Λ 0 do 3: for e 0 ∈ N E ( e, t ) do 4: if s ( e, t ) − s ( e 0 , t ) − 1 < 0 then 5: µ ← V T Ψ( t ) 6: η ← U T (Φ( e 0 ) − Φ( e )) 7: for i ∈ 1 . . . d do 8: AdaGradUpdate ( U i , µ [ i ](Φ( e 0 ) − Φ( e ))) 9: AdaGradUpdate ( V i , η [ i ]Ψ( t )) 10: end for 11: end if 12: end for 13: for t 0 ∈ N T ( e, t ) do 14: if s ( e, t ) − s ( e, t 0 ) − 1 < 0 then 15: µ ← V T (Ψ( t 0 ) − Ψ( t )) 16: η ← U T Φ( e ) 17: for i ∈ 1 . . . d do 18: AdaGradUpdate ( U i , µ [ i ]Φ( e )) 19: AdaGradUpdate ( V i , η [ i ](Ψ( t 0 ) − Ψ( t ))) 20: end for 21: end if 22: end for 23: end for s ( e, t | θ = ( U , V )) = Ψ( t ) T VU T Φ( e ) , where U ∈ R d e × d and V ∈ R d t × d are projection matrices that embed the entities and types in a d - dimensional space. Similarly to the linear classiﬁer model, we use the l1-hinge loss function ( k = 1 ) with no regularization ( R eg ( θ ) = ∅ ). U i and V i denote the i -th column vector of the matrix U and V , respectiv ely . The algorithm is described in detail in Algorithm 2. The embedding model has more e xpressi ve power than the linear model, but the training unlike in the linear model, con ver ges only to a local optimum so- lution since the objecti ve function is non-con v ex. 4.3 Relationship to Existing Methods Many existing methods for relation extraction and entity type prediction can be cast as a special case under the global objectiv e framework. For exam- ple, we can consider the work in relation extrac- tion (Mintz et al., 2009; Bordes et al., 2013; Riedel et al., 2013) as models trained with N T ( e, t ) = ∅ . These models are trained only using negati ve entities which we refer to as Negati v e Entity (NE) objecti v e. The entity type prediction model in Ling and W eld (2012) is a linear model with N E ( e, t ) = ∅ which 70 types 500 types Entities 2.2M 2.2M T raining Data Statistics ( Λ 0 ) positiv e example 4.5M 6.2M max #ent for a type 1.1M 1.1M min #ent for a type 6732 32 T est Data Statistics ( Λ − Λ 0 ) positiv e examples 163K 240K negati ve e xamples 17.1M 132M negati ve/positi v e ratio 105.22 554.44 T able 1: Statistics of our dataset. Λ 0 is our training snap- shot and Λ is our test snapshot. An example is an entity- type pair . we refer to as the Negati ve T ype (NT) objecti ve. The embedding model described in W eston et al. (2011) de veloped for image retriev al is also a special case of our model trained with the NT objecti ve. While the N E or N T objectiv e functions could be suitable for some classiﬁcation tasks (W eston et al., 2011), the choice of objective functions for the KBC tasks has not been well motiv ated. Often the choice is made neither with theoretical foundation nor with empirical support. T o the best of our knowl- edge, the global objective function, which includes both N E ( e, t ) and N T ( e, t ) , has not been considered pre viously by KBC methods. 5 Experiments In this section, we gi ve details about our dataset and discuss our experimental results. Finally , we per- form manual e v aluation on a small subset of the data. 5.1 Data First, we ev aluate our methods on 70 entity types with the most observed facts in the training data. 5 W e also perform large-scale ev aluation by testing the methods on 500 types with the most observed facts in the training data. T able 1 shows statistics of our dataset. The num- ber of positi ve examples is much larger in the train- ing data compared to that in the test data since the test set contains only facts that were added to the more recent snapshot. An additional effect of this is 5 W e removed fe w entity types that were trivial to predict in the test data. that most of the facts in the test data are about en- tities that are not very w ell-known or famous. The high negati ve to positi ve examples ratio in the test data makes this dataset v ery challenging. 5.2 A utomatic Evaluation Results T able 2 sho ws automatic ev aluation results where we gi ve results on 70 types and 500 types. W e compare dif ferent aspects of the system on 70 types empiri- cally . Adagrad Vs DCD W e ﬁrst study the linear mod- els by comparing Linear .DCD and Linear .AdaGrad. T able 2a shows that Linear .AdaGrad consistently performs better for our task. Impact of Featur es W e compare the effect of dif ferent features on the ﬁnal performance using Linear .AdaGrad in T able 2b. T ypes are repre- sented by boolean features while Freebase descrip- tion and W ikipedia full text are represented using tf- idf weighting. The best MAP results are obtained by using all the information (T+D+W) while best GAP results are obtained by using the Freebase descrip- tion and W ikipedia article of the entity . Note that the features are simply concatenated when multiple resources are used. W e tried to use idf weighting on type features and on all features, but they did not yield improv ements. The Importance of Global Objective T able 2c and 2d compares global training objectiv e with NE and NT training objecti ve. Note that all the three methods use the same number of negati ve examples. More precisely , for each ( e, t ) ∈ Λ 0 , |N E ( e, t ) | + |N T ( e, t ) | = m + n = 2 . The results show that the global training objective achiev es best scores on both MAP and GAP for classiﬁers and low- dimensional embedding models. Among NE and NT , NE performs better on the type-based metric while NT performs better on the global metric. Linear Model Vs Embedding Model Finally , we compare the linear classiﬁer model with the embed- ding model in T able 2e. The linear classiﬁer model performs better than the embedding model in both MAP and GAP . W e perform large-scale e valuation on 500 types with the description features (as experiments are expensi ve) and the results are shown in T able 2f. Features Algorithm MAP GAP Description Linear .Adagrad 29.17 28.17 Linear .DCD 28.40 27.76 Description + W ikipedia Linear .Adagrad 33.28 31.97 Linear .DCD 31.92 31.36 (a) Adagrad vs. Dual coordinate descent (DCD). Results are obtained using linear models trained with global training ob- jectiv e (m=1, n=1) on 70 types. Features MAP GAP T ype (T) 12.33 13.58 Description (D) 29.17 28.17 W ikipedia (W) 30.81 30.56 D + W 33.28 31.97 T + D + W 36.13 31.13 (b) Feature Comparison. Results are obtained from using Lin- ear .Adagrad with global training objective (m=1, n=1) on 70 types. Features Objectiv e MAP GAP D + W NE (m = 2) 33.01 23.97 NT (n = 2) 31.61 29.09 Global (m = 1, n = 1) 33.28 31.97 T + D + W NE (m = 2) 34.56 21.79 NT (n = 2) 34.45 31.42 Global (m = 1, n = 1) 36.13 31.13 (c) Global Objective vs NE and NT . Results are obtained us- ing Linear .Adagrad on 70 types. Features Objectiv e MAP GAP D + W NE (m = 2) 30.92 22.38 NT (n = 2) 25.77 23.40 Global (m = 1, n = 1) 31.60 30.13 T + D + W NE (m = 2) 28.70 19.34 NT (n = 2) 28.06 25.42 Global (m = 1, n = 1) 30.35 28.71 (d) Global Objectiv e vs NE and NT . Results are obtained us- ing the embedding model on 70 types. Features Model MAP GAP G@1000 G@10000 D + W Linear .Adagrad 33.28 31.97 79.63 68.08 Embedding 31.60 30.13 73.40 64.69 T + D + W Linear .Adagrad 36.13 31.13 70.02 65.09 Embedding 30.35 28.71 62.61 64.30 (e) Model Comparison. The models were trained with the global training objecti ve (m=1, n=1) on 70 types. Model MAP GAP G@1000 G@10000 Linear .Adagrad 13.28 20.49 69.23 60.14 Embedding 9.82 17.67 55.31 51.29 (f) Results on 500 types using Freebase description features. W e train the models with the global training objecti ve (m=1, n=1). T able 2: Automatic Evaluation Results. Note that m = |N E ( e, t ) | and n = |N T ( e, t ) | . One might expect that with the increased number of types, the embedding model would perform better than the classiﬁer since the y share parameters across types. Howe v er , despite the recent popularity of em- bedding models in NLP , linear model still performs better in our task. 5.3 Human Evaluation T o verify the effecti veness of our KBC algorithms, and the correctness of our automatic ev aluation method, we perform manual ev aluation on the top 100 predictions of the output obtained from two dif- ferent experimental setting and the results are shown in T able 3. Ev en though the automatic ev alua- tion giv es pessimistic results since the test KB is also incomplete 6 , the results indicate that the auto- matic e valuation is correlated with manual ev alua- tion. More excitingly , among the 179 unique in- stances we manually ev aluated, 17 of them are still 7 missing in Freebase which emphasizes the effecti v e- ness of our approach. 6 This is true even with existing automatic e valuation meth- ods. 7 at submission time. Features G@100 G@100-M Accuracy-M D + W 87.68 97.31 97 T + D + W 84.91 91.47 88 T able 3: Manual vs. Automatic ev aluation of top 100 pre- dictions on 70 types. Predictions are obtained by train- ing a linear classiﬁer using Adagrad with global training objectiv e (m=1, n=1). G@100-M and Accurac y-M are computed by manual ev aluation. 5.4 Error Analysis • Effect of training data : W e ﬁnd the perfor- mance of the models on a type is highly de- pendent on the number of training instances for that type. For example, the linear classiﬁer model when ev aluated on 70 types performs 24.86 % better on the most frequent 35 types compared to the least frequent 35 types. This indicates bootstrapping or activ e learning tech- niques can be proﬁtably used to pro vide more supervision for the methods. In this case, G@k would be an useful metric to compare the effec- ti veness of the dif ferent methods. • Shallow Linguistic features : W e found some of the false positiv e predictions are caused by the use of shallow linguistic features. For e x- ample, an entity who has acted in a movie and composes music only for television shows is wrongly tagged with the type / ﬁlm/composer since words like ”movie”, ”composer” and ”music” occur frequently in the W ikipedia arti- cle of the entity ( http://en.wikipedia. org/wiki/J._J._Abrams ). 6 Related W ork Entity T ype Prediction and Wikipedia Featur es Much of previous work (Pantel et al., 2012; Ling and W eld, 2012) in entity type prediction has fo- cused on the task of predicting entity types at the sentence level. Y ao et al. (2013) dev elop a method based on matrix factorization for entity type predic- tion in a KB using information within the KB and Ne w Y ork T imes articles. Howe v er , the method was still ev aluated only at the sentence lev el. T oral and Munoz (2006), Kazama and T orisaw a (2007) use the ﬁrst line of an entity’ s W ikipedia article to perform named entity recognition on three entity types. Knowledge Base Completion Much of precious work in KB completion has focused on the problem of relation extraction. Majority of the methods infer missing relation facts using information within the KB (Nickel et al., 2011; Lao et al., 2011; Socher et al., 2013; Bordes et al., 2013) while methods such as Mintz et al. (2009) use information in text doc- uments. Riedel et al. (2013) use both information within and outside the KB to complete the KB. Linear Embedding Model W eston et al. (2011) is one of ﬁrst work that dev eloped a supervised linear embedding model and applied it to image retriev al. W e apply this model to entity type prediction but we train using a dif ferent objectiv e function which is more suited for our task. 7 Conclusion and Future W ork W e propose an ev aluation frame work comprising of methods for dataset construction and ev aluation metrics to ev aluate KBC approaches for inferring missing entity type instances. W e veriﬁed that our automatic ev aluation is correlated with human e val- uation, and our dataset and ev aluation scripts are publicly av ailable. 8 Experimental results show that models trained with our proposed global training ob- jecti ve produces higher quality ranking within and across types when compared to baseline methods. In future work, we plan to use information from entity linked documents to impro ve performance and also explore acti ve leaning, and other human- in-the-loop methods to get more training data. References [Berant et al.2013] Jonathan Berant, V i vek Srikumar, Pei- Chun Chen, Abby V ander Linden, Brittany Harding, Brad Huang, and Christopher D. Manning. 2013. Semantic parsing on freebase from question-answer pairs. In Empirical Methods in Natural Language Pro- cessing. [Bollacker et al.2008] Kurt Bollacker , Colin Ev ans, Prav een Paritosh, Tim Sturge, and Jamie T aylor . 2008. Freebase: a collaborati vely created graph database for structuring human kno wledge. In 8 http://research. microsoft.com/en- US/downloads/ df481862- 65cc- 4b05- 886c- acc181ad07bb/ default.aspx Pr oceedings of the ACM SIGMOD International Confer ence on Management of Data . [Bordes et al.2013] Antoine Bordes, Nicolas Usunier , Alberto Garcia-Duran, Jason W eston, and Oksana Y akhnenko. 2013. T ranslating embeddings for mod- eling multi-relational data. In Advances in Neural In- formation Pr ocessing Systems. [Bordes et al.2014] Antoine Bordes, Sumit Chopra, and Jason W eston. 2014. Question answering with sub- graph embeddings. In Empirical Methods in Natural Language Pr ocessing. [Carlson et al.2010] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Este vam R. Hruschka, and A. 2010. T ow ard an architecture for ne ver -ending lan- guage learning. In In AAAI . [Chang et al.2014] Kai-W ei Chang, W en tau Y ih, Bishan Y ang, and Christopher Meek. 2014. T yped tensor de- composition of knowledge bases for relation extrac- tion. In Pr oceedings of the 2014 Conference on Em- pirical Methods in Natural Languag e Pr ocessing . [Cucerzan and Y arowsk y1999] Silviu Cucerzan and David Y aro wsky . 1999. Language indep endent named entity recognition combining morphological and contextual evidence. In oint SIGDA T Confer- ence on Empirical Methods in Natural Languag e Pr ocessing and V ery Large Corpor a. [Duchi et al.2011] John Duchi, Elad Hazan, and Y oram Singer . 2011. Adapti ve subgradient methods for on- line learning and stochastic optimization. In J ournal of Machine Learning Resear c h . [Fang and Chang2014] Y uan Fang and Ming-W ei Chang. 2014. Entity linking on microblogs with spatial and temporal signals. In T ransactions of the Association for Computational Linguistics. [Guo et al.2013] Stephen Guo, Ming-W ei Chang, and Emre Kiciman. 2013. T o link or not to link? a study on end-to-end tweet entity linking. In The North American Chapter of the Association for Computa- tional Linguistics. , June. [Hajishirzi et al.2013] Hannaneh Hajishirzi, Leila Zilles, Daniel S. W eld, and Luke Zettlemoyer . 2013. Joint coreference resolution and named-entity linking with multi-pass sieves. In Empirical Methods in Natural Language Pr ocessing. [Hsieh et al.2008] Cho-Jui Hsieh, Kai-W ei Chang, Chih- Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. 2008. A dual coordinate descent method for large- scale linear svm. In International Confer ence on Ma- chine Learning . [Kazama and T orisaw a2007] Jun’ichi Kazama and Ken- taro T orisawa. 2007. Exploiting wikipedia as exter - nal knowledge for named entity recognition. In J oint Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing and Computational Natural Lan- guage Learning . [Kwiatko wski et al.2013] T om Kwiatko wski, Eunsol Choi, Y oav Artzi, and Luke. Zettlemoyer . 2013. Scaling semantic parsers with on-the-ﬂy ontology matching. In Empirical Methods in Natural Language Pr ocessing. [Lao et al.2011] Ni Lao, T om Mitchell, and William W . Cohen. 2011. Random walk inference and learning in a large scale kno wledge base. In Conference on Em- pirical Methods in Natural Languag e Pr ocessing . [Lewis and Gale1994] David D. Lewis and W illiam A. Gale. 1994. A sequential algorithm for training text classiﬁers. In ACM SIGIR Conference on Researc h and Development in Information Retrie val. [Ling and W eld2012] Xiao Ling and Daniel S. W eld. 2012. Fine-grained entity recognition. In Association for the Advancement of Artiﬁcial Intelligence. [Manning et al.2008] Christopher D. Manning, Prabhakar Raghav an, and Hinrich Sch ¨ utze. 2008. Introduction to information retriev al. In Cambridge University Pr ess, Cambridge, UK. [Mintz et al.2009] Mike Mintz, Ste ven Bills, Rion Sno w , and Dan Jurafsky . 2009. Distant supervision for re- lation extraction without labeled data. In Association for Computational Linguistics and International J oint Confer ence on Natural Langua ge Pr ocessing . [Nickel et al.2011] Maximilian Nickel, V olker T resp, and Hans-Peter Kriegel. 2011. A three-way model for col- lectiv e learning on multi-relational data. In Interna- tional Confer ence on Machine Learning . [Pantel et al.2012] Patrick Pantel, Thomas Lin, and Michael Gamon. 2012. Mining entity types from query logs via user intent modeling. In Association for Computational Linguistics. [Ratinov and Roth2012] Lev Ratinov and Dan Roth. 2012. Learning-based multi-siev e co-reference reso- lution with knowledge. In J oint Confer ence on Em- pirical Methods in Natural Language Pr ocessing and Computational Natural Languag e Learning. [Riedel et al.2013] Sebastian Riedel, Limin Y ao, Andrew McCallum, and Benjamin M. Marlin. 2013. Rela- tion extraction with matrix factorization and uni versal schemas. In The North American Chapter of the Asso- ciation for Computational Linguistics . [Socher et al.2013] Richard Socher , Danqi Chen, Christo- pher Manning, and Andre w Y . Ng. 2013. Reasoning with neural tensor networks for knowledge base com- pletion. In Advances in Neural Information Pr ocess- ing Systems. [T oral and Munoz2006] Antonio T oral and Rafael Munoz. 2006. A proposal to automatically build and maintain gazetteers for named entity recognition by using wikipedia. In Eur opean Chapter of the Association for Computational Linguistics. [W est et al.2014] Robert W est, Evgeniy Gabrilovich, Ke vin Murphy , Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via search-based question answering. In Pr oceedings of the 23rd international conference on W orld wide web , pages 515–526. International W orld W ide W eb Con- ferences Steering Committee. [W eston et al.2011] Jason W eston, Samy Bengio, and Nicolas Usunier . 2011. Wsabie: Scaling up to large vocab ulary image annotation. In International Joint Confer ence on Artiﬁcial Intelligence . [Y ao and Durme2014] Xuchen Y ao and Benjamin V an Durme. 2014. Information extraction ov er structured data: Question answering with freebase. In Associa- tion for Computational Linguistics. [Y ao et al.2013] Limin Y ao, Sebastian Riedel, and An- drew McCallum. 2013. Universal schema for entity type prediction. In Pr oceedings of the 2013 W orkshop on Automated Knowledg e Base Construction .

Inferring Missing Entity Type Instances for Knowledge Base Completion: New Dataset and Methods

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment