Inducing Semantic Representation from Text by Jointly Predicting and Factorizing Relations

Accepted as a workshop contrib ution at ICLR 2015 I N D U C I N G S E M A N T I C R E P R E S E N T A T I O N F R O M T E X T B Y J O I N T L Y P R E D I C T I N G A N D F A C T O R I Z I N G R E L A - T I O N S Ivan Tito v , Ehsan Khoddam Univ ersiteit van Amsterdam Amsterdam, the Netherlands { titov,e.khoddammohammadi } @uva.nl A B S T R A C T In this work, we propose a new method to integrate two recent lines of work: un- supervised induction of shallow semantics (e.g., semantic roles) and factorization of relations in text and knowledge bases. Our model consists of two components: (1) an encoding component: a semantic role labeling model which predicts roles giv en a rich set of syntactic and le xical features; (2) a reconstruction component: a tensor factorization model which relies on roles to predict ar gument ﬁllers. When the components are estimated jointly to minimize errors in argument reconstruc- tion, the induced roles largely correspond to roles deﬁned in annotated resources. Our method performs on par with most accurate role induction methods on En- glish, ev en though, unlike these pre vious approaches, we do not incorporate any prior linguistic knowledge about the language. 1 I N T RO D U C T I O N Shallow representations of meaning, and semantic role labels in particular , have a long history in linguistics (Fillmore, 1968). More recently , with an emer gence of lar ge annotated resources such as PropBank (Palmer et al., 2005) and FrameNet (Bak er et al., 1998), automatic semantic role labeling (SRL) has attracted a lot of attention (Surdeanu et al., 2008; Haji ˇ c et al., 2009; Das et al., 2010). Semantic role representations encode the underlying predicate-argument structure of sentences, or , more speciﬁcally , for every predicate in a sentence they identify a set of arguments and associate each ar gument with an underlying semantic r ole , such as an agent (an initiator or doer of the action) or a patient (an affected entity). Consider the following sentence: [ Ag ent The police ] charged [ P atient the demonstrators ] [ I nstr ument with batons ] . Here, the police , the demonstrator s and with batons are assigned to roles A gent, P atient and I nstrument, respectiv ely . Semantic roles have many potential applications in NLP and hav e been shown to beneﬁt, for e xample, question answering (Shen and Lapata, 2007; Berant et al., 2014) and textual entailment (Sammons et al., 2009), among others. The scarcity of annotated data has motiv ated the research into unsupervised learning of seman- tic representations (Lang and Lapata, 2010; 2011a;b; T itov and Klementiev, 2012; F ¨ urstenau and Rambow, 2012; Gar g and Henderson, 2012). The existing methods ha ve a number of serious short- comings. First, they make very strong assumptions, for example, assuming that arguments are con- ditionally independent of each other given the predicate. Second, unlike state-of-the-art supervised parsers, they rely on a v ery simplistic set of features of a sentence. These factors lead to models be- ing insuf ﬁciently expressi ve to capture syntax-semantics interf ace, inadequate handling of language ambiguity and, ov erall, introduces an upper bound on their performance. In this work, we propose a method for effecti ve unsupervised estimation of feature-rich models of semantic roles. W e demonstrate that reconstruction-error objectiv es, which have been sho wn to be effecti ve primarily for training neural networks, are well suited for inducing feature-rich log-linear models of semantics. Our model consists of two components: a log-linear feature rich semantic role 1 Accepted as a workshop contrib ution at ICLR 2015 Reconstructed input Encoding Reconstruction Input y 2 R p Latent representation Feature representation of " The police charged... " ( ) Semantic role prediction ( = Encoding) charge ( Agent : police , Patient: demonstrator , Instrument: baton ) demonstrator Argument prediction ( = Reconstruction) Hidden p ( r | x, w ) Feature-rich model "Argument prediction" model (a) (b) x 2 R m ˜ x 2 R m p ( a i | a  i , r ,v , ✓ ) x Figure 1: (a) An autoencoder from R m to R p (typically p < m ). (b) Modeling roles within the reconstruction-error minimization framew ork. labeler and a tensor -factorization model which captures interaction between semantic roles and argu- ment ﬁllers. Our method riv als the most accurate semantic role induction methods on English (T itov and Klementiev, 2012; Lang and Lapata, 2011a). Importantly , no prior knowledge about any spe- ciﬁc language was incorporated in our feature-rich model, whereas the clustering counterparts relied on language-speciﬁc argument signatures. 2 A P P RO A C H At the core of our approach is a statistical model encoding an interdependence between a seman- tic role structure and its realization in a sentence. In the unsupervised learning setting, sentences, their syntactic representations and argument positions (denoted by x ) are observable whereas the associated semantic roles r are latent and need to be induced by the model. Crucially , the good r should encode roles rather than some other form of abstraction. In what follows, we will refer to roles using their names, though, in the unsupervised setting, our method, as any other latent v ariable model, will not yield human-interpretable labels for them. W e also focus only on the labeling stage of semantic role labeling. Identiﬁcation, though an important problem, can be tackled with heuris- tics (Lang and Lapata, 2011a), with unsupervised techniques (Abend et al., 2009) or potentially by using a supervised classiﬁer trained on a small amount of data. The model consists of two components. The ﬁrst component is responsible for prediction of argu- ment tuples based on roles and the predicate. In our experiments, in this component, we represent arguments as lemmas of their lexical heads (e.g., baton instead of with batons ), and we also restrict ourselves to only verbal predicates. Intuitiv ely , we can think of predicting one argument at a time (see Figure 1(b)): an argument (e.g., demonstrator in our example) is predicted based on the pred- icate lemma ( char ge ), the role assigned to this argument (i.e. Patient ) and other role-argument pairs (( Agent , police ) and ( Instrument , baton )). While learning to predict arguments, the in- ference algorithm will search for role assignments which simplify this prediction task as much as possible. Our hypothesis is that these assignments will correspond to roles accepted in linguistic theories (or , more importantly , useful in practical applications). Why is this hypothesis plausible? Primarily because these semantic representations were introduced as an abstraction capturing cru- cial properties of a relation (or an event). Thus, these representations, rather than surface linguistic details like argument order or syntactic functions, should be crucial for modeling sets of potential argument tuples. The reconstruction component is not the only part of the model. Crucially , what we referred to above as ‘searching for role assignments to simplify argument prediction’ would actually correspond to learning another component: a semantic role labeler which predicts roles relying on a rich set of sentence features. These two components will be estimated jointly in such a way as to minimize errors in recovering arguments. The role labeler will be the end-product of learning: it will be used to process new sentences, and it will be compared to existing methods in our e valuation. Generativ e modeling is not the only way to learn latent representations. One alternativ e, popular in the neural network community , is to use autoencoders instead and optimize the reconstruction error (Hinton, 1989; V incent et al., 2008). The encoding model will be a feature-rich classiﬁer 2 Accepted as a workshop contrib ution at ICLR 2015 which predicts semantic roles for a sentence, and the reconstruction model is the model which predicts an argument given its role, and given the rest of the arguments and their roles. The idea of training linear models with reconstruction error was previously explored by Daum ´ e III (2009) and very recently by Ammar et al. (2014). Howe ver , they do not consider learning f actorization models, and they also do not deal with semantics. T ensor and factorization methods used in the context of modeling kno weldge bases (e.g., (Bordes et al., 2011)) are also close in spirit. Howe ver , the y do not deal with inducing semantics but rather f actorize existing relations (i.e. rely on semantics). 2 . 1 M O D E L I N G S E M A N T I C S W I T H I N T H E R E C O N S T R U C T I O N - E R R O R F R A M E W O R K As we mentioned above, we focus on argument labeling: we assume that arguments a = ( a 1 , . . . , a N ) , a i ∈ A , are known, and only their roles r = ( r 1 , . . . , r N ) , r i ∈ R need to be induced. For the encoder (i.e. the semantic role labeler), we use a log-linear model: p ( r | x, w ) ∝ exp( w T g ( x, r )) , where g ( x, r ) is a feature vector encoding interactions between sentence x and the semantic role representation r . Any model can be used here as long as the posterior distributions of roles r i can be efﬁciently computed or approximated. In our experiments, we used a model which factorizes over individual ar guments (i.e. independent logistic regression classiﬁers). The reconstruction component predicts an argument (e.g., the i th ar gument a i ) gi ven the semantic roles r , the predicate v and other arguments a − i = ( a 1 , . . . , a i − 1 , a i +1 , . . . , a N ) with a bilinear softmax model: p ( a i | a − i , r , v , C , u ) = exp( u T a i C T v ,r i P j 6 = i C v ,r j u a j ) Z ( r , v, i ) , (1) u a ∈ R d (for e very a ∈ A ) and C v ,r ∈ R d × k (for e very verb v and every role r ∈ R ) are model parameters, Z ( r , v , i ) is the partition function ensuring that the probabilities sum to one. Intuitively , embeddings u a encode semantic properties of an ar gument: for e xample, embeddings for the words demonstrator and pr otestor should be some where near each other in R d space, and further away from that for the word cat . The product C p,r u a is a k -dimensional vector encoding beliefs about other arguments based on the argument-role pair ( a, r ) . In turn, the dot product ( C v ,r i u a i ) T C v ,r j u a j is large if the argument pair ( a i , a j ) is semantically compatible with the predicate, and small otherwise. Intuitiv ely , this objectiv e corresponds to scoring argument tuples according to h ( a , r , v , C , u ) = X i 6 = j u T a i C T v ,r i C v ,r j u a j , hinting at connections to (coupled) tensor and factorization methods (Yılmaz et al., 2011; Bordes et al., 2011) and distributional semantics (Mikolov et al., 2013; Pennington et al., 2014). Note also that the reconstruction model does not hav e access to any features of the sentence (e.g., argument order or syntax), forcing the roles to con vey all the necessary information. In practice, we smooth the model by using a sum of predicate-speciﬁc and cross-predicate projection matrices ( C v ,r + C r ) instead of just C v ,r . 2 . 2 L E A R N I N G Parameters of both model components ( w , u and C ) are learned jointly: the natural objectiv e asso- ciated with ev ery sentence would be the following: N X i =1 log X r p ( a i | a − i , r , v , C , u ) p ( r | x, w ) . (2) Howe ver optimizing this objecti ve is not practical in its e xact form for two reasons: (1) the marginal- ization ov er r is exponential in the number of arguments; (2) the partition function Z ( r , v , i ) re- quires summation over the entire set of potential argument lemmas. W e use existing techniques to address both challenges. In order to deal with the ﬁrst challenge, we use a basic mean-ﬁeld approximation: instead of marginalization over r we substitute r with their posterior distributions µ is = p ( r i = s | x, w ) . T o tackle the second problem, the computation of Z ( µ , v, i ) , we use a negati ve sampling technique (see, e.g., Mik olov et al. (2013)). At test time, only the linear semantic role labeler is used, so the inference is straightforward. 3 Accepted as a workshop contrib ution at ICLR 2015 T able 1: Purity (PU), collocation (CO) and F1 on English (PropBank / CoNLL 2008). PU CO F1 Our Model 79.7 86.2 82.8 Bayes 89.3 76.6 82.5 Agglom+ 87.9 75.6 81.3 RoleOrdering 83.5 78.5 80.9 Agglom 88.7 73.0 80.1 GraphPart 88.6 70.7 78.6 LLogistic 79.5 76.5 78.0 SyntF 81.6 77.5 79.5 3 E X P E R I M E N T S W e followed Lang and Lapata (2010) and used the CoNLL 2008 shared task data (Surdeanu et al., 2008). As in most previous work on unsupervised SRL, we ev aluate our model using clustering metrics: purity , collocation and their harmonic mean F1. For the semantic role labeling (encoding) component, we relied on 14 feature patterns used for argument labeling in one of the popular su- pervised role labelers (Johansson and Nugues, 2008), which resulted in a quite large feature space (49,474 feature instantiations for our English dataset). For the reconstruction component, we set the dimensionality of embeddings d , the projection di- mensionality k and the number of negati ve samples n to 30 , 15 and 20 , respectiv ely . These hyper- parameters were tuned on held-out data (the CoNLL 2008 dev elopment set), d and k were chosen among { 10 , 15 , 20 , 30 , 50 } with constraining d to be always greater than k , n was ﬁxed to 10 and 20 . The model was not sensitive to the parameter deﬁning the number of roles as long it was large enough. For training, we used uniform random initialization and AdaGrad (Duchi et al., 2011). Follo wing (Lang and Lapata, 2010), we use a baseline ( SyntF ) which simply clusters predicate arguments according to the dependenc y relation to their head. A separate cluster is allocated for each of 20 most frequent relations in the dataset and an additional cluster is used for all other relations. As observed in the pre vious work (Lang and Lapata, 2011a), this is a hard baseline to beat. W e also compare against previous approaches: the latent logistic classiﬁcation model (Lang and Lapata, 2010) (labeled LLogistic ), the agglomerati ve clustering method (Lang and Lapata, 2011a) ( Agglom ), the graph partitioning approach (Lang and Lapata, 2011b) ( GraphP art ), the global role ordering model (Gar g and Henderson, 2012) ( RoleOr dering ). W e also report results of an impro ved version of Ag glom , recently reported by Lang and Lapata (2014) ( Agglom+ ). The strongest pre vious model is Bayes : Bayes is the most accurate (‘coupled’) version of the Bayesian model of Tito v and Klementiev (2012), estimated from the CoNLL data without relying on an y external data. Our model outperforms or performs on par with best previous models in terms of F1 (see T able 1). Interestingly , the purity and collocation balance is very different for our model and for the rest of the systems. In fact, our model induces at most 4-6 roles. On the contrary , Bayes predicts more than 30 roles for the majority of frequent predicates (e.g., 43 roles for the predicate include or 35 for say ). Though this tendency reduces the purity scores for our model, this also means that our roles are more human interpretable. For e xample, agents and patients are clearly identiﬁable in the model predictions. 4 C O N C L U S I O N S W e introduced a method for inducing feature-rich semantic role labelers from unannoated text. In our approach, we view a semantic role representation as an encoding of a latent relation between a predicate and a tuple of its arguments. W e capture this relation with a probabilistic tensor factor- ization model. Our estimation method yields a semantic role labeler which achiev es state-of-the-art results on English. 4 Accepted as a workshop contrib ution at ICLR 2015 R E F E R E N C E S Abend, O., Reichart, R., and Rappoport, A. (2009). Unsupervised argument identiﬁcation for semantic role labeling. In ACL-IJCNLP . Ammar , W ., Dyer , C., and Smith, N. (2014). Conditional random ﬁeld autoencoders for unsupervised structured prediction. In NIPS . Baker , C. F ., Fillmore, C. J., and Lowe, J. B. (1998). The Berkeley FrameNet project. In ACL-COLING . Berant, J., Srikumar, V ., Chen, P ., Huang, B., Manning, C. D., V ander Linden, A., Harding, B., and Clark, P . (2014). Modeling biological processes for reading comprehension. In EMNLP . Bordes, A., W eston, J., Collobert, R., and Bengio, Y . (2011). Learning structured embeddings of knowledge bases. In AAAI . Das, D., Schneider , N., Chen, D., and Smith, N. A. (2010). Probabilistic frame-semantic parsing. In NAA CL . Daum ´ e III, H. (2009). Unsupervised search-based structured prediction. In ICML . Duchi, J., Hazan, E., and Singer, Y . (2011). Adapti ve subgradient methods for online learning and stochastic optimization. JMLR , 12:2121–2159. Fillmore, C. J. (1968). The case for case. In E., B. and R.T ., H., editors, Universals in Linguistic Theory , pages 1–88. Holt, Rinehart, and W inston, New Y ork. F ¨ urstenau, H. and Rambow , O. (2012). Unsupervised induction of a syntax-semantics lexicon using iterative reﬁnement. In the F irst Joint Conference on Lexical and Computational Semantics-V olume 1: the main confer ence and the shar ed task, and V olume 2: the Sixth International W orkshop on Semantic Evaluation . Garg, N. and Henderson, J. (2012). Unsupervised semantic role induction with global role ordering. In ACL: Short P apers-V olume 2 . Haji ˇ c, J., Ciaramita, M., Johansson, R., Kawahara, D., Mart ´ ı, M. A., M ` arquez, L., Meyers, A., Nivre, J., Pad ´ o, S., ˇ St ˇ ep ´ anek, J., Stra ˇ n ´ ak, P ., Surdeanu, M., Xue, N., and Zhang, Y . (2009). The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In CoNLL . Hinton, G. E. (1989). Connectionist learning procedures. Artiﬁcial intelligence , 40(1):185–234. Johansson, R. and Nugues, P . (2008). Dependency-based syntactic-semantic analysis with PropBank and Nom- Bank. In CoNLL . Lang, J. and Lapata, M. (2010). Unsupervised induction of semantic roles. In ACL . Lang, J. and Lapata, M. (2011a). Unsupervised semantic role induction via split-merge clustering. In A CL . Lang, J. and Lapata, M. (2011b). Unsupervised semantic role induction with graph partitioning. In EMNLP . Lang, J. and Lapata, M. (2014). Similarity-driv en semantic role induction via graph partitioning. Computational Linguistics , 40(3):633–669. Mikolov , T ., Chen, K., Corrado, G., and Dean, J. (2013). Ef ﬁcient estimation of word representations in vector space. arXiv preprint . Palmer , M., Gildea, D., and Kingsbury , P . (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics , 31(1):71–106. Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP . Sammons, M., Vydiswaran, V ., V ieira, T ., Johri, N., Chang, M., Goldwasser, D., Srikumar , V ., Kundu, G., Tu, Y ., Small, K., Rule, J., Do, Q., and Roth, D. (2009). Relation alignment for textual entailment recognition. In T ext Analysis Confer ence (T AC) . Shen, D. and Lapata, M. (2007). Using semantic roles to improve question answering. In EMNLP . Surdeanu, M., Johansson, A. M. R., M ` arquez, L., and Nivre, J. (2008). The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. In CoNLL . T itov , I. and Klementiev , A. (2012). A bayesian approach to semantic role induction. In EACL . V incent, P ., Larochelle, H., Bengio, Y ., and Manzagol, P .-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML . Yılmaz, K. Y ., Cemgil, A. T ., and Simsekli, U. (2011). Generalised coupled tensor factorisation. In NIPS . 5

Inducing Semantic Representation from Text by Jointly Predicting and Factorizing Relations

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment