구문 기능을 활용한 트리형 HMM 기반 단어 표현 학습

W ord Repr esentations, T ree Models and Syntactic Functions Simon ˇ Suster Uni versity of Groningen Netherlands s.suster@rug.nl Gertjan van Noord Uni versity of Groningen Netherlands g.j.m.van.noord@rug.nl Ivan T itov Uni versity of Amsterdam Netherlands titov@uva.nl Abstract W ord representations induced from models with discrete latent v ariables (e.g. HMMs) hav e been sho wn to be beneﬁcial in man y NLP applications. In this work, we ex- ploit labeled syntactic dependency trees and formalize the induction problem as un- supervised learning of tree-structured hid- den Marko v models. Syntactic functions are used as additional observed v ariables in the model, inﬂuencing both transition and emission components. Such syntactic in- formation can potentially lead to capturing more ﬁne-grain and functional distinctions between words, which, in turn, may be de- sirable in man y NLP applications. W e e v al- uate the word representations on two tasks – named entity recognition and semantic frame identiﬁcation. W e observe improv e- ments from exploiting syntactic function information in both cases, and the results ri v aling those of state-of-the-art represen- tation learning methods. Additionally , we re visit the relationship between sequential and unlabeled-tree models and ﬁnd that the adv antage of the latter is not self-e vident. 1 Introduction W ord representations have prov en to be an indis- pensable source of features in many NLP systems as they allow better generalization to unseen lex- ical cases (K oo et al., 2008; T urian et al., 2010; T ito v and Klementie v , 2012; Passos et al., 2014; Belinko v et al., 2014). Roughly speaking, word representations allow us to capture semantically or otherwise similar lexical items, be it categori- cally (e.g. cluster ids) or in a vectorial way (e.g. word embeddings). Although the methods for ob- taining word representations are di verse, they nor - mally share the well-kno wn distrib utional hypoth- esis (Harris, 1954), according to which the simi- larity is established based on occurrence in similar contexts. Howe ver , word representation methods frequently differ in ho w they operationalize the deﬁnition of context. Recently , it has been sho wn that representations using syntactic contexts can be superior to those learned from linear sequences in downstream tasks such as named entity recognition (Grave et al., 2013), dependency parsing (Bansal et al., 2014; Sagae and Gordon, 2009) and PP-attachment dis- ambiguation (Belinko v et al., 2014). They hav e also been shown to perform well on datasets for intrinsic e v aluation, and to capture a different type of semantic similarity than sequence-based repre- sentations (Levy and Goldberg, 2014; ˇ Suster and v an Noord, 2014; Pad ´ o and Lapata, 2007). Unlike the recent research in word representa- tion learning, focused hea vily on word embeddings from the neural network tradition (Collobert and W eston, 2008; Mikolov et al., 2013a; Pennington et al., 2014), our work f alls into the frame w ork of hidden Markov models (HMMs), drawing on the work of Gra ve et al. (2013) and Huang et al. (2014). An attracti ve property of HMMs is their ability to provide context-sensiti ve representations, so the same word in two dif ferent sentential conte xts can be giv en distinct representations. In this way , we account for various senses of a word. 1 Ho we ver , this ability requires inference, which is e xpensi ve compared to a simple look-up, so we explore in our experiments word representations that are orig- inally obtained in a conte xt-sensiti ve way , b ut are then av ailable for look-up as static representations. Our method includes two types of observed vari- ables: words and syntactic functions. This allo ws us to address a drawback of learning word repre- sentation from unlabeled dependency trees in the 1 The handling of polysemy and homonymy typically re- quires extending a model in other frameworks, cf. Huang et al. (2012), T ian et al. (2014) and Neelakantan et al. (2014). w 1 w 2 w 3 w 4 w 5 w 6 c 1 c 2 c 3 c 0 c 4 c 5 c 6 r 1 r 2 r 3 r 4 r 5 r 6 The magic happens beneath oak tr ees NMOD SBJ R OO T LOC NMOD PMOD Figure 1: Hidden Markov tree model with syntactic functions, r , as additional observed layer . context of HMMs ( § 2). The motiv ation for includ- ing syntactic functions comes from the intuition that they act as proxies for semantic roles. The current research practice is to either discard this type of information (so context words are deter- mined on the syntactic structure alone (Gra v e et al., 2013)), or include it in a preprocessing step, i.e. by attaching syntactic labels to words, as in Le vy and Goldberg (2014). W e ev aluate the word representations in two structured prediction tasks, named entity recogni- tion (NER) and semantic frame identiﬁcation. As our e xtension b uilds upon sequential and unlabeled- tree HMMs, we also revisit the basic difference between the two, b ut are unable to entirely corrob- orate the alleged adv antage of syntactic context for word representations in the NER task. 2 Why syntactic functions A word can typically occur in distinct syntactic functions. Since these account for words in dif- ferent semantic roles (Bender , 2013; Le vin, 1993), the incorporation of the syntactic function between the word and its parent could gi v e us more precise representations. For example, in “Carla bought the computer”, the subject and the object represent tw o dif ferent semantic roles, namely the buyer and the goods, respecti vely . Along similar lines, Pad ´ o and Lapata (2007), ˇ Suster and van Noord (2014) and Grav e et al. (2013) argue that it is inaccurate to treat all context words as equal contributors to a word’ s meaning. In HMM learning, the parameters obtained from training on unlabeled syntactic structure encode the probabilistic relationship between the hidden states of parent and child, and that between the hidden state and the word. The tree structure thus only deﬁnes the word’ s context, but is obli vious of the relationship between the words. For exam- ple, Grav e et al. (2013) acknowledge precisely this limitation of their unlabeled-tree representations by providing as example the hidden state of a verb, which cannot discriminate between left (e.g. sub- ject) and right (e.g. object) neighbors because of shared transition parameters. This adv ersely af- fects the accuracy of their super-sense tagger for English. Similarly , ˇ Suster and v an Noord (2014) sho w that ﬁltering dependency instances based on syntactic functions can positiv ely af fect the quality of obtained Brown word clusters when measured in a wordnet similarity task. 3 A tree model with syntactic functions W e represent a sentence as a tuple of K words, w = ( w 1 , . . . , w K ) , where each w k ∈ { 1 , . . . , |V |} is an integer representing a word in the vocab ulary V . The goal is to infer a tu- ple of K states c = ( c 1 , . . . , c K ) , where each c k ∈ { 1 , . . . , N } is an integer representing a se- mantic class of w k , and N is the number of states, which needs to be set prior to training. Another possibility is to let w k ’ s representation be a prob- ability distrib ution ov er N states. In this case, we denote w k ’ s representation as u k ∈ R N . As usual in Marko vian models, the generation of the sentence can be decomposed into the generation of classes (transitions) and the generation of words (emissions). The process is deﬁned on a tree, in which a node c k is generated by its single parent c π ( k ) , where π : { 1 , . . . , K } 7→ { 0 , . . . , K } , with 0 representing the root of the tree (the only node not emitting a word). W e denote a syntactic function as r ∈ { r 1 , . . . , r S } , where S is the total number of syntactic function types produced by the syntactic parser . W e encode the syntactic function at position k as r k , r w k → w π ( k ) , i.e. the dependenc y relation between w k and its parent. W e would like the variable r to modulate the transition and emission processes. W e achieve this by drawing on the Input-output HMM architecture of Bengio and Frasconi (1996), who introduce a sequential model in which an additional sequence of observations called input becomes part of the model, and the model is used as a conditional pre- dictor . The authors describe the application of their model in speech processing, where the goal is to obtain an accurate predictor of the output phoneme layer from the input acoustic layer . Our focus is, in contrast, on representation learning (hidden layer) rather than prediction (output layer). Also, we adapt their sequential topology to trees. The probability distribution of w ords and seman- tic classes is conditional on syntactic functions and is factorized as: p ( w , c | r )= K Y k =1 p ( w k | c k , r k ) p ( c k | c π ( k ) , r k ) , (3.1) where r k encodes additional information about w k , in our case the syntactic function of w k to its parent. This is represented graphically in ﬁg. 1. The parameters of the model are stored in column-stochastic transition and emission matri- ces 2 : T , where T ij l = p ( c k = i | c π ( k ) = j, r k = l ) O , where O ij l = p ( w k = i | c k = j, r k = l ) The number of required parameters for representing the transitions is O ( N 2 S ) , and for representing the emissions O ( N |V | S ) . Our model satisﬁes the single-parent constraint and can be applied to proper trees only . It is in principle possible to extend the base representation for the model by using approximate inference tech- niques that w ork on graphs (Murphy , 2012, p. 720), but we do not e xplore this possibility here. 3 As opposed to an unlabeled-tree HMM, our extension can in f act be cate gorized as an in homogeneous model since the transition and emission probability distrib utions change as a func- tion of input, cf. Bengio (1999). Another com- parison concerns the learning of long-term depen- dencies: since in the Input-output architecture the transition probabilities can change as a function of input at each k , they can be more deterministic (hav e lo wer entrop y) than the transition probabili- ties of an HMM. Ha ving the transition parameters closer to zero or one reduces the ambiguity of the next state and allo ws the conte xt to ﬂo w more eas- ily . A concrete graphical example is gi ven in ﬁg. 2. 2 W e are abusing the terminology slightly , as these are in fact three-dimensional arrays. 3 This would be relev ant for dependency annotation schemes which include secondary edges. tree synfunc 0.00 0.25 0.50 0.75 1.00 −10 0 10 −10 0 10 density p Figure 2: The transition probabilities of a tree HMM with syntactic functions ( synfunc ) are sparser and hav e a lo wer entropy ( 5 . 34 ) than those of an unlabeled-tree HMM ( tr ee ; entrop y of 5 . 6 ). 4 Learning and infer ence W e train the model with the Expectation- Maximization (EM) algorithm (Baum, 1972) and use the sum-product message passing for inference on trees (Pearl, 1988). The inference procedure (the estimation of hidden states) is the same as in an unlabeled-tree model, except that it is performed conditionally on r . The parameters T and O are estimated with max- imum likelihood estimation. In the E-phase, we obtain pseudo-counts from the existing parameters, as shown in ﬁg. 3. The M-step then normalizes the transition pseudo-counts (and similarly for emis- sions): ˆ T ij l = τ ij l P j 0 τ ij 0 l (4.1) 4.1 State splitting and merging W e explore the idea of introducing complexity grad- ually in order to alle viate the problem of EM ﬁnd- ing a poor solution, which can be particularly se- vere when the search space is large (Petrov et al., 2006). The splitting procedure starts with a small number of states, splits the parameters of each state s into s 1 and s 2 by cloning s and slightly perturb- ing. The model is retrained, and a new split round takes place. T o allow splitting states to various degrees, Petrov et al. also merge back those split states which improve the likelihood the least. Al- though the mer ge step is done approximately and does not require new c ycles of inference, we ﬁnd that the extra running time does not justify the spo- τ ij l = N X n =1 K n X k =1 E h 1 { C ( n ) k = i, C ( n ) π ( k ) = j, R ( n ) k = l }    W ( n ) = w ( n ) , R ( n ) = r ( n ) i ω ij l = N X n =1 K n X k =1 E h 1 { W ( n ) k = i, C ( n ) π ( k ) = j, R ( n ) k = l }    W ( n ) = w ( n ) , R ( n ) = r ( n ) i Figure 3: Obtaining pseudo-counts, or expected suf ﬁcient statistics, in the E-step. radic improvements we observe. W e settle there- fore on the splitting-only regime. 4.2 Decoding f or HMM-based models Once a model is trained, we can search for the most probable states giv en observed data by using the max-product message passing ( M A X - P RO D U C T , a generalization of V iterbi) for efﬁcient decoding on trees: ˆ c = arg max c p ( C = c | W = w , R = r ) . W e ha ve also tried posterior (or minimum risk) decoding (Lember and K oloydenk o, 2014; Ganche v et al., 2008), but without consistent im- prov ements. The search for the best states can be av oided by taking the posterior state distribution u k ov er N hidden states (Nepal and Y ates, 2014; Grav e et al., 2014): u ( k ) c = E  1 { C k = c }   W = w , R = r  . W e call this vectorial representation P O S T - T O K E N . In both cases, inference is performed on a con- crete sentence, thus providing a context-sensiti v e representation. W e ﬁnd in our experiments that P O S T - T O K E N consistently outperforms M A X - P RO D U C T due to its ability to carry more informa- tion and uncertainty . This can then be exploited by the do wnstream task predictor . One disadv antage of obtaining context-sensiti v e representations is the relati v ely costly inference. The inference and decoding are also sometimes not applicable, such as in information retriev al, where the entire sentence is usua lly not gi v en (Huang et al., 2011). A trade-of f between full conte xt sensitiv- ity and efﬁcienc y can be achiev ed by considering a static representation ( P O S T - T Y P E ). It is obtained in a context-insensitiv e way (Huang et al., 2011; Grav e et al., 2014) by averaging posterior state dis- tributions (conte xt-sensiti v e) of all occurrences of a word type ˜ w from a large corpus U : v ( ˜ w ) = 1 Z ˜ w X i ∈U : w i = ˜ w u ( i ) . (4.2) In ﬁg. 4, we gi ve a graphical example of the word representations learned with our model ( § 5.5), ob- tained either with the P O S T - T O K E N or the P O S T - T Y P E . T o visualize the representations, we apply multidimensional scaling. 4 The model clearly sep- arates between management positions and parts of body , and interestingly , puts “head” closer to man- agement positions, which can be e xplained by the business and economic nature of the Bllip corpus. The words “chief ” and “e x ecuti ve” are located to- gether , yet isolated from others, possibly because of their strong tendency to precede nouns. The ar- ro w on the plot indicates the shift in the meaning when a P O S T - T O K E N representation is obtained for “head” (part-of-body) within a sentence. hand director leg foot head chairman face president shoulder manager eye chief ex ecutive _head_ ● ● ● ● ● ● ● ● ● ● ● ● ● ● ("The head louse is one of the three types of lice that infest people") Figure 4: Representations obtained with our model with syntactic functions. All are static P O S T - T Y P E representations, e xcept “ head ”, which is obtained with P O S T - T O K E N from the concrete sentence. Despite the advantage of P O S T - T O K E N to ac- count for word senses, we observe that P O S T - T Y P E performs better in almost all experiments. A likely explanation is that a veraging increases the general- izability of representations. For the concrete tasks in which we apply the word representations, the increased rob ustness is simply more important than context sensitivity . Also, P O S T - T Y P E might be less sensiti ve to parsing errors during test time. 4 https://github.com/scikit- learn/scikit- learn 5 Empirical study 5.1 Parameters and setup W e observe faster con ver gence times with online EM, which updates the parameters more frequently . Speciﬁcally , we use the mini-batch step-wise EM (Liang and Klein, 2009; Capp ´ e and Moulines, 2009), and determine the hyper-parameters on the held-out dataset of 10,000 sentences to maxi- mize the log-likelihood. W e ﬁnd out that higher v alues for the step-wise reduction power α and the mini-batch size lead to better ov erall log- likelihood, but with a somewhat negati ve effect on the con ver gence speed. W e ﬁnally settle on α = { 0 . 6 , 0 . 7 , 0 . 8 , 0 . 9 , 1 } and mini-batch size of { 1 , 10 , 100 , 1000 } sentences. W e ﬁnd that a couple of iterations ov er the entire dataset is suf ﬁcient to obtain good parameters, cf. Klein (2005). Initialization. Since the EM algorithm in our setting only ﬁnds a local optimum of the log- likelihood, the initialization of model parameters can hav e a major impact on the ﬁnal outcome. W e initialize the emission matrices with Brown clusters by ﬁrst assigning random v alues between 0 and 1 to the matrix elements, and then multiplying those el- ements that represent words in a cluster by a factor of f ∈ { 10 , 100 , 1 K , 10 K } . Finally , we normalize the matrices. This technique incorporates a strong bias to wards word-class emissions that exist (de- terministically) in Brown clusters. The transition parameters are simply set to random numbers sam- pled from the uniform distrib ution between 0 and 1, and ﬁnally normalized. A pproximate inference. Follo wing Grav e et al. (2013) and Pal et al. (2006), we approximate the belief vectors during inference, 5 which speeds up learning and works as regularization. W e use the k -best projection method, in which only k -largest coef ﬁcients (in our case k = 1 8 N ) are kept. 5.2 Data f or obtaining word r epr esentations English. W e use the 43M-word Bllip corpus (Char- niak et al., 2000) of WSJ texts, from which we remov e the sentences of the PTB and those whose length is ≤ 4 or ≥ 40 . W e use the MST depen- dency parser (McDonald and Pereira, 2006) for English and b uild a projecti ve, second order model, trained on sections 2–21 of the Penn T reebank WSJ 5 In a bottom-up pass, a belief vector represents the local evidence by multiplying the messages received from the chil- dren of a node, as well as the emission probability at that node. (PTB). Prior to that, the PTB was patched with NP bracketing rules (V adas and Curran, 2007) and con v erted to dependencies with L TH (Johansson and Nugues, 2007). The parser achiev es the unla- beled/labeled accuracy of 91.5/85.22 on section 23 of the PTB without retagging the POS. For POS- tagging the Bllip corpus and the ev aluation datasets, we use the Citar tagger 6 . After parsing, we replace the words occurring less than 40 times with a spe- cial symbol to model OO V words. This results in the vocab ulary size of 27,000 words. Dutch. W e ﬁrst produce a random sample of 2.5M sentences from the SoNaR corpus 7 , then follo w the same preprocessing steps as for English. W e parse the corpus with Alpino (v an Noord, 2006), an HPSG parser with a max ent disambiguation com- ponent. In contrast with English, for which we use word forms, we keep here the root forms produced by the parser’ s lexical analyzer . The resulting vo- cabulary size is 25,000 words. The analyses pro- duced by the parser represent multiple parents to facilitate the treatment of wh-clauses, coordination and passivization. Since our method expects proper trees, we con vert the parser output to CoNLL for - mat. 8 5.3 Evaluation tasks Named entity recognition. W e use the standard CoNLL-2002 shared task dataset for Dutch and CoNLL-2003 dataset for English. W e also include the out-of-domain MUC-7 testset, preprocessed according to T urian et al. (2010). W e refer the reader to Ratinov and Roth (2009) for a detailed description of the NER classiﬁcation problem. Just like T urian et al. (2010), we use the av eraged structured perceptron (Collins, 2002) with V iterbi as the base for our NER system. 9 The classiﬁer is trained for a ﬁxed number of iterations, and uses these baseline features: • w k information: is-alphanumeric, all-digits, all-capitalized, is-capitalized, is-hyphenated; • preﬁxes and suf ﬁxes of w k ; • word windo w: w k , w k ± 1 , w k ± 2 ; • capitalization pattern in the word windo w . W e construct N real-v alued features for a word vector of dimensionality N , and a simple indicator feature for a categorical w ord representation. 6 http://github.com/danieldk/citar 7 http://lands.let.ru.nl/projects/SoNaR 8 http://www.let.rug.nl/bplank/alpino2conll/ 9 http://github.com/LxMLS/lxmls- toolkit Semantic frame identiﬁcation. Frame-semantic parsing is the task of identifying a ) semantic frames of predicates in a sentence (giv en tar get words e voking frames), and b ) frame arguments partici- pating in these ev ents (Fillmore, 1982; Das et al., 2014). Compared to NER, in which the classi- ﬁcation decisions apply to a relatively small set of words, the problem of semantic frame identiﬁ- cation concerns making predictions for a broader set of words (verbs, nouns, adjecti ves, sometimes prepositions). Figure 5: A parse with H I N D E R I N G and C AU S E T O M A K E P R O G R E S S frames with respec- ti ve ar guments. W e use the Semafor parser (Das et al., 2014) con- sisting of two log-linear components trained with gradient-based techniques. The parser is trained and tested on the FrameNet 1.5 full-text annota- tions. Our test set consists of the same 23 doc- uments as in Hermann et al. (2014). W e in v es- tigate the effect of word representation features on the frame identiﬁcation component. W e mea- sure Semafor’ s performance on gold-standard tar - gets, and report the accurac y on e xact matches , as well as on partial matches . The latter giv e par- tial credit to identiﬁed related frames. W e use and modify the publicly av ailable implementation at http://github.com/sammthomson/semafor . Our baseline features for a target w k include: • w k and w π ( k ) (if the parent is a preposition, the grandparent is taken by collapsing the de- pendency), • their lemmas and POS tags, • syntactic functions between: – w k and its children, – w k and w π ( k ) (added by ourselves), – w π ( k ) and its parent w π ( π ( k )) . 5.4 Baseline word r epresentations W e test our model, which we call S Y N F U N C - H M M , against the follo wing baselines: • B A S E L I N E : no word representation features • H M M : a sequential HMM • T R E E - H M M : a tree HMM W e induce other representations for comparison: • B RO W N : Brown clusters • D E P - B RO W N : dependency Brown clusters • S K I P - G R A M : Skip-Gram word embeddings 5.5 Preparing w ord repr esentations Bro wn clusters. Bro wn clusters (Brown et al., 1992) are kno wn to be effecti ve and robust when compared, for example, to word embeddings (Bansal et al., 2014; Passos et al., 2014; Nepal and Y ates, 2014; Qu et al., 2015). The method can be seen as a special case of a HMM in which word emissions are deterministic, i.e. a word belongs to at most one semantic class. Recently , an extension has been proposed on the basis of a dependency language model ( ˇ Suster and van Noord, 2014). W e use the publicly av ailable implementations. 10 Follo wing other work on English (Koo et al., 2008; Nepal and Y ates, 2014), we add both coarse- and ﬁne-grained clusters as features by using pre- ﬁxes of length 4, 6, 10 and 20 in addition to the complete binary tree path. F or Dutch, coarser- grained clusters do not yield any improvement. Bro wn features are included in a window around the target word, just as the NER word features. When adding cluster features to the frame-semantic parser , we trans form the cluster identiﬁers to one- hot vectors, which gi ves a small impro vement o ver the use of indicator features. HMM-based models. The N -dimensional repre- sentations obtained from HMMs and their variants are included as N distinct continuous features. In the NER task, word representations are included at w k and w k +1 for Dutch and at w k for English, which we determined on the de v elopment set. W e in v estigate state space sizes of 64 , 128 and 256 and ﬁnally choose N =128 as a reasonable trade- of f between training time and quality . W e use the same dimensionality for other word representation models in this paper . W e observe that by constraining S Y N F U N C - H M M to use only the k most frequent syntactic functions and to treat the remaining ones as a sin- gle special syntactic function, we obtain better re- sults in our ev aluation tasks. This is because for a model with all S syntactic functions produced by the parser , there is less learning evidence for more infrequent syntactic functions. W e explore the ef fect of keeping up to ﬁ v e most frequent syn- tactic functions, ignoring functional ones such as 10 http://github.com/percyliang/brown- cluster , http://github.com/rug- compling/dep- brown- cluster English Dutch MUC test set Model P R F-1 P R F-1 F-1 F-1 type F-1 unlab B A S E L I N E 80.12 77.30 78.69 75.36 70.92 73.07 65.44 87.04 96.25 H M M 81.49 78.90 80.17 77.61 73.97 75.74 70.20 87.66 96.50 T R E E - H M M 80.49 78.10 79.28 77.41 73.48 75.40 65.67 86.99 96.53 S Y N F U N C - H M M 80.65 78.90 79.76 (+.48) 78.54 75.23 76.85 (+1.45) 66.49 (+.82) 86.93 (-.06) 96.69 (+.16) B R OW N 80.15 77.28 78.70 77.88 71.73 74.68 68.85 87.72 96.67 D E P - B RO W N 78.80 75.73 77.23 77.50 73.66 75.53 68.31 87.44 96.47 S K I P - G R A M 80.80 78.98 79.88 76.02 71.28 73.57 72.42 88.94 96.69 T able 1: NER results (precision, recall and F-score) on English and Dutch test sets. Best result per column in bold. The score increase reported in parentheses is in comparison to T R E E - H M M . F-1 type is the F-score measured per word type, and F-1 unlab is the F-score measured per word type, ignoring labels. punctuation and determiner . 11 The ﬁnal selection is sho wn in table 2. English Dutch nmod (nominal modiﬁer) mod (modiﬁer) pmod (prepositional modiﬁer) su (subject) sub (subject) obj1 (direct object) cnj (conjunction) mwp (multiword unit) T able 2: Syntactic functions in S Y N F U N C - H M M for English (produced by the MST parser) and Dutch (produced by Alpino). For NER e xperiments, the representations from all HMM models are obtained with three dif ferent “decoding” methods ( § 4.2). Since P O S T - T Y P E is performing best ov erall, we only report the results for this method in the e v aluation. 12 W ord embeddings. W e use the Skip-Gram model presented in Mikolov et al. (2013a) ( https:// code.google.com/p/word2vec/ ), trained with negati ve sampling (Mikolov et al., 2013b). The training seeks to maximize the dot product between word-conte xt pairs encountered in the training cor- pus and minimize the dot product between those pairs in which the conte xt word is randomly sam- pled. W e set both the number of ne gati ve examples and the size of the context windo w to 5 , the down- sampling threshold to 1 × 10 − 4 , and the number of iterations to 15 . 5.6 NER results The results for all testsets are shown in table 1. For English, all HMM-based models improve 11 W e deﬁne the list of function-marker syntactic functions following Goldber g and Orwant (2013). 12 While exploring the constraint on the number of syntactic functions, we do ﬁnd that P O S T - T O K E N outperforms P O S T - T Y P E in some sets of syntactic functions, but not in the ﬁnal, best-performing selection. the baseline, with the sequential H M M achiev- ing the highest F-score. Our S Y N F U N C - H M M performs on a par with S K I P - G R A M . It outper - forms the unlabeled-tree model, indicating that the added observ ations are useful and correctly incor - porated. Brown clusters do not e xceed the B A S E - L I N E score. 13 T esting for signiﬁcance with a boot- strap method (Søgaard et al., 2014), we ﬁnd out that only H M M improv es signiﬁcantly at p < 0 . 01 on macro-F1 ov er B A S E L I N E , while S K I P G R A M and S Y N F U N C - H M M show signiﬁcant improvements only for the location entity type. The general trend for Dutch is somewhat dif- ferent. Most notably , all word representations contribute much more eff ecti vely to the overall classiﬁcation performance compared to English. The best-scoring model, our S Y N F U N C - H M M , im- prov es o ver the baseline signiﬁcantly by as much as about 3.8 points. Part of the reason S Y N F U N C - H M M works so well in this case is that it can make use of the informativ e “mwp” syntactic function be- tween the parts of a multiword unit. Similarly as for English, the unlabeled-tree HMM performs slightly worse than the sequential H M M . The cluster fea- tures are more valuable here than in English, and we also observe a 0.7-point adv antage by using de- pendency Brown clusters over the standard, bigram Bro wn clusters. The S K I P - G R A M model does not perform as well as in English, which might indicate that the hyper -parameters would need ﬁne-tuning speciﬁc to Dutch. On the out-of-domain MUC dataset, tree-based representations appear to perform poorly , whereas the highest score is achieved by the S K I P - G R A M method. Unfortunately , it is difﬁcult to gener- alize from these F-1 results alone. Concretely , 13 Howe ver, after additional experiments we observe that the cluster features do improv e o ver the baseline score when the number of clusters is increased. the dataset contains 3,518 named entities, and the S K I P - G R A M method makes 258 correct predic- tions more than T R E E - H M M . Howe ver , because the MUC dataset cov ers the narrow topic of missile- launch scenarios, the system gets badly penalized if a mistake is made repeatedly for a certain named en- tity . For e xample, only the entity “NASA ” occurs 103 times, most of which are wrongly classiﬁed by the T R E E - H M M system, b ut correctly by S K I P - G R A M . The overall performance may therefore hinge on a limited number of frequently occurring entities. A workaround is to e v aluate per entity type — calculate the F-score for each entity , then average ov er all entity types. The results for this ev aluation scenario are reported as F-1 type . S K I P - G R A M still performs best, but the dif ference to other models is smaller . Finally , we also report F-1 unlab , calcu- lated as F-1 type but ignoring the actual entity label. So, if a named-entity tok en is recognized as such, we count it as correct prediction ignoring the en- tity label type, similarly as done by Ratinov and Roth (2009). Since S Y N F U N C - H M M performs bet- ter here, we can conclude that it is more ef fecti v e at identifying entities rather than at labeling them. The fact that we observe dif ferent tendencies for English and Dutch can be attributed to an inter- play of factors, such as language dif ferences (Ben- der , 2011), differently-performing syntactic parsers, and dif ferences speciﬁc to the e valuation datasets. W e brieﬂy discuss the ﬁrst possibility . It is clear from table 1 that all syntax-based models ( D E P - B RO W N , T R E E - H M M , S Y N F U N C - H M M ) generally beneﬁt Dutch more than English. W e hypothesize that since the word order in Dutch is generally less ﬁxed than in English, 14 a sequence-based model for Dutch cannot capture selectional preferences that successfully , i.e. there is more interchanging of semantically div erse words in a small word window . This then makes the dif ference in performance be- tween sequential and tree models more apparent for Dutch. 5.7 Semantic frame identiﬁcation results The results are shown in table 3. The best score is obtained by the Skip-Gram embeddings, how- e ver , the dif ference to other models outperform- ing the baseline is small. For example, S K I P - G R A M correctly identiﬁes only two cases more than D E P - B R O W N , out of 3681 correctly disam- 14 For instance, it is unusual for the direct object in English to precede the verb, b ut quite common in Dutch. biguated frames. The S Y N F U N C - H M M model outperforms all other HMM models in this task. The dif ferences are larger when scoring partial matches. 15 Model Exact Partial B A S E L I N E 82.70 90.44 H M M 82.20 90.20 T R E E - H M M 82.89 90.59 S Y N F U N C - H M M 82.95 (+0.06) 90.80 (+0.21) B R O W N 83.15 90.74 D E P - B R OW N 83.15 90.76 S K I P - G R A M 83.19 90.91 T able 3: Frame identiﬁcation accuracy . Score in- crease in parentheses is relati ve to T R E E - H M M . 5.8 Further discussion W e can conclude from the NER experiments that unlabeled syntactic trees do not in general pro- vide a better structure for deﬁning the contexts compared to plain sequences. The only e xception is the case of dependency Brown clustering for Dutch. Comparing our results to those of Grave et al. (2013), we therefore cannot conﬁrm the same adv antage when using unlabeled-tree repre- sentations. In semantic frame identiﬁcation, ho w- e ver , the unlabeled-tree representations do compare more fa v orably to sequential representations. Our extension with syntactic functions outper- forms the baseline and other HMM-based represen- tations in practically all experiments. It also out- performs all other word representations in Dutch NER. The adv antage comes from discriminating between the types of contexts, for example between a modiﬁer and a subject, which is impossible in se- quential or unlabeled-tree HMM architectures. The results for English are comparable to those of the state-of-the-art representation methods. 6 Related work HMMs hav e been used successfully for learning word representations already before, see Huang et al. (2014) for an overvie w , with an emphasis on in- vestigating domain adaptability . Models with more 15 On exact matches, only D E P - B R OW N and B RO W N signif- icantly outperform the baseline with the p < 0 . 05 . On partial matches, D E P - B R OW N , B RO W N , S K I P - G R A M and S Y N F U N C - H M M all outperform the baseline signiﬁcantly . S Y N F U N C - H M M performs signiﬁcantly better than T R E E - H M M on partial matches, whereas the difference between S K I P - G R A M and S Y N F U N C - H M M is not signiﬁcant. The signiﬁcance tests were run using paired permutation. complex architecture have been proposed, such as a factorial HMM (Nepal and Y ates, 2014), trained us- ing approximate v ariational inference and applied to POS tagging and chunking. Recently , semantic compositionality of HMM-based representations based on the frame w ork of distrib utional semantics has been in vestigated by Grav e et al. (2014). There is a long tradition of unsupervised train- ing of HMMs for POS tagging (Kupiec, 1992; Merialdo, 1994), with more recent w ork on incor - porating bias by fa v oring sparse posterior distri- butions within the posterior regularization frame- work (Gra c ¸ a et al., 2007), and for example on auto-supervised reﬁnement of HMMs (Garrette and Baldridge, 2012). It would be interesting to see ho w well these techniques could be applied to word representation learning methods like ours. The extension of HMMs to dependenc y trees for the purpose of word representation learning was ﬁrst proposed by Grav e et al. (2013). Although our baseline HMM methods, H M M and T R E E - H M M , conceptually follo w the models of Gra ve et al., there are still sev eral practical differences. One source of diff erences is in the precise steps taken when performing Bro wn initialization, state split- ting, and also approximation of belief v ectors dur - ing inference. Another source in volves the e valua- tion setting. Their NER classiﬁer uses only a single feature, and the inclusion of Brown clusters does not make use of the clustering hierarchy . In this respect, our experimental setting is more similar to T urian et al. (2010). Another practical difference is that Grav e et al. concatenate words with POS-tags to construct the input te xt, whereas we use tok ens (English) or word roots (Dutch). The incorporation of w ord representations into semantic frame identiﬁcation has been explored in Hermann et al. (2014). They perform a projection of generic word embeddings for context w ords to a lo w-dimensional representation, which also learns an embedding for each frame label. The method selects the frame closest to the low-dimensional representation obtained through mapping of the input embeddings. Their approach differs from ours in that they induce new representations that are tied to a speciﬁc application, whereas we aim to obtain linguistically enhanced word represen- tations that can be subsequently used in a variety of tasks. In our case, the word representations are thus included as additional features in the log- linear model. The inclusion accounts for syntactic functions between the target and its context w ords. Although Hermann et al. also use syntactic func- tions, they are used to position the general word em- beddings within a single input context embedding. Unfortunately , we are unable to directly compare our results with theirs as their parser implementa- tion is proprietary . The accuracy of our baseline system on the test set is 0 . 27 percent lo wer in the exact matching re gime and 0 . 07 lo wer in the partial matching regime compared to the baseline imple- mentation (Das et al., 2014) they used. 16 The topic of context type (syntactic vs. linear) has been abundantly treated in distrib utional seman- tics (Lin, 1998; Baroni and Lenci, 2010; v an de Cruys, 2010) and elsewhere (Boyd-Graber and Blei, 2008; Tjong Kim Sang and Hofmann, 2009). 7 Conclusion W e ha ve proposed an extension of a tree HMM with syntactic functions. The obtained word representa- tions achiev e better performance than those from the unlabeled-tree model. Our results also show that simply preferring an unlabeled-tree model ov er a sequential model does not always lead to an improvement. An important direction for fu- ture work is to in vestigate ho w discriminating be- tween context types can lead to more accurate models in other frame w orks. The code for obtain- ing HMM-based representations described in this paper is freely av ailable at http://github.com/ rug- compling/hmm- reps . Acknowledgements W e would like to thank Edouard Grav e and Sam Thomson for valuable discussion and suggestions. References Mohit Bansal, K e vin Gimpel, and Karen Livescu. 2014. T ailoring continuous word representations for dependency parsing. In ACL . Marco Baroni and Alessandro Lenci. 2010. Dis- tributional memory: A general framework for corpus-based semantics. Computational Linguistics , 36(4):673–721. Leonard E. Baum. 1972. An inequality and associ- ated maximization technique in statistical estimation for probalistic functions of Markov processes. In In- equalities . 16 Among other implementation differences, the y introduce a variable capturing le xico-semantic relations from W ordNet. Y onatan Belinkov , T ao Lei, Regina Barzilay , and Amir Globerson. 2014. Exploring compositional archi- tectures and word v ector representations for preposi- tional phrase attachment. T ransactions of the Asso- ciation for Computational Linguistics , 2:561–572. Emily M. Bender . 2011. On Achie ving and Ev aluating Language-Independence in NLP . Linguistic Issues in Language T echnolo gy , 6(3):1–26. Emily Bender . 2013. Linguistic Fundamentals for Natural Language Pr ocessing . Synthesis Lectures on Human Language T echnologies. Morg an & Clay- pool Publishers. Y oshua Bengio and Paolo Frasconi. 1996. Input- output HMMs for sequence processing. IEEE T rans- actions on Neural Networks , 7(5). Y oshua Bengio. 1999. Marko vian models for sequen- tial data. Neural computing surveys , 2:129–162. Jordan Boyd-Graber and Da vid M. Blei. 2008. Syntac- tic topic models. In NIPS . Peter F . Brown, V incent J. Della Pietra, Peter V . deS- ouza, Jenifer C. Lai, and Robert L. Mercer . 1992. Class-based n-gram models of natural language. Computational Linguistics , 18(4):467–479. Olivier Capp ´ e and Eric Moulines. 2009. Online em al- gorithm for latent data models. Journal of the Royal Statistical Society . Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale, and Mark Johnson. 2000. BLLIP 1987– 1989 WSJ Corpus Release 1, LDC No. LDC2000T43 . Linguistic Data Consortium. Michael Collins. 2002. Discriminativ e T raining Meth- ods for Hidden Marko v Models: Theory and Experi- ments with Perceptron Algorithms. In EMNLP . Ronan Collobert and Jason W eston. 2008. A Uni- ﬁed Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In ICML . Dipanjan Das, Desai Chen, Andr ´ e F . T . Martins, Nathan Schneider , and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguis- tics , 40(1):9–56. Charles Fillmore. 1982. Frame semantics. Linguistics in the morning calm , pages 111–137. Kuzman Ganchev , Jo ˜ ao V . Grac ¸ a, and Ben T askar . 2008. Better alignments = better translations? In A CL-HLT . Dan Garrette and Jason Baldridge. 2012. T ype- Supervised Hidden Markov Models for Part-of- Speech T agging with Incomplete T ag Dictionaries. In EMNLP . Y oav Goldberg and Jon Orwant. 2013. A dataset of syntactic-ngrams o ver time from a very large corpus of english books. In *SEM . Edouard Grave, Guillaume Obozinski, and Francis Bach. 2013. Hidden Markov tree models for se- mantic class induction. In CoNLL . Edouard Grave, Guillaume Obozinski, and Francis Bach. 2014. A Markovian approach to distribu- tional semantics with application to semantic com- positionality . In COLING . Jo ˜ ao Grac ¸ a, Kuzman Ganchev , and Ben T askar . 2007. Expectation maximization and posterior constraints. In NIPS . Zellig Harris. 1954. Distributional structure. W or d , 10(23):146–162. Karl Moritz Hermann, Dipanjan Das, Jason W eston, and Kuzman Ganchev . 2014. Semantic frame iden- tiﬁcation with distributed word representations. In A CL . Fei Huang, Alexander Y ates, Arun Ahuja, and Doug Downe y . 2011. Language models as representations for weakly-supervised nlp tasks. In CoNLL . Eric H. Huang, Richard Socher , Christopher D. Man- ning, and Andrew Y . Ng. 2012. Improving word representations via global context and multiple word prototypes. In A CL . Fei Huang, Arun Ahuja, Doug Downey , Y i Y ang, Y uhong Guo, and Alexander Y ates. 2014. Learning representations for weakly supervised natural lan- guage processing tasks. Computational Linguistics , 40(1):85–120. Richard Johansson and Pierre Nugues. 2007. Ex- tended constituent-to-dependency con version for En- glish. In NODALID A , pages 105–112, T artu, Esto- nia. Dan Klein. 2005. The unsupervised learning of natu- ral language structur e . Ph.D. thesis, Stanford Uni- versity . T erry K oo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In A CL-HLT . Julian Kupiec. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer Speech & Language , 6(3):225–242. J ¨ uri Lember and Alexe y A. Kolo ydenko. 2014. Bridg- ing V iterbi and posterior decoding: A generalized risk approach to hidden path inference based on Hid- den Markov models. Journal of Machine Learning Resear c h , 15(1):1–58. Beth Le vin. 1993. English V erb Classes and Alter- nations: A Preliminary Investigation . Univ ersity of Chicago Press. Omer Levy and Y oav Goldberg. 2014. Dependency- based word embeddings. In ACL . Percy Liang and Dan Klein. 2009. Online EM for un- supervised models. In HLT -NAA CL . Dekang Lin. 1998. Automatic retriev al and clustering of similar words. In COLING . Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algo- rithms. In EA CL . Bernard Merialdo. 1994. T agging english text with a probabilistic model. Computational Linguistics , 20(2):155–171. T omas Mikolov , Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efﬁcient estimation of word represen- tations in vector space. In ICLR W orkshop P apers . T omas Mikolov , Ilya Sutske ver , Kai Chen, Greg Cor- rado, and Jeffre y Dean. 2013b . Distrib uted repre- sentations of words and phrases and their composi- tionality . In NIPS . Ke vin P . Murphy . 2012. Machine Learning: A Pr oba- bilistic P erspective . The MIT Press. Arvind Neelakantan, Jeev an Shankar , Alexandre Pas- sos, and Andrew McCallum. 2014. Ef ﬁcient non- parametric estimation of multiple embeddings per word in vector space. In EMNLP . Anjan Nepal and Alexander Y ates. 2014. Factorial Hidden Markov models for learning representations of natural language. In ICLR . Sebastian Pad ´ o and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics , 33:161–199. Chris Pal, Charles Sutton, and Andrew McCallum. 2006. Sparse forward-backward using minimum di- ver gence beams for fast training of conditional ran- dom ﬁelds. In ICASSP . Alexandre Passos, V ineet Kumar , and Andrew McCal- lum. 2014. Lexicon infused phrase embeddings for named entity resolution. In CoNLL . Judea Pearl. 1988. Pr obabilistic Reasoning in Intelli- gent Systems: Networks of Plausible Infer ence . Mor- gan Kaufmann Publishers Inc., San Francisco, CA, USA. Jeffre y Pennington, Richard Socher , and Christopher Manning. 2014. Glov e: Global vectors for word representation. EMNLP . Slav Petrov , Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and inter- pretable tree annotation. In A CL . Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, W ei- wei Hou, Nathan Schneider , and Timoth y Baldwin. 2015. Big Data Small Data, In Domain Out-of Do- main, Known W ord Unknown W ord: The Impact of W ord Representation on Sequence Labelling T asks. arXiv pr eprint arXiv:1504.05319 . Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL . Kenji Sagae and Andrew S. Gordon. 2009. Clustering words by syntactic similarity improves dependency parsing of predicate-argument structures. In IWPT . Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy , and H ´ ector Mart ´ ınez Alonso. 2014. What’ s in a p-value in NLP? In CoNLL . Simon ˇ Suster and Gertjan van Noord. 2014. From neighborhood to parenthood: the advantages of dependency representation ov er bigrams in Brown clustering. In COLING . Fei T ian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and T ie-Y an Liu. 2014. A probabilis- tic model for learning multi-prototype word embed- dings. In COLING . Ivan Tito v and Alexandre Klementiev . 2012. A Bayesian approach to unsupervised semantic role in- duction. In EA CL . Erik Tjong Kim Sang and Katja Hofmann. 2009. Lexi- cal patterns or dependency patterns: Which is better for hypernym e xtraction? In CoNLL . Joseph T urian, Lev Ratinov , and Y oshua Bengio. 2010. W ord representations: a simple and general method for semi-supervised learning. In A CL . David V adas and James R. Curran. 2007. Adding Noun Phrase Structure to the Penn T reebank. In A CL . T im van de Cruys. 2010. Mining for Meaning: The Extraction of Lexico-semantic Knowledge fr om T e xt . Ph.D. thesis, Univ ersity of Groningen. Gertjan van Noord. 2006. At Last Parsing Is Now Op- erational. In T ALN .

구문 기능을 활용한 트리형 HMM 기반 단어 표현 학습

원본 논문

댓글 및 학술 토론

의견 남기기

원본 논문

관련 논문

댓글 및 학술 토론

의견 남기기