Geometry of Polysemy

Vector representations of words have heralded a transformational approach to classical problems in NLP; the most popular example is word2vec. However, a single vector does not suffice to model the polysemous nature of many (frequent) words, i.e., wor…

Authors: Jiaqi Mu, Suma Bhat, Pramod Viswanath

Geometry of Polysemy
G E O M E T RY O F P O L Y S E M Y Jiaqi Mu, Suma Bhat, Pramod V iswanath Department of Electrical and Computer Engineering Univ ersity of Illinois at Urbana Champaign Urbana, IL 61801, USA { jiaqimu2,spbhat2,pramodv } @illinois.edu A B S T R AC T V ector representations of words ha ve heralded a transformational approach to clas- sical problems in NLP; the most popular example is word2v ec. Ho wev er , a sin- gle vector does not suffice to model the polysemous nature of many (frequent) words, i.e., w ords with multiple meanings. In this paper , we propose a three-fold approach for unsupervised polysemy modeling: (a) conte xt representations, (b) sense induction and disambiguation and (c) lexeme (as a word and sense pair) representations. A ke y feature of our work is the finding that a sentence contain- ing a target w ord is well represented by a low rank subspace , instead of a point in a vector space. W e then sho w that the subspaces associated with a particular sense of the tar get word tend to inter sect over a line (one-dimensional subspace), which we use to disambiguate senses using a clustering algorithm that harnesses the Grassmannian geometry of the representations. The disambiguation algorithm, which we call K -Grassmeans, leads to a procedure to label the dif ferent senses of the target word in the corpus – yielding lexeme vector representations, all in an unsupervised manner starting from a large (Wikipedia) corpus in English. Apart from se veral prototypical target (word,sense) examples and a host of empirical studies to intuit and justify the various geometric representations, we validate our algorithms on standard sense induction and disambiguation datasets and present new state-of-the-art results. 1 I N T RO D U C T I O N Distributed representations are embeddings of words in a real vector space, achiev ed via an appro- priate function that models the interaction between neighboring words in sentences (e.g.: neural networks [6, 23, 13], log-bilinear models [25, 22], co-occurrence statistics [28, 19]). Such an ap- proach has been strikingly successful in capturing the syntactic and semantic similarity between words (and pairs of words), via simple linear algebraic relations between their corresponding vector representations. On the other hand, the polysemous nature of words, i.e., the phenomenon of the same surface form representing multiple senses, is a central feature of the creativ e process embody- ing all natural languages. For example, a lar ge, tall machine used for moving heavy objects and a tall, long-legged, long-necked bird both share the same surface form “crane”. A vast majority of words, especially frequent ones, are polysemous, with each word taking on anywhere from two to a dozen different senses in many natural languages. For instance, W ordNet collects 26,896 polyse- mous English w ords with an a verage of 4.77 senses each [24]. Naturally , a single vector embedding does not appropriately represent a polysemous word. There are currently two approaches to address the polysemy issue: • Sense specific representation learning [8, 34], usually aided by hand-crafted lexical re- sources such as W ordNet [24]; • Unsupervised sense induction and sense/lexeme representation learning by inferring the senses directly from text [13, 26, 20, 2]. Since hand-crafted lexical resources sometimes do not reflect the actual meaning of a target word in a gi ven context [37] and, more importantly , such resources are lacking in many languages (and 1 their creation draws upon intensive expert human resources), we focus on the second approach in this paper; such an approach is inherently scalable and potentially plausible with the right set of ideas. Indeed, a human expects the contexts to cue in on the particular sense of a specific word, and successful unsupervised sense representation and sense extraction algorithms would represent progress in the broader area of representation of natural language. Such are the goals of this work. Firth1957’ s hypothesis – a word is characterized by the company it keeps [11] – has motiv ated the dev elopment of single embeddings for words, but also suggests that multiple senses for a target word could be inferred from its contexts (neighboring words within the sentence). This task is naturally broken into three related questions: (a) how to repr esent contexts (neighboring words of the target word); (b) how to induce word senses (partition instances of contexts into groups where the target word is used in the same sense within each group) and (c) ho w to represent le xemes (word and sense pairs) by vectors. Existing works address these questions by exploring the latent structure of contexts. In an inspired work, [2] hypothesizes that the global word representation is a linear combination of its sense rep- resentations, models the contexts by a finite number of discourse atoms, and recovers the sense representations via sparse coding of all the v ectors of the v ocabulary (a global fit). Other works perform a local context-specific sense induction: [20] introduces a sense-based language model to disambiguate word senses and to learn lex eme representations by incorporating the Chinese restau- rant process, [31] and [13] label the word senses by clustering the contexts based on the av erage of the context word embeddings and learn lexeme representations using the labeled corpus. [26] retains the representation of contexts by the average of the word vectors, but improves the previous approach by jointly learning the lex eme vectors and the cluster centroids. Grassmannian Model : W e depart from the linear latent models in these prior works by presenting a nonlinear (Grassmannian) geometric property of contexts. W e empirically observe and h ypothesize that the conte xt word representations surrounding a target word reside roughly in a lo w dimensional subspace . Under this hypothesis, a specific sense representation for a target word should reside in all the subspaces of the contexts where the word means this sense. Note that these subspaces need not cluster at all: a word such as “launch” in the sense of “beginning or initiating a new endeav or” could be used in a large variety of contexts. Ne vertheless, our hypothesis that large semantic units (such as sentences) reside in low dimensional subspaces implies that the subspaces of all contexts where the target word shares the same meaning should intersect non-trivially . This further implies that there exists a direction (one dimensional subspace) that is very close to all subspaces and we treat such an intersection vector as the representation of a group of subspaces. Follo wing this intuition, we propose a three-fold approach to deal with the three central questions posed abov e. • Context Representation : we define the context for a target word to be a set of left W and right W non-functional words of the target word ( W ≈ 10 in our experiments), including the target word itself, and represent it by a low-dimensional subspace spanned by its context word representations; • Sense Induction and Disambiguation : we induce word senses from their contexts by partitioning multiple context instances into groups, where the target word has the same sense within each group. Each group is associated with a representation – the intersection direction of the group – found via K -Grassmeans, a novel clustering method that harnesses the geometry of subspaces. Finally , we disambiguate word senses for ne w context instances using the respeciv e group representations; • Lexeme Representation : the lexeme representations can be obtained by running an off- the-shelf word embedding algorithm on a labeled corpus. W e label the corpus through hard decisions (in volving erasur e labels) and soft decisions (probabilistic labels), motiv ated by analogous successful approaches to decoding of turbo and LDPC codes in wireless communication. Experiments : The lexical aspect of our algorithm (i.e., senses can be induced and disambiguated individually for each word) as well as the no vel geometry (subspace intersection v ectors) jointly allow us to capture subtle shades of senses. F or instance, in “Can you hear me? Y ou’ re on the air . One of the great moments of live tele vision, isn’t it?”, our representation is able to capture the 2 occurrence of “air” to mean “liv e ev ent on camera”. In contrast, with a global fit such as that in [2] the senses are inherently limited in the number and type of “discourse atoms” that can be captured. As a quantitative demonstration of the latent geometry captured by our methods, we ev aluate the proposed induction algorithm on standard W ord Sense Induction (WSI) tasks. Our algorithm out- performs state-of-the-art on two datasets: (a) SemEv al-2010 T ask 14 [21] whose word senses are obtained from OntoNotes [29]; and (b) a custom-built dataset built by repurposing the polysemous dataset of [2]. In terms of le xeme vector embeddings, our representations ha ve ev aluations compara- ble to state-of-the-art on standard tasks – the word similarity task of SCWS [13] – and significantly better on a subset of the SCWS dataset which focuses on polysemous target words and the “police lineup” task of [2]. W e summarize our contributions belo w: • W e observe a new geometric property of context words and use it to represent contexts by the low-dimensional subspaces spanned by their w ord vector representations; • W e use the geometry of the subspaces in conjunction with unsupervised clustering methods to propose a sense induction and disambiguation algorithm; • W e introduce a new dataset for the WSI task which includes 50 polysemous w ords and 6,567 contexts. The word senses in this dataset are coarser and more human-interpretable than those in previous WSI datasets. 2 C O N T E X T R E P R E S E N TA T I O N Contexts refer to entire sentences or (long enough) consecutive blocks of words in sentences sur- rounding a target word. Efficient distributed vector representations for sentences and paragraphs are active topics of research in the literature ([18, 35, 16]), with much emphasis on appropriately relating the individual word embeddings with those of the sentences (and paragraphs) they reside in. The scenario of contexts studied here is similar in the sense that they constitute long semantic units similar to sentences, but dif ferent in that we are considering semantic units that all hav e a com- mon target word residing inside them. Instead of a straightforward application of existing literature on sentence (and paragraph) v ector embeddings to our setting, we de viate and propose a non-vector space representation; such a representation is central to the results of this paper and is best moti v ated by the following simple e xperiment. Giv en a random word and a set of its contexts (culled from the set of all sentences where the target word appears), we use principle component analysis (PCA) to project the context word embed- dings for ev ery context into an N -dimensional subspace and measure the low dimensional nature of context word embeddings. W e randomly sampled 500 words whose occurrence (frequency) is larger than 10,000, extracted their contexts from W ikipedia, and plotted the histogram of the vari- ance ratios being captured by rank- N PCA in Figure 1(a) for N = 3 , 4 , 5 . W e make the following observations: even rank-3 PCA captures at least 45% of the energy (i.e., v ariance ratio) of the con- text word representations and rank- 4 PCA can capture at least half of the energy almost surely . As comparison, we note that the av erage the number of context words is roughly 21 and a rank-4 PCA ov er a random collection of 21 words would be expected to capture only 20% of the energy (this calculation is justified because word vectors hav e been observed to possess a spatial isotropy prop- erty [1]). All word vectors were trained on the W ikipedia corpus with dimension d = 300 using the skip-gram program of word2vec [22]. This e xperiment immediately suggests the low-dimensional nature of conte xts, and that the contexts be represented in the space of subspaces, i.e., the Grassmannian manifold: we represent a context c (as a multiset of words) by a point in the Grassmannian manifold – a subspace (denoted by S ( c ) ) spanned by its top N principle components (denoted by { u n ( c ) } N n =1 ), i.e., S ( c ) = ( N X n =1 α n u n ( c ) : α n ∈ R ) . A detailed algorithm chart for conte xt representations is pro vided in Appendix A, for completeness. 3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 variance ratio 0.1 0.0 0.1 0.2 0.3 0.4 0.5 frequency N = 3 N = 4 N = 5 Figure 1: An experiment to study the linear algebraic structure of word senses. 3 S E N S E I N D U C T I O N A N D D I S A M B I G U A T I O N W e now turn to sense induction, a basic task that explores polysemy: in this task, a set of sentences (each containing a common target word) have to be partitioned such that the target word is used in the same sense in all the sentences within each partition. The number of partitions relates to the number of senses being identified for the target word. The geometry of the subset representations plays a key role in our algorithm and we start with this ne xt. 3 . 1 G E O M E T RY O F P O L Y S E M Y Consider a target word w and a context sentence c containing this word w . The empirical experiment from the previous section allows us to represent c by a N -dimensional subspace of the vectors of the words in c . Since N (3 ∼ 5 in our experiments) is much smaller than the number of words in c (21, on average), one suspects that the representation associated with c wouldn’t change very much if the target w ord w were expurgated from it, i.e., S ( c ) ≈ S ( c \ w ) . On the other hand v ( w ) (the vector representation of the target word w ) perhaps has a fairly large intersection with S ( c ) and thus also with S ( c \ w ) . Putting these two observ ations together, one arriv es at the following hypothesis, in the conte xt of monosemous target words: Intersection Hypothesis : the target word vector v ( w ) should reside in the inter- section of S ( c \ w ) , where the intersection is over all its contexts c . The reason why this hypothesis is made in the context of monosemous w ords is that in this case the word vector representation is “pure”, while polysemous words are really different words with the same (lexical) surface form. Empirical V alidation of Intersection Hypothesis : W e illustrate the intersection phenomenon via another experiment. Consider a monosemous word “typhoon” and consider all contexts in W ikipedia corpus where this word appears (there are 14,829 contexts and a few sample contexts are provided in T able 1). W e represent each context by the rank- N PCA subspace of all the vectors (with N = 3 ) associated with the words in the context and consider their intersection. Each of these subspaces is 3 × d dimensional (where d = 300 is the dimension of the word vectors). W e find that cosine similarity (normalized projection distance) between the vector associated with “typhoon” and each context subspace is v ery small: the av erage is 0.693 with standard de viation 0.211. For comparison, we randomly sample 14,829 contexts and find the a verage is 0.305 with standard de viation 0.041 (a detailed histogram is provided in Figure 2(a)). This corroborates with the hypothesis that the target word vector is in the intersecton of the context subspaces. A visual representation of this geometric phenomenon is in Figure 2(b), where we ha ve projected the d -dimensional word representations into 3-dimensional vectors and use these 3-dimensional word vectors to get the subspaces for contexts (we set N = 2 here for visualization) in T able 1, and plot the subspaces as 2 -dimensional planes. From Figure 2, we can see that all the context subspaces roughly intersect at a common direction, thus empirically justifying the intersection hypothesis. 4 0.0 0.2 0.4 0.6 0.8 1.0 cosine similarity 0 1000 2000 3000 4000 5000 6000 7000 8000 frequency Wikipedia random (a) distance between word representation and its subspaces 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 (b) contexts of “typhoon” Figure 2: The geometry of contexts for monosemy . “typhoon” powerful typhoon that af fected southern japan in july it was the sixth named storm and second typhoon of the pacific typhoon season originating from an area of lo w pressure near wake island on july the precursor to maon gradually de veloped typhoon ineng was a powerful typhoon that affected southern japan in july it was the sixth named storm and second typhoon of the pacific typhoon season originating from an area of low pressure near w ake island on july the precursor crossing through a gulf of the south china sea patsy weakened to a mph kmh tropical storm before the joint typhoon warning center ceased follo wing the system on the morning of august as it made landfall near the city of bay were completely wiped out while all airplanes at the local airports were damaged this is the first acti ve pacific typhoon season on record typhoon barbara formed on march and mov ed west it strengthened briefly to a category three with T able 1: Contexts containing a monosemous word “typhoon”. Recovering the Intersection Direction : An algorithmic approach to robustly disco ver the intersec- tion direction in volv es finding that direction vector that is “closest” to all subspaces; we propose doing so by solving the following optimization problem: ˆ u ( w ) = arg min k u k =1 X w ∈ c d ( u, S ( c \ w )) 2 , (1) where d ( v , S ) is the shortest ` 2 -distance between u and subspace S , i.e., d ( u, S ) = v u u t k u k 2 − N X n =1 ( u T u n ) 2 , where u 1 , . . . , u N are N orthonormal basis vectors for subspace S . Thus (1) is equiv alent to, ˆ u ( w ) = arg max k u k =1 X w ∈ c N X n =1  u T u n ( c \ w )  2 , (2) which can be solved by taking the first principle component of { u n ( c \ w ) } w ∈ c,n =1 ,...,N . The property that context subspaces of a monosemous word intersect at one direction naturally generalizes to polysemy: Polysemy Intersection Hypothesis : the context subspaces of a polysemous word intersect at different directions for dif ferent senses. This intuition is validated by the following experiment, which continues on the same theme as the one done for the monosemous word “typhoon”. Now we study the geometry of contexts for a 5 polysemous word “crane”, which can either mean a large, tall machine used for moving heavily objects or a tall, long-legged, long-neck ed bird. W e list four contexts for each sense of “crane” in T able 2, repeat the e xperiment as conducted abov e for the monosemous word “typhoon” and visualize the context subspaces for two senses in Figure 3(a) and 3(b) respectiv ely . Figure 3(c) plots the direction of two intersections. This immediately suggests that the contexts where “crane” stands for a bird intersect at one direction and the contexts where “crane” stands for a machine, intersect at a different direction as visualized in 3 dimensions. 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 (a) crane: machine 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 (b) crane: bird 2 1 0 1 2 2 1 0 1 2 2 1 0 1 2 crane: machine crane: bird (c) intersection Figure 3: Geometry of contexts for a polysemous w ord “crane”: (a) all contexts where “crane” means a machine roughly intersect at one direction; (b) all conte xts where “crane” means a bird roughly intersect at another direction; (c) two directions representing “crane” as a machine and as a bird. crane: machine crane: bird In June 1979, the anchor towed ashore and lifted by mobile crane into a tank made of concrete built into the ground specifically for the purpose of conserving the anchor . The sandhill crane (“Grus canadensis”) is a species of lar ge crane of North America and ex- treme northeastern siberia. The company ran deck and covered lighters, stick lighters, steam cranes and hea vy lift crane barges, providing a single agency for Delaware V alley shippers. Although the gre y cro wned crane remains com- mon over much of its range, it faces threats to its habitat due to drainage, ov ergrazing, and He claimed that his hydraulic crane could un- load ships faster and more cheaply than con ven- tional cranes. The blue crane (“ Anthropoides paradiseus”), also known as the Stanley crane and the par- adise crane, is the national bird of South Africa. A large pier was built into the harbour to accom- modate a heavy lift marine crane which would carry the components into the Northumberland Strait to be installed. The sarus crane is easily distinguished from other cranes in the region by the overall grey colour and the contrasting red head T able 2: Contexts containing a polysemous word “crane”. 3 . 2 S E N S E I N D U C T I O N W e can use the representation of senses by the intersection directions of context subspaces for un- supervised sense induction: supposing the target polysemous word that has K senses (known ahead of time for no w), the goal is to partition the conte xts associated with this target word into K groups within each of which the target polysemous word shares the same sense. The fact that two groups of context subspaces, corresponding to different senses, intersect at different directions motiv ates our geometric algorithm: we note that each one of the contexts belongs to a group associated by the nearest intersection direction which serves as a prototype of the group. Part of the task is also to identify the most appropriate intersection direction vectors associated with each group. This task represents a form of unsupervised clustering which can be formalized as the optimization problem below . Giv en a target polysemous word w , M contexts c 1 , ..., c M containing w and a number K indicating the number of senses w has, we would like to partition the M contexts into K sets S 1 , ..., S K so as 6 to minimize the distance d ( · , · ) of each subspace to the intersection direction of its group, L = min u 1 ,...,u K ,S 1 ,...,S K K X k =1 X c ∈ S k d 2 ( u k , S ( c \ w )) . (3) This problem (3) is analogous to the objective of K -means clustering for vectors and solving it exactly in the worst case can be sho wn to be NP-hard. W e propose a natural algorithm by repurpos- ing traditional K -means clustering built for vector spaces to the Grassmannian space as follows (a detailed algorithm chart is provided in Appendix B): Algorithm: K -Grassmeans • Initialization: we randomly initialize K unit-length vectors u 1 , ..., u K . • Expectation: we group contexts based on the distance to each intersection direction: S k ← { c m : d ( u k , S ( c m \ w )) ≤ d ( u k 0 , S ( c m \ w )) ∀ k 0 } , ∀ k . • Maximization: we update the intersection direction for each group based on the contexts in the group. u k ← arg min u X c ∈ S k d 2 ( u, S ( c \ w )) , L ← K X k =1 X c ∈ S k d 2 ( u k , S ( c \ w )) . T o ensure good performance, we randomize the intersection directions with multiple dif ferent seeds and output the best one in terms of the objectiv e function L ; this step is analogous to the random initialization conducted in kmeans++ in classical clustering literature [27, 3]. T o get a qualitative feel of this algorithm at work, we consider an exemplar target w ord “columbia” with K = 5 senses. W e considered 100K sentences, extracted from the W ikipedia corpus. The goal of sense induction is to partition the set of contexts into 5 groups, so that within each group the target word “columbia” has the same sense. W e run K -Grassmeans for this target word and extract the intersect vectors u 1 , . . . u K for K = 5 . One sample sentence for each group is given in T able 3 as an example, from which we can see the first group corresponds to British Columbia in Canada, the second one corresponds to Columbia records, the third one corresponds to Columbia University in New Y ork, the fourth one corresponds to the District of Columbia, and the fifth one corresponds to Columbia Riv er . The performance of K -Grassmeans in the context of the target w ord “columbia” is described in detail in the context of sense disambiguation (Section 3.3). Note that our algorithm can be run for any one specific target word, and makes for efficient online sense induction; this is relev ant in information retriev al applications where the sense of the query words may need to be found in real time. T o get a feel for how good K -Grassmeans is for the sense induction task, we run the following synthetic experiment: we randomly pick K monosemous words, merge their surface forms to create a single artificial polsemous word, collect all the con- texts corresponding to the K monosemous words, replace every occurrence of the K monosemous words by the single artificial polysemous word. Then we run the K -Grassmeans algorithm on these contexts with the artificial polysemous word as the tar get word, so as to reco ver their original labels (which are kno wn ahead of time, since we mer ged known monosemous words together to create the artificial polysemous word). Figure 4(a) shows the clustering performances on a realization of the artificial polysemous word made of “monastery” and “phd” (here K = 2 ) and Figure 4(b))shows the clustering performance when K = 5 monosemous words “employers”, “exiled”, “grossed”, “incredible” and “unreleased” are merged together . W e repeat the experiment over 1,00 trials with K v arying from 2 ∼ 8 and the accuracy of sense induction is reported in Figure 4(c). From these experiments we see that K -Grassmeans performs very well, qualitati vely and quantitati vely . A quantitativ e experiment on a large and standardized real dataset (which in volves real polysemous target words, as opposed to synthetic ones we created), as well as a comparison with other algorithms in the literature, is detailed in Section 5, where we see that K -Grassmeans outperforms state of the art. 7 Group No. contexts 1 (a) research centres in canada it is located on patricia bay and the former british columbia highway a in sidney british columbia vancouv er island just west of victoria international airport the institute is paired with a canadian coast guard base 2 (b) her big break performing on the arthur godfrey sho w and had since then released a series of successful singles through columbia records hechtlancaster productions first published the music from their film marty in april and june through cromwell music this 3 (c) fellow at the merrill school of journalism univ ersity of maryland college park in she was a visiting scholar at the columbia university center for the study of human rights in haddad completed a master of international policy 4 (d) signed into law by president benjamin harrison in march a site for the new national conservatory in the district of columbia was nev er selected much less built the school continued to function in ne w york city existing solely from phi- lanthropy 5 (e) in co wlitz county washington the caples community is located west of wood- land along caples road on the east shore of columbia riv er and across the riv er from columbia city oregon the caples community is part of the woodland school district T able 3: Semantics of 5 groups for target word “columbia”. group-0 group-1 0 10000 20000 30000 40000 50000 60000 70000 80000 frequency monastery phd (a) K =2 group-0 group-1 group-2 group-3 group-4 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 frequency employers exiled grossed incredible unreleased (b) K =5 2 3 4 5 6 7 8 # senses 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 accuracy K-Grassmean random guess (c) accuracy Figure 4: A synthetic experiment to study the performances of K -Grassmeans. : (a) monosemous words: “monastery” and “phd”; (b) K = 5 monosemous words: “employers”, “exiled”, “grossed”, “incredible” and “unreleased”; (c) accuracy v ersus K . 3 . 3 S E N S E D I S A M B I G U A T I O N Having the intersection directions to represent the senses, we are ready to disambiguate a target word sense in a given context using the learned intersection directions specific to this target word: for a new context instance for a polysemous word, the goal is to identify which sense this word means in the context. Our approach is three-fold: represent the conte xt by a lo w dimensional subspace S ( c \ w ) approximation of the linear span of the word embeddings of non-functional words in the context, find the orthogonal projection distance between the intersection vector u k ( w ) and the context subspace, and finally output k ∗ that minimizes the distance, i.e., k ∗ = arg min k d ( u k ( w ) , S ( c \ w )) . (4) W e refer to (4) as a har d decoding of word senses since this outputs a deterministic label. At times, it makes sense to consider a soft decoding algorithm where the output is a probability distribution. The probability that w takes k -th sense given the conte xt c is defined via, P ( w , c, k ) = exp( − d ( u k ( w ) , S ( c \ w ))) P k 0 exp( − d ( u k 0 ( w ) , S ( c \ w ))) . (5) Here we calculate the probability as a monotonic function of the cosine distance between the in- tersection v ector u k ( w ) and the context subspace S ( c \ w ) , inspired by similar heuristics in the literature [13]. 8 W e applied (4) and (5) on the target word “columbia” and fiv e sentences listed in T able 3, the probability distributions P ( w , c, k ) returned by the soft decoding algorithm and the optimal k ∗ ’ s returned by hard decoding algorithm are provided in T able 4. From T able 4 we can see that ev en though the hard decoding algorithm outputs the correct label, some information is missing if we return a single label k ∗ . For example, since we take bag-of-words model in K -Grassmean, some words (e.g. “school” and “new york city” in conte xt (c) provided in T able 3) suggest that the meaning for “columbia” in this instance might also be Columbia University . The function of those words reflects in the probability distribution returned by the soft decoding algorithm, where we can see the probability that “columbia” in this instance means Columbia Univ ersity is around 0.13. The misleading result mainly comes from the bag-of-words model, and ho w to resolve it remains open. Context No. hard decoding ( k ∗ ) soft decoding P ( w , c, k ) k = 1 k = 2 k = 3 k = 4 k = 5 (a) 1 0.81 0.01 0.01 0.05 0.13 (b) 2 0.02 0.92 0.01 0.04 0.01 (c) 3 0.01 0.00 0.91 0.06 0.01 (d) 4 0.07 0.03 0.13 0.70 0.07 (e) 5 0.04 0.01 0.01 0.05 0.90 T able 4: Hard decoding and soft decoding for disambiguation of “columbia” in fi ve sentences giv en in T able 3. 4 L E X E M E R E P R E S E N TA T I O N Induction and disambiguation are important tasks by themselves, but sev eral downstream applica- tons can use a distrib uted vector representation of the multiple senses associated with a target w ord. Just as with word representations, we expect the distributed lexeme representations to have semantic meanings – similar lex emes should be represented by similar vectors. It might seem natural to represent a lex eme s k ( w ) of a giv en word w by the intersection vector asso- ciated with the k -th sense group of w , i.e., u k ( w ) . Such an idea is supported by an observation that the intersection vector is close to the word representation vector for many monosemous word. W e perform another e xperiment to directly confirm this observ ation: we randomly sample 500 monose- mous words which occur at least 10,000 times, for each word we compute the intersection vector and check the cosine similarity between the intersection vector and the corresponding word repre- sentation of these monosemous words. W e find that on average the cosine similarity score is a very high 0.607, with a small standard deviation of 0.095. Despite this empirical evidence, somewhat surprisingly , le xeme representation using the intersec- tion vectors turns out to be not such a good idea, and the reason is fairly subtle. It turns out that the intersection vectors are concentrated on a relativ ely small surface area on the sphere (magnitudes are not av ailable in the intersection vectors) – the cosine similarity between two random intersec- tion vectors among 10,000 intersection vectors (fi ve intersection vectors each for 2,000 polysemous words) is 0.889 on average with standard de viation 0.068 (a detailed histogram is provided in Fig- ure 5(a)). This is quite in contrast to analogous statistics for (global) word embeddings from the word2vec algorithm: the cosine similarity between two random word vectors is 0.134 on av erage with standard deviation 0.072 (a detailed histogram is pro vided in Figure 5(b)). Indeed, word v ector representations are kno wn to be approximately uniformly scattered on the unit sphere (the so-called isotropy property , see [1]). The intersection vectors cluster together far more and are quite far from being isotropic – yet they are still able to distinguish different senses as sho wn by the empirical studies and qualitative experiments on prototypical examples abov e (and also on standard datasets, as seen later in Section 5). Due to this geometric mismatch between word vectors and intersection directions, and correspond- ing mismatch in linear algebraic properties expected of these distributed representations, it is not appropriate to use the intersection direction as the lexeme vector representation. Why word rep- resentations are isotropic but intersection vectors cluster close to each other is an intriguing open question: a detailed empirical study of this phenomenon and a theoretical exploration of generativ e models that might mathematically explain this behavior are exciting future directions of research, 9 and beyond the scope of this paper . In addition to the geometric mismatch, intersection vectors are perhaps not appropriate to represent the lex emes for two further reasons: (a) the intersections only represent directions and lack information about their magnitudes; and (b) the context subspaces are themselves noisy since during the initial phase, a polysemous word is represented by a single vector . In this light, we propose to learn the lexeme representations by an alternate (and more direct) proce- dure: first label the polysemous words in the corpus using the proposed disambiguation algorithm from Section 3.3 and then run a standard word embedding algorithm (we use word2vec) on this labeled corpus, yielding lexeme embeddings. There are se veral possibilities regarding labeling and are discussed next. 0.0 0.2 0.4 0.6 0.8 1.0 cosine similarity 0 1 2 3 4 5 6 7 frequency (a) similarity between intersection direc- tions 0.0 0.2 0.4 0.6 0.8 1.0 cosine similarity 0 1 2 3 4 5 6 7 frequency (b) similarity between word representation Figure 5: Cosine similarities between intersection directions and word representations respecti vely . 4 . 1 H A R D D E C O D I N G S W e label the corpus using the disambiguation algorithm as in (4). A special label “IDK” representing “I don’t know” is introduced to av oid introducing too many errors during the labeling phase since (a) our approach is based on the bag-of-words model and cannot guarantee to label e very sense correctly; (for example, “arm” in “the boat commander was also not allo wed to resume his career in the Greek Navy due to his missing arm which was deemed a factor that could possibly raise enquiries re garding the mission which caused the trauma. ” will be labeled as “weapon”); and (b) we are not clear how such errors will af fect existing w ord embedding algorithms. An “IDK” label is introduced via checking the closest distance between the context subspace and the intersection directions, i.e., let u k ∗ ( w ) be the closest intersection vector of w to context c , we will label this instance as k ∗ if d ( u k ∗ ( w ) , S ( c \ w )) < θ and “IDK” otherwise, where θ is a hyper- parameter . A detailed algorithm chart for sense disambiguation and corpus labeling is provided in Appendix C. The “IDK” label includes instances of words that means a rare sense, (for example: “crane” as in stretching the neck), or a confusing sense which requires disambiguation of context words (for example: “book” and “ticket” in “book a flight ticket”). The IDK labeling procedure is inspired by analogous scenarios in wireless communication where the log likelihood ratio of (coded) bits is close to zero and in practice are better labeled as “erasures”, than treating them as informative for the ov erall decoding task [9]. 4 . 2 S O F T D E C O D I N G S Another way of labeling is via using the absolute scores of K -Grassmeans disambiguation for each sense of a target work in a specific context, cf. Equation (5). Soft decoding in volv es generating a random corpus by sampling one sense for e very occurrence of a polysemous word according to its probability distribution from (5). Then lexeme representations are obtained via an application of a standard word embedding algorithm (we use word2vec) on this (random) labeled corpus. Since we only consider words that are frequent enough (i.e., whose occurrence is larger than 10,000), each sense of a polysemous word is sampled enough times to allow a robust lexeme representation with high probability . 10 Soft decoding benefits in two scenarios: (a) when a context has enough information for disambigua- tion (i.e., the probability distribution (5) concentrates on one), the random sampling will hav e a high chance making a correct decision. (b) when a context is ambiguous (i.e., the probability distribu- tion have more than one peak), the random sampling will have a chance of not making a wrong (irrev ersible) decision. 5 E X P E R I M E N T S Throughout this paper we have conducted multiple qualitati ve and empirical experiments to high- light and motiv ate the various geometric representations. In this section we e valuate our algorithms (on sense disambiguation method and sense representation) empirically on (standardized) datasets from the literature, allowing us to get a quantitative feel for the performance on large datasets, as well as afford a comparison with other algorithms from the literature. 5 . 1 P R E L I M I N A R I E S All our algorithms are unsupervised and operate on a large corpus obtained from W ikipedia dated 09/15. W e use W ikiExtractor ( http://medialab.di.unipi.it/wiki/Wikipedia_ Extractor ) to extract the plain text. W e use the skip-gram model from word2vec [22] as the word embedding algorithm where we use the default parameter setting. W e set c = 10 as the context window size and set N = 3 as the rank of PCA. W e choose K = 2 and K = 5 in our experiment. For the disambiguation algorithm, we set θ = 0 . 6 . 5 . 2 B A S E L I N E S Our main comparisons are with algorithms that conduct unsupervised polysemy disambiguation, specifically the sense clustering method of [13], the multi-sense skip gram model (MSSG) of [26] with dif ferent parameters, and the sparse coding method with a global dictionary of [2]. W e were able to do wnload the word and sense representations for [13, 26] online, and trained the word and sense representations of [2] on the same corpus as that used by our algorithms. 5 . 3 S E N S E I N D U C T I O N A N D D I S A M B I G U A T I O N W ord sense induction (WSI) tasks conduct the following test: giv en a set of context instances con- taining a target word, one is asked to partition the context instances into groups such that within each group the target word shares the same sense. W e test our induction algorithm, K -Grassmeans, on two datasets – one standard and the other custom-b uilt. • SemEval-2010: The test set of SemEval-2010 shared task 14 [21] contains 50 polysemous nouns and 50 polysemous verbs whose senses are extracted from OntoNotes [29], and in total 8,915 instances are e xpected to be disambiguated. The context instances are extracted from various sources including CNN and ABC. • Makes-Sense-2016: Se veral word senses from SemEval-2010 are too fine-grained in our view (no performance results on tests with nativ e human speakers’ is provided in the lit- erature) – this creates “noise” that that reduces the performance of all the algorithms, and the required senses are perhaps not that useful to downstream applications. For example, “guarantee” (as a noun) is labeled to ha ve four different meanings in the following four sentences: – It has provided a leg al guarantee to the dev elopment of China’ s Red Cross cause and connections with the International Red Cross mov ement, signifying that China’ s Red Cross cause has entered a new historical phase. – Some hotels in the hurricane - stricken Caribbean promise money - back guarantees . – Many agencies roll over their debt , paying off delinquent loans by issuing new loans , or con verting defaulted loan guarantees into direct loans. – Litigation consulting isn’t a guarantee of a fav orable outcome. Howe ver , in general they all mean “a formal promise or pledge”. T owards a more human- interpretable version of the WSI task, we custom-build a dataset whose senses are coarser 11 and clearer . W e do this by repurposing a recent dataset created in [2], as part of their “po- lice lineup” task. Our dataset contains 50 polysemous words, together with their senses (on average 2.78 senses per word) borrowed from [2]. W e generate the testing instances for a tar get w ord by extracting all occurrences of it in the W ikipedia Corpus, analyzing its Wikipedia annotations (if any), grouping those which hav e the same annotations, and finally merging annotations where the target word shares the same sense. Since the senses are quite readily distinguishable from the perspectiv e of native/fluent speakers of English, the disambiguation variability , among the human raters we tested our dataset on, is negli- gible (this effect is also seen in Figure 6 of [2]). W e ev aluate the performance of the algorithms on this (disambiguation) task according to standard measures in the literature: V -Measure [33] and paired F-Score [4]; these two ev aluation metrics also feature in the SemEval-2010 WSI task [21]. V -measure is an entropy-based external cluster ev aluation metric. P aired F-score ev aluates clustering performance by conv erting the clustering problem into a binary classification problem – gi ven two instances, do they belong to the same cluster or not? Both metrics operate on a contingency table A = { a tk } , where a tk is the number of instances that are manually labeled as t and algorithmically labeled as k . A detailed description is giv en in Appendix D for completeness. Both the metrics range from 0 to 1, and perfect clustering giv es a score of 1. Empirical statistics show that V -Measure fa vors those with a larger number of cluster and paired F-score fa vors those with a smaller number of cluster . algorithms SemEval-2010 Make-Sense-2016 V -Measure F-score # cluster V -Measure F-score # cluster MSSG.300D.30K.key 9.00 47.26 2.88 19.40 54.49 2.88 MSSG.300D.6K.key 6.90 48.43 2.45 14.40 57.91 2.35 huang 2012 10.60 38.05 6.63 46.86 15.9 2.74 #cluster=2 7.30 57.14 1.93 29.30 64.58 2.37 #cluster=5 14.50 44.07 4.30 34.40 58.17 4.98 T able 5: Performances (V -measure (x100) and paired F-score (x100)) of W ord Sense Induction T ask on two datasets. T able 5 shows the detailed results of our experiments, and from where we see that K -Grassmeans strongly outperforms the others. The main reason behind the better performance seems to be that K -Grassmeans disambiguates some subtle senses where the others cannot. For example, follo wing are three sentences containing “air”: • Can you hear me? Y ou’ re on the air . One of the great moments of li ve television , isn’ t it? • The air force is to take at least 250 more. • The empty shells piled here along the roadside fill the air with their briny aroma. It can be observed that enough information is contained in the sentence to inform us that the first “air” is about broadcasting, the second is about the region above the ground and the third one is about a mixture of gases. K -Grassmeans can distinguish all three while the other algorithms cannot. 5 . 4 L E X E M E R E P R E S E N TA T I O N The k ey requirement of lex eme representations should be that the y ha ve the same properties as w ord embeddings, i.e., similar lex emes (or monosemous words) should be represented by similar vec- tors. Hence we ev aluate our lexeme representations on a standard word similarity task focusing on context-specific scenarios: the Stanford Contextual W ord Similarities (SCWS) dataset [13]. In ad- dition to word similarity task on SCWS, we also e valuate our lex eme representations on the “police lineup” task proposed in [2]. W ord Similarity on SCWS The task is as follows: gi ven a pair of target words, the algorithm assigns a measure of similarities between this pair of words. The algorithm is e valuated by checking the degree of agreement between the similarity measure gi ven by the algorithm and the similarity measure gi ven by humans in terms of Spearman’ s rank correlation coefficient. Although this SCWS 12 dataset is not meant specifically for polysemy , we repurpose it for our tasks since it asks for the similarity between tw o words in two giv en sentential contexts (the contexts presumably provide important clues to the human rater on the senses of the tar get words) and also because this is a lar ge (in volving 2,003 word pairs) and standard dataset in the literature with 10 human rated similarity scores, each rated on an integral scale from 1 to 10. W e take the average of 10 human scores to represent the human judgment. W e propose two measures between w and w 0 giv en their respecti ve contexts c and c 0 – one (denoted by HardSim ) is based on the hard decoding algorithm and the other one (denoted by SoftSim ) is based on the soft one. HardSim and SoftSim are defined via, HardSim( w , w 0 , c, c 0 ) = d ( v k ( w ) , v k 0 ( w 0 )) , where k , and k 0 are the senses obtained via (4), v k ( w ) and v k 0 ( w ) are the lex eme representations for the two senses, d ( v , v 0 ) is the cosine similarity between two vectors v and v 0 , i.e., ( d ( v , v 0 ) = v T v / k v kk v 0 k ), and SoftSim( w , w 0 , c, c 0 ) = X k X k 0 P ( w , c, k ) P ( w 0 , c 0 , k 0 ) d ( v k ( w ) , v k 0 ( w 0 )) , where P ( w , c, k ) is the probability that w takes k -th sense giv en the context c defined in (5). T able 6 shows the detailed results on this task. Here we conclude that in general our lex eme rep- resentations have a similar performance as the state-of-the-art on both soft and hard decisions. It is worth mentioning that the vanilla word2vec representation (which simply ignores the provided contexts) also has a comparable performance – this makes some sense since some of the words in SCWS are monosemous (and hence their meanings are conte xt-dependent). A closer inspec- tion of the results shows that the vanilla word2v ec representation also performs fairly well for polysemous w ords too – we hypothesize that this is because the word2v ec global representation is actualy close to the indi vidual sense representations, if the senses occur frequently enough in the corpus. This hypothesis is supported by the linear algebraic structure uncov ered by [2], that a word representation is a linear combination of its lex eme representations: let v ( w ) = P k α k v k ( w ) , and let w 0 be a word semantically close to k 0 -th sense of w , then we kno w that v ( w ) T v ( w 0 ) = α k 0 v k 0 ( w ) T v ( w 0 ) + P k 6 = k 0 α k v k ( w ) v ( w 0 ) . Since (a) other senses are irrelev ant to w 0 , we can as- sume v k ( w ) v ( w 0 ) ≈ 0 for k 6 = k 0 , and (b) the k 0 -th sense is frequent enough, we can assume α k 0 ≈ 1 . Putting (a) and (b) together we can conclude that v ( w ) T v ( w 0 ) ≈ v k 0 ( w ) T v ( w 0 ) and there- fore the inner product between v ( w ) and v ( w 0 ) captures the similarity between k 0 -th sense of w and w 0 . T o separate out the effect of the combination of monosemous and polysemous words in the SCWS dataset, we expurg ate monosemous words from the corpus creating a smaller version that we denoted by SCWS-lite. In SCWS-lite, we only consider the pairs of sentences where both target words in one pair have the same surface form but dif ferent contexts (and hence potentially different senses). SCWS-lite now contains 241 pairs of words, which is roughly 12% of the original dataset. The performances of the v arious algorithms are detailed in T able 6, from where we see that our represen- tations (and corresponding algorithms) outperform the state-of-the-art and sho wcases the superior representational performance when it comes to context-specific polysemy disambiguation. Police Lineup T ask This task is proposed in [2] to ev aluate the efficacy of their sense representa- tions (via the vector representations of the discourse atoms). The testbed contains 200 polysemous words and their 704 senses, where each sense is defined by eight related words. For example, the “tool/weapon” sense of “axes” is represented by “handle”, “harvest”, “cutting”, “split”, “tool”, and “wood”, “battle”, “chop”. The task is the following: giv en a polysemous word, the algorithm needs to identify the true senses of this word from a group of m senses (which includes many distractors) by outputting k senses from the group. The algorithm is ev aluated by the corresponding precision and recall scores. This task offers another opportunity to test our representations with the others in the literature, and also provide insight into some of those representations themselves. One possible algorithm is to simply use that proposed in [2] where we replace their sense representations with ours: Let s k ( w ) denote a sense of w , and let L denote the set of words representing a sense. W e define a similarity 13 Model SCWS SCWS-lite SoftSim HardSim SoftSim HardSim skip-gram 62.21 – MSSG.300D.6K 67.89 55.99 21.30 20.06 MSSG.300D.30K 67.84 56.66 19.49 20.67 NP-MSSG.300D.6K 67.72 58.55 20.14 19.26 Huang 2012 65.7 26.1 – – #cluster=2 (hard) 61.03 58.40 9.33 4.97 #cluster=5 (hard) 60.82 53.14 27.40 22.28 #cluster=2 (soft) 63.59 63.67 5.07 6.53 #cluster=5 (soft) 62.46 61.23 16.54 17.83 T able 6: Performance (Spearman’ s rank correlation x100) on SCWS task.. score between a sense representation of w and a sense set from the m senses as, scor e ( s k ( w ) , L ) = s X w 0 ∈ L ( v k ( w ) T v w ) 2 , and group two senses with highest scores with respect to s k ( w ) for each k , and then output the top k senses with highest scores. A detailed algorithm is provided in Appendix E, with a discussion of the potential variants. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 recall 0.4 0.5 0.6 0.7 0.8 0.9 precision word2vec Arora 2016 # cluster = 2 # cluster = 5 (a) hard 0.2 0.3 0.4 0.5 0.6 0.7 0.8 recall 0.4 0.5 0.6 0.7 0.8 0.9 1.0 precision word2vec Arora 2016 # cluster = 2 # cluster = 5 (b) soft Figure 6: Precision and recall curv e in the sense identification task where we let m = 20 and k vary from 1 to 6. Figure 6 shows the precision and recall curve in the polysemy test where we let m = 20 and let k vary from 1 to 6. First, we observe that our representations are uniformly better over the precision- recall curve than the state of the art, although by a relativ ely small margin. Soft decoding performs slightly better than hard decoding o ver this task. Second, the surprising finding is that the baseline we create using v anilla word2vec representations (precise details of this baseline algorithm are provided in Appendix E for completeness) performs as well as the state-of-the-art described in [2]. A careful look shows that all algorithm outputs (word2vec, ours and those in [2]) are highly correlated – they all make correct calls on obvious instances and all make mistakes for confusing instances. W e believ e that this is because of the follo wing: (a) the algorithm 1 in [2] is highly correlated with word2vec since their overall similarity measure uses a linear combination of the similarity measures associated with the sense (discourse atom) v ector and the word v ector (see Step 6 of Algorithm 1 in [2]); (b) word2vec embeddings appear to have an ability to capture two different senses of a polysemous word (as discussed earlier ); (c) the instances where the errors occur all seem to be either genuinely subtle or rare in the domain where embeddings were trained (for instance “bat” in the sense of fluttering eyelashes is rare in the W ikipedia corpus, and is one of the error instances). 14 6 R E L A T E D W O R K There are two main approaches to model polysemy: one is supervised and uses linguistic resoures [8, 34] and the other is unsupervised inferring senses directly from a large text corpus [13, 26, 20, 2]. Our approach belongs to the latter category . There are differing approaches to harness hand-crafted lexical resources (such as W ordNet): [8] lev erages a “gloss” as a definition for each lexeme, and uses this to model and disambiguate word senses. [34] models sense and le xeme representations through the ontology of W ordNet. While the approaches are natural and interesting, they are inherently limited due to (a) the co verage of W ordNet: W ordNet only covers 26k polysemous words, and the senses for polysemous words are not complete and are domain-agnostic (for example, the mathematical sense for “ring” is missing in W ordNet and a majority of occurrences of “ring” mean exactly this sense in the W ikipedia corpus) and (b) the fine-grained nature of W ordNet: W ordNet senses appear at times too pedantic to be useful in practical downstream applications (for example, “air” in “This show will air Saturdays at 2 P .M. ” and “air” in “W e cannot air this X-rated song” are identified to have dif ferent meanings). The unsupervised methods do not suffer from the idiosyncracies of linguistic resources, b ut are inherently more challenging to pursue since they only rely on the latent structures of the word senses embedded inside their contexts. Existing unsupervised approaches can be divided into two categories, based on what aspects of the contexts of target words are used: (a) global structures of contexts and (b) local structures of conte xts. Global structure: [2] hypothesizes that the global word representation is a linear combination of its sense vectors. This linear algebraic hypothesis is validated by a surprising experiment wherein a single artificial polysemous word is created by merging two random words. The experiment is ingenious and the finding quite surprising but was under a restricted setting: a single artificial pol- ysemous word is created by merging only two random words. Upon enlargening these parameters (i.e., man y artificial polysemous w ords are created by merging multiple random words) to better suit the landscape of polysemy in natural language, W e find the linear-algebraic hypothesis to be fragile: Figure 7 plots the linearity fit as a function of the number of artificial polysemous words created, and also as a function of how many w ords were mer ged to create any polysemous w ord. W e see that the linearity fit worsens fairly quickly as the number of polysemous words increases, a scenario that is typical of natural languages. 1 5 10 50 100 500 1000 5000 # new words 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 cosine similarity # merged words = 2 # merged words = 3 # merged words = 4 # merged words = 5 Figure 7: A synthetic experiment to study the linear algebraic structure of word senses. The main reason for this effect appears to be that the linearity fit is quite sensitive to the interaction between the wor d vectors , caused by the polysemous nature of the words. The linear algebraic hypothesis is mathematically justified in Section 5 in [2] in terms of the RAND-W ALK generative model of [1] with three e xtra simplifications. If one were to generalize this proof to handle multiple artificial words at the same time, it appears particularly rele vant that the simplification 2 should continue to hold. This simplification step inv olved the assumption that if w 1 , ..., w n be the random words being merged, then (a) w i and w j do not occur together in a context window for any i 6 = j and (b) any other word κ can only occur with a single one of the w i ’ s in all context windows. This simplification step clearly no longer holds when n increases, and especially so when n nears the size of vocab ulary . Ho wever , this latter scenario (of n being the size of the vocab ulary) is the very basis of of the sparse-coding algorithm proposed in [2] where the latent structure of the multiple senses is modeled as a corpus of discourse atoms where ev ery atom interacts with all the others. 15 The experiment, whose results are depicted in Figure 7, is designed to mimic these underlying sim- plifications of the proof in [2]: we train word v ectors via the skip-gram version of word2v ec using the following steps. (a) W e initialize the newly generated artificial polysemous words by random vec- tors; (b) we initialize, and do not update the (two sets of), v ector representations of other w ords κ by the e xisting w ord vectors. The embeddings are learnt on the 2016/06/01 W ikipedia dump, tokenized via W ikiExtractor ( http://medialab.di.unipi.it/wiki/Wikipedia_Extractor ); words that occur less than 1,000 times are ignored; words being merged are chosen randomly in proportion to their frequencies. Due to computational constraints, each instance of mergers is sub- jected to a single run of the word2vec algorithm. Local structure: [13, 26] model a context by the average of its constituent word embeddings and use this average vector as a feature to induce word senses by partitioning context instances into groups and to disambiguate word senses for new context instances. [20] models the senses for a target word in a given context by a Chinese restaurant process, models the contexts also by a veraging its constituent w ord embeddings and then applies standard word embedding algorithms (continuous bag-of-words (CBO W) or skip-gram). Our approach is broadly similar in spirit to these approaches, in that a local lexical-lev el model is conceiv ed, but we depart in sev eral ways, the most prominent one being the modeling of the contexts as subspaces (and not as vectors, which is what an av erage of constituent word embeddings would entail). 7 C O N C L U S I O N In this paper, we study the geometry of contexts and polysemy and propose a three-fold approach (entitled K -Grassmeans) to model target polysemous words in an unsupervised fas hion: (a) we represent a context (non-function words surrounding the target word) by a low rank subspace, (b) induce word senses by clustering the subspaces in terms of a distance to an intersection vector and (c) representing lexeme (as a word and sense pair) by labeling the corpus. Our representations are nov el and in volve nonlinear (Grassmannian) geometry of subspaces and the clustering algorithms are designed to harness this specific geometry . The overall performance of the method is ev aluated quantitativ ely on standardized word sense induction and word similarity tasks and we present new state-of-the-art results. Sev eral new av enues of research in natural language representations arise from the ideas in this work and we discuss a fe w items in detail below . • Interactions between Polysemous W ords : One of the findings of this work, via the exper - iments in Section 6, is that polysemous words interact with each other in the corpus. One natural w ay to harness these intersections, and hence to sharpen K -Grassmeans, is to do an iterativ e labeling procedure. Both hard decoding and soft decoding (discussed in Section 4) can benefit from iterations. In hard decoding, the “IDK” labels may be resolved ov er mul- tiple iterations since (a) the rare senses can become dominant senses once the major senses are already labeled, and (b) a confusing sense can be disambiguated once the polysemous words in its context are disambiguated. In soft decoding, the probability can be expected to concentrate on one sense since each iteration yields yet more precise context word em- beddings. This hypothesis is inspired by the success of such procedures inherent in the message passing algorithms for turbo and LDPC codes in reliable wireless communication which share a fair bit of commonality with the setting of polysemy disambiguation [32]. A quantitativ e tool to measure the disambiguation improvements from iterations and when to stop the iterative process (akin to the EXIT charts for message passing iterative decoders of LDPC codes [36]) is an interesting research direction, as is a detailed algorithm design of the iterati ve decoding and its implementation; both of these are beyond the scope of this paper . • Low Dimensional Context Representation : A surprising finding of this w ork is that con- texts (sentences) that contain a common target word tend to reside in a low dimensional subspace, as justified via empirical observations in Figure 1. Understanding this geomet- rical phenomenon in the context of a generativ e model (for instance, RAND-W ALK of [1] is not able to explain this) is a basic problem of interest, with sev eral relev ant applications (including language modeling [14, 30] and semantic parsing of sentences (textual entail- ment, for example [10, 5, 12])). Such an understanding could also provide new ideas for 16 the topical subject of representing sentences and paragraphs [18, 38, 15, 17] and e ventually combining with document/topic modeling methods such as Latent Dirichlet Allocation [7]. • Combining Linguistic Resources : Presently the methods of le xeme representation are ei- ther exclusi ve external resource-based or entirely unsupervised. The unsupervised method of K -Grassmeans reveals rob ust clusters of senses (and also provides a soft score measur- ing the robustness (in terms of ho w frequent the sense is and ho w sharply/crisply it is used) of the identified sense). On the other hand, W ordNet lists a very detailed number of senses, some frequent and robust but many others very fine grained; the lack of any accompan ying metric that relates to the frequency and robustness of this sense (which could potentially be domain/corpus specific) really makes this resource hard to make computational use of, at least within the context of polysemy representations. An interesting research direction would be to try to combine K -Grassmeans and existing linguistic resources to automati- cally define senses of multiple granularities , along with metrics relating to frequency and robustness of the identified senses. 17 R E F E R E N C E S [1] Sanjeev Arora, Y uanzhi Li, Y ingyu Liang, T engyu Ma, and Andrej Risteski. A latent variable model approach to pmi-based word embeddings. T ransactions of the Association for Compu- tational Linguistics , 4:385–399, 2016. [2] Sanjeev Arora, Y uanzhi Li, Y ingyu Liang, T engyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy . CoRR , abs/1601.03764, 2016. [3] David Arthur and Sergei V assilvitskii. k-means++: the advantages of careful seeding. In Pr oceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SOD A 2007, New Orleans, Louisiana, USA, J anuary 7-9, 2007 , pages 1027–1035, 2007. [4] Javier Artiles, Enrique Amig ´ o, and Julio Gonzalo. The role of named entities in web people search. In Pr oceedings of the 2009 Conference on Empirical Methods in Natural Language Pr ocessing, EMNLP 2009, 6-7 August 2009, Singapore , A meeting of SIGD AT , a Special In- ter est Gr oup of the ACL , pages 534–542, 2009. [5] Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor . The second pascal recognising textual entailment challenge. In Pr oceedings of the second P ASCAL challenges workshop on r ecognising te xtual entailment , volume 6, pages 6–4, 2006. [6] Y oshua Bengio, R ´ ejean Ducharme, Pascal V incent, and Christian Janvin. A neural probabilistic language model. Journal of Machine Learning Resear ch , 3:1137–1155, 2003. [7] David M. Blei, Andrew Y . Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Resear ch , 3:993–1022, 2003. [8] Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. A unified model for word sense represen- tation and disambiguation. In Pr oceedings of the 2014 Confer ence on Empirical Methods in Natural Language Pr ocessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar , A meeting of SIGD A T , a Special Inter est Gr oup of the A CL , pages 1025–1035, 2014. [9] Asaf Cidon, Kanthi Nagaraj, Sachin Katti, and Pramod V iswanath. Flashback: decoupled lightweight wireless control. In A CM SIGCOMM 2012 Confer ence, SIGCOMM ’12, Helsinki, F inland - August 13 - 17, 2012 , pages 223–234, 2012. [10] Ido Dagan, Oren Glickman, and Bernardo Magnini. The P ASCAL recognising textual entail- ment challenge. In Machine Learning Challenges, Evaluating Pr edictive Uncertainty , V isual Object Classification and Recognizing T e xtual Entailment, F irst P ASCAL Machine Learning Challenges W orkshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected P apers , pages 177–190, 2005. [11] J. Firth. A synopsis of linguistic theory 1930-1955. In Studies in Linguistic Analysis . Philo- logical Society , Oxford, 1957. reprinted in P almer , F . (ed. 1968) Selected P apers of J. R. Firth, Longman, Harlow . [12] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third pascal recog- nizing textual entailment challenge. In Proceedings of the A CL-P ASCAL workshop on textual entailment and paraphr asing , pages 1–9. Association for Computational Linguistics, 2007. [13] Eric H. Huang, Richard Socher , Christopher D. Manning, and Andre w Y . Ng. Improving word representations via global context and multiple word prototypes. In The 50th Annual Meeting of the Association for Computational Linguistics, Pr oceedings of the Conference , J uly 8-14, 2012, Jeju Island, K orea - V olume 1: Long P apers , pages 873–882, 2012. [14] Rukmini Iyer , Mari Ostendorf, and Jan Robin Rohlicek. Language modeling with sentence- lev el mixtures. In Human Language T echnolo gy , Pr oceedings of a W orkshop held at Plains- bor o, New J er ey , USA, Mar ch 8-11, 1994 , 1994. [15] T om K enter , Ale xey Borisov , and Maarten de Rijke. Siamese CBO W : optimizing word embed- dings for sentence representations. In Proceedings of the 54th Annual Meeting of the Associ- ation for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany , V olume 1: Long P apers , 2016. [16] Y oon Kim. Con v olutional neural networks for sentence classification. In Pr oceedings of the 2014 Confer ence on Empirical Methods in Natural Language Pr ocessing, EMNLP 2014, Oc- tober 25-29, 2014, Doha, Qatar , A meeting of SIGD A T , a Special Inter est Gr oup of the ACL , pages 1746–1751, 2014. 18 [17] Matt J. Kusner , Y u Sun, Nicholas I. K olkin, and Kilian Q. W einberger . From w ord embeddings to document distances. In Pr oceedings of the 32nd International Confer ence on Machine Learning, ICML 2015, Lille, F rance, 6-11 J uly 2015 , pages 957–966, 2015. [18] Quoc V . Le and T omas Mikolov . Distributed representations of sentences and documents. In Pr oceedings of the 31th International Confer ence on Machine Learning, ICML 2014, Beijing, China, 21-26 J une 2014 , pages 1188–1196, 2014. [19] Omer Levy and Y oav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Pr ocessing Systems 27: Annual Confer ence on Neural In- formation Pr ocessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages 2177–2185, 2014. [20] Jiwei Li and Dan Jurafsky . Do multi-sense embeddings impro ve natural language understand- ing? In Pr oceedings of the 2015 Confer ence on Empirical Methods in Natural Languag e Pr ocessing, EMNLP 2015, Lisbon, P ortugal, September 17-21, 2015 , pages 1722–1732, 2015. [21] Suresh Manandhar, Ioannis P . Klapaftis, Dmitriy Dligach, and Sameer Pradhan. Semev al- 2010 task 14: W ord sense induction &disambiguation. In Pr oceedings of the 5th International W orkshop on Semantic Evaluation, SemEval@ACL 2010, Uppsala University , Uppsala, Swe- den, J uly 15-16, 2010 , pages 63–68, 2010. [22] T omas Mikolov , Kai Chen, Greg Corrado, and Jeffre y Dean. Efficient estimation of w ord representations in vector space. CoRR , abs/1301.3781, 2013. [23] T omas Mikolo v , Martin Karafi ´ at, Luk ´ as Burget, Jan Cernock ´ y, and Sanjeev Khudanpur . Recur- rent neural network based language model. In INTERSPEECH 2010, 11th Annual Confer ence of the International Speech Communication Association, Makuhari, Chiba, J apan, September 26-30, 2010 , pages 1045–1048, 2010. [24] George A. Miller . W ordnet: A lexical database for english. Commun. ACM , 38(11):39–41, 1995. [25] Andriy Mnih and Geoffrey E. Hinton. Three new graphical models for statistical language modelling. In Machine Learning, Pr oceedings of the T wenty-F ourth International Confer ence (ICML 2007), Corvallis, Or e gon, USA, June 20-24, 2007 , pages 641–648, 2007. [26] Arvind Neelakantan, Jeev an Shankar , Alexandre Passos, and Andre w McCallum. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Pr oceedings of the 2014 Confer ence on Empirical Methods in Natur al Language Pr ocessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar , A meeting of SIGDA T , a Special Inter est Gr oup of the A CL , pages 1059–1069, 2014. [27] Rafail Ostrovsky , Y uval Rabani, Leonard J. Schulman, and Chaitanya Swamy . The effecti ve- ness of lloyd-type methods for the k-means problem. J. A CM , 59(6):28, 2012. [28] Jeffrey Pennington, Richard Socher , and Christopher D. Manning. Glov e: Global vectors for word representation. In Pr oceedings of the 2014 Confer ence on Empirical Methods in Natural Language Pr ocessing, EMNLP 2014, October 25-29, 2014, Doha, Qatar , A meeting of SIGD A T , a Special Inter est Gr oup of the A CL , pages 1532–1543, 2014. [29] Sameer S. Pradhan and Nianwen Xue. Ontonotes: The 90% solution. In Human Language T echnologies: Confer ence of the North American Chapter of the Association of Computational Linguistics, Pr oceedings, May 31 - June 5, 2009, Boulder , Color ado, USA, T utorial Abstracts , pages 11–12, 2009. [30] Klen Copic Pucihar, Matjaz Kljun, John Mariani, and Alan Dix. An empirical study of long- term personal project information management. Aslib J. Inf. Mana g. , 68(4):495–522, 2016. [31] Joseph Reisinger and Raymond J. Mooney . Multi-prototype vector-space models of word meaning. In Human Language T echnolo gies: Conference of the North American Chapter of the Association of Computational Linguistics, Pr oceedings, J une 2-4, 2010, Los Angeles, California, USA , pages 109–117, 2010. [32] Thomas J. Richardson and R ¨ udiger L. Urbanke. Modern Coding Theory . Cambridge Univ er- sity Press, 2008. 19 [33] Andrew Rosenberg and Julia Hirschberg. V -measure: A conditional entropy-based external cluster ev aluation measure. In EMNLP-CoNLL 2007, Proceedings of the 2007 J oint Con- fer ence on Empirical Methods in Natural Language Pr ocessing and Computational Natural Language Learning , June 28-30, 2007, Pr ague, Czech Republic , pages 410–420, 2007. [34] Sascha Rothe and Hinrich Sch ¨ utze. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53r d Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Pr ocessing of the Asian F ederation of Natural Language Pr ocessing, ACL 2015, July 26-31, 2015, Beijing, China, V olume 1: Long P apers , pages 1793–1803, 2015. [35] Kai Sheng T ai, Richard Socher, and Christopher D. Manning. Improved semantic represen- tations from tree-structured long short-term memory networks. In Pr oceedings of the 53r d Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natur al Langua ge Pr ocessing of the Asian F ederation of Natur al Lan- guage Pr ocessing, A CL 2015, July 26-31, 2015, Beijing, China, V olume 1: Long P apers , pages 1556–1566, 2015. [36] Stephan ten Brink. Con vergence behavior of iterati vely decoded parallel concatenated codes. IEEE T rans. Communications , 49(10):1727–1737, 2001. [37] Jean V ´ eronis. Hyperlex: lexical cartography for information retriev al. Computer Speech & Language , 18(3):223–252, 2004. [38] John W ieting, Mohit Bansal, K evin Gimpel, and Karen Li vescu. T owards universal paraphras- tic sentence embeddings. CoRR , abs/1511.08198, 2015. 20 Supplementary Material: Geometry of Polysemy A C O N T E X T R E P R E S E N TA T I O N A L G O R I T H M The pseudocode for context representation (c.f. Section 2) is provided in Algorithm 1. Algorithm 1: The algorithm for context representation. Input : a context c , word embeddings v ( · ) , and a PCA rank N . 1 Compute the first N principle components of samples { v ( w 0 ) , w 0 ∈ c } , u 1 , ..., u N ← PCA( { v ( w 0 ) , w 0 ∈ c } ) , S ← ( N X n =1 : α n u n , α n ∈ R ) Output : N orthonormal basis u 1 , ..., u N and a subspace S . 2 , B S E N S E I N D U C T I O N A L G O R I T H M The pseudocode for word sense induction (c.f. Section 3.2) is provided in Algorithm 2. Algorithm 2: W ord sense induction algorithm with giv en number of senses. Input : contexts c 1 , ..., c M , and integer K . 1 Initialize unit length vectors u 1 , ..., u K randomly , initialize L ← ∞ , 2 while L does not conver ge do 3 Expectation step : assign subspaces to the group which has the closest intersection direction, S k ← { c m : d ( u k , S ( c m \ w )) ≤ d ( u k 0 , S ( c m \ w )) ∀ k 0 } , 4 Maximization Step : update the new intersection direction that minimize the distances to all subspace in this group as (1). u k ← arg min v X c ∈ S k d 2 ( u, S ( c \ w )) , L ← K X k =1 X c ∈ S k d 2 ( u k , S ( c \ w )) , 5 end Output : K intersection directions v 1 , ..., v K . C S E N S E D I S A M B I G U A T I O N A L G O R I T H M The pseudocode for the word sense disambiguation (c.f. Section 3.3) is provided in Algorithm 3. D V - M E A S U R E A N D P A I R E D F - S C O R E Clustering algorithms partition N data points into a set of clusters K = { 1 , 2 , ..., K } . Given the ground truth, i.e., another partition of data into another set of clusters T = { 1 , 2 , ..., T } , the perfor- mance of a clustering algorithm is e valuated based on a contingenc y table A = a tk representing the clustering algorithm, where a tk is the number of data whose ground truth label is t and algorithm label is k . There are two intrinsic properties of all desirable e valuation measures: 21 Algorithm 3: The algorithm for sense disambiguation. Input : a context c , word embeddings v ( · ) , a PCA rank N , a set of intersection directions u k ( w ) ’ s, and a threshold θ . 1 Compute denoised context subspace, S ← S ( c \ w ) , 2 Compute the distance between S ( c \ w ) and intersections, d k ← d ( u k , S ) , 3 if hard decoding then 4 Get the closest cluster , k ∗ ← arg min d k , 5 Check the threshold, 6 if d k ∗ < θ then 7 retur n k ∗ ; 8 end 9 retur n IDK ; 10 end 11 if soft decoding then 12 Compute the probabilities: P ( w , c, k ) ← exp( − 10 d ( u k ( w ) , S ( c \ w ))) P k 0 exp( − 10 d ( u k 0 ( w ) , S ( c \ w ))) , 13 end 14 return P ( w , c, k ) for k = 1 , ..., K ; Output : An label indicating which sense this word means in this context. • The measure should be permutation-in variant, i.e., the measure should be the same if we permutate the labels in K or T . • The measure should encourage intra-cluster similarity and penalize inter -cluster similarity . V -Measure [33] and paired F-Score [4] are tw o standard measures, the definitions of which are gi ven below . D . 1 V - M E A S U R E V -Measure is an entropy-based metric, defined as a harmonic mean of homogeneity and complete- ness. • Homogeneity is satisfied if the data points belong to one algorithm cluster fall into a single ground truth cluster , formally defined as, h =  1 if H ( T ) = 0 1 − H ( T |K ) H ( T ) otherwise, where H ( T |K ) = − K X k =1 T X t =1 a tk N log a tk P T t =1 a tk , H ( T ) = − T X t =1 P K k =1 a tk N log P k k =1 a ck N . • Completeness is analogous to homogeity . Formally this is defined as: c =  1 if H ( K ) = 0 1 − H ( K|T ) H ( K ) otherwise, 22 where H ( K |T ) = − T X t =1 K X k =1 a tk N log a tk P K k =1 a tk , H ( K ) = − K X k =1 P T t =1 a tk N log P T t =1 a ck N . Giv en h and c , the V -Measure is their harmonic mean, i.e., V = 2 hc h + c . D . 2 P A I R E D F - S C O R E Paired F-score ev aluates clustering performance by conv erting the clustering into a binary classifi- cation problem – giv en two instances, do they belong to the same cluster or not? For each cluster k identified by the algorithm, we generate  P T t =1 a tk 2  instance pairs, and for each ground true cluster, we generate  P K k =1 a tk 2  instances pairs. Let F ( K ) be the set of instance pairs from the algorithm clusters and let F ( T ) be the set of instances pairs from ground truth clusters. Precision and recall is defined accordingly: P = | F ( K ) ∩ F ( T ) | F ( K ) , R = | F ( K ) ∩ F ( T ) | F ( S ) , where | F ( K ) | , | F ( T ) | and | F ( K ) ∩ F ( T ) | can be computed using the matrix A as below: | F ( K ) | = K X k =1  P T t =1 a tk 2  , | F ( T ) | = T X t =1  P K k =1 a tk 2  , | F ( K ) ∩ F ( T ) | = T X t =1 K X k =1  a tk 2  . E A L G O R I T H M S F O R P O L I C E L I N E U P T A S K W e first introduce the baseline algorithm for word2v ec and our algorithm, as given in Algorithm 4 and 5. Both algorithms are motiv ated by the algorithm in [2]. In our algorithm, the similarity score can be thought as a mean v alue of the word similarities between a target word w and a definition word w 0 in the giv en sense L , we take the power mean with p = 2 . Our algorithm can be adapted to different choice of p , i.e., scor e p ( s k ( w ) , L ) ← X w 0 ∈ L   v k ( w ) T v w   p ! 1 /p − 1 | V | X w 00 ∈ V X w 0 ∈ L   v k ( w 00 ) T v w   p ! 1 /p Different choice of p leads to different preferences of the similarities between w and w 0 ∈ L , generally speaking larger weights put on rele vant w ords with larger p : • If we take p = 0 , scor e 1 ( w , L ) turns to be an average of the similarities; • If we take p = ∞ , scor e ∞ ( w , L ) measures the similarity between w and L by the similarity between w and the most rele vant word w 0 ∈ L , i.e., scor e ∞ ( s k ( w ) , L ) ← max w 0 ∈ L | v k ( w ) T v ( w 0 ) | 23 In our case we take p = 2 to allo w enough (but not too much) influence from the most rele vant words. Algorithm 4: The baseline algorithm with word2vec for Princeton Police Lineup T ask. Input : a target w ord w , m senses L 1 , ..., L m , word vectors v ( w 0 ) for ev ery w 0 in the vocab ulary V . 1 Compute a similarity score between w and a sense L i , scor e ( w, L i ) ← s X w 0 ∈ L i ( v ( w ) T v ( w 0 )) 2 − 1 | V | X w 00 s X w 0 ∈ L i ( v ( w 00 ) T v ( w 0 )) 2 2 return T op k L ’ s with highest similarity scor es. Output : k candidate senses. Algorithm 5: Our algorithm for Princeton Police Lineup T ask. Input : a target w ord w , m senses L 1 , ..., L m , lex eme representations v k ( w ) for every sense of w , and word representations v ( w 0 ) for w 0 in the vocab ulary . 1 candidates ← list() 2 scores ← list() 3 for every sense s k ( w ) of w do 4 for every i = 1 , 2 , ..., m do 5 Compute a similarity score between a lex eme s k ( w ) and a sense L i , scor e ( s k ( w ) , L ) ← s X w 0 ∈ L ( v k ( w ) T v w 0 ) 2 + s X w 0 ∈ L ( v ( w ) T v ( w 0 )) 2 − 1 | V | X w 00 ∈ V s X w 0 ∈ L ( v ( w 00 ) T v ( w 0 )) 2 , 6 end 7 L i 1 , L i 2 ← top 2 senses. 8 candidates ← candidates + L i 1 + L i 2 9 scores ← scores + scor e ( s k ( w ) , L i 1 ) + scor e ( s k ( w ) , L i 1 ) 10 end 11 return T op k L ’ s with highest similarity scor es. Output : k candidate senses. Both our algorithm and the algorithm in [2] do not take into account that one atom represents one sense of the tar get word, and thus some atoms might generate two senses in the output k candidates. A more sophisticated algorithm is required to address this issue. 24

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment