The Expressive Power of Word Embeddings

The Expressiv e P o w er of W ord Em b eddings Y anqing Chen cy anqing@cs.stonybr ook .edu Bry an P erozzi bper oz zi@cs.stonybr ook .edu Rami Al-Rfou’ ralfrou@cs.stonybr ook .edu Stev en Skiena skiena@cs.stonybrook.edu Computer Science Dept. Stony Bro ok Universit y Stony Bro ok, NY 11 794 Abstract W e seek to b etter understand the informa- tion enco ded in w or d embeddings. W e pro- po se several tasks that help to distinguish the characteristics of diﬀerent publicly r eleased embeddings. O ur ev aluatio n s ho ws that em- bedding s ar e able to capture surpr isingly nu- anced semantics even in the a bs ence of sen- tence structure. Moreover, benchmarking the embeddings sho ws g reat v ariance in quality and c har acteristics o f the se ma n tics captured by the tes ted embeddings. Finally , w e show the impact o f v ary ing the n umber of dimen- sions and the r esolution of each dimension on the eﬀective useful features captured by the embedding spa ce. Our contributions high- light the imp ortance of embeddings fo r NLP tasks and the eﬀect of their qualit y on the ﬁnal results. 1. In tro duction Distributed w ord repr esen tations (embeddings) cap- ture s eman tic and syntactic features of words out of raw text co rpus without h uma n interv ention or language dep e nden t pro cessing. The featur es em- bedding capture are tas k indep enden t which make them ideal for la ng uage mo deling. How ever, em- bedding s are ha r d to interpret and und e r stand. De- spite the eﬀorts of vis ualizing the w o rd em b eddings ( V an der Maaten and Hinton , 2008 ), po ints in high di- mensional space s carry a lot of information that is hard to quantify . Additionally , publicly a v aila ble embed- dings generated by multiple resear ch g roups use dif- ferent data a nd training pro cedures a nd ther e is no t yet a n understanding ab out the b est wa y to learn these Pr o c e e dings of the 30 th International Confer enc e on Ma- chine L e arning , Atlanta, Georgia , USA, 20 13. JMLR: W&CP volume 28. Copyrigh t 2013 by the author(s). representations. In this paper , we in vestigate four public relea sed w o rd embeddings: (1) HLBL, (2) SENNA, (3) T urian’s and (4) Huang’s. W e use context-free cla s siﬁcation tasks rather than sequence lab eling task s (such as part of sp eec h ta g ging) to isola te the eﬀects of context in mak- ing decisions a nd eliminate the c omplexit y of the lea r n- ing metho ds. Sp eciﬁcally , our work ma k es the follo w- ing c on tributions: • W e show through ev alua tion that em b eddings a re able to capture semantics in the abse nc e of sen- tence structure and that there is a diﬀerence in the characteristics of the publicly released word embeddings. • W e explo re the impact of the num be r of dimen- sions and the reso lution of e a c h dimension o n the quality of the informa tio n that can b e enco ded in the embeddings space. That shows that mini- m um e ﬀ e c tiv e space needed to capture the useful information in the embeddings. • W e demonstrate the imp ortance of word pair ori- ent a tion in enco ding useful linguistic informa tio n. W e r un t wo pair classiﬁcatio n tas k s and provide an example with one of them wher e pair pe r for- mance greatly exceeds that of individual words. The r est of the work pro ceeds as follows: Fir st we de- scrib e the word embeddings we consider. Next we dis- cuss our classiﬁca tion exper imen ts, a nd present their results. Finally we discuss the eﬀects o f sc a ling down the size of the embeddings space. 2. Related W ork The original work for generating w o rd embeddings was presented by Bengio et. al. in ( Bengio et al. , 2 003a ). The embeddings were a seco ndary output when g en- erating language model. Since ( Bengio et al. , 2003a ), 3 EXPERIMENT AL SETUP 2 there has been a signiﬁcant in ter e st in sp e eding up the generation pro cess ( Bengio et al. , 2003 b ; 20 09 ). These original langua ge models w ere ev aluated using perplex- it y . W e argue that while p erplexity is a g oo d metric of language mo deling, it is not insightful ab out ho w well the embeddings capture diverse types of informa tion. SENNA’s embeddings ( Collob ert , 2 011 ) are gener- ated using a mo del that is discriminating a nd non- probabilistic. In eac h training up date, w e read an n-gram from the corpus, concatenating the learned embeddings of these n words. Then a cor rupted n- gram is used by replacing the word in the middle with a rando m o ne from the vo cabulary . On top of the tw o phrases, the mo del learns a scor ing function that sc ores the or iginal phrases lower than the cor- rupted one. The loss function used for training is hinge loss. ( Collob ert et al. , 2011 ) shows that embed- dings a re able to p erform well o n several NLP ta sks in the absence of any other featur es. The NLP tasks considered b y SENNA all co nsist of sequence labeling, which imply that the model mig ht learn fro m seq ue nc e depe ndencie s. Our work enriches the discussion by fo- cusing on term classiﬁca tio n problems. In ( T urian et a l. , 2 010 ), T urian et. al. duplicated the SENNA embeddings with so me diﬀerences; they cor - rupt the last w o rd of each n-gram instead of the word in the middle. They also show that using embeddings in conjunction with typical NLP features improves the per formance on the Named Entit y Reco gnition task. An additional result o f ( T urian et al. , 20 10 ) sho ws that most of the embeddings ha ve similar eﬀect when added to an existing NLP task. This giv e s the wrong impr es- sion. O ur work illus tr ates that not a ll embeddings are created equal and there are signiﬁca n t diﬀerence s in the information captured b y eac h publicly released mo del exist. Mnih and Hin to n ( Mnih and Hinton , 2007 ) prop osed a log-bilinear loss function to model la ng uage. Given an n-gram, the mo del c oncatenates the em b eddings of the n- 1 ﬁrst w o rds, and lea rns a linear mo de l to predict the embedding of the last word. Mnih and Hinton later prop osed Hiera rc hical log- bilinear (HLBL) mo del em- bedding s ( Mnih and Hinton , 20 09 ) to sp eed up mo del ev aluation dur ing training and testing b y using a hierarchical a pproach (similar to ( Morin and Bengio , 2005 )) that prune the sear c h space for the next word by dividing the prediction into a s eries o f predictio ns that ﬁlter regio n of the space. The lang uage mo del even tually is ev aluate us ing p erplexity . A fundamental challenge for neural la nguage mo dels inv olves repr e sen ting words which have multiple mean- ings. In ( Huang et al. , 201 2 ), Huang et. al. incorp orate global context to deal with challenges raised by words with m ultiple meanings. Recent work by Mik o lo v et. al. ( Mikolo v et al. , 2013 ) inv estigates linguistic reg ularities captur ed b y the rel- ative p ositions of p oints in the embedding space. O ur results re garding pair classiﬁcation are complemen- tary . 3. Exp erimen tal setup W e will construct three term cla ssiﬁcation problems and tw o pair clas s iﬁcation problems to quantify the quality of the embeddings. 3.1. Ev aluation T asks Our ev aluation tasks are as follows: • Sentimen t Polarit y : W e use Lydia’s sen timent lexicon ( Go dbole et al. , 2007 ) to create sets of words which have p ositive or nega tiv e c o nnota- tions and construct the 2-class sentim e n t po la rit y test. The da ta size is 6923 w o rds. • Noun Gender : W e use Bergsma’s data set ( Bergsma and Lin , 2 006 ) to compile a list of mas- culine and feminine proper nouns. Names tha t co- refer more fr equen tly with she / he a re resp ectiv ely considered feminine/ma sculine. Strings that co- refer the most with it , a ppear less than 300 times in the co rpus, or cons ist of multiple words are ig - nored. The tota l size is 2133 words. • Plurality : W e use W o rdNet ( F ellbaum , 2010 ) to extract nouns in their singular and plural forms. The data consists of 3012 words. • Synon ym s and An tonyms : W e use W ord- Net to ex tract sy no n ym and a n tonym pair s and chec k whether we can part one kind from the others. The r elation is symmetric thus we put each word pair together with their order -rev er sed- counterparts. Ther e a re 3446 diﬀeren t word pairs. • Regi onal Sp ellings : W e collect the words that diﬀer in sp elling b e t ween UK Eng lish a nd the America n counterpart from an online source ( Limited , 20 0 9 ). W e make this task be a pair classiﬁcation task to emphasize relative distances betw e en e mbedding s. W e ha ve 1565 pairs in this task. W e ensur e that for all tas k s the cla s s lab els a re bal- anced. This allow our ba seline ev aluation to b e either 3.3 Classiﬁcation 3 Sent iment Noun Gender Plurality Positiv e Negative F e minine Masculine Plural Singular Samples goo d bad Ada Steve cats cat talent stupid Ir e na Roland tab les tab le amazing ﬂaw Linda Leonardo system s system Synonyms a nd Anton yms Regional Sp ellings Synonyms Anton yms UK US Samples store shop rear f ron t colou r color virgin pure p olite impol ite driveable driv able p ermit l icense frie nd fo e smash-up smashup T able 1. Example input from eac h task the random clas s iﬁer or the most frequent label clas- siﬁer. E ither of them will give a n accuracy of 50%. T able 1 s hows e xamples of e a c h o f the 2-clas s ev alua - tion task s. The cla ssiﬁer is asked to iden tify which of the classes a term or pair b elongs to . 3.2. Embeddings ’ Datasets W e choos e the following publicly a v aila ble embeddings datasets for ev aluation. • SENNA’ s embeddi ngs co vers 130,000 w o rds with 50 dimensions for each word. • T uri an’s embe ddings covers 268,81 0 words, each re presen ted either with 25, 5 0 or 10 0 dimen- sions. • HLBL’s embeddi ngs cov er s 246,12 2 w o rds. These embeddings were trained o n same da ta used for T uria n e m b e dding fo r 10 0 ep ochs (7 days), a nd hav e b een induced in 50 or 10 0 dimensions. • Huang’s embe ddings cov ers 100,232 w or ds, in 50 dimensions . Huang’s embeddings require con- text to disambiguate which prototype to use fo r a word. Our tasks ar e context free s o we av erag e the multiple prototypes to a single point in the space. (This w as the appr oac h which worked b est in our testing.) It sho uld be emphasized that e ac h o f these mo dels has bee n induced under substantially diﬀerent tr a ining pa- rameters. Ea ch mo del has its own vocabular y , used a diﬀerent context size, a nd w a s trained for a diﬀerent nu mber of ep o c hs o n its training s et. While the control of thes e v ariables is o utside the s cope of this study , we hop e to mitigate one of these c halleng e s by running our exp erimen ts o n the v o cabulary shared by all these em- bedding s. The size of this sha red vocabula ry is 58,411 words. 3.3. Classiﬁcation F or class iﬁc a tion we used Lo gistic Regr ession and a SVM with the RBF-kernel as linear and non-linear classiﬁers . There is a mo del-selection pro cedure by running a grid-search on the parameter space with the help o f the developmen t data. All exp erimen ts were written using the Python pack a ge Scikit- L earn ( Pedregosa et al. , 2011 ). F or the term classiﬁcation tasks we oﬀere d the classiﬁer o nly the embedding of the word as an input. F or pair wise exper imen ts, the input consists of the embeddings of the t wo words con- catenated. The average of four folds of cr oss v alidatio n is used to ev aluate the perfor mance of each cla ssiﬁer on eac h task. 50%, 25%, 25 % of the data ar e used, as tra in- ing, developmen t a nd testing datasets resp ectiv ely , for ev aluation and mo del selection. 4. Ev aluation Results Here we present the ev aluation of both our term and pair classiﬁcatio n results. 4.1. T erm Classiﬁcation Figure 1 shows the results ov er all the 2-class term classiﬁcation tasks using logis tic regression and RBF- kernel SVM. It is sur prising that a ll the embeddings we considered did muc h b etter than the baseline, even on a se emingly har d tes ts like sentimen t detection. What’s mo r e, there is stro ng perfor mance from b oth the SENNA and Huang embedding s. SENNA embed- dings seem to capture the plurality r elationship b etter, which may b e from the emphasis that the SENNA em- bedding s place on shallow s y n tactic features. Sentiment Gender Plurality 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Accuracy SENNA HLBL-50 HLBL-100 T ur ian-25 T ur ian-50 T ur ian-100 Huang Figure 1. Results of the term-based tasks considered, shaded areas represent impro vements using kernel SVM. 5 INF O RMA TION REDUCTION 4 T able 2 shows examples of words from the test datasets after clas s ifying them using logistic regres s ion on the SENNA embeddings. The top a nd b ottom r o ws show the words that the clas s iﬁer is conﬁdent classifying, while the rows in the middle show the words that lie close to the decision b oundary . F or example, r esilient could hav e p ositiv e and negative connota tions in text, therefore, we ﬁnd it c lose to the region were the words are more neutral than b eing p olarized. F or SENNA, the b est p erforming task was the Plural- it y tas k. That explains the ob vio us contrast b et ween the proba bilities given to the w or ds. The top words are given almos t 100 % probability and the b ottom ones are given a lmost 0%. The results of regional sp e lling task is shown here in the term-wise setup. Despite not per forming as well as the pair -wise sp elling, we ca n see tha t classiﬁer shows meaningful r esults. W e can clearly notice that the Br itish sp ellings of words fav or the usage of hyphens, s over z and l l ov er l . Positiv e Prob British Prob Sent iment w orld -famous 99.85 Regional Sp el lin g kick-oﬀ 92.37 aw ard-winning 99.83 hauliers 91.54 high-quality 99.83 re-exp orted 89.46 achiev e men t 99.81 bullet-pr o of 88.69 athletic 99.81 initialle d 88.42 resilient 50.14 paralysed 50.16 ragged 50.11 italiciz e d 50.04 discrimin ating 50.10 exorcise 50.03 stout 49.97 fusing 49.90 lose 49.83 lacklustre 49.78 b ored 49.81 subsidizin g 49.77 blo odshed 0.74 signaling 32.04 burglary 0.68 hemorrh agi c 21.69 robb ery 0.58 tumor 21.69 panic 0.45 homologue 19.53 stone-throwing 0.28 lo calize 17.50 Negative 1.0-Prob American 1.0-Prob T able 2. Examples of the results of the logistic regression classiﬁer on diﬀerent tasks. 4.2. P air Classiﬁcation Sometimes how ever, the choice to use pair classiﬁca - tion can ma k e quite a diﬀerence in the results. Fig ure 2a shows that classifying individual w or ds acco rding to their regio nal usage p erforms p o orly . W e can redeﬁne the problem such that the clas siﬁer is asked to decide if the ﬁrs t w o rd, in a pa ir of words, is the American sp elling o r not. Fig ur e 2a shows that perfor mance im- prov es a lot. This hints tha t the words under this crite- ria a re not separable by a hyp er-plane in any subspace of the or iginal embeddings space. Instead, we draw a similar conclusion as ( Mikolo v et al. , 201 3 ) that the pairs’ p ositions relative to each other is what enco des such information but not their absolute co ordinates, and rela tio nship b et ween words often indicate the rel- ative diﬀerence vector b et ween cor responding p o in ts. In our pr evious Plura lit y test, the SENNA em b ed- dings signiﬁcantly outper formed Huang’s. How ever in our r e gional sp elling task (whic h migh t seem s imilar), Huang’s em b eddings outper fo rm SENNA in b oth term and pair classiﬁcatio n setups. W e believe that Huang’s approach for building word proto t yp es from signiﬁcant diﬀerences in context provide a signiﬁcant adv antage on this task. W e note that it is surprising that neur al lang uage mo d- els may ca pture the r e lation betw een a synonym a nd anton ym. Both the languag e mo deling of HLBL a nd the w ay tha t SE NNA/T ur ian corrupted their exam- ples fav o r words that can syntactically replace each other; e.g. b ad can re pla ce go o d as easily as exc el lent can. The result of this syn ta ctic interc hang eabilit y is that b oth b ad and exc el lent are close to go o d in the embedding space. 5. Information reduction Distributed w o rd representation exis t in con tinuous space, which is quite diﬀerent from co mmo n language mo deling techniques. Beside the powerful expres siv e- ness that we demonstrated previously , another key ad- v an ta ge of distributed represe ntations is their size - they r equire far less memory a nd disk storag e than other techniques. In this s e ction we seek to under- stand e xactly how m uch spa ce word em b eddings need in o rder to serve as useful features. W e also in vestigate whether the p ow erful repr esen tation that embeddings oﬀer is a re s ult o f having real v alue co ordinates or the exp onen tial num b er of r egions which can be descr ibed using m ultiple indep endent dimensions . 5.1. Bit wi se T runcation T o r educe the r esolution o f the real num b ers that make up the em b eddings matrix. First we scale them to 32 bit int eg er v alues, then we divide the v alues by 2 b , where b is the num b er o f bits we wish to remove. Fi- nally , w e scale the v alues ba c k to lie betw een ( − 1 , 1). After this prepr ocessing we giv e the new v alues as fea- tures to our classiﬁers. In the extr eme c a se, when we truncate 31 bits, the v alues will b e all either { 1 , − 1 } . Figure 3 shows that when w e r emo ve 31 bits (i.e, v al- ues are { 1 , − 1 } ), the p erformance of an embedding dataset drops no more than 7%. This reduced res o- lution is equiv alent to 2 50 regions which can b e en- co ded in the new spa ce. This is still a huge resolution, but surpr isingly seems to b e suﬃcient at so lving the tasks w e propo sed. A na ¨ ıve approximation of this trick which may b e of interest is to simply take the the sign of the embedding v alues as the repre sen tation o f the 6 CONCLUSION 5 Spellings(term) Spellings(pair) 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Accuracy SENNA HLBL-50 HLBL-100 T ur ian-25 T ur ian-50 T ur ian-100 Huang (a) UK/US term vs. pair Synonym Spellings 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Accuracy SENNA HLBL-50 HLBL-100 T ur ian-25 T ur ian-50 T ur ian-100 Huang (b) 2-class pair results Figure 2. Results of the pair-based tests. Figure 2a sh ows the d i ﬀ erence b et ween treating the UK/US spellings as a single word problem, or using a p ai r of emb eddings. Figure 2b sho ws th e results of the 2-class pair tests, shaded areas represent imp ro vemen ts using kernel SVM. embeddings themselves. 5.2. Principle Com ponent Analysis The bit wise trunca tion e x periment indica tes that the nu mber o f dimensions could b e a key facto r in to the per formance of the embeddings. T o experiment on this further, w e run PCA ov er the embeddings datase ts to ev aluate task p erformance on a reduced num b er of di- mensions. Fig ure 4 s hows that reducing the dimen- sions dr o ps the accura cy of the classiﬁer s sig niﬁcan tly across all embedding datasets . Another key diﬀerence b et ween the truncation exp er- imen t and the P CA exp erimen t is that the trunca- tion exp erimen t may preser v e rela tionships captured by non-linearities in the em b edding space. Linear PCA 1 6 1 2 8 4 2 1 Bit s r e ma in in g i n e a c h d i me n sio n . 0 . 5 0 0 .5 5 0 . 6 0 0 .6 5 0 . 7 0 0 .7 5 0 . 8 0 0 .8 5 0 . 9 0 A c c u ra c y S E N N A H L B L - 5 0 T u ri a n - 5 0 H u a n g Figure 3. Results of reducing the precision of the em b ed- dings, av eraged by th e geometric mean of classi ﬁ ers. W e note that after removing 31 b its, eac h dimension of the em b eddings is a b in a ry feature. can not oﬀer such g uaran tees and this weakness ma y contribute to the diﬀerence in p erformance. 1 2 3 5 1 0 1 5 2 0 2 5 5 0 Di me n si o n s 0 .5 0 0 . 5 5 0 .6 0 0 . 6 5 0 .7 0 0 . 7 5 0 .8 0 0 . 8 5 0 .9 0 A c c u r a c y S E N N A H L B L - 5 0 T u r i a n - 5 0 H u a n g Figure 4. Results of redu cing the dimensions of t he emb ed- dings through PCA, av eraged by the geometric mean. 6. Conclusion Distributed word representations show a lot of promise to impro ve sup ervised learning and semi-supe r vised learning. The pra ctical adv ant a ges of having dense representations make them ideal for industria l appli- cations and softw ar e developmen t. The pr evious w or k mainly foc us ed on sp eeding up the tra ining pr ocess with one metric for ev aluation, pe r plexit y . W e s how that this metric is not able to provide a nuanced view of their quality . W e develop a suite of linguistic or i- REFERENCES 6 ent e d tasks which might ser v e as a pa r t of a com- prehensive b enc hmar k for w o rd embedding ev aluation. The tas ks fo cus o n w or ds or pa irs of them in iso la tion to the actual text. The g oal here is not to build a useful classiﬁer as muc h as it is to understa nd how m uch su- per vised learning can beneﬁt from the features whic h are enco ded in the embeddings. W e succeed in showing that the publicly av aila ble datasets diﬀer in their quality and usefulness, and our results are consistent acr oss tasks and class iﬁers. O ur future w o rk will tr y to address the factor s that lead to such div ers e quality . The eﬀect of training c o rpus s ize and the c ho ic e of the ob jective functions are tw o main areas where b etter understa nding is needed. While our tasks are simple, the diﬀerences among task per formance s hed light o n the features enco ded b y em- bedding s. W e show ed that in addition to the shallow syntactic features like plural and gender a greemen t, there are signiﬁcant seman tic pa rtitions reg arding s e n- timen t and syno nym/anton ym mea ning . Our curren t tasks fo cus on nouns and adjectives, a nd the suite of tasks has to be extended to include tasks that address verbs and other parts of sp eec h. Ac knowledgm ents This resea r c h was partia lly supp orted by NSF Gra n ts DBI-1060 572 and IIS-1 017181, with a dditional sup- po rt from T exe lT ek, Inc. References Y. Bengio, R. Duc har me, P . Vincent, and C. Jauvin. A neural probabilis tic language mo del. J o u rnal of Machine L e arning R ese ar ch , 3:1137 –1155, 2 003a. Y. Bengio, J .S. Sen ´ ecal, et al. Quic k training of proba- bilistic neural nets by impor tance sampling. In AIS- T A TS Confer enc e , 200 3b. Y. Beng io, J. Louradour , R. Collob ert, and J. W es ton. Curriculum learning. In Pr o c e e dings of the 26th an- nual int ernatio n al c onfer enc e on machine le arning , pages 4 1–48. ACM, 20 09. Shane Bergsma and Dek ang Lin. Bo otstrapping path- based pronoun r esolution. In Pr o c e e dings of the 21st International Confer enc e on Computational Lin- guistics and 44th A n nual Me eting of the Asso ciation for Computational Linguistics , pag es 33 –40, Sydney , Australia, July 2 006. Asso ciation for Computational Linguistics. R. Collob ert. Deep lear ning for eﬃcien t discriminative parsing. In In ternational Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS) , 201 1. R. Collob ert, J. W es ton, L. Bo ttou, M. Karle n, K. Kavukcuoglu, a nd P . Kuksa. Natural language pro cessing (almost) from scratch. The Journal of Machine L e arning Re s e ar ch , 1 2 :2493–253 7, 20 11, JMLR. org . C. F ellbaum. W ordnet. The ory and Applic ations of Ontolo gy: Computer Applic ations , pages 231 –243, 2010, Springer. N. Go dbole, M. Sr iniv asa iah, and S. Skiena. La rge- scale sent iment analysis for news and blogs. In Pr o- c e e dings of t he International Confer enc e on Weblo gs and S o cial Me dia (ICWSM) , volume 2 , 2007 . Eric H. Huang , Ric hard So cher, Christopher D. Man- ning, and Andrew Y. Ng. Improving word represen- tations via global context and multiple w o rd pro- totypes. In Pr o c e e dings of the 50th Annual Me et- ing of the Asso ciation for Computational Linguis- tics: L ong Pap ers - V olume 1 , ACL ’12, pages 873– 882, Stroudsburg, P A, USA, 20 12. Ass ociation for Computational Linguistics. W ords W orldwide Limited. W ord list of us/uk spelling v ariants, Ma y 200 9. URL http:/ /www.word s w orldwide.co . uk/docs/Words- Worldwide- Wor d - T omas Mik olov, W en-tau Yih, and Geoﬀrey Zweig. Linguistic regular ities in co n tinuous s pace word rep- resentations. In Pr o c e e dings of NAACL-HL T , 2013. A. Mnih and G. Hin ton. Three new g raphical mo dels for statistical langua ge mo delling. In Pr o c e e dings of the 24th international c onfer enc e on Machine le arn- ing , pages 641 – 648. ACM, 2007. A. Mnih and G.E. Hin ton. A scala ble hierarchical dis- tributed languag e mo del. Ad vanc es in neur al in- formation pr o c essing syst ems , 21 :1081–108 8 , 2009, Citeseer. F. Morin and Y. Bengio. Hiera r c hical probabilistic neural netw or k languag e mo del. In Pr o c e e dings of the international workshop on artiﬁcial int elligenc e and st atistics , pages 24 6–252, 2005. F. Pedregosa, G. V aro quaux, A. Gramfor t, V. Mic hel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. W eiss, V. Dubourg , J. V a nderplas, A. Pas- sos, D. Cournap eau, M. Brucher, M. Perrot, and E. Duc hesnay . Scikit-learn: Machine learning in Python. Journal of Machine L e arning Rese ar ch , 12 : 2825– 2830, 20 11. REFERENCES 7 J. T ur ian, L. Ratinov, a nd Y. Bengio . W o rd repre- sentations: a simple and gener al metho d for semi- sup e rvised lear ning. Urb ana , 51 :6 1801, 2010. L. V a n der Maaten and G. Hin to n. Visualizing da ta using t-sne. Journal of Ma chine L e arning Rese ar ch , 9(2579 -2605):85, 2008. SUPPLEMENT AL MA TERIALS 8 Supplemen tal Materials 3-class T es ts T o strengthen these results, w e p erformed a 3-class version of the sentimen t test, in which we ev aluated the abilit y to cla s sify words as ha ving p ositiv e, neg a- tive, or neutral sen timent v alue. The res ults a r e pre- sented in Figure 5 . T he results ar e consistent with those from our 2-lab el test, and all e m b eddings p er- form m uch higher than the baseline score of 33%. Sentiment 3-Sentment 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Accuracy SENNA HLBL-50 HLBL-100 T uri an - 25 T uri an - 50 T uri an - 10 0 Huang Figure 5. The p erformance on the 3-class version of the senti m ent task, shaded areas represen t impro vemen ts u si n g kernel SVM. In or der to inv e s tigate the depth to which synonyms and anton yms ar e c a ptured, we conducted a 3-cla ss version of the same test. W e now ev alua te betw ee n pairs of w ords that are synon y ms, anton yms, or hav e no suc h relation. While such a task is muc h ha rder for the embeddings, the results in Figure 6 s ho w that a nonlinear c la ssiﬁer can ca ptur e the re la tionship, par- ticularly with the SENNA embeddings. An analy- sis of the confusion matrix for the nonlinear SVM show ed that error s o ccurred roughly evenly b et ween the cla s ses. W e b eliev e that this ﬁnding regarding the enco ding of synonym/anton ym relationships is an in- teresting contribution of our work. Dimensio nal Reduction by task Lo oking at Figure 8a , reducing the words em b eddings to p oint s on a real line almo st deletes all the fea tures that are relev ant to the pair classiﬁcation and to les s a degree the s e ntimen t features . Despite the 10%-20% drop in ac c ur acy in the Plurality and Gender tasks, the classiﬁcation is still higher than ra ndom. The re- sults show that when tha t sha llo w sy n tactic features such as gender and num b er agr eemen t a re preserved at Synonym 3-Synonym 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 0 Accuracy SENNA HLBL-50 HLBL-100 T uri an - 25 T uri an - 50 T uri an - 10 0 Huang Figure 6. The performance of the 3-class syn- onym/an tonym task, shaded areas represent improv ements using kernel S V M. 1 6 1 2 8 4 2 1 Bits r e ma in in g in e a c h d ime n sio n . 0 . 5 0 0 . 5 5 0 . 6 0 0 . 6 5 0 . 7 0 0 . 7 5 0 . 8 0 0 . 8 5 0 . 9 0 A c c u ra c y S p e l l i n g s S y n o n y m G e n d e r P l u r a l S e n t i m e n t Figure 7. Results of reducing the precision of the em b ed- dings, a veraged b y the geometric mean of classiﬁers across- ing tasks the exp ense of more subtle semantic features such as sentimen t p olarity . This gives us insight into what the hierarchical structure of the embeddings space lo oks like. Shallow semantic features a re present in all as- pec ts o f the spa c e, a nd when PCA choo ses to maximize this v ariance of the feature spa ce it is at the e xpense of the other semantic prop erties. W e als o illustra te this phenomenon in Figure 8b , by showing how the p erformance of the linear and non- linear cla ssiﬁers conv erge for our ha rder tasks (senti- men t and synonym) as we reduce the num b er of di- mensions with PCA. SUPPLEMENT AL MA TERIALS 9 1 2 3 5 1 0 1 5 2 0 2 5 5 0 Di m e n si o n s 0 . 5 0 0 .5 5 0 . 6 0 0 .6 5 0 . 7 0 0 .7 5 0 . 8 0 0 .8 5 0 . 9 0 A c cu r a cy S p e l l i n g s S y n o n y m G e n d e r P l u r a l S e n ti m e n t (a) By task 1 2 3 5 10 1 5 2 0 2 5 50 Dim e n sio ns 0 .5 0 0 .5 5 0 .6 0 0 .6 5 0 .7 0 0 .7 5 0 .8 0 0 .8 5 0 .9 0 A c c urac y S y n o n y m (R B F) P l u ra l (R B F) S en t i m en t (R B F) S y n o n y m (l i n ea r) P l u ra l (l i n ea r) S en t i m e n t (l i n e a r) (b) Linear vs. Nonlinear Figure 8. Results of reducing the dimensions of the em b eddings through PCA, av eraged by the geometric mean acro ss tasks ( 8a ). Figure 8b shows the diﬀerence b etw een linear (dashed) and non-linear (solid) classiﬁers for our harder tasks (sentimen t and synonym) and an easy task (p lural ). The p erformance of the linear and nonlinear classiﬁers conv erges as PCA remo ves more dimensions. This results in signiﬁcantly degraded performance on nuanced tasks like sentimen t analysis. This figure "bit_reduction_by_embedding_after_normalization.png" is available in "p ng" format from: This figure "bit_reduction_by_task_after_normalization.png" is available in "png" for mat from:

The Expressive Power of Word Embeddings

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment