IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

IdBench: Ev aluating Semantic Representations of Identiﬁer Names in Source Code Y aza W ainakh Department of Computer Science TU Darmstadt Darmstadt, Germany yaza.wainakh@gmail.com Moiz Rauf Department of Computer Science University of Stuttgart Stuttgart, Germany moiz.rauf@iste.uni-stuttgart.de Michael Pradel Department of Computer Science University of Stuttgart Stuttgart, Germany michael@binaervarianz.de Abstract —Identiﬁer names con vey useful information about the intended semantics of code. Name-based program analyses use this information, e.g ., to detect bugs, to predict types, and to improve the readability of code. At the core of name- based analyses are semantic representations of identiﬁers, e.g., in the form of learned embeddings. The high-level goal of such a repr esentation is to encode whether two identiﬁers, e.g., len and size , are semantically similar . Unfortunately , it is currently unclear to what extent semantic representations match the semantic r elatedness and similarity perceived by developers. This paper presents IdBench, the ﬁrst benchmark for evaluating semantic repr esentations against a ground truth created from thousands of ratings by 500 software de velopers. W e use IdBench to study state-of-the-art embedding techniques proposed for natural language, an embedding technique speciﬁcally designed for source code, and lexical string distance functions. Our results show that the effectiveness of semantic representations varies signiﬁcantly and that the best available embeddings successfully repr esent semantic relatedness. On the downside, no existing technique provides a satisfactory repr esentation of semantic similarities, among other reasons because identiﬁers with opposing meanings are incorrectly considered to be similar , which may lead to fatal mistakes, e.g., in a refactoring tool. Studying the strengths and weaknesses of the different techniques shows that they complement each other . As a ﬁrst step toward exploiting this complementarity , we present an ensemble model that combines existing techniques and that clearly outperforms the best available semantic representation. Index T erms —sour ce code, neural networks, embeddings, iden- tiﬁers, benchmark I . I N T R O D U C T I O N Identiﬁer names play an important role in writing, under - standing, and maintaining high-quality source code [1]. Be- cause they con ve y information about the meaning of variables, functions, classes, and other program elements, dev elopers of- ten rely on identiﬁers to understand code written by themselves and others. Be yond de velopers, v arious automated techniques analyze, use, and improve identiﬁer names. For example, identiﬁers ha ve been used to ﬁnd programming errors [2]– [5], to mine speciﬁcations [6], to infer types [7], [8], to predict the name of a method [9], or to complete partial code using a learned language model [10]. T echniques for This work was supported by the European Research Council (ERC, grant agreement 851895), and by the German Research Foundation within the ConcSys and Perf4JS projects. improving identiﬁer names pinpoint inappropriate names [11] and suggest more suitable names [12]. The basic idea of all these approaches is to infer the intended meaning of a piece of code from the natural language information in identiﬁers, possibly along with other information, such as the structure of code, data ﬂow , and control ﬂo w . W e here refer to program analyses that rely on identiﬁer names as a primary source of information as name-based analyses . Most name-based analyses reason about names in one of two ways. First, some approaches build upon string distance functions , such as the Le venshtein distance, sometimes in combination with algorithms for tokenizing names, e.g., based on underscore or camel-case notation [13]. Giv en a pair of identiﬁers, e.g. len and length , a string distance function yields a real-valued number that indicates to what extent the character sequences in the identiﬁers resemble each other . String distance functions are at the core of name-based anal- yses to detect name-related bugs [2], [3], to predict types [7], to impro ve identiﬁer names [14], or to suggest appropriate names [15]. Second, another approach, which has become popular more recently , are neural network-learned embeddings of identiﬁers . An embedding maps each identiﬁer into a continuous vector representation, so that similar identiﬁers are mapped to similar vectors. Embeddings implicitly deﬁne a similarity function via the cosine similarity of embedding vectors. For example, embeddings of identiﬁers are at the core of neural program analyses [16] to predict types [8], to detect bugs [4], to de-obfuscate code [17], to complete partial code [10], and to map API elements across programming languages [18]. The common aim of both string distance functions and embeddings of identiﬁers is to reason about the semantics of identiﬁers, and we hence call both of them semantic repr esen- tations of identiﬁers , or short semantic repr esentations . The ov erall effecti veness of a name-based analysis relies on the as- sumption that the underlying semantic representation encodes some kind of semantic relationship between identiﬁers. For example, two semantically similar identiﬁers, such as len and length , should be closer to each other than two unrelated identiﬁers, such as length and click . Despite the importance of semantic representations for name-based analyses, it is currently unclear how well e xisting approaches actually represent semantic relationships. Speciﬁ- cally , we are interested in the follo wing questions: a) RQ 1: How accurately do state-of-the-art semantic r epresentations matc h the semantic relatedness of identiﬁers as per ceived by softwar e developers?: “Relatedness” here means the degree of association between tw o identiﬁers, which co vers various possible relations between them, e.g., being used in the same application domain or being opposites of each other . For example, top and bottom are related because they are opposites, click and dblclick are related because they belong to the same general concept, and getBorderWidth and getPadding are related because they belong to the same application domain. The relatedness of identiﬁers is relev ant for tools that reason about the broad meaning of code elements, e.g., to predict the types of functions [8], [19]. b) RQ 2: How accurately do state-of-the-art semantic r epresentations match the semantic similarity of identiﬁer s as per ceived by softwar e developer s?: “Similarity” here means the degree to which two identiﬁers hav e the same meaning, in the sense that one could substitute the other without changing the overall meaning [20]. For example, length and size , as well as username and userid , are similar to each other . The similarity of identiﬁers is, e.g., relev ant for name-based bug detection tools [4], [21]. c) RQ 3: What are the str engths and weaknesses of the existing semantic r epr esentations?: Better understanding why particular techniques sometimes succeed or fail to accurately represent identiﬁers will enable improving the current seman- tic representations. d) RQ 4: Do the existing semantic r epresentations com- plement eac h other?: If current techniques are complementary , it may be possible to combine them in a way that outperforms the individual techniques. Addressing these questions relies on a way to measure and compare the effecti veness of semantic representations of identiﬁers in source code. This paper presents IdBench, the ﬁrst benchmark for this task, which is based on a dataset of dev eloper assessments about the relatedness and similarity of pairs of identiﬁers. W e gather this dataset through surveys that show real-world identiﬁers and code snippets to hundreds of dev elopers, asking them to rate their semantic relationship. T aking the dev eloper assessments as a gold standard, IdBench allows for ev aluating semantic representations in a systematic way by measuring to what extent a semantic representation agrees with ratings gi ven by developers. Moreov er, inspecting pairs of identiﬁers for which a representation strongly agrees or disagrees with the benchmark helps understand the strengths and weaknesses of the representation. Applying our methodology to se ven widely used semantic representations leads to various nov el insights. W e ﬁnd that different techniques differ heavily in their ability to accurately represent identiﬁer relatedness and similarity . The best among the studied techniques, the CBO W variant of FastT ext [22], accurately represents the relatedness of identiﬁers (RQ 1), but none of the av ailable techniques accurately represents the similarity of identiﬁers (RQ 2). Studying the strengths and weaknesses of each technique (RQ 3) shows that some embeddings are confused about identiﬁers with opposite mean- ing, e.g., rows and cols , about identiﬁers that belong to the same application domain but are not similar, and about synonyms, e.g., file and record . Furthermore, practically all techniques struggle with identiﬁers that use abbre viations, which are v ery common in software. W e also ﬁnd that simple string distance functions, which measure the similarity of identiﬁers without any learning, are surprisingly effecti ve, and ev en outperform some learned embeddings for the similarity task. A close inspection of the results sho ws that dif ferent tech- niques complement each other (RQ 4). T o beneﬁt from the strengths of multiple techniques, we present a new semantic representation that combines the av ailable techniques into an ensemble model based on features of identiﬁers, such as the number of characters or whether an identiﬁer contains non- dictionary words. The ensemble model clearly outperforms each of the e xisting semantic representations, impro ving agree- ment with dev elopers by 6% and 19% for relatedness and similarity , respecti vely . In summary , this paper makes the follo wing contributions. • A r eusable benchmark . W e make a vailable a benchmark of hundreds of pairs of identiﬁers, providing a way to systematically e valuate existing and future embeddings. 1 T o the best of our kno wledge, this is the ﬁrst benchmark to systematically ev aluate semantic representations of identiﬁers. • Novel insights . Our study reveals both strengths and limitations of current semantic representations, along with concrete e xamples to illustrate them. These insights provide a basis for future w ork on better semantic repre- sentations. • A technique that outperforms the state-of-the-art . Com- bining the currently available techniques based on a few simple features yields a semantic representation that clearly outperforms all individual techniques. I I . M E T H O D O L O G Y T o measure and compare the accuracy of semantic represen- tations, we gather thousands of ratings from 500 de velopers (Section II-A). Cleaning and compiling this raw dataset into a benchmark yields se veral hundreds of pairs of identiﬁers with gold standard similarities (Section II-B). W e then measure the agreement between the gold standard and state-of-the- art semantic representations (Section II-C), where we study two string distance functions and ﬁ ve learned embeddings (Section II-D). W e apply our methodology to Jav aScript code, because recent work on identiﬁer names and code embeddings focuses on this language [4], [8], [17], [23], but our method- ology can also be applied to other languages. A. Developer Surve ys IdBench includes three benchmark tasks: A r elatedness task and two tasks to measure ho w well an embedding 1 https://github .com/sola- st/IdBench Identiﬁers: radians , angle 1) How r elated are the identiﬁers? Unrelated      Related 2) Could one substitute the other? Not substitutable      Substitutable (a) Direct surv ey . Which identiﬁer ﬁts best into the blanks?  positions  indices O p e n t i p . _ _ _ _ = [ " t o p " , " t o p R i g h t " , " r i g h t " , " b o t t o m R i g h t " , " b o t t o m " , " b o t t o m L e f t " , " l e f t " , " t o p L e f t " ] ; O p e n t i p . p o s i t i o n = { } ; _ r e f = O p e n t i p . _ _ _ _ ; (b) Indirect surv ey . Fig. 1: Examples of the dev eloper surveys. reﬂects the similarity of identiﬁers: a similarity task and a contextual similarity task . The follo wing describes how we gather dev eloper assessments that provide data for these tasks. The supplementary material provides additional examples and details of the surve y setup. a) Dir ect Survey of De veloper Assessments: This surve y shows two identiﬁers to a developer and then directly asks how related and ho w similar the identiﬁers are. Figure 1a shows an example question from the survey . The developer is sho wn pairs of identiﬁers and is then asked to rate on a ﬁ ve-point Likert scale how related and how similar these identiﬁers are to each other . In total, each dev eloper is shown 18 pairs of identiﬁers, which we randomly sample from a larger pool of pairs. Before sho wing the questions, we provide a brief description of what the de velopers are supposed to do, including an explanation of the terms “related” and “substitutable”. The ratings gathered in the direct survey are the basis for the relatedness task and the similarity task of IdBench. b) Indir ect Survey of De veloper Assessments: This sur- ve y asks de velopers to pick an identiﬁer that best ﬁts a given code context, which indirectly asks about the similarity of identiﬁers. The motiv ation is that identiﬁer names alone may not pro vide enough information to fully judge ho w similar the y are [20]. For example, without any context, identiﬁers idx and hl may cause confusion for de velopers who are trying to judge their similarity . The survey addresses this challenge by showing the code context in which an identiﬁer occurs, and by asking the dev elopers to decide which of two giv en identiﬁers best ﬁts this context. If, for a speciﬁc pair of identiﬁers, the developers choose both identiﬁers equally often, then the identiﬁers are likely to be similar to each other , since one can substitute the other . Figure 1b shows a question ask ed during T ABLE I: Occurrences of IdBench identiﬁers in code corpora of different languages. T otal occurrences Occurrences of individual identiﬁers Language Number Perc. Min Mean Max Jav aScript 3,697,498 12.5% 62 7,639 629,413 Python 2,279,866 14.8% 0 4,710 1,367,832 Jav a 757,064 6.3% 0 1,564 119424 the indirect surv ey . As sho wn in the e xample, for code conte xts where the identiﬁer occurs multiple times, we show multiple blanks that all refer to the same identiﬁer . In total, we sho w 15 such questions to each participant of the survey , where the 15 identiﬁer pairs are randomly selected from the set of all studied pairs. The ratings gathered in the indirect survey are the basis for the contextual similarity tasks of IdBench. c) Selection of Identiﬁers and Code Examples: W e se- lect identiﬁers and code contexts from a corpus of 50,000 Jav aScript ﬁles [24]. W e select 300 pairs, made out of 488 identiﬁers, through a combination of automated and manual selection, aimed at a di verse set that cov ers different degrees of similarity and relatedness. At ﬁrst, we extract from the code corpus all identiﬁer names that appear more than 50 times, including method names, variable names, property names, and other types of identiﬁers. A nai ve approach would be to randomly sample pairs among those identiﬁers. Howe ver , this naive approach would result almost only in unrelated and dissimilar identiﬁer pairs. Instead, we follow a methodology proposed for natural language [25], which ranks all pairs based on the cosine similarity according to a giv en embedding, and then selects pairs from different ranges in the ranking. W e select pairs using two embeddings [17], [26]. The fact that these embeddings are later also ev aluated with the benchmark does not introduce bias because the ground truth of the benchmark is constructed only from the human ratings, not from the embeddings. In addition to pairs selected as suggested in [25], we manually select some synonym pairs, which we observed to lack otherwise, and add randomly selected pairs, which are likely to be unrelated. The manual selection was done before ev aluating any semantic representations to a void biasing the benchmark. T o gather the code contexts for the indirect surve y , we search the code corpus for occurrences of the selected identi- ﬁers. As the size of the context, we choose ﬁ ve lines, aiming to provide sufﬁcient context to pick the best identiﬁer without ov erwhelming the study participants with large amounts of code. F or each identiﬁer , we randomly select ﬁv e different contexts. When showing a speciﬁc pair of identiﬁers to a dev eloper, we randomly select one of the gathered contexts for one of the two identiﬁers. T able I shows how often the selected identiﬁers occur in the Jav aScript corpus. Overall, the identiﬁers in IdBench occur 3.7 million times, which covers 12.5% of all identiﬁer occur- rences. Ev en though this was not a criterion when selecting the identiﬁers, the benchmarks cov ers a non-negligible portion of real-world code. The table also shows how often individual identiﬁers occur , which is 7,639 times, on average. T o assess whether IdBench could also be used for other languages, we also measure the occurrences in Python [27] and Jav a code corpora [28] with 50,000 ﬁles each. As sho wn in T able I, the identiﬁers are also frequent in code beyond Ja vaScript, with an a verage number of occurrences of 4,710 and 1,564 in the Python and Jav a corpora, respectiv ely . A manual analysis shows that identiﬁers that occur across languages co ver general programming terminology , whereas identiﬁers that appears in Jav aScript only are mostly speciﬁc to the web domain, e.g., tag_h4 or DomRange . T o better understand whether IdBench co vers identiﬁers that appear in different syntactic roles, we measure for each identiﬁer how often it used as a function name, v ariable name, or property name. W e then assign each identiﬁer to one of these roles based on whether the majority of its occurrences is in a speciﬁc role. The measurements show that 17% of the identiﬁers are primarily function names, 18% are primarily variables names, 34% are primarily property names, and the rest is commonly used in multiple roles. d) P articipants: W e recruit dev elopers to participate in the surve y in se veral ways. About half of the participants are volunteers recruited via personal contacts, posts in public de- veloper forums, and a post in an internal forum within a major software company . The other half of the participants were recruited via Amazon Mechanical T urk, where we offered a compensation of one US dollar for completing both surve ys. On av erage, participants took around 15 minutes to complete both surveys. That is, the of fered compensation matches the av erage salary of software dev elopers in some countries of the world. 2 In total, 500 developers participate in the survey . Most participants li ve in North America and in India, and they hav e at least ﬁv e years of experience in software de velopment. B. Data Cleaning Crowd-sourced surveys may contain noise, e.g., due to lack of expertise or in volv ement by the participants [29]. T o address this challenge, we gather at least ten ratings per pair of identiﬁers and then clean the data based on the inter-rater agreement, which has been found ef fecti ve in other crowd- sourced surve ys [30]. a) Removing Outlier P articipants: As a ﬁrst ﬁlter , we remov e outlier participants based on the inter-rater agreement, which measures the degree of agreement between participants. W e use Krippendorf ’ s alpha coefﬁcient, because it handles unequal sample sizes, which ﬁts our data, as not all participants rate the same pairs and not all pairs have the same number of ratings. The coef ﬁcient ranges between zero and one, where zero represents complete disagreement and one represents perfect agreement. For each participant, we calculate the difference between her rating and the av erage of all the other ratings for each pair . Then, we average these differences for each rater , and discard participants with a difference above 2 https://www .payscale.com/research/IN/Job=Software Developer/Salary T ABLE II: Benchmark sizes and inter -rater agreement (IRA). Size Thresholds T ask τ θ Relatedness Similarity Contextual simil. Pairs IRA Pairs IRA P airs Small 0.215 0.4 167 0.67 167 0.62 115 Medium 0.23 0.5 247 0.61 247 0.57 145 Large 0.25 0.6 291 0.56 291 0.51 176 T ABLE III: Pairs of identiﬁers with their gold standard simi- larities. Score Identiﬁer 1 Identiﬁer 2 Related- Similar- Conte xtual ness ity similarity substr substring 0.94 1.00 0.89 setMinutes setSeconds 0.91 0.22 0.06 reset clear 0.90 0.89 0.94 rows columns 0.88 0.08 0.22 setInterval clearInterval 0.86 0.09 0.34 count total 0.83 0.81 0.79 item entry 0.78 0.77 0.92 miny ypos 0.68 0.37 0.02 events rchecked 0.16 0.14 0.18 re destruct 0.06 0.02 0.02 a threshold τ (values giv en in T able II). W e perform this computation both for the relatedness and similarity ratings from the direct surv ey , and then remove outliers based on the av erage difference across both ratings. b) Removing Downer P articipants: As a second ﬁlter , we eliminate participants that decrease the overall inter-rater agreement (IRA). W e call such participants downers [31], because they bring the agreement le vel between all participants down. For each participant p , we compute IRA sim and IRA rel before and after removing p from the data. If IRA sim or IRA rel increases by at least 10%, then we discard that participant’ s ratings. c) Removing Outlier P air s: As a third ﬁlter , we eliminate some pairs of identiﬁers used in the indirect survey . Since our random selection of code contexts may include contexts that are not helpful in deciding about the most suitable identiﬁer , the ratings for some pairs may be misleading. For example, this is the case for code contexts that contain short and meaningless identiﬁers or that mostly consist of comments unrelated to the missing identiﬁer . T o mitigate this problem, we remove a pair if the difference in similarity as rated in the direct and indirect surve ys exceeds some threshold θ (values giv en in T able II). T able II shows the number of identiﬁer pairs that remain in the benchmark after data cleaning. For each of the three tasks, we provide a small , medium , and lar ge benchmark, which differ in the thresholds used during data cleaning. The smaller benchmarks use stricter thresholds and hence pro vide higher agreements between the participants, whereas the larger benchmarks offer more pairs. The thresholds are selected to strike a balance between increasing the o verall inter-rater agreement while keeping enough pairs and ratings to form a representativ e benchmark. C. Measuring Agr eement with the Benchmark Giv en the ground truth similarities and a semantic repre- sentation technique, we want to measure to what extent both agree with each other . a) Con verting Ratings to Scor es: As a ﬁrst step of measuring the agreement with the benchmark, we con vert the ratings gathered for a speciﬁc pair during the dev eloper surve ys into a similarity score in the [0 , 1] range. For the direct surve y , we scale the 5-point Likert-scale ratings into the [0 , 1] range and a verage all ratings for a speciﬁc pair of identiﬁers. For the indirect survey , we use a signal detection theory-based approach for con verting the collected ratings into numeric values, which has been pre viously used to create a similarity benchmark for natural languages [20]. This con version yields an unbounded distance measure d for each pair , which we con vert into a similarity score s by normalizing and in verting the distance: s = 1 − d − min d max d − min d where min d and max d are the minimum and maximum distances across all pairs. b) Examples: T able III sho ws representative examples of identiﬁer pairs and their scores for the three benchmark tasks. 3 The examples illustrate that the scores match human intuition and that the gold standard clearly distinguishes relatedness from similarity . Some of the highly related and highly similar pairs, e.g., substr and substring , are le xically similar , while others are synonyms, e.g., count and total . The identiﬁers rows and columns are strongly related, but one cannot substitute the other, and they hence hav e low similarity . Similarly miny , ypos represent distinct properties of the variable y , which is why they are related but not similar . Finally , some pairs are either weakly or not at all related, e.g., re and destruct . c) Corr elation with benchmark: W e measure the magni- tude of agreement of a semantic representation with IdBench by computing Spearman’ s rank correlation between the simi- larities of pairs of identiﬁer vectors according to the semantic representation and our gold standard of similarity scores. Deﬁnition 1 (Correlation with benchmark): Given n pairs ( s i , g i ) of similarity scores, where s i is computed by a se- mantic representation and g i is the gold standard, let r ank ( s i ) and r ank ( g i ) be the ranks of s i and g i , respectiv ely . The cor- relation of the semantic representation with the benchmark is r = cov ( rank ( s i ) , rank ( g i )) σ rank ( s i ) σ rank ( g i ) where cov and σ are cov ariance and standard deviation of the rank variables, respecti vely . The correlation ranges between 1 (perfect agreement) and -1 (complete disagreement). For string distance functions, we compute the similarity score s i = 1 − d norm for each pair based on a normalized version d norm of the distance returned by the string distance function. W e use Spearman’ s rank corre- lation because directly comparing absolute similarities across different embeddings may be misleading [32]. The reason is 3 The full list of identiﬁers pairs is av ailable for do wnload as part of our benchmark. that, depending on ho w “wide” or “narro w” an embedding space is, a cosine similarity of 0.3 may mean a rather high or a rather lo w similarity . A rank-based comparison, as pro vided by Spearman’ s rank correlation, is more rob ust to dif ferent ways of populating the embedding space with identiﬁers than computing the correlation of absolute similarities. D. Embeddings and String Distance Functions T o assess ho w accurately existing semantic representations encode the relatedness and similarity of identiﬁers, we e v aluate sev en semantic representations against IdBench: T wo string distance functions and ﬁv e learned embeddings. String distance functions use lexical similarity as a proxy for the semantic relatedness of identiﬁers. W e consider these functions because the y are used in name-based b ug detec- tion tools [2], including a bug detection tool deployed at Google [3], to improv e identiﬁer names [14], and to suggest appropriate names [15]. The two string distance functions we ev aluate are: • “L V”: Levenshtein’ s edit distance, which is the number of character insertions, deletions, and substitutions required to transform one identiﬁer into another . • “NW”: Needleman-W unsch distance [33], which gen- eralizes the Lev enshtein distance by computing global alignments of two strings. Learned embeddings are popular in recent name-based analyses, e.g., for bug detection [4], type prediction [8], and for predicting names and types of program elements [17]. The ﬁv e learned embeddings we ev aluate are: • “w2v-cbow”: The continuous bag of words variant of W ord2vec [26], [34]. • “w2v-sg”: The skip-gram variant of W ord2vec . • “FT -cbo w”: The continuous bag of words v ariant of F astT ext [22], a sub-word extension of W ord2vec that represents words as character n-grams. • “FT -sg”: The skip-gram v ariant of F astT ext . • “path-based”: An embedding technique speciﬁcally de- signed for code, which learns from paths through a structural, tree-based representation of code [17]. W e train all embeddings on the same code corpus of 50,000 Jav aScript ﬁles [24]. For each embedding, we experiment with v arious hyper-parameters (e.g., dimension, number of context words) and report results only for the best performing models. 4 W e provide all identiﬁers as they are to the semantic representations, without pre-processing or tokenizing identi- ﬁers. The rationale is that such pre-processing should be part of the semantic representation. For example, the NW string distance function aligns the characters of identiﬁers, and the FastT ext embeddings split identiﬁers into character n-grams, which may enable these techniques to reason about subtokens of an identiﬁer . LV NW FT-cbow FT-SG w2v-cbow w2v-SG Path- based Combined Similarity functions 0.0 0.2 0.4 0.6 0.8 Correlation with benchmark Large benchm. Medium benchm. Small benchm. (a) Relatedness. LV NW FT-cbow FT-SG w2v-cbow w2v-SG Path- based Combined Similarity functions 0.0 0.2 0.4 0.6 0.8 Correlation with benchmark Large benchm. Medium benchm. Small benchm. (b) Similarity . LV NW FT-cbow FT-SG w2v-cbow w2v-SG Path- based Combined Similarity functions 0.0 0.2 0.4 0.6 0.8 Correlation with benchmark Large benchm. Medium benchm. Small benchm. (c) Contextual similarity . Fig. 2: Correlations of embeddings and string distance functions with the small, medium, and large variants of the benchmark. I I I . R E S U LT S A. RQ 1: Accuracy of Repr esenting Semantic Relatedness The following addresses the question ho w accurately the studied techniques represent the relatedness of identiﬁers, i.e., the degree of association between the two identiﬁers. Figure 2a shows the agreement of the ev aluated semantic representations with the small, medium, and large variants of the relatedness benchmark in IdBench. All techniques achiev e relati vely high lev els of agreement, with correlations between 41% and 74%. The neurally learned embeddings clearly outperform the string distance-based similarity functions (41-74% vs. 46-49%), showing that the ef fort of learning a semantic representation is worthwhile. In particular , the learned embeddings match or ev en slightly exceed the inter-rater agreement, which is considered an upper bound of ho w strongly an embedding may correlate with a similarity-based benchmark [31]. Comparing different embedding techniques with each other , we ﬁnd that both FastT e xt variants achie ve higher scores than all other embeddings. In contrast, despite using additional structural information of source code, path-based embeddings score only comparably to W ord2v ec. A likely reason for the effecti veness of FastT ext is that it generalizes across lexically similar names by computing embeddings based on character n-grams of an identiﬁer . E.g., giv en the identiﬁer getIndex , FastT ext computes its embed- ding based on embeddings for its various characters n-grams, such as Index and Ind , allowing the approach to generalize across le xically similar identiﬁers, such as setIndex or ind . B. RQ 2: Accuracy of Repr esenting Semantic Similarity This research question is about the semantic similarity , i.e., the degree to which two identiﬁers have the same meaning. Figure 2b shows how much the studied semantic representa- tions agree with the similarity benchmark in IdBench. Overall, the ﬁgure sho ws a much lower agreement with the gold standard than for relatedness. One explanation is that encoding semantic similarity is a harder task than encoding the less strict 4 Details on the hyperparameters and ho w we tuned them are a vailable in the supplementary material. concept of relatedness. Similar to relatedness, FT -cbow shows the strongest agreement, ranging between 35% and 38%. The results of the contextual similarity task (Figure 2c) conﬁrm the ﬁndings from the similarity task. All studied techniques are less effecti ve than for relatedness, and FT -cbo w achiev es the highest agreement with IdBench. A perhaps surprising result is that string distance functions are roughly as ef fectiv e as some of the learned embeddings and sometimes even outperform them. The reason is that some semantically similar identiﬁers are also le xically similar , e.g., len and length . One do wnside of string distance functions is that they miss synonymous identiﬁers, e.g., count and total . C. RQ 3: Str engths and W eaknesses of Existing T ec hniques T o better understand the strengths and weaknesses of the studied semantic representations, we inspect v arious examples (Section III-C1) and study interesting subsets of all identiﬁer pairs in isolation (Section III-C2). 1) Examples: T o better understand why current embeddings sometimes fail to accurately represent similarities, T able IV shows the most similar identiﬁers of selected identiﬁers ac- cording to the FT -cbow and path-based embeddings. The examples illustrate two observations. First, FastT ext, due to its use of n-grams [22], tends to cluster identiﬁers based on lexical commonalities. While many lexically similar identiﬁers are also semantically similar, e.g., substr and substring , this approach misses other synonyms, e.g., item and entry . Another downside is that lexical similarity may also es- tablish wrong relationships. For example, substring and substrCount represent dif ferent concepts, but FastT ext ﬁnds them to be highly similar . Second, in contrast to FastT e xt, path-based embeddings tend to cluster words based on the structural and syntactical contexts they occur in. This approach helps the embeddings to identify synonyms despite their lexical dif ferences, e.g., count and total , or files and records . The do wn- side is that it also clusters various related but not similar identiﬁers, e.g., minText and maxText , or substr and getPadding . Some of these identiﬁers even ha ve opposing T ABLE IV: T op-5 most similar identiﬁers by the FT -cbow and path-based models. Identiﬁer Embedding Nearest neighbors substr FT -cbow substring substrs subst substring1 substrCount Path-based substring getInstanceProp getPadding getMinutes ﬂoor item FT -cbo w itemNr itemJ itemL itemI itemAt Path-based entry child record targ ne xtElement count FT -cbo w countTbl countInt countR TO countsAsNum countOne Path-based total limit minV al e xponent rate rows FT -cbow ro wOrRows ro wXs rows l rowsAr rowIDs Path-based cols cells columns tiles items setInterval FT -cbo w resetInterv al setT imeoutInterval clearInterval getInterval retInterval Path-based clearInterval assume alert ne xtTick ReactT extComponent minT ext FT -cbo w maxT ext minLengthT ext microsecT ext maxLengthT ext minuteT ext Path-based maxT ext displayMsg blankT ext disableT ext emptyT ext ﬁles FT -cbow ﬁlesObjs ﬁlesGen ﬁleSets extFiles libFiles Path-based records tasks names tiles todos miny FT -cbo w min y minBy minx minPt min z Path-based minx ymin dataMax dataMin ymax meanings, e.g., rows and cols , which can mislead code analysis tools when reasoning about the semantics of code. 2) Inter esting Subsets of All Identiﬁer P airs: T o better understand the strengths and weaknesses of semantic repre- sentations for speciﬁc kinds of identiﬁers, we analyze some interesting subsets of all identiﬁer pairs in more detail. W e focus on four subsets: • Abbre viations . Pairs where at least one identiﬁer is an abbreviation and where both identiﬁers refer to the same concept, e.g., substr and substring , or cfg and conf . Since abbre viations are commonly used for con- cise source code, accurately reasoning about them is important. • Opposites . Pairs where one identiﬁer is the oppo- site of the other identiﬁer , e.g., xMin and xMax , or setInstanceProp and getInstanceProp . Since opposite identiﬁers often occur in similar contexts, they may be difﬁcult to distinguish. • Synonyms . Pairs that refer to the same concepts, e.g., reset and clear , or emptyText and blankText . These identiﬁers often are lexically different but should be represented in a similar way . • Added subtoken . Pairs where both identiﬁers are identical, except that one adds a subtoken to the other , e.g., id and sessionid , or maxLine and maxLineLength . • T rick y tokenization . Pairs where at least one of the identiﬁers is composed of multiple subtokens b ut uses neither camel case nor snail case to combine subtokens, e.g., touchmove and touchend , or newtext and content . This and the above subset are interesting be- cause some semantic representations reason about subto- kens of identiﬁers. T o e xtract pairs into these subsets, we inspect all 167 pairs from the small benchmark, which yields between 7 and 22 pairs per set. 0.5 0.0 0.5 Abbrevia- tions Relatedness Similarity Contextual similarity 0.5 0.0 0.5 Opposites 0.5 0.0 0.5 Synonyms 0.5 0.0 0.5 Added subtoken LV NW FT-cbow FT-SG w2v-cbow w2v-SG Path-based 0.5 0.0 0.5 Tricky tokenization LV NW FT-cbow FT-SG w2v-cbow w2v-SG Path-based LV NW FT-cbow FT-SG w2v-cbow w2v-SG Path-based Fig. 3: Agreement and disagreement with the benchmark for different kinds of identiﬁers. Figure 3 shows how much the different techniques agree or disagree with the benchmark for selected subsets. As in Figure 2, each bar shows the Spearman rank correlation between the predicted similarities and the ground truth. That is, higher v alues are better and negati ve values indicate a clear disagreement with the ground truth. The results sho ws that all techniques are challenged by abbreviations, with more than half of the correlations being negati ve. The poor performance for abbre viations can be at- tributed to the f act that fewer characters provide less informa- tion and that there may be many v ariants of the same name. F or opposites and synonyms, we ﬁnd that most techniques, and in particular the learned embeddings, successfully represent the relatedness of these identiﬁers. Howe ver , almost all techniques clearly fail to capture that opposite identiﬁers are not similar , as one cannot replace the other , and to capture that synon yms are similar . For the subtoken-related subsets, we ﬁnd that most tech- niques are challenged by pairs where one identiﬁer adds a subtoken to the other , in particular , when reasoning about similarity . One explanation is that identiﬁers with an added subtoken tend to be rather specialized, and hence, occur less frequent, which giv es less training data to the learning-based techniques. When being faced with identiﬁers that use non- obvious tokenization, most techniques, with the exception of Needleman-W unsch, perform relatively well. W e attribute this result to the fact that techniques that reason about substrings of an identiﬁer , such as FastT ext [22], do not rely on a speciﬁc tokenization approach, such as camel case or snail case, but instead consider character n-grams of the giv en identiﬁer . D. RQ 4: Complementarity of Existing T echniques Our inspection of examples and of speciﬁc subsets of identiﬁer pairs shows that different semantic representation techniques work well for different kinds of identiﬁers. F or example, some techniques work better for abbreviations than others. Based on this observation, we hypothesize that the existing semantic representations complement each other . If this hypothesis is correct, combining techniques in such a way that the most suitable set of techniques is used for a giv en pair of identiﬁers could represent similarities more accurately than any of the indi vidual techniques. T o v alidate this hypothesis, we present an ensemble model that combines e xisting semantic representations. The key idea is to train a model that predicts the similarity of two identiﬁers based on the similarity scores provided by the existing seman- tic representations. T o this end, the approach queries each of the sev en techniques studied in this paper for a similarity score and provides these scores to the model. T o help the model decide what representations to fa vor for a gi ven pair of identiﬁers, we also pro vide to the model a set of features that describe some properties of identiﬁers. Given two identiﬁers, the features we consider are: • The length of these identiﬁers. • The number of subtokens in each of the identiﬁers, based on snail case and camel case con ventions. • The number of words among the subtokens that are not in an English dictionary . The rationale for this feature is to identify abbreviations, which usually are not dictionary words. Giv en the seven similarity scores and the features, we train a model that takes the scores and features of a pair as an input, and then that predicts a similarity score for the pair . W e train the model in a supervised way , using the ground truth provided in IdBench as the labels for learning. W e use an off-the-shelf support vector machine model with the default hyperparameters provided by the underlying library 5 . In practice, one would train the model with all pairs in our benchmark and then apply the trained model to new pairs. T o enable us to measure the effecti veness of the model, we here train it with all but one pair , and then apply the trained model to the left-out pair . W e repeat this step for each pair and use the score predicted by the model as the score of the combined technique. Figure 2 shows the results of the combined approach. Com- bining different semantic representations clearly outperforms all existing techniques. For example, for the large benchmark, the combined approach increases the relatedness, similarity , and contextual similarity of the best individual technique by 6%, 19%, and 5%, respectiv ely . This result conﬁrms our hypothesis that the existing techniques complement each other and shows the beneﬁts of combining them. I V . D I S C U S S I O N This section discusses some lessons learned from our study of semantic representations, along with ideas for addressing the current limitations in future work. a) Neurally learned embeddings accurately repr esent the r elatedness of identiﬁers: Overall, all neural embeddings con- sidered in our ev aluation pro vide a high agreement with the ground truth pro vided by the relatedness scores in IdBench. This result sho ws that embeddings are ef fectiv e at assigning similar vector representations to identiﬁers that occur in the same application domain or that are associated in some other way . b) No e xisting technique accurately repr esents the sim- ilarity of identiﬁers: While the best available embeddings are highly ef fectiv e at representing relatedness, none of the studied techniques reaches the same level of agreement for similarity . In fact, e ven the best results in Figures 2b and 2c (38%) clearly stay beyond the inter -rater agreement of our benchmark (62%), showing a huge potential for improvement. For many applications of embeddings of identiﬁers, semantic similarity is crucial. For example, techniques that suggest suitable v ariable or method names [9], [17] aim for the name that is most similar, not only most related, to the concept represented by the v ariable or method. Like wise, name-based analyses for ﬁnding programming errors [4] or variable mis- uses [21] aim at identifying situations where the de veloper uses a wrong b ut perhaps related variable. Improving the ability of 5 Class sklearn.svm.SVR from scikit-learn. semantic representations to accurately represent the similarity of identiﬁers will beneﬁt these name-based analysis. c) Neural embeddings g enerally outperform string dis- tance functions: Our results for both relatedness and similarity show that the best av ailable neural embeddings outperform classical string distance functions. For example, for the re- latedness benchmark, the string distance functions achie ve up to 49% correlation, whereas embeddings achie ve up to 74% correlation. For the similarity and contextual similarity benchmarks, the differences are smaller (32% vs. 38% for similarity , and 29% vs. 36% for contextual similarity), but still clearly visible. These results suggest that name-based analyses are likely to beneﬁt from using embeddings instead of string distance functions. d) Opposite ar e challenging: Inspecting examples of (in)accurately represented pairs of identiﬁers sho ws that identi- ﬁers that describe opposing concepts are particularly challeng- ing for current semantic representations. For example, both the FT -cbow and path-based embeddings assign similar vectors to minText and maxText , even though these identiﬁers are clearly not similar but only related. Another example are the setInterval and clearInterval function names. T able IV shows these and other examples of this phenomenon. Improving semantic representations to better distinguish iden- tiﬁers with opposing meaning will beneﬁt name-based anal- yses that, e.g., suggest method names [9] or refactorings of identiﬁers [12]. e) Distinguishing singular and plural identiﬁers is par - ticularly challenging: Another challenge we observe while inspecting pairs of inaccurately represented pairs of identiﬁers is to distinguish identiﬁers of individual items from identiﬁers of collections of items. For example, FT -cbo w assigns very similar vectors to substr and substrs (T able IV). Such a conﬂation of singular and plural concepts may be misleading, e.g., for name-based analyses that predicts types [7], [8], [19]. f) Shar ed subwor d information may be misleading: String distance functions and, to some e xtent, also subword- based embeddings, such as FastT e xt, rely on the assumption that substrings shared by two identiﬁers increase the chance that the identiﬁers are semantically similar . While a subword- based approach helps deal with the out-of-vocab ulary prob- lem [35], it may also mislead the semantic representation. For example, the FT -cbow embedding assigns similar vectors to minText and minuteText , as well as to setInterval and clearInterval , as these identiﬁers share subwords, ev en though the identiﬁers refer to clearly different concepts. g) Expanding abbre viations may impr ove semantic r ep- r esentations: The ﬁnding that practically all existing seman- tic representations have dif ﬁculties with abbreviations raises the question how to address this limitation. One promising direction is to expand abbreviations into longer identiﬁers before querying for their relatedness or similarity to another identiﬁer . Sev eral techniques for expanding identiﬁers hav e been proposed [36]–[40], which could possibly be used as a preprocessing step within semantic representations. h) Differ ent semantic r epr esentations complement each other: The a vailability of dif ferent techniques for reason- ing about the similarity of identiﬁers can be exploited by combining multiple such techniques. Our ensemble model (Section III-D) shows the potential of combined approaches. V . T H R E A T S T O V A L I D I T Y A. Thr eats to Internal V alidity Threats to internal validity are about factors that may inﬂuence our results. The identiﬁers and the code examples associated with them may not be representative of other code. T o mitigate this threat, we gather data from a large and diverse code corpus, and we select identiﬁers that cover semantically similar and dissimilar pairs of identiﬁers (Section II-A0c). The decision to perform our work with code written in a dynamically typed programming language, Ja vaScript, biases our results toward such languages. The reason for focusing on a dynamically typed language is that such languages are the target of v arious name-based analyses [4], [7], [8], [19], [41], [42] and embedding techniques [17], [23]. Some ratings gathered our surv eys may be inaccurate, e.g., because participants may hav e misunderstood the instructions. T o mitigate this threat, we gather at least ten ratings per pair of identiﬁers and then carefully clean the ratings gathered by dev elopers to remove noise and outliers (Section II-B). B. Thr eats to External V alidity Threats to external v alidity are about factors that may inﬂuence the generalizability of our results. One limitation is that IdBench focuses on indi vidual identiﬁers only . As a result, it is not clear to what extent our ev aluation of semantic representations of identiﬁers allows for conclusions about representations at a larger granularity , e.g., of complex expressions, statements, or sequences of statements. W e focus on individual identiﬁers as they are the basic building blocks of code. Recent work on improving name-based and learning- based b ug detection [4] by aggregating identiﬁers in complex expressions suggests that improving embeddings for indi vidual identiﬁers also beneﬁts larger -scale code representations [41]. Another limitation is that other string distance functions or other embeddings may perform better or worse than those studied here. W e select semantic representations that ha ve been used in past name-based analyses, as well as some recent em- bedding techniques that are state of the art in natural language processing (NLP). By making IdBench publicly av ailable, we enable others to ev aluate future semantic representations. As any benchmark, IdBench consists of a ﬁnite set of subjects, which may not be representati ve for all others. The number of pairs of identiﬁers in the benchmark (T able II) is in the same order of magnitude as that of word similarity benchmarks used in NLP [20], [31], [43]–[45]. Finally , we focus on Ja vaScript code, i.e., our ﬁndings may not generalize to identiﬁers in other languages. Finally , dif ferent name-based analyses hav e different re- quirements on the semantic representations they build upon. The tasks we present to survey participants may not represent all these requirements, and hence, a semantic representation may perform better or worse in a speciﬁc name-based analysis than IdBench suggests. V I . R E L A T E D W O R K a) Name-based Pr ogram Analysis: V arious analyses ex- ploit the rich information pro vided by identiﬁer names, e.g., to ﬁnd bugs [2]–[5] and vulnerabilities [47], to mine speciﬁca- tions [6], to infer types based on identiﬁer names as implicit type hints [7], [8], to predict the name of a method [9], to complete partial code using a learned language model [10], to identify inappropriate names [11], to suggest more suitable names [12], to resolv e fully qualiﬁed type names of methods, variables, etc. in a gi ven code snippet [48], or to map APIs between programming languages based on an embedding of code tokens [18]. A systematic way of e valuating semantic representations of identiﬁers, as provided in this paper, helps in further exploiting the implicit kno wledge encoded in iden- tiﬁers, and hence will beneﬁt name-based program analyses. b) Embeddings of Identiﬁers: Embeddings of identiﬁers are at the core of sev eral code analysis tools. A popular approach, e.g., for bug detection [4], type prediction [8], or vulnerability detection [47], is applying W ord2v ec [26], [34] to token sequences, which corresponds to the W ord2vec embedding ev aluated in Section III. [49] train an RNN- based language model and extract its ﬁnal hidden layer as an embedding of identiﬁers. Chen et al. [50] provide a more comprehensiv e survey of embeddings for source code. Beyond learned embeddings, string distance functions are used in other name-based tools, e.g., for detecting bugs [2], [3] or for inferring speciﬁcations [6]. The quality of embeddings is crucial in these and other code analysis tools, and IdBench will help to improv e the state of the art. c) Embeddings of Pro grams: Beyond embeddings of identiﬁers, there is work on embedding larger parts of a program. One approach [9] uses a log-bilinear , neural language model [51] to predict the names of methods. Other work em- beds code based on graph neural networks [21] or sequence- based neural networks applied to paths through a graph representation of code [23], [52]–[56]. Code2seq embeds code and then generates sequences of NL words [57]. For a broader ov erview and a detailed survey of learning-based software analysis, we refer the reader to [16] and [58], respecti vely . T o ev aluate embeddings of programs, the COSET benchmark provides thousands of programs with semantic labels [59]. An- other study measures ho w effecti ve pre-trained code2vec [23] embeddings are for different do wnstream tasks [46]. One conclusion from Kang et al. ’ s work [46] is that e valuating embeddings on a speciﬁc downstream task is insufﬁcient, a problem we here address with a task-independent benchmark. Both of the abov e [46], [59] complement IdBench because the existing work is about entire programs, whereas IdBench is about identiﬁers. Since identiﬁers are a basic building block of source code, a benchmark for improving embeddings of identiﬁers will eventually also beneﬁt learning-based code analysis tools. d) Benchmarks of W ord Embeddings: The NLP commu- nity has a long tradition of reasoning about the semantics of words. In particular, that community has addressed the challenge of measuring how well a semantic representation of words matches actual relationships between words through a series of gold standards of words, focusing on either relat- edness [25], [43], [44] or similarity [20], [31], [45], [60] of words. These gold standards deﬁne how similar two words are based on ratings by human judges, enabling an ev aluation that measures how well an embedding reﬂects the human ratings. Unfortunately , simply reusing these existing gold standards for identiﬁers in source code would be misleading. One reason is that the vocab ularies of natural languages and source code ov erlap only partially , because source code contains various terms and abbreviations not found in natural language texts. Moreov er , source code has a constantly growing vocab ulary , as developers tend to quickly inv ent new identiﬁers, e.g., for newly emerging application domains [35]. Finally , e ven words present in both natural languages and source code may differ in their meaning due to computer science-speciﬁc terms, e.g., “ﬂoat” or “string”. This work is the ﬁrst to address the need for a gold standard for identiﬁers in code. e) Data Gathering: Asking human raters how related or similar two w ords are was ﬁrst proposed by [45] and then adopted by others [20], [31], [43], [60]. Our direct surve y also follows this methodology . [20] propose to gather judgments about contextual similarity by asking participants to choose a word to ﬁll in a blank, an idea we adopt in our indirect surve y . T o choose words and pairs of words, prior work relies on manual selection [45], pre-e xisting free association databases [31], [60], e.g., USF [61] or V erbNet [62], [63], or cosine similarities according to pre-existing models [25]. W e follow the latter approach, as it minimizes human bias while cov ering a wide range of degrees of relatedness and similarity . f) Inter-r ater Agr eement: V alidating and cleaning data gathered via cro wd-sourcing based on the inter -rater agreement has been found effecti ve in other cro wd-sourced surv eys [30]. Gold standards for natural language words reach an inter-rater agreement of 0.61 [43] and 0.67 [31]. Our “small” dataset reaches similar le vels of agreement, sho wing that the rates in IdBench represent a genuine human intuition. As noted by [31], the inter-rater agreement also giv es an upper bound of the expected correlation between the tested model and the gold standard. Our results show that current models still leave plenty of room for improv ement, especially w .r .t. similarity . V I I . C O N C L U S I O N This paper presents the ﬁrst benchmark for ev aluating se- mantic representations of identiﬁers names, along with a study of current semantic representation techniques. W e compile thousands of ratings gathered from 500 dev elopers into a benchmark that provides gold standard similarity scores rep- resenting the relatedness, similarity , and contextual similarity of identiﬁers. Using IdBench to experimentally compare two string distance functions and ﬁve embedding techniques shows that these techniques dif fer signiﬁcantly in their agreement with our gold standard. The best available embeddings are effecti ve at representing how related identiﬁers are. Howe ver , all studied techniques show huge room for improvement in their ability to represent ho w similar identiﬁers are. An in- depth study of dif ferent subsets of identiﬁers sho ws the speciﬁc strengths and weaknesses of current semantic representations, e.g., that most techniques are challenged by abbre viations, opposites, and the difference between singular and plural. T o exploit the complementarity of current techniques, we present an ensemble model that effecti vely combines them and clearly outperforms the best individual techniques. Our work will help addressing the limitations of current semantic representations of identiﬁers. Such progress will beneﬁt downstream dev eloper tools, in particular , name-based program analyses. More broadly , impro ving semantic rep- resentations of identiﬁers will also contribute toward better learning-based program testing and analysis techniques. R E F E R E N C E S [1] S. Butler, M. W ermelinger , Y . Y u, and H. Sharp, “Exploring the inﬂuence of identiﬁer names on code quality: An empirical study , ” in Eur opean Confer ence on Software Maintenance and Reengineering (CSMR) . IEEE, 2010, pp. 156–165. [2] M. Pradel and T . R. Gross, “Detecting anomalies in the order of equally- typed method arguments, ” in International Symposium on Software T esting and Analysis (ISSTA) , 2011, pp. 232–242. [3] A. Rice, E. Aftandilian, C. Jaspan, E. Johnston, M. Pradel, and Y . Arroyo-Paredes, “Detecting argument selection defects, ” in Con- fer ence on Object-Oriented Pro gramming, Systems, Languages, and Applications (OOPSLA) , 2017. [4] M. Pradel and K. Sen, “DeepBugs: A learning approach to name-based bug detection, ” P ACMPL , vol. 2, no. OOPSLA, pp. 147:1–147:25, 2018. [Online]. A vailable: https://doi.org/10.1145/3276517 [5] S. Kate, J. Ore, X. Zhang, S. G. Elbaum, and Z. Xu, “Phys: probabilistic physical unit assignment and inconsistency detection, ” in Pr oceedings of the 2018 ACM J oint Meeting on Eur opean Software Engineering Conference and Symposium on the F oundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena V ista, FL, USA, November 04-09, 2018 , 2018, pp. 563–573. [Online]. A vailable: https://doi.org/10.1145/3236024.3236035 [6] H. Zhong, L. Zhang, T . Xie, and H. Mei, “Inferring resource speci- ﬁcations from natural language API documentation, ” in International Confer ence on Automated Softwar e Engineering (ASE) , 2009, pp. 307– 318. [7] Z. Xu, X. Zhang, L. Chen, K. Pei, and B. Xu, “Python probabilistic type inference with natural language support, ” in Pr oceedings of the 24th A CM SIGSOFT International Symposium on F oundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016 , 2016, pp. 607–618. [Online]. A vailable: https://doi.org/10.1145/2950290.2950343 [8] R. S. Malik, J. Patra, and M. Pradel, “NL2T ype: Inferring Jav aScript function types from natural language information, ” in Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019 , 2019, pp. 304–315. [Online]. A vailable: https://doi.org/10.1109/ICSE.2019.00045 [9] M. Allamanis, E. T . Barr , C. Bird, and C. A. Sutton, “Suggesting accurate method and class names, ” in Pr oceedings of the 2015 10th Joint Meeting on F oundations of Softwar e Engineering, ESEC/FSE 2015, Ber gamo, Italy , August 30 - September 4, 2015 , 2015, pp. 38–49. [10] M. R. Parvez, S. Chakraborty , B. Ray , and K. Chang, “Building language models for text with named entities, ” in Pr oceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, V olume 1: Long P apers , 2018, pp. 2373–2383. [Online]. A vailable: https://www .aclweb .org/anthology/P18- 1221/ [11] E. W . Høst and B. M. Østv old, “Debugging method names, ” in Eur opean Confer ence on Object-Oriented Pr ogramming (ECOOP) . Springer , 2009, pp. 294–317. [12] K. Liu, D. Kim, T . F . Bissyand ´ e, T . Kim, K. Kim, A. Koyuncu, S. Kim, and Y . L. Traon, “Learning to spot and refactor inconsistent method names, ” in Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019 , 2019, pp. 1–12. [Online]. A vailable: https: //dl.acm.org/citation.cfm?id=3339507 [13] S. Butler , M. W ermelinger , Y . Y u, and H. Sharp, “Improving the tokenisation of identiﬁer names, ” in European Confer ence on Object- Oriented Pr ogramming (ECOOP) . Springer , 2011, pp. 130–154. [14] Y . Jiang, H. Liu, J. Q. Zhu, and L. Zhang, “ Automatic and accurate ex- pansion of abbreviations in parameters, ” IEEE T ransactions on Softwar e Engineering , 2018. [15] H. Liu, Q. Liu, C.-A. Staicu, M. Pradel, and Y . Luo, “Nomen est omen: Exploring and exploiting similarities between argument and parameter names, ” in International Confer ence on Software Engineering (ICSE) , 2016, pp. 1063–1073. [16] M. Pradel and S. Chandra, “Neural software analysis, ” CoRR , vol. abs/2011.07986, 2020. [Online]. A vailable: https://arxiv .or g/abs/2011. 07986 [17] U. Alon, M. Zilberstein, O. Levy , and E. Y aha v , “ A general path-based representation for predicting program properties, ” in ACM SIGPLAN Notices , vol. 53, no. 4. ACM, 2018, pp. 404–419. [18] T . D. Nguyen, A. T . Nguyen, H. D. Phan, and T . N. Nguyen, “Exploring API embedding for API usages and applications, ” in Proceedings of the 39th International Confer ence on Softwar e Engineering, ICSE 2017, Buenos Aires, Ar gentina, May 20-28, 2017 , 2017, pp. 438–449. [19] V . Hellendoorn, C. Bird, E. T . Barr , and M. Allamanis, “Deep learning type inference, ” in FSE , 2018. [20] G. A. Miller and W . G. Charles, “Contextual correlates of semantic similarity , ” Language and cognitive pr ocesses , vol. 6, no. 1, pp. 1–28, 1991. [21] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs, ” CoRR , vol. abs/1711.00740, 2017. [Online]. A vailable: http://arxiv .or g/abs/1711.00740 [22] P . Bojanowski, E. Grave, A. Joulin, and T . Mikolov , “Enriching word vectors with subword information, ” T ACL , v ol. 5, pp. 135–146, 2017. [Online]. A vailable: https://transacl.org/ojs/index.php/tacl/article/ view/999 [23] U. Alon, M. Zilberstein, O. Levy , and E. Y ahav , “code2v ec: Learning distributed representations of code, ” Pr oceedings of the A CM on Pro- gramming Languages , vol. 3, no. POPL, p. 40, 2019. [24] V . Raychev , P . Bielik, M. V eche v , and A. Krause, “Learning programs from noisy data, ” in ACM SIGPLAN Notices , vol. 51, no. 1. A CM, 2016, pp. 761–774. [25] E. Bruni, N.-K. Tran, and M. Baroni, “Multimodal distributional se- mantics, ” Journal of Artiﬁcial Intelligence Resear ch , vol. 49, pp. 1–47, 2014. [26] T . Mikolov , K. Chen, G. Corrado, and J. Dean, “Efﬁcient estimation of word representations in vector space, ” arXiv pr eprint arXiv:1301.3781 , 2013. [27] V . Rayche v , P . Bielik, and M. V eche v , “Probabilistic model for code with decision trees, ” in OOPSLA , 2016. [28] M. Allamanis and C. A. Sutton, “Mining source code repositories at massiv e scale using language modeling, ” in Pr oceedings of the 10th W orking Confer ence on Mining Softwar e Repositories, MSR ’13, San F rancisco, CA, USA, May 18-19, 2013 , 2013, pp. 207–216. [29] A. Kittur , E. H. Chi, and B. Suh, “Crowdsourcing user studies with mechanical turk, ” in Proceedings of the SIGCHI confer ence on human factors in computing systems , 2008, pp. 453–456. [30] S. Now ak and S. R ¨ uger , “Ho w reliable are annotations via crowdsourc- ing: a study about inter-annotator agreement for multi-label image an- notation, ” in Proceedings of the international conference on Multimedia information retrie val , 2010, pp. 557–566. [31] F . Hill, R. Reichart, and A. K orhonen, “Simlex-999: Evaluating semantic models with (genuine) similarity estimation, ” Computational Linguistics , vol. 41, no. 4, pp. 665–695, 2015. [32] V . Zhelezniak, A. Savko v , A. Shen, and N. Y . Hammerla, “Cor - relation coefﬁcients and semantic textual similarity , ” arXiv preprint arXiv:1905.07790 , 2019. [33] S. B. Needleman and C. D. Wunsch, “ A general method applicable to the search for similarities in the amino acid sequence of two proteins, ” Journal of molecular biology , vol. 48, no. 3, pp. 443–453, 1970. [34] T . Mikolov , I. Sutskev er, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their composi- tionality , ” in Advances in neural information pr ocessing systems , 2013, pp. 3111–3119. [35] H. Babii, A. Janes, and R. Robbes, “Modeling vocab ulary for big code machine learning, ” CoRR , 2019. [Online]. A vailable: https://arxiv .or g/abs/1904.01873 [36] A. Corazza, S. D. Martino, and V . Maggio, “LINSEN: an efﬁcient approach to split identiﬁers and expand abbreviations, ” in 28th IEEE International Conference on Software Maintenance, ICSM 2012, T rento, Italy , September 23-28, 2012 . IEEE Computer Society , 2012, pp. 233– 242. [Online]. A vailable: https://doi.org/10.1109/ICSM.2012.6405277 [37] Y . Jiang, H. Liu, and L. Zhang, “Semantic relation based expansion of abbre viations, ” in Proceedings of the ACM Joint Meeting on Eur opean Softwar e Engineering Conference and Symposium on the F oundations of Software Engineering, ESEC/SIGSOFT FSE 2019, T allinn, Estonia, August 26-30, 2019 , M. Dumas, D. Pfahl, S. Apel, and A. Russo, Eds. A CM, 2019, pp. 131–141. [Online]. A vailable: https://doi.org/10.1145/3338906.3338929 [38] C. D. Newman, M. J. Deck er , R. S. Alsuhaibani, A. Peruma, D. Kaushik, and E. Hill, “ An empirical study of abbre viations and e xpansions in software artifacts, ” in 2019 IEEE International Confer ence on Softwar e Maintenance and Evolution, ICSME 2019, Cle veland, OH, USA, September 29 - October 4, 2019 . IEEE, 2019, pp. 269–279. [Online]. A vailable: https://doi.org/10.1109/ICSME.2019.00040 [39] D. J. Lawrie and D. W . Binkley , “Expanding identiﬁers to normalize source code vocabulary , ” in IEEE 27th International Confer ence on Softwar e Maintenance, ICSM 2011, W illiamsbur g, V A, USA, September 25-30, 2011 . IEEE Computer Society , 2011, pp. 113–122. [Online]. A vailable: https://doi.org/10.1109/ICSM.2011.6080778 [40] D. Lawrie, H. Feild, and D. Binkley , “Extracting meaning from abbre- viated identiﬁers, ” in W orking Conference on Source Code Analysis and Manipulation (SCAM) . IEEE, 2007, pp. 213–222. [41] R.-M. Karampatsis and C. Sutton, “Scelmo: Source code embeddings from language models, ” 2020. [Online]. A v ailable: https://openrevie w . net/pdf?id=ryxnJlSKvr [42] M. Pradel, G. Gousios, J. Liu, and S. Chandra, “T ypewriter: Neural type prediction with search-based validation, ” in ESEC/FSE ’20: 28th A CM Joint European Software Engineering Confer ence and Symposium on the F oundations of Software Engineering, V irtual Event, USA, November 8-13, 2020 , 2020, pp. 209–220. [Online]. A vailable: https://doi.org/10.1145/3368089.3409715 [43] L. Finkelstein, E. Gabrilovich, Y . Matias, E. Rivlin, Z. Solan, G. W olf- man, and E. Ruppin, “Placing search in context: The concept revisited, ” ACM T ransactions on information systems , vol. 20, no. 1, pp. 116–131, 2002. [44] T . Schnabel, I. Labutov , D. M. Mimno, and T . Joachims, “Evaluation methods for unsupervised word embeddings, ” in Proceedings of the 2015 Confer ence on Empirical Methods in Natural Languag e Pr ocessing, EMNLP 2015, Lisbon, P ortugal, September 17-21, 2015 , 2015, pp. 298–307. [Online]. A vailable: http://aclweb.or g/anthology/D/ D15/D15- 1036.pdf [45] H. Rubenstein and J. B. Goodenough, “Contextual correlates of syn- onymy , ” Communications of the ACM , v ol. 8, no. 10, pp. 627–633, 1965. [46] H. J. Kang, T . F . Bissyand ´ e, and D. Lo, “ Assessing the generalizability of code2vec token embeddings, ” in ASE , 2019. [47] J. A. Harer , L. Y . Kim, R. L. Russell, O. Ozdemir, L. R. Kosta, A. Rangamani, L. H. Hamilton, G. I. Centeno, J. R. K ey , P . M. Ellingwood, M. W . McConley , J. M. Opper, S. P . Chin, and T . Lazovich, “ Automated software vulnerability detection with machine learning, ” CoRR , vol. abs/1803.04497, 2018. [Online]. A vailable: http://arxiv .or g/abs/1803.04497 [48] H. Phan, H. A. Nguyen, N. M. T ran, L. H. Truong, A. T . Nguyen, and T . N. Nguyen, “Statistical learning of API fully qualiﬁed names in code snippets of online forums, ” in Proceedings of the 40th International Confer ence on Software Engineering, ICSE 2018, Gothenbur g, Sweden, May 27 - J une 03, 2018 , 2018, pp. 632–642. [Online]. A vailable: http://doi.acm.org/10.1145/3180155.3180230 [49] M. White, M. T ufano, C. V endome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection, ” in ASE , 2016, pp. 87–98. [50] Z. Chen and M. Monperrus, “ A literature study of embeddings on source code, ” arXiv preprint , 2019. [51] Y . Bengio, R. Ducharme, P . V incent, and C. Jauvin, “ A neural proba- bilistic language model, ” Journal of machine learning resear ch , vol. 3, no. Feb, pp. 1137–1155, 2003. [52] T . Ben-Nun, A. S. Jakobovits, and T . Hoeﬂer, “Neural code comprehension: A learnable representation of code semantics, ” CoRR , vol. abs/1806.07336, 2018. [Online]. A vailable: http://arxi v .org/abs/ 1806.07336 [53] J. Devlin, J. Uesato, R. Singh, and P . Kohli, “Semantic code repair using neuro-symbolic transformation networks, ” CoRR , vol. abs/1710.11054, 2017. [Online]. A vailable: http://arxiv .org/abs/1710.11054 [54] J. Henkel, S. K. Lahiri, B. Liblit, and T . W . Reps, “Code vectors: understanding programs through embedded abstracted symbolic traces, ” in Pr oceedings of the 2018 ACM Joint Meeting on Eur opean Softwar e Engineering Conference and Symposium on the F oundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena V ista, FL, USA, November 04-09, 2018 , 2018, pp. 163–174. [55] D. DeFreez, A. V . Thakur, and C. Rubio-Gonz ´ alez, “Path-based function embedding and its application to speciﬁcation mining, ” CoRR , vol. abs/1802.07779, 2018. [56] X. Xu, C. Liu, Q. Feng, H. Y in, L. Song, and D. Song, “Neural network- based graph embedding for cross-platform binary code similarity detec- tion, ” in CCS , 2017, pp. 363–376. [57] U. Alon, S. Brody , O. Levy , and E. Y aha v , “code2seq: Generating sequences from structured representations of code, ” in 7th International Confer ence on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 , 2019. [Online]. A vailable: https://openrevie w . net/forum?id=H1gKY o09tX [58] M. Allamanis, E. T . Barr , P . Dev anbu, and C. Sutton, “ A survey of machine learning for big code and naturalness, ” ACM Computing Surveys (CSUR) , vol. 51, no. 4, p. 81, 2018. [59] K. W ang and M. Christodorescu, “Coset: A benchmark for ev aluating neural program embeddings, ” CoRR , 2019. [Online]. A vailable: https://arxiv .or g/abs/1905.11445 [60] D. Gerz, I. V uli ´ c, F . Hill, R. Reichart, and A. Korhonen, “Simverb- 3500: A large-scale e valuation set of verb similarity , ” arXiv pr eprint arXiv:1608.00869 , 2016. [61] D. L. Nelson, C. L. McEv oy , and T . A. Schreiber , “The univ ersity of south ﬂorida free association, rhyme, and word fragment norms, ” Behavior Resear ch Methods, Instruments, & Computers , vol. 36, no. 3, pp. 402–407, 2004. [62] K. Kipper, B. Snyder , and M. Palmer , “Extending a verb-lexicon using a semantically annotated corpus. ” in LREC , 2004. [63] K. Kipper, A. K orhonen, N. Ryant, and M. Palmer , “ A large-scale classiﬁcation of english verbs, ” Language Resources and Evaluation , vol. 42, no. 1, pp. 21–40, 2008.

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment