CupCleaner: A Hybrid Data Cleaning Approach for Comment Updating
Comment updating is an emerging task in software evolution that aims to automatically revise source code comments in accordance with code changes. This task plays a vital role in maintaining code-comment consistency throughout software development. Recently, deep learning-based approaches have shown great potential in addressing comment updating by learning complex patterns between code edits and corresponding comment modifications. However, the effectiveness of these learning-based approaches heavily depends on the quality of training data. Existing datasets are typically constructed by mining version histories from open-source repositories such as GitHub, where there is often a lack of quality control over comment edits. As a result, these datasets may contain noisy or inconsistent samples that hinder model learning and generalization. In this paper, we focus on cleaning existing comment updating datasets, considering both the data’s characteristics in the updating scenario and their implications on the model training process. We propose a hybrid statistical approach named CupCleaner (Comment UPdating’s CLEANER) to achieve this purpose. Specifically, we combine static semantic information within data samples and dynamic loss information during the training process to clean the dataset. Experimental results demonstrate that, on the same test set, both the individual static strategy and the dynamic strategy can significantly filter out a portion of the data and enhance the performance of the model. Furthermore, employing a model ensemble approach can combine the characteristics of static and dynamic cleaning, further enhancing the performance of the model and the reliability of its output results.
💡 Research Summary
The paper addresses a critical but often overlooked problem in the emerging task of comment updating: the quality of training data. Comment updating aims to automatically revise natural‑language comments in source code whenever the underlying code changes, a capability that can save developers considerable effort and prevent outdated documentation from causing bugs. Recent deep‑learning approaches such as PLBART and CodeT5 have shown promising results, but their performance is highly sensitive to the noise present in the datasets that are typically mined automatically from open‑source repositories. Because these datasets are constructed by extracting commit histories, they frequently contain samples where the code change and the comment change are unrelated, where the old and new comments are semantically inconsistent, or where the comment change is trivial and does not reflect the code modification. Such noisy samples hinder model convergence, inflate loss, and ultimately degrade the quality of generated comments.
To remedy this, the authors propose CupCleaner (Comment UPdating’s CLEANER), a hybrid data‑cleaning framework that combines two complementary strategies: a static, semantics‑aware filter and a dynamic, loss‑aware filter.
-
Static Semantic Filtering
- Each sample (old comment, old code, new code, new comment) is embedded using a pretrained code‑comment model (e.g., CodeBERT).
- Three similarity scores are computed: (a) semantic distance between old and new comments, (b) semantic distance between old and new code, and (c) overall consistency of the (code, comment) pairs before and after the change.
- A weighted sum yields a final quality score; samples below a preset threshold are deemed semantically noisy and removed. This step directly tackles the first limitation of prior rule‑based cleaners, which cannot capture the nuanced relationship between code edits and comment edits.
-
Dynamic Loss Filtering
- A base model is trained on the full dataset for a few epochs. For every training instance, the average loss and the variance of loss across epochs are recorded.
- High average loss indicates that the model finds the sample difficult to learn, while high variance suggests label instability or inherent noise.
- A composite metric (e.g., weighted combination of mean loss and variance) ranks samples, and the top‑percentile (e.g., 20 %) are filtered out. This strategy exploits the observation that noisy data often manifests as outliers in the training dynamics, addressing the second limitation of static‑only approaches.
The authors evaluate each strategy separately and in combination. On two public comment‑updating datasets (the Panthaplackel 2021 dataset and an extended version covering @param and summary comments), they conduct human annotation to verify that filtered samples receive significantly lower quality scores (average 2.5/5) compared with retained high‑quality samples (average 4.3/5).
When training PLBART and CodeT5 on the cleaned data, the Exact Match (EM) metric improves from 33.33 % to 38.0 % (PLBART) and from 41.33 % to 45.33 % (CodeT5). Notably, CodeT5 trained on CupCleaner‑cleaned data outperforms the specialized CodeT5‑Edit model, which was pre‑trained on code‑editing tasks. Both static and dynamic filters individually boost performance, but the model‑level ensemble—training two separate models on the two cleaned corpora and selecting the higher‑scoring output at inference—yields the best results, confirming that the two filters capture complementary aspects of data quality.
Regarding efficiency, the static filter requires a one‑time embedding pass (linear in the number of samples) and can be reused for multiple experiments. The dynamic filter incurs negligible overhead because it reuses loss information already generated during normal training. Overall cleaning time is modest compared with the time needed to fine‑tune the downstream models, making CupCleaner practical for real‑world pipelines.
The paper’s contributions are threefold: (1) a novel hybrid cleaning methodology that jointly leverages semantic similarity and training dynamics, (2) an ensemble technique that merges the strengths of both cleaned corpora at inference time, and (3) extensive empirical validation—including human studies and quantitative metrics—demonstrating consistent gains across models and datasets.
In summary, CupCleaner proves that systematic, task‑aware data cleaning can substantially improve the reliability of comment‑updating systems. The approach is unsupervised, requires no external high‑quality labels, and is readily extensible to other programming languages or software‑engineering tasks such as automated refactoring or bug‑fix generation. Future work may explore multilingual extensions, real‑time cleaning during continuous integration, and integration with larger code‑pretraining frameworks.
Comments & Academic Discussion
Loading comments...
Leave a Comment