ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automated program repair (APR) attempts to reduce manual debugging efforts and plays a vital role in software maintenance. Despite remarkable progress, APR is still limited in generating overfitting patches, i.e., patches passing available test suites but incorrect. This issue, known as patch overfitting, has become a key concern in the APR community, with numerous approaches proposed to address it. Very recent work proposes a pre-trained language model (PLM)-based automated patch correctness assessment (APCA) approach, indicating the potential of such PLMs in reasoning about patch correctness. Despite being promising, it is still far from perfect due to various limitations, such as the training paradigm and training dataset. In this paper, we present ComPass, a PLM-based APCA approach that leverages contrastive learning and data augmentation to address the technical limitations of prior work. Our work is inspired by the opportunity to integrate contrastive learning with recent PLMs in the field of patch correctness assessment, where large-scale labeled patches are difficult to obtain. ComPass utilizes code transformation rules to generate semantic-preserving code snippets for both unlabeled pre-training corpus and labeled fine-tuning patches. ComPass then pre-trains PLMs with contrastive learning, which captures code features with the same semantics but different structures. ComPass finally integrates representation embeddings of patch code snippets and fine-tunes PLMs with a binary classifier jointly to assess patch code correctness. Experimental results on 2274 real-world patches from Defects4J demonstrate that ComPass achieves an accuracy of 88.35%, significantly outperforming state-of-the-art baseline APPT.

💡 Research Summary

**
Automated Program Repair (APR) aims to reduce the manual effort required to fix software bugs by automatically generating patches. While recent advances have produced many promising patch generation techniques, a persistent problem is “patch overfitting”: patches that pass the developer‑provided test suite but do not actually fix the bug. Overfitting patches force developers to spend additional time inspecting and discarding incorrect fixes, limiting the practical adoption of APR.

Existing Automated Patch Correctness Assessment (APCA) methods fall into static, dynamic, and learning‑based categories. The most recent learning‑based approach, APPT, fine‑tunes a pre‑trained BERT model on a modestly sized labeled patch dataset (1,183 patches from Defects4J) and then classifies patches as correct or overfitting. APPT, however, suffers from two fundamental drawbacks: (1) it treats the problem as a plain classification task, making the model overly sensitive to superficial code changes (e.g., variable renaming) that do not affect semantics; (2) the limited amount of labeled patches hampers the model’s ability to learn robust representations, leading to sub‑optimal generalization.

ComPass addresses these limitations by integrating contrastive learning with data augmentation. The core idea is to generate semantic‑preserving code transformations (e.g., variable renaming, statement reordering, insertion of no‑op code, conditional negation) and use them to create pairs of code snippets that share the same meaning but have different syntactic structures. These pairs serve as positive examples in a contrastive pre‑training phase, while unrelated code fragments act as negatives. By minimizing an InfoNCE‑style loss, the encoder learns an embedding space where semantically equivalent snippets are close together, regardless of superficial syntactic variations.

The workflow consists of three stages:

Transformation Rule Definition – Eight handcrafted rules are designed to guarantee semantic equivalence while altering code appearance.
Contrastive Pre‑training – A large corpus of unlabeled source code (millions of lines) is transformed according to the rules, producing massive numbers of positive pairs. The encoder (BERT in the baseline implementation) is trained with a contrastive objective, encouraging it to be invariant to the defined syntactic perturbations.
Fine‑tuning with Augmented Labeled Patches – For each labeled patch (correct or overfitting), the same transformation rules are applied to generate multiple augmented versions that inherit the original label. The pre‑trained encoder and a binary classifier are jointly fine‑tuned using a weighted sum of the contrastive loss and cross‑entropy loss. This joint optimization ensures that the encoder’s robustness to syntactic changes is retained while the classifier learns to discriminate the two classes.

Experimental Evaluation
The authors evaluate ComPass on 2,274 real patches extracted from Defects4J, generated by more than 30 repair tools. Compared with the state‑of‑the‑art APPT, ComPass achieves:

Accuracy: 88.35% (↑ 6.33% points)
Precision: 87.50% (↑ 5.29% points)
Recall: 88.69% (↑ 8.87% points)
F1‑score: 88.09% (↑ 7.05% points)

Ablation studies reveal that the contrastive pre‑training contributes +4.11% to accuracy and +7.80% to precision, while the augmentation during fine‑tuning adds another ≈2.6% across all metrics. When the framework is integrated with more advanced encoders such as CodeBERT, accuracy improvements range from 4.9% to 15.7%, demonstrating the method’s encoder‑agnostic nature. Cross‑project experiments (training on one project, testing on another) show that ComPass still outperforms all baselines, confirming that the learned semantic‑invariant embeddings generalize across code bases.

Key Contributions

Contrastive Pre‑training for APCA – Introduces a self‑supervised pre‑training stage that makes the model robust to syntactic perturbations while preserving sensitivity to semantic differences.
Data Augmentation for Low‑Resource Settings – Leverages the same transformation rules to expand the limited labeled patch set, mitigating the data‑scarcity problem inherent in APR research.
Generic Framework – Designed to work with any encoder‑only PLM; the paper demonstrates integration with BERT, CodeBERT, and discusses potential extensions to encoder‑decoder models.
Extensive Empirical Validation – Provides a large, high‑quality benchmark of 2,274 labeled patches, conducts thorough comparisons against seven baselines (static, dynamic, and learning‑based), and performs detailed ablations.

Limitations and Future Work
The current implementation focuses on encoder‑only models; extending the approach to encoder‑decoder architectures (e.g., CodeT5) could further improve performance on generation‑heavy repair scenarios. The transformation rules are manually crafted; automating rule discovery or learning them from data could reduce engineering effort and adapt the method to new programming languages or frameworks. Finally, integrating runtime information (e.g., execution traces) with the contrastive embeddings may yield a hybrid static‑dynamic APCA system that captures both semantic and behavioral aspects of patches.

In summary, ComPass demonstrates that contrastive learning combined with systematic code augmentation can substantially improve the reliability of automated patch correctness assessment, offering a practical, plug‑in component for APR pipelines that filters out overfitting patches before costly validation or manual review.

ComPass: Contrastive Learning for Automated Patch Correctness Assessment in Program Repair

💡 Research Summary

Comments & Academic Discussion

Leave a Comment