Rethinking the effects of data contamination in Code Intelligence

Rethinking the effects of data contamination in Code Intelligence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. Previous studies mainly focused on sample-level contamination, ignoring partial contamination scenarios that are pervasive in code intelligence. This paper fills this gap and presents a systematic empirical study to investigate the fine-grained data contamination on mainstream code tasks. Our study involves diverse representative PLMs: RoBERTa and GPT-2, and LLMs: LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization, across two Programming Languages (PLs): Java and Python. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration. Experimental results show that, under the pre-training, fine-tuning, and inference paradigm adopted by PLMs, even deliberately injecting paired contamination does not lead to significant performance overestimation. But direct inference or small-scale fine-tuning uncovers the contamination effects. In contrast, LLMs with pre-training and inference paradigm are significantly affected by the paired contamination. Apart from the above, other contamination scenarios have no impact on both PLMs and LLMs. Our findings challenge the conventional belief that contamination inevitably leads to performance overestimation, providing new insights into the evaluation and deployment of code intelligence models.


💡 Research Summary

The paper conducts a systematic empirical investigation of data contamination in code‑intelligence tasks, a topic that has received limited attention beyond the coarse “sample‑level” contamination studied in natural‑language settings. The authors first define four realistic contamination scenarios that frequently arise in software‑engineering pipelines: (1) input‑only, where only the test inputs appear in the pre‑training corpus; (2) output‑only, where only the test outputs are present; (3) unpaired, where both inputs and outputs are seen during pre‑training but never together as a matching pair; and (4) paired, where exact input‑output pairs from the test set are also present in the pre‑training data. These categories capture the nuanced ways code fragments can leak into large corpora (e.g., a Java method seen during pre‑training while its translated ArkTS version is not, or unit‑test methods and the code under test appearing in separate files).

To evaluate the impact of each scenario, the study selects two representative pre‑trained language models (PLMs) – RoBERTa (encoder‑only) and GPT‑2 (decoder‑only) – and two large language models (LLMs) – LLaMA (general‑purpose) and StarCoder (code‑focused). For PLMs the authors re‑train the models from scratch on a controlled code corpus (Java and Python subsets of CodeSearchNet), deliberately injecting contaminated samples into the pre‑training data according to the four scenarios. After pre‑training, the models undergo fine‑tuning on three downstream code tasks: code translation (Java↔C# and Python↔Java), code generation (NL→Java, NL→Python), and code summarization (code→NL). Performance is measured with BLEU and METEOR, and each experiment is repeated five times to enable statistical testing.

For LLMs, full re‑training is infeasible, so the authors instead extract contaminated examples directly from the publicly released pre‑training corpora of LLaMA and StarCoder. They construct contaminated test sets and corresponding “clean” counterparts by applying perturbations that preserve length and difficulty while removing the leaked content. The LLMs are then evaluated in a pure pre‑training‑then‑inference regime, mirroring typical usage of these massive models.

Key findings are as follows:

  1. PLMs with the full pre‑training → fine‑tuning → inference pipeline are remarkably robust to all four contamination types. Across all three tasks, BLEU and METEOR differences between contaminated and clean conditions are tiny (‑0.07 % to +0.2 % for BLEU, ‑0.03 % to +0.13 % for METEOR) and not statistically significant (p > 0.05).

  2. When the PLM pipeline is shortened—either by performing only direct inference on a decoder‑only model (GPT‑2) or by fine‑tuning on a very small dataset—the contamination effect resurfaces dramatically. In the paired‑contamination setting, direct inference yields average BLEU gains of 54.09 % and METEOR gains of 39.06 % (p < 0.05), indicating severe over‑estimation caused by memorized input‑output pairs.

  3. Large‑scale fine‑tuning mitigates this memorization. When GPT‑2 is fine‑tuned on a substantial amount of task‑specific data, the performance gap between contaminated and clean conditions collapses, suggesting that extensive task‑aligned training can overwrite the spurious memorization introduced by contamination.

  4. LLMs are sensitive only to paired contamination. In the pre‑training‑then‑inference setting, both LLaMA and StarCoder exhibit consistent over‑estimation when the test set contains exact input‑output pairs from their training data: BLEU improves by an average of 13.41 % and METEOR by 7.39 % (p < 0.05). Input‑only, output‑only, and unpaired contamination have no measurable impact on these models.

  5. The observed patterns hold for both statically typed Java and dynamically typed Python, indicating that the phenomenon is not language‑specific but rather tied to model architecture and usage paradigm.

The authors argue that these results overturn the prevailing belief that any data contamination inevitably inflates performance metrics. Instead, the impact is highly conditional: it depends on whether the model sees the contaminated data during fine‑tuning, on the model’s architecture (encoder vs. decoder), and on the evaluation pipeline (full fine‑tuning vs. direct inference). For PLMs, the standard three‑stage pipeline effectively shields against contamination, whereas for LLMs the lack of a fine‑tuning stage leaves them vulnerable to paired leaks.

Beyond the empirical contributions, the paper provides a reproducible methodology for extracting contaminated samples from open‑source corpora, enabling future researchers to replicate or extend the study. It also offers practical guidance: developers deploying LLM‑based code assistants should audit their test suites for potential paired leaks, especially when using models that are not further fine‑tuned on task‑specific data. Conversely, researchers evaluating PLMs can be more confident that standard fine‑tuning mitigates most contamination risks.

Future work suggested includes extending the analysis to other code‑related tasks such as bug fixing or automated refactoring, testing newer ultra‑large models (e.g., GPT‑4, Claude), and developing automated data‑filtering pipelines that detect and excise paired leaks before model training. Overall, the paper deepens our understanding of how subtle data leakage interacts with model training dynamics in the software‑engineering domain, prompting a more nuanced approach to benchmark construction and model deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment