A Comprehensive Study on Large Language Models for Mutation Testing
Large Language Models (LLMs) have recently been used to generate mutants in both research work and in industrial practice. However, there has been no comprehensive empirical study of their performance for this increasingly important LLM-based Software Engineering application. To address this, we conduct a comprehensive empirical study evaluating BugFarm and LLMorpheus (the two state-of-the-art LLM-based approaches), alongside seven LLMs using our newly designed prompt, including both leading open- and closed-source models, on 851 real bugs from two Java real-world bug benchmarks. Our results reveal that, compared to existing rule-based approaches, LLMs generate more diverse mutants, that are behaviorally closer to real bugs and, most importantly, with 111.29% higher fault detection. That is, 87.98% (for LLMs) vs. 41.64% (for rule-based); an increase of 46.34 percentage points. Nevertheless, our results also reveal that these impressive results for improved effectiveness come at a cost: the LLM-generated mutants have worse non-compilability, duplication, and equivalent mutant rates by 26.60, 10.14, and 3.51 percentage points, respectively. These findings are immediately actionable for both research and practice. They allow practitioners to have greater confidence in deploying LLM-based mutation, while researchers now have a baseline for the state-of-the-art, with which they can research techniques to further improve effectiveness and reduce cost.
💡 Research Summary
This paper presents the first large‑scale empirical evaluation of large language models (LLMs) for mutation testing, comparing them against traditional rule‑based mutant generators. The authors focus on Java and use two well‑known real‑bug benchmarks—Defects4J 2.0 (605 bugs across 12 projects) and ConDefects (246 bugs)—for a total of 851 real faults. They evaluate two state‑of‑the‑art LLM‑based systems, BugFarm and LLMorpheus, and introduce a novel few‑shot prompt template (LLMut) that is applied to seven LLMs, both closed‑source (GPT‑4o, GPT‑4o‑Mini) and open‑source (DeepSeek‑V3‑671b, DeepSeek‑Coder‑V2‑236b, StarChat‑β‑16b, CodeLlama‑Instruct‑13b, GPT‑3.5‑Turbo). In total, more than 701 400 mutants are generated, enabling a comprehensive quantitative and qualitative analysis.
Effectiveness (RQ1) – The LLM‑based approaches dramatically outperform traditional mutation tools (PI T, Major, LEAM, μBERT). Across all LLM configurations the real‑bug detection rate improves by an average of 32.99 percentage points, a 1.75× increase over rule‑based methods. The best result is achieved by LLMut using DeepSeek‑V3‑671b, which reaches a 91.1 % detection rate. Coupling analysis shows that LLMorpheus (DeepSeek‑V3‑671b) couples 52 % of its mutants with real bugs, far higher than any traditional tool. Diversity metrics reveal that LLM‑generated mutants introduce up to 49 new AST node types (LLMut with DeepSeek‑236b and DeepSeek‑671b), compared with only 2 new types from Major. Moreover, the distribution of AST edit distances indicates that LLMs produce more complex edits (only 38 % have distance = 1 versus 70 % for Major), suggesting richer semantic changes.
Validity (RQ2) – The gains in effectiveness come with higher costs in mutant quality. Compilability drops to 76.4 % for GPT‑4o‑generated mutants, whereas Major achieves 97.6 %. Duplicate mutants appear at 7.8 % and equivalent mutants at 3.1 %, representing increases of 10.14 pp and 3.51 pp relative to rule‑based baselines. An error‑type analysis of non‑compilable mutants identifies nine categories; the most frequent are “Usage of Unknown Methods” and “Code Structural Destruction.” These errors are concentrated in code regions involving method references and invocations, highlighting the need for better syntactic guidance in prompts or post‑processing filters.
Efficiency (RQ3) – The authors measure token consumption and latency for each LLM. Large commercial models (GPT‑4o, DeepSeek‑V3‑671b) incur higher monetary costs per mutant, but their superior detection efficiency yields a favorable cost‑per‑effective‑mutant ratio compared with cheaper models. Open‑source models provide a cheaper alternative with modest effectiveness loss, offering a practical trade‑off for industry adoption.
Compilation Errors (RQ4) – Detailed taxonomy shows that most compilation failures stem from missing imports, incorrect method signatures, and malformed control structures. The paper suggests integrating static analysis checks before feeding generated code back to the test harness.
Surviving Mutants (RQ5) – Surviving mutants are categorized into three cases: (1) not covered by any test, (2) covered but without state infection, and (3) cause state infection but remain undetected. LLM‑generated mutants, especially those from the LLMut prompt, produce a higher proportion of case (3) mutants, which are valuable for guiding test‑suite augmentation because they expose subtle weaknesses in assertions.
Overall contributions: (1) a thorough comparative benchmark of LLM‑based versus traditional mutation generators; (2) a novel prompt template that consistently improves LLM performance; (3) an extensive error analysis that pinpoints where LLMs falter; (4) a publicly released, richly annotated Java mutant dataset suitable for future fine‑tuning work.
The study concludes that LLMs hold great promise for advancing mutation testing—delivering markedly higher fault detection and diversity—yet practical deployment must address the higher rates of non‑compilable, duplicate, and equivalent mutants. Future work should explore prompt engineering, post‑generation validation pipelines, and model fine‑tuning to reduce these overheads, thereby making LLM‑driven mutation testing a robust component of modern software quality assurance pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment