Examining LLMs Ability to Summarize Code Through Mutation-Analysis

Examining LLMs Ability to Summarize Code Through Mutation-Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As developers increasingly rely on LLM-generated code summaries for documentation, testing, and review, it is important to study whether these summaries accurately reflect what the program actually does. LLMs often produce confident descriptions of what the code looks like it should do (intent), while missing subtle edge cases or logic changes that define what it actually does (behavior). We present a mutation-based evaluation methodology that directly tests whether a summary truly matches the code’s logic. Our approach generates a summary, injects a targeted mutation into the code, and checks if the LLM updates its summary to reflect the new behavior. We validate it through three experiments totalling 624 mutation-summary evaluations across 62 programs. First, on 12 controlled synthetic programs with 324 mutations varying in type (statement, value, decision) and location (beginning, middle, end). We find that summary accuracy decreases sharply with complexity from 76.5% for single functions to 17.3% for multi-threaded systems, while mutation type and location exhibit weaker effects. Second, testing 150 mutated samples on 50 human-written programs from the Less Basic Python Problems (LBPP) dataset confirms the same failure patterns persist as models often describe algorithmic intent rather than actual mutated behavior with a summary accuracy rate of 49.3%. Furthermore, while a comparison between GPT-4 and GPT-5.2 shows a substantial performance leap (from 49.3% to 85.3%) and an improved ability to identify mutations as “bugs”, both models continue to struggle with distinguishing implementation details from standard algorithmic patterns. This work establishes mutation analysis as a systematic approach for assessing whether LLM-generated summaries reflect program behavior rather than superficial textual patterns.


💡 Research Summary

The paper addresses a critical gap in the evaluation of large language model (LLM)–generated code summaries: whether the natural‑language description truly reflects the program’s actual behavior rather than merely restating its apparent intent. To answer this, the authors adapt mutation analysis—a well‑established software testing technique that injects purposeful faults into correct code—to the domain of code summarization. Their methodology consists of three steps: (1) generate targeted mutations in a code snippet (statement‑level rearrangements, value changes, or decision‑logic alterations); (2) prompt an LLM with a fixed “Explain the following code snippet in plain English” request to obtain a summary for both the original and mutated versions; (3) have human evaluators compare the two summaries and decide whether the mutation is detectable from the textual change. A positive detection means the summary changed in a way that reveals the mutation; a negative detection means the summary remained effectively unchanged, indicating a failure to capture the behavioral shift.

The authors conduct three complementary experiments. Experiment 1 uses a synthetic dataset of 12 programs deliberately designed to span three complexity levels (single‑function, multi‑function, multi‑threaded) and to vary mutation location (beginning, middle, end). Across 324 mutation‑summary pairs, they find that summary accuracy drops sharply with program complexity: 76.5 % for single‑function programs, 45.2 % for multi‑function, and only 17.3 % for multi‑threaded systems. Mutation type (statement, value, decision) and location have comparatively modest effects.

Experiment 2 applies the same pipeline to 50 human‑written Python problems from the Less Basic Python Problems (LBPP) benchmark, creating 150 mutated samples. Here the overall detection rate falls to 49.3 %, confirming that the patterns observed on synthetic code persist on real‑world code. The models tend to describe the algorithmic intent (e.g., “merge sort”) even when a subtle change (such as an off‑by‑one constant) alters the actual output.

Experiment 3 compares two generations of models: GPT‑4 and the newer GPT‑5.2. Under identical conditions, GPT‑5.2 achieves an 85.3 % detection rate, a substantial improvement over GPT‑4’s 49.3 %. Nevertheless, both models still struggle to differentiate implementation‑specific details from generic algorithmic patterns, and only the newer model shows a higher propensity to label a mutation as a “bug.”

The paper makes several contributions. First, it introduces a mutation‑based evaluation framework that directly tests the grounding of LLM summaries in code semantics, sidestepping the shortcomings of n‑gram metrics (BLEU, ROUGE) and execution‑based reconstruction approaches. Second, it provides a systematic analysis of how program complexity, mutation type, and mutation location affect summarization fidelity. Third, it demonstrates that newer LLMs can markedly improve on this metric, yet the gap between intent‑level description and behavior‑level understanding remains.

Limitations are acknowledged: the mutation operators are rule‑based and may not capture the full spectrum of real bugs; human evaluation introduces subjectivity, though the authors argue it is necessary for fine‑grained semantic judgment. Future work is outlined, including automated large‑scale mutation generation, crowdsourced labeling to reduce evaluator bias, and incorporating mutation detection as an explicit training objective for LLMs.

Overall, the study establishes mutation analysis as a practical, behavior‑centric benchmark for code summarization, offering valuable insights for researchers developing more reliable LLM‑based developer tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment