LLPut: Investigating Large Language Models for Bug Report-Based Input Generation
Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs – LLaMA, Qwen, and Qwen-Coder – in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.
💡 Research Summary
The paper “LLPut: Investigating Large Language Models for Bug Report-Based Input Generation” addresses the challenge of automatically extracting failure‑inducing inputs (commands or test cases) from natural‑language bug reports, a task that is essential for reproducing and diagnosing software defects. The authors focus on the Linux coreutils project and construct a curated dataset of 206 bug reports drawn from Red Hat Bugzilla. The dataset creation process involved collecting 779 reports, filtering out irrelevant entries (e.g., release announcements), and manually annotating a random subset of 250 reports by two authors until consensus was reached. After discarding reports without clear input information, the final set comprised 149 reports containing explicit commands, 54 reports with no commands, and 3 ambiguous cases, all labeled accordingly.
The study evaluates two families of approaches. First, a baseline using BERT‑base‑uncased fine‑tuned for token classification (command vs. non‑command tokens) on the 149 command‑bearing instances (80 % training, 20 % testing). The baseline performed poorly: only 3.33 % of test predictions achieved a BLEU‑2 score ≥ 0.5, and no exact matches were observed. Error analysis revealed that the model frequently missed entire tokens or mis‑identified command fragments, likely due to the limited training data and the complex, often multi‑line nature of command descriptions embedded in prose.
Second, the authors assess three open‑source generative large language models (LLMs): Meta’s LLaMA‑3.3‑70B, Alibaba’s Qwen2.5‑32B‑instruct, and Qwen2.5‑coder‑32B. All models are accessed via the Ollama platform with temperature set to 0 to ensure deterministic outputs. Two prompting strategies are explored: zero‑shot (no examples) and one‑shot (a single illustrative example). Preliminary qualitative testing indicated that one‑shot prompts yielded more structured and accurate responses, so the final experiments used a uniform one‑shot prompt across all models. The prompt instructs the model to read a bug description and output only the exact command(s) needed for reproduction, or “None” if no command exists.
Performance is measured using BLEU‑2, a standard n‑gram overlap metric, comparing model‑generated command strings against the human‑annotated ground truth. Empty‑string cases (both model and reference produce “None”) are treated as perfect matches. Results show that Qwen2.5‑instruct attains the highest average BLEU‑2 scores, marginally outperforming LLaMA and Qwen‑coder. While all LLMs substantially exceed the BERT baseline, the differences among them suggest that instruction‑following capability (Qwen) and code‑specific fine‑tuning (Qwen‑coder) each confer distinct advantages depending on the nature of the command text.
Key insights include: (1) traditional token‑level NLP methods struggle with the sparse, heterogeneous command extraction task, especially under limited data; (2) generative LLMs can leverage their massive pre‑training to infer command structures with minimal task‑specific examples; (3) model performance is sensitive to prompt design and the presence of code‑oriented pre‑training; (4) BLEU‑2, while useful for measuring textual similarity, does not capture functional correctness or executability of the extracted commands.
The authors acknowledge several limitations: the dataset is confined to a single project (coreutils) and a single language (English), the sample size (206 reports) is modest, and the evaluation relies on a similarity metric rather than actual execution of the commands. They propose future work to expand the dataset across diverse projects and languages, to develop an automated test harness that runs the extracted commands to verify reproducibility, and to design richer evaluation metrics that combine textual similarity with functional success rates.
In summary, LLPut provides the first systematic empirical study of open‑source generative LLMs for extracting reproducible inputs from bug reports. It demonstrates that LLMs, particularly instruction‑tuned models like Qwen, can substantially outperform conventional NLP baselines, opening a promising avenue for automating a labor‑intensive step in software debugging and maintenance workflows.
Comments & Academic Discussion
Loading comments...
Leave a Comment