From Trace to Line: LLM Agent for Real-World OSS Vulnerability Localization
Large language models show promise for vulnerability discovery, yet prevailing methods inspect code in isolation, struggle with long contexts, and focus on coarse function- or file-level detections that offer limited guidance to engineers who need precise line-level localization for targeted patches. We introduce T2L, an executable framework for project-level, line-level vulnerability localization that progressively narrows scope from repository modules to exact vulnerable lines via AST-based chunking and evidence-guided refinement. We provide a baseline agent with an Agentic Trace Analyzer (ATA) that fuses runtime evidence such as crash points and stack traces to translate failure symptoms into actionable diagnoses. To enable rigorous evaluation, we introduce T2L-ARVO, an expert-verified 50-case benchmark spanning five crash families in real-world projects. On T2L-ARVO, our baseline achieves up to 58.0% detection and 54.8% line-level localization rate. Together, T2L framework advance LLM-based vulnerability detection toward deployable, precision diagnostics in open-source software workflows.
💡 Research Summary
The paper introduces T2L (Trace‑to‑Line), a novel framework that moves large language model (LLM)‑based vulnerability detection from coarse, function‑ or file‑level predictions to precise, line‑level localization in real‑world open‑source software (OSS) projects. The authors identify three practical gaps in existing work: (1) LLMs are typically applied to isolated code fragments, limiting context; (2) they produce only high‑level alerts that do not tell developers exactly which line to patch; and (3) they rarely incorporate runtime evidence such as crash logs, stack traces, or sanitizer reports. To address these issues, T2L defines a two‑tier task: a coarse‑grained “chunk‑level detection” phase followed by a fine‑grained “line‑level localization” phase.
The technical pipeline begins with AST‑based chunking, which partitions a repository into semantically meaningful, function‑aligned code chunks that fit within LLM context windows while preserving cross‑module relationships. Next, the Agentic Trace Analyzer (ATA) runs the target program in a reproducible Docker environment, instruments it with tools such as AddressSanitizer, GDB, and static analyzers, and collects a rich set of runtime artifacts: crash points, stack traces, memory‑violation patterns, and static warnings. All artifacts are merged into a single “evidence block” that is fed to the LLM, ensuring that the model reasons over a global view of the failure rather than isolated hints.
A hierarchical planner‑executor architecture orchestrates the workflow. The planner first asks the LLM to generate a list of candidate file:line pairs (hypotheses) together with confidence scores, based on the unified evidence. Each hypothesis is independently checked against the ground‑truth patch diff. The executor then reports success indicators and confidence changes back to the planner, which decides whether to continue refining or stop, respecting a predefined token‑budget. Two key enhancements are introduced: (a) divergence tracing, which runs multiple parallel LLM reasoning branches on the same evidence to broaden the candidate pool, and (b) detection refinement, a two‑stage loop where the initial coarse candidates are used to extract relevant code snippets; these snippets are appended to the evidence block for a second LLM pass, allowing the model to correct early mistakes and discover lines that are far from the crash point (e.g., use‑after‑free bugs).
To evaluate the approach, the authors construct T2L‑ARVO, a curated benchmark derived from the larger ARVO dataset. From over 4,900 reproducible vulnerabilities, 50 cases are selected, evenly covering five crash families: buffer overflows, uninitialized accesses, memory‑lifecycle errors, type‑safety violations, and system/runtime faults. Each case includes multiple ground‑truth vulnerable lines and has been validated by experts and LLM‑assisted checks to ensure realistic difficulty and reproducibility.
Experimental results on T2L‑ARVO show that the baseline T2L‑Agent (built on GPT‑4‑style prompting) achieves 58 % chunk‑level detection and 54.8 % exact line‑level localization. While this outperforms prior function‑level baselines, a substantial portion of vulnerable lines remain undiscovered, especially in multi‑module scenarios. The authors discuss limitations such as dependence on LLM prompt quality, token‑budget constraints, and the need for richer static‑dynamic hybrid representations (e.g., code property graphs) to further improve precision.
Related work is surveyed across three axes: traditional static/dynamic vulnerability localization, AI‑for‑cybersecurity agents, and recent LLM‑based line‑level localization methods (e.g., LineVul, LO‑VA, MatsVD). The paper argues that T2L uniquely combines runtime evidence, iterative refinement, and a realistic project‑scale benchmark, positioning it as a bridge between research prototypes and deployable security tooling.
In conclusion, T2L demonstrates that LLMs, when coupled with systematic evidence collection and multi‑stage reasoning, can provide actionable, line‑precise vulnerability diagnostics for large OSS codebases. The framework, benchmark, and baseline open avenues for future research on stronger multi‑modal LLMs, graph‑enhanced reasoning, multi‑agent collaboration, and large‑scale continuous evaluation in real development pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment