A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage. This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully. In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. We analyze 263 studies published between January 2020 and November 2025, categorizing them by task formulation, input representation, system architecture, and techniques. Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity. We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research. We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies at https://github.com/hs-esslingen-it-security/Awesome-LLM4SVD.

💡 Research Summary

This paper presents a comprehensive systematic literature review (SLR) of research that applies large language models (LLMs) to software vulnerability detection (SVD). Covering the period from January 2020 to November 2025, the authors identified and analyzed 263 peer‑reviewed studies. The review is organized around four primary dimensions—task formulation, input representation, system architecture, and adaptation/orchestration techniques—plus an extensive examination of the datasets used in these works.

Task formulation is split into classification (binary or multi‑class CWE labeling, vulnerability presence detection) and generation (automatic creation of vulnerability descriptions, patches, or remediation advice). The authors note a growing interest in generative tasks, especially for automated repair, but classification remains the dominant focus.

Input representation covers a spectrum from raw source‑code text to richer program abstractions such as abstract syntax trees (AST), control‑flow graphs (CFG), data‑flow graphs (DFG), and hybrid representations that combine code with auxiliary metadata (e.g., CVE narratives, library dependencies, project‑level information). The review highlights that hybrid inputs tend to improve detection performance, particularly for subtle, context‑dependent vulnerabilities.

System architecture is categorized into three families: (1) Zero‑shot or prompt‑only usage of pre‑trained LLMs, (2) fine‑tuning or continual‑learning approaches that adapt the model to domain‑specific code corpora, and (3) retrieval‑augmented generation (RAG) or other orchestration schemes that integrate external knowledge bases, code search engines, or static analysis tools. The authors observe a shift from pure prompt‑only methods toward hybrid pipelines that combine LLM reasoning with traditional program analysis.

Adaptation and orchestration techniques include prompt engineering (manual and automated), label smoothing, multi‑task learning, model ensembling, and meta‑learning for automatic prompt generation. The review points out that systematic prompt optimization can yield performance gains comparable to full fine‑tuning, which is valuable given the high computational cost of large models.

The dataset analysis is a standout contribution. The authors construct a taxonomy that classifies datasets by type (code‑only, text‑only, graph‑based, mixed), granularity (file, function, line), source (open‑source repositories, industrial partners, synthetic), and labeling methodology (manual, automatic, semi‑automatic). They evaluate each dataset on class balance, CWE coverage diversity, duplication rate, and label noise. Findings reveal that most benchmarks are heavily skewed toward C/C++ and Java, concentrate on a limited set of CWE families (e.g., CWE‑79, CWE‑89), and suffer from significant class imbalance and duplicate code snippets. Consequently, many reported performance improvements may not generalize to broader, real‑world codebases.

The paper identifies several limitations in the current research landscape: (1) lack of standardized evaluation metrics and protocols, making cross‑study comparisons difficult; (2) limited reproducibility due to proprietary datasets, undisclosed data splits, and missing hyper‑parameter details; (3) prohibitive computational costs for training or inference with very large models such as GPT‑4; and (4) security concerns that LLM‑generated code can introduce new vulnerabilities, yet few works provide systematic risk assessments.

Based on the analysis, the authors outline future research opportunities: (i) develop community‑agreed benchmark suites and evaluation pipelines that include diverse programming languages, multi‑CWE coverage, and realistic class distributions; (ii) expand and curate multilingual, multi‑domain datasets with high‑quality, verified labels, possibly through collaborative industry‑academia initiatives; (iii) investigate efficient fine‑tuning strategies (e.g., parameter‑efficient adapters, LoRA) and prompt‑optimization algorithms to reduce resource demands; (iv) create metrics and defensive mechanisms to assess and mitigate the security risk of LLM‑generated code, such as static‑analysis‑in‑the‑loop or adversarial testing; and (v) integrate LLM‑based detectors into continuous integration pipelines, enabling real‑time feedback for developers.

Finally, the authors release all collected artifacts, taxonomies, and analysis scripts in a publicly maintained GitHub repository (Awesome‑LLM4SVD), positioning the work as a living resource for the community. By providing a fine‑grained taxonomy and a critical dataset audit, the paper not only maps the current state of LLM‑driven vulnerability detection but also establishes a clear roadmap for more comparable, reproducible, and secure research in the coming years.

A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment