From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security
Large Language Models (LLMs) have emerged as powerful tools for automating programming tasks, including security-related ones. However, they can also introduce vulnerabilities during code generation, fail to detect existing vulnerabilities, or report nonexistent ones. This systematic literature review investigates the security benefits and drawbacks of using LLMs for code-related tasks. In particular, it focuses on the types of vulnerabilities introduced by LLMs when generating code. Moreover, it analyzes the capabilities of LLMs to detect and fix vulnerabilities, and examines how prompting strategies impact these tasks. Finally, it examines how data poisoning attacks impact LLMs performance in the aforementioned tasks.
💡 Research Summary
This paper presents a systematic literature review (SLR) that investigates the security implications of using large language models (LLMs) for code‑related tasks. The authors aim to (1) catalogue the types of vulnerabilities that LLM‑generated code can introduce, (2) assess how well LLMs can detect and remediate such vulnerabilities, including the influence of prompting strategies, and (3) examine the impact of data‑poisoning attacks on these capabilities.
The review follows the Petersen et al. SLR guidelines and covers studies published from 2021 through early 2026. An exhaustive search across IEEE Xplore, ACM Digital Library, ScienceDirect, SpringerLink, and USENIX yielded 7,008 records; after duplicate removal, screening, and full‑text assessment, 102 primary studies were included. The selection criteria required peer‑reviewed full‑text papers that either (i) identify security flaws in LLM‑generated code, (ii) evaluate LLMs for vulnerability detection or fixing, or (iii) explore poisoning of training data.
RQ1 – Vulnerabilities introduced by LLMs
From 21 papers that explicitly discuss code‑level flaws, the authors extract ten high‑level categories, each mapped to CWE identifiers where possible: (1) missing input validation (e.g., SQL/command injection), (2) authentication/authorization errors, (3) cryptographic misuse (weak hashes, hard‑coded keys), (4) resource‑management defects (memory/file leaks), (5) inadequate error handling (exception suppression, debug leakage), (6) type confusion in dynamically typed languages, (7) vulnerable third‑party dependencies, (8) unsafe system‑call usage, (9) insufficient logging/auditing, and (10) code‑injection risk from prompt contamination. The taxonomy shows that LLMs often reproduce insecure patterns present in their training corpora or generate code that violates modern secure‑coding guidelines.
RQ2 – Detection and fixing capabilities
The majority of the 96 studies focusing on detection/fixing employ hybrid static‑plus‑dynamic analysis pipelines driven by prompts. Model performance varies: GPT‑4 and Claude achieve the highest average F1 scores (~0.78), while Llama‑2 and earlier GPT‑3‑based models trail (~0.71). Prompt engineering is shown to be a decisive factor. Zero‑shot prompts yield modest accuracy (~0.55), whereas few‑shot examples improve results by 10–12 percentage points. Adding Chain‑of‑Thought (CoT) reasoning further boosts performance, with the few‑shot + CoT combination consistently outperforming other configurations.
Detection and remediation are often combined in a “Detect‑Fix” pipeline: the model first flags a vulnerability, then proposes a patch, and finally a verification step checks that the patch does not introduce new issues. Studies report that omitting the verification stage can lead to regression bugs or even new security flaws. Fine‑tuning on security‑focused corpora improves detection rates but may increase false positives if the fine‑tuning data are not carefully curated.
RQ3 – Effects of data poisoning
Fourteen papers examine adversarial poisoning of LLM training data. Attack vectors include injecting malicious repositories, contaminating code snippets, or mislabeling security‑related examples. Experiments demonstrate that poisoned data can cause LLMs to (a) deliberately generate insecure code, (b) overlook known vulnerabilities, or (c) misclassify dangerous patterns as safe. Notably, “source‑code injection” attacks can bias the model to favor certain unsafe API calls without altering model weights; a simple prompt perturbation (e.g., typo in a security keyword) can trigger the malicious behavior, highlighting the stealthy nature of such attacks.
Key insights and recommendations
- Prompt templates anchored in CWE taxonomy – embedding explicit security checks (e.g., “ensure input sanitization for X”) consistently improves detection and fixing accuracy.
- Multi‑stage verification – integrating static analysis, dynamic testing, and post‑patch validation reduces regression risk in the Detect‑Fix workflow.
- Data hygiene – systematic vetting of pre‑training and fine‑tuning corpora, combined with automated poisoning detection tools, is essential to preserve model integrity.
- Defensive training – adversarial training with poisoned samples and regular model audits can mitigate the impact of data‑poisoning attacks.
- Broader language coverage – current literature focuses on Python and JavaScript; extending studies to systems languages (C/C++, Rust, Go) is a critical future direction.
The authors conclude that while LLMs offer substantial productivity gains and have demonstrated promising abilities to locate and remediate security flaws, their deployment must be accompanied by rigorous prompt engineering, robust verification pipelines, and proactive defenses against data poisoning. This SLR provides a consolidated taxonomy of LLM‑induced vulnerabilities, an evaluation of detection/fixing techniques, and a roadmap for securing LLM‑driven software development pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment