An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems
  • ArXiv ID: 2601.00254
  • Date: 2026-01-01
  • Authors: Md Hasan Saju, Maher Muhtadi, Akramul Azim

📝 Abstract

The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vulnerability detection, a crucial task in securing modern codebases. This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities. The study evaluates three approaches, Retrieval-Augmented Generation (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent LLM framework, against a baseline LLM model. A curated dataset was compiled from Big-Vul [1] and real-world code repositories from GitHub, focusing on five critical Common Weakness Enumeration (CWE) categories: CWE-119, CWE-399, CWE-264, CWE-20, and CWE-200. Our RAG approach, which integrated external domain knowledge from the internet and the MITRE CWE database, achieved the highest overall accuracy (0.86) and F1 score (0.85), highlighting the value of contextual augmentation. Our SFT approach, implemented using parameter-efficient QLoRA adapters, also demonstrated strong performance. Our Dual-Agent system, an architecture in which a secondary agent audits and refines the output of the first, showed promise in improving reasoning transparency and error mitigation, with reduced resource overhead. These results emphasize that incorporating a domain expertise mechanism significantly strengthens the practical applicability of LLMs in real-world vulnerability detection tasks.

💡 Deep Analysis

Figure 1

📄 Full Content

An Empirical Evaluation of LLM-Based Approaches for Code Vulnerability Detection: RAG, SFT, and Dual-Agent Systems Md Hasan Saju, Maher Muhtadi, Akramul Azim Department of Electrical, Computer, and Software Engineering Ontario Tech University Oshawa, Canada {mdhasan.saju, maher.muhtadi, akramul.azim}@ontariotechu.ca Abstract—The rapid advancement of Large Language Models (LLMs) presents new opportunities for automated software vul- nerability detection, a crucial task in securing modern codebases. This paper presents a comparative study on the effectiveness of LLM-based techniques for detecting software vulnerabilities. The study evaluates three approaches, Retrieval-Augmented Genera- tion (RAG), Supervised Fine-Tuning (SFT), and a Dual-Agent LLM framework, against a baseline LLM model. A curated dataset was compiled from Big-Vul [1] and real-world code repositories from GitHub, focusing on five critical Common Weakness Enumeration (CWE) categories: CWE-119, CWE- 399, CWE-264, CWE-20, and CWE-200. Our RAG approach, which integrated external domain knowledge from the internet and the MITRE CWE database, achieved the highest overall accuracy (0.86) and F1 score (0.85), highlighting the value of contextual augmentation. Our SFT approach, implemented using parameter-efficient QLoRA adapters, also demonstrated strong performance. Our Dual-Agent system, an architecture in which a secondary agent audits and refines the output of the first, showed promise in improving reasoning transparency and error mitigation, with reduced resource overhead. These results emphasize that incorporating a domain expertise mechanism significantly strengthens the practical applicability of LLMs in real-world vulnerability detection tasks. Index Terms—Vulnerability Detection, LLM, RAG, SFT, Dual- Agent I. INTRODUCTION A software vulnerability is a flaw, caused by weaknesses such as buffer overflows, authentication errors, code injection, or design deficiencies, in the source code that can be exploited by hackers to breach security measures and gain unauthorized access to a system or network [2]. This can lead to severe con- sequences, including data theft, system manipulation, service disruption, and financial loss [3]. For example, according to the Cost of a Data Breach Report 2024 by IBM, the average cost of a data breach is USD 4.88 million which includes the costs of detecting and addressing the breach, disruption and losses, and the damage to the business reputation [3]. Vul- nerabilities are especially significant in safety-critical systems, where the consequences of exploitation can be catastrophic. As highlighted in [4], real-time systems like automotive control systems (e.g., anti-lock braking, cruise control) depend on both logical and temporal correctness for faultless operation. A breach in such systems can disrupt timing constraints, leading to missed deadlines and potentially life-threatening failures. Therefore, it is important to detect and mitigate vulnerabilities in a timely manner. Vulnerability detection involves identifying security weak- nesses in software code that attackers could exploit. Con- ventional detection approaches like rule-based methods and signature-based techniques rely on predefined patterns to spot known vulnerabilities but often fail to detect new or sophisti- cated threats [5]. Recent advances in machine learning, espe- cially deep learning, have transformed this field by enabling systems to automatically learn complex patterns from code [5]. Moreover, the rise of large language models (LLMs) has fur- ther enhanced detection capabilities, as these models can an- alyze code syntax and context to identify vulnerabilities more effectively. LLM models are further customized to improve vulnerability detection through Retrieval-Augmented Genera- tion (RAG) and fine-tuning approaches. However, X. Du et al., [6] and A. Z. H. Yang et al. [7] highlighted challenges such as false positives and computational costs persist, motivating the exploration of hybrid approaches like Dual-Agent systems. In this paper, we evaluate and compare the effectiveness of different LLM-based approaches for detecting source code vulnerabilities. The approaches investigated in this paper are RAG, Supervised Fine-Tuning (SFT), and a Dual-Agent sys- tem. These approaches are then compared to the performance of the base LLM model. The Dual-Agent system comprises a detector model for identifying vulnerabilities and a validation model for reviewing the first agent’s findings. There are some relevant motivations behind this study. Firstly, this study aims to provide a holistic comparison of the three different LLM techniques like RAG, fine-tuning, and Dual-Agent LLMs for vulnerability detection and help researchers or developers decide on the best approach for their needs. Secondly, this paper is the first study to implement and apply a Dual- Agent system in the domain of code vulnerability detection, to the best of our knowledge [8]

📸 Image Gallery

RAG.png ResultFinal.png TwoAgent.png cover.png finetuning.png responseDual.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut