Adaptive Hierarchical Evaluation of LLMs and SAST tools for CWE Prediction in Python

Reading time: 4 minute
...

📝 Original Info

  • Title: Adaptive Hierarchical Evaluation of LLMs and SAST tools for CWE Prediction in Python
  • ArXiv ID: 2601.01320
  • Date: 2026-01-04
  • Authors: Muntasir Adnan, Carlos C. N. Kuhn

📝 Abstract

Large Language Models have become integral to software development, yet they frequently generate vulnerable code. Existing code vulnerability detection benchmarks employ binary classification, lacking the CWE-level specificity required for actionable feedback in iterative correction systems. We present ALPHA (Adaptive Learning via Penalty in Hierarchical Assessment), the first function-level Python benchmark that evaluates both LLMs and SAST tools using hierarchically aware, CWE-specific penalties. ALPHA distinguishes between overgeneralisation, over-specification, and lateral errors, reflecting practical differences in diagnostic utility. Evaluating seven LLMs and two SAST tools, we find LLMs substantially outperform SAST, though SAST demonstrates higher precision when detections occur. Critically, prediction consistency varies dramatically across models (8.26%-81.87% agreement), with significant implications for feedback-driven systems. We further outline a pathway for future work incorporating ALPHA penalties into supervised fine-tuning, which could provide principled hierarchyaware vulnerability detection pending empirical validation.

💡 Deep Analysis

Figure 1

📄 Full Content

Adaptive Hierarchical Evaluation of LLMs and SAST tools for CWE Prediction in Python Muntasir Adnan Open Source Institute Faculty of Science and Technology University of Canberra Canberra, Australia Adnan.adnan@canberra.edu.au *Corresponding author Carlos C. N. Kuhn Open Source Institute Faculty of Science and Technology University of Canberra Canberra, Australia Carlos.NoschangKuhn@canberra.edu.au Abstract—Large Language Models have become integral to software development, yet they frequently generate vulnerable code. Existing code vulnerability detection benchmarks employ binary classification, lacking the CWE-level specificity required for actionable feedback in iterative correction systems. We present ALPHA (Adaptive Learning via Penalty in Hierarchical Assessment), the first function-level Python benchmark that evaluates both LLMs and SAST tools using hierarchically aware, CWE-specific penalties. ALPHA distinguishes between over- generalisation, over-specification, and lateral errors, reflecting practical differences in diagnostic utility. Evaluating seven LLMs and two SAST tools, we find LLMs substantially outperform SAST, though SAST demonstrates higher precision when de- tections occur. Critically, prediction consistency varies dramati- cally across models (8.26%-81.87% agreement), with significant implications for feedback-driven systems. We further outline a pathway for future work incorporating ALPHA penalties into supervised fine-tuning, which could provide principled hierarchy- aware vulnerability detection pending empirical validation. Index Terms—Vulnerability Detection, Large Language Mod- els, Static Analysis, CWE Classification, Hierarchical Evaluation I. INTRODUCTION Large Language Models (LLMs) have fundamentally trans- formed software development practices, with code-generation capabilities now integrated into mainstream development workflows via tools such as GitHub Copilot, Amazon Code- Whisperer, and ChatGPT. This paradigm shift has dramatically improved developer productivity, enabling rapid prototyping and accelerating development cycles [1]. However, recent stud- ies demonstrate that LLMs frequently produce code containing exploitable vulnerabilities, with security flaws appearing in 40- 60% of generated code snippets, depending on task complexity and model [2], [3]. The research community has approached this vulnerability challenge through two primary avenues: prompt engineering and supervised fine-tuning. This work is funded under the agreement with the ACT Government, Future Jobs Fund - Open Source Institute(OpenSI) - R01553; and NetApp Technology Alliance Agreement with OpenSI - R01657. Additionally, this research was supported by the Australian Government through the Department of Education’s National Industry PhD Program (project 36337). The views expressed herein are those of the authors and are not necessarily those of the Australian Government or the Department of Education. Prompt engineering techniques have demonstrated potential, with security-focused prompts reducing vulnerability genera- tion by up to 61% in models like GPT-4o [4]. Self-reflection approaches, where models detect vulnerability in their previ- ously generated code and apply repairs, achieve 41.9-68.7% vulnerability remediation when specifically prompted [4], [5]. However, these techniques exhibit significant practical limitations. Not only do they require vulnerability-specific instructions to achieve substantial improvements, but more critically, they demonstrate concerning instability. Results vary substantially with minor prompt modifications and prove in- consistent both within and across studies [6], [7]. Fine-tuning approaches have demonstrated improve- ments [8]–[10], but suffer from critical limitations including catastrophic forgetting [11], poor generalisation [12] and prohibitive computational costs for frequent retraining as vulnerability landscapes evolve [11], [13]. Recent evidence suggests that contemporary Small Language Models (SLMs) with improved training regimes can match or exceed the performance of older, larger models [14], potentially rendering vulnerability-specific fine-tuning obsolete. Given these limitations, iterative feedback loops have emerged as a promising alternative, already demonstrating effectiveness in improving functional correctness in code gen- eration [15]–[17]. These systems iteratively analyse generated code, identify specific issues, and prompt the model to address them. However, a critical question remains unanswered: what tool should provide the feedback for vulnerability detection? Two candidate approaches exist: traditional static analysis security testing (SAST) tools and LLM-based vulnerability detection. Current practice predominantly employs SAST tools for feedback generation [12], [18], yet this choice appears conventional rather than evidence-based. Recent comparative studies suggest LLMs can outperform SAST tools for vul- nerability detection [19],

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut