Title: Adaptive Hierarchical Evaluation of LLMs and SAST tools for CWE Prediction in Python
ArXiv ID: 2601.01320
Date: 2026-01-04
Authors: Muntasir Adnan, Carlos C. N. Kuhn
📝 Abstract
Large Language Models have become integral to software development, yet they frequently generate vulnerable code. Existing code vulnerability detection benchmarks employ binary classification, lacking the CWE-level specificity required for actionable feedback in iterative correction systems. We present ALPHA (Adaptive Learning via Penalty in Hierarchical Assessment), the first function-level Python benchmark that evaluates both LLMs and SAST tools using hierarchically aware, CWE-specific penalties. ALPHA distinguishes between overgeneralisation, over-specification, and lateral errors, reflecting practical differences in diagnostic utility. Evaluating seven LLMs and two SAST tools, we find LLMs substantially outperform SAST, though SAST demonstrates higher precision when detections occur. Critically, prediction consistency varies dramatically across models (8.26%-81.87% agreement), with significant implications for feedback-driven systems. We further outline a pathway for future work incorporating ALPHA penalties into supervised fine-tuning, which could provide principled hierarchyaware vulnerability detection pending empirical validation.
💡 Deep Analysis
📄 Full Content
Adaptive Hierarchical Evaluation of LLMs and
SAST tools for CWE Prediction in Python
Muntasir Adnan
Open Source Institute
Faculty of Science and Technology
University of Canberra
Canberra, Australia
Adnan.adnan@canberra.edu.au
*Corresponding author
Carlos C. N. Kuhn
Open Source Institute
Faculty of Science and Technology
University of Canberra
Canberra, Australia
Carlos.NoschangKuhn@canberra.edu.au
Abstract—Large Language Models have become integral to
software development, yet they frequently generate vulnerable
code. Existing code vulnerability detection benchmarks employ
binary classification, lacking the CWE-level specificity required
for actionable feedback in iterative correction systems. We
present ALPHA (Adaptive Learning via Penalty in Hierarchical
Assessment), the first function-level Python benchmark that
evaluates both LLMs and SAST tools using hierarchically aware,
CWE-specific penalties. ALPHA distinguishes between over-
generalisation, over-specification, and lateral errors, reflecting
practical differences in diagnostic utility. Evaluating seven LLMs
and two SAST tools, we find LLMs substantially outperform
SAST, though SAST demonstrates higher precision when de-
tections occur. Critically, prediction consistency varies dramati-
cally across models (8.26%-81.87% agreement), with significant
implications for feedback-driven systems. We further outline a
pathway for future work incorporating ALPHA penalties into
supervised fine-tuning, which could provide principled hierarchy-
aware vulnerability detection pending empirical validation.
Index Terms—Vulnerability Detection, Large Language Mod-
els, Static Analysis, CWE Classification, Hierarchical Evaluation
I. INTRODUCTION
Large Language Models (LLMs) have fundamentally trans-
formed software development practices, with code-generation
capabilities now integrated into mainstream development
workflows via tools such as GitHub Copilot, Amazon Code-
Whisperer, and ChatGPT. This paradigm shift has dramatically
improved developer productivity, enabling rapid prototyping
and accelerating development cycles [1]. However, recent stud-
ies demonstrate that LLMs frequently produce code containing
exploitable vulnerabilities, with security flaws appearing in 40-
60% of generated code snippets, depending on task complexity
and model [2], [3]. The research community has approached
this vulnerability challenge through two primary avenues:
prompt engineering and supervised fine-tuning.
This work is funded under the agreement with the ACT Government,
Future Jobs Fund - Open Source Institute(OpenSI) - R01553; and NetApp
Technology Alliance Agreement with OpenSI - R01657. Additionally, this
research was supported by the Australian Government through the Department
of Education’s National Industry PhD Program (project 36337). The views
expressed herein are those of the authors and are not necessarily those of the
Australian Government or the Department of Education.
Prompt engineering techniques have demonstrated potential,
with security-focused prompts reducing vulnerability genera-
tion by up to 61% in models like GPT-4o [4]. Self-reflection
approaches, where models detect vulnerability in their previ-
ously generated code and apply repairs, achieve 41.9-68.7%
vulnerability remediation when specifically prompted [4],
[5]. However, these techniques exhibit significant practical
limitations. Not only do they require vulnerability-specific
instructions to achieve substantial improvements, but more
critically, they demonstrate concerning instability. Results vary
substantially with minor prompt modifications and prove in-
consistent both within and across studies [6], [7].
Fine-tuning
approaches
have
demonstrated
improve-
ments [8]–[10], but suffer from critical limitations including
catastrophic forgetting [11], poor generalisation [12] and
prohibitive computational costs for frequent retraining as
vulnerability landscapes evolve [11], [13]. Recent evidence
suggests that contemporary Small Language Models (SLMs)
with improved training regimes can match or exceed the
performance of older, larger models [14], potentially rendering
vulnerability-specific fine-tuning obsolete.
Given these limitations, iterative feedback loops have
emerged as a promising alternative, already demonstrating
effectiveness in improving functional correctness in code gen-
eration [15]–[17]. These systems iteratively analyse generated
code, identify specific issues, and prompt the model to address
them. However, a critical question remains unanswered: what
tool should provide the feedback for vulnerability detection?
Two candidate approaches exist: traditional static analysis
security testing (SAST) tools and LLM-based vulnerability
detection. Current practice predominantly employs SAST tools
for feedback generation [12], [18], yet this choice appears
conventional rather than evidence-based. Recent comparative
studies suggest LLMs can outperform SAST tools for vul-
nerability detection [19],