SAFuzz: Semantic-Guided Adaptive Fuzzing for LLM-Generated Code
While AI-coding assistants accelerate software development, current testing frameworks struggle to keep pace with the resulting volume of AI-generated code. Traditional fuzzing techniques often allocate resources uniformly and lack semantic awareness of algorithmic vulnerability patterns, leading to inefficient resource usage and missed vulnerabilities. To address these limitations, we present a hybrid testing framework that leverages LLM-guided adaptive fuzzing to detect algorithmic vulnerabilities efficiently. Our system SAFuzz integrates prompt-based behavioral diversification, harness generation with problem-specific oracles, and an LLM-based predictor to enable adaptive resource allocation and dynamic early stopping. Evaluating SAFuzz on CSES algorithmic problems, we improve vulnerability discrimination precision from 77.9% to 85.7% and achieve a 1.71x reduction in time cost compared to SOTA GreenFuzz while maintaining comparable recall. We further observe that combining our approach with existing unit test generation methods yields complementary gains, increasing the bug detection recall from 67.3% to 79.5%.
💡 Research Summary
The paper introduces SAFuzz, a hybrid testing framework designed to efficiently uncover algorithmic vulnerabilities in code generated by large language models (LLMs). The authors observe that existing testing approaches—LLM‑based unit test generation, formal verification, and traditional fuzzing—each suffer from distinct shortcomings when applied to AI‑generated code. Unit test generators achieve high functional coverage but cannot detect runtime anomalies such as timeouts or memory overflows. Formal verification offers strong guarantees but requires expert‑crafted specifications and does not scale to the volume of code produced by LLMs. Conventional fuzzers allocate uniform time budgets across all programs, lack semantic awareness of algorithmic risk, and rely on generic harnesses that ignore problem‑specific constraints, leading to wasted effort and missed bugs.
SAFuzz addresses these gaps through three tightly integrated stages. First, it creates a diverse set of prompt variants for each programming problem. For every problem, twelve prompts are generated: one original, five semantic variations (e.g., emphasizing edge cases, reordering constraints, focusing on examples), and six “buggy” variations that intentionally steer the LLM toward known vulnerability patterns such as integer overflow, inefficient algorithms, deep recursion, or off‑by‑one errors. By feeding each variant to the coding LLM, the system obtains a spectrum of solutions that reflect real‑world user interaction diversity and consequently a broader distribution of potential bugs.
Second, SAFuzz employs an LLM‑guided harness generation agent that produces problem‑specific fuzzing harnesses. The agent parses the natural‑language problem description to extract numeric bounds, type requirements, and structural constraints (e.g., maximum graph size). It then synthesizes weighted input generators that bias toward stress‑inducing values (maximum recursion depth, largest permissible input sizes). Crucially, the agent also creates four semantic oracles: (1) a timeout oracle that monitors execution time in a separate thread, (2) a crash oracle for memory violations such as null dereferences or out‑of‑bounds accesses, (3) a determinism oracle that checks for consistent outputs on repeated inputs, and (4) an overflow oracle that detects sign‑change or wrap‑around behavior indicative of arithmetic overflow. A two‑stage validation loop, inspired by OSS‑Fuzz‑Gen, compiles the generated harness, captures compiler errors, and feeds them back to the LLM for targeted fixes, thus achieving a high success rate (≈92%) while avoiding exponential retry costs.
Third, SAFuzz predicts vulnerability risk and allocates fuzzing resources adaptively. It extracts fourteen features that combine static code metrics (lines of code, cyclomatic complexity, call‑graph depth, number of loops) with dynamic, LLM‑derived semantic features (identified algorithmic class, expected time‑complexity order, weighted input distribution). These features feed a Gradient Boosting binary classifier trained to output a vulnerability probability for each program. Programs scoring below a configurable threshold are filtered out early, while high‑risk programs receive proportionally larger time budgets (e.g., a program with a 0.9 probability may receive three times the fuzzing time of a program with 0.5). During fuzzing, real‑time coverage and oracle trigger rates are monitored; when a saturation window is detected, the campaign for that program is stopped, preventing unnecessary computation.
The authors evaluate SAFuzz on 96 CSES algorithmic problems, generating 1,152 LLM‑produced variants. Compared against GreenFuzz, a recent static‑feature‑based adaptive fuzzer, SAFuzz improves vulnerability discrimination precision from 77.9 % to 85.7 % (a 7.8‑point gain) while reducing total fuzzing time by a factor of 1.71. Recall remains comparable, with only a marginal drop (<0.5 %). When combined with an LLM‑based unit‑test generator (ChatUniTest), overall bug detection recall rises from 67.3 % to 79.5 %, demonstrating complementary strengths. The harness generation component scales linearly with problem difficulty thanks to the bounded retry mechanism, and the adaptive scheduler consistently focuses effort on the most promising targets.
Limitations acknowledged include dependence on manually crafted prompt‑variant templates (which may need redesign for new domains), sensitivity of the semantic feature extractor to the underlying LLM model, and the current focus on algorithmic contest problems rather than system‑level or network code. Future work is suggested in automatic prompt‑variant synthesis, multi‑language support, and richer hybrid oracles that combine static analysis with dynamic simulation.
In summary, SAFuzz presents a novel, semantics‑aware approach to fuzzing AI‑generated code, achieving higher precision and substantially lower computational cost than prior adaptive fuzzers, and illustrating how LLMs can be leveraged not only for code creation but also for intelligent test generation and resource management.
Comments & Academic Discussion
Loading comments...
Leave a Comment