LLAMAFUZZ: Large Language Model Enhanced Greybox Fuzzing
Greybox fuzzing has achieved success in revealing bugs and vulnerabilities in programs. However, randomized mutation strategies have limited the fuzzer’s performance on structured data. Specialized fuzzers can handle complex structured data, but require additional efforts in grammar and suffer from low throughput. In this paper, we explore the potential of utilizing the Large Language Model to enhance greybox fuzzing for structured data. We utilize the pre-trained knowledge of LLM about data conversion and format to generate new valid inputs. We further fine-tuned it with paired mutation seeds to learn structured format and mutation strategies effectively. Our LLM-based fuzzer, LLAMAFUZZ, integrates the power of LLM to understand and mutate structured data to fuzzing. We conduct experiments on the standard bug-based benchmark Magma and a wide variety of real-world programs. LLAMAFUZZ outperforms our top competitor by 41 bugs on average. We also identified 47 unique bugs across all trials. Moreover, LLAMAFUZZ demonstrated consistent performance on both bug trigger and bug reached. Compared to AFL++, LLAMAFUZZ achieved 27.19% more branches in real-world program sets on average. We also demonstrate a case study to explain how LLMs enhance the fuzzing process in terms of code coverage.
💡 Research Summary
LLAMAFUZZ introduces a novel hybrid fuzzing framework that leverages large language models (LLMs) to enhance grey‑box fuzzing on structured inputs. Traditional grey‑box fuzzers such as AFL++ rely on high‑throughput bit‑level mutations, which work well for unstructured data but often break the syntactic integrity of formats like XML, JSON, or binary protocols. Grammar‑based fuzzers preserve structure but require handcrafted grammars and suffer from low throughput. LLAMAFUZZ bridges this gap by (1) exploiting the pre‑trained knowledge of LLMs about data conversion and format semantics, and (2) fine‑tuning the LLM on paired mutation seeds collected from real fuzzing runs.
The authors first build a training corpus from FuzzBench and AFL++ logs, selecting seeds that (i) discovered new execution paths, (ii) exhibited distinct hit‑counts, or (iii) caused crashes. Each training example consists of an original seed and a successful mutated counterpart. Binary seeds are transformed into a uniform hexadecimal representation; two consecutive hex characters are merged into a single token to reduce token length, while text‑based formats receive a simple prompt. This preprocessing enables the LLM to handle diverse formats without custom token vocabularies.
Fine‑tuning employs supervised fine‑tuning (SFT) with a cross‑entropy loss, augmented by LoRA adapters and mixed‑precision (FP16) quantization to keep training and inference efficient. Noise data are added to mitigate over‑fitting and prevent the model from merely memorizing seeds. After fine‑tuning, the LLM can generate structure‑preserving mutations that are semantically meaningful, as demonstrated by the XML mutation example in the paper.
Integration with the fuzzer is achieved through an asynchronous job‑queue architecture. The main fuzzing loop runs the high‑speed AFL++ mutation pipeline and execution monitoring. In parallel, an LLM mutation thread consumes seeds from a dedicated queue, converts them to the hex representation, invokes the fine‑tuned LLM on a GPU, and returns the generated seeds back to the main queue. The scheduler always prefers LLM‑generated seeds when available, otherwise falling back to AFL++ mutations. This dual‑layer, non‑blocking design isolates the computationally intensive LLM work from the latency‑sensitive fuzzing core, preserving overall throughput while enriching the mutation space with structure‑aware candidates.
Experimental evaluation comprises two parts. First, on the Magma V1.2 bug‑based benchmark (10 real programs with known bugs), LLAMAFUZZ is compared against AFL++, MoptaF, Honggfuzz, and Fairfuzz. LLAMAFUZZ discovers on average 41 more bugs than the best competitor and identifies 47 unique bugs across all runs. Second, the authors test 15 real‑world applications covering a variety of structured formats (e.g., image files, network protocols, archive formats). LLAMAFUZZ outperforms AFL++ on 11 of the 15 targets, achieving an average 27.19 % increase in branch coverage. In many cases, its performance rivals or exceeds that of specialized grammar‑based fuzzers, while retaining the flexibility of a general‑purpose fuzzer.
Key insights include: (1) LLMs can internalize complex data format rules from their massive pre‑training corpora, allowing them to generate syntactically valid mutations without explicit grammars; (2) fine‑tuning on real mutation pairs teaches the model concrete transformation patterns that are highly effective for bug discovery; (3) an asynchronous queue decouples the slower LLM generation from the fast execution loop, eliminating bottlenecks and maintaining high throughput.
The paper’s contributions are threefold: a methodology for preparing and fine‑tuning LLMs for structured seed mutation, a practical system architecture that integrates LLM‑driven mutations with a state‑of‑the‑art grey‑box fuzzer, and a thorough empirical evaluation showing significant improvements in bug detection and code coverage. LLAMAFUZZ demonstrates that large language models can be harnessed to give grey‑box fuzzers structural awareness, achieving both the speed of traditional fuzzers and the precision of grammar‑based approaches. This work opens the door for future research on scaling LLM‑enhanced fuzzing to larger codebases, richer data formats, and even tighter integration with symbolic analysis techniques.
Comments & Academic Discussion
Loading comments...
Leave a Comment