Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis

Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) for code generation boost productivity but frequently introduce Knowledge Conflicting Hallucinations (KCHs), subtle, semantic errors, such as non-existent API parameters, that evade linters and cause runtime failures. Existing mitigations like constrained decoding or non-deterministic LLM-in-the-loop repair are often unreliable for these errors. This paper investigates whether a deterministic, static-analysis framework can reliably detect \textit{and} auto-correct KCHs. We propose a post-processing framework that parses generated code into an Abstract Syntax Tree (AST) and validates it against a dynamically-generated Knowledge Base (KB) built via library introspection. This non-executing approach uses deterministic rules to find and fix both API and identifier-level conflicts. On a manually-curated dataset of 200 Python snippets, our framework detected KCHs with 100% precision and 87.6% recall (0.934 F1-score), and successfully auto-corrected 77.0% of all identified hallucinations. Our findings demonstrate that this deterministic post-processing approach is a viable and reliable alternative to probabilistic repair, offering a clear path toward trustworthy code generation.


💡 Research Summary

The paper addresses a pressing problem in the use of large language models (LLMs) for code generation: Knowledge‑Conflicting Hallucinations (KCH). These are subtle semantic bugs—such as misspelled API names, non‑existent parameters, or mis‑used variables—that pass syntactic checks and linters but cause runtime failures. Existing mitigations either constrain the generation process (e.g., PICARD, Synchromesh) which cannot catch semantically valid yet incorrect calls, or rely on a non‑deterministic “LLM‑in‑the‑loop” repair where the model is asked to fix its own mistake. Both approaches are insufficient for KCH.

The authors propose a deterministic, static‑analysis pipeline that works entirely without executing the generated code. First, the code snippet is parsed into an Abstract Syntax Tree (AST) using Python’s standard ast module. From the AST they extract (i) import statements and aliases, (ii) fully‑qualified call sites, (iii) bare function calls lacking a module qualifier, and (iv) call arguments, especially string literals that hint at intent (e.g., a “.csv” file extension).

Next, a Dynamic Knowledge Base (KB) is built on‑the‑fly by importing each referenced library and introspecting it with inspect and dir. The KB stores the list of public callables, common aliases, and lightweight semantic cues (e.g., requests.get expects a URL). The KB is versioned with the library’s __version__ attribute to guarantee reproducibility.

The Validation layer cross‑references each extracted call with the KB. Three error categories are defined: (1) Unknown API – the call does not exist in the KB (e.g., pd.read_exel); the system suggests the closest match using edit distance. (2) Bare Critical Call – a function is used without its required module prefix (e.g., read_csv); the system flags it for insertion of the missing import. (3) Semantic Inconsistency – the call’s signature conflicts with contextual cues (e.g., passing a “.csv” file to pd.read_excel). The validation runs in O(n·m) time, where n is the number of call sites and m the size of the KB, and proved tractable even for large libraries.

The Correction module performs localized AST edits: misspellings are replaced by the nearest valid symbol, mismatched arguments are rewritten according to intent heuristics, and missing imports are inserted at the top of the file. The modified AST is then unparsed back to source code, yielding a deterministic, reproducible fix.

Evaluation uses a manually curated dataset of 200 Python snippets generated by GPT‑5, covering five popular libraries (numpy, pandas, requests, matplotlib, json). The set contains 161 hallucinated snippets and 39 clean ones. Detection results show 100 % precision (no false positives) and 87.6 % recall, giving an F1‑score of 0.934. Recall is highest for Missing Imports (97.9 %) and numpy calls (100 %), and lowest for Contextual Mismatches (33.3 %) and matplotlib (72.2 %).

For correction, 77 % of the identified hallucinations are automatically fixed. Success is near‑perfect for Missing Imports (97.9 %) but drops for mis‑typed APIs (70 %) and especially for pandas (56.2 %). Failure analysis reveals that while the system can correct surface‑level typos, it sometimes misses deeper semantic intent (e.g., replacing np.arrya with np.array but not recognizing that the intended operation was np.mean).

The authors argue that a deterministic static‑analysis approach offers a reliable, interpretable alternative to probabilistic repair. It can be packaged as a lightweight semantic linter integrated into IDEs, providing real‑time feedback as the LLM generates code. Future work includes automating KB population from official documentation, extending analysis to multi‑module projects, and incorporating machine‑learning‑based intent inference to handle more nuanced semantic errors.

In summary, the paper demonstrates that a combination of AST parsing and dynamic introspection can effectively detect and automatically correct a large class of LLM‑induced knowledge‑conflicting hallucinations, achieving high precision and respectable recall without executing the code. This deterministic pipeline paves the way for more trustworthy code‑generation tools and offers a solid foundation for further research into static, non‑probabilistic mitigation of LLM‑generated bugs.


Comments & Academic Discussion

Loading comments...

Leave a Comment