From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation
Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, verifiable computation, and secure finance. However, authoring ZK programs remains challenging: unlike conventional software development, ZK programming manifests a fundamental paradigm shift from \textit{imperative computation} to \textit{declarative verification}. This process requires rigorous reasoning about finite field arithmetic and complex constraint systems (which is rare in common imperative languages), making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and constraint-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities on ZK programming at three levels: language knowledge, algebraic primitive competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that while models demonstrate strong proficiency in language syntax, they struggle when implementing and composing algebraic primitives to specify correct constraint systems, frequently producing incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments with GPT-o3 on Circom and Noir show substantial gains, with success rates improving from 20.29% to 87.85% and from 28.38% to 97.79%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a new basis for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance privacy computing.
💡 Research Summary
Zero‑knowledge proofs (ZKPs) have moved from theory to practice, powering privacy‑preserving authentication, blockchain scaling, and verifiable computation. Writing ZKP programs, however, is fundamentally different from conventional software development: developers must express a computation as a system of algebraic constraints over a finite field rather than as a sequence of imperative instructions. Small mistakes in constraint wiring can break completeness, soundness, or zero‑knowledge, making ZKP development highly knowledge‑intensive and error‑prone.
Large language models (LLMs) have demonstrated impressive code‑generation abilities for mainstream languages, but their suitability for ZKP domain‑specific languages (DSLs) such as Circom and Noir has not been studied. To fill this gap, the authors introduce ZK‑Eval, a three‑stage benchmark that isolates (i) language and toolchain knowledge, (ii) competence with algebraic primitives (modular arithmetic, inverses, bit‑wise gates), and (iii) end‑to‑end verification (from natural‑language specification to a compilable, provable circuit). They collect 172 curated test cases from official docs, tutorials, and real‑world repositories, and evaluate four state‑of‑the‑art LLMs (GPT‑o4‑mini, GPT‑o3, Claude‑2, Llama‑2‑70B).
Results reveal a clear pattern: all models achieve >85 % accuracy on pure syntax questions, but their performance collapses to <30 % on primitive‑level tasks and <15 % on full‑pipeline generation. This indicates that while LLMs have memorized DSL syntax, they lack the deeper mathematical reasoning required to compose correct constraint systems.
Guided by these findings, the paper proposes ZK‑Coder, an agentic framework that augments an LLM with three complementary modules:
- ZK Sketch Layer (ZKSL) – transforms a natural‑language specification into an explicit graph of algebraic primitives, forcing the model to reason about the high‑level constraint structure before emitting concrete code.
- Guided Retrieval‑Augmented Generation (RAG) – searches a curated library of verified implementations (e.g., circomlib, Noir stdlib) and injects relevant snippets into the prompt, grounding the generation in proven patterns.
- Interactive Repair – after an initial generation, the system automatically compiles and attempts a proof; any error messages are fed back as corrective prompts, iterating until the circuit passes.
When applied to GPT‑o3, ZK‑Coder raises the success rate on Circom from 20.29 % to 87.85 % and on Noir from 28.38 % to 97.79 %. Ablation studies show that removing any of the three modules degrades performance by at least 10 %, with the sketch layer being especially critical for eliminating primitive‑level mistakes.
The contributions are threefold: (1) a systematic, three‑stage benchmark (ZK‑Eval) for measuring LLM capabilities in ZKP programming; (2) the ZK‑Coder framework that demonstrates how constraint sketching, retrieval grounding, and interactive repair can bridge the gap between syntactic knowledge and mathematical correctness; (3) extensive empirical evidence that these augmentations dramatically improve end‑to‑end ZKP code generation.
Overall, the work highlights that ZKP DSLs demand domain‑specific reasoning beyond what generic LLMs learn from open‑source code. By providing both a rigorous evaluation methodology and a practical augmentation pipeline, the paper paves the way for more reliable AI‑assisted development of privacy‑preserving cryptographic applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment