AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ($40.3$% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ($24.7$%) and the explicit proof construction required by Lean (7.8%). Beyond aggregate metrics, we uncover a sharp divergence in test-time compute dynamics: Gemini-3 effectively utilizes iterative repair to boost performance (e.g., tripling pass rates in Dafny), whereas GPT-OSS saturates early. Finally, our error analysis shows that language design affects the refinement trajectory: while Dafny allows models to focus on logical correctness, Verus and Lean trap models in persistent syntactic and semantic barriers. All data and evaluation code can be found at https://github.com/haoyuzhao123/algoveri.

💡 Research Summary

AlgoVeri addresses a critical gap in the emerging field of “vericoding,” the automatic generation of formally verified code from high‑level specifications. Existing benchmarks evaluate verified code generation in isolation—each focusing on a single language or tool such as Dafny, Verus, or Lean—and consequently compare incomparable tasks. As a result, it has been impossible to assess whether a model’s superior performance stems from genuine reasoning ability or from easier benchmark design.

To solve this, the authors construct a unified, cross‑language benchmark consisting of 77 textbook‑level algorithms spanning sorting, data structures (heaps, segment trees, red‑black trees), graph algorithms (Bellman‑Ford, Edmonds‑Karp), dynamic programming, greedy methods, and selected mathematical procedures (Gaussian elimination). For each algorithm they write aligned specifications in Dafny, Verus, and Lean that share identical pre‑conditions, post‑conditions, global invariants, and, when necessary, ghost state. The specifications are curated by formal‑methods experts, mechanically checked for local satisfiability and necessity, and verified to avoid degenerate formulations. This alignment isolates algorithmic difficulty from tool‑specific quirks, enabling a fair, head‑to‑head comparison of LLM reasoning across SMT‑based (Dafny, Verus) and ITP‑based (Lean) ecosystems.

The evaluation pipeline proceeds as follows: a model receives the natural‑language problem description together with the aligned formal specification; it then generates an implementation and the required proof artifacts. A multi‑turn iterative repair process allows the model to revise its output up to 15 rounds, each time ingesting compiler or verifier error messages. A solution is considered verified if the underlying verifier accepts the code and proof. Because a specification may admit multiple correct implementations, a secondary LLM‑judge checks algorithmic fidelity—whether the verified code actually implements the intended algorithm (e.g., bubble sort vs. merge sort). The final metric, “Full Mark,” combines formal verification with this semantic filter, thereby preventing “spec‑gaming” where a model circumvents the intended task.

The authors evaluate both proprietary frontier models (Gemini‑3 Flash, GPT‑5 mini) and open‑weight models (GPT‑OSS‑120B, Qwen3‑235B, Qwen3‑Next‑80B, DeVStral‑2‑123B). Results reveal stark performance gaps: Gemini‑3 Flash achieves 40.3 % full correctness on Dafny, 24.7 % on Verus, and only 7.8 % on Lean. Open‑weight models perform substantially worse, and scaling the number of parallel passes (10 × 15) yields limited gains. These numbers contrast sharply with earlier benchmarks that reported >80 % success on Dafny for simpler tasks, underscoring that verifying algorithms requiring global reasoning and ghost state is far more challenging.

A detailed error analysis identifies three dominant failure modes. In Dafny, most errors are logical—incorrect loop invariants or insufficient termination arguments—reflecting the need for deeper global reasoning despite SMT automation. In Verus, syntactic and type errors dominate; the language’s low‑level memory model forces the model to struggle with parsing and pointer handling before any logical reasoning can begin. In Lean, the primary obstacle is proof‑search: models frequently invent or misuse tactics, fail to prove auxiliary lemmas, or get stuck in proof‑state exploration, leading to very low success rates.

The study also examines compute dynamics. Frontier models like Gemini‑3 Flash benefit dramatically from iterative repair: each additional round yields a measurable increase in success, tripling the final pass rate for Dafny. By contrast, open‑weight models saturate after a few rounds, indicating that for them “depth” (more repair steps) is less effective than “width” (more parallel samples). This suggests an emerging “intelligence gap” where only the most advanced models can leverage feedback loops to refine proofs.

Finally, the authors argue that language design fundamentally shapes the repair trajectory. Dafny’s high level of automation lets models focus on logical correctness; Verus’s requirement to manage concrete memory details creates a syntactic barrier; Lean’s interactive theorem‑proving paradigm imposes a search‑heavy proof construction barrier. These observations point to several future research directions: (1) developing meta‑learning strategies that teach models to navigate tactic spaces efficiently; (2) designing prompting and tool‑integration techniques that expose verifier feedback more effectively; (3) possibly redesigning verification languages to lower syntactic friction while preserving expressive power.

In sum, AlgoVeri provides the first rigorously aligned, multi‑language benchmark for verified code generation on non‑trivial algorithms, reveals that current LLMs are far from solving the problem, and offers a rich diagnostic framework for guiding the next generation of vericoding systems. All data, specifications, and evaluation scripts are publicly released at https://github.com/haoyuzhao123/algoveri.

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

💡 Research Summary

Comments & Academic Discussion

Leave a Comment