Automated Proof Generation for Rust Code via Self-Evolution

Automated Proof Generation for Rust Code via Self-Evolution
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier’s feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proofs for Rust code. This advancement leads to a significant improvement in performance, achieving a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o’s performance of 14.39%.


💡 Research Summary

The paper introduces SAFE (Self‑evolving Automated proof Generation), a framework that tackles the severe data scarcity problem in automated formal verification for Rust code. Traditional proof‑oriented languages such as Lean or F* benefit from large, human‑written proof corpora, but tools like Verus—designed to verify Rust programs—have only a few hundred verified files, making direct fine‑tuning of language models infeasible. SAFE overcomes this by constructing two intertwined self‑evolving loops: one for generating specifications and another for generating proofs.

First, a large pool of Rust functions is assembled by translating existing Python and Rust snippets from popular code‑synthesis datasets (MBPP, CodeNet) into Verus‑compatible Rust using GPT‑4o. Unsupported language features (e.g., iterators, HashMap) are replaced with equivalent constructs (e.g., while loops). Only programs that compile with the Verus compiler are retained, effectively using the verifier as an automatic filter.

Second, specification synthesis proceeds in rounds. In round 0, GPT‑4o creates pre‑ and post‑conditions from each function’s implementation and doc‑string. A quantitative quality metric (based on Lahiri 2024) evaluates these specs; high‑quality specs are kept, low‑quality ones are discarded. The retained specs fine‑tune an open‑source LLM (DeepSeekCoder). The newly fine‑tuned model then generates specs for the next round, gradually improving both quantity and quality. Crucially, the process does not require perfect specs—reasonably accurate ones suffice for the verifier to attempt a proof.

Third, proof synthesis follows a similar self‑evolving pattern. Using the specs from the previous stage, GPT‑4o attempts to produce full Verus proofs. Verus instantly validates each proof; correct proofs (P✓) are saved, while incorrect attempts (P✗) are paired with the verifier’s error messages to form a triplet (P✗, error, P✓). These triplets become “debug data” that teach the model how to repair faulty proofs. Over multiple rounds, the model’s ability to both generate and debug proofs improves dramatically, eventually surpassing the original GPT‑4o baseline.

Experimental evaluation uses two benchmarks: V erusBench (human‑curated specifications and verified Rust functions) and CodeNet‑Test (synthetically generated). Direct proof generation yields 43.17 % accuracy on V erusBench and 43.83 % on CodeNet‑Test, far exceeding GPT‑4o’s 11.51 % and 0.28 % respectively. When the self‑debugging module is invoked, accuracy climbs to 79.14 % and 48.43 %. Overall, SAFE synthesizes 19,017 specifications and 9,706 verified Rust functions from an initial dataset with zero formal annotations.

Key contributions include: (1) leveraging the verifier itself as a data filter and labeler, eliminating the need for human‑annotated proof data; (2) repurposing abundant incorrect proofs as training material for a self‑debugging capability; (3) demonstrating that two mutually reinforcing self‑evolving loops can bootstrap high‑quality proof generation from virtually no initial data; and (4) showing that an open‑source LLM, initially unaware of Verus, can be transformed into a competent Rust proof generator.

Limitations are acknowledged: the approach depends on Verus’s current subset of Rust, so broader language support would increase applicability; the spec‑quality metric may still discard useful specs; and extending the method to other verification tools (e.g., Dafny, Why3) will require adapting the feedback loop to their specific error reporting and specification languages.

In summary, SAFE provides a practical, fully automated pipeline that turns a verification engine into a data generator, a quality filter, and a feedback provider, thereby achieving a substantial leap in automated proof generation for Rust code and opening avenues for similar self‑evolving systems in other formal verification contexts.


Comments & Academic Discussion

Loading comments...

Leave a Comment