APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries
While frontier formal mathematics systems now routinely develop repository-scale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation, existing evaluation benchmarks remain focused on isolated theorem proving. We introduce Automated Proof Engineering (APE), the first systematic framework for evaluating repository-scale proof engineering through dual verification that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. We present a complete infrastructure comprising APE-Bench, which automatically extracts proof engineering tasks from real library commit histories, and APE-Harness, a unified execution framework based on task contract abstraction. This contract-based design enables standardized evaluation across diverse formal mathematics tasks and fair systematic comparison of different agent implementations (including our APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. We demonstrate the framework’s effectiveness through comprehensive evaluation. All code and benchmark dataset are released as open-source at https://github.com/xinhjBrant/APE-Bench.
💡 Research Summary
The paper introduces APE‑Bench, the first systematic benchmark for evaluating Automated Proof Engineering (APE) at repository scale, together with an execution framework called APE‑Harness. While existing formal‑math benchmarks such as miniF2F, miniCTX, and FA‑TE focus on isolated theorem proving—requiring a proof term that type‑checks against a single statement—real‑world development of libraries like Mathlib involves multi‑file modifications, attribute registration, and semantic requirements that go far beyond simple compilation. To capture this, the authors formalize an APE task as a tuple consisting of (i) a pinned repository commit and toolchain version, (ii) a natural‑language instruction describing the engineering goal, (iii) a set of file‑level modifications, and (iv) a dual verification protocol: syntactic verification (Lean compilation) and semantic verification (LLM‑as‑Judge assessing requirement alignment, scope control, and logical correctness).
APE‑Bench is built by mining the commit history of Mathlib. An automated pipeline extracts 100 proof‑engineering tasks from 67 commits dated after 2026‑01‑01, ensuring zero contamination with training data. Each task preserves the original commit hash, the Lean4 toolchain version, the modified file, and a concise natural‑language description of the intended change. The pipeline also implements content‑addressable deduplication: identical files across different commits are stored once, and compiled artifacts are shared, turning linear version growth into logarithmic storage requirements.
APE‑Harness provides the runtime infrastructure. Central to it is the “Task Contract” abstraction, which declaratively specifies environment bindings, objectives, access boundaries, and verification protocols. Contracts are independent of any particular execution strategy, allowing heterogeneous agents—APE‑Agent (a ReAct‑style scaffold), Claude Code, Codex CLI—to operate on the exact same specification while using their own internal toolsets. The harness offers three core services: (a) Execute Service for compilation checking, (b) Retrieve Service for semantic search over the pinned library (e.g., locating lemmas, definitions), and (c) Orchestrator that creates isolated workspaces, enforces access control, and coordinates the dual verification flow. Nested contracts enable sub‑tasks such as judgment checks to be executed through the same infrastructure, demonstrating self‑hosting capability.
The authors evaluate frontier LLMs (GPT‑5.2, Gemini 3 Pro, Gemini 3 Flash) on the 100 APE‑Bench tasks using the three agent scaffolds. Success is defined as passing both compilation and semantic checks. GPT‑5.2 achieves the highest overall success rate (≈42 %), followed by Gemini 3 Pro (≈35 %) and Gemini 3 Flash (≈31 %). Claude Code and Codex CLI trail with 27 % and 22 % respectively. To validate the LLM‑as‑Judge component, 64 expert‑annotated solutions are collected and rated on the three semantic dimensions; the automatic judge’s scores correlate with human judgments at 0.84 Pearson, confirming reliability.
Beyond the core benchmark, the authors demonstrate the generality of APE‑Harness by running traditional theorem‑proving benchmarks (miniF2F, miniCTX) within the same framework, showing that the contract‑based design can accommodate pure proof synthesis, proof‑engineering with semantic validation, and even workflow automation tasks such as benchmark construction and library annotation.
Key contributions are: (1) the formal APE formulation and the APE‑Bench dataset derived from real commit histories, (2) a validated LLM‑as‑Judge benchmark for semantic evaluation, (3) the contract‑based APE‑Harness infrastructure enabling unified, reproducible evaluation across diverse tasks and agents, (4) an open‑source APE‑Agent scaffold for research and extension, and (5) an efficient multi‑version execution strategy based on content deduplication. The released code and data (https://github.com/xinhjBrant/APE‑Bench) provide a foundation for future work on large‑scale, automated proof engineering, bridging the gap between isolated theorem proving research and the practical demands of maintaining massive formal mathematics libraries.
Comments & Academic Discussion
Loading comments...
Leave a Comment