Construction-Verification: A Benchmark for Applied Mathematics in Lean 4
Recent advances in large language models have demonstrated impressive capabilities in mathematical formalization. However, existing benchmarks focus on logical verification of declarative propositions, often neglecting the task of explicitly synthesizing solutions. This limitation is particularly acute in applied mathematics domains, where the goal is frequently to derive concrete values or executable algorithms rather than solely proving theorems. To address this, we introduce a Lean 4 framework that enforces a construction-verification workflow, compelling the agent to define explicit solutions before proving their correctness. We curate a comprehensive benchmark AMBER (Applied Mathematics BEnchmark for Reasoning) spanning core domains of applied mathematics, including convex analysis, optimization, numerical algebra, and high-dimensional probability. Aside from theorem proving, our benchmark features complex tasks such as evaluation, algorithm design, and representation transformation. Experiments reveal that current models face significant difficulties with these constructive tasks. Notably, we observe that general-purpose reasoning models consistently outperform specialized theorem provers. We attribute this to a degradation of instruction following capabilities in specialized models. Fine-tuning on proof corpora appears to induce ``tactical overfitting", compromising the ability to adhere to complex constructive requirements, whereas general models retain the versatility needed for multi-task formal reasoning.
💡 Research Summary
This paper addresses a critical gap in the evaluation of large language models (LLMs) for mathematical formalization. Existing benchmarks such as MiniF2F and ProofNet focus on proving the existence of a solution, which is sufficient for pure mathematics but inadequate for applied mathematics where concrete numerical values, executable algorithms, and problem transformations are required. To bridge this gap, the authors introduce a “Construction‑Verification” workflow implemented in Lean 4. The workflow forces an agent to first construct an explicit solution (via a def that takes only the raw parameters) and then verify that this solution satisfies the required property (via a subsequent theorem). This two‑stage pattern eliminates non‑constructive shortcuts, ensuring that the model cannot simply prove existence without actually producing the solution.
Based on this paradigm, the authors curate the AMBER benchmark (Applied Mathematics Benchmark for REasoning). AMBER spans four core applied‑math domains—convex analysis, optimization, numerical algebra, and high‑dimensional probability—and includes three problem families: (1) evaluation problems (e.g., solving a linear system or deriving the closed‑form optimum of a quadratic program), (2) algorithm design problems (e.g., writing the update rule of gradient descent as a recursive Lean function), and (3) representation‑transformation problems (e.g., mapping a non‑standard optimization problem to a canonical SDP or ILP form). Each task is presented with a construction stage (defining the solution or algorithm) followed by a verification stage (proving optimality, convergence, or equivalence). Difficulty ranges from undergraduate exercises to PhD‑level research questions, providing a comprehensive testbed for multi‑step mathematical reasoning.
The experimental evaluation compares several state‑of‑the‑art LLMs: general‑purpose models such as DeepSeek‑V3.2‑Thinking, Kimi K2, and Gemini‑3 Pro, and specialized theorem‑proving models fine‑tuned on proof corpora (Seed‑Prover, DeepSeek‑Prover‑V2, etc.). Results show that general‑purpose models consistently outperform the specialized provers on AMBER’s constructive tasks. The specialized models suffer from “tactical overfitting”: fine‑tuning on pure proof data improves local proof‑search tactics but degrades instruction‑following and construction abilities, leading them to exploit existential proofs rather than synthesize explicit solutions. In contrast, the general models retain broader reasoning and code‑generation capabilities, enabling them to handle the construction phase effectively.
The authors conclude that applied‑mathematics automation demands benchmarks that evaluate both logical correctness and constructive synthesis. Over‑specialization in proof‑search can be detrimental when the target tasks require algorithmic design, numerical computation, or problem reformulation. Future work should explore training regimes that balance tactical proof proficiency with flexible instruction following, possibly through multi‑task fine‑tuning that includes constructive objectives. This study thus provides a new benchmark, a clear methodological framework, and valuable insights for the next generation of neuro‑symbolic systems aimed at real‑world mathematical problem solving.
Comments & Academic Discussion
Loading comments...
Leave a Comment