Don't Eliminate Cut: Exponential Separations in LLM-Based Theorem Proving
We develop a theoretical analysis of LLM-guided formal theorem proving in interactive proof assistants (e.g., Lean) by modeling tactic proposal as a stochastic policy in a finite-horizon deterministic MDP. To capture modern representation learning, we treat the state and action spaces as general compact metric spaces and assume Lipschitz policies. To explain the gap between worst-case hardness and empirical success, we introduce problem distributions generated by a reference policy $q$, including a latent-variable model in which proofs exhibit reusable cut/lemma/sketch structure represented by a proof DAG. Under a top-$k$ search protocol and Tsybakov-type margin conditions, we derive lower bounds on finite-horizon success probability that decompose into search and learning terms, with learning controlled by sequential Rademacher/covering complexity. Our main separation result shows that when cut elimination expands a DAG of depth $D$ into a cut-free tree of size $Ω(Λ^D)$ while the cut-aware hierarchical process has size $O(λ^D)$ with $λ\llΛ$, a flat (cut-free) learner provably requires exponentially more data than a cut-aware hierarchical learner. This provides a principled justification for subgoal decomposition in recent agentic theorem provers.
💡 Research Summary
The paper presents a rigorous theoretical framework for large‑language‑model (LLM)‑guided interactive theorem proving (ITP) by casting the process of tactic proposal as a stochastic policy within a finite‑horizon deterministic Markov decision process (MDP). The authors abstract both the state space (the current set of goals) and the action space (tactics together with their parameters) as compact metric spaces, and restrict policies to be Lipschitz continuous with respect to the state metric. This abstraction captures the continuous embeddings and representation learning that modern LLMs employ while keeping the analysis mathematically tractable.
A central contribution is the introduction of explicit data‑generation models that reflect the non‑uniform distribution of real‑world theorem‑proving tasks. A reference stochastic policy q (e.g., a pretrained LLM or a random draw from a library such as Mathlib) is used to generate successful proof traces. Conditioning on success within a bounded length L yields a “cut‑free” distribution Q_tree over theorem instances and their associated proof trees.
To capture the reusable structure that appears in practice—lemmas, cuts, or sketch sub‑proofs—the authors define a latent‑variable model in which a hidden variable Z is a directed acyclic graph (DAG) representing shared sub‑proofs. The observation model p(y|x,Z) “unfolds’’ the DAG into a cut‑free trace, while p(Z|x) encodes a compact generative process parameterized by a depth D, an effective branching factor b_eff, and a contraction rate α that governs how local proof complexity shrinks under decomposition. This latent‑DAG formulation provides a parsimonious description of the inductive bias present in many mathematical corpora.
Performance is measured by the finite‑horizon success probability V_π(T,x₀) = P_π
Comments & Academic Discussion
Loading comments...
Leave a Comment