Escaping the Cognitive Well: Efficient Competition Math with Off-the-Shelf Models
In the past year, custom and unreleased math reasoning models reached gold medal performance on the International Mathematical Olympiad (IMO). Similar performance was then reported using large-scale inference on publicly available models but at prohibitive costs (e.g., 3000 USD per problem). In this work, we present an inference pipeline that attains best-in-class performance on IMO-style math problems at an average inference cost orders of magnitude below competing methods while using only general-purpose off-the-shelf models. Our method relies on insights about grader failure in solver-grader pipelines, which we call the Cognitive Well (iterative refinement converging to a wrong solution that the solver as well as the pipeline’s internal grader consider to be basically correct). Our pipeline addresses these failure modes through conjecture extraction, wherein candidate lemmas are isolated from generated solutions and independently verified alongside their negations in a fresh environment (context detachment). On IMO-ProofBench Advanced (PB-Adv), our pipeline achieves 67.1 percent performance using Gemini 3.0 Pro with an average cost per question of approximately 31 USD. At the time of evaluation, this represented the state-of-the-art on PB-Adv among both public and unreleased models, and more than doubles the success rate of the next best publicly accessible pipeline, all at a fraction of the cost.
💡 Research Summary
The paper tackles the problem of solving International Mathematical Olympiad (IMO)‑style proof questions using only publicly available large language models (LLMs), without the need for custom‑trained or proprietary models that have dominated recent leaderboards. The authors identify two systemic failure modes that cripple existing solver‑grader pipelines: (1) Cognitive Plateau, where the solver makes incremental progress but the grader’s score does not reflect meaningful improvement, causing the loop to stagnate; and (2) Cognitive Well, a more insidious situation in which both solver and grader converge on a flawed solution that appears logically consistent enough to be judged correct, leading to high grader scores for completely wrong proofs. These phenomena arise because the grader often shares the same parameters and context as the solver, making it vulnerable to “reward hacking” and to being “contaminated” by the erroneous proof it is evaluating.
To overcome these issues, the authors propose a novel inference pipeline—dubbed the Momus pipeline—that relies on three design principles:
-
Narrow Width (Limited Parallelism) – Instead of brute‑force parallelism with dozens of branches, the pipeline runs only a modest number of concurrent solvers (typically K≈4). This dramatically reduces API calls and cost while keeping the system tractable.
-
Conjecture Extraction – When a proof stalls (grader score ≤6/7), the system explicitly extracts the logical gaps as self‑contained conjectures (C_i) and also generates their negations ¬(C_i). By turning vague uncertainty into concrete statements, the pipeline creates new, well‑defined sub‑problems that can be independently verified.
-
Contextual Detachment – Each conjecture and its negation are fed to fresh solver‑grader instances without the surrounding (potentially misleading) proof context. If a conjecture or its negation can be proved, the result is stored in a global memory (M_{\text{lemma}}) as a verified lemma (positive or negative). This detachment prevents the grader from being “contaminated” by the original flawed proof and allows the next solving round to reason from a clean slate.
The pipeline operates in three phases:
-
Phase I (Limited Exploration) generates K candidate proofs using a dialectic prompting strategy that pits multiple personas (e.g., a strict “Momus” grader and a more lenient counterpart) against each other. Each proof is graded independently; a perfect 7/7 score across three runs leads to immediate acceptance.
-
Phase II (Contextual Detachment) activates when no perfect proof is found. The Conjecture Extractor isolates key lemmas from the top‑scoring attempts, strips them from their original context, and launches two parallel sub‑pipelines to prove both (C_i) and ¬(C_i). Proven lemmas are added to (M_{\text{lemma}}).
-
Phase III (Refinement with Global Memory) restarts the solver pool, now equipped with the verified lemmas from Phase II. All but one solver receive this enriched context, preserving diversity while steering the search away from previously identified dead‑ends.
If the budget for conjecture iterations is exhausted without a full solution, a post‑enhancement step runs two more Phase II cycles, incorporates any newly proved lemmas, and performs a final Phase III pass. For especially hard questions the authors optionally run two full pipelines in parallel and employ a “Judge” persona to pick the better final proof.
A key technical contribution is the Dialectic Prompting framework. Instead of separate “solver” and “grader” prompts, a single call contains multiple named personas that interact according to explicit rules. This yields more controllable behavior, easier prompt engineering, and the ability to add new roles (e.g., a dedicated conjecture extractor) without redesigning the entire pipeline.
The authors evaluate the system on the IMO‑ProofBench Advanced (PB‑Adv) benchmark, which consists of 30 curated IMO‑style problems. Using off‑the‑shelf models such as Gemini 3.0 Pro, Gemini 2.5 Pro, and Gemini 3.0 Flash, the pipeline achieves:
- Single‑run performance ranging from 52.6 % to 57.9 % accuracy.
- Combined‑run performance (two parallel pipelines with a final judge) reaching 64.8 % accuracy.
- Full pipeline (including conjecture verification) attaining 67.1 % accuracy with an average cost of ≈ $31 per question (based on January 2026 pricing).
These results place the method above all publicly accessible pipelines (e.g., DeepSeekMath V2 at 61.9 % with $3000/question) and within striking distance of unreleased, custom‑trained systems that do not disclose their compute budgets. A cost‑performance Pareto analysis shows the proposed approach dominates the frontier, delivering an order‑of‑magnitude reduction in cost while improving success rates.
The paper also revisits grader design. Rather than using a scalar score as a progress signal, the grader is instructed to output a list of concrete logical errors (“guilty until proven innocent” style). This makes the feedback more actionable for the solver and reduces the risk of the grader being fooled by superficially coherent but incorrect proofs. The authors validate this design with GraderBench, demonstrating higher error‑detection fidelity compared to prior grader implementations.
Limitations are acknowledged: the approach currently focuses on natural‑language proofs, whereas formal proof assistants (Lean, Coq) remain a separate line of work. Conjecture extraction quality still depends heavily on model size; smaller models may struggle to generate precise lemmas. Managing the global lemma memory and deciding which lemmas to retain could be further optimized.
In summary, the paper delivers a practical, cost‑effective pipeline that leverages only off‑the‑shelf LLMs to achieve state‑of‑the‑art performance on a challenging IMO‑style benchmark. By diagnosing and systematically addressing the Cognitive Well and Cognitive Plateau failure modes through conjecture extraction, context detachment, and dialectic prompting, the authors demonstrate that high‑level mathematical reasoning is attainable without prohibitive compute budgets. This work lowers the barrier for researchers, educators, and enthusiasts to experiment with AI‑driven competition math, and opens avenues for future integration with formal verification tools and broader mathematical domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment