Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present a case study in semi-autonomous mathematics discovery, using Gemini to systematically evaluate 700 conjectures labeled ‘Open’ in Bloom’s Erdős Problems database. We employ a hybrid methodology: AI-driven natural language verification to narrow the search space, followed by human expert evaluation to gauge correctness and novelty. We address 13 problems that were marked ‘Open’ in the database: 5 through seemingly novel autonomous solutions, and 8 through identification of previous solutions in the existing literature. Our findings suggest that the ‘Open’ status of the problems was through obscurity rather than difficulty. We also identify and discuss issues arising in applying AI to math conjectures at scale, highlighting the difficulty of literature identification and the risk of ‘‘subconscious plagiarism’’ by AI. We reflect on the takeaways from AI-assisted efforts on the Erdős Problems.

💡 Research Summary

This paper presents a case study in semi‑autonomous mathematics discovery using Gemini Deep Think, specifically a custom research agent named Aletheia, to tackle the “Open” problems listed in Bloom’s Erdős Problems database. At the time of the study (December 2025), the database contained 700 unsolved conjectures. The authors deployed Aletheia to generate candidate solutions for all 700 prompts, then applied an AI‑driven natural‑language verifier to filter the output. The verifier flagged 212 responses as potentially correct, reducing the human workload dramatically.

A small team of mathematicians performed a two‑stage human review. First, a broad‑skill group quickly screened the 212 candidates, discarding obviously irrelevant or malformed answers and narrowing the set to 27 promising solutions. Next, domain experts examined those 27 in depth, checking mathematical correctness, alignment with the original problem statement, and novelty relative to the existing literature. The experts identified 63 technically correct solutions, but only 13 were judged to meaningfully solve the intended problem. The remaining 50 were either trivial reinterpretations of the problem, solutions to a mis‑phrased version, or partially correct answers that did not address the full question.

The 13 meaningful results fall into four categories:

Autonomous Resolution – Aletheia produced the first known correct proof for a problem. This includes Erdős‑652 (which reduces to known results) and Erdős‑1051, where the model applied a tail‑of‑the‑series argument and Mahler’s criterion in a way not previously published.
Partial AI Solution – For multi‑part problems, Aletheia solved at least one sub‑question. Erdős‑654 and Erdős‑1040 belong here.
Independent Rediscovery – The model generated a correct proof that later turned out to already exist in the literature, but the reasoning trace shows no direct copying. This category contains Erdős‑397,‑659,‑935, and‑1089. The authors highlight the risk of “subconscious plagiarism,” where the model reproduces knowledge absorbed during training without attribution.
Literature Identification – Aletheia recognized that a problem was already solved, despite being marked “Open” in the database. Erdős‑333,‑591,‑705,‑992, and‑1105 fall into this group, exposing gaps in the curation of the Erdős Problems repository.

Statistical results are summarized in Tables 1 and 2: of 200 evaluated AI outputs, 68.5 % were fundamentally flawed, 31.5 % were technically correct but not meaningful, and only 6.5 % constituted a genuine solution. This quantitative breakdown challenges the simplistic claim that “AI accelerates science” by showing that the majority of AI‑generated material requires extensive human vetting.

The paper also discusses several systemic challenges uncovered during the study. First, problem statement ambiguity: many entries in the database contain transcription errors, missing definitions, or outdated notation, leading the model to solve a slightly different problem than Erdős intended. The authors illustrate this with a detailed appendix on Erdős‑75, where Aletheia’s solution was mathematically sound but addressed a mis‑quoted version of the original conjecture.

Second, literature search at scale: identifying prior work for each candidate solution proved to be the most time‑consuming step. The authors argue that AI can aid this process, but current tools lack reliable citation‑matching and provenance tracking.

Third, formal verification limitations: while integrating proof assistants such as Lean could provide absolute certainty, the scarcity of formally verified mathematics limits the applicability of this approach for most Erdős problems.

Finally, the authors reflect on authorship and accountability. Even though large portions of the manuscript are derived from AI‑generated text, the human authors retain responsibility for mathematical validity, proper attribution, and ethical considerations. They advocate for a model where AI assists but does not replace human judgment, especially in fields where correctness is non‑negotiable.

In conclusion, the study demonstrates that semi‑autonomous AI systems can indeed discover new mathematics or at least uncover hidden literature, but the overall success rate is modest. Future work should focus on improving natural‑language understanding of mathematical statements, building robust citation‑retrieval pipelines, and integrating formal verification to reduce the burden on human experts. By addressing these challenges, AI could become a more reliable partner in the ongoing quest to resolve the remaining open problems in Erdős’s prolific legacy.

Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment