Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis

Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Advancing complex reasoning in large language models relies on high-quality, verifiable datasets, yet human annotation remains cost-prohibitive and difficult to scale. Current synthesis paradigms often face a recurring trade-off: maintaining structural validity typically restricts problem complexity, while relaxing constraints to increase difficulty frequently leads to inconsistent or unsolvable instances. To address this, we propose Agentic Proposing, a framework that models problem synthesis as a goal-driven sequential decision process where a specialized agent dynamically selects and composes modular reasoning skills. Through an iterative workflow of internal reflection and tool-use, we develop the Agentic-Proposer-4B using Multi-Granularity Policy Optimization (MGPO) to generate high-precision, verifiable training trajectories across mathematics, coding, and science. Empirical results demonstrate that downstream solvers trained on agent-synthesized data significantly outperform leading baselines and exhibit robust cross-domain generalization. Notably, a 30B solver trained on only 11,000 synthesized trajectories achieves a state-of-the-art 91.6% accuracy on AIME25, rivaling frontier-scale proprietary models such as GPT-5 and proving that a small volume of high-quality synthetic signals can effectively substitute for massive human-curated datasets.


💡 Research Summary

The paper introduces “Agentic Proposing,” a novel framework for generating high‑quality, verifiable reasoning data by treating problem synthesis as a goal‑driven sequential decision process. Instead of relying on static templates or single‑pass generation, the authors model synthesis as a Partially Observable Markov Decision Process (POMDP) where the latent state captures logical consistency and difficulty, the action space comprises cognitive language actions, tool invocations, and a final submission, and observations include an active subset of modular reasoning skills, dialogue history, and a stage indicator.

A central contribution is the definition of “Composable Agent Skills.” Each skill is a four‑tuple ⟨intent, method, difficulty effect, tool hint⟩. A skill library K_self is built automatically from large corpora using a teacher policy that scores candidate skills and filters them via a quality threshold. During generation, the agent dynamically selects, combines, and, if necessary, prunes skills through an internal reflection action (τthink) and a tool call (τedit) that removes misaligned skills. This self‑corrective loop ensures that logical errors are caught early, producing stable, verifiable problems.

Training proceeds in three stages. First, Skill Acquisition extracts and formalizes atomic skills, establishing the prior knowledge base. Second, Agentic Supervised Fine‑tuning (SFT) uses expert trajectories—generated by a teacher policy and filtered by a high‑precision verifier—to teach the agent to imitate complex behaviors such as internal reflection, tool use, and skill pruning. Only trajectories whose final problems pass verification are retained, yielding a clean SFT dataset. Third, Multi‑Granularity Policy Optimization (MGPO) refines the policy via reinforcement learning with a multi‑level reward structure: a binary validity reward for logical soundness and a difficulty‑alignment reward that penalizes deviation from a target difficulty distribution. Rewards are further decomposed across the drafting, checking, refining, and finalizing phases, encouraging the agent to orchestrate skills effectively at each step.

Empirical evaluation demonstrates that data generated by the Agentic‑Proposer‑4B dramatically improves downstream solvers. A 4‑billion‑parameter solver trained on just 10 k synthetic trajectories outperforms baselines trained on much larger human‑curated datasets across mathematics, coding, and science benchmarks. Scaling the downstream model to 30 billion parameters and training on only 11 k trajectories yields a state‑of‑the‑art 91.6 % accuracy on the AIME 2025 exam, surpassing open‑source models up to 20× larger (e.g., DeepSeek‑v3.1, Mistral‑3) and rivaling proprietary frontier models such as GPT‑5 and Gemini‑3. Cross‑domain tests also show consistent gains (≈7 %p average) and no sign of data saturation when the synthetic set is enlarged.

The work substantiates the hypothesis that high‑precision synthetic data can replace massive human‑annotated corpora for advancing LLM reasoning. By integrating modular skill composition, POMDP‑based decision making, and multi‑granularity reinforcement learning, the authors provide a scalable, cost‑effective pipeline for producing challenging, verifiable problems. Future directions include automatic expansion of the skill library, multi‑domain skill recombination, human‑in‑the‑loop quality assurance, and deployment of the generated data in real‑world educational or assessment settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment