Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning

Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.


💡 Research Summary

The paper introduces TrajFusion, a fine‑tuning strategy that augments the widely used rejection‑sampling fine‑tuning (RFT) paradigm for mathematical reasoning with teacher‑generated error trajectories. Traditional RFT samples multiple chain‑of‑thought (CoT) attempts for each problem, keeps only those whose final answer is correct, and discards all incorrect attempts. While this binary filter yields clean supervision, it also throws away rich diagnostic information contained in the failed reasoning paths—common computational slips, missing assumptions, misapplied formulas, and plausible yet invalid argument chains.

TrajFusion reframes this binary filtering as a supervision‑construction process. For each problem x, the method first samples K candidate CoT trajectories from a strong teacher model and partitions them into a correct set Y⁺(x) and an incorrect set Y⁻(x) using an automatic verifier. Two problem‑level statistics are then computed:

  1. Error rate r(x) = |Y⁻(x)| / K, measuring how often the teacher fails on this problem.
  2. Error diversity u(x) = |{Ans(y) : y ∈ Y⁻(x)}|, i.e., the number of distinct final answers among the wrong trajectories (quantified via Shannon entropy in the analysis).

These statistics drive an adaptive selection rule for the number of incorrect trajectories to be fused:

k(x) = min( k_max , ⌊ α · r(x) · u(x) ⌋ )

where k_max caps the maximum number of error paths and α controls sensitivity. When r(x)=0 (no errors) or u(x)≈0 (all errors identical), k(x)=0 and TrajFusion collapses to vanilla RFT. Conversely, when a problem elicits many diverse failures, k(x) grows, allowing richer supervision.

From Y⁻(x), TrajFusion selects representative error trajectories: it groups them by final answer, orders groups by frequency, and within each group picks the shortest trajectory as a concise exemplar. The selected error paths {yₑ₁,…,yₑ_k} are then interleaved with a reflection prompt ρ_i (e.g., “Why is this answer wrong?”) and finally the correct trajectory y*. The fused training sample is

T(x) =

Comments & Academic Discussion

Loading comments...

Leave a Comment