FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.


💡 Research Summary

FIPO (Future‑KL Influenced Policy Optimization) tackles a fundamental limitation of outcome‑based reward (ORM) methods such as GRPO and DAPO, which assign the same scalar advantage to every token in a generated trajectory. This uniform credit assignment prevents the model from distinguishing pivotal reasoning steps from filler tokens, leading to a plateau in chain‑of‑thought (CoT) length around 4 000 tokens and capping performance on demanding tasks.

The core contribution of FIPO is the introduction of a token‑level “Future‑KL” metric that re‑weights the advantage of each token based on the cumulative log‑probability shift of all subsequent tokens. Formally, FutureKLₜ = Σ_{k=t}^{T} Δlog pₖ, where Δlog pₖ = log π_θ(oₖ|…) – log π_{θ_old}(oₖ|…). Positive FutureKL indicates that the current token anchors a favorable future trajectory, while negative values signal that the downstream tokens are being collectively suppressed.

Direct use of FutureKL leads to instability because large importance ratios can cause gradient explosions. FIPO mitigates this in two ways. First, a binary mask Mₖ zeroes out any future token whose importance ratio exceeds a Dual‑Clip threshold c, preventing harmful outliers from contaminating the sum. Second, a soft decay window applies an exponential discount γ^{k‑t} (γ = 2^{‑1/τ}) to the contribution of distant tokens, reflecting the intuition that the influence of a token diminishes with temporal distance. The final FutureKL becomes FutureKLₜ = Σ_{k=t}^{T} Mₖ·γ^{k‑t}·Δlog pₖ.

The discounted, masked FutureKL is exponentiated and clipped to a range


Comments & Academic Discussion

Loading comments...

Leave a Comment