DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

DyJR: Preser ving Div ersity in Reinfor cement Lear ning with V eriﬁable Rewards via Dynamic Jensen-Shannon Replay Long Li 1 * Zhijian Zhou 2 3 * Tianyi W ang 4 W eidi Xu 5 Zuming Huang 5 W ei Chu 5 Zhe W ang 1 Shirui Pan 1 † Chao Qu 2 6 † Y uan Qi 2 6 Abstract While Reinforcement Learning (RL) enhances Large Language Model reasoning, on-policy algo- rithms like GRPO are sample-inefﬁcient as they discard past rollouts. Existing experience replay methods address this by reusing accurate samples for direct polic y updates, but this often incurs high computational costs and causes mode collapse via ov erﬁtting. W e argue that historical data should prioritize sustaining diversity rather than simply reinforcing accuracy . T o this end, we propose Dynamic J ensen-Shannon Replay (DyJR) , a sim- ple yet ef fectiv e regularization frame work using a dynamic reference distrib ution from recent tra- jectories. DyJR introduces tw o innov ations: (1) A T ime-Sensitiv e Dynamic Buffer that uses FIFO and adaptiv e sizing to retain only temporally prox- imal samples, synchronizing with model ev olu- tion; and (2) Jensen-Shannon Diver gence Regu- larization, which replaces direct gradient updates with a distributional constraint to prevent div ersity collapse. Experiments on mathematical reasoning and T ext-to-SQL benchmarks demonstrate that DyJR signiﬁcantly outperforms GRPO as well as baselines such as RLEP and Ex-GRPO, while maintaining training efﬁcienc y comparable to the original GRPO. Furthermore, from the perspec- tiv e of Rank- k token probability ev olution, we show that DyJR enhances di versity and mitigates ov er-reliance on Rank-1 tokens, elucidating how speciﬁc sub-modules of DyJR inﬂuence the train- ing dynamics. ∗ Equal contribution. † Co-corresponding authors. 1 Grifﬁth Uni- versity , Australia 2 Fudan Univ ersity , China 3 Shanghai Innov ation Institute, China 4 Beijing University of Post and T elecommuni- cations 5 INFL Y TECH, China 6 Shanghai Academy of Artiﬁcial Intelligence for Science, China. Correspondence to: Long Li < long.li@grifﬁthuni.edu.au > . Pr eprint. Mar ch 18, 2026. 1. Introduction Reasoning capability is central to Artiﬁcial General Intelli- gence in the large language models (LLMs) era ( OpenAI , 2023 ; Y ang et al. , 2025 ; Zhu et al. , 2025a ). Models like OpenAI o1 demonstrate that Reinforcement Learning (RL) transcends preference alignment by enhancing logic through long Chain-of-Thought (CoT) exploration ( OpenAI , 2024 ; Guo et al. , 2025 ). While “long thinking” via V eriﬁable Re- wards is no w mainstream ( Chen et al. , 2025b ), a key bottle- neck exists: the on-policy nature of current Reinforcement Learning with V eriﬁable Re ward (RL VR) ( Cai et al. , 2025 ; Cai & Sugiyama , 2026 ) algorithms causes expensi ve rollout data to be discarded after a single update ( Shao et al. , 2024 ; Zhan et al. , 2025 ). This inefﬁcienc y wastes v ast resources and prevents learning from past successes, hindering the scalability of RL in reasoning tasks ( Fu et al. , 2025 ; Zhou et al. , 2026 ). Experience Replay ( Rolnick et al. , 2019 ; Mnih et al. , 2013 ; Lillicrap , 2015 ; Liu et al. , 2026 ) has recently been adapted for LLM training to improv e sample efﬁcienc y . Meth- ods such as ReMix ( Liang et al. , 2025 ), RePO ( Li et al. , 2025b ), Ex-GRPO ( Zhan et al. , 2025 ),and RLEP ( Zhang et al. , 2025 ) reuse historical trajectories by following a data augmentation paradigm. These approaches typically main- tain massiv e buf fers to store historical samples and treat them as additional positiv e instances for direct policy gra- dient updates. Formally , they optimize a joint objecti ve J ( θ ) = J on ( θ ) + α J exp ( θ ) using the same standard clipped surrogate objecti ve for both terms. 1 W e contend that this mainstream approach suffers from two cor e misconceptions . First, indiscriminate forward updates exacerbate mode collapse. By directly maximizing the like- lihood of historical trajectories, the model is coerced into ov er-ﬁtting speciﬁc solution paths ( Zhu et al. , 2025b ; Peng et al. , 2025 ), leading to a swift erosion of its exploratory potential ( W ang et al. , 2025 ; Y ue et al. , 2025 ; Cheng et al. , 1 Speciﬁcally , J exp ( θ ) takes the exact same form as the on- policy objecti ve: E q ∼B exp [min( r ( θ ) ˆ A, clip ( r ( θ ) , 1 − ϵ, 1 + ϵ ) ˆ A )] , where r ( θ ) = π θ π old . This effecti vely treats replayed trajectories as additional off-polic y samples. 1 DyJR: Preser ving Diversity in RL VR via Dynamic Jensen-Shannon Replay 2025 ). Second, traditional Experience Replay methods ne- cessitate substantial training resources for sample storage and reuse; for instance, approaches like RLEP incur massi ve GPU memory overhead by archiving the entire trajectory history . In contrast, our empirical ﬁndings suggest that historical data is not uniformly valuable, as RL training exhibits a rapid transition window where the model’ s en- tropy decreases sharply during the early stages (typically the ﬁrst 20 steps) before con v erging to a peaked distrib ution. Consequently , by prioritizing large-scale sample storage ex- clusi vely during this volatile initial phase while maintaining a minimal footprint in later stages, one can achiev e perfor- mance comparable to continuous lar ge-scale storage while drastically reducing memory requirements. Based on these observations, we redeﬁne the role of Experi- ence Replay in reasoning tasks: the objectiv e should shift from accuracy optimization via correct samples to a regular- ization mechanism for sustaining di versity . Guided by this philosophy , we propose the Dynamic Jensen-Shannon Re- play Algorithm ( DyJR ). Unlike previous approaches, DyJR introduces two ke y innov ations. Regarding data construc- tion , we replace brute-force storage with a non-uniform dynamic buf fer strategy . Speciﬁcally , we implement a dy- namic capacity mechanism that expands the buf fer to re- tain a larger v olume of samples during the rapid transition phase—thereby capturing high-entropy reasoning patterns— and subsequently contracts it. Crucially , we emplo y a First- In-First-Out (FIFO) protocol for buf fer updates, retaining only samples that are most temporally proximal to the cur- rent model. Empirical evidence indicates that while retain- ing excessiv e historical data with large temporal variance impedes learning, utilizing temporally adjacent data mini- mizes training resource consumption while yielding optimal performance. Re garding data utilization , we move away from direct policy gradient updates and introduce Jensen- Shannon div ergence as a re gularization constraint. By treat- ing the mixture of historical policies as a dynamic distrib u- tional anchor , we minimize the Jensen-Shannon diver gence between the current polic y and this mixture. This prev ents the model from drifting away from div erse successful paths without aggressiv ely altering the optimization direction. Our main contributions are summarized as follo ws: (1) Redeﬁning the Replay Paradigm: From Accuracy Optimization to Diversity Regularization. W e demon- strate that the primary value of replayed data lies in sus- taining diversity rather than merely reinforcing accuracy . Consequently , we replace direct gradient updates with a Jensen-Shannon div ergence distrib utional constraint, effec- tiv ely preserving the model’ s e xploration capability and robustness. (2) Proposing a Dynamic Data Construction Strategy Based on T emporal Proximity . W e introduce a non- uniform dynamic buf fer mechanism that expands storage during the rapid transition phase to capture high-entrop y pat- terns and contracts it as the model stabilizes. By employing a FIFO protocol to retain only the most temporally adjacent samples, we achieve optimal performance while minimizing training resource consumption. (3) Extensive Experiments and Fine-Grained Analysis. W e demonstrate rob ust improvements across div erse tasks (e.g., Math and T ext-to-SQL) and architectures (Qwen and Llama families), achie ving substantial g ains in both Pass@1 and Pass@k with negligible GPU memory overhead. Fur- thermore, we provide a detailed ablation analysis from the perspectiv e of Rank- k token probability e volution to eluci- date how DyJR’ s sub-modules inﬂuence training dynamics. 2. Preliminaries In this section, we formalize the RL VR setup and intro- duce the foundational algorithms: Group Relativ e Policy Optimization (GRPO) and JS div ergence. 2.1. RL Backbone: GRPO W e formulate the reasoning task as a Markov Decision Pro- cess (MDP). Giv en a query x , a policy π θ generates a rea- soning chain y = ( y 1 , . . . , y L ) . The en vironment returns a binary rew ard r ( x, y ) ∈ { 0 , 1 } . Standard PPO requires a value function critic, which is computationally expensi ve for LLMs. Instead, we utilize GRPO ( Shao et al. , 2024 ), which estimates baselines using group statistics. For each query x , GRPO samples a group of G outputs { y 1 , . . . , y G } from the old policy π θ old . The advantage for the i -th output is computed as: ˆ A i = r ( x, y i ) − µ group σ group (1) where µ group and σ group are the mean and standard deviation of rew ards within the group. The GRPO objective maxi- mizes the surrogate loss: L GRPO ( θ ) = − 1 G G X i =1 min  ρ i ˆ A i , clip ( ρ i , 1 − ϵ, 1 + ϵ ) ˆ A i  (2) where ρ i = π θ ( y i | x ) π θ old ( y i | x ) is the importance sampling ratio. 2.2. Forward KL Di vergence In contrast to symmetric measures, the Forward K ullback- Leibler (KL) diver gence is a standard asymmetric metric typically minimized in maximum lik elihood estimation. For a target distrib ution P and an approximating distribution Q , it is deﬁned as: D KL ( P ∥ Q ) = E x ∼ P  ln P ( x ) Q ( x )  (3) 2 DyJR: Preser ving Diversity in RL VR via Dynamic Jensen-Shannon Replay The Forw ard KL is characterized by its mode-covering behavior . Because the expectation is taken with respect to the reference distribution P , the di vergence penalizes Q sev erely if Q ( x ) is small where P ( x ) is large. Consequently , minimizing D KL ( P ∥ Q ) forces Q to spread its probability mass to cov er all modes of P , ensuring broad support e ven if it results in assigning mass to lo w-probability regions of the target distrib ution. 2.3. Jensen-Shannon Div ergence The Jensen-Shannon di ver gence offers a symmetrized and smoothed alternativ e to the KL diver gence. For two prob- ability distributions P and Q , the JS di ver gence is deﬁned via a mixture distribution M = 1 2 ( P + Q ) as: D JS ( P ∥ Q ) = 1 2 D KL ( P ∥ M ) + 1 2 D KL ( Q ∥ M ) (4) Unlike the standard KL diver gence, which is asymmetric and unbounded (potentially approaching inﬁnity if the sup- port of Q does not fully encompass P ), the JS div ergence is symmetric and bounded within [0 , ln 2] . 3. Method W e now introduce DyJR. While GRPO provides efﬁcient exploration, it is prone to mode collapse in sparse-reward settings. DyJR mitigates this by maintaining a Dynamic Replay Buffer and applying a JS Regularization term deriv ed from the deﬁnition in Sec. 2 . Algorithm 1 DyJR 1: Input: Dataset D , Policy π θ , Max Age M , Reg Coef ﬁ- cient α JS . 2: Initialize: Buf fer S ← ∅ . 3: f or step t = 1 , . . . , T do 4: Rollout: Sample queries X , generate group Y ∼ π θ old ( X ) . 5: Eval: Compute re wards R and group conﬁdence C id . 6: Buffer Maintenance: 7: 1. Evict samples where age > M . 8: 2. Select new samples based on C id and schedule η (Eq. 6 ). 9: 3. Store ( x, y , log π θ old ) into S . 10: Optimization: 11: Calculate L GRPO on online batch ( X, Y ) using Eq. (2). 12: Sample batch B replay uniformly from S . 13: Calculate L JS on B replay using Eq. ( 7 )-( 9 ). 14: Update θ minimizing L GRPO + α JS L JS . 15: end f or 3.1. Dynamic Reference Construction T o lev erage historical success, we construct a reference distribution Q B supported by a Replay Buffer S t . Dynamic Replay Buffer T o address the non-stationarity of the policy , we enforce a Max Age ( M ) constraint. The buf fer strictly retains only perfect samples ( r = 1 ) generated within the last M steps. T o enable efﬁcient di ver gence computation without re-forw arding, we store the token-lev el log-probabilities computed at generation time. A b uffer entry is deﬁned as ( x k , y k , log π ( k ) old ) . At step t , any stale sample is evicted: S t = { ( x k , y k , log π ( k ) old ) | r ( y k ) = 1 , t − T id k ≤ M } (5) This ensures the reference distrib ution tightly tracks the shifting capability boundary of the current policy π θ . Bias-A ware Adaptive Data Selection T o mitigate the selection bias inherent in ﬁltering exclusi vely for correct- ness while ensuring robust coverage across varying task difﬁculties, we propose a conﬁdence-stratiﬁed descend- ing admission strategy . For each query in batch B t , we deﬁne the empirical conﬁdence C id as the count of correct responses among G sampled paths. W e strictly prioritize high-conﬁdence samples by iterativ ely admitting perfect trajectories where C id = k , sweeping k from G down to 1 : P new t ← P new t ∪{ ( x, y , log π old ) ∈ B t | C id = k , r ( y ) = 1 } (6) This process halts once the ne wly admitted samples reach the target ﬁll rate η (default 5%). This “High-to-Lo w” mech- anism inherently provides a difﬁculty-adaptive property: (1) For Easy T asks, it preferentially secures high-conﬁdence samples ( C id ≈ G ), ensuring a lo w-variance reference; (2) For Hard T asks, it naturally relaxes admission criteria to capture rare solutions ( C id ≪ G ), thereby prev enting data starvation and the “catastrophic for getting” of difﬁcult capabilities. Mitigating Early Di versity Collapse A critical challenge in RL VR is the rapid collapse of policy entrop y during early training stages, often occurring before the replay buf fer is sufﬁciently populated. T o counteract this, we implement a T ime-A ware Adaptiv e Schedule. During the initial warm- up phase (e.g., the ﬁrst 20 steps), we temporarily elev ate the target ﬁll rate η from 5% to 20%. By proactiv ely ad- mitting a broader spectrum of exploratory samples during initialization, this strategy rapidly di v ersiﬁes the reference distribution, ef fecti vely smoothing the optimization trajec- tory and safeguarding against premature mode collapse. 3 DyJR: Preser ving Diversity in RL VR via Dynamic Jensen-Shannon Replay 3.2. JS Regularization Implementation W e apply JS div ergence to re gularize π θ tow ards the buf fer distribution Q B . While Eq. (3) gi ves the theoretical deﬁni- tion, computing standard JS div ergence implies calculating the mixture distribution M , which is intractable for auto- regressi ve models. Instead, we employ a low-v ariance generati ve estima- tor ( W ang et al. , 2023 ). For a sample s drawn from S t , we compute the probability ratio u s using the current policy and the stored log-probabilities: u ( j ) s = exp  log π θ ( y ( j ) s | x s , y ( 0.375). Evaluation is performed on highly challenging benchmarks, including AIME25 ( Li et al. , 2024a ), HMMT25 ( Baluno vi ´ c et al. , 2025 ), BR UMO25 ( Baluno vi ´ c et al. , 2025 ), AMC23, Minerv a ( Le wko wycz et al. , 2022 ) and Beyond AIME. This task is designed to observe the signiﬁcant performance leap from a low baseline to high-lev el proﬁciency follo wing RL training. (2) A SQL generation task based on Llama-3.1-8B-Instruct ( Meta AI , 2024 ), which focuses on moderate-difﬁculty reasoning typically within a 1k token limit. The model is trained on the BIRD ( Li et al. , 2024b ) dataset and ev aluated across both BIRD and Spider ( Y u et al. , 2018 ) datasets to test cross-domain performance. In this work, we employ multiple e xperimental setups to v alidate the effecti veness of DyJR in comparison with baseline methods; T able 3 summarizes the detailed conﬁgurations for all setups. Speciﬁcally , Replay Bsz denotes the batch size of newly added data items during each data fusion step. For DPH-RL , following the experimental protocol in their original work, we incorporate samples that yield 6 correct responses out of 8 sampled paths into the regularization term. Regarding RLEP , we preserve two correct solution trajectories for each query q in the replay b uf fer . These b uffered samples are subsequently treated as data generated by the current policy π θ and integrated into the joint policy update process. 2 https://huggingface.co/datasets/open-r1/OpenR1-Math-220k 12 DyJR: Preser ving Diversity in RL VR via Dynamic Jensen-Shannon Replay T able 4. ExGRPO-speciﬁc hyperparameters for experience replay . Parameter V alue Description Replay ratio ( ρ ) 0.5 Fraction of replay samples in each mini-batch Delayed replay threshold 0.35 Minimum batch Pass@1 required to acti vate e xperience replay Difﬁculty b ucket range [0 , 7] Bounds for difﬁculty-based buck eting ( k /K ) Replay metric Entropy Criterion for ev aluating trajectory quality T rajectory selection mode Argmin Select the lowest-entrop y successful trajectory per prompt For Pass@1, we employ greedy decoding (i.e., temperature T = 0 ). For Mean@256, we generate 256 samples with a temperature of T = 0 . 7 and calculate the av erage accuracy across all samples. ExGRPO. (Experiential Group Relativ e Policy Optimization) ( Zhan et al. , 2025 ) extends GRPO with an experience replay frame work that integrates e xperience management and mixed optimization . During training, ExGRPO maintains an experience pool that maps each prompt to a set of pre viously successful solution trajectories, and categorizes prompts into difﬁculty b uckets according to their most recent rollout accuracy . T o prevent ov erﬁtting to trivial cases, prompts whose rollouts are entirely correct are mov ed into a r etir ed set and excluded from subsequent replay . At each update step, ExGRPO applies a replay ratio ρ to partition each mini-batch into on-policy samples and replayed experiences. Difﬁculty b uckets are sampled using a Gaussian weighting centered at 0 . 5 , thereby emphasizing prompts of intermediate difﬁculty . For each selected prompt, ExGRPO further selects the successful trajectory with the lowest entrop y under the current policy , encouraging stable exploitation while mitigating e xcessiv e div ersity collapse. The additional hyperparameters introduced by ExGRPO are summarized in T able 4 , while all other training conﬁgurations follow the GRPO baseline in the T able 3 . C. The Irreplaceability of Early Exploration T able 5. Performance of DyJR on v arious mathematical benchmarks. init denotes the percentage of rollout samples retained per step during the initial 20 steps, while η represents the retention rate thereafter . The maximum age of the FIFO buf fer is consistently set to 8 steps. Model AIME25 AMC23 Beyond AIME BRUMO25 HMMT25 Minerva * A vg DyJR ( init = 20% , η = 20% ) 22.9 72.5 12.5 33.8 12.6 51.0 34.2 DyJR ( init = 20% , η = 5% ) 23.1 72.0 12.7 33.4 12.7 50.4 34.1 DyJR ( init = 10% , η = 5% ) 22.5 69.2 11.4 31.5 11.5 48.2 32.4 DyJR ( init = 5% , η = 5% ) 21.8 68.5 10.1 29.2 10.3 47.8 31.3 W e ev aluated whether maintaining a consistently large sample collection rate yields signiﬁcant differences compared to our current strategy , with results presented in the T able 5 . Our ablation study yields two critical observations re garding data utilization: 1. Diminishing Returns of Late-Stage Data: Maintaining a consistently high global sampling rate ( 0 . 2 ) implies no signiﬁcant performance gain compared to our dynamic strategy . This suggests that as the model stabilizes in the mid-to-late stages, a minimal set of replay samples sufﬁces to anchor the distribution, v alidating the memory efﬁcienc y of DyJR’ s dynamic contraction. 2. Criticality of Initial Data: Con versely , reducing the sampling rate during the initial phase ( init ) results in a signiﬁcant deterioration of performance. Reevaluating Replay V alue—Div ersity Over Accuracy . These ﬁndings challenge the con ventional understanding of Experience Replay . If the primary value of replayed data lay solely in its correctness (i.e., reinforcing accurate solutions to prev ent forgetting), then late-stage trajectories—generated by a model with higher accuracy—should theoretically be more val uable than early-stage ones. Howe ver , our experiments re veal a paradox: although the early-stage model exhibits lower accuracy , the data it generates is indispensable for ﬁnal performance. This contradiction strongly suggests that the 13 DyJR: Preser ving Diversity in RL VR via Dynamic Jensen-Shannon Replay efﬁcac y of DyJR stems not from merely cloning correct behaviors, but from preserving the high entropy and r easoning diversity inherent in early training. By lev eraging these early , div erse trajectories as a distributional constraint via JS div ergence, DyJR prev ents the model from prematurely con ver ging to a narro w solution path (mode collapse). Consequently , unlike static replay methods that simply mitigate for getting, DyJR activ ely safeguards the model’ s e xploration capability , leading to substantial gains in Pass@1. 14

DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment