Detecting RLVR Training Data via Structural Convergence of Reasoning

Detecting RLVR Training Data via Structural Convergence of Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the $k$ smallest nearest-neighbor edit distances. Min-$k$NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-$k$NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.


💡 Research Summary

The paper addresses the problem of detecting whether a specific example was part of the reinforcement‑learning‑with‑verifiable‑rewards (RLVR) training set of a reasoning language model. Unlike conventional pre‑training or supervised fine‑tuning, RLVR optimizes models by rewarding self‑generated chain‑of‑thought (CoT) trajectories rather than maximizing token‑level likelihood. Consequently, classic likelihood‑based membership inference methods, which rely on statistical traces in token probabilities, perform poorly for RLVR‑trained models.

The authors first conduct a systematic analysis of how RLVR reshapes model behavior. Using a Qwen‑2.5‑7B‑Base backbone, they train two representative RLVR algorithms—DAPO and GRPO—on a standard RL dataset. For each checkpoint they sample 32 completions for 300 prompts and evaluate three complementary diversity metrics: lexical diversity (Expectation‑Adjusted Distinct n‑grams, EAD), logical diversity (an NLI‑based entailment/contradiction ratio), and semantic diversity (one minus the average pairwise cosine similarity of sentence embeddings). Across both algorithms, all three metrics steadily decline as training progresses, indicating that the space of possible reasoning trajectories conditioned on a fixed prompt contracts over time.

To pinpoint which parts of the output become rigid, the authors extract high‑frequency 3‑grams that appear in at least half of the completions for a given prompt. They categorize these 3‑grams into three types: (1) restatements of the problem, (2) boilerplate connective phrases, and (3) symbolic/algebraic logic steps (e.g., “x = y + 2”, “let f(x) = …”). The analysis shows that the symbolic logic fragments increase rapidly during RLVR training, while the other two categories grow more slowly. This suggests that RLVR primarily compresses the core logical component of reasoning into a limited set of fixed structural patterns.

Further, hierarchical agglomerative clustering of logic‑related 3‑grams reveals that most prompts converge to a small number (typically 2–4) of distinct reasoning structure clusters rather than a single deterministic path. The distribution of cluster sizes shifts toward fewer, tighter clusters for prompts that the model has seen during RLVR, whereas unseen prompts retain a broader spread of structures. Table 1 in the paper quantifies this effect: seen prompts have a higher proportion of clusters with ≤2 structures and a larger count of rigid 3‑grams compared to unseen prompts.

Based on these observations, the authors propose Min‑k NN Distance, a black‑box statistic designed to detect RLVR exposure. For a query prompt x, the method samples m completions from the RLVR‑tuned model, computes the edit distance between every pair of completions, selects the k smallest distances, and averages them. Because RLVR‑seen prompts generate completions that collapse into a few structural modes, their Min‑k NN Distance values are systematically lower than those of unseen prompts, which exhibit higher diversity. Importantly, the detector requires only sampling access; it does not need token log‑probabilities, model gradients, or a reference (pre‑RLVR) model.

The experimental evaluation spans multiple RLVR‑trained models, varying decoding hyper‑parameters (temperature, top‑p), paraphrased prompts, and even distilled versions of the original models. Min‑k NN Distance consistently outperforms prior membership inference baselines (e.g., LiRA, shadow‑model approaches) and recent RL‑contamination detectors, achieving higher AUC and accuracy—often improving by 5–10 percentage points. The method remains robust under paraphrasing attacks and when the attacker only has black‑box query access.

In conclusion, the paper demonstrates that RLVR training induces a distinctive “structural convergence” in reasoning outputs: prompts encountered during training produce rigid, low‑diversity generations, while novel prompts retain higher variability. This convergence provides a reliable signal for membership inference that can be captured with a simple, model‑agnostic statistic—Min‑k NN Distance. The work not only deepens our understanding of how reward‑driven fine‑tuning shapes reasoning behavior but also offers a practical tool for auditing open‑source RLVR models for benchmark contamination, thereby promoting more trustworthy evaluation of reasoning capabilities.


Comments & Academic Discussion

Loading comments...

Leave a Comment