Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.


💡 Research Summary

The paper tackles the high annotation cost inherent in Reinforcement Learning with Verifiable Reward (RLVR) for mathematical reasoning with large language models (LLMs). While RLVR eliminates the need for learned reward models by using binary, rule‑based rewards, it still requires tens of thousands of query‑answer pairs to achieve strong performance. The authors ask whether a much smaller, more informative subset of queries can suffice, and they bring active learning (AL) into the RLVR pipeline.

A key observation is that classic AL strategies—uncertainty‑based (e.g., least confidence, margin, entropy) or diversity‑based (e.g., k‑means, core‑set)—fail to outperform random sampling in this setting. The failure stems from the fact that these methods consider only subjective uncertainty (model‑estimated perplexity, entropy, etc.) and ignore objective uncertainty (the binary reward indicating correctness). Samples with high subjective uncertainty but correct answers (“inconsistent” samples) generate large, noisy policy gradients, destabilizing training. Conversely, samples where subjective and objective uncertainties are aligned (“consistent” samples) produce smaller, more stable gradients.

To quantify this alignment, the authors introduce a Point‑Biserial Correlation (PBC) metric, denoted (r_{pb}). For each query (x), they generate (K) responses, compute a subjective uncertainty score (U) for each response, and record the binary reward (R). The PBC measures the correlation between (U) and (R); a strongly negative value indicates that higher uncertainty corresponds to lower reward, i.e., the two uncertainties are consistent. In the offline setting, they pre‑compute (r_{pb}) for the whole pool and select the top‑(p)% of queries with the smallest (most negative) values for RL training.

However, online RL training cannot afford the large (K) needed for a reliable PBC estimate, and the policy distribution shifts continuously. The authors therefore propose an online uncertainty consistency metric (r^{online}{pb}), computed from the normalized advantage (\hat A) of each sampled response and the current model’s subjective uncertainty. They prove theoretically that (r^{online}{pb}) is strictly negatively correlated with the offline (r_{pb}) and that maximizing (r^{online}{pb}) is equivalent to selecting samples with high alignment between the two uncertainties. This provides a solid justification for using (r^{online}{pb}) as a selection criterion during training.

Experiments are conducted on the MA​TH benchmark and GSM8K using several Qwen model sizes (0.5B, 3B, 7B). Baselines include random sampling, perplexity‑based uncertainty, entropy, k‑center, k‑means, and an LLM‑prompted selector (AskLLM). Results show that classic AL methods achieve virtually the same performance as random selection, whereas the proposed offline (r_{pb}) selection consistently outperforms them, reaching full‑dataset accuracy with only 10% of the data. In the online regime, selecting the top‑(p)% of queries with the highest (r^{online}_{pb}) yields performance comparable to or better than training on the entire dataset while using only 30% of the queries. Gradient‑norm analyses reveal that inconsistent samples cause orders‑of‑magnitude higher variance, confirming the theoretical motivation.

The main contributions are:

  1. Identification of the mismatch between subjective and objective uncertainty as the root cause of AL failure in RLVR.
  2. Introduction of the offline PBC‑based uncertainty consistency metric for query selection.
  3. Development of an online, advantage‑driven consistency metric with provable negative correlation to the offline metric.
  4. Empirical demonstration that these metrics reduce annotation cost by up to 70% without sacrificing reasoning performance.

The work opens a new direction for cost‑effective RL fine‑tuning of LLMs: rather than selecting “most uncertain” examples, one should select examples where the model’s internal uncertainty aligns with the external reward signal. This insight is likely applicable to other RL‑LLM tasks such as code generation, factual QA, or any domain where binary, verifiable rewards are available. Future research may explore Bayesian or meta‑learning approaches to improve online metric estimation, extend the framework to multi‑objective RL, or apply it to non‑binary reward settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment