Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention (“I don’t know”) alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.


💡 Research Summary

The paper introduces a reinforcement‑learning framework called Reinforcement Learning with Verifiable Rewards (RLVR) that explicitly rewards large language models (LLMs) for abstaining (“I don’t know”) when they are uncertain, alongside rewarding correct answers and penalizing wrong ones. The reward function is ternary: +1 for a correct answer, –1 for an incorrect answer, and a tunable r_abs for the abstention action. By varying r_abs from negative (discouraging abstention) to positive (encouraging abstention), the authors can control the trade‑off between accuracy and humility.

Two instruction‑tuned models are fine‑tuned under this paradigm: IBM’s Granite‑3.3‑2B‑Instruct (a compact 2‑billion‑parameter model) and Qwen‑3‑4B‑Instruct (a 4‑billion‑parameter multilingual model). Experiments are conducted on two verification‑friendly benchmarks: MedMCQA, a multiple‑choice medical exam dataset, and Hendrycks Math, a set of challenging open‑ended math problems. For MedMCQA the authors augment the answer set with an explicit “I Don’t Know” option and enforce a strict XML‑style output format (, ) to enable automatic reward computation. For Hendrycks Math they require the final answer to be wrapped in \boxed{…} or the phrase “I Don’t Know”.

Three training pipelines are explored: (1) RL‑only, where the model is directly optimized with the ternary reward; (2) RL‑SFT‑Random, where a supervised‑fine‑tuning (SFT) stage first replaces 30 % of ground‑truth answers with “I Don’t Know” to teach the model to say it does not know, followed by RLVR; and (3) RL‑R‑Tuning, where an SFT stage replaces only those questions the base model originally answered incorrectly with “I Don’t Know”, then RLVR is applied. Baselines include a standard no‑IDK model and an IDK‑enabled model that simply adds a fifth answer choice.

Results on MedMCQA show that moderate positive abstention rewards (r_abs ≈ –0.25 to 0.3) substantially reduce the proportion of incorrect answers while keeping overall accuracy within a few points of the baseline. For the smaller Granite model, RL‑only with r_abs = –0.25 yields 46.4 % correct, 46.4 % wrong, and 7 % abstentions, a balanced improvement over the base. For the larger Qwen model, RL‑only with r_abs = 0.3 drives abstentions up to 41 % and drops wrong answers from 32.5 % to 10.3 %, albeit with a larger accuracy drop (48 % correct). This demonstrates that larger models are more robust to abstention incentives and can maintain higher performance while being more cautious.

On the open‑ended Hendrycks Math benchmark, RL‑only fails to increase abstentions because the policy does not explore the “I Don’t Know” action sufficiently—a classic exploration problem in reinforcement learning. The RL‑SFT‑Random pipeline mitigates this: with r_abs = –0.5 the model answers correctly 35 % of the time, reduces wrong answers from 53 % to 26 %, and produces abstentions for 39 % of the questions. In contrast, RL‑R‑Tuning generates a very high abstention rate (≈ 60 %) after SFT, leaving little room for RL to recover correct answering, and thus underperforms the random variant.

A systematic sweep of r_abs values confirms a monotonic relationship: as r_abs becomes more positive, abstention rates rise and accuracy falls. The optimal operating point depends on the application’s tolerance for errors versus unanswered queries. The authors also note that the reward scaling experiments reveal model‑size effects: Qwen‑3‑4B tolerates higher abstention rewards without catastrophic accuracy loss, whereas Granite‑3.3‑2B can be driven to 100 % abstention with r_abs = 0.3.

The paper’s contributions are threefold: (i) a verification‑only reward scheme that eliminates the need for costly human preference data; (ii) empirical evidence that modest abstention incentives can meaningfully curb hallucinations in multiple‑choice settings; and (iii) an analysis of why open‑ended tasks require additional supervised abstention training to overcome exploration deficits.

In discussion, the authors argue that “intellectual humility” can be operationalized as a tunable reward parameter, offering a practical lever for system designers to balance safety (low hallucination) against utility (high answer coverage). They acknowledge limitations, including the need for better exploration strategies (e.g., on‑policy entropy bonuses, mixed‑policy sampling) and more sophisticated reward shaping for open‑ended generation. Future work is suggested on scaling to larger models, integrating with existing RLHF pipelines, and extending the framework to multimodal or retrieval‑augmented systems.

Overall, the study demonstrates that verifiable reward design is a feasible and flexible approach for reducing hallucinations in LLMs, especially when combined with targeted supervised fine‑tuning to teach the model when to say “I don’t know.”


Comments & Academic Discussion

Loading comments...

Leave a Comment