OVD: On-policy Verbal Distillation

OVD: On-policy Verbal Distillation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model’s exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0–9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive experiments on Web question answering and mathematical reasoning tasks show that OVD substantially outperforms existing methods, delivering up to +12.9% absolute improvement in average EM on Web Q&A tasks and a up to +25.7% gain on math benchmarks (when trained with only one random samples), while also exhibiting superior training efficiency. Our project page is available at https://OVD.github.io


💡 Research Summary

The paper introduces On‑policy Verbal Distillation (OVD), a novel knowledge‑distillation framework designed to transfer the multi‑step reasoning abilities of large language models (LLMs) to smaller, more efficient student models while overcoming the severe memory bottlenecks of existing token‑level on‑policy distillation methods. Traditional token‑level approaches require the teacher to output full‑vocabulary probability distributions (logits) at every decoding step. For long sequences (e.g., L = 8192) and large vocabularies (V ≈ 152 K), storing these logits for a batch of rollouts can consume hundreds of gigabytes of GPU memory, making it impractical for reinforcement‑learning (RL) settings that need many rollouts per problem.

OVD replaces token‑level logits with a compact, discrete “verbal score” ranging from 0 to 9 that the teacher provides for each reasoning step or for the entire trajectory. These scores act as stochastic feedback, preserving the teacher’s uncertainty while drastically reducing the data that must be stored (memory scales as O(N·K·v) instead of O(N·L·V), where K is the number of reasoning steps and v = 10). The student generates N trajectories per problem on‑policy; low‑scoring trajectories are rejected via a verbal rejection‑sampling scheme and regenerated until they meet a quality threshold. This process requires only the 10‑class scores, not the full logits, allowing OVD to work even with black‑box teachers.

The learning objective combines the verbal feedback with standard RL rewards. For web question‑answering (Web Q&A) tasks, a soft F1‑based reward is used; for mathematical reasoning, a binary exact‑match reward is applied. The policy gradient is formulated as

L = −E


Comments & Academic Discussion

Loading comments...

Leave a Comment