B-GRPO: Unsupervised Speech Emotion Recognition based on Batched-Group Relative Policy Optimization
Unsupervised speech emotion recognition (SER) focuses on addressing the problem of data sparsity and annotation bias of emotional speech. Reinforcement learning (RL) is a promising method which enhances the performance through rule-based or model-based verification functions rather than human annotations. We treat the sample selection during the learning process as a long-term procedure and whether to select a sample as the action to make policy, thus achieving the application of RL to measure sample quality in SER. We propose a modified Group Relative Policy Optimization (GRPO) to adapt it to classification problems, which takes the samples in a batch as a group and uses the average reward of these samples as the baseline to calculate the advantage. And rather than using a verifiable reward function as in GRPO, we put forward self-reward functions and teacher-reward functions to encourage the model to produce high-confidence outputs. Experiments indicate that the proposed method improves the performance of baseline without RL by 19.8%.
💡 Research Summary
The paper introduces B‑GRPO (Batched‑Group Relative Policy Optimization), a reinforcement‑learning‑based framework for unsupervised speech emotion recognition (SER). Traditional SER suffers from data sparsity and annotation bias, making fully supervised training costly and potentially biased. While recent self‑supervised and contrastive methods (e.g., DINO, mask‑prediction, contrastive predictive coding) alleviate the need for labels, they still rely on generic feature learning and do not directly address sample quality during training.
GRPO, originally proposed for generative reasoning tasks, computes a relative advantage by comparing each response’s reward to the average reward of a group of responses to the same query, thereby eliminating the need for a separate value network. However, SER produces a single prediction per utterance, so the authors reinterpret the learning process as a long‑term decision problem: at each step the agent decides whether to incorporate a particular unlabeled sample into the policy update. By treating an entire mini‑batch as a “group,” B‑GRPO uses the batch‑wise mean reward as a baseline, normalizes the advantage, and discards negative advantages to prevent low‑confidence samples from harming the policy.
Two families of reward functions are defined. The self‑reward relies solely on the model’s own confidence: either a binary reward (constant C if the maximum softmax probability exceeds a threshold δ) or a continuous reward equal to the maximum probability itself. Teacher‑reward functions compare the policy’s prediction with those of external, frozen teacher models (Emotion2vec‑plus‑large, Emotion2vec‑base, Whisper‑large‑v3) using either agreement of argmax labels, conjunction with the self‑reward, or KL‑divergence below a threshold θ. Empirical results show that self‑reward consistently outperforms teacher‑reward across all five corpora, indicating that the model’s own calibrated confidence is a reliable proxy for sample quality in the unsupervised setting.
The loss combines the PPO clipped surrogate with the normalized advantage \hat A_i and a KL regularization term that keeps the policy close to a reference distribution. Experiments were conducted on five publicly available SER datasets covering English, Mandarin, and French: IEMOCAP, CASIA, CAFE, MELD, and M3ED. The baseline first trains on half of the data with labels for 100 epochs, then the remaining half is used without labels either by DINO or B‑GRPO for another 100 epochs. B‑GRPO improves macro‑F1 scores over the baseline by 2.2%–48.0% depending on the corpus and yields an average 10.3% gain over DINO. When compared to a fully supervised model trained on the entire labeled set for 200 epochs, B‑GRPO’s performance is comparable, demonstrating that high‑quality unlabeled samples can be leveraged effectively.
Ablation studies examine the impact of the advantage clipping, the choice of reward (r₁ vs. r₂), and the presence of teacher models. Positive‑only advantage yields the best results; using the raw advantage or removing it degrades performance. Among self‑rewards, the continuous version (r₂) slightly outperforms the binary version (r₁). Teacher‑reward variants sometimes achieve the best score on a single corpus but are overall inferior to self‑reward.
The authors also test different feature extractors for the policy network: SenseVoice, Emotion2vec, and Whisper. Whisper‑large‑v3 combined with B‑GRPO shows the largest absolute improvement, suggesting that stronger pre‑trained acoustic representations enhance the reliability of confidence estimates used in the reward.
Finally, experiments on cross‑corpus data selection reveal that B‑GRPO can successfully pick useful samples even when the source and target corpora differ in language or domain, though using the same corpus yields larger gains. This indicates that the method can serve both as a sample selector within a dataset and as a lightweight data‑augmentation tool across corpora.
In summary, B‑GRPO reframes sample selection as a reinforcement learning policy, introduces batch‑wise relative advantage computation, and leverages self‑confidence as a reward signal to achieve substantial improvements in unsupervised SER. The approach is simple, does not require external annotations, and can be extended to other classification tasks where labeled data are scarce. Future work may explore richer teacher ensembles, multimodal extensions, and online updating mechanisms for real‑time emotion recognition.
Comments & Academic Discussion
Loading comments...
Leave a Comment