Unseen but not Unknown: Using Dataset Concealment to Robustly Evaluate Speech Quality Estimation Models
We introduce Dataset Concealment (DSC), a rigorous new procedure for evaluating and interpreting objective speech quality estimation models. DSC quantifies and decomposes the performance gap between research results and real-world application requirements, while offering context and additional insights into model behavior and dataset characteristics. We also show the benefits of addressing the corpus effect by using the dataset Aligner from AlignNet when training models with multiple datasets. We demonstrate DSC and the improvements from the Aligner using nine training datasets and nine unseen datasets with three well-studied models: MOSNet, NISQA, and a Wav2Vec2.0-based model. DSC provides interpretable views of the generalization capabilities and limitations of models, while allowing all available data to be used at training. An additional result is that adding the 1000 parameter dataset Aligner to the 94 million parameter Wav2Vec model during training does significantly improve the resulting model’s ability to estimate speech quality for unseen data.
💡 Research Summary
The paper introduces Dataset Concealment (DSC), a systematic evaluation framework designed to expose and quantify the generalization capabilities of no‑reference speech quality estimation models. DSC operates on N datasets, each split into training, validation, and held‑out test partitions. Three families of models are trained: (1) Individual models, each using a single dataset; (2) a Global model trained on the union of all N datasets; and (3) Concealed models, each trained on all datasets except one. Each model is then evaluated on every dataset’s test set, producing correlation metrics (LCC or SRCC) denoted ρI,j, ρG,j, and ρC,j respectively. Two gap metrics are defined: the versatility gap vj = |ρI,j| – |ρG,j|, measuring performance loss when moving from single‑dataset to multi‑dataset training, and the concealment gap cj = |ρG,j| – |ρC,j|, measuring how well knowledge from other datasets transfers to a truly unseen dataset. Small or negative gaps indicate strong generalization.
A major obstacle when training on multiple subjective MOS datasets is the “corpus effect”: MOS scores are not absolute and can differ across experiments due to listener pools, test conditions, or rating scales. This creates label noise that hampers learning. To mitigate this, the authors incorporate the Dataset Aligner from the AlignNet architecture. The Aligner receives a dataset identifier and learns a mapping from a reference MOS scale to each dataset’s scale, effectively normalizing labels across corpora. For MOSNet and NISQA, the Aligner is frozen until the validation LCC reaches 0.6; for the large Wav2Vec2.0‑based model, the Aligner is trained from the start.
Experiments involve nine training datasets and nine completely unseen test datasets, covering narrow‑band, wide‑band, and full‑band speech with various degradations (noise, reverberation, codecs, packet loss, etc.). Three representative models are evaluated: MOSNet (1.4 M parameters, CNN‑BLSTM), NISQA (0.218 M parameters, lightweight attention‑based), and a Wav2Vec2.0‑based model (94 M parameters, self‑supervised speech representation with a single linear head). Results show that, as expected, MOSNet has the lowest individual‑model performance, NISQA is intermediate, and Wav2Vec2.0 achieves the highest LCC. When trained globally without alignment, all models suffer a noticeable versatility gap, especially MOSNet. Adding the Aligner dramatically reduces the versatility gap for NISQA and Wav2Vec2.0, and also shrinks the concealment gap, indicating better transfer to unseen datasets. Notably, the 1000‑parameter Aligner improves the 94 M‑parameter Wav2Vec2.0 model’s LCC on unseen data by roughly 0.07–0.09, demonstrating that a tiny alignment module can substantially enhance a massive SSL model’s robustness.
The DSC framework provides richer insight than a single aggregate metric: it reveals which datasets are “easy” (high ρI,j) or “hard” (low ρI,j), quantifies how much label inconsistencies affect learning, and allows researchers to objectively assess the benefit of alignment techniques. The authors argue that DSC should become a standard tool for evaluating new speech quality estimators, guiding model architecture choices, data‑selection strategies, and the inclusion of alignment modules.
In summary, the paper makes three key contributions: (1) the DSC methodology for detailed, dataset‑level generalization analysis; (2) empirical evidence that the AlignNet Dataset Aligner effectively mitigates the corpus effect across diverse models; and (3) demonstration that even a very small Aligner can significantly boost the performance of a large self‑supervised model on truly unseen speech quality data. These findings advance the state of the art in robust, real‑world speech quality estimation.
Comments & Academic Discussion
Loading comments...
Leave a Comment