SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment

SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Singing voice generation progresses rapidly, yet evaluating singing quality remains a critical challenge. Human subjective assessment, typically in the form of listening tests, is costly and time consuming, while existing objective metrics capture only limited perceptual aspects. In this work, we introduce SingMOS-Pro, a dataset for automatic singing quality assessment. Building on our preview version SingMOS, which provides only overall ratings, SingMOS-Pro extends the annotations of the additional data to include lyrics, melody, and overall quality, offering broader coverage and greater diversity. The dataset contains 7,981 singing clips generated by 41 models across 12 datasets, spanning from early systems to recent state-of-the-art approaches. Each clip is rated by at least five experienced annotators to ensure reliability and consistency. Furthermore, we investigate strategies for effectively utilizing MOS data annotated under heterogeneous standards and benchmark several widely used evaluation methods from related tasks on SingMOS-Pro, establishing strong baselines and practical references for future research. The dataset is publicly available at https://huggingface.co/datasets/TangRain/SingMOS-Pro.


💡 Research Summary

The paper addresses a critical gap in the evaluation of singing voice generation systems: the lack of a large‑scale, multi‑dimensional dataset for automatic quality assessment. Building on their earlier SingMOS dataset, the authors introduce SingMOS‑Pro, a comprehensive benchmark that contains 7,981 singing clips generated by 41 distinct models across 12 publicly available singing corpora. The dataset spans four major tasks—singing voice synthesis (SVS), singing voice conversion (SVC), singing voice resynthesis (SVR), and ground‑truth recordings (GT)—and includes samples at three sampling rates (16 kHz, 24 kHz, 44.1 kHz).

A key innovation of SingMOS‑Pro is the annotation of three separate MOS dimensions: overall quality, lyrics clarity, and melody naturalness. Each clip received at least five ratings from experienced annotators, yielding 44,247 overall MOS scores and an additional 23,475 scores for both lyrics and melody. The annotation protocol involved five online batches; the first and fourth batches collected only overall MOS, while the second, third, and fifth batches collected all three dimensions. Quality control was enforced through trap and golden clips, and annotators completed a training lecture before participation.

The dataset is split strategically: systems with more than 50 clips are divided 70/30 into training and test sets, systems with 10–50 clips are placed entirely in the test set, and those with fewer than 10 clips are kept in training. Because annotation standards differ across batches, three distinct test subsets (test1, test2, test3) are maintained, allowing evaluation of models under heterogeneous MOS standards.

For benchmarking, the authors fine‑tune a wav2vec2‑large self‑supervised learning (SSL) backbone on the 16 kHz subset (4,007 training clips, 2,091/1,540/376 test clips for test1/2/3). Training uses an L1 margin loss, SGD (lr 0.001, momentum 0.9) for 200 epochs, batch size 15. Evaluation metrics are RMSE, linear correlation coefficient (LCC), and Spearman’s rank correlation coefficient (SRCC), with SRCC emphasized as the primary indicator of ranking consistency.

Results in Table 2 reveal that incorporating multi‑dataset fine‑tuning (MDF) dramatically improves performance. When both domain ID and MDF are disabled, utterance‑level SRCC hovers around 0.45; enabling MDF while keeping domain ID off raises SRCC to 0.75. Adding domain ID further boosts SRCC to 0.86, demonstrating that leveraging heterogeneous MOS data with appropriate domain identifiers yields more robust predictors.

Table 3 compares several established MOS prediction methods—DNS‑MOS, UTMOS, the original SingMOS model, SHEET‑ssqa—and the SSL‑based approach. The SSL model achieves the highest SRCC at both utterance (0.79) and system (0.68) levels, outperforming DNS‑MOS (0.55/0.86) and UTMOS (0.35/0.19). Notably, DNS‑MOS and UTMOS are designed for speech and only predict overall MOS, lacking the ability to assess lyrics or melody, whereas the SSL model can be extended to multi‑task prediction.

The authors also analyze the impact of sampling rate: the same model evaluated at 24 kHz, 44.1 kHz, and 16 kHz shows decreasing MOS, confirming that listeners perceive quality differences between 16 kHz and higher rates, while the gap between 24 kHz and 44.1 kHz is marginal.

In summary, SingMOS‑Pro constitutes the first multilingual, multi‑task MOS dataset for singing quality assessment, providing rich annotations that enable research on automatic MOS prediction, multi‑task learning, and cross‑domain generalization. The benchmark experiments establish strong baselines and illustrate effective strategies for exploiting heterogeneous MOS data. Future work may explore language‑transfer learning, higher‑resolution audio modeling, real‑time assessment pipelines, and integration of perceptual features such as pitch contours and lyrical intelligibility into end‑to‑end quality predictors.


Comments & Academic Discussion

Loading comments...

Leave a Comment