SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

SA-SSL-MOS: SELF-SUPER VISED LEARNING MOS PREDICTION WITH SPECTRAL A UGMENT A TION FOR GENERALIZED MUL TI-RA TE SPEECH ASSESSMENT F engyuan Cao 1 , Xinyu Liang 1 , F r edrik Cumlin 1 , V ictor Ungur eanu 2 , Chandan K. A. Reddy 2 , Christian Sch ¨ uldt 2 , Saikat Chatterjee 1 1 KTH Royal Institute of T echnology , Stockholm, Sweden 2 Google LLC ABSTRA CT Designing a speech quality assessment (SQA) system for esti- mating mean-opinion-score (MOS) of multi-rate speech with vary- ing sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited av ailability of a MOS-labeled training dataset comprising multi-rate speech samples. While self- supervised learning (SSL) models hav e been widely adopted in SQA to boost performance, a key limitation is that they are pre- trained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. T o address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. W e further introduce a two-step train- ing scheme: the model is ﬁrst pre-trained on a large 48 kHz dataset and then ﬁne-tuned on a smaller multi-rate dataset. Experimental results show that le veraging high-frequency information ov erlooked by SSL features is crucial for accurate multi-rate SQA, and that the proposed two-step training substantially improves generalization when multi-rate data is limited. Index T erms — Speech quality assessment, deep learning, self- supervised learning, generalization ability 1. INTR ODUCTION Speech quality assessment (SQA) is the task of ev aluating how well human or synthetic speech is perceived by a listener . There are two main approaches to SQA: subjecti ve and objectiv e. Subjectiv e meth- ods in volv e human listeners rating the speech, typically on the mean- opinion-score (MOS) scale, where listeners rate quality from 1 (bad) to 5 (excellent). Objective methods use algorithms to predict human perception and are more efﬁcient and reproducible. These include intrusiv e methods like PESQ [1] and POLQA [2], which compare a degraded speech signal to its clean reference, and non-intrusiv e methods that assess quality using only the degraded signal. Since clean reference signals are rarely av ailable in real-world scenarios, non-intrusive SQA methods are popular . Recent state- of-the-art non-intrusiv e SQA models [3, 4, 5, 6] leverage self- supervised learning (SSL) representations e xtracted from large- scale pretrained models, such as W av2V ec2, HuBER T , and W avLM [7, 8, 9]. In this framew ork, an SSL model is pre-trained on vast amounts of unlabeled data and provides generic representations that can be exploited for downstream tasks such as SQA. Howe ver , a ke y limitation is that current SSL models are typically pretrained on 16 kHz speech. As a result, high-ﬁdelity recordings (e.g., 24 kHz or 48 kHz) must be downsampled to 16 kHz before feature extraction, which discards perceptually important high-frequency information and negati vely impacts SQA performance. Dev eloping a generalized SSL-based multi-rate SQA method providing MOS, that works across different sampling rates, is an interesting yet challenging task due to three reasons. (a) First, SSL- based models lack access to high-band information. (b) Second, there is a scarcity of multi-rate datasets. Most MOS-labelled cor- pora are collected at a single sampling rate, limiting the availability of suitable training data. (c) Third, the range-equalizing bias compli- cates cross-dataset learning [10]. Human raters typically use the full MOS scale even when the variance in perceived quality is limited, leading to misaligned MOS distributions across datasets. For exam- ple, a MOS rating of 5 for a 16 kHz sample may not correspond to the same perceived quality as a MOS rating 5 for a 48 kHz sample. Therefore, it is difﬁcult to directly combine MOS-labeled datasets recorded at different sampling rates and use the combined dataset for model training. Recently , a multi-rate MOS-labeled subjectiv e dataset contain- ing recordings at 16, 24, and 48 kHz within a single evaluation was released [11] as part of the AudioMOS 2025 challenge, aimed to tackle the issue of multi-rate SQA. Howe ver , its limited size makes it challenging to train a generalizable multi-rate SQA model. In this work, we show that SSL-based multi-rate SQA methods trained only on the AudioMOS dataset struggle to generalize when ev aluated on diverse external datasets. T o address this limitation, we propose SA-SSL-MOS , a spectrogram-augmented SSL-based model for non-intrusiv e MOS prediction. The proposed method augments SSL-based features at 16 kHz with spectrogram features to preserve high-frequency information. By effecti vely combining SSL-based and spectral-augmented features, SA-SSL-MOS tak es advantage of the rob ustness and performance of SSL-based ap- proaches while still retrieving information of higher frequencies for high-ﬁdelity recordings. Furthermore, we introduce a two-step pretraining–ﬁnetuning framew ork that enables effecti ve use of lim- ited multi-rate MOS data. From this, we in vestigate two research questions. First, does high-frequency information improv e MOS prediction in high-ﬁdelity recordings? And second, because of dataset limitations, does a pretraining strategy improve generaliza- tion to unseen speech recordings? Our contributions in this article are as follows: (1) we pro- pose SA-SSL-MOS, a method for high-ﬁdelity multi-rate speech quality assessment method; (2) we demonstrate that incorporating high-frequency information signiﬁcantly improv es objectiv e speech quality prediction; (3) we show that an SSL-based multi-rate SQA method trained on limited AudioMOS data suffers in generaliza- tion, and we introduce a two-step training strategy that improves generalization to out-of-distribution datasets. Fig. 1 . Architecture comparison between the existing SSL-Layer- MOS (left) and the proposed SA-SSL-MOS (right). 2. METHODS Let x denote a speech clip and y its corresponding MOS label. A speech quality dataset can be represented as D = { ( x n , y n ) } N n =1 , where N is the total number of clips. Our goal is to design a regres- sor function f θ θ θ ( x ) with parameters θ θ θ that predicts y for a given input x . The regressor is typically implemented as a deep neural network (DNN), which learns its parameters in a data-driv en manner . 2.1. SA-SSL-MOS architecture W e use the SSL-based MOS providing method of [5] as our base- line due to its design simplicity and high performance. The baseline method performs a layer selection and is referred to as ‘SSL-Layer- MOS’ in this article. Follo wing the architectural design from [12] and a comprehensive study of different layers of SSL models for MOS prediction across multiple datasets, earlier SSL layers were found to be more ef fectiv e, and larger SSL models generally demon- strated better performance. Howe ver , as previously discussed, most existing SSL feature extractors operate on 16 kHz inputs, leading to the loss of high- frequency information that is important for intelligibility and hence quality assessment [13]. T o address this limitation, we propose SA-SSL-MOS, a spectral-augmented SSL-based MOS prediction model. Our approach introduces a parallel spectrogram-based path- way to complement the SSL features, enriching the representation with high-frequency information. The overall architectural modiﬁ- cations are illustrated in Figure 1. Giv en an input audio signal, SA-SSL-MOS processes it through two parallel branches. In the ﬁrst path, the audio is downsampled to 16 kHz and follo ws the same procedure as SSL-Layer-MOS: features are extracted using an SSL-based feature extractor , passed through the Feature Processing Module (FPM), and then ﬂattened. In the second path, the audio is upsampled to 48 kHz, conv erted into a spectrogram, and processed by the Spectrogram Processing Module (SPM), followed by a global pooling layer . The second path preserves high-frequency information that would be lost by SSL features when the original audio has a sampling frequency higher than 16 kHz. Afterwards, the two resulting vector representations are concatenated and jointly used to predict the MOS score. Fig. 2 . Detailed layer-wise architecture of SA-SSL-MOS. Follo wing previous studies [14, 15, 16], we model the MOS pre- diction y as a Gaussian posterior, where the network predicts both the mean ˆ µ and variance ˆ σ 2 . The model parameters are optimized using the Gaussian negati ve log-likelihood (GNLL) loss, which not only improves the performance of the point estimate b ut also pro- vides an uncertainty estimation of the prediction. The GNLL loss is formulated as L GN LL = X 1 2 (log( ˆ σ 2 ) + ( y − ˆ µ ) 2 ˆ σ 2 ) . (1) The detailed implementation of the proposed SA-SSL-MOS is illustrated in Figure 2. For the SSL branch, we use layer 9 of the W av2v ec2-XLS-R-2B 1 model as the feature extractor . These SSL features are processed by the FPM, which consists of three 1D con- volutional layers. For the spectrogram branch, the SPM is designed based on the encoder architecture of DNSMOS Pro [16] and operates on the upsampled spectrogram using 2D con volutions. The concatenated 640-dimensional feature vector is passed into the MOS Mapping Module, which consists of three fully connected layers followed by a linear transformation. Unlike DNSMOS Pro, we design independent mapping heads for ˆ µ and ˆ σ 2 for better mod- eling of the posterior parameters. At inference time, the predicted mean estimator is used as the point estimate for MOS following [16]. 2.2. T wo-step T raining Design T o effecti vely leverage the limited multi-rate MOS dataset and de- velop a robust MOS prediction model that generalizes well to un- seen scenarios, we adopt a two-step training strategy for SA-SSL- MOS. In the ﬁrst stage, we pre-train the model on a large-scale MOS-labeled dataset sampled at 48 kHz. This enables the model, particularly the spectral-augmented branch, to learn rich representa- tions and adapt its parameters to handle diverse acoustic conditions. In the second stage, we ﬁne-tune the pre-trained model on the multi- rate MOS dataset for only a few epochs. This controlled ﬁne-tuning prev ents overﬁtting to the smaller dataset while maintaining strong generalization performance on out-of-domain ev aluation sets. 1 https://docs.pytorch.org/audio/main/generated/ torchaudio.pipelines.WAV2VEC2_XLSR_2B.html . Dataset Purpose Sampling Rate Language # Samples Ratings/Clip AudioMOS train [11] train, val 16/24/48kHz English 320/80 10 AudioMOS test [11] test 16/24/48kHz English 400 10 NISQA TRAIN (SIM+LIVE) [17] train 48kHz English 10000+1020 ∼ 5 NISQA V AL (SIM+LIVE) [17] val 48kHz English 2500+200 ∼ 5 NISQA TEST LIVET ALK [17] test 48kHz German 232 24 NISQA TEST FOR [17] test 48kHz Australian English 240 ∼ 30 NISQA TEST P501 [17] test 48kHz British English 240 ∼ 28 T encent w R [18] test 24kHz Chinese 3197 ∼ 20 T encent w/o R [18] test 24kHz Chinese 8366 ∼ 20 TCD-V oIP [19] test 48kHz English 384 24 T able 1 . Overvie w of datasets. 3. EXPERIMENTS 3.1. Datasets W e use a collection of datasets to train and ev aluate our models, as summarized in T able 1. The multi-rate AudioMOS2025 Track3 dataset [11] contains separate training and test splits, each compris- ing 400 samples across 16, 24 and 48 kHz sampling rates. For model training and ﬁne-tuning, we further di vide the AudioMOS train split into 320 training samples and 80 v alidation samples, ensuring that the split is performed at the system lev el to avoid any overlap be- tween different speech systems across the tw o subsets. W e pre-train our SA-SSL-MOS model on the combined NISQA TRAIN SIM and NISQA TRAIN LIVE datasets (denoted as NISQA TRAIN), using their corresponding validation sets NISQA V AL SIM and NISQA V AL LIVE (denoted as NISQA V AL). In total, NISQA TRAIN contains 11,020 training samples, while NISQA V AL includes 2,700 v alidation samples. Finally , to ev aluate the generalization ability of the proposed system, we use a div erse collection of additional datasets covering different languages, sam- pling rates, and recording conditions. 3.2. Featur e Extraction Since SA-SSL-MOS employs two feature processing branches, the input audio is processed through two distinct feature extractors: one based on a SSL model and the other based on spectrogram features. A uniﬁed feature extraction strategy is applied across all datasets to ensure consistency . For the SSL branch, the audio signal is ﬁrst downsampled to 16 kHz and then repetitively padded or cropped to a ﬁxed length of 10s. The processed audio is fed into a general-purpose pre-trained W av2V ec2 XLSR 2B model, and the output from its ninth trans- former layer is selected as the feature representation. This choice follows the ﬁndings of [5], which sho wed that earlier transformer layers provide impro ved performance with reduced inference cost. For the spectrogram branch, the audio signal is ﬁrst upsampled to 48 kHz and resized to 10s using the same repetiti ve padding or cropping strategy . The spectrogram is computed using the short- time Fourier transform (STFT) with a window length of 320 , frame shift of 160 , and FFT size of 320 . Afterwards, we extract the mag- nitude spectrum and apply a logarithmic transformation to produce the ﬁnal spectrogram features, following [16]. For the baseline SSL- Layer-MOS model [5], we use the same SSL-based feature extrac- tion procedure but without the additional spectrogram branch. 3.3. T raining and Fine-tuning W e adopt a two-step training strategy for the proposed SA-SSL- MOS 2 . In the ﬁrst stage, the model is pre-trained for 30 epochs on the NISQA TRAIN dataset. In the second stage, the pre-trained model is ﬁne-tuned for 3 epochs on the AudioMOS train dataset. T o ev aluate the effecti veness of this strategy , we compare with two al- ternativ e conﬁgurations: (1) training only on AudioMOS train for 30 epochs, and (2) training only on NISQA TRAIN for 30 epochs. W e conduct ﬁv e rounds of experiments for the two-step strat- egy , where each round inv olves two ﬁne-tuning runs. For conﬁgu- rations trained directly on a single dataset, we perform ten indepen- dent runs per setting. All other hyperparameters are kept constant across experiments. W e use the Adam optimizer with a learning rate of 1 × 10 − 4 , no weight decay , and moving average parameters β 1 = 0 . 9 and β 2 = 0 . 999 . An ExponentialLR scheduler is applied with a decay coefﬁcient of γ = 0 . 9999 . A batch size of 64 is used for all experiments. The same parameters and training procedure is applied to the baseline SSL-Layer-MOS model. W e use GNLL loss for the baseline model with posterior modeling instead of standard MSE loss. 3.4. Results W e use mean squared error (MSE), linear correlation coefﬁcient (LCC) [20], and Spearman’ s rank correlation coefﬁcient (SRCC) [21] as standard ev aluation metrics, following [5, 16]. T able 2 summarizes the results on the AudioMOS test dataset, reporting both utterance-le vel (UTT) and system-level (SYS) met- rics. W e compare the baseline SSL-Layer-MOS [5] and the pro- posed SA-SSL-MOS under three training strategies, as described in Section 3.3, to ev aluate the effecti veness of the two-step training scheme. Our key observ ations are as follo ws: • The SSL-Layer -MOS model trained only on the 320-clip split of AudioMOS train achieves competiti ve results on Au- dioMOS test. Under the same setup, SA-SSL-MOS performs slightly worse. W e attribute this to the added spectrogram processing branch, which naturally requires more data to con ver ge compared to the baseline model’ s SSL features that beneﬁt from being pretrained on a massiv e amount of data. • T raining exclusiv ely on the NISQA dataset yields strong correlation-based performance, suggesting that NISQA pro- vides diverse cov erage and facilitates generalization to unseen data. Howe ver , the MSE is higher due to misaligned score distributions between NISQA and AudioMOS, stemming from range-equalizing bias and domain mismatch. 2 The implementation of the model can be found at https://github. com/Dear- xxf/SA_SSL_MOS . Model T rain Data UTT MSE ↓ UTT LCC ↑ UTT SRCC ↑ SYS MSE ↓ SYS LCC ↑ SYS SRCC ↑ AudioMOS train 0.282 ± 0 . 017 0 . 830 ± 0 . 012 0 . 678 ± 0 . 020 0.138 ± 0 . 012 0.961 ± 0 . 006 0 . 852 ± 0 . 035 baseline [5] NISQA 0 . 835 ± 0 . 071 0 . 798 ± 0 . 014 0 . 712 ± 0 . 033 0 . 641 ± 0 . 057 0 . 920 ± 0 . 008 0 . 781 ± 0 . 042 NISQA+AudioMOS train 0 . 465 ± 0 . 066 0 . 819 ± 0 . 016 0 . 731 ± 0 . 023 0 . 385 ± 0 . 079 0 . 936 ± 0 . 007 0 . 845 ± 0 . 015 AudioMOS train 0 . 375 ± 0 . 035 0 . 830 ± 0 . 006 0 . 679 ± 0 . 015 0 . 286 ± 0 . 060 0 . 953 ± 0 . 014 0 . 826 ± 0 . 084 SA-SSL-MOS (Ours) NISQA 0 . 555 ± 0 . 070 0 . 789 ± 0 . 011 0 . 721 ± 0 . 024 0 . 424 ± 0 . 059 0 . 911 ± 0 . 005 0 . 754 ± 0 . 022 NISQA+AudioMOS train 0 . 377 ± 0 . 082 0.848 ± 0 . 008 0.750 ± 0 . 018 0 . 323 ± 0 . 104 0 . 943 ± 0 . 005 0.856 ± 0 . 025 T able 2 . Results on AudioMOS test. Metrics reported as mean ± standard deviation, best performance marked bold . test data train data model MSE ↓ LCC ↑ SRCC ↑ T encent w/o R AudioMOS train baseline 1.002 ± 0.054 0.691 ± 0.023 0.687 ± 0.024 SA-SSL-MOS 1.097 ± 0.057 0.669 ± 0.035 0.666 ± 0.033 NISQA+AudioMOS train baseline 0.751 ± 0.043 0.917 ± 0.009 0.901 ± 0.006 SA-SSL-MOS 1.192 ± 0.124 0.877 ± 0.024 0.891 ± 0.010 T encent w R AudioMOS train baseline 0.577 ± 0.047 0.638 ± 0.029 0.542 ± 0.036 SA-SSL-MOS 0.712 ± 0.081 0.630 ± 0.054 0.544 ± 0.056 NISQA+AudioMOS train baseline 0.421 ± 0.051 0.814 ± 0.009 0.780 ± 0.010 SA-SSL-MOS 0.453 ± 0.035 0.795 ± 0.009 0.741 ± 0.017 TCD-V oIP AudioMOS train baseline 0.836 ± 0.032 0.420 ± 0.042 0.343 ± 0.042 SA-SSL-MOS 0.864 ± 0.011 0.391 ± 0.027 0.318 ± 0.036 NISQA+AudioMOS train baseline 0.615 ± 0.061 0.844 ± 0.025 0.836 ± 0.030 SA-SSL-MOS 0.590 ± 0.092 0.860 ± 0.022 0.847 ± 0.029 NISQA TEST FOR AudioMOS train baseline 0.671 ± 0.063 0.692 ± 0.017 0.692 ± 0.020 SA-SSL-MOS 0.775 ± 0.106 0.664 ± 0.021 0.666 ± 0.020 NISQA+AudioMOS train baseline 0.323 ± 0.059 0.900 ± 0.005 0.914 ± 0.004 SA-SSL-MOS 0.268 ± 0.024 0.901 ± 0.010 0.901 ± 0.011 NISQA TEST P501 AudioMOS train baseline 0.788 ± 0.040 0.656 ± 0.022 0.665 ± 0.026 SA-SSL-MOS 0.888 ± 0.076 0.627 ± 0.029 0.642 ± 0.027 NISQA+AudioMOS train baseline 0.463 ± 0.064 0.907 ± 0.011 0.930 ± 0.005 SA-SSL-MOS 0.393 ± 0.045 0.926 ± 0.005 0.926 ± 0.006 NISQA TEST LIVET ALK AudioMOS train baseline 0.655 ± 0.037 0.604 ± 0.017 0.596 ± 0.011 SA-SSL-MOS 0.733 ± 0.042 0.566 ± 0.053 0.572 ± 0.040 NISQA+AudioMOS train baseline 0.418 ± 0.052 0.877 ± 0.010 0.881 ± 0.006 SA-SSL-MOS 0.392 ± 0.054 0.893 ± 0.007 0.881 ± 0.007 T able 3 . Results for generalization ability test. Metrics for utterance level only , best performance marked bold . • Fine-tuning on AudioMOS train for just three epochs after pre-training on NISQA signiﬁcantly improves performance on AudioMOS test. This beneﬁt is consistent across both baseline and SA-SSL-MOS models, demonstrating that the two-step training ef fectiv ely mitigates dataset misalignment with a small amount of ﬁne-tuning. • Combining the spectral-augmented architecture of SA-SSL- MOS with the two-step training strategy yields the best utterance-lev el performance overall, indicating the effectiv e- ness of the SA-SSL-MOS system on multi-rate SQA. While training on the limited AudioMOS train dataset achieves competitiv e performance on its corresponding test split, it remains unclear ho w well these models generalize to unseen conditions. T o inv estigate this, we ev aluate both architectures (baseline and SA-SSL-MOS) under two training conﬁgurations: only on Au- dioMOS train, and using the proposed two-step strategy . W e assess their generalization ability across a div erse set of out-of-domain datasets. The results are summarized in T able 3, metrics reported at the utterance lev el since system-le vel labels are una vailable. From the experimental results, we observ e the following: • Adopting the two-step training strategy , which lev erages pre- training on a much larger dataset, substantially improves the model’ s generalization ability on unseen data. This suggests that the strong system-level performance on AudioMOS test achiev ed without pre-training may stem from overﬁtting and the similarity between the training and test systems within the AudioMOS dataset. • Under the two-step setup, SA-SSL-MOS consistently out- performs SSL-Layer-MOS across all NISQA test splits and the TCD-V oIP dataset. This highlights the importance of in- corporating high-frequency information missed by SSL-only feature extractors and demonstrates the effecti veness of the spectrogram augmentation branch introduced in SA-SSL- MOS. • In contrast, SSL-Layer-MOS achiev es better performance on the two T encent datasets. W e attribute this to a language dis- tribution mismatch. SSL-Layer-MOS relies solely on features from a pre-trained SSL backbone, which includes e xposure to Chinese speech, whereas the SPM module in SA-SSL-MOS is pre-trained on NISQA, which does not contain Chinese data. This discrepancy likely causes ne gati ve transfer ef fects, resulting in reduced performance of SA-SSL-MOS on these datasets. 4. CONCLUSIONS In this work, we presented SA-SSL-MOS, a novel non-intrusive MOS prediction model designed for multi-rate speech quality assess- ment. By augmenting SSL-based representations with spectrogram features from upsampled 48 kHz audio, SA-SSL-MOS effecti vely captures high-frequency information that SSL models discard. W e further proposed a two-step training strategy that ﬁrst pre-trains the model on a large-scale single-rate dataset and then ﬁne-tunes it on a smaller multi-rate dataset. Experimental results show that this ap- proach achiev es superior performance on the AudioMOS test set and deliv ers signiﬁcant impro vements in generalization ability across six out-of-distribution test sets with dif ferent languages, sampling rates, and recording conditions. 5. A CKNO WLEDGEMENT The research is supported by funding from Digital Futures Center, European Defence Fund REACT II project, and partially supported by the W allenberg AI, Autonomous Systems and Software Program (W ASP) funded by the Knut and Alice W allenberg Foundation. The computations were enabled by resources provided by Chalmers e- Commons at Chalmers. 6. REFERENCES [1] A.W . Rix, J.G. Beerends, M.P . Hollier , and A.P . Hekstra, “Per- ceptual e valuation of speech quality (PESQ)-a ne w method for speech quality assessment of telephone networks and codecs, ” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Pr ocessing. Pr oceedings (Cat. No.01CH37221) , 2001, vol. 2. [2] John G Beerends, Christian Schmidmer , Jens Berger , Matthias Obermann, Raphael Ullmann, Joachim Pomy , and Michael Ke yhl, “Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part i—temporal alignment, ” journal of the audio engineering society , vol. 61, no. 6, pp. 366–384, 2013. [3] Erica Cooper , W en-Chin Huang, T omoki T oda, and Junichi Y amagishi, “Generalization ability of mos prediction net- works, ” in ICASSP 2022-2022 IEEE International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2022, pp. 8442–8446. [4] T akaaki Saeki, Detai Xin, W ataru Nakata, T omoki Koriyama, Shinnosuke T akamichi, and Hiroshi Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022, ” in Pr oc. Interspeech 2022 , 09 2022, pp. 4521–4525. [5] Xinyu Liang, Fredrik Cumlin, V ictor Ungureanu, Chandan KA Reddy , Christian Sch ¨ uldt, and Saikat Chatterjee, “Selec- tion of layers from self-supervised learning models for predict- ing mean-opinion-score of speech, ” in 2025 IEEE Automatic Speech Recognition and Understanding W orkshop (ASR U) , 2025. [6] Zili Qi, Xinhui Hu, W angjin Zhou, Sheng Li, Hao Wu, Jian Lu, and Xinkang Xu, “Le-ssl-mos: Self-supervised learning mos prediction with listener enhancement, ” in 2023 IEEE Inter- national Confer ence on Multimedia and Expo (ICME) . IEEE, 2023. [7] Alex ei Baevski, Y uhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2v ec 2.0: A framew ork for self-supervised learning of speech representations, ” Advances in neural infor- mation pr ocessing systems , vol. 33, pp. 12449–12460, 2020. [8] W ei-Ning Hsu, Benjamin Bolte, Y ao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov , and Abdelrahman Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units, ” IEEE/A CM transactions on audio, speech, and language pr ocessing , vol. 29, pp. 3451–3460, 2021. [9] Sanyuan Chen, Chengyi W ang, Zhengyang Chen, Y u W u, Shu- jie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, T akuya Y osh- ioka, Xiong Xiao, et al., “W avlm: Large-scale self-supervised pre-training for full stack speech processing, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 16, no. 6, pp. 1505– 1518, 2022. [10] Erica Cooper and Junichi Y amagishi, “In vestigating range- equalizing bias in mean opinion score ratings of synthesized speech, ” arXiv pr eprint arXiv:2305.10608 , 2023. [11] W en-Chin Huang, Hui W ang, Cheng Liu, Y i-Chiao W u, An- dros Tjandra, W ei-Ning Hsu, Erica Cooper , Y ong Qin, and T omoki T oda, “The audiomos challenge 2025, ” arXiv preprint arXiv:2509.01336 , 2025. [12] Fredrik Cumlin, Xinyu Liang, V ictor Ungureanu, Chandan KA Reddy , Christian Sch ¨ uldt, and Saikat Chatterjee, “Multiv ari- ate probabilistic assessment of speech quality , ” in Proc. Inter- speech 2025 , 2025. [13] Lina Motlagh Zadeh, Noah H Silbert, Katherine Sternasty , De W et Swanepoel, Lisa L Hunter, and David R Moore, “Ex- tended high-frequency hearing enhances speech perception in noise, ” Pr oceedings of the National Academy of Sciences , vol. 116, no. 47, pp. 23753–23759, 2019, Epub 2019 Nov 4. [14] Xinyu Liang, Fredrik Cumlin, Christian Sch ¨ uldt, and Saikat Chatterjee, “Deepmos: Deep posterior mean-opinion-score of speech, ” in Interspeec h 2023 . Aug 2023, ISCA. [15] Xinyu Liang, Fredrik Cumlin, V ictor Ungureanu, Chandan KA Reddy , Christian Sch ¨ uldt, and Saikat Chatterjee, “Deepmos- b: Deep posterior mean-opinion-score using beta distrib ution, ” in 2024 32nd European Signal Pr ocessing Conference (EU- SIPCO) . IEEE, 2024, pp. 416–420. [16] Fredrik Cumlin, Xinyu Liang, V ictor Ungureanu, Chandan KA Reddy , Christian Sch ¨ uldt, and Saikat Chatterjee, “Dns- mos pro: A reduced-size dnn for probabilistic mos of speech, ” in Pr oc. Interspeech 2024 , 2024, pp. 4818–4822. [17] Gabriel Mittag, Babak Naderi, Assmaa Chehadi, and Sebas- tian M ¨ oller , “NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with cro wdsourced datasets, ” in Interspeec h 2021 . Aug 2021, ISCA. [18] Gaoxiong Y i, W ei Xiao, Y iming Xiao, Babak Naderi, Se- bastian Moller , W afaa W ardah, Gabriel Mittag, Ross Cutler, Z. Zhang, Donald S. W illiamson, Fei Chen, Fuzheng Y ang, and Shidong Shang, “Conferencingspeech 2022 challenge: Non- intrusiv e objective speech quality assessment (nisqa) challenge for online conferencing applications, ” in Interspeec h , 2022. [19] Naomi Harte, Eoin Gillen, and Andrew Hines, “Tcd-voip, a research database of degraded speech for assessing quality in voip applications, ” in 2015 Seventh International W orkshop on Quality of Multimedia Experience (QoMEX) . IEEE, 2015, pp. 1–6. [20] Karl Pearson, “Notes on the history of correlation, ” Biometrika , vol. 13, no. 1, pp. 25–45, 1920. [21] Charles Spearman, “The proof and measurement of association between two things, ” The American journal of psychology , vol. 100, no. 3/4, pp. 441–471, 1987.

SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment