Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Bottlenec k T ransformer-Based Approac h for Impro v ed A utomatic STOI Score Prediction 1 st Amart y a v eer Spire Lab, Dept of Electrical Engg Indian Institue of Science (I ISc) Bengaluru, India Amart y av eer72@gmail.com 2 nd Murali Kadam bi Spire Lab, Dept of Electrical Engg Indian Institue of Science (I ISc) Bengaluru, India mkkadam bi@gmail.com 3 rd Chandra Mohan Sharma Cen ter for Articial In telligence and Rob otics (CAIR), DRDO India c handramohan.cair@go v.in 4 th An upam Mandal Cen ter for Articial In telligence and Rob otics (CAIR), DRDO India amandal.cair@go v.in 5 th Prasan ta Kumar Ghosh Spire Lab, Dept of Electrical Engg Indian Institue of Science (I ISc) Bengaluru, India prasan tg@iisc.ac.in Abstract—In this study , w e hav e presented a no vel approac h to predict the Short-Time Ob jective In telligibility (STOI) metric using a b ottlenec k transformer arc hitecture. T raditional metho ds for calculating STOI typically requires clean reference sp eec h, which limits their applicability in the real w orld. T o address this, numerous deep learning-based nonintrusiv e speech assessmen t mo dels ha v e garnered signican t in terest. Man y studies ha ve ac hieved commendable performance, but there is ro om for further improv ement. W e prop ose the use of b ottleneck transformer, incorp orating con volution blo cks for learning frame-level features and a multi- head self-atten tion (MHSA) la y er to aggregate the information. These comp onen ts enable the transformer to fo cus on the k ey asp ects of the input data. Our mo del has shown higher correlation and low er mean squared error for both seen and unseen scenarios compared to the state-of-the-art mo del using self-sup ervised learning (SSL) and sp ectral features as inputs. Index T erms—Nonin trusive, Objective Intelligibilit y , Short- Time Objective Intelligibilit y (STOI), Bottlenec k T ransformer (BoT) I. In tro duction Sp eec h assessmen t refers to the pro cess of ev aluating v arious attributes of speech signals, such as quality and in telligibilit y . Sp eec h assessment metrics are indicators that quan titativ ely measure the sp ecic attributes of sp eec h signals. Sp eech assessmen t is mainly divided into t w o categories, one that requires human in volv ement called Sub jective Assessment, and one without human listeners called Ob jectiv e Assessmen t. Ob jective assess- men t is further divided in to t w o sub categories called In trusiv e assessment and Nonin trusiv e assessment. The former requires a clean reference signal to calculate the assessmen t score, while the latter do es not require the clean reference signal. F or most of the cases where we ha v e a large amount of data, the clean reference signal is una v ailable, and therefore sub jective or intrusiv e assess- men t metho ds are not feasible. T o ov ercome this problem, sev eral approaches hav e been proposed to estimate sp eech in telligibilit y as surrogates for h uman listening test and in trusiv e assessment. The researc hers in [1] hav e prop osed Quality-Net that used the magnitude of the sp ectrogram as input to bidirectional long-short-term memory (BiLSTM) mo dules. They tried to estimate the Perceptual Ev aluation of Sp eec h Quality (PESQ) score [ 2 ] at an utterance lev el b y incorporating the weigh ted sum of b oth the utterance lev el and frame lev el ev aluations using the mean squared error as the ob jectiv e function. The researchers in [ 3 ] ha ve in tro duced STOI-Net whic h also used the magnitude of the sp ectrogram as input. STOI-Net is a combination of con- v olutional neural netw orks (CNN) and bidirectional long- short-term memory (BiLSTM) with multiplicativ e atten- tion mechanism (CNN-BiLSTM-A TTN). The researchers used the same ob jective function employ ed by Quality- Net [1]. The latter mo del sho w ed a higher correlation b e- t w een the ground-truth STOI scores. More researc h works w ere published with multi-task setup where scores like Ob jectiv e ev aluation scores suc h as speech transmission index(STI) and Short time ob jectiv e intelligibilit y (STOI) [4] and human-subjective ratings from human listening tests w ere ev aluated. In the pap er MOSA-Net [ 5 ] cross- domain features (sp ectral and temp oral features) and laten t representations from an Self Sup ervised Learning (SSL) HuBER T [ 6 ] mo del were used to predict objective qualit y and in telligibilit y scores sim ultaneously . MOSA- Net can quite accurately predict ob jectiv e quality (PESQ) and intelligibilit y (STOI) scores. Later an improv ed v er- sion of MOSA Net called MTI-Net [ 7 ] was developed to sim ultaneously predict Subjective Intelligibilit y (SI), STOI and WER scores. Researc h has been and is b eing done for MOS predic- tion. Recen t eorts include MOS-Net [ 8 ], a CNN-BiLSTM- based mo del designed to estimate the quality of sp eech. MB-Net [ 9 ] uses tw o separate netw orks to predict the mean © 2025 IEEE. P ersonal use of this material is p ermitted. P ermission from IEEE must be obtained for all other uses, in any current or future media, including reprin ting/republishing this material for adv ertising or promotional purposes, creating new collective w orks, for resale or redistribution to servers or lists, or reuse of any cop yrighted component of this w ork in other works. qualit y score of an utterance and the dierence b etw een the mean score and the listener’s score. QUAL-Net [10] utilizes the same architecture and features as MTI-Net [ 7 ] but uses a simpler CNN architecture for feature extraction. More w ork has b een done in the medical eld, DNN- based mo dels are utilized in hearing aids (HA) to predict ev aluation metrics such as the Hearing Aid Sp eech Quality Index (HASQI) [11] and the Hearing Aid Sp eec h P ercep- tion Index (HASPI) [12]. MBI-Net [13] resembles MTI-Net [ 7 ] and takes sp ectral feature along with a hearing loss pattern as inputs for the net work. It has t wo branc hes that uses dieren t input c hannels, fed into a feature extractor to extract spectral, learnable lter banks (LFB), and SSL features and estimates subjective intelligibilit y scores. MBI-Net+ [14], an enhanced v ersion, incorporates HASPI in its objective function to enhance the in telligibility prediction score. It uses Whisp er model embeddings and sp eec h metadata as the inputs and utilizes a classier to iden tify sp eec h signals enhanced by v arious metho ds. In this study , we propose a model for STOI prediction, whic h is the combination of a conv olution blo ck (conv blo c k), bottleneck transformer [15], and dense la yers. The con v blo ck is used for extracting and rening the input features. The bottleneck transformer helps to capture short- and long-term contexts while removing redundant information. The dense la yer is used for the prediction of the STOI scores. Experimental results sho w that predicted scores ha ve higher correlation with the ground-truth STOI Scores when tested in b oth Seen (as explained in Section V) and Unseen conditions (test sp eak ers and utterances are not inv olved in the training). The exp erimental results conrm that the prop osed model has a comparably b etter result than the baseline mo del. The remainder of the paper is organized as follows. Section I I reviews the datasets used, Section I I I presents the related works, Section IV presen ts the prop osed metho d, Section V presen ts the experiments and results, and concluding with discussion on future prosp ects of this w ork in Section VI. I I. Dataset Due to the una v ailabilit y of datasets con taining STOI scores, we hav e dev elop ed our own noisy dataset. W e selected the Indic TIMIT [16], LibriSp eech [17], RESPIN [18] and Bhashini 1 Hindi datasets. The levels of Signal- to-Noise Ratio (SNR) in the audio clips from the datasets w ere analyzed using Long-T erm SNR (L T-SNR) [19] and w a veform amplitude distribution analysis (W AD A) -SNR [20] metrics. Files with L T-SNR scores ab ov e 16 and W AD A-SNR scores exceeding 80 were classied as clean and used to build the noisy dataset. W e then selected a 12-hour subset of this data. V arious types of noises w ere added to our clean dataset to form a comprehensiv e noisy dataset. The follo wing is the list of included noise t yp es: 1 https://bhashini.go v.in 1) Mobile/T elephone channel noise: F or simulating the mobile/telephone channel, w e used SoX. w e are sim ulating the c haracteristics of the Global System for Mobile Comm unication (GSM) format. 2) Rev erberation: Rev erb eration was induced in sp eech signals by conv olving the signals, with Ro om Im- pulse Resp onses (RIR) from the A CE Corpus [21]. The av ailable data in the corpus con tain real record- ings of RIRs with their corresp onding T-60 v alues from 7 dierent ro oms with v arying p ositions of microphones. 3) Radio c hannel noise: F or radio channel noise, w e ha v e added white noise of 30-40 dB SNR after applying a random high-pass lter b et w een 500-1000 Hz and a band-pass lter of 50-2600 Hz. 4) T rans-co ding: mp3, ogg, ac, ai, wa v : In trans- co ding we hav e conv erted the audios in dierent audio codecs, each audio codecs has its own wa y of compression, which aects the intelligibilit y of the audios. After compression, we decompress the audio bac k to w av format. 5) V ariable length clipping: Clipping aects intelligi- bilit y a lot, for creating the clipping eect w e are using a moving windo w, within the windo w w e ha ve randomly selected negative and p ositive thresholds and clipp ed the samples having v alues higher or lo w er than the threshold. A dditiv e noises were randomly added at 0 − 20 dB SNR from the MUSAN [22] dataset. W e hav e added the abov e list of noises in three dierent com binations. The resulting noisy les hav e either a single type of noise, a combination of 2 noises, or a com bination of 3 noises. T o compute the STOI score b etw een the noisy signal and the clean reference, we utilized the STOI metric from T orc hmetrics A udio 2 . This STOI score serv es as the ground truth STOI for the exp erimen t. W e hav e used the Indic TIMIT dataset for training, v alidation, and testing. A 5-fold cross-v alidation was p erformed, with each fold containing 2 hours of data. A dditionally , w e used 2 hours subset from LibriSp eech, RESPIN (Bho jpuri and Bengali subset) and Bhashini (Hindi language) dataset each only for testing, which included Unseen test set. I I I. Existing Metho ds ML-based sp eech assessment mo dels are increasingly used in sp eec h processing, including building speech assessmen t systems. This section reviews deep learning tec hniques for predicting speech metrics. [ 7 ] prop osed MTI-Net, a netw ork that predicts STOI, WER, and in telligibilit y in separate branc hes. It uses STFT acoustic features, LFB, and embeddings from an SSL mo del like HuBER T. The netw ork comprises con volution lay ers, 2 https://torchmetrics.readthedocs.io/en/v0.9.0/audio/short_ time_objective_intelligibility.html bidirectional LSTM lay ers, and linear la yers, with predic- tions occurring in individual branc hes. The loss function com bines utterance-level and frame-lev el scores for each metric it predicts. Results show ed that using all three feature types to predict all metrics produced the b est mo del, and replacing the pretrained SSL mo del with a ne-tuned one further impro v ed p erformance. [ 5 ] introduced MOSA-Net, a net work that uses STFT, LFB, and SSL mo del embeddings as features, with an arc hitecture similar to MTI-Net [7]. MOSA-Net predicted PESQ, STOI, and SDI (Sp eech Distortion Index) scores. Using the W all Street Journal (WSJ) dataset, researc hers sho w ed that MOSA-Net p erformed well in predicting PESQ scores with a BiLSTM + CNN + A ttention cong- uration when noise types were seen during training. This setup also excelled at predicting STOI for unseen noise t yp es. A dditionally , they presented QIA-SE, a net w ork that enhances noisy sp eech using the output of the nal linear la y er of MOSA-Net as an input. [ 3 ] introduced a netw ork called STOI-Net that used the STFT sp ectrum as the input features for the mo del. The model w as built with con v olution la yers, Bidirectional LSTM and attention lay ers. The mo del sho wed go o d p erformance using correlation with actual STOI scores while p erforming nonintrusiv e prediction on noisy WSJ dataset. The w ork in [23] used features from the whisp er ASR mo del’s decoder lay ers for its application in the prediction of in telligibility in hearing aids. In [24] Gen- eralized Enhanced STOI (GESTOI) was predicted using an LFB that learns to pro cess temp oral en v elop es. The T emp oral Atten tion Blo c k then weighs the con tribution of the input sp eec h in dieren t time segments. The work in [25] predicted the intelligibilit y of sp eech due to hearing loss and hearing aids. A pretrained W avLM [26] model w as used to extract acoustic features for each audio segment that w ere a v eraged across the time dimension to pro duce a represen tativ e sp eech feature v ector. In [27] the acoustic features of a pre-trained W av2V ec-based XLS-R mo del w ere used as input to a machine learning netw ork to predict MOS score. The net w ork consisted of Bi-LSTM, atten tion, p o oling and linear lay ers. A comparison sho wed that the acoustic features of the XLS-R model using Bi- LSTM and A tten tion la y ers ga ve the b est p erformance and also p erformed b etter than tw o baseline mo dels considered. [28] exp erimented with many fairseq models b y ne-tuning and zero-shot training and found that the small and large v ariants of the W a v2v ec2 mo del ga ve the b est and most consisten t results. IV. Prop osed Metho d The ov erall architecture of our prop osed method is sho wn in Figure 1. Three types of input features were exp erimen ted with for the mo del. Laten t F eature V ector of SSL mo dels lik e W av2V ec2 [29] small and HuBER T [ 6 ] base from the projection lay er of the models treated as input features. The second type is input sp ectral Spectogram Post Processing PS-I The utterances w ere conv erted into a 257D spectrogram using a 512-point STFT with a 32ms Hamming windo w and a 16ms hop length None PS-II STOI-Net Conv lay ers [ 3 ] PS-II I QUAL-Net Conv lay ers [10] T ABLE I Denition of PS-I, II and II I features, similar to the feature extraction metho d from STOI-Net [ 3 ]. The utterance w ere conv erted into a 257- dimensional sp ectrogram using a 512-p oin t STFT with a 32 ms Hamming windo w and a 16 ms hop length (referred to as PS-I). The third type of input feature exp erimen ted is obtained b y passing the aforementioned sp ectral features to a set of conv olution lay ers to extract features, leveraged from [ 3 ] to be referred to as PS-I I. and a set of con v olution lay ers leveraged form [10] referred as PS-I I I. These features are learnable and directly fed to the Con v olution Blo ck (Conv Blo c k) of the prop osed mo del (T able I). Con v Blo ck: The Con v block consists of a set of t wo 1D con v olutional lay ers along with a 1D Batc h Normalization la y er for regularization and a Gaussian Error Linear Unit (GELU) [30] as the activ ation. It takes input features, reduces the input dimension, and helps in extracting and rening the features. The extracted features are then pro vided as input to the Bottlenec k T ransformer. Bottlenec k T ransformer: A mo del has to b e go o d enough to capture b oth lo cal and global information presen t ev en in the presence of nonstationary noise or distortions mixed with the signal, so we used Bottleneck T ransformer. It includes con v olution lay ers, multi-head self-attention la y- ers, Batch Normalization, and Pooling lay ers. Using b oth con v olution and attention lay ers, it captures a richer and more robust representation of the input data. Con volution la y ers reduce dimensionality and capture the lo cal con- text, while self-attention la yers aggregate the information learned b y conv olution lay ers and learn the global con text. A dditionally , another conv olution la y er is used for up- sampling. These lay ers lter out redundan t information and focus on relev ant data. A residual connection from input to output aids gradien t propagation during training. The Bottleneck T ransformer’s output is then passed to Dense Blo c k-1. Dense Blo c ks Dense blo cks comprise linear la y ers and nonlinear activ ation lay ers. Dense Blo ck-1 further renes the features extracted b y the b ottleneck transformer. Global Pooling is then applied to reduce the temp oral dimensions b y av eraging across the time dimension. This input is fed in to Dense Block-2, which predicts the STOI score. The ob jective function w e used for training is the Mean Squared Error (MSE) b etw een the true and predicted ut- terance level STOI scores. The mo del eectively captures information at b oth the frame and utterance lev els without needing frame-lev el scores for training. Bottleneck T ransformer Dense Block-1 Global A verage Dense Block-2 Conv Block STOI Input Feature Activation Conv2D Batch Norm2D Activation Conv2D Batch Norm2D MHSA Bottleneck T ransformer Bottleneck T ransformer Pooling 1 Fig. 1. Prop osed Architecture: Left- Architecture of the full mo del. Right- Arc hitecture of the Bottleneck T ransformer V. Exp eriments and Results W e prepared ve test sets to ev aluate the baseline and prop osed metho d, one Seen test set and four Unseen test sets. The Seen test set consists of data where, sp eak ers, utterances, and noise types from the training data ma y app ear in v arious com binations in the test set while making sure complete data leakage do es not o ccur. The Unseen test set consists of speakers and utterances dieren t from those in the training data, how ever, in this case, the noise t yp es still o verlap with those of the training data. F eatures Proposed Model STOI-Net W a v2V ec2 706,465 967,938 HuBER T 903,073 1,230,082 PS-I 314,017 N/A PS-II 1,019,937 1,195,106 PS-II I 669,921 845,538 T ABLE II Parameter Coun t of Models for dierent feature types All training/ne-tuning was performed on systems con- taining 24GB NVidia R TX A5000 GPUs. Our prop osed mo del consisten tly ha v e few er n um b er of parameters than the baseline across all features, as sho wn in T able I I. Hyp er-parameters are tuned and mo dels selected using the dev elopmen t set. A. Exp erimental Setup In a previous work STOI-Net [ 3 ] the researchers hav e sho wn the adv an tage of BiLSTM and self-attention for mo deling time-v ariant sp eech patterns and predict STOI as the lab el. Therefore, we used the STOI-Net model as our baseline system in this study . W e also compute the MSE, LCC and SR CC for dieren t t yp es of features used by v arying the num b er of additiv e noises added to the clean signals from the seen test set. Model F eatures LCC ↑ SRCC ↑ MSE ↓ STOI-Net W av2V ec2 92 . 69 ± 0 . 72 92 . 67 ± 0 . 65 0 . 0078 ± 0 . 0007 HuBER T 91 . 64 ± 0 . 79 91 . 48 ± 0 . 85 0 . 0088 ± 0 . 0009 PS-II 93 . 15 ± 1 . 01 92 . 87 ± 1 . 00 0 . 0073 ± 0 . 0011 PS-II I 77 . 73 ± 7 . 97 80 . 24 ± 5 . 92 0 . 0266 ± 0 . 0114 Proposed W av2V ec2 93 . 95 ± 0 . 30 93 . 89 ± 0 . 42 0 . 0064 ± 0 . 0003 HuBER T 93 . 95 ± 0 . 26 93 . 54 ± 0 . 32 0 . 0065 ± 0 . 0003 PS-I 91 . 86 ± 0 . 20 91 . 60 ± 0 . 28 0 . 0085 ± 0 . 0001 PS-II 92 . 21 ± 0 . 64 91 . 99 ± 0 . 73 0 . 0084 ± 0 . 0009 PS-II I 93 . 42 ± 0 . 19 93 . 39 ± 0 . 28 0 . 0069 ± 0 . 00023 T ABLE II I Baseline and proposed mo del p erformance on the seen test set F or our prop osed mo del, we hav e used a Conv blo ck for feature extraction, whic h takes input size of 257 for PS-I, 512 for PS-I I and PS-I I I, 768 for W a v2V ec2 and 1024 for HuBER T features. The hidden dimensions for the rst and second Conv1D blo cks are 256 and 128, resp ectiv ely . Both use a kernel size of 3, follo wed b y 1-D Batc h Normalization and GELU activ ation lay ers. Then the output of the Conv blo ck is given to a b ottleneck transformer with a hidden dimension of 64, which takes the 128-dimensional input. The b ottleneck transformer consists of 3 blo cks, the rst blo ck consists of a 2-D con v olution la y er that has an input dimension of 128, a hidden dimension of 64 and a kernel size of 1 follo w ed b y a GELU activ ation (approximating tanh), 2-D Batc h Normalization, and dropout of 0 . 1 . The second blo ck features a m ulti-head self-attention mec hanism that tak es a 64-dimensional input with 8 heads and a 0.2 drop out. This is follo wed b y a 2-D adaptive av erage p o oling lay er, reducing height and width dimensions to 1 × 1 while main- taining the batc h size and the num b er of channels. This is follo w ed by a GELU activ ation (appro ximating tanh), 2-D batc h normalization, and a 0.1 drop out. The third blo ck includes a 2-D conv olutional lay er with input dimensions matc hing the hidden dimension and output dimensions Language LibriSpeech English RESPIN Bengali RESPIN Bhojpuri Bhashini Hindi F eatures LCC ↑ SRCC ↑ MSE ↓ LCC ↑ SRCC ↑ MSE ↓ LCC ↑ SRCC ↑ MSE ↓ LCC ↑ SRCC ↑ MSE ↓ W av2vec2 80 . 06 ± 2 . 33 80 . 57 ± 2 . 61 0 . 023 ± 0 . 0024 76 . 75 ± 0 . 72 76 . 97 ± 0 . 88 0 . 023 ± 0 . 0014 75 . 5 ± 1 . 45 75 . 6 ± 1 . 38 0 . 026 ± 0 . 0018 90 . 1 ± 0 . 96 89 . 8 ± 1 . 11 0 . 01 ± 0 . 001 Hubert 80 . 47 ± 0 . 84 80 . 77 ± 1 . 29 0 . 023 ± 0 . 0010 77 . 28 ± 1 . 34 77 . 17 ± 1 . 46 0 . 024 ± 0 . 0013 74 . 95 ± 1 . 36 75 . 06 ± 1 . 59 0 . 027 ± 0 . 0012 91 . 87 ± 0 . 59 91 . 6 ± 0 . 58 0 . 008 ± 0 . 0006 PS-I 78 . 25 ± 1 . 75 77 . 75 ± 1 . 95 0 . 028 ± 0 . 0034 73 . 55 ± 2 . 26 74 . 5 ± 2 . 24 0 . 031 ± 0 . 004 73 . 25 ± 2 . 54 73 . 35 ± 2 . 29 0 . 035 ± 0 . 0047 91 . 31 ± 0 . 84 90 . 64 ± 1 . 01 0 . 01 ± 0 . 0015 PS-II 78 . 12 ± 3 . 39 78 . 18 ± 3 . 15 0 . 031 ± 0 . 0038 76 . 49 ± 1 . 25 77 . 12 ± 1 . 22 0 . 028 ± 0 . 003 74 . 13 ± 1 . 77 74 . 7 ± 1 . 87 0 . 033 ± 0 . 004 89 . 56 ± 1 . 25 89 . 54 ± 1 . 4 0 . 014 ± 0 . 004 PS-II I 83 . 13 ± 1 . 94 82 . 91 ± 2 . 3 0 . 023 ± 0 . 0026 81 . 59 ± 1 . 09 82 . 5 ± 1 . 60 0 . 022 ± 0 . 002 81 . 34 ± 1 . 38 81 . 64 ± 1 . 76 0 . 024 ± 0 . 002 94 . 07 ± 0 . 54 94 . 02 ± 0 . 62 0 . 006 ± 0 . 001 T ABLE IV Performance of Propo sed Bottleneck T ransformer Mo del on unseen (English, Hindi, Bengali and Bhojpuri) datasets. Language LibriSpeech English RESPIN Bengali RESPIN Bho jpuri Bhashini Hindi F eatures LCC ↑ SRCC ↑ MSE ↓ LCC ↑ SRCC ↑ MSE ↓ LCC ↑ SRCC ↑ MSE ↓ LCC ↑ SRCC ↑ MSE ↓ W av2vec2 78 . 5 ± 0 . 82 80 . 1 ± 1 . 22 0 . 025 ± 0 . 0022 75 . 54 ± 1 . 64 76 . 27 ± 1 . 73 0 . 026 ± 0 . 0017 72 . 43 ± 1 . 61 73 . 56 ± 1 . 43 0 . 031 ± 0 . 0021 91 . 14 ± 1 . 37 91 . 24 ± 1 . 45 0 . 009 ± 0 . 0017 Hubert 78 . 62 ± 1 . 8 80 . 12 ± 2 . 72 0 . 022 ± 0 . 003 73 . 18 ± 4 . 15 73 . 93 ± 4 . 85 0 . 027 ± 0 . 0054 69 . 99 ± 3 . 57 72 . 11 ± 4 . 28 0 . 03 ± 0 . 0047 91 . 88 ± 0 . 29 92 . 43 ± 0 . 32 0 . 009 ± 0 . 0004 PS-II 77 . 25 ± 2 . 25 76 . 9 ± 2 . 59 0 . 023 ± 0 . 0021 77 . 74 ± 2 . 08 77 . 41 ± 2 . 53 0 . 021 ± 0 . 0021 76 . 17 ± 2 . 31 75 . 21 ± 2 . 71 0 . 024 ± 0 . 0023 93 . 32 ± 0 . 78 92 . 98 ± 0 . 96 0 . 007 ± 0 . 0009 PS-II I 62 . 2 ± 6 . 8 61 . 19 ± 8 . 84 0 . 037 ± 0 . 006 56 . 75 ± 9 . 05 58 . 44 ± 8 . 7 0 . 038 ± 0 . 0078 52 . 87 ± 7 . 7 54 . 82 ± 8 . 01 0 . 042 ± 0 . 0069 78 . 95 ± 9 . 09 82 . 14 ± 6 . 79 0 . 026 ± 0 . 015 T ABLE V Performance of the baseline STOI-Net Model on unseen (English, Hindi, Bengali and Bho jpuri) datasets Fig. 2. Predicted STOI vs. A ctual STOI for v arious SNR bins when using W av2V ec2 F eatures for Seen test data corresp onding to the b ottlenec k transformer’s input, with a k ernel size of 1, follow ed b y batc h normalization. Finally , a residual connection from the b ottleneck input is added, whic h is passed to the sigmoid activ ation. F ollowing the b ottlenec k transformer, we ha ve Dense Blo ck-1, which includes a linear lay er with a 128-dimensional input and a 32-dimensional output, follow ed by la yer normalization. W e then applied a 1-D adaptiv e av erage p o oling to eliminate the time dimension. Next, Dense Block-2 is used, featuring a linear lay er with a 32-dimensional input and a 1-dimensional output, follow ed b y a sigmoid activ ation function for the STOI prediction task. W e conducted exp eriments for the prop osed metho d in PyT orch and the baseline metho ds in T ensorFlo w. w e ha ve used an ep o ch size of 50, with a learning rate of 0.0001 and Adam as the optimizer. W e used the mean squared error (MSE) as the ob jectiv e function. W e used three ev aluation metrics: MSE, linear corre- lation co ecien t (LCC) and Sp earman rank correlation co ecien t (SRCC) [31] to ev aluate the p erformance of the mo dels. The predictions and comparison of STOI scores w ere done at a utterance lev el. F urthermore, the p erformance of the mo del was ana- lyzed b y v arying the SNR of the signals, for which the STOI score was predicted. The SNR v alues of the noisy signals in the Seen test set w ere calculated using the corresp onding clean signals from the dataset as a reference. The SNR v alues were then group ed into six bins: less than 0 dB, 0–5 dB, 5–10 dB, 10–15 dB, 15–20 dB, and more than 20 dB. F eatures #Noise LCC SR CC MSE W a v2V ec2 1 96 . 77 ± 0 . 59 94 . 55 ± 0 . 66 0 . 0033 ± 0 . 001 2 93 . 06 ± 0 . 54 92 . 94 ± 0 . 82 0 . 0068 ± 0 . 0007 3 86 . 6 ± 0 . 87 84 . 84 ± 0 . 85 0 . 0095 ± 0 . 0006 HuBER T 1 96 . 3 ± 0 . 27 92 . 77 ± 0 . 42 0 . 0037 ± 0 . 0004 2 93 . 07 ± 0 . 38 92 . 38 ± 0 . 43 0 . 0067 ± 0 . 0004 3 87 . 25 ± 0 . 77 84 . 76 ± 1 . 26 0 . 009 ± 0 . 0005 PS-II 1 95 . 74 ± 0 . 6 91 . 35 ± 0 . 43 0 . 0049 ± 0 . 0011 2 91 . 36 ± 0 . 56 91 . 24 ± 0 . 88 0 . 0084 ± 0 . 0006 3 82 . 62 ± 1 . 76 80 . 33 ± 2 . 22 0 . 012 ± 0 . 0011 PS-II I 1 96 . 29 ± 0 . 47 93 . 31 ± 0 . 28 0 . 0036 ± 0 . 0005 2 93 . 18 ± 0 . 54 93 . 15 ± 0 . 64 0 . 0065 ± 0 . 0005 3 84 . 91 ± 0 . 86 83 . 14 ± 1 . 28 0 . 0106 ± 0 . 0005 T ABLE VI Impact on LCC, SR CC and MSE when dieren t num ber of noises are added in seen noise scenario B. Results and Discussion In this study , our ndings revealed a higher correlation b et w een the ground truth STOI and the predicted STOI scores of the model. PS-I ro w is not applicable to the base- line architecture and is therefore not shown in T able I I I and T able V. T able I I I presents the p erformance on the Seen test set. W e compared the performance of t wo mo dels: STOI- Net and our prop osed mo del, using v e dierent features: W av2V ec2, Hubert, PS-I, PS-II and PS-II I. The proposed mo del consisten tly outp erformed the baseline on the seen set. Our prop osed mo del ac hiev ed the highest LCC (93 . 95 ± 0 . 26) and SR CC (93 . 89 ± 0 . 42) v alues, along with the lo w est MSE (0 . 0064 ± 0 . 0003) . T able IV sho ws the p erformance of the prop osed ar- c hitecture on the Unseen test set with dierent types of input features. The performance on the Unseen test set was ev aluated using the same mo dels and feature extractors. T able IV and T able V shows that the proposed mo del generally show ed b etter p erformance than STOI- Net, even on unseen data, which highligh ts its robustness and generalizabilit y . The p erformance on PS-I features are sligh tly lo w; it is probably b ecause the mo del b ecame to o simple (0.31 M parameters) to capture the underlying information presen t in the sp ectrogram. By incorp orating SSL features, the mo del demonstrates enhanced p erformance compared to the baseline. The prop osed mo del sho ws similar results to the baseline with the PS-I I features. The highest p erformance across the v arious test sets is observ ed with the PS-I I I features. It can b e seen that in most cases SNR F eatures LCC SRCC MSE <0 W a v2V ec2 90 . 31 ± 0 . 22 90 . 75 ± 0 . 21 0 . 0095 ± 0 . 0002 Hubert 90 . 36 ± 0 . 52 90 . 48 ± 0 . 51 0 . 0094 ± 0 . 0005 PS-II 87 . 05 ± 1 . 05 86 . 93 ± 1 . 33 0 . 0125 ± 0 . 0008 PS-II I 89 . 23 ± 0 . 25 89 . 44 ± 0 . 34 0 . 0105 ± 0 . 0002 0-5 W a v2V ec2 82 . 24 ± 2 . 48 70 . 22 ± 3 . 87 0 . 0052 ± 0 . 0011 Hubert 82 . 23 ± 2 . 24 67 . 23 ± 4 . 05 0 . 0054 ± 0 . 0011 PS-II 80 . 42 ± 1 . 87 64 . 72 ± 1 . 66 0 . 0058 ± 0 . 001 PS-II I 81 . 01 ± 2 . 07 69 . 02 ± 1 . 88 0 . 0054 ± 0 . 0007 5-10 W a v2V ec2 76 . 76 ± 4 . 71 78 . 93 ± 2 . 08 0 . 0047 ± 0 . 0015 Hubert 73 . 23 ± 8 . 27 73 . 55 ± 6 . 88 0 . 0051 ± 0 . 0019 PS-II 75 . 9 ± 3 . 2 76 . 83 ± 2 . 58 0 . 0059 ± 0 . 0011 PS-II I 80 . 65 ± 2 . 89 81 . 49 ± 2 . 66 0 . 0036 ± 0 . 0006 10-15 W a v2V ec2 52 . 99 ± 5 . 45 56 . 68 ± 4 . 34 0 . 0026 ± 0 . 0011 Hubert 52 . 94 ± 2 . 59 46 . 87 ± 3 . 68 0 . 002 ± 0 . 0003 PS-II 58 . 66 ± 3 . 55 54 . 97 ± 4 . 55 0 . 0034 ± 0 . 0014 PS-II I 65 . 08 ± 3 . 34 61 . 13 ± 4 . 2 0 . 0018 ± 0 . 0005 15-10 W a v2V ec2 49 . 02 ± 5 . 19 44 . 28 ± 7 . 46 0 . 0014 ± 0 . 0007 Hubert 50 . 15 ± 8 . 74 33 . 43 ± 9 . 84 0 . 0019 ± 0 . 0005 PS-II 41 . 91 ± 11 . 92 21 . 92 ± 9 . 35 0 . 0038 ± 0 . 0012 PS-II I 53 . 15 ± 7 . 74 35 . 06 ± 6 . 09 0 . 0014 ± 0 . 0003 >20 W a v2V ec2 36 . 39 ± 8 . 99 36 . 58 ± 6 . 7 0 . 0017 ± 0 . 0016 Hubert 31 . 12 ± 7 . 44 34 . 76 ± 5 . 96 0 . 0013 ± 0 . 0005 PS-II 6 . 1 ± 4 . 94 24 . 41 ± 8 . 65 0 . 0036 ± 0 . 001 PS-II I 15 . 34 ± 6 . 01 32 . 51 ± 2 . 35 0 . 0017 ± 0 . 0003 T ABLE VI I Impact on LCC, SR CC, and MSE across dieren t SNR lev els in the seen noise scenario the LCC and SRCC v alues for the prop osed mo del are higher if not the same compared to the baseline. A cross all languages and feature t yp es, the prop osed arc hitecture giv es av erage LCC, SR CC and MSE of 81 . 108 ± 1 . 480 , 81 . 249 ± 1 . 628 and 0 . 0225 ± 0 . 0022 resp ectively compared to the baseline v alues of 75 . 408 ± 3 . 457 , 76 . 178 ± 3 . 696 and 0 . 0249 ± 0 . 0040 resp ectively . Also a detailed noise sp ecic (increasing levels of noise con tamination) result is presented in the T able VI for dieren t feature extraction metho ds. The observ ations sho ws that with the increase in the n um b er of noises, the correlation scores of the actual and predicted STOI reduce while the MSE v alues increase. This shows an intuitiv e drift in the p erformance prediction because in telligibility of sp eec h reduces as more n um b er of noises are added to a signal dro wning out the signal. In order to answ er the question of ho w STOI predictions v ary when signal quality v aries based on SNR, the noisy data from IndicTimit dataset was binned in to v arious SNR bins. Correlation scores of predicted and ac tual STOI scores were calculated for eac h bin for the v arious features used in the exp eriment. An interesting observ ation that w as made is that the correlation score of signals ha ving lo w er SNR ( < 10dB) w as higher than the correlation score of signals with higher SNR ( > 20dB). Initially this seemed a bit counter to the results in T able VI I, how ev er, as rev ealed in the plots in Figure 2. These plots show how the predicted STOI and actual STOI scores v ary for signals in dierent SNR ranges. As can b e seen, for lo w er SNR ranges, there is a spread in the actual and predicted v alues that can b e thought of to form a linear relationship. This w ould lead to the correlation scores in these SNR ranges b eing quite high. As the signal quality improv es (for higher SNR ranges), the prediction and actual STOI scores tend to concen trate in a smaller regions. This results in a lack of an y linear-lo oking relationship, so, the correlation worsens. This further shows that there is a complex relation of in telligibilit y with the n um b er of noises in a signal and the SNR of a signal. VI. Conclusions In this study , we prop osed a mo del architecture to predict the STOI scores whic h is a nonintrusiv e sp eech in telligibilit y assessment mo del and does not require the clean reference signal for STOI prediction. W e hav e lev er- aged the b ottleneck transformer as the bac kb one for our mo del to learn the lo cal and global con text present in the data. W e hav e used SSL features (W a v2V ec2, HuBER T) and sp ectral features for our exp erimen ts. Exp erimental results indicate that our prop osed mo del consistently ac hiev ed higher LCC and SRCC v alues across b oth seen and unseen data sets, demonstrating impro v ed accuracy , reliabilit y , and generalizability . In the future, w e plan to use adapter-based ne-tuning of SSL features to enhance the accuracy of STOI predictions. References [1] F u, Szu-W ei and T sao, Y u and Hw ang, Hsin-T e and W ang, Hsin- Min, “Qualit y-Net: An end-to-end non-in trusive sp eech qualit y assessment model based on blstm,” in Pro c. Interspeech, 2018. [2] Rix, A.W. and Beerends, J.G. and Hollier, M.P . and Hekstra, A.P ., “Perceptual ev aluation of sp eech qualit y (PESQ) -a new method for speech qualit y assessment of telephone netw orks and codecs,” in Proc. ICASSP , v ol. 2, 2001, pp. 749–752 v ol.2. [3] Zezario, Ry andhimas E and F u, Szu-W ei and F uh, Chiou-Shann and T sao, Y u and W ang, Hsin-Min, “STOI-Net: A deep learning based non-intrusiv e speech intelligibilit y assessment mo del,” in Asia-Pacic Signal and Information Pro cessing Association Annual Summit and Conference (APSIP A ASC). IEEE, 2020, pp. 482–486. [4] T aal, Cees H and Hendriks, Ric hard C and Heusdens, Richard and Jensen, Jesper, “A short-time objective intelligibilit y measure for time-frequency weigh ted noisy sp eech,” in Proc. ICASSP . IEEE, 2010, pp. 4214–4217. [5] Zezario, Ry andhimas E and F u, Szu-W ei and Chen, F ei and F uh, Chiou-Shann and W ang, Hsin-Min and T sao, Y u, “Deep learning-based non-intrusiv e multi-objective sp eech assessment model with cross-domain features,” IEEE/A CM T ransactions on Audio, Sp eech, and Language Pro cessing, vol. 31, pp. 54–70, 2022. [6] Hsu, W ei-Ning and Bolte, Benjamin and T sai, Y ao-Hung Hu- bert and Lakhotia, Kushal and Salakhutdino v, R uslan and Mohamed, Ab delrahman, “Hubert: Self-supervised sp eech rep- resentation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 29, pp. 3451–3460, 2021. [7] Ry andhimas Edo Zezario and Szu-wei F u and F ei Chen and Chiou-Shann F uh and Hsin-Min W ang and Y u T sao, “MTI- Net: A Multi-T arget Sp eech In telligibilit y Prediction Mo del,” in Interspeech 2022, 2022, pp. 5463–5467. [8] Chen-Chou Lo and Szu-W ei F u and W en-Chin Huang and Xin W ang and Junichi Y amagishi and Y u T sao and Hsin-Min W ang, “MOSNet: Deep Learning-Based Ob jective Assessmen t for V oice Conv ersion,” in Proc. Interspeech, 2019, pp. 1541–1545. [9] Leng, Yichong and T an, Xu and Zhao, Sheng and So ong, F rank and Li, Xiang-Y ang and Qin, T ao, “MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Netw ork,” in Pro c. ICASSP , 2021, pp. 391–395. [10] Lin, Guo jian and T sao, Y u and Chen, F ei, “A Non-In trusive Speech Quality Assessment Mo del using Whisper and Multi- Head Atten tion,” in 2024 Asia Pacic Signal and Information Processing Association Ann ual Summit and Conference (AP- SIP A ASC). IEEE, 2024, pp. 1–6. [11] Kates, James M and Arehart, Kathryn H, “The hearing-aid speech quality index (HASQI),” Journal of the Audio Engineer- ing So ciety , v ol. 58, no. 5, pp. 363–381, 2010. [12] ——, “The Hearing-Aid Sp eech Perception Index (HASPI),” Speech Communication, vol. 65, pp. 75–93, 2014. [13] Ry andhimas Edo Zezario and F ei Chen and Chiou-Shann F uh and Hsin-Min W ang and Y u T sao, “MBI-Net: A Non-Intrusiv e Multi-Branched Sp eech In telligibility Prediction Mo del for Hearing Aids,” in Proc. In tersp eech, 2022, pp. 3944–3948. [14] Ry andhimas E. Zezario and F ei Chen and Chiou-Shann F uh and Hsin-Min W ang and Y u T sao, “Non-In trusiv e Sp eech Intelligibil- ity Prediction for Hearing Aids using Whisp er and Metadata,” in Pro c. Interspeech, 2024, pp. 3844–3848. [15] Sriniv as, Aravind and Lin, T sung-Yi and Parmar, Niki and Shlens, Jonathon and Abbeel, Pieter and V aswani, Ashish, “Bottleneck T ransformers for Visual Recognition,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), 2021, pp. 16 514–16 524. [16] Y arra, Chiranjeevi and Aggarwal, Ritu and Rajpal, A vni and Ghosh, Prasan ta Kumar, “Indic TIMIT and Indic English lexicon: A sp eech database of Indian sp eakers using TIMIT stimuli and a lexicon from their mispronunciations,” in Pro c. O-COCOSDA, 2019, pp. 1–6. [17] Pana yoto v, V assil and Chen, Guoguo and Po vey , Daniel and Khudanpur, Sanjeev, “Librisp eech: An ASR corpus based on public domain audio b o oks,” in Pro c. ICASSP , 2015, pp. 5206– 5210. [18] Udupa, Sathvik and Bandekar, Jesura ja and Deekshitha, G and Kumar, Saurabh and Ghosh, Prasanta Kumar and Badiger, Sandhy a and Singh, Abha yjeet and Murthy , Savitha and Pai, Priyanka and Ragha v an, Sriniv asa and Nanav ati, Raoul, “Gated Multi Encoders and Multitask Ob jectives for Dialectal Speech Recognition in Indian Languages,” in IEEE Automatic Sp eech Recognition and Understanding W orkshop (ASR U), 2023, pp. 1–8. [19] Ghosh, Prasanta Kumar and T siartas, Andreas and Naray anan, Shrikanth, “Robust voice activity detection using long-term signal v ariability ,” IEEE T ransactions on Audio, Speech, and Language Pro cessing, vol. 19, no. 3, pp. 600–613, 2010. [20] Chanw oo Kim and Richard M. Stern, “Robust signal-to-noise ratio estimation based on wa v eform amplitude distribution analysis,” in Pro c. In tersp eech, 2008, pp. 2598–2601. [21] Eaton, James and Gaubitch, Nikolay D. and Mo ore, Alastair H. and Na ylor, Patric k A., “Estimation of Ro om Acoustic Parameters: The ACE Challenge,” IEEE/A CM T ransactions on Audio, Speech, and Language Pro cessing, vol. 24, no. 10, pp. 1681–1693, 2016. [22] David Sn yder and Guoguo Chen and Daniel Po vey , “MUSAN: A Music, Sp eech, and Noise Corpus,” 2015, [23] Mogridge, Rhiannon and Close, George and Sutherland, Rob ert and Hain, Thomas and Bark er, Jon and Goetze, Stefan and Ragni, Anton, “Non-Intrusiv e Sp eech Intelligibilit y Prediction for Hearing-Impaired Users Using Intermediate ASR F eatures and Human Memory Mo dels,” in Pro c. ICASSP , 2024, pp. 306– 310. [24] Szymon Drgas, “Sp eech intelligibilit y prediction using generalized ESTOI with ne-tuned parameters,” Sp eech Communication, vol. 159, p. 103068, 2024. [Online]. A v ailable: https://www.sciencedirect.com/science/article/pii/ S0167639324000402 [25] Zhou, Xia jie and Mawalim, Candy Olivia and Unoki, Masashi, “Speech Intelligibilit y Prediction Using Binaural Processing for Hearing Loss,” IEEE A ccess, pp. 1–1, 2025. [26] Chen, Sanyuan and W ang, Chengyi and Chen, Zhengy ang and W u, Y u and Liu, Shujie and Chen, Zh uo and Li, Jinyu and Kanda, Naoyuki and Y oshioka, T akuy a and Xiao, Xiong and W u, Jian and Zhou, Long and Ren, Sh uo and Qian, Y anmin and Qian, Y ao and W u, Jian and Zeng, Michael and Y u, Xiangzhan and W ei, F uru, “W a vLM: Large-Scale Self-Sup ervised Pre- T raining for F ull Stac k Sp eech Pro cessing,” IEEE Journal of Selected T opics in Signal Processing, vol. 16, no. 6, pp. 1505– 1518, 2022. [27] Bastiaan T amm and Helena Balabin and Rik V andenberghe and Hugo V an hamme, “Pre-trained Sp eech Representations as F eature Extractors for Sp eech Qualit y Assessment in Online Conferencing Applications,” in Proc. Interspeech, 2022, pp. 4083–4087. [28] Co op er, Erica and Huang, W en-Chin and T oda, T omoki and Y amagishi, Junichi, “Generalization Abilit y of MOS Prediction Netw orks,” in Proc. ICASSP , 2022, pp. 8442–8446. [29] Baevski, Alexei and Zhou, Y uhao and Mohamed, Abdelrah- man and A uli, Mic hael, “wa v2vec 2.0: A framework for self- supervised learning of sp eech representations,” Adv ances in neural information pro cessing systems, vol. 33, pp. 12 449– 12 460, 2020. [30] Dan Hendrycks and Kevin Gimp el, “Gaussian Error Linear Units (GELUs),” 2023. [Online]. A v ailable: abs/1606.08415 [31] Nicholas de Klerk, “Commentary: Spearman’s ‘The pro of and measurement of asso ciation b etw een tw o things’,” International Journal of Epidemiology , v ol. 39, pp. 1159–1161, 2010. [Online]. A v ailable: https://api.semanticscholar.org/CorpusID: 196423054

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment