Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition
In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the per…
Authors: D. S. Pavan Kumar, N. Vishnu Prasad, Vikas Joshi
MODIFIED SPLICE AND ITS EXTENSION TO NON-STEREO D A T A FOR NOISE ROB UST SPEECH RECOGNITION D. S. P avan K uma r 1 , N. V ishn u Pr a sad 1 , V ika s J oshi 1 , 2 , S. Umesh 1 1 Departmen t of Electrical Enginee ring, Indian Institute of T echnology Madras, India 2 IBM India Research Labs, India ABSTRA CT In this paper , a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech re- cognition. The modification is based on feature correlations, and enables this stereo-based algorithm t o improv e the performance in all noise conditions, especially in unseen cases. Further , the mod - ified frame work is extended to work for non - stereo datasets where clean and noisy training utterances, but not stereo counterparts, are requ i red. Finally , an MLLR-based computationa l ly efficient run-time noise ad aptation method in SP LICE framew ork has been proposed. The mo difi ed SPLICE shows 8.6% absolute imp rovement ov er SPLIC E in T est C o f Aurora-2 database, and 2.93% o verall. Non-stereo method shows 10.37 % and 6.93% absolute improv e- ments ov er Aurora-2 and Aurora-4 baseline models respecti vely . Run-time adap t ation sho ws 9.89% absolu te improv ement in mod- ified frame work as compared to SPLICE for T est C, and 4.96% ov erall w .r .t. standard MLLR adap tation on HMMs. Index T erms — Rob ust speech recognition, SPLICE, stereo data, feature normalisation, MFCC. 1. INTRODUCTION The goal of rob ust sp eech recogn i tion is to b uil d systems tha t can work under diffe rent noisy en vironment conditions. Due to the acoustic mismatch betwee n training and test co nditions, the per- formance degrades under noisy en vironments. Model Adaptation and F eatur e Compensation are two classes o f tech niques that ad- dress this problem. The former methods adapt the trained models to match the en vironment, and the latter methods compensate eit her or both noisy and clean features so that they hav e simil ar characterist- ics. Stereo based piece-wise linear co mpensation for env ir onments (SPLICE) is a popular and ef ficient noise rob ust feature enhan ce- ment technique. It partitions the noisy feature space i nto M classes, and learns a linear transformation based noise compensation for each partition class during training, using st ereo data. Any test vec- tor y is soft-assigned to one or more classes by computing p ( m | y ) ( m = 1 , 2 , . . . , M ) , and is compensated by applying the wei ghted combination of linear transformations to get the cleaned version b x . b x = M ∑ m = 1 p ( m | y ) ( A m y + b m ) (1) A m and b m are estimated during training u si ng stereo data. The train- ing noisy vectors { y } are modelled using a Gaussian mixture model This work was supported under the SERC project funding SR/S3/EECE/050/2013 of Department of Science and T echnology , India. (GMM) p ( y ) of M mixtures, and p ( m | y ) is calcu l ated for a test v ec- tor as a set of posterior probabilities w .r .t the GMM p ( y ) . Thus the partition class is decided by the mixture assignments p ( m | y ) . Over the last d ecade, techniques such as maximum mutual in- formation based training [1], speaker normalisation [2], uncertainty decoding [3] etc. were introd uced in SPLICE frame work. There are two disadv antages of SPLICE. The algorithm fails when the test noise condition is not seen during training. Also, o wing to its re- quirement of stereo data for training, the usage of the technique is quite restricted. S o there is an interest in addressing these issues. In a recent work [4], an adaptation framewo rk using Eigen- SPLICE was proposed to address the problems of unseen noise conditions. The method in volv es preparation of quasi stereo data using the noise frames extracted from non-spee ch portions of the test utterances. For this, the recogn ition system is r equired to ha ve access to some clean t raining u tt erances for performing run-time adaptation. In [5], a stereo-based feature compe nsation method was pro- posed. Clean and noisy feature spaces were partitioned into vector quantised (VQ) re gions. The stereo ve ct or pairs belon ging to i t h VQ region in clean space an d j t h VQ re gion in noisy spac e are classified to the i j t h sub-reg ion. T ransformations based on Gaussian whiten- ing expression were estimated fr om ev ery noisy sub-regio n to clean sub-reg ion. But it is not always guaranteed to have enough data to estimate a full transformation matrix from each sub-regio n to other . In t his paper , we propose a simple modification based on an as- sumption made b y SPLICE on the correlation of trainin g s t ereo data, which improv es t he performance in unseen noise conditions. This method does not need any adaptation data , in contrast to the re- cent work prop osed in literature [4]. W e call this method as modi- fied SP LICE (M-SPLICE ). W e also extend M-SPLIC E to work for datasets that are not stereo recorded, wi th minimal performance de- gradation as compared to con ventiona l SP LICE. Finally , we use an MLLR based run-time noise ada ptati on frame work, wh i ch is com- putationally efficient and achie ves better results than MLLR HMM- adaptation. This method is done on 13 dimension al MFCCs and does not r equire two-pass V iterbi decoding, in contrast t o con ven- tional MLLR done on 39 dimensions. The rest of the pap er is organised as follo ws: a revie w of SPLICE is gi ven in Section 2, proposed modification to SPLICE is presented in Section 3, extension to non-stereo datasets is explained in Section 4, run-time noise adap tation is described in Section 5, exp eriments and results are presented in Section 6, detailed discussion and com- parison of e xisting v ersus propose d techniques is gi ven in Section 7 and the paper is concluded in S ection 8 indicating possible future extension s. 2. REVIE W OF SPLICE As discussed in the introduction , SPLICE algorithm makes the fol- lo w ing two assump ti ons: 1. The noisy features { y } follow a Gaussian mixture density of M modes p ( y ) = M ∑ m = 1 P ( m ) p ( y | m ) = M ∑ m = 1 π m N y ; µ y , m , Σ y , m (2) 2. The conditional density p ( x | y , m ) is the Gaussian p ( x | y , m ) ∼ N ( x ; A m y + b m , Σ x , m ) (3) where { x } are the clean features. Thus, A m and b m parameterise the mix t ure specific linear transform- ations on the noisy v ector y . Here y and m are indep endent v ariables, and x is dependent on them. Estimate of the cleane d feature b x can be obtained in MMSE frame work as shown in Eq. (1). The deriv ation of SPLICE transformations is briefly discussed next. L et W m = b m A m and y ′ = 1 y T T . Using N independ - ent pairs of stereo training features { ( x n , y n ) } and maximising the joint log-like l ihood L = N ∑ n = 1 log p ( x n , y n ) = N ∑ n = 1 M ∑ m = 1 log [ p ( x n | y n , m ) p ( y n | m ) P ( m )] (4) yields W m = " N ∑ n = 1 p ( m | y n ) x n y ′ T n # " N ∑ n = 1 p ( m | y n ) y ′ n y ′ T n # − 1 (5) Alternativ ely , sub-optimal update rules of separately estimating b m and A m can be deri ved by initially assuming A m to be identity matrix while estimating b m , and then using this b m to estimate A m . A perfect correlation between x and y is assumed, and the fol- lo w ing approximation is used in deri ving Eq. (5) [6]. p ( m | x n , y n ) ≈ p ( m | x n ) ≈ p ( m | y n ) (6) Giv en mixture inde x m , Eq. (5) can be sho wn to gi ve the MM S E estimator of b x m = A m y + b m [7], gi ven by b x m = µ x , m + Σ xy , m Σ − 1 y , m y − µ y , m (7) where µ x , m = N ∑ n = 1 p ( m | y n ) x n N ∑ n = 1 p ( m | y n ) , µ y , m = N ∑ n = 1 p ( m | y n ) y n N ∑ n = 1 p ( m | y n ) (8) Σ xy , m = N ∑ n = 1 p ( m | y n ) x n y T n N ∑ n = 1 p ( m | y n ) , Σ y , m = N ∑ n = 1 p ( m | y n ) y n y T n N ∑ n = 1 p ( m | y n ) (9) i.e., the align ments p ( m | y n ) are being used in place of p ( m | x n ) and p ( m | x n , y n ) in Eqs. (8) and (9) respecti vely . Thus from (7), A m = Σ xy , m Σ − 1 y , m (10) b m = µ x , m − A m µ y , m (11) T o redu ce the number of parameters, a simplified model with only bias b m is proposed in literature [7]. A diagonal version of Eq. (7) can be written as b x c = µ x , c + σ 2 xy , c σ 2 y , c y − µ y , c (12) where c runs along all components of the features and all mixtures. Since this method d oes not capture all the correlations, it suf fers from p erformance degrad ati on. T his s ho ws that n oise has significant effe ct on feature correlations. 3. P R OP OSED MODIFICA TION TO SPLICE SPLICE assumes that a perfect correlation exists between clean and noisy stereo features (Eq. ( 6)), which mak es the implementation simple [6 ] . But, the actual feature correlations Σ xy , m are u sed to train SPLICE parameters, as se en in Eq. (10). Instead, if the training pro- cess also assumes perfect correlation and eliminates t he term Σ xy , m during parameter esti mation, it complies with the assumptions and gi ves improved performance. This simple modification can be done as follows: Eq. (12) can be re writ ten as b x − µ x σ x = σ 2 xy σ x σ y y − µ y σ y = ρ y − µ y σ y where ρ = σ 2 xy σ x σ y is the correlati on coef ficient. A perfect correlation implies ρ = 1. Since Eq. (6) makes this assumption, we enforce it in the abov e equation and obtain b x c = µ x , c + σ x , c σ y , c y − µ y , c Similarly , for multidimensio nal case, the matrix Σ − 1 2 x , m Σ xy , m Σ − 1 2 y , m should be enforced to be i dentity as per the assumption. Thus, we obtain b x m = µ x , m + Σ 1 2 x , m Σ − 1 2 y , m y − µ y , m (13) Hence M-SPLICE and its updates are defined as b x = M ∑ m = 1 p ( m | y ) ( C m y + d m ) (14) C m = Σ 1 2 x , m Σ − 1 2 y , m (15) d m = µ x , m − C m µ y , m (16) All the assumptions of con ven ti onal SPLICE are valid fo r M- SPLICE. Comparing both t he methods, it can be seen from Eqs. (7) and (15) that while A m is obtained using MMSE estimation frame- work, C m is b ased on whitening e xpression. Also, A m in volves cross-cov ariance term Σ xy , m , whereas C m does not. The bias terms are computed in the same manner , using their respecti ve transform- ation matrices, as seen in Eqs. (11) and (16). More analysis on M-SPLICE is gi ven in Section 4.1. Stereo Training Data Build GMM Clean Data Compute Alignments Compute Compute Noisy Counterparts Transforms µ y,m Σ y,m p ( y ) { y n } p ( m | y n ) { x n } d m = µ x,m − C m µ y,m C m = Σ 1 2 x,m Σ − 1 2 y,m µ x,m and Σ x,m (a) M-SPLICE Build GMM MLLR Mean Adaptation Noisy Data Clean Data Compute Refine GMM using EM Transforms p ( y ) p ( x ) Σ x ,m µ x ,m Σ y ,m µ y ,m { y } { x } C m = Σ 1 2 x ,m Σ − 1 2 y ,m d m = µ x ,m − C m µ y ,m (b) Non-Stereo Method Fig. 1 : Estimation of piecewise linear transformation s 3.1. T raining The estimation proc edure of M-SPLICE transformations is sho wn in Figure 1a. The steps are summarised as follows: 1. Build noisy GMM 1 p ( y ) using noisy features { y n } of stereo data. This giv es µ y , m and Σ y , m . 2. For eve r y no i se frame y n , compute the alignment w .r .t. the noisy GMM, i.e., p ( m | y n ) . 3. Using the alignments of stereo counterparts, compute the means µ x , m and cov ariance matrices Σ x , m of each clean mix- ture from clean data { x n } . 4. Compute C m and d m using Eq. (15) and (16) . 3.2. T estin g T esting process of M - SPLICE is ex actl y same as that of conv entional SPLICE, and is summarised as follo ws: 1. For each t est vector y , compute the alignment w .r .t. the noisy GMM, i.e., p ( m | y ) . 2. Compute the cleaned version as: b x = M ∑ m = 1 p ( m | y ) ( C m y + d m ) 4. NON-S TEREO EXTENSION In this section, we motiv ate how M-SPLICE can be e xtended to data- sets wh i ch are not stereo record ed. Ho wev er some noisy training utterances, which are not necessarily the st ereo counterparts of the clean data, are required. 4.1. Motivation Consider a stereo dataset of N training frames ( x n , y n ) . Suppose two M mixture GMMs p ( x ) and p ( y ) are independently built using { x n } and { y n } respecti vely , and each data poin t is hard - clustered to the mixture giving the highest pro bability . W e are interested in analysing a matrix V M × M , built as V i j = N ∑ n = 1 1 ( x n ∈ i , y n ∈ j ) 1 W e use the term noisy mixture to denote a Gaussian mixture built using noisy da ta. Similar meanings apply to clean mixtur e , noisy GMM and clean GMM . where 1 () is indicator function. In other w ords, while parsing the stereo t raining data, when a stereo pair with clean part belonging to i t h clean mixture and noisy part to j t h noisy mixture is encountered, the i j t h element of the matrix is increme nted by unity . T hus each i j t h element of the matrix denotes the number of stereo pairs belong to the i t h clean − j t h noisy mixture-pair . When data are soft assigned to all the mixtures, the matrix can instead be built as: V i j = N ∑ n = 1 p ( i | x n ) p ( j | y n ) Figure 2 a visualises such a matrix built using Aurora-2 stereo training data using 128 mixture models. A dark spot i n the plot rep- resents a higher data count, an d a b ulk of stereo data points do be long to that mixture-pair . In con ventional SPLICE and M-S PLICE, only the noisy GMM p ( y ) is built, and not p ( x ) . p ( m | y n ) are computed for ev ery noisy frame, and the same a l ignments are assumed for the clean frames { x n } while computing µ x , m and Σ x , m . Hence µ x , m , Σ x , m and p ( m | y ) can be considere d as the parameters of a clean hypothetical GMM p ( x ) . Now , gi ven these GMMs p ( y ) and p ( x ) , the matrix V can be constructed, wh ich is visualised in Figure (2b). Since the alignments are same, and i t h clean mixture correspon ds to the i t h noisy mixture, a diagonal pattern can be seen. Thus, under the assumption of Eq. (6), con ventional SPLICE and M-SPLICE are able to estimate transforms from i t h noisy mixture to exactly i t h clean mixture by maintaining the mixture- corresponde nce. When stereo data is not av ailable, such exact mixture corres- pondenc e do not exist. Figure 2a makes this fact evide nt, since stereo property was not used while building the two independ ent GMMs. Ho wever , a sparse struc t ure can be seen , which suggests tha t for most noisy mixtures j , there exists a unique clean mixture i ∗ hav- ing highest mixture-corresponde nce. This property can be exploited to estimate piecewise linear transformations from ev ery mixture j of p ( y ) to a single mixture i ∗ of p ( x ) , ign oring all o t her mixtures i 6 = i ∗ . This is the basis for the proposed ex tension to non-stereo data. 4.2. Impl ementation In the absence of stereo data, our approach is to build two separate GMMs viz., clean and noisy during trai ning, such that there exists mixture-to-mixture correspondenc e between the m, as close t o Fig. 2b as possible. Then whitening based t ransforms can be estimated from each noisy mixture to its corresponding clean mixture. This Noisy Mixture Index Clean Mixture Index 20 40 60 80 100 120 20 40 60 80 100 120 (a) Separately built noisy and clean GMMs Noisy Mixture Index Clean Mixture Index 20 40 60 80 100 120 20 40 60 80 100 120 (b) GMMs of SPLICE and M-SPLICE Noisy Mixture Index Clean Mixture Index 20 40 60 80 100 120 20 40 60 80 100 120 (c) Noisy GMM and MLLR-EM based clean GMM Fig. 2 : Mixture assignment distribution plots for Aurora-2 stereo training data sort of extension is not obv ious in the conv entional S PLICE frame- work, since it is not straight-forw ard to compute the cross-co variance terms Σ xy , m without using stereo data. Also, M-SPLICE is expec t ed to wo rk better than SPL ICE du e to its adv antages described earlier . The training app roach of tw o mixture-correspon ded GMMs is as follo w s: 1. After bu ilding the noisy GMM p ( y ) , it is mean adapted by estimating a global MLLR transformation using clean training data. The transformed GMM has the sam e cov ariances and weights, and only means are altered to match the clean data. By this process, the mixture corresponde nces are not lost. 2. Howe ver , the tr ansformed GMM need not model the clean data accu rately . So a fe w steps of expectation maximisation (EM) are performed using clean training data, initialising wi th the transformed GMM. This adjusts all the pa rameters and gi ves a more accurate representation of the clean GMM p ( x ) . No w , the matrix V o btained through this method using Aurora-2 training data is visualised in Figure 2c. It can be noted that no ste- reo information has been used while obtaining p ( x ) , follo wing the abov e mentioned steps, from p ( y ) . It can be observed that a diag- onal pattern is retained, as i n t he case of M-SPL ICE, though there are some outliers. Since st ereo information is not used, only com- parable performance s can be ach iev ed. F igure 1b sho ws the block diagram of estimating transformations of non-stereo method . The steps are summarised as follo ws: 1. Build noisy GMM p ( y ) using noisy features { y } . This gi ves µ y , m and Σ y , m . 2. Adapt the means of noisy GMM p ( y ) to clean data { x } using global MLLR transformation. 3. Perform at l east three EM it erations to refine the adapted GMM using clean data. This give s p ( x ) , thus µ x , m and Σ x , m . 4. Compute C m and d m using Eq. (15) and (16) . The testing process is exactly same as that of M-S PLICE, as ex- plained in Section 3.2. 5. ADDITIONAL RUN-TIME ADAPT A TION T o improv e the performance of the p r oposed methods during run- time, GMM adaptation to the test condition can be done in b oth con- vention al SPLICE and M-SPLICE framew orks in a si mple manner . Con ventional MLLR adaptation on HMMs in volve s two-pass re cog- nition, where the transformation matrices are estimated using the alignments ob t ained thro ugh first pass V it erbi-decoded output, and a final recognition is performed using the transformed models. MLLR adaptation can be used to adapt GMMs in the context of SPLICE and M-SPLICE as follo ws: 1. Adapt the noisy GMM through a global MLL R mean trans- formation µ ( a ) y , m ← µ y , m 2. Now , adjust the bias term i n con ventional SPLICE or M- SPLICE as d ( a ) m = µ x , m − C m µ ( a ) y , m (17) This method in volv es only simple calculation of ali gnments of the test data w .r .t. the noisy GMM, an d doe sn’t need V i terbi decod- ing. Clean mixture means µ x , m computed du ri ng training need to be st ored. A separate global MLLR mean transform can be estim- ated using test utterances belon ging to each no ise condition. The steps for t esting process for run-time compensa t ion are summarised as follows: 1. For all test vectors { y } belonging to a particular en vironment, compute the alignments w .r .t. the noisy GMM, i.e., p ( m | y ) . 2. Estimate a global MLLR mean transformation using { y } , maximising the likelihoo d w .r .t. p ( y ) . 3. Compute the adapted noisy GMM p ( a ) ( y ) using the estimated MLLR transform. Only the means µ y , m of t he noisy GMM would ha ve been adapted as µ ( a ) y , m . 4. Using Eq. (17 ), recompute the b i as term of S PLICE or M- SPLICE. 5. Compute the cleaned test vectors as b x = M ∑ m = 1 p ( m | y ) C m y + d ( a ) m T able 1 : Results on Aurora-2 Database (a) Comparison of SPLICE, M-SPLICE and non-stereo methods Noise Lev el Baseline SPLICE M-SPLICE Non- Stereo Method Clean 99.25 98.97 99.01 99.08 SNR 20 97.35 97.84 97.92 97.68 SNR 15 93.43 95.81 96.10 95.15 SNR 10 80.62 89.48 91.03 87.37 SNR 5 51.87 72.71 77.59 68.49 SNR 0 24.30 42.85 50.72 39.00 SNR -5 12.03 18.52 22.27 16.73 T est A 67.45 81.39 83.47 77.44 T est B 72.26 83.24 84.18 79.63 T est C 68.14 69.42 78.06 73.54 Overall 6 9.51 79.74 82.67 77.54 (b) Comparison of adaptation methods MLLR (39) SPLICE + Run-time Adaptation M-SPLICE + Run-time Adaptation Non-Stereo Method + Run-ti me Adaptation 99.28 99.05 99.02 99.08 98.33 97.96 98.18 97.77 96.82 96.21 96.87 95.47 91.88 90.61 93.10 88.80 73.88 75.05 82.00 72.36 41.94 46.27 57.51 44.98 18.71 20.10 27.32 20.43 79.31 82.45 86.47 80.12 82.55 84.09 85.91 81.67 79.14 73.01 82.90 75.79 80.57 81.22 85.53 79.88 T able 2 : Results on Aurora-4 Database Clean Car Babble Street Restaurant Airpo rt S tation A vera ge Baseline Mic-1 87.63 75.58 52.77 52.83 46.53 56.38 45.30 54.73 Mic-2 77.40 64.39 45.15 42.03 36.26 47.69 36.32 Non-Stereo Method Mic-1 86.85 77.71 62.62 58.96 55.93 61.95 55.37 61.66 Mic-2 79.10 68.58 55.24 51.67 45.88 55.45 47.88 6. EXP ERIMENT AL SETUP Aurora-2 task of 8 kHz sampling freq uency [8] has be en used to per- form c omparati ve study of the proposed tech niques with the existing ones. Aurora-2 consists of conne cted spoken digits with stereo train- ing data. The test set consists of utterances of t en different en viron- ments, each at se ven distinct SNR lev els. The acoustic word models for each digit hav e been built using left to right continuou s density HMMs with 16 states and 3 diagonal cov ariance Gaussian mixtures per state. HMM T oolkit (HTK) 3.4.1 has been used for building a nd testing the acous tic models. All SPLICE based linear transformations hav e been applied on 13 dimensional MFCCs, including C 0 . During HMM training, the features are appended with 13 delta and 13 acceleration coefficients to get a composite 39 dimensional vector per f rame. Cepstral mean subtraction (CMS) has been performed in all the exp eri ments. 128 mixture GMMs are built for all SPLICE based experimen t s. Run- time noise adaptation in SPL ICE framew ork is performed on 13 di- mensional MFCCs. Data belonging to each SNR le vel of a test noise condition has been separately used to compute the global transform- ations. In all SPLICE based experiments, pseudo-clean i ng of cl ean features has been performed. T o test the efficacy of non-stereo meth od on a database which doesn’ t contain stereo data, Aurora-4 task of 8 kHz sampling fre- quenc y has been used. Aurora-4 is a continuou s speech recognition task with clean and n oisy training u t terances (non-stereo) and test utterances of 14 en vironments. Aurora-4 acoustic models are built using crossword triphone HMMs of 3 states and 6 mixtures per state. Standard WSJ0 bigram lan guage model has been use d during decod- ing of Aurora-4. Noisy GM M of 512 mixtures is built for ev aluating non-stereo m ethod, using 7138 utterances taken from both clean and multi-training data. This GMM is adapted to standard clean training set to get the clean GMM. 6.1. Resul ts T ables 1a and 1 b summarise the results of v arious algorithms dis- cussed, on Aurora-2 dataset. All t he results are shown in % accur- acy . All SNRs lev els mentioned are in decibels. T he first sev en rows report t he overall results on all 10 test noise conditions. T he rest of the rows report t he ave r age values in t he SNR range 20-0 dB. T able 2 sho ws the experimental results on Aurora-4 database. For reference, the result of stan dard MLLR adaptation on HMMs [9] has been sh own in T able 1b, which computes a global 39 dimen- sional mean transformation, and uses two-pass V iterbi dec oding. It can be seen that M-SPLICE improv es ov er SPLICE at all noise conditions and SNR lev els and giv es an absolute impro vement of 8 . 6% in test-set C and 2 . 93% overa ll. Run-time compensation in SPLICE frame work giv es improv ements ov er standard MLLR in test-sets A and B, whereas M-SPLIC E gi ves improv ements in all conditions. Here 9 . 89% a bsolute improv ement can be o bserved ov er SPLICE with run-time n oise adap t ation, and 4 . 96% over standard MLLR. Finally , non-stereo method, though not using stereo data, sho ws 10 . 37 % and 6 . 93% absolute improvemen t s o ver Aurora-2 and Aurora-4 baseline mode ls respectiv ely , a nd a sli ght de gradation w . r .t. SPLICE in all test cases. Run -t ime noise adaptation results of non- stereo method are comparable to that of standard MLLR, an d are computationally less expe nsive. 7. DISCUSS ION In t erms of computationa l cost, the meth ods M-SPLICE an d non- stereo methods are identical during testing as compared to con ven- tional SPLICE. Also, there is almost negligible increa se in cost dur- ing training. The MLLR mean adaptation in both non-stereo method and run-time adaptation are computationally very efficient, and do not need V iterbi deco ding. In terms of p erformance, M-SPLICE is able to achiev e good res- ults in all cases without any use of adaptation da t a, especially in unseen cases. In non-stereo method, o ne-to-one m i xture correspond- ence be tween noise and clean GMM s is assumed . The method gi ves slight degradation in performance. This could be attributed to neg- lecting the outlier data. Comparing with other existing fea t ure normalisation techniques, the techniques in SP LICE framew ork operate on indiv i dual feature vectors, and no estimation of parameters is required from test data. So these method s do n ot suf fer fro m test data insufficienc y prob- lems, and are adv antageous f or shorter utterances. Also, the t esting process is usually faster , and are easily implementable in real-time applications. S o by e xtending t he methods to non-stereo data, we belie ve that they become more useful in man y applications. 8. CONCLUS ION AND FUTURE WORK A modified version of the S PLICE algorithm has been proposed for noise robust speech recognition. It is better compliant with the as- sumptions of SPLI CE, and improves the recognition in highly mis- matched and unseen noise conditions. An extension of t he methods to non-stereo data h as been presen ted. Finally , a con venient run-time adaptation frame work has been exp l ained, which is comp utati onally much cheaper than st andard MLLR on HMMs. In future, we wou l d like to improv e the ef ficiency of non-stereo e xtensions of SPLICE, and ex tend M-SPLICE in uncertainty decoding framew ork. 9. REF ERENCES [1] J. Droppo and A. Acero, “Maximum mutual information splice transform for seen and unseen conditions., ” i n INT ERSPEECH , pp. 989–99 2, 2005. [2] Y . Shinohara, T . Masuko, and M. Akamine, “Feature enhan ce- ment by sp eaker-normalized splice fo r rob ust speech recog ni- tion, ” in Acoustics, Speech and S i gnal P r ocessing (ICASSP), IEEE Intern ati onal Con ference on , pp. 4881–4884 , IEEE , 2008. [3] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding wi th splice for noise robust speech recognition, ” in Acoustics, Spee ch, and Signal Pr ocessing (ICASSP), 2002 IEEE International Co n- fer ence on , vol. 1, pp. I–57, IEEE, 2002. [4] K. Chijiiwa, M . Suzuki, N. Min ematsu, and K. Hirose, “Unseen noise robust speech reco gnition using adapti ve piecewise lin- ear transformation, ” in Acoustics, S peech and Signal Pro cessing (ICASSP), IEEE Intern ational Confer ence o n , pp. 4289–4 292, IEEE, 2012. [5] J. Gonzalez, A. P einado, A. Gomez, and J. Carmona, “Ef ficient MMSE estimation and uncertainty processing for multien viron- ment robust speech recognition, ” Audio, Speech, and Languag e Pr ocessing, IEEE Tr ansactions on , vol. 19, no. 5, pp. 1 206– 1220, 2011. [6] M. Afify , X. Cui, and Y . Gao, “Stereo-based sto chastic mapping for ro bust speech recog niti on, ” IEEE T rans. on Audio, Speec h and Lang. Pr oc. , vol. 17 , pp. 1325– 1334, Sept. 2009. [7] L. Deng, A. Acero, M. Plumpe, and X. Huang, “Large- vocab ulary speech recognition under adv erse acoustic en viron- ments, ” in International Confer ence on Spo ken Langua ge Pro- cessing , pp. 806–809, 2000. [8] H.-G. Hirsch and D. Pearce, “The aurora e xperimental frame- work for the performance ev aluation of speech recognition sys- tems u nder noisy c onditions, ” in Automatic Speec h Recognition: Challenge s for the new Millenium ISCA T utorial and R esear ch W orkshop (ITRW) , 2000. [9] M. J. Gales, “Maximum lik elihood linear transformations for hmm-based speech recognition, ” Computer speec h & langua ge , vol. 12, no. 2, pp. 75–9 8, 1998.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment