Spectral feature mapping with mimic loss for robust speech recognition

SPECTRAL FEA TURE MAPPING WITH MIMIC LOSS FOR ROBUS T SPEECH RECOG NITION Deblin Bagchi P eter Planti nga Ada m Sti ff Eric F osler -Lussier Department of Computer Science and Engineering The Ohio State University , Columbus, OH, USA ABSTRA CT For the task of speech enhancemen t, loca l learning ob- jectiv es are a g nostic to phonetic structures helpfu l for sp eech recogn itio n. W e prop o se to add a g lo bal criterion to ensu re de-noised speech is u seful for downstream tasks like ASR. W e ﬁrst train a spectra l classiﬁer on clean speech to pred ict senone labels. Then, the spectral classiﬁer is join ed with our speech enhancer as a noisy speech recog n izer . This model is taught to imitate the output of the spectral classiﬁer alone on clean speech. This mimic loss is combined with th e tradi- tional lo cal cr iter ion to train the sp eech enhancer to prod uce de-noised speech. Feeding the d e-noised speech to an off- the-shelf Kaldi trainin g recipe for the CHiME-2 corpus shows signiﬁcant improvements in WER. Index T erms — Speech enh ancemen t, Spectral mappin g, Mimic loss, Noise-r obust spee c h recog nition, CHiME-2 1. INTR ODUCTION Automatic Speech Reco g nition (ASR) h as shown tremendo us progr ess over the years in r e cognizing clean speech. How- ev er , traditiona l DNN - HMM ASR systems still suf fer f rom perfor mance degradation in the pr esence of acou stic interf er- ence, such as add iti ve no ise an d room reverberation. Some strategies for building a noise-ro bust spee c h r e cognizer in- clude using n oise-inv ariant features [ 1 ], augme n ted data [ 2 ], bulkier acou stic mod els like LSTMs and CNNs [ 2 , 3 ] and sophisticated lan guage models [ 2 , 3 ]. Few g roups, howe ver , have looked at systems th a t train only a speech enhancemen t model, which can be used fo r different tasks. A speech enhancem ent front-e n d ref e rs to a p erform a nce- boosting deno ising tec hnique that can be attach ed to any stan- dard au tomatic speech recognition m o del. So me d eep learn - ing mo d els for speech enhancement attempt to estimate an ideal ratio mask (IRM) fo r r emoving noise fro m a speech sig- nal [ 4 ]. Others utilize spectr al mapping in the signal domain [ 5 , 6 ] or in the f eature d omain [ 7 , 8 ] to directly predict fea- tures. Recent work in c omputer vision has seen notable success in ad dressing the problem of po or resolution in mo diﬁed in- put data, using a framew ork b roadly referre d to as Genera tive Adversarial Networks ( GANs). Fund amentally , these g ains are achieved b y the injection of au xiliary or proxy learnin g objectives into mo re traditio n al p ipelines. Th ese aux iliary ob- jectiv es exploit an adversarial relationship b e tween two ne u- ral networks (a generator and a discriminator ) to ﬁnd a Nash equilibriu m in which gen erating sensible data is th e optimal behavior f or the generator [ 9 ]. Th is has the effect of reﬁn ing the distribution of the generated data closer to some desirable outcome relati ve to a system trained with out the auxiliary ob- jectiv e, e.g., sharp er , m o re realistic ge nerated images. In light of the se successes, in p articular that of [ 9 ], in- serting an auxiliary realism objective in to a speech denoising pipeline seems to be a natu ral avenue to pursue improved (1) Clean speech Spectral classiﬁer Predicted senones Senone label L C (2) Noisy speech Spectral mapper De-noised speech Clean speech L F (3) Noisy speech Spectral mapper De-noised speech Clean speech Spectral classiﬁer Soft labels Spectral classiﬁer Classiﬁer output L F L M Fig. 1 . Our sp e e ch en hancemen t system is trained in thr e e steps. (1 ) The spectr al classiﬁer is train ed to predict seno n e la- bels with cross-entropy criterion (c la ssiﬁcation loss, L C ). (2) The spectral mapper is pr e -trained to map from noisy speech to clea n speec h u sing MSE criter ion (ﬁde lity loss, L F ). (3) The spectral mapper is tr ained using joint loss from b oth the clean speech and th e outputs of the classiﬁer whe n fed clean speech (m imic loss, L M ). The gra y models have fro zen weights. c  2018 IEE E . Published in the IEEE 20 18 Internation al Con f erence on Acou stics, Speech, an d Signal Proce ssing (ICASSP 2018) , scheduled for 15 -20 April 201 8 in Calgar y , Alb erta, Canada . perfor mance. Howe ver , initial experime n ts with GANs con - ditioned on n oisy speech in puts, in wh ic h the discriminato r network was trained to distinguish b etween real and fake clean/noisy input pairs, exposed the well-known m o de col- lapse problem endemic to GANs [ 10 ]. Results failed to improve upo n simple baselines of noisy-to-clean speech ma p- pings trained with only MSE loss. Other attem pts hav e also failed to improve upon a DNN ba selin e, as in [ 11 ]. W e spec- ulate that th e d ifference in perform a nce from experiments in the visual d o main may be d ue to the r elativ e ly non- local structure of the speech signal in the fr equency axis (i.e. har- monic structure), as well as the smoothness o f speech features as compar ed to images. The work d escribed in this pa per is m otiv ated by the ob - servations of over -sim p le outpu ts fro m GAN s that seemed to be stuck in m ode collapse orbits. W e h y pothesize that a train - ing objective that can provide stro nger feedb ack th an a simple real o r fake determina tio n will be b etter ab le to guide the pa- rameters of the denoising network towards pr oducing outp u t that b ehaves like actual sp eech. While our resulting system retains n one o f th e p r operties o f b eing gen erative o r trained adversarially , the insight to use an au xiliary task to improve the deno ising process is dr awn from that b o dy of work. The auxiliary glob al ob jectiv e that we add to our local criterion is th e behavioral loss of a classiﬁer trained on clean speech. W e call this addition al o bjective the mimic loss . First, we train a senon e classiﬁer using clean speech as inpu t, and a spectral map p er network [ 8 , 6 ] using pa r allel noisy an d clean speech frame pairs. Next, we free ze th e weights of the acous- tic model a nd join our pre-trained spectral mapper to it. W e then pass noisy speech frames to tra in o ur spectral mapper with a join t ob jectiv e, i.e. a weighted sum of the traditio n al ﬁdelity loss a n d mimic lo ss. The mimic loss, then, is M SE loss with respect to the sof tmax (o r pre- softmax) outpu ts of the classiﬁer fed with the co rrespond ing clean speech frame. See ﬁgure 1 for a graphical dep ic tio n of the m odel. This tech nique of using on e model to teach another was propo sed by Ba and Caruna [ 12 ] fo r th e task of model com - pression. In their work , they intro duce stud e n t-teacher learn - ing, whe r e sepa r ate teacher and stud ent mod els are tr ained to do the same task. Mimic loss, on the other hand , is used to train the studen t model to d o a different task from the te a c her model. 2. PRIOR WORK T o deal with no ise, many DNN b ased meth ods have been p ro- posed to improve the r obustness of ASR systems. In acou stic modeling , using Con volutional Neural Network s (CNNs), such as in [ 1 3 ] and Lon g Short T er m Memory Networks (LSTMs) in [ 14 ] has r esulted in an improvement in per for- mance. LSTMs have also be en su ccessfully used as sp eech enhancem ent fro nt-end s in [ 15 , 14 ]. Spectral mapping h as been u sed to gen erate clean speech signals. Howe ver, in [ 8 , 7 ] they use only a local learning objective. Student-te a cher networks have been used to im- prove th e q u ality of noisy speech recogn ition [ 16 , 17 , 18 ]. Our model uses mimic loss instead of student-tea cher learn - ing, which mean s the spe ech enhancer is not jo intly trained with a particular acoustic model. This speech enhancer cou ld be used as a pre-pr ocessor fo r any ASR system, o r for another similar dataset. This modularity is the strength of mimic loss. 3. SYSTEM DESCRIPTION In this section , we will describe the m a jor comp onents of our system: n amely , the spectral mapp er , the spectral c la ssiﬁer, and the overall framework bin ding the two together . 3.1. Spectral mapping Spectral mappin g improves pe r forman ce of the speech rec- ognizer by learning a m a pping from noisy spe ctral patterns to clean f e a tures. W e train a D N N- based spectr al mapper fo r feature denoising . In our p r evious work [ 7 , 8 ], we hav e shown that a DNN-ba sed spectral mappe r, which takes noisy spec- trogram as inp ut to predict clean ﬁlterbank featu res fo r ASR, yields g ood results on the CHiME-2 noisy and reverberant dataset. Speciﬁcally , we ﬁrst divide the input time- d omain sig- nals into 2 5-ms frames with a 1 0-ms frame shift, an d then apply sho rt tim e Fourier transfo rm (STFT) to co mpute log spectral magn itudes in each time f r ame. For a 16 kHz sig- nal, each fra m e co ntains 400 samples, and we use 512-po int Fourier transfo rm to com pute the magnitu des, for ming a 25 7- dimensiona l log mag nitude vector . Each noisy spectral co m- ponen t x k m for frequ ency k at tim e slice m is aug mented o n the inpu t b y the d eltas and dou ble deltas, as well as a ﬁ ve frame window (d esignated ˜ x k m = [ x k m ± 5 ]), leadin g to the di- mensionality of ˜ x m being 257 · 3 · 11 = 8481 . Similarly , we deﬁne y m to be the clean spectr al slice at tim e m . W e then use a feed-fo rward neural n etwork f ( · ) to map noisy spectral slices ˜ x m to clean spectral features y m using an MSE loss function, which we call ﬁdelity lo ss . L FIDELITY ( ˜ x m , y m ) = 1 K K X k =1 ( y k m − f ( ˜ x m ) k ) 2 (1) 3.2. Spectral classiﬁer The spectr a l classiﬁer is similar to the trad itional DNN acous- tic mo del trained to classify a stacked clean spectral pattern ˜ y m to its corresp o nding senone class z m . W e train the classi- ﬁer u sin g a cross entro py criterion; critically , once th e classi- ﬁer is tr ained, we freeze the weigh ts as a model of app ropriate behavior u nder clean speech. 3.3. Joint loss W e can deﬁne the mimic loss as the mean square d ifference between a D -dimen sional representatio n g within the spectral classiﬁer e valuated on clean speech y m and its paired cleaned speech f ( ˜ x m ) : L MIMIC ( ˜ ˜ x m , ˜ y m ) = 1 D D X d =1 ( g ( ˜ y m ) d − g ( ˜ f ( ˜ x m )) d ) 2 (2) W e experimented with two different rep resentations for g ( · ) : the posterior ou tput o f the senon es after softmax norm aliza- tion ( po st-softmax ) an d th e laye r outpu ts prior to the softmax normalizatio n ( pre-softmax ). While training the speech enhance r, we fo und th a t using only mimic lo ss was no t enoug h to allow the m odel to con- verge. W e speculate that the task of pr edicting senon es is too different f r om the task of predictin g clean speech fo r the error signal to drive the output of the speech enhan cer to actually look like speech. Comb ining the ﬁdelity and m imic losses into a join t loss allows the enha n cer to better imitate th e be- havior of the classiﬁer under clean speech wh ile keeping th e projection of n oisy speech closer to clean speech . L JOINT = L FIDELITY + α L MIMIC (3) The hyper-parameter α is used to ensure that b oth of th e losses that make up the joint loss h ave a similar magnitud e . W e used 0. 1 f or pre- softmax and 1000 fo r post-so f tmax. 4. EXPERIMENT AL SETUP W e evaluate the effectiveness of our propo sed method on T r ack 2 o f the CHiME-2 challeng e [ 19 ], which is a med ium- vocab ulary task for word r ecognitio n unde r reverberant and noisy environments with out speaker movements. In this task, three ty pes of data are provid ed based on the W all Street Journal ( WSJ0) 5k vocabulary read speech corpus: clean, re- verberant and rev e r berant+n oisy . The clean utterances are ex- tracted from the WSJ0 da tabase. The reverberant utterances are created by convolving the clean speech with b inaural room impu lse responses (BRIR) c o rrespon ding to a fro ntal position in a family li ving roo m. Real-world non-station ary noise backgro und recorde d in the same r oom is mixed with the r everberant utteran ces to for m th e reverberant+noisy set. The noise excerp ts ar e selected such th at the signa l-to-noise ratio (SNR) r anges among -6, -3, 0, 3, 6 and 9 dB withou t scaling. The multi-con dition trainin g , development and test sets of the reverberant+noisy set con tain 7138 , 2 454 and 1 9 80 utterances re sp ectiv ely , which are the same utterances in the clean set but with r everberation and noise at 6 different SNR condition s. Our system is mon a u ral. In o ur experiments, we simp ly av erage th e sign als fr om the lef t and r ight ear . A GMM-HM M system is built using th e Kald i too lk it [ 20 ] o n the clean utter- ances in the WSJ0-5 k to g et the senon e state f or e a ch frame of th e cor r espondin g noisy -reverberant utterances. The in i- tial clean alignme nts ar e o btained by perfo rming fo rced alig n- ment on the clean utterances. T o reﬁne the in itial clean align - ments, we fu rther trained a DNN-based acoustic m odel using the ﬁlter bank features o f th e clean utterances, and re-generate clean alignments. These clean alig nments are used as the la- bels for training all the aco ustic models in this stu dy . Note that the DNN-HMM hyb rid system built o n the clean utter- ances is a powerful recognize r . It achie ves 2 .3% WER on the clean test set of the WSJ0-5k dataset. 4.1. Description of t he acoustic model In or d er to d etermine the effecti veness of the add itional cri- terion, we train a mo del u sing the denoised features with an off-the-shelf K a ldi recipe. The DNN-HMM hybrid system is trained using the clean WSJ0-5k align m ents generated us- ing the metho d stated ab ove. The DNN h as 7 hidd en layers, with 20 48 sigmo id neur ons in each lay er and a sof tmax out- put layer . Splicing context size for the ﬁlter-bank featur es was ﬁxed at 11 fr ames, with th e minibatch -size being 1024. After that, we tra in the DNN with sMBR seq u ence train ing to ach iev e better perfo rmance. W e r egenerate the lattices af- ter the ﬁrst iter ation and tr ain for 4 mo re iterations. W e use the CMU pronu nciation diction ary and the ofﬁcial 5k close- vocab ulary tri-gram langu age model in o ur experiments. 4.2. Description of t he spectral classiﬁer The sp e c tral classiﬁer ne twork is a m u ltilayer feed-for ward network which classiﬁes c lea n speech frames as one of 199 9 senone labels. W e use 6 layers of 1 024 neuron s with Leaky ReLU a c ti vations and batch normalization between all the layers. While training the spectra l classiﬁer on clean speech , we ap p ly softmax af ter th e topmo st layer, an d use a cross en- tropy criterion to teac h the network to produce senones. Spectral inp u t to Kaldi CE WER sMBR WER No en hanceme nt 18.0 17.3 Enhance m ent via ﬁdelity loss 17.5 16.5 Enhance m ent via joint loss w/ po st-softmax mimic loss 16.5 15.7 w/ pr e -softmax mimic loss 15.7 14.7 T able 1 . Experimen tal results o n th e CHiME2 test set; CE WER is th e word err or rate of a DNN-HMM hybrid system trained using a cross-entr opy criterion . sMBR WER is the error rate af ter sequen tial minimum Bayes r isk training . Enhance m ent -6 dB -3 dB 0 dB 3 dB 6 dB 9 dB None 29.8 22.3 17.6 13.4 10.9 9.7 Fidelity loss 29.3 20.6 16.2 12.4 10.9 9.2 Joint loss 25.9 19.6 14.8 10.9 9.0 8.0 T able 2 . Exper imental results on the CHiME2 test set, bro- ken down across different SNRs. Mimic loss based on pre - softmax units. Bold indicates the best p erform ing system for each ev aluatio n subset. 4.3. Description of t he spectral mapper The spectra l mapper is com posed of a simple two-layer feed - forward network with 2048 n euron s in e a ch lay er . The fact that this model is simple means it can be used in some low- resource situation s, and is fast to u se. No te that m imic lo ss can be ap plied to impr ove the results of any k ind of speech enhancer . T o regularize the network, w e use batch n ormal- ization and a dropo u t rate o f 0.5 be tween every lay e r . W e also use ReLU activations after each layer . The classiﬁer and mapper are written using T e n sorFlow 1 . 4.4. Results W e see in T able 1 th at the outputs of the recognizer bef o re the softmax provid e a better target for the noisy speec h r ec- ognizer, as sug gested in [ 12 ], ev en th o ugh our setup is quite different from their s. Th e difference in perf ormance between pre- and post-so f tmax targets may be d ue to a mismatch between target domain and loss criterion; ongoing work sug- gests that a cross-entro py mimic loss on post-softmax targets perfor ms similarly to MSE on pre- softmax. Furthermo re, the informa tio n loss in the softmax nor malization may “broad en” the allow a b le spec tr al mappin gs, harm ing generalization. This suggests that it is helpful for th e noisy recognize r not only to have the sam e targets as the clean recog nizer, but also to learn to behave in th e same way as the clean recognizer . 1 Code at https:/ /github . com/OSU-slatelab/actor critic s pecmap Study WER W an g e t.al [ 1 ] 10.6 W en inger e t. al[ 15 ] 13.8 propo sed approach 14.7 Narayanan -W ang[ 4 ] 15.4 Chen et. al [ 14 ] 16.0 T able 3 . Perf ormance comp a r ison with other studies o n the CHiME2 test set. W e also demo nstrate this fact by training the noisy spee ch recogn ize r using h ard targets r ather than the sof t targets o f mimic loss. Th is ca u sed the joint loss in the spectral mapp er to diverge, p roviding mo re evidence that th e noisy speech rec- ognizer must le a rn to m im ic the beh avior , n o t just th e targets of the clean speech recogniz e r . Finally , in T able 2 we break down our results into different SNRs and see that the gains are consistent over all Signa l-to-Noise levels. 4.5. Comparison with other systems For c o ntext, we show in T able 3 the per forman ce of o ur sys- tem relative to other p ublished results in the ﬁeld . The better- perfor ming models in this list use more sophisticated mo dels (like RNNs and LSTMs) f or fro nt-end speech enhancem ent [ 14 , 1 5 ] an d acoustic mo deling [ 14 ], as well as n oise-inv arian t features [ 1 ] (e.g., PNCC, MRCG). W e use an off-the-shelf Kaldi recipe using DNN-HMM to do speech reco gnition, as well as a simp le 2 layer fee d -forward network to do spectra l mapping . Aga in , mimic loss can theoretically b e used to im- prove the results of any fro nt-end system. 5. CONCLUSION W e have pro posed a speech enhan c ement criterio n, called mimic loss, which can be used to produce sp eech that is use- ful for downstream ASR tasks. The mimic loss co mes from comparin g the outputs o f a frozen clean speec h reco gnizer, before sof tmax is applied, on clean an d en hanced spe e ch. This conﬁg uration a llows the speec h e n hanceme nt outpu t to be used as clean speec h for any d ownstream task. W e see that with mimic loss, the spectral mapp er learns to pr oduce more detailed speech data, retaining features that ﬁdelity loss a lo ne fails to mod e l. Mimicking the be h avior of the pr e-softmax layer of th e classiﬁer was sup erior to m imicking the ou tput of the senone posterior estimates; in gen eral, the lower error rates show th at these features are helpf ul for downstream tasks. In futu re work, we propose to extend this work by match- ing ev e ry layer of the phone recognize r, rather th an ju st the inputs a n d outputs. W e cou ld also use a variety o f mo dels for speech enh ancemen t to demon strate th e effecti veness of mimic loss. Another av e n ue is to ev aluate the ou tput o f our system in multiple domains to determine the effecti veness of this appro ach at learning domain -inv ariant rep resentations o f speech. Finally , we can train a more soph isticated a coustic model, rather th a n using an off-the-shelf Kald i recipe. 6. A C K NO WLEDGEMENTS This material is based up on work supported by the National Science Foundatio n under Grant IIS-14 0943 1 . W e also than k the Ohio Super computer Center (OSC) [ 21 ] for provid in g us with comp utational resources. 7. REFERENCES [1] Z.-Q. W ang and D. W ang , “ A joint trainin g fr amework for robust a u tomatic speech recog nition, ” IEEE/AC M T ransactio ns on Audio, S peech, and La nguage Pr o cess- ing , vol. 24, no . 4, p p. 79 6 –806 , 201 6. [2] J. Du, Y .-H. T u, L . Sun, F . Ma, H. - K. W ang , J. Pan, C. Liu, J.-D. Chen, and C.-H. Lee, “ The USTC-iFlytek system f or CHiME-4 challeng e, ” Pr oc. CHiME , pp . 36– 38, 20 16. [3] T . Y oshio k a, N. Ito, M. D e lc r oix, A. Ogawa, K. Ki- noshita, M. Fujim oto, C. Y u, W . J. F ab ian, M. Espi, T . Higuch i, et al., “The NTT CHiME-3 system: A d - vances in sp e e ch enhan cement and recognition for m o - bile multi-m icropho ne devices, ” in Automatic Speech Recognition and Und erstanding ( ASRU), 201 5 IEE E W orksho p on . IEE E , 2015, pp . 436–44 3. [4] A. Narayan an and D. W an g, “Imp roving robustness of deep neura l network acoustic models via speech separa- tion and joint adaptive train ing, ” IEEE/ACM T ransac- tions o n Audio, Speech, an d Lan guage Pr ocessing , vol. 23, no . 1, pp. 92–101 , 2 015. [5] K. Han, Y . W ang, and D. W a n g, “Learning spectral m ap- ping for speech dereverberation, ” in Aco ustics, Speech and Sign al Pr ocessing (ICASS P), 2 014 IEEE In terna- tional Con fe rence on . IEEE, 20 14, pp. 462 8–46 32. [6] K. Han, Y . W ang , D. W ang , W . S. W oods, I. Merks, and T . Zh ang, “Learning spectra l mapping for speech dere- verberation and d e noising, ” IEEE T ransaction s on Au- dio, Speech, and Lang uage Pr ocessing , vol. 2 3, n o. 6, pp. 98 2–99 2, 20 15. [7] K. Han , Y . He, D. Bagchi, E . Fosler -L ussier , an d D. W an g, “Deep neural n etwork based spectral feature mapping for robust spe e ch r ecognition , ” in Pr o c. Inter- speech , 201 5 . [8] D. Bagchi, M . I. Mandel, Z. W an g, Y . He, A. Plum- mer , a n d E. Fosler-Lussier , “Combining spectral f e a ture mapping and multi-ch annel mode l-based source sepa- ration for noise-robust auto matic spe e ch recognition , ” in Automatic Speech Re c ogn ition and Understanding (ASRU), 2015 IEEE W orkshop on . IEEE, 20 1 5, pp. 49 6– 503. [9] P . Iso la, J.-Y . Zh u, T . Zhou , and A. A. Efros, “Image- to-image translation with con ditional adversarial net- works, ” arXiv pr eprint arXiv:1 611.0 7004 , 201 6. [10] T . Salima n s, I. Good fellow , W . Zaremba, V . Cheung , A. Radford, and X. Che n , “Improved techniques f or training GAN, ” in Adva n ces in Neural In formation P r o- cessing S ystems , 2016 , p p. 22 3 4–22 42. [11] D. Michelsanti and Z.-H. T an , “Cond itional gen erative adversarial network s for speech en hanceme n t and n o ise- robust speaker veriﬁcation, ” in Interspeech . ISCA, 2017. [12] J. Ba and R. Carua n a, “Do deep nets really need to be deep?, ” in Adv ances in n eural information pr ocessing systems , 201 4, pp. 265 4–266 2. [13] Y . Qian and P . C. W ood la n d, “V ery deep c o n volutional neural n etworks for robust speech recogn itio n, ” in Spo- ken Language T echnology W orksh o p (SLT), 2016 I E EE . IEEE, 201 6, pp. 481– 488. [14] Z. Chen, S. W atan abe, H. Erd o gan, and J. Hershey , “In- tegration of speech enhan cement and r ecognition using long-sho rt ter m memor y recurren t neural network, ” in Pr oc. In terspeech , 201 5. [15] F . W en in ger, H. Erdo gan, S. W atanab e , E. V incent, J. Le Roux , J. R. Hershey , and B. Schu ller , “Speech enhancem ent with lstm recu rrent neural n e twork s an d its application to noise-robust ASR, ” in In ternationa l Confer en ce on Latent V ariable An alysis an d Sig n al Sep - aration . Springer, 2 015, pp . 91– 9 9. [16] K. Markov and T . Matsui, “Robust speech recogni- tion using generalize d distillation framework., ” in In- terspeech , 2 0 16, pp. 23 64–23 68. [17] S. W atana b e, T . Hori, J. Le Roux, and J. R. Her- shey , “Student-teac h er network learning with enh anced features, ” in Acoustics, S peech and Sign a l Pr ocess- ing (ICASSP), 2 017 IEEE International Conference on . IEEE, 201 7, pp. 5275 –527 9. [18] J. Li, M. L. Seltzer, X. W ang, R. Zhao, an d Y . Gon g, “Large-scale domain adaptation via teacher -student learning, ” Pr oc. Interspeech 20 17 , pp. 2386–23 90, 2017. [19] E. V incen t, J. Barker , S. W atan abe, J. Le Rou x, F . Nesta, and M. Matassoni, “Th e second CHiME speech sep- aration and recognitio n ch allenge: Datasets, tasks an d baselines, ” in Aco u stics, Speech and S ignal Pr ocess- ing (ICASSP), 2 013 IEEE International Conference on . IEEE, 201 3, pp. 126– 130. [20] D. Povey , A. Ghoshal, G. B oulianne, L. Burget, O. Glemb ek, N. Goel, M. Han nemann , P . Motlicek, Y . Qian, P . Schwarz, et al., “Th e Kald i speech recog- nition toolkit, ” in IEEE 2 011 W orkshop o n Auto- matic Sp e ech Rec ogn ition and Un derstanding . IEEE Signal Processing Soc ie ty , 2011, number EPFL- CONF- 19258 4. [21] O. S. Center , “Ohio supercomp uter cen ter, ” http://osc.e du/ark:/1949 5/f5s1ph73 , 1987.

Spectral feature mapping with mimic loss for robust speech recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment