Spectral feature mapping with mimic loss for robust speech recognition

For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to ensure de-noised speech is useful for downstream tasks like ASR. We first train …

Authors: Deblin Bagchi, Peter Plantinga, Adam Stiff

SPECTRAL FEA TURE MAPPING WITH MIMIC LOSS FOR ROBUS T SPEECH RECOG NITION Deblin Bagchi P eter Planti nga Ada m Sti ff Eric F osler -Lussier Department of Computer Science and Engineering The Ohio State University , Columbus, OH, USA ABSTRA CT For the task of speech enhancemen t, loca l learning ob- jectiv es are a g nostic to phonetic structures helpfu l for sp eech recogn itio n. W e prop o se to add a g lo bal criterion to ensu re de-noised speech is u seful for downstream tasks like ASR. W e first train a spectra l classifier on clean speech to pred ict senone labels. Then, the spectral classifier is join ed with our speech enhancer as a noisy speech recog n izer . This model is taught to imitate the output of the spectral classifier alone on clean speech. This mimic loss is combined with th e tradi- tional lo cal cr iter ion to train the sp eech enhancer to prod uce de-noised speech. Feeding the d e-noised speech to an off- the-shelf Kaldi trainin g recipe for the CHiME-2 corpus shows significant improvements in WER. Index T erms — Speech enh ancemen t, Spectral mappin g, Mimic loss, Noise-r obust spee c h recog nition, CHiME-2 1. INTR ODUCTION Automatic Speech Reco g nition (ASR) h as shown tremendo us progr ess over the years in r e cognizing clean speech. How- ev er , traditiona l DNN - HMM ASR systems still suf fer f rom perfor mance degradation in the pr esence of acou stic interf er- ence, such as add iti ve no ise an d room reverberation. Some strategies for building a noise-ro bust spee c h r e cognizer in- clude using n oise-inv ariant features [ 1 ], augme n ted data [ 2 ], bulkier acou stic mod els like LSTMs and CNNs [ 2 , 3 ] and sophisticated lan guage models [ 2 , 3 ]. Few g roups, howe ver , have looked at systems th a t train only a speech enhancemen t model, which can be used fo r different tasks. A speech enhancem ent front-e n d ref e rs to a p erform a nce- boosting deno ising tec hnique that can be attach ed to any stan- dard au tomatic speech recognition m o del. So me d eep learn - ing mo d els for speech enhancement attempt to estimate an ideal ratio mask (IRM) fo r r emoving noise fro m a speech sig- nal [ 4 ]. Others utilize spectr al mapping in the signal domain [ 5 , 6 ] or in the f eature d omain [ 7 , 8 ] to directly predict fea- tures. Recent work in c omputer vision has seen notable success in ad dressing the problem of po or resolution in mo dified in- put data, using a framew ork b roadly referre d to as Genera tive Adversarial Networks ( GANs). Fund amentally , these g ains are achieved b y the injection of au xiliary or proxy learnin g objectives into mo re traditio n al p ipelines. Th ese aux iliary ob- jectiv es exploit an adversarial relationship b e tween two ne u- ral networks (a generator and a discriminator ) to find a Nash equilibriu m in which gen erating sensible data is th e optimal behavior f or the generator [ 9 ]. Th is has the effect of refin ing the distribution of the generated data closer to some desirable outcome relati ve to a system trained with out the auxiliary ob- jectiv e, e.g., sharp er , m o re realistic ge nerated images. In light of the se successes, in p articular that of [ 9 ], in- serting an auxiliary realism objective in to a speech denoising pipeline seems to be a natu ral avenue to pursue improved (1) Clean speech Spectral classifier Predicted senones Senone label L C (2) Noisy speech Spectral mapper De-noised speech Clean speech L F (3) Noisy speech Spectral mapper De-noised speech Clean speech Spectral classifier Soft labels Spectral classifier Classifier output L F L M Fig. 1 . Our sp e e ch en hancemen t system is trained in thr e e steps. (1 ) The spectr al classifier is train ed to predict seno n e la- bels with cross-entropy criterion (c la ssification loss, L C ). (2) The spectral mapper is pr e -trained to map from noisy speech to clea n speec h u sing MSE criter ion (fide lity loss, L F ). (3) The spectral mapper is tr ained using joint loss from b oth the clean speech and th e outputs of the classifier whe n fed clean speech (m imic loss, L M ). The gra y models have fro zen weights. c  2018 IEE E . Published in the IEEE 20 18 Internation al Con f erence on Acou stics, Speech, an d Signal Proce ssing (ICASSP 2018) , scheduled for 15 -20 April 201 8 in Calgar y , Alb erta, Canada . perfor mance. Howe ver , initial experime n ts with GANs con - ditioned on n oisy speech in puts, in wh ic h the discriminato r network was trained to distinguish b etween real and fake clean/noisy input pairs, exposed the well-known m o de col- lapse problem endemic to GANs [ 10 ]. Results failed to improve upo n simple baselines of noisy-to-clean speech ma p- pings trained with only MSE loss. Other attem pts hav e also failed to improve upon a DNN ba selin e, as in [ 11 ]. W e spec- ulate that th e d ifference in perform a nce from experiments in the visual d o main may be d ue to the r elativ e ly non- local structure of the speech signal in the fr equency axis (i.e. har- monic structure), as well as the smoothness o f speech features as compar ed to images. The work d escribed in this pa per is m otiv ated by the ob - servations of over -sim p le outpu ts fro m GAN s that seemed to be stuck in m ode collapse orbits. W e h y pothesize that a train - ing objective that can provide stro nger feedb ack th an a simple real o r fake determina tio n will be b etter ab le to guide the pa- rameters of the denoising network towards pr oducing outp u t that b ehaves like actual sp eech. While our resulting system retains n one o f th e p r operties o f b eing gen erative o r trained adversarially , the insight to use an au xiliary task to improve the deno ising process is dr awn from that b o dy of work. The auxiliary glob al ob jectiv e that we add to our local criterion is th e behavioral loss of a classifier trained on clean speech. W e call this addition al o bjective the mimic loss . First, we train a senon e classifier using clean speech as inpu t, and a spectral map p er network [ 8 , 6 ] using pa r allel noisy an d clean speech frame pairs. Next, we free ze th e weights of the acous- tic model a nd join our pre-trained spectral mapper to it. W e then pass noisy speech frames to tra in o ur spectral mapper with a join t ob jectiv e, i.e. a weighted sum of the traditio n al fidelity loss a n d mimic lo ss. The mimic loss, then, is M SE loss with respect to the sof tmax (o r pre- softmax) outpu ts of the classifier fed with the co rrespond ing clean speech frame. See figure 1 for a graphical dep ic tio n of the m odel. This tech nique of using on e model to teach another was propo sed by Ba and Caruna [ 12 ] fo r th e task of model com - pression. In their work , they intro duce stud e n t-teacher learn - ing, whe r e sepa r ate teacher and stud ent mod els are tr ained to do the same task. Mimic loss, on the other hand , is used to train the studen t model to d o a different task from the te a c her model. 2. PRIOR WORK T o deal with no ise, many DNN b ased meth ods have been p ro- posed to improve the r obustness of ASR systems. In acou stic modeling , using Con volutional Neural Network s (CNNs), such as in [ 1 3 ] and Lon g Short T er m Memory Networks (LSTMs) in [ 14 ] has r esulted in an improvement in per for- mance. LSTMs have also be en su ccessfully used as sp eech enhancem ent fro nt-end s in [ 15 , 14 ]. Spectral mapping h as been u sed to gen erate clean speech signals. Howe ver, in [ 8 , 7 ] they use only a local learning objective. Student-te a cher networks have been used to im- prove th e q u ality of noisy speech recogn ition [ 16 , 17 , 18 ]. Our model uses mimic loss instead of student-tea cher learn - ing, which mean s the spe ech enhancer is not jo intly trained with a particular acoustic model. This speech enhancer cou ld be used as a pre-pr ocessor fo r any ASR system, o r for another similar dataset. This modularity is the strength of mimic loss. 3. SYSTEM DESCRIPTION In this section , we will describe the m a jor comp onents of our system: n amely , the spectral mapp er , the spectral c la ssifier, and the overall framework bin ding the two together . 3.1. Spectral mapping Spectral mappin g improves pe r forman ce of the speech rec- ognizer by learning a m a pping from noisy spe ctral patterns to clean f e a tures. W e train a D N N- based spectr al mapper fo r feature denoising . In our p r evious work [ 7 , 8 ], we hav e shown that a DNN-ba sed spectral mappe r, which takes noisy spec- trogram as inp ut to predict clean filterbank featu res fo r ASR, yields g ood results on the CHiME-2 noisy and reverberant dataset. Specifically , we first divide the input time- d omain sig- nals into 2 5-ms frames with a 1 0-ms frame shift, an d then apply sho rt tim e Fourier transfo rm (STFT) to co mpute log spectral magn itudes in each time f r ame. For a 16 kHz sig- nal, each fra m e co ntains 400 samples, and we use 512-po int Fourier transfo rm to com pute the magnitu des, for ming a 25 7- dimensiona l log mag nitude vector . Each noisy spectral co m- ponen t x k m for frequ ency k at tim e slice m is aug mented o n the inpu t b y the d eltas and dou ble deltas, as well as a fi ve frame window (d esignated ˜ x k m = [ x k m ± 5 ]), leadin g to the di- mensionality of ˜ x m being 257 · 3 · 11 = 8481 . Similarly , we define y m to be the clean spectr al slice at tim e m . W e then use a feed-fo rward neural n etwork f ( · ) to map noisy spectral slices ˜ x m to clean spectral features y m using an MSE loss function, which we call fidelity lo ss . L FIDELITY ( ˜ x m , y m ) = 1 K K X k =1 ( y k m − f ( ˜ x m ) k ) 2 (1) 3.2. Spectral classifier The spectr a l classifier is similar to the trad itional DNN acous- tic mo del trained to classify a stacked clean spectral pattern ˜ y m to its corresp o nding senone class z m . W e train the classi- fier u sin g a cross entro py criterion; critically , once th e classi- fier is tr ained, we freeze the weigh ts as a model of app ropriate behavior u nder clean speech. 3.3. Joint loss W e can define the mimic loss as the mean square d ifference between a D -dimen sional representatio n g within the spectral classifier e valuated on clean speech y m and its paired cleaned speech f ( ˜ x m ) : L MIMIC ( ˜ ˜ x m , ˜ y m ) = 1 D D X d =1 ( g ( ˜ y m ) d − g ( ˜ f ( ˜ x m )) d ) 2 (2) W e experimented with two different rep resentations for g ( · ) : the posterior ou tput o f the senon es after softmax norm aliza- tion ( po st-softmax ) an d th e laye r outpu ts prior to the softmax normalizatio n ( pre-softmax ). While training the speech enhance r, we fo und th a t using only mimic lo ss was no t enoug h to allow the m odel to con- verge. W e speculate that the task of pr edicting senon es is too different f r om the task of predictin g clean speech fo r the error signal to drive the output of the speech enhan cer to actually look like speech. Comb ining the fidelity and m imic losses into a join t loss allows the enha n cer to better imitate th e be- havior of the classifier under clean speech wh ile keeping th e projection of n oisy speech closer to clean speech . L JOINT = L FIDELITY + α L MIMIC (3) The hyper-parameter α is used to ensure that b oth of th e losses that make up the joint loss h ave a similar magnitud e . W e used 0. 1 f or pre- softmax and 1000 fo r post-so f tmax. 4. EXPERIMENT AL SETUP W e evaluate the effectiveness of our propo sed method on T r ack 2 o f the CHiME-2 challeng e [ 19 ], which is a med ium- vocab ulary task for word r ecognitio n unde r reverberant and noisy environments with out speaker movements. In this task, three ty pes of data are provid ed based on the W all Street Journal ( WSJ0) 5k vocabulary read speech corpus: clean, re- verberant and rev e r berant+n oisy . The clean utterances are ex- tracted from the WSJ0 da tabase. The reverberant utterances are created by convolving the clean speech with b inaural room impu lse responses (BRIR) c o rrespon ding to a fro ntal position in a family li ving roo m. Real-world non-station ary noise backgro und recorde d in the same r oom is mixed with the r everberant utteran ces to for m th e reverberant+noisy set. The noise excerp ts ar e selected such th at the signa l-to-noise ratio (SNR) r anges among -6, -3, 0, 3, 6 and 9 dB withou t scaling. The multi-con dition trainin g , development and test sets of the reverberant+noisy set con tain 7138 , 2 454 and 1 9 80 utterances re sp ectiv ely , which are the same utterances in the clean set but with r everberation and noise at 6 different SNR condition s. Our system is mon a u ral. In o ur experiments, we simp ly av erage th e sign als fr om the lef t and r ight ear . A GMM-HM M system is built using th e Kald i too lk it [ 20 ] o n the clean utter- ances in the WSJ0-5 k to g et the senon e state f or e a ch frame of th e cor r espondin g noisy -reverberant utterances. The in i- tial clean alignme nts ar e o btained by perfo rming fo rced alig n- ment on the clean utterances. T o refine the in itial clean align - ments, we fu rther trained a DNN-based acoustic m odel using the filter bank features o f th e clean utterances, and re-generate clean alignments. These clean alig nments are used as the la- bels for training all the aco ustic models in this stu dy . Note that the DNN-HMM hyb rid system built o n the clean utter- ances is a powerful recognize r . It achie ves 2 .3% WER on the clean test set of the WSJ0-5k dataset. 4.1. Description of t he acoustic model In or d er to d etermine the effecti veness of the add itional cri- terion, we train a mo del u sing the denoised features with an off-the-shelf K a ldi recipe. The DNN-HMM hybrid system is trained using the clean WSJ0-5k align m ents generated us- ing the metho d stated ab ove. The DNN h as 7 hidd en layers, with 20 48 sigmo id neur ons in each lay er and a sof tmax out- put layer . Splicing context size for the filter-bank featur es was fixed at 11 fr ames, with th e minibatch -size being 1024. After that, we tra in the DNN with sMBR seq u ence train ing to ach iev e better perfo rmance. W e r egenerate the lattices af- ter the first iter ation and tr ain for 4 mo re iterations. W e use the CMU pronu nciation diction ary and the official 5k close- vocab ulary tri-gram langu age model in o ur experiments. 4.2. Description of t he spectral classifier The sp e c tral classifier ne twork is a m u ltilayer feed-for ward network which classifies c lea n speech frames as one of 199 9 senone labels. W e use 6 layers of 1 024 neuron s with Leaky ReLU a c ti vations and batch normalization between all the layers. While training the spectra l classifier on clean speech , we ap p ly softmax af ter th e topmo st layer, an d use a cross en- tropy criterion to teac h the network to produce senones. Spectral inp u t to Kaldi CE WER sMBR WER No en hanceme nt 18.0 17.3 Enhance m ent via fidelity loss 17.5 16.5 Enhance m ent via joint loss w/ po st-softmax mimic loss 16.5 15.7 w/ pr e -softmax mimic loss 15.7 14.7 T able 1 . Experimen tal results o n th e CHiME2 test set; CE WER is th e word err or rate of a DNN-HMM hybrid system trained using a cross-entr opy criterion . sMBR WER is the error rate af ter sequen tial minimum Bayes r isk training . Enhance m ent -6 dB -3 dB 0 dB 3 dB 6 dB 9 dB None 29.8 22.3 17.6 13.4 10.9 9.7 Fidelity loss 29.3 20.6 16.2 12.4 10.9 9.2 Joint loss 25.9 19.6 14.8 10.9 9.0 8.0 T able 2 . Exper imental results on the CHiME2 test set, bro- ken down across different SNRs. Mimic loss based on pre - softmax units. Bold indicates the best p erform ing system for each ev aluatio n subset. 4.3. Description of t he spectral mapper The spectra l mapper is com posed of a simple two-layer feed - forward network with 2048 n euron s in e a ch lay er . The fact that this model is simple means it can be used in some low- resource situation s, and is fast to u se. No te that m imic lo ss can be ap plied to impr ove the results of any k ind of speech enhancer . T o regularize the network, w e use batch n ormal- ization and a dropo u t rate o f 0.5 be tween every lay e r . W e also use ReLU activations after each layer . The classifier and mapper are written using T e n sorFlow 1 . 4.4. Results W e see in T able 1 th at the outputs of the recognizer bef o re the softmax provid e a better target for the noisy speec h r ec- ognizer, as sug gested in [ 12 ], ev en th o ugh our setup is quite different from their s. Th e difference in perf ormance between pre- and post-so f tmax targets may be d ue to a mismatch between target domain and loss criterion; ongoing work sug- gests that a cross-entro py mimic loss on post-softmax targets perfor ms similarly to MSE on pre- softmax. Furthermo re, the informa tio n loss in the softmax nor malization may “broad en” the allow a b le spec tr al mappin gs, harm ing generalization. This suggests that it is helpful for th e noisy recognize r not only to have the sam e targets as the clean recog nizer, but also to learn to behave in th e same way as the clean recognizer . 1 Code at https:/ /github . com/OSU-slatelab/actor critic s pecmap Study WER W an g e t.al [ 1 ] 10.6 W en inger e t. al[ 15 ] 13.8 propo sed approach 14.7 Narayanan -W ang[ 4 ] 15.4 Chen et. al [ 14 ] 16.0 T able 3 . Perf ormance comp a r ison with other studies o n the CHiME2 test set. W e also demo nstrate this fact by training the noisy spee ch recogn ize r using h ard targets r ather than the sof t targets o f mimic loss. Th is ca u sed the joint loss in the spectral mapp er to diverge, p roviding mo re evidence that th e noisy speech rec- ognizer must le a rn to m im ic the beh avior , n o t just th e targets of the clean speech recogniz e r . Finally , in T able 2 we break down our results into different SNRs and see that the gains are consistent over all Signa l-to-Noise levels. 4.5. Comparison with other systems For c o ntext, we show in T able 3 the per forman ce of o ur sys- tem relative to other p ublished results in the field . The better- perfor ming models in this list use more sophisticated mo dels (like RNNs and LSTMs) f or fro nt-end speech enhancem ent [ 14 , 1 5 ] an d acoustic mo deling [ 14 ], as well as n oise-inv arian t features [ 1 ] (e.g., PNCC, MRCG). W e use an off-the-shelf Kaldi recipe using DNN-HMM to do speech reco gnition, as well as a simp le 2 layer fee d -forward network to do spectra l mapping . Aga in , mimic loss can theoretically b e used to im- prove the results of any fro nt-end system. 5. CONCLUSION W e have pro posed a speech enhan c ement criterio n, called mimic loss, which can be used to produce sp eech that is use- ful for downstream ASR tasks. The mimic loss co mes from comparin g the outputs o f a frozen clean speec h reco gnizer, before sof tmax is applied, on clean an d en hanced spe e ch. This config uration a llows the speec h e n hanceme nt outpu t to be used as clean speec h for any d ownstream task. W e see that with mimic loss, the spectral mapp er learns to pr oduce more detailed speech data, retaining features that fidelity loss a lo ne fails to mod e l. Mimicking the be h avior of the pr e-softmax layer of th e classifier was sup erior to m imicking the ou tput of the senone posterior estimates; in gen eral, the lower error rates show th at these features are helpf ul for downstream tasks. In futu re work, we propose to extend this work by match- ing ev e ry layer of the phone recognize r, rather th an ju st the inputs a n d outputs. W e cou ld also use a variety o f mo dels for speech enh ancemen t to demon strate th e effecti veness of mimic loss. Another av e n ue is to ev aluate the ou tput o f our system in multiple domains to determine the effecti veness of this appro ach at learning domain -inv ariant rep resentations o f speech. Finally , we can train a more soph isticated a coustic model, rather th a n using an off-the-shelf Kald i recipe. 6. A C K NO WLEDGEMENTS This material is based up on work supported by the National Science Foundatio n under Grant IIS-14 0943 1 . W e also than k the Ohio Super computer Center (OSC) [ 21 ] for provid in g us with comp utational resources. 7. REFERENCES [1] Z.-Q. W ang and D. W ang , “ A joint trainin g fr amework for robust a u tomatic speech recog nition, ” IEEE/AC M T ransactio ns on Audio, S peech, and La nguage Pr o cess- ing , vol. 24, no . 4, p p. 79 6 –806 , 201 6. [2] J. Du, Y .-H. T u, L . Sun, F . Ma, H. - K. W ang , J. Pan, C. Liu, J.-D. Chen, and C.-H. Lee, “ The USTC-iFlytek system f or CHiME-4 challeng e, ” Pr oc. CHiME , pp . 36– 38, 20 16. [3] T . Y oshio k a, N. Ito, M. D e lc r oix, A. Ogawa, K. Ki- noshita, M. Fujim oto, C. Y u, W . J. F ab ian, M. Espi, T . Higuch i, et al., “The NTT CHiME-3 system: A d - vances in sp e e ch enhan cement and recognition for m o - bile multi-m icropho ne devices, ” in Automatic Speech Recognition and Und erstanding ( ASRU), 201 5 IEE E W orksho p on . IEE E , 2015, pp . 436–44 3. [4] A. Narayan an and D. W an g, “Imp roving robustness of deep neura l network acoustic models via speech separa- tion and joint adaptive train ing, ” IEEE/ACM T ransac- tions o n Audio, Speech, an d Lan guage Pr ocessing , vol. 23, no . 1, pp. 92–101 , 2 015. [5] K. Han, Y . W ang, and D. W a n g, “Learning spectral m ap- ping for speech dereverberation, ” in Aco ustics, Speech and Sign al Pr ocessing (ICASS P), 2 014 IEEE In terna- tional Con fe rence on . IEEE, 20 14, pp. 462 8–46 32. [6] K. Han, Y . W ang , D. W ang , W . S. W oods, I. Merks, and T . Zh ang, “Learning spectra l mapping for speech dere- verberation and d e noising, ” IEEE T ransaction s on Au- dio, Speech, and Lang uage Pr ocessing , vol. 2 3, n o. 6, pp. 98 2–99 2, 20 15. [7] K. Han , Y . He, D. Bagchi, E . Fosler -L ussier , an d D. W an g, “Deep neural n etwork based spectral feature mapping for robust spe e ch r ecognition , ” in Pr o c. Inter- speech , 201 5 . [8] D. Bagchi, M . I. Mandel, Z. W an g, Y . He, A. Plum- mer , a n d E. Fosler-Lussier , “Combining spectral f e a ture mapping and multi-ch annel mode l-based source sepa- ration for noise-robust auto matic spe e ch recognition , ” in Automatic Speech Re c ogn ition and Understanding (ASRU), 2015 IEEE W orkshop on . IEEE, 20 1 5, pp. 49 6– 503. [9] P . Iso la, J.-Y . Zh u, T . Zhou , and A. A. Efros, “Image- to-image translation with con ditional adversarial net- works, ” arXiv pr eprint arXiv:1 611.0 7004 , 201 6. [10] T . Salima n s, I. Good fellow , W . Zaremba, V . Cheung , A. Radford, and X. Che n , “Improved techniques f or training GAN, ” in Adva n ces in Neural In formation P r o- cessing S ystems , 2016 , p p. 22 3 4–22 42. [11] D. Michelsanti and Z.-H. T an , “Cond itional gen erative adversarial network s for speech en hanceme n t and n o ise- robust speaker verification, ” in Interspeech . ISCA, 2017. [12] J. Ba and R. Carua n a, “Do deep nets really need to be deep?, ” in Adv ances in n eural information pr ocessing systems , 201 4, pp. 265 4–266 2. [13] Y . Qian and P . C. W ood la n d, “V ery deep c o n volutional neural n etworks for robust speech recogn itio n, ” in Spo- ken Language T echnology W orksh o p (SLT), 2016 I E EE . IEEE, 201 6, pp. 481– 488. [14] Z. Chen, S. W atan abe, H. Erd o gan, and J. Hershey , “In- tegration of speech enhan cement and r ecognition using long-sho rt ter m memor y recurren t neural network, ” in Pr oc. In terspeech , 201 5. [15] F . W en in ger, H. Erdo gan, S. W atanab e , E. V incent, J. Le Roux , J. R. Hershey , and B. Schu ller , “Speech enhancem ent with lstm recu rrent neural n e twork s an d its application to noise-robust ASR, ” in In ternationa l Confer en ce on Latent V ariable An alysis an d Sig n al Sep - aration . Springer, 2 015, pp . 91– 9 9. [16] K. Markov and T . Matsui, “Robust speech recogni- tion using generalize d distillation framework., ” in In- terspeech , 2 0 16, pp. 23 64–23 68. [17] S. W atana b e, T . Hori, J. Le Roux, and J. R. Her- shey , “Student-teac h er network learning with enh anced features, ” in Acoustics, S peech and Sign a l Pr ocess- ing (ICASSP), 2 017 IEEE International Conference on . IEEE, 201 7, pp. 5275 –527 9. [18] J. Li, M. L. Seltzer, X. W ang, R. Zhao, an d Y . Gon g, “Large-scale domain adaptation via teacher -student learning, ” Pr oc. Interspeech 20 17 , pp. 2386–23 90, 2017. [19] E. V incen t, J. Barker , S. W atan abe, J. Le Rou x, F . Nesta, and M. Matassoni, “Th e second CHiME speech sep- aration and recognitio n ch allenge: Datasets, tasks an d baselines, ” in Aco u stics, Speech and S ignal Pr ocess- ing (ICASSP), 2 013 IEEE International Conference on . IEEE, 201 3, pp. 126– 130. [20] D. Povey , A. Ghoshal, G. B oulianne, L. Burget, O. Glemb ek, N. Goel, M. Han nemann , P . Motlicek, Y . Qian, P . Schwarz, et al., “Th e Kald i speech recog- nition toolkit, ” in IEEE 2 011 W orkshop o n Auto- matic Sp e ech Rec ogn ition and Un derstanding . IEEE Signal Processing Soc ie ty , 2011, number EPFL- CONF- 19258 4. [21] O. S. Center , “Ohio supercomp uter cen ter, ” http://osc.e du/ark:/1949 5/f5s1ph73 , 1987.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment