Multilingual Adaptation of RNN Based ASR Systems
In this work, we focus on multilingual systems based on recurrent neural networks (RNNs), trained using the Connectionist Temporal Classification (CTC) loss function. Using a multilingual set of acoustic units poses difficulties. To address this issu…
Authors: Markus M"uller, Sebastian St"uker, Alex Waibel
MUL TILINGU AL AD APT A TION OF RNN B ASED ASR SYSTEMS Markus M ¨ uller , Sebastian St ¨ uker , Alex W aibel Institute for Anthropomat i cs and Robotics Karlsruhe Institu t e of T echnolog y , Karlsruhe, Germany { m.mueller, sebastian. stueker, alexander.wai bel } @kit.edu ABSTRA CT In this work , we focus o n multiling ual systems based on re- current n eural n etworks (RNNs), trained using the Connec - tionist T emporal Classification ( CTC) lo ss functio n. Using a multilingua l set of acoustic units po ses dif ficulties. T o ad d ress this issue, we prop osed Lan guage Feature V ector s (LFVs) to tr ain la n guage adaptive multiling ual systems. Languag e adaptation , in con trast to speaker ad aptation, n eeds to be ap- plied no t o nly on the feature level, but also to dee p er layer s of th e network . In this work, we therefore extended our pr e- vious a p proach by intro ducing a novel techniq ue which we call “modulatio n ”. Based on th is method, we modulated the hidden lay ers of RNNs using L FVs. W e e valuated this ap- proach in b o th full and low resou rce con ditions, as well a s for g rapheme an d pho ne based systems. Lower error rates throug hout the different cond itions co u ld b e achieved by th e use of the mo dulation. Index T erms — Multilingual, au tomatic spee c h r ecogni- tion, conn ectionist temp oral classification, lan guage featu re vectors, low-resource 1. INTR ODUCTION T raining multilingual speech recogn itio n systems r e quires special method s. In low-resour ce conditions, training systems on data from multiple langu ages impr oves th e perfor mance. In a resour ce rich en viron ment, using d ata from m ultiple lan- guages o ften d o es not im prove th e p erforma n ce, b ut might ev ent affect it negativ ely . In bo th cases, adaptation techn iques are required to improve the reco g nition accuracy an d n eural networks adapted to lan guage cha racteristics have proven to perfor m better . This is similar to speaker adaptatio n, where adapted networks o utperfo rm u nadapted on e s. Howe ver , lan guage ada p tation is m ore cha llenging than speaker adaptation: Collecting training da ta from se veral hun- dred speakers is po ssible. Th is amo unt of spea kers enables networks to gene r alize up on speaker prop e rties. For lan guage adaptation , there are a n order of m a g nitude less languag es av ailable than ther e are speakers. This render s generaliza tion This work was reali zed in the frame work of the ANR-DFG project BULB (ANR-14-CE35-002 ). across langu a ges mor e d ifficult. An other facto r is the task it- self. When trained on data from mu ltiple speakers of the same languag e , the same targets, e.g., p hone states, are used. Dif- ferent languages feature different, b ut sometimes overlapping, sets of targets. Although speech recognitio n fo r different lan- guages are different tasks, th ey are r elated since all languages are being spoken by humans. This lim its the soun d inventory to the soun ds that can b e produc ed by the human vocal tract. Also, lang uages (from the same lan guage family) poten tially share sound inventories, as well as the set of targets the net- work is train ed on . Applying languag e ad aptation techniques should there - fore enab le th e networks to gen eralize b etter . En coding lan- guage pr o perties using, e. g ., LFVs, like we showed in th e past, allo w networks to be trained languag e adaptive in such a way th a t they can exploit similarities and differences between languag e s. Un like tr aditional GMM/HMM or DNN/HMM based systems, RNN/CTC b ased setups do no t requir e ex- plicit modellin g o f con text dependent states which would then need to be adapted. Based on RNNs, these systems should be trained tow ards learning feature s based on langu age prop - erties in order to be ab le to better perfo rm in a m u ltilingual scenario. As we outlined in the related works section, sev eral tech- niques for lan g uage adaptation hav e been pr oposed for tradi- tional systems in the past. W e propo sed to use LFVs as addi- tional inpu t features for langua ge adaptation. In th is paper, we introdu c e a novel ap proach of in tegrating LFVs into recurre nt network architectures based on th e id ea of Meta-PI network s. The effecti veness of our approach is demon strated in a series of experim ents, showing th at the m e th od presente d h ere can be applied to b oth full- and low-resource condition s. In ad- dition, we also omitted the pronu nciation d ictionary and built systems using graph emes only . In a mu ltilingual scenario, this is particular ly challen ging as the network is requ ired to learn pronun ciations from mu ltiple lan g uages in parallel. T o ev aluate our systems, we use the token e r ror rate ( TER) as primary measu r e of the trained n etworks. But we also inc or- porated a RNN based langua g e model (LM) fo r decodin g to the deter mine the word error r ate (WER). This p aper is o rganized as fo llows: In th e n ext Section 2, we outlined related work in th e field, followed by a de- T o appear in Pr o c. ICASS P 2018, April 15-20, 2018, Calgary , Canada c I E EE 2018 tailed description of th e method pro posed in Section 3. W e described the experimenta l setup in Section 4, fo llowed by the resu lts of our experiments (Section 5). This paper con- cludes with Section 6, where we also o utline possible futu re work. 2. RELA TED WORK 2.1. Multi- a nd Cr osslingual Speech Recognition Systems Prior to the emergence o f neural networks, ASR systems were typically built using a GMM/HMM based appr oach. Meth- ods for training/ad apting such systems cross- and multilin- gually were prop o sed to hand le data spar sity [ 1, 2]. The pro- cess of clustering context-in d ependen t pho nes into con text- depend ent ones can also be adapted to acc o unt f or cro ss- an d multilingua lity [ 3]. Due to their recurren t nature, RNNs are a powerful tool to model sequential depen d encies, r enderin g the need for co n text-depend ent targets superfluou s. Using only context-indep endent targets has the ad vantage that no clustering is required . 2.2. Multilingual Bottleneck F eatures In a resour ce con straint scenario , data from additional sour ce languag e s are used to im prove the perfo rmance. DNNs are typically traine d in two steps: Pre-training and fine- tuning. It has been shown that the pre-training step is lan guage indep en- dent [4]. The fin e - tuning can be mod ified in mu ltip le ways to account for additional langu a ges. One appro ach inclu des the use of shared hidden layers, with languag e d epende n t o utput layers [5]. Combin ing multiple output la y ers into on e is also possible [6]. 2.3. Neural Network Adaptation Feeding add itional inpu t featur e s into a n eural network is a common way fo r adap tation. A popular app roach fo r speaker adaptation is to supp ly i-V e ctors [7], which encode spea ker characteristics in a low-dimensional re presentation. Speaker adaptive neural n etworks can be trained th is way [8] . Such low dimensio n al codes can also b e extracted u sing neural net- works, called Bottleneck Speaker V ectors ( BSVs) [ 9]. In th e past, we propo sed similar metho ds to adapt DNNs to mul- tiple languag es. W e first intr o duced a m ethod encod in g the languag e id entity u sing one-hot en coding [10]. W e enhan ced this metho d in a similar way to BSVs, by extra c tin g L anguag e Feature V ectors (LFVs) [1 1]. Th e se vector s have shown to encode lan guage p r operties instead of the langu age identity alone, even for lang uages not seen dur in g training. 2.4. RNN Based ASR Systems RNN based ASR systems are b ecoming increasing ly p o pular . One method to train them is the use of th e Connectionist tem- poral classification ( CTC) loss f unction [12], wh ich does n o t require fram e-level labels. It aligns a seque n ce of tokens au - tomatically . As in trad itional systems, pho nes, grap hemes or both c o mbined can b e u sed as acoustic modeling un its [13]. Giv en enough trainin g da ta , even wh ole words can be used [14]. 3. LANGU A G E AD APT A TION In th e p ast, we propo sed method s for ad apting mu ltilingual neural network based ASR systems to lan guages using LFVs. Languag e Feature V e c tors are a low-dimensional r epresenta- tion of langu age prop erties, extracted v ia a neural network. This network w as train ed to d iscriminate languag e s, based on log M e l an d to n al featu res (FFV [15] and p itch [16]) typ ically used b y ASR sy stem s. A similar architectu re as for extraction of BNFs was used. This architectu r e featured a b ottleneck as second last lay er . After trainin g, the ou tput ac ti vations o f this layer were u sed as LFVs. T o perform the adap tation, we append ed LFVs to the acou stic featu res, similar to a p pend- ing i-V ector s fo r speaker adaptation . In the results section, we included error rates using this me th od as contrastive ex- periments, de noted as “LFV app ”. This method h as shown to reduce error rates f or multilingu al GMM/HMM, DNN/HMM as well as RNN/CTC based systems. 3.1. Neural Modulation Append in g features for speaker adaptation to acou stic fea- tures is fitting, as changes in speaker ch aracteristics are re- flected within the signal. Multiple adaptation method s like VTLN or fMLLR which directly o perate o n the ac o ustic fea- tures wer e propo sed. The same holds tru e for i- V ector ba sed adaptation , where speaker adap tiv e systems can be trained to directly shift inp ut f eatures based o n sp e aker proper ties [8]. But language properties ar e a high er o rder concept in con trast to speaker variations. Some aspects ar e based on acoustics, e.g. having the same phon e in multiple lan guages, where a lang u age specific colo ring can be observed to som e de- gree. But aspects like p honotac tics or d ifferent sets of acous- tic units require adap tation meth ods beyond th e transform a - tion of acoustic features. Here, adding fe a tures at deeper lay - ers poten tially enables b etter ada ptation. One p o ssibility is a meth od first introdu ced as part of Meta-PI [17] networks. The key aspect is the u se of Meta-PI connectio ns, which allow to mod ulate th e outpu t of u nits by m ultiplication with a coefficient. Applied to languag e adaptation , we m odulated the outputs of hidd en layers with LFVs. Based o n lang uage featur es, the output of LSTM cells are attenu ated or em p hasized. This f orces the cells in the hidden lay e r to learn o r adapt to featur es based o n language proper ties. Modu lation can be con sidered related to drop o ut training [18], wher e connection s are dr opped on a r andom 2 basis. I n the results section, we refer to th is method as “LFV mod”. W e used a network configu ration as shown in Figu re 1. The basic ar chitecture is inspired by Baidu’ s Deep speech 2 . It co mbines two TDN N /CNN layers with 4 bi-direction al LSTM layer s. The ou tput layer is a f eed-for ward layer wh ich maps the outpu t o f the last LSTM layer to the targets. W e chose the n umber of LSTM cells in each layer to be a mul- tiple of th e dimensionality of the LFVs. Th is way , we could structure the hidde n layer into g r oups of LSTM cells co n tain- ing an equal amo unt o f units. T he output of ea c h group is then mod ulated with o ne dimen sio n of the LFVs. T h e figure shows bo th configu r ations, “LFV app” and “LFV mod”, but only on e m ethod was a p plied at a time. In prelim inary exper- iments, we determin ed modulating the ou tput of the second LSTM layer to re su lt in the best perform ance. 2D TDNN / CNN Layers Bi-directional L STM Layers Output Layer LFV app LFV mod Fig. 1 . Network architectu r e sh owing “LFV app”, a s well as the pr o posed adaptation meth o d “LFV mod”. 4. EXPERIMENT AL SETUP W e based our exper im ents on th e Euron ews corp us [ 19], which co ntains data from 10 languag es. For each language, 70h of TV broa d cast news recording s are a vailable. For o ur experiments, we used a combin ation o f 4 langu ages (En- glish, Fre nch, German, T urkish), based on th e av ailability of pronu nciation dictio naries. W e filtered utteran c e s ba sed on length, omitting very short ones ( < 1s), an d also removed ones having a tran script of more than 639 characters 1 . Noises were on ly annotated in a very basic way with a single n oise marker covering all d ifferent no ise ty pes, ra n ging from mu sic, backgr o und and hu man noises. W e therefor e omitted utter- ances marked as noise. After apply ing all filterin g , approx . 50h of data r emained per lan guage an d was split into 45h of tra in ing an d 5 h o f test data. For training, we cr eated an 1 Interna l limita tion within the implementat ion of CUD A/warp -ctc, see: https:/ /github .com/baidu-research/warp-ctc, accesse d 2018-02-12 additional subset con taining only 8h ou t of the 4 5h training set to ev aluate our approach in a low-resource cond ition. 4.1. Acoustic Units As ac o ustic units, we u sed b o th pho nes and gr aphemes. The pronu nciation dictionar ies we r e created using MaryTTS [20]. For merging the mon olingual dictio naries, we m a pped th e phone - symbols to a multilingual pho ne set using the d efini- tion o f articula to ry features in Mary T TS’ lan guage d escrip- tion files. In addition to sy stem s based on phon es, we also trained networks using gr aphemes as aco ustic units. T o in d i- cate word bounda r ies, an addition a l to ken was used. 4.2. RNN/CTC Network T raining Multilingual Bottleneck Feature s (ML-BNFs) were used as input features. Th e ML-BNFs network was tra in ed u sing d ata from 5 languag es ( Fr ench, German, Italian , Russian, T urk - ish). Inp ut featu res to th e network were log M el an d tonal features (FFV [15] and pitch [16]), extracted using a 32ms window with a 10m s frame- shift. The RNN network was trained using stochastic gradient descent (SGD) an d Nesterov momentu m [ 21] with a factor o f 0 . 9 . Mini- batch updates with a b atch size of 15 were ap plied tog ether with batch normal- ization. Th e utteran c es wer e sorted ascending by length to stabilize the training, as shorter utter ances are easier to align. 4.3. Grapheme Based RNN LM W e used a RNN based L M , train ed on grap hemes as describ ed in [ 22]. It featured 1 hidd en layer with 1024 LSTM cells. The model was tra in ed on only a very limited set of sen- tences, consisting of th e tra in ing u tter ances of the acoustic model only . As langu age models are typ ically trained on se veral millio ns of sen tences, this is not much tra in ing data. But the model sh ould provid e an indication wh ether the im- provements o bserved as TER also result in a better word level speech recog nition system. 4.4. Evaluation W e evaluated our prop osed method varying two con ditions: The av ailability of a pr onunc iatio n dictiona r y an d the amoun t of data. An ASR system without langu age adaptatio n is used as b aseline. First, we used th e token error rate (TE R) as pri- mary measure to d etermine the per forman ce with out the use of external (langu age) mo dels. For decoding, we use the same proced u re as in [12] and gr e edily search for the b est path. In addition to the TER, we also determ ined th e word erro r r ate (WER) using an RNN LM. 3 5. RESUL TS 5.1. Grapheme Based Systems First, we e valuated the u se of grap hemes as acou stic mod eling units. W e started using a network co nfiguratio n with the RNN part h aving 4 20 L STM cells per layer, traine d u sing only 8h of d ata per languag e (see T able 1). Adding LFVs after th e TDNN / CNN layers (“L FV ap p”) does lower the TE R, but applying the m ethod presented here (“LFV m od”) lowers the TER even mor e. Similar gain s can be observed using the full Condition DE EN F R TR ML Baseline 30.8 38.0 29.4 30.9 LFV app 22.9 33 .3 27.3 21 .3 LFV mod 20.7 32 .7 25.4 19 .6 T able 1 . TER of g r apheme based system tr a in ed on 8h per languag e , 420 LSTM cells per layer training set (T able 2). T he use of more data lowered the TE R, whereas th e relative improvements were in the same order of magnitud e. T rainin g on more data also allowed fo r larger n e t- Condition DE EN F R TR Baseline 10.6 18 .2 15.9 9.1 LFV app 9.5 16.1 14 .3 8.1 LFV mod 9.1 15.5 13 .6 8.0 T able 2 . TER of graph eme based system train e d on 45h per languag e , 420 LSTM cells per layer works. In an additiona l experiment, we increa sed the nu mber of LSTM cells p er layer to 8 4 0. As shown in T able 3, th e TER decr eases in absolu te terms, but the difference between addition and modulation beco mes smaller . Condition DE EN FR TR Baseline 8.9 15.0 13.5 7.9 LFV app 7.9 13.6 11.8 7.1 LFV mod 7.7 13.3 11.7 7.1 T able 3 . TER of graph eme based system train e d on 45h per languag e , 840 LSTM cells per layer 5.2. Phoneme Based Systems In th e same no tion as graph emes, we ev aluated system s b ased on phonem es as acou stic modelling units. Starting with the limited data set (T able 4), improvements by the modu lation (“LFV m o d”) over the addition (“LFV app”) can be observed. Using all av ailable training d ata and increasing th e n umber Condition DE EN F R TR Baseline 21.7 27 .2 23.9 21.6 LFV ap p 20.9 26 .4 21.3 19.5 LFV m od 19.0 25 .6 19.8 17.6 T able 4 . TER of pho neme based system train ed on 8 h per languag e , 420 LSTM cells per layer of L STM cells p e r layer to 8 40, similar improvements could be achiev ed (T able 5). In con trast to the graph eme based setup (T able 3), modulating the layers (“LFV mo d”) impr oves the perfor mance over th e simple additio n (“LFV app”). Th e TERs of th e gra p heme b ased systems fo r German and T urkish are lower comp ared to their phone based counter parts. On e reason for this is the quality o f the pro nunciation d ic tio nary , which was created fully automatically b ased on a TTS sys- tem. Condition DE EN FR TR Baseline 9.6 14.6 12.1 8.5 LFV ap p 9.3 13.2 10 .8 7.7 LFV m od 8 .6 12.5 10.2 7.3 T able 5 . TER of phoneme b ased system trained on 45 h per languag e , 840 LSTM cells per layer 5.3. Decoding with RNN LM T o deter m ine the WER, we ran a greedy decoding usin g a char based RNN LM on the En g lish sub set of the test data. The r esults shown in T able 6 indicate that the impr ovements of TER ar e also observable w . r . t. WER after decoding with a langu age mode l. Setup Baseline LFV app LFV mod 8h-42 0 3 2.4% 30 .6% 29.9% 45h-8 40 29.2% 2 7.7% 27.3 % T able 6 . WERs for Eng lish grap h eme based systems. 6. CONCLUSION Unlike speaker adaptatio n, where th e collectio n of data cover - ing hun d reds o f speakers is fea sible, collecting data f r om that many lan g uages is next to impo ssible. Optimizin g th e ad ap- tation meth od is th erefore key to maximize the perfor mance in a multilingual scenario. W e presen te d an improved m ethod for language adaptatio n o f RNNs in a multiling ual setting . Modulating the outpu ts o f a layer showed improvements over append ing L FVs to input featu res. 4 7. REFERENCES [1] T anja Schultz and Alex W aibel, “Fast bootstrap ping of L VCSR systems with multiling ual ph oneme sets, ” in Eur ospeech , 1997 . [2] Sebastian S t ¨ uker , Acoustic modelling for un der- r esour ced la nguages , Ph.D. th esis, Karlsruhe, U n iv ., Diss., 2 0 09, 20 09. [3] T anja Sch u ltz and Alex W aibel, “Polyph one decision tree specialization fo r langu age adaptatio n , ” in Acous- tics, Sp eech, and Signa l Pr oc essing ( ICASSP) . IE E E, 2000, vol. 3 , pp. 1 707– 1 710. [4] Pawel Swietojan sk i, Arn a b Gho sh al, and Steve Re- nals, “ U n supervised cross-ling ual knowledge tran sfer in DNN-based L VCSR, ” in S LT . IEEE, 2012, p p. 246 – 251, IE E E. [5] Karel V esely , Martin Kar afiat, Frantisek Grezl, Mi- los Janda, an d Ek aterina Ego rova, “The langu age- indepen d ent b ottleneck features, ” in Pr oceeding s of the Spoken Lang uage T echnology W orksho p (SLT), 2012 IEEE . 2012, pp. 336–3 41, IEEE. [6] Fran tisek Gr ´ ezl, Martin Karafi ´ at, and Kar el V esely , “ Adaptation o f multilingual stacked bo ttle-neck neural network structure for n ew langu age, ” in A coustics, Speech, and Signa l Pr ocessing (ICASSP ) . IEE E , 201 4, pp. 7 654– 7 658. [7] George Saon , Hagen Soltau , David Naham oo, and Michael Picheny , “Speaker adaptation of neural network acoustic models u sing i-V ectors, ” in ASR U . IEE E, 20 13, pp. 5 5–59 . [8] Y ajie Miao, Hao Zhang , and Flo r ian Metze, “T o wards speaker adaptive trainin g of deep n eural network acous- tic m o dels, ” 2014 . [9] Heng guan Huang and Khe Chai Sim, “ An investiga- tion of aug menting speaker representa tio ns to improve speaker n ormalisation for dn n-based speech re cogni- tion, ” in ICASSP . IE E E, 2015 , pp. 461 0–461 3. [10] Marku s M ¨ uller an d Alex W aibel, “Using language ad ap- ti ve deep neural networks for imp roved m ultilingual speech r ecognition , ” IWSLT , 2015. [11] Marku s M ¨ uller, Seb astian St ¨ uker , and Alex W aibel, “Langua g e adap tiv e DNNs for improved low r esource speech r ecognition , ” in In terspeech , 2 016. [12] Alex Graves, Santiago Fern ´ andez, Faustino Go mez, and J ¨ urgen Schmid huber, “Connection ist tem poral classifi- cation: labelling u nsegmented seq uence da ta with recur- rent neural networks, ” in Pr oceedings of the 23 r d inter - nationa l con fer ence o n Machine learning . A CM, 2006 , pp. 3 69–3 7 6. [13] Do ngpen g Chen, Brian Mak, Cheu n g-Chi Leung , a n d Sunil Siv adas, “Joint acoustic mod eling of triphon e s and trigraph emes by multi-task learning deep ne u ral net- works for low-resource speech r ecognition , ” in Acou s- tics, Speech, and Signal P r ocessing (ICASSP ) . IEEE , 2014, pp. 5592 –5596 . [14] Hag en Soltau, Hank Liao, and Hasim Sak, “Neural speech reco g nizer: acoustic-to- word LSTM Model for large vocab ulary speech recog n ition, ” arXiv pr eprint arXiv:161 0.099 75 , 2016 . [15] Kornel Lasko wski, Mattias Held n er , and Jens Edlund, “The fun damental frequen cy variation spectrum, ” in Pr o ceedings of the 21 st Swedish Phonetics Confer ence (F one tik 2008 ) , Go th enburg, Sweden, June 2008 , pp . 29–32 . [16] Kjell Schuber t, “Gru ndfreq uenzverfolgung u nd der en anwendu ng in der spr acherkennu ng, ” M.S. th e sis, Un i- versit ¨ a t Karlsruhe (TH ) , Germany , 199 9, In German. [17] Joh n B Hampshire an d Alex W aib el, “ T he Meta- Pi net- work: Building distributed k nowledge r e presentation s for r obust multisou rce patter n reco gnitio, ” IEEE T rans- actions o n P attern Analysis and Machine Intelligence , vol. 14, no. 7, p p . 751– 769, 1992. [18] Geo ffrey E Hinton, Nitish Srivasta va, Alex Krizhevsky , Ilya Sutskev er, and Ruslan R Salak h utdinov , “ Improv- ing n e ural ne tworks by preventing co -adaptation of fea- ture detector s, ” a rXiv pr eprint arXiv:1207 .0580 , 20 12. [19] Rob erto G r etter , “Euro news: A m ultilingual b e nchmark for ASR and LID, ” in Fifteenth Annu al Conference of the Internationa l Speech Commun ication Association , 2014. [20] Ma r c Schr ¨ od er and J ¨ urgen Trouvain, “The Ger m an text- to-speech synthesis system MAR Y : A tool for research , development an d teachin g, ” Internation al Journal of Speech T e chnology , vol. 6, no. 4, p p . 365– 377, 2003 . [21] I lya Sutske ver , James Martens, Ge orge Dahl, and Ge- offrey Hinton, “ On the importance of initialization and momentu m in dee p lear n ing, ” in Pr oceed ings o f the 3 0th Internation al Con fer ence on Machine Learning (ICML- 13) , 20 13, pp. 1 139– 1 147. [22] T h omas Zenkel, Ramo n Sanabria, Florian Metze, Jan Niehues, Matthias Sp erber, Seb astian St ¨ uker , and Alex W a ibel, “Comparison of deco ding strategies for CTC acoustic models, ” arXiv preprint arXiv:1708 .0446 9 , 2017. 5
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment