Exploring RNN-Transducer for Chinese Speech Recognition

EXPLORING RNN-TRANSDUCER FOR CHINESE SPEECH RECOGNITION Senmao W ang 1 , 3 ,P an Zhou 2 , W ei Chen 3 , Jia Jia 2 , Lei Xie 1 1 School of Computer Science,Northwestern Polytechnical Uni versity , Xi’an, China 2 T iangong Institute for Intelligent Computing, Tsinghua Uni v ersity , Beijing, China 3 V oice Interaction T echnology Center , Sogou Inc., Beijing, China { swang,lxie } @nwpu-aslp.org, { zh-pan,jjia } @mail.tsinghua.edu.cn, chenweibj8871@sogou-inc.com ABSTRA CT End-to-end approaches hav e drawn much attention recently for sig- niﬁcantly simplifying the construction of an automatic speech recog- nition (ASR) system. RNN transducer (RNN-T) is one of the pop- ular end-to-end methods. Previous studies have shown that RNN-T is dif ﬁcult to train and a very complex training process is needed for a reasonable performance. In this paper , we explore RNN-T for a Chinese large vocabulary continuous speech recognition (L VCSR) task and aim to simplify the training process while maintaining per- formance. First, a ne w strate gy of learning rate decay is proposed to accelerate the model con ver gence. Second, we ﬁnd that adding con- volutional layers at the beginning of the network and using ordered data can discard the pre-training process of the encoder without loss of performance. Besides, we design experiments to ﬁnd a balance among the usage of GPU memory , training circle and model perfor- mance. Finally , we achie ve 16.9% character error rate (CER) on our test set, which is 2% absolute improv ement from a strong BLSTM CE system with language model trained on the same text corpus. Index T erms — RNN-T randucer , automatic speech recognition, end-to-end speech recognition 1. INTRODUCTION Most state-of-the-art automatic speech recognition (ASR) systems [1–4] have three main components, acoustic model, pronunciation model and language model. An uniﬁed model taking audio feature as input, acoustic model uses deep neural network (DNN) in com- bination with hidden Markov model (HMM) [5] and outputs pos- teriors of context dependence (CD) states. The decision tree based clustering connects context dependent states to phones. A separated expert-curated pronunciation model maps phones to words. Lan- guage model is used to construct words to a whole meaningful sen- tence. The three components are trained separately , and thus an end- to-end system [6] which combines all these components into one model attracts much of interest. T aking speech feature as input and outputting word symbols directly will remove the gap between dif- ferent components. In other words, one model may a void local opti- mum in three components and get global optimum. Compared with a traditional ASR system, an end-to-end ASR system aims to map the input speech sequence to the output graphme/word sequence using neural netw ork. The main problem of this goal is that the length of input sequence and the length of output sequence are apparently dif ferent. T o resolve this, encoder -decoder architecture is recommended. Combining with attention mecha- nism that aligns input to output, this architecture shows a good performance in various sequence-to-sequence mapping task, such as neural machine translation [7], text summarization [8], image cap- tioning [9], etc. Recently , attention based approaches [10–13] ha ve been reported to perform well in ASR as well. The Listen, Attend and Spell (LAS) model, proposed in [14], uses a BLSTM network to map acoustic feature to a high lev el representation, and then uses an attention based decoder to predict output symbol on condition of previous output symbol. Howe v er , attention based decoder has to wait the formation of the entire high le vel representation for computing attention weights. It cannot be used in real-time tasks. The connectionist temporal classiﬁcation (CTC) model [15, 16] can be regard as another kind of end-to-end approach. The blank label and the mapping function are used to align input sequence to target symbols. CTC based acoustic model usually use unidirec- tional or bidirectional LSTM as encoder and compute the CTC loss between the output sequence of encoder and the target symbol se- quence. In [17], context dependent phoneme (CDPh) based CTC model has shown good performance in an ASR system. In CTC based model, the frame independent assumption ignores the context information in some degree. Obviously , this assumption does not match the actual situation where content information is essential for speech modeling. Another disadvantage of CTC is that the length of the output posterior probability sequence must be longer than the length of the label sequence. This limitation does not allow for a large subsample rate for a CDPh based CTC system. RNN Transducer (RNN-T) [18, 19] has been recently proposed as an extension of the CTC model. Speciﬁcally by adding an LSTM based prediction network, RNN-T removes the conditional indepen- dence assumption in the CTC model. Moreov er , RNN-T does not need the entire utterance lev el representation before decoding, which makes streaming end-to-end ASR possible. In [20], Google has implemented the RNN-T model to a streaming English ASR sys- tem and has achieved a comparativ e performance with con ventional state-of-the-art speech recognition systems. As RNN-T is extended from an CTC acoustic model, it is usually initialized from a pre- trained CTC model. The hierarchical CTC (HCTC) architecture [21] also can be used for a better initialization [20]. Speciﬁcally in an English ASR task, the y used phoneme CTC loss and grapheme CTC loss to assist w ordpiece CTC loss optimization. The HCTC architec- ture helps to train a better model for initialization, which is beneﬁcial to the RNN-T model. In RNN-T training, for a good performance, it is necessary to use a pre-trained CTC model to initialize the encoder of the RNN- T model. And for a decent pre-trained CTC model, we usually use frame-wise cross entropy (CE) training to obtain a CE model as the start point of CTC training. Obviously , this long step-by-step pro- cess will cost lots of time in order to obtain a well-trained RNN-T model. In this paper, we explore the potentials of the RNN-T ar- chitecture on a Mandarin L VCSR task and attempt to simplify the training process. The main contributions of our work are as follows. Encoder Prediction Network enc t h t x 1 u y  Joint Network Softmax   u t y P , | pred u h Fig. 1 . Illustration of the RNN-T ransducer model. First, for a good model con ver gence, we propose an effecti v e learn- ing rate decay strate gy which shapely decreases the learning rate at the epoch of the loss starting to increase and halves the learning rate ev ery epoch after that. Second, we ﬁnd that adding conv olutional layers before the BLSTM layers can replace the functionality of a pre-trained CTC-based model for the encoder , leading to a simpliﬁed training process. Besides, to further accelerate the training process while ensuring performance, we compare the inﬂuence of different architectures and the subsampling rates in the encoder . W e ﬁnd that subsampling is a necessary step for accelerating training as the out- put of RNN-T is a four -dimension tensor which occupies too many GPU memories and slo ws down the training c ycle. Furthermore, a pre-trained LSTM language model is pro ven to ha ve a positive in- ﬂuence. Finally we achie ve 16.9% character error rate (CER) on our test set, which is 2% absolute improv ement from a strong BLSTM CE system with language model trained on the same text corpus. 2. RECURRENT NEURAL NETWORK-TRANSDUCER As mentioned earlier , RNN-T [18] is an extension to the CTC [15] model. On the base of the CTC encoder, Gra ves proposed to add an LSTM to learn conte xt information, which functions as a language model. A joint network is subsequently used to combine the acous- tic representation and the context representation together to compute posterior probability . Fig. 1 illustrates the three main components of RNN-T , namely encoder, prediction network and joint network. Concretely , the encoder transforms the input time-frequency acous- tic feature sequence x = ( x 0 , . . . , x T ) to a high-le vel feature repre- sentation h enc . h enc = Encoder ( x ) (1) Prediction network can remove the limitation of the frame in- dependent assumption in the original CTC architecture. It usually adopts an LSTM to model context information which leads to trans- formation of the original one-hot v ector y = ( y 1 , . . . , y U ) to a high- lev el representation h pred u . The output of the prediction network is determined by the pre vious context information. Note that the ﬁrst input of the prediction network is an all-zero tensor and y u − 1 is the last non-blank unit. Eq. (2) and Eq. (3) describe how the prediction network operates at label step u . h embed u − 1 = Embedding ( y u − 1 ) (2) h pred u = Prediction ( h embed u − 1 ) (3) h enc t and h pred u are ﬁrst reshaped to N × T × U × H tensor where N and H represent the batch size and the number of hidden nodes respectiv ely . The joint network is usually a feed-forward network that produces h j oint t,u from h enc t and h pred u . h j oint t,u = tanh ( W enc h enc t + W pred h pred u + b ) (4) Finally the output probability distribution is computed by a soft- max layer: P ( k | t, u ) = SoftmaxDist ( h j oint t,u ) (5) where k is the index of the output classes. Finally , the whole network is trained by optimizing the RNN-T loss which is computed by the forward-backward algorithm. Loss rnnt = − ln X ( t,u ): t + u = n p ( y | t, u ) . (6) In the decoding, the most likely sequence of characters is gen- erated by the beam search algorithm. During the RNN-T inference, the input of the prediction network is the last non-blank symbol. The ﬁnal output sequence is obtain by removing all blank symbols in the most likely sequence path. The temperature of the softmax function can be used to smooth the posterior probability distrib ution and ben- eﬁt larger beam width. At the same time, an N-gram language model trained on external te xt can be integrated in the beam score. 3. DA T ASET AND B ASELINE W e inv estigated the RNN-T model on a Chinese L VCSR task. Speciﬁcally , we carried out a series of e xperiments in order to obtain good result on the Chinese task and to simplify the RNN-T model training at the same time. 3.1. Data Our dataset is composed of approximately 1,000 hours of Mandarin speech collected by Sogou voice input method (IME) on mobile phones. W e split out 50 hours from this dataset as validation and the rest 950 hours are used for training. 40-dimensional log Mel- ﬁlterbank coefﬁcients are extracted from 25 ms frames shifted 10 ms. Global mean and variance normalization (CMVN) is performed to achieve our ﬁnal acoustic feature. T o obtain more convincing re- sults, we use a rich test set which is recorded by Sogou IME in dif- ferent clean and noisy environments. Each test set has around 8000 utterances and in total we have 17.4 hours for testing. In order to train a truly end-to-end model, we choose Chinese character as our modeling unit. W e use a symbol in ventory consisting of 26 English characters, 6784 frequently-used Chinese characters, an unknown token (UNK) and a blank token. 3.2. Baseline W e train sev eral models as our baselines. First, a 4-layer BLSTM model with 256 cells/layer is trained by optimizing frame-le vel CE loss using tied phone states as targets. Meanwhile, using the same architecture, we also train two CTC models using the monophone- lev el CTC loss and the character-lev el CTC loss. W e also train sev- eral RNN-T models using the standard pretraining method in [20]. Our base RNN-T shares the same encoder as the CTC model ex- cept the one uses HCTC pre-training method which has 5 layers of BLSTM. Follo wing the work in [20, 21], for HCTC pre-training, we increase the number of BLSTM layer to 5 and add a monophone- lev el CTC loss after the third layer . The prediction network is a 2- layer LSTM with 512 cells/layer . One fully-connected feed-forward neural network with 512 nodes is used as the joint network. W e concatenate 7 frames (3-1-3) of FBank features as the net- work input. Frame skipping (2 frames) is also adopted in the CTC T able 1 . P erformance of the baseline systems in CER. Model CER (%) CE 18.87 charCTC 20.93 phoneCTC 19.06 RNN-T 22.39 +CTC init 20.83 +LM init 19.98 +HCTC init 19.05 +Beam 10 18.78 and RNN-T models. Adam optimizer is used to learn parameters and the initial learning rate is 0.0002. The decoding process of the CTC based model follows the setup in [22]. Unless else where stated, The RNN-T decoding uses a beam- width of 5 without an external language model. W e summarize our baseline results in T able 1. W e can see that the two CTC models are w orse than the CE state model. This is mainly because we use monophones as the modeling unit in the CTC model. W e believe the performance will be better if we use context dependent phones as the modeling unit. As expected, the random initialized RNN-T model performs much worse than the CTC and CE models. It can be found that with proper encoder and prediction network initializa- tion methods, RNN-T gradually improves from 22.39% to 18.78% in terms of CER, which e ventually surpass the CE and CTC models. Although our baseline RNN-T model ev entually performs better than the CE and CTC models, it is obvious that the training process is sophisticated, in v olving many pre-training initialization steps ac- cording to [20]. In the following, we will discuss the proposed tricks to simplify the training circle of an RNN-T model. 4. PROPOSED TRICKS 4.1. Sharpen learning rate decay During training of the baseline RNN-T models, we ﬁnd that this kind of model is hard to train and easy to get overﬁtted. The common setting in neural network training is as follows. The learning rate re- mains ﬁxed for the ﬁrst few epochs before the loss on the validation set begins to increase, and then it is di vided by 2 ev ery epoch after that. In order to overcome the ov erﬁtting problem, we try a differ- ent learning rate decay strategy which brings clearly positive effect. Speciﬁcally , we use a more aggressive strategy which divides the learning rate by a number larger than 2 at the ﬁrst decay epoch and it changes as usual in the following epochs. Fig. 2 illustrates how the training loss changes over training epochs for different learning rate decay strategy . A signiﬁcant decline in the training loss is observed when the learning rate is ﬁrst divided by more than 2. The best model co verage is achiev ed when the learning rate is ﬁrst divided by 10. By applying this strategy , we achieve a clear improvement over the baseline, as shown in T able 2. Besides, dropout is also a common trick to cope with the o verﬁt- ting problem. W e ﬁnd that with a dropout probability equals to 0.2, we improve our RNN-T model from 19.98% to 19.51% in terms of CER. In the following experiments, the sharpen learning rate decay strategy (1/10) and a dropout rate of 0.2 are adopted. 4.2. Abandon encoder pretraining As mentioned in Section 3.2, we need to use a trained CTC model to initialize the encoder of the RNN-T model. Moreover , we also need to use CE training to initialize the CTC model. The training process T able 2 . The impact of sharpen learning rate decay strate gy and dr oput. Here RNN-T+LM init in T able 1 is used as baseline and we denote it as RNN-T for simplicity . Model conﬁg CER (%) RNN-T 19.98 +sharpen 19.69 +dropout 19.51 2e-04 2e-04 2e-04 1e-04 5e-05 2.5e-05 1.25e-06 5e-05 2.5e-05 1.25e-06 6.75e-07 1e-05 5e-06 2.5e-06 1.25e-06 2e-5 1e-05 5e-06 2.5e-06 6 6.2 6.4 6.6 6.8 7 7.2 7.4 7.6 7.8 8 E P O C H 1 E P O C H 2 E P O C H 3 E P O C H 4 E P O C H 5 E P O C H 6 E P O C H 7 TRAINING LOSS L E A R N I N G C U R V E 1/2 1/4 1/10 1/20 Fig. 2 . Learning curv es of RNN-T . is really complicated, which costs too much time. W e wonder if the pre-training process is necessary for the RNN-T model and try to abandon this complicated training process. First, as con volutional neural networks (CNN) can help to ex- tract more inv ariant and stable features, we incorporate CNN layers in the encoder to explore how it af fects the performance. T wo lay- ers of CNN with 6x6 k ernel size are added before the BLSTM layer . Besides, curriculum learning (CL), which makes the model to learn from an easy task to a hard task, can also be cosidered to accelerate training con vergence. W e use CL by sorting the training sentences according to their length. T able 3 sho ws the ef fects of RNN-T with CNN layers and CL training. Here we select the RNN-T model in the second last row in T able 1 as the base model, renamed it as RNN-T (enc init), meaning the use of encoder initialization (HCTC). Note that we use dropout and sharpen learning decay rate here, so the CER of this model is lower than that in T able 1. W e ﬁnd that by adding two CNN layers, we can achie v e a CER of 17.65% from random initialization, with absolute 1% CER reduction as compared with HCTC-initialization trained RNN-T . Note that CL also helps for a lower CER. But when the two strategies are used together , we obtain a small performance degradation. Hence in conclusion, the complicated RNN-T training process can be removed by adding CNN layers to the RNN-T model. 4.3. Acceleration by subsampling The output of RNN-T is a four-dimension tensor representing thou- sands of label classes and hundreds of speech frames. Consequently , it will cost a large amount of GPU memory . Larger batch size is another way to accelerate the training speed. Although we use a NIVDIA M40 GPU with 24GB memory , we still cannot increase our batch size to a reasonable number under the conﬁguration of skipping frame number 2. Higher skipping frame rate causes shorter acoustic features to save memory; howe ver , performance will de- grade when more frames are skipped. In the CNN and BLSTM equipped encoder, we try to exploit subsampling within the layer by max-pooling (MP) after the CNN layer and the p yramid BLSTM (pBLSTM). pBLSTM is a BLSTM layer whose input is obtained by T able 3 . The impact of CNN layers and curriculum learning . Model CER (%) RNN-T (enc init) 18.62 +2 CNN layers (rand init) 17.65 RNN-T (enc init) + CL 18.45 +2 CNN layers (rand init) 17.94 T able 4 . RNN-T performance of dif fer ent subsampling conﬁguration in encoder with 2 CNN layers and 5 BLSTM layers. MP2@1-2 r ep- r esents maxpooling with size=2 at 1st and 2nd CNN layer; Py2@1-3 means pyramid size=2 at 1st, 2nd and 3r d BLSTM layer . total subsample subsample conﬁg CER (%) 2 MP2@2 18.32 2 Py2@3 18.39 4 MP2@2+Py2@3 18.07 4 MP2@1-2 18.30 4 Py2@1-2 18.69 4 Py2@2-3 17.94 4 Py2@3-4 17.98 4 Py2@4-5 18.11 6 MP2@2+Py3@3 17.95 8 MP2@2+Py2@2-3 18.58 8 MP2@1-2+P2@3 18.88 8 Py2@1-3 18.42 concatenating several frames of its preceding layer outputs. Unless otherwise stated the max-pooling size is set to 2 and the pyramid size is set to 2 when pBLSTM takes 2 frames of its input features and skips 2 frames. In our RNN-T model, the encoder is composed of 2 CNN layers and 5 BLSTM layers. W e compare different subsampling conﬁgu- rations, including size and location of the max-pooling and pyramid layer . T able 4 shows the details of subsampling. It can be found that max-pooling and pyramid achie v e similar performance at a total subsampling rate of 2. As the total subsampling rate is increased to 4 for a faster training speed, we ﬁnd subsampling all in BLSTM parts is a better option than using max-pooling. This maybe attrib ute to the information loss in max-pooling. For a subsampling rate larger than 6, performance de gradation is observed. The most suitable sub- sampling rate is between 4 and 6 according to our results. W e choose the second and third layers as the pyramid BLSTM in our model. Using 24GB GPU memory , we can only set the batch size to 10 with the frame skipping rate of 1, b ut the batch size can be set as 20 with our subsampling ratio which accelerates the training. 4.4. Prediction network initialization W e further study the initialization strategy for the prediction net- work. Speciﬁcally , we train 2-layer LSTM language models using the training set transcriptions and an external 27G text corpus, re- spectiv ely , and use them to initialize the training of the prediction network with the same LSTM structure. Here we use the best en- coder architecture (total subsample 2, Py2@2-3 in T able 4) for the experiments, combining it with different prediction networks. Re- sults are listed in T able 5. W e ﬁnd that initializing the prediction net- work with a same structure language model can bring performance improv ement. The prediction network initialized by the LSTM lan- guage model trained using transcriptions from the training set itself shows better performance. It seems that the language model trained from the external corpus has domain mismatch with our experimen- T able 5 . The impact of differ ent initializations of prediction network. Model conﬁg CER 2-layer LSTM random init. 17.94 2-layer LSTM init. w/ training transcription 17.61 2-layer LSTM init. w/ e xternal text corpus 17.77 BLSTM pBLSTM pBLSTM BLSTM BLSTM CNN CNN Input Feature Onehot input Embed LSTM LSTM Joint Network RNN-T Char Loss Onehot input Embed LSTM LSTM CE Loss Fig. 3 . The RNN-T architecture used in our paper, consisting of 2 layers of CNN and 5 layers of BLSTM, is trained from scratch. The prediction network is initialized by a character -lev el LSTM LM. tal data. 4.5. Final r esults W e combine all of our previous mentioned tricks together and draw the ﬁnal architecture of our RNN-T in Fig. 3. As listed in T able 6, our best RNN-T model without encoder CTC-pretraining achiev es a CER of 16.90% with the help of an external 5-gram character LM, which is about 2% absolute CER reduction from our BLSTM CE baseline. It also shows that the training transcriptions, corresponding to the 1000 hours of speech, are not enough for a decent language model in an end-to-end ASR system. W e ﬁnally report the results on a task of Mandarin ASR us- ing 10,000 hour of training data. Our RNN-T ev entually achiev es a CER of 10.52%, without the use of an external language model. As a comparison, an internal Latency-Controlled BLSTM system has achiev ed a CER of 11.30% on the same test set. T able 6 . P erformance of RNN-T model trained on 1000h and 10000h Mandarin corpus. T ask Model conﬁg CER (%) 1,000hr RNN-T 17.61 1,000hr + Character LM 16.90 10,000hr RNN-T 10.52 10,000hr LC-BLSTM 11.30 5. CONCLUSION In this work, we ha ve explored the RNN-T model training in a Man- darin L VCSR task. W e hav e proposed se veral methods to speed up RNN-T training, including sharpen learning rate decay strategy , abandon encoder pre-training by adding CNN layers, accelerating training by proper subsampling and LM initialization. Finally we achiev e a simpliﬁed training procedure for RNN-T with a superior performance as compared to a strong BLSTM CE system. 6. REFERENCES [1] Geoffre y Hinton, Li Deng, Dong Y u, George E Dahl, Abdel- rahman Mohamed, Navdeep Jaitly , Andrew Senior , V incent V anhoucke, P atrick Nguyen, T ara N Sainath, et al., “Deep neu- ral networks for acoustic modeling in speech recognition: The shared views of four research groups, ” IEEE Signal pr ocessing magazine , v ol. 29, no. 6, pp. 82–97, 2012. [2] George E Dahl, Dong Y u, Li Deng, and Alex Acero, “Context-dependent pre-trained deep neural networks for large-v ocabulary speech recognition, ” IEEE T ransactions on audio, speec h, and languag e pr ocessing , v ol. 20, no. 1, pp. 30– 42, 2012. [3] Has ¸ im Sak, Andrew Senior , and Franc ¸ oise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling, ” in F ifteenth annual confer ence of the international speech communication association , 2014. [4] T ara N Sainath, Oriol V in yals, Andre w Senior , and Has ¸im Sak, “Con volutional, long short-term memory , fully connected deep neural networks, ” in Acoustics, Speech and Signal Process- ing (ICASSP), 2015 IEEE International Confer ence on . IEEE, 2015, pp. 4580–4584. [5] Lawrence R Rabiner , “ A tutorial on hidden markov models and selected applications in speech recognition, ” Pr oceedings of the IEEE , vol. 77, no. 2, pp. 257–286, 1989. [6] Alex Graves, “Generating sequences with recurrent neural net- works, ” arXiv pr eprint arXiv:1308.0850 , 2013. [7] Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Bengio, “Neural machine translation by jointly learning to align and translate, ” arXiv preprint , 2014. [8] Alexander M Rush, Sumit Chopra, and Jason W eston, “ A neu- ral attention model for abstractive sentence summarization, ” arXiv pr eprint arXiv:1509.00685 , 2015. [9] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov , Rich Zemel, and Y oshua Ben- gio, “Sho w , attend and tell: Neural image caption generation with visual attention, ” in International conference on mac hine learning , 2015, pp. 2048–2057. [10] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Phile- mon Brakel, and Y oshua Bengio, “End-to-end attention-based large vocabulary speech recognition, ” in Acoustics, Speec h and Signal Pr ocessing (ICASSP), 2016 IEEE International Confer- ence on . IEEE, 2016, pp. 4945–4949. [11] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Y oshua Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: ﬁrst results, ” arXiv pr eprint arXiv:1412.1602 , 2014. [12] Jan K Choro wski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyungh yun Cho, and Y oshua Bengio, “ Attention-based mod- els for speech recognition, ” in Advances in neural information pr ocessing systems , 2015, pp. 577–585. [13] Changhao Shan, Junbo Zhang, Y ujun W ang, and Lei Xie, “ Attention-based end-to-end speech recognition on voice search, ” pp. 4764–4768, 2017. [14] W illiam Chan, Navdeep Jaitly , Quoc Le, and Oriol V inyals, “Listen, attend and spell: A neural netw ork for large vocabu- lary conv ersational speech recognition, ” in Acoustics, Speech and Signal Pr ocessing (ICASSP), 2016 IEEE International Confer ence on . IEEE, 2016, pp. 4960–4964. [15] Alex Gra ves, Santiago Fern ´ andez, Faustino Gomez, and J ¨ urgen Schmidhuber , “Connectionist temporal classiﬁcation: la- belling unse gmented sequence data with recurrent neural net- works, ” in Pr oceedings of the 23rd international conference on Machine learning . A CM, 2006, pp. 369–376. [16] Alex Graves and Navdeep Jaitly , “T owards end-to-end speech recognition with recurrent neural networks, ” in International Confer ence on Machine Learning , 2014, pp. 1764–1772. [17] Has ¸ im Sak, Andrew Senior , Kanishka Rao, and Franc ¸ oise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition, ” arXiv preprint arXiv:1507.06947 , 2015. [18] Alex Grav es, “Sequence transduction with recurrent neural networks, ” arXiv pr eprint arXiv:1211.3711 , 2012. [19] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks, ” in Acoustics, speech and signal processing (ICASSP), 2013 ieee international confer ence on . IEEE, 2013, pp. 6645–6649. [20] Kanishka Rao, Has ¸ im Sak, and Rohit Prabhavalkar , “Ex- ploring architectures, data and units for streaming end-to- end speech recognition with rnn-transducer, ” in Automatic Speech Recognition and Understanding W orkshop (ASR U), 2017 IEEE . IEEE, 2017, pp. 193–199. [21] Santiago Fern ´ andez, Alex Graves, and J ¨ urgen Schmidhuber, “Sequence labelling in structured domains with hierarchical re- current neural netw orks, ” in Pr oceedings of the 20th Interna- tional Joint Conference on Artiﬁcial Intelligence, IJCAI 2007 , 2007. [22] Y ajie Miao, Mohammad Go wayyed, and Florian Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding, ” in Automatic Speech Recognition and Understanding (ASR U), 2015 IEEE W orkshop on . IEEE, 2015, pp. 167–174.

Exploring RNN-Transducer for Chinese Speech Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment