Real to H-space Encoder for Speech Recognition

Deep neural networks (DNNs) and more precisely recurrent neural networks (RNNs) are at the core of modern automatic speech recognition systems, due to their efficiency to process input sequences. Recently, it has been shown that different input repre…

Authors: Titouan Parcollet, Mohamed Morchid, Georges Linar`es

Real to H-space Encoder for Speech Recognition
Real to H-space Encoder f or Speech Recognition T itouan P ar collet 1 , 2 , Mohamed Mor chid 1 , Geor ges Linarès 1 , Renato De Mori 3 1 A vignon Univ ersité, LIA, France 2 ORKIS, Aix-en-prov ence, France 3 McGill Uni versity , Montréal, QC, Canada titouan.parcollet@alumni.univ-avignon.fr, {firstname.lastname}@univ-avignon.fr Abstract Deep neural networks (DNNs) and more precisely recurrent neural networks (RNNs) are at the core of modern automatic speech recognition systems, due to their ef ficiency to process input sequences. Recently , it has been shown that dif ferent in- put representations, based on multidimensional algebras, such as complex and quaternion numbers, are able to bring to neural networks a more natural, compressi ve and po werful representa- tion of the input signal by outperforming common real-v alued NNs. Indeed, quaternion-v alued neural networks (QNNs) better learn both internal dependencies, such as the relation between the Mel-filter-bank value of a specific time frame and its time deriv atives, and global dependencies, describing the relations that exist between time frames. Nonetheless, QNNs are limited to quaternion-valued input signals, and it is difficult to benefit from this powerful representation with real-valued input data. This paper proposes to tackle this weakness by introducing a real-to-quaternion encoder that allows QNNs to process any one dimensional input features, such as traditional Mel-filter-banks for automatic speech recognition. Index T erms : quaternion neural networks, recurrent neural net- works, speech recognition 1. Introduction Automatic speech recognition (ASR) systems have been widely impacted by machine learning, and more precisely by the resur - gence of deep neural networks (DNNs). In particular , recurrent neural networks (RNNs) have been designed to learn parame- ters of sequence to sequence mapping, and v arious models were successfully applied to ASR with a remarkable increasing in the ASR system performance. In order to avoid parameter estima- tion problems, RNNs with long short-term memory [1, 2], and gated recurrent unit (GRU) [3] have been proposed to mitigate vanishing and e xploding gradients when learning long input se- quences. Nev ertheless, less attention has been paid to model input features with multiple views of speech spectral tok ens. A noticeable exception is the use of comple x-valued num- bers in neural networks (CVNNs) to jointly represent am- plitude and phase of spectral samples [4]. More recently , quaternion-valued neural networks (QNNs) have been inv es- tigated to process the traditional Mel-frequency cepstral co- efficients (MFCCs), or Mel-filter-banks plus time deriv ativ es [5, 6, 7] as composed entities. Superior accuracy , with up to four times less model parameters, has been observed with quaternion-valued models compared to results obtained with real-valued equiv alent models. In fact, common real-valued neural networks process energies and time deriv ativ es inde- pendently , learning both global dependencies between multiple time frames, and local values in a specific time frame, without considering the relations between a v alue and its deriv atives. In- stead, the quaternion algebra allow s QNNs [5, 6, 8, 9, 10, 11] to process time frames as composed entities, with internal re- lations learned within the specific algebra, and global depen- dencies learned with the neural network architecture, while re- ducing by an important factor the number of neural parameters needed. As a side effect, with quaternion algebra, the number of neural parameters to be estimated is reduced by an impor- tant factor . Nonetheless, QNNs input features must be encoded as quaternion numbers, requiring a preliminary definition of in- put vie ws that cannot be modified by the learning process. In many cases, it looks advantageous to hav e multiple input fea- ture vie ws, b ut there may be dif ferent choices of them and it is not clear how to make a selection. Examples could be views based on temporal or spectral relations. In this paper , a real- to-quaternion encoder (R2H) is proposed to let a quaternion- valued neural architecture learn hidden representations of input feature views. The R2H layer acts as an encoder to train QNNs with any real-valued input vector . Indeed, this encoder allows the model to learn in an end-to-end architecture, a latent and quaternion-valued representation of the input data. This repre- sentation is then used as an input to a quaternion-valued clas- sifier , to exploit the capabilities of quaternion neural networks. For achie ving this objectiv e, the contributions of this paper are: • In vestigate different real-to-quaternion (R2H) encoders to learn an internal representation of any real-valued in- put data (Section 4). • Merge the R2H with the previously introduced quater- nion long short-term memory neural network (QLSTM, Section 3)[6] 1 . • Evaluate this approach on the TIMIT , and Librispeech speech recognition tasks (Section 5). Results improvements on both TIMIT and Librispeech speech recognition tasks are reported with the introduction of a R2H encoder in a QLSTM architecture, with input made of 40 Mel- filter-bank coefficients, and with more than three times fewer neural parameters than with real-valued LSTMs. 2. Quaternion algebra The quaternion algebra H defines operations between quater - nion numbers. A quaternion Q is an extension of a complex number to the hyper-comple x plane defined in a four dimen- sional space as: Q = r 1 + x i + y j + z k , (1) where r , x , y , and z are real numbers, and 1 , i , j , and k are the quaternion unit basis. In a quaternion, r is the real part, while 1 Code is available at: https://github .com/Orkis-Research/Pytorch- Quaternion-Neural-Networks x i + y j + z k with i 2 = j 2 = k 2 = ijk = − 1 is the imaginary part, or the v ector part. Such a definition can be used to describe spatial rotations. A quaternion Q can also be summarized into the following matrix of real numbers, that turns out to be more suitable for computations: Q mat =    r − x − y − z x r − z y y z r − x z − y x r    . (2) The conjugate Q ∗ of Q is defined as: Q ∗ = r 1 − x i − y j − z k . (3) Then, a normalized or unit quaternion Q / is expressed as: Q / = Q | Q | , (4) with | Q | the norm of Q defined as: | Q | = p r 2 + x 2 + y 2 + z 2 . (5) Finally , the Hamilton product ⊗ between two quaternions Q 1 and Q 2 is computed as follows: Q 1 ⊗ Q 2 =( r 1 r 2 − x 1 x 2 − y 1 y 2 − z 1 z 2 )+ ( r 1 x 2 + x 1 r 2 + y 1 z 2 − z 1 y 2 ) i + ( r 1 y 2 − x 1 z 2 + y 1 r 2 + z 1 x 2 ) j + ( r 1 z 2 + x 1 y 2 − y 1 x 2 + z 1 r 2 ) k . (6) The Hamilton product is used in QNNs to perform transforma- tions of vectors representing quaternions, as well as scaling and interpolation between two rotations follo wing a geodesic o ver a sphere in the R 3 space as shown in [12]. 3. Quaternion long short-term memory neural networks Long short-term memory neural networks (LSTM) are a well- known and in vestigated extension of recurrent neural networks [1, 13]. LSTMs offer an elegant solution to the vanishing and exploding gradient problems, alongside with a stronger learn- ing capability of long and short-term dependencies within se- quence. Follo wing these strengths, a quaternion long short-term memory neural network (QLSTM) has been proposed [5]. In a quaternion-valued layer , all parameters are quaternions, including inputs, outputs, weights and biases. The quaternion algebra is ensured by manipulating matrices of real numbers [7, 11] to reconstruct the Hamilton pr oduct from quaternion al- gebra. Consequently , for each input vector of size N , output vector of size M , dimensions are split in four parts: the first one equals to r , the second to x i , the third one is y j , and the last one equals to z k . The inference process of a fully-connected layer is defined in the real-valued space by the dot product be- tween an input vector and a real-v alued M × N weight matrix. In a QLSTM, this operation is replaced with the Hamilton pr od- uct ’ ⊗ ’ (Eq. 6) with quaternion-v alued matrices ( i.e . each entry in the weight matrix is a quaternion). Both LSTM and QLSTM networks rely on a gate action [14], that allows the cell-state to retain or discard information from the past, and the future in the case of a bidirectional (Q)LSTM. Gates are defined in the quaternion space follow- ing [5]. Indeed, the gate mechanism implies a component-wise product of the components of the quaternion-valued signal with the g ate potential in a split manner [15]. Let f t , i t , o t , c t , and h t be the for get, input, output gates, cell states and the hidden state of a QLSTM cell at time-step t . QLSTM equations are defined as: f t = σ ( W f ⊗ x t + R f ⊗ h t − 1 + b f ) , (7) i t = σ ( W i ⊗ x t + R i ⊗ h t − 1 + b i ) , (8) c t = f t × c t − 1 + i t × α ( W c ⊗ x t + R c ⊗ h t − 1 + b c ) , (9) o t = σ ( W o ⊗ x t + R o ⊗ h t − 1 + b o ) , (10) h t = o t × α ( c t ) , (11) with σ and α the Sigmoid and T anh quaternion split activations [15, 9]. Bidirectional connections allow (Q)LSTM networks to con- sider the past and future information at a specific time step, en- abling the model to capture a more global context [2]. Quater- nion bidirectional connections are identical to real-v alued ones. Indeed, past and future conte xts are added together component- wise at each time step. An adapted scheme initialization for quaternion neural net- works parameters has been proposed in [6]. In practice, bi- ases are set to zero while weights are sampled following a Chi- distribution with four degrees of freedom. Finally , QRNNs re- quire a specific backpropagation algorithm detailed in [6]. 4. R2H encoder As mentioned in the introduction, having input features repre- sented by quaternions requires to have predefined a number of views for the same input tok en. This prevents the use of quater - nion networks when prior knowledge suggests to use multiple views whose number and type cannot be exactly defined. For example, it is kno wn that time relations between a speech spec- trum and its neighbor spectra may improve the classification of the phoneme whose utterance produced the spectrum. Nev er- theless, these relations may not be limited to time deriv atives of all spectra samples in a spoken sentence. T o overcome this lim- itation, a ne w method is proposed. It consists in introducing a real-valued encoder directly connected to the real-valued input signal. The real-to-quaternion (R2H) encoder is trained jointly to the rest of the model in an end-to-end manner , such as any other layer . After training, the encoder is expected to allow a mapping from the real space of the input features, to a latent in- ternal representation meaningful for the following quaternion layers. The trained model is thus able to directly deal with real-valued input features, while internally processing quater- nion numbers. The R2H encoder is a traditional dense layer followed by a quaternion activ ation function and normalization. The number of neurons contained in the layer must be a multi- ple of four for the quaternion representation. Let W , X and B be the weight matrix, the real-v alued input and the bias vectors respectiv ely . Q / out is the unit quaternion v ector obtained at the output of the projection layer and is expressed as: Q / out = Q out | Q out | , (12) with, Q out = α ( W .X + B ) , (13) and α any quaternion split activ ation function. In practice, Q out and Q / out follow the quaternion internal representation defined Figure 1: Illustration of the R2H encoder , used as an input layer to a QLSTM. Inputs ar e r eal, before being turned into quaternions, and finally unitary quaternions within the R2H encoder . in Section 3. Consequently , the input is split in four features from a latent sub-space, that are interpreted as quaternion com- ponents: the first one equals to r , the second to x i , the third one is y j , and the last one equals to z k , making it possible to apply the quaternion normalization, and the acti vation function. At the end of training Q / out capture an internal latent mapping of the real-v alued input signal X trough a vector of unit quater- nions. Adding the R2H encoder as an input layer to QLSTMs or to any other QNNs allow the model to deal with real-valued inputs, while taking the strengths of QNNs (Figure 1). 5. Experiments Model architectures used for the experiments are presented in Section 5.1. Then, the R2H encoder is compared to the tradi- tional and naiv e quaternion representation on the TIMIT and Librispeech speech recognition tasks (Section 5.2) 5.1. Model architectur es QLSTMs ha ve already been inv estigated for speech recognition in [5] and [6]. Consequently , and based on these previous re- searches, QLSTMs are composed with four bidirectional QL- STM layers with an internal real-valued size of 1 , 024 , equi va- lent to 256 quaternion neurons. Indeed, 256 × 4 = 1 , 024 real numbers. The R2H encoder size v aries from 256 to 1 , 024 to ex- plore the best latent quaternion representation. T anh, HardT anh and ReLU activ ation functions are inv estigated to compare the impact of bounded (T anh, Hardtanh) and unbounded (ReLU) R2H encoders. In fact, the quaternion normalization allows a numerical reduction of the internal representation, b ut the ReLU counteracts the latter effect by integrating high real and positive values to the encoding. The final layer is real-valued and corre- sponds to the HMM states obtained with the Kaldi [16] toolkit. A dropout of 0 . 2 is applied across all the layers except the out- put. The Adam optimizer [17] is used to train the models with vanilla hyperparameters. The learning rate is halved ev ery-time the loss on the validation set is belo w a certain threshold fixed to 0 . 001 to a void o verfitting. Finally , models are implemented with the Pytorch-Kaldi toolkit [18]. While the ef fectiv eness of QLSTM ov er LSTM has been demonstrated, an LSTM network trained in the same conditions and based on [5] is considered as a baseline. All the models are trained during 30 epochs, and the results on both the v alidation and test sets are saved at this point. 5.2. Phoneme recognition with the TIMIT corpus The training process is based on the standard 3 , 696 sentences uttered by 462 speakers, while testing is conducted on 192 sen- tences uttered by 24 speakers of the TIMIT [19] dataset. A val- idation set composed of 400 sentences uttered by 50 speakers is used for hyper-parameter tuning. The ra w audio is processed with a window of 25 ms and an ov erlap of 10 ms. Then, 40 - dimensional log Mel-filter-bank coefficients are extracted with the Kaldi toolkit. In pre vious work with QLSTMs [6, 5], first, second and third time order deriv ativ es were composed with spectral energies to build a multidimensional input representa- tion with quaternions. In this paper , the time deri vati ves are no longer used. Instead, latent representations are directly learned from the R2H encoder , fed with the 40 log Mel-filter bank co- efficients. For the sake of comparison, an input quaternion is naiv ely composed with input features from four consecuti ve Mel-filter-bank coef ficients, before being fed to a standard QL- STM. Figure 2 reports the results obtained for the in vestigation of the R2H encoder size and the impact of the activ ation layer . Results are from an av erage of three runs and are not obtained w .r .t to the v alidation set. Indeed, performances on the test set are ev aluated only once at the end of the training phase. It is first interesting to note that a layer of 1 , 024 neurons always giv es better results than a layer of size 256 or 512 , without ev en considering activ ation functions. In the same manner , the T anh activ ation outperforms both ReLU and Hardtanh activ a- tion function with all the layer size, with an av erage phoneme error rate (PER) on the TIMIT test set of 15 . 6 % compared to 16 . 7 and 16 % for the ReLU and HardT anh activ ations. It is im- portant to note that the ReLU activ ation gives the w orst results. An explanation of such phenomenon is the definition interv al of the ReLU function. When dealing with ReLU, outputs of the R2H layer are not bounded in the positi ve domain before being normalised. Therefore, the dense layer can output large v alues that are then squashed by the quaternion normalization, and it can be hard for the neural network to learn such mapping. Con- versely , both Hardtanh and T anh functions are bounded by − 1 and 1 , making it easier to learn, since values of the R2H layer before and after normalization vary on the same range. The HardT anh function also hardly saturates at − 1 and 1 in the same manner as the ReLU activ ation for negati ve numbers, while the T anh smoothly tends to these bounds. Consequently , the Hard- T anh giv es slightly worst results than the T anh. Finally , a best PER of 15 . 4 % is obtained with a normalised R2H encoder of size 1 , 024 based on the T anh activ ation function, compared to 16 . 5 % and 15 . 9 % with ReLU and Hardtanh functions. ReLU HardT anh T anh 15 . 5 16 16 . 5 17 16 . 9 16 . 2 15 . 9 16 . 7 15 . 9 15 . 5 16 . 5 15 . 9 15 . 4 PER % R2H Size 256 R2H Size 512 R2H Size 1024 Figure 2: Phoneme Err or Rate (PER %) obtained on the test set of the TIMIT corpus with differ ent activation functions, and differ ent R2H encoder size for a QLSTM. Results are fr om an averag e of thr ee runs. T able 1 presents a summary of the results observed on the TIMIT phoneme recognition task with a QLSTM and basic quaternion features, compared to the proposed QLSTM coupled with the best R2H encoder from Fig. 2. For fair comparison, a real-valued LSTM is also tested. As highlighted in [6], QL- STMs models require less neural parameters than LSTMs due to their internal quaternion algebra. Therefore, an LSTM with 1 , 024 neurons per layer is composed of 46 . 0 million parame- ters, while corresponding QLSTMs only need 15 . 5 M param- eters. It is first interesting to note that the R2H encoder helps the QLSTM to obtains the same PER as the real-v alued LSTM, while dealing with a real-valued input signal. Indeed, both mod- els performed at 15 . 4 % on the test set, while the QLSTM still requires more than three times fewer neural parameters. T able 1: Phoneme err or rate (PER%) of the models on the de- velopment and test sets of the TIMIT dataset. “P arams" stands for the total number of trainable par ameters. "R2H-Norm" and "R2H" correspond to R2H encoders with and without normal- ization. Results are fr om an average of 3 runs. Models Dev . T est Params LSTM 14.5 15.4 46.0M QLSTM 14.9 15.9 15.5M R2H-QLSTM 14.7 15.7 15.5M R2H-Norm-QLSTM 14.4 15.4 15.5M Then, it is worth underlining that the basic QLSTM without R2H layer obtains the worst PER of all models with 15 . 9 % on the test set, due to the inappropriate input representation. Then, the impact of the quaternion normalization process is inv esti- gated by comparing a R2H encoder without normalization, to a normalised one. As expected, the quaternion normalization helps the input to fit the quaternion representation, and thus giv es better results with 15 . 4 % of PER compared to 15 . 7 % for the non-normalized R2H encoder . It is important to men- tion that such results are obtained without batch-normalization, speaker adaptation or rescoring methods. 5.3. Speech recognition with the Librispeech corpus The e xperiments are extended to the larger Librispeech dataset [20]. Librispeech is composed of three distinct training subsets of 100 , 360 and 500 hours of speech respectively , represent- ing a total training set of 960 hours of read English speech. In our experiments, the models are trained following the setups de- scribed in Section 5.1, and based on the tr ain_clean_100 subset containing 100 hours. Results are reported on the test_clean set. Input features are the same as for the TIMIT experiments, and the best activation function reported in Figure 2 is used (T anh). No regularization techniques such as batch-normalization are used, and no rescoring methods are applied at testing time. T able 2: W ord err or rate (WER%) of the models on test_clean set of the Librispeech dataset with a training on the train_clean_100 subset. “P arams" stands for the total number of trainable parameters. "R2H-Norm" and "R2H" corr espond to R2H encoders with and without normalization. No r escoring technique is applied. Models T est Params LSTM 8.1 49.0M QLSTM 8.5 17.7M R2H-QLSTM 8.3 17.7M R2H-Norm-QLSTM 8.0 17.7M The total number of neural parameters is slightly differ- ent when compared with the TIMIT experiments due to the in- creased number of HMM states, and therefore neurons, of the output layer for the Librispeech task. Nonetheless, the number of parameters is still lowered by a factor of 3 when using QL- STM networks, compared to the real-valued LSTM. Similarly to the TIMIT e xperiments, the QLSTM with a normalized R2H layer reaches slightly better performances in term of word error rate (WER), with 8 . 0 % compared to 8 . 1 % for the LSTM. More- ov er , the R2H encoder allows the QLSTM WER to decrease from 8 . 5 % to 8 . 0 %, representing an absolute gain of 0 . 5 %. The reported results on the larger Librispeech dataset demonstrate that the R2H encoder solution scales well with more realistic speech recognition tasks. 6. Conclusions Summary . This paper addresses one of the major weakness of quaternion-valued neural networks known as the inability to process non quaternion-valued input signal. A new real- to-quaternion (R2H) encoder is introduced, making it possi- ble to learn in a end-to-end manner a latent quaternion rep- resentation from any real-valued input data. Such representa- tion is then processed with QNNs such as a quaternion LSTM. The experiments conduced on the TIMIT phoneme recognition task demonstrate that this new approach outperforms a naive quaternion representation of the input signal, enabling the use of QNNs with any type of inputs. Future work. Split activ ation functions and current quaternion gate mechanisms do not fully respect the quaternion algebra by considering each elements as uncorrelated components. A future work will consist on the inv estigation of purely quater- nion recurrent neural networks, in volving well-adapted activa- tion functions, and proper quaternion gates. 7. References [1] S. Hochreiter and J. Schmidhuber, “Long short-term memory , ” Neural computation , v ol. 9, no. 8, pp. 1735–1780, 1997. [2] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neu- ral networks, ” IEEE T ransactions on Signal Processing , v ol. 45, no. 11, pp. 2673–2681, 1997. [3] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y . Ben- gio, “ Attention-based models for speech recognition, ” in Ad- vances in neural information pr ocessing systems , 2015, pp. 577– 585. [4] C. Trabelsi, O. Bilaniuk, D. Serdyuk, S. Subramanian, J. F . San- tos, S. Mehri, N. Rostamzadeh, Y . Bengio, and C. J. Pal, “Deep complex networks, ” arXiv preprint , 2017. [5] T . Parcollet, M. Morchid, G. Linarès, and R. De Mori, “Bidirec- tional quaternion long short-term memory recurrent neural net- works for speech recognition, ” arXiv preprint , 2018. [6] T . Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. D. Mori, and Y . Bengio, “Quaternion recurrent neural net- works, ” , 2018. [7] T . Parcollet, Y . Zhang, M. Morchid, C. T rabelsi, G. Linarès, R. de Mori, and Y . Bengio, “Quaternion conv olutional neural networks for end-to-end automatic speech recognition, ” in Inter- speech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018. , 2018, pp. 22–26. [Online]. [8] T . Nitta, “ A quaternary version of the back-propagation algo- rithm, ” in Neural Networks, 1995. Pr oceedings., IEEE Interna- tional Confer ence on , vol. 5. IEEE, 1995, pp. 2753–2756. [9] P . Arena, L. Fortuna, G. Muscato, and M. G. Xibilia, “Multilayer perceptrons to approximate quaternion valued functions, ” Neural Networks , vol. 10, no. 2, pp. 335–342, 1997. [10] T . Isokawa, N. Matsui, and H. Nishimura, “Quaternionic neural networks: Fundamental properties and applications, ” Complex- V alued Neural Networks: Utilizing High-Dimensional P arame- ters , pp. 411–439, 2009. [11] C. J. Gaudet and A. S. Maida, “Deep quaternion networks, ” in 2018 International Joint Conference on Neural Networks (IJCNN) . IEEE, 2018, pp. 1–8. [12] T . Minemoto, T . Isokawa, H. Nishimura, and N. Matsui, “Feed forward neural network with random quaternionic neurons, ” Sig- nal Pr ocessing , vol. 136, pp. 59–68, 2017. [13] K. Greff, R. K. Sriv astav a, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “Lstm: A search space odyssey , ” IEEE tr ansac- tions on neural networks and learning systems , vol. 28, no. 10, pp. 2222–2232, 2017. [14] I. Danihelka, G. W ayne, B. Uria, N. Kalchbrenner , and A. Graves, “ Associativ e long short-term memory , ” arXiv preprint arXiv:1602.03032 , 2016. [15] D. Xu, L. Zhang, and H. Zhang, “Learning alogrithms in quater- nion neural networks using ghr calculus, ” Neural Network W orld , vol. 27, no. 3, p. 271, 2017. [16] D. Povey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qian, P . Schwarz, J. Silovsky , G. Stemmer , and K. V esely , “The kaldi speech recog- nition toolkit, ” in IEEE 2011 W orkshop on Automatic Speech Recognition and Understanding . IEEE Signal Processing So- ciety , Dec. 2011, iEEE Catalog No.: CFP11SR W -USB. [17] D. Kingma and J. Ba, “ Adam: A method for stochastic optimiza- tion, ” arXiv preprint , 2014. [18] M. Ravanelli, T . Parcollet, and Y . Bengio, “The pytorch-kaldi speech recognition toolkit, ” in In Pr oc. of ICASSP , 2019. [19] J. S. Garofolo, L. F . Lamel, W . M. Fisher, J. G. Fiscus, and D. S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1, ” N ASA STI/Recon technical r eport n , vol. 93, 1993. [20] V . Panayotov , G. Chen, D. Povey , and S. Khudanpur , “Lib- rispeech: an asr corpus based on public domain audio books, ” in 2015 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2015, pp. 5206–5210.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment