Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

DEEP LONG SHOR T -TERM MEMOR Y AD APTIVE BEAMFORMING NETWORKS FOR MUL TICHANNEL R OBUST SPEECH RECOGNITION Zhong Meng 1 , 2 ∗ , Shinji W atanabe 1 , J ohn R. Hershe y 1 , Hakan Er dogan 3 1 Mitsubishi Electric Research Laboratories, Cambridge, MA 2 Georgia Institute of T echnology , Atlanta, GA 3 Microsoft Research, Redmond, W A ABSTRA CT Far -ﬁeld speech recognition in noisy and reverberant conditions remains a challenging problem despite recent deep learning break- throughs. This problem is commonly addressed by acquiring a speech signal from multiple microphones and performing beam- forming ov er them. In this paper , we propose to use a recurrent neural network with long short-term memory (LSTM) architecture to adapti vely estimate real-time beamforming ﬁlter coefﬁcients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions which results in a set of time- varying room impulse responses. The LSTM adaptiv e beamformer is jointly trained with a deep LSTM acoustic model to predict senone labels. Further , we use hidden units in the deep LSTM acoustic model to assist in predicting the beamforming ﬁlter coef- ﬁcients. The proposed system achieves 7.97% absolute gain ov er baseline systems with no beamforming on CHiME-3 real e valuation set. Index T erms — beamforming, multichannel, speech recogni- tion, LSTM 1. INTR ODUCTION Although extraordinary performance has been achieved in automatic speech recognition (ASR) with the advent of deep neural networks (DNNs) [1, 2], the performance still degrades dramatically in noisy and far-ﬁeld situations [3, 4]. T o achie ve robust speech recognition, multiple microphones can be used to enhance the speech signal, re- duce the effects of noise and reverberation, and improv e the ASR performance. In this scenario, an essential step of the ASR front-end processing is multichannel ﬁltering, or beamforming , which steers a spatial sensitivity region, or “beam, ” in the direction of the tar- get source, and inserts spatial suppression regions, or “nulls, ” in the directions corresponding to noise and other interference. Delay-and-sum (DAS) beamforming is widely used for multi- channel signal processing [5], in which the multichannel inputs of an microphone array are delayed to be aligned in time and then summed up to be a single channel signal. The signal from the tar get direction is enhanced and the noises and interferences coming from other di- rections are attenuated. Filter-and-sum beamforming applies ﬁlters to the input channels before summing them up [6]. Minimum vari- ance distortionless response (MVDR) [7] and generalized eigenv alue (GEV) [8] are ﬁlter -and-sum beamforming methods which solve for ﬁlter coefﬁcients using dif ferent deriv ations. ∗ Zhong Meng performed the work while he was an intern at Mitsubishi Electric Research Laboratories, Cambridge, MA. Although these methods hav e achieved good performance in beamforming, their goal is to optimize only the signal-lev el objec- tiv e (e.g., SNR). In order to achieve robust speech recognition, it is more important to jointly optimize beamforming and acoustic model with the objectiv e of maximizing the ASR performance. In [9], the parameters of a frequency-domain beamformer are ﬁrst estimated by a DNN based on the generalized cross correlation between microphones. Con ventional features are extracted from the beamformed signal before passing through a second DNN for acoustic modeling. Instead of ﬁltering in the frequency domain, [10] performs spatial and spectral ﬁltering through time-domain con volution over raw wa veform. The output feature is then passed to a con volutional LSTM DNN (CLDNN) acoustic model to predict the conte xt-dependent state output tar gets. In [11], the beamforming and frequency decomposition are factorized into separate layers in the network. These approaches assume that the speaker position and the en vironment are ﬁxed and estimate constant ﬁlter coefﬁcients for either beamforming or spatial and spectral ﬁltering. Howe ver , in real noisy and far-ﬁeld scenarios, as the position of the source (speaker), noise and room impulse response keep chang- ing, the time-in variant ﬁlter coefﬁcients estimated by these neural networks may fail to robustly enhance the target signal. Therefore, we propose to adaptiv ely estimate the beamforming ﬁlter coefﬁcients at each time frame using an LSTM to deal with an y possible changes of the source, noise or channel conditions. The enhanced signal is generated by applying these time-variant ﬁlter coef ﬁcients to the short-time Fourier transform (STFT) of the array signals. Log ﬁlter- bank like features are obtained from the enhanced signal and then passed to a deep LSTM acoustic model to predict the senone pos- terior . The LSTM beamforming network and the LSTM acoustic model are jointly trained using truncated back-propagation through time (BPTT) with a cross-entropy objective. STFT coefﬁcients of the array signals are used as the input of the beamforming network. In ASR systems of [12, 13], the speech signal is enhanced by NMF and LSTM before fed into the acoustic model. But speech enhance- ment module and the acoustic model are not jointly optimized to minimize the WER and the input is only single channel signal. Previous work [14] has shown that the speech separation per- formance can be improv ed by incorporating the speech recognition alignment information within the speech enhancement framew ork. Inspired by this, we feed the units of the top hidden layer of the LSTM acoustic model at the previous time step back as an auxil- iary input to the beamforming network to predict the current ﬁlter coefﬁcients. Note that our work is different from [15] in that: (1) we perform adaptive beamforming over 5 different input channels, but their system works only on 2 input channels; (2) our adaptive LSTM beamformer predicts only the frequency domain ﬁlter coef- ﬁcients and performs frequency domain ﬁlter-and-sum over STFT coefﬁcients, while their work majorly focuses on the time-domain ﬁltering with ra w wav eforms as the input; (3) the log Mel ﬁlter bank like features are generated with ﬁxed log Mel transform over the beamformed STFT coefﬁcients for acoustic modeling in our work, while time/frequency domain con volution is performed with train- able parameters on the beamformed features in their work; (4) no additional gate modulation is applied to the feedback to reduce the system complexity for our much smaller dataset. In the experi- ments, we show that this feedback captures high-level knowledge about the acoustic states and increases the performance. The exper - iments are conducted with the CHiME 3 dataset. The joint training of LSTM adaptiv e beamforming network and deep LSTM acoustic model achiev es 7.75% absolute gain over the single channel signal on the real test data. The acoustic model feedback provides an extra gain of 0.22%. 2. LSTM AD APTIVE BEAMFORMING 2.1. Adaptive Filter -and-Sum Beamforming As a generalization of the delay-and-sum beamforming, ﬁlter-and- sum beamformer processes the signal from each microphone using a ﬁnite impulse response (FIR) ﬁlter before summing them up. In frequency domain, this operation can be written as: ˆ x t,f = M X m =1 g f ,m x t,f ,m , (1) where x t,f ,m ∈ C is the complex STFT coefﬁcient for the time- frequency index ( t, f ) of the signal from channel m , g f ,m ∈ C is the beamforming ﬁlter coefﬁcient and ˆ x t,f ∈ C is the complex STFT coefﬁcient of the enhanced signal. In Eq. (1), t = 1 , . . . , T , f = 1 , . . . , F and M , T , F are the numbers of microphones, time frames and frequencies. T o cope with the time-variant source position and room impulse response, we make the ﬁlter coefﬁcients time- dependent and propose the adaptiv e ﬁlter-and-sum beamforming: ˆ x t,f = M X m =1 g t,f ,m x t,f ,m , (2) where g t,f ,m ∈ C is time-v ariant complex ﬁlter coefﬁcient. 2.2. Adaptive LSTM Beamforming Network The LSTM network is a special kind of recurrent neural network (RNN) with purpose-built memory cells to store information [16]. The LSTM has been successfully applied to many different tasks [17, 18] due to its strong capability of learning long-term dependen- cies. The LSTM takes in an input sequence x = { x 1 , . . . , x T } and computes the hidden vector sequence h = { h 1 , . . . , h T } by iterat- ing the equation below h t = LSTM ( x t , h t − 1 ) (3) W e implement the LSTM in Eq. (3) with no peep hole connections. In this work, we apply r eal-value LSTM to the adaptive ﬁlter- and-sum beamformer to predict the real and imaginary parts of the complex ﬁlter coefﬁcients at time t and channel m . That is, we introduce the following real-value v ectors for complex values g t,f ,m and x t,f ,m in Eq. (2): g t,m , [ < ( g t,f ,m ) , = ( g t,f ,m )] F f =1 ∈ R 2 F x t , [ < ( x t,f ,m ) , = ( x t,f ,m )] F,M f =1 ,m =1 ∈ R 2 F M . W ith this representation, the real-value LSTM predicts g t,m as fol- lows: p t = W x,p x t (4) h t = LSTM B F ( p t , h t − 1 ) (5) g t,m = tanh( W h,m h t ) , m = 1 , . . . , M , (6) where W x,p and W h,m are projection matrices. W e use tanh( · ) function to limit the range of the ﬁlter coefﬁcients within [ − 1 , 1] . The real and imaginary parts of the STFT coef ﬁcient ˆ x t,f of the beamformed signal are generated by Eq. (2) as follows ( < ( ˆ x t,f ) = P M m =1 < ( x t,f ,m ) < ( g t,f ,m ) − = ( x t,f ,m ) = ( g t,f ,m ) = ( ˆ x t,f ) = P M m =1 < ( x t,f ,m ) = ( g t,f ,m ) + = ( x t,f ,m ) < ( g t,f ,m ) . (7) More sophisticated features can be extracted from the beamformed STFT coefﬁcients and are passed to the LSTM acoustic model to predict the senone posterior . In our e xperiments, the log Mel ﬁlter - bank like feature is generated from Eq. (7) by z t = log ( Mel ( P t )) (8) P t =  < ( ˆ x t,f ) 2 + = ( ˆ x t,f ) 2  F f =1 ∈ R F (9) where Mel ( · ) is the operation of Mel matrix multiplication, and P t is F dimensional real-v alue vector of the po wer spectrum of the beam- formed signal at time t . Global mean and variance normalization is applied to this log Mel ﬁlterbank like feature. Note that all op- erations in this section are performed with the r eal-value computa- tion, and can be easily represented by a dif ferentiable computational graph. 2.3. Deep LSTM Acoustic Model Recently , LSTMs are shown to be more effecti ve than DNNs [1, 2] and con ventional RNNs [19, 20] for acoustic modeling as they are able to model temporal sequences and long-range dependen- cies more accurately than the others especially when the amount of training data is large. LSTM has been successfully applied in both the RNN-HMM hybrid systems [21, 22] and the end-to-end system [23, 24]. In this work, the deep LSTM-HMM hybrid system is utilized for acoustic modeling. A forced alignment is ﬁrst generated by a GMM-HMM system and is then used as the frame-lev el acoustic tar- gets which the LSTM attempts to classify . The LSTM is trained with cross-entropy objecti ve function using truncated BPTT . In this paper, to connect the deep LSTM with the adapti ve LSTM beamformer , we compute log Mel ﬁlterbank z t from the beamformed STFT coefﬁ- cients. q t = W z,p z t (10) s t = LSTM AM ( q t , s t − 1 ) (11) y t = softmax ( W s,y s t ) (12) q t is the projection of z t into a high-dimensional space and y t is the senone posterior . 2.4. Integrated Network of LSTM Adaptive Beamformer and Deep LSTM Acoustic Model In order to achiev e robust speech recognition by making use of mul- tichannel speech signals, LSTM beamformer in Section 2.2 and the deep LSTM acoustic model in Section 2.3 need to be jointly opti- mized with the objective of maximizing the ASR performance. In other words, the beamforming LSTM needs to be concatenated with the LSTM acoustic model to form an integrated network that takes multichannel STFT coefﬁcients as the input and produces senone posteriors as illustrated in Fig. 1. The deep LSTM has three hidden layers in our experiments b ut only one is shown here for simplicity . T o train the integrated LSTM network, we connect the beam- forming network (2) – (6), log Mel ﬁltering (8), and the acoustic model (10) – (12) as a single feed forward network, and back- propagate the gradient of the cross-entropy objectiv e function through the network so that both the adaptive beamformer and the acoustic model are optimized for the ASR task by using multi- channel training data. Fig. 1 . The unfolded integrated network of an LSTM adapti ve beam- former and an LSTM acoustic model. The acoustic feedback (in blue) is introduced to allo w the hidden units in LSTM acoustic model to assist in predicting the ﬁlter coefﬁcient at current time. On top of that, we feed the hidden units of the top hidden layer of the deep LSTM acoustic model back to the input of the LSTM beamformer as the auxiliary feature to predict the ﬁlter coef ﬁcients at next time. By introducing the acoustic model feedback, the Eq. (5) is re-written as h t = LSTM B F (( p t , s t − 1 ) , h t − 1 ) (13) where ( p t , s t − 1 ) is the concatenation of the acoustic feedback from previous time s t − 1 and the current projection p t . Direct training of the integrated network easily falls into a lo- cal optimum as the gradients for the LSTM beamformer and the deep LSTM acoustic model have different dynamic ranges. For a robust estimation of the model parameters, the training should be performed in sequence as shown in Algorithm 1. 3. EXPERIMENTS 3.1. CHiME-3 Dataset The CHiME-3 dataset is released with the 3rd CHiME speech Sepa- ration and Recognition Challenge [25], which incorporates the W all Street Journal corpus sentences spoken by talkers situated in chal- lenging noisy en vironment recorded using a 6-channel tablet based Algorithm 1 T rain LSTM adaptive beamformer and deep LSTM acoustic model 1: T rain a deep LSTM acoustic model with log Mel ﬁlterbank fea- ture extracted from the speech of all channels to minimize the cross-entropy objecti ve. 2: Initialize the integrated network with the deep LSTM acoustic model in Step 1. 3: T rain the integrated network with the ASR cross-entropy objec- tiv e, update only the parameters in the LSTM beamformer . 4: Jointly train the integrated network in Step 3 with the ASR cross-entropy objective, updating all parameters in the LSTM beamformer and deep LSTM acoustic model. 5: Introduce the acoustic feedback and re-train the integrated net- work with the ASR objecti ve, updating all the parameters. microphone array . CHiME-3 dataset consists of both real and simu- lated data. The real data is recorded speech spoken by actual talkers in four real noisy environments (on buses, in caf ´ es, in pedestrian areas, and at street junctions). T o generate the simulated data, the clean speech is ﬁrst con voluted with the estimated impulse response of the en vironment and then mix ed with the background noise sep- arately recorded in that en vironment [4]. The training set consists of 1600 real noisy utterances from 4 speakers, and 7138 simulated noisy utterances from the 83 speakers in the WSJ0 SI-84 training set recorded in 4 noisy en vironments. There are 3280 utterances in the dev elopment set including 410 real and 410 simulated utterances for each of the 4 environments. There are 2640 utterances in the test set including 330 real and 330 simulated utterances for each of the 4 environments. The speakers in training set, dev elopment set and the test set are mutually different (i.e., 12 different speakers in the CHiME-3 dataset). The training, dev elopment and test data are all recorded in 6 different channels. The WSJ0 text corpus is also used to train the language model. 3.2. Baseline System The baseline system is built with Chainer [26] and Kaldi [27] toolk- its. 40-dimensional log Mel ﬁlterbank features extracted by Kaldi from all 6 channels are used to train a deep LSTM acoustic model us- ing Chainer . The LSTM has 3 layers and each hidden layer has 1024 units. The output layer has 1985 units, each of which corresponds to a senone tar get. The input feature is ﬁrst projected to a 1024 di- mensional space before being fed into the LSTM. The forced align- ment generated by a GMM-HMM system trained with data from all 6 channels is used as the tar get for LSTM training. During e valua- tion, only the de velopment and test data from the 5 th channel is used for testing (only for the baseline system). The LSTM is trained using BPTT with a truncation size of 100 and a learning rate of 0 . 01 . The batch size for stochastic gradient descent (SGD) is 100. The WER performance of the baseline system is shown in T able 1. 3.3. LSTM Adaptive Beamformer The 257-dimensional complex STFT coefﬁcients are extracted for the speech in channels 1 , 3 , 4 , 5 , 6 . The real and imaginary parts of STFT coefﬁcients from all the 5 channels are concatenated together to form 257 × 2 × 5 = 2570 dimensional input of the beamform- ing LSTM. The input is projected to 1024 dimensional space before being fed into the LSTM. The beamforming LSTM has one hidden layer with 1024 units. The hidden units vector is projected to 5 sets of 257 × 2 = 514 dimensional ﬁlter coefﬁcients for adaptiv ely beam- System Input Feature Simu Dev Real Dev Simu T est Real T est AM (baseline) Fbank 16.15 19.24 23.02 32.88 BeamformIt+AM STFT 14.32 12.99 24.36 21.21 BF+AM (ﬁxed) STFT 15.23 15.01 23.14 25.64 BF+AM STFT 14.43 15.19 22.40 25.13 BF+AM+Feedback STFT 14.28 15.10 22.23 24.91 T able 1 . The WER performance (%) of the baseline LSTM acous- tic model (AM), BeamformIt-enhanced signal as the input of the AM, joint training of LSTM beamformer and LSTM acoustic model (BF+AM) with or without acoustic feedback. forming signals from 5 channels using Eq. (2). The MSE objecti ve is computed between the beamformed signal and BeamformIt [28]. The beamforming LSTM is trained using BPTT with a truncation size of 100 , a batch size of 100 and a learning rate of 1 . 0 . 3.4. Joint T raining of the Integrated Network The baseline LSTM acoustic model trained in Section 3.2 and the LSTM adaptiv e beamformer trained in Section 3.3 are concatenated together as the initialization of the integrated network. A feature extraction layer is inserted in between the two LSTMs to extract 40- dimensional log Mel ﬁlterbank features with Eq. (8). The integrated network is trained in a way described in Steps 3, 4 and 5 of Section 2.4. BPTT with a truncation size of 100 and a batch size of 100 and a learning rate of 0 . 01 is used for training. The data from all 5 channels in the development and test set is used for ev aluating the integrated network. The WER performance for different cases are shown in T able 1. 3.5. Result Analysis From T able 1, the best system is the integrated network of an LSTM adaptiv e beamformer and a deep LSTM acoustic model with the acoustic feedback, which achiev es 14.28%, 15.10%, 22.23%, 24.91% WERs on the simulated dev elopment set, real development set, simulated test set and real test set of the CHiME-3 dataset respectiv ely . The joint training of the integrated network without updating the deep LSTM acoustic model achie ves absolute gains of 0.92%, 4.23% and 7.24% ov er the baseline system on the simulated dev elopment set, real development set and real test set respectiv ely . The joint training of the inte grated network with all the parameters updated achie ves absolute gains of 1.72%, 4.05%, 0.62% and 7.75% respectiv ely ov er the baseline systems on the simulated dev elop- ment set, real dev elopment set, simulated test set and real test set respectiv ely . The large performance improvement justiﬁes that the LSTM adaptive beamformer is able to estimate the real-time ﬁlter coefﬁcients adapti vely in response to the changing source position, en vironmental noise and room impulse response with the LSTM acoustic model jointly trained to optimize the ASR objective. Fur- ther absolute gains of 0.15%, 0.09%, 0.17% and 0.22% are achieved with the introduction of acoustic feedback, which indicates that the high-lev el acoustic information is also helpful in predicting the ﬁlter coefﬁcients at the ne xt time step. Note that although the proposed system with acoustic feedback achiev es 0.04% and 2.13% absolute gains o ver the beamformed sig- nal generated by BeamformIt on the simulated de velopment and test sets, it does not work as well as the BeamformIt on the real de velop- ment and test sets. In BeamformIt implementation, the two-step time delay of arriv al V iterbi postprocessing makes use of both the past and future information in predicting the best alignment of multiple chan- nels at the current time, while in our system, only the history in the past is utilized to estimate the current ﬁlter coefﬁcients. This may explain the differences in WER performance and can be alleviated by using bidirectional LSTM as part of the future work. 3.6. Beamformed Featur e The LSTM beamformer adaptiv ely predicts the time-variant beam- forming coefﬁcients and performs ﬁlter-and-sum beamforming ov er the 5 input channels. The log Mel ﬁlter bank feature is obtained from the STFT coefﬁcients. From Fig. 2, we see that the log Mel ﬁlter bank feature obtained from the LSTM adaptive beamformer is quite similar to the log Mel ﬁlter bank feature extracted from the STFT coefﬁcients beamformed by BeamformIt for the same utter- ance. The SNR is not high but matches the LSTM acoustic model well for maximizing the ASR performance. Fig. 2 . The comparison of the log Mel ﬁlter bank coefﬁcients of the same utterance extracted from STFT coef ﬁcients beamformed by BeamformIt (upper) and LSTM adaptiv e beamformer (lower) . 4. CONCLUSIONS In this work, LSTM adapti ve beamforming is proposed to adaptively predict the real-time beamforming ﬁlter coefﬁcients to deal with the time-variant source location, en vironmental noise and room impulse response inherent in the multichannel speech signal. T o achie ve ro- bust ASR, the LSTM adaptiv e beamformer is jointly trained with a deep LSTM acoustic model to optimize the ASR objectiv e. This framew ork achiev es absolute gains of 1.72%, 4.05%, 0.62% and 7.75% over the baseline system on the CHiME-3 dataset. Further improv ement is achie ved by introducing the acoustic feedback to as- sist in predicting the ﬁlter coefﬁcients. Howe ver , our approach does not work as well as the BeamformIt on real data and we will look into this in the future. 5. REFERENCES [1] G. Hinton, L. Deng, D. Y u, G. E. Dahl, A. r . Mohamed, N. Jaitly , A. Senior , V . V anhoucke, P . Nguyen, T . N. Sainath, and B. Kingsbury , “Deep neural networks for acoustic model- ing in speech recognition: The shared views of four research groups, ” IEEE Signal Pr ocessing Magazine , vol. 29, no. 6, pp. 82–97, Nov 2012. [2] G. E. Dahl, D. Y u, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocab ulary speech recognition, ” IEEE T ransactions on Audio, Speech, and Lan- guage Pr ocessing , vol. 20, no. 1, pp. 30–42, Jan 2012. [3] Marc Delcroix, T akuya Y oshioka, Atsunori Ogawa, Y otaro Kubo, Masakiyo Fujimoto, Nobutaka Ito, Keisuke Kinoshita, Miquel Espi, Shoko Araki, T akaaki Hori, and T omohiro Nakatani, “Strategies for distant speech recognitionin reverber - ant en vironments, ” EURASIP Journal on Advances in Signal Pr ocessing , vol. 2015, no. 1, pp. 1–15, 2015. [4] T . Hori, Z. Chen, H. Erdogan, J. R. Hershey , J. Le Roux, V . Mi- tra, and S. W atanabe, “The merl/sri system for the 3rd chime challenge using beamforming, robust feature extraction, and advanced speech recognition, ” in 2015 IEEE W orkshop on Au- tomatic Speech Recognition and Understanding (ASR U) , Dec 2015, pp. 475–481. [5] Jacob Benesty , Jingdong Chen, and Y iteng Huang, Micro- phone array signal pr ocessing , vol. 1, Springer Science & Business Media, 2008. [6] Barry D V an V een and Kevin M Buckley , “Beamforming: A versatile approach to spatial ﬁltering, ” IEEE assp magazine , vol. 5, no. 2, pp. 4–24, 1988. [7] H Erdogan, JR Hershey , S W atanabe, M Mandel, and J Le Roux, “Impro ved mvdr beamforming using single- channel mask prediction networks, ” 2016. [8] E. W arsitz and R. Haeb-Umbach, “Blind acoustic beamform- ing based on generalized eigen value decomposition, ” IEEE T ransactions on Audio, Speech, and Language Processing , vol. 15, no. 5, pp. 1529–1539, July 2007. [9] X. Xiao, S. W atanabe, H. Erdogan, L. Lu, J. Hershey , M. L. Seltzer , G. Chen, Y . Zhang, M. Mandel, and D. Y u, “Deep beamforming networks for multi-channel speech recognition, ” in 2016 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2016, pp. 5745–5749. [10] T . N. Sainath, R. J. W eiss, K. W . Wilson, A. Narayanan, M. Bacchiani, and Andre w , “Speaker location and microphone spacing in variant acoustic modeling from raw multichannel wa veforms, ” in 2015 IEEE W orkshop on Automatic Speech Recognition and Understanding (ASR U) , Dec 2015, pp. 30–36. [11] T . N. Sainath, R. J. W eiss, K. W . W ilson, A. Narayanan, and M. Bacchiani, “Factored spatial and spectral multichannel raw wa veform cldnns, ” in 2016 IEEE International Conference on Acoustics, Speec h and Signal Pr ocessing (ICASSP) , March 2016, pp. 5075–5079. [12] J. T . Geiger, F . W eninger , J. F . Gemmeke, M. Wllmer, B. Schuller , and G. Rigoll, “Memory-enhanced neural net- works and nmf for robust asr, ” IEEE/A CM T ransactions on Audio, Speech, and Language Pr ocessing , vol. 22, no. 6, pp. 1037–1046, June 2014. [13] Felix W eninger , Jrgen Geiger , Martin Wllmer , Bjrn Schuller, and Gerhard Rigoll, “Feature enhancement by deep { LSTM } networks for { ASR } in reverberant multisource environ- ments, ” Computer Speech & Language , vol. 28, no. 4, pp. 888 – 902, 2014. [14] H. Erdogan, J. R. Hershey , S. W atanabe, and J. Le Roux, “Phase-sensitiv e and recognition-boosted speech separation using deep recurrent neural networks, ” in 2015 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , April 2015, pp. 708–712. [15] Bo Li, T ara N Sainath, Ron J W eiss, Kevin W Wilson, and Michiel Bacchiani, “Neural netw ork adaptiv e beamforming for robust multichannel speech recognition, ” in Pr oc. Interspeech , 2016. [16] S. Hochreiter and J. Schmidhuber , “Long short-term memory , ” Neural Computation , vol. 9, no. 8, pp. 1735–1780, Nov 1997. [17] A. Grav es, A. r . Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks, ” in 2013 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing , May 2013, pp. 6645–6649. [18] Martin Sundermeyer , Ralf Schl ¨ uter , and Hermann Ney , “Lstm neural networks for language modeling., ” in Interspeech , 2012, pp. 194–197. [19] C. W eng, D. Y u, S. W atanabe, and B. H. F . Juang, “Recur- rent deep neural networks for robust speech recognition, ” in 2014 IEEE International Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , May 2014, pp. 5532–5536. [20] O. V inyals, S. V . Ravuri, and D. Povey , “Re visiting recur- rent neural networks for robust asr, ” in 2012 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2012, pp. 4085–4088. [21] A. Graves, N. Jaitly , and A. r . Mohamed, “Hybrid speech recognition with deep bidirectional lstm, ” in A utomatic Speech Recognition and Understanding (ASR U), 2013 IEEE W orkshop on , Dec 2013, pp. 273–278. [22] Hasim Sak, Andre w W Senior , and Franc ¸ oise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling., ” in INTERSPEECH , 2014, pp. 338–342. [23] A. Grav es, A. r . Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks, ” in 2013 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing , May 2013, pp. 6645–6649. [24] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyungh yun Cho, and Y oshua Bengio, “ Attention-based mod- els for speech recognition, ” in Advances in Neural Information Pr ocessing Systems , 2015, pp. 577–585. [25] J. Barker , R. Marxer , E. V incent, and S. W atanabe, “The third chime speech separation and recognition challenge: Dataset, task and baselines, ” in 2015 IEEE W orkshop on Automatic Speech Recognition and Understanding (ASR U) , Dec 2015, pp. 504–511. [26] Seiya T okui, K enta Oono, Shohei Hido, and Justin Clayton, “Chainer: a next-generation open source framew ork for deep learning, ” in Pr oceedings of W orkshop on Machine Learning Systems (LearningSys) in The T wenty-ninth Annual Conference on Neural Information Pr ocessing Systems (NIPS) , 2015. [27] Daniel Povey , Arnab Ghoshal, Gilles Boulianne, Lukas Bur- get, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Y anmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer , and Karel V esely , “The kaldi speech recognition toolkit, ” in IEEE 2011 W orkshop on Automatic Speech Recog- nition and Understanding . Dec. 2011, IEEE Signal Processing Society , IEEE Catalog No.: CFP11SR W -USB. [28] X. Anguera, C. W ooters, and J. Hernando, “ Acoustic beam- forming for speaker diarization of meetings, ” IEEE T ransac- tions on Audio, Speech, and Language Processing , vol. 15, no. 7, pp. 2011–2022, Sept 2007.

Deep Long Short-Term Memory Adaptive Beamforming Networks For Multichannel Robust Speech Recognition

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment