Deep context: end-to-end contextual speech recognition
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that util…
Authors: Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar
DEEP CONTEXT : END-TO-END CONTEXTU AL SPEECH RECOGNITION Golan Pundak, T ar a N. Sainath, Rohit Prabhavalkar , Anjuli Kannan, Ding Zhao Google Inc., USA ABSTRA CT In automatic speech recognition (ASR) what a user says depends on the particular context she is in. T ypically , this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR sys- tem that utilizes such context. Our approach, which we re- fer to as Contextual Listen, Attend and Spell (CLAS) jointly- optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-of- vocab ulary (OO V) terms not seen during training. W e com- pare our proposed system to a more traditional conte xtualiza- tion approach, which performs shallow-fusion between inde- pendently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the pro- posed CLAS system outperforms the baseline method by as much as 68% relati ve WER, indicating the adv antage of joint optimization ov er individually trained components. Index T erms : speech recognition, sequence-to-sequence models, listen attend and spell, LAS, attention, embedded speech recognition. 1. INTRODUCTION As speech technologies become increasingly pervasi ve, speech is emer ging as one of the main input modalities on mobile devices and in intelligent personal assistants [1]. In such applications, speech recognition performance can be improv ed significantly by incorporating information about the speaker’ s context into the recognition process [2]. Exam- ples of such context include the dialog state (e.g., we might want “stop” or “cancel” to be more likely when an alarm is ringing), the speaker’ s location (which might make nearby restaurants or locations more likely) [3], as well as personal- ized information about the user such as her contacts or song playlists [4]. There has been growing interest recently in b uilding sequence-to-sequence models for automatic speech recogni- tion (ASR), which directly output words, word-pieces [5], or graphemes gi ven an input speech utterance. Such mod- els implicitly subsume the components of a traditional ASR system - the acoustic model (AM), the pronunciation model (PM), and the language model (LM) - into a single neural network which is jointly trained to optimize log-likelihood or task-specific objectiv es such as the expected word error rate (WER) [6]. Representativ e examples of this approach include connectionist temporal classification (CTC) [7] with word output tar gets [8], the recurrent neural network trans- ducer (RNN-T) [9, 10], and the “Listen, Attend, and Spell” (LAS) encoder-decoder architecture [11, 12]. In recent work, we hav e shown that such approaches can outperform a state- of-the-art con ventional ASR system when trained on 12 , 500 hours of transcribed speech utterances [13]. In the present work, we consider techniques for incorpo- rating contextual information dynamically into the recogni- tion process. In traditional ASR systems, one of the domi- nant paradigms for incorporating such information in volves the use of an independently-trained on-the-fly (OTF) rescor- ing framew ork which dynamically adjusts the LM weights of a small number of n-grams relev ant to the particular recog- nition context [2]. Extending such techniques to sequence- to-sequence models is important for improving system per- formance, and is an activ e area of research. In this context, previous works hav e examined the inclusion of a separate LM component into the recognition process through either shallow fusion [14], or cold fusion [15] which can bias the recognition process tow ards a task-specific LM. A shallow fusion approach was also directly used to contextualize LAS in [16] where output probabilities were modified using a spe- cial weighted finite state transducer (WFST) constructed from the speaker’ s context, and was shown to be effecti ve in im- proving performance. The use of an external independently-trained LM for O TF rescoring, as in pre vious approaches, goes against the benefits deriv ed from the joint optimization of the components of a sequence-to-sequence model. Therefore, in this work, we propose Contextual-LAS (CLAS), a novel, all-neural mech- anism which can leverage contextual information – pro vided as a list of contextual phrases – to improve recognition per- formance. Our technique consists of first embedding each phrase, represented as a sequence of graphemes, into a fixed- dimensional representation, and then employing an attention mechanism [17] to summarize the available context at each step of the model’ s output predictions. Our approach can be considered to be a generalization of the technique proposed in [18] in the context of streaming keyw ord spotting, by al- lowing for a variable number of contextual phrases during inference. The proposed method does not require that the particular context information be a vailable at training time, and crucially , unlik e pre vious w orks [16, 2], the method does not require careful tuning of rescoring weights, while still being able to incorporate out-of-vocab ulary (OO V) terms. In experimental ev aluations, we find that CLAS – which trains the contextualization components jointly with the rest of the model – significantly outperforms online rescoring techniques when handling hundreds of context phrases, and is comparable to these techniques when handling thousands of phrases. The or ganization of the rest of this paper is as follows. In Section 2.1 we describe the standard LAS model, and the standard contextualization approach in Section 2.2. W e present the proposed modifications to the LAS model in or- der to obtain the CLAS model in Section 3. W e describe our experimental setup and discuss results in Sections 4 and 5, respectiv ely , before concluding in Section 6. 2. BA CKGROUND 2.1. The LAS model W e no w briefly describe the LAS model. For more details see [11, 13]. The LAS model outputs a probability distrib u- tion over sequences of output labels, y , (graphemes, in this work) conditioned on a sequence of input audio frames, x (log-mel features, in this work): P ( y | x ) . Fig. 1 . A schematic representation of the models used in this work. The model consists of three modules: an encoder , de- coder and attention network , which are trained jointly to pre- dict a sequence of graphemes from a sequence of acoustic feature frames (Figure 1a). The encoder is comprised of a stacked recurrent neu- ral network (RNN) [19, 20] (unidirectional, in this work) that reads acoustic features, x = ( x 1 , . . . , x K ) , and out- puts a sequence of high-lev el features (hidden states), h x = ( h x 1 , . . . , h x K ). The encoder is similar to the acoustic model in an ASR system. The decoder is a stacked unidirectional RNN that com- putes the probability of a sequence of output tokens (charac- ters, in this work) y = ( y 1 , . . . , y T ) as follows: P ( y | x ) = P ( y | h x ) = T Y t =1 P ( y t | h x , y 0 , y 1 , . . . , y t − 1 ) . (1) The conditional dependence on the encoder state vectors, h x , is modeled using a context v ector c t = c x t , which is com- puted using Multi-Head-attention [21, 13] as a function of the current decoder hidden state, d t , and the full encoder state sequence, h x . The hidden state of the decoder , d t , which captures the previous character conte xt y token (see section 3.2). use superscript z to distinguish bias-attention v ariables from audio-related variables). h z i is an embedding of z i if i > 0 . Since the bias phrases may not be relev ant for the current ut- terance, we include an additional learnable vector , h z 0 = h z nb , that corresponds to the the no-bias option, that is not using any of the bias phrases to produce the output. This option enables the model to backof f to a “bias-less” decoding strat- egy when none of the bias-phrases matches the audio, and allows the model to ignore the bias phrases altogether . The bias-encoder is a multilayer long short-term memory network (LSTM) [19]; the embedding, h z i , is obtained by feeding the bias-encoder with the sequence of embeddings of subwords in z i (i.e., the same grapheme or word-piece units used by the decoder) and using the last state of the LSTM as the em- bedding of the entire phrase [24]. Attention is then computed over h z , using a separate at- tention mechanism from the one used for the audio-encoder . A secondary conte xt v ector c z t is computed using the decoder state d t . This context v ector summarizes z at time step t : u z it = v z > tanh( W z h h z i + W z d d t + b z a ) (6) α z t = softmax ( u z t ) c z t = N X i =0 α z it h z i (7) The LAS context vector , which feeds into the decoder, c t is then modified by setting c t = [ c x t ; c z t ] , the concate- nation of context vectors obtained with respect to x and z . The other components of the CLAS model (i.e., decoder and audio-encoder) are identical to the corresponding components in the standard LAS model. It is worth noting that CLAS explicitly models the prob- ability of seeing a particular bias phrase given the audio and previous outputs: α z t = P ( z t | d t ) = P ( z t | x ; y symbol is in- serted after the match. For example, if the reference transcript is play a song. and the matching bias phrase is play , the target sequence will be modified to play a song. The purpose of is to introduce a train- ing error which can be corrected only by considering the cor- rect bias phrase [18]. In other words, to be able to predict the model has to attend to the correct bias phrase, thus ensuring that the bias-encoder will receiv e updates dur- ing training. 3.3. Inference During inference, the user provides the system with a se- quence of audio feature vectors, x , and a set of context se- quences, z , possibly never seen in training. Using the bias- encoder , z is embedded into h z . This embedding can take place before audio streaming begins. The audio frames, x , are then fed into the audio encoder, and the decoder is run as in standard LAS to produce N-best hypotheses using beam- search decoding [24]. 3.4. Bias-Conditioning When thousands of phrases are presented to CLAS, retrie ving a meaningful bias context vector becomes challenging, since it is the weighted sum of many different bias-embeddings, and might be far from any context vector seen in training. Bias- Conditioning attempts to alleviate this problem. Here we as- sume that during inference the model is provided with both a list of bias phrases, z = z 1 , . . . , z N , as well as a list of bias prefixes , p = p 1 , . . . , p N . W ith this technique a bias phrase z i is “enabled” at step t only when p i was detected on the partial hypothesis y
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment