Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio

Multiple-Instance, Cascaded Classiﬁcation f or K eyword Spotting in Narr ow-Band A udio Ahmad Abdulkader V oicera ahmada@voicera.ai Kareem Nassar V oicera kareemn@voicera.ai Mohamed Mahmoud V oicera geish@voicera.ai Daniel Galvez V oicera dt.galvez@gmail.com Chetan Patil V oicera chetanp@voicera.ai Abstract W e propose using cascaded classiﬁers for a keyw ord spotting (KWS) task on narrow-band (NB), 8kHz audio acquired in non-IID en vironments — a more challenging task than most state-of-the-art KWS systems face. W e present a model that incorporates Deep Neural Networks (DNNs), cascading, multiple-feature representations, and multiple-instance learning. The cascaded classiﬁers handle the task’ s class imbalance and reduce po wer consumption on computationally- constrained devices via early termination. The KWS system achieves a false negati ve rate of 6% at an hourly f alse positive rate of 0.75. 1 Introduction and Pr oblem Description At V oicera, we are b uilding Ev a — Enterprise V oice Assistant — to collaborate with meeting participants using voice [ 1 ]. W e belie ve interactions with Ev a should feel as natural as interactions with any other participant in the meeting, so we have designed Ev a to recognize and respond to voice commands. Eva continuously listens to the con versation during a meeting and v erbally acknowledges the wake word "Okay Eva." In order for this interaction to feel natural, Eva’ s KWS and audible responses need to be real-time. Eva continuously predicts whether or not the ke ywords of interest were uttered in a real-time audio stream. For fear that users ﬁnd Eva vexing due to unsolicited interruptions, the false positiv e rate should be less than 1 per hour; on the other hand, users may abandon the service if Eva doesn’t respond when addressed, so we striv e to maximize recall while maintaining a low false positi ve rate. In addition to challenges that other KWS systems f ace while working with real-time speech, Ev a’ s KWS system — in order to be ubiquitous — needs to support speech signals carried over the public- switched telephone netw ork, which typically uses G.711: an NB audio codec that operates at a low bit-rate and at a sample rate of 8kHz [ 2 ]. Both human and automatic speech recognition suf fer signiﬁcantly from loss of accuracy when listening to NB audio [ 2 , 3 , 4 ]. Moreover , Eva’ s KWS system needs to adapt to new microphones, en vironment settings, speakers, and noise proﬁles whose characteristics vary drastically — making the input signal to the KWS system non-IID. 2 Prior Art V oice-enabled AI assistants — like Apple’ s Siri, Amazon’ s Alexa, and Google Assistant — perform similar KWS tasks to enable users to interact with their devices. Google proposed a model that uses 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Con volutional Neural Netw orks for performing a KWS task to detect 14 dif ferent phrases [ 5 ]; this architecture showed a 27-44% relati ve improv ement in false positiv e rate compared to its predecessor that used a Deep Neural Network (DNN) [ 6 ], which in turn sho wed a 45% relativ e improvement in conﬁdence score compared to KWS using Hidden Markov Models (HMM) [ 7 ]. In [ 8 ], Apple uses a DNN to predict scores for 20 sound classes every 10ms, combining these scores in an HMM-like graph to calculate a composite score indicating that "Hey Siri" was uttered. Similar to our approach, which we detail belo w , the authors used a cascade of two classiﬁers to conserve po wer . Giv en the control Apple, Amazon, and Google hav e over their hardware speciﬁcations, they can acquire wide-band audio, which signiﬁcantly improv es the accuracy of speech recognition systems. 3 A pproach Our approach to the KWS problem is shown in Figure 1. The KWS system is triggered every 10ms. The audio signal from the past 500ms is captured for processing, since we found that the median command duration is 500ms. In this paper, we often call these 500ms windo ws "examples". Each of our six classiﬁers — three cascaded classiﬁers composed of two DNNs trained on dif ferent feature representations — are fully connected DNNs with two 128-input hidden layers. PLP PLP PLP Feature Extraction MFCC MFCC MFCC Cascaded Classiﬁers Noisy Or Non-Ke yword Non-Ke yword Non-Ke yword Ke yword Figure 1: KWS system diagram: (i) Feature Extraction (ii) Cascaded classiﬁers (iii) Noisy Or 3.1 Data W e initially collected 19k user examples from 200 indi viduals using a crowdsourcing platform. The collection cov ered a wide v ariety of speakers living in the continental United States. These examples act as positive e xamples in our experiments. Negativ e examples for each cascaded classiﬁer were generated from a repository of audio samples from a v ariety of meeting recordings that do not contain the keyw ord. All of the audio examples were either acquired as or con verted to NB, 8kHz audio. 80% of the data were designated for training; the remaining 20% were used for ev aluation. The data were stratiﬁed over indi vidual speakers: A speaker belongs to either the training or the ev aluation set. 3.2 Multi-Representation F eature Extraction It is challenging to sample neg ative examples adequately . As such, discriminati ve classiﬁers suf fer from the so-called "Novelty Detection" problem: a sound not encountered during the training process can be misclassiﬁed as a keyw ord — resulting in a false positiv e. This happens because the decision boundary learned by the discriminative classiﬁer is undeﬁned in the areas that were not sampled during training. T o ov ercome this, we train two classiﬁers on two dif ferent representations, Mel Frequenc y Cepstral Coefﬁcient (MFCC) and Perceptual Linear Prediction (PLP) features; both are commonly used in speech recognition [ 9 ]. These features are extracted on the same audio input for each stage of the cascade. These classiﬁers are ensembled using model-av eraging to compute the ﬁnal probability of an audio windo w being a keyw ord. Because these classiﬁers were trained on different representations of the same data, the classiﬁers concur on areas of the distribution that they were trained on and behav e randomly otherwise. This results in a lower rate of f alse positiv es. In our system, both were extracted from 30ms frames with a stride of 10ms and concatenated for each 500ms e xample. 2 3.3 Cascading Cascaded classiﬁers are commonly used to deal with highly asymmetric classiﬁcation problems; they hav e been popularized through their use in man y practical machine learning solutions; the V iola-Jones face detector is one notable example [10]. Cascading is an instance of ensemble learning based on the concatenation of sev eral classiﬁers of increasing complexity . Each classiﬁer is trained on the examples that are not ﬁltered-out by the previous classiﬁers. A threshold is established on the output of each classiﬁer such that a portion of the negati ve (non-keyword) examples it is subjected to get correctly classiﬁed. The remaining examples are either positi ve (ke yword) examples or hard ne gativ e examples. For training the next classiﬁer of the cascade, we run the previous classiﬁers of the cascade on a large repository of audio guaranteed not to ha ve instances of the tar get keyword, using the discov ered false positi ves as hard negati ve e xamples. This allows subsequent classiﬁers in the cascade to focus on ne gativ e examples that previous classiﬁers confuse with the positiv e examples. The distribution of positi ve to negati ve data gets more symmetric for subsequent classiﬁers. One drawback of this technique is that extracting hard e xamples for training subsequent classiﬁers in the cascade tak es longer as the cascade gets better . In our experiments belo w , the ﬁrst cascade classiﬁer is trained with a ratio of 100 negati ve examples for every positi ve example. The second and third cascade classiﬁer maintain a ratio of 2 negati ve examples for e very positi ve example. During inference time, cascaded classiﬁers provide a way to early-terminate subsequent KWS computations on non-keyw ord windows. 3.4 Multiple-Instance Learning The ﬁnal stage of the pipeline, in Figure 1(iii), aggregates the outputs of the current and the past outputs of the cascaded classiﬁers to make a ﬁnal decision about the triggering of the targeted keyw ord. The KWS problem, as we modeled it, is innately a Multiple-Instance Learning (MIL) [ 11 ] problem: The keyw ord is somewhere in the audio signal, b ut at an imprecisely deﬁned location and may v ary drastically in duration as humans have a wide range of w ays of pronouncing the same word. W e’ ve found that the "Okay Ev a" utterance could v ary from 300ms to 900ms. In MIL, training examples are not singletons; the y come in “bags” such that all of the examples in a bag share a label [ 11 , 12 ]. A positi ve bag of windo ws means that at least one window in the bag is positiv e while a negati ve bag means that all windo ws in the bag are negati ve. In MIL, learning must simultaneously learn which examples in the positi ve bags are positi ve, along with the parameters of the classiﬁer . In our case, a bag is a group of 500ms windows strided by 10ms. As such we have designed our learning process to learn simultaneously on all windows encompassed in a particular positiv e or negati ve bag of windows. The outputs corresponding to all of the windows in the bag are aggregated by a "Noisy Or ." 4 Experiments and Results W e e valuated detection of the phrase "Okay Ev a" using the hourly false positiv e rate plotted against the false neg ativ e rate. Multiple-Featur e Representations The R OC characteristics of a 3-stage cascade with only one feature representation (MFCCs or PLPs individually) compared to a 3-stage cascade with both representations is shown in Figure 2. The multi-representation scheme improves the trade-off between False Positiv es and False Reject rates. For example, at a False Reject rate of around 7% (0.07), MFCCs hav e an hourly False Positi ve rate of 1.2; PLPs ha ve an hourly F alse Positiv e rate of 1.0; the multi-representation has an hourly False Positi ve rate of 0.55. Cascading W e trained one-, two-, and three-cascaded classiﬁers for comparison. Because the ﬁrst classiﬁer is trained on mostly neg ative examples, the threshold on the output probability of the ﬁrst classiﬁer in each cascade was computed to guarantee that most of the positiv e examples would not be ﬁltered out. The ROC characteristics of each of the stages of the cascaded classiﬁers are sho wn in Figure 3. The R OC characteristics signiﬁcantly improv es as more classiﬁers are cascaded. 3 Figure 2: A plot sho wing the effects of using PLPs, MFCCs, and multi-representation models. Figure 3: A plot sho wing the effects of using cascaded classiﬁers 5 Conclusion W e demonstrate signiﬁcant gains in KWS in narrow-band audio while minimizing computational resource usage. By incorporating multiple feature representations and three cascaded classiﬁers, we reduce our false positives per hour at 5% false negati ve rate from 8 to 1.2, a reduction of 85%. Although they are not directly comparable due to dif ferences in datasets, our system performs better on narrow-band 8kHz audio than Google’ s DNN system [5] performs on wide-band 16kHz audio. References [1] O. T awak ol, “Introducing e va by voicera, ” Feb 2017, (Accessed 29-Oct-2017). [Online]. A vailable: https://www .voicera.com/introducing- ev a- by- workﬁt/ [2] L. Gallardo, Human and A utomatic Speaker Recognition over T elecommunication Channels , ser . T -Labs Series in T elecommunication Services. Springer Singapore, 2015. [3] S. V oran, “Listener ratings of speech passbands, ” in 1997 IEEE W orkshop on Speech Coding for T elecom- munications Pr oceedings. Back to Basics: Attacking Fundamental Pr oblems in Speech Coding , Sep 1997, pp. 81–82. [4] S. Möller , F . Köster , L. F . Gallardo, and M. W agner, “Comparison of transmission quality dimensions of narrowband, wideband, and super -wideband speech channels, ” in 2014 8th International Confer ence on Signal Pr ocessing and Communication Systems (ICSPCS) , Dec 2014, pp. 1–6. [5] T . N. Sainath and C. Parada, “Con volutional neural networks for small-footprint k eyword spotting, ” in INTERSPEECH 2015, 16th Annual Confer ence of the International Speech Communication Association, Dr esden, Germany , September 6-10, 2015 . ISCA, 2015, pp. 1478–1482. [6] G. Chen, C. Parada, and G. Heigold, “Small-footprint ke yword spotting using deep neural netw orks, ” in 2014 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) . IEEE, 2014, pp. 4087–4091. [7] J. R. Rohlicek, W . Russell, S. Roukos, and H. Gish, “Continuous hidden markov modeling for speaker - independent word spotting, ” in International Confer ence on Acoustics, Speech, and Signal Pr ocessing, , May 1989, pp. 627–630 vol.1. [8] S. T eam, “Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant - apple, ” Oct 2017, (Accessed 29-Oct-2017). [Online]. A vailable: https://machinelearning.apple.com/2017/10/01/ hey- siri.html [9] S. Y oung, G. Evermann, M. Gales, T . Hain, D. Kersha w , X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al. , “The htk book, ” Cambridge university engineering department , vol. 3, 2002. [10] P . V iola and M. J. Jones, “Robust real-time face detection, ” International journal of computer vision , vol. 57, no. 2, pp. 137–154, 2004. [11] C. Zhang, J. C. Platt, and P . A. V iola, “Multiple instance boosting for object detection, ” in Advances in neural information pr ocessing systems , 2006, pp. 1417–1424. [12] T . G. Dietterich, R. H. Lathrop, and T . Lozano-Pérez, “Solving the multiple instance problem with axis-parallel rectangles, ” Artiﬁcial intelligence , vol. 89, no. 1, pp. 31–71, 1997. 4

Multiple-Instance, Cascaded Classification for Keyword Spotting in Narrow-Band Audio

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment