An Ensemble SVM-based Approach for Voice Activity Detection

An Ensemble SVM-based A ppr oa ch f o r V oice Activity Detection J ayanta De y 1 , Md. Sanzid Bin Hossain 2 , Mohammad Ar iful Haque 3 ∗ Dept. of Electrical and Electronic Engineering, Bangladesh University of Engineering and T echnology deyjayanta76@ gmail.com, sanzidbinhoss ain33@gmail.co m, arifulhoque@e ee.buet.ac.bd Abstract V oice activ ity detection (V AD), used as the front end of speech enhanceme nt, speech and speaker recognition algorithms, de- termines the overall accuracy and efﬁcienc y of the algorithms. Therefore, a V AD wit h low complexity and high accurac y is highly desirable for speech processing ap plications. In this pa- per , we propos e a no vel training method on large dataset for su- pervised learning-based V AD system using sup port v ector ma- chine (SVM). Despite of high classiﬁcation accu racy of suppo rt vector machines (SVM), trivial SVM is not suitable for classi- ﬁcation of large data sets needed for a good V AD system be- cause of high training complexity . T o overcome this problem, a nov el ensemble-based approach using SVM has been proposed in this paper . The performance of the proposed ensemble struc- ture has been compared with a The performan ce of the pro- posed ens emble structure has been compared with a feedfor- ward neural netw ork (NN). Alt hough NN performs better than single SVM-based V A D trained on a small portion of the train- ing data, ensemble S VM gi ves accuracy comparable to neural network-b ased V AD. Ensemble SVM and NN gi ve 88 . 7 4 % and 86 . 28 % accuracy respectiv ely whereas the stand-alone SVM sho ws 57 . 05 % accurac y on average on the test dataset. Index T erms : V oice activity detection, suppo rt vector machin e, neural network, ensemble. 1. Intr oduction V oice activity detection is ba sically the act of separating the speech and the non-sp eech portions of an aud io recording. As a typical speech signal may contain silence, mono-tonic noise and e ven music frames, sorting out the speech frames from the recording usin g an efﬁcient V AD in t he front end is of great importance for reliable operation of the speech processing al- gorithms. A number of algorithms f or vo ice activity detection have been proposed in the literature. Among them frame ener gy , pe- riodicity measure [1] or entrop y-based [2] methods are elegant in the sense of their simplicity and time-efﬁciency . Howe ver , they are based on parameters selection that are tuned for a par- ticular situation an d can not separate mo re critical non-sp eech frames li ke music accurately . A solution to this problem can be found by ado pting relati vely complex st ati stical approach such as statistical hypothesis testing [3], long-term spectral div er- gence measure [4], amp litude pro bability distribution [5] and lo w-va riance spectrum estimation [6]. Howe ver , these methods need to estimate the background noise le vel and are also prone to severa l parameters tuning. More recent studies attempts t o solve the problem of V AD from machine learning point-of-vie w [7], [ 8 ], [9] that classiﬁes an audio frame as speech or no n- speech. The main problem associated with these approaches i s that they need to be trained on a large dataset which includes a rich non-speech instances for a satisfactory efﬁciency . This poses a problem for learners such as SVM that has high classi- ﬁcation accurac y and yet can not be trained on a larg e dataset due to comple x training algorithm. In this paper , we propose a novel ensemble technique to train SVM learners on a larg e dataset for v oice acti vit y detec- tion and com pare it with the performance of a neural netw ork- based classiﬁer t r ained on t he similar datase t. T i me ef ﬁciency is a crucial factor for V AD implementation and unlike t he other methods reported in the literature [10] that use a large set of features, we focus our efforts on b uilding cl assiﬁ er based on MFCC features only . These MFCC features are popularly used for speech processing algorithms. Therefore, the extracted fea- tures can be used in the subsequen t stages making t he ov erall procedure more efﬁcient. In our work, we have trained a num- ber of S VMs on non-ov erlapping small datasets to cover t he whole lar ge dataset and their predicted probability is used as features for t he output l ayer SVM t hat giv es the ﬁnal decision. The proposed SVM-ensemble giv es approximately similar ac- curacy compared to the state of the art neural network [10]. This approach sho ws signiﬁcant improv ement in terms of accuracy from the st and-alone SVM and also the v ariance in result for a single SVM is smoothed out by the ensemb le. The paper is organ ized as follows. Secti on II describes problem formulation and data description. The system architec- ture is discussed in section III and the system efﬁcacy i s estab- lished in section IV . Finally , mentioning the contributions and our future work, the paper is concluded in section VI. 2. Pr oblem Formulation and Data Description In our work, we have used MUSAN corpus [11] as both our testing and training database. The corpus consists of approxi- mately 109 hours of speech, music and noise data that makes it an ideal database to be used for supervised learner-based V AD application. T he speech dataset contains silent frames which are exclud ed from t he training dataset by a log-mel energy based thresholding described later in the paper and a t est dataset of duration of about 3 hours has been separated from the corpus where the silent frames of speech r ecordings were annotated by a human listener . Now both the training an d testing dataset were divide d into frames of duration 25 ms with 15 ms over - lap. Therefore, the challenge is to build a learner trained on the features extracted from the training frames that can classify the testing frames as speech or non-sp eech. Here we hav e used an ensemble-based approach for training an SVM learner and compared it with a trivial architecture-based neural network. 3. V AD System Description In this work, w e aim to de velop a binary classiﬁcati on system in which one class consists of only speech and the other one con- Energy > Threshold Silence Classiﬁed Silence Classiﬁed Energy>Threshold V AD model Classiﬁer MUSAN corpus No No Y es Y es Speech/ non- speech decision F eature extraction F eature extraction T raining dataset T est dataset Figure 1: General system model f or the pr oposed voice activity detection. Featur e dataset (n+1)th dataset 1 1 2 3 4 2 3 4 n n+1 SVM-1 SVM-2 SVM-3 SVM-4 SVM-n SVM-1 SVM-2 SVM-3 SVM-4 SVM-n n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SVM Probability esma on Shuﬄing Predicon T raining Ensemble member training Final member training Probability esma te f eature Probabilty esma on fea ture vect or a) b) Figure 2: a) T raining n numb ers of SVM for ensembling, b) T raining the ﬁnal output layer SVM. tains silence, music and noise. A general overvie w of the V AD system is sho wn in F ig. 1. As the silent frames can be sepa- rated quite ef ﬁciently by energy-b ased thresholding only , they were excluded from the training data and detected in the test- ing phase using a threshold ing on the log-mel energy that is the ﬁrst MFCC feature. Then the remaining frames were classiﬁed using a supervised-learn er-based V AD system. 3.1. Fea ture Extraction In order to train the classiﬁer, we hav e used MFCC features as they are the standard features used in speech processing. There are many software packages to extract the MFCC features ef- ﬁciently and hence it will be an elegant feature-set to be used in V AD where time-ef ﬁciency is high ly desirable. MFCC is a spectral feature inspired by hu man auditory model. As hu- man ea r i s efﬁcient in distinguishing between speech and non - speech, we hope the MFCC features will also be effecti ve for our purpos e. Here we have used 13 MFCC features for a par- False Positive Rate 0.1 1 True Positive Rate 1 2 3 4 5 6 7 8 0. 9 1 0. 0. 0. 0. 0. 0. 0. 0. 0 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 False Positive Rate True Positive Rate non-speech speech b) non-speech speech 1 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 a) ( 0.1523, 0.8627 ) ( 0.1701, 0.8804 ) Figure 3: a) R OC curves for ensemb le-SVM, b) ROC curves for NN. ticular audio frame. 3.2. Classiﬁcation usi n g Neural Network W e have de veloped and ev aluated our deep neural network (NN) model with pytho n li brary Keras ® and numerical computation software library ten sorﬂow . W e have trained a fully connected neural netw ork model with an input layer of 13 input v ariable ,two hidden layers w i th 12 and 8 neurons respectiv ely and an output layer with one neuron. W e have initialized network weights to small rando m nu mbers which w as generated from a uniform distribution. Details of the layer architecture is gi ven in T able 1. W e have used binary cro ssentropy as loss function and gradien t decen t algorithm Adam as optimizer . T able 1: A rc hitectur e and Tr aining P ara meters of the Neural Network Layer Details Node Numb er Activation Epochs Batch Size V alidation Split Hidden Layer 1 12 relu Hidden L ayer 2 8 relu 100 100 .2 Output Layer 1 sigmoid 3.3. Classiﬁcation usi n g SVM In our work, we hav e used libsvm toolbox [12] for SVM- based classiﬁcation. As a stand-alone SVM i s not suitable for the large train dataset described earlier, we shufﬂe the feature sets obtained from the t rain dataset and divide it into n non - ov erl apping smaller datasets. The o verview of the procedure is sho wn in the Fig. 2. Here we ha ve used an ensemb le of n = 5 SVMs with non -linear rbf kernel. T he v alues of gamma and C parameters of the kern el ha ve been tuned from the 2 -fold cross-v ali dation accuracy using grid-search [13]. The rationale behind choosing an ensemble of 5 learners is that the perfor- mance saturates for higher number of learners. In the ﬁrst st age, the feature dataset was shuf ﬂed and divided into 6 portions. Among t hem the ﬁ rst 5 segmen ts has been used for t r aining the ensemble-members and then each of the trained members gave probability estimates for the held-out 6 -th portion of the dataset. For one feature ve ctor of 13 MFCC features each of the mem- bers gi ves 1 probability estimate and thus a ne w feature space of 5 features have been deriv ed from one input feature vector which mak es the procedure completely data-driven wit hout us- ing an y heuristic thresholding. These feature vectors hav e been used for training an SVM classiﬁer with rbf kernel in t he ﬁnal layer . Here instead of using a majority voting based decision, we hav e used SVM as majority voting system may giv e estimate biased to a particular ensemble member without considering the other members. In case of l arger dataset even than that of used in this work, the num ber of SVM layers in the system may be increased simi l ar to an NN architecture. 4. Result Analysis The proposed ensemble-SVM was tested on the t est dataset de- scribed in the data des cription section. T o prove the efﬁcacy of the classiﬁer , we att empt to e xamine t he ensemble architec- ture from different perspecti ves such as effect of layer members, stability and ﬁnally we compare t he classiﬁer wit h a stable, ef- ﬁcient neural netwo rk. In order t o test t he effect of ensemb ling, we gradually increase t he number of ensemble member and ob- serve their accurac y in T able 2. T able 2: Effect on performance for incr easing ensemble mem- bers Ensemble Member No. Ac curacy A verage indi vidual member accuracy 1 57.05% 2 72.25% 3 78.32% 57.05% 4 87.02% 5 88.74% 6 88.82% From T able 2 we see that the accurac y of classiﬁcation in- creases for increasing number of ensemble members and the accurac y almost saturates aft er n = 5 ens emble members. Although individu al SVM trained on smaller dataset shows poor performance of 57 . 05% , the effect of ensemb ling is e v- ident from the higher classiﬁcation accuracies of the ensem- bles. Again to observe estimation v ariance reduction of the ensemble, we present the accurac y of two testing ﬁles i n T a- ble 3. For e xample, in the case of ‘spee ch-libriv ox-0011’ ﬁle, if we use stand-alon e SVM the accuracy may v ary i n the range 8 . 65 ∼ 91 . 35% as the stand-alone SVMs are trained on differ - ent portions of data and hence, the y may perform differently for a particular test-case. From T able 3 we can observ e that the ac- curacy of indi vidual SVM may ﬂuctuate whereas their ensemble accurac y remains stable and close to the maximum individual accurac y . T able 3: Effect on performance for incr easing ensemble mem- bers File Name Ensemble Member Accurac y T otal E nsemble Accurac y 8.6466% 8.6466% speech-libriv ox-0011 91.3534% 90.2256% 39.4737% 18.797% 5.26316% 30.8271% noise-sound-bible-00 31 21.4286% 72.45% 49.6241% 68.797% Finally , we compare the performance of t he ensemble SVM with t he NN described in the subsection 3 . 2 . NN and ensemble- SVM give 86 . 28 % and 88 . 74 % accurac y respecti vely . Their R OC curves are giv en in Fi g. 3. The operating point of each classiﬁer is sho wn by using a circle on the curv es which sho ws that the ensemble-SVM has a better true positiv e rate of 88 . 04% and a slightly high false positi ve rate of 17 . 01% compared to that of NN. The av erage area under curve (A UC) f or NN and ensemble-SVM are 0 . 9284 and 0 . 9167 respectiv ely . From these performance indices, we can con clude that their performan ces are comparable. 5. Conclusion In this paper , we ha ve proposed a novel ensemble SVM-based approach for v oice acti vity detection. The efﬁcac y of the pro- posed ensembling method has been established in t he result sec- tion through comparing with NN and testing on the test dataset. Here the member SVMs are i ndependent of each other and hence they can operate parallelly resulting in a signiﬁcant r e- duction in runtime. Again different ensemble member can be trained on dif ferent types of feature giving a more rob ust V AD system. Replacing some of the layers of an NN wit h SVM layer may result in improv ed accurac y . In our future work, we will ex- plore composite structure of NN and S VM with reduced train- ing complexity . 6. Refer ences [1] R. Tuck er, “V oice acti vity detect ion using a peri odicit y measure, ” IEE Pr oceedings I (Communic ations, Speech an d V ision) , vol. 139, no. 4, pp. 377–380, 1992. [2] P . Rene vey and A. Drygajlo, “Entropy based voice acti vity detec- tion in very noi sy condit ions, ” in Sevent h Eur opean Con fer ence on Speec h Communicat ion and T echn olog y , 2001. [3] J.-H. Chan g, N. S. Kim, and S. K. Mi tra, “V oice acti vity detec- tion based on multiple statistica l models, ” IE EE T ransact ions on Signal Pr ocessing , vol. 54, no. 6, pp. 1965–1976, 2006. [4] J. Ramırez , J. C. Seg ura, C. Benıtez , A. De La T orre, and A. Ru- bio, “Efﬁcie nt voice act i vity detect ion algorithms using long-te rm speech informatio n, ” Speec h communication , vol. 42, no. 3-4, pp. 271–287, 2004. [5] S. G. T anyer and H. Ozer , “V oice ac ti vity detect ion in nonstation- ary noise , ” IEE E T ransactions on spee ch and au dio pr ocessing , vol. 8, no. 4, pp. 478–482, 2000. [6] A. Davis, S. Nordholm, and R. T ogneri, “Stati stical voice acti vity detec tion using low-v ariance spectrum estimation and an adapti ve threshold , ” IEE E T ransaction s on Audio, Spee ch, and Langua ge Pr ocessing , vol . 14, no. 2, pp. 412–424, 2006. [7] J. W . Shin, J. -H. Chang, and N. S. Kim, “V oice acti vity detecti on based on s tatist ical models and machine lea rning approac hes, ” Computer Speech & Languag e , vol. 24, no. 3, pp. 515–530, 2010. [8] J. Wu and X. -L. Z hang, “Max imum margin clustering based sta- tistic al vad with multipl e observa tion compound feature, ” IEEE Signal Pr ocessing Letters , vol. 18, no. 5, pp. 283–286, 2011. [9] X.-L. Zhang and J. W u, “Deep belief networks based voic e ac- ti vity de tecti on, ” IE EE T ransact ions on Audio, Spee ch, and Lan- guag e Pr ocessing , vol. 21, no. 4, pp. 697–710, 2013. [10] M. V an Segbroeck , A. Ts iarta s, and S. Narayan an, “ A robust fron- tend for vad : exploitin g contextua l, discriminat i ve and spectral cues of human voice. ” in INTERSP E ECH , 2013, pp. 704–708. [11] D. Snyde r , G. Chen, and D. Pov ey , “Musan: A music, speech, and noise corpus, ” arXiv preprin t arXiv:1510.08484 , 2015. [12] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support ve ctor machines, ” ACM tran sactions on intellig ent systems and tec hnol- ogy (TIST) , vol. 2, no. 3, p. 27, 2011. [13] C.-W . Hsu, C.-C. Chang, C.-J. Lin et al. , “ A practi cal guide to support vec tor classiﬁcati on, ” 2003.

An Ensemble SVM-based Approach for Voice Activity Detection

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment