Prodorshok I: A Bengali Isolated Speech Dataset for Voice-Based Assistive Technologies - A comparative analysis of the effects of data augmentation on HMM-GMM and DNN classifiers
Prodorshok I is a Bengali isolated word dataset tailored to help create speaker-independent, voice-command driven automated speech recognition (ASR) based assistive technologies to help improve human-computer interaction (HCI). This paper presents th…
Authors: Mohi Reza, Warida Rashid, Moin Mostakim
© 2017 IEEE. Per sonal use of this materia l is per mitted. P ermission fro m IEEE must be o btained for all other uses, in any current or future media, including reprinting/republishing this m ater ial for advertising or promotional purposes, creating new collec tive works, for resale or red istribution to servers or lists, or reuse of any copyrighted co m ponent o f this work in other works. Prodorshok I: A Bengali Isolated Speech Dataset for Voice-Based Assistive Technologies A com parative analysis of the effects of data augm entation on HMM-GMM and DN N classifiers Mohi Reza, Warid a Rashid , Moin Mostakim Department of Co m puter Scie nce & Engineering BRAC University Dhaka, Bangladesh mohireza@ieee.org , warida.rashid@gmail.co m , mostakim@b racu.ac.bd Abstract — Prodorshok I is a Bengali isolated word dataset tailored to help create speaker-i ndependent, voice-command driven automated s peech recognition (ASR) based assistive technologies to help improve human-co mpu ter interaction (HCI) . This paper presents the results of an objective analysis that was undertaken using a subset of wor ds fro m Prodorshok I to assess its reliability in ASR syste ms t hat utilize Hidden Markov M ode ls (HMM) with Gaussian emissions and Deep Neural Networks (DNN). The results show that simple data augmentation involving a small pitch shift can mak e surprisingly tangible improvements to accuracy levels in speech recognition. Keywords — Automatic Speech Recog nition ; Bengali ; Hidden Markov Model ; Gaussi an Mi xture Model; Deep Neural Network; Human Computer Interaction, Assistive Technology I. I NTRODUCTI ON The impetus for creating Prodorkshok I is twof old. First, it serves to help fill the lack o f preprocessed , ea sy to use B eng ali isolated-w ord dat asets. S econd, this word set can have use ful applicati ons in assistiv e technol ogies that ut ilize voice commands. For exam ple, an applic ation that com bines our ASR system with text- to -speech tech nology can enable people with visual impai rment to navigat e through digital interfa ces with ease. Contempora ry softw are sy stem s are heavi ly reliant up on increasing ly rich graphical user interfaces. W hile this has brought drastic improvem ents in terms o f usabili ty for the general populat ion, the sam e c annot be said f or people suff erin g from disabili ties such visual impairment or lack o f mobility . Where accessibility featu res do exist, they are rarely designed with Bengali speaking use rs in m ind. With an estimate d 6 50 0 0 0 visually impaired adults in the Bangladesh [1], th ere is a tan gible need for more inclusive alternativ es to purely GUI-driven ways to navigat e digital inte r faces. Prodorshok I can help fi ll this void. II. R ELATED W OR K The research lands cape for Bengali Speech Recogniti on is nascent in com par ison to the rich history o f ASR system developm ent in E nglish . A notable B engal i d atase t that is available for free is SHRUTI Bengali Contin uous ASR Speech Corpus [1] . Das and Mitra [2] used a Hidden Markov Toolkit (HKT) to align its speech data. Mandal et al. [3 ] used SHRUTI to c reate a phone recognition (PR) sy stem that used an optimum text selecti o n techni que to decipher the smallest discret e unit of sound in uttered speech. In 2010, Ma ndal , Das and Mitra [4 ] introduce d SHRU TI -II, a SPHINX3 base d Bengali ASR Sys tem and demonstrated its us e in an E-m ail based computer applicati on designed to aid visu ally impaired users. Mohanta and Sharm a [5] did a sm all study on em otion detection in Bengali speech . T hei r goal w as to identify neutrality , anger a nd sadness in speech using Linear Prediction Cepstral coefficient (LPCC), Mel-frequen cy C ep stra l Coefficient (MFCC), pitch, intensity and formant. Bhowm ik and Mandal [16] appli ed a de ep neural netw or k based phon ological feature extracti on technique on Bengali continuous sp eech. III. D ATASET The dataset consists of reco rdings of single utterances of 30 Bengali words by 35 native sp eakers in Dha ka, totaling 1050 voice sa mples. T he word set has been specifically constructed to b e used in syste ms that i mplement ha nds -free selection and navigation of digital interfaces. It incl udes Bengali w o rds for 10 digits (0 to 9), 10 dir ectional word s (East W est, No rth and South, up, do wn, left, right, forward, backward) and 10 positional words (First to tenth). IV. M ETHOD A. Prep rocessing All word sa mples were put through five stages of speech enhancement. First, s tereo cha nn els were merged into m ono. Then, static b ackground nois e was attenuated using a noise - reduction al gorithm based on Fourier ana lysis. U nique noise profiles for different samples were used for best results. T he sound signals were the n normalized to have a maximum amplitude o f -1.0 dB . The mean amplitude d isplacement was set to 0 .0 for uniformity. Any silence at the begi nning or end were truncated. Finally, t he audio sam p les were cloned into tw o separate datasets, one o f w hich was syntheticall y augme nted b y including pitch altered voice s amples of existing d ata. The effect of these five stages on the audio sample can be seen in fig. 1. The end result is a co ncise audio sample that is ready t o be us ed to t rain and tes t differ ent acoustic m o dels. Fig. 1. Visualizi ng the Five Stages of Spe ech Enhancemen t B. Fea ture Extraction After th e data w as prepare d, th e next ste p involved s electing and extracting features from the dat a. The accuracy and precisi o n of speech recogn ition system s are highly dep endent on the method of feature extractio n. For this experim ent, the stat e- of -th e -art featu re f or speech r ecogniti o n sy stem — MFCC (Mel- Frequency Cepst ral Coeffici ent) w as selected. MFCC features closely m i mic the way sound is perceive d by the human ear. The implem entation details [ 7 , 8, 9] are as f ollow s: • The signal is divided into 25ms long frames. • For ea ch frame, the Discrete Fourier T ransform (DFT ) is taken. The DFT is represented by the follo wing equation: (1) Here, X N [N] is a signal with a perio d of N. If the DFT is X N [n], k is the length o f the DFT. (2 ) (3) The p eriodgram estimate of po w er spectrum is given by: (4) • The Mel-spaced filterbank is applied to the results of the previous step. The filterbank energies are calculated by multiplying ea ch filter bank with the power spectrum and add ing the coefficients. • Then, the logarit hm of the f ilte r bank ener gies is taken. • Finally, the Discrete Cosine T ransform ( DCT) of the log filter bank energies are computed to yield the cepstral coefficients. T he lower 13 coefficients result in a feature vector for ever y frame of each signal. C. Hidden Markov Model (HMM) with Gaussian E mission Hidden Mar kov Models have b een successfully used for time varying seq uences s uch as audio signal processing. The underlying idea behind t he Hidd en Markov Mo del is t hat, it models seque nces with discrete states. The way this m ap s to the problem of speech recognition is, during the feature extraction process, speech signals ar e tra nsformed i nto features of discrete time slices or frames. Therefo re there are a finite n umber of frames in a p articular word. When a particular seque nce o f features is given, the model can yield the pro bability of that sequence b eing a certain word. Here, the phonemes i.e. the distinct u nits of so und that ca n be prod uced are d iscrete states and the seq uences of MFC Cs which represent the uttered word are observations. The probab ility of observi ng M FCC sequences given the state is performed using Gaussian Emissions. T he details [10] of how it w o rks are stated belo w: Hidden Markov Model i s a pro babilistic model which can produce a sequence o f observations X b y a sequence of hidd en states Z. It is generated b y a pro babilistic functio n associated with each state. An HMM i s usually rep resented by λ where λ = ( A, B, Π). It can be defined b y the followin g parameters: • O = {o 1, o 2………… o m }. This is an output o bservation sequence. For speech reco gnition, this rep resents the MFCC feature vectors. • Ω = {1, 2,………,N}. This is a set of states. For speech recognition, it is the phone me labels. • A = {a ij }. This is the transition probab ility matrix. It represents t he probab ility associated with transition from state i to state j . • B = {b i (k)}. It is the output probability, i.e . t he probability of emitting a certain ob servation o k in t he state i. • Π = Start prob ability v ector. There are three basic problems for HMM: • Estimating t he optimal seque nce of states given the parameters and observed data. • Calculating the likelihood or probability of a data given the para m eters and ob served data P (O| λ). • Adjusting the para meters given the o bserved data so that P(O| λ) is maximized . 1) Estimating th e Parameters The solution to problem 3 is to estimate the model paramete r s so that P (O| λ) is maximized for the traini ng observati o ns. The optimal Gaussian mixtu r e parameters for a given set of observ ations can be chosen such that th e pro b ability reaches maxim a by using the Expectation Maximization (EM) algorithm [11] . It is a gradient bas ed optimizati o n method which is likely to converge at the local m axima. 2) Estimating th e State Sequen ce Estimatin g the state sequen ce S given an observation sequence X and the model λ is done usin g the Vite rbi algorit hm [12]. It is a form al technique for finding the best state sequence based on d y namic progr amming meth o d [10 ]. D. Deep Neural Network Deep Neural Network has pr oven to be successf ul in speech recognit ion and is currently a widely researche d area under this field. [13, 14, 15] Artificial neural networks are simplified represent ation and simulati on of the neuronal structure present in brains. Dee p neural networks are artif icial neural networ ks where multiple layers of neuron are used. T he system learns through observat io ns and th e feedback m echanism. 1) Activation F unction McCulloch a nd Pitts pr oposed the idea of an ar tificial neuron ca lled the Sigmoid Neuron [ 16 ]. The sigmoid f unction is widel y used for feed-for w ar d net work with backpropagatio n because of its n o n-linearity and simplicity of computation [17 ]. The function is given b y: (2) In practical applications of using the s igmoid function as an activation functio n, W i is a re al valued weight. X i is the i nput and the weighted input o f a nod e is given by: (3) The weight variable is c hanged depending on how much t he relationship between the inputs to the output need s to be strengthened. 2) Multi-Layered F eed-forward Network The Feed -forw ard network i s the type of Artificial Neural Network where connection s between the nodes d o not for m a cycle [ 18 ]. A sim p le three layered feed-for ward neural network structure is shown belo w: Fig. 2. A th ree -layered neural netw ork 3) Backprop agation Backpropagation is the p rocess of minimizing d ifferences between actual output and desired output or error based on the training sa mples with labels. This is used by o ptimization algorithms for adjustin g the weight for each connectio n between neurons or nodes in different layers based on the accumulated erro r of a batch in t he training d ata. T he error is computed using a cost function which is p ropagated back through the network. Different opti m ization a lgorithms are used for the process. 4) Optimization Alg orithm The objective of a n opti mizer is to get to the minimum poi nt of the err or curve for different weights. For o ur experiment, Adam, a stochastic o ptimization m et hod w as u sed w hich combines the advantages of t wo popular m ethod s A d aGrad and RMSProp. T his technique was first introduced in 20 14 [13]. It takes the following par ameters: • Learning rate – A floating p oint value. The lear ning rate. 0.001 was used for the experiment. • beta1 = 0.9 – A co nstant float tensor. The ex ponential decay rate for the 1st moment estimates. • beta2 = 0.999 – Another constant float te nsor. The exponential deca y rate for the 2nd moment estimates. • epsilon – A small co nstant for numerical stabil ity . 5) The Cost Functio n A way of generalizing the o ptimization proce ss by not overfitting it to the train ing set is using cost function. For the experiment, so ftmax cr oss entr opy with logits were used. It measures t he pr obability error in d iscrete classi fication where each sample can belo ng to exactly one class. V. R ESULTS The performance of speaker independent dataset without augmentation for both classi fiers were q uite low. I nterestingly, after data-augmentatio n, the accurac y le vels of the HM M- GMM m od el increa sed by 6 .12% an d that of the DNN by 7.65%. The overall performance ho w e ver was better with the HMM-GMM model. On the spea ker dep endent s ystem, the performance of the HMM -GMM model is q uite hi gh with an accuracy le vel of 96.67%. T he performance sco re for DNN in the spea ker independe nt syst em is comparati vely lower, at 47.84% with augmentatio n and 4 0.19% without. Furt her experiments show a positive correlation between th e number o f utterances per word and the accuracy level in b oth classifiers. A. Ave rage Percentage Accuracy Levels Ea ch sub-categor y in T able 1 denotes the a verage acc uracy score that were derived fro m t hree con secutive r uns of a particular classifier. T hese results ar e presented visuall y in Figure 2. TA B L E I. A VERAGE A CCURACY L EVELS Classifiers HMM -GMM DNN Speaker Independent With Augme n tation 56.28 % 47.84 % Without Aug mentation 50.07 % 40.19 % Speaker Depe n dent 96.67 % 43.75 % Fig. 3. Effects of Augmentation o n Accuracy Levels B. Correlatio n Between Utterenc es per Word and A ccuracy Table 2 lists ho w ac curacy levels derived from each classifier varied as the numb er of utterances for each w o rd varied. These res ults are presented visuall y in Figure 3 . TA B L E II . U TTERANCE P ER W ORD V ERSUS A CCURACY Utterances Per W ord Classifier Total Test Train HMM -GMM DNN 10 3 7 34.93 % 21.53 % 15 4 11 35.38 % 22.89 % 20 6 14 36.97 % 28.57 % 25 7 18 41.00 % 31.65 % 30 9 21 50.75 % 34.63 % 35 10 25 52.51 % 40.19 % 50.07 40.19 56.28 47.84 0 10 20 30 40 50 60 HMM-GMM DNN Plain Augmented Fig. 4. Pl ot of Utterance Count and Acc u racy L evel VI. D I SCUSSION Results pertaining to speaker -dependent s ystems have been very p romising, yieldin g acc uracy levels averagin g at 9 6.67% when using the HMM-GMM based classifier. Quite interestingly, the DNN classifier yielded only a negligib le improvement o ver speaker-independ ent accuracy levels . The two classifiers we h ave used are both based on tried and tested acoustic m odels. Yet, ow in g to the intrinsic acoustic variability in sound signals spoken by multiple speakers, accuracy levels for speaker-independent syste ms h ave been below 60%. T he size of the corpus and the spar seness o f o ur training data are what likely affected the performance of the classifiers. J udging b y the experi mental r esults summarized in Table 2, there appears to be a clear positive co rrelation bet ween utterance co unt and accuracy levels for both classifiers. A s such, expandi ng the current dataset to incorporate higher number of utterances per word will likely solve this i ssue. Despite the limitations of working with a s p arse datase t, experimental results summarized in Table 1 indicate that tangible i mprovements ca n be made by aug menting the d ata through simple measures such as p itch shifting. This is li kely due to the increased variety of speech s i gnals to which the ASR system is exposed to d uring training. VII. F UT URE W ORK The perform a nce o f h y brid classifiers suc h as DNN-HMM is yet to be explored . It would be useful to see if t he trend lines depicted in Figure 4 . hold as uttera nce count is increased beyond 35. Empirical a nalysis of t his nature r elies upon t he availability of more data. He nce, there i s a need for further expansion of Pr odorshok I to incorporate lar ger vocabulary and increased variation in sp eech for every word . VIII. C ONCLU S ION In this paper, we tested Prodorshok 1 using two classification algorithms that use HMM-G MM and DNN b ased acoust ic modeling. The r esults i ndicate that Pro dorshok I in its current form can already be used to design reliab le speaker-dependent systems. Furthermore, they show that a sim ple data au gmentation tec hnique rel ying upon minor pitch shifti ng can make tangible improvements in speech recognition acc uracy. Further expans io n to the dataset will li kely impro ve performance levels i n speaker-independent s ystems. A CKNOWL E DGMENT We are grateful t o Assistant Professor Matin Saa d Abdullah and Lectu rer Md . Shamsul Kaonain , Departmen t of Computer Science and Engineering at BRAC University for their guidance and engaging conversations that were indispens able to the developm ent of this w or k. R EFERENCES [1] B. Das, S. Mandal and P. Mitra, Shruti Bengali Bangla ASR Speech Corpus. [online] Available at: http://cse.iitkgp .ac.in/~pabitra / shruti _corpus.html [Accessed 19 Aug. 2017]. [2] S. M andal et al. "Develo p ing Bengal i Speech Corpus for Phone Recognizer Using Optimum Text Selection Technique," Proc. Conf. Asian L a nguage Proce ssin g (IAL P), 20 11 pp. 2 68-271, 2011 [3] B. Das, S. Mandal and P. Mitra, "Bengali speech c or p us for continuous automatic speech recognition system, " Proc. Conf. Spe ech Database and Assessme n ts (Oriental COCOSD A), pp.51-55, Taiwan, 201 1 [4] B. Das, S. Mandal and P. Mitra, “Shruti -II : A vernacular sp eech recognition system in Bengali and an application for visually impaired community , ” in Students' Technol ogy Symposium (TechSy m), 2010 © IEEE. doi: 10.1 109/TECHSYM.2 0 10.546915 6 [5] A. Mohanta and U. Sharma, “Bengali speech emotion recognitio n” in Computing for Sustainable Global Developme n t (INDIACom) , 2016 © IEEE [6] T. Bhowmik, S . K. D. Mandal, “De ep neural ne twork based p honol ogical feature ex t raction for Bengal i continuous spee ch ” in Signal a nd Information Processing (IConSI P), 2016 © I EEE. doi: 10.1109/I CONSIP.2016.7857491 [7] X. Huang, A. Acero, and H. Hon. Spoke n L anguage Processing: A guide to theory , algorithm, and syste m d eve lopment. Prentice Hall, 200 1 . [8] M . Xu et al, "HMM-base d audio keyword g eneration". In K i yoharu Aizawa; Yuichi Nakamura; Shin'ichi Satoh. Advances in Multime d ia Information Processing – PCM 2004: 5th Pacific Rim Confere n ce on Multimedia (PDF ). Springer. ISBN 3-540-23985-5, 2 0 04. [9] M . Sahidull ah , G . Saha, " Design, anal ysis and e xperimental ev a luation of block based transformation in MFCC computation for speaker recognition". Speech C omm u nication. Vol. 54, No. 4, 2012, pp. 543 – 565. doi:10.1016/j.s p ecom.20 1 1.11.004. [10] L. R. Rabiner “A tutorial on hidden Markov models and sele c ted applications in speech recogni tion,” Proceedings of the I EEE 77.2, pp. 257-286, 1989. [11] Bilmes, Jeff A. "A gentle tutorial of t he EM alg orith m a nd its applicat i on to p arame t er estimation for Gaussian mixture and hidden Ma rkov models." I nt ernational Com puter Science Institute 4.51 0 (1998): 1 26. [12] A. Viterbi, "Error bounds for convolutional codes and an asymptotically optimum decoding al gorithm," IEEE transactions on Information Theo r y 13.2 1967, pp. 2 60-269. [13] G. Hinton et al., “Dee p n eural networks for acoustic modeling in speech recognition: The shared view s of four rese arch groups ,” Signal Processing Magazine, I EEE, vol. 29, no. 6, pp. 82 – 97, 2012 [14] G. E. Dahl et a l., “Context -dependent pre-trained deep neural n etw orks for large- vocabulary speech recog n ition,” Audio, Spe ech, and Language Processing, I EEE Transactions on, vol . 20, no. 1, p p. 30 – 42, 2012. [15] F. Seide, G. L i , and D. Yu, “Conve rsational speech transcription using context- dependen t deep neural networks.” in INTERSP EECH, 2011, pp. 437 – 440. [16] M. A. Nielsen, “Neural Networks and Deep Learning”, Determination Press, 2015. [17] J. Han, C. Morag, "The influence of the sigmoid function paramete rs on the spee d of backpropagation learning". I n Mira, José; Sandoval , Francisco. From N a tural to A rti ficial Neural Computatio n , 1995. [18] A. Zell, Simulation Neuronaler Netze 1st ed. Addison-Wesley, p. 73. ISBN 3-89319-554-8 . 0 10 20 30 40 50 60 0 10 20 30 40 Percentage Accuracy L evel Number of Utterances per Word HMM-GMM DNN Expon. (HMM-GMM) Expon. (DNN)
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment