Pathological Voice Classification Using Mel-Cepstrum Vectors and Support Vector Machine

P athological V oice Classiﬁcation Using Mel-Cepstrum V ectors and Support V ector Machine Maryam Pishgar 1 , Fazle Karim 1 , Graduate Student Member , IEEE , Somshubra Majumdar 2 , and Houshang Darabi 1 , Senior Member , IEEE Abstract —V ocal disorders hav e affected several patients all over the world. Due to the inher ent difﬁculty of diagnosing vocal disorders without sophisticated equipment and trained personnel, a number of patients remain undiagnosed. T o alle viate the monetary cost of diagnosis, there has been a recent growth in the use of data analysis to accurately detect and diagnose individuals f or a fraction of the cost. W e propose a cheap, efﬁcient and accurate model to diagnose whether a patient suffers from one of thr ee vocal disorders on the FEMH 2018 challenge. Index T erms —v ocal disorder , neoplasm, nodule, polyp, mfcc I . I N T R O D U C T I O N The human standard of life can be se verely af fected by their individual pathological voice condition. This has also ﬁnancially burdened sev eral patients, organizations, and soci- eties [ 1 ]. Some of the common impairments to the voice are structural lesions, neoplasms, and neurogenic disorders [ 1 ]. One of the most frequently utilized tools to diagnose these vocal disorders is through a laryngoscope [ 2 ]. Laryngoscopy is an expensi ve time-consuming process, that requires trained personnel to perform the test [ 3 ]. In addition, vocal disorders must also be detected and treated at an early diagnostic stage when they are faced with sev eral symptomatic challenges to properly detect, as in the case of larynx cancer [ 4 ]. Therefore, patients who do not have easy access to advanced technology or those who cannot afford expensiv e medical treatments are in a derogatory position to receive effecti ve treatments. T o alleviate these issues, it has become very popular to utilize non-in v asiv e techniques for detecting vocal disorders through eliminating subjectiv e baises [ 5 ]. In this paper , we propose an algorithm to accurately and quickly classify voice disorder using cheap diagnostic tools. In the past 60 years, a signiﬁcant amount of work has been dev eloped in the ﬁeld of automated speech pathology , which has assisted physicians to diagnose vocal disorders ac- curately . In 1960s, the detection of voice quality was measured by the widespread shimmer , jitter and harmonic-to-the-noise ratio technique (HRN) [ 6 ]. Subsequently , linear discriminant analysis (LD A) was used to discriminate pathological voices from normal ones [ 7 ]. In another study , glottal noise parameter was measured primarily to identify vocal disorder [ 8 ]. An- other common technique e xtracted the Mel-frequenc y cepstral coefﬁcient features (MFCC) and fed into a Gaussian mixture 1 Mechanical and Industrial Engineering, Uni versity of Illinois at Chicago, Chicago,IL 2 Computer Science, Uni versity of Illinois at Chicago, Chicago, IL model (GMM) [ 9 ]. More recently , the F-Ratio and Fisher’ s discriminant ratio w as used for feature selection of acoustic analysis [ 4 ]. Little et al ., used recurrence and fractal scaling to quadratically analyze disordered vo wels. These samples were then bootstrapped to distinguish normal and disordered voices [ 10 ]. In the early 2000s, Arias-Londoo et al. combined the information obtained from MFCC and Modulation spectra (MS) to input into a GMM and support vector machine (SVM) classiﬁer [ 11 ]. Arjmandi and P ooyan et al. focused their study on Short-Time Fourier T ransform (STFT) and Con- tinuous W avelet T ransform (CWT) to discriminately analyze voice impairments as they used Linear Discriminant Analysis (LD A), Principle Component Analysis (PCA) and an SVM classiﬁer [ 12 ]. In May 2014, Muhammad and Melhem et al. employed MPEG-7 features to differentiate pathological voice from normal voice [ 13 ]. Deep learning approaches are becoming more commonly used to categorize voice disorders. One such study utilized con volutional layers with Long Short T erm Memory (LSTM) recurrent layers on raw audio signals as an end-to-end model classiﬁer [ 14 ]. Another deep learning approach used the VGG- 16 and CaffeNet models for voice disorder classiﬁcation [ 15 ]. Successiv ely , Tsao et al. compared deep neural net- works (DNN), GMM, and SVM models that apply MFCC, MFCC+delta and MFCC(N)+delta features from the ra w sig- nals [ 1 ]. In this study , we develop a supervised classiﬁcation model on a range of gi v en v oice data samples (obtained from the 2018 Far Eastern Memorial Hospital V oice Disorder Challenge) to detect the pathological defects and classify them into one of the three commonly identiﬁed vocal disorders, Neoplasm, V ocal Palsy , and Phonotrauma. The voice samples included samples of 50 normal voice and 150 voice disorders which were characterized by vocal nodules, polyps, and cysts (mutu- ally related to Phonotrauma), glottis neoplasm, and unilateral vocal paralysis [ 1 ]. Our model ut ilizes MFCC and MFCC delta features from the raw signals as an input to an SVM classiﬁer . The hyperparameters of the SVM classiﬁer is tuned via a state-of-the-art algorithm de veloped by researchers at Google (sequential, halving, and classiﬁcation algorithm [ 16 ]). In this study , we compare our model with a deep learning model (Long Short-T erm Memory Fully Con v olutional Network [ 17 ]) and a classical model (XGBoost) to show ho w our proposed model can perform similarly to more complicated models on the FEMH dataset [ 18 ]. W e propose our model be utilized as a baseline for which physicians can diagnose v ocal disorder accurately , and other researchers can compare their results against. The remainder of the paper is structured as follow: Section II contains the background work on which this w ork is based. Section III describes the methodology we use to train our proposed models. In section IV , we discuss the results obtained on our experiments. Finally , in Section V , we conclude and describe potential future work that needs to be performed. I I . B A C K G R O U N D W O R K A. Mel-fr equency Cepstral Coefﬁcients Mel-frequency Cepstrum is a representation of a sound signal, based on the linear cosine transform of a log po wer spectrum on a nonlinear mel scale of frequency [ 19 ]. The coefﬁcients that collecti vely comprise the Mel-frequency Cep- strum are called MFCC features. In contrast to the linearly- spaced frequency bands obtained from the cepstrum of a sound signal, in a mel-frequency cepstrum, the frequency bands are uniformly spaced on the mel scale. This frequency warping allows for better representation of sound and voice data. B. T emporal Derivatives In order to extract the dynamic features of speech, auxiliary delta and delta-delta features must be computed as input features for the model, which are calculated as the temporal deriv ati ves of the original MFCC features [ 20 ]. In order to estimate smooth deriv ativ es f 0 l,n , we often simply compute the local least squares polynomial ﬁt to the data samples, so as to minimize the cost function C p l,n = M X m = − M p X k =0 a k m k − f l,n + m ! 2 with respect to the MFCC coefﬁcients a k where f is the spectrum-based feature vector , L being the number of coefﬁcients, N being the number of frames, p is the order of the polynomial and M is the number of samples used to ﬁt the polynomial. W e compute the delta features via the local estimate of the deri v ativ e of the input data, computed using Savitsk y- Golay ﬁltering [ 20 ] as we utilize the Librosa [ 21 ] package to preprocess the dataset. I I I . M E T H O D O L O G Y V oice samples were obtained from a v oice clinic in a tertiary hospital (Far Eastern Memorial Hospital (FEMH ). The dataset is comprised of voice samples of 50 individuals who do not exhibit a pathological abnormality in their speech, and v oice samples 150 individuals who exhibit one of three different voice disorders, including vocal nodules, polyps, and cysts (collectiv ely referred to Phonotrauma ); glottis neoplasm and unilateral vocal paralysis. T able I describes the demographics of the dataset, and T able II pro vides the number of samples av ailable for each of the three vocal disorders, stratiﬁed by gender . Each voice sample is a recording of a 3-second sustained vo wel sound of the letter ”A”, under a background noise of amplitude between 40 and 45 decibels, and a microphone- to-mouth distance of approximately 15-20 centimeters. The recordings were obtained using a Shure SM58 microphone and a Shure X2u digital ampliﬁer , with a sampling rate of 44,100 Hz at 16-bit resolution, which was saved in uncompressed wa veform-audio format. T ABLE I D E M O G R A P H I C S O F N O R M A L A N D P A T H O L O G I C A L S A M P L E S Count Mean Age Age Range Standard (Y ears) (Y ears) Deviation Gender M F M F M F M F Normal 20 30 30.9 34.5 22- 72 22- 72 10.9 12. Pathological 78 72 53.8 44.6 21- 87 22- 84 15.3 13.5 T ABLE II D I S E A S E C AT E G O R I E S O F T H E 1 5 0 P A T H O L O G I C A L V O I C E S A M P L E S Neoplasm Phonotrauma V ocal Palsy M 32 13 33 F 8 47 13 A. Data Pr ocessing Each audio w av eform is processed to deri ve MFCC features using the Librosa [ 21 ] library , using a sampling rate of 22050 Hertz. W e compute the temporal deriv ati ves (delta) of these MFCC features, of which we compute the mean and maximum across all samples, and concatenate all three vectors into a single vector of size R 3 d , where d is the number of extracted MFCC coefﬁcients. W e select the number of MFCC coefﬁcients ( d ) computed to be 15, which we ﬁnd via grid search over the space of { 10, 15, 20, 25, 30, 40, 50, 100 } . When assessing the performance of any model, we compute a weighted average of the sensitivity and speciﬁcity of the model on the binary task of predicting whether the sample is normal or pathological, and the av erage recall of the model on the multitask objective of classifying the sample as one of four classes. W e assign the weights for the above scores as 40%, 20%, and 40% respecti vely . B. Hyperparameter T uning As the dataset comprises of a mere 200 samples, we perform 5 fold cross validation for ev ery model using the same global seed across all models. W e utilize the sequential halving and classiﬁcation (SHA C) algorithm from K umar et al. [ 16 ], as an ef ﬁcient alternati ve to exhaustiv e grid search to sample hyperparameter settings in continuous search spaces. T o a v oid o verﬁtting the data distrib ution in the 5 folds used for ev aluation of the models, we train and ev aluate the SHAC algorithm on 5 dif ferent folds of the dataset using a dif ferent random seed. This preserves the generalization properties of Input MFCC MFCC Deltas Concat SHAC RF Feature Selection SVM Classiﬁer Hyperparameter T uning Fig. 1. Proposed model pipeline. the parameters sampled from the SHA C algorithm while also providing robustness to random seed overﬁtting. For better robustness, we round off all ﬂoating point values to the 3rd decimal place, and ﬁnd that performance is not impacted. When training the classiﬁers of the SHA C algorithm, we compute a maximum of 10 classiﬁers, with a batch of 100 hyperparameter samples per classiﬁer and a total budget of 1000 hyperparameter samples ev aluated. It is to be noted that once we obtain a sample from the SHAC algorithm, we use the same parameters for all 5 folds. Therefore, the sampled parameters are more robust and generalize better, as the y must perform well across all 5 folds. W e then obtain 100 hyperpa- rameter samples, compute the mean and standard deviation of this batch, and select those candidates which obtain a score greater than the mean plus one standard de viation. W e use a publically av ailable implementation of the SHAC algorithm 1 . C. Pr oposed Model Our proposed model is comprised of a feature selection stage follo wed by a multi-class classiﬁcation stage. W e utilize a set of Random Forest models, each trained on the training data of a giv en fold, to compute the feature importance of the 45 dimension input vector . W e then determine a threshold value sampled from the SHAC algorithm, which is used to select only those features which are above that threshold. These features are then supplied to a Kernel Support V ector Machine, which utilizes the Gaussian Radial Basis Function as its kernel. W e utilize the One-vs-One strategy for multi- class classiﬁcation, which builds N ∗ ( N − 1) 2 classiﬁers. Such a strategy is applicable here due to the small amount of data av ailable and relativ ely fast training, but we observe that One-vs-All strategy is nearly identical in performance. W e utilize the e xcellent Scikit Learn [ 22 ] package for all models described in this section. Figure 1 details the various stages of the proposed model pipeline. 1 https://github .com/titu1994/pyshac W e construct a search space consisting of the number of trees in the Random Forest { 10, 20, 50, 100 } , depth of the tree { 3, 4, 5, 6, 7, 8, no limit } , selection threshold ( U ∈ (0 , 0 . 5) ), penalty parameter C for the SVM ( U ∈ (0 , 25) ), gamma parameter for the RBF kernel ( U ∈ ( − 1 , 1) ), where we resolve negati ve v alues to be 1 / (number of featur es) . W e then search over this space using the SHAC algorithm and sample its best parameters as described in Hyperparameter tuning III-B . D. Baseline Models All baseline models utilize the same input features and 5 fold training as described in the proposed model, to ensure a consistent training methodology . Random Forest based fea- ture selection is performed, but the hyperparameters for the baseline models are searched via SHA C to ensure we obtain unbiased hyperparameters. The search space for the Random Forest and the selection threshold remain consistent across all models. 1) XGBoost: W e compare against XGBoost [ 23 ], a power - ful Gradient Boosting T ree model which obtains state-of-the- art results on multiple structured and unstructured datasets and is a standard baseline model to compete with. While XGBoost has a large number of hyperparameters that can be tuned, we ﬁnd that only three of these parameters signiﬁcantly impact the ﬁnal score, and therefore construct a search space only over those three parameters. W e search ov er the number of estimators { 10, 25, 50, 100, 200 } , the maximum depth of the tree { 3, 4, 5, 6, 7, 8 } , and the learning rate U ∈ (0 . 01 , 0 . 2) . 2) Long Short T erm Memory Fully Con volutional Networks: W e also compare this dataset on a hybrid deep neural network, called the Long Short T erm Memory Fully Con volutional Network (LSTM-FCN) [ 17 ]. It comprises of two branches, one with Con volutional blocks comprised of a Con volutional Layer , follo wed by Batch Normalization [ 24 ] and then the Relu activ ation function. Another branch is comprised of a Long Short T erm Memory Recurrent layer [ 25 ] follo wed by a dropout [ 26 ] layer (with a probability of 80%). W e utilize the same FCN and LSTM branch structure as provided by Karim et al. to be consistent, and only modify the number of LSTM cells in the LSTM branch, which we ﬁnd using grid search [ 17 ]. All aspects of initialization and training methodology are kept consistent with the paper to provide the best results. I V . R E S U LT S Due to the lack of a distinct test set, we instead discuss the 5 fold cross v alidation score obtained by each of the models, and how we selected the best model to submit to the FEMH 2018 Challenge. As discussed in III-B , we use the SHAC algorithm to randomly sample the best parameters which it determines can obtain strong scores on the overall validation sets of the 5 folds. For each of the models, out of the 100 hyperparameter settings sampled, we found one or more hyperparameter settings which scored abo ve the mean + 1 standard deviation threshold we had set. W e then train and e valuate each of these models on those candidates, and rank each of them. W e ﬁnd that we need to train fe wer than 3 candidates in all cases as the candidates all score v ery close to each other . All code employed in this paper is av ailable online. 2 The scores of each of these models has been provided in T able III . The ﬁnal column contains the weighted score of the Sensitivity , Speciﬁcity and Recall scores with the weights 40%, 20% and 40% respecti vely . T ABLE III 5 - F O L D C R O S S V A L I DAT I O N S C O R E S Model Sensitivity Speciﬁcity Recall Scores Std. Dev Proposed Model 0.8860 0.7823 0.5900 0.7469 0.0160 XGBoost 0.8747 0.7561 0.6150 0.7470 0.0710 LSTM-FCN 0.8539 0.6624 0.5550 0.6960 0.0401 Upon inspection of the results, we ﬁnd that the proposed model and XGBoost perform quite similarly . Howe ver , the standard de viation of the XGBoost model is much higher than that of the proposed model, albeit the mean of the score is marginally higher . It is for this reason that we state that the proposed model is the ov erall winner , for scoring marginally higher in sensitivity and speciﬁcity , as well as having a lo wer standard deviation in the weighted score. W e also inspect the reason for the signiﬁcantly lower performance of the LSTM-FCN. W e ﬁnd that the extremely small amount of data pro vided to such a lar ge neural network, possessing sev eral hundred thousand parameters allo ws the model to signiﬁcantly overﬁt each of the subsets of the 5 folds during training and consequently recei ve a much lo wer generalization score. T o see if this was indeed the case, we also train the model with a mere fraction of its original parameters and ﬁnd that, although the overall score improv es, it is not signiﬁcant enough to be comparable to the proposed model. 2 https://github .com/houshd/FEMH V . C O N C L U S I O N In this study , we present a model that classiﬁes v ocal disorders using a range of v oice data samples obtained from the 2018 FEMH V oice Disorder Challenge. The proposed model extracts MFCC and MFCC delta features from raw input signals and applies them on a SVM classiﬁer . W e utilize the SHAC algorithm to tune the parameters of the classiﬁer for optimal performance. Our results of the proposed model perform similarly and in a case outperformed more complicated models, such as XGBoost and LSTM-FCN. W e recommend using this approach to pro vide a fast yet simple baseline to diagnose the vocal disorder in clinical practice. W e leav e the extension of this model to automatically diagnose larynx cancer as future work. A C K N O W L E D G M E N T W e would like to thank Far Eastern Memorial Hospital for donating the dataset and the org anizers of the 2018 FEMH challenge for providing valuable feedback. This publication w as supported by the Grant or Cooperati ve Agreement Number, T42OH008672, funded by the Centers for Disease Control and Pre vention. Its contents are solely the responsibility of the authors and do not necessarily represent the ofﬁcial vie ws of the Centers for Disease Control and Prev ention or the Department of Health and Human Services. R E F E R E N C E S [1] Shih-Hau Fang, Y u Tsao, Min-Jing Hsiao, Ji-Y ing Chen, Y ing-Hui Lai, Feng-Chuan Lin, and Chi-T e W ang. Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of V oice , 2018. [2] Nizhoni Denipah, Christopher M Dominguez, Erik P Kraai, T ania L Kraai, Paul Leos, and Darren Braude. Acute management of paradoxical vocal fold motion (vocal cord dysfunction). Annals of emergency medicine , 69(1):18–23, 2017. [3] John George Karippacheril, Goneppanav ar Umesh, and V enkatesw aran Ramkumar . Inexpensiv e video-laryngoscopy guided intubation using a personal computer: initial e xperience of a no vel technique. Journal of clinical monitoring and computing , 28(3):261–264, 2014. [4] Juan Ignacio Godino-Llorente, Pedro Gomez-V ilda, and Manuel Blanco- V elasco. Dimensionality reduction of a pathological voice quality assessment system based on gaussian mixture models and short-term cepstral parameters. IEEE transactions on biomedical engineering , 53(10):1943–1953, 2006. [5] Ghulam Muhammad, Mansour Alsulaiman, Zulﬁqar Ali, T amer A Mesallam, Mohamed F arahat, Khalid H Malki, Ahmed Al-nasheri, and Mohamed A Bencherif. V oice pathology detection using interlaced deriv ativ e pattern on glottal source excitation. Biomedical Signal Pr ocessing and Contr ol , 31:156–164, 2017. [6] Michael HL Hecker and E James Kreul. Descriptions of the speech of patients with cancer of the vocal folds. part i: Measures of funda- mental frequency . The Journal of the Acoustical Society of America , 49(4B):1275–1282, 1971. [7] Karthikeyan Umapathy , Sridhar Krishnan, V ijay Parsa, and Donald G Jamieson. Discrimination of pathological voices using a time-frequency approach. IEEE Tr ansactions on Biomedical Engineering , 52(3):421– 430, 2005. [8] V ijay P arsa and Donald G Jamieson. Identiﬁcation of pathological v oices using glottal noise measures. Journal of speech, language, and hearing r esearc h , 43(2):469–485, 2000. [9] Fethi Amara and Mohamed Fezari. V oice pathologies classiﬁcation using gmm and svm classiﬁers. Recent Advances in Biology , Medical Physics, Medical Chemistry , Biochemistry and Biomedical Engineering , page 65, 2013. [10] Max A Little, Patrick E McSharry , Stephen J Roberts, Declan AE Costello, and Irene M Moroz. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomedical engineering online , 6(1):23, 2007. [11] Juli ´ an David Arias-Londo ˜ no, Juan I Godino-Llorente, Maria Markaki, and Y annis Stylianou. On combining information from modulation spectra and mel-frequency cepstral coef ﬁcients for automatic detection of pathological voices. Logopedics Phoniatrics V ocology , 36(2):60–69, 2011. [12] Meisam Khalil Arjmandi and Mohammad Pooyan. An optimum al- gorithm in pathological voice quality assessment using wa velet-packet- based features, linear discriminant analysis and support vector machine. Biomedical Signal Pr ocessing and Contr ol , 7(1):3–19, 2012. [13] Ghulam Muhammad and Moutasem Melhem. Pathological voice detec- tion and binary classiﬁcation using mpeg-7 audio features. Biomedical Signal Processing and Contr ol , 11:1–9, 2014. [14] Pa vol Harar , Jesus B Alonso-Hernandezy , Jiri Mekyska, Zoltan Galaz, Radim Burget, and Zdenek Smekal. V oice pathology detection using deep learning: a preliminary study . In Bioinspired Intelligence (IWOBI), 2017 International Conference and W orkshop on , pages 1–4. IEEE, 2017. [15] Musaed Alhussein and Ghulam Muhammad. V oice pathology detection using deep learning on mobile healthcare framew ork. IEEE Access , 6:41034–41041, 2018. [16] Manoj Kumar , George E Dahl, V ijay V asudev an, and Mohammad Norouzi. Parallel architecture and hyperparameter search via successiv e halving and classiﬁcation. arXiv preprint , 2018. [17] Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Shun Chen. Lstm fully con volutional netw orks for time series classiﬁcation. IEEE Access , 6:1662–1669, 2018. [18] Chi-T e W ang, Feng-Chuan Lin, Y u Tsao, and Shih-Hau F ang. 2018 FEMH V oice Data Challenge, May 2018. https://femh- challenge2018. weebly .com/. [19] Homayoon Beigi. Speaker reco gnition . Springer , 2011. [20] S. R. Krishnan, M. Magimai.-Doss, and C. S. Seelamantula. A savitzky- golay ﬁltering perspective of dynamic feature computation. IEEE Signal Pr ocessing Letters , 20(3):281–284, March 2013. [21] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McV icar , Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science confer ence , pages 18–25, 2015. [22] Fabian Pedregosa, Ga ¨ el V aroquaux, Alexandre Gramfort, V incent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pretten- hofer , Ron W eiss, V incent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning r esear ch , 12(Oct):2825–2830, 2011. [23] T ianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Pr oceedings of the 22nd acm sigkdd international confer ence on knowledge discovery and data mining , pages 785–794. A CM, 2016. [24] Sergey Iof fe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal co variate shift. arXiv preprint arXiv:1502.03167 , 2015. [25] Sepp Hochreiter and J ¨ urgen Schmidhuber . Long short-term memory . Neural computation , 9(8):1735–1780, 1997. [26] Nitish Srivasta va, Geof frey Hinton, Ale x Krizhevsky , Ilya Sutskev er , and Ruslan Salakhutdinov . Dropout: a simple way to prev ent neural networks from ov erﬁtting. The Journal of Machine Learning Resear ch , 15(1):1929–1958, 2014.

Pathological Voice Classification Using Mel-Cepstrum Vectors and Support Vector Machine

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment