Pathological Voice Classification Using Mel-Cepstrum Vectors and Support Vector Machine
Vocal disorders have affected several patients all over the world. Due to the inherent difficulty of diagnosing vocal disorders without sophisticated equipment and trained personnel, a number of patients remain undiagnosed. To alleviate the monetary …
Authors: Maryam Pishgar, Fazle Karim, Somshubra Majumdar
P athological V oice Classification Using Mel-Cepstrum V ectors and Support V ector Machine Maryam Pishgar 1 , Fazle Karim 1 , Graduate Student Member , IEEE , Somshubra Majumdar 2 , and Houshang Darabi 1 , Senior Member , IEEE Abstract —V ocal disorders hav e affected several patients all over the world. Due to the inher ent difficulty of diagnosing vocal disorders without sophisticated equipment and trained personnel, a number of patients remain undiagnosed. T o alle viate the monetary cost of diagnosis, there has been a recent growth in the use of data analysis to accurately detect and diagnose individuals f or a fraction of the cost. W e propose a cheap, efficient and accurate model to diagnose whether a patient suffers from one of thr ee vocal disorders on the FEMH 2018 challenge. Index T erms —v ocal disorder , neoplasm, nodule, polyp, mfcc I . I N T R O D U C T I O N The human standard of life can be se verely af fected by their individual pathological voice condition. This has also financially burdened sev eral patients, organizations, and soci- eties [ 1 ]. Some of the common impairments to the voice are structural lesions, neoplasms, and neurogenic disorders [ 1 ]. One of the most frequently utilized tools to diagnose these vocal disorders is through a laryngoscope [ 2 ]. Laryngoscopy is an expensi ve time-consuming process, that requires trained personnel to perform the test [ 3 ]. In addition, vocal disorders must also be detected and treated at an early diagnostic stage when they are faced with sev eral symptomatic challenges to properly detect, as in the case of larynx cancer [ 4 ]. Therefore, patients who do not have easy access to advanced technology or those who cannot afford expensiv e medical treatments are in a derogatory position to receive effecti ve treatments. T o alleviate these issues, it has become very popular to utilize non-in v asiv e techniques for detecting vocal disorders through eliminating subjectiv e baises [ 5 ]. In this paper , we propose an algorithm to accurately and quickly classify voice disorder using cheap diagnostic tools. In the past 60 years, a significant amount of work has been dev eloped in the field of automated speech pathology , which has assisted physicians to diagnose vocal disorders ac- curately . In 1960s, the detection of voice quality was measured by the widespread shimmer , jitter and harmonic-to-the-noise ratio technique (HRN) [ 6 ]. Subsequently , linear discriminant analysis (LD A) was used to discriminate pathological voices from normal ones [ 7 ]. In another study , glottal noise parameter was measured primarily to identify vocal disorder [ 8 ]. An- other common technique e xtracted the Mel-frequenc y cepstral coefficient features (MFCC) and fed into a Gaussian mixture 1 Mechanical and Industrial Engineering, Uni versity of Illinois at Chicago, Chicago,IL 2 Computer Science, Uni versity of Illinois at Chicago, Chicago, IL model (GMM) [ 9 ]. More recently , the F-Ratio and Fisher’ s discriminant ratio w as used for feature selection of acoustic analysis [ 4 ]. Little et al ., used recurrence and fractal scaling to quadratically analyze disordered vo wels. These samples were then bootstrapped to distinguish normal and disordered voices [ 10 ]. In the early 2000s, Arias-Londoo et al. combined the information obtained from MFCC and Modulation spectra (MS) to input into a GMM and support vector machine (SVM) classifier [ 11 ]. Arjmandi and P ooyan et al. focused their study on Short-Time Fourier T ransform (STFT) and Con- tinuous W avelet T ransform (CWT) to discriminately analyze voice impairments as they used Linear Discriminant Analysis (LD A), Principle Component Analysis (PCA) and an SVM classifier [ 12 ]. In May 2014, Muhammad and Melhem et al. employed MPEG-7 features to differentiate pathological voice from normal voice [ 13 ]. Deep learning approaches are becoming more commonly used to categorize voice disorders. One such study utilized con volutional layers with Long Short T erm Memory (LSTM) recurrent layers on raw audio signals as an end-to-end model classifier [ 14 ]. Another deep learning approach used the VGG- 16 and CaffeNet models for voice disorder classification [ 15 ]. Successiv ely , Tsao et al. compared deep neural net- works (DNN), GMM, and SVM models that apply MFCC, MFCC+delta and MFCC(N)+delta features from the ra w sig- nals [ 1 ]. In this study , we develop a supervised classification model on a range of gi v en v oice data samples (obtained from the 2018 Far Eastern Memorial Hospital V oice Disorder Challenge) to detect the pathological defects and classify them into one of the three commonly identified vocal disorders, Neoplasm, V ocal Palsy , and Phonotrauma. The voice samples included samples of 50 normal voice and 150 voice disorders which were characterized by vocal nodules, polyps, and cysts (mutu- ally related to Phonotrauma), glottis neoplasm, and unilateral vocal paralysis [ 1 ]. Our model ut ilizes MFCC and MFCC delta features from the raw signals as an input to an SVM classifier . The hyperparameters of the SVM classifier is tuned via a state-of-the-art algorithm de veloped by researchers at Google (sequential, halving, and classification algorithm [ 16 ]). In this study , we compare our model with a deep learning model (Long Short-T erm Memory Fully Con v olutional Network [ 17 ]) and a classical model (XGBoost) to show ho w our proposed model can perform similarly to more complicated models on the FEMH dataset [ 18 ]. W e propose our model be utilized as a baseline for which physicians can diagnose v ocal disorder accurately , and other researchers can compare their results against. The remainder of the paper is structured as follow: Section II contains the background work on which this w ork is based. Section III describes the methodology we use to train our proposed models. In section IV , we discuss the results obtained on our experiments. Finally , in Section V , we conclude and describe potential future work that needs to be performed. I I . B A C K G R O U N D W O R K A. Mel-fr equency Cepstral Coefficients Mel-frequency Cepstrum is a representation of a sound signal, based on the linear cosine transform of a log po wer spectrum on a nonlinear mel scale of frequency [ 19 ]. The coefficients that collecti vely comprise the Mel-frequency Cep- strum are called MFCC features. In contrast to the linearly- spaced frequency bands obtained from the cepstrum of a sound signal, in a mel-frequency cepstrum, the frequency bands are uniformly spaced on the mel scale. This frequency warping allows for better representation of sound and voice data. B. T emporal Derivatives In order to extract the dynamic features of speech, auxiliary delta and delta-delta features must be computed as input features for the model, which are calculated as the temporal deriv ati ves of the original MFCC features [ 20 ]. In order to estimate smooth deriv ativ es f 0 l,n , we often simply compute the local least squares polynomial fit to the data samples, so as to minimize the cost function C p l,n = M X m = − M p X k =0 a k m k − f l,n + m ! 2 with respect to the MFCC coefficients a k where f is the spectrum-based feature vector , L being the number of coefficients, N being the number of frames, p is the order of the polynomial and M is the number of samples used to fit the polynomial. W e compute the delta features via the local estimate of the deri v ativ e of the input data, computed using Savitsk y- Golay filtering [ 20 ] as we utilize the Librosa [ 21 ] package to preprocess the dataset. I I I . M E T H O D O L O G Y V oice samples were obtained from a v oice clinic in a tertiary hospital (Far Eastern Memorial Hospital (FEMH ). The dataset is comprised of voice samples of 50 individuals who do not exhibit a pathological abnormality in their speech, and v oice samples 150 individuals who exhibit one of three different voice disorders, including vocal nodules, polyps, and cysts (collectiv ely referred to Phonotrauma ); glottis neoplasm and unilateral vocal paralysis. T able I describes the demographics of the dataset, and T able II pro vides the number of samples av ailable for each of the three vocal disorders, stratified by gender . Each voice sample is a recording of a 3-second sustained vo wel sound of the letter ”A”, under a background noise of amplitude between 40 and 45 decibels, and a microphone- to-mouth distance of approximately 15-20 centimeters. The recordings were obtained using a Shure SM58 microphone and a Shure X2u digital amplifier , with a sampling rate of 44,100 Hz at 16-bit resolution, which was saved in uncompressed wa veform-audio format. T ABLE I D E M O G R A P H I C S O F N O R M A L A N D P A T H O L O G I C A L S A M P L E S Count Mean Age Age Range Standard (Y ears) (Y ears) Deviation Gender M F M F M F M F Normal 20 30 30.9 34.5 22- 72 22- 72 10.9 12. Pathological 78 72 53.8 44.6 21- 87 22- 84 15.3 13.5 T ABLE II D I S E A S E C AT E G O R I E S O F T H E 1 5 0 P A T H O L O G I C A L V O I C E S A M P L E S Neoplasm Phonotrauma V ocal Palsy M 32 13 33 F 8 47 13 A. Data Pr ocessing Each audio w av eform is processed to deri ve MFCC features using the Librosa [ 21 ] library , using a sampling rate of 22050 Hertz. W e compute the temporal deriv ati ves (delta) of these MFCC features, of which we compute the mean and maximum across all samples, and concatenate all three vectors into a single vector of size R 3 d , where d is the number of extracted MFCC coefficients. W e select the number of MFCC coefficients ( d ) computed to be 15, which we find via grid search over the space of { 10, 15, 20, 25, 30, 40, 50, 100 } . When assessing the performance of any model, we compute a weighted average of the sensitivity and specificity of the model on the binary task of predicting whether the sample is normal or pathological, and the av erage recall of the model on the multitask objective of classifying the sample as one of four classes. W e assign the weights for the above scores as 40%, 20%, and 40% respecti vely . B. Hyperparameter T uning As the dataset comprises of a mere 200 samples, we perform 5 fold cross validation for ev ery model using the same global seed across all models. W e utilize the sequential halving and classification (SHA C) algorithm from K umar et al. [ 16 ], as an ef ficient alternati ve to exhaustiv e grid search to sample hyperparameter settings in continuous search spaces. T o a v oid o verfitting the data distrib ution in the 5 folds used for ev aluation of the models, we train and ev aluate the SHAC algorithm on 5 dif ferent folds of the dataset using a dif ferent random seed. This preserves the generalization properties of Input MFCC MFCC Deltas Concat SHAC RF Feature Selection SVM Classifier Hyperparameter T uning Fig. 1. Proposed model pipeline. the parameters sampled from the SHA C algorithm while also providing robustness to random seed overfitting. For better robustness, we round off all floating point values to the 3rd decimal place, and find that performance is not impacted. When training the classifiers of the SHA C algorithm, we compute a maximum of 10 classifiers, with a batch of 100 hyperparameter samples per classifier and a total budget of 1000 hyperparameter samples ev aluated. It is to be noted that once we obtain a sample from the SHAC algorithm, we use the same parameters for all 5 folds. Therefore, the sampled parameters are more robust and generalize better, as the y must perform well across all 5 folds. W e then obtain 100 hyperpa- rameter samples, compute the mean and standard deviation of this batch, and select those candidates which obtain a score greater than the mean plus one standard de viation. W e use a publically av ailable implementation of the SHAC algorithm 1 . C. Pr oposed Model Our proposed model is comprised of a feature selection stage follo wed by a multi-class classification stage. W e utilize a set of Random Forest models, each trained on the training data of a giv en fold, to compute the feature importance of the 45 dimension input vector . W e then determine a threshold value sampled from the SHAC algorithm, which is used to select only those features which are above that threshold. These features are then supplied to a Kernel Support V ector Machine, which utilizes the Gaussian Radial Basis Function as its kernel. W e utilize the One-vs-One strategy for multi- class classification, which builds N ∗ ( N − 1) 2 classifiers. Such a strategy is applicable here due to the small amount of data av ailable and relativ ely fast training, but we observe that One-vs-All strategy is nearly identical in performance. W e utilize the e xcellent Scikit Learn [ 22 ] package for all models described in this section. Figure 1 details the various stages of the proposed model pipeline. 1 https://github .com/titu1994/pyshac W e construct a search space consisting of the number of trees in the Random Forest { 10, 20, 50, 100 } , depth of the tree { 3, 4, 5, 6, 7, 8, no limit } , selection threshold ( U ∈ (0 , 0 . 5) ), penalty parameter C for the SVM ( U ∈ (0 , 25) ), gamma parameter for the RBF kernel ( U ∈ ( − 1 , 1) ), where we resolve negati ve v alues to be 1 / (number of featur es) . W e then search over this space using the SHAC algorithm and sample its best parameters as described in Hyperparameter tuning III-B . D. Baseline Models All baseline models utilize the same input features and 5 fold training as described in the proposed model, to ensure a consistent training methodology . Random Forest based fea- ture selection is performed, but the hyperparameters for the baseline models are searched via SHA C to ensure we obtain unbiased hyperparameters. The search space for the Random Forest and the selection threshold remain consistent across all models. 1) XGBoost: W e compare against XGBoost [ 23 ], a power - ful Gradient Boosting T ree model which obtains state-of-the- art results on multiple structured and unstructured datasets and is a standard baseline model to compete with. While XGBoost has a large number of hyperparameters that can be tuned, we find that only three of these parameters significantly impact the final score, and therefore construct a search space only over those three parameters. W e search ov er the number of estimators { 10, 25, 50, 100, 200 } , the maximum depth of the tree { 3, 4, 5, 6, 7, 8 } , and the learning rate U ∈ (0 . 01 , 0 . 2) . 2) Long Short T erm Memory Fully Con volutional Networks: W e also compare this dataset on a hybrid deep neural network, called the Long Short T erm Memory Fully Con volutional Network (LSTM-FCN) [ 17 ]. It comprises of two branches, one with Con volutional blocks comprised of a Con volutional Layer , follo wed by Batch Normalization [ 24 ] and then the Relu activ ation function. Another branch is comprised of a Long Short T erm Memory Recurrent layer [ 25 ] follo wed by a dropout [ 26 ] layer (with a probability of 80%). W e utilize the same FCN and LSTM branch structure as provided by Karim et al. to be consistent, and only modify the number of LSTM cells in the LSTM branch, which we find using grid search [ 17 ]. All aspects of initialization and training methodology are kept consistent with the paper to provide the best results. I V . R E S U LT S Due to the lack of a distinct test set, we instead discuss the 5 fold cross v alidation score obtained by each of the models, and how we selected the best model to submit to the FEMH 2018 Challenge. As discussed in III-B , we use the SHAC algorithm to randomly sample the best parameters which it determines can obtain strong scores on the overall validation sets of the 5 folds. For each of the models, out of the 100 hyperparameter settings sampled, we found one or more hyperparameter settings which scored abo ve the mean + 1 standard deviation threshold we had set. W e then train and e valuate each of these models on those candidates, and rank each of them. W e find that we need to train fe wer than 3 candidates in all cases as the candidates all score v ery close to each other . All code employed in this paper is av ailable online. 2 The scores of each of these models has been provided in T able III . The final column contains the weighted score of the Sensitivity , Specificity and Recall scores with the weights 40%, 20% and 40% respecti vely . T ABLE III 5 - F O L D C R O S S V A L I DAT I O N S C O R E S Model Sensitivity Specificity Recall Scores Std. Dev Proposed Model 0.8860 0.7823 0.5900 0.7469 0.0160 XGBoost 0.8747 0.7561 0.6150 0.7470 0.0710 LSTM-FCN 0.8539 0.6624 0.5550 0.6960 0.0401 Upon inspection of the results, we find that the proposed model and XGBoost perform quite similarly . Howe ver , the standard de viation of the XGBoost model is much higher than that of the proposed model, albeit the mean of the score is marginally higher . It is for this reason that we state that the proposed model is the ov erall winner , for scoring marginally higher in sensitivity and specificity , as well as having a lo wer standard deviation in the weighted score. W e also inspect the reason for the significantly lower performance of the LSTM-FCN. W e find that the extremely small amount of data pro vided to such a lar ge neural network, possessing sev eral hundred thousand parameters allo ws the model to significantly overfit each of the subsets of the 5 folds during training and consequently recei ve a much lo wer generalization score. T o see if this was indeed the case, we also train the model with a mere fraction of its original parameters and find that, although the overall score improv es, it is not significant enough to be comparable to the proposed model. 2 https://github .com/houshd/FEMH V . C O N C L U S I O N In this study , we present a model that classifies v ocal disorders using a range of v oice data samples obtained from the 2018 FEMH V oice Disorder Challenge. The proposed model extracts MFCC and MFCC delta features from raw input signals and applies them on a SVM classifier . W e utilize the SHAC algorithm to tune the parameters of the classifier for optimal performance. Our results of the proposed model perform similarly and in a case outperformed more complicated models, such as XGBoost and LSTM-FCN. W e recommend using this approach to pro vide a fast yet simple baseline to diagnose the vocal disorder in clinical practice. W e leav e the extension of this model to automatically diagnose larynx cancer as future work. A C K N O W L E D G M E N T W e would like to thank Far Eastern Memorial Hospital for donating the dataset and the org anizers of the 2018 FEMH challenge for providing valuable feedback. This publication w as supported by the Grant or Cooperati ve Agreement Number, T42OH008672, funded by the Centers for Disease Control and Pre vention. Its contents are solely the responsibility of the authors and do not necessarily represent the official vie ws of the Centers for Disease Control and Prev ention or the Department of Health and Human Services. R E F E R E N C E S [1] Shih-Hau Fang, Y u Tsao, Min-Jing Hsiao, Ji-Y ing Chen, Y ing-Hui Lai, Feng-Chuan Lin, and Chi-T e W ang. Detection of pathological voice using cepstrum vectors: A deep learning approach. Journal of V oice , 2018. [2] Nizhoni Denipah, Christopher M Dominguez, Erik P Kraai, T ania L Kraai, Paul Leos, and Darren Braude. Acute management of paradoxical vocal fold motion (vocal cord dysfunction). Annals of emergency medicine , 69(1):18–23, 2017. [3] John George Karippacheril, Goneppanav ar Umesh, and V enkatesw aran Ramkumar . Inexpensiv e video-laryngoscopy guided intubation using a personal computer: initial e xperience of a no vel technique. Journal of clinical monitoring and computing , 28(3):261–264, 2014. [4] Juan Ignacio Godino-Llorente, Pedro Gomez-V ilda, and Manuel Blanco- V elasco. Dimensionality reduction of a pathological voice quality assessment system based on gaussian mixture models and short-term cepstral parameters. IEEE transactions on biomedical engineering , 53(10):1943–1953, 2006. [5] Ghulam Muhammad, Mansour Alsulaiman, Zulfiqar Ali, T amer A Mesallam, Mohamed F arahat, Khalid H Malki, Ahmed Al-nasheri, and Mohamed A Bencherif. V oice pathology detection using interlaced deriv ativ e pattern on glottal source excitation. Biomedical Signal Pr ocessing and Contr ol , 31:156–164, 2017. [6] Michael HL Hecker and E James Kreul. Descriptions of the speech of patients with cancer of the vocal folds. part i: Measures of funda- mental frequency . The Journal of the Acoustical Society of America , 49(4B):1275–1282, 1971. [7] Karthikeyan Umapathy , Sridhar Krishnan, V ijay Parsa, and Donald G Jamieson. Discrimination of pathological voices using a time-frequency approach. IEEE Tr ansactions on Biomedical Engineering , 52(3):421– 430, 2005. [8] V ijay P arsa and Donald G Jamieson. Identification of pathological v oices using glottal noise measures. Journal of speech, language, and hearing r esearc h , 43(2):469–485, 2000. [9] Fethi Amara and Mohamed Fezari. V oice pathologies classification using gmm and svm classifiers. Recent Advances in Biology , Medical Physics, Medical Chemistry , Biochemistry and Biomedical Engineering , page 65, 2013. [10] Max A Little, Patrick E McSharry , Stephen J Roberts, Declan AE Costello, and Irene M Moroz. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomedical engineering online , 6(1):23, 2007. [11] Juli ´ an David Arias-Londo ˜ no, Juan I Godino-Llorente, Maria Markaki, and Y annis Stylianou. On combining information from modulation spectra and mel-frequency cepstral coef ficients for automatic detection of pathological voices. Logopedics Phoniatrics V ocology , 36(2):60–69, 2011. [12] Meisam Khalil Arjmandi and Mohammad Pooyan. An optimum al- gorithm in pathological voice quality assessment using wa velet-packet- based features, linear discriminant analysis and support vector machine. Biomedical Signal Pr ocessing and Contr ol , 7(1):3–19, 2012. [13] Ghulam Muhammad and Moutasem Melhem. Pathological voice detec- tion and binary classification using mpeg-7 audio features. Biomedical Signal Processing and Contr ol , 11:1–9, 2014. [14] Pa vol Harar , Jesus B Alonso-Hernandezy , Jiri Mekyska, Zoltan Galaz, Radim Burget, and Zdenek Smekal. V oice pathology detection using deep learning: a preliminary study . In Bioinspired Intelligence (IWOBI), 2017 International Conference and W orkshop on , pages 1–4. IEEE, 2017. [15] Musaed Alhussein and Ghulam Muhammad. V oice pathology detection using deep learning on mobile healthcare framew ork. IEEE Access , 6:41034–41041, 2018. [16] Manoj Kumar , George E Dahl, V ijay V asudev an, and Mohammad Norouzi. Parallel architecture and hyperparameter search via successiv e halving and classification. arXiv preprint , 2018. [17] Fazle Karim, Somshubra Majumdar, Houshang Darabi, and Shun Chen. Lstm fully con volutional netw orks for time series classification. IEEE Access , 6:1662–1669, 2018. [18] Chi-T e W ang, Feng-Chuan Lin, Y u Tsao, and Shih-Hau F ang. 2018 FEMH V oice Data Challenge, May 2018. https://femh- challenge2018. weebly .com/. [19] Homayoon Beigi. Speaker reco gnition . Springer , 2011. [20] S. R. Krishnan, M. Magimai.-Doss, and C. S. Seelamantula. A savitzky- golay filtering perspective of dynamic feature computation. IEEE Signal Pr ocessing Letters , 20(3):281–284, March 2013. [21] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McV icar , Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science confer ence , pages 18–25, 2015. [22] Fabian Pedregosa, Ga ¨ el V aroquaux, Alexandre Gramfort, V incent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pretten- hofer , Ron W eiss, V incent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning r esear ch , 12(Oct):2825–2830, 2011. [23] T ianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Pr oceedings of the 22nd acm sigkdd international confer ence on knowledge discovery and data mining , pages 785–794. A CM, 2016. [24] Sergey Iof fe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal co variate shift. arXiv preprint arXiv:1502.03167 , 2015. [25] Sepp Hochreiter and J ¨ urgen Schmidhuber . Long short-term memory . Neural computation , 9(8):1735–1780, 1997. [26] Nitish Srivasta va, Geof frey Hinton, Ale x Krizhevsky , Ilya Sutskev er , and Ruslan Salakhutdinov . Dropout: a simple way to prev ent neural networks from ov erfitting. The Journal of Machine Learning Resear ch , 15(1):1929–1958, 2014.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment