A text-independent speaker verification model: A comparative analysis

A text-independent speaker verification model: A comparative analysis Rishi Char an, Manisha.A , Karth ik.R, R ajesh Kum ar M, Senior IEEE Member School of Electron ic Engineering VIT University Tamil Nadu, Indi a rishi charan96@gmail .com , manishaab26@ gmail.com , m rajeshkumar@vit.ac.in Abstract — The mo st pr essi ng c hall enge i n the fiel d of voic e biometrics is s electi ng the most effi cient techni que of s peaker recognit ion. Every indivi dual’ s voice is p eculiar, factors l ike physi cal diffe ren ces i n voc al org ans, ac ce nt and p ron unciat i on contributes to the pr oblem’s complex ity. In this paper, we explore the v arious meth ods avail able in eac h block in the process of speaker r ecog nition w ith the objective t o ide ntify best of tec hni ques t hat c ould be used t o get prec ise re sul ts. W e stu dy the results on t ext independent cor pora. We use MFCC (Mel - frequency cepstr al coefficie nt), LPCC (linear predicti ve cepst ral coefficient) an d PLP (perceptua l linear pre diction) alg orithms for feature extraction, PCA (Principal Component Analysis) and t- SNE for di mens ional ity red uct ion an d S VM ( Suppor t Vect or Machine), fee d forw ard, nearest nei ghbor a nd decisio n tree algo rit hms fo r cla ssi ficat ion blo ck in sp eake r reco gnit ion s ystem and compar atively an alyze ea ch block to deter mine the best technique. Index Ter ms — MFCC (Mel-frequency ce pstral coeffici ent), LPCC (linear predictive cepstral coefficient), PCA (Pri ncipal component analysis) and t-SNE. I. I NTRODUCTION Speaker verif ication is considered one of th e essential biom etric methods in assu ring identity in numerous real world app lica tions [ 1]. Spea ker re cogni tion is act uall y ident if ying an individual’s voice from a se t of potential speakers wh ile verification is confirming a speaker’s identity as the original speaker or as a tres passer who could be try ing to intrude. In this paper speaker identif ication is the area of interest. The Speaker identification technique has three main operation s which are: Feature Extractio n, dimensionality reduction a nd classification. Feature Extraction: Voice signal is converted into a set of 12-15 features or f eature vectors f or further proceedings in th e model. Dimen sionality redu ction : This module is used to lower the dimensions of the extracted feature set which makes implementation of the classification techniques easier. Classification : This module is usef ul in multi-speaker recognition pro blems. The given voice signal is segmented into equal length voice segments and labels are assigned to identif y the speaker. Md Raibul et al [2] have already worked on speaker identification which uses cepstral features and PCA for classification. An enhance study was done by Muda.L [3] et al on MFCC and dynamic time wa rping techniqu es to obtain a better perform ance. Urmila Shrawankar et al and MJ A lam et al [4, 5, 6] have als o done an extensive analysis on feature extractio n me thod s like M FCC, P LP, FFT, LP C and LPCC etc. Our pape r aims to bring out a compa rative analy sis on ea ch modu le and als o to dete rmin e the most ef ficient c ombinati on of algorithm s that could be u sed to obtain a reliable out come. Feature ex tracti on is one of the m ost widely researche d areas when it com es to speaker recogn ition. State of art m ethods like MFCC and hidde n Markov model have been studied extensively for more th an a deca de now. But in this paper, w e have i mplemente d three diffe rent for featur e extrac tion techniques n amely MFCC, LPCC and PLP. In dim ensionality redu ction m odul e, our w ork f ocuses on tw o popu lar techni ques: PCA and t-SNE (S tochasti c neigh bou r em bedding) . The las t m odule c ompares diff erent cl assif iers such as near est neig hbou r, SVM , Feed f orwar d netw ork an d deci sion tree . Results of ea ch modu le are c ompared in dividu ally as w ell as sequent ially to d eciphe r the best w ay to recog nize a speak er. II. RELATED WORK Pre -Pr oc e ssi ng F eat u re Ex tr a c ti on Dime nsion ality Reduct ion Class ifier Fig .1: Bl oc k di agr am of spe ake r v er ifi cati on mo del A. Pre-Process ing: In pre-pr ocessing, we ar e going to remove maximum part of sile nce prese nt in the signal, for achie ving this we ar e going to use theory of pro babi lity densit y functio n to re move noise and sil ence p art of the s ign al. Usually first 2 00ms of any recor ded voice s ignal c orres ponds to the silence as th ere is alway s a time g ap betw een the po int w here the s peaker s tarts talking and the voice s tarts to be re cord this tim e is habitually minimum of 2 00ms. No rmal densit y function i s used to remove silence and to fi nd the endpoint s of the si gnal. A one- dimensional Gaussian distribution has 68% of its probability mass in the range | u| ≤ 1, 95% in the ra nge of |u| ≤ 2, an d 99. 7% in the r ange of | u| ≤ 3. Whe re u is def ine d as fol low s =   (1) Whe re µ , σ are th e mean an d varian ce of th e first 200ms of the spee ch signal. Algo rithm fig ure 1.3 was used to discr iminate between vo ice par t of signal from unvoi ced par t of the signal. Pre -processing of Speech Signal serves var ious purp oses in an y speech proces sing a pplicati on. I t inclu des Noi se Removal, E ndpoint Detec tion etc. figure 1.1, fi gure 1.2 shows the Input and output of the Pre-Processi ng blocks. Figure1.1: Input from the microphone Figure 1.2: Output s ignal of Pre-processing Rea d f ir s t 200 ms ec sa mp le s & com p ute µ , σ C o m pute z sco re(z) o f ea ch sa mp le If Z>3 Unvo iced sample make enve lope as 0 Voi ced s ampl e ma ke enve lope as 1 Divide above ar ray of 1 & 0 by 10 ms non -overl apping window s Label sam ples in e ach window only by 1s or only by 0s ac cor ding t h eir m ajori ty in a window Ret rieve the voi ced par t by selecti ng the window s which consist onl y o f ones Fig ure 1.3: Al gorithm used f or e nd po int de tectio n an d sile nce r emov al (pr e- proce ssing ) B. Feature Extr action: The voice algorithm s consist of two paralle l paths. The f irst one is training sessi ons, in thi s part we feed the v oice signals along w ith their identity to the algorithm so that the extract ed featur es can be catego rized t he second one is cat egori zed as testing w here this is the one which is u sed for identif icati on of the indivi dual . In voice i dentif ication f eature ext raction play s an impo rtant role in extra cting th e features f rom the in finite infor matio n containi ng voice si gnal which c an b e used for ide ntifying the spea ker among a gro up of N number of speake rs. We are go ing to use M FCC, LP CC and PLP techni ques for ex tracti on of S hort-t erm spect ral featu res w hich will be compared t o find the best possi ble extraction m ethod for d ifferent applicat ions. Vo ice signals are non-sta tioner y for a large du ration and st ation ery w hen w e take them f or a short durat ion of 20- 25ms ec dura tion . We us e thes e techn iqu es for extracti on of these stati onery features .[8] 1.Mel Fr equency Cepstr al Coefficie nts (MFCC): P re- em p has is Fra m ing Log IDFT Window ing FF T Mel F reque ncy Wr apping Figu re 2.1: com plete pi pelin e of MFCC MFCC uses all-zero mo del for comput ing spectra. T he output of pre- proc essin g bl ock is tak en as in put t o the featu re extracti on stage w here pr e-em phasis is don e to the sign al to increase the ene rgy of the signal at h igher f requ encies as it also removes DC off set pres ent in the sign al. Transfe r functi on of this step is as f ollows Y[n] = x[ n]-a*x [ n-1] (2) Where value of a lies in between [0.9,1], f rom th e figure 2.2.1 and figure 2.2.2 we can see the central frequency of the speaker along with signal strength at higher frequency that is chan ging whe n ‘a ’ value is cha nged fro m 0.9 to 1 Figure 2 .1.1 FFT of the s ignal wh en a=1 Figure 2 .1.2 FFT of the s ignal wh en a=0 .9 Fig 2.1.1 an d Fig 2.1.2 s hows the FFT of th e given audio signal where x-axis represen ts the frequency and y-axis represents th e amplitude when ‘a’ is equ al to 1 and 0.9 respectiv ely. Now the signal is div ided into a set of short fram es with a duration of 20-25ms as v oice signal whi ch is considered to have stationery features for short period of time with each fram e having an overlap regi on of 50-80% wi th the other fram es, we use windows most pref erably a window to decrease the strength of th e sa mples at the end of the frame. Commo nly u sed wind ows ar e Ha mmin g, Han ning, Bla ckm an, Rectangular and Triang ular windo ws. Hamming: -  (  ) = 0.54 − 0.46 ∗ cos    (3) We c onsid er ha mmin g wind ow for windo wing p ro cess. FF T is found f or individu al frames and t hey are passed th rough Mel freq uenc y ba nk. Mel(f)=1125* ln 󰇡1    󰇢 (4) In Mel frequency wrapping we multiply FFT of the frames with t heir Mel b ank va lues and o utput l ogar ith m is se nt to IDCT ( Inverse discrete cosi ne transfor mation) to get t he desired number of features. T hen log arithm is applied, as it compress es dynamic range of v alues as hum an responses are logarithm ic to sign al responses. F igure 2.1.3 s hows features for all the frames. [3, 4, 5] Figure 2 .1.3 M FCC feat ures of th e signa l where x-a xis i s fram e number 2.Lin ear Predi cti on Ce pstr al c oeffic ient s: Pre- em phas is Fram ing LPC para me ter co nversion Feat u re ext ract ion Window ing A uto cor re lat ion ana lys is LPC ana ly sis Figure:2.2.1: comp lete pipeline of LPCC In LPCC we are g oing to use all-pole or maximum entropy model or auto reg ression model for calcula tion of the spectra which i s counter part of all-zero model us ed in MFCC extraction. first Linear Pr edictive Coding (LPC)[9] coefficients are f ound and then they are converted to ceps tral coefficients . LPCC is also a well-known algorithm and widely used to ex tract feature in speaker signal. LPC parameters eff ectively describe energy and frequency spectrum of voiced fram es. T he base of ex plaining acoustic signals s pectrum, model ing and pattern recognition is set by the result of increasi ng logarith m which restrai ns the fast change of frequency spectrum, m ore centralized and better for short-time ch aracter and it is because of Cepstrum derived from original spectrum . O ne of the common short-term spectral measurements cu rrently used are LPC derived cepstral coeff icients (LPCC) and their regression coefficients . Order Q of aut o regression model used for comput ation of LPC is the number of concentric cylinders used to model the vocal track where 8 ≤ Q ≤ 16 figure 2.2.1 sh ows algorithm of LPCC feature extraction LPCC extracted featu res of the audio is show n in the figure 2.2.2 with frames along x-ax is. [4, 5, 6, 7] Fig ure 2.2. 2 LP CC featur es o f audio s ignal x -axis is fram es 3.Perceptual li near predicti on Pre-em phasis Fram ing Lin ear pre dict ion Cepstrum comp utation Wind owing FFT Intensi ty loudne ss Eq ual - Lo udn es s Pr e- Em ph asis Bark Filte r bank Fig ure 2.3. 1: com ple te pipe line of PL P PLP is an ext ended v ersion of linear p rediction c oefficients till Fast Fourier t ransformati on (FFT) , same procedure is follow ed as d escribe d in MFC C feature extract ion. PLP is combin ation of conc epts of LPCC an d MFCC for com putat ion of coef ficien ts PL P uses Bark scal e instea d of Mel scale like in MFCC featu re ext racti on. Bark= 6∗ l n     󰇡   󰇢  1     ( 5 ) Next st ep Equal loudness pre emph asis is design ed to do some pre-emph asis in the spirit of com bining the concept of Equal Loudness Curves.it is a process to normalize different loudness in the voice frames. We calculated Intensity-lou dn ess power is found from the output of equal power pre em phasis by taking cu bic root of the Equal l oudness pre emphasi s[4, 5,7 ]. Till t his we have take n c onc ept s of M FCC, fro m her e we use next ha lf of LP CC like find in g LPC coe ffic ie nts fol lowed by s pectrum converte d to Cepstrum figure 2.3.2 s hows the extracted PLP features of the audio signal[10]. Figure 2.3 .2 PLP features extrac ted for audi o signal x-axis is frames C. Dimens ionality r eduction : In theoretical point of view more the number of features better is the performan ce but as number of features increase the perform ance of the system decreases.so in order to increase the perform ance o f the algorithm we use Dimens ionalit y red uctio n tec hniq ues. Di mensio na lit y red uctio n me ans inform ation loss s o our main objective in choos ing a Dime nsio nalit y re duct io n tec hniq ue i s to p rese rve a s muc h inform ation as possibl e while reducing th e dimension of the voice signal. In this paper, w e are going to use two dimens ionality reduction techni ques 1.Principle C omponent Analy sis (P CA): A data matri x of n features and m dime nsions, whi ch can be correlated can be conv erted into a matrix of q f eatures, which are non-correlated axis’s. Objective of P CA is to rigidl y rotate the axes of th is m-dimensional space to n ew position (principal axis) s o that n principle axis at in descending order and the co-variances are zero. Principle ax is can be found by usin g eige n ana lysis o f the c ro ss pr oduc t matri x(s) . | S −λ I | =0 (6) Matrix λ is the variance of the coordinates on each principal compo nent axi s. Ei gen vecto r with hig hest E ige n val ue is t he principle component it is the most si gnificant relatio n between our vari able and th e dimension. Final data=R ow Feature Vect or X Row Data Adj ust (7) Row Feature vector is the matrix with the eigenvectors in the columns transpos ed so that the eigenvectors are now in the rows, with the most significant ei genvector at the to p, and Row Data Adjust is t he mean-adjusted data tr ansposed, i.e., the data items are in each column, with each row holding a separate dim ension . [2] 2.T-Dis tributed St ochastic Ne ighbor Embe dding (t -SNE): t-SNE is a di mensionalit y reduction technique which tries to conver t da ta po int nea rb y into c luster s and send s p oints t hat are beyond thresh old to a very far distance. Let x correspond to the data point in the h igh dimensionality space and y denote the data points correspon ding to the low dimensionality space. Then we find the conditional pr obability between the points denoted by p i/j .   / =        ∑         (8)   =   /   /  (9) Where d ij represen ts distance of j th feature from i th feature and σ i is the Gaussian variance centered at i th feature then p ij is found in the si milar way conditio nal probab ility of low dimension ality features is represen ted as q ij. w e choose q ij. In such a way that the res ulted cost function is minimum   / =     ∑      ( 10)  = ∑ (   ||  )  = ∑∑   /  /   (11) ) ( ) ( 2 | | | | j i j i i j i j i j j i q p q p Cost − + − − = ∂ ∂  y y y (12) D.Classi fier: A machine learning task that deals with identify ing the class to which an instance belon gs. we are going to com pare four classifiers for ou r speaker identification. 1. Feed Forw ard Neural Network: Feed For ward neur al net wor k we use d co nsist s of 2 hidd en layers each hidden layer consists of large number of units each unit is connected to all the un its in the next layer but none of them are inte r connected each unit in the hi dden layer is given a value called weight of the unit. Input Fig ure 2.4 Fee d forw ard ne twor k 2. Support Vector Machine : This is a supervised mach ine learning algorithm. SVM can be used for clas sification and regress ion analysis. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is num ber of features you have) w ith the v alue of each feature being th e value of a partic ular coordinat e. Then, we perf or m classification by finding the hyper-plane that differentiate the two cla sses ver y well. Support Vectors are simply th e co- ordinate s of indivi dual observati on. Support Vect or Machine is a frontier which best s egregates the two classes (hyper- plane/ line). 3 . Decision Tree Decision trees, or C lassification trees and regression trees, predict respon ses to th e given data. To predict a respon se, we follow the decisions in the tree from the root (beginn ing) node down to a leaf node. The leaf node holds the response. Classification trees give res ponses that are insignificant, such as 't rue' or 'false' . Regression trees g ive num eric responses . 4. K- Near est Ne ighbors This coul d be called straigh tforward ext ension of 1NN. w e find the k ne are st ne ig hbo ur a nd d o a maj or ity vot ing. Classi cally k is odd w hen the number of classes is 2. A very popular t hing to do is weighted k-NN where each point has a weight which is typically cal culated using its distance. This mean s that neig hbo ring p oin ts have a hi gher vote t han t he farther points . It is quite obvious that the accuracy might increase when you increas e k but the computation cost also increases. III. RESULT S AND DISCUSSION The main goal of o ur proj ect is to implement a nd compare differ e nt tec hni que s in t he c onve ntio nal spe ake r re cogn iti on system and highl ight the best alg orithm that could be used to get efficient results. The m ain challenge in our project w as the number of s amples. Typically, according to th e literature survey we did, w e noticed that researchers h ad used samples as less as 3 -4 to imple ment each algorith m. But in this proj ect, O/p Hidden l ayers we have used 15 sp eake rs wit h eac h ha ving t hree sam ple s each and w e have made an attempt to u se 40-45 sam ples for checking each algorithm’s performan ce which also tells us how the perform ance devalues as the num ber of sample incre ases. It gives us an i nsi ght o n how o ne a lgor ith m tha t proves to be t he best f or fewer sample s turns out to be not very efficient with increase in speakers. We ha ve tried to implement the above-mention ed algorithms in each block in every combination and permutation poss ible so that it gives a big pictu re about how the techniques could be ex ploited wit h varied number of speakers and requirem ent. These results tell us how the performance varies when the input speakers decrease in number while the best method chosen from the first set is used. So thi s paper on the whole exte nsively researches on every disparity that could lead to a differen t perform ance. We are first com paring the performan ce of each combination along with the number of distinguish able speakers to find the best poss ible com bination for speaker identif ication. The combination used in the first se t is all t hree-feature extraction method w ith t-SNE an d PCA. Table 3.1 sh ows the perform ance in percentage of each combination for 7 m speakers while using t-S NE for dimensionalit y reduction. t-S NE MFCC (%) LPCC (%) PLP (%) Complex Di screte Tree 51.2 33.4 44 Weigh ted Near Neighbor 68.9 51.3 66.3 Fine- SVM 57.9 38 52.8 Feed Forw ard 51.5 47 50.3 Bagged Trees- Ensembl e 67.4 47.3 66.2 Table 3.1 Performan ce of each fra me Above table sh ows performance of each fram e of 7 speakers which are used for classification. Table 3.2 shows num ber of distingu ishable s peakers among the set of 7 speakers using this combination (t-SNE is used for dimensionalit y reduction). t- SNE MFCC LPC C PLP Complex Di screte Tree 4 2 3 Weigh ted Near Neighbor 7 5 7 Fin e-SVM 6 2 6 Feed Forw ard 4 3 5 Bagged Trees- Ensem ble 7 5 7 Tab le 3.2 D istin guis hable S peak er Amon g 7 Sp eakers Table 3.3 and tabl e 3.4 sh ows the perf ormance in percentage of each com bination for 7 speakers and n umber of distinguishable speak ers among the chosen s et of speakers respectively while using P CA for di mensionality reduction. PCA MF CC (%) LPC C (%) PLP (%) Complex Discret e Tree 20.2 22.1 24 Weig hted Nea r Neighbor 17.1 25 26.4 Fine -SVM 23 18.5 22 Fee d Forw ard 18 17.5 20 Bagg ed Tree s-Ensem ble 13.7 18.6 22.4 Table3.3 : Performance of each frame when PCA is used . PCA M FCC LPC C PLP Complex Di screte Tree 2 1 2 Weigh ted-Near Neighbor 1 1 2 Fin e-SVM 1 1 1 Feed Forw ard 2 1 1 Bagged Trees- Ensem ble 1 1 1 Tab le3.4 : Dist ingui shab le sp eaker a mong 7 speaker s (PC A ) From the above table, w e clearly infer that t-SNE gives a better perf ormance when compared to PCA as the number of speaker increases. As already mentioned t he literature surve y shows that PCA gives better result s when the num ber of speakers involved w as considerably less. From the above tables, w e can also interpret that t-SNE combin ation with weighted n eighbor is th e best possible combination for speaker verification. Figure3.1shown below shows how the performan ce of frames changes with the increase in num ber of speakers. Figure 3.1 Performance vs Number of Speak ers From the figure 3.1 w e can see that as the number of speakers increases performance of PLP f eature extraction alon g with t- SNE dimensionalit y reduction follo wed by Weighted K NN gives a better perf ormance than MFCC s o if we are using a more num ber of speakers preferred identif ication combination is MFCC-tSNE-KNN but if we are planni ng for a speaker identification for a gad get with limited n umber of Speaker s best pref erred combinati on among our s et is PLP-tS NE-KNN. Figure 3.2 sh ows the rate of change of perf ormance as the number of speakers increase. More the stability of the graph, better is the perform ance of the combination, when the number of s peakers is said to be changing. Fig3.2 : Rate of cha nge of perform ance versu s spea kers From the above f igure, we can see that rate of change in perform ance of PLP increases. But the rate of change of perform ance becomes minimum for MFCC , so for an unlimited num ber of users the performan ce becomes more stable in this case compared to PLP feature ex traction method. From the results whic h are shown in the table it beco mes very obviou s that PL P-tS NE-KNN giv es the best of perform ances when compared to the rest of the com binations. So, we implemented the mentioned combinatio n for different set of speakers. The Table3.5 sh ows how the perform ance varies with varied n umber of users . Speaker >> 2 3 4 5 6 7 MFCC- K NN 87.2 81.2 74.5 69.7 67.1 66.9 LPCC- KNN 81.3 70 60.2 54.9 53.1 49.1 PLP– KNN 86.2 79.6 74.6 70.4 68.5 66.3 Table 3.5 : Performance of t-SNE , weighted K-NN combin ation for d ifferent numbe r of s peake rs. T he per form ance is re pres ented in per centag e. Table3.5 shows th e performance in percentage when diff erent feature extraction algorithms ar e used in combination with t- SNE and weighted KNN for different number of speakers. it is observed that the reliability and efficienc y in the performance increases with decrease in the num ber of speak ers involved while training the network and it h as already been poin ted out before. Thus, reas suring the fact that training fewer number of speakers is easier. From the above table, we also infer that when it is f ewer number of speakers for e.g. 2 speak ers MFCC and PLP g ives bett er perform ance. Figure 3.6 shows Receiver Operating Characteristics of seven speakers using MFCC -t-SNE-KNN Figure 3 .6 ROC of 2, 3, 4, 5 ,6 an d 7 speak ers of combin ati on MFCC -tSNE- KNN IV. CONCLUSION The paper is a com prehensive study of currently available algorithms for a speaker verification system. Our main obs ervat ion is t hat t he b est c ombi natio n va ries d ep endi ng o n the input-num ber of speakers and the co mbination that per forms be st for fe wer sampl es do es not a lways give the be st perform ance with larger number of s amples . it is observed that the best combination for a large set, 30 samples of 7 speakers is MFCC-t-SNE-weighted KNN and for a smaller set, 5-10 samples is M FCC /PLP-T SNE-weighted KNN. I n conclusion , the best combination of algorithm must be ch osen dep endi ng on t he e nd re quire men t. T hro ugho ut the cour se o f study a random datas et of male and f emale voices were used to train the n etwork and an enh ance study can be don e by using either male voices or female voices or a combination of male an d female voices in a def inite proportion . REFEREN CES [1] Ja in, R. Bo le, S . Pankant i, BIOMET RICS Perso nal Ide ntificatio n in Network ed Soc iet y, Kluwer Acad emic Press, Bost on, 1999 [2] Islam , M . R., & Rah man , M. F. (20 10) . Noi se rob ust speak er i den tifi cation using PCA bas ed ge netic alg orith m. Inter nati onal Jou rnal of Com puter Appl icatio ns , 4 (12) , 27- 31 [3] Mu da, L ., Begam, M., & Elamvaz uthi, I . (2010) . Voice r ecog nition algo rithms using m el f reque ncy ce pstral coeff icient ( MFCC) a nd dynami c time warp ing (DTW) tech niqu es. arXi v pre print arXiv: 1003. 408 3 . [4] S hraw ankar, U., & Thakar e, V . M. (20 13). Te chnique s fo r feature extr action in s pee ch reco gnition s yste m: A comparat ive study . arXi v prepr int arXiv :1305. 1145 . [5] Alam , Md Jahan gir, et al. "Mu ltitap er MFCC and PLP fea tu res for s peak er verific ati on usin g i-vec tors. " Speech comm unica tion 55.2 (201 3): 2 37-2 51. [6] S ingh, Lal ima. "S peech S ignal Anal y sis using F FT and L PC." [7] Motl ı cek, P etr. Feat ure ex trac tion i n speech cod ing a nd reco gniti on . Technical R eport of PhD research int ernship in ASP Group , OGI-OHSU,< http:/ /www . fit. v utbr. cz/ ∼ motli cek/pub li/20 02/rep ogi. pdf, 2 002. [8] Schüre r, T. "COMPA RING DIFFERENT FEATURE EXT RACTION METHODS FOR TELEPHONE SPEE CH RECOGNITION BASED ON HMM's ." [9] Se lvape rumal , Sat hish K umar, e t al . "Spe ech to T ext S ynthe sis fr om Vid eo Auto mated S ubtitl ing usi ng L evins on Durbin Metho d of L inear Predict ive Coding ." Inte rnat ional J ourna l of Ap plied E ngin eering R esear ch 11.4 (201 6): 2388-23 95 [10] H ermans ky , Hyne k. "Pe rceptu al line ar pre dictive (PL P) anal ysis o f speec h. " the Jou rnal of th e Acoust ical Soci ety o f A meric a 87.4 (1990): 1 738 - 1752 .

A text-independent speaker verification model: A comparative analysis

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment