Multi-speaker Recognition in Cocktail Party Problem

Multi - spe ak er Rec ogniti on in C ock tail P arty Pr oblem Yi q ian W ang , Wensheng Sun 1 B eijing U niversity of P osts and T eleco mmunicatio ns , Beijing , C hina yqw ang_c @ 163 .com , sunws @bupt.ed u.cn Abstract. T his paper prop oses a n origina l statistical decision theory to accompli sh a multi - speaker re cognit ion task in cocktail party problem. Th is theor y relies on an as sumption that the var ied fr equenc ies of speaker s obey Gaussian d istr ibutio n and t he r elat ionship of the ir v oicepr ints c an be represen ted by Eucl idean dist ance vectors. This paper us es M el -F requency C epstral C oeffici ents to extract the featur e of a vo ice in j udg ing whether a spea ker i s include d in a m ulti - speaker envir onment and dis ting uish w ho the speaker shoul d be . F inally , a thirteen - dim ensi on co nstellation draw ing is esta blishe d by ma pping fr o m M anhatt an distances of speakers in order to take a thoroug h cons ider ation about g ross influe ntial f act ors . Keyw ords: M ulti - speaker reco gnitio n ; c ocktail party ; featur e extractio n ; statistical de cision theory . 1 Introduction C ocktail part y problem describes a psy cho- acoustic phenomenon due to maski ng effect. [1] F or ins tance, a person i n the nois y environment of a cockta il party can focus on a specific spe ech of another person while i gnoring spe eches of the others. Th e top - down attenti on of the perso n can affect the pro cess, which c ontains two spheres, speaker r ecognition and v oice filt er ing . [2] This paper chiefl y consid ers and imi tates the first part of subconscious thinki ng process which helps us to locate the sound of a talker. The compound logic in searching out who is the m ost probable sp eaker and w ho is the l east possi ble one is na med as st atistical decision theory in m ulti - speaker recogni tion. In hypothetica l scener y , Mr. Bright wants find out one of his friends in a cocktail party only by auditory sens e . Since the timbre of that friend is already known, he needs to match the v oice in m emory to the m ixture of sounds in th e party. When s ome speakers ar e communi cating at the sam e area, they can be included in to a gro up, thus those people participating in the party can be separated into varied groups. Then the task is sim plified into two concrete st eps, one i s to 1 Wensheng Sun (  ) B eij ing U niversi ty of P ost s and T elecommuni cat ions , No. 10 Xutuc heng Rd. Hai dian Di strict, Beijing , C hina e- mail: sunws@bup t.edu.c n determ ine whether a si ngle speech belongs to a group of speakers, the other is t o decide who is t he owner of t he voice i f it has met the former require men t. In a prev ious study of speake r recogniti on, Gaussian mixture models (GMM) and Expectati on Maximi zation (EM) [3] were use d to com pare two speeches of the same conte nt with slightl y distinction in th at they were reco rded in different ti me. The algorithm reaches an error rate of in contrast ing couple s of 3.2- second spe eches, and t he error rate decli nes whe n the time of speeches increases . Its text - depende nt demerit is ascribed to the limitation of G aussian mixt ure models whose m odel number s hould be predefine d. Another rel ated study use d joint Mel - Frequency Cepstral Coef ficients (MFCCs) and V ec tor Quantizat ion (VQ) algorit hm [4] to deal with a s imilar case, but the content of each speech co uld be dif ferent, m aking its final re sult text -independent. T he stu dy is completed by comparing eac h pair of s ingle speeches with si lent backgrounds , an d it reaches an err or rate of . Nevertheless, n one of them can c ontrast a si ngle spee ch with a blend of speeches by v aried people. Therefore, this paper used Me l - Fr equency Cepstral Coeffic ients and a st atistical decision theory to decide whether a si ngle speech is included in the bl end of voices and who is the possessor of the single speech. Mel - Frequency Cepstral Coefficients can help us to define the acoust ics features of a person b y emulating t he response in human audit ory syst em . [5] T he statistic al decision th eory is our original idea comi ng up from the book Principles of Comm unications [6] . In tra ditional s tatistical d ecision theory , the best reception o f a dig ital modu lation signal in M n otation through an A dditive W hite Gaus sian N ois e (AWGN) cha nnel is real ized by M axi mum P osterior - probability (MAP) Algorithm . And it can be sim plified into Max imu m Li kelihood (ML) al gorithm or M inimum Euclidea n distance algo rithm when the prior - probabilities are equal . A digital signal through AWGN c h annel is comparabl e to a speech in the mul ti - talker background, so the best rec eption rate of the digital signal is parallel to the least judgment error rate in the recognition with background noises. This paper also bui lds a thirteen - dimensi on conste llatio n drawing based on MF CCs , where each s pot represent s the g ross influence of div erse frequencies in a speaker ’ s voic eprint . Making judgme nt of one - second recor dings in three - speaker environment , the error rate is if the contents are the same , and in text -independent co ndition. 2 Recognition Algorithms The recognit ion algorithm s combi ne two parts, a cl assical f eature ext raction algorithm MFCCs and a new statistical de cision theo ry proposed in this pa per. 2.1 Feature Extraction Mel -F reque ncy C epstr al C oefficients (MF CCs) are r epresen tation s of the short - term power spectru m of a sound, based on a lin ear cosine transform of a log power spectrum on a nonl inear M el scale of fre quency . [ 7] [8] It can concisely imitate t he frequency maski ng effect in hum an’s basement m embrane of cochl ea, where the lower fr equency sounds transmit farther dist ance and are easier to be recognized tha n the higher o nes. MF CCs are commonly derived by t he following st eps: [4] 1) Frame blocking 2) Windowing 3) Fa st Fourier transform (FFT) 4) Mel frequency warping Distinct frequenc ies ar e perceived non - l inearly , so Mel - Scale filter bank c an characteriz e the preciseness of human ear : ( 1) 5) Ce pstrum The MFCCs are resulted from Discrete Cosine Transform (DCT ), and represents Mel - scale warping st age: ( 2) After extracting v oice features, the coefficients are obtained: thirteen f igures for each signal o n behalf of varied proport ions in differen t frequency . These coeffi cients represe nt the voi ceprint of a person, so the more si milar voiceprint peopl e have, the smaller Euclidean di stance their Mel - coefficients reach. For i nstance, participa nts have recorded sentences , and t heir MF CCs are: . In the sentence, the E uclidean distance v ector of speaker P and speaker Q is: . Fig. 1 . Spectro grams and MFCCs when speaker A and B utt er ing the s ame sentence. The speakers in the recording data com e from varied districts in the United States and each of them has recorded ten sentences, including tw o duplicate sentences and eight distinc t ones [9][10][11] . In the recordings of same sentences and distinct sentences, the f irst part is easier to be recognized, because the features of di fferent speakers hav e obvious di stinctions. The second part is m ore effortful to be distingui shed because the d istinction s in content bring interference in extrac ting the cha racteristics of multiple talkers. In the exam ple figure ( Figure 1), speaker A is femal e and speaker B is male. As the graphs s how, their voice spectrogra m s and MF CCs are apparently different, so the voicep rints of them have lit tle similarit y. 2.2 Statistica l decision theo ry 1) Th e initial mo del Taking the sim plest case of three - speaker env ironme nt into considerat ion, speak er A, B, and C ar e talking at the s ame time, and the aco ustic feat ures of them hav e been acquired by a record er. I mporting a sp ecific voice V, the similar ities between V and speaker A , B, and C can be det ermined by their Euclidean distance vector s . By comparing the E uclide an distance v ector between V and each of th e m, w e can figure out whi ch of the speak er V is most likely to be . In addit ion, we should contr ast the chara cterist ics between V and the composit ion of the three to make sure if V is included . Supposing the Eu clidean distance vector of speaker A and the m ixed voice of the three is , that of the voice V and the inter mixed voice is , we can conclude t hat V has a hi gher l ikelihood to be none of t hem i f V has a the c lose st average Euclidean distance t o A but the mathematical mean of is larger th an th at of . This statistical d ecision in three - speaker env ironment can be further im proved if we consid er th e Euclidean dis tance vector bet ween V and the mi xture of two speakers . V is less likely to be the voic e of A if it has a neare st Euclid ean distance to the blended sp eech of B and C , so this consid eration is a reverse lo gic to the initial algorith m, and t he re will be a balance to decid e the likabilit y of V and the other three speakers. Fig. 2. The sketch map of finding the least p ossible speaker in three to five speakers ’ envi ronme nt. When speakers are talkin g at the same time, thi s algorithm can be popularized from three to by adding the combinati ons of more speakers. The example s of different com binations containing t wo speakers to four s peakers are showed in F igure 2. 2) The enhance d model If we assume that the proportion of ever y frequency in a speaker’s v oiceprint follows Gaussi an distribut ion, the distri bution of each i ndex in the Euclidean distance vector between V and one of t hose speaker s can be shaped by a Gaussian model. E xtracted from MF CCs, the coefficien t s vector of V is : . The post- probability is deduced from Bayesian Theore m : , (3) . (4) In the theorem abov e, is the distrib ution of MFCCs vector , whi ch represents the voiceprint of A, B, an d C when . A s is hy pothesized above, follow s Gaus sian distrib ution , is mathem atical mean of and is standard dev iation of , s o the distribut ion of is: (5) In the definition , the Manhattan distance of two vectors is the average distance in each dim ension: . (6) Deduced from (6), t he Manhat tan distance of two MF CCs vector s is: (7) Collecti ng the voiceprint v ectors of A, we can conclude them into a constella tion drawing with thirteen dimensi ons, where each spot repres ents the location of a v ector. Becaus e A has recorded t en sentences and each of them may lead to a distin ct spot, the area whi ch contains thos e spot s can be enclosed as a circle to represent the character istic of A. Thus, the a rea is quit e possible to be reached if the input t est sentenc e belongs to A. Since the spots of a speaker obey normal dist ribution, it is lo gical to dedu ce that of each index is included in the inte rval , so of this area is covered b y a new vector , wh ich is tested to b e a felicitous balanc e to make j udgments. Therefore, a s entence can b e excluded from belonging to A when the mappi ng spot of MFCCs vector is too far from that area (ima ging that is th e center of a circle with dia meter of in two - dimensi on constellation draw ing s , which pile up together to form a thirteen - dimension const ellation fi gure). The foll owing figure (F igure 3) is a v isualization of tw o vectors m apping in thirteen -dim ension constella tion . Fig. 3. The l ocat ions of tw o MFCCs v ector s in 13 - dim ens ion mode l . 3) The complete theory The thorough algorithm in three -speaker env ironment is as follows: First, cal culate the Euclidean dist ance vector of V and the m ixture o f multiple speakers ( ). T hen use the traini ng data of each separated speaker to determine an average Euclidean distance v ector ( ) likewise. Compare the mat hematical mean of and that of , and then decide whet her V sho uld be incl uded i n those talk ers. Second, contrast the MFC Cs vector of V and that of the com binations o f some speakers, and decide which specific speaker V is less likely to b e . I f the contrast between V and a com bination containing spea ker W is hi gher than that of another one excluding W when the rest speaker s are unch anged, we can co nclude speaker W is q uite im probable to be V. Third, use as a vector and compare it with the MFC C s vector of V when change from one to ( is the total number of speakers) in order to find out the Man hattan dist ances of V and ot her speakers . Finally , we can decide w ho V sh ould be if V has t he closest M anhattan distance to a specific s peaker and also has the f arthest c ontrast to the mixture without t he one. 3 Exper imental Results and A nalysis In t h e experim ent, i nput sentences are separated into two parts, t he same sentence data and the dist inct sentences data. T he first part is ea s ier to be distinguish e d while the second part is m ore subtle t o be decided du e to more interference. B y testing numeral com bination s of voices i n the sec ond step of statistical decision theor y above, the r ecognition error r ate approxim ately goes dow n in logarithmic form when the num ber of testing dat a increases. M oreover, the for ms of different combi nation increase s rapidly as the tota l num ber of speakers increases, so hardness boosts i n making preci se deci sion of its owner wit h increasing t otal number . In thre e - speaker environment, the final err or rate of distinct sentences is , while that o f the same sente nces is . It is qu ite reasonable because the features of sam e content sentences are ext racted wit h less interference, making t he judgment easier. Fig. 4. The erro r rate variation in three - speaker en vironment. Fig. 5. The err or rate of w hether the spe aker is incl uded and who t he s peak er should be in diff erent k inds of envi ronme nt. 4 Conclus ion and Fut ure De velopme nt In thi s paper, the two prime algo rithms are MFCCs an d statistical decisio n theor y, and two tasks are resolved: judging whether a voi ce flow is ut tered by one of those spea kers and findi ng the ow ner of th e voic e . The error rate in these tasks increases sl ightly w hen the total number of speakers rises f rom three to five. The uniqueness of this paper i s that it can compare a single voice with the blend of var ied voices in a mom ent, and make a sy nthesized decision of the speaker all at once. But previous w orks in speaker recogniti on alway s make compari sons one by one , and finally obtain the m ost possible owner of a v oice by seeking out the one with the highest possibility , Since this paper only take s the tim bres of people into consideration, its weakness is t hat judgment error occurs when a person ’ s t imbre is so similar to another one t hat even hum an ’ s ear cannot corre ctly distingu ish them. If we utilize the accents of peo ple in spe eches, the error rat e may be na rrowed down to som e extent. F or instance, the accents in different ial parts of the United States ar e varied, s ome local peopl e may say ‘Oh, my gowd’ (‘Oh, my god’) in New Jersey, and ‘mah f anger hurts’ (‘my finger hurts’) in Alabam a. Fo r those people with similar timbre bu t different ac cent, the syllables of their speech can be used in algorithm such as Hidden M arkov Model ( HMM ). Acknowle dgme nt We would like to ackn owle dge Electronic I nform ation Specialty Group of Universit ies in Beijing f or funding the pa per. Referen ces 1. A. W. Bronkhors t. The cock tail - pa rty problem revis ited: e arly proc ess ing and s elec tion of multi - talker speech. Atten P ercept Psyc hophys , 2015. DO I 10. 3758/ s13414 - 015 - 0882 - 9. 2. L. Marchegi ani, S. G. Karad ogan, T. And ersen, J. Larsen and L. K. Hansen. The Role of Top - Down A ttention in the Co cktail Party : Revisiting Cherry ’ s Experi ment after Sixt y Years. In ternation al Conference on Machine Lea rning a nd Applic at ions, 20 11. 3. A. M. Ding. “ Rese arch on spea ker recog nition sy stem using MFCC a nd GMM a lgorithm ” , Hohai U niver sity , 2006. 4. W. Li. “ Analysis o f Chara cters in Multi - speake r Envir onme nt ” , South C hina U niver sity of Te chnology , 2014 De c. 5. A. E. Om er. Joint MFCC - and - V ector Quanti zation based Text - I ndepend ent Speaker Rec ognition Sy stem . 2017 I nter national Conf erenc e on Comm unic ation, Co ntrol, Com puting a nd Elec tronic s Eng inee ring (I CCCCEE ) , Khar toum , Su dan. 6. J. P. Zhou. “Pri nciples of Comm unication ” ,ww w.buptpr ess.c om, 20 02. 7. G. K. Verm a, U. S. T iwary and S. Agraw al. “ Multi - algor ithm F usion f or Spee ch Em otion Recognition ” , C omm unicati ons in Com pute r and I nform ation Sc ie nce 192, 2011 J ul. 8. T. K innunen a nd H. L i. “ An overvi e w o f te xt - independe nt spe aker . recog nition: Fr om features t o supervecto rs ” , Spee ch Com mun, v ol. 52, no. 1, pp. 12 – 40, 2010 Jan. 9. Libe rm an, Mark , et a l. Emoti on Proso dy S peech a nd Tra nscripts L DC2002 S28. CD - ROM. Philadelphia: Linguist ic Data Con sortium, 2 002. 10. Hua ng, S hudong , Dav id Gr aff and Ge orge Doddi ngton . Multi ple - Tra nslation C hines e Corpus LDC 2002T01. Web d ownloa d file . Phila delp hia: L inguist ic D ata Cons ortium , 200 2. 11. Ga rofol o, John S. , et a l. TIM IT Ac oustic - Phoneti c Cont inuous Speec h Corpus LD C 93S1. We b Downloa d. Phil adelp hia: L inguist ic Da ta Cons orti um, 199 3.

Multi-speaker Recognition in Cocktail Party Problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment