The Second DIHARD Diarization Challenge: Dataset, task, and baselines

The Second DIHARD Diarization Challenge: Dataset, task, and baselines Neville Ryant 1 , K enneth Chur c h 2 , Christ o pher Cieri 1 , Alejan drina Cristia 3 , Jun Du 4 , Sriram Ganapathy 5 , Mark Liberman 1 1 Linguisti c Data Consortium, Unive rsity of Pennsylvania, Philadelphia, P A, USA 2 Baidu Research, Sunnyv ale, CA, USA 3 Laboratoire de Sciences Cognit iv es et de Psycholinguisti que, D ´ ept d’ ´ etudes cogni tiv es, ENS, EHESS, CNRS, PSL University , Paris, France 4 Univ ersity of Science and T echnolog y of China, Hefei, China 5 Electrical Engi neering Department, Indian Institute of Science, Bangalore, India nryant@ldc.up enn.edu Abstract This paper introduces the second DIHARD challenge, the sec- ond in a series of speak er diarization challenges intended to improv e the robustness of diarization systems to v ariation in recording equipment, noise conditions, and con v ersational do- main. The challenge comprises four tracks ev aluating diariza- tion performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diariza- tion from a reference speech segm entation vs. diarization from scratch). In order to prev ent participants from overtuning t o a particular combination of recording conditions and con versa- tional domain, recordings are drawn fr om a variety of sources ranging from read audiobooks t o meeting speech, to child lan- guage acquisition recordings, t o dinner parties, to web video. W e describe the t ask and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization. Index T erms : speaker diarization, speaker recognition, robust ASR, noise, conv ersational speech , DIHARD challenge 1. Intr oduction Speaker diarization, often referred to as “who spoke when”, is the task of determining how many speak ers are present in a con versation and correctly identifying all segments for each speak er . In addition to being an i nteresting technical chal- lenge, it forms an impo rtant part of the pre-p rocessing pipeline for speech -to-text and is essential for making objectiv e mea- surements of turn-taking behavior . Early work in this area was dri ven by the NIST Rich Transcription (R T) e valuations [1], which ran between 2002 and 2009. In addition to driving substantial performance improv ements, especially for meeting speech, the R T e valuations introduce d the diarization error rate (DER) metric, which r emains the principal ev aluation metric in this area. Since the R T e valuation series end ed in 2009, diariza- tion performance has continued to improve, thoug h the lack of a common task has resulted in fragmentation with individ- ual research groups focusing on dif ferent datasets or domains (e.g., con versational telephone speech [2, 3, 4, 5, 6], broadcast [7, 8], or meeting [9 , 10]). At best, t his has made comparing performance difﬁcult, w hil e at worst it may have engendered ov erﬁ t ting t o individual domains/datasets resulting in systems that do not gen eralize. Moreov er , the majority of this work has e v aluated systems using a modiﬁed versio n of DER in which speech w i thin 250 ms of reference boundaries and ov erlapped speech are excluded from scoring. As short segments such as backchan nels and o verlapp ing speech are both common in con - versation, this may hav e resulted in an over-op t imistic assess- ment of performance e ven within t hese domains 1 [11]. It is against this backdrop that the JSAL T - 2017 workshop [12] and DIHARD challenges 2 emerged . The DIHARD series of challenges introduc e a ne w common t ask for diarization that is intended both t o facilitate comparison of current and future systems through standardized data, tasks, and metrics and pro- mote work on rob ust diarization systems; that is systems, that are able to accurately handle highly interactiv e and o verlapping speech from a range of con versational domains, while being re- silient to variation in recording equipment, recording env iron- ment, re verbe ration, ambient noise, number of speakers, and speak er demographics. As with the NI S T R T ev aluations, DER is adopted as the primary ev aluation metric, but without use of collars or exclusion of overlapping speech. There are no con- straints on training data, with participants allo wed to use any combination of public/proprietary data for system dev elopment. The initial DIHARD challenge (DIHAR D I) [13 ] r an during the spring of 2018 and attracted registrations from 20 teams, of which 13 submitted systems. As expected, state-of-the-art sys- tems performed poorly , wit h ﬁnal DER on the ev aluation set for the top systems rang ing from 23.73% [14] when prov ided with reference speech activity detection ( S AD) marks to 35.51% [15] when forced to perform diari zati on from scratch. T hese error rates rates are more than double the state-of-the-art for CAL L - HOME [16] at the time [4, 5]. For some domains, error r ates for the best sy stems exceed ed 49% when using reference SAD and 75% when performing diarization from scratch! The second DIHARD Challenge (DIHARD II) [17], like its predecessor , examin es diarization system performance un- der two SAD conditions: diarization from a supplied refer- ence SAD and diarization from scratch. As wi th DIHARD I, it includes a single channel input condition util i zing wideband speech sampled from 11 demanding domains, ranging from clean, nearﬁeld recordings of read audiobook s to extremely noisy , highly interactiv e, farﬁeld recordings of speech in restau- rants to ch ild languag e data recorded in the home using LEN A vests. Unlike DIHARD I, it additionally of fers a multichan- nel input condition requiring participants to perform diarization from farﬁeld microphone arrays of dinner party speech drawn 1 See, for instance, the release of IBM’ s diari zation API in 2017. The feature worke d well for simple cases, but when run by users on real inputs, the performance was found to be lacki ng, especia lly for ov erlaps, back-chann els, and short turns. 2 https://coml.l scp.ens.fr/dihar d/index.html from the CHiME-5 corpus [18]. For the ﬁ rst time, we also provid e participants wit h baseline systems for speech enhance- ment, SAD, and diarization, as well as results obtained with these systems for all tracks. 2. T racks The challenge features two audio inpu t conditions: • Sin gle channel – Systems are provide d wi th a single channel of audio for each recording. Depending on the recording source, this chann el may be taken from a sin- gle dist ant microphone, a single channel from a distant microphone array , a mix of head-mou nted or array mi- crophones , or a mix of binaural microphones. • Mul tichannel – Each recording session contains output from one or more distant microphone arrays, each con- taining multiple channels. Participants are instructed to treat the arrays separately , producing one output per ar- ray . They are free to use as fe w or as many of the chan- nels on each array as the y wish to perform diarization. As system performance is strongly tied to the quality of t he SAD component, we also include two SAD co nditions: • Reference SAD – Systems are provided wit h a refer- ence speech segmentation that is generated by merging speak er t urns in the referenc e diarization. • System SAD – Systems are provided with just the raw audio inp ut for each recording session and are responsi- ble for producing t heir own speech se gmentation. T ogether , this yields the following four e valu ation tracks: • T rack 1 – single channel au dio using reference SAD • T rack 2 – single channel au dio using system SAD • T rack 3 – multichannel audio using reference SAD • T rack 4 – multichannel audio using system SAD All t eams are required to register for at least one of track 1 or track 3. 3. P erformance Metrics As in DIHARD I, the primary metric is D E R [1], which is the sum of missed speech, false alarm speech, and speake r mis- classiﬁcation error rates. Because systems are provided with the reference speech se gmentation for tracks 1 and 3, for these tracks, it exclusi vely measures speaker misclassiﬁ cation error . This is the metric used to rank systems on t he leaderboard. For each system we also compute a secondary metric, Jac- card error rate (JER), which is newly deve loped for DIHARD II. JER is based on the Jaccard similarity index [19, 20] , a met- ric commonly used to ev aluate the output of image seg menta- tion systems, which is deﬁned as the ratio between the sizes of the intersections and unions of two sets of segments. An opti- mal mapping between speakers i n the reference diarization and speak ers in the system diarization is determined and for each pair t he Jaccard inde x of their se gmentations is compu ted. JER is deﬁned as 1 minus the averag e of these scores, expressed as a percentage. That is, it is the mean of Eq. 1 across all refer- ence speakers r ef , where TOT A L is the duration of the union of reference and system speaker segments, F A is the total system speak er time not attributed to the reference speaker , and MISS is the total reference speaker time not attributed to the system speak er . It ranges from 0% in the case where each reference T able 1: Overview of DIHA RD II datasets. F or the CHiME- 5 (multichannel) data, each Kinect is treated as a separate r ecor ding. Input condition Set Duration (hours) # Recordings single channel de v 23.81 192 e v al 22.49 194 multichannel de v 262.41 105 e v al 31.24 12 speak er is paired w i th a system speaker w i th an identical seg- mentation to 100% in the case where none of the system speak- ers overlap any of the reference speak ers. JER r ef = F A + MISS TO T AL (1) All metrics are compu ted using ve rsion 1.0.1 of the dscor e tool 3 without the use of forgiv eness collars and with scoring of ov erlapped speec h. 4. Datasets 4.1. Overview The DIHARD II de velopment and e valuation sets draw from a div erse set of sources exhibiting wide variation in recording equipment, recording en vironment, ambient noise, number of speak ers, and speaker demographics. The single chann el input condition (tracks 1 and 2) dataset is a superset of that used in DIHARD I, though 6 hours of additional material have been added to ensure that all domains are represented in both the de- velopm ent and ev aluation set. Additionally , two domains where the DIHA R D I ann otation w as deemed suspect (child langua ge and web video) ha ve been entirely reseg mented. For the multi- channel input condition (tracks 3 and 4) we use the multi-party dinner recordings originally collected for and expo sed during the CHiME-5 cha llenge [18]. The de velopment and e valuation sets are summarized in T able 1. The de velopment set includes reference diari zati on and speech segmentation and may be used for any purpose includ- ing system dev elopment or training. As with DIHARD I, there is no training set, with parti cipants free to train their systems on any proprietary and/or public data. Both the dev elopment and e v aluation sets will be submitted f or publication via LD C at the end of the ev aluation. 4.2. Si n gle channel data (tracks 1 and 2) The single channel input condition de velopment and ev aluation sets consist of selections of 5-10 minute duration samples drawn from 11 con versational domain s, each including approximately 2 hours of audio. Th e full set of domains is described below with LDC Catalog numbers where appropriate. Unless other- wise spe ciﬁed, all speech is English, thoug h not nece ssarily by nativ e or e ven ﬂuent speakers. All audio is distributed via LDC as 16 kHz, mon ochanne l FLA C ﬁles. • audiobooks – amateur recordings of public domain En- glish works drawn from LibriV ox; care was t aken to av oid overlap with LibriSpeech [21 ] (unpublished) • broadcast interview – student produced interviews with ne wsmakers of the day taken from a late 1970s college 3 https://github .com/nryant/dsco re radio sho w; recorded on open reel tapes before being digitized and contributed to LDC (unp ublished) • child language – day-lon g recordings of 6-18 month old vocalization s collected at home by U niversity of Rochester researchers for the S EEDLingS corpus [22] • clinical – interviews with 12-16 year old children in- tended to determine whether or not they ﬁt the clinical di- agnosis f or autism; all recording s condu ct ed at the C en- ter f or Autism Research (CAR) of the Childrens Hospital of Philadelphia (CHOP) using a mixture of cameras and ceiling mounted microphones (unpublished) • courtroom – oral argumen ts from the 2001 term of t he U.S. Supreme Court that were digitized for the O YE Z project; recordings are summed from individu al table- mounted microphones, one per speaker (unpublished) • map task – recordings of map tasks in which one par- ticipant, the leader , describes a route drawn on a map to the other parti cipant, the follo wer, who att empts to draw the same route on a copy of t he map lacking the route and op tionally lacking some landmarks; audio was recorded via close-talking microphones under quiet con- ditions (previo usly released as LDC96S38) • meeting – meetings with between 3 and 7 participants, each recorded with a v ariety of close-talking and distant microphones, from which a single, centrally located dis- tant microphone was selected; the deve lopment set draws from the NIST Spring 2004 Rich T ranscription Evalua - tion (LDC2007S11 and L DC2007S12) while the ev alu- ation set draws from pre viously upublished recording s conducted for the DARP A Robust Omnipresent Auto- matic Recognition (RO AR) project at L DC in 2001 • restaurant – ≈ 1 hour sessions i n volving 3-6 diners recorded on a binau r al microphone worn by one partici- pant in r estaurants with varying room acoustics and noise lev els; inspired by the NSF Hearables Challenge and ex- tended by LDC f or DIHARD (unpublished) • sociolinguistic ﬁ eld recordings – sociolinguistic inter- vie ws recorded under ﬁeld conditions during the 1960s and 1970s; recorded under diverse locations and condi- tions wi th subjects ranging from 15 to 81 years of age and representing div erse ethnicities, backgrounds, and dialects of world English; the de velopment set draws from SLX (LDC 2003T15) and the ev aluation set from D ASS (LD C 2012S03 & LDC2016S05) • sociolinguistic lab recordings – sociolinguistic inter- vie ws recorded as part of MIXER6 (LDC2013S03) un- der quiet conditions in a controlled en vironment; ses- sions were recorded with a variety of close-talking and distant microphones from which a single, centrally l o- cated distant microphone was selected • web video – English and Mandarin amateur videos col- lected fr om online sharing sit es ( e. g., Y ouT ube and V imeo) as part of t he V ideo Annotation for Speech T ech- nologies (V AST) [23] collection (mostly unpublished) 4.3. Mu ltichannel data (tracks 3 and 4) The multichannel input condition deve lopment and e valuation sets are drawn from the CHiME- 5 dinner party corpus [18], a corpus of con versational speech collected du ring dinner parties held in real homes. The dev elopment set combines the CHiME- 5 t raining and dev elopment sets and encompa sses 45 hours of dinner parties from 18 homes. The ev aluation set is i dentical to the CHiME-5 ev aluation set and consists of 5 hours of din- ner parties from 2 home s. Each party was recorded using 6 Microsoft Kinect devices (4 channe l linear arrays) distributed throughou t the home in such a way that t he con versation was al- ways present on each array . Due to a combination of clock drift and random frame dropping, the Kinects within each record- ing session exhibit massive desynchronization , both with each other and with the binaural recording devices worn by partici- pants. For this reason, each Kinect de vice is treated separately with the resulting dev elopment and ev aluation sets having du- rations of 262.4 hours and 31.2 hours respecti vely . All aud io is distributed via the University of Shef ﬁeld as 16 kHz W A V ﬁles. 4.4. Processing A limited number of recordings contained regions carrying per- sonal identifying i nformation (PII ), which were remov ed prior to publication. For the clinical and r estaurant domains, t his was done at LDC by low-p ass ﬁltering using a 10 t h order But- terworth ﬁ lter wit h a passband of 0 to 40 0 Hz. T o av oid abrupt transitions in the resulting wav eform, the effect of the ﬁlter was gradually faded in and out at t he beginning and end of the re- gions using a ramp of 40 ms. In the case of the sociolinguis- tic ﬁeld r ecor dings domain and the CHiME-5 data, PII was re- mov ed by the original creators of the corpora. In the former case, P II was replaced by tones of matched duration, while i n the latter case it was zeroed out. PII containing regions are ig- nored during scoring. 4.5. Ann otation Reference segm entation and speaker labeling was produced by annotators at LDC usin g a tool eq uipped with playback , wave - form and spectrogram display . Annotators were instructed to split on pauses > 200 ms, where a pause was deﬁned as any stretch of ti me during which the spea ker was not pro ducing vo- calization (e.g., bac kchannels, ﬁl led pauses, singing, speech er - rors and disﬂuencies, infant babbling or vocalizations, laughter , coughs, breaths, l i psmacks, and humming) of any kind. Bound- aries were placed within 10 ms of the true boundary , taking care not to t runcate sounds at edges of words (e. g., utterance-ﬁnal fricativ es). Where indiv i dual close talking microphon es were av ailable for speakers, annotation wa s performed separately for each speaker using their indi vidual microphon e. Due to time constraints, this manual seg mentation process could not be im- plemented for the multichannel de velopment data; for this data, segmen tation was taken from the turn boundaries established during the original CHiME-5 transcription. An additiona l post-processin g step was necessary for the CHiME-5 annotation to correct for the lack of synch roniza- tion between bina ural recording de vices and Kinects. For each Kinect, t he lag between that array and the binaural recording de- vices was estimated at regular interv al s us ing normalized cross- correlation. The speech boundaries et ablished by annotation on the binaural de vices were then corrected for each Kinect using these estimated lags. 5. Baseline system 5.1. Sp eech enhancement For speech enhancement we use a densely-con nected LSTM architecture [24, 25, 26] trained to predict the ideal ratio masks (IRM) [27] of spe ech from log-power spectra (LPS ) fea- tures. The model is trained via progressi ve multi-target l earning [24, 28] using 40 0 hours of noisy speech produced by corrupt- ing clean utterances f r om WSJ0 [29] and a 50 hour Chinese speech corpus from the 863 Program [30]. Utterances were cor- rupted using 115 noise types [24] at 3 S NR le vels (-5dB, 0dB, and 5dB). T he trained models as well as scripts for applying them, are distri buted through GitHub 4 . 5.2. Beamforming For the multichannel t r acks, we use weighted delay-an d-sum beamforming as implemented in BeamformIt [31]. Beamform- ing is applied independ ently for each Kinect in each session using all four cha nnels following the CHi ME-5 recipe [18]. 5.3. Sp eech activity detection The baselines for tracks 2 and 4 use W ebR TC’s 5 SAD as i mple- mented in the py-webrtc Python package 6 . Scripts for pe rform- ing SAD using the same settings used to obtain the baseline results are distr i buted through GitHub 4 . 5.4. Diarization The diarization baseline is based on the previously published Kaldi [32] recipe 7 for JHU’ s submission to DIHARD I [14]. At a high lev el, the system performs diarization by dividing each recording into short o verlapping se gments, e xtracting x-ve ctors [33, 34], scoring with proba bilistic linear discriminant analysis (PLDA) [35], and clustering using agglomerati ve hierarchical clustering (AHC) [36]. In contrast to the original JHU system, we omit the V ariational Bayes resegm entation step [37]. The trained models are distribu ted through GitHub 8 . The x-vector extractor conﬁguration is identical to that used in prev ious speaker recognition and diarization systems [ 34 , 14] with two e xceptions: i) 30 dimensional mel frequen cy cepstral coef ﬁ cient (MFCC) features are used instead of mel ﬁlterbank features; ii) the embedding layer uses 512 dimensions. MFCCs are extracted every 10 ms using a 25 ms wi ndo w and mean- normalized using a 3 second sliding window . For training we use a combination of V oxCeleb 1 and V oxCeleb 2 [38, 39] aug - mented with additiv e noise and reverb eration according to the recipe from [33]. Segments under 4 seconds duration are dis- carded, resulting in a training set with 7,323 speakers. Rev er- beration is adde d by con volution with roo m responses from the RIR dataset [40], while additiv e noises are drawn fr om the MU- SAN dataset [41]. At test time, x-vectors are extracted from 1.5 second segments with 0.75 seco nd overlap. Follo wing extraction, x-vectors are pre-processed to per- form domain adaptation to the DIHARD II dataset. This is done by normalizing with a global mean and whitening transform learned from the DIHARD II dev elopment set. The whitened x-vecto rs are then length normalized [42] and used to train a Gaussian PLDA model [35] using a subset of V oxCeleb consist- ing of segments of at least 3 seconds duration. Following PLDA scoring, clustering is performed using AHC with the threshold set by minimizing DER on the de velopment data. 4 https://github .com/staplesinLA /denoising_DIHARD18 5 https://webrtc .org/ 6 https://github .com/wiseman/py- web rtcvad 7 https://github .com/kaldi- asr/kald i/tree/master/egs/dihard_2018/v2 8 https://github .com/iiscleap/DI HARD_2019_baseline_alltracks T able 2: Baseline performan ce (m easure d by DER and JER) on dev and e val sets for all tracks. The Enh. column indicates whether or no t speec h enhancement was app l ied prior to SAD. T rack Enh. DER (%) JER (%) De v Eva l Dev Ev al T rack 1 no 23.70 25.99 56.20 59 . 51 T rack 2 no 46.33 50.12 69.26 72.1 T rack 2 yes 38.26 40.86 62.59 66 . 60 T rack 3 no 59.73 50.85 68.00 65 . 91 T rack 4 no 87.55 83.41 88.08 85 . 12 T rack 4 yes 82.49 77.34 83.6 80.42 5.5. Baseline results DER and JER of the baseline system on both the deve lopment and ev aluation sets for each track are presented in T able 2. The speech enhancemen t module is used only for tracks 2 and 4 as a pre-proces sing front-end for the SAD pipeline as the diariza- tion system did not show improvemen ts using the enhanced au- dio. The scores obtained by the challenge baseline are quite high, with track 1 DER roughly in line with the pe rformance of the best D I HARD I systems [14, 15, 25] and t r ack 2 DER 5% higher than for DI H A R D I (15% without enha ncement), which we suspect reﬂects a combination of superior SAD components in those systems and the more careful segmentation for the child language and web video domains in DIHARD II. Error rates are noticeably higher for tracks 3 and 4, reaching 50.85% and 77.34% respectiv ely , though , again, these rates are roughly in line with those observ ed for the best DIHARD I systems on the two most difﬁcult domains in that challenge: restaur ant and child langua ge . 6. Conclusion The ﬁeld of speak er diarization has changed drastically in the two short years we have been running this challenge. In the lead up to DIHARD I, the research community w as fragmented and most research concentrated on relativ ely easy datasets using for- gi ving ev aluation metrics. T his both made comparison of sys- tems difﬁcult and led some to believ e that diarization was rela- tiv ely solved and uninteresting. Howe ver , we were pleased by the response to DIHARD I, both during the ev aluation and after, demonstrating that there is interest i n robust diarization. This rene wed energ y is on display in DIHARD II, which attracted 48 registered teams from 17 countries, more than doubling the number of teams registered for DIHARD I. It is also e vident in the recent announc ement of the Fearless Steps challenge, which includes diarization among its tasks. W e hope that this year’ s contribution s lead t o marked progress toward the goal of tr uly robust diarization. 7. Ackno wledgements W e would like to thank Harshah V ardhan MA, Prachi Si ngh, and Lei Sun for their help in preparing the baseline sytems and results. W e would also l ike t o acknowledg e the gener- ous support of Agence Nationale de la Recherche (ANR-16- D A T A-0004 ACLEW , ANR-14-CE30-0003 MechELex, ANR - 17-EURE-0017), the J. S. McDonnell Foundation, and the Lin- guistic Data Consortium as well as the CHiME-5 challenge for allowin g us use of their data. 8. Refer ences [1] J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The Rich Tra nscripti on 2006 Spring Meeting Recogniti on Eva luati on, ” in Internati onal W orkshop on Machine Learning for Multimodal In- terac tion . Spri nger, 2006, pp. 309–322. [2] G. Sell and D. Garcia -Romero, “Speaker diari zati on with PLD A i-ve ctor scoring and unsupervised calibrati on, ” in Pr oc. IEEE Spo- ken Languag e T echnolo gy W orkshop (SLT) , 2014, pp. 413–417. [3] W . Zhu and J. Pelecanos, “Online speaker diarizat ion using adapte d i-v ector transforms, ” in Proc. ICASSP , 2016. [4] D. Garcia-Romero, D. Snyder , G. Sell, D. Pove y , and A. McCree, “Speak er diariza tion using deep neura l network embe ddings, ” in Pr oc. ICASSP , 2017, pp. 4930–4934. [5] Q. W ang, C. Downey , L. W an, P . A. Mansﬁeld, and I. L. Moreno, “Speak er diarizati on with LSTM, ” in Pro c. ICASSP , 2018, pp. 5239–5243. [6] A. Zhang, Q. W ang, Z. Zhu, J. Paisle y , and C. W ang, “Fully su- pervised speaker diariz ation, ” Pr oc. ICASSP , 2019. [7] M. Rouvier , G. Dupuy , P . Gay , E. Khoury , T . Merlin, and S. Meignie r, “ An open-source state -of-the-a rt toolbox for broad- cast news d iariz ation, ” in Proc. Inter speec h , 2013, pp. 1477–1481. [8] I. V i ˜ nals, A. Ortega, J. A. V . L ´ opez, A. Miguel, and E . L leida, “Domain adapt ation of PLDA models in broadcast diariza tion by means of unsupervi s ed speake r clust ering. ” in Pr oc. Interspee ch , 2017, pp. 2829–2833. [9] S. H. Y ella and H. Bourlard, “Improved ov erlap speech diarizat ion of meeting recordings using long-term conv ersational features, ” in Pr oc. ICASSP , 2013, pp. 7746–7750. [10] S. H. Y ella, A. Stolck e, and M. Slaney , “ Artiﬁcia l neural netwo rk feature s for s peak er diarizati on, ” in Proc. IEEE Spoken Language T echnol ogy W orkshop , 2014, pp. 402–406. [11] R. Milner and T . Hain, “Segment-orie nted e valuat ion of speak er diarisat ion pe rformance , ” in P r oc. ICASSP , 2016, pp. 5460–5464. [12] N. Ryant , E. Bergel son, K. Church, A. Cristi a, J. Du, S. Ganap- athy , S. Khudanpur , D. Ko wal ski, M. Krishnamoorthy , R. Kul- shreshta et al. , “Enhancemen t and analysis of con versat ional speech: JSAL T 2017, ” in Pr oc. ICASSP , 2018, pp. 5154–5158. [13] N. Ryant, K. Church, C. Cie ri, A. C ristia, J. Du, S. G anapathy , and M. Liberman, “Fi rs t DIHARD chal- lenge ev aluati on plan, ” T ech. Rep., 2018. [Online]. A vai lable : https:/ /zenodo .org/re cord/1199638 [14] G. Sell, D. Snyde r, A. McCree, D. G arcia-Romero, J. V illal ba, M. Mac ieje wski, V . Manohar , N. Dehak, D. Po vey , S. W atanab e et al. , “Diarizati on is hard: Some exper ience s and lessons learned for the JHU team in the inaugural DIHARD Challenge, ” in Proc . Inter speec h , 2018, pp. 2808–2812 . [15] M. Diez, F . Landini, L. Burget, J. Rohdin, A. Silnov a, K. Z m olık ov ´ a, O. Nov otn ` y, K. V esel ` y, O. Glembek, O. Plchot et al . , “BUT system for DIHARD Speech Diariza tion Challenge 2018, ” in P r oc. Interspee ch , 2018, pp. 2798–2802. [16] C. Cieri, D. Miller , and K. W alker , “From Switchboard to F isher: T elepho ne collect ion protocols, their uses and yields, ” in P r oc. EUR OSPEECH , 2003. [17] N. Ryant, K. Church, C. Cie ri, A. C ristia, J. Du, S. Ganapathy , and M. Liberman, “Second DIHARD chal- lenge ev aluati on plan, ” T ech. Rep., 2019. [Online]. A vai lable : https:/ /coml.lscp.en s.fr/dihard/2019/second dihard ev al plan v1.1.pdf [18] J. Barker , S. W atanabe , E. V incent, and J. Trmal, “The Fifth ‘CHiME’ Speech Separation and Recognition Challenge : Dataset, task and baseli nes, ” in Pro c. Interspe ech , 2018 , pp. 1561–1565. [19] L. Hamers e t al. , “Simil arity mea sures in s cient ometric re search: The Jaccard inde x versus Salton’ s cosine formula. ” Information Pr ocessing and Manage ment , vol. 25, no. 3, pp. 315–18, 1989. [20] R. Real and J. M. V argas, “The probabili stic basis of Jaccar d’ s inde x of similarity , ” Systematic Biology , vo l. 45, no. 3, pp. 380 – 385, 1996. [21] V . Pana yotov , G. Chen, D. Povey , and S. Khudan pur , “Lib- riSpeec h: an AS R corpus ba sed on public domain audio books, ” in Proc . ICASSP , 2015, pp. 5206–5210. [22] E. Bergelson, “Bergel s on Seedlings Hom eBank Corpus, ” 2016, doi:10.2141 5/T5PK6D. [23] J. Trace y and S. Strassel, “V AST: A corpus of video annotation for speech technolo gies, ” in Proc . LREC , 2018. [24] T . Gao, J. Du, L. -R. Dai, and C.-H. L ee, “Densely connected pro- gressi ve learni ng for LSTM-based speech enhancemen t, ” in P roc. ICASSP , 2018, pp. 5054–5058. [25] L. Sun, J. Du, C. Jiang, X. Z hang, S. He, B. Y in, and C.-H. Lee, “Speaker diarizatio n with enhancin g speech for the First DI- HARD Challenge, ” P r oc. Interspee ch , pp. 2793–2797, 2018. [26] L. Sun, J . Du, T . Gao, Y .-D. Lu, Y . Tsao, C. -H. Lee, and N. Ryant, “ A novel LSTM-based speech preprocessor for speaker diariza- tion in realist ic mismatch conditions, ” in P r oc. ICASSP , 2018, pp. 5234–5238. [27] S. Srini vasan, N. Roman, and D. W ang, “Binary and ratio time- frequenc y masks for robust speech recogni tion, ” Speec h Commu- nicati on , vol. 48, no. 11, pp. 1486–1501, 2006. [28] L. Sun, J. Du, L. -R. Dai, and C.-H. Lee, “Multipl e-tar get deep learni ng for LSTM-RNN based speech enhanc ement, ” in Pr oc. HSCMA , 2017, pp. 136–140. [29] J. S. G arofolo et al. , CSR-I (WSJ0) Complete LDC93S6A . Philade lphia: Linguistic Data Consortium, 1993. [30] Y . L. Qian, S . X. Lin, Y . D. Zhang, Y . Liu, H. Liu, and Q. Liu, “ An introducti on to corpora resources of 863 program for Chinese languag e processing and human-machine interacti on, ” P r oc. ALR , 2004. [31] X. Angu era, C. W ooters, and J. Hernando, “ Acoustic beamform- ing for speaker diarizatio n of meetings, ” IEE E T rans. Audio, Speec h, Language P r ocess , vol. 15, no. 7, pp. 2011–2022, 2007. [32] D. Pov ey , A. Ghoshal, G. Boulia nne, L. Burget , O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y . Qia n, P . Schwa rz et al. , “The Kaldi speech recognition toolkit, ” IEEE Signal Processing Society , T ech. Rep., 2011. [33] D. Snyder , P . Ghahremani, D. Pov ey , D. Garcia-R omero, Y . Carmiel, and S. Khudanpur , “Deep neural network-ba sed speak er embeddings for end-to-end speaker veriﬁcati on, ” in 2016 IEEE Spoken L anguage T echnolo gy W orkshop , 2016, pp. 165– 170. [34] D. Snyder , D. Garcia-Rome ro, G. Sell, D. Pove y , and S. Khudan- pur , “X-ve ctors: Robust DNN embeddings for speaker recogni - tion, ” in Pr oc. ICASSP , 2018, pp. 5329–5333. [35] S. J. Prince and J. H. E lder , “Probabil istic linear discriminant anal- ysis for inference s about identity , ” in 2007 IEEE 11th Interna- tional Confere nce on Computer V ision , 2007, pp. 1–8. [36] K. J . Han, S. Kim, and S. S. Narayana n, “Strate gies to im- prov e the robustne ss of agglomerati ve hierarchi cal cluste ring un- der data source variat ion for speaker diarizat ion, ” IE E E T rans. Au- dio, Speec h, Languag e Proce s s , vol. 16, no. 8, pp. 1590–1601, 2008. [37] M. Diez, L. Burget, and P . Matejka, “Speak er diariza tion based on Bayesia n HMM with eigen voice priors, ” in Pr oc. Odysse y , 2018, pp. 147–154. [38] A. Nagrani, J. S. Chung, an d A. Zisserman, “ V oxCele b: a large-scal e speake r identi ﬁcatio n dataset, ” arXiv preprint arXiv:1706.08612 , 2017. [39] J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep speak er recognit ion, ” Pr oc. Interspe ech , pp. 1086–1090, 2018. [40] T . Ko , V . Peddinti , D. Pove y , M. L . Seltzer , and S. Khudanpur , “ A study on data augmentation of re verberant speech for robust speech recogniti on, ” in Proc. ICASSP , 2017, pp. 5220–5224. [41] D. Snyd er , G. Chen, and D. Pove y , “MUSAN: A music, speech, and noise corpus, ” arXiv pre print arXiv:1510.08484 , 2015. [42] D. Garcia-Ro mero and C. Y . Espy-W ilson, “ Analysis of i-ve ctor length normalization in speak er recogniti on systems, ” in Pr oc. In- terspe ec h , 2011, pp. 249–252.

The Second DIHARD Diarization Challenge: Dataset, task, and baselines

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment