rVAD: An Unsupervised Segment-Based Robust Voice Activity Detection Method
This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speec…
Authors: Zheng-Hua Tan, Achintya kr. Sarkar, Najim Dehak
rV AD: An Unsup ervised Segmen t-Based Robust V oice Activit y Detection Metho d Zheng-Hua T an a ∗ , Ac hint y a kr. Sark ar a,b , Na jim Dehak c a Dep artment of Ele ctr onic Systems, A alb or g University, Denmark b Scho ol of Ele ctr onics Engine ering, VIT-AP University, India c Center for L anguage and Sp e e ch Pr o c essing, Johns Hopkins University, Baltimor e, MD, USA Abstract This paper presen ts an unsup ervised segment-based metho d for robust voice activit y detection (rV AD). The metho d consists of tw o passes of denoising follow ed by a voice activit y detection (V AD) stage. In the first pass, high-energy segments in a sp eec h signal are detected by using a p osteriori signal-to-noise ratio (SNR) weigh ted energy difference and if no pitc h is detected within a segment, the segmen t is considered as a high-energy noise segment and set to zero. In the second pass, the sp eec h signal is denoised by a sp eec h enhancement metho d, for which sev eral metho ds are explored. Next, neighbouring frames with pitc h are group ed together to form pitch segments, and based on speech statistics, the pitch segmen ts are further extended from b oth ends in order to include b oth voiced and unv oiced sounds and likely non-sp eec h parts as well. In the end, a p osteriori SNR weigh ted energy difference is applied to the extended pitch segmen ts of the denoised sp eech signal for detecting voice activity . W e ev aluate the V AD p erformance of the prop osed metho d using t wo databases, RA TS and Aurora-2, whic h contain a large v ariet y of noise conditions. The rV AD metho d is further ev aluated, in terms of sp eak er verification performance, on the RedDots 2016 challenge database and its noise-corrupted v ersions. Exp erimen t results show that rV AD is compared fav ourably with a n umber of existing metho ds. In addition, we present a mo dified version of rV AD where computationally in tensive pitch extraction is replaced by computationally efficient sp ectral flatness calculation. The mo dified v ersion significan tly reduces the computational complexity at the cost of mo derately inferior V AD p erformance, which is an adv an tage when pro cessing a large amount of data and running on lo w resource devices. The source code of rV AD is made publicly av ailable. Keywor ds: a p osteriori SNR; energy; pitch detection; sp ectral flatness; sp eec h enhancement; voice activit y detection; sp eaker verification 1. In tro duction V oice activity detection (V AD), also called sp eec h activity detection (SAD), is widely used in real-world sp eec h systems for impro ving robustness against additive noises or discarding the non-sp eech part of a signal to reduce the computational cost of downstream pro cessing [1]. It attempts to detect the presence or absence of sp eech in a segmen t of an acoustic signal. The detected non-sp eec h segmen ts can subsequen tly b e abandoned to impro ve the o verall p erformance of these systems. F or instance, cutting out noise only segmen ts can reduce the error rates of sp eech recognition and sp eak er recognition systems [2, 3, 4]. V AD metho ds can b e broadly categorized in to sup ervised and unsup ervised metho ds. Sup ervised meth- o ds form ulate V AD as a classical classification problem and solv e it either b y training a classifier directly [5], or b y training statistical models for sp eec h and non-sp eec h separately and then making V AD decisions based on a log-likelihoo d ratio test (LR T) [6]. These metho ds require lab elled sp eech data and their performances ∗ Corresponding author. Email address: zt@es.aau.dk. This work was done in part while the author was visiting Computer Science and Artificial Intelligence Lab oratory , Massach usetts Institute of T ec hnology , Cambridge MA, USA. Pr eprint submitte d to Computer Sp e e ch and L anguage are highly dep endent on the qualit y of the lab elled data, in particular how w ell training and test data match eac h other. Sup ervised metho ds are often able to outp erform unsup ervised metho ds under matched condi- tions, but they can potentially break do wn under mismatc hed conditions. In [7], for example, a deep learning metho d, exp erimented on the robust automatic transcription of sp eec h (RA TS) database [8], demonstrates a v ery go od V AD p erformance for seen environmen ts, but the gap betw een unseen and seen environmen ts is very significan t, b y almost an order of magnitude difference in terms of detection c ost function (DCF), but what matters most is the p erformance on unseen environmen ts. Unsup ervised V AD metho ds include metrics metho ds and mo del based ones. Metrics metho ds rely on the contin uous observ ation of a sp ecific metric, such as energy and zero-crossing rate, follow ed by a simple threshold-based decision stage [9]. On the other hand, mo del based metho ds build separate mo dels for sp eec h and non-sp eech, follo wed b y an LR T test together with threshold-based decision for statistical mo dels [10] (similarly to the settings of sup ervised metho ds but without using lab elled data) or a distance comparison for non-statistical mo dels [11] to clas- sify a segment as spee c h or non-sp eech. Semi-sup ervised V AD has also b een studied, e.g., semi-sup ervised Gaussian mixture mo dels (SSGMM) based V AD for sp eaker verification in [12]. V AD is a binary classification problem inv olving b oth feature extraction and classification. V arious sp eec h features can be found in literature suc h as energy , zero-crossing rate, harmonicity [13], p erceptual sp ectral flux [14], Mel-frequency cepstral co efficien t (MFCC) [11], p ow er-normalized cepstral co efficien ts (PNCCs) [15], entrop y [16], Mel-filter bank (MFB) outputs [17] and a p osteriori signal-to-noise ratio (SNR) w eighted energy distance [3, 18]. F or the purp ose of mo delling and classification, p opular techniques include Gaussian mo dels [19], Gaussian mixture mo dels (GMM) [2, 14, 20], sup er-Gaussian mo dels and their conv ex com bination [9], i-vector [15, 21], decision trees [22], supp ort vector machines [23], and neural netw ork mo dels (including deep mo dels) [5, 7, 24]. Although mu ch progress has b een made in the area of V AD, developing V AD metho ds that are accurate in b oth clean and noisy en vironments and are able to generalize well under unseen environmen ts is still an unsolv ed problem. F or example, the ETSI adv anced fron t-end (AFE) V AD [25], a commonly referred V AD metho d, p erforms very well in noisy environmen ts, but p oorly in noise-free conditions, and it is primarily suitable for dealing with stationary noise. The MFB output based V AD [17] and the long-term signal v ariabilit y (L TSV) V AD [26] are highly accurate for clean sp eec h, but their p erformances in noisy en vironments are worse than that of the AFE V AD. A sup ervised V AD metho d based on a non-linear sp ectrum modification (NLSM) and GMM [27] is shown to b e able to outp erform b oth MFB and L TSV algorithms. In [3] a low-complexit y V AD metho d based on a p osteriori SNR weigh ted energy difference also shows a p erformance sup erior to sev eral methods including the MFB V AD and the AFE V AD. These referred metho ds are all significantly sup erior to the G.729 [28] and G.723.1 [29] V AD algorithms included in their corresp onding V oice-ov er-IP standards. The comparisons cited in the paragraph are all conducted on the Aurora 2 database [30]. This pap er fo cuses on developing a V AD metho d that is unsup ervised and able to generalize well to real-w orld data, irrespective of whether the data is corrupted by stationary or rapidly c hanging additive noise. In [3], a p osteriori SNR weigh ted energy distance is used as the key speech feature for V AD and has demonstrated state-of-the-art p erformance in noisy en vironments, and the rational b ehind the feature is that the w eighting of a p osteriori SNR makes the w eighted distance close to zero in non-sp eec h regions for b oth clean and noisy sp eec h. The V AD metho d in [3] first uses the a p osteriori SNR weigh ted energy distance for selecting frames in a v ariable frame rate analysis and then conducts V AD based on the selected frames, and the metho d assumes the presence of speech within a certain p erio d. While sho wing state-of- the-art p erformance, the metho d in [3] has several dra wbac ks. First, it makes an implicit assumption of sp eec h presence, whic h do es not alwa ys hold, and it is not uncommon that a V AD metho d assumes the presence of sp eech in every signal (e.g. a signal file in a database) [11], [14]. Secondly , the pro cess of making V AD decisions through selecting frames first is cumbersome and sub optimal. Finally , estimating a p osteriori SNR weigh ted energy distance is prone to noise. In the present w ork, we therefore propose a V AD metho d eliminating the ass umption of sp eec h presence b y using pitc h or spectral flatness as an anc hor to find potential sp eec h segmen ts, directly using a p osteriori SNR w eighted energy distance without conducting frame selection, and applying sp eec h enhancement. The proposed metho d differs from [3] in a num ber of w ays: 1) tw o-stage denoising is prop osed in order to enhance the noise robustness against b oth rapidly 2 c hanging and relativ ely stationary noise, 2) pitch or spectral flatness is applied to detect high-energy noise segmen ts and as an anchor to find potential sp eec h segments, and 3) a segment based approac h is used, in whic h V AD is conducted in segments, making it easier and more effectiv e in determining a threshold for making decisions since a c ertain amount of sp eec h and non-sp eech exists in each segment. The proposed metho d is called r obust voic e activity dete ction (rV AD). Pitc h information places an imp ortant role in rV AD and it has b een used for existing V AD metho ds as w ell. F or example, [31] combines pitch con tinuit y with other sp eec h features for detecting voice activit y . In [32], long-term pitch divergence is used as the feature for V AD. The big difference here is that rV AD uses pitc h as an anchor to lo cate p otential sp eec h regions, rather than as a V AD feature, and in rV AD the actual V AD b oundaries are found b y using a p osterior SNR weigh ted energy distance as the feature. Concerning t wo-stage denoising, the concept has b een applied in the literature although b eing different. In [33], tw o-stage denoising is conducted to handle reverberation and additive noise separately in a supervised fashion by training neural netw orks. A tw o stage Mel-w arp ed Wiener filter approach is presented in [25]. In this w ork, the tw o-stage metho d aims to deal with high-energy noise in the first stage separately . Most of the computation in rV AD lies in pitc h detection, while the rest part of rV AD is computationally ligh t. Therefore, we present a mo dified v ersion of rV AD, where the time-consuming pitc h detector is replaced b y a computationally efficien t sp ectral flatness (SFT) [34, 35, 36] detector. W e call the mo dified algorithm rV AD-fast . W e sho w that rV AD-fast is significan tly ( ≈ 10 times) faster compared with rV AD with moderate degradation on V AD p erformance, whic h is b eneficial when processing a larger amount of data or used on devices with computational constrain ts. W e demonstrate the p erformance of the rV AD metho d for v oice activity detection on tw o databases consisting of v arious t yp es of noise: the RA TS [37] and Aurora-2 [30] databases. W e further ev aluate its p erformance for sp eak er verification on the RA TS database as well as on the RedDots database [38] that we corrupt with additive noise. Experiment results sho w that rV AD is compared fav ourably with a num b er of existing metho ds. The MA TLAB source co de of the prop osed rV AD metho d (including rV AD-fast) is made publicly av ail- able 1 , and the Python source co de of rV AD-fast is also publicly av ailable 2 . It is noted that a sligh tly differen t version of the rV AD source code - rV AD1.0 - has already b een made publicly av ailable while no pap er has b een published to document the rV AD metho d and a num ber of studies ha ve used it, cov ering applications such as v oice activity detection in sp eak er verification [39, 40, 41], age and gender identification [42], emotion detection and recognition [43, 44, 45], and discov ering linguistic structures [46]. A mo dified v ersion is used for real-time human-robot in teraction [47]. In addition, w e hav e made the Python source co de for training and testing of GMM-UBM and maximum a p osteriori (MAP) adaptation based sp eaker v erification publicly av ailable 3 . The pap er is organized as follows. The tw o-pass denoising metho d and the proposed rV AD metho d are presen ted in Sections 2 and 3, respectively . Section 4 describ es rV AD-fast. Exp erimen tal results on the Aurora-2 database, the RA TS database and the RedDots database are presented in Sections 5, 6 and 7, resp ectiv ely . Finally , the pap er is concluded in Section 8. 2. Robust V AD in noise When robustness is of concern, a ma jor challenge presen ted by real-w orld applications is that sp eec h signals often con tain both stationary noise as well as burst-lik e noise. It is very difficult to detect and remo ve burst-like noise since they hav e high energy and are rapidly changing and thus hard to estimate. T o illustrate this, Fig. 1 shows, from the RA TS database [8], tw o examples of noisy sp eech signals obtained by corrupting clean sp eec h with different t yp es of communication c hannel noise. The figure depicts wa veform and sp ectrogram of the tw o noisy sp eec h signals and sp ectrogram of their corresp onding speech signals 1 https://github.com/zhenghuatan/rVAD 2 https://github.com/zhenghuatan/rVADfast 3 https://github.com/zhenghuatan/GMM- UBM_MAP_SV 3 denoised by the minimum statistics noise estimation (MSNE) [48] based spectral subtraction. It is noticed that high-energy noise is largely kept intact after denoising, except for at the b eginning of each utterance where the noise estimate is close to the real noise since there is only high-energy noise in the b eginning of an utterance. W e further did preliminary V AD exp eriments using a classical statistical method [19] on a few files and the p erformance is not encouraging (with o ver 20% frame error rate). These initial tests show that burst-lik e noise presents a significant challenge to b oth denoising and V AD metho ds. (a) (b) Figure 1: Noisy and denoise d sp e e ch: (a) Noisy sp e e ch from Channel A: waveform of noisy sp e e ch (the first p anel), spe ctr o gr am of noisy spe e ch (the se c ond p anel) and sp e ctr o gr am of denoise d sp e e ch (the third p anel); (b) noisy sp e e ch fr om Channel H with the same or der of p anels as in (a). Due to the very different characteristics of stationary and burst-like noise, the tw o types of noise require differen t types of pro cessing. T o deal with this problem, we devise a t wo-pass denoising metho d, as detailed in the presen tation of rV AD in Section 3. In the first pass, the a p osteriori SNR weigh ted energy difference measure [3] is used to detect high-energy segments. If a high-energy segment is detected and it does not con tain pitch, the segment is considered non-sp eech and its samples are set to zero. After detecting and denoising high-energy noise segments, in the second pass, a general sp eec h enhance- men t metho d is applied to the first-pass-denoised sp eec h signal in order to remo ve the remaining noise that is relatively more stationary . In the classical sp eech enhancement framework [49], accurate noise estimation is imp ortan t and the widely used metho ds include minimum statistics noise estimation (MSNE) [48], mini- m um mean-square error (MMSE) [50, 51], and minima controlled recursiv e av eraging (MCRA) [52]. Based on an estimate of additive-noise sp ectrum, a sp ectral subtraction (SS) metho d [53] is then used to subtract the estimate from the noisy speech spectrum. Another in teresting sp eec h enhancement metho d is the one in the adv anced fron t-end [25], whic h aims at improving sp eech recognition in noisy environmen ts. In [54] it is found that the AFE enhancement metho d outp erforms MMSE-based metho ds for noise-robust sp eech 4 recognition. Recently DNN based sp eech enhancement metho ds hav e also b een prop osed for impro ving sp eec h intelligibi lity [55], automatic sp eech recognition [56] and sp eak er verification [57, 58]. Unsup ervised V AD metho ds often rely on a predefined or adaptiv e threshold for making V AD decisions, and finding this threshold is a challenging problem in noisy environmen t. Therefore, it is not uncommon that a V AD metho d assumes the presence of sp eech in every signal (e.g. a signal file in a database) that the metho d is applied upon [11], [14]. This assumption holds for b enchmark sp eech databases, but do es not in real-world scenarios where it is p ossible to hav e no sp eec h in a long duration. As w e know, a sp eech signal must contain pitc h, whic h motiv ates us to prop ose to use pitch (or its replacement sp eech flatness) as an anc hor or indicator for sp eech presence, namely sp eech is present in a signal if pitch is detected. This leads to the V AD pro cess in this w ork: detecting extended pitch segments first and then detecting speech activit y within the segments. Extended pitch segmen t detection pla ys a k ey role in rV AD. First, this pro vides an anchor to locate sp eech segmen ts. Secondly , it enables to exclude a substantial amount of non-sp eec h, p oten tially noisy , part from a sp eech signal. Finally , this results in a segment-based approach in whic h v oice activity detection op erates in segments, making it easier and more effective in terms of determining a decision threshold as b oth sp eech and non-sp eech are guaranteed to b e presen t in each segment. 3. rV AD: an unsup ervised segmen t-based V AD metho d The blo c k diagram of the prop osed rV AD method is sho wn in Fig. 2. It consists of the follo wing steps: the first pass denoising (high-energy segment detection, pitch based noise-segment classification, and setting high-energy noise segmen ts to zero), the second pass denoising, extended pitc h segment detection, and the actual V AD. These steps are detailed in this section. The noise-corrupted sp eec h signal is mo delled using the additive noise signal mo del as x ( n ) = s ( n ) + v ( n ) (1) where x ( n ) and s ( n ) represent the noisy and clean sp eech sample at time n , resp ectiv ely , and v ( n ) the sample of additive noise at time n . The signal x ( n ) is first filtered by a high-pass filter to remov e the DC comp onen t and low frequency noise. A first-order high-pass filter with a cutoff frequency of 60 Hz is applied to remo ve low-frequency noise. F or simplicity , this is not included in the equations of this pap er. T o conduct short-time speech analysis, the signal is partitioned into frames of 25 ms in length with a frame shift of 10 ms, without using pre-emphasis or a hamming window unless stated otherwise. 3.1. The first p ass denoising In the first pass, high-energy segmen ts are detected by using an a p osteriori SNR weigh ted energy difference measure [3] as follo ws: (a) Calculate the a p osteriori SNR weigh ted energy difference of tw o consecutive frames as d ( m ) = q | e ( m ) − e ( m − 1) | ∗ max( S N R post ( m ) , 0) (2) where m is the frame index, e ( m ) the energy of the m th frame of noisy sp eec h x ( n ), and S N R post ( m ) is a p osteriori SNR that is calculated as the logarithmic ratio of e ( m ) to the estimated energy of the m th frame of noise signal v ( n ): S N R post ( m ) = 10 ∗ l og 10 e ( m ) ˜ e v ( m ) (3) In Eq.(2), the square root is taken to reduce the dynamic range, which differs from [3] where the square root is not applied. Energy is calculated as the sum of the squares of all samples in a frame. The noise energy ˜ e v ( m ) is estimated as follows. First, the noisy sp eech signal x ( n ) is partitioned in to sup er-segmen ts of 200 frames eac h (ab out 2s): x ( p ) = s ( p ) + v ( p ), p = 1 , . . . , P , where P is the num ber 5 of sup er-segments in an utterance. F or each sup er-segmen t x ( p ), the noise energy e v ( p ) is calculated as the energy of the frame ranked at 10% of low est energy within the s uper-segment. Thereafter, the noise energy ˜ e v ( p ) is calculated as the smo othed version of e v ( p ) with a forgetting factor of 0.9 as follo ws: ˜ e v ( p ) = 0 . 9 ∗ ˜ e v ( p − 1) + 0 . 1 ∗ e v ( p ) . (4) Noise energy of the m th frame, ˜ e v ( m ), tak es the energy v alue ˜ e v ( p ) of the p th sup er-segmen t whic h the m th frame b elongs to. Segmentation based on a posteriori SNR weighted energy dif ference Noise-segment classi fication based on pitch estimation VAD based on a posteriori SNR weighted energy differe nce Speech enhancement Extended pitch-seg ment detection Postpro cessing Setting high-energ y noise segments to zero First pass denoising Second pass denoising Voice activit y detection Output Figure 2: Blo ck diagr am of the rV AD method. 6 (b) Central-smooth the a p osteriori SNR weigh ted energy difference, ¯ d ( m ) = 1 2 N + 1 N X i = − N d ( m + i ) (5) where N = 18. (c) Classify a frame as a high-energy frame if ¯ d ( m ) is greater than a threshold θ he ( m ). F or eac h sup er- segmen t p (containing 200 frames), θ he ( p ) is computed as follo ws: θ he ( p ) = α ∗ max e (( p − 1) ∗ 200 + 1) , . . . , e ( m ) . . . , e ( p ∗ 200) (6) where α = 0 . 25. θ he ( m ) tak es the threshold v alue θ he ( p ) of the p th sup er-segmen t whic h the m th frame b elongs to. Alternativ ely , the threshold can b e calculated recursiv ely using a forgetting factor. (d) Consecutive high-energy frames are group ed together to form high-energy segmen ts. (e) Within a high-energy segmen t, if no more than t wo pitch frames are found, the segment is classified as noise, and the samples of the segment are set to zero. Motiv ation of this pass is t wo-fold: first, to a void o verestimating noise due to the burst-lik e noise when applying a noise estimator in the second pass denoising and secondly , to detect and denoise high-energy non-sp eec h parts, which are otherwise difficult for conv en tional denoising and V AD metho ds to deal with. 3.2. The se c ond p ass denoising An y sp eec h enhancemen t metho d is applicable for the second pass denoising. Three sp ectral subtraction based sp eech enhancement metho ds are considered in this work, and they rely on different noise estimation approac hes: MMSE [51], MSNE [48], and a mo dified version of MSNE (MSNE-mo d). The un biased noise p ow er estimation in the conv entional MSNE can b e expressed as, ˆ λ v ( m, k ) = B min ∗ min P ( m, k ) , P ( m − 1 , k ) , . . . , P ( m − l, k ) (7) where m is the frame index, k the frequency bin index, B min ( m, k ) the bias compensation factor, P ( m, k ) the recursiv ely smo othed p eriodogram, and l the length of the finite window for searching the minimum. In the prop osed MSNE-mo d, noise estimate ˆ λ v ( m, k ) is not up dated during the detected high-energy noise segmen ts (which are set to zero). Besides, if more than half of the energy is lo cated within the first 7 frequency bins ( < 217 H z ), the v alues of th e 7 frequency bins are set to zero to further remo ve low-frequency noise in addition to the use of the first-order high-pass filter mentioned earlier. 3.3. Extende d pitch se gment dete ction The V AD algorithm is based on the denoised sp eech and the pitch information generated from the pre- vious steps. The fundamen tal assumption is that all sp eech segmen ts should contain a num ber of speech frames with pitch. In this algorithm, pitch frames are first group ed into pitch segmen ts, which are then extended from both ends b y 60 frames (600ms), based on speech statistics, in order to include v oiced sounds, un voiced sounds and potentially non-sp eec h parts. This strategy is tak en for the following reasons: 1) pitc h information is already extracted in the previous steps, 2) pitch is used as an anchor to trigger the V AD pro- cess and 3) many frames can b e p oten tially discarded if an utterance contains a large p ortion of non-sp eech segmen ts, which are also non-pitch segments. 7 3.4. V oic e activity dete ction The a p osteriori SNR weigh ted energy difference is applied now to each extended pitch segment to make V AD decision as follows: (a) Calculate a p osteriori SNR w eighted energy difference d 0 ( m ) according to Eq. (2). T o calculate S N R post ( m ) using Eq. (3), the noise energy ˜ e 0 v ( m ) is estimated as the energy ranked at 10% of low est frame energy within the extended pitc h segment. (b) Central-smooth the a p osteriori SNR w eigh ted energy difference with N = 18 as in Eq.(5), resulting in ¯ d 0 ( m ). (c) Classify a frame as sp eec h if ¯ d 0 ( m ) is greater than the threshold ( θ v ad ) θ v ad = β ∗ 1 L L X j =1 ¯ d 0 ( j ) (8) where L is the total n umber of frames with pitc h in the extended pitc h segmen t and the default v alue for β is set to 0.4. (d) Apply post-pro cessing. The assumptions here are sp eech frames should not b e too far a wa y from its closest pitch frame, and within a sp eech segmen t, there should b e a certain num ber of sp eech frames without pitch. While it is p ossible to gain impro vemen t by analysing noisy sp eec h as well, we analyse only a few clean sp eech files that are not included in an y exp erimen ts in this paper and then derive and apply the following rules. First, frames that are 33 frames aw a y from the pitch segmen t to the left and 47 frames aw a y to the righ t are classified as non-sp eech, regardless the V AD results ab o v e, whic h co vers 95% of the cases based on the a few sp eech files. On the other hand, frames that are within 5 frames to the left and 12 frames to the right of the pitc h segments are classified as sp eec h, again regardless the V AD results, whic h lea ves out 5% of the cases based on the a few sp eec h files. The concept is sort of similar to that of hangov er schemes used in some V AD metho ds. F urthermore, segmen ts with energy b elow 0.05 times the ov erall energy is remov ed. 4. rV AD-fast based on spectral flatness As the computational complexity of pitc h detection is relatively high, w e in v estigate alternativ e measures to pitch, for example, sp ectral flatness (SFT) [34, 35, 59]. Our primitive study sho ws that SFT is a goo d indicator to tell whether or not there is pitc h in a sp eec h frame. Replacing the pitc h detector by a simple SFT based voiced/un v oiced speech detector, leads to a more computationally efficien t algorithm called rV AD-fast. T o extract the SFT feature, a hamming window is first applied to a sp eec h frame before taking short-time F ourier transform (STFT). After STFT, the signal is represen ted in the sp ectral domain as X ( m, k ) = S ( m, k ) + V ( m, k ) . (9) Thereafter, SFT is calculated as S F T ( m ) = exp ( 1 K P K − 1 k =0 ln | X ( m, k ) | ) 1 K P K − 1 k =0 | X ( m, k ) | (10) where | X ( m, k ) | denotes the magnitude sp ectrum of k th frequency bin for the m th frame, and K is the total n umber of frequency bins. As SFT is used as a replacement for pitch in this work, SFT v alues are compared against a predefined threshold θ sf t to decide whether their corresp onding frame is voiced or unv oiced. Figures 3(a) and 3(b) illustrate sp ectrogram, pitch lab els (1 for pitch and 0 for no pitch) and SFT v alues of a sp eech signal from TIMIT (clean) and those of a signal from the NIST 2016 SRE ev aluation (noisy), 8 resp ectiv ely . The figures show that if w e choose a θ sf t v alue of 0 . 5 (i.e. if SFT ≤ 0 . 5, a frame is said to con tain pitch), the lab els generated by SFT are close to those generated by the pitch detector. W e extensiv ely studied the effect of differen t threshold v alues θ sf t on the performance of SFT as a replacemen t of the pitch detector, which will not b e detailed in this pap er. Briefly , we compared the output of the SFT detector of different v alues of θ sf t with the output of the pitch detector on large num b er of utterances from v arious databases including NIST 2016 SRE (ev aluation set) [60], TIMIT [61], RSR2015 [62], RedDots [38], ASVsp o of2015 [63] and noisy v ersions (car, street, market, and white) of ASVsp o of2015. It is observed that a threshold v alue of 0 . 5 giv es the b est matc h b et ween SFT and pitch, and this v alue is used for exp erimen ts in this pap er. 0 0.5 1 Normalized freq. 0 50 100 150 200 250 Frames 0 0.5 1 Pitch detector generated labels: voiced (1) / unvoiced(0) SFT values (a) 0 0.5 1 Normalized Freq. 0 100 200 300 400 500 600 700 0 0.5 1 Pitch detector generated labels: voiced (1) / unvoiced(0) SFT values (b) Figure 3: Spe ctr o gr am, pitch lab els (1 for pitch and 0 for no pitch) and SFT values of a sp e e ch signal fr om (a) TIMIT (b) the NIST 2016 SRE evaluation set. 5. Exp erimen ts on the Aurora-2 database T o ev aluate the p erformance of the prop osed rV AD metho d, exp erimen ts are conducted on a num b er of databases and for different tasks. In this section, we compare rV AD with ten existing V AD metho ds (b oth sup ervised and unsup ervised) in terms of V AD p erformance on the test sets of the Aurora-2 database. Aurora-2 [30] has three test sets A, B and C, all of which contain b oth clean and noisy sp eec h signals. The noisy signals in Set A are generated from clean data by using a filter with a G.712 characteristic and mixing four noise types including subw a y , babble, car and exhibition with SNR v alues ranging across 20 dB, 15 dB, 9 10 dB, 5 dB, 0 dB, and -5 dB. Set B is created similarly with the only difference being that the t yp es of noise are restaurant, street, airport and train station. In Set C, clean sp eec h is corrupted b y subw a y and street noise, in addition to a Motorola in tegrated radio systems (MIRS) characteristic filter b eing applied instead of that of G.712. The reference V AD labels for the Aurora-2 database are generated with the HTK recognizer [64] which is trained using the training set of Aurora-2. Whole word mo dels are created for all digits. Each of the whole w ord digit mo dels has 16 HMM states with three Gaussian mixtures p er state. The silence mo del has three HMM states with six Gaussian mixtures per state. A one state short pause mo del is tied to the second state of the silence mo del. The sp eech feature consists of 12 MF CC co efficients, logarithmic energy as well as their corresp onding ∆ and ∆∆ components. In [65], it is confirmed that forced-alignmen t sp eec h recognition is able to provide accurate and consisten t V AD lab els, matching closely transcriptions made by an exp ert lab eler and b eing b etter than most of those made by non-expert lab elers. The generated reference V AD lab els are made publicly a v ailable 4 . Sev eral metrics are used to characterize V AD p erformance. The frame error rate (FER) is defined as F E R = 100 ∗ # f alse r ej ection f rames + f alse al ar m f r ames # total f r ames (11) F alse alarm rate P f a is the p ercen tage of non-speech frames b eing misclassified as sp eec h and miss rate P miss is the percentage of sp eech frames b eing misclassified as non-sp eech. Detection cost function (DCF) is defined as D C F = (1 − γ ) ∗ P miss + γ ∗ P f a (12) where the w eight γ is equal to 0 . 25, whic h p enalizes missed sp eec h frames more heavily . Through out this paper, the default configuration for rV AD is the one that includes tw o-pass denoising (with the second being MSNE) and the post-pro cessing, whic h is also the default configuration in the released rV AD source co de, and this configuration, shown in italic font in tables, is used for all exp eriments in this pap er unless stated otherwise. W e do not c hange the settings and parameters while testing rV AD on different datasets so as to ev aluate its generalization abilit y , unless a test is sp ecifically for assessing the effects of c hanging settings, e.g. the threshold β in Eq. (8). F or exp erimen ts conducted in this pap er, pitch extraction is realized using the pitch estimator in [66], unless stated otherwise. It has b een exp erimentally sho wn that using Praat pitch extraction [67] giv es almost the same V AD results. 5.1. Comp arison with r efer enc e d metho ds and evaluation of differ ent c onfigur ations T able 1 presen ts the V AD results a veraged ov er the three test sets of Aurora-2, which accounts for 70070 speech files, for v arious metho ds. VQV AD [11] first applies a sp eec h enhancemen t metho d and an energy V AD to a testing utterance in order to automatically lab el a small subset of MFCCs as sp eec h or non-sp eec h; afterwards, these MFCCs are used to train speech and non-sp eec h co debo oks using k-means and all the frames in the utterance are lab eled using nearest-neighbor classification. Sohn et al. V AD [19] is an unsup ervised V AD metho d based on a statistical likelihoo d ratio test, which uses Gaussian mo dels to represen t the distributions of sp eech and non-sp eec h energies in individual frequency bands. The metho d also uses a hangov er scheme. The V oiceBox toolkit 5 implemen ted version of Sohn et al. V AD is used in this study . Kaldi V AD is the the Kaldi to olkit’s [68] energy based V AD for which we use the default parameters (-v ad-energy-threshold=5.5, –v ad-energy-mean-scale=0.5) as included in the SRE16 script. Note that Kaldi is a widely used op en source softw are for sp eak er and sp eec h recognition. Results for the G.729, G.723.1 and MFB V AD metho ds are cited from [17], results for the L TSV and GMM-NLSM metho ds are from [27], and results for the DSR-AFE and v ariable frame rate (VFR) metho ds are from [3]. The comparison in this table is conducted in terms of frame error rate (FER) since results of 4 https://github.com/zhenghuatan/rVAD 5 http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 10 L TSV and GMM-NSLM are only a v ailable in terms of FER. Note that the iden tical exp erimen tal settings and lab els are used across [3, 17, 27] and the present w ork, so the comparison is v alid. The results in T able 1 clearly show that rV AD with the default configuration gives significan tly lo wer a verage FER than those of the compared V AD metho ds. It outp erforms all referenced V AD metho ds, with big margins, under all SNR levels including the Clean condition. The closest one is the VFR V AD [3], whic h is our previous work that also uses a p osteriori SNR weigh ted energy distance as the feature for V AD decision. The GMM-NLSM [27] V AD pro vides go o d p erformance as well, but still with a 3% (absolute) higher FER as compared with rV AD, and furthermore it should b e noted that GMM-NLSM is a sup ervise d V AD where the GMMs are trained using multicondition training data of the Aurora 2 database. The next one in line is the V AD method in the DSR AFE fron tend [25], whic h is an unsup ervised V AD and giv es a more than 5% (absolute) higher FER than that of rV AD. DSR AFE p erforms well under low SNRs, for example at 0dB and -5dB, with FERs close to those of rV AD; from 5dB and abov e, ho wev er, its performance is far b elow that of rV AD. The remaining compared metho ds hav e ev en muc h higher FERs. F or example, V QV AD[11] gives a 36% FER, as compared with a 11% FER achiev ed by rV AD. T able 1: Comp arison of rV AD (with the default c onfigur ation) with other metho ds on the test datasets of Aur or a-2. Methods FER (%) Avg. Clean 20 dB 15 dB 10 dB 5dB 0 dB -5 dB FER VQV AD[11] 17.26 33.64 36.34 38.77 41.05 43.03 46.07 36.59 G.729 [28] 12.84 24.53 26.13 27.38 29.13 32.23 35.21 26.78 Sohn et al. [19] 17.37 20.16 21.97 24.48 27.96 33.12 39.76 26.40 Kaldi Energy V AD [68] 9.88 26.54 26.61 26.62 26.62 26.62 26.62 24.22 G.723.1 [29] 19.45 21.31 23.29 24.44 26.30 26.56 28.58 23.99 MFB [17] 6.92 15.39 17.70 20.12 22.75 26.16 31.09 20.02 L TSV [26] 9.50 15.90 16.80 18.20 21.00 26.10 28.80 19.50 DSR AFE [25] 18.41 15.16 14.96 14.59 14.54 15.62 22.08 16.48 GMM-NLSM (sup ervised) [27] 10.95 11.20 11.43 11.73 13.35 17.44 23.52 14.23 VFR [3] 8.10 8.30 9.00 10.60 13.50 19.50 28.20 13.90 rV AD (default) 6.90 7.30 7.64 8.43 11.09 16.01 21.48 11.26 T able 2 compares the results of v arious configurations for rV AD. It is shown that when ev aluating on the Aurora-2 database, the first-pass denoising of rV AD do es not mak e difference, for which the reason is that burst-lik e high-energy noise prominently present in the RA TS database has muc h less presence in Aurora-2 while the aim of the first pass denoising is to remov e burst-like noise. The second pass denoising with MSNE or MMSE is able to b oost the performance of rV AD and there is almost no p erformance difference b et w een the t wo enhancemen t methods. MSNE-mo d, ho wev er, do es not perform as w ell as MSNE, for whic h the reason is that MSNE-mo d is tailored to the sp ecial noise characteristics of the RA TS database. The p ostprocessing step is also shown to b e imp ortant for improving the p erformance of rV AD. T able 2: Comp arison of rV AD (with different c onfigur ations) with other metho ds on the test datasets of Aur or a-2. Methods Denoising FER (%) Avg. 1st-pass 2nd-pass Clean 20 dB 15 dB 10 dB 5dB 0 dB -5 dB FER rV AD (default) X MSNE 6.90 7.30 7.64 8.43 11.09 16.01 21.48 11.26 (tw o-pass denoising) X MMSE 6.96 7.35 7.75 8.58 11.02 15.93 21.55 11.30 X MSNE-mod 6.86 7.07 7.44 8.68 13.13 19.73 25.56 12.63 rV AD (w/o denoising) × × 6.89 7.37 7.69 8.81 13.07 19.05 23.23 12.30 rV AD (w/o 1st-pass × MSNE 6.89 7.30 7.64 8.45 11.01 15.88 21.44 11.23 denosing) × MMSE 6.95 7.35 7.76 8.60 11.00 15.85 21.51 11.28 × MSNE-mo d 6.85 7.06 7.44 8.66 13.07 19.59 25.52 12.59 rV AD (w/o p ostproc) X MSNE 8.82 9.53 9.78 10.05 11.32 15.47 21.59 12.37 11 Ov erall, the exp erimen tal results demonstrate the effectiveness of the rV AD metho d on the Aurora-2 database, which is a very different database from the RA TS database that rV AD w as originally devised for. This confirms the go od generalization ability of rV AD. 5.2. Sensitivity of rV AD with changing thr eshold β on V AD p erformanc e In rV AD, β in Eq. (8) is an imp ortan t thresholding parameter that con trols the aggressiv e lev el of rV AD; the larger the v alue of β , the higher the aggressive lev el. T able 3 presents V AD results, in terms of false rejections rate P miss , false alarms rate P f a and FER, of rV AD with v arious β v alues. Apart from v arying β v alues, the default configuration of rV AD is used. F or small β v alues, P miss is small while P f a is large and vice versa. As of FER, a threshold v alue of 0.4 gives the b est p erformance while a change of ± 0.1 in β only marginally change the p erformance. The results confirm the stability of rV AD with resp ect to changing this threshold. In T able 3, w e additionally compare rV AD with several other metho ds now also in terms of P miss and P f a while in T able 1 only FER performance is compared. This allo ws us to compare rV AD with other metho ds in terms of P f a , while fixing P miss or the other wa y around. T able 3 clearly shows that rV AD outp erforms the referenced metho ds with big margins. T able 3: Performanc e of rV AD with various thr eshold β values in c omp arison with several r efer enc e d metho ds on the Aur ora-2 datab ase. Metho d Threshold P miss (%) P f a (%) Avg. FER ( β ) (%) 0.1 1.21 52.90 14.97 rV AD 0.2 2.16 42.43 12.87 0.3 3.21 35.11 11.70 0.4 (default) 4.86 28.91 11.26 0.5 7.54 23.59 11.81 0.6 11.41 19.03 13.44 0.7 16.48 15.25 16.15 V QV AD [11] - 7.02 47.32 36.59 Sohn et al. [19] - 17.31 51.48 26.40 Kaldi Energy V AD [68] - 1.53 86.75 24.22 DSR AFE [25] - 3.32 56.61 16.48 5.3. Evaluation of rV AD-fast T able 4 compares the p erformance of the rV AD-fast metho d with rV AD on the test datasets of Aurora-2 in terms of FER and pro cessing time. It can b e seen that rV AD-fast is appro ximately an order of magnitude faster than rV AD at the cost of mo derate p erformance degradation. This presents an adv an tage when pro cessing a large num ber of speech files or running on lo w-resource devices. F or measuring the pro cessing time, the algorithms were run on a desktop computer with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz and 16 GB RAM, and w e measured the CPU time. 12 T able 4: Comp arison of rV AD-fast with rV AD for voic e activity dete ction on the test datasets of Aur or a-2 in terms of FER and aver age CPU pr o c essing time Methods FER (%) Avg. time Times as fast Clean 20 dB 15 dB 10 dB 5 dB 0 dB -5 dB Avg. sec/file rV AD-fast rV AD (MSNE, default) 6.90 7.30 7.64 8.43 11.09 16.01 21.48 11.26 0.1253 rV AD-fast(MSNE) 7.25 8.75 9.15 10.07 12.37 17.35 25.18 12.87 0.0085 ≈ 14 × rV AD(MMSE) 6.96 7.35 7.75 8.58 11.02 15.93 21.55 11.30 0.1181 rV AD-fast(MMSE) 7.26 8.78 9.22 10.20 12.61 17.71 25.95 13.10 0.0068 ≈ 17 × rV AD(MSNE-mo d) 6.86 7.07 7.44 8.68 13.13 19.73 25.56 12.63 0.1206 rV AD-fast(MSNE-mo d) 7.25 8.71 9.30 11.29 15.31 21.01 30.40 14.75 0.0135 ≈ 9 × 6. Exp erimen ts on the RA TS database In this section, rV AD is ev aluated against nine differen t V AD methods (b oth sup ervised and unsuper- vised) on the RA TS database [37, 69] and it consists of audio recordings selected from sev eral existing and new data sources. All recordings are re-transmitted through 8 different noisy communication channels, la- b elled by the letters A through H. The database is for ev aluating methods such as voice activity detection, sp eak er identifi cation, language identification and keyw ord sp otting. 6.1. Il lustr ative V AD and denoising r esults of rV AD Figure 4 illustrates results of rV AD for tw o sp eec h recordings taken from the RA TS database. It shows that rV AD is able to remov e b oth stationary and burst-like noise and p erforms well in terms of V AD. (a) (b) Figure 4: Il lustr ative results of the pr op ose d rV AD method on sp e e ch signals fr om the RA TS datab ase: (a) noisy sp e e ch fr om Channel A with four p anels presenting waveform of the original signal, its sp e ctr o gr am, sp e ctr o gr am of the denoise d signal and V AD r esults wher e gre en and r e d c olours r epr esent true lab els and algorithm outputs, resp e ctively; b) noisy sp e e ch fr om Channel H with the same or der of panels as in (a). 13 6.2. V AD r esults Data used for this ev aluation contains 8x25 = 200 sp eech files randomly selected from the RA TS database and it amoun ts to appro ximately 44 hours of sp eec h in total, whic h is a sufficient quan tity for ev aluating V AD metho ds. T able 5 compares rV AD of default configuration with other V AD metho ds. T able 5 shows that rV AD gives substantially low er FER compared with Sohn et al. , Kaldi [68], V QV AD [11] and SSGMM [12] V AD metho ds. SSGMM [12] is neighter included in the experiments for the Aurora-2 database nor for the RedDots database, as SSGMM does not work for short utterances (due to the data sparsit y issue caused by short utterances when training GMMs), as also stated in [12], and it produces empt y lab el files for Aurora-2 or RedDots utterances. Kaldi V AD provides the lo w est v alue of P miss , but with significantly higher P f a and FER v alues. P f a = 86% indicates that Kaldi V AD classifies most frames as sp eech. Apparen tly , simple energy-threshold based V AD metho ds do not w ork well under highly noisy en vironments. VQV AD, a widely used and well-performing V AD metho d in the sp eaker v erification domain, pro vides as high as 36% FER. It is interesting to note that VQV AD and Sohn et al. achiev e 36% and 26% FER, resp ectiv ely , on b oth the Aurora-2 database and the RA TS database, as shown in T ables 1 and 5. T able 5: Comp arison of V AD performanc e of the pr op ose d rV AD of default c onfigur ation with other metho ds on the RA TS datab ase. Metho ds P miss (%) P f a (%) FER (%) Kaldi Energy V AD [68] 3.10 85.98 51.62 DSR AFE [25] 4.17 76.24 46.36 V QV AD[11] 41.65 32.85 36.49 Sohn et al. [19] 30.36 24.51 26.93 SSGMM[12] 30.36 13.24 20.33 rV AD (default) 17.00 5.52 10.27 Concerning the different configurations of rV AD, T able 6 shows the first-pass denoising improv es the p erformance sligh tly . The mo dified MSNE, whic h is tailored tow ards the RA TS database, bo osts the V AD p erformance mo destly while MSNE and MMSE do es not. The b est p erformance is given b y first-pass denoising and mo dified MSNE based second-pass denoising. Ov erall, the improv emen t brought b y denoising is modest, for which the reason could b e that the voice activit y detection is conducted on the basis of the extended pitch segmen ts in which the non-sp eec h segmen ts that do not contain pitch ha ve b een excluded during the pro cess of detecting pitch segments. T able 6: Comp arison of V AD p erformanc e of the prop ose d rV AD of different c onfigur ations on the RA TS datab ase. Metho ds Denoising P miss (%) P f a (%) FER (%) 1st-pass 2nd-pass rV AD (default) X MSNE 17.00 5.52 10.27 (with t wo-pass denoising) X MMSE 16.67 5.84 10.32 X MSNE-mod 17.63 4.82 10.12 rV AD (w/o denoising) × × 16.30 6.59 10.61 rV AD (w/o 1st-pass × MSNE 17.88 5.63 10.71 denoising) × MMSE 18.21 5.56 10.80 × MSNE-mod 17.52 5.16 10.28 rV AD (w/o p ostpro c) X MSNE 18.07 8.29 12.34 14 W e study the sensitivity of threshold β in Eq. (8) of rV AD for voice activity detection. T able 7 presen ts the results on the R T AS database when using different v alues of β ; other than this, the default configuration of rV AD is used. It is observed that increasing the v alue of the V AD threshold β makes rV AD more aggressiv e, i.e. increased P miss and decreased P f a , and thus less frames are classified as sp eech. The β v alue should b e c hosen according to the application in hand. F urthermore, the results show that rV AD p erforms w ell in a wide range of v alues of threshold β , demonstrating its adv an tage of stability . T able 7: Performanc e of rV AD with various threshold β values on the RA TS datab ase. Metho d Threshold P miss (%) P f a (%) FER ( β ) (%) 0.1 8.85 10.58 9.86 0.2 11.36 8.16 9.48 0.3 14.04 6.60 9.68 rV AD 0.4(default) 17.00 5.52 10.27 (MSNE) 0.5 20.33 4.70 11.17 0.6 24.16 4.02 12.36 0.7 28.79 3.44 13.94 6.3. V AD r esults for NIST 2015 Op enSAD chal lenge In [15], the rV AD metho d is compared with sev eral metho ds on the NIST 2015 Op enSAD challenge [70] that is also based on the RA TS database. T able 8 cites results for several metho ds from [15], our join t w ork with sev eral other teams. The GMM-MF CC and GMM-PNCC metho ds are similar to [11], but using GMMs trained with maximum-lik eliho o d instead of co deb o oks, which leads to some impro vemen t. PNCCs [71] are known to b e robust against noise. The i-vector based V AD is a sup ervised V AD [21]. rV AD sho ws a p erformance substan tially b etter than those of Sohn et al. , GMM-MF CC and GMM-PNCC, and a p erformance v ery close to that of the sup ervised i-v ector metho d trained on the NIST 2015 OpenSAD c hallenge. T able 8: V AD results [15] of sever al metho ds on Dev data of NIST 2015 Op enSAD chal lenge. Metho ds P f a (%) P miss (%) DCF GMM-MF CC [15] 6.15 43.17 0.1540 GMM-PNCC [15] 7.72 17.14 0.1008 Sohn et al. [19] 6.35 38.89 0.1449 i-v ector (sup ervised) [21] 2.77 10.09 0.0460 rV AD (MSNE-mo d) 4.78 5.75 0.0502 6.4. Sp e aker verific ation r esults on the RA TS datab ase In [72], the rV AD metho d was applied as a prepro cessing step of text-indep enden t sp eak er verification (TI-SV) systems built for the RA TS database under the DARP A RA TS pro ject. The TI-SV systems use 60- dimension MFCCs as features and 600-dimension i-vector [73] together with probability linear discriminate analysis (PLDA) based scoring as the speaker verification back-end. Praat pitc h extration [67] was used for the rV AD method in this exp eriment. rV AD was first compared with a neural netw ork based sup ervised V AD metho d developed b y Brno Univ ersity of T ec hnology (BUT-V AD) [74]. The neural net work deplo yed in BUT-V AD has 9 outputs for sp eech and 9 outputs non-sp eech, eac h of whic h corresponds to one of the 9 c hannels (one source and 8 retransmitted). The outputs are smo othed and merged in to speech and non-sp eec h segmen ts using a hidden Marko v mo del (HMM) with Viterbi deco ding. The neural net work was 15 trained on RA TS data defined for the V AD task. Another sup ervised V AD metho d is GMM-PLP-RAST A where perceptual linear predictive (PLP) co efficien ts with RAST A-based [75] cepstral mean normalization b eing applied are used as the feature and t wo 2048-comp onent GMMs (one for sp eec h and another for non-sp eec h), trained on RA TS, are used as the mo dels for V AD. Ev aluation was conducted on a developmen t set of 30s-30s enrolment and test condition and the ev alua- tion criterion for sp eaker v erification is equal error rate (EER) [72]. The unsup ervised rV AD (MSNE-mo d) metho d, the sup ervised BUT-V AD and GMM-PLP-RAST A metho ds[74] yield EERs of 5 . 6%, 5 . 4%, 6 . 7%, resp ectiv ely . rV AD (MSNE-mo d) p erforms marginally worse than the supervised BUT-V AD trained on the RA TS database, but b etter than the sup ervised GMM-PLP-RAST A also trained on the RA TS database. F or more details ab out the systems see [72]. 7. Exp erimen ts on the RedDots database for TD-SV In this section, we compare rV AD and rV AD-fast with other V AD metho ds in the context of text- dep enden t speaker verification (TD-SV) on the male part-01 task of the RedDots 2016 c hallenge database [38], whic h consists of short utterances and is one of most used databases for sp eaker verification. The database was collected in many coun tries and mostly in office en vironmen ts, hence introducing great diversit y in terms of sp eak ers and channels. There are 320 target mo dels for training and eac h target has 3 sp eech signals/sessions for building their particular mo del. F or the assessmen t of TD-SV, there are three t yp es of non-target trials: target-wrong (TW) (29 , 178), impostor-correct (IC) (120 , 086) and imp ostor-wrong (IW) (1 , 080 , 774). The mo delling metho d for SV adopts Gaussian mixture mo del-universal background mo del (GMM-UBM) since GMM based metho ds are kno wn to outp erform the i-v ector technique for SV with short utterances. MF CCs of 57 dimensions (including static, ∆ and ∆∆) with RAST A filtering [75] are extracted from sp eec h signals with a 25 ms hamming window and a 10 ms frame shift. After V AD, detected sp eech frames are normalized to hav e zero mean and unit v ariance at utterance level. A gender indep endent GMM- UBM, consisting of 512 mixtures and having diagonal cov ariance matrices, is trained using non-target data from TIMIT ov er 630 sp eak ers (6300 utterances). T arget mo dels are derived from the GMM-UBM with 3 iterations of MAP adaptation (with relev ance factor 10 . 0 and only applied to Gaussian mean vectors) using the training data of the particular target mo del. During the test, feature vectors of a test utterance is scored against the target mo del and GMM-UBM to calculate log lik eliho o d ratio. T able 9 presen ts the TD-SV p erformance for different metho ds which are generally prop osed for the SV. T able 9: Performanc e of V AD metho ds for text-dependent sp e aker verific ation on R e dDots p art-01 (male) Metho ds [%EER/minDCFx100] Av erage T arget-wrong Imp ostors-correct Impostors-wrong (%EER/minDCF) (TW) (IC) (IW) no V AD 6.36/3.071 2.82/1.400 1.26/0.525 3.48/1.665 Kaldi Energy V AD [68] 6.26/2.662 3.62/1.625 1.67/0.618 3.85/1.635 Sohn et al. [19] 4.78/2.492 2.40 / 1.200 1.01/0.295 2.73/1.329 V QV AD [11] 3.70 /1.652 2.94/1.520 0.89/0.284 2.51 /1.152 rV AD(MSNE, default) 3.79/1.572 2.93/1.328 1.01/0.273 2.58/1.058 rV AD(MMSE) 4.16/1.523 3.02/1.329 0.92 /0.300 2.70/1.050 rV AD(MSNE-mo d) 3.82/ 1.498 2.83/1.290 0.98/0.284 2.54/ 1.024 rV AD-fast(MSNE) 4.10/1.682 2.86/1.228 0.92 / 0.272 2.63/1.061 rV AD-fast(MMSE) 5.09/2.242 2.74/1.276 1.14/0.383 2.99/1.300 rV AD-fast(MSNE-mo d) 3.97/1.705 2.64/1.225 0.95/0.283 2.52/1.071 16 The exp erimen t results in T able 9 sho w that that all V AD metho ds (except for Kaldi V AD) are able to outp erform the system without V AD either in terms of EER or minDCF. This observ ation is in line with the well kno wn fact that V AD is useful in sp eaker v erification. rV AD o verall p erforms b etter than Sohn et al. , Kaldi Energy V AD and is comparable to or marginally b etter than VQV AD (almost iden tical in EER and sligh tly b etter in minDCF). rV AD-fast shows a similar p erformance to rV AD in terms of SV, although its V AD p erformance (as observ ed in T able 4) is w orse than that of rV AD. Considering the huge V AD p erformance gap (by a factor of more than three in EER) betw een rV AD and VQV AD as sho wn in T ables 1 and 5, we conclude that sup erior p erformance in V AD do es not necessarily translate into an improv ement in SV performance. This could be explained by the fact that SV is a task of making one single decision based on the en tire sequence/utterance, which differs from the V AD task where each short segment matters. V QV AD is sp ecially optimized for SV. 7.1. Sensitivity of rV AD thr eshold β on sp e aker verific ation p erformanc e T able 10 shows the effect of v arying threshold v alue β in Eq. (8) of rV AD on TD-SV performance. MSNE is used for the second pass denoising and both passes are applied. The results sho w that the performance of rV AD do es not c hange rapidly with changing the v alue of β , demonstrating the stabilit y of rV AD. The results, whic h are obtained but not included in T able 10, also show that rV AD with MSNE-mod p erforms sligh tly b etter than rV AD with MSNE. T able 10: Performanc e of rV AD with various threshold β values for spe aker verification on R e dDots p art-01 (male). Metho ds Threshold [%EER/minDCFx100] Av erage ( β ) T arget-wrong Imp ostors-correct Imp ostors-wrong (%EER/minDCF) (TW) (IC) (IW) 0.1 4.08/1.591 2.77/1.243 1.07/0.266 2.64/1.033 0.2 3.91/1.534 2.80/1.242 1.11/0.278 2.61/ 1.018 0.3 3.86/1.547 2.82/1.249 1.01/0.275 2.56 /1.024 rV AD 0.4 (default) 3.79/1.572 2.93/1.328 1.01/0.273 2.58/1.058 (MSNE) 0.5 4.36/1.646 3.26/1.451 1.14/0.331 2.92/1.143 0.6 4.19/1.674 3.20/1.460 1.26/0.325 2.88/1.153 0.7 4.45/1.735 3.54/1.540 1.29/0.388 3.09/1.221 7.2. T ext-dep endent sp e aker verific ation under noise c onditions W e further ev aluate the p erformance of V AD methods for TD-SV under mismatched conditions, where noisy test utterances are scored against the sp eak er models trained under clean condition (office en viron- men t). In order to cov er different real-world scenarios, v arious t yp es of noise are artificially added to the test data with v arious SNR v alues. The scaling factor is calculated using the ITU sp eec h v oltmeter [76]. TD-SV results of differen t V AD metho ds are presented in T able 11. It is observed that TD-SV performances are significan tly degraded under noisy conditions as also exp ected. rV AD achiev es mostly low er EER v alues than those of Kaldi and Sohn et al. and No-V AD (i.e., without using V AD) ov er different noise t yp es and SNR v alues, but p erforms slightly worse than VQV AD that is sp ecially designed for SV. Sohn et al. V AD pro vides decen t improv ement as well. Kaldi Energy V AD degrades the p erformance compared with No-V AD, as in the case of clean condition. rV AD-fast giv es comparable performance to that of rV AD under noise t yp e of babble, market and car, but it does not work under white noise as the sp ectral flatness measure is severely affected by white noise. A close analysis shows that for white noise from 0 dB through 10 dB , sp ectral flatness v alues are mostly close to 1 . 0, as illustrated in Fig.5, due to the similar amount of p ow er in all spectral bands. Therefore, the threshold v alue of θ sf t = 0 . 5 do es not output any sp eech frames with pitch (and thus no speech frames) for most of noisy test trails in SV and in this exp erimen t, w e consider these trails without sp eech frames as an SV error or misclassification. Numbers of these trails (without any sp eech frame b eing detected) are shown 17 in paren thesis in T able 11. When calculating EER, genuine/true trials without sp eech frames are directly rejected (false rejection) b y assigning the low est score v alue a v ailable in the non-target trials to these gen uine trials and vice-versa, namely non-target trails without sp eech frames are directly accepted (false alarm) b y assigning the highest score v alue av ailable in the genuine trials to these non-target trials. T able 11: TD-SV p erformanc e of differ ent V AD metho ds under noisy test environments on R e dDots p art-01 (male). Numb ers in p ar enthesis show numb ers of test utter anc es (out of total 3854 unique utter anc es) that do not yield any sp e e ch fr ames by the r esp e ctive V AD methods. Noise SNR % Av erage EER [TW, IC, IW] across V AD methods type (dB) no Kaldi Sohn VQ rV AD rV AD-fast V AD et al. V AD MSNE MMSE mod. MSNE MMSE mod. (default) MSNE MSNE Clean - 3.48 3.85 2.73 2.51 2.58 2.70 2.54 2.63(2) 2.99(2) 2.52 (2) White 0 35.23 37.82 37.05(119) 30.18 34.46 34.44 35.29(65) 99.34(3796) 99.38(3796) 99.29(3796) 05 26.78 30.63 26.68(33) 21.98 25.49 24.82 25.58(11) 68.05(2086) 68.05(2080) 67.84(2079) 10 18.88 22.19 17.08 14.57 16.42 16.90 17.22 25.26(333) 25.26(326) 25.25(335) 15 12.47 15.28 10.50 9.28 10.20 10.33 10.41 13.04(61) 13.13(81) 13.30(63) 20 8.25 9.90 6.59 5.97 6.64 6.60 6.62 6.85(3) 7.12(3) 7.05(3) Babble 0 35.02 34.42 33.25 30.91 32.88 32.78 33.10 33.01 33.44 33.21 05 25.17 24.19 22.89 20.78 21.84 22.32 22.36 22.84 23.79 21.90 10 16.45 15.61 14.17 12.48 13.49 13.72 13.60 14.01(1) 14.43 13.61 15 10.30 10.26 8.48 7.55 8.19 8.47 8.24 8.37(2) 8.65 8.41 20 6.34 6.94 5.24 4.66 5.28 5.31 5.17 5.46(2) 5.43(2) 5.59(2) Market 0 25.39 26.24 24.67 22.79 23.83 24.15 24.09(1) 24.37 24.49 23.94 05 16.84 17.41 15.51 14.09 15.07 15.33 15.16 15.81 15.38 15.11 10 10.58 11.11 9.02 8.40 8.84 9.06 9.11 9.23 9.37 9.33 15 6.58 7.45 5.65 5.06 5.61 5.90 5.73 6.00(1) 6.06(1) 5.97(1) 20 4.82 5.44 3.90 3.60 3.99 4.10 3.98 4.03(2) 4.21(2) 4.04(2) Car 0 4.10 5.60 3.70 3.53 3.74 3.95 3.75 3.75 3.74 3.69 05 3.67 5.01 3.20 3.09 3.23 3.39 3.24 3.24 3.23 3.18 10 3.33 4.61 3.00 2.90 2.90 3.05 2.96 2.92 2.93 2.88 15 3.24 4.36 2.80 2.72 2.77 2.85 2.71 2.77 2.78 2.71 20 3.18 4.18 2.70 2.58 2.62 2.71 2.60 2.63(1) 2.70(1) 2.56 (1) Average 13.83 14.93 12.80 11.35 12.37 12.50 12.54 18.54 18.67 18.44 0 200 400 600 Frames 0.4 0.6 0.8 1 Spectral flatness Speech with white-noise, SNR=0dB 0 200 400 600 Frames 0.4 0.6 0.8 1 Spectral flatness Speech with white-noise, SNR=05dB 0 200 400 600 Frames 0.4 0.6 0.8 1 Spectral flatness Speech with white-noise, SNR=10dB 0 200 400 600 Frames 0.4 0.6 0.8 1 Spectral flatness Speech with white-noise, SNR=20dB Figure 5: Sc atter plots of SFT values of a sp e e ch signal with white-noise for differ ent SNR values. 18 7.3. Sensitivity of rV AD thr eshold β on sp e aker verific ation p erformanc e under noisy c onditions T able 12 shows the effect of v arying threshold β in Eq. (8) of rV AD on the p erformance of TD-SV under noisy test conditions. It is observed that rV AD is stable tow ards v arying the threshold v alue. T able 12: Performanc e of rV AD with various thr eshold β values for sp e aker verific ation on R e dDots p art-01 (male) under noisy c onditions. T est SNR Threshold ( β ) [% av erage EER (TW, IC, IW)] condition (dB) 0.1 0.2 0.3 0.4 (default) 0.5 0.6 0.7 0.8 0.9 White 0 34.55 34.37 34.16 34.46 34.41 33.91 33.47 34.27 34.76 05 25.79 25.26 25.09 25.49 25.40 25.28 24.47 25.62 25.39 10 17.40 16.87 16.71 16.42 16.75 16.95 16.11 17.27 17.63 15 10.87 10.66 10.61 10.20 10.42 10.65 10.10 11.20 11.69 20 6.79 6.71 6.49 6.64 6.79 6.74 6.64 7.45 7.78 Babble 0 33.79 33.02 33.07 32.88 32.81 32.34 31.71 31.63 32.13 05 23.85 23.45 22.65 21.84 21.86 21.42 20.84 21.49 21.95 10 14.82 14.42 13.88 13.49 13.66 13.15 12.79 13.60 13.72 15 8.92 8.56 8.36 8.19 8.30 7.97 7.95 8.37 8.59 20 5.82 5.54 5.26 5.28 5.28 5.30 5.37 5.86 6.28 Market 0 24.43 23.87 23.96 23.86 23.97 23.08 23.36 24.09 24.53 05 15.54 15.22 14.75 15.07 15.18 14.68 14.36 15.21 15.55 10 9.62 9.02 8.76 8.84 9.02 9.01 8.83 9.71 9.88 15 6.21 5.85 5.71 5.61 6.02 5.86 5.94 6.61 6.79 20 4.20 4.10 3.99 3.99 4.25 4.15 4.51 4.82 5.09 Car 0 3.74 3.69 3.76 3.74 4.05 4.05 4.30 5.02 5.21 05 3.26 3.21 3.26 3.23 3.60 3.60 3.78 4.51 4.86 10 2.94 2.87 2.97 2.90 3.30 3.30 3.52 4.17 4.49 15 2.71 2.75 2.72 2.77 3.07 3.09 3.29 3.97 4.14 20 2.65 2.63 2.63 2.62 2.92 2.92 3.14 3.79 4.00 Average 12.89 12.60 12.43 12.37 12.55 12.37 12.22 12.93 13.22 8. Conclusion In this pap er, we presen ted an unsupervised segmen t-based robust voice activity detection (rV AD) metho d for voice activity detection (V AD). It consists of tw o-pass denoising, extended pitc h segment de- tection, and voice activity detection. The first pass denoising uses pitch as a sp eec h indicator to remo ve high-energy noise segments detected by using a p osteriori signal-to-noise ratio (SNR) w eighted energy differ- ence, while the second pass denoising attempts to remov e more stationary noise using speech enhancement metho ds. Then, extended pitc h segmen ts are found. In the end, a p osteriori SNR w eighted energy difference is applied to extended pitch segments of the denoised sp eec h signal for V AD. W e ev aluated the p erformance of the prop osed rV AD metho d for V AD and sp eak er verification (SV) tasks on several div erse databases con taining a large v ariety of noise conditions and compared rV AD against 16 V AD methods including b oth sup ervised and unsup ervised metho ds. Exp erimen t results show that the prop osed metho d is compared fa vourably with a num b er of existing metho ds. It is w orth to emphasize that rV AD obtained the promising p erformances across databases and tasks b y using the same parameters, indicating the go o d generalization abilit y of rV AD. It can b e concluded that pitch is a go od indicator or anchor for lo cating speech segments and a p osteriori signal-to-noise ratio (SNR) w eighted energy difference is an effective measure for segmen ting sp eec h in noisy e n vironmen t. F urthermore, a V AD targeted tow ards SV is able to p erform well even though its V AD p erformance do es not. In addition, we presen ted a mo dified v ersion of the rV AD metho d, called rV AD-fast, where computation- ally time-consuming pitc h extraction is replaced by computationally efficient sp ectral flatness calculation. The mo dified v ersion significantly reduces the computational complexit y at the cost of mo derate inferior V AD p erformance, which is an adv antage when pro cessing a large amount of data and running on resource- limited devices. rV AD-fast, ho wev er, breaks down under white noise and therefore it should b e used with caution. It has shown to work well under babble, market and car noise and under clean condition. One 19 further finding is that sp ectral flatness is a go od indicator for whether or not there is pitc h in a segment as long as the signal is not sev erely corrupted by white noise. Ov erall, it can b e concluded that rV AD p erforms w ell in b oth clean and noisy conditions and for b oth V AD itself and SV. The generalization ability across databases, noisy conditions and tasks was pro ofed as w ell. F uture w ork includes inv estigating the optimal configurations of rV AD for different applications. The p erformance of rV AD on automatic sp eech recognition is worth to study as well. 9. Ac knowledgemen t Zheng-Hua T an wishes to thank Dr. James Glass for hosting him in 2012 and 2017 at MIT where the w ork was done in part. This work is partly supp orted by the iSocioBot pro ject, funded b y the Danish Council for Independent Researc h - T echnology and Pro duction Sciences (#1335-00162), and the Horizon 2020 OCT A VE Pro ject (#647850), funded b y the Research Europ ean Agency (REA) of the Europ ean Commission. References [1] M. Price, J. Glass, A. P . Chandrak asan, A low-pow er speech recognizer and voice activit y detector using deep neural netw orks, IEEE Journal of Solid-State Circuits 51 (1) (2018) 66–75. [2] P . Kenny , T. Stafylakis, P . Ouellet, M. J. Alam, P . Dumouc he, Sup ervised/unsupervised voice activit y detectors for text- dependent sp eaker recognition on the rsr2015 corpus, in: Proc. of Odyssey Sp eak er and Language Recognition W orkshop, 2014. [3] Z.-H. T an, B. Lindb erg, Low-complexit y v ariable frame rate analysis for speech recognition and voice activity detection, IEEE Journal of Selected T opics in Signal Pro cessing 4 (5) (2010) 798 – 807. [4] J. Ramirez, C. Segura, C. Benitez, A. T orre, A. Rubio, A new kullback-leibler v ad for sp eech recognition in noise, IEEE Signal Pro cessing Letters 11 (2) (2004) 266–269. [5] X.-L. Zhang, J. W u, Deep belief netw orks based v oice activit y detection, IEEE T ransactions on Audio, Sp eech, and Language Pro cessing 21 (4) (2013) 697–710. [6] L. F errer, M. McLaren, N. Scheffer, Y. Lei, M. Graciarena, V. Mitra, A noise-robust system for nist 2012 sp eak er recognition ev aluation, in: Pro c. of Interspeech, 2013, pp. 1981–1985. [7] L. F errer, M. Graciarena, V. Mitra, A phonetically a ware system for sp eec h activit y detection, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Processing (ICASSP), 2016, pp. 5710–5714. [8] K. W alker, S. Strassel, The rats radio traffic collection system, in: Pro c. of Odyssey Sp eak er and Language Recognition W orkshop, 2012, pp. 291–297. [9] T. Petsatodis, C. Boukis, F. T alan tzis, Z.-H. T an, R. Prasad, Conv ex combination of multiple statistical mo dels with application to v ad, IEEE T ransactions on Audio, Sp eec h, and Language Processing 19 (8) (2011) 2314–2327. [10] T. Petsatodis, F. T alantzis, C. Boukis, Z.-H. T an, R. Prasad, Multi-sensor voice activity detection based on m ultiple observ ation hypothesis testing, in: Pro c. of Interspeech, 2011, pp. 2633–2636. [11] T. Kinn unen, P . Ra jan, A practical, self-adaptive v oice activit y detector for sp eaker verification with noisy telephone and microphone data, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Pro cessing (ICASSP), 2013. [12] A. Sholokhov, M. Sahidullah, T. Kinnunen, Semi-sup ervised sp eec h activit y detection with an application to automatic speaker verification, Computer Speech & Language 47 (2018) 132–156. [13] E. Chuangsu wanic h, J. Glass, Robust v oice activity detector for real world applications using harmonicity and modulation frequency , in: Proc. of Interspeech, 2011, pp. 2645–264. [14] S. Sadjadi, J. H. L. Hansen, Unsup ervised sp eech activity detection using voicing measures and p erceptual sp ectral flux, IEEE Signal Pro cessing Letters 20 (3) (2013) 197–200. [15] T. Kinn unen, A. Sholokhov, E. Khoury , D. A. L. Thomsen, M. Sahidullah, Z.-H. T an, Happy team entry to nist op en- sad c hallenge: A fusion of short-term unsupervised and segment i-vector based sp eech activity detectors, in: Pro c. of Interspeech, 2016, pp. 2992–2996. [16] A. Dryga jlo, Entropy based v oice activity detection in v ery noisy conditions, in: Proc. of EUROSPEECH, 2001, pp. 1887–1890. [17] D. Vla j, B. Kotnik, B. Horv at, Z. Kacic, A computationally efficient mel-filter bank v ad algorithm for distributed sp eec h recognition systems, EURASIP Journal of Applied Signal Pro cessing 2005 (4) (2005) 487–497. [18] Z.-H. T an, B. Lindb erg, A p osteriori snr weigh ted energy based v ariable frame rate analysis for sp eec h recognition, in: Proc. of Interspeech, 2008, pp. 1024–1027. [19] J. Sohn, N. S. Kim, W. Sung, A statistical mo del-based voice activity detection, IEEE Signal Pro cessing Letters 6 (1) (1999) 1–3. [20] J. F. Bonastre, N. Scheffer, C. F redouille, D. Matrouf, Nist04 speaker recognition ev aluation campaign: new lia sp eak er detection platform based on alize to olkit, in: Pro c. of NIST 2004 sp eaker recognition w orkshop, 2004. 20 [21] E. Khoury , M. Garland, I-vectors for sp eec h activity detection, in: Pro c. of Odyssey Sp eaker and Language Recognition W orkshop, 2016, pp. 334–339. [22] D. L. Hu, et al., Voice activity detection with decision trees in noisy environments, Applied Mechanics and Materials 128-129 (2012) 749–752. [23] D. Enqing, L. Guizhong, Z. Y atong, Z. Xiaodi, Applying supp ort vector machines to v oice activit y detection, in: Pro c. of International Conference on Sp ok en Language Pro cessing, 2002. [24] F. T ao, C. Busso, Bimo dal recurren t neural netw ork for audio visual v oice activity detection, in: Proc. of Interspeech, 2017, pp. 1938–1942. [25] ETSI, Speech processing, transmission and qualit y asp ects (STQ): Distributed sp eech recognition, adv anced front-end feature extraction algorithm, compression algorithm ES 202 050 v1.1.5, ETSI, Geneve, 2007. [26] P . Ghosh, A. Tsiartas, S. Naray anan, Robust voice activit y detection using long-term signal v ariability , IEEE T rans Audio Speech Lang Pro cess 19 (3) (2011) 600–613. [27] D. Vla j, Z. Kaˇ ciˇ c, M. Kos, Voice activity detection algorithm using nonlinear sp ectral w eights, hangov er and hang before criteria, Computers & Electrical Engineering 38 (6) (2012) 1820–1836. [28] ITU, Co ding of sp eech at 8 kbit/s using conjugate structure algebraic co de-excited linear-prediction (cs-acelp) annex b: A silence compression scheme, T ech. rep., ITU Recommendation G.729, Geneve (1996). [29] ITU, Dual rate sp eech co der for multimedia communications transmitting at 5.3 and 6.3 kbit/s. annex a: Silence com- pression scheme, T ech. rep., ITU Recommendation G.723.1, Geneve (1996). [30] H.-G. Hirsch, D. P earce, The aurora exp erimen tal framework for the p erformance ev aluation of speech recognition systems under noisy conditions, in: Automatic Sp eec h Recognition: Challenges for the Next Millennium, ISCA ITR W ASR2000, 2000. [31] Y. Shao, Q. Lin, Use of pitch contin uit y for robust speech activity detection, in: 2018 IEEE International Conference on Acoustics, Sp eec h and Signal Processing (ICASSP), IEEE, 2018, pp. 5534–5538. [32] X.-K. Y ang, L. He, D. Qu, W.-Q. Zhang, V oice activity detection algorithm based on long-term pitch information, EURASIP Journal on Audio, Speech, and Music Pro cessing 2016 (1) (2016) 14. [33] Y. Zhao, Z.-Q. W ang, D. W ang, A two-stage algorithm for noisy and reverberant speech enhancemen t, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Pro cessing (ICASSP), IEEE, 2017, pp. 5580–5584. [34] G. Peeters, A large set of audio features for sound description (similarity and classification), T ech. rep., CUIDADO pro ject (2004). [35] N. Madhu, Note on measures for sp ectral flatness, Electronics Letters 45 (23) (2009) 1195 – 1196. [36] M. Moattar, M. M. Homay ounpour, A simple but efficient real-time v oice activity detection algorithm, in: Proc. of EUSIPCO, 2009, pp. 2549–2553. [37] https://www.darpa.mil/program/robust-automatic-transcription-of-speech. [38] The reddots challenge: T o wards characterizing sp eakers from short utterances, https://sites.google.com/site/thereddotspro ject/reddots-challenge. [39] I. Stefan us, R. S. J. Sarw ono, M. I. Mandasari, Gmm based automatic speaker verification system developmen t for forensics in bahasa indonesia, in: Proc. of Instrumentation, Control, and Automation (ICA), 2017. [40] A. Nautsc h, R. Bam b erger, C. Busch, Decision robustness of voice activity segmentation in unconstrained mobile speaker recognition environmen ts, in: Pro c. of Biometrics Sp ecial Interest Group (BIOSIG), 2016. [41] B. K. Dhanush, et al., Factor analysis metho ds for joint sp eaker verification and spo of detection, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Processing (ICASSP), 2017. [42] S. Shepstone, Z.-H. T an, S. Jensen, Audio-based age and gender identification to enhance the recommendation of tv conten t, IEEE T ransactions on Consumer Electronics 59 (3) (2013) 721–729. [43] H. Dub ey , M. R. Mehl, K. Mank o diya, Bigear: Inferring the am bient and emotional correlates from smartphone-based acoustic big data, in: In ternational W orkshop on Big Data Analytics for Smart and Connected, 2016. URL [44] N. Semw al, A. Kumar, S. Naray anan, Automatic sp eech emotion detection system using multi-domain acoustic feature selection and classification mo dels, in: Identit y , Security and Beha vior Analysis (ISBA), 2017. [45] A. Chorianop oulou, P . Koutsakis, A. Potamianos, Speech emotion recognition using affective saliency , in: Pro c. of Inter- speech, 2016, pp. 500–504. [46] C.-Y. Lee, Discov ering linguistic structures in sp eech: Models and applications, Ph.D. thesis, Massach usetts Institute of T echnology (2014). [47] Z.-H. T an, N. B. Thomsen, X. Duan, E. Vlachos, S. E. Shepstone, M. H. Rasmussen, J. L. Højv ang, iso ciob ot: A multimodal interactiv e so cial robot, International Journal of So cial Rob otics (2017) 1–15. [48] R. Martin, Noise p ow er sp ectral density estimation based on optimal smo othing and minimum statistics, IEEE T ransac- tions on Sp eec h and Audio Pro cessing 9 (5) (2001) 504–512. [49] P . C. Loizou, Sp eec h enhancement: theory and practice, CRC press, 2007. [50] Y. Ephraim, D. Malah, Sp eec h enhancement using a minim um mean-square error short-time spectral amplitude estimator, IEEE T rans. on Acoust., Sp eec h, Signal Processing 32 (1984) 1109–1121. [51] T. Gerkman, R. C. Hendriks, Unbiased mmse-based noise p o wer estimation with low complexity and low tracking dela y , IEEE T rans Audio, Sp eec h, Language Pro cessing 20 (2012) 1383–1393. [52] I. Cohen, B. Berdugo, Noise estimation by minima controlled recursive av eraging for robust sp eech enhancement, IEEE Signal Pro cessing Letters 9 (1) (2002) 12–15. [53] S. Boll, Suppression of acoustic noise in sp eech using sp ectral subtraction, IEEE T rans. on Acoustics, Speech, and Signal Processing 27 (2) (1979) 113–120. 21 [54] J. Jensen, Z.-H. T an, Minim um mean-square error estimation of mel-frequency cepstral features, IEEE/A CM T rans. Audio, Speech and Language Pro cessing 23 (1) (2015) 186–197. [55] M. Kolbæk, Z.-H. T an, J. Jense n, Sp eech intelligibilit y p oten tial of general and sp ecialized deep neural netw ork based speech enhancement systems, IEEE/A CM T rans. on Audio, Sp eec h and Language Proc essing 25 (1) (2017) 153–167. [56] X. Zhang, Z.-Q. W ang, D. W ang, A sp eech enhancemen t algorithm by iterating single- and multi-microphone pro cessing and its application to robust asr, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Pro cessing (ICASSP), 2017, pp. 276–280. [57] D. Michelsan ti, Z.-H. T an, Conditional generative adv ersarial netw orks for sp eech enhancemen t and noise-robust sp eak er verification, in: Proc. of Interspeech, 2017, pp. 2008–2012. [58] M. Kolbæk, Z.-H. T an, J. Jensen, Sp eec h enhancement using long short-term memory based recurrent neural netw orks for noise robust sp eaker v erification, in: Pro c. of Sp oken Language T echnology W orkshop (SL T), 2016, pp. 305–311. [59] J. D. Johnston, Transform co ding of audio signals using perceptual noise criteria, IEEE Journal on Selected Areas in Communications 6 (2) (1988) 314–332. [60] Sp eak er recognition ev aluation 2016, https://www.nist.gov/itl/iad/mig/speaker-recognition-ev aluation-2016. [61] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Contin uous Speech Corpus LDC93S1, W eb Do wnload. Philadelphia: Linguistic Data Consortium, 1993. [62] A. Larcher, K. A. Lee, B. Ma, H. Li, Text-dep enden t Sp eaker Verification: Classifiers, Databases and RSR2015, Sp eec h Communication 60 (2014) 56–77. [63] Z. W u, T. Kinnunen, N. Ev ans, J. Y amagishi, C. Hanilci, M. Sahidullah, A. Sizov, ASVsp oof 2015: the First Automatic Speaker Verification Sp oofing and Countermeasures c hallenge, in: Pro c. of Interspeech, 2015, pp. 2037–2041. [64] S. Y oung, D. Kershaw, J. Odell, V. V altchev, P . W oo dland, et al., HTK Book, Copyrigh t 2001-2006 CUED. [65] I. Kraljevski, Z.-H. T an, M. P . Bissiri, Comparison of forced-alignment sp eec h recognition and humans for generating reference v ad, in: Pro c. of Interspeech, 2015, pp. 2937–941. [66] S. Gonzalez, M. Bro okes, A pitch estimation filter robust to high levels of noise (pefac), in: Pro c. of EUSIPCO, 2011, pp. 451–455. [67] P . Bo ersma, D. W eenink, Praat: doing phonetics by computer (version 5.1.05) [computer program], [Online; Accessed 2009] (2009). URL http://www.praat.org/ [68] D. P ov ey , A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P . Motlicek, Y. Qian, P . Sc hw arz, J. Silovsky , G. Stemmer, K. V esely , The k aldi speech recognition toolkit, in: Proc. of IEEE W orkshop on Automatic Speech Recognition and Understanding, 2011. [69] K. W alker, X. Ma, D. Graff, S. Strassel, S. Sessa, K. Jones, Rats sp eech activit y detection, in: LDC2015S02. Hard Drive. Philadelphia: Linguistic Data Consortium, 2015. URL https://catalog.ldc.upenn.edu/ldc2015s02 [70] Ev aluation plan for the nist op en ev aluation of speech activity detection (opensad15) (2015). URL https://www.nist.gov/itl/iad/mig/nist- open- speech- activity- detection- evaluation [71] C. Kim, R. Stern, Po wer-normalized cepstral coefficients (pncc) for robust sp eec h recognition, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Pro cessing (ICASSP), 2012, pp. 4101–4104. [72] O. Plc hot, S. Matsouk as, P . Matejk a, N. Dehak, J. Z. Ma, S. Cumani, O. Glembek, H. Hermansky , S. H. R. Mallidi, N. Mesgarani, R. M. Sch wartz, M. Soufifar, Z. T an, S. Thomas, B. Zhang, X. Zhou, Developing a sp eaker identification system for the DARP A RA TS pro ject, in: Pro c. of IEEE Int. Conf. Acoust. Sp eec h Signal Pro cessing (ICASSP), 2013, pp. 6768–6772. [73] N. Dehak, P . Kenny , R. Dehak, P . Dumouchel, P . Ouellet, Fron t-end factor analysis for sp eaker verification, IEEE T rans. on Audio, Sp eec h and Language Pro cessing 19 (2011) 788–798. [74] T. Ng, B. Zhang, L. Nguyen, S. Matsouk as, X. Zhou, N. Mesgarani, K. V esel´ y, P . Matejk a, Developing a sp eech activity detection system for the darpa rats program, in: Proc. of Interspeech, 2012, pp. 1969–1972. [75] H. Hermanksy , N. Morgan, Rasta pro cessing of sp eech, IEEE T rans. on Sp eech and Audio Processing 2 (1994) 578–589. [76] G. 191, Softw are to ols for speech and audio co ding standardization, T ech. rep., In ternational T elecommunication Union (2005). 22
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment