Spoof detection using time-delay shallow neural network and feature switching
Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice conversion or by physical accesses such as replaying the pre-recorded utterance. Inspir…
Authors: Mari Ganesh Kumar, Suvidha Rupesh Kumar, Saranya M
SPOOF DETECTION USING TIME-DELA Y SHALLO W NEURAL NETWORK AND FEA TURE SWITCHING Mari Ganesh K umar 1 , Suvidha Rupesh K umar 2 , Saranya M S 1 , B. Bharathi 2 , Hema A. Murthy 1 1 Indian Institute of T echnology Madras 2 SSN college of Engineering ABSTRA CT Detecting spoofed utterances is a fundamental problem in voice-based biometrics. Spoofing can be performed either by logical accesses like speech synthesis, voice con version or by physical accesses such as replaying the pre-recorded utterance. Inspired by the state-of-the-art x -vector based speaker verification approach, this paper proposes a time- delay shallo w neural network (TD-SNN) for spoof detection for both logical and physical access. The nov elty of the pro- posed TD-SNN system vis-a-vis con ventional DNN systems is that it can handle v ariable length utterances during testing. Performance of the proposed TD-SNN systems and the base- line Gaussian mixture models (GMMs) is analyzed on the ASV -spoof-2019 dataset. The performance of the systems is measured in terms of the minimum normalized tandem detec- tion cost function (min-t-DCF). When studied with individual features, the TD-SNN system consistently outperforms the GMM system for ph ysical access. For logical access, GMM surpasses TD-SNN systems for certain individual features. When combined with the decision-le vel feature switching (DLFS) paradigm, the best TD-SNN system outperforms the best baseline GMM system on ev aluation data with a relativ e improv ement of 48.03% and 49.47% for both logical and physical access, respecti vely . Index T erms : anti-spoofing, voice-biometrics, GMM, x- vectors, time-delay neural networks 1. INTRODUCTION Although automatic speak er verification (ASV) systems are robust to impostor threats [1] and acoustic v ariations, they are vulnerable when subjected to presentation attacks. Present- ing a fake biometric sample to a biometric detection system is referred to as a presentation attack 1 . The process of this deliberate e vasion is called spoofing. Spoofing at sample ac- quisition stage can be classified into two cate gories, namely , logical access (LA) and physical access (P A) [2]. Spoofing W e would lik e to thank the ASV -Spoof-2019 org anizers for providing the new spoof detection dataset. 1 https://www .iso.org/standard/53227.html samples generated using speech synthesis (SS) or v oice con- version (VC) approach are categorized as LA while replay- ing a pre-recorded original audio falls under the P A category . The primary objectiv e of ASV -spoof-challenge proposed in 2015 was to detect LA. Since the implementation of P A is easier than LA, the former attack is a greater threat than later . ASV -spoof-challenge in 2017 focused on identifying physi- cal access. Numerous spoof detection algorithms have been proposed since then for both LA [3 – 5] and P A [6 – 8]. ASV -spoof-2019 challenge focused on detecting spoofed utterances synthesized by both LA and P A. Unlike the pre- vious anti-spoofing challenges, equal error rate (EER) was not used as the ev aluation metric due to its ill-suited oper- ating point for user applications like telephone banking [9]. Hence a new metric termed as a minimum normalized tan- dem detection cost function (min-t-DCF) is provided as the ev aluation metric. The min-t-DCF considers the false alarms and misses for both countermeasure system as well as the au- tomatic speaker verification (ASV) system, along with the prior probabilities of target and spoof trials. The details of min-t-DCF is discussed in [9, 10]. Scores from a x -vector based speaker verification system [11] are used along with the statistics of the spoof detection system to estimate min-t- DCF . x -vector is a DNN based state-of-the-art speaker veri- fication technique that embeds the speaker characteristics in low-dimensional fixed-length vectors from v ariable length ut- terances. In this paper , inspired by the x -vector based ASV system, we propose a similar spoof detection system for identifying both logical and physical access. For detecting spoofed ut- terances, the following changes are made to the time-delay neural network architecture ( x -vector) proposed in [11]: (i) The last layer in the ASV system’ s architecture is modified to handle the two-class problem of spoof detection. (ii) In- stead of the standard cross-entropy loss function, a new fo- cal loss function [12] is used to gi ve more focus on hard and misclassified examples (iii) The network was made shallow since this is binary classification problem with limited data. The proposed network outperforms the baseline GMM clas- sifier for physical access almost in all the cases. On the other hand, the proposed system is not consistently outperforming the GMM classifier while detecting logical access. Instead of con ventional score fusion, decision-lev el feature switching (DLFS) system proposed for ASV -spoof-2017 dataset [8] is used to exploit the property of dif ferent features in capturing different kinds of spoofing conditions. The focus of this paper is three-fold: Firstly , a comparison of baseline GMM system using four different features on ASV -spoof-2019 challenge is discussed. Secondly , we propose a novel neural network ar- chitecture for spoof detection system inspired by state-of-the- art ASV system ( x -vector [11]). Finally , by using DLFS on individual feature system, the performance of spoof detection systems (SDS) is further improv ed. The rest of the paper is organized as follows: Section 2 discusses the details of spoof detection approaches in the liter- ature. A brief description of ASV -spoof-2019 dataset is gi ven in Section 3. Section 4 giv es a brief overvie w of the x -vector based ASV system. The proposed architecture for spoof de- tection is explained in Section 5. Section 6 discusses the de- tails of baseline GMM systems, the proposed systems, and the DLFS systems. A comprehensi ve analysis on the perfor- mance of various systems is giv en in Section 7 follo wed by the conclusion in Section 8. 2. PRIOR WORKS ON SPOOF DETECTION The ASV -spoof-2015 challenge tar geted ten different types of logical access [13]. A combination of auditory transformation based on cochlear filter cepstral coef ficients (CFCC) and in- stantaneous frequency (IF) termed as CFCCIF is proposed as the best feature to detect these LAs in [5]. Score fusion of CFCCIF and MFCC w as adjudged as the best system with an av erage EER of 1 . 2% across all the ten conditions. V arious LA spoof detection systems submitted to the challenge are detailed in [4]. The speech corpus used in ASV -spoof-2017 challenge has the spoofed instances generated by recording and replaying the bonafide trials of speakers in different en vironments (E) using various recording (R) and playback devices (P). Phys- ical attack is harder than logical access as the spoofed utter- ance of a bonafide trial may come from v arious E-R-P combi- nations. The ev aluation subset of the ASV -spoof-2017 dataset tried to simulate this ‘in-wild’ condition by generating the spoofed instances from different E-R-P combinations. A light con volutional neural network (LCNN) [14] system outper- formed all other systems submitted to the challenge. In [7] an end-to-end neural network (NN) with attention masking was proposed to learn the difference in the spectrogram of bonafide and the replayed utterances. This end-to-end atten- tion masking system pre-trained on ImageNet dataset [15] giv es an ideal performance with zero percent EER. DLFS paradigm proposed in [8], uses information from multiple feature spaces. This technique outperforms all other replay attack detection systems in the literature except the ideal NN system with zero percent EER. Many recent works on ASV -spoof-2019 dataset uses vari- ous end-to-end neural network (NN) structures like DNN with nine layers [16], variations of ResNet [17], namely , Squeeze- network (SENet), dilated ResNet, and light con volution neu- ral network (LCNN) [18]. The NN architectures used in these works are deep NNs with a minimum of seven layers e xclud- ing the input, pooling, and output layers. The SENet architec- ture in [17] uses four blocks of NN architecture with sev eral layers of CNN/RNN in each block. 3. DA T ASET DESCRIPTION Similar to the ASV -spoof-2015 and ASV -spoof-2017 cor - pus, [19] ASV -spoof-2019 also has three subsets namely , training (train), dev elopment (dev), and ev aluation (ev al). Different subsets of data are used for LA and P A attacks. The duration of each utterance is approximately two seconds. Unlike the “in-wild” spoofed trials of the ASV -spoof-2017 corpus, in this dataset, the spoofed trials for physical access are generated in controlled acoustic conditions [20]. The latest best performing text-to-speech synthesis and voice con- version algorithms are used to generate the spoofed trials for logical access cate gory . These algorithms are better than the algorithms used in ASV -spoof-2015. The number of trials in each subset is listed in the T able 1. The number of trials in e valuation subsets of LA and P A are 71,747 and 137,457, respectiv ely . T able 1 : Number of trials in development and training subsets Attack Subsets No. of speak ers No. of trials Male Female Bonafide Spoofed LA train 8 12 2580 22800 dev 8 12 2548 22296 ev al - - 7355 63822 P A train 8 12 5400 48600 dev 8 12 5400 24300 ev al - - 18090 116640 4. X -VECTORS IN SPEAKER RECOGNITION i-vectors were the state-of-the-approach for te xt-independent speaker recognition since 2010 [21]. An alternate approach proposed in [22] extracts DNN embeddings termed as x - vectors from a NN using a temporal pooling layer . This pooling layer facilitates the NN to discriminate the speakers from variable-length input speech segments. During testing, the fixed dimensional x -vectors are extracted and are com- pared with the training data embeddings using some scoring approach. Speaker embeddings are extracted in [22] from variable length acoustic segments using a DNN with a multi-class cross-entropy loss function. The DNN consists of fe w time- delay neural network (TDNN) layers to enhance frame-level representation. A pooling layer aggregates the frame-le vel representations, followed by few additional layers to handle segment-le vel representations. Finally , a softmax layer is used to get posterior probabilities of each speaker . This ap- proach mainly aims (i) to produce the speaker embeddings at utterance lev el rather than frame lev el and (ii) to generalize well, to handle the unseen speakers. The main advantage of this x -vector architecture is to handle the short duration utterances. x -vector results in [22] are sho wn to outperform the i-vector systems for short utterances of duration less than 10 seconds. 5. TD-SNN FOR SPOOF DETECTION Generally , speaker information is present throughout the utterance. Inspired by this concept, x-v ector architecture was proposed for automatic speaker recognition in [11, 22]. The x-vector embeddings are obtained by averaging various statistics across time in a high-dimensional space. Similar to the speaker characteristics, the impact of various spoof- ing approaches used to generate the spoofed trials will be present throughout the utterance. X-vector proposed for ASV [11, 22], is trained using the speaker labels to extract the speaker embeddings from the data. In this work, we sho w that the same x-vector model can be used for spoof detec- tion by training the NN using the class labels (bonafide and spoofed) rather than the speaker labels. The results show that the model in fact captures the characteristics of spoofing approaches embedded in the spoofed utterances. x -vector proposed for ASV in [22] uses eight hidden lay- ers. Unlike ASV x -vector architecture and fe w other neural network architectures for spoof detection [17, 18, 23, 24], we propose a time-delay shallow neural network with just four hidden layers, which includes two hidden layers at frame- lev el, a pooling layer to aggregate the statistics at the utter- ance lev el, and a penultimate layer to reduce the dimension. T ime-delay neural network is used for the first time to detect the spoofing attacks. The proposed architecture for spoof detection is shown in Figure 1. This architecture is referred to as time-delay shallow neural network (TD-SNN) in rest of this paper . The first two layers are frame-level layers and use time-delay neural networks. These layers conv ert the input feature vec- tors into high-dimensional vectors by preserving temporal information. The third layer av erages information across time by estimating mean and standard deviation, thereby con verting the inputs of variable length into a fixed length, high-dimensional vector . The fourth hidden layer reduces this high-dimensional v ector to a low-dimensional representation. Since spoof detection is a binary classification problem, a soft-max layer with two outputs is used as the last layer, to get the classification posteriors of a trial. These poste- rior values are used to classify the trials either as bonafide or spoofed. The embeddings extracted from the penultimate layer can also be used to identify the spoofed utterances using T d Input T h1 T h2 Statistics Pooling h2 mean + h2 variance Posterior Probabilities h3 Bonafide or Spoofed Frame Level Segment Level DNN Embeddings or x -vector T - Number of frames d - Input feature vector dimension h1 - Size of hidden layer one Fig. 1 : TD-SNN architecture for spoof detection a back-end classifier . Instead of the standard cross-entrop y error , the focal loss function is used in this work. Focal loss was first proposed for object detection task in [12]. The focal loss reshapes the cross-entropy loss such that it giv es more importance for hard-to-classify and misclassified examples. The focal loss is a better loss function for the class imbalance problem (Refer to the T able 1 for imbalance in the dataset). The focal loss is estimated as shown in Equation 1. F ( p,y )= − α [ y (1 − p ) γ log ( p ) + (1 − y ) ( p ) γ log ( p )] (1) In Equation 1, y is the ground truth class label, and p is the posterior probability giv en by a neural network. α and γ are hyper-parameters in this loss function. Setting γ to zero reduces focal loss to the standard cross-entropy loss. In Fig- ure 2, the 2D representation of DNN embeddings obtained from the proposed network trained using the cross-entropy loss and the focal loss are compared. DNN embeddings are con verted to 2D space using t-Distributed Stochastic Neigh- bor Embedding (t-SNE) algorithm [25]. It can be observed that focal loss produces better embeddings with lesser inter- class ov erlap than the standard cross-entropy error . In ASV , the x -vector architecture uses the ra w filter bank energies as the input. Borrowing from the ASV approaches, the same filter bank energies were given as input to the TD- SNN frame work. As the performance was poor , the focus is shifted to use dif ferent features for building a better classifier . 6. SPOOF DETECTION SYSTEMS (SDS) Sev eral attempts hav e been made to train an ef ficient classifier for spoof detection. The most common classifiers used for the purpose are GMMs and DNNs. Although there are fe w -100 -50 0 50 100 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 -100 -50 0 50 100 -100 -80 -60 -40 -20 0 20 40 60 80 100 Bona fi de T rials Spoofed T rials Dim 1 Dim 2 Cr oss-ent r opy L oss F ocal L oss Fig. 2 : Comparison of proposed network embeddings trained using cross- entropy loss and focal loss. The LA de velopment subset of ASV -spoof-2019 dataset is used to generate this plot. T able 2 : List of dev eloped systems. T ype System System Name GMM TD-SNN Single Baseline G-CQCC x-CQCC G-LFCC x-LFCC G-IMFCC x-IMFCC G-LFBE x-LFBE DLFS Primary G-Prim x-Prim Contrastiv e-1 G-C1 x-C1 Contrastiv e-2 G-C2 x-C2 DLFS1 G-DLFS1 x-DLFS1 DLFS2 G-DLFS2 x-DLFS2 The systems submitted to ASV -spoof-2019 challenge are highlighted in grey color . works with SVMs [26] and i-vectors [27], the performance is worse than that of the GMM and DNN classifiers. Hence in this work, we use both GMM and TD-SNN classifiers to detect spoofed trials. GMM-based systems with a set of fea- tures were e xplored, and best performing four systems were submitted to the ASV -spoof-2019 challenge. The TD-SNN systems were de veloped post-challenge. The performance of the TD-SNN systems is compared with the submitted GMM- based SDS using both dev elopment and ev aluation data. 6.1. Single feature systems GMM classifier has been the baseline system for all the ASV spoof challenges conducted from 2015 to 2019. Bonafide and spoofed trials from the training subset are used to train two GMMs, one for the bonafide ( λ B ) and other for the spoofed class ( λ S ). During testing, a trial t is gi ven to λ B and λ S , and the log-likelihood ( Λ ) dif ference is computed as S ( t ) = Λ( λ B ( t )) − Λ( λ S ( t )) (2) The log-likelihood difference is considered as the final score for the trial t . This simple classifier gav e an EER of 1 . 44% and 7 . 82% on the ev aluation data of ASV -spoof-2015 [5] and ASV -spoof-2017 [8] respectively . GMM-SDS with a set of cepstral coefficients and filterbank energies were explored for the ASV -spoof-2019 challenge. The GMM systems with constant-Q cepstral coef ficients (CQCC) [28, 29], in verse Mel frequency cepstral coefficients (IMFCC) [30], linear frequency cepstral coef ficients (LFCC) [31], and linear filter- bank ener gy (LFBE) gave better performance than few other features like Mel frequency cepstral coefficients (MFCC), in verse Mel filterbank energies (IMFBE), and Mel filterbank energies (MFBE). T o compare the performance of TD-SNN systems with that of the baseline GMM systems, TD-SNN systems were also dev eloped with the same set of features. 6.2. Featur e switching systems Almost ev ery spoof detection system uses a score fusion of many single feature based system as the primary system [4, 6]. This clearly shows that different features are required to detect different spoofing conditions. Instead of the conv en- tional score fusion approach, a decision-le vel feature switch- ing (DLFS) approach proposed in [8] is used here. For a giv en trial, DLFS essentially chooses the decision score from a set of indi vidual features, that has maximum discrimination between the bonafide and the spoofed model. In this work, DLFS is implemented with four best performing individual feature based system for both GMM and TD-SNN frame- works. The list of systems dev eloped for this work is listed in T able 2. Features used in primary and contrastive DLFS systems v ary for logical access and physical access. T able 3 shows the results of the GMM-based SDS systems submitted to the challenge. T able 3 : Performance (in min-t-DCF) of various GMM-based SDS submitted to ASV -spoof-2019 challenge System T ype Attack T ype System Name Featur e Dev Data Eval Data Single LA G-LFBE LFBE 0.0077 0.2059 P A G-IMFCC IMFCC 0.1396 0.1518 Primary LA G-Prim CQCC | LFBE 0.0002 0.1333 P A G-Prim CQCC | IMFCC | LFBE 0.1236 0.1330 Contrastiv e-1 LA G-C1 IMFCC | LFBE 0.0003 0.1565 P A G-C1 CQCC | LFCC 0.1226 0.1401 Contrastiv e-2 LA G-C2 CQCC | LFCC 0.0013 0.2139 P A G-C2 CQCC | LFBE 0.1821 0.1672 The symbol ‘ | ‘ represents exclusiv e-OR. A | B implies that either feature A (OR) B will be chosen for each trial. 7. RESUL T ANAL YSIS The TD-SNN for LA and P A spoof detection is trained only on the corresponding training subsets. T o a void the problem of over-fitting, twenty percentage of training data is used as T able 4 : Performance of various spoof detection systems. Systems with best performance in each cate gory are highlighted in grey color . System T ype System Name Logical access Physical access Featur e Development Data Evaluation Data Development Data Evaluation Data Featur e min-t-DCF EER min-t-DCF EER min-t-DCF EER min-t-DCF EER Single G-CQCC CQCC 0.0123 0.43 0.2366 9.57 0.1953 9.87 0.2454 11.04 CQCC x-CQCC 0.0164 0.54 0.154 6.93 0.3039 12.98 0.3148 11.83 G-LFCC LFCC 0.0663 2.71 0.2116 8.09 0.2555 11.96 0.3017 13.54 LFCC x-LFCC 0.0062 0.28 0.164 6.29 0.1231 4.53 0.1314 4.79 G-IMFCC † IMFCC 0.0012 0.04 0.2401 10.62 0.2078 9.19 0.3085 12.10 IMFCC x-IMFCC 0.0285 1.08 0.4020 18.95 0.1396 5.28 0.1518 5.58 G-LFBE ∗ LFBE 0.0077 0.32 0.2059 10.65 0.2581 11.47 0.3708 15.79 LFBE x-LFBE 0.0561 1.88 0.265 11.12 0.1818 7.39 0.1766 6.99 DLFS G-Prim ∗ † CQCC | LFBE 0.0002 0.01 0.1333 6.14 0.1888 8.17 0.2767 11.28 CQCC | IMFCC | LFBE x-Prim 0.0139 0.47 0.175 8.52 0.1236 4.85 0.133 4.91 G-C1 ∗ † IMFCC | LFBE 0.0003 0.04 0.1565 6.46 0.1972 7.53 0.2309 9.33 CQCC | LFCC x-C1 0.0040 0.16 0.296 14.52 0.1226 4.56 0.140 5.05 G-C2 ∗ † CQCC | LFCC 0.0013 0.04 0.2139 9.04 0.2329 8.48 0.3058 11.34 CQCC | LFBE x-C2 0.0142 0.47 0.107 5.75 0.1821 7.54 0.167 6.46 G-DLFS1 CQCC | IMFCC | LFCC 0.0026 0.19 0.2070 8.92 0.1548 7.61 0.2260 9.99 CQCC | IMFCC | LFCC x-DLFS1 0.0033 0.14 0.142 7.50 0.1171 4.13 0.130 5.61 G-DLFS2 LFCC | IMFCC | LFBE 0.0035 0.15 0.1780 7.92 0.3209 10.61 0.2838 13.23 LFCC | IMFCC | LFBE x-DLFS2 0.0166 0.31 0.208 11.21 0.1230 4.43 0.124 4.42 Systems marked with ∗ and † were submitted to ASV -spoof-2019 challenge under LA and P A conditions respectively . The symbol ‘ | ‘ represents exclusive-OR. A | B implies that either feature A (OR) B will be chosen for each trial. the v alidation subset. This TD-SNN is used to test trials from dev elopment and ev aluation subset. The performance of all SDS on the development and e valuation data are listed in T a- ble 4. The best performing system is chosen based on the min-t-DCF metric [9, 10]. G-CQCC and G-LFCC are the sin- gle feature based baseline systems provided along with the challenge dataset. Results reported in T able 4 shows that the performance of the proposed system over the baseline GMM systems under LA category is not consistent across various features. On the other hand, the TD-SNN systems consis- tently gi ve good performance for P A than the GMM systems. One possible reason could be that, unlike LA, the P A cate- gory hav e enough amount of data (refer T able 1) to train the neural network. Since TD-SNN SDS performs well for all the cases, we can conclude it as a more suitable classifier for detecting physical access spoofing. The performance of the SDS is further improved by applying DLFS as sho wn in the T able 4. Apart from the systems submitted to the challenge, DLFS with new feature combinations are reported in the table as G- DLFS and x-DLFS. The comparison of best performing TD- SNN system is compared with the corresponding GMM sys- tem with same feature combination and the best performing GMM system in T able 5. From the result analysis of both LA and P A, we can conclude that TD-SNN framework can be a potential model to detect all types of spoofing attacks. It also justifies our assumption that TD-SNN better identifies the traces of spoof mechanism in the spoofed utterances than the GMM. Moreo ver , since x -vector is the current state-of-the-art for ASV , a spoof detection system with a similar framework, will help us to make a common NN frame work for spoof de- T able 5 : Relati ve improvement of t-DCF: Logical Access and Physical Access (e valuation data) System T ype Attack T ype System Name t-DCF R.I (in %) DLFS Systems LA G-C2 vs x-C2 0.2139 vs 0.1070 49.97 P A G-DLFS2 vs x-DLFS2 0.2383 vs 0.1240 47.96 Best Baseline System vs Best Proposed System LA G-LFBE vs x-C2 0.2059 vs 0.1070 48.03 P A G-CQCC vs x-DLFS2 0.2454 vs 0.1240 49.47 tection as well as speaker recognition. 8. CONCLUSION Spoofed utterances contain traces of approaches used to gen- erate them. The ability of x -vector based NN to capture the ut- terance le vel information is established in the field of speak er verification. Hence, in this work, an attempt has been made to de velop spoof detection systems using a similar TDNN framew ork. A time-delay shallo w neural network (TD-SNN) with focal-loss function is proposed as the neural netw ork ar- chitecture for spoof detection. On ASV -spoof-2019 dataset, the proposed TD-SNN based SDS outperforms all the GMM based SDS in case of P A, whereas GMM based SDS performs well for LA in some of the cases. Further , DLFS paradigm is used to improve the performance of single feature based SDS. The best performing TD-SNN SDS with DLFS outper- forms the best performing GMM-DLFS SDS with a relativ e improv ement of 48.03% and 49.47% for LA and P A in terms of min-t-DCF , respectiv ely . 9. REFERENCES [1] S. J. Elliott, Zero Effort F or gery . Boston, MA: Springer US, 2009, pp. 1411–1414. [2] Z. W u et al. , “Spoofing and countermeasures for speaker verification: A survey, ” Speech Communication , vol. 66, pp. 130 – 153, 2015. [3] T . B. Patel and H. A. Patil, “Significance of Source- Filter Interaction for Classification of Natural vs. Spoofed Speech, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 11, no. 4, pp. 644–659, June 2017. [4] Z. W u et al. , “ASVspoof: The Automatic Speaker V erification Spoofing and Countermeasures Challenge, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 11, no. 4, pp. 588–604, June 2017. [5] T . B. Patel and H. A. Patil, “Cochlear Filter and Instan- taneous Frequency Based Features for Spoofed Speech Detection, ” IEEE Journal of Selected T opics in Signal Pr ocessing , vol. 11, no. 4, pp. 618–631, June 2017. [6] T . Kinnunen et al. , “The ASVspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection, ” in INTERSPEECH , Aug 2017, pp. 1–6. [7] F . T om, M. Jain, and P . Dey , “End-to-end audio replay attack detection using deep conv olutional networks with attention, ” in INTERSPEECH , 2018, pp. 681– 685. [Online]. A vailable: http://dx.doi.or g/10.21437/ Interspeech.2018- 2279 [8] Saranya M. S. and Hema A. Murthy, “Decision-lev el feature switching as a paradigm for replay attack detec- tion, ” in INTERSPEECH , 2018, pp. 686–690. [9] T . Kinnunen, K. A. Lee et al. , “t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speak er verification, ” in Odysse y , The Speaker and Language Recognition W ork- shop , June 2018. [10] T . Kinnunen, K.-A. Lee, H. Delgado, N. W . D. Evans, M. T odisco, M. Sahidullah, J. Y amagishi, and D. A. Reynolds, “t-DCF: a Detection Cost Function for the T andem Assessment of Spoofing Countermea- sures and Automatic Speaker V erification, ” CoRR , vol. abs/1804.09618, 2018. [11] D. Snyder , D. Garcia-Romero, G. Sell, D. Pov ey , and S. Khudanpur , “X-V ectors: Robust DNN Embeddings for Speaker Recognition, ” 2018 IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing (ICASSP) , pp. 5329–5333, 2018. [12] T .-Y . Lin, P . Goyal, R. Girshick, K. He, and P . Doll ´ ar , “Focal loss for dense object detection, ” in Pr oceedings of the IEEE international conference on computer vi- sion , 2017, pp. 2980–2988. [13] Zhizheng W u and T omi Kinnunen and Nicholas Evans and Junichi Y amagishi , “Automatic Speaker V erification Spoofing and Countermeasures Challenge (ASVspoof), ” Feb 2015. [Online]. A vailable: http: //www .spoofingchallenge.org/index2015.html [14] G. Lavrentyev a et al. , “ Audio replay attack detection with deep learning frameworks, ” in INTERSPEECH , Aug 2017, pp. 82–86. [15] O. Russakovsk y , J. Deng et al. , “ImageNet Lar ge Scale V isual Recognition Challenge, ” International Journal of Computer V ision (IJCV) , vol. 115, no. 3, pp. 211–252, 2015. [16] M. Baelde, N. Souvira ` a-Labastie, and R. Greff, “Influence of the attack conditions on countermeasures for Automatic Speaker V erification, ” Mar . 2019, working paper or preprint. [Online]. A vailable: https: //hal.archiv es- ouvertes.fr/hal- 02082414 [17] C. Lai, N. Chen, J. V illalba, and N. Dehak, “ASSER T : anti-spoofing with squeeze-excitation and residual networks, ” CoRR , vol. abs/1904.01120, 2019. [Online]. A vailable: http://arxiv .org/abs/1904.01120 [18] G. Lavrentye va, S. Novoselo v , A. Tseren, M. V olkov a, A. Gorlanov , and A. K ozlov , “STC antispoofing systems for the asvspoof2019 challenge, ” CoRR , vol. abs/1904.05576, 2019. [Online]. A vailable: http: //arxiv .org/abs/1904.05576 [19] T . Kinnunen et al. , “ Automatic Speaker V er- ification Spoofing and Countermeasures Chal- lenge (ASVspoof), ” Feb 2017. [Online]. A vailable: http://www .spoofingchallenge.org/index2017.html [20] ASVspoof consortium, “ASVspoof 2019:Automatic Speaker V erification Spoofing and Countermeasures Challenge Evaluation Plan, ” Jan 2019. [Online]. A vailable: http://www .asvspoof.org/asvspoof2019/ asvspoof2019 ev aluation plan.pdf [21] N. Dehak, P . J. Kenny, R. Dehak, P . Dumouchel, and P . Ouellet, “Front-end factor analysis for speaker verifi- cation, ” IEEE T ransactions on Audio, Speech, and Lan- guage Processing , vol. 19, no. 4, pp. 788–798, May 2011. [22] D. Snyder , D. Garcia-Romero, D. Povey , and S. Khu- danpur , “Deep neural network embeddings for text- independent speaker verification. ” in Interspeech , 2017, pp. 999–1003. [23] J.-w . Jung, H.-j. Shim, H.-S. Heo, and H.-J. Y u, “Replay attack detection with complementary high-resolution in- formation using end-to-end dnn for the asvspoof 2019 challenge, ” arXiv preprint , 2019. [24] B. Chettri, D. Stoller, V . Morfi, M. A. M. Ram ´ ırez, E. Benetos, and B. L. Sturm, “Ensemble models for spoofing detection in automatic speaker verification, ” arXiv pr eprint arXiv:1904.04589 , 2019. [25] L. van der Maaten and G. Hinton, “V isualizing data using t-SNE, ” Journal of Machine Learning Resear ch , vol. 9, pp. 2579–2605, 2008. [Online]. A vailable: http://www .jmlr .org/papers/v9/v andermaaten08a.html [26] S. Novoselo v, A. K ozlov , G. Lavrentye va, K. Si- monchik, and V . Shchemelinin, “STC anti-spoofing sys- tems for the ASVspoof 2015 challenge, ” in ICASSP , March 2016, pp. 5475–5479. [27] E. Khoury , T . Kinnunen, A. Sizov , Z. W u, and S. Mar- cel, “Introducing i-vectors for joint anti-spoofing and speaker verification, ” in INTERSPEECH , 2014, pp. 61– 65. [28] M. T odisco et al. , “A New Feature for Automatic Speaker V erification Anti-Spoofing: Constant Q Cep- stral Coef ficients, ” in The Speaker and Language Recognition W orkshop, OD YSSEY , June 2016. [29] M. T odisco, H. Delgado, and N. Evans, “Constant q cepstral coefficients: A spoofing countermeasure for automatic speaker verification, ” Computer Speech & Language , vol. 45, pp. 516 – 535, 2017. [Online]. A vail- able: http://www .sciencedirect.com/science/article/pii/ S0885230816303114 [30] S. Chakroborty , A. Roy , and G. Saha, “Improved closed set text-independent speaker identification by combin- ing mfcc with evidence from flipped filter banks, ” Inter - national Journal of Signal Pr ocessing , v ol. 4, no. 2, pp. 114–122, 2007. [31] X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy- W ilson, and S. Shamma, “Linear versus Mel frequency cepstral coefficients for speaker recognition, ” in 2011 IEEE W orkshop on Automatic Speech Recognition Un- derstanding , Dec 2011, pp. 559–564.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment