Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems

Accepted to be Published in IEEE International W orkshop on Information Forensics and Security (WIFS) 2018, Hong K ong, China T ransf orming acoustic characteristics to deceive playback spooﬁng countermeasur es of speaker veriﬁcation systems Fuming Fang 1 , Junichi Y amagishi 1 , 2 , Isao Echizen 1 , Md Sahidullah 3 , T omi Kinnunen 4 1 National Institute of Informatics, Japan 2 Uni versity of Edinb urgh, UK 3 Inria, France 4 Uni versity of Eastern Finland, Finland { fang, jyamagis, iechizen } @nii.ac.jp, md.sahidullah@inria.fr, tkinnu@cs.uef.fi Abstract Automatic speaker veriﬁcation (ASV) systems use a play- back detector to ﬁlter out playback attacks and ensure ver- iﬁcation reliability . Since current playback detection mod- els ar e almost always trained using genuine and played- back speech, it may be possible to de grade their perfor - mance by transforming the acoustic characteristics of the played-back speech close to that of the g enuine speech. One way to do this is to enhance speech “stolen” fr om the tar- get speaker befor e playback. W e tested the effectiveness of a playback attac k using this method by using the speech enhancement gener ative adversarial network to transform acoustic characteristics. Experimental results showed that use of this “enhanced stolen speech” method signiﬁcantly incr eases the equal err or rates for the baseline used in the ASVspoof 2017 challeng e and for a light con volutional neu- ral network-based method. The results also showed that its use de grades the performance of a Gaussian mixture model- universal backgr ound model-based ASV system. This type of attack is thus an ur gent pr oblem needing to be solved. 1. Introduction Automatic speaker v eriﬁcation (ASV) [ 1 ], a kind of bio- metrics authentication technology , identiﬁes a person from a segment of speech. ASV systems typically fall into two types: text-independent and text-dependent, where the lat- ter requests a client to speak a given phrase. Due to the con venience of ASV , it is being used in more and more ap- plications, such as ones used in call centers and by mobile devices. Howe ver , ASV is vulnerable to sev eral kinds of spooﬁng attacks (also known as presentation attacks [ 2 ]), so ASV systems need a spooﬁng countermeasure (CM) (also known as presentation attack detection [ 2 ]). Such attacks aim to mimic the target speaker mainly by using synthesized speech [ 3 ], conv erted speech [ 3 ], or playback speech [ 4 , 5 ]. Among them, playback speech-based attacks are relati vely easy to mount since an attacker who has no special kno wledge can make them [ 6 ]. Once an attacker has collected/stolen a voice sample for the target speaker , he/she can simply play it back to an ASV system or con- catenate segments of the sample to form a ne w utterance. Threats from this kind of attack have been conﬁrmed by sev eral studies [ 4 , 5 , 7 , 8 , 9 ]. Here we focus on playback spooﬁng attacks and relev ant CMs. Four main types of CMs ha ve been developed to pro- tect against playback spooﬁng attacks. One type utilizes a text-dependent ASV system and randomly prompts for a pass-phrase [ 10 , 11 ], making it difﬁcult to mount playback attacks using phrase-ﬁxed speech. Howe ver , it is possible to form an arbitrary utterance to spoof this type of CM if the attack er has suf ﬁcient speech data for the target speaker . The second type is based on rules describing the characteris- tics of genuine speech (recorded from a person). For e xam- ple, Mochizuki et al. [ 12 ] distinguished genuine speech by detecting pop-noise from certain phonemes. An intractable problem related to this type of CM is that it is difﬁcult to design suitable rules and implement them. The third type utilizes audio ﬁngerprinting to check whether an incoming recording is similar to previously authenticated utterances that were automatically sav ed in the ASV system. Ro- driguez et al. [ 13 ] dev eloped such a system: if the simi- larity score was higher than a threshold, the recording was treated as a playback attack. A disadvantage of this type of CM is that it is sensiti ve to noise. In contrast, the fourth type compares the differences between genuine speech and play- back speech. This type mainly utilizes a machine learning algorithm to learn the dif ferences. An example is W ang et al. ’ s [ 14 ] use of a support vector machine [ 15 ] to learn the difference in Mel-frequency cepstral coefﬁcient (MFCC)- based acoustic features. More methods of the fourth type were presented at the second Automatic Speaker V eriﬁcation Spooﬁng and Coun- termeasures Challenge (ASVspoof 2017), in which a com- mon database was used to assess the participants’ CMs. The database consists of two parts. One part contains genuine speech taken from the RedDots corpus [ 16 ], which was c  2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating ne w collectiv e works, for resale or redistribution to serv ers or lists, or reuse of any copyrighted component of this work in other w orks. Pla y ba ck detector AS V m odule Authenticat ion S peech en h anc emen t Play back Re - rec ording At ta cke r Upl oad D ow n load O nline Offline Figure 1. Playback spooﬁng attack using enhanced stolen speech method under ASVspoof 2017 scenario. W ithout speech enhance- ment, attack is the same as a con ventional playback attack. designed for speaker veriﬁcation. The other part contains recordings of the genuine speech made in various en viron- ments. F or these data, the baseline [ 4 ] with a constant Q cepstral coefﬁcient (CQCC) [ 17 ] feature and a Gaus- sian mixture model (GMM) classiﬁer had an equal error rate (EER) of 30.60%. A deep learning-based method had an EER of 6.73% [ 18 ], which was the best performance achiev ed at ASVspoof 2017 [ 4 ]. These mainstream CMs of the fourth type are also prob- lematic: they are based on the assumption that the attack- ers do not hav e special knowledge. Moreov er, this type of CM algorithms only learn the difference from a given dataset and perhaps do not work well if the acoustic charac- teristics of the playback speech is transformed close to that of the genuine one. T o conﬁrm this hypothesis, we tested the ef fectiv eness of a playback attack using speech “stolen” from the target speaker and enhanced before mounting the attack. This enhancement should remove the distortions in the stolen speech caused by the recording de vice and en vi- ronmental noise so that they do not affect the re-recorded speech. W e evaluated the effecti veness of a playback attack us- ing this enhanced stolen speech method against a text- dependent ASV system. W e used the ASVspoof 2017 sce- nario (Figure 1 ) in which the attacker is assumed to ob- tain from somewhere uncompressed speech for the target speaker containing the phrase used for authentication, e.g., by downloading from the web, hacking a de vice used by the target speaker , and talking to and surreptitiously recording the tar get speaker . The speech enhancement generativ e ad- versarial network (SEGAN) [ 19 ] was used to transform the acoustic characteristics of the obtained speech close to that of the genuine speech. W e also in vestigated the effect of dif- ferent types of playback loudspeakers and re-recording de- vices. The results showed that it is possible to fool playback spooﬁng CMs by transforming the acoustic characteristic of the playback speech close to that of the genuine speech. 2. Related work Pioneering work on playback attacks was reported by Lindberg and Blomber g in 1999 [ 20 ]. They pre-recorded the numbers one to ten of two speakers and then con- catenated various combinations of them to attack a hid- den Markov model (HMM) [ 21 ]-based text-dependent ASV system. The y demonstrated considerable increase in both the EER and false acceptance rate (F AR) compared to ver- iﬁcation without attacks. More recently , Ergunay et al. in- vestigated the effect of playback attacks against ASV sys- tems and also achiev ed a large increase in F AR [ 22 ]. Com- pared to these con ventional playback attacks, our method further degrades the performance of playback spooﬁng CMs by enhancing the speech. There are a few attack methods similar to our enhanced stolen speech method. Demiroglu et al. improved the natu- ralness of synthesized and con verted speech before attack- ing a phase-based synthetic speech detector and an ASV system [ 23 ]. The synthesized and conv erted speech sig- nals were ﬁrstly analyzed frame by frame, and each frame was replaced with one containing the most similar natu- ral speech selected from a dataset. A complex cepstrum vocoder was used to re-synthesize these frames so as to improv e speech naturalness. Finally , the speech was di- rectly fed into an ASV system. They reported that their method fooled four out of nine detectors. Our method can be thought of as an extension of their method as it further transforms synthesized speech close to natural speech. Nguyen et al. reported an attack method that transforms computer-generated (CG) images into natural images be- fore feeding them into a facial authentication system [ 24 ]. The transformation model is trained using a generative ad- versarial network (GAN) [ 25 ]. The GAN discriminator, which mimics a spooﬁng detector , is used to distinguish CG/natural images. The discriminator is pre-trained and ﬁxed during training of the transformation model. In con- trast, we treat the authentication system as a black box, and anything regarding playback spooﬁng CMs and ASV sys- tems is unknown. 3. Playback detectors and ASV system T wo playback spooﬁng CMs and a classical Gaussian mixture model with univ ersal background model (GMM- UBM) [ 26 ]-based ASV system were used to ev aluate the effecti veness of our enhanced stolen speech attack method. The two CMs were the baseline and the core method of the system with the best performance (i.e., a light conv olutional neural network) of ASVspoof 2017. 3.1. Baseline of ASVspoof 2017 The baseline of ASVspoof 2017 consists of a CQCC front-end and a GMM back-end. W e refer to this method as “CQCC-GMM CM”. The CQCC is an acoustic feature extracted from an audio signal. CQCC extraction is per- formed using constant Q transform (CQT) instead of the classical short-time Fourier transformation (STFT). STFT suffers from ﬁxed frequency resolution and ﬁxed temporal resolution whereas CQT exhibits higher frequency resolu- tion at lower frequencies and higher temporal resolution at higher frequencies. An audio signal is usually represented by a sequence of CQCC feature vectors. A two-class GMM-based classiﬁer is used for gen- uine/playback speech detection. One GMM is trained us- ing genuine speech while the other one is trained using playback speech. Input to the GMMs is CQCC-based acoustic feature vectors, and the expectation maximization (EM) [ 27 ] algorithm is used for training. During prediction, the feature vectors of an audio signal are independently in- put into the two models, and then the joint log-likelihood for both models is calculated. Finally , the log-likelihood ra- tio of the genuine and playback model results is compared with a threshold to determine genuine/playback speech. 3.2. Core method of best system of ASVspoof 2017 The best system of ASVspoof 2017 was a fusion of three sub-systems: a support vector machine with i -vector fea- tures [ 28 ], a con volutional neural network (CNN) with a recurrent neural network (RNN), and a light CNN (LCNN). The LCNN was used as the core method, which achie ved an EER of 7.37%. This performance was very close to that of the fused system (6.73%). W e therefore used an “LCNN CM” to ev aluate our enhanced stolen speech attack method. The LCNN consists of ﬁv e con volution layers, four network-in-network (NIN) [ 29 ] layers, ten max-feature- map (MFM) layers, ﬁve max-pooling layers, and two fully connected layers. Each MFM layer acts as a maxout activ a- tion function [ 30 ] that splits the CNN feature maps into tw o groups and then performs element-wise maximization to se- lect features. The LCNN input is a spectrum with a ﬁxed size of 864 × 400 , which is obtained by performing STFT with 1728 bins and concatenating 400 frames. Dropout [ 31 ] is applied after the ﬁrst fully connected layer . The ﬁnal output layer, with a softmax acti vation function, is used to discriminate genuine/playback speech. This is described in more detail elsewhere [ 18 ]. The silence parts of the audio signal are removed, and then STFT is performed using a window length of 25 ms with a shift size of 10 ms. If a signal is shorter than 4 seconds, its content is repeated to match the length. For a longer signal, its content is repeated to match multiples of 4 seconds, and the output probabilities are av eraged. 3.3. GMM-UBM-based ASV system W e use a GMM-UBM-based system for speaker veriﬁca- tion. Even though it is a classic ASV method, GMM-UBM provides competiti ve performance on short-duration, text- dependent ASV tasks [ 32 ]. The speaker models are created by maximum a posteriori adaptation from a UBM trained with a large amount of speech data from dif ferent speakers. T ext-dependent speaker models are separately created for different passphrases following the guidelines for conduct- ing experiments with the RedDots corpus. The recognition ෝ 𝒚 𝒄 𝒛 𝒙 G d eco d er G e nco de r 𝒙 1 𝒚 D 𝒙 0 ෝ 𝒚 D Figure 2. Architecture of SEGAN. Input of discriminator is ( y , x ) or ( ˆ y , x ) ; the former should be classiﬁed as real data while the latter should be classiﬁed as fake data; but the latter should be treated as real data when updating the generator parameters, so adversarial training is performed. score is the likelihood ratio between the results of the tar get speaker model and those of the UBM. 4. Speech enhancement SEGAN is a data-driv en speech enhancement method that constructs a mapping from a noisy wav eform to a clean wa veform with the help of supervised training. More specif- ically , SEGAN lev erages the power of the GAN composed of two adversarial networks, a discriminator D and a gen- erator G . The discriminator predicts the probability that the input is from real data rather than from fake data gener- ated by G . The generator learns a mapping function from a prior noise distribution p noise ( z ) to the distribution of the real data p data ( y ) to fool the discriminator . If the noise dis- tribution is conditioned by x drawn from playback speech and y drawn from genuine speech, the output ˆ y is genuine- like speech. The objectiv e function for training SEGAN is formulated as min D L ( D ) = 1 2 E y ∼ p data ( y ) [( D ( y , x ) − 1) 2 ] + 1 2 E z ∼ p noise ( z ) [ D ( ˆ y , x ) 2 ] min G L ( G ) = E z ∼ p noise ( z ) [( D ( ˆ y , x ) − 1) 2 ] + λ · k ˆ y − y k 1 , (1) where ˆ y = G ( z , x ) is enhanced (or generated) speech and E means expectation. L 1 norm, k · k 1 , is used to measure the distance between the real and enhanced speech. The discriminator and generator are alternately trained by per- forming a min-max game. Figure 2 shows the architecture of SEGAN. It is an end- to-end model, so both the input and output of the generator are raw waveforms. The input of the discriminator is also a raw wav eform combining ( y , x ) or ( ˆ y , x ) . The generator has an encoder-decoder structure. The encoder part is com- posed of 11 stacked 1-D CNN layers with a ﬁlter width of 31 and a stride of two. The decoder part is a mirror structure of the encoder part, and the corresponding layers between them are connected by a skip path. The dimension of noise z is the same as that of encoder output c and is drawn from a normal distribution. They are concatenated and input to the decoder . The architecture of the discriminator is the same as that of the encoder part of the generator except that virtual batch-normalization [ 33 ] is performed in hidden layers. 5. Database In this section, we describe the speech data used for training the spooﬁng detectors, ASV , and SEGAN. W e also describe the test data used for authentic as well as illegiti- mate access. 5.1. T raining data f or playback spooﬁng CM Both the CQCC-GMM and the LCNN CM models were trained using the ASVspoof 2017 database (version 2), which was deriv ed from the RedDots corpus. The gen- uine speech data in the database was taken from the Red- Dots corpus, and the playback speech data was recorded by playing the genuine speech data in various en vironments (including quiet and noisy places) using various recording devices and speakers with various qualities. The sampling rate was 16 kHz for both the genuine and playback speech. The database was further split into three datasets: train- ing, de velopment, and e valuation. Each of the datasets con- tained both genuine and playback speech. The GMM of the CQCC-GMM CM was trained using the training and dev el- opment datasets. F or the LCNN, the training dataset was used to estimate the model parameters and the dev elopment dataset was used to monitor the training process. 5.2. T raining data f or ASV system W e used TIMIT and RSR2015 (background subset [ 34 ]) corpora for training the UBM for the GMM-UBM-based ASV system. Only male speak ers were used as the ASVspoof 2017 database was created from the male sub- set of the RedDots corpus. In total, we used 17,850 speech utterances from 488 speakers for UBM training. Each tar- get speaker model was created with speech utterances from three different sessions for a ﬁx ed passphrase. 5.3. T raining data f or SEGAN W e used a high-quality database and two low-quality databases distorted by recording de vices or en vironment noises to train SEGAN. The high-quality database was the voice cloning toolkit (VCTK) corpus [ 35 ]. This corpus con- tains data recorded in a hemi-anechoic chamber by 109 na- tiv e English speakers, but we used data for only 28 speak- ers. One of the low-quality databases was a de vice-recorded VCTK (DR-VCTK) corpus [ 36 ] and the other one was a noisy VCTK (N-VCTK) corpus [ 37 ]. The DR-VCTK was created by playing the high-quality speech of the 28 speak- ers in ofﬁce en vironments and recording it using relativ ely inexpensi ve consumer de vices. The N-VCTK was created by adding noise to the high-quality speech of the 28 speak- ers. The sampling rate of these databases was 48 kHz with downsampling to 16 kHz. T wo types of SEGAN were trained. One was trained using DR-VCTK and VCTK. The other was trained using N-VCTK and VCTK. 5.4. A uthentication and spooﬁng data W e equally split the genuine speech of the ev alua- tion dataset in the ASVspoof 2017 database into two sub- datasets. One was used as authentication speech, and the other was used as “stolen speech. ” W e enhanced the stolen speech, played it using four types of portable loud- speakers, and re-recorded it using six types of record- ing devices in an ofﬁce room. The four loudspeakers were a high-quality speaker (BOSE Soundlink), a medium- quality speaker (SONY SRS-BTS50), a lo w-quality speaker (audio-technica A T -SP92), and an iPhone 6s speaker . The six recording de vices were a high-quality condenser micro- phone (Apogee MiC 96k), a directional microphone (Sony ECM–673), a low-quality microphone (Snowball iCE), a MacBook microphone, an iPad microphone, and an iPhone 6s microphone. These devices were placed at around 30 to 50 cm from the loudspeaker . The sampling rate for the re- recording was 16 kHz. According to the used loudspeaker , four playback and re-recording sessions were performed. 6. Experimental setup W e ev aluated 1) ho w our enhanced stolen speech method affects the performance of playback spooﬁng CMs and 2) how the enhanced speech effects an ASV system. In order to compare the dif ference between our method and con ven- tional playback attacks, a paired two-tailed t -test was used. 6.1. Setup f or playback spooﬁng CMs Settings of the CQCC-GMM CM were the same as base- line of ASVspoof 2017 and the source code is av ailable at [ 38 ]. CQCC had 29 dimension and 0-th order cepstral coefﬁcient was further used. Their ﬁrst and second deriv a- tiv es were ﬁnally used as features (90 dimensions in total). The GMM of the CQCC-GMM CM had 512 components. The weights of the LCNN were initialized using the Xavier method [ 39 ]. The dropout rate was set to 0.5. Adam optimization [ 40 ] with momentum of 0.5 was used. The initial learning rate was 0.0001; it was reduced by 0.9 if the classiﬁcation accuracy of the de velopment dataset de- creased after each epoch. There were nine epochs, and the mini-batch size was 64. The LCNN was implemented using the T ensorFlow frame work [ 41 ] and is av ailable at [ 42 ]. T o assess the performance of both spooﬁng CMs, we use EER, which reﬂects the ability of the CM to discriminate genuine speech samples from playback attacks. 6.2. Setup f or ASV system Our ASV system used MFCC-based acoustic feature ex- tracted from a 20 ms short-term window with a 10 ms shift using 20 ﬁlters. W e computed 19 MFCCs after discard- ing the energy coefﬁcients. The MFCCs were further pro- T able 1. EERs for CQCC-GMM CM. Bold means largest de gradation. Loudspeaker Enhancement Directional High-quality Low-quality Mac iPad iPhone A verage used for replay training data microphone microphone microphone book 6s High quality − 15.65 8.83 20.87 9.98 7.21 49.92 18.74 DR-VCTK 28.42 18.10 29.67 14.96 8.59 50.00 24.96 N-VCTK 35.61 23.18 33.59 16.49 9.17 50.00 28.01 Medium quality − 9.35 11.96 8.78 10.78 6.54 49.13 16.09 DR-VCTK 15.71 20.16 15.76 15.27 7.43 49.92 20.71 N-VCTK 22.56 25.62 22.03 15.61 8.36 49.92 24.02 Low quality − 11.83 8.98 10.28 8.34 6.14 49.92 15.92 DR-VCTK 20.07 16.29 19.77 10.32 6.96 49.96 20.56 N-VCTK 26.87 22.35 24.44 10.78 7.35 49.92 23.62 iPhone 6s − 16.28 16.54 19.83 7.19 6.40 49.53 19.30 DR-VCTK 30.45 31.19 30.50 10.93 7.14 49.92 26.69 N-VCTK 24.25 24.26 26.94 9.83 7.28 49.88 23.74 T able 2. EERs for LCNN CM. Bold means largest de gradation. Loudspeaker Enhancement Directional High-quality Low-quality Mac iPad iPhone A verage used for replay training data microphone microphone microphone book 6s High quality − 11.19 8.00 16.14 7.71 12.95 25.04 13.51 DR-VCTK 12.35 9.12 18.14 8.59 13.55 25.74 14.58 N-VCTK 13.48 10.43 19.74 8.85 13.83 25.29 15.27 Medium quality − 8.78 9.98 5.92 5.47 7.09 25.25 10.42 DR-VCTK 9.57 11.22 6.56 6.85 8.79 27.25 11.71 N-VCTK 10.31 12.22 7.96 7.23 9.56 27.12 12.40 Low quality − 7.25 6.06 5.31 9.52 7.80 16.29 8.71 DR-VCTK 8.44 7.10 6.07 10.08 8.76 17.05 9.58 Noisy VCTK 10.23 8.95 7.52 10.30 9.38 17.09 10.58 iPhone 6s − 11.11 11.56 10.47 4.40 9.17 17.65 10.73 DR-VCTK 13.25 14.97 13.62 5.23 11.12 18.26 12.74 N-VCTK 11.70 12.33 11.21 4.54 10.07 18.07 11.32 cessed with RAST A ﬁltering to suppress conv olutiv e mis- match. The delta and double-delta coefﬁcients were com- puted for a context of three frames and then augmented with static MFCCs to create a 57-dimensional feature vector . Fi- nally , cepstral mean and v ariance normalization (CMVN) was performed after discarding the non-speech frames with an energy-based voice activity detector . W e trained the gender-dependent UBM with 512 mixture components. The speaker models were created by adapting only the centers of UBM with a relev ance factor of three. Even though ASV spooﬁng ev aluations hav e focused on standalone CM assessment, the performance of a tan- dem (combined) system is important for real-w orld deploy- ment. Both CM and ASV can result in target speaker misses and false acceptances of impostors (either non-targets or spoofs). W e therefore adopted a recently proposed tandem detection cost function (t-DCF) metric [ 43 ] for e valuating the combination of two systems in a Bayes risk framew ork. The t-DCF is gi ven by C asv miss · π tar · P a + C asv fa · π non · P b + C cm fa · π spo of · P c + C cm miss · π tar · P d , where C asv miss = 1 , C asv fa = 10 , C cm fa = 10 , and C cm miss = 1 are unit costs related to the misses and false alarms of the two systems; π spo of = 0 . 0100 , π non = 0 . 0099 , and π tar = 0 . 9801 were the prior probabili- ties of the tar gets, non-targets, and spoofs, respecti vely; and P a , P b , P c , and P d are the error rates of four possible er- rors originating from the joint actions of the CM and ASV systems. The reported t-DCF values are minimum t-DCF values with a ﬁxed ASV system. The higher the value, the less usable the combined (ASV and CM) system. 6.3. Setup f or SEGAN Similar to previous work [ 19 , 44 ], we extracted chunks of wa veforms by using a sliding window of 2 14 samples at ev ery 2 13 samples (i.e., 50% overlap). During testing, we concatenated the results at the end of the stream without ov erlap. The learning rate, mini-batch size, and epoch size were set to 0.0002, 100, and 120, respectiv ely . The λ in Equation 1 was set to 100. W e used source code for im- prov ed SEGAN [ 45 ]. 7. Results T ables 1 and 2 sho w the EERs for the CQCC-GMM CM and the LCNN CM, respectiv ely . Playback spooﬁng attacks using our enhanced stolen speech method had signiﬁcantly T able 3. V alues of t-DCF obtained from combination of CQCC-GMM CM and ASV scores. Bold means largest de gradation. Loudspeaker Enhancement Directional High-quality Low-quality Mac iPad iPhone A verage used for replay training data microphone microphone microphone book 6s High quality − 0.9361 0.9276 0.9392 0.9136 0.9118 0.9426 0.9285 DR-VCTK 0.9412 0.9363 0.9410 0.9258 0.9210 0.9426 0.9347 N-VCTK 0.9403 0.9373 0.9402 0.9277 0.9211 0.9431 0.9350 Medium quality − 0.9156 0.9351 0.9211 0.9222 0.9039 0.9425 0.9234 DR-VCTK 0.9313 0.9386 0.9348 0.9311 0.9143 0.9428 0.9322 N-VCTK 0.9339 0.9377 0.9362 0.9308 0.9186 0.9429 0.9334 Low quality − 0.9225 0.9177 0.9248 0.9109 0.9020 0.9415 0.9199 DR-VCTK 0.9339 0.9314 0.9362 0.9220 0.9076 0.9428 0.9290 N-VCTK 0.9353 0.9342 0.9363 0.9223 0.9105 0.9425 0.9302 iPhone 6s − 0.9354 0.9358 0.9379 0.9085 0.9006 0.9384 0.9261 DR-VCTK 0.9380 0.9379 0.9388 0.9290 0.9127 0.9381 0.9324 N-VCTK 0.9387 0.9391 0.9383 0.9247 0.9100 0.9388 0.9316 T able 4. V alues of t-DCF obtained from combination of LCNN CM and ASV scores. Bold means largest de gradation. Loudspeaker Enhancement Directional High-quality Low-quality Mac iPad iPhone A verage used for replay training data microphone microphone microphone book 6s High quality − 0.9239 0.9073 0.9436 0.9063 0.9338 0.9656 0.9301 DR-VCTK 0.9303 0.9135 0.9494 0.9098 0.9385 0.9664 0.9347 N-VCTK 0.9346 0.9186 0.9522 0.9105 0.9390 0.9658 0.9368 Medium quality − 0.9105 0.9173 0.8990 0.8978 0.9038 0.9657 0.9157 DR-VCTK 0.9159 0.9266 0.9011 0.9021 0.9107 0.9673 0.9206 N-VCTK 0.9202 0.9293 0.9077 0.9035 0.9143 0.9675 0.9238 Low quality − 0.9043 0.8992 0.8972 0.9122 0.9070 0.9584 0.9131 DR-VCTK 0.9105 0.9029 0.8999 0.9156 0.9125 0.9605 0.9170 N-VCTK 0.9183 0.9108 0.9051 0.9156 0.9141 0.9593 0.9205 iPhone 6s − 0.9208 0.9238 0.9173 0.8952 0.9120 0.9530 0.9204 DR-VCTK 0.9344 0.9396 0.9345 0.8970 0.9222 0.9540 0.9303 N-VCTK 0.9234 0.9274 0.9214 0.8960 0.9154 0.9559 0.9233 higher EERs for both CMs compared to those of con ven- tional playback attacks (without enhancement). One rea- son could be that the signal-to-noise ratio was higher af- ter speech enhancement, resulting in the playing of cleaner speech. Use of the high-quality speaker with the high- quality microphone and use of the low-quality speaker with the high-quality microphone when N-VCTK was used to train SEGAN resulted in the largest performance degrada- tion for the two CMs. The increases in EER were 2.6 and 1.5 times, respectiv ely . As expected, use of the high-quality speaker resulted in higher EERs because it generated more natural speech. It is interesting that the results for the iPhone 6s speaker were similar to those for the high-quality speaker . While a wide range of EERs were obtained for the recording devices, use of the high-quality microphone did not result in signif- icantly higher EERs. The CQCC-GMM CM could not dis- tinguish the playback speech re-recorded using the iPhone 6s. This was because features were not normalized and channel distortions greatly degraded its performance [ 5 ]. Enhancement based on the N-VCTK was mightier than that based on the DR-VCTK in most cases. This might be be- cause distortion due to environmental noise has a greater effect than that due to the recording de vices. T ables 3 and 4 show the t-DCF values for the combined CQCC-GMM CM and GMM-UBM-based ASV scores and for the combined LCNN CM and GMM-UBM-based ASV scores, respectively . Compared to the conv entional play- back attacks, an attack using our enhanced stolen speech method greatly degraded the authentication performance of both combinations. This suggests that our enhanced stolen speech method enables playback attacks to pass playback spooﬁng CMs and to fool ASV systems as well. 8. Conclusion and future w ork W e in vestigated the ef fectiv eness of using enhanced stolen speech in playback spooﬁng attacks. Experimental results showed that stolen speech enhanced with SEGAN can greatly degrade the performances of baseline CQCC- GMM and adv anced LCNN-based playback spooﬁng CMs as well as that of GMM-UBM-based ASV systems. Since the used speech enhancement method for attack would be unknown, we plan to dev elop a robust playback detection method for various speech enhancement methods. Acknowledgement This work was partially supported by JSPS KAK- ENHI Grant Numbers JP16H06302, 18H04120, 18H04112, 18KT0051, 17H04687, and Academy of Finland (project no. 309629). W e thank Huy H. Nguyen, the Graduate Uni- versity for Advanced Studies (SOKEND AI), for comments on an earlier versions of the manuscript. References [1] J. H. Hansen and T . Hasan. Speaker recognition by machines and humans: A tutorial revie w . IEEE Signal pr ocessing magazine , 32(6):74–99, 2015. 1 [2] I. 30107-1:2016. Information technology Biometric presentation attack detection Part 1: Framew ork. https://www.iso.org/obp/ui/#iso: std:iso- iec:30107:- 1:ed- 1:v1:en , 2016. [Online; accessed 5-July-2018]. 1 [3] Z. W u, T . Kinnunen, N. Evans, J. Y amagishi, C. Hanilc ¸ i, M. Sahidullah, and A. Sizov . Asvspoof 2015: the ﬁrst automatic speaker v eriﬁcation spooﬁng and countermeasures challenge. In Sixteenth Annual Confer ence of the International Speech Communica- tion Association , 2015. 1 [4] T . Kinnunen, M. Sahidullah, H. Delg ado, M. Todisco, N. Evans, J. Y amagishi, and K. A. Lee. The ASVspoof 2017 challenge: Assessing the limits of replay spoof- ing attack detection. In INTERSPEECH 2017, An- nual Confer ence of the International Speech Commu- nication Association, August 20-24, 2017, Stockholm, Sweden , Stockholm, SWEDEN, 08 2017. 1 , 2 [5] H. Delgado, M. Todisco, M. Sahidullah, N. Evans, T . Kinnunen, K. A. Lee, and J. Y amagishi. ASVspoof 2017 V ersion 2.0: meta-data analysis and baseline en- hancements. In OD YSSEY 2018, The Speaker and Language Recognition W orkshop, June 26-29, 2018, Les Sables d’Olonne, France , Les Sables d’Olonne, FRANCE, 06 2018. 1 , 6 [6] Z. W u, N. Ev ans, T . Kinnunen, J. Y amagishi, F . Ale- gre, and H. Li. Spooﬁng and countermeasures for speaker veriﬁcation: A surve y . Speech Communica- tion , 66:130–153, 2015. 1 [7] F . Alegre, A. Janicki, and N. Evans. Re-assessing the threat of replay spooﬁng attacks ag ainst automatic speaker veriﬁcation. In Biometrics Special Interest Gr oup (BIOSIG), 2014 International Confer ence of the , pages 1–6. IEEE, 2014. 1 [8] J. Gałka, M. Grzywacz, and R. Samborski. Playback attack detection for text-dependent speaker veriﬁca- tion ov er telephone channels. Speech Communication , 67:143–153, 2015. 1 [9] Z. W u, S. Gao, E. S. Cling, and H. Li. A study on replay attack and anti-spooﬁng for text-dependent speaker veriﬁcation. In APSIP A , pages 1–5. IEEE, 2014. 1 [10] T . Kinnunen, M. Sahidullah, I. Kukanov , H. Del- gado, M. Todisco, A. Sarkar , N. Thomsen, V . Hau- tamaki, N. Evans, and Z.-H. Tan. Utterance veriﬁca- tion for text-dependent speaker recognition: a com- parativ e assessment using the RedDots corpus. In IN- TERSPEECH 2016, Annual Conference of the Inter - national Speech Communication Association, Septem- ber 8-12, 2016, San F rancisco, USA , San Francisco, UNITED ST A TES, 09 2016. 1 [11] H. Zeinali, L. Burget, H. Sameti, and H. Cernocky . Spoken pass-phrase veriﬁcation in the i-vector space. In Pr oc. Odyssey 2018 The Speaker and Language Recognition W orkshop , pages 372–377, 2018. 1 [12] S. Mochizuki, S. Shiota, and H. Kiya. V oice live- ness detection using phoneme-based pop-noise detec- tor for speaker veriﬁcation. In Pr oc. Odyssey 2018 The Speaker and Language Recognition W orkshop , pages 233–239. 1 [13] J. Gonzalez-Rodriguez, A. Escudero, D. de Benito- Gorron, B. Labrador, and J. Franco-Pedroso. An audio ﬁngerprinting approach to replay attack detection on asvspoof 2017 challenge data. In Odyssey 2018 The Speaker and Language Recognition W orkshop , pages 304–311. ISCA, 2018. 1 [14] C. W ang, Y . Zou, S. Liu, W . Shi, and W . Zheng. An efﬁcient learning based smartphone playback attack detection using gmm supervector . In Multimedia Big Data (BigMM), 2016 IEEE Second International Con- fer ence on , pages 385–389. IEEE, 2016. 1 [15] M. A. Hearst, S. T . Dumais, E. Osuna, J. Platt, and B. Scholkopf. Support vector machines. IEEE In- telligent Systems and their applications , 13(4):18–28, 1998. 1 [16] K. A. Lee, A. Larcher , G. W ang, P . K enny , N. Br ¨ ummer , D. v . Leeuwen, H. Aronowitz, M. K ock- mann, C. V aquero, B. Ma, et al. The reddots data col- lection for speaker recognition. In Sixteenth Annual Confer ence of the International Speech Communica- tion Association , 2015. 1 [17] M. T odisco, H. Delgado, and N. Evans. A new fea- ture for automatic speaker veriﬁcation anti-spooﬁng: Constant q cepstral coefﬁcients. In Speaker Odyssey W orkshop, Bilbao, Spain , volume 25, pages 249–252, 2016. 2 [18] G. Lavrentye va, S. Nov oselov , E. Malykh, A. K ozlov , O. Kudashev , and V . Shchemelinin. Audio replay at- tack detection with deep learning frameworks. Inter- speech , 2017. 2 , 3 [19] S. Pascual, A. Bonafonte, and J. Serra. Segan: Speech enhancement generativ e adv ersarial netw ork. In Inter - speech . ISCA, 2017. 2 , 5 [20] J. Lindber g and M. Blomberg. V ulnerability in speaker veriﬁcation-a study of technical impostor techniques. In Sixth European Conference on Speech Communication and T echnology , 1999. 2 [21] L. R. Rabiner . A tutorial on hidden markov models and selected applications in speech recognition. Pro- ceedings of the IEEE , 77(2):257–286, 1989. 2 [22] S. K. Erg ¨ unay , E. Khoury , A. Lazaridis, and S. Mar- cel. On the vulnerability of speaker veriﬁcation to re- alistic voice spooﬁng. In Biometrics Theory , Appli- cations and Systems (BT AS), 2015 IEEE 7th Interna- tional Confer ence on , pages 1–6. IEEE, 2015. 2 [23] C. Demiroglu, O. Buyuk, A. Khodabakhsh, and R. Maia. Postprocessing synthetic speech with a com- plex cepstrum vocoder for spooﬁng phase-based syn- thetic speech detectors. IEEE Journal of Selected T op- ics in Signal Pr ocessing , 11(4):671–683, 2017. 2 [24] H. H. Nguyen, N.-D. T . Tieu, H.-Q. Nguyen-Son, J. Y amagishi, and I. Echizen. T ransformation on computer-generated facial image to av oid detection by spooﬁng detector . arXiv preprint , 2018. 2 [25] I. Goodfellow , J. Pouget-Abadie, M. Mirza, B. Xu, D. W arde-Farley , S. Ozair , A. Courville, and Y . Ben- gio. Generativ e adversarial nets. In Advances in neu- ral information pr ocessing systems , pages 2672–2680, 2014. 2 [26] D. A. Reynolds, T . F . Quatieri, and R. B. Dunn. Speaker v eriﬁcation using adapted gaussian mixture models. Digital signal pr ocessing , 10(1-3):19–41, 2000. 2 [27] A. P . Dempster, N. M. Laird, and D. B. Rubin. Max- imum likelihood from incomplete data via the em al- gorithm. Journal of the r oyal statistical society . Series B (methodological) , pages 1–38, 1977. 3 [28] N. Dehak, P . J. Kenny , R. Dehak, P . Dumouchel, and P . Ouellet. Front-end factor analysis for speaker ver- iﬁcation. IEEE T ransactions on A udio, Speech, and Language Pr ocessing , 19(4):788–798, 2011. 3 [29] M. Lin, Q. Chen, and S. Y an. Network in network. arXiv pr eprint arXiv:1312.4400 , 2013. 3 [30] I. J. Goodfellow , D. W arde-Farle y , M. Mirza, A. Courville, and Y . Bengio. Maxout networks. arXiv pr eprint arXiv:1302.4389 , 2013. 3 [31] N. Sriv astava, G. Hinton, A. Krizhe vsky , I. Sutske ver , and R. Salakhutdinov . Dropout: A simple way to pre- vent neural networks from ov erﬁtting. The J ournal of Machine Learning Resear ch , 15(1):1929–1958, 2014. 3 [32] H. Delgado, M. T odisco, M. Sahidullah, A. K. Sarkar , N. Evans, T . Kinnunen, and Z.-H. T an. Further optimi- sations of constant q cepstral processing for integrated utterance and text-dependent speaker veriﬁcation. In Spoken Language T echnology W orkshop (SLT), 2016 IEEE , pages 179–185. IEEE, 2016. 3 [33] T . Salimans, I. Goodfellow , W . Zaremba, V . Cheung, A. Radford, and X. Chen. Impro ved techniques for training gans. In Advances in Neural Information Pr o- cessing Systems , pages 2234–2242, 2016. 4 [34] A. Larcher, K. A. Lee, B. Ma, and H. Li. T ext- dependent speaker veriﬁcation: Classiﬁers, databases and RSR2015. Speech Communication , 60:56–77, 2014. 4 [35] C. V eaux, J. Y amagishi, and K. MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. http://dx.doi.org/10. 7488/ds/1994 , 2017. 4 [36] S. S. Sarfjoo and J. Y amagishi. Device recorded vctk (small subset version). http://dx.doi.org/ 10.7488/ds/2316 , 2018. 4 [37] C. V alentini-Botinhao. Noisy speech database for training speech enhancement algorithms and TTS models. http://dx.doi.org/10.7488/ds/ 2117 , 2017. 4 [38] ASVspoof 2017 Organizers. Baseline replay at- tack detector . http://www.asvspoof.org/ data2017/baseline_CM.zip , 2017. 4 [39] X. Glorot and Y . Bengio. Understanding the difﬁculty of training deep feedforw ard neural networks. In Pr o- ceedings of the thirteenth international confer ence on artiﬁcial intelligence and statistics , pages 249–256, 2010. 4 [40] D. P . Kingma and J. Ba. Adam: A method for stochas- tic optimization. arXiv pr eprint arXiv:1412.6980 , 2014. 4 [41] M. Abadi, A. Agarwal, P . Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. T ensorﬂow: Large-scale machine learning on heterogeneous distrib uted systems. arXiv pr eprint arXiv:1603.04467 , 2016. 4 [42] F . Fang. A T ensorFlo w implementation of light con volutional neural network (LCNN). https:// github.com/fangfm/lcnn , 2018. 4 [43] T . Kinnunen, K.-A. Lee, H. Delgado, N. W . D. Evans, M. T odisco, M. Sahidullah, J. Y amagishi, and D. A. Reynolds. t-DCF: a detection cost function for the tan- dem assessment of spooﬁng countermeasures and au- tomatic speaker veriﬁcation. CoRR , abs/1804.09618, 2018. 5 [44] J. Lorenzo-Trueba, F . Fang, X. W ang, I. Echizen, J. Y amagishi, and T . Kinnunen. Can we steal your vocal identity from the internet?: Initial in vestiga- tion of cloning obamas voice using gan, wav enet and low-quality found data. In Pr oc. Odysse y 2018 The Speaker and Language Recognition W orkshop , pages 240–247, 2018. 5 [45] S. S. Sarfjoo. Improv ed SEGAN. https: //github.com/ssarfjoo/improvedsegan , 2017. 5

Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment