Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes

Replay spoofing detec tion system fo r a utomatic speaker verific ation using multi-task lea rn ing of noise cla s ses Hye-Jin Shim School of Computer Science University of Seoul Seoul, South Korea shimhz6.6@gmail.com Sung-Hyun Yoon School of Computer Science University of Seoul Seoul, South Korea ysh901108@naver.com Jee-Weon Jung School of Computer Science University of Seoul Seoul, South Korea jeewon.leo.jung@gmail.com Ha-Jin Yu School of Computer Science University of Seoul Seoul, South Korea hjyu@uos.ac.kr Hee-Soo Heo School of Computer Science University of Seoul Seoul, South Korea zhasgone@naver.com Abstract — In th is paper, we propo se a replay attack spoofing detection system for a utomatic speaker verification using m ulti- task learning of noise classes. We define t he noise that is caused by the replay attac k as replay noise. We ex plore the effecti veness of training a deep neural netw ork simultaneously for replay attack s poofing detection and rep l ay n oise classification. The multi-task learni ng includes classifying the noise of playback devices, recording environments, and recording devices as well as the spoofing d etection. Ea ch of the three t ypes of the noise classes als o incl udes a genuine class. The experiment results on the version 1.0 o f ASVspoof2017 datasets demonstrate that t he performance of our proposed sy stem is i mproved by 30% relatively on the evaluation set. Keywords—replay attack, spoofing detection, anti-spoofing, speaker verification, mul ti-task learning I. I NTRODUCTI ON As speaker verification is a pplied to various applications, the reliability of automatic speaker verifi cation sy stems became an im portant issue. Therefo re, many researchers are focusing on spoofing detect ion to enhance the reliability of speaker verification system . An a udio spoofi ng signal is generated by man ipulating a genuine signal through recording, synth esizing or modify ing to trick a speaker verification system . From amongst other studies in this field, automatic speaker v erificati on(ASV) spoofing and countermeasures challenge has ini tiatively led to the evaluati on of audio spoofing using various attacks such as speech synthes is, voice conversi on in 2015 [1], and replay attack in 20 17 [2], respective ly. The results o f the ASVspoof2017 challenge showed that r eplay attack is more difficult to detect than other attacks. While speech synthes is and voice conversi on are hard to im plement as they need special equipm ent and expertise , replay attack requires neither specia l equipment nor expertise. Additionally , speech synth esis and voice conversi on also include a play back phase tha t occurs after man ipulation o f the genuine signal. Hence, de tec ting replay attack can help in d etec ting other spoofing attacks . In thi s paper, the spo ofed signal and spoofing detection are discussed only in the context of replay attack and replay attack s poofing detecti on. Generally, channel nois e in an audio signal is ca use d b y the recording envi ronment a n d recording or pl aying devices. In various domain such as speaker recognition, it has been known that channel n oise reduces the a ccu racy o f t he sys tem [3]. Therefore , to im prove the perform ance, many studies focus on reducing chann el noise. However, in spoofin g detection, we hy pothesized that noise can be vital especia lly in replay attack. In repl ay attack, the noises of the recording envir onment, playback and recording devices are generated during the playback and re- recording phase. Compared with the original signal without replay attack, the spoofed signal is identica l to the genui ne signal except the replay noise which is added by r eco rding environm ent, recording and playback devices. We define this additional channel noise that is added during replay attack as replay n oise. The c onventional method of spoofing detection is binary classificati on. This technique i s use d to determine w hether a signal has been spoofed. Given the importance of replay noise, we trained a deep neural network (DNN) for replay noise cl assification as well as spoofing detecti on. Multi-t ask learning [4] is implemented to train the netw ork o n various tasks at the same tim e. II. R ELATED WORK S In the ASVspoof20 17 in wh ich the system s for detecting replay attacks were evaluated, many teams show ed competitive results [5, 6, 7, 8]. Among th ese, the system developed by Lavrenty eva et al. (2017) [5] had the best perform ance. To verify effectiven ess of our proposed sy stem, we designed the architectu re of o ur system i n a manner similar to the aforementione d system [5], and modified it to apply replay noise classific ation. The system is composed of front-en d DNN for extracting features and back- end single Gaussian model for scoring. For detecting replay attack, Nagarsheth e t a l. (2 017) [6] used a channel discrim ination. The system developed by Nagarsheth et al. (2017) [6] completely traine d D NN f or spoofing detection or channel discrimin ation and selected the DNN only for channel discrim ination. In o u r stu dy , we trained the DNN simultaneously for spo ofing detection and replay noise classificati on by using multi -task learning . For generalizati on, w e added the genu ine node for each replay noise classification, wh erea s N agarsh eth et al. (2017) did not consider the g enuine class in th e channel d is crimination [6]. A. Fr ont-en d DN N The front-end DNN is used a s a feature extractor. The spectrog ram of a sig nal is fed to the network for trainin g the spoofing detector. Spoofing detection features are e xt racted from the linear activati on of the last hidden layer of the trained netw ork an d sc oring is performed b y the back-en d single Gauss ian model . Given that noise is well visualized in spectrog rams, a convoluti onal neural n etw or k (CNN) is exploited. CNNs are often used for processing images; howev er, recently , CNNs have perform ed w ell in spoofing detection [5, 9] a s well. Among m any CNN architectures, we implemen ted Light CNN(LCNN) [10], wh ich showed the best performance in ASVspoof20 17. In LCNN, the concept o f maxout activati on [11] is applied to each CNN layer, called a Max-Featur e- Map (MFM). An LCNN ca n select the most representative featu re from various featu res produce d by different fil ters. It also helps reduce the number of parameters and makes CNN faster. LCNN also show s competitive results. B. Bac k-e nd single Gaus sia n sco ring The output layer of DNN for binary clas sification tasks typically configured in one o f the two configurations : single node w ith sigm oid activa tion, or t wo nodes with softmax activation. In the former confi guration, th e value of the node is directly used as the score and i n the latter configu ration, the value of a single n ode is used as the score . However, the output la y er’s value cannot be direct ly used as a measure o f reliability [12]. Naga rsheth et al. (2017) [5] applied single Gaussian modelin g using the last hidden lay er’s linear act ivation as the code to avoid this is sue. After the DNN was traine d, tw o single Gaussian models were m odeled by calculatin g the mean and standar d deviati on of both the genuine an d the spoofed signal’s code, respectively. During the tes t phase, th e code w as extracted using the DNN . Then, the difference between the log probabili ty of the genuine model and the log probability o f the spoofed model was obtain ed and used as the score . III. THE P ROPOSED SYSTEM To det ect a r eplay attack, we expect th at t he in vestigati on of the difference b etw een the genuine signal and the spoofed signal can contribute t o im proving the perform ance of th e spoofin g detection sy stem. In this section, we analyze the differenc e between the genu ine signal and the s poofed signal based o n the p roces s of producing each signal. Ba sed on the result, we introduce a meth od to improve th e perf ormance of spoo fing detection sys te m . The process of producing a genuine signal and spoofe d signal based on the ‘stolen voice’ scenario [2] is d epicte d in Fig. 1. First, the genuine si gnal is g enerated wh en t he speaker utterance is entered into the verificati on sys tem through a reco rding device. The speaker utterance only refers to the utterance spoken by a speaker without an y noise. However, internal nois e is inherent in e v ery recorded signal. In this process, the noise o f the recording device and th e recording e nvironm ent i s inevitably included in the genuine signal. H ence, we define th is n oise as in ternal noise. Next, the spoofed s ignal ba s ed on th e ‘stolen voice’ scenari o is produced by modifying the genuine signal which was stolen by the imposter. T he imposter pla y s bac k the genuine signal and re-records it to generate the spoofed signal. T he sp oofed ㄹ signal includes the nois e from the playback d evice, recording environmen t and the re co rding device a s w ell a s the genuine signal. W e defined the noise of the playback device, recording environmen t a n d the recording device as replay noise. By analyzing the pro cess o f generating t he genuine signal and the spoofed signal, we found that both signals commonly include the speaker utte rance and the internal noise; how ever, only the spoofed sign al includes t he replay noise. The analy zed results c an be expresse d by the follow ing equations: y  ( t ) = x ( t ) ∗ n ( t ) (1) y  ( t ) = y  ( t ) ∗ P ( t ) ∗ E ( t ) ∗ R ( t ) (2) Fig. 1. Process of s poofing a signal based on t he ‘stolen voice’ scenario Fig. 2. System arc hitecture of the propose d multi-task DNN The above equations are from A legre et al. (2014) [13]. The * operation represents the convolution. T his equation represents the genui ne sign al a nd the s poofed signal as a convoluti on of impulse respon ses of various signal . The genuine signal ( 1) is composed of the sp eaker utterance x ( t) and internal noise n(t). W e ass umed that internal noise is not informative especi ally in a rep l ay attack, as the internal noise is inevitably included in recorded sign al. However, the spoofed signal (2) is composed of not only the genuine signal but al so the noise from play bac k devices P(t) , the recording environm ent E (t) and the recording devices R(t). The replay noise is added after spoofing a nd the replay noise can be the biggest differ ence between th e spoofed signal and the genuine signal. As a result, w e proposed the method of training a DNN for r epl ay attack spoofing d etec tion and replay noise class ification at the same time. As w e intended to concurren tly embed the feat ure of spoofing detection and replay n oise classifica tion, w e applied multi-task learning. Mult i-task le arning is a method of learning several tasks a t the sam e time. From learning each task, other tasks also can be learned b ette r and a synergy effect e xists between differe nt tasks [4]. T hus , the total classes are compose d of the results of spoofing detection(gen uine/spo ofed) a n d three kinds o f replay noi se classificati on. Proposed syste m a rchitectu re is depicted in Fig. 2. F or each replay noise clas sification , w e added a single node which indicates genuine if the signal is genu ine. For instance, in the task that identifies a playback device, th er e ar e a s many as nodes as the number of playback devices and one extra node indicating that the input signal is the genuine signal. W e expect that b y adding the genuine node for the genuine signal, the generaliz ation ability can be i mproved. TABLE I . DATA CONFIGURATION IV. EXPERIMENTS A. Da tasets Experimen ts were c onducted o n the ve rsion 1 .0 of ASVspoof20 17 datasets , composed of non-replay ed utterances (genuine signals ) and replay ed utterances (sp oofed signals). No n- replayed utter ances are subsets of original RedDots corpus and replay ed utterances are made by playing back and recording the RedDots corpus for simul ating a replay attack s poofing environm ent. The AS Vspoof2017 cor pus is o riginally divided into training, developm ent a nd evaluati on, but re - partiti oning of training and d evelo pment subsets are permitted. Given that training a network usin g o nly tra inin g set doe s not involve suffici ent c hannel information, we used part of the developmen t set for training a DNN. The separati on of the dataset is as listed in T ABLE Ⅰ . We used the non-re played utterances of the developm ent set w hich are composed of 380 utterances f rom the original development set (760 utt erances). The 380 ut terances are s e lect ed from 50 u tterances of seven speakers (M11, M12, M13, M14, M16, M17 ) and 30 utterances of one speaker (M15). In a ll experiments, the inputs a re spectrogram s which are obtained via Fast Fourier Transform(F FT) using the kaldi [14] default s etup. If the signal is long er than four se conds, w e random ly select 4 seconds of the signal an d if the sign al is sh orter than four seconds, w e du plicate it. W e also applied mean vec t or normalizati on at the ut terance lev el. B. B aseline To verify effectivenes s of cla ss ifying the replay n oise, we implem ented the best performan ce s ystem in [5] from t h e ASVspoof20 17 challeng e . How ever, the baseline was man ipulated at s everal points a nd the obtained results are al so different. Difference of our baselin e from the system in [5] are as follow s. First, the number of dimensions a n d the method of extracting the spe ctrogram is differen t f ro m those in [5]. Specifically , the size of input in [5] is (864 × 4 00 × 32), but th e size of input in this work is (400 × 257 × 32). S econd, we d id n ot use any additional datasets, though the a uth ors in [5] used additional datasets . T hird, the separation of t he devel opment set and the training set is n ot same as the separation in [5]. Specific d etails on this were m entioned i n se ction 5.1. F ourth, considering the diff erent siz e of input s pectrogram, we exploite d (2×1) max-pooling operation instead of (2×2). We also added Dropout of 2 0% [15] af ter the input . Finally , ADAM optim izer [16] w as used w ith learning rate of 10 -3 . Subset The num ber of uttera nces Non-replay Replay Tra in 1508 + 380 1508 + 570 Dev 380 380 Eval 1298 12008 To tal 3566 14466 C. T he proposed s ystem TABLE II. T HE OVERALL DNN ARCHI TECTURE TABLE III . DETAILS OF MULTI-TA SK NODES TA BLE IV. P ERFO RMAN CE CO MPARI SON OF SPOOFI NG DETECT ION EER(%) WI TH THE BEST PERFO RMAN CE SYST EM Fig. 3. Visualizatio n of t-SNE results comp ared to baseline system We im plemented the mul ti-task learn ing for concurrent ly training spoofing detection and the replay noise classif ication. We tr ained the netw o rk with four tasks as follows and the output layers were e limi nated after training. It must be notice d that the output la y ers are com posed o f not only the results of spoofing detection (genuine/s poofed) a nd the classes of a l l kind of r eplay noise, b ut als o a genu ine node for e ach replay noise cla ss ification. Thus, the total number of output layer nodes is 2 + (4+1) + (8+1) + (7+1) = 24. All losses in multi- task learning ar e equally weighted. Other DNN configuratio n details are liste d in the TABLE Ⅱ, FC and MFM refers to Fully Connected Layer and Max Featur e Map, respec tively. With regard to the mu lti-task output nodes w hich is l isted in th e l ast two columns of TABLE Ⅱ , the d etails are included in TABLE Ⅲ . D. Results and Disccussion The exper imental r esul ts are listed in the TABLE Ⅳ . T he results show that the usage of replay noise can improve the perform ance of spoofing detection system s. T he different perform ances in th e v alid set and eval set is because the r eplay noise of the valid set also exis ts in th e training set. How e ver, the perform ance improvement on the eval set proves the effectiveness of the replay noise when t he devices and environm ent are even un known. Fig. 3 shows the visual ization results usi ng t-SNE [17]. In t his figure, the yellow dots and purple dots indicate the genuine and the spoofe d sign al, r espectively. From the visu alization results, we found that the distribution of the spoofe d signa l is represente d by mul tiple clusters w hich are caused b y the channel difference a nd the overlap with the genui ne signal is significan tly reduced by training for the classificati on o f the replay noises. V. C ON CLUSION In this p aper , we proposed a reply a ttack spoofing detection system using multi-task learning with the replay T ype Filter/Stride Output Input Dropout1(0.2) Conv1 5 × 5 / 1 × 1 400 × 257 × 32 MFM1 (1/2) - 400 × 257 × 16 MaxPool1 2 × 2 / 2 × 2 200 × 128 × 16 Conv2a 1 × 1 / 1 × 1 200 × 128 × 32 MFM2a (1/2) - 200 × 128 × 16 Conv2b 3 × 3 / 1 × 1 200 × 128 × 48 MFM2b (1/2) - 200 × 128 × 24 MaxPool2 2 × 2 / 2 × 2 100 × 64 × 24 Conv3a 1 × 1 / 1 × 1 100 × 64 × 48 MFM3a (2/3) - 100 × 64 × 32 Conv3b 3 × 3 / 1 × 1 100 × 64 × 64 MFM3b (1/2) - 100 × 64 × 32 MaxPool3 2 × 1 / 2 × 1 50 × 64 × 32 Conv4a 1 × 1 / 1 × 1 50 × 64 × 64 MFM4a (1/2) - 50 × 64 × 32 Conv4b 3 × 3 / 1 × 1 50 × 64 × 32 MFM4b (1/2) - 50 × 64 × 16 MaxPool4 2 × 1 / 2 × 1 25 × 64 × 16 Conv5a 1 × 1 / 1 × 1 25 × 64 × 32 MFM5a (1/2) - 25 × 64 × 16 Conv5b 3 × 3 / 1 × 1 25 × 64 × 32 MFM5b (1/2) - 25 × 64 × 16 MaxPool5 2 × 2 / 2 × 2 13 × 32 × 16 Dropout2(0.7) FC6 2 × 64 FC7 2 × 64 FC _ S 2 FC_E 4+1 FC_P 8+1 FC_R 7+1 Object of classifier Task details # of classes Spoofing detection (FC_S) Spoofed and gen uine 2 Recording environment ( FC_E) ['Balcony', 'Bedroo m', 'Cantine', 'Office'] ( and genuine) 4+1 Playback device (FC_P) ['All-in-one PC speakers ', 'Beyerdynamic DT 770 P RO headphones', 'Crea tive A60', 'Dell laptop with internal s peakers', 'Dynaudio BM5A S peaker connected to laptop', 'HP Laptop spea kers', 'High Quality GENEL EC Studio Monitors Speakers', 'VIF A M10MD-39-08 Speaker connected to la ptop'] (and genuine) 8+1 Recording device (FC_R) ['BQ Aquaris M5 smart phone. Software: Smart voice recorder', 'Desktop Computer w ith headset and arecord', 'H6 Handy Recorder', 'Nokia Lumia', 'Rode NT2 m icrophone connected to laptop', 'Rode smartlav+ microphone connec ted t o l aptop', 'Samsung Galaxy 7s'] ( and genuine) 7+1 System Zhang, C. e t al Basel ine Proposed Set Valid Eval V ali d Eva l V a lid Eval EER (%) - 11.5 0 9.47 13.57 4.21 9.56 noise classes. To verify its effectiveness, we conducte d an experimen t on the version 1.0 of A SVspoof20 17 datasets . We train ed a DNN t o perform the replay noise cl assificati on task as well as spoofing de t ection task by multi-task learn ing. Replay noise classificat ion is composed o f three sub tasks which classify the nois e as environm ents, rec ording devices and playback devices, respectively. Experim ental results show the improvement in the spoofing detection perform ance. E ER o f the propose d system is reduced by about 30% rel atively on the evaluati on set. R EFERENCES [1] Wu. Z, Kinnunen. T, Evans. N, Yamag ishi. J, Hanilçi. C, Sahidullah. M., & Si zo v. A, “ASVs poof 2015: the first automatic speaker verification spoofing and countermeasures challenge.” I n Sixteenth Annual Confere nce of the I nternational Speech Communicatio n Asso ciat ion, 201 5 [2] T. Kinnune n, M. Sahidullah, H. Delg ado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Proc. INTERSPEECH, 2017, p p. 2– 6. [3] S.V.Vaseghi, Advance d digital signal processing a nd noise reduction. John Wiley & Sons , 2008. [4] R. Caruana, Multitask learning. Machine learning , 19 97, 28.1: 41-75. [5] G. Lavrentyev a, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemel inin, “Audio replay attack detection w ith deep learning framew orks,” in Proc. INTERSPEECH , 2017, pp. 82–86. [6] P. Nagarsheth, E. Khoury, K . Patil, & M.Garland, “Replay Attack Detection using DNN for Ch annel Disc riminat ion”. Proce edings of Interspeech 2017, 9 7-101, 2017. [7] Z. Ji, Z.Y. Li, P., M. A n, S. Gao, D. Wu, & F . Zhao, “Ense mble learning for coun termeasure of audio replay sp oofing att ack in ASVspoof2017”, Proce edings of Inters peech, 87-91, 2017. [8] L. Li, Y. Chen, D. Wang, and T. F . Zheng, “A study on replay attack and anti-spoo fing for automatic speaker verification,” in Pro c. INTERSPEECH, 2017, p p. 92–96. [9] C. Zhang, C. Yu, & J.H. Hansen, "An investigation of d eep-l earning frameworks for speaker verification antispoofing.” IEEE Journal of Selected Topics in Sig nal Processing, 11(4), 6 84-694, 2017. [10] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep fa ce represe ntat ion with noisy labels,” IEEE Journal of Selected Topics in Signal Processing, 20 15. [11] I. J. Goodfellow, D. Warde-F arley, M. Mirza, A. C. Courville, and Y. Bengio, “Maxou t networks.” I CML (3), vol. 28, pp. 1 319–1327, 20 13. [12] Y.Gal, Uncertainty in deep le arning. University of Cambridge , 20 16. [13] F. Alegre , A. Jan icki, & N. Evans, “Re-assessing the threat of repla y spoofing a ttacks against automatic s peaker verification.” In Biometrics Special Intere st G roup (BIOSIG), 20 14 International Confere nce of the, IEEE, 1-6, 2014 [14] D. Povey, A. Ghoshal, G. B oulian ne, L. Burget, O. Glembek, et. al. “The Kaldi speech reco gnition toolkit.” In I EEE 2011 wo rkshop o n automatic speech recognition and un derstanding (No. EPFL-CONF - 192584). I EEE Signal Processing Society . [15] N. Srivastava, G. Hinton, A . K rizhevsky, I . Sutskeve r, & R. Salakhutdinov, “Dropout: A simple w ay to prevent neural netwo rks from overfitting.” The Journal of Machine Learning Research, 15(1 ), 1929-1958, 2014. [16] D.P. Kingma and J. Ba, “A dam: A method fo r st ochastic optimization,” arXiv preprint arXiv:1 412.6980, 2014. [17] L. V. D. M aaten, & G. Hinton, “Visualizing data using t-SNE”. Journal of machine le arning research, 2579-2605, 20 08

Replay spoofing detection system for automatic speaker verification using multi-task learning of noise classes

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment