A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators

A General Model for Deepfak e Speech Detection: Di v erse Bonaﬁde Resources or Di v erse AI-Based Generators Lam Pham 1 ∗ , Khoi V u 2 ∗ , Dat T ran 2 ∗ , David Fischinger 1 , Simon Freitter 1 , Marcel Hasenbalg 1 , Davide Antonutti 1 , Alexander Schindler 1 , Martin Boyer 1 , Ian McLoughlin 3 Abstract —In this paper , we analyze two main factors of Bonaﬁde Resource (BR) or AI-based Generator (A G) which affect the performance and the generality of a Deepfake Speech Detection (DSD) model. T o this end, we ﬁrst pr opose a deep-learning based model, referr ed to as the baseline. Then, we conducted experiments on the baseline by which we indicate how Bonaﬁde Resource (BR) and AI-based Generator (A G) factors affect the threshold score used to detect fake or bonaﬁde input audio in the inference process. Given the experimental r esults, a dataset, which re-uses public Deepfake Speech Detection (DSD) datasets and shows a balance between Bonaﬁde Resource (BR) or AI-based Generator (A G), is proposed. W e then train various deep-lear ning based models on the proposed dataset and conduct cross-dataset evaluation on different benchmark datasets. The cross-dataset evaluation results pro ve that the balance of Bonaﬁde Resources (BR) and AI-based Generators (A G) is the key factor to train and achiev e a general Deepfake Speech Detection (DSD) model. I . I N T RO D U C T I O N Over the past few years, generative AI presents an impressive de velopment due to remarkable adv ancements in deep-learning techniques. Regarding applying generati ve AI in audio domain, especially focusing on human speech, a wide range of high- quality applications such as media production, AI-based music composer , audio-based chatbot, etc. ha ve been introduced and used popularly . Howe ver , this also poses potential risks when generated human speech by AI-based generators is used for criminal purposes. T o address this concern, the audio forensic task of Deepf ake Speech Detection (DSD) has been raised and drawn much attention from the audio research community . Indeed, various public datasets from scientiﬁc papers and challenge competitions have been proposed for the DSD task recently (i.e., The survey paper [1] indicates that four DSD datasets each year ha ve been published from 2021). Gi ven the public datasets, a wide range of deep-learning based models hav e been proposed to detect the deepfake audio, referred to as DSD models. The state-of-the-art DSD models [2]–[4] hav e le veraged pre-trained deep neural networks which were trained on large-scale datasets of human speech for the up-stream task of Speech-to-T ext. Then, these pre-trained models are ﬁne- tuned on the down-stream task of DSD. In particular , authors L. Pham, D. Fischinger, S. Freitter , M. Hasenbalg, D. Antonutti, A. Schindler, and M. Boyer are with Austrian Institute of T echnology , Austria. K. V u and D. Tran are with FPT University , V ietnam. I. McLoughlin is with Singapore Institute of T echnology , Singapore. (*) Main and equal contribution into the paper . 4- secon d Segments Pre - tra ined WavLM MLP . . . . Audio Embed ding p F p B Pre dict ed Pro babil iti es Fully Co nnected 01 ReLU Dro pout Fully Co nnected 02 ReLU Dro pout Fully Co nnected 03 Soft, max Thre shold Scor e Fig. 1. The proposed baseline architecture in [2] achiev ed the best performance by ﬁntuning XLSR model, a variant of W ave2V ec2 [5] released by Meta. Meanwhile, authors in [3] lev eraged and ﬁnetuned W avLM [6] released by Microsoft. The authors in [4] made ef fort to benchmark a wide range of popular pre-trained models proposed for audio tasks. Howe ver , the state-of-the-art DSD models present two main concerns. First, these DSD models use Equal Error Rate (EER) with free threshold score as the main metric to compare the performance among proposed models. The EER scores are reported and treated as indi vidual values as they are computed from different testing datasets rather than on all testing datasets. For each EER v alue obtained from a speciﬁc testing dataset, there is a corresponding threshold score which is used to detect fak e or bonaﬁle audio. These threshold scores may be signiﬁcantly dif ferent from different testing datasets. Ho wev er, only one threshold score must be selected to detect fake or bonaﬁle audio from an unseen data in the inference process. Therefore, if the threshold scores are much different from different testing datasets, it leads the overﬁtting issue and the DSD model is not general. In other words, proposed high- performance DSD models with low EER values ev aluated on individual testing datasets may be not really reliable if the corresponding threshold scores are not presented, compared, and analyzed. Ho wev er, the state-of-the-art DSD systems ha ve not reported or analyzed threshold scores across testing datasets. Second, during training a DSD model, Softmax-based layer is normally used to separate f ake and bonaﬁde audio into two distributions with the boundary probability of 0.5. Howe ver , the unbalance issues of Bonaﬁde Resource (BR) or AI-based Generators (A G) may lead the fake or bonaﬁde distrib utions to mainly locate at the boundary (e.g., around the probability of 0.5). Therefore, increasing or decreasing the probability boundary (e.g., upper or lower than 0.5) helps to ﬁnd the optimized boundary probability from which the EER and the corresponding threshold score are computed. Howe ver , none of research has analyzed the role of Bonaﬁde Resource (BR) or AI-based Generators (A G) which may signiﬁcantly af fect the ﬁnal probability boundary . By addressing these concerns, we make the follo wing main contributions in this paper: • W e conduct extensi ve experiments to indicate the roles of Bonaﬁde Resource (BR) and AI-based Generator (A G) which af fect a DSD model performance, the EER metric and the corresponding threshold score. • Giv en the e xperimental results, we propose a dataset for DSD task which re-uses public DSD datasets and shows a balance between Bonaﬁde Resource (BR) and AI-based Generator (A G). • W ith the proposed dataset, we train various deep-learning based model for DSD task. The proposed models prov es general when we conduct cross-dataset ev aluation on different benchmark datasets and using a ﬁxed threshold. I I . E V A L U A T I O N O F B O NA FI D E R E S O U R C E A N D A I - B A S E D G E N E R A T O R F AC T O R S A. Pr oposed Baseline Model T o ev aluate the role of Bonaﬁde Resource (BR) and AI- Based Generator (A G) to affect the DSD model performance, we ﬁrst propose a deep-learning based model which is referred to as the baseline in this paper . Inspired by lev eraging pre- trained models [2]–[4] from the state-of-the-art, we also use the pre-trained W avLM [6] model to construct the baseline. In particular , the baseline is shown in Fig. 1. First, 4-second audio segments are fed into the pre-trained W avLM [6] model which was trained on large-scale datasets for the task of Speech- to-T ext. By lev eraging the pre-trained W avLM model [6], audio embeddings (i.e., The audio embedding is a vector - based presentation) e xtracted from this model present general acoustic features of human speech. Then audio embeddings are explored by a Multilayer Perceptron (MLP) which is performed by three Dense layers. During training the baseline, the pre- trained W avLM model is frozen. In other words, only trainable parameters of the MLP component is updated during the training process to adapt the DSD task. B. The Statistics of Public and Benchmark DSD Datasets W e now analyze the statistics of public and benchmark datasets proposed for the DSD task (i.e., Only English datasets are analyzed). T o this end, we present the public datasets for the DSD task in T able I which were also summarized in our previous work [1]. The T able I indicates an unbalanced issue between Bonaﬁde Resource (BR) and AI-Based Generator (A G). Indeed, ASVspoof 2021 (DF task) [12] presents the largest number of A G (e.g., more than 100 T ext-to-Speech (TTS) and V oice Con version (VC)) but is limited by the number of speakers (e.g., 21 males and 27 females). In contrast, some datasets such as ASVspoof5 [25], W a veFake [13], LibriSe Real Speakers Fake Utterances Fake Speech Generators Real Speakers 106 246500 10 107 108978 19 48 163114 13 48 589221 100 107 41280 5 2 117985 6 500 25000 2 140 195541 7 ASVspoof 2015 ASVspoof 2019 ASVspoof 2021 LA ASVspoof 2021 DF VoC.v WaveFake FakeAVCeleb FoR Fig. 2. The statistics of DSD datasets in English V oc [20] le verage the Librispeech [21], a large-scale of human speech dataset, as the bonaﬁde resource. This is effecti ve to construct a general human speech dataset with div erse speech characteristics such as gender , accent, loudness, speech ﬂo w , etc. that reﬂects real-worl d conv ersations. Howe ver , these datasets present a limitation of A G and unbalanced issue between TTS and VC systems. Similarly , A V -Deepfake1M [22] presents a large number of fake utterances but fake utterances are generated from only 2 TTS systems. The Fig. 2 again indicates an unbalance issue among Bonaﬁde Resource (BR), AI-Based Generators (A G), and the fake utterance number regarding proposed DSD datasets without using the Librispeech as the bonaﬁde resource. C. Experimental Settings Giv en the DSD dataset concerns above, we no w setup an experiment to e valuate the roles of Bonaﬁde Resource (BR) and AI-Based Generators (A G). First, ASVspoof 2019 (LA), ASVspoof 2021 (LA), and ASVspoof 2021 (DF) are grouped into one dataset with totally 53,552/861,304 of bonaﬁde/fake utterances. W e gather these datasets as they use bonaﬁde audio from a limited speaker volunteers but apply div erse AI-Based Generators (e.g., more than 100 generators). This data is referred to as the A G dataset Meanwhile, ASVspoof 2024 [25] train and dev elopment subsets are also grouped into one dataset with totally 50,131/273,176 of bonaﬁde/fake utterances. As these datasets use di verse bonaﬁde audio from Libreispeech [21], these are group together to present a div erse bonaﬁde dataset, referred to as the BR dataset. Gi ven the BR dataset and the A G dataset as sho wn in T able II, we setup two test cases. In the ﬁrst test case, we ev aluate the role of A G. W e use the A G dataset to train the baseline model. After the training process, we test the baseline model on the In-The-W ild [18] dataset. In the second test case, we ev aluate the role of BR. W e use the BR-dataset to train the T ABLE I P U BL I C A N D B E NC H M A RK D AT A S E TS P RO P O SE D F O R D E E PFA K E S P EE C H D E T EC T I O N ( D S D ) I N E N G L IS H Datasets Y ears Speakers Utt. No. Fake speech Real Speech (Male/Female) (Bonaﬁde/Fake) Generators Resources ASVspoof 2015 [7](audio) 2015 45/61 16,651/246,500 10 (7 VC, 3 TTS) Speaker V olunteers FoR [8](audio) 2019 140 -/195541 7 TTS Kaggle [9] ASVspoof 2019 (LA task) [10](audio) 2019 46/61 12,483/108,978 19 (8VC, 11 TTS) Speaker V olunteers DFDC [11](video) 2020 3426 128,154/104,500 1 TTS Speak er V olunteers ASVspoof 2021 (LA task) [12](audio) 2021 21/27 18,452/163,114 13 TTS/VC Speaker V olunteers ASVspoof 2021 (DF task) [12](audio) 2021 21/27 22,617/589,212 100+ TTS/VC Speaker V olunteers W aveFak e [13](audio) 2021 0/2 -/117,985 6 TTS LJSPEECH [14], JSUT [15] FakeA VCeleb [16](video) 2022 250/250 570/25,000 2 TTS V ox-Celeb2 [17] In-the-Wild [18](video) 2022 58 19963/11816 n/a Self-collected V oc.v [19](audio) 2023 46/61 14,250/41,280 5 TTS ASVspoof 2019 LibriSeV oc [20](audio) 2023 n/a 13,201/79,206 6 TTS/VC Librispeech [21] A V -Deepfake1M [22], [23](video) 2023 2,068 286,721/860,039 2 TTS V oxceleb2 [17] MLAAD [24](audio) 2024 n/a -/76,000 54 TTS Librispeech [21] ASVspoof5 [25](audio) 2024 n/a 188,819/815,262 28 (15 TTS, 6 VC, 7 A T) Librispeech [21] T ABLE II P RO P OS E D D A T A S E T S TO B AL A N C E A I- B AS E D G E N ER ATO R A N D B O NA FI DE R E SO U R C E FA CT O RS Datasets Bonaﬁde Utterances Fake Utterances A G dataset 53,552 861,304 BR dataset 50,131 273,176 BR-A G dataset 103,683 1,134,408 0.0 0.2 0.4 0.6 0.8 1.0 Fake Probabilities 0 5000 10000 15000 20000 25000 30000 35000 4-second Audio Segment Number bonafide spoof Fig. 3. Distribution of fake probabilities from In-The-W ild dataset using AG dataset to train the baseline model baseline model. Then, we test the baseline model on the In- The-W ild [18] dataset. W e select In-The-Wild [18] for testing in both test cases as this dataset has been widely used for cross- dataset ev aluation in the state-of-the-art papers and has not presented the A G or BR information. This dataset also presents v arious noise background, different audio quality which reﬂect real scenarios, and balance between 52,605 4-second bonaﬁde segments and 35483 4-second fak e segments. The baseline model is constructed with Pytorch framework and is trained on GPU GeForce 12 GB. W e uses Adam [26] algorithm for optimization. The training process is stop after 30 epoch and uses a ﬁxed learning rate of 1E-5. Notably , this experiment is for 4-second audio segment input. In other words, original audio utterances are split into 4-second audio segments before feeding into the model. Regarding ev aluating metrics, we present F1 score, Accuracy , AuC, and EER. The EER is accompanied with the threshold score. W e use the probability of 0.5 to compute Accuracy , F1 Score and AuC (i.e., if the fake probability is larger than 0.5, the input audio is referred to as fake). Meanwhile, we follow ASVspoof challenge to compute EER and the 8 6 4 2 0 2 Logarithmic Scores 0 500 1000 1500 2000 4-second Audio Segment Number Threshold: -1.38 bonafide spoof Fig. 4. Distribution of Logarit scores from In-The-Wild dataset using AG dataset to train the baseline model Fig. 5. Distribution of fake probabilities from In-The-Wild dataset using BR dataset to train the baseline model corresponding threshold basing on the Logarithmic scores. Particularly , the Logarithmic score is computed from fake and bonaﬁde probabilities in the following equation: Log S core = log ( P B /P F ) 10 , (1) where P B and P F are the bonaﬁde and fake probabilities, respectiv ely . D. Experimental Results and Discussion As Fig. 3 and Fig. 4 show , this indicates that the distribution of fak e probabilities (e.g., the output of Softmax layer) from fake audio locates near one when we train the baseline on 4 2 0 2 Logarithmic Scores 0 250 500 750 1000 1250 1500 1750 4-second Audio Segment Number Threshold: 1.45 bonafide spoof Fig. 6. Distribution of Logarit scores from In-The-Wild dataset using BR dataset to train the baseline model Bonafide Fake Bonafide Fake 54.3 45.7 4.5 95.5 Fig. 7. Confusion matrix (%) for In-The-W ild dataset using the AG dataset to train the baseline model (The boundary probability of 0.5) A G-dataset. Meanwhile, the distribution of fake probabilities from bonaﬁde audio spreads from zero to near one. This leads the threshold of -1.38 when EER of 0.85 is computed. In other words, the f ake probability is signiﬁcant larger than the bonaﬁde probability (e.g., f ake probability is near one) to decide a fak e audio input if we use the threshold score of -1.38. In contrast, when we train the baseline on the BR-dataset, the distribution of fake probabilities from fake audio spread from zero to one while the fake probabilities from bonaﬁde audio locate near zero. This leads the threshold v alue of 1.45 when the EER of 0.82 is computed. In other words, the fak e probability is signiﬁcant smaller than bonaﬁde probability (e.g., the fak e probability is near zero) to decide a fake audio input if we use the threshold score of 1.45 as the boundary . As a result, we conclude that if we train a model on either A G-dataset or BR-dataset and then use the threshold score accompanied with EER metric for the inference process, this leads the ov erﬁtting issue on the training dataset and may obtain signiﬁcant misclassiﬁcation on unseen data. Indeed, instead of using the threshold score from the EER metric, we use the probability of 0.5 (e.g., The output of Softmax layer) as the probability boundary to decide fak e or bonaﬁde audio input, the confusion matrix shown in Fig. 7 and Fig. 8 indicate that Bonafide Fake Bonafide Fake 97.1 2.9 57.4 42.6 Fig. 8. Confusion matrix (%) for In-The-Wild dataset using the BR dataset to train the baseline model (The boundary probability of 0.5) the misclassiﬁcation occurs on bonaﬁde audio if training on A G dataset and on fake audio if training on BR dataset. I I I . P RO P O S E D D S D D A TA S E T , D S D M O D E L , A N D C RO S S - T E S T E V A L U A T I O N A. Pr oposed Dataset for DSD task Experimental results in the Section II-D indicate that the unbalance issue between Bonaﬁde Resource (BR) and AI- Based Generators (AG) leads signiﬁcantly dif ferent threshold scores. Consequently , a DSD model, which was trained on unbalance datasets of BR and A G, may be not general for unseen data in the inference process if using the threshold score from EER metric. In other words, individual EER scores and corresponding threshold scores ev aluated on independent testing datasets may not exactly reﬂect the generality of a DSD model. T o address these concerns, we proposed a DSD dataset which comprises both BR-dataset and A G-dataset. The dataset is referred to as the BR-A G dataset. As T able II sho ws, the BR-A G dataset comprises 103,683 bonaﬁde utterances and 1,134,480 fake utterances. Regarding fake utterances, there are 462,354 utterances generated from T ext-to-Speech (TTS) systems and 672,126 utterances generated from V oice Conv ersion (VC) systems. B. Pr oposed a DSD model Giv en the proposed BR-A G dataset, we construct a DSD model which improves the proposed baseline architecture in Section II-A and re-uses the training strategy in our pre vious published work [27]. As Fig. 9 sho ws, training the proposed DSD models comprises three main stages. In the ﬁrst stage, the pre-trained model is used as the backbone to extract audio embeddings. Then, the audio embeddings are explored by three heads, each of which is applied by one loss function. The ﬁrst loss function is A-Softmax loss which is used to classify audio embeddings into three categories: bonaﬁde, TTS (fak e audio generated from TTS systems), and VC (fake audio generated from VC systems). The second loss function is Contrastiv e Loss which is used to separate bonaﬁde audio embeddings with fak e audio embeddings. The ﬁnal loss function, Central Bonafid e loss (L3) Backb one . . . . Xb . . . . Bonafid e segmen t s N ( 𝜇, ∑ ) Backb one . . A- Sof tmax loss (L1) Contr astive loss (L2) . . FC – BN – GELU FC – Soft ma x . . FC – BN – GELU FC – BN – G ELU . . . . . . X . . . . . . . . Y Z . . . . Froz en 4- second segmen t s Contr astive Hea d Soft max Hea d Backb one . . . . X . . . . 4- second segmen t s FC – Softmax Entr opy loss Entr opy Hea d Trans f e r Stag e 1 Stag e 2 Stag e 3 Multipl e Cl asses & Multiple Los ses Trai ning Two - cla ss F in etu n in g Bonafid e Distr ibution Fig. 9. Our proposed model with three-stage training strategy for DSD task Loss, is used to condense the bonaﬁde audio embeddings. The combination of theses three loss function is ef fective to separate distrib ution of f ake and bonaﬁde audio embeddings far away . In the second stage, three heads in the ﬁrst stage are removed. The backbone is connected to only one head with Binary-Cross-Entropy loss to learn two categories of fake and bonaﬁde. In the ﬁnal stage, bonaﬁde utterances are fed into the backbone to extract bonaﬁde audio embeddings. These audio embeddings are used to compute the bonaﬁde distribution using Mahalanobis distribution. In the inference process, the unseen audio is ﬁrst fed into the backbone to e xtract the audio embedding. The the audio embedding is compared with the bonaﬁde distribution to decide whether the unseen audio is fake or bonaﬁde. C. Experimental Settings T o e valuate the backbone in the proposed DSD model, we ev aluate three pre-trained models: Whisper-base model [28], W avLM-lar ge [6], and W ave2XLSR [5], which show the state-of-the-art results on the upstream task of Speech-T o- T ext. Reg arding testing datasets, we ev aluate dif ferent datasets of In-The-Wild [18], ASVspoof5 Evaluation Subset [29], and F akeA VCeleb [30]. The ASVspoof5 Evaluation [29] subset is selected as this dataset presents div erse bonaﬁde resource and unseen fake generators compared with ASVspoof5 Dev elopment and T rain subset in BR-A G dataset. Meanwhile, FakeA VCeleb [30] is selected as this dataset uses the V ox- Celeb [17] as the bonaﬁde resource which presents real- world conv ersation from a lar ge number of speakers. In this experiment, we report the Accuracy (Acc.), F1, and AuC scores using the the ﬁxed probability of 0.5 as the boundary for all 0.0 0.2 0.4 0.6 0.8 1.0 Fake Probabilities 0 5000 10000 15000 20000 25000 Utterance Number bonafide spoof Fig. 10. Distribution of fake probabilities from In-The-Wild dataset using BR-A G dataset to train the W avLM-Lar ge-Finetune+MLP model ev aluating datasets. The other settings are re-used from the experiment in Section II-C D. Experimental Results and Discussion Compare among pre-trained models as sho wn in the lower part in the T able III, this indicates that ﬁnetuning the pre-trained W avLM-Lar ge with the three-stage training strategy is more effect than ﬁnetuning W av2V ec2.0-XLSR-53 and Whisper- based models. Finetuning the W avLM-Lar ge backbone also enhances the performance compared with freezing the W avLM- large backbone as sho wn in the baseline. T ABLE III E V A LU A T I ON O F T H E B R A N D A G FA CT O RS O N I N - T H E - W IL D D A TAS E T ( U S IN G P R OB AB I L I TY B O U ND ARY O F 0 . 5 ) Model T rain Data T est Data Acc. F1 A uC W avLM-Large frozen + MLP (baseline) AG dataset In-The-W ild 0.71 0.72 0.74 W avLM-Large frozen + MLP (baseline) BR dataset In-The-Wild 0.75 0.58 0.69 W avLM-Large frozen + MLP (baseline) BR-A G dataset In-The-Wild 0.80 0.72 0.82 W avLM-Large ﬁnetune + MLP BR-A G dataset In-The-Wild 0.87 0.79 0.83 Whisper-base ﬁnetune + MLP BR-AG dataset In-The-W ild 0.73 0.60 0.65 W av2V ec2-XLSR-53 ﬁnetune + MLP BR-A G dataset In-The-Wild 0.79 0.64 0.73 T ABLE IV C RO S S - D AT A S E T E V A LU A T I ON O F T H E B E ST M O DE L ( W A V L M - L A R G E FI N ET U N E + M L P ) U S I N G T H E B O U ND ARY P RO BA B I L IT Y O F 0 . 5 T est Dataset Accuracy F1 Score A uC In-The-W ild 0.87 0.79 0.83 ASVspoof5 Eva Subset 0.91 0.91 0.92 FakeA VCeleb 0.99 0.99 0.99 6 4 2 0 2 Logarithmic Scores 0 1000 2000 3000 4000 5000 Utterance Number Threshold: 0.19 bonafide spoof Fig. 11. Distribution of Logarithmic scores from In-The-Wild dataset using BR-A G dataset to train the W avLM-Lar ge-Finetune+MLP model Giv en the best model of W avLM-Large-Finetune+MLP , we ev aluate this proposed model on diff erent datasets. As the experimental results are sho wn in the T able IV, the proposed model achiev es accuracy scores of 0.87 and 0.91 on In- The-W ild and ASVspoof5 Evaluation Subset, respectiv ely . Signiﬁcantly , the model achie ves a high performance on Fak eA VCeleb (i.e., All accuracy , F1, and AuC scores are 0.99). These results indicate that the proposed model is general on these three e valuating datasets as using the ﬁxed boundary probability of 0.5. The Fig. 10 and Fig. 11, which present the distribution of fake probabilities and Logarithmic scores of In-The-W ild dataset using the best model of W avLM-Large-Finetune+MLP , also indicate that fake and bonaﬁde distribution is well separated. I V . C O N C L U S I O N W e have presented a comprehensive analysis of Bonaﬁde Resource (BR) and AI-Based Generators (A G) and indicated ho w these two factors affect the performance and the generality of a Deefake Speech Detection (DSD) model. Given the analysis, we successfully proposed the BR-A G dataset which is used to train and achie ve a general DSD model. Our proposed DSD model trained on the BR-A G dataset proves general on various benchmark DSD datasets and potential for real- work applications. Our future work will explore cross-language ev aluation. R E F E R E N C E S [1] L. Pham, P . Lam, D. T ran, H. T ang, T . Nguyen, A. Schindler, F . Skopik, A. Polonsky , and H. C. V u, “ A comprehensive surv ey with critical analysis for deepfake speech detection, ” Computer Science Review , vol. 57, p. 100757, 2025. [2] K.-H. Ng, T . Song, Y . WU, and Z. Xia, “Xlsr-mambo: Scaling the hybrid mamba-attention backbone for audio deepfake detection, ” arXiv preprint arXiv:2601.02944 , 2026. [3] D. Combei, A. Stan, D. Oneata, and H. Cucu, “W avlm model ensemble for audio deepfake detection, ” arXiv pr eprint arXiv:2408.07414 , 2024. [4] S. Dowerah, A. Kulkarni, A. Kulkarni, H. M. Tran, J. Kalda, A. Fe- dorchenko, B. Fauve, D. Lolive, T . Alum ¨ ae, and M. M. Doss, “Speech df arena: A leaderboard for speech deepfake detection models, ” arXiv pr eprint arXiv:2509.02859 , 2025. [5] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsu- pervised Cross-Lingual Representation Learning for Speech Recognition, ” in Pr oc. INTERSPEECH , 2021, pp. 2426–2430. [6] S. Chen et al. , “W avlm: Large-scale self-supervised pre-training for full stack speech processing, ” IEEE J ournal of Selected T opics in Signal Pr ocessing , vol. 16, no. 6, pp. 1505–1518, 2022. [7] Z. W u, T . Kinnunen, N. Evans, J. Y amagishi, C. Hanil c ¸ i, M. Sahidullah, and A. Sizov , “ Asvspoof 2015: the ﬁrst automatic speaker veriﬁcation spooﬁng and countermeasures challenge, ” in Proc. INTERSPEECH , 2015, pp. 2037–2041. [8] R. Reimao and V . Tzerpos, “For: A dataset for synthetic speech detection, ” in International Conference on Speech T echnology and Human-Computer Dialogue , 2019, pp. 1–10. [9] “ Audio source used to generate for dataset, ” https://www .kaggle.com/ datasets/percev alw/englishfrench- translations, 2018. [10] X. W ang et al. , “ Asvspoof 2019: A large-scale public database of synthesized, conv erted and replayed speech, ” Computer Speech & Language , vol. 64, p. 101114, 2020. [11] B. Dolhansky , J. Bitton, B. Pﬂaum, J. Lu, R. Ho wes, M. W ang, and C. C. Ferrer, “The deepfake detection challenge (DFDC) dataset, ” arXiv pr eprint arXiv:2006.07397 , 2020. [12] J. Y amagishi et al. , “ Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection, ” in W orkshop-Automatic Speaker V eriﬁcation and Spooﬁng Coutermeasur es Challenge (ASVspoof) , 2021. [13] J. Frank and L. Sch ¨ onherr , “W avef ake: A data set to facilitate audio deepfake detection, ” NeurIPS , 2024. [14] N. Kalchbrenner , E. Elsen, K. Simonyan, S. Noury , N. Casagrande, E. Lockhart, F . Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efﬁcient neural audio synthesis, ” in Proc. ICML , 2018, pp. 2410–2419. [15] R. Sonobe, S. T akamichi, and H. Saruwatari, “Jsut corpus: free large- scale japanese speech corpus for end-to-end speech synthesis, ” arXiv pr eprint arXiv:1711.00354 , 2017. [16] H. Khalid, S. T ariq, M. Kim, and S. S. W oo, “Fak eavceleb: A nov el audio-video multimodal deepfake dataset, ” in Thirty-ﬁfth Conference on Neural Information Pr ocessing Systems Datasets and Benchmarks T rack (Round 2) , 2021. [17] J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition, ” in Proc. INTERSPEECH , 2018, pp. 1086–1090. [18] N. M ¨ uller , P . Czempin, F . Diekmann, A. Froghyar , and K. B ¨ ottinger , “Does Audio Deepfake Detection Generalize?” in Proc. INTERSPEECH , 2022, pp. 2783–2787. [19] X. W ang and J. Y amagishi, “Spoofed training data for speech spooﬁng countermeasure can be efﬁciently created using neural vocoders, ” in Pr oc. ICASSP , 2023, pp. 1–5. [20] C. Sun, S. Jia, S. Hou, and S. Lyu, “ Ai-synthesized voice detection using neural vocoder artifacts, ” in Proc. IEEE/CVF Confer ence on Computer V ision and P attern Recognition , 2023, pp. 904–912. [21] V . Panayotov , G. Chen, D. Povey , and S. Khudanpur , “Librispeech: an asr corpus based on public domain audio books, ” in Proc. ICASSP , 2015, pp. 5206–5210. [22] Z. Cai, S. Ghosh, A. P . Adatia, M. Hayat, A. Dhall, and K. Stefanov , “ A v-deepfake1m: A large-scale llm-driven audio-visual deepfake dataset, ” arXiv pr eprint arXiv:2311.15308 , 2023. [23] “1m-deepfakes detection challenge, ” https://deepfakes1m.github .io/, 2023. [24] N. M. M ¨ uller , P . Kawa, W . H. Choong, E. Casanova, E. G ¨ olge, T . M ¨ uller , P . Syga, P . Sperl, and K. B ¨ ottinger , “Mlaad: The multi-language audio anti-spooﬁng dataset, ” in Proc. IJCNN , 2024, pp. 1–7. [25] “The asvspoof 2024 challenge, ” https://www .asvspoof.org/, 2024. [26] P . K. Diederik and B. Jimmy , “ Adam: A method for stochastic optimization, ” CoRR , vol. abs/1412.6980, 2015. [27] L. Pham, D. Tran, P . Lam, F . Skopik, A. Schindler , S. Poletti, D. Fischinger, and M. Boyer, “Din-cts: Low-complexity depthwise- inception neural network with contrastive training strategy for deepfake speech detection, ” arXiv preprint , 2025. [28] A. Radford et al. , “Robust speech recognition via large-scale weak supervision, ” in Proc. ICML , 2023, pp. 28 492–28 518. [29] T . B. Patel and H. A. Patil, “Combining evidences from mel cepstral, cochlear ﬁlter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. ” in Pr oc. INTERSPEECH , 2015, pp. 2062– 2066. [30] K. Prajwal, R. Mukhopadhyay , V . P . Namboodiri, and C. Jawahar , “ A lip sync expert is all you need for speech to lip generation in the wild, ” in Pr oc. ACM international conference on multimedia , 2020, pp. 484–492.

A General Model for Deepfake Speech Detection: Diverse Bonafide Resources or Diverse AI-Based Generators

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment