An Effective Energy Mask-based Adversarial Evasion Attacks against Misclassification in Speaker Recognition Systems
Evasion attacks pose significant threats to AI systems, exploiting vulnerabilities in machine learning models to bypass detection mechanisms. The widespread use of voice data, including deepfakes, in promising future industries is currently hindered by insufficient legal frameworks. Adversarial attack methods have emerged as the most effective countermeasure against the indiscriminate use of such data. This research introduces masked energy perturbation (MEP), a novel approach using power spectrum for energy masking of original voice data. MEP applies masking to small energy regions in the frequency domain before generating adversarial perturbations, targeting areas less noticeable to the human auditory model. The study primarily employs advanced speaker recognition models, including ECAPA-TDNN and ResNet34, which have shown remarkable performance in speaker verification tasks. The proposed MEP method demonstrated strong performance in both audio quality and evasion effectiveness. The energy masking approach effectively minimizes the perceptual evaluation of speech quality (PESQ) degradation, indicating that minimal perceptual distortion occurs to the human listener despite the adversarial perturbations. Specifically, in the PESQ evaluation, the relative performance of the MEP method was 26.68% when compared to the fast gradient sign method (FGSM) and iterative FGSM.
💡 Research Summary
The paper introduces a novel adversarial evasion technique called Masked Energy Perturbation (MEP) targeting modern speaker‑recognition systems such as ECAPA‑TDNN and ResNet34‑based models. The core idea leverages the psychoacoustic masking phenomenon: human listeners are largely insensitive to low‑energy components of a speech spectrum, while they readily perceive changes in high‑energy regions. MEP first computes a short‑time Fourier transform (STFT) of the input waveform using a 25 ms Hann window and 12.5 ms frame shift, yielding 512 frequency bins per frame. For each utterance the peak energy (x_{\text{peak}}) is identified, and an energy threshold corresponding to –20 dB relative to the peak is applied. Bins below this threshold are masked to zero, producing a binary mask (\mu
Comments & Academic Discussion
Loading comments...
Leave a Comment