AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Jailbreak attacks to Large audio-language models (LALMs) are studied recently, but they exclusively focused on the attack scenario where the adversary can fully manipulate user prompts (named strong adversary) and limited in effectiveness, applicability, and practicability. In this work, we first conduct an extensive evaluation showing that advanced text jailbreak attacks cannot be easily ported to end-to-end LALMs via text-to-speech (TTS) techniques. We then propose AUDIOJAILBREAK, a novel audio jailbreak attack, featuring (1) asynchrony: the jailbreak audios do not need to align with user prompts in the time axis by crafting suffixal jailbreak audios; (2) universality: a single jailbreak perturbation is effective for different prompts by incorporating multiple prompts into the perturbation generation; (3) stealthiness: the malicious intent of jailbreak audios is concealed by proposing various intent concealment strategies; and (4) over-the-air robustness: the jailbreak audios remain effective when being played over the air by incorporating reverberation into the perturbation generation. In contrast, all prior audio jailbreak attacks cannot offer asynchrony, universality, stealthiness, and/or over-the-air robustness. Moreover, AUDIOJAILBREAK is also applicable to a more practical and broader attack scenario where the adversary cannot fully manipulate user prompts (named weak adversary). Extensive experiments with thus far the most LALMs demonstrate the high effectiveness of AUDIOJAILBREAK, in particular, it can jailbreak openAI’s GPT-4o-Audio and bypass Meta’s Llama-Guard-3 safeguard, in the weak adversary scenario. We highlight that our work peeks into the security implications of audio jailbreak attacks against LALMs, and realistically fosters improving their robustness, especially for the newly proposed weak adversary.

💡 Research Summary

The paper “AudioJailbreak: Jailbreak Attacks against End‑to‑End Large Audio‑Language Models” investigates the vulnerability of modern large audio‑language models (LALMs) to jailbreak attacks. While prior work on jailbreaks has focused on text‑only large language models (LLMs) or on audio attacks that assume the adversary can fully control the user’s prompt (the “strong adversary” setting), this study reveals that such approaches are largely ineffective against end‑to‑end LALMs, especially when the attacker cannot manipulate the user’s spoken input (the “weak adversary” setting).

The authors first conduct a comprehensive evaluation showing that advanced text‑based jailbreak prompts, when converted to speech via text‑to‑speech (TTS) systems, achieve high success rates on cascaded LALMs (which rely on an ASR → LLM → TTS pipeline) but drop dramatically on end‑to‑end models that directly process audio embeddings. This motivates the design of a new attack that satisfies four desiderata simultaneously: (1) asynchrony – the malicious audio does not need to be temporally aligned with the user’s speech; (2) universality – a single perturbation works across many different user prompts; (3) stealthiness – the malicious intent is concealed from human listeners and automated content‑moderation systems; and (4) over‑the‑air robustness – the attack remains effective when the audio is played through speakers and captured by microphones in realistic acoustic environments.

The proposed method, named AudioJailbreak, constructs “suffixal jailbreak audios”. After a user finishes speaking, the attacker plays a short audio segment that contains a carefully crafted perturbation (δ). To achieve universality, the optimization incorporates multiple normal user prompts simultaneously, forcing δ to be effective for unseen prompts. Stealthiness is addressed through two strategies: (a) speeding up the malicious audio or applying subtle acoustic transformations so that the malicious content is hard to perceive, and (b) removing explicit malicious instructions altogether and relying on the perturbation to manipulate the model’s internal tokenization. Over‑the‑air robustness is obtained by augmenting the training loss with random Room Impulse Responses (RIRs), thereby simulating reverberation and other acoustic distortions.

Experiments are performed on ten state‑of‑the‑art end‑to‑end LALMs (including Mini‑Omni, Qwen‑Audio, LLaSM, SpeechGPT, etc.) and two benchmark datasets covering policy‑violating and benign queries. Results show that for sample‑specific attacks (where a separate perturbation is generated per prompt) the attack achieves at least 46 % success rate (ASR) against strong adversaries and nearly 100 % against weak adversaries across all models. For universal attacks (single δ for all prompts) the ASR is 87 % for strong and 76 % for weak adversaries. When the audio is played over the air, the success rates remain high: 88 % (strong) and 70 % (weak). Notably, the closed‑source GPT‑4o‑Audio, which is relatively robust in the strong‑adversary setting, succumbs to the weak‑adversary scenario with 13 %–34 % ASR. Moreover, Meta’s Llama‑Guard‑3 safeguard, designed to filter policy‑violating content, fails to block the jailbreak in the weak‑adversary case.

The paper also evaluates three categories of defenses—input filtering, token‑level moderation, and audio‑signal normalization—and finds that they provide limited protection, especially against the weak‑adversary attacks. This highlights a gap in current LALM security research, which has largely focused on text‑centric defenses.

In summary, AudioJailbreak demonstrates that end‑to‑end audio‑language models can be compromised even when the attacker has only minimal control (adding a suffix after the user’s speech). By jointly optimizing for asynchrony, universality, stealth, and acoustic robustness, the attack is both practical and potent. The authors release code and audio samples to facilitate reproducibility and to encourage the development of stronger, audio‑aware defenses for future LALMs.

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment