Multi-turn Jailbreaking Attack in Multi-Modal Large Language Models
In recent years, the security vulnerabilities of Multi-modal Large Language Models (MLLMs) have become a serious concern in the Generative Artificial Intelligence (GenAI) research. These highly intelligent models, capable of performing multi-modal tasks with high accuracy, are also severely susceptible to carefully launched security attacks, such as jailbreaking attacks, which can manipulate model behavior and bypass safety constraints. This paper introduces MJAD-MLLMs, a holistic framework that systematically analyzes the proposed Multi-turn Jailbreaking Attacks and multi-LLM-based defense techniques for MLLMs. In this paper, we make three original contributions. First, we introduce a novel multi-turn jailbreaking attack to exploit the vulnerabilities of the MLLMs under multi-turn prompting. Second, we propose a novel fragment-optimized and multi-LLM defense mechanism, called FragGuard, to effectively mitigate jailbreaking attacks in the MLLMs. Third, we evaluate the efficacy of the proposed attacks and defenses through extensive experiments on several state-of-the-art (SOTA) open-source and closed-source MLLMs and benchmark datasets, and compare their performance with the existing techniques.
💡 Research Summary
The paper “Multi‑turn Jailbreaking Attack in Multi‑Modal Large Language Models” investigates a previously under‑explored vulnerability of multi‑modal large language models (MLLMs) – namely, the ability of an adversary to bypass safety guardrails through a sequence of carefully crafted prompts that span multiple interaction turns. The authors first formalize the architecture of a typical vision‑language model (LVLM) as a three‑stage pipeline (visual encoder → connector → textual decoder) and point out that safety constraints are usually embedded only in the textual decoder. This design leaves a “modal gap” that can be exploited by embedding malicious content in the visual modality while using innocuous text prompts to gradually lower the model’s defensive posture.
Attack methodology (MJAD‑MLLM).
The attack proceeds in three stages: (1) the adversary supplies an image containing a typographic overlay of a prohibited phrase (e.g., instructions for hacking) generated by Stable Diffusion; (2) in turn 1 the user asks a benign question such as “Describe what you see in the image?” to engage the model; (3) in turn 2 the user requests a hypothetical scenario (e.g., a movie script) that forces the model to imagine the content of the image; (4) in turn 3 the user directly asks for the prohibited content that was hidden in the image’s caption. Because the model has already produced a “safe” response in turn 1 and a creative but still benign response in turn 2, the safety filter becomes less strict, allowing the model to comply with the final malicious request. The authors evaluate this multi‑turn jailbreak on four state‑of‑the‑art open‑source (LLaVA‑7B/13B, Qwen‑2‑7B) and closed‑source (Gemini‑2.0‑Flash, GPT‑4o) models using the MM‑SafetyBench benchmark, which covers 13 prohibited scenarios. Results show a dramatic increase in attack success rate (ASR) from turn 2 to turn 3, reaching up to 91.5 % on LLaVA‑7B and 82.3 % on Gemini‑2.0‑Flash. Larger models (LLaVA‑13B) exhibit lower ASR, indicating that scaling can improve alignment but does not eliminate the vulnerability.
Defense proposal (FragGuard).
To counter the attack, the authors introduce FragGuard, a two‑step, training‑free defense. First, any model‑generated response R* is split into fixed‑length token fragments (F_r1 … F_rN). Second, each fragment is fed to three distinct language models—OpenAI’s reasoning‑focused o1, Google’s Gemini‑2.5‑Flash‑lite, and Meta’s open‑source LLaMA‑3 (70 B). Each defender outputs a toxicity score T_{j,k} for fragment k. The final toxicity T_final is the maximum across all fragments and all defenders, a conservative aggregation that favors safety. If T_final exceeds a predefined threshold τ (set to correspond to a toxicity level of 4 on a 1‑5 scale), the original response is suppressed and replaced with a safe refusal message (“Sorry, I cannot respond to this request”).
Evaluation metrics and results.
The authors assess both attack and defense using three metrics: Attack Success Rate (ASR), Average Toxicity Score (ATS), and Refusal Rate (RR) for the defense. Toxicity is measured by two production LLMs (GPT‑4 and Gemini‑2.5‑Flash) and manually verified. Under FragGuard, RR climbs to 78 %–94 % across models, while ATS drops from an average of 4.2 (pre‑defense) to below 1.2 (post‑defense). GPT‑4o shows the strongest defense performance (RR ≈ 94 %). The multi‑LLM aggregation proves more reliable than a single‑judge approach, reducing false‑negative toxicity detections.
Discussion and limitations.
The study highlights that multi‑turn interactions dramatically amplify jailbreak potency, a factor that prior single‑turn analyses missed. FragGuard’s reliance on multiple external LLM APIs introduces latency and computational cost, which may be prohibitive for real‑time applications. Moreover, the current defense focuses solely on textual toxicity; visual or audio modalities that could directly convey harmful content remain unaddressed. Future work is suggested to (i) develop lightweight, on‑device fragment evaluation, (ii) extend toxicity assessment to non‑text modalities, and (iii) explore adaptive guardrails that can detect escalating risk across conversation turns.
Conclusion.
This paper makes three key contributions: (1) a novel multi‑turn jailbreak framework that exploits the modal gap in MLLMs, achieving high success rates on both open‑source and commercial models; (2) a practical, training‑free defense (FragGuard) that leverages response fragmentation and multi‑LLM toxicity scoring to achieve high refusal rates and low residual toxicity; (3) a comprehensive experimental evaluation that establishes new baselines for both attack and defense in the MLLM security landscape. The work underscores the urgency of rethinking safety mechanisms for multi‑modal AI systems, especially as they become integral to user‑facing applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment