Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Published in T ransactions on Mac hine Learning Researc h (03/2026) A dversa rial A ttacks on Multimo dal La rge Language Mo dels: A Comp rehensive Survey Bha vuk Jain bhavukj@go o gle.c om Go o gle Sercan Ö. Arık so arik@go o gle.c om Go o gle Hardeo K. Thakur Har de o.Thakur@b ennett.e du.in Bennett University, India Review ed on Op enReview: https: // openreview. net/ forum? id= zwzodDJkzZ Abstract Multimo dal large language mo dels (MLLMs) integrate information from multiple mo dal- ities such as text, images, audio, and video, enabling complex capabilities suc h as visual question answ ering and audio translation. While p o werful, this increased expressiveness in- tro duces new and ampliﬁed vulnerabilities to adversarial manipulation. This surv ey provides a comprehensive and systematic analysis of adversarial threats to MLLMs, moving b ey ond en umerating attack tec hniques to explain the underlying causes of mo del susceptibilit y . W e in tro duce a taxonomy that organizes adversarial attac ks according to attac ker ob jectiv es, unifying diverse attack surfaces across mo dalities and deploymen t settings. Additionally , we also present a vulnerabilit y-centric analysis that links integrit y attacks, safet y and jailbreak failures, con trol and instruction hijacking, and training-time p oisoning to shared arc hitec- tural and representational w eaknesses in multimodal systems. T ogether, this framew ork pro vides an explanatory foundation for understanding adversarial b eha vior in MLLMs and informs the dev elopmen t of more robust and secure m ultimo dal language systems. 1 Intro duction The rapid developmen t of Large Language Models (LLMs) marks a signiﬁcant leap in AI (Chang et al., 2023). By in tegrating and pro cessing information from diverse mo dalities, such as text, images, and audio, in v arious com binations, these mo dels mimic h uman perception to achiev e a more holistic understanding of the world (Li et al., 2023). This adv anced capability enables them to tac kle complex tasks previously b ey ond reac h, driving breakthroughs in areas suc h as image captioning (Vin yals et al., 2015), visual question answ ering (An tol et al., 2015), audio-visual scene analysis (Arandjelović & Zisserman, 2017), rob otics (Mon- Williams et al., 2025), and more. The p o w er of these mo dels stems from their scalable attention-based arc hitectures (V aswani et al., 2017) and eﬀective training recipes: pre-training on v ast datasets, follow ed by p ost-training alignment on a m ultitude of tasks to suit downstream use cases, thereb y achieving state-of- the-art p erformance. How ever, the v ery complexity that fuels the p o w er of LLMs also introduces nov el and in tricate vulnerabilities (Go odfellow, 2014). As illustrated in Figure 1, unlik e text-only LLMs, multimodal LLMs presen t an expanded attack surface where threats can arise not only from weaknesses within individual mo dalit y pro cessing but, crucially , from the complex interpla y and fusion mechanisms b et ween the com bined mo dalities (Baltrušaitis et al., 2018). These cross-mo dal vulnerabilities can manifest through several ke y attac k v ectors, some of whic h are as follo ws: 1 Published in T ransactions on Mac hine Learning Researc h (03/2026) • Cross-mo dal prompt injection (Bagdasaryan et al., 2023) exploits LLMs’ instruction-following nature by em b edding malicious commands within non-textual modalities. A ttac k ers can hide in- structions in images or audio that the mo del in terprets as textual commands, eﬀectively hijacking b eha vior without directly manipulating text prompts. • F usion mec hanism attacks target ho w the core process LLMs integrate information from m ultiple mo dalities, exploiting vulnerabilities in how features from diﬀeren t sources are combined and aligned. These attac ks can disrupt the fusion pro cess, causing the model to misin terpret or improp erly w eigh m ultimo dal inputs. • A dversarial illusions exploit the shared em b edding space where LLMs align represen tations across mo dalities. A ttack ers can craft inputs in one mo dalit y that appear benign but whose embeddings b ecome deceptively aligned with unrelated concepts from another modality , causing the model to hallucinate false seman tic connections (Bagdasary an et al., 2024). As such p o werful mo dels are increasingly deploy ed in real-world applications, including safet y-critical sys- tems, understanding their susceptibilit y to adv ersarial manipulation b ecomes paramount. The landscap e of attacks targeting these LLMs is evolving rapidly , encompassing a range of techniques. These include adversarial p erturbations designed to cause misclassiﬁcation or erroneous outputs (Go odfellow et al., 2014), jailbreak attac ks that bypass safety alignments (W ei et al., 2023), prompt injection metho ds that hijac k mo del behavior (P erez & Rib eiro, 2022), and data p oisoning strategies that corrupt the model during training (Gu et al., 2017). Giv en the increasing sophistication and p oten tial impact of these attac ks, a systematic understanding of the curren t threat landscap e is urgently needed. While recen t eﬀorts hav e b egun to map the ﬁeld with surv eys on general LLM attac ks (Shay egani et al., 2023b) and on speciﬁc pairs of mo dalities, suc h as vision–language mo dels (Liu et al., 2024b), a deep er analyt- ical connection b et ween documented attacks and the underlying vulnerabilities they exploit remains limited. A notable p eer-review ed ov erview by Li & F ung (2025) surv eys a broad range of security concerns for large language mo dels, including prompt injection, adversarial manipulation, poisoning, and agent-related risks, pro viding a v aluable high-lev el syn thesis of the ev olving threat landscap e. More narrowly , recen t w ork on m ultimo dal prompt injection characterizes injection v ectors and defenses across modality pairs, but does not pursue a uniﬁed taxonomy or vulnerability-lev el explanation b ey ond prompt-based control failures (Y eo & Choi, 2025). Ho w ev er, existing surveys predominantly organize attacks by mo dalit y , attack surface, or appli- cation con text, and focus on cataloging attack techniques and defenses rather than systematically explaining wh y div erse attacks recur across diﬀerent model arc hitectures and deplo yment settings. T o complemen t this line of work, our surv ey in tro duces a taxonom y that classiﬁes adv ersarial attac ks b y attac k objective, pro viding a uniﬁed organizational framework that cuts across mo dalities and application domains. Building on this taxonomy , we also adopt a vulnerabilit y-fo cused p erspective that explicitly links attac k categories to shared architectural and representational w eaknesses in m ultimo dal large language models, including failures in cross-mo dal interaction, mo dalit y-sp eciﬁc pro cessing, instruction following, and con trol. This combined taxonom y and vulnerability-driv en analysis mov es b ey ond en umerating attac k instances to explain the root causes of adv ersarial susceptibilit y in MLLMs. The remainder of this surv ey is organized as follo ws. Section 2 provides background on multimodal large language models, highlighting architectural components and formal abstractions that are most relev ant to ad- v ersarial analysis. Section 3 introduces a goal-driv en taxonomy of adversarial attacks on multimodal LLMs, organizing prior work b y primary adv ersarial ob jectiv e, and analyzes these attac ks under diﬀerent attac ker kno wledge assumptions (white-b o x, gray-box, and black-box). Section 4 shifts the focus from attack con- structions to ro ot causes, presen ting a vulnerabilit y-centric analysis that systematically links observ ed attac k families to shared arc hitectural, represen tational, and training-induced w eaknesses in m ultimo dal systems. Section 5 brieﬂy discusses representativ e defense mechanisms and mitigation strategies, con textualized with resp ect to the attack families in the proposed taxonomy . Section 6 outlines the limitations of this surv ey and Section 7 discusses the broader impact of this survey pap er. The Appendix in the end, details the surv ey metho dology and inclusion criteria, and provides a consolidated table summarizing empirical c haracteristics of the w orks co v ered in the taxonom y diagram. 2 Published in T ransactions on Mac hine Learning Researc h (03/2026) In terms of paper selection, this survey fo cuses on p eer-review ed w orks that ev aluate adv ersarial attac ks on multimodal large language mo dels with a language-model-based reasoning core processing t wo or more input mo dalities. A ttac ks op erating on unimo dal surfaces are included when they are ev aluated against an MLLM, as they represen t baseline threats inherited b y m ultimo dal systems. Con v ersely , works targeting unimo dal-only mo dels, non-LLM m ultimo dal systems, or non-adversarial failure mo des (e.g., calibration, fairness) are excluded, as are non-p eer-review ed preprints and non-English publications. F ull inclusion and exclusion criteria are detailed in App endix A. Figure 1: A high-lev el ov erview of the adversarial attac k landscap e for Multimo dal LLMs. 2 Background: Understanding LLMs in the Context of A dversarial Attacks This section lays the groundw ork for understanding how multimodal LLMs are susceptible to adv ersarial attac ks. W e will brieﬂy in tro duce LLM arc hitectures and their operational principles, focusing on asp ects that b ecome relev ant when discussing vulnerabilities. W e then delve into the mathematical formulation of LLMs as it p ertains to the generation and eﬀect of adv ersarial p erturbations, and ﬁnally , w e examine key comp onen ts as distinct attack surfaces. 2.1 LLM Architectures and Asp ects Exploited b y A ttacks Multimo dal LLMs are designed to pro cess and integrate information from multiple input types, commonly including text, images, audio & video (Li et al., 2023). Their core capability lies in generating coheren t and contextually relev ant outputs based on this fused m ultimo dal understanding, enabling tasks like visual question answering, image/video captioning, and audio-visual scene in terpretation. F rom an adv ersarial p erspective, several architectural and op erational c haracteristics of LLMs are particularly relev ant: • Mo dalit y-Sp eciﬁc Enco ders ( E i ): T o pro cess div erse inputs, LLMs often employ specialized enco ders for eac h input t yp e: for images, these include Vision T ransformers (ViT s) (Doso vitskiy et al., 2020; Radford et al., 2021) and Con volutional Neural Net works (CNNs); for text, T ransformer- based enco ders (Devlin et al., 2019); and for audio, a range of mo dels from RNNs (Baevski et al., 2020), to the increasingly prev alent T ransformer-based architectures that transform ra w inputs into high-dimensional v ector representations. Their primary vulnerabilit y stems from often being pre- trained on large unimo dal datasets, allowing them to inherit adversarial weaknesses from these source mo dels. As a result, adv ersarial examples originally crafted for standard image classiﬁers (e.g., b y adding imperceptible noise to pixels) (Go o dfello w et al., 2014) or audio classiﬁers (e.g., through small 3 Published in T ransactions on Mac hine Learning Researc h (03/2026) w a v eform perturbations) (Carlini & W agner, 2018) can successfully fo ol the resp ectiv e enco der of the LLM. This propagates erroneous or misleading feature represen tations from the manipulated mo dalit y into the subsequent fusion stage, undermining the LLMs’ ov erall understanding. • F usion Mec hanisms ( f fuse ) and Cross-Mo dal Alignmen t: A critical comp onen t is the fusion mec hanism, which integrates the vector representations from diﬀerent mo dalities. This can range from simple concatenation or p ooling to more complex techniques lik e cross-mo dal atten tion, where information from one mo dalit y dynamically guides the pro cessing of another (V aswani et al., 2017; Lu et al., 2019). This intricate pro cess of integrating information represen ts a signiﬁcant p oin t of p oten tial failure. Disruptions in the fusion pro cess, or misalignmen t in the learned cross-modal com- bination can lead the mo del to incorrectly prioritize, misin terpret, or fail to reconcile information from diﬀerent mo dalities, esp ecially when one is adv ersarially manipulated. Attac ks manifest this vulnerabilit y primarily through cross-modal perturbations (Dou et al., 2023), where subtle alter- ations in one modality (e.g., text) are designed to mislead the in terpretation of another (e.g., an image), or vice-v ersa. • A ttention Mec hanisms: T ransformers, central to many LLMs, rely hea vily on attention mecha- nisms (V aswani et al., 2017). These mec hanisms w eigh the importance of diﬀerent parts of the input via key-query alignment, b oth within a single mo dalit y (self-atten tion) and betw een diﬀeren t modal- ities (cross-mo dal attention, often part of the fusion mechanism). The wa y attention is distributed is a key target, as attack ers can craft inputs that exploit or manipulate these attention scores (W ang et al., 2024c). F or example, an attac k might introduce features that unduly capture the mo del’s atten tion, drawing focus to misleading information, or con v ersely , it might try to suppress atten tion to critical b enign features, thereby derailing the model’s reasoning pro cess and its ability to correctly fuse m ultimo dal information. • Join t Represen tation Space ( Z join t ): After fusion, information from m ultiple mo dalities is of- ten enco ded in a join t em b edding space. This space is designed to capture complex in ter-mo dal relationships and dep endencies, providing a uniﬁed representation for downstream pro cessing. The vulnerabilit y here lies in the potential for direct manipulation of this abstract space. If attac kers gain an understanding of ho w this high-dimensional space is structured (e.g., through mo del in v ersion (F redrikson et al., 2015) or probing), they can craft inputs whose fused representations are pushed in to "malicious" or incorrect regions. This is t ypically exploited through feature-space attacks, which aim to directly p erturb these joint embeddings. The goal is often to ensure that the fused repre- sen tation of an adv ersarial input is either close to that of a sp eciﬁc target malicious concept or signiﬁcan tly distan t from the representation of the benign, original input, even if the input-space p erturbations are subtle. • T ransformer-Based Deco ders and Prompting: LLMs’ deco der pro cesses fuse multimodal in- formation to generate the ﬁnal output (e.g., text, a classiﬁcation). Ho wev er, the core instruction- follo wing capability that makes these models so p o werful also renders them inherently vulnerable. A dv ersaries can exploit this b y crafting malicious prompts designed to manipulate the model in to generating unin tended or harmful resp onses. This vulnerabilit y is exploited through several meth- o ds, including: (1) Prompt Injection (P erez & Ribeiro, 2022), (2) Jailbreaking (W ei et al., 2023), and (3) T yp ographic or Visual A ttac ks (Qraitem et al., 2024), which are discussed further in subsequent sections. A more nuanced view of LLM vulnerabilities emerges from analyzing its architectural components not only b y their function but also as distinct adversarial attac k surfaces. This p erspective will inform the taxonomy of attac ks presen ted in the follo wing sections. 2.2 Mathematical Rep resentation of LLMs and Adversa rial Perturbations This section formalizes the concept of adversarial attac ks b y ﬁrst deﬁning the op erational structure of an LLM and then detailing the construction and optimization of adv ersarial inputs. 4 Published in T ransactions on Mac hine Learning Researc h (03/2026) Let an LLM b e represen ted b y a function F . F or n diﬀerent input mo dalities, let X i ∈ D i b e the input from the i-th mo dalit y , where D i is the domain for that modality (e.g., D image for images, D text for text sequences). The LLM pro cesses the set of inputs X = { X 1 , X 2 , . . . , X n } to pro duce an output Y : Y = F ( X ; θ ) , , (1) where θ represents the mo del’s learnable parameters. This pro cess, as formalized by Baltrušaitis et al. (2018), t ypically in v olv es the follo wing sequence of op erations: 1. Enco ding: Each X i is transformed by a mo dalit y-sp eciﬁc enco der E i in to a feature representation Z i : Z i = E i ( X i ; θ E i ) (2) 2. F usion: These represen tations { Z 1 , . . . , Z n } are combined b y a fusion function f fuse in to a joint represen tation Z joint : Z joint = f fuse ( Z 1 , . . . , Z n ; θ fuse ) (3) 3. Output Generation: A deco der or output la yer g generates the ﬁnal output Y : Y = g ( Z joint ; θ g ) (4) An adversarial attack aims to ﬁnd a set of p erturbations δ = { δ 1 , δ 2 , . . . , δ n } , where δ i is the p erturbation for mo dalit y X i . The goal, as established in foundational w ork on adversarial examples (Go odfellow et al., 2014), is to create a p erturbed input X adv : X adv = { X 1 + δ 1 , . . . , X n + δ n } ( denoted X + δ ) , (5) suc h that X adv causes the LLM to pro duce an undesired output Y adv . The op eration X i + δ i is deﬁned ap- propriately for eac h modality (e.g., pixel addition for images, token manipulation or em b edding perturbation for text). The p erturbation δ i for each modality is typically constrained to b e “small” or “imp erceptible. ” This is commonly enforced b y b ounding its L p norm (Madry et al., 2019): ∥ δ i ∥ p ≤ ϵ i for p ∈ { 0 , 1 , 2 , ∞} , (6) where ϵ i is a predeﬁned small budget for the i-th modality , ensuring the perturbation remains within ac- ceptable limits (e.g., imp erceptible to humans or within feasible manipulation ranges). The choice of p -norm inﬂuences the nature of the p erturbation (e.g., L ∞ leads to small changes to many elements, L 0 to c hanges in a few elemen ts). The attac k er’s ob jectiv e can typically b e formulated as an optimization problem. Let L ( · , · ) denote the loss function. W e can consider the b elo w categorization: • Un targeted Attac k: The goal is to ﬁnd δ that maximizes the dissimilarity b et w een the LLM’s output on the adversarial example and the true output Y true , a standard formulation for adv ersarial attac ks (Go odfellow et al., 2014). The ob jectiv e, sub ject to the p erturbation constrain ts, is: arg max δ L ( F ( X + δ ; θ ) , Y true ) sub ject to ∥ δ i ∥ p ≤ ϵ i for all i (7) Alternativ ely , if F pro duces class probabilities P ( Y | X ) , the attac ker might aim to maximize L ( P ( Y | X + δ ) , y true ) where y true is the true class lab el. • T argeted A ttac k: The goal is to ﬁnd δ that minimizes the dissimilarit y b et w een the LLM’s output on the adv ersarial example and a sp eciﬁc attack er chosen target output Y target (Kurakin et al., 2018). The ob jectiv e is: arg min δ L ( F ( X + δ ; θ ) , Y target ) sub ject to ∥ δ i ∥ p ≤ ϵ i for all i (8) F or instance, Y target could b e a sp eciﬁc incorrect class lab el or a desired malicious text. 5 Published in T ransactions on Mac hine Learning Researc h (03/2026) • Jailbreaking/Harmful Conten t Generation: Here, the objective is often to maximize the like- liho od of the LLM generating an output Y adv that contains harmful or forbidden conten t, C harmful , thereb y b ypassing the mo del’s safety alignments. This might b e form ulated as: arg max δ P ( C harmful ∈ F ( X + δ ; θ )) sub ject to ∥ δ i ∥ p ≤ ϵ i for all i (9) The pro cess of ﬁnding the optimal p erturbation δ in v olv es solving the optimization problems deﬁned in Equations 7, 8, and 9. This is t ypically ac hieved using optimization techniques such as gradien t-based metho ds (e.g., Pro jected Gradient Descent - PGD) in white-box settings where gradients are accessible (Madry et al., 2019). In scenarios lik e jail-breaking, the optimization may sp eciﬁcally inv olve maximizing scores from an external classiﬁer that detects harmful conten t. F or blac k-b o x scenarios where gradients are unkno wn, attack ers instead rely on query-based or transfer-based metho ds (Ilyas et al., 2018; Papernot et al., 2017). 3 T axonomy of A ttacks on Multimodal LLMs 3.1 A F ramew ork fo r Classiﬁcation T o organize the rapidly expanding landscap e of adversarial attac ks on MLLMs, we prop ose a taxonom y based on the primary goal of the adversary and the asp ect of mo del b eha vior that is compromised. Rather than classifying attac ks by lo w-level implementation details or input mo dalities, our framew ork groups attac ks by what asp ect of the mo del’s b eha vior or lifecycle is compromised, distinguishing attacks on mo del integrit y , safet y and alignmen t, b eha vioral control, and training-time reliability . At the top level, the taxonom y com- prises four attack families corresp onding to these goals, with attacks within each family further diﬀerentiated b y their dominant realization patterns, such as p erturbation-based manipulation, comp ositional prompts, represen tation-lev el exploits, or data-centric interv entions. While real-w orld attacks often span multiple ob- jectiv es and combine several tec hniques, we assign eac h attack to a single top-lev el category according to its primary adversarial ob jectiv e. Lo wer-lev el mechanisms and secondary eﬀects may ov erlap across categories. In addition to this taxonomy , we also analyze attacks along an orthogonal dimension: the attack er’s knowl- edge of the target model (white-b o x, gray-box, black-box). This dimension inﬂuences feasibilit y and method- ology but do es not deﬁne the attack’s ob jectiv e. A ccordingly , attack er kno wledge is treated as an analytical axis rather than a primary categorization principle. This c hoice of axes is motiv ated by the hybrid and cross-modal nature of adv ersarial threats in multimodal LLMs. The same attack construction can span multiple mo dalities or app ear across diﬀeren t applications, making mo dalit y-based taxonomies fragment related attacks and task-based categorizations fail to general- ize. Organizing attac ks by primary adversarial ob jectiv e instead captures the t yp e of system-level failure induced—suc h as loss of correctness, safety , control, or training reliability , indep enden t of ho w the attac k is instantiated. The attack er-knowledge dimension complemen ts this view b y reﬂecting realistic access as- sumptions in deplo y ed m ultimo dal systems. In total, this survey substan tively analyzes 88 unique works around attacks, vulnerabilit y analyses, and defenses. Of these, 65 works with full empirical characterization are consolidated in T able 6 (App endix A), encompassing the 45 attac k papers in the taxonomy as well as 20 additional attack and vulnerability analysis w orks drawn from the threat mo del discussion (Section 3.3) and vulnerabilit y-centric analysis (Section 4); the remaining 23 works inform the discussion of defense mechanisms in Section 5. Organized b y primary adv ersarial ob jectiv e across the 65 empirically c haracterized works, 20 target mo del in tegrity , 21 target safet y and alignment, 14 target behavioral control and instruction follo wing, and 11 target training-time reliability; one work (T ao et al., 2025) is assigned to b oth the safety and poisoning families due to its dual-ob jectiv e nature. In terms of target mo dalities, visual attack surfaces dominate: 58 of the 65 works in volv e image or video inputs, 9 in volv e audio, and 4 sp eciﬁcally target the video mo dalit y , while only 7 w orks op erate on purely non-visual surfaces. Regarding attack er knowledge, 36 w orks op erate under black-box assumptions, 16 under white-b o x access, 4 under gra y-b o x access, and 9 are ev aluated under mixed threat mo dels spanning m ultiple access levels. These distributions underscore b oth the breadth of existing research on vision–text adv ersarial attac ks and the comparativ ely limited co v erage of audio- and video-based threats. 6 Published in T ransactions on Mac hine Learning Researc h (03/2026) 3.2 A ttack T axonomy Based on the classiﬁcation framework described ab o ve, w e now present a hierarc hical taxonomy of adversarial attac ks on MLLMs. The taxonomy organizes existing attacks into four top-level families according to their primary adversarial ob jectiv e and system impact: Integrit y attac ks, Safety and Jailbreak attacks, Con trol and Injection attacks, Data P oisoning and Backdoor attac ks. Each family is further subdivided in to a small num b er of represen tativ e subtypes that capture the dominant w ays in whic h attacks are instan tiated in practice. Figure 2 illustrates this hierarch y and summarizes represen tative w orks asso ciated with each attac k subt yp e. 3.2.1 Integrit y A ttacks In tegrit y attac ks aim to compromise the correctness, reliability , or perceptual grounding of MLLMs with- out necessarily triggering explicit safet y or alignment violations. Unlike jailbreak attacks that seek to elicit prohibited or restricted con tent, integrit y attac ks t ypically op erate under b enign or seemingly inno cuous prompts, inducing incorrect descriptions, hallucinated reasoning, or attac k er-inﬂuenced outputs while re- maining within nominal p olicy b oundaries (Dong et al., 2023; Zhao et al., 2023; W ang et al., 2025c). These attac ks exploit the architectural structure of MLLMs, which combine modality-speciﬁc enco ders (e.g., vision or audio) with cross-mo dal alignmen t and fusion mec hanisms b efore deco ding responses through a language mo del. Perturbations in tro duced at the input signal lev el or within intermediate represen tations can propa- gate through multimodal fusion, systematically altering do wnstream reasoning and generation ev en when the textual prompt itself is b enign (Zhao et al., 2023; Xie et al., 2024). Figure 3 provides represen tative exam- ples of inference-time adversarial attac ks on m ultimo dal LLMs, illustrating how illustrating how multimodal inputs can b e manipulated to induce integrit y failures at deploymen t time. F rom an implementation p erspective, existing integrit y attac ks predominantly interv ene at three distinct lev els of the m ultimo dal pip eline: (i) contin uous p erturbations applied directly to raw input signals, (ii) discrete trigger artifacts that act as reusable control mec hanisms, and (iii) targeted manipulation of cross- mo dal represen tations or fusion dynamics. W e adopt this structural distinction to organize integrit y attacks in the remainder of this section. Signal P erturbations: Signal perturbation attacks introduce contin uous mo diﬁcations to ra w m ultimo dal inputs such as images, audio, or video frames, with the ob jectiv e of inducing incorrect p erception or do wn- stream reasoning errors. Examples include visually imperceptible adversarial p erturbations applied to images (Figure 3a) as we ll as inaudible commands em b edded within b enign audio signals (Figure 3d), b oth of which manipulate mo del p erception without altering the explicit user prompt. Although many p erturbations are designed to b e visually or acoustically subtle, their impact is ampliﬁed in MLLMs due to the coupling betw een mo dalit y enco ders and op en-ended language generation. Dong et al. (2023) show that adv ersarial images optimized against white-b o x surrogate vision encoders reliably induce incorrect image descriptions in Go ogle Bard and transfer to other deploy ed MLLMs under black-box access, illustrating that p erturbations crafted at the comp onen t level can generalize to end-to-end multimodal assistan ts. Complemen tary ev aluations by Zhao et al. (2023) sho w that b oth targeted and untargeted p erturbations can manipulate responses across a range of op en-source L VLMs, even when attack ers lack access to the language mo del itself. Bey ond direct transferabilit y , signal p erturbations can also in teract with the language generation pro cess in more subtle w a ys. Qraitem et al. (2024) show that L VLMs may generate deceptive t yp ographic con tent that subsequen tly misleads their own p erception mo dules, resulting in cascading integrit y failures. In a complementary direc- tion, W ang et al. (2025c) demonstrate that carefully optimized image p erturbations can steer tok en-level deco ding b eha vior, enabling controlled hallucinations and ﬁne-grained output manipulation across div erse VLM arc hitectures. Discrete T riggers: Discrete trigger attacks rely on structured artifacts—such as lo calized patches, opti- mized visual patterns, or symbolic ov erlays—that function as reusable con trol signals for m ultimo dal mo dels. In con trast to signal p erturbations, whic h are typically optimized p er instance, discrete triggers emphasize p ersistence and reusabilit y: once constructed, the same artifact can b e deplo yed across diverse inputs, prompts, or tasks to induce consistent integrit y failures. Recent w ork demonstrates that such triggers can 7 Published in T ransactions on Mac hine Learning Researc h (03/2026) Attac ks on Multimodal LLMs Integrit y Attacks Safety / Jail- break Attac ks Control & In- jection Attac ks Poisoning / Backdoor Attac ks Signal Per- turbations Discrete T riggers Representation & F usion Exploits Dong et al. (2023) Zhao et al. (2023) Qraitem et al. (2024) W ang et al. (2025c) Liu et al. (2024a) Cheng et al. (2024) Qraitem et al. (2025) Cao et al. (2025) Xie et al. (2024) Shay egani et al. (2023a) Yin et al. (2023) T eng et al. (2025) Unimodal Sur- faces (T/V/A) Multimodal Composite Universal T riggers Y ang et al. (2024) Roh et al. (2025) T ao et al. (2025) Qi et al. (2023) Li et al. (2025b) Huang et al. (2025b) Miao et al. (2025) W ang et al. (2025d) Y ang et al. (2025) W ang et al. (2024a) Geng et al. (2025) Jeong et al. (2025) Prompt Injection System Instruction Manipulation T ool / Retrieval / Agentic Injection Bagdasaryan et al. (2023) Hou et al. (2025) Lee et al. (2025) Clusmann et al. (2025) F u et al. (2023) Zhang et al. (2025c) Aich b erger et al. (2025) W ang et al. (2025a) Gu et al. (2024) Zhang et al. (2025a) Liao et al. (2025) Data Poisoning Backdoors Fine-tuning / Adapter P oisoning Xu et al. (2024) T ao et al. (2025) Yin et al. (2025) Lyu et al. (2024) Liang et al. (2025) Lyu et al. (2025) Liang et al. (2024) Yin et al. (2025) Y uan et al. (2025) Liu & Zhang (2025) Figure 2: Hierarc hical taxonomy of attac ks on multimodal LLMs. generalize across mo dels and deplo yment settings, eﬀectively acting as univ ersal adv ersarial artifacts in large vision–language systems (Liu et al., 2024a). Another recen t work sho ws that discrete triggers can be de- 8 Published in T ransactions on Mac hine Learning Researc h (03/2026) (a) Visual adversarial perturbation inducing incorrect p erception. (b) An image is engineered for indirect prompt injection. (c) A visual prompt bypasses the model’s safety align- men ts. (d) An inaudible command is hidden in a b enign audio signal. Figure 3: Representativ e attacks: visual p erturbations, indirect prompt injection via images, visual safet y b ypasses, and inaudible audio commands. signed to remain eﬀective under scene coherence and ph ysical-w orld constraints, enabling robust real-world in tegrit y attac ks against vision–language mo dels (Cao et al., 2025). A key prop ert y of discrete triggers is that they interact with vision enco ders in a stable and input-agnostic manner, causing do wnstream multimodal fusion and language generation to b e systematically biased to ward attac k er-c hosen interpretations. As a result, once deplo yed, such triggers can induce mis-classiﬁcation, hallu- cinated descriptions, or targeted outputs across prompts and tasks without requiring per-input optimization, underscoring their practical threat in real-w orld deplo ymen ts. . Represen tation and F usion Exploits: Represen tation and fusion exploits target the in termediate align- men t mec hanisms that bind mo dalit y-sp eciﬁc enco ders to language deco ding. Rather than relying solely on pixel-lev el similarity constrain ts, these attacks manipulate join t em b eddings, attention patterns, or cross- mo dal feature in teractions so that even seemingly b enign prompts are in terpreted under an adv ersarial m ultimo dal con text. Xie et al. (2024) introduce a transfer-based attac k that iterativ ely updates adv er- sarial examples via multimodal seman tic correlations, lev eraging cross-modal alignment signals to improv e blac k-b o x attack eﬀectiv eness. Similarly , Sha yegani et al. (2023a) show that adv ersarial images can steer join t represen tations tow ard harmful resp onse behaviors when paired with generic prompts, even without access to the underlying language mo del. Empirical robustness studies further suggest that vulnerabilities arise from cross-mo dal alignmen t pathw ays beyond unimo dal brittleness, with attacks crafted on surrogate vision-language comp onen ts often transferring to end-to-end assistants (Zhao et al., 2023; Yin et al., 2023). 9 Published in T ransactions on Mac hine Learning Researc h (03/2026) 3.2.2 Safet y and Jailbreak Attacks Safet y and jailbreak attac ks aim to b ypass alignmen t mec hanisms, policy constrain ts, or conten t safeguards in MLLMs, inducing the mo del to produce prohibited or harmful outputs. In con trast to integrit y attacks that primarily degrade correctness or p erceptual grounding, jailbreak attacks explicitly target harmlessness and p olicy c omplianc e , often seeking to elicit disallo wed instructions or unsafe con tent even when the user-facing prompt app ears b enign or indirect (Qi et al., 2023; Li et al., 2025b). A k ey driver of these failures is the expanded multimodal attac k surface: safet y alignment is commonly strongest for text, while non-text modalities (images, audio, video) pro vide alternate channels through whic h malicious inten t can b e comm unicated or obfuscated. Recen t w ork demonstrates that alignmen t can b e b ypassed b y (i) op erating through a single under-protected mo dalit y , (ii) comp osing attacks across mo dalities to ev ade unimodal ﬁlters, or (iii) constructing reusable univ ersal triggers that generalize across prompts and tasks (W ang et al., 2024a; Geng et al., 2025; Jeong et al., 2025). Unimo dal Surfaces: Unimo dal-surface jailbreaks operate through a single mo dalit y (e.g., audio-only) while targeting an MLLM that retains a language-mo del-based instruction-following core. These attac ks exploit the fact that the model’s ﬁnal resp onses are generated b y a shared language-deco ding core, so a w eakness in one mo dalit y enco der or its safet y handling can compromise the o verall system. Audio jailbreaks pro vide a clear example. Y ang et al. (2024) systematically red-team audio-capable multimodal mo dels and sho w that harmful queries deliv ered in audio form, as well as speech-speciﬁc jailbreak strategies, achiev e high attack success rates, indicating misalignment b et ween text safety and audio safety . Roh et al. (2025) further show that multilingual and m ulti-accent v ariations substantially amplify jailbreak success, suggesting that safet y training and ﬁltering generalize po orly across cross-lingual phonetics and acoustic perturbations. T ogether, these results demonstrate that a single-mo dalit y c hannel can serve as an eﬀectiv e jailbreak surface b y exploiting cross-mo dal inconsistencies in safet y training and ﬁltering, enabling harmful inten t to propagate through an otherwise aligned m ultimo dal system. Multimo dal Comp osite Jailbreaks: Multimodal comp osite jailbreaks exploit interactions across modal- ities—most commonly image–text or video–text—so that malicious in tent is distributed or encoded in a w a y that is not easily captured b y unimodal safet y mechanisms. F or instance, visually rendered prompts can o v erride safet y constraints when interpreted jointly with b enign textual instructions, as illustrated by a visual jailbreak example in Figure 3c. A cen tral observ ation is that cross-mo dal fusion can render the combined input harmful even when each comp onen t may appear inno cuous when assessed in isolation. Sev eral w orks demonstrate that visual inputs can directly undermine alignment. Qi et al. (2023) show that visual adver- sarial examples can circum v en t safety guardrails in aligned LLMs with integrated vision, including settings where a single adv ersarial image can act as a broadly eﬀective jailbreak artifact. Li et al. (2025b) provide systematic evidence that image inputs are a primary source of harmlessness vulnerabilities in MLLMs and in tro duce an image-assisted jailbreak metho d that ampliﬁes malicious inten t. Other approac hes explicitly optimize bi-mo dal interactions. Ying et al. (2024) prop ose a bi-modal adversarial prompt strategy that join tly optimizes visual and textual comp onen ts, demonstrating that co ordinated image–text manipulation outp erforms attac ks perturbing only one mo dalit y . Beyond static images, Hu et al. (2025b) show that the video mo dalit y in tro duces additional vulnerabilities: distributing malicious cues across frames and exploiting temp oral dynamics can b ypass defenses that are eﬀective for single images. Domain-sp eciﬁc and con text-driven jailbreaks further underscore cross-mo dal risks. Huang et al. (2025b) demonstrate that medical MLLMs can be jailbroken via cross-mo dalit y attac ks and mismatc hed (out-of- con text) constructions, highligh ting the fragilit y of safety mec hanisms in specialized deploymen ts. Miao et al. (2025) formalize a vision-centric jailbreak setting where images are used to construct realistic harmful con texts via image-driven context injection, yielding high attack success rates against blac k-b o x MLLMs. Finally , comp osite jailbreaks can b e achiev ed through structured obfuscation and attention manipulation. W ang et al. (2025d) prop ose a multi-modal linkage mechanism that hides malicious inten t via cross-mo dal “enco ding/decoding” structure to reduce ov er-exp osure of harmful conten t while retaining strong jailbreak eﬀectiv eness. Y ang et al. (2025) show that distraction mechanisms—com bining structured decomp osition of 10 Published in T ransactions on Mac hine Learning Researc h (03/2026) harmful prompts with visually enhanced distraction—can disp erse atten tion and w eak en the mo del’s ability to detect and suppress unsafe generations. Univ ersal Jailbreak T riggers: Universal jailbreak triggers aim to pro duce reusable attac k artifacts that generalize across prompts, tasks, or inputs, functioning as query-agnostic “master keys. ” Compared to instance-sp eciﬁc jailbreaks, universal triggers p ose a stronger threat mo del b ecause they can b e deploy ed rep eatedly with minimal adaptation. W ang et al. (2024a) dev elop a white-b o x univ ersal jailbreak strategy that join tly optimizes image and text comp onen ts, pro ducing a univ ersal master key that reliably elicits harmful aﬃrmativ e resp onses across diverse harmful queries. In a complementary direction, Geng et al. (2025) show that non-textual modalities can themselves enco de universal malicious instructions: by opti- mizing adv ersarial images or audio to align with target instructions in em b edding space, the attac k bypasses safet y mec hanisms without requiring textual harmful instructions. Universal jailbreak capability can also arise from distribution shifts rather than explicit trigger optimization. Jeong et al. (2025) sho w that out-of- distribution transformations applied to harmful inputs increase mo del uncertain ty ab out malicious in tent, thereb y enabling jailbreaks that defeat safet y alignment on b oth LLMs and MLLMs. Collectively , these results indicate that universal jailbreaks arise both from systematic m ultimo dal alignment w eaknesses and from generalization failures under distributional shift. 3.2.3 Control and Injection Attacks Con trol and injection attacks aim to ov erride instruction prioritization, execution logic, or action-selection b eha vior of MLLMs, causing the system to follow attac k er-sp eciﬁed objectives rather than the intended user or system instructions. Unlike in tegrity attac ks, which primarily compromise correctness or perception, and jailbreak attacks, which b ypass safety constrain ts, control attacks explicitly target the mechanisms b y whic h MLLMs in terpret, prioritize, and execute instructions. These attacks are particularly relev ant in mo dern deploymen ts where MLLMs are embedded within in teractive systems that include system prompts, to ol in v o cation, and agen tic control lo ops. In such settings, adversarial inﬂuence can b e injected indirectly through m ultimo dal inputs that manipulate ho w instructions are interpreted or how downstream actions are triggered, ev en when the textual prompt itself app ears b enign (F u et al., 2023). Prompt Injection Prompt injection attac ks manipulate the instruction-follo wing b eha vior of MLLMs b y embedding o verride or compete with in tended system, developer, or user instructions. In m ultimo dal systems, suc h injections need not b e expressed explicitly in text; instead, they can b e encoded indirectly through non-textual mo dalities that inﬂuence the mo del’s internal interpretation of inten t, as illustrated in Figure 3b. By blending adversarial p erturbations corresp onding to malicious prompts in to visual or audio inputs, the attac ker can steer the mo del to output attac ker-c hosen resp onses or follo w unintended instructions when the user queries the p erturbed con tent. This w ork highligh ts that multimodal con text is often treated as authoritative by instruction-follo wing models, enabling prompt injection that b ypasses traditional text-only ﬁltering and mo deration. System Instruction Manipulation System instruction manipulation attac ks target higher-lev el con trol signals such as to ol-selection logic, execution policies, or agentic action-selection b eha vior, rather than user- facing instructions alone. These attacks are especially dangerous b ecause they operate at a level in tended to b e trusted and can directly aﬀect the model’s interaction with external resources. F u et al. (2023) sho w that visual adv ersarial examples can b e used to induce attack er-desired tool usage in to ol-augmen ted language mo dels. By manipulating the visual input, the attack er can cause the mo del to in v ok e sensitiv e to ols—suc h as calendar management or information retriev al—even when the user’s text prompt is inno cuous and do es not request suc h actions. The attack remains stealthy and generalizes across prompts, demonstrating that system-lev el decision logic can b e hijack ed through multimodal inputs. T o ol, Retriev al, and Agen tic Injection: T o ol, retriev al, and agen tic injection attacks extend con trol manipulation to settings in whic h MLLMs op erate as autonomous or semi-autonomous agen ts. Suc h systems often main tain memory , inv oke to ols, retriev e external context, or p erform multi-step reasoning o ver extended 11 Published in T ransactions on Mac hine Learning Researc h (03/2026) in teraction horizons. In these cases, a single successful injection can propagate across many do wnstream actions. Gu et al. (2024) demonstrate the severit y of this threat by showing that a single adversarial image can b e used to jailbreak a large p opulation of multimodal agen ts simultaneously . By exploiting shared p erception and instruction-follo wing mechanisms across agen ts, the attack scales without requiring per-agent customization. This result highlights that agen tic MLLM deplo yments substan tially amplify the impact of con trol and injection attac ks, transforming lo calized multimodal manipulation into system-wide compromise. 3.2.4 P oisoning and Backdo o r A ttacks P oisoning and backdoor attac ks in tro duce p ersistent vulnerabilities in to MLLMs during training, instruction tuning, or model adaptation. Unlike inference-time attac ks, whic h rely on carefully crafted inputs at test time, p oisoning-based attac ks implan t malicious b eha viors that are activ ated b y speciﬁc triggers or contexts long after deplo yment. These attacks p ose a particularly severe threat because they can remain dormant, can ev ade detection during ev aluation, and aﬀect all downstream users of the compromised mo del. Figure 4 illustrates a canonical training-time bac kdo or attack, in which a small num b er of poisoned samples are injected during training to implan t a trigger that activ ates attack er-controlled behavior at inference time while preserving b enign p erformance on clean inputs. The risk is ampliﬁed in m ultimo dal settings due to the reuse of pretrained components (e.g., vision enco ders), the reliance on large-scale web data, and the increasing use of parameter-eﬃcien t adaptation metho ds suc h as instruction tuning or ligh tw eight adapters. Recen t work demonstrates that p oisoning can target data, triggers, representations, and ev en shared pretrained mo dules, enabling stealthy and transferable bac kdo ors in MLLMs. Data Poisoning: Data p oisoning attacks manipulate the training or ﬁne-tuning data of MLLMs to induce malicious b eha viors under b enign inputs. In m ultimo dal settings, p oisoning can exploit the alignmen t b et ween images and text to implan t subtle yet p o werful b eha vioral shifts without noticeably degrading mo del utility . Xu et al. (2024) introduce Shadowc ast , a stealthy p oisoning framew ork that inserts visually indistinguishable p oisoned image–text pairs into training data. Shadow cast demonstrates that even a small n um b er of p oisoned samples can induce persistent malicious b eha viors, including b oth misclassiﬁcation-style errors and more subtle p ersuasion b eha viors that lev erage the generativ e capabilities of vision–language mo dels. Notably , the p oisoned behaviors transfer across arc hitectures and remain eﬀective under realistic data augmentation and compression, highligh ting the fragility of data integrit y assumptions in multimodal training pip elines. Bac kdo or Attac ks: Bac kdo or attacks implan t hidden triggers during training suc h that the mo del b eha ves normally on clean inputs but pro duces attack er-chosen outputs when the trigger is present. In multimodal mo dels, triggers can b e embedded in images, instructions, or laten t represen tations, enabling a broad range of p ersisten t and diﬃcult-to-detect attack v ectors. Early work suc h as Lyu et al. (2024) demonstrates that vision–language models can b e backdoored to inject predeﬁned target text in to image-to-text generation tasks while preserving seman tic plausibilit y . Liang et al. (2025) extend this threat to instruction-tuned autoregressiv e VLMs, showing that multimodal instruction backdoors can b e implanted during instruction tuning using b oth visual and textual triggers, even under limited attack er access. Subsequen t studies explore more realistic and challenging threat mo dels. Lyu et al. (2025) sho w that back- do ors can b e implanted using only out-of-distribution data, eliminating the assumption that attack ers m ust access the original training distribution and further demonstrate that the bac kdo or remains eﬀectiv e despite distribution mismatc h b et ween p oisoning data and deploymen t inputs. Bey ond explicit triggers, Yin et al. (2025) in tro duce shadow-activate d b ackdo ors , where malicious behaviors are activ ated implicitly when the mo del discusses sp eciﬁc objects or concepts, without requiring an y external trigger. This paradigm high- ligh ts that backdoor activ ation in MLLMs can be con text-driven rather than artifact-driv en, signiﬁcantly complicating detection and mitigation. 12 Published in T ransactions on Mac hine Learning Researc h (03/2026) Figure 4: A bac kdo or attac k where the mo del is ’p oisoned’ during training. Fine-tuning, A dapter, and Represen tation Poisoning: Fine-tuning and represen tation p oisoning at- tac ks target the adaptation mechanisms commonly used to customize MLLMs, including instruction tuning, tok en-lev el output manipulation, and shared pretrained encoders. These attacks are particularly concerning b ecause they exploit widely adopted deploymen t practices such as plug-and-pla y ﬁne-tuning and comp onen t reuse. Y uan et al. (2025) in tro duce BadT oken , a tok en-level bac kdo or attac k that manipulates the out- put space of MLLMs by inserting or substituting sp eciﬁc tokens when a backdoored input is encountered. The attac k preserves o verall mo del utilit y while enabling ﬁne-grained and stealthy control o ver generated resp onses, posing risks in safety-critical applications suc h as medical diagnosis and autonomous systems. Complemen tarily , Liu & Zhang (2025) reveal that backdoors can b e implanted directly into self-sup ervised vision encoders that are later reused across many L VLMs. By compromising a shared enco der, the attack er induces widespread hallucinations and attack er-chosen b eha viors in downstream mo dels, demonstrating that represen tation-lev el p oisoning can propagate backdoors across the multimodal ecosystem without mo difying the language mo del itself. An imp ortan t observ ation is that attack ob jectiv es in multimodal LLMs are not mutually exclusive. Man y eﬀectiv e attacks are inheren tly multi-objective, where success under one ob jectiv e serv es as an enabling mec hanism for another. F or example, p erturbation-based in tegrit y attacks on visual inputs are often used to facilitate safety and jailbreak attacks by corrupting cross-mo dal representations and weak ening alignmen t safeguards. Similarly , control and injection attac ks ma y manifest as in tegrity failures at the output lev el, while training-time backdoors can surface at inference as targeted in tegrity or con trol violations. Accordingly , our taxonomy assigns eac h attac k to a primary adversarial ob jectiv e, while explicitly ackno wledging that individual attacks may span multiple categories. F or instance, (T ao et al., 2025) achiev es jailbreak b eha vior through the implantation of a training-time backdoor, and th us exhibits c haracteristics of both backdoor and jailbreak attac ks. 3.3 A ttacker Kno wledge and Threat Mo dels Bey ond attack ob jectiv es, adversarial attacks on MLLMs diﬀer in how they are instantiated, dep ending on the attack er’s kno wledge of and access to the target system. Attac k er knowledge shap es feasible construction strategies and ev aluation assumptions, ranging from gradient-based optimization in settings with full in ternal access to query-driven or transfer-based methods under limited access. W e characterize this dimension using standard white-box, gra y-b o x, and black-box threat mo dels, whic h reﬂect v arying degrees of access to mo del 13 Published in T ransactions on Mac hine Learning Researc h (03/2026) in ternals, prompts, or training artifacts. Rather than deﬁning attack ob jectiv es, this dimension pro vides an analytical lens for understanding ho w attac ks across diﬀeren t families are realized in practice. W e summarize the distribution of attac k families under these threat mo dels using a tw o-dimensional matrix. 3.3.1 White-b o x Attacks In a white-b o x threat mo del, the adversary is assumed to ha ve full access to the target MLLM, including its architecture, parameters, gradien ts, and, in some cases, the training or ﬁne-tuning pip eline. Suc h priv- ileged access enables direct gradient-based optimization of adv ersarial objectives and precise manipulation of internal cross-mo dal represen tations. As a result, white-b o x attacks typically ac hiev e high success rates and are commonly used to c haracterize upp er-bound vulnerabilities of multimodal systems under worst-case assumptions. Under this setting, white-b o x attacks ha ve b een demonstrated across multiple adv ersarial ob jectiv es. In- te grity attacks ev aluate robustness under gradient-based visual adv ersarial p erturbations adversarial visual inputs or manipulate in ternal visual tokens, causing MLLMs to produce incorrect or targeted outputs without necessarily violating safety p olicies (Cui et al., 2024). Safety and jailbr e ak attacks exploit white-b o x access to bypass alignment mec hanisms, using optimized visual or multimodal p erturbations to elicit disallo w ed or harmful conten t (Qi et al., 2023; W ang et al., 2024a). Bey ond output manipulation, white-b o x c ontr ol and inje ction attacks target in ternal p erception mo dules or visual prompts to ov erride instruction hierarchies and redirect mo del or agent b eha vior to w ard attac ker-speciﬁed goals (F u et al., 2023). Finally , white-b o x tr aining-time attacks exploit access to the data or ﬁne-tuning pipeline to implant p ersisten t bac kdo ors or p oisoned b eha viors that are triggered at inference time (Lyu et al., 2024; Xu et al., 2024; Liang et al., 2024). 3.3.2 Gra y-b o x A ttacks Gra y-b o x attac ks occupy an in termediate threat mo del b et ween white-b o x and black-box settings, where the adv ersary possesses partial knowledge or access to the target system. This ma y include access to speciﬁc comp onen ts such as the vision enco der, captioning mo dule, or training data distribution, while the full language mo del, alignment lay ers, or deplo yment conﬁguration remain unknown. Gra y-b o x assumptions are particularly relev ant for m ultimo dal systems, whic h are often comp osed of reusable or open-source submo dules integrated with proprietary comp onen ts. Under this partial-access setting, gra y-b o x attacks span all ma jor adversarial ob jectiv es. Inte grity attacks exploit access to visual enco ders or in termediate representations to induce incorrect or targeted outputs that transfer across do wnstream MLLMs (Zhao et al., 2023). Safety and jailbr e ak attacks leverage partial kno wledge of non-textual mo dalities or fusion mec hanisms to bypass alignmen t safeguards, often through comp ositional or mo dalit y-sp eciﬁc perturbations (Geng et al., 2025). Gray-box control and injection attacks can arise when attack ers hav e access to a kno wn p erception or agen t component (e.g., tool routing rules), enabling redirection of system behavior without full mo del access. Finally , gray-box tr aining-time attacks exploit limited con trol ov er data sources or pretrained comp onen ts to implant p ersisten t backdoors that activ ate under sp eciﬁc triggers (Liu & Zhang, 2025; Xu et al., 2024). These attac ks highlight that ev en partial system knowledge can be suﬃcien t to compromise multimodal LLMs across inference and training stages. 3.3.3 Black-b o x Attacks In a black-box threat mo del, the adversary has no access to the in ternal parameters, gradien ts, or archi- tecture of the target MLLM, and can in teract with the system only through input–output queries. This setting reﬂects realistic deplo yment scenarios, including attacks on proprietary or API-based mo dels. Con- sequen tly , blac k-b o x attacks rely on query-based optimization, transferability from surrogate mo dels, or carefully engineered input patterns, and t ypically trade oﬀ attac k eﬃciency for broader applicabilit y . Despite these constrain ts, blac k-b o x attacks hav e b een shown to b e eﬀective across multiple adv ersarial ob jectiv es. Inte grity attacks exploit transfer-based or universal perturbations to induce incorrect or targeted outputs in vision-langu age models without in ternal access (Zhao et al., 2023; Xie et al., 2024). Safety 14 Published in T ransactions on Mac hine Learning Researc h (03/2026) and jailbr e ak attacks lev erage crafted visual or multimodal inputs to bypass alignment mechanisms and elicit restricted conten t under query-only access, including attac ks under query-only access (Gong et al., 2025; Jeong et al., 2025; W ang et al., 2025d). Black-box c ontr ol and inje ction attacks em b ed adversarial instructions into images or multimodal con texts to ov erride user in tent or hijack execution b eha vior, often via indirect prompt injection (Clusmann et al., 2025) (Kim ura et al., 2024). Finally , black-box tr aining-time attacks demonstrate that p oisoning or backdooring can succeed even when the attack er lac ks knowledge of the ﬁnal deplo yed mo del, relying instead on transferability across training pip elines (Xu et al., 2024; Liang et al., 2025). T ogether, these results indicate that limited access does not preclude impactful attac ks on deplo y ed MLLMs. T able 1: Represen tative tec hniques for each attac k ob jectiv e based on the attack er’s knowledge lev el. A ttack Goal A ttack er Knowledge In tegrity Attac k Safet y Attac k Con trol Attac k T raining Attac k White b o x (Cui et al., 2024), (Zhang et al., 2025d) (Qi et al., 2023), (W ang et al., 2024a), (Hao et al., 2024) (F u et al., 2023), (Bailey et al., 2024) (Lyu et al., 2024), (Xu et al., 2024), (Liang et al., 2024) Grey b o x (Zhao et al., 2023) (Geng et al., 2025), (Sha yegani et al., 2023a) (W u et al., 2025) (Xu et al., 2024), (Liu & Zhang, 2025) Blac k b o x (Zhao et al., 2023), (Xie et al., 2024) (Gong et al., 2025), (Jeong et al., 2025), (W ang et al., 2025d) (Clusmann et al., 2025), (Kimura et al., 2024) (Xu et al., 2024), (Liang et al., 2025) 4 Dissecting the Threat: An Analysis of LLM V ulnerabilities 4.1 Overview Ha ving established a taxonom y organized by adv ersarial ob jectiv es and threat mo dels, we now shift from ho w attac ks are categorized to wh y they succeed. Rather than viewing attacks as isolated techniques, w e analyze them as manifestations of recurring structural w eaknesses in m ultimo dal large language models. Figure 5 presen ts a hierarchical view of these vulnerabilities, organizing failure mo des from high-lev el arc hitectural design choices to component-lev el ﬂaws. This vulnerability-cen tric p ersp ectiv e abstracts b ey ond surface-lev el attac k v ectors and explains how diverse in tegrity , safety , con trol, and poisoning attacks arise across diﬀeren t mo dalities and access assumptions. T o ground this framew ork, T ables 2, 3, 4, and 5 map each vulnerability category to representativ e attack families. The follo wing subsections examine these vulnerability categories in detail. 4.2 Cross Mo dal Interaction & Alignment Vulnerabilities This refers to the core vulnerabilities unique to MLLMs that arise from the inheren t complexities of in tegrat- ing and reconciling information deriv ed from m ultiple, distinct mo dalities such as text, vision, and audio. These vulnerabilities stem from how LLMs attempt to align, fuse, and reason ov er heterogeneous modality- sp eciﬁc representations. Because these interaction p oin ts are fundamen tal to multimodal pro cessing, they b ecome prime targets for sophisticated adv ersarial attacks aiming to disrupt cross-mo dal alignment or fused represen tations and induce erroneous outputs or b eha viors. 4.2.1 Misalignment & Integration Failures Large m ultimo dal language mo dels can exhibit systematic failures in cross-modal alignmen t and integration, particularly when join tly reasoning ov er heterogeneous inputs such as text, images, audio, or video. These 15 Published in T ransactions on Mac hine Learning Researc h (03/2026) Figure 5: Hierarc hical Overview of LLM V ulnerabilities 16 Published in T ransactions on Mac hine Learning Researc h (03/2026) failures arise when the mo del is unable to establish stable seman tic correspondences across mo dalities or reconcile conﬂicting con textual signals within the shared m ultimo dal representation space. As a result, b enign inputs at the p er-modality level may combine to yield incorrect or unin tended in terpretations after fusion. Suc h vulnerabilities are often rooted in weaknesses of learned cross-mo dal embeddings, misaligned mo dalit y-sp eciﬁc enco ders, or brittle fusion mechanisms that o ver-rely on a dominan t mo dalit y . Recen t work demonstrates that adv ersaries can explicitly exploit the se vulnerabilities b y targeting joint m ultimo dal representations rather than individual mo dalities. F or example, typographic and comp ositional attac ks show that harmful inten t can b e concealed in, or ampliﬁed through, visual channels even when the accompan ying text app ears benign, exposing deﬁciencies in cross-mo dal alignmen t and integration (Gong et al., 2025). Similar failures arise from cross-mo dalit y mismatches, where b enign textual queries combined with misleading or out-of-con text visual inputs induce incorrect or unsafe outputs, as demonstrated in medical MLLMs b y Huang et al. (2025b). 4.2.2 Emb edding Space Weaknesses A critical vulnerabilit y arises from the shared high-dimensional embedding space used by m ultimo dal lan- guage mo dels to align and reason ov er heterogeneous inputs. While this space is intended to enco de seman tic similarit y across mo dalities, its learned structure can b e exploited b y adversaries. Carefully crafted inputs that app ear benign at the p erceptual level can induce latent represen tations that are semantically mislead- ing after fusion, eﬀectiv ely creating false cross-mo dal corresp ondences. F or instance, Adv ersarial Illusions demonstrate that an input from one modality (e.g., an image) can b e engineered suc h that its embedding closely aligns with an unrelated concept from another modality (e.g., text), causing the mo del to infer a spuri- ous seman tic relationship (Bagdasaryan et al., 2024). Related w ork further shows that such embedding-level vulnerabilities can be exploited in a targeted manner, where adv ersarial images are optimized to steer prox- imit y in the join t em b edding space tow ard harmful semantic regions and, when paired with otherwise benign prompts, induce unsafe mo del b eha vior (Shay egani et al., 2023a). 4.2.3 Over-Reliance / Imbalanced Reliance on Sp eciﬁc Mo dalities Multimo dal language models may exhibit an im balanced reliance on particular input mo dalities, disprop or- tionately weigh ting information from one source while underutilizing signals from others. Suc h asymmetries can undermine cross-mo dal veriﬁcation, allowing dominant mo dalities to ov erride conﬂicting evidence with- out suﬃcien t scrutiny . This creates a structural vulnerabilit y whereb y misleading or adv ersarial con tent in tro duced through a secondary mo dalit y ma y not b e adequately reconciled with the model’s primary de- cision signal. Recent empirical analysis sho ws that sev eral vision–language models exhibit a pronounced bias to ward textual inputs, often treating vision as auxiliary rather than co-equal evidence (Deng et al., 2025). This imbalance weak ens multimodal grounding and increases susceptibility to attacks that exploit underw eigh ted mo dalities or bypass cross-mo dal consistency c hecks. 4.2.4 Asymmetric V ulnerabilit y A cross Mo dalities Multimo dal language models can exhibit asymmetric susceptibilit y to adv ersarial p erturbations across the mo dalities they process, where attac ks on one mo dalit y are more eﬀective or ha ve a disprop ortionate inﬂuence on the mo del’s ﬁnal prediction compared to others. Suc h asymmetries can arise when modality-speciﬁc enco ders or fusion mec hanisms assign unequal w eight or robustness to diﬀerent input c hannels, causing certain mo dalities to dominate join t decision-making. Prior w ork on m ultimo dal safety has shown that the visual mo dalit y can constitute a particularly brittle path wa y for alignment, where adversarially crafted images can be lev eraged to amplify harmful inten t and induce safety violations even when text-only safeguards are in place (Li et al., 2025b). These ﬁndings highlight that mo dalit y-dep enden t robustness is uneven and that vulnerabilities in a single mo dalit y can disprop ortionately compromise m ultimo dal reasoning. This asymmetry extends beyond static images: Hu et al. (2025b) sho w that the video mo dalit y in tro duces distinct vulnerabilities, where distributing adv ersarial cues across frames enables jailbreaks that bypass defenses eﬀectiv e for single-image or text-only inputs. 17 Published in T ransactions on Mac hine Learning Researc h (03/2026) T axonomy Category Primary V ulnerabilities Exploited Represen tative W orks Signal P erturbations Sp eciﬁc component w eaknesses Zhang et al. (2025d) Visual Input Pro cessing; Em b edding Space W eaknesses W ang et al. (2024b) Discrete T riggers Visual Input Pro cessing; Em b edding Space W eaknesses; Universal Represen tation V ulnerability Liu et al. (2024a) Visual Input Pro cessing; Misalignmen t & In tegration F ailures Cao et al. (2025) Represen tation & F usion Exploits Em b edding Space W eaknesses Bagdasary an et al. (2024) Misalignmen t & Integration F ailures Sha yegani et al. (2023a) T able 2: Mapping In tegrity A ttacks to underlying vulnerabilities driv ers 4.2.5 Universal Rep resentation Vulnerabilit y A recurring c hallenge in multimodal models is that their joint representations are exp ected to generalize across div erse tasks, prompts, and input distributions, which creates a shared attac k surface. Empirical evidence shows that adv ersarial patterns can b e optimized to generalize broadly across prompts and down- stream tasks, remaining eﬀectiv e even when the attac ker do es not know the sp eciﬁc query or interaction con text (Zhang et al., 2025d). Complementarily , Pandora’s Box constructs a task-agnostic univ ersal ad- v ersarial patch that transfers across m ultiple L VLMs and task settings, indicating systematic brittleness in shared multimodal represen tations across mo del instances (Liu et al., 2024a). As models increasingly rely on uniﬁed represen tational spaces for scalable multimodal reasoning, ensuring robustness across suc h v ar- ied scenarios remains diﬃcult, leaving these shared representations vulnerable to reusable and transferable manipulations. 4.3 Mo dalit y-Sp eciﬁc Input Pro cessing Vulnerabilities Bey ond cross-modal challenges, LLMs also exhibit vulnerabilities inherent in how they pro cess and in terpret data from individual mo dalities. These w eaknesses often stem from the sp eciﬁc c haracteristics of eac h data t yp e (e.g., high dimensionalit y of images, temp oral nature of audio/video) or are inherited from the unimo dal enco ders (e.g., pre-trained vision or audio net works) used as building blo c ks. Attac k ers can exploit these mo dalit y-sp eciﬁc frailties even b efore information is fused, corrupting the input at its source. 4.3.1 Visual Input Pro cessing Multimo dal language mo dels exhibit signiﬁcant vulnerabilities in their processing of visual inputs, arising from b oth the brittleness of vision enco ders and their integration into language-driven reasoning pip elines. Small, often h uman-imp erceptible p erturbations to visual inputs can induce disprop ortionate c hanges in do wnstream MLLM outputs and behaviors, as adversarial signals injected at the vision-enco ding stage prop- agate through multimodal fusion and language generation (Dong et al., 2023; Zhao et al., 2023). These vulnerabilities are frequently inherited from pre-trained vision bac kb ones used as encoders, creating an at- tac k surface that persists ev en when the language mo del itself is w ell-aligned. Bey ond perception errors, recen t work demonstrates that carefully crafted visual inputs can trigger unintended to ol in vocation or con- trol b eha viors in to ol-augmen ted MLLMs, highligh ting the security implications of visual pro cessing failures (F u et al., 2023). Other attacks exploit t yp ographic o v erla ys or visually rendered text that are misinterpreted b y OCR or visual-text alignment components, eﬀectiv ely acting as co vert visual prompts that manipulate do wnstream reasoning (Qraitem et al., 2024; Gong et al., 2025). 18 Published in T ransactions on Mac hine Learning Researc h (03/2026) 4.3.2 A udio Input Pro cessing The pro cessing of sp eec h, audio w a v eforms, and general sound ev ents in multimodal language mo dels intro- duces a distinct and increasingly exploited attack surface. Recent studies show that audio inputs are often less robustly protected than textual counterparts, allowing adversaries to induce harmful or unin tended be- ha viors through carefully crafted sp eec h or acoustic signals (Y ang et al., 2024). In particular, vulnerabilities in speech recognition and audio–language alignment pip elines can b e exploited via cross-lingual phonetics and pronunciation or accent v ariations that ev ade safet y mechanisms while remaining in telligible to the mo del (Roh et al., 2025). Empirical red-teaming of audio m ultimo dal mo dels further rev eals high attack success rates and inconsistent safet y enforcement across acoustic inputs, underscoring that audio pro cessing remains a comparativ ely weak link in current multimodal systems. While adv ersarial w a v eform attac ks on acoustic mo dels hav e b een studied extensiv ely in prior work Carlini & W agner (2018); Zhang et al. (2017), suc h brittleness b ecomes especially consequential when these components are in tegrated into language-driv en reasoning pip elines. 4.3.3 T extual Input Processing V ulnerabilities in the interpretation of textual conten t remain critical in multimodal language mo dels, par- ticularly when text is pro cesse d indirectly through visual or multimodal pip elines. In such settings, mo dels ma y o ver-attend to salien t textual cues while failing to incorp orate broader seman tic context, leading to systematic misin terpretations. Recen t work shows that t yp ographic text rendered within images can b e mis- pro cessed b y vision–language mo dels, causing them to infer incorrect or unintended instructions despite the absence of explicit textual prompts (Qraitem et al., 2024; Gong et al., 2025). These failures highlight w eak- nesses in how textual information is extracted, normalized, and aligned with language reasoning comp onen ts, esp ecially when mediated by OCR or visual-text alignment mo dules. 4.3.4 T emp o ral Information Pro cessing (Video/Sequential Data) Multimo dal language models face distinct vulnerabilities when pro cessing temp orally ev olving inputs such as video, where maintaining coherent representations across sequential frames is essential for correct inter- pretation. Prior w ork indicates that disruptions to temp oral consistency can propagate across time, causing errors in tro duced at a small num b er of frames to aﬀect the mo del’s understanding of an en tire sequence. Suc h w eaknesses are commonly asso ciated with limitations in temp oral aggregation and sequence-level reasoning, where mo dels struggle to robustly in tegrate dynamic visual information ov er time (Huang et al., 2025a). Recen t attacks explicitly exploit these limitations using ﬂow-based metho ds that lev erage optical ﬂow to iden tify and p erturb temp orally salient regions, ac hieving high eﬀectiv eness by targeting a limited subset of frames rather than the full video stream (Li et al., 2024). 4.4 Instruction follo wing & prompt-based vulnerabilities A signiﬁcant class of vulnerabilities in MLLMs stems directly from their core design ob jectiv e: to understand and follow human instructions. Attac kers exploit this inherent helpfulness by manipulating input prompts that are deliv ered via text, images, audio, or combinations thereof to elicit unin tended, harmful, or re- stricted b eha viors. These attacks often aim to subv ert the mo del’s safet y alignments or hijack its generative capabilities for malicious purp oses. 4.4.1 Exploiting Instruction-F ollo wing Nature The same instruction-follo wing capability that enables multimodal language models to perform complex tasks also in tro duces a critical vulnerability when adversarial or deceptiv e commands are introduced. Prior w ork shows that mo dels can b e manipulated by embedding malicious instructions that comp ete with or o ver- ride user in tent, causing the system to prioritize attac ker-supplied guidance (Kimura et al., 2024). While suc h instructions ma y app ear explicitly in textual prompts, they can also be conv eyed indirectly through non-textual mo dalities. In particular, attack ers can enco de instructions within images or audio inputs that are subsequen tly decoded and acted up on by the model as if they w ere legitimate textual commands, despite 19 Published in T ransactions on Mac hine Learning Researc h (03/2026) T axonomy Category Primary V ulnerabilities Exploited Represen tative W orks Unimo dal Surfaces Visual Input Pro cessing; Jailbreaking & Safet y Bypass Qi et al. (2023) A udio Input Pro cessing; Asymmetric V ulnerability A cross Mo dalities Y ang et al. (2024) Roh et al. (2025) Multimo dal Composite Jailbreaks Misalignmen t & Integration F ailures; Em b edding Space W eaknesses Gong et al. (2025) Con text Manipulation; Misalignment & In tegration F ailures W ang et al. (2025d) T emp oral Input Processing; A tten tion Mec hanism Exploitation Hu et al. (2025b) Jailbreaking & Safety Bypass; Misalignmen t & Integration F ailures; Li et al. (2025b) W ang et al. (2025b) Jailbreaking & Safety Bypass Cheng et al. (2025) Univ ersal Jailbreak T riggers Misalignmen t & integration failures; Visual input pro cessing W ang et al. (2024a) Em b edding Space W eaknesses; Ov er Reliance on Sp eciﬁc Modalities Geng et al. (2025) T able 3: Mapping Safet y/Jailbreak Attac ks to underlying vulnerability driv ers T axonomy Category Primary V ulnerabilities Exploited Represen tative W orks Prompt Injection Con text Manipulation; Exploiting Instruction F ollowing Nature; Visual + T extual Input Pro cessing Clusmann et al. (2025) A udio Input Pro cessing; Exploiting Instruction-F ollowing Nature; Con text Manipulation Hou et al. (2025) System Instruction Manipulation Exploiting Instruction F ollowing Nature; Con text Manipulation; Misalignment & In tegration F ailures W ang et al. (2025a) Misalignmen t & Integration F ailures; Ov er-Reliance on Sp eciﬁc Mo dalities Zhang et al. (2025c) T o ol, Retriev al and Agentic Injection Sp eciﬁc Component W eakness; Con text Manipulation; Visual Input Pro cessing F u et al. (2023) Con text Manipulation; Exploiting Instruction-F ollowing Nature; Misalignmen t & Integration F ailures Zhang et al. (2025a) Instruction F ollowing Nature; Misalignmen t & Integration; Gu et al. (2024) T able 4: Mapping Con trol & Injection Attac ks to underlying vulnerability driv ers. the absence of ov ert instruction text (Bagdasary an et al., 2023; Gu et al., 2024). These attac ks highlight that instruction-following b eha vior itself—rather than a speciﬁc modality—constitutes a fundamen tal vul- nerabilit y when mo dels lack robust mechanisms for distinguishing inten t, authority , and context. 20 Published in T ransactions on Mac hine Learning Researc h (03/2026) 4.4.2 Jailb reaking & Safety Bypass Jailbreaking attac ks target weaknesses in the safety and alignment mechanisms of large language mo dels, seeking to induce outputs that violate in tended con tent restrictions. Prior w ork sho ws that suc h failures often arise from the model’s tendency to comply with clev erly framed instructions that exploit gaps or inconsistencies in safety training, for example through role-playing, hypothetical framing, or logical indirec- tion. In m ultimo dal settings, these vulnerabilities can be ampliﬁed when inputs from m ultiple mo dalities in teract in w ays that undermine unimodal safety chec ks (Liu et al., 2024c). Recent studies demonstrate that carefully constructed com binations of visual and textual inputs can jointly bypass safety mechanisms that w ould otherwise b e eﬀectiv e in isolation, exposing weaknesses in cross-modal safety alignment (Liu et al., 2024d). These ﬁndings indicate that safet y enforcemen t in multimodal mo dels is not uniformly robust across mo dalities and can b e circumv ented through co ordinated m ultimo dal inputs. 4.4.3 Context Manipulation & Ambiguity Exploitation Multimo dal language mo dels can b e misled by adv ersarial manipulation of con textual information or b y exploiting inherent am biguities in language and cross-mo dal inputs. Prior work sho ws that b y supplying misleading background con text or framing information, attac k ers can steer the mo del’s in terpretation to- w ard unintended conclusions or actions (Miao et al., 2025). Am biguity can also b e introduced delib erately through carefully constructed prompts or by com bining m ultimo dal inputs whose relationships are in ten- tionally unclear or deceptiv e (Huang et al., 2025b). In suc h cases, mo dels ma y default to an incorrect or attac k er-fa v ored in terpretation, particularly when resolving conﬂicting or undersp eciﬁed cues across mo dal- ities. Recen t studies demonstrate that bi-mo dal adv ersarial prompts leveraging ambiguous visual–textual relationships can eﬀectively induce safety bypasses and unin tended b eha viors, highligh ting w eaknesses in con textual reasoning and disam biguation mec hanisms (Ying et al., 2024). 4.4.4 A ttention Mechanism Exploitation A tten tion mechanisms pla y a cen tral role in how m ultimo dal language mo dels prioritize and integrate infor- mation across inputs, making them a critical attack surface (V asw ani et al., 2017). Prior work sho ws that adv ersarial inputs can b e constructed to manipulate ho w attention is allocated, either b y amplifying the in- ﬂuence of misleading features or by suppressing attention to seman tically relev an t cues. Such manipulation can o ccur within a single mo dalit y through self-attention, or across modalities by disrupting cross-mo dal at- ten tion alignmen t. Recen t studies demonstrate that explicitly targeting atten tion dynamics can signiﬁcan tly alter mo del b eha vior, enabling attack ers to steer outputs b y biasing atten tion tow ard adv ersarially c hosen elemen ts (W ang et al., 2024c; Xie et al., 2024). 4.5 Architectural & Comp onent-Level Vulnerabilities V ulnerabilities in MLLM systems are not conﬁned to input processing or instruction-follo wing; they can also arise from internal architectural mechanisms and sp eciﬁc comp onen ts within the model. Such weaknesses ma y stem from design choices in modules suc h as attention or fusion, or be inherited from pre-trained building blo c ks integrated into the system. Attac ks that target these vulnerabilities aim to perturb internal represen tations and information routing, leading to systematic errors or unin tended b eha viors. 4.5.1 Inherited V ulnerabilities from Pre-trained Comp onents Man y m ultimo dal language mo dels rely on p o w erful unimodal enco ders—suc h as vision, text, or audio bac kb ones—that are pre-trained on large-scale datasets and subsequently integrated into larger multimodal systems. While suc h pre-training provides strong representational capabilities, it also introduces a pathw ay for adversarial weaknesses presen t in these source models to propagate into the full m ultimo dal pip eline. Prior w ork shows that vulnerabilities in vision enco ders, including susceptibilit y to adversarial p erturbations, can p ersist after in tegration and be exploited to induce incorrect or targeted b eha viors in downstream multimodal mo dels (Cui et al., 2024; Dong et al., 2023). As a result, attacks originally designed for standalone unimo dal 21 Published in T ransactions on Mac hine Learning Researc h (03/2026) T axonomy Category Primary V ulnerabilities Exploited Represen tative W orks Data Poisoning Inherited V ulnerabilities from pre-trained comp onen ts; Xu et al. (2024) Bac kdo or A ttacks Data & T raining Induced; T extual Input Pro cessing Y uan et al. (2025) Fine-tuning, Adapter and Represen tation Poisoning Inherited V ulnerabilites from pre-trained comp onen ts; Sp eciﬁc Component W eaknesses; Liu & Zhang (2025) T able 5: Mapping P oisoning & Backdoor Attac ks to underlying vulnerability driv ers enco ders may transfer to m ultimo dal language mo dels, exposing inherited w eaknesses in the initial processing stages that are not mitigated b y subsequen t language reasoning comp onen ts. 4.5.2 Sp eciﬁc Comp onent W eaknesses Bey ond general architectural vulnerabilities, multimodal language mo dels ma y exhibit w eaknesses tied to sp eciﬁc comp onen ts or auxiliary mo dules in tegrated into the system. In particular, ligh tw eight adaptation mec hanisms used for eﬃcient ﬁne-tuning—such as low-rank adapters—introduce additional attac k surfaces that can b e exploited independently of the base model. Prior work demonstrates that suc h adapters can b e lev eraged to implant backdoors with minimal computational o verhead, enabling malicious b eha viors to p ersist across do wnstream usage without mo difying core mo del parameters (Liu et al., 2025a; Liang et al., 2025; Lyu et al., 2024). In addition, the choice of modality fusion components can inﬂuence robustness, as diﬀerent fusion designs ma y v ary in their susceptibilit y to adv ersarial manipulation or misalignmen t, p oten tially amplifying errors introduced at earlier pro cessing stages. 4.6 Data & T raining Induced Vulnerabilities The security and reliability of MLLMs are strongly shap ed b y the data they are trained on and the training metho dologies employ ed. V ulnerabilities ma y b e introduced in ten tionally b y adv ersaries through malicious manipulation of training or ﬁne-tuning data, or arise uninten tionally from biases, sensitiv e information, or spurious patterns embedded in large-scale multimodal corp ora. Prior work demonstrates that attack ers can inject a small n umber of p oisoned samples con taining sp eciﬁc trigger patterns paired with attack er-chosen outputs, causing mo dels to learn latent correlations that remain dorman t during standard ev aluation yet activ ate reliably at inference time (Xu et al., 2024). In m ultimo dal settings, suc h vulnerabilities are further ampliﬁed by cross-mo dal representation learning, where modality dominance and in teraction eﬀects can inﬂuence ho w p oisoned b eha viors are enco ded and later activ ated (Han et al., 2024). These training-induced w eaknesses are particularly diﬃcult to detect and mitigate post-deploymen t, as the resulting malicious b eha viors are deeply embedded in the mo del parameters and may only surface under out-of-distribution or trigger-sp eciﬁc conditions (Lyu et al., 2025). 5 Defense Mechanisms This section presents an ov erview of defense mec hanisms for MLLMs, and t ying them according to the attac k families in our taxonomy . In classical adversarial robustness, defenses largely fo cus on hardening mo dalit y enco ders against b ounded p erturbations; in contrast, MLLM deplo yments introduce additional system-lev el failure mo des in whic h untrusted m ultimo dal conten t (e.g., OCR text in images, retriev ed do cumen ts, or to ol outputs) is inadverten tly treated as higher-priority instructions. Consequen tly , defenses for MLLMs span m ultiple lay ers of the pip eline, including p erception-la y er robustness for in tegrity attac ks, context and instruction-handling mechanisms for jailbreak and injection attacks, and con trol-plane constraints for to ol- using agen ts (C), with complemen tary safeguards for training-time compromise where applicable. 22 Published in T ransactions on Mac hine Learning Researc h (03/2026) 5.1 Input Prep ro cessing and Perception-La yer Hardening A ﬁrst line of defense operates at the input and p erception la yer by reducing sensitivit y to low-lev el p er- turbations before or within mo dalit y encoders. T est-time transformations and feature-space compression can suppress small-magnitude p erturbations that exploit weaknesses in visual or audio input pro cessing, thereb y mitigating in tegrity attac ks based on contin uous signal p erturbations in our taxonom y . F eature squeezing is a representativ e approach that detects adversarial examples b y comparing predictions before and after inexpensive input squeezes suc h as bit-depth reduction and smo othing (Xu et al., 2018). Such prepro cessing-based defenses are ligh tw eight and can b e deploy ed without retraining, but their protection is t ypically partial under adaptive attac kers and limited against attac ks that do not rely on fragile pixel- or wa veform-lev el noise. More recen tly , robustness of vision enco ders in vision–language pip elines has b een studied directly in the multimodal setting, for example by adv ersarially ﬁne-tuning CLIP-st yle enco ders to impro v e resistance against adv ersarial visual p erturbations that propagate into downstream vision–language mo dels (Schlarmann et al., 2024). A more robust but costlier approac h is adversarial training and robust optimization, which impro ves em- pirical robustness by training on worst-case (or pro xy) adversarial examples. F oundational w ork on robust optimization and PGD-based adversarial training demonstrated substantial gains in resistance to b ounded p erturbations (Madry et al., 2019), and TRADES formalized the trade-oﬀ betw een natural accuracy and robustness via a principled objective (Zhang et al., 2019). Complemen tary approaches explore adapting prompt representations rather than mo del w eights, showing that adv ersarial prompt tuning can impro ve ro- bustness of vision language mo dels to adv ersarially p erturbed visual inputs without modifying the underlying enco ders (Zhang et al., 2024). In MLLMs, these approac hes most directly strengthen modality encoders and th us primarily mitigate in tegrit y attacks ro oted in p erceptual signal manipulation, with partial b eneﬁts for represen tation-lev el vulnerabilities when the exploited failure originates in enco der embeddings rather than in higher-lev el instruction handling or fusion logic. 5.2 Certiﬁed Robustness for Enco ders Bey ond empirical robustness, certiﬁed defenses aim to provide formal guaran tees that mo del predictions are inv ariant within a sp eciﬁed p erturbation set. Randomized smo othing oﬀers probabilistic robustness certiﬁcates under ℓ 2 p erturbations and has b een shown to scale to large neural models (Cohen et al., 2019). Recen t work has b egun extending certiﬁcation tec hniques to vision language mo dels, including incremental randomized smoothing metho ds that certify robustness of m ultimo dal encoders under b ounded p erturbations (Nirala et al., 2024), as w ell as prompt-lev el certiﬁcation strategies for medical vision language models (Hussein et al., 2024). Such certiﬁcation most naturally applies to individual p erception components within MLLM pipelines and therefore targets in tegrity attacks arising from b ounded input p erturbations. How ever, end-to-end certiﬁcation of multimodal fusion, autoregressive deco ding, and instruction-follo wing b eha vior remains c hallenging, lea ving representation-lev el exploits and higher-lev el safet y or control attac ks largely outside the scop e of current guarantees. 5.3 Multimo dal Input Validation and Speciﬁcation-Based Gating A distinctly MLLM-oriented defense direction is to v alidate m ultimo dal inputs against explicit, application- pro vided speciﬁcations b efore they are allow ed to inﬂuence the mo del’s reasoning. Sharma et al. (2024) prop ose a defense sp eciﬁcally for image-based prompt attacks against MLLM c hatb ots, using a tw o-stage pip eline that (i) v alidates whether an input image conforms to exp ected constraints and (ii) performs prompt- injection defense to blo c k malicious in tent enco ded in the image. This class of defenses directly targets safety b ypasses and prompt injection attac ks when adv ersarial instructions are introduced via the image channel (e.g., OCR-readable directiv es), and it can also reduce exp os ure to fusion-lev el vulnerabilities b y preven ting un trusted multimodal artifacts from reac hing cross-modal reasoning in the ﬁrst place. Beyond sp eciﬁcation- based ﬁltering, recent defenses also exploit cross-mo dal in teractions themselves to disrupt jailbreak attempts, for example b y in tro ducing adv ersarial visual p erturbations that proactiv ely in terfere with malicious instruc- tions b efore they inﬂuence the language mo del (Li et al., 2025a). 23 Published in T ransactions on Mac hine Learning Researc h (03/2026) 5.4 Instruction–Data Sepa ration for Multimo dal Contexts A central vulnerability b ehind many injection attacks—ampliﬁed in multimodal settings—is the mixing of trusted instructions with untrusted context suc h as OCR outputs, retrieved do cumen ts, captions, or to ol results. Defenses in this category enforce structured separation so that untrusted conten t is treated as data rather than executable directives. Liu et al. (2025b) formalize prompt injection attac ks and defenses for LLM-integrated applications and provide a benchmark ed ev aluation framework that clariﬁes why naïve concatenation enables injected tasks to ov erride target tasks. Building on this separation principle, Chen et al. (2024) prop ose structured queries that explicitly separate prompt and data channels and demonstrate impro v ed robustness against prompt injection. Related defenses further reﬁne instruction-data separation b y shaping mo del preferences or introducing ligh tw eight defensive tokens, demonstrating impro ved resistance to injected instructions under mixed trusted and untrusted con texts (Chen et al., 2025b;a). In our taxonomy , these defenses map most directly to control and instruction-hijacking attacks as well as many safety bypass scenarios, and they partially mitigate represen tation- and fusion-lev el exploits when the exploit relies on cross-mo dal context b eing reinterpreted as commands. 5.5 Detection-Based Defenses for Injection and Hijacking Detection-based defenses aim to identify injected instructions or malicious in tent embedded in untrusted m ultimo dal con text, including OCR-extracted text from images. In practice, detection and rejection can complemen t strict separation b ecause not all applications can constrain inputs to a narrow sp eciﬁcation, and attac kers ma y em b ed prompt-lik e directiv es in visually plausible conten t. The t wo-stage approach of Sharma et al. (2024) explicitly includes prompt-injection defense to detect unsafe inten t that bypasses initial v alidation, while the b enc hmarking metho dology of Liu et al. (2025b) emphasizes ev aluating defenses under diverse injection strategies and attack er capabilities. These defenses primarily mitigate instruction- hijac king and safet y b ypass attac ks that exploit instruction-data confusabilit y , but they may b e b ypassed b y adaptiv e attac ks that obfuscate or distribute malicious directives across mo dalities. Recent work has prop osed deplo yable detection mechanisms aimed at iden tifying prompt injection attempts at inference time, including light w eigh t classiﬁers designed for real-time use in practical systems (Jacob et al., 2025). At the same time, empirical analyses highlight fundamen tal limitations of LLM-based detection under adaptiv e adv ersaries, sho wing that attac kers can often ev ade detectors through paraphrasing, indirection, or m ulti- step instruction disp ersion (Choudhary et al., 2025). 5.6 Control-Plane Defenses for T o ol-Using Multimo dal Agents When MLLMs are embedded in agentic systems with to ols (browsers, shells, retriev al, APIs), the dominant risk shifts from unsafe text outputs to unsafe actions. Defenses therefore must constrain to ol inv o cation, detect malicious en vironmen tal artifacts, and preven t agents from being steered into attac ker-c hosen ac- tion sequences. A yzenshteyn et al. (2025) prop ose proactiv e defenses based on deception and instrumen- tation—plan ting deceptiv e strings, honeytok ens, and traps to detect and derail autonomous agent behav- ior—demonstrating the utilit y of such mec hanisms against agentic attack w orkﬂows. These defenses map most directly to control-plane attacks in whic h adversarial information propagates into to ol execution, in- cluding cases where multimodal inputs (screenshots, documents, OCR) serv e as the carrier for adversarial instructions that would otherwise lead to to ol misuse. More recen t work extends con trol-plane defenses to end-to-end agent systems, prop osing real-time monitoring and interv ention mechanisms for computer- use agents (Hu et al., 2025a), as well as b enc hmark-driven ev aluations that formalize attack and defense dynamics in LLM-based agen ts (Zhang et al., 2025b). 5.7 P oisoning/Backdo o r Defenses Data p oisoning and bac kdo or threats motiv ate defenses that op erate b oth during data curation and p ost- training veriﬁcation. A common line of w ork aims to detect and remo ve p oisoned samples using representation outliers, e.g., sp ectral signature methods that identify anomalous directions in learned feature space and ﬁlter suspicious training points (T ran et al., 2018). Complemen tary approac hes attempt post-ho c trigger 24 Published in T ransactions on Mac hine Learning Researc h (03/2026) disco v ery and mitigation by reverse-engineering candidate triggers and identifying backdoored classes; Neural Cleanse proposes an optimization-based procedure to reconstruct p oten tial triggers and then repair the model accordingly (W ang et al., 2019). A t inference time, run time input ﬁltering can ﬂag triggered b eha vior b y measuring prediction consistency under strong p erturbations, as in STRIP (Gao et al., 2019). Finally , model repair metho ds can reduce bac kdo or capacit y by pruning neurons that are dormant on clean inputs follo wed b y ﬁne-tuning; Fine-Pruning demonstrates that com bining pruning with ﬁne-tuning can substantially w eaken bac kdo ors while largely preserving clean accuracy (Liu et al., 2018). 6 Limitations This surv ey fo cuses on adv ersarial attacks against MLLMs, emphasizing attac k objectives, mech anisms, and threat mo dels. As a result, sev eral limitations should b e noted. First, our analysis is inten tionally attack- cen tric. While we provide a structured ov erview of defense mechanisms in Section 5 to contextualize these threats, w e do not claim to pro vide an exhaustiv e catalog of all possible mitigation strategies. Second, regarding scop e, w e prioritize p eer-review ed researc h that ev aluates attac ks on MLLMs with a language- mo del-based reasoning core. Adv ersarial studies targeting unimo dal mo dels alone or those aiming at conten t bias, fairness, and priv acy leakage—are excluded unless explicitly framed as adversarial attacks on m ulti- mo dal systems. Third, the survey reﬂects the inherent sk ew in the current literature tow ards MLLMs. While w e discuss Audi o- and Video-based attacks where av ailable, these mo dalities are curren tly less represen ted in the broader research landscape compared to image-text pairs. Finally , the taxonomy presented empha- sizes dominan t attack ob jectiv es. Real-world attacks ma y com bine multiple mechanisms and exploit sev eral vulnerabilities sim ultaneously , and th us may span m ultiple categories. 7 Broader Impact This survey systematically catalogs adversarial attac ks and vulnerabilities aﬀecting MLLMs. While suc h taxonomies can lo wer the barrier to understanding and p oten tially reproducing kno wn attacks, our inten t is to supp ort the developmen t of more robust, secure, and trustw orthy m ultimo dal AI systems. By organizing attac ks according to ob jectiv es, mechanisms, and threat mo dels, we aim to help practitioners, system de- signers, and researchers anticipate realistic threats and reason ab out their ro ot causes. Understanding how and wh y attac ks succeed is a prerequisite for developing eﬀective defenses, auditing deploy ed systems, and informing resp onsible deploymen t decisions. W e ac knowledge the inherent tension b et ween disseminating attac k knowledge and the risk of misuse. Con- sisten t with established practice in securit y research, w e rely on peer-reviewed sources and emphasize analysis rather than op erational guidance, av oiding step-b y-step instructions or exploit-ready artifacts. Moreo ver, w e include a dedicated discussion of defense mec hanisms to con textualize attac ks and highligh t mitigation strategies. Finally , the real-w orld safet y implications of m ultimo dal attacks—suc h as misinformation, misuse of autonomous agents, and harm arising from mo del manipulation—underscore the urgency of this research area. W e view this surv ey as a contribution to ward resp onsible disclosure and proactive risk mitigation, rather than enabling malicious use. 8 Conclusions This survey presented a systematic review of adv ersarial attac ks on m ultimo dal large language mo dels, orga- nized through a goal-driven taxonom y and complemented by a vulnerabilit y-cen tric analysis. By classifying attac ks according to their primary adversarial ob jectiv e and considering diﬀerent attac ker knowledge assump- tions, we provide a structured framew ork for organizing a rapidly gro wing and diverse b ody of literature. The taxonom y is intended to oﬀer a unifying p erspective that cuts across mo dalities, attack implementa- tions, and deplo yment settings. Across the surv ey ed literature, a substan tial p ortion of existing attac ks hav e b een ev aluated in vision–language mo del settings, reﬂecting the prev alence of image–text in terfaces and the relativ e maturity of vision enco ders in current multimodal systems. At the same time, attacks targeting audio–language and video–language mo dels appear less frequen tly in the literature. W e interpret this obser- 25 Published in T ransactions on Mac hine Learning Researc h (03/2026) v ation as an indication of uneven researc h cov erage rather than a deﬁnitive assessment of comparative risk across mo dalities. Our vulnerability analysis highligh ts several recurring weaknesses that are exploited across diﬀerent attack families, including cross-mo dal misalignmen t, embedding-space fragility , mo dalit y-sp eciﬁc pro cessing limita- tions, instruction-following ambiguities, and training-time data integrit y issues. The surv eyed w orks suggest that successful attacks often lev erage combinations of these vulnerabilities rather than a single isolated weak- ness. In addition, attack objectives in multimodal systems frequently ov erlap: integrit y failures ma y enable safet y b ypasses, control-orien ted attacks may manifest as perception errors, and training-time p oisoning can surface as inference-time integrit y or control violations. Our taxonomy captures this structure by assigning eac h attack a primary ob jectiv e while ackno wledging secondary eﬀects. Finally , the survey surfaces several op en challenges suggested by patterns in existing work. Ev aluation protocols for multimodal robustness remain fragmented, particularly outside vision–language settings. Rep orted defenses are often tailored to sp eciﬁc attac k classes or threat mo dels and may not generalize across mo dalities or deploymen t assumptions. W e hop e that the taxonomy , vulnerability mappings, and empirical summaries provided in this survey serv e as a useful foundation for more systematic analysis and for the developmen t of more robust and reliable m ultimo dal language mo dels. References Lukas Aich b erger, Alasdair Paren, Guohao Li, Philip T orr, Y arin Gal, and Adel Bibi. Mip against agent: Malicious image patc hes hijac king m ultimo dal os agents, 2025. URL 10809 . Stanisla w Antol, Aishw arya Agraw al, Jiasen Lu, Margaret Mitc hell, Dhruv Batra, C Lawrence Zitnick, and Devi P arikh. Vqa: Visual question answering. In Pr o c e e dings of the IEEE international c onfer enc e on c omputer vision , pp. 2425–2433, 2015. Relja Arandjelo vić and Andrew Zisserman. Lo ok, listen and learn. In Pr o c. ICCV , volume 3, pp. 9, 2017. Daniel A yzenshteyn, Roy W eiss, and Yisro el Mirsky . Cloak, Honey , T rap: Proactive Defenses Against LLM Agen ts. In Pr o c e e dings of the 34th USENIX Se curity Symp osium (USENIX Se curity 2025) , pp. 8095–8114. USENIX Asso ciation, 2025. Alexei Baevski, Henry Zhou, Ab delrahman Mohamed, and Mic hael A uli. w av2v ec 2.0: A framework for self-sup ervised learning of speech representations. In A dvanc es in Neur al Information Pr o c essing Systems 34 (NeurIPS 2020) , pp. 12449–12460, 2020. Eugene Bagdasaryan, T sung-Yin Hsieh, Ben Nassi, and Vitaly Shmatiko v. Abusing images and sounds for indirect instruction injection in m ulti-mo dal llms. arXiv pr eprint arXiv:2307.10490 , 2023. Eugene Bagdasary an, Rishi Jha, Vitaly Shmatiko v, and Tingw ei Zhang. Adv ersarial illusions in { Multi- Mo dal } embeddings. In 33r d USENIX Se curity Symp osium (USENIX Se curity 24) , pp. 3009–3025, 2024. Luk e Bailey , Euan Ong, Stuart Russell, and Scott Emmons. Image hijacks: Adv ersarial images can control generativ e mo dels at runtime, 2024. URL . T adas Baltrušaitis, Chaitan y a Ahuja, and Louis-Philipp e Morency . Multimodal mac hine learning: A survey and taxonom y . IEEE tr ansactions on p attern analysis and machine intel ligenc e , 41(2):423–443, 2018. Y ue Cao, Y un Xing, Jie Zhang, Di Lin, Tianw ei Zhang, Ivor T sang, Y ang Liu, and Qing Guo. Scenetap: Scene-coheren t t yp ographic adversarial planner against vision-language models in real-w orld en vironments, 2025. URL . Nic holas Carlini and Da vid W agner. A udio adversarial examples: T argeted attacks on speech-to-text. In 2018 IEEE se curity and privacy workshops (SPW) , pp. 1–7. IEEE, 2018. 26 Published in T ransactions on Mac hine Learning Researc h (03/2026) Y up eng Chang, Xu W ang, Jindong W ang, Y uan W u, Lin yi Y ang, Kaijie Zhu, Hao Chen, Xiao yuan Yi, Cunxiang W ang, Yidong W ang, W ei Y e, Y ue Zhang, Yi Chang, Philip S. Y u, Qiang Y ang, and Xing Xie. A surv ey on ev aluation of large language mo dels, 2023. URL . Sizhe Chen, Julien Piet, Chawin Sitaw arin, and Da vid W agner. Struq: Defending against prompt injection with structured queries, 2024. URL . Sizhe Chen, Yizh u W ang, Nic holas Carlini, Cha win Sitaw arin, and David W agner. Defending against prompt injection with a few defensiv etok ens, 2025a. URL . Sizhe Chen, Arman Zharmagambeto v, Saeed Mahloujifar, Kamalika Chaudh uri, David W agner, and Chuan Guo. Secalign: Defending against prompt injection with preference optimization. In Pr o c e e dings of the 2025 A CM SIGSA C Confer enc e on Computer and Communic ations Se curity , CCS ’25, pp. 2833–2847. ACM, No v em b er 2025b. doi: 10.1145/3719027.3744836. URL http://dx.doi.org/10.1145/3719027.3744836 . Hao Cheng, Erjia Xiao, Jindong Gu, Le Y ang, Jinhao Duan, Jize Zhang, Jiahang Cao, Kaidi Xu, and Renjing Xu. Un veiling typographic deceptions: Insigh ts of the t yp ographic vulnerabilit y in large vision-language mo del, 2024. URL . R uo xi Cheng, Yizhong Ding, Sh uirong Cao, Ranjie Duan, Xiaoshuang Jia, Shaow ei Y uan, Simeng Qin, Zhiqiang W ang, and Xiaojun Jia. Pbi-attac k: Prior-guided bimo dal in teractiv e black-box jailbreak attac k for to xicit y maximization, 2025. URL . Sarthak Choudhary , Divy am Anshumaan, Nils P alumbo, and Somesh Jha. How not to detect prompt injections with an llm, 2025. URL . Jan Clusmann, Dyke F erb er, Isab ella C. Wiest, Carolin V. Schneider, Titus J. Brinker, Sebastian F o ersc h, Daniel T ruhn, and Jak ob N. Kather. Prompt injection attac ks on vision-language mo dels in oncology . Natur e Communic ations , 16(1):1239, 2025. doi: 10.1038/s41467- 024- 55631- x. Jerem y M Cohen, Elan Rosenfeld, and J. Zico K olter. Certiﬁed adv ersarial robustness via randomized smo othing, 2019. URL . Xuanming Cui, Alejandro Aparcedo, Y oung Kyun Jang, and Ser-Nam Lim. On the robustness of large m ultimo dal mo dels against image adv ersarial attacks. In Pr o c e e dings of the IEEE/CVF Confer enc e on Computer Vision and Pattern R e c o gnition (CVPR) , 2024. Ailin Deng, T ri Cao, Zhirui Chen, and Bryan Ho oi. W ords or vision: Do vision-language mo dels ha ve blind faith in text?, 2025. URL . Jacob Devlin, Ming-W ei Chang, Kenton Lee, and Kristina T outanov a. Bert: Pre-training of deep bidirectional transformers for language understanding. In Pr o c e e dings of the 2019 c onfer enc e of the North A meric an chapter of the asso ciation for c omputational linguistics: human language te chnolo gies, volume 1 (long and short p ap ers) , pp. 4171–4186, 2019. Yinp eng Dong, Huanran Chen, Jia wei Chen, Zhengw ei F ang, Xiao Y ang, Yic hi Zhang, Y u Tian, Hang Su, and Jun Zhu. How robust is go ogle’s bard to adversarial image attac ks? arXiv pr eprint arXiv:2309.11751 , 2023. Alexey Doso vitskiy , Lucas Beyer, Alexander Kolesnik ov, Dirk W eissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylv ain Gelly , et al. An image is worth 16x16 w ords. arXiv pr eprint arXiv:2010.11929 , 7, 2020. Zhihao Dou, Xin Hu, Haibo Y ang, Zhuqing Liu, and Minghong F ang. A dversarial attac ks to multi-modal mo dels. In Pr o c e e dings of the 1st A CM W orkshop on L ar ge AI Systems and Mo dels with Privacy and Safety A nalysis , pp. 35–46, 2023. Matt F redrikson, Somesh Jha, and Thomas Ristenpart. Mo del inv ersion attacks that exploit conﬁdence information and basic countermeasures. In Pr o c e e dings of the 22nd A CM SIGSA C c onfer enc e on c omputer and c ommunic ations se curity , pp. 1322–1333, 2015. 27 Published in T ransactions on Mac hine Learning Researc h (03/2026) Xiaohan F u, Zihan W ang, Sh uheng Li, Ra jesh K. Gupta, Nilo ofar Mireshghallah, T aylor Berg-Kirkpatric k, and Earlence F ernandes. Misusing to ols in large language mo dels with visual adversarial examples, 2023. URL . Y ansong Gao, Chang Xu, Derui W ang, Shiping Chen, Damith C. Ranasinghe, and Surya Nepal. Strip: A defence against tro jan attac ks on deep neural netw orks. In Pr o c e e dings of the A nnual Computer Se curity A pplic ations Confer enc e (A CSA C) , pp. 113–125, 2019. Jiah ui Geng, Th y Thy T ran, Preslav Nako v, and Iryna Gurevyc h. Con instruction: Univ ersal jailbreaking of m ultimo dal large language models via non-textual modalities, 2025. URL 2506.00548 . Yic hen Gong, Delong Ran, Jin yuan Liu, Conglei W ang, Tianshuo Cong, An yu W ang, Sisi Duan, and Xiaoyun W ang. Figstep: Jailbreaking large vision-language mo dels via t yp ographic visual prompts, 2025. URL https://arxiv.org/abs/2311.05608 . Ian J Go odfellow. Intriguing prop erties of neural net w orks. arXiv pr eprint arXiv , 1312:6199, 2014. Ian J Go odfellow, Jonathon Shlens, and Christian Szegedy . Explaining and harnessing adversarial examples. arXiv pr eprint arXiv:1412.6572 , 2014. Tian yu Gu, Brendan Dolan-Ga vitt, and Siddharth Garg. Badnets: Iden tifying vulnerabilities in the machine learning mo del supply chain. arXiv pr eprint arXiv:1708.06733 , 2017. Xiangming Gu, Xiaosen Zheng, Tian yu Pang, Chao Du, Qian Liu, Y e W ang, Jing Jiang, and Min Lin. Agen t smith: A single image can jailbreak one million m ultimo dal llm agents exp onen tially fast, 2024. URL . Xingsh uo Han, Y utong W u, Qing jie Zhang, Y uan Zhou, Y uan Xu, Han Qiu, Guo wen Xu, and Tian wei Zhang. Bac kdo oring multimodal learning. In 2024 IEEE Symp osium on Se curity and Privacy (SP) , pp. 3385–3403. IEEE, 2024. Sh uy ang Hao, Bry an Hooi, Jun Liu, Kai-W ei Chang, Zi Huang, and Y ujun Cai. Exploring visual vul- nerabilities via m ulti-loss adv ersarial searc h for jailbreaking vision-language mo dels, 2024. URL https: //arxiv.org/abs/2411.18000 . Guan yu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, and W enbo Jiang. Ev aluating robustness of large audio language mo dels to audio injection: An empirical study , 2025. URL https: //arxiv.org/abs/2505.19598 . Haitao Hu, Peng Chen, Y anp eng Zhao, and Y uqi Chen. Agentsen tinel: An end-to-end and real-time security defense framew ork for computer-use agen ts, 2025a. URL . W en b o Hu, Shishen Gu, Y ouze W ang, and Richang Hong. Video jail: Exploiting video-mo dalit y vulnera- bilities for jailbreak attacks on multimodal large language mo dels. In ICLR 2025 W orkshop on Building T rust in L anguage Mo dels and Applic ations , 2025b. Linhao Huang, Xue Jiang, Zhiqiang W ang, W entao Mo, Xi Xiao, Bo Han, Y ong jie Yin, and F eng Zheng. Image-based multimodal models as in truders: T ransferable multimodal attac ks on video-based mllms, 2025a. URL . Zhen yu Huang, Yifan Zhang, Y ang Liu, Xiaoyuan W ang, and Bo Li. Medical mllm is vulnerable: Cross- mo dalit y jailbreak and mismatched attac ks on medical m ultimo dal large language mo dels. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e . AAAI Press, 2025b. No or Hussein, F ahad Shamshad, Muzammal Naseer, and Karthik Nandakumar. Promptsmo oth: Certifying robustness of medical vision-language mo dels via prompt learning, 2024. URL 2408.16769 . 28 Published in T ransactions on Mac hine Learning Researc h (03/2026) Andrew Ilyas, Logan Engstrom, Anish A thalye, and Jessy Lin. Black-box adv ersarial attacks with limited queries and information. In International c onfer enc e on machine le arning , pp. 2137–2146. PMLR, 2018. Dennis Jacob, Hend Alzahrani, Zhanhao Hu, Basel Alomair, and David W agner. Promptshield: Deplo yable detection for prompt injection attac ks, 2025. URL . Jo onh yun Jeong, Seyun Bae, Y eonsung Jung, Jaery ong Hwang, and Eunho Y ang. Pla ying the fo ol: Jail- breaking llms and m ultimo dal llms with out-of-distribution strategy , 2025. URL abs/2503.20823 . Subaru Kimura, R yota T anaka, Sh ump ei Miya waki, Jun Suzuki, and Keisuke Sakaguc hi. Empirical analysis of large vision-language models against goal hijac king via visual prompt injection, 2024. URL https: //arxiv.org/abs/2408.03554 . Alexey Kuraki n, Ian J Go odfellow, and Sam y Bengio. Adv ersarial examples in the physical world. In A rtiﬁcial intel ligenc e safety and se curity , pp. 99–112. Chapman and Hall/CRC, 2018. Sey ong Lee, Jaeb eom Kim, and W o oguil Pak. Mind mapping prompt injection: Visual prompt injection attac ks in mo dern large language models. Ele ctr onics , 14(10), 2025. ISSN 2079-9292. doi: 10.3390/ electronics14101907. URL https://www.mdpi.com/2079- 9292/14/10/1907 . Chongxin Li, Hanzhang W ang, and Y uch un F ang. Attac k as defense: Safeguarding large vision-language mo dels from jailbreaking b y adv ersarial attacks. In Christos Christo doulopoulos, T anmoy Chakrab ort y , Carolyn Rose, and Violet Peng (eds.), Findings of the A sso ciation for Computational Linguistics: EMNLP 2025 , pp. 20138–20152, Suzhou, China, No vem b er 2025a. Asso ciation for Computational Linguistics. ISBN 979-8-89176-335-7. doi: 10.18653/v1/2025.ﬁndings- emnlp.1095. URL https://aclanthology.org/2025. findings- emnlp.1095/ . Ch un yuan Li, Zhe Gan, Zhengyuan Y ang, Jian wei Y ang, Linjie Li, Lijuan W ang, and Jianfeng Gao. Multimo dal foundation models: F rom specialists to general-purpose assistants, 2023. URL https: //arxiv.org/abs/2309.10020 . Jinmin Li, Kuofeng Gao, Y ang Bai, Jingyun Zhang, Shu-tao Xia, and Yisen W ang. F mm-attack: A ﬂow- based m ulti-mo dal adversarial attack on video-based llms. arXiv pr eprint arXiv:2403.13507 , 2024. Miles Q. Li and Benjamin C. M. F ung. Securit y concerns for large language mo dels: A survey , 2025. URL https://arxiv.org/abs/2505.18889 . Yifan Li, Hangyu Guo, Kun Zhou, W ayne Xin Zhao, and Ji-Rong W en. Images are ac hilles’ heel of alignmen t: Exploiting visual vulnerabilities for jailbreaking m ultimo dal large language mo dels, 2025b. URL https: //arxiv.org/abs/2403.09792 . Jia w ei Liang, Siyuan Liang, Aishan Liu, and Xiao c hun Cao. Vl-tro jan: Multimo dal instruction backdoor attac ks against autoregressiv e visual language mo dels. International Journal of Computer Vision , pp. 1–20, 2025. Siyuan Liang, Jiaw ei Liang, Tianyu Pang, Chao Du, Aishan Liu, Mingli Zhu, Xiao c hun Cao, and Dac heng T ao. Revisiting backdoor attac ks against large vision-language models from domain shift, 2024. URL https://arxiv.org/abs/2406.18844 . Zeyi Liao, Lingbo Mo, Chejian Xu, Min tong Kang, Jiaw ei Zhang, Chao wei Xiao, Y uan Tian, Bo Li, and Huan Sun. Eia: En vironmental injection attack on generalist web agen ts for priv acy leakage, 2025. URL https://arxiv.org/abs/2409.11295 . Daizong Liu, Mingyu Y ang, Xiaoy e Qu, Lic hao Sun, Kek e T ang, Y ao W an, Mingyu Y ang, and P an Zhou. P andora’s b o x: T ow ards building universal attack ers against real-w orld large vision-language mo dels. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 37, 2024a. Daizong Liu, Mingyu Y ang, Xiaoy e Qu, P an Zhou, Y u Cheng, and W ei Hu. A survey of attac ks on large vision-language mo dels: Resources, adv ances, and future trends. arXiv pr eprint arXiv:2407.07403 , 2024b. 29 Published in T ransactions on Mac hine Learning Researc h (03/2026) Hongyi Liu, Shaochen Zhong, Xintong Sun, Minghao Tian, Mohsen Hariri, Zirui Liu, Ruixiang T ang, Zhimeng Jiang, Jiayi Y uan, Y u-Neng Ch uang, Li Li, So o-Hyun Choi, Rui Chen, Vipin Chaudhary , and Xia Hu. Loratk: Lora once, bac kdo or ev erywhere in the share-and-pla y ecosystem, 2025a. URL https://arxiv.org/abs/2403.00108 . Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against backdooring at- tac ks on deep neural netw orks. In Pr o c e e dings of the International Symp osium on R ese ar ch in A ttacks, Intrusions, and Defenses (RAID) , pp. 273–294, 2018. Xin Liu, Yic hen Zhu, Jindong Gu, Y unshi Lan, Chao Y ang, and Y u Qiao. Mm-safetybench: A b enc hmark for safety ev aluation of m ultimo dal large language mo dels, 2024c. URL 17600 . Yi Liu, Chengjun Cai, Xiaoli Zhang, Xingliang Y uan, and Cong W ang. Arondigh t: Red teaming large vision language mo dels with auto-generated m ulti-mo dal jailbreak prompts, 2024d. URL abs/2407.15050 . Y up ei Liu, Y uqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. F ormalizing and benchmarking prompt injection attac ks and defenses, 2025b. URL . Zhao yi Liu and Huan Zhang. Stealth y backdoor attack in self-sup ervised learning vision encoders for large vision language mo dels, 2025. URL . Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic represen tations for vision-and-language tasks. A dvanc es in neur al information pr o c essing systems , 32, 2019. W eimin Lyu, Lu Pang, T engfei Ma, Haibin Ling, and Chao Chen. T ro jvlm: Bac kdo or attack against vision language mo dels. In Eur op e an Confer enc e on Computer V ision , pp. 467–483. Springer, 2024. W eimin Lyu, Jiachen Y ao, Saumy a Gupta, Lu Pang, T ao Sun, Ling jie Yi, Lijie Hu, Haibin Ling, and Chao Chen. Backdooring vision-language models with out-of-distribution data, 2025. URL https://arxiv. org/abs/2410.01264 . Aleksander Madry , Aleksandar Makelo v, Ludwig Schmidt, Dimitris T sipras, and Adrian Vladu. T ow ards deep learning mo dels resistant to adversarial attacks, 2019. URL . Ziqi Miao, Yi Ding, Lijun Li, and Jing Shao. Visual con textual attack: Jailbreaking mllms with image-driv en con text injection, 2025. URL . R uaridh Mon-Williams, Gen Li, Ran Long, W enqian Du, and Christopher G. Lucas. Embo died large language mo dels enable rob ots to complete complex tasks in unpredictable environmen ts. Nat. Mac. Intel l. , 7(4): 592–601, 2025. URL https://doi.org/10.1038/s42256- 025- 01005- x . A K Nirala, A Joshi, C Hegde, and S Sarkar. F ast certiﬁcation of vision-language mo dels using incremental randomized smo othing, 2024. URL . Nicolas Papernot, Patric k McDaniel, Ian Goo dfello w, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical blac k-b o x attacks against machine learning. In Pr o c e e dings of the 2017 A CM on A sia c onfer enc e on c omputer and c ommunic ations se curity , pp. 506–519, 2017. Fábio P erez and Ian Rib eiro. Ignore previous prompt: Attac k tec hniques for language mo dels. arXiv pr eprint arXiv:2211.09527 , 2022. Xiangyu Qi, Kaixuan Huang, Ash winee P anda, P eter Henderson, Mengdi W ang, and Prateek Mittal. Visual adv ersarial examples jailbreak aligned large language mo dels, 2023. URL 13213 . Maan Qraitem, Nazia T asnim, Piotr T eterwak, Kate Saenko, and Bry an A Plummer. Vision-llms can fo ol themselv es with self-generated t yp ographic attacks. arXiv pr eprint arXiv:2402.00626 , 2024. 30 Published in T ransactions on Mac hine Learning Researc h (03/2026) Maan Qraitem, Piotr T eterwak, Kate Saenk o, and Bryan A. Plummer. W eb artifact attacks disrupt vision language mo dels, 2025. URL . Alec Radford, Jong W o ok Kim, Chris Hallacy , A dit y a Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual mo dels from natural language sup ervision. In International c onfer enc e on machine le arning , pp. 8748–8763. PmLR, 2021. Jaec h ul Roh, Virat Shejwalkar, and Amir Houmansadr. Multilingual and multi-accen t jailbreaking of audio llms, 2025. URL . Christian Schlarmann, Naman Deep Singh, F rancesco Croce, and Matthias Hein. Robust CLIP: Unsu- p ervised adv ersarial ﬁne-tuning of vision embeddings for robust large vision-language models. In R us- lan Salakhutdino v, Zico K olter, Katherine Heller, Adrian W eller, Nuria Oliv er, Jonathan Scarlett, and F elix Berkenkamp (eds.), Pr o c e e dings of the 41st International Confer enc e on Machine L e arning , v ol- ume 235 of Pr o c e e dings of Machine L e arning R ese ar ch , pp. 43685–43704. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/schlarmann24a.html . Reshabh K Sharma, Vinay ak Gupta, and Dan Grossman. Defending language mo dels against image-based prompt attacks via user-provided sp eciﬁcations. In 2024 IEEE Se curity and Privacy W orkshops (SPW) , pp. 112–131, 2024. doi: 10.1109/SPW63631.2024.00017. Erfan Sha yegani, Y ue Dong, and Nael Abu-Ghazaleh. Jailbreak in pieces: Comp ositional adversarial attacks on m ulti-mo dal language mo dels, 2023a. URL . Erfan Shay egani, Md Abdullah Al Mamun, Y u F u, Pedram Zaree, Y ue Dong, and Nael Abu-Ghazaleh. Surv ey of vulnerabilities in large language mo dels rev ealed by adversarial attacks. arXiv pr eprint arXiv:2310.10844 , 2023b. Xijia T ao, Shuai Zhong, Lei Li, Qi Liu, and Lingp eng Kong. Imgtro jan: Jailbreaking vision-language models with one image. In Pr o c e e dings of the 2025 Confer enc e of the Nations of the A meric as Chapter of the A sso ciation for Computational Linguistics: Human L anguage T e chnolo gies (V olume 1: L ong Pap ers) , pp. 7048–7063. Asso ciation for Computational Linguistics, 2025. doi: 10.18653/v1/2025.naacl- long.360. URL http://dx.doi.org/10.18653/v1/2025.naacl- long.360 . Ma T eng, Jia Xiaojun, Duan Ranjie, Li Xinfeng, Huang Yihao, Jia Xiaosh uang, Chu Zhixuan, and Ren W enqi. Heuristic-induced multimodal risk distribution jailbreak attac k for m ultimo dal large language mo dels, 2025. URL . Brandon T ran, Jerry Li, and Aleksander Madry . Spectral signatures in bac kdo or attacks. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS) , volume 31, 2018. Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszk oreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia P olosukhin. A tten tion is all you need. A dvanc es in neur al information pr o c essing systems , 30, 2017. Oriol Vin yals, Alexander T oshev, Samy Bengio, and Dumitru Erhan. Sho w and tell: A neural image caption generator. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition , pp. 3156– 3164, 2015. Bolun W ang, Y uanshun Y ao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. Neural cleanse: Identifying and mitigating backdoor attac ks in neural net works. In Pr o c e e dings of the IEEE Symp osium on Se curity and Privacy (S&P) , pp. 707–723, 2019. Le W ang, Zonghao Ying, Tian yuan Zhang, Siyuan Liang, Shengshan Hu, Mingc huan Zhang, Aishan Liu, and Xianglong Liu. Manipulating m ultimo dal agents via cross-modal prompt injection, 2025a. URL https://arxiv.org/abs/2504.14348 . 31 Published in T ransactions on Mac hine Learning Researc h (03/2026) R uofan W ang, Xing jun Ma, Hanxu Zhou, Ch uanjun Ji, Guangnan Y e, and Y u-Gang Jiang. White-b o x m ultimo dal jailbreaks against large vision-language mo dels, 2024a. URL 17894 . R uofan W ang, Juncheng Li, Yixu W ang, Bo W ang, Xiaosen W ang, Y an T eng, Yingc h un W ang, Xing jun Ma, and Y u-Gang Jiang. Ideator: Jailbreaking and b enc hmarking large vision-language mo dels using themselv es, 2025b. URL . Xiaosen W ang, Shaokang W ang, Zhijin Ge, Y uyang Luo, and Shudong Zhang. Atten tion! y our vision language mo del could b e maliciously manipulated, 2025c. URL . Y u W ang, Xiaofei Zhou, Yic hen W ang, Geyuan Zhang, and Tianxing He. Jailbreak large vision-language mo dels through multi-modal linkage, 2025d. URL . Y ub o W ang, Chaohu Liu, Y anqiu Qu, Hao yu Cao, Deqiang Jiang, and Linli Xu. Break the visual perception: A dv ersarial attac ks targeting enco ded visual tok ens of large vision-language mo dels, 2024b. URL https: //arxiv.org/abs/2410.06699 . Zijun W ang, Hao qin T u, Jieru Mei, Bingc hen Zhao, Yisen W ang, and Cihang Xie. A ttngcg: Enhancing jailbreaking attac ks on llms with atten tion manipulation. arXiv pr eprint arXiv:2410.09040 , 2024c. Alexander W ei, Nika Hagh talab, and Jacob Steinhardt. Jailbroken: Ho w do es llm safet y training fail? A dvanc es in Neur al Information Pr o c essing Systems , 36:80079–80110, 2023. Chen Henry W u, Rishi Shah, Jing Y u K oh, Ruslan Salakhutdino v, Daniel F ried, and A diti Raghunathan. Dissecting adversarial robustness of m ultimo dal lm agen ts, 2025. URL 12814 . P eng Xie, Y equan Bie, Jianda Mao, Y angqiu Song, Y ang W ang, Hao Chen, and Kani Chen. Chain of attac k: On the robustness of vision-language mo dels against transfer-based adv ersarial attac ks, 2024. URL . W eilin Xu, Da vid Ev ans, and Y anjun Qi. F eature squeezing: Detecting adv ersarial examples in deep neural net w orks. In Pr o c e e dings 2018 Network and Distribute d System Se curity Symp osium , NDSS 2018. Internet So ciet y , 2018. doi: 10.14722/ndss.2018.23198. URL http://dx.doi.org/10.14722/ndss.2018.23198 . Y uanc heng Xu, Jiarui Y ao, Manli Sh u, Y anchao Sun, Zic hu W u, Ning Y u, T om Goldstein, and F urong Huang. Shado wcast: Stealth y data poisoning attac ks against vision-language mo dels, 2024. URL https: //arxiv.org/abs/2402.06659 . Hao Y ang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haﬀari. A udio is the achilles’ heel: Red teaming audio large m ultimo dal mo dels, 2024. URL . Zuop eng Y ang, Jiluan F an, Anli Y an, Erdun Gao, Xin Lin, T ao Li, Kangh ua Mo, and Changyu Dong. Distraction is all you need for multimodal large language model jailbreaking, 2025. URL https://arxiv. org/abs/2502.10794 . Andrew Y eo and Daeseon Choi. Multimo dal prompt injection attacks: Risks and defenses for mo dern llms, 2025. URL . Ziyi Yin, Muchao Y e, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jingh ui Chen, Ting W ang, and F englong Ma. Vlattack: Multimo dal adversarial attac ks on vision-language tasks via pre-trained mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 36:52936–52956, 2023. Ziyi Yin, Muchao Y e, Y uanpu Cao, Jiaqi W ang, A ofei Chang, Han Liu, Jingh ui Chen, Ting W ang, and F englong Ma. Shadow-activ ated backdoor attac ks on m ultimo dal large language models. In W anxiang Che, Jo yce Nab ende, Ekaterina Sh utov a, and Mohammad T aher Pilehv ar (eds.), Findings of the A sso ciation for Computational Linguistics: A CL 2025 , pp. 4808–4829, Vienna, Austria, July 2025. Asso ciation for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.ﬁndings- acl.248. URL https: //aclanthology.org/2025.findings- acl.248/ . 32 Published in T ransactions on Mac hine Learning Researc h (03/2026) Zonghao Ying, Aishan Liu, Tian yuan Zhang, Zhengmin Y u, Siyuan Liang, Xianglong Liu, and Dacheng T ao. Jailbreak vision language mo dels via bi-mo dal adversarial prompt, 2024. URL 2406.04031 . Zengh ui Y uan, Jiaw en Shi, Pan Zhou, Neil Zhenqiang Gong, and Lichao Sun. Badtoken: T oken-lev el backdoor attac ks to m ulti-mo dal large language mo dels, 2025. URL . Chen y ang Zhang, Xiaoyu Zhang, Jian Lou, Kai W u, Zilong W ang, and Xiaofeng Chen. P oisoned- ey e: Kno wledge poisoning attack on retriev al-augmen ted generation based large vision-language mo dels. In Pr o c e e dings of the 42nd International Confer enc e on Machine L e arning , volume 267, V ancouver, Canada, 2025a. PMLR. URL https://icml.cc/virtual/2025/poster/46373 . ICML 2025 (PMLR v ol. 267). Op enReview: h ttps://op enreview.net/forum?id=6SIymOqJlc. PDF: h ttps://ra w.gith ubusercon ten t.com/mlresearc h/v267/main/assets/zhang25da/zhang25da.p df. Guoming Zhang, Chen Y an, Xiao yu Ji, Tianc hen Zhang, T aimin Zhang, and W enyuan Xu. Dolphinat- tac k: Inaudible voice commands. In Pr o c e e dings of the 2017 A CM SIGSA C c onfer enc e on c omputer and c ommunic ations se curity , pp. 103–117, 2017. Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Y ao, Zhen ting W ang, Chenlu Zhan, Hongwei W ang, and Y ongfeng Zhang. Agen t security bench (asb): F ormalizing and b enc hmarking attacks and defenses in llm-based agen ts, 2025b. URL . Hongy ang Zhang, Y ao dong Y u, Jiantao Jiao, Eric P . Xing, Lauren t El Ghaoui, and Mic hael I. Jordan. Theoretically principled trade-oﬀ b et ween robustness and accuracy , 2019. URL abs/1901.08573 . Jiaming Zhang, Xing jun Ma, Xin W ang, Lingyu Qiu, Jiaqi W ang, Y u-Gang Jiang, and Jitao Sang. Adv er- sarial prompt tuning for vision-language mo dels, 2024. URL . Y anzhe Zhang, T ao Y u, and Diyi Y ang. A ttac king vision-language computer agen ts via p op-ups, 2025c. URL https://arxiv.org/abs/2411.02391 . Y udong Zhang, Ruobing Xie, Jiansheng Chen, Xingwu Sun, Zhanhui Kang, and Y u W ang. Qav a: Query- agnostic visual attac k to large vision-language models, 2025d. URL . Y unqing Zhao, Tian yu P ang, Chao Du, Xiao Y ang, Chongxuan Li, Ngai-Man Man Cheung, and Min Lin. On ev aluating adv ersarial robustness of large vision-language mo dels . A dvanc es in Neur al Information Pr o c essing Systems , 36:54111–54138, 2023. A Survey Metho dology This section outlines the metho dology used to identify , select, and analyze the literature for this surv ey , ensuring b oth its comprehensiveness and reliability . A.1 Sea rch Strategy T o iden tify relev ant publications, w e conducted a systematic literature searc h using the Seman tic Sc holar database, selected for its broad co verage of computer science research. Our search employ ed key terms related to adversarial attacks and multimodal LLM’s, including large language mo dels and text, vision, audio, and video mo dalities. T o ensure relev ance to current systems and threat mo dels, w e prioritized w ork published in recent years, while selectively incorporating earlier foundational studies when necessary for context. In addition, w e p erformed targeted searc hes focusing on audio-based adv ersarial attac ks to ensure adequate co v erage of less-represen ted mo dalities. A.2 Inclusion and Exclusion Criteria W e include researc h w orks that satisfy all of the follo wing conditions: 33 Published in T ransactions on Mac hine Learning Researc h (03/2026) Inclusion Criteria • T arget Model Scop e: The work studies attac ks on MLLMs that: – pro cess tw o or more input mo dalities (e.g., text, image, audio, video), and – incorp orate a language-mo del-based reasoning or instruction-follo wing core (e.g., GPT-4V, LLaV A, Gemini, Flamingo, Qw en-VL). • A ttack Relev ance to MLLMs: The attac k is ev aluated on a m ultimo dal LLM, ev en if the attack op erates on a single mo dalit y (e.g., text-only , vision-only , or audio-only input surfaces). W e refer to suc h attacks as unimo dal attack surfaces on MLLMs and include them as baseline threats inherited b y m ultimo dal systems. • A dversarial Inten t: The work presents or analyzes attacks that in tentionally induce one or more of the follo wing: – in tegrit y failures (incorrect, misleading, or targeted outputs), – safet y or alignmen t violations (jailbreaks, p olicy bypass), – con trol or authorit y hijac king (prompt injection, to ol misuse, agentic manipulation), or – p ersisten t malicious b eha vior through training-time data p oisoning or backdoors. • A ttack Stage Cov erage: Both inference-time attacks (input, prompt, context, tool, or agen t manipulation) and training-time attac ks (data p oisoning, bac kdo ors, ﬁne-tuning corruption) are included, pro vided they target m ultimo dal LLM systems. • Sc holarly Quality: The work must b e peer-reviewed and published in a recognized conference or journal v en ue. Exclusion Criteria A work is excluded if it meets an y of the following conditions: • Unimo dal Mo dels: A ttacks targeting unimodal models only (e.g., text-only LLMs, vision-only classiﬁers, audio-only ASR systems) without ev aluation on a m ultimo dal LLM. • Non-LLM Multimo dal Systems: W orks on m ultimo dal systems without a language-model-based reasoning or instruction-follo wing comp onen t, such as: – audio-visual classiﬁers, – detection or recognition pip elines, or – m ultimo dal p erception mo dels without instruction follo wing. • Non-A dversarial F ailures: Studies fo cusing solely on benign robustness, data bias, calibration, fairness, or generalization without an adv ersarial threat mo del, unless explicitly framed as adversarial attac ks. • Non-T echnical or Opinion Pieces: Editorials, position pap ers without tec hnical con tent, blog p osts, or anecdotal rep orts. • Non-P eer-Reviewed W orks: Preprints, tec hnical rep orts, or unpublished manusc ripts that hav e not undergone p eer review. • Non-English Language: Only pap ers published in English are included. A.3 Summa ry of Adversa rial Attacks in the T axonomy on Multimo dal Large Language Mo dels The following table consolidates all pap ers analyzed and categorized in the prop osed taxonomy in Section 3.2. 34 Published in T ransactions on Mac hine Learning Researc h (03/2026) T able 6: Ov erview of adversarial attac ks on multimodal large language mo dels P ap er T arget Mo dalities A ttack er Kno wledge Impact on Performance (Dong et al., 2023) Image Blac k-b o x ASR: Bard 22%, Bing 26%, GPT-4V 45%, ERNIE 86% (Qraitem et al., 2024) Image + T ext Blac k-b o x A ccuracy drop up to 60% (Liu et al., 2024a) Image + T ext Blac k-b o x Seman tic similarity score up to 0.879 (Cheng et al., 2024) Image + T ext Blac k-b o x P erformance drop up to 42% (Qraitem et al., 2025) Image + T ext Blac k-b o x ASR up to 100% (artifact en- sem bles) (Cao et al., 2025) Image + T ext Blac k-b o x ASR: 44% (MCQ), 62% (op en-ended) (Xie et al., 2024) Image Blac k-b o x T argeted ASR up to 98% (Sha yegani et al., 2023a) Image + T ext Blac k-b o x ASR: 85–87% (image trig- gers) (Zhao et al., 2023) Image + T ext Blac k-b o x ASR: High Success Rate (Yin et al., 2023) Image + T ext Blac k-b o x ASR up to 93.5% (task- dep enden t) (T eng et al., 2025) Image + T ext Blac k-b o x ASR: 90% (op en-source), 68% (closed-source) (Y ang et al., 2024) A udio + T ext Blac k-b o x ASR ∼ 70% on harmful queries (Roh et al., 2025) A udio Blac k-b o x ASR increase up to +57% (Li et al., 2025b) Image + T ext Black-box ASR: LLaV A 90%, Gemini 72% (Miao et al., 2025) Image + T ext Blac k-b o x ASR: 85–91% across mo dels (W ang et al., 2025d) Image + T ext Blac k-b o x ASR up to 99% across bench- marks (Y ang et al., 2025) Image + T ext Blac k-b o x A verage ASR 52%; ensemble 74% (Jeong et al., 2025) Image + T ext Blac k-b o x A chiev ed upto 100% on LLaV A-1.5 13B (Hou et al., 2025) A udio + T ext Black-box High Defense Success Rate (3%) (Lee et al., 2025) Image + T ext Blac k-b o x ASR upto 90% (Clusmann et al., 2025) Image + T ext Blac k-b o x ASR 67% (GPT4o) v aries by mo del (Zhang et al., 2025c) Image + T ext Blac k-b o x 86% (W ang et al., 2025a) Image + T ext Blac k-b o x Increase of upto 30% (Zhang et al., 2025a) Image + T ext Blac k-b o x P oison success rate of upto 92% (Liao et al., 2025) Image + T ext Blac k-b o x Sp eciﬁc PII leakage upto 70% (T ao et al., 2025) Image + T ext Blac k-b o x 83.5% for AntiGPT prompt (Lyu et al., 2025) Image + T ext Black-box Consisten tly high (Liang et al., 2024) Image + T ext Blac k-b o x ASR > 97% at 0.2% p oisoning (Hu et al., 2025b) Video + T ext Blac k-b o x Upto 96.5% LLaV A-Video- 7B, dep ends on model (Kim ura et al., 2024) Image + T ext Blac k-b o x ASR up to 15.80% 35 Published in T ransactions on Mac hine Learning Researc h (03/2026) P ap er T arget Mo dalities A ttack er Kno wledge Impact on Performance (W ang et al., 2025b) Image + T ext Blac k-b o x ASR up to 94% (Cheng et al., 2025) Image + T ext Blac k-b o x ASR up to 67.3% (Gong et al., 2025) Image + T ext Black-box A v erage ASR 82.50% (Liu et al., 2024c) Image + T ext Blac k-b o x ASR up to 72.14% (Liu et al., 2024d) Image + T ext Black-box ASR up to 84.50% (Huang et al., 2025a) Video + T ext Blac k-b o x ASR up to 55.48% (MSVD- QA), ASR up to 58.26% (MSR VTT-QA) (Liang et al., 2024) Image + T ext Blac k b o x ASR > 97% (Aic hberger et al., 2025) Image + T ext Gra y-b o x Univ ersal Attac ks 100% (Liu & Zhang, 2025) Image Gray-box ASR > 99% (Geng et al., 2025) Image + Audio Gra y-b o x ASR up to 86.6% (Liang et al., 2025) Image + T ext Gra y-b o x ASR up to 99.82% (Zhang et al., 2017) A udio Gra y-b o x ASR almost 100% in ideal conditions (W ang et al., 2024b) Image + T ext Gra y b o x ASR up to 81.60% (Xu et al., 2024) Image + T ext Gra y + Black- b o x P oison ASR > 95% (lab el ﬂips) (Xu et al., 2024) Image + T ext Gra y + Black- b o x ASR > 95% (Liang et al., 2025) T ext + Image Gra y + Black b o x ASR up to 99.82% (W ang et al., 2025c) Image White-b o x Jailbreak ASR > 88%; halluci- nation > 98% (Ying et al., 2024) Image + T ext White-b o x A verage ASR upto 68.17% for MiniGPT4 (Yin et al., 2025) Image + T ext White-b o x ASR upto 100% (Lyu et al., 2024) Image + T ext White-b o x Image captioning around 0.97 (Y uan et al., 2025) Image + T ext White-b o x ASR up to 100% (captioning) (Gu et al., 2024) Image + T ext White-b o x Near-100% infection ASR (Qi et al., 2023) Image + T ext White-b o x Jailbreak ASR up to 91% (W ang et al., 2024a) Image + T ext White-b o x ASR: 96% (MiniGPT-4) (Bagdasary an et al., 2023) Image + Audio White-box Not provided (Carlini & W agner, 2018) A udio White b o x ASR almost 100% (Li et al., 2024) Video White b o x Almost 60% garble rate (W ang et al., 2024c) T ext White b o x ASR up to 70.60% (Liu & Zhang, 2025) Image + T ext White box ASR up to 99% (Cui et al., 2024) Image White b o x Caption Retriev al Recall Drop 90% (Han et al., 2024) Image + T ext + A udio + Video White b o x ASR > 96% (Liu et al., 2025a) T ext White b o x ASR up to 95.8% - 100% (F u et al., 2023) Image + T ext White-b o x 98% (Huang et al., 2025b) Image + T ext White + Black- b o x ASR up to 98.5% (transfer) 36 Published in T ransactions on Mac hine Learning Researc h (03/2026) P ap er T arget Mo dalities A ttack er Kno wledge Impact on Performance (Huang et al., 2025b) Image + T ext White + Black- b o x White Box ASR: 82.04% Blac k Box ASR: Upto 98.5% (Zhang et al., 2025d) Image + T ext White + Black- b o x P erformance drops from 78% to 44.85% for InstructBLIP Mo del (Hao et al., 2024) Image + T ext White + Black- b o x ASR up to 77.75% (MiniGPT-4) (Bailey et al., 2024) Image + T ext White + Blac k- b o x ASR > 80% (W u et al., 2025) Image + T ext White + Blac k- b o x ASR > 67% (Bagdasary an et al., 2024) Image + T ext + A udio + Thermal White + Blac k + Gray box ASR > 99.5% (Image- Bind/A udioCLIP) 37

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment