GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs
Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.
💡 Research Summary
The paper introduces GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a fully automated pipeline that stress‑tests multimodal large language models (MLLMs) by synthesizing images that deliberately cause object hallucination. Unlike prior work that relies on static benchmarks, GHOST actively manipulates the visual input based on direct feedback from the target model, thereby uncovering model‑specific and transferable vulnerabilities.
The method proceeds in three stages. First, a mapper Π (implemented as a simple MLP) is trained to align CLIP image embeddings with the vision‑token space of the MLLM, using a mean‑squared error loss. Second, given an original image that does not contain a target object t, GHOST optimizes the CLIP embedding c of that image. The optimization minimizes a composite loss: (i) an adversarial term L_adv = –log p(y* | X_q, Π(c)) that pushes the model to answer “Yes” (or another target token) with high confidence; (ii) a regularization term L_reg that keeps c close to the original embedding c₀, preserving overall visual similarity; and (iii) a CLIP‑object term L_clip that penalizes cosine similarity between c and textual templates describing t, ensuring the object’s semantics are not directly encoded. Hyper‑parameters λ_clip and λ_reg balance these components, and the loss is minimized with AdamW while sampling diverse query templates each step.
Third, the optimized embedding c is fed to a latent diffusion model (Stable Diffusion unCLIP). Instead of starting from pure noise, the reverse diffusion begins from a partially noised latent of the original image, allowing subtle semantic shifts while retaining high‑level structure. The diffusion model decodes c into a natural‑looking image ˜X_v. To guarantee that the target object truly remains absent, the generated image is screened with the open‑vocabulary detector OWLv2; only images where OWLv2 detects no instance of t are kept. If the MLLM’s probability for the target token exceeds a confidence threshold τ_yes, the image is recorded as a successful hallucination‑inducing sample.
Experiments span five state‑of‑the‑art MLLMs (Qwen2.5‑VL, GLM‑4.1V‑Thinking, LLaVA‑1.5, MiniGPT‑4, GPT‑4o) and a corpus of 9,423 diverse images paired with 20 target objects (e.g., knife, cat, clock). On Qwen2.5‑VL, GHOST achieves a 29 % success rate, compared with roughly 0.1 % for the closest prior method (DASH). Across all models the average success rate exceeds 28 %, representing a two‑order‑of‑magnitude improvement over static benchmarks. Moreover, images optimized for one model often transfer: Qwen2.5‑VL‑generated images cause hallucinations in GPT‑4o at a 66.5 % rate, indicating shared spurious visual‑language correlations across architectures.
Image quality is validated both quantitatively (FID ≈ 12.3) and via human studies, where 89 % of participants correctly identify that the target object is absent. This confirms that GHOST’s cues are subtle enough to be imperceptible to humans yet sufficient to push MLLMs across their decision boundary.
Finally, the authors demonstrate a mitigation use‑case. Fine‑tuning Qwen2.5‑VL on a set of 5,000 GHOST‑generated images reduces hallucination errors on established benchmarks by an average of 12 percentage points, especially for thin, edge‑like objects (e.g., knives, scissors). This positions GHOST as both a diagnostic probe and a source of adversarial training data.
Key contributions: (1) a model‑feedback‑driven embedding optimization that inserts “stealth tokens” without explicit object insertion; (2) decoupling of the optimization from the diffusion generator, yielding computational efficiency and compatibility with any diffusion model; (3) empirical evidence of both model‑specific and transferable hallucination vulnerabilities; (4) demonstration that synthetic hallucination‑inducing images can improve model robustness when used for fine‑tuning.
Limitations include reliance on CLIP’s own biases (which may propagate through the mapper), focus on binary existence queries rather than more complex relational reasoning, and the need to repeat the optimization per target object. Future work could extend GHOST to multi‑object scenarios, explore alternative vision encoders, and integrate the generated data into continual‑learning pipelines for long‑term robustness.
Comments & Academic Discussion
Loading comments...
Leave a Comment