Read as You See: Guiding Unimodal LLMs for Low-Resource Explainable Harmful Meme Detection

Read as You See: Guiding Unimodal LLMs for Low-Resource Explainable Harmful Meme Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Detecting harmful memes is crucial for safeguarding the integrity and harmony of online environments, yet existing detection methods are often resource-intensive, inflexible, and lacking explainability, limiting their applicability in assisting real-world web content moderation. We propose U-CoT+, a resource-efficient framework that prioritizes accessibility, flexibility and transparency in harmful meme detection by fully harnessing the capabilities of lightweight unimodal large language models (LLMs). Instead of directly prompting or fine-tuning large multimodal models (LMMs) as black-box classifiers, we avoid immediate reasoning over complex visual inputs but decouple meme content recognition from meme harmfulness analysis through a high-fidelity meme-to-text pipeline, which collaborates lightweight LMMs and LLMs to convert multimodal memes into natural language descriptions that preserve critical visual information, thus enabling text-only LLMs to “see” memes by “reading”. Grounded in textual inputs, we further guide unimodal LLMs’ reasoning under zero-shot Chain-of-Thoughts (CoT) prompting with targeted, interpretable, context-aware, and easily obtained human-crafted guidelines, thus providing accountable step-by-step rationales, while enabling flexible and efficient adaptation to diverse sociocultural criteria of harmfulness. Extensive experiments on seven benchmark datasets show that U-CoT+ achieves performance comparable to resource-intensive baselines, highlighting its effectiveness and potential as a scalable, explainable, and low-resource solution to support harmful meme detection.


💡 Research Summary

The paper tackles the pressing problem of detecting harmful memes—visual‑textual artifacts that can spread hate speech, misinformation, or extremist propaganda—while addressing three major shortcomings of existing approaches: high computational cost, poor adaptability to diverse sociocultural policies, and lack of explainability. Instead of relying on large multimodal models (LMMs) such as GPT‑4V or CLIP‑based classifiers that require massive parameter counts, fine‑tuning, or extensive labeled data, the authors propose a lightweight, fully text‑centric framework called U‑CoT+. The core idea is to let unimodal large language models (LLMs) “see” memes by “reading” them, i.e., converting the visual content into a high‑fidelity textual description and then guiding the LLM’s reasoning with human‑crafted, context‑aware guidelines through zero‑shot Chain‑of‑Thought (CoT) prompting.

High‑Fidelity Meme2Text Pipeline
The pipeline first extracts critical visual attributes from a meme using a small LMM (e.g., LLaVA‑1.6‑7B). Rather than asking the model to produce a full caption in one shot, it repeatedly poses atomic visual‑question‑answer (VQA) prompts targeting identity‑related cues such as race, gender, age, attire, and disability. This step‑wise querying reduces hallucination and ensures that socially sensitive details are not omitted by safety‑aligned LMMs, which often replace specific references with generic terms. The collected attribute answers (e.g., “two human subjects, appears to be of Jewish descent, adult man and young boy”) are then fed to a unimodal LLM (Mistral‑12B, Qwen2.5‑14B, etc.) which integrates them into a concise, coherent description (D_h). This description serves as the sole input for the downstream classification, effectively turning the multimodal detection task into a pure language‑understanding problem.

Unimodal Guided CoT Prompting
To compensate for the loss of visual intuition and to embed policy‑level nuance, the authors design a set of human‑written guidelines (GL1‑GL7). These capture protected‑group definitions, rules for interpreting metaphors, distinctions between explicit and implicit hate, and sensitivity weighting based on historical context. During inference, the LLM receives a prompt that concatenates (1) the high‑fidelity meme description, (2) the full guideline list, and (3) an instruction to reason step‑by‑step before outputting a binary label. The LLM then proceeds through a structured chain of thought: (i) identify image and caption content, (ii) perform contextual analysis (e.g., familial relationship, historical references), (iii) evaluate potential negative associations (e.g., “gas” evoking Zyklon B), (iv) consider intent versus impact, and (v) apply each relevant guideline to reach a final decision. The generated reasoning is returned alongside the label, providing transparent justification for moderators.

Experimental Evaluation
The authors evaluate U‑CoT+ on seven publicly available harmful‑meme benchmarks, including HatefulMemes, HarMeme, and CMU‑Meme. Baselines comprise (a) fully supervised multimodal classifiers fine‑tuned on large labeled corpora, (b) low‑resource approaches that prompt advanced LMMs such as GPT‑4o‑mini, and (c) CLIP‑based zero‑shot methods. Results show that U‑CoT+ with lightweight LLMs achieves F1 scores around 0.78 and AUROC ≈0.86, matching or slightly surpassing GPT‑4o‑mini despite using far fewer parameters and no training data. An ablation study reveals that (i) removing the Meme2Text stage and feeding raw images directly to the LLM drops performance dramatically (F1 ≈0.62), confirming the necessity of high‑fidelity textual grounding; (ii) omitting the guideline‑driven CoT reduces accuracy by 4‑6 percentage points, highlighting the importance of structured policy guidance. Moreover, the entire pipeline runs on a single 24 GB GPU in under 30 seconds for a batch of 1,000 memes, demonstrating practical feasibility for real‑time moderation.

Strengths, Limitations, and Future Work
U‑CoT+ excels in three dimensions: (1) Resource Efficiency – it requires only open‑source, modest‑size models and no labeled data, making it accessible to organizations with limited compute budgets; (2) Flexibility – updating or swapping the guideline set instantly adapts the system to new platform policies, regional regulations, or emerging meme formats; (3) Explainability – the step‑wise CoT output furnishes moderators with human‑readable rationales, fostering trust and enabling audit trails. However, the approach inherits certain constraints. The VQA stage depends on the LMM’s ability to correctly detect nuanced cultural symbols; misrecognition can propagate errors into the final description. The guidelines themselves are handcrafted and thus subjective; maintaining them across languages and cultures will demand continuous expert involvement. Finally, the current evaluation focuses on English‑language memes; extending to multilingual contexts and assessing cross‑cultural robustness remain open challenges.

In conclusion, the paper demonstrates that by decoupling visual perception from harmfulness reasoning and leveraging guided chain‑of‑thought prompting, lightweight unimodal LLMs can serve as effective, transparent, and low‑cost detectors for harmful memes. The authors envision future extensions that (i) incorporate multilingual LLMs and culturally diverse guideline repositories, (ii) automate guideline generation via policy‑driven templates, and (iii) integrate human‑in‑the‑loop feedback loops to continuously refine both the meme‑to‑text conversion and the reasoning process. This work paves the way for scalable, explainable multimodal moderation solutions that are viable even in resource‑constrained settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment