A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks
This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM’s ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: https://github.com/yha9806/VULCA-EMNLP2025.
💡 Research Summary
This paper introduces VULCA (Vision‑Understanding and Language‑based Cultural Adaptability Framework), a structured methodology for evaluating and enhancing the interpretive capabilities of multimodal large language models (VLMs) in culturally situated tasks, using traditional Chinese painting criticism as a testbed. The authors first construct a high‑quality human benchmark (MHEB) comprising 163 expert commentaries drawn from authoritative museum catalogs and scholarly publications. Each commentary is annotated along three dimensions—Evaluative Stance, Feature Focus, and Commentary Quality—resulting in 38 primary labels and nine derived analytical dimensions (47 in total). Annotation is performed by three graduate‑level art historians, achieving substantial inter‑annotator agreement (Fleiss κ 0.78, ICC 0.82).
To transform these textual critiques into a quantitative representation, the authors employ a zero‑shot multilingual BART‑large‑MNLI model. For each label, the model evaluates the hypothesis “This text is about L” and outputs a probability score; scores above 0.5 indicate presence, while the continuous values capture prominence. This yields a 47‑dimensional feature vector for every human commentary, providing a fine‑grained reference for model comparison.
The evaluation phase leverages eight persona‑guided prompts (e.g., historian, aesthetician, technical analyst) combined with a domain‑specific knowledge base containing Chinese art terminology and symbolic concepts. VLMs such as Llama 3, Qwen‑VL, and Gemini 2.5 Pro are prompted under two conditions: baseline (no persona) and persona‑guided. Results show that Gemini 2.5 Pro experiences a 20 % increase in symbolic reasoning scores (0.62 → 0.75) and a 30 % boost in argumentative coherence (0.68 → 0.88) when persona conditioning is applied. Llama 3 and Qwen‑VL also exhibit measurable shifts in style and terminology usage, confirming that persona and knowledge‑base interventions can steer model outputs toward culturally appropriate discourse.
By contrasting VULCA with existing benchmarks such as MME and MMBench—which focus on object recognition or factual QA—the paper demonstrates that VULCA uniquely assesses deep semantic alignment, cultural adaptability, and logical consistency. The authors further argue that the framework is generalizable to other epistemically rich domains (religion, medicine, history), offering a pathway for multimodal models to collaborate with human experts in nuanced interpretive tasks. Overall, VULCA provides a comprehensive evaluation pipeline and a practical intervention strategy that moves multimodal LLMs beyond surface‑level performance toward genuine cultural and interpretive competence.
Comments & Academic Discussion
Loading comments...
Leave a Comment