Using Vision + Language Models to Predict Item Difficulty

Using Vision + Language Models to Predict Item Difficulty
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.


💡 Research Summary

This paper investigates whether large multimodal language models can predict the difficulty of data‑visualization‑literacy (DVL) test items. Difficulty is defined as the proportion of respondents who answer a given item correctly (the “easiness” score). The authors use a dataset compiled by Verma and Fan (2025) that contains responses from U.S. adults and college students to items drawn from five established DVL assessments (WAN, GGR, BRBF, VLA​T, CALVI). Each item includes a PNG image of a chart, the question text, a set of answer options, and a large number of binary response records. By aggregating responses per item, the authors compute an empirical easiness score ranging from 0 (all participants correct) to 1 (all participants incorrect).

The central research questions are: (1) Which modality—visual features extracted from the chart or textual features extracted from the question and answer options—better predicts item difficulty? (2) Does a combined multimodal approach improve predictive performance? To answer these questions, the authors design three experimental configurations that all rely on the GPT‑4.1‑nano model accessed via the OpenAI API, leveraging its built‑in multimodal capabilities.

Model configurations

  1. Text‑only model – Input consists solely of the question text and the answer‑option text. The system prompt asks the model to evaluate cognitive task type, question clarity, number of options, distractor plausibility, and other textual cues, then output an estimated easiness score.
  2. Vision‑only model – Input consists solely of the chart image URL (PNG). The prompt directs the model to assess chart type, axis labeling, data‑encoding clarity, visual clutter, number of data series, annotations, and overall visual complexity before producing an easiness estimate.
  3. Multimodal (Vision + Text) model – Both the image and the textual components are supplied. The prompt explicitly requests a joint analysis of visual and textual demands, the quality of answer options, and their interaction, with the final output being an estimated easiness score.

All three pipelines use Pydantic‑defined JSON schemas to enforce a consistent output format and to simplify downstream evaluation. The dataset is split into an 80 % training/validation portion and a 20 % held‑out test set. For model selection, the authors restrict the validation subset to 154 items whose images are PNG files (the API does not support SVG at the time of the study).

Performance metrics
The primary evaluation metric is mean absolute error (MAE) between predicted and observed easiness scores on the validation subset. Results are:

  • Vision‑only MAE = 0.2819
  • Text‑only MAE = 0.3382
  • Multimodal MAE = 0.2239

The multimodal configuration achieves the lowest MAE, indicating that integrating visual and textual cues yields more accurate difficulty predictions than either modality alone. To assess generalization, the multimodal model is applied to the held‑out test set (46 items). Six items contain SVG images, which the API cannot process; the authors assign a default prediction of 0.5 for these cases. For the remaining 40 PNG items, the multimodal model’s predictions are submitted to a Kaggle‑style competition platform, where the model attains a mean squared error (MSE) of 0.10805.

Discussion and limitations
The authors interpret the superior performance of the multimodal model as evidence that difficulty in DVL items stems from the interaction between visual complexity and textual demands. They acknowledge several limitations: (i) inability to process SVG images directly, which forced a simplistic fallback and likely inflated the test‑set MSE; (ii) reliance on a single proprietary LLM, raising questions about reproducibility and model‑specific bias; (iii) modest size of the validation subset, limiting statistical power; and (iv) the current system provides only point estimates without uncertainty quantification.

Future work is suggested in four directions: (a) implement image‑format conversion pipelines (e.g., SVG‑to‑PNG) or adopt APIs that accept SVG; (b) explore alternative multimodal architectures (e.g., Flamingo, CLIP‑based models) and fine‑tuning strategies; (c) compare LLM‑based predictions with traditional psychometric approaches such as Item Response Theory (IRT) or logistic regression on handcrafted features; and (d) incorporate Bayesian or ensemble methods to deliver confidence intervals alongside difficulty estimates.

Implications
The study demonstrates that large multimodal LLMs can serve as practical tools for automating psychometric analysis in the domain of data‑visualization literacy. Accurate, automated pre‑calibration of item difficulty could accelerate test development cycles, reduce the need for costly pilot studies, and inform the design of more balanced assessment items. Moreover, the model’s internal analysis of visual and textual features can surface systematic sources of difficulty—such as overly cluttered charts or ambiguously phrased questions—providing actionable insights for educators and visualization designers seeking to improve instructional materials and design guidelines.

In summary, by systematically comparing text‑only, vision‑only, and combined multimodal LLM configurations, the paper provides empirical evidence that multimodal reasoning yields the most reliable difficulty predictions for DVL items, while also outlining concrete pathways for extending this line of research toward more robust, scalable, and interpretable psychometric tools.


Comments & Academic Discussion

Loading comments...

Leave a Comment