Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus
As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil’s high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an “evaluation-before-generation” pipeline for responsible assessment design.
💡 Research Summary
This paper investigates whether large language models (LLMs) can reliably estimate the difficulty of exam items they generate, using Brazil’s high‑stakes ENEM exam as a testbed. The authors assembled a corpus of 1,031 released ENEM multiple‑choice questions from 2017‑2022, covering four domains (Languages, Human Sciences, Natural Sciences, Mathematics). About 38 % of the items contain visual components (diagrams, tables, graphs). Visual content was extracted via OCR and captioning using vision‑capable LLMs (GPT‑4o, Gemini‑Flash), then translated to English for uniform input. Ground‑truth difficulty is taken from the official 3‑parameter Item Response Theory (IRT) model: the difficulty parameter bᵢ, linearly transformed to a 1‑10 scale.
Ten contemporary LLMs—both proprietary (GPT‑4o, Gemini‑Flash) and open‑weight (Llama‑2, Mistral, Falcon, etc.)—were evaluated under eight prompt templates (direct query, chain‑of‑thought, plan‑and‑solve, persona‑based, etc.) and a single “prompt‑evolution” variant that iteratively refines the answer. Model outputs were parsed into a numeric difficulty score, and multiple samples were aggregated to improve stability. The evaluation focused on three axes: absolute calibration (MAE, RMSE), rank fidelity (Spearman correlation), and contextual sensitivity (performance on visual vs. non‑visual items, and on prompts that include simple demographic cues about the student population).
Key findings: (1) The best models (GPT‑4o, Gemini‑Flash) achieve moderate rank correlation (≈ 0.60) but systematically underestimate difficulty, with MAE around 0.68‑0.73 on the 1‑10 scale. Open‑source models perform worse (MAE > 0.85, Spearman < 0.45). (2) Items that originally contained visuals suffer a clear penalty: transcription errors and loss of spatial cues raise MAE by roughly 0.12‑0.18 points, especially in Mathematics and Natural Sciences. (3) Prompt engineering matters for absolute error—plan‑and‑solve and chain‑of‑thought reduce MAE by 5‑7 % compared with a naïve direct prompt—but does not substantially improve rank ordering. (4) A lightweight post‑hoc calibration (global mean shift and scaling) cuts overall MAE by about 12 % without affecting rank correlation, indicating that simple bias correction can improve absolute estimates but not ordering. (5) When prompts are conditioned on demographic cues (e.g., “students from Brazil” vs. “students from Portugal”), model predictions shift by 0.2‑0.6 points, yet these shifts are inconsistent across models and do not align with real IRT‑derived cross‑country differences, revealing limited and noisy plasticity. Some models exhibit systematic bias toward certain cues, raising fairness concerns.
The authors conclude that LLMs are not yet ready to serve as autonomous difficulty estimators for high‑stakes assessments. Instead, they should be employed as calibrated screeners within an “evaluation‑before‑generation” pipeline: LLM‑generated items are first passed through an LLM‑based difficulty estimator, optionally calibrated, and then subjected to traditional psychometric pilot testing. For multimodal items, dedicated multimodal models or higher‑quality visual‑to‑text pipelines are required. Finally, the paper warns that using demographic prompts for personalization must be approached cautiously, with thorough bias audits, because current models display unstable and potentially inequitable behavior.
Comments & Academic Discussion
Loading comments...
Leave a Comment