Scalable Generation and Validation of Isomorphic Physics Problems with GenAI
Traditional synchronous STEM assessments face growing challenges including accessibility barriers, security concerns from resource-sharing platforms, and limited comparability across institutions. We present a framework for generating and evaluating large-scale isomorphic physics problem banks using Generative AI to enable asynchronous, multi-attempt assessments. Isomorphic problems test identical concepts through varied surface features and contexts, providing richer variation than conventional parameterized questions while maintaining consistent difficulty. Our generation framework employs prompt chaining and tool use to achieve precise control over structural variations (numeric values, spatial relations) alongside diverse contextual variations. For pre-deployment validation, we evaluate generated items using 17 open-source language models (LMs) (0.6B-32B) and compare against actual student performance (N>200) across three midterm exams. Results show that 73% of deployed banks achieve statistically homogeneous difficulty, and LMs pattern correlate strongly with student performance (Pearson’s $ρ$ up to 0.594). Additionally, LMs successfully identify problematic variants, such as ambiguous problem texts. Model scale also proves critical for effective validation, where extremely small (<4B) and large (>14B) models exhibit floor and ceiling effects respectively, making mid-sized models optimal for detecting difficulty outliers.
💡 Research Summary
Background and Motivation
Traditional synchronous STEM assessments face growing challenges: logistical constraints, security breaches through online resource‑sharing platforms, and limited comparability across institutions. In physics, the need for “isomorphic” items—questions that probe the same underlying concept while varying surface features—has been recognized as a way to discourage rote memorization and promote transfer of learning. However, manual creation of large banks of high‑quality isomorphic variants is prohibitively labor‑intensive, and validating that each variant has comparable difficulty is difficult because each item is typically administered to only a small subset of students.
Objectives and Contributions
The paper proposes a scalable pipeline that (1) automatically generates large banks of isomorphic introductory‑physics problems using a combination of prompt chaining and external tool use, (2) releases the ESTELA‑Physics dataset comprising 666 items across 12 physics topics, and (3) validates the difficulty homogeneity of these banks with 17 open‑source large language models (LLMs) ranging from 0.6 B to 32 B parameters, comparing model‑predicted performance with actual student outcomes (N > 200). The authors demonstrate that (a) 73 % of the deployed banks achieve statistical homogeneity, (b) mid‑sized models (≈4 B–14 B) provide the strongest correlation with student performance (Pearson ρ up to 0.594), and (c) very small and very large models suffer from floor and ceiling effects, respectively, making them less useful for detecting difficulty outliers.
Generation Framework
The core of the generation system is prompt chaining, which decomposes the complex task of creating an isomorphic variant into a sequence of seven logical steps: (i) select a template problem, (ii) identify its constituent components, (iii) define structural (numerical, spatial) and contextual (scenario, wording) variations together with their constraints, (iv) design a chain of prompts that generate each component, (v) iteratively execute and refine each prompt, (vi) combine the generated components into a complete problem statement, and (vii) verify correctness.
Tool Use – The authors exploit the Python code interpreter built into modern LLMs. For each structural variation, a small Python script samples numeric parameters (e.g., friction coefficient, force angle, mass) within predefined ranges, computes the required force using the physics formula
(F = \frac{\mu m g}{\cos\theta + \mu \sin\theta}),
and checks that the result lies in a realistic interval. The script also produces diagrams with matplotlib when the problem requires a visual aid. By delegating arithmetic and diagram generation to an external tool, the LLM can focus on natural‑language composition while guaranteeing physical consistency.
Structural variations (force magnitude, angle, coefficient of kinetic friction, mass) are tightly constrained because they affect the solution. Contextual variations (e.g., “a traveler dragging a backpack”, “a dog pulling a sled”) are more free‑form, allowing cultural or reading‑level tailoring. The two sets of variations can interact; for instance, the mass value must be plausible for the chosen object.
Example Bank – Angled Force with Friction
The paper details a concrete bank (Bank 3‑3) that contains 20 items. The prompt chain first generates ten contextual scenarios (horse pulling a sledge, person pushing a couch, etc.), then uses Python to sample angle, friction coefficient, and mass values that satisfy the force‑balance constraint. Next, the unknown variable (mass, force, or friction coefficient) is selected for each item, and a full problem statement is assembled. Subsequent prompts produce step‑by‑step solutions and export the final JSON‑compatible data. The resulting items illustrate clear separation of contextual (purple) and structural (blue) variations, with the unknown variable underlined.
ESTELA‑Physics Dataset
Applying the framework across twelve introductory‑physics topics (1‑D motion, projectile motion, Newton’s laws, energy, momentum, rotational dynamics, simple harmonic motion, etc.) yields 666 isomorphic items. Each bank contains 10–48 variants and is classified by question type: numerical response (NUM), multiple‑choice (MCQ), multiple‑answer (MA), and categorization (CAT). Six banks include image‑based questions. Every item undergoes a final faculty review to ensure pedagogical soundness and technical correctness.
LLM‑Based Pre‑Deployment Validation
To assess difficulty homogeneity before student exposure, the authors run all 17 LLMs on every item using a zero‑shot prompt that requests a JSON object with the answer. Model accuracy is averaged across models for each bank. Two validation dimensions are examined:
-
Homogeneity – Fisher’s exact test (α = 0.05) determines whether accuracy rates differ significantly among variants within a bank. 73 % of banks are labeled homogeneous.
-
Correlation with Student Performance – Pearson correlation (ρ) between model‑level accuracy and actual student accuracy is computed per bank. Mid‑sized models (4 B–14 B) achieve the highest ρ (up to 0.594), indicating that LLMs can serve as reliable proxies for difficulty estimation in moderate‑difficulty banks.
Model‑Scale Effects – Very small models (< 4 B) exhibit a “floor effect”: low overall accuracy and high variance, making it difficult to distinguish easy from hard items. Very large models (> 14 B) show a “ceiling effect”: they answer almost all items correctly, flattening the difficulty signal. Consequently, the sweet spot for validation lies in the intermediate parameter range.
Error Detection – The LLM pipeline automatically flags problematic variants, such as ambiguous wording, missing essential variables (e.g., omitted friction coefficient), or inconsistent units. These flags enable instructors to revise items before deployment, dramatically reducing the cost of piloting large banks with real students.
Discussion and Future Work
The authors argue that large, openly available isomorphic banks can support asynchronous, multi‑attempt assessments that are both secure and comparable across institutions. By providing a systematic generation‑validation workflow, the work reduces the resource barrier for instructors wishing to adopt such assessments. Limitations include the focus on numerical and multiple‑choice formats; more complex open‑ended or simulation‑based items remain challenging for current LLMs. Future directions involve extending the framework to other STEM domains, incorporating multimodal generation (e.g., interactive simulations), and integrating student models that adapt prompts based on learner profiles.
Conclusion
The paper demonstrates that prompt chaining combined with tool‑use enables precise, scalable creation of physics isomorphic problem banks. Validation with a suite of open‑source LLMs shows that mid‑sized models can reliably detect difficulty outliers and predict student performance, while very small or very large models are less informative. The ESTELA‑Physics dataset and the accompanying validation methodology provide a practical foundation for educators and researchers to develop large, fair, and secure asynchronous assessments in physics and potentially other STEM fields.
Comments & Academic Discussion
Loading comments...
Leave a Comment