Calibrating Generative AI to Produce Realistic Essays for Data Augmentation

Calibrating Generative AI to Produce Realistic Essays for Data Augmentation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Data augmentation can mitigate limited training data in machine-learning automated scoring engines for constructed response items. This study seeks to determine how well three approaches to large language model prompting produce essays that preserve the writing quality of the original essays and produce realistic text for augmenting ASE training datasets. We created simulated versions of student essays, and human raters assigned scores to them and rated the realism of the generated text. The results of the study indicate that the predict next prompting strategy produces the highest level of agreement between human raters regarding simulated essay scores, predict next and sentence strategies best preserve the rated quality of the original essay in the simulated essays, and predict next and 25 examples strategies produce the most realistic text as judged by human raters.


💡 Research Summary

This paper investigates how different prompt‑engineering strategies for large language models (LLMs) affect the quality and realism of synthetic essays used to augment training data for automated scoring engines (ASEs). The authors selected 96 real ninth‑grade essays from a statewide assessment (24 essays at each of the four rubric scores) and generated three synthetic versions of each essay using three distinct prompting methods: “Predict Next,” “Sentence,” and “25 Examples,” yielding a total of 288 synthetic essays. Human evaluation was conducted in two stages. First, professional raters scored the original essays using a four‑point rubric. Second, a separate group of expert raters re‑scored all 384 essays (96 real + 288 synthetic) and simultaneously labeled each essay as “real” or “simulated” to assess perceived realism.

The three prompting approaches differ substantially. “25 Examples” presents the model with 24 real essays and 25 previously generated synthetic essays, framing the real essays as positive examples and the synthetic ones as negative examples, and asks the model to produce a new essay that mimics the real style. “Predict Next” supplies two real essays, extracts a detailed note from the first, and uses that note together with explicit constraints (grammar, spelling, punctuation, word count, sentence complexity) to generate an essay that closely follows the second essay’s content and style. “Sentence” iteratively feeds the model one sentence at a time, each heavily altered in structure and vocabulary while preserving the original’s error patterns, and then concatenates the modified sentences into a full essay.

Evaluation metrics focused on (1) scoring agreement—measured by the proportion of exact score matches and quadratically weighted kappa (QWK)—and (2) realism discrimination—measured by the accuracy of the “real vs. simulated” label relative to the true origin. Across all essays, the overall exact‑match proportion was 0.64. The “Predict Next” method achieved the highest agreement (0.72) and a QWK of 0.74, virtually indistinguishable from the real‑essay QWK of 0.75. The “Sentence” method produced a QWK of 0.68, while “25 Examples” lagged at 0.58. In terms of average scores, the “Sentence” synthetic essays most closely mirrored the scores of their source essays across the rubric range; “Predict Next” matched well at scores 1, 2, and 4 but diverged at score 3; “25 Examples” deviated the most.

Realism results revealed a stark contrast. Expert raters correctly identified real essays 74 % of the time (overall accuracy 0.48 for distinguishing real vs. simulated). The “Sentence” method yielded the highest realism discrimination accuracy (0.74), comparable to the real‑essay baseline, indicating that its outputs were relatively easy to recognize as synthetic. Conversely, “Predict Next” and “25 Examples” produced essays that were much harder to label correctly, with accuracies of only 0.25 and 0.18 respectively—suggesting these texts were highly realistic and often indistinguishable from genuine student writing.

The authors interpret these findings as evidence that prompt design critically shapes both the fidelity of scoring characteristics and the perceptual realism of generated data. “Predict Next” excels at preserving rubric‑based quality but may over‑fit to the style of the training essays, potentially introducing bias if used indiscriminately for augmentation. “Sentence” strikes a better balance, maintaining realistic error patterns while still reflecting the original scoring distribution. “25 Examples” performs poorly on both dimensions, making it unsuitable for ASE data augmentation.

Limitations include the narrow demographic (mid‑western U.S. ninth‑graders), reliance on a single LLM architecture, and a binary realism judgment that does not capture nuanced aspects such as logical coherence or content depth. The study underscores the need for careful selection of prompting strategies based on the specific augmentation goal—whether the priority is to boost scoring agreement or to enrich the training set with diverse yet realistic samples. Future work should explore multiple LLMs, broader subject areas, and automated realism metrics, as well as longitudinal testing of how augmented datasets affect ASE performance in real‑world deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment