Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training
Automated Essay Scoring (AES) plays a crucial role in education by providing scalable and efficient assessment tools. However, in real-world settings, the extreme scarcity of labeled data severely limits the development and practical adoption of robust AES systems. This study proposes a novel approach to enhance AES performance in both limited-data and full-data settings by introducing three key techniques. First, we introduce a Two-Stage fine-tuning strategy that leverages low-rank adaptations to better adapt an AES model to target prompt essays. Second, we introduce a Score Alignment technique to improve consistency between predicted and true score distributions. Third, we employ uncertainty-aware self-training using unlabeled data, effectively expanding the training set with pseudo-labeled samples while mitigating label noise propagation. We implement above three key techniques on DualBERT. We conduct extensive experiments on the ASAP++ dataset. As a result, in the 32-data setting, all three key techniques improve performance, and their integration achieves 91.2% of the full-data performance trained on approximately 1,000 labeled samples. In addition, the proposed Score Alignment technique consistently improves performance in both limited-data and full-data settings: e.g., it achieves state-of-the-art results in the full-data setting when integrated into DualBERT.
💡 Research Summary
The paper tackles the pressing problem of extreme label scarcity in Automated Essay Scoring (AES), a scenario that hampers the deployment of robust scoring systems in real educational settings. To address this, the authors propose a three‑pronged approach that can be applied independently or in combination: (1) a Two‑Stage fine‑tuning strategy that incorporates Low‑Rank Adaptation (LoRA), (2) a Score Alignment (SA) post‑processing technique, and (3) Uncertainty‑aware Self‑Training (UST). All three methods are built on top of DualBERT, a dual‑encoder architecture that captures both sentence‑level and document‑level semantics and has been shown to perform well on multi‑trait AES tasks.
Two‑Stage Fine‑Tuning with LoRA
In the first stage, the entire DualBERT model is fine‑tuned on the available labeled essays. After this, LoRA layers—lightweight low‑rank matrices—are inserted into the frozen base model. The second stage fine‑tunes only these LoRA parameters while keeping the original DualBERT weights fixed. This design lets the model retain its general language knowledge while efficiently adapting to prompt‑specific nuances using a small number of additional parameters. Moreover, the authors vary the loss weights for each trait (overall, content, organization, etc.) during LoRA training, enabling a tailored multi‑task optimization.
Score Alignment
Even after fine‑tuning, regression‑type AES models often exhibit systematic bias: predictions near the extremes (0 or 1 after normalization) are slightly shrunk toward the interior. To correct this, the authors compute the mean difference between true and predicted scores for the top‑p % and bottom‑p % of the development set. They then apply a linear transformation to the test‑set predictions, shifting the minimum and maximum to align with the observed bias‑corrected bounds. This simple post‑processing step dramatically reduces distributional distortion, which is especially beneficial for the Quadratic Weighted Kappa (QWK) metric that heavily penalizes ordinal misplacements.
Uncertainty‑aware Self‑Training
To exploit the abundance of unlabeled essays, the paper adapts the UST framework originally designed for classification. For each unlabeled essay, the model performs T stochastic forward passes with dropout enabled, and the standard deviation of the resulting scores serves as an uncertainty estimate. Essays are binned by predicted score magnitude; within each bin, the nₛ samples with the lowest uncertainty are selected as pseudo‑labeled data. This filtering mitigates the risk of propagating noisy labels. The pseudo‑labeled set is then merged with the original labeled data, and a freshly initialized DualBERT is trained on this augmented corpus. Finally, Score Alignment is applied again to the model’s output.
Experimental Setup and Results
The authors evaluate on the ASAP++ benchmark (12,978 essays across eight prompts, multi‑trait scoring). They consider two regimes: a “full‑data” setting (~1,000 labeled essays) and a “32‑data” few‑shot setting (≈32 labeled essays per prompt). In the few‑shot scenario, each technique alone yields a 2–4 % absolute QWK gain over the baseline DualBERT. When combined (LoRA + SA + UST), the system reaches 91.2 % of the full‑data performance, demonstrating that even with roughly 1 % of the labels the model remains highly competitive. In the full‑data regime, Score Alignment alone pushes DualBERT to a new state‑of‑the‑art QWK of ≈0.79, surpassing prior deep‑learning and LLM‑based approaches.
Insights and Limitations
The study highlights three key insights: (i) low‑rank adaptation can benefit not only massive LLMs but also mid‑size transformer encoders, (ii) a lightweight linear alignment of score distributions can correct systematic bias without retraining, and (iii) uncertainty‑driven pseudo‑label selection effectively expands the training set while controlling noise. Limitations include reliance on a single dataset, potential sensitivity to the hyperparameters governing p % in SA and dropout‑repeat T in UST, and the fact that the uncertainty estimate is based on Monte‑Carlo dropout rather than a full Bayesian treatment.
Future Directions
The authors suggest extending the two‑stage LoRA fine‑tuning to larger LLMs, exploring non‑linear or learned alignment functions, and integrating more sophisticated Bayesian uncertainty estimators. Additionally, cross‑prompt transfer, multilingual AES, and active learning loops that query human raters for the most uncertain essays are promising avenues.
Overall, the paper delivers a practical, modular toolkit that substantially narrows the performance gap caused by label scarcity in AES, offering both theoretical contributions and immediate applicability for educational technology deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment