Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach

Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.


💡 Research Summary

This paper tackles the long‑standing gap in Automatic Essay Scoring (AES) for Arabic by introducing a trait‑centric, prompt‑engineering framework that works with large language models (LLMs) without any fine‑tuning. The authors note that while English AES has progressed to multi‑trait scoring and sophisticated prompt designs, Arabic systems remain limited to holistic or content‑correctness scoring due to scarce annotated resources and the linguistic complexity of Arabic. To bridge this gap, they propose a three‑level prompting strategy:

  1. Standard prompting – a zero‑shot, single‑pass prompt that asks the model to output scores for all seven linguistic traits (Organization, Vocabulary, Style, Development, Mechanics, Structure, Relevance) in one go. This baseline is simple but provides little guidance on trait‑specific criteria.

  2. Hybrid trait prompting – a multi‑agent simulation where five “virtual raters” each specialize in a subset of traits (e.g., an Organization specialist, a Vocabulary specialist, etc.). Each rater scores only its assigned dimensions, and the final trait scores are obtained by averaging across the relevant raters according to a predefined mapping. This design mimics human assessment panels and forces the model to focus on one linguistic aspect at a time.

  3. Rubric‑guided few‑shot prompting – for each trait the model receives a detailed Arabic rubric plus three scored exemplars (low, medium, high). The model compares the target essay to these exemplars, produces a trait score, and returns a justification in a structured JSON format. This approach aligns the model’s decision‑making with human scoring standards and improves consistency.

The experimental platform uses the QAES dataset, the first publicly available Arabic AES corpus with fine‑grained trait annotations. QAES contains 195 undergraduate argumentative essays, each annotated on a 0‑5 scale (Relevance 0‑2) for the seven traits. Human inter‑rater agreement (Cohen’s κ) averages 0.72, establishing a reliable gold standard.

Eight LLMs of varying size, architecture, and Arabic support are evaluated: ChatGPT‑4, Fanar‑1‑9B‑Instruct, Jais‑13B‑Chat, ALLAM‑7B‑Instruct, Qwen1.5‑1.8B‑Chat, Qwen2.5‑7B‑Instruct, Qwen3‑VL‑8B‑Instruct, and LLaMA‑2‑7B‑Chat‑hf. All models are accessed via HuggingFace checkpoints and are tested under the three prompting configurations without any parameter updates.

Performance is measured with Quadratic Weighted Kappa (QWK), the standard metric for ordinal scoring in AES, and 95 % confidence intervals (CIs) are derived via non‑parametric bootstrapping (1,000 iterations) to capture variability due to the small sample size.

Key findings:

  • Fanar‑1‑9B‑Instruct achieves the highest overall agreement (QWK = 0.28, CI = 0.41) in both zero‑shot and few‑shot settings, outperforming larger proprietary models such as ChatGPT‑4.
  • The rubric‑guided few‑shot prompts consistently boost QWK across all traits, with the most pronounced gains for the discourse‑level traits Development and Style (QWK improvements of up to 0.07).
  • Hybrid prompting yields modest but reliable improvements over the standard baseline, especially for Organization and Vocabulary, indicating that trait‑specialist simulation helps the model concentrate on relevant cues.
  • Smaller models (Qwen1.5, Qwen2.5, LLaMA‑2) and even ChatGPT‑4 show low to fair agreement (QWK ≈ 0.10‑0.18) across most traits, suggesting that intrinsic Arabic language capability, rather than sheer model size, limits performance.
  • Confidence intervals shrink noticeably when rubric‑guided few‑shot prompts are used, confirming that providing explicit scoring examples reduces prediction variance.

The authors interpret these results as evidence that prompt structure, not model scale, is the primary driver of effective Arabic AES. By decomposing the scoring task into trait‑specific sub‑prompts and supplying concrete rubrics, LLMs can approximate human raters even in a low‑resource language.

Implications and future work:

  1. The three‑tier framework offers a scalable, fine‑tuning‑free solution for educational institutions lacking large annotated corpora.
  2. The hybrid approach demonstrates a practical way to emulate multi‑expert assessment panels, which could be extended to other languages or domains (e.g., scientific writing).
  3. The study highlights the need for larger, more diverse Arabic essay datasets to further validate and refine trait‑level scoring.
  4. Future research directions include automated prompt optimization (e.g., meta‑prompt learning), integration of human‑in‑the‑loop feedback to refine rubrics, and deployment of real‑time feedback tools for learners.

In summary, this work pioneers trait‑centric prompt engineering for Arabic AES, shows that structured prompting markedly improves alignment with human scores, and establishes a foundation for scalable, nuanced assessment in low‑resource educational contexts.


Comments & Academic Discussion

Loading comments...

Leave a Comment