Scalable Classification of Course Information Sheets Using Large Language Models: A Reusable Institutional Method for Academic Quality Assurance

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Purpose: Higher education institutions face increasing pressure to audit course designs for generative AI (GenAI) integration. This paper presents an end-to-end method for using large language models (LLMs) to scan course information sheets at scale, identify where assessments may be vulnerable to student use of GenAI tools, validate system performance through iterative refinement, and operationalise results through direct stakeholder communication and effort. Method: We developed a four-phase pipeline: (0) manual pilot sampling, (1) iterative prompt engineering with multi-model comparison, (2) full production scan of 4,684 Bachelor and Master course information sheets (Academic Year 2024-2025) from the Vrije Universiteit Brussel (VUB) with automated report generation and email distribution to teaching teams (91.4% address-matched) using a three-tier risk taxonomy (Clear risk, Potential risk, Low risk), and (3) longitudinal re-scan of 4,675 sheets after the next catalogue release. Results: Five iterations of prompt refinement achieved 87% agreement with expert labels. GPT-4o was selected for production based on superior handling of ambiguous cases involving internships and practical components. The Year 1 scan classified 60.3% of courses as Clear risk, 15.2% as Potential risk, and 24.5% as Low risk. Year 2 comparison revealed substantial shifts in risk distributions, with improvements most pronounced in practice-oriented programmes. Implications: The method enables institutions to rapidly transform heterogeneous catalogue data into structured and actionable intelligence. The approach is transferable to other audit domains (sustainability, accessibility, pedagogical alignment) and provides a template for responsible LLM deployment in higher education governance.

💡 Research Summary

Purpose and Context
The paper addresses the urgent need for universities to audit the vulnerability of their assessment designs to generative AI (GenAI) tools. Traditional AI‑detection methods are unreliable, and institutions lack systematic, scalable mechanisms to identify courses where take‑home assignments may be compromised. The authors propose an end‑to‑end pipeline that uses large language models (LLMs) to automatically scan, classify, and report on every course information sheet (CIS) in a university catalogue.

Methodology – Four‑Phase Pipeline

Pilot Sampling (Phase 0) – A random subset of roughly 200 CIS documents was manually labeled by domain experts into three risk categories: Clear risk (high likelihood of GenAI misuse), Potential risk (ambiguous or partially vulnerable), and Low risk (unlikely to be affected). This set served as the ground‑truth benchmark.
Iterative Prompt Engineering (Phase 1) – The authors experimented with three LLM families (OpenAI GPT‑4, Anthropic Claude‑3, Meta Llama‑2) and crafted a series of prompts that explicitly defined the role of the model (“assessment auditor”), the taxonomy, and special cases (internships, practicum, language‑course assessments). After each round they compared model outputs against the expert labels, categorized error types (misinterpretation, omission, over‑classification), and refined the prompt. Five iterations were required to raise agreement from ~70 % to 87 % (Cohen’s κ ≈ 0.78). GPT‑4o emerged as the best performer, especially on ambiguous internship descriptions.
Full Production Scan (Phase 2) – Using the finalized prompt and GPT‑4o, the pipeline processed the entire 2024‑2025 catalogue (4,684 Bachelor and Master CISs). Documents were first scraped from the public HTML catalogue, cleaned, and language‑detected (Dutch, English, French, etc.). The model generated a risk label plus a concise rationale for each course. An automated reporting engine compiled individualized PDFs and dispatched them via email to teaching teams; 91.4 % of addresses were successfully matched. The overall distribution was 60.3 % Clear risk, 15.2 % Potential risk, and 24.5 % Low risk.
Longitudinal Re‑scan (Phase 3) – After the next catalogue release (2025‑2026, 4,675 CISs), the same pipeline was rerun. Comparative analysis showed a modest shift: Clear risk fell to 55.1 %, Potential risk to 13.8 %, while Low risk rose to 31.1 %. The most pronounced improvements were observed in practice‑oriented programmes, suggesting that the feedback loop prompted faculty to redesign assessments (e.g., adding in‑class components, limiting open‑ended take‑home tasks).

Key Findings

Prompt Engineering as Validation Tool – Systematic, hypothesis‑driven prompt refinement proved essential for achieving high alignment with expert judgments without the need for large annotated training sets.
Scalability and Speed – The entire production run completed within a few days, demonstrating that LLMs can handle thousands of heterogeneous HTML documents under operational time constraints.
Actionable Intelligence – Direct email delivery of personalised risk reports bridged the gap between institutional policy and course‑level practice, fostering immediate reflection and redesign.
Transferability – The same pipeline architecture can be repurposed for other audit domains (sustainability integration, accessibility compliance, pedagogical alignment) by simply redefining the taxonomy and prompt.

Limitations

Document‑Centric View – The classification reflects only the textual description in the CIS, not actual classroom practices; a course may be more or less vulnerable than its sheet suggests.
Expert Dependence – The pilot labeling and iterative prompt tuning rely on domain experts, limiting reproducibility across institutions without similar expertise.
Multilingual Reporting Quirks – Automatic language detection sometimes produced reports in the target language of a language‑course (e.g., Spanish), which was unexpected for some staff.
Model & Prompt Drift – Changes in LLM versions or prompt wording can affect longitudinal comparability; the authors mitigated this by freezing the prompt after Phase 1 but acknowledge the need for version control.

Implications for Higher‑Education Governance
The study demonstrates that LLM‑driven document analytics can become a core component of institutional quality assurance, offering a rapid, low‑cost alternative to manual audits. By providing a reusable, modular workflow—structured extraction, iterative validation, operationalisation, and longitudinal tracking—universities can systematically monitor emerging policy concerns (AI, sustainability, equity) and generate evidence‑based recommendations for faculty.

Conclusion
Through a rigorously documented four‑phase pipeline, the authors show that large language models can reliably classify thousands of course information sheets for GenAI risk, produce actionable feedback for teaching teams, and support year‑over‑year monitoring. The approach balances methodological rigor (prompt‑driven validation) with practical scalability, and its modular design makes it adaptable to a broad range of institutional audit tasks.

Scalable Classification of Course Information Sheets Using Large Language Models: A Reusable Institutional Method for Academic Quality Assurance

💡 Research Summary

Comments & Academic Discussion

Leave a Comment