Sandpiper: Orchestrated AI-Annotation for Educational Discourse at Scale

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Digital educational environments are expanding toward complex AI and human discourse, providing researchers with an abundance of data that offers deep insights into learning and instructional processes. However, traditional qualitative analysis remains a labor-intensive bottleneck, severely limiting the scale at which this research can be conducted. We present Sandpiper, a mixed-initiative system designed to serve as a bridge between high-volume conversational data and human qualitative expertise. By tightly coupling interactive researcher dashboards with agentic Large Language Model (LLM) engines, the platform enables scalable analysis without sacrificing methodological rigor. Sandpiper addresses critical barriers to AI adoption in education by implementing context-aware, automated de-identification workflows supported by secure, university-housed infrastructure to ensure data privacy. Furthermore, the system employs schema-constrained orchestration to eliminate LLM hallucinations and enforces strict adherence to qualitative codebooks. An integrated evaluations engine allows for the continuous benchmarking of AI performance against human labels, fostering an iterative approach to model refinement and validation. We propose a user study to evaluate the system’s efficacy in improving research efficiency, inter-rater reliability, and researcher trust in AI-assisted qualitative workflows.

💡 Research Summary

The paper introduces Sandpiper, a mixed‑initiative platform that bridges large‑scale educational discourse data with the qualitative expertise of researchers. The authors argue that the rapid growth of digital learning environments—ranging from tutoring transcripts to peer‑to‑peer discussions—has created a wealth of semi‑structured conversational data that is largely under‑analyzed because traditional qualitative coding is labor‑intensive, costly, and prone to fatigue and inter‑rater drift. Sandpiper addresses these bottlenecks through four design goals: (1) scalable de‑identification and verification, (2) schema‑constrained reliability to prevent LLM hallucinations, (3) continuous verification and benchmarking against human labels, and (4) secure, university‑hosted infrastructure.

The system architecture separates an interactive researcher dashboard from heavy LLM inference. Raw transcripts are normalized into a JSON “Session” object, and an automated de‑identification module masks personally identifiable information before any data leaves the secure environment. Researchers can review and approve the masked output via a verification UI, ensuring compliance with privacy regulations.

Prompt management is central to the platform. In a dedicated Prompt Editor, users define both natural‑language instructions and a strict JSON schema that reflects their qualitative codebook. When a Large Language Model (LLM) is invoked, the returned text is immediately validated against the schema. If the output is malformed or contains hallucinated tags, an orchestrator loop generates a detailed error message and re‑prompts the model until a compliant response is produced. This closed‑loop mechanism enforces structural fidelity and reduces post‑processing cleanup.

All LLM inference is routed through Cornell’s LiteLM cluster, a secure, on‑premise gateway that isolates the university’s data from external APIs while still providing access to state‑of‑the‑art closed‑source models (e.g., GPT‑4, Gemini). The backend, built on Node.js/Express with a Redis‑based task queue, handles asynchronous processing, while MongoDB stores both raw and annotated data with encryption at rest.

The Evaluation Dashboard is the system’s distinguishing feature. Researchers can group multiple “Runs” into “Run‑Sets” to compare different model versions, prompt variants, or human annotations. The dashboard automatically computes inter‑rater agreement metrics such as Cohen’s Kappa, precision, recall, and F1‑score, and visualizes confusion matrices and error typologies. This continuous benchmarking enables an iterative workflow: insights from the metrics guide prompt refinements and schema adjustments, which are then re‑tested in subsequent runs.

In the related work discussion, the authors position Sandpiper against traditional qualitative data analysis tools (NVivo, MAXQDA) that lack scalability, and against fully automated NLP pipelines that often miss the nuance required for educational discourse. They cite recent LLM‑assisted labeling tools (Label Studio, GPT‑3‑based coders) but note that those systems typically treat the LLM as a black box, leading to post‑hoc cleaning and reduced methodological rigor. Sandpiper’s schema‑enforced orchestration and mixed‑initiative design directly address these shortcomings.

Although the paper proposes a user study rather than reporting completed experiments, the authors anticipate that Sandpiper will improve research efficiency (by reducing manual coding time), increase inter‑rater reliability (through consistent, schema‑driven AI suggestions), and boost researcher trust in AI‑assisted qualitative workflows (by providing transparent performance metrics and a secure data pipeline). Future work includes expanding human‑in‑the‑loop capabilities, integrating automatic segmentation models for pre‑coding, and supporting multimodal data (audio/video).

In summary, Sandpiper offers a comprehensive solution for scaling qualitative analysis of educational discourse without sacrificing methodological rigor. By tightly coupling interactive dashboards, secure LLM inference, schema‑constrained orchestration, and continuous evaluation, the platform enables researchers to harness the power of large language models while maintaining control, transparency, and compliance—potentially reshaping how social‑science and education researchers approach massive conversational datasets.

Sandpiper: Orchestrated AI-Annotation for Educational Discourse at Scale

💡 Research Summary

Comments & Academic Discussion

Leave a Comment