There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective
The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models’ capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B–14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.
💡 Research Summary
The paper addresses the pressing need for locally deployable, offline large language models (LLMs) that can be safely used in educational settings where data privacy, cost, and pedagogical control are paramount. Focusing on Turkish heritage language instruction—a context characterized by code‑switching, cultural nuance, and frequent linguistic anomalies—the authors develop a novel benchmark called the Turkish Anomaly Suite (TAS). TAS comprises ten carefully crafted edge‑case prompts that probe four dimensions of model behavior: (1) linguistic calques and orthographic impossibilities (e.g., “What is the shortest Turkish word beginning with ‘˘g’?”), (2) factual and geographical hallucinations (e.g., false travel routes), (3) historical and cultural fabrications (e.g., invented proverbs or counter‑factual history), and (4) appeal‑to‑authority fallacies (e.g., “My teacher says 2 + 2 = 5”).
Each model response is scored on a 10‑point scale across three axes—Factual Accuracy, Hallucination Control, and Pedagogical Tone—using a detailed rubric that categorizes outcomes as Success, Partial Failure, or Critical Failure. The study evaluates fourteen open‑source models ranging from 270 M to 32 B parameters, including variants of Gemma, MiniGPT‑4, LLaMA‑2, and DeepSeek‑R1, as well as a reasoning‑optimized 14 B model (ministral‑3‑14b‑reasoning).
Results reveal that anomaly resistance is not a simple function of model size. While larger models generally achieve higher scores, the relationship is non‑linear. The 32 B DeepSeek‑R1 model attains the highest overall robustness (85 points) but fails on the authority‑based logical trap, demonstrating that sheer parameter count does not eliminate sycophancy bias. Conversely, the 14 B reasoning‑focused model correctly rejects the same false premise, highlighting the importance of alignment strategies and dedicated reasoning fine‑tuning. Smaller models (≤1 B) consistently produce critical failures: they accept false premises, generate fabricated lexical items (e.g., nonsense words starting with ‘˘g’), and fabricate impossible geographic routes.
Technical measurements show a clear latency‑performance trade‑off. Models above 27 B parameters incur average response latencies exceeding 1.8 seconds, which may hinder real‑time tutoring interactions. In contrast, sub‑1 B models respond within 0.4 seconds but fall short on safety metrics, indicating that speed alone is insufficient for pedagogical deployment.
To synthesize multi‑dimensional performance, the authors compute a composite FinalScore using weighted contributions (70 % safety, 20 % technical efficiency, 10 % memory footprint). This composite ranking places models in the 8 B–14 B range as the most cost‑effective and pedagogically safe choices for resource‑constrained Turkish heritage language classrooms.
The discussion emphasizes that model selection for education must balance computational cost, latency, and, critically, epistemic resistance to misleading inputs. The authors advocate for further research that expands the anomaly suite to additional cultural contexts, integrates human‑in‑the‑loop evaluations, and explores fine‑tuning techniques that specifically target sycophancy mitigation. Overall, the paper contributes a reproducible benchmark, a transparent evaluation pipeline, and actionable guidance for educators and developers seeking trustworthy offline LLMs in multilingual, heritage‑language learning environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment