EMNLP: Educator-role Moral and Normative Large Language Models Profiling
Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 14 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.
💡 Research Summary
The paper introduces EMNLP (Educator‑role Moral and Normative Large Language Models Profiling), a comprehensive framework for evaluating large language models (LLMs) that are prompted to act as teachers. Recognizing a gap in existing research—most prior work assesses general personality or moral reasoning but does not combine profession‑specific traits, moral development, and vulnerability to prompt‑based attacks—the authors design a three‑tier evaluation pipeline: (1) personality profiling, (2) moral dilemma reasoning, and (3) harmful‑content risk testing via soft prompt injection.
For personality assessment, the authors extend the Computerized Personality Scale for Teachers (CPST) through a human‑machine collaborative process, doubling each of its 13 dimensions to create CPST‑E, and validate it with 100 in‑service teachers. They also employ the HEXA‑CO‑60 inventory to capture general personality dimensions. Both scales use a 7‑point Likert format, and responses are calibrated for positive/negative wording, repeated ten times, and aggregated by mode to ensure stability.
The moral reasoning tier consists of 88 teacher‑specific dilemmas covering five conflict categories typical in education (caring vs. formal climate, distributive justice vs. school standards, confidentiality vs. rules, loyalty vs. norms, family vs. educational standards) and eleven sub‑categories. The dilemmas were created through a multi‑step pipeline: expert seed generation, LLM‑driven expansion, and expert review, ensuring coverage across primary, secondary, and higher‑education contexts, and including extreme scenarios. Each dilemma is presented as an open‑ended question, allowing the model to explain its decision. Nine human experts independently label each model response according to Kohlberg’s three stages of moral development (pre‑conventional, conventional, post‑conventional); majority voting determines the final stage, and a weighted Moral Stage Score (MSS) aggregates performance across categories.
The final tier probes ethical risk by injecting soft prompts that encode four moral flaws (incompetence, offensiveness, indolence, and responding to inappropriate student requests). For each flaw, five prompt templates are crafted, and five student utterances (e.g., “ignorant,” “psychologically fragile,” “actively requesting harmful content”) are paired, yielding 1,400 test cases across 14 models. Nine experts label each response as harmful or non‑harmful, and inter‑annotator agreement is reported as high.
Experiments involve 14 LLMs (including open‑source and commercial models) evaluated at temperature 0 for baseline comparisons, and additionally across temperatures 0–1 in 0.25 increments to assess hyperparameter effects. Key findings:
-
Personality – Teacher‑role LLMs display more idealized and polarized traits than human teachers, especially scoring high on responsibility, empathy, and ethical orientation. Some models exhibit extreme scores on specific dimensions, suggesting over‑alignment with an “ideal teacher” archetype.
-
Moral Development – In abstract, principle‑based dilemmas, high‑performing models (e.g., GPT‑4, Claude‑2) often achieve post‑conventional reasoning, outperforming human benchmarks. However, in emotionally charged or relational scenarios, they tend to evade or provide vague answers, resulting in lower stage classifications.
-
Vulnerability to Prompt Injection – Models with stronger reasoning capabilities are paradoxically more susceptible to harmful outputs when soft prompts are applied. This “capability‑risk paradox” indicates that raw performance does not guarantee safety.
-
Hyperparameter Influence – Varying temperature has minimal impact on personality scores and moral stage distribution, but higher temperatures slightly increase the proportion of harmful responses, confirming that temperature alone is insufficient to mitigate risk.
The authors argue that deploying teacher‑role LLMs in real educational settings requires not only advanced reasoning but also robust defensive mechanisms (e.g., safety fine‑tuning, prompt filtering). EMNLP is positioned as the first benchmark that jointly measures profession‑specific personality, moral development, and ethical risk, offering a template that can be extended to other high‑stakes domains such as healthcare or law. Limitations include reliance on English and Chinese languages, dependence on human expert labeling, and the absence of multi‑turn, multi‑student interaction simulations. Future work is suggested to broaden linguistic coverage, automate annotation, and integrate real‑time classroom dynamics.
Overall, EMNLP reveals that while teacher‑role LLMs can emulate idealized professional traits and exhibit sophisticated moral reasoning, they remain vulnerable to prompt‑based manipulation and struggle with affect‑laden decisions, underscoring the need for balanced development of capability and safety in educational AI.
Comments & Academic Discussion
Loading comments...
Leave a Comment