Are Language Models Sensitive to Morally Irrelevant Distractors?
With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this “situationist” view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 “moral distractors” from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
💡 Research Summary
This paper investigates whether large language models (LLMs) exhibit the same “situationist” moral variability observed in human psychology, where morally irrelevant contextual factors (e.g., pleasant smells, background noise, affective images) can shift moral judgments. The authors curate a multimodal set of 60 “moral distractors” – 30 textual and 30 visual stimuli – drawn from established affective databases (IDEST for text, OASIS for images). Each modality includes ten positive, ten neutral, and ten negative items, carefully filtered to ensure they are emotionally valenced but contain no moral content.
Two widely used moral evaluation benchmarks are employed: (1) MORALCHOICE, which presents 687 low‑ambiguity and 680 high‑ambiguity forced‑choice scenarios, each with two actions annotated against Gert’s ten common‑morality rules; and (2) a collection of 250 everyday dilemmas from the Reddit subreddit r/AITA, where models must assign one of five verdicts (YT A, NT A, NAH, ESH, INFO) and provide brief reasoning. For MORALCHOICE, distractors are prepended to the scenario text; for r/AITA, they are added to the system prompt. The authors run four families of LLMs—GEMMA‑3‑4B‑IT, LLaMA‑3.2‑4B‑INSTRUCT, GPT‑4.1, and QWEN‑3‑4B—across both benchmarks, measuring the marginal moral action probability (MMAP) for the rule‑following choice and analyzing verdict distributions.
Results show a robust effect of affective distractors on moral behavior. Negative distractors consistently lower MMAP, reducing the likelihood of the morally preferable action by up to 30 % in low‑ambiguity scenarios and by a statistically significant margin in most high‑ambiguity cases. Positive distractors have the opposite, though generally weaker, effect, slightly raising MMAP. An exception is observed with LLaMA‑3.2‑INSTRUCT, where both positive and neutral distractors unexpectedly depress MMAP by over 20 % in low‑ambiguity settings. Visual distractors (tested only with GEMMA due to compute constraints) produce a similar pattern: negative images suppress moral choices, while positive images modestly enhance them.
A deeper linguistic analysis of the r/AITA reasoning reveals that negative distractors increase the frequency of harm‑related and negative affect terms, whereas positive distractors boost cooperative and prosocial vocabulary, aligning with the direction of the behavioral shift. The authors also note that some high‑ambiguity scenarios may contain subtle biases in the original MORALCHOICE annotation, which could amplify the observed effects.
The study concludes that LLMs do not possess stable, context‑independent moral preferences; instead, they are susceptible to morally irrelevant affective cues in a manner reminiscent of human situationist findings. This challenges the implicit assumption underlying many current moral alignment benchmarks and suggests that future alignment work must incorporate contextual robustness, evaluating models not only on the content of moral prompts but also on the surrounding sensory and affective environment. The paper contributes (1) a publicly released “moral distractor” dataset, (2) empirical evidence of novel cognitive moral biases in LLMs, and (3) a discussion of the implications for AI alignment, responsibility, and the design of safer, more reliable language‑model‑driven systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment