The Fragility Of Moral Judgment In Large Language Models

The Fragility Of Moral Judgment In Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

People increasingly use large language models (LLMs) for everyday moral and interpersonal guidance, yet these systems cannot interrogate missing context and judge dilemmas as presented. We introduce a perturbation framework for testing the stability and manipulability of LLM moral judgments while holding the underlying moral conflict constant. Using 2,939 dilemmas from r/AmItheAsshole (January-March 2025), we generate three families of content perturbations: surface edits (lexical/structural noise), point-of-view shifts (voice and stance neutralization), and persuasion cues (self-positioning, social proof, pattern admissions, victim framing). We also vary the evaluation protocol (output ordering, instruction placement, and unstructured prompting). We evaluated all variants with four models (GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5-72B) (N=129,156 judgments). Surface perturbations produce low flip rates (7.5%), largely within the self-consistency noise floor (4-13%), whereas point-of-view shifts induce substantially higher instability (24.3%). A large subset of dilemmas (37.9%) is robust to surface noise yet flips under perspective changes, indicating that models condition on narrative voice as a pragmatic cue. Instability concentrates in morally ambiguous cases; scenarios where no party is assigned blame are most susceptible. Persuasion perturbations yield systematic directional shifts. Protocol choices dominate all other factors: agreement between structured protocols is only 67.6% (kappa=0.55), and only 35.7% of model-scenario units match across all three protocols. These results show that LLM moral judgments are co-produced by narrative form and task scaffolding, raising reproducibility and equity concerns when outcomes depend on presentation skill rather than moral substance.


💡 Research Summary

The paper investigates how stable large language models (LLMs) are when they are asked to render moral judgments on everyday interpersonal conflicts. Using 2,939 posts from the Reddit community r/AmItheAsshole (January–March 2025), the authors generate three families of content perturbations—surface edits, point‑of‑view (POV) shifts, and minimal persuasion cues—while keeping the underlying moral conflict unchanged. They also vary the evaluation protocol by changing instruction placement, output ordering, and using an unstructured prompt. Four state‑of‑the‑art models (GPT‑4.1, Claude 3.7 Sonnet, DeepSeek V3, Qwen2.5‑72B) are queried under each condition, yielding 129,156 judgments.

Methodology

  • Baseline: Each scenario is evaluated three times with a structured prompt that asks for a categorical verdict (YTA, NTA, NAH, ESH, INFO) and a short explanation. Self‑consistency is measured via normalized entropy and three‑run agreement.
  • Content perturbations:
    Surface – delete a sentence, change a trivial detail, or insert an irrelevant sentence.
    POV – rewrite the story in first‑person or third‑person without the subreddit‑specific “Am I the asshole?” tag.
    Persuasion – add one of six short phrases that either self‑condemn, provide social proof, admit a pattern, self‑justify, or frame the other party as a victim.
  • Protocol perturbations: (a) “verdict‑first” (structured prompt), (b) “explanation‑first” (explanation then verdict), and (c) a free‑form prompt with no forced‑choice labels. A stratified sample of 1,200 scenario‑perturbation instances is evaluated under all three protocols, producing 14,400 additional judgments.

Key Findings

  1. Surface edits cause few flips (7.5 %); this lies within the models’ intrinsic stochasticity (4‑13 % self‑disagreement).
  2. POV shifts dramatically increase instability (24.3 % flips). A large subset (37.9 % of cases) is robust to surface noise but flips under POV changes, indicating that models treat narrative voice as a pragmatic cue that can alter inferred social context.
  3. Persuasion cues produce systematic directional shifts: social proof and pattern admission raise the likelihood of blaming the narrator, while self‑justification often reduces it. The effect size is modest (≈3‑5 % points) but consistent across models.
  4. Protocol effects dominate: agreement between the two structured protocols is only 67.6 % (Cohen’s κ = 0.55), and merely 35.7 % of model‑scenario units produce the same verdict across all three protocols. The unstructured protocol especially inflates “INFO/needs more data” responses.
  5. Model‑level differences: GPT‑4.1 shows the highest self‑consistency (≈92 % three‑run agreement) but is not immune to protocol‑driven flips. Qwen2.5‑72B is the most sensitive to the “explanation‑first” protocol.
  6. Uncertainty correlation: normalized entropy of baseline runs predicts flip rates (r = 0.37‑0.71). Scenarios with ambiguous blame (e.g., NAH or ESH) are most vulnerable; in those, flips exceed 40 %.
  7. Explanation analysis: Using a lexicon‑based epistemic stance score, the authors find that persuasion cues that push blame onto the narrator increase confidence language (+0.12 on the stance scale), whereas “explanation‑first” prompts increase tentative language, reflecting higher perceived uncertainty.

Implications
The study demonstrates that LLM moral judgments are not intrinsic properties of the model alone; they are co‑produced by (a) how the story is narrated, (b) subtle rhetorical cues, and (c) the scaffolding of the evaluation protocol. Consequently, a user who can rephrase a dilemma or choose a particular interface can materially influence the model’s advice, raising reproducibility, fairness, and safety concerns for any application that relies on LLM‑generated moral guidance (e.g., counseling bots, decision‑support tools).

Limitations & Future Work

  • The dataset is limited to English Reddit posts from a narrow time window, which may not generalize across cultures or languages.
  • Only a predefined set of 11 content perturbations were explored; richer linguistic manipulations (metaphor, irony) remain untested.
  • Explanation quality was assessed with a simple lexical stance metric; deeper logical or argumentative analyses could yield richer insights.
  • The authors suggest developing standardized, protocol‑agnostic evaluation frameworks and training models to be explicitly aware of narrative perspective as a non‑moral cue.

Conclusion
Moral judgments from LLMs are fragile: surface noise has little effect, but changes in narrative point of view and, especially, the prompting protocol can flip verdicts in a quarter to three‑quarters of cases. This fragility underscores the need for robust evaluation standards, transparent prompting practices, and model architectures that can disentangle moral facts from presentation style before LLMs are deployed for real‑world ethical advice.


Comments & Academic Discussion

Loading comments...

Leave a Comment