Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution
Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory, a phenomenon often termed sycophancy or compliance. However, the mechanistic realization of this behavior remains obscure, specifically how the model resolves these knowledge conflicts through compliance, and whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream. To resolve this, we conducted a layer-wise geometric analysis across Qwen-3-4B, Llama-3.1-8B, and GLM-4-9B, decomposing the residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components. Our empirical results reject the universality of the “Manifold Dilution” hypothesis, as two of the three architectures maintained stable residual norms despite exhibiting significant performance degradation on factual queries. Instead, we observed that compliance is consistently characterized by “Orthogonal Interference,” where the conflicting context injects a steering vector that is quasi-orthogonal to the ground-truth direction, effectively rotating the hidden state representation. This suggests that models do not “unlearn” or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement to bypass the correct unembedding vector, effectively simulating adoption while preserving the original structural magnitude. These findings challenge scalar confidence metrics for detecting hallucinations and underscore the necessity of vectorial monitoring to distinguish between genuine knowledge integration and superficial in-context mimicry.
💡 Research Summary
This paper investigates how large language models (LLMs) resolve conflicts between their stored parametric knowledge and contradictory information supplied in the prompt context. The authors focus on two competing geometric mechanisms: (1) “Manifold Dilution,” where the conflicting context inflates the norm of the residual stream, thereby diluting the projection onto the correct unembedding direction; and (2) “Orthogonal Interference,” where the context introduces a vector that is approximately orthogonal to the truth direction, rotating the hidden state without changing its magnitude. To test these hypotheses, the study conducts a layer‑wise analysis on three contemporary models—Qwen‑3‑4B, Llama‑3.1‑8B, and GLM‑4‑9B. Using a curated set of 300 factual questions from MMLU and MMLU‑Pro, each question is paired with multiple adversarial prompts that present false “new discoveries.” Only cases where the model originally answered correctly but flips to the adversarial answer are retained, ensuring that the measured interference truly displaces verified knowledge.
For each layer, the hidden state before RMSNorm is extracted, defining a baseline vector h₀ and a conflict‑induced vector h_c. The interference vector Δh = h_c – h₀ is then decomposed into radial (‖h_c‖/‖h₀‖) and angular (cosine similarity between Δh and the ground‑truth token’s unembedding vector) components. The authors also track the logit for the correct token across layers to quantify performance degradation.
Results show that in Qwen‑3‑4B and GLM‑4‑9B the residual norm remains essentially unchanged (ratio ≈0.95–1.02) while the correct logit drops by 3–5 points. In Llama‑3.1‑8B a modest norm increase (≈1.07) is observed, yet the logit still declines. Across all models, the cosine similarity between Δh and the truth direction clusters around zero (‑0.1 to +0.1), indicating near‑orthogonal interference. Regression analysis reveals a strong positive correlation (≈0.8) between orthogonal interference strength and logit decay, whereas norm changes correlate weakly (≈0.1). The degradation is most pronounced in the final 20 % of layers, where semantic convergence normally occurs, suggesting that the orthogonal vector is injected late in the network to override the correct representation.
The authors conclude that compliance (or “sycophancy”) is not driven by a reduction of signal magnitude but by a geometric rotation that bypasses the correct unembedding direction. Consequently, scalar confidence scores are insufficient for detecting hallucinations; vector‑level monitoring of directionality is required. They propose incorporating directional checks into safety pipelines and suggest architectural modifications (e.g., direction‑preserving regularization) to mitigate orthogonal interference.
Overall, the study provides the first mechanistic evidence that LLMs resolve knowledge conflicts primarily through orthogonal interference, challenging the prevailing “Manifold Dilution” narrative and opening new avenues for reliable, interpretable, and safe in‑context learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment