Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We show that continual pretraining on plausible misinformation can overwrite specific factual knowledge in large language models without degrading overall performance. Unlike prior poisoning work under static pretraining, we study repeated exposure to counterfactual claims during continual updates. Using paired fact-counterfact items with graded poisoning ratios, we track how internal preferences between competing facts evolve across checkpoints, layers, and model scales. Even moderate poisoning (50-100%) flips over 55% of responses from correct to counterfactual while leaving ambiguity nearly unchanged. These belief flips emerge abruptly, concentrate in late layers (e.g., Layers 29-36 in 3B models), and are partially reversible via patching (up to 56.8%). The corrupted beliefs generalize beyond poisoned prompts, selectively degrading commonsense reasoning while leaving alignment benchmarks largely intact and transferring imperfectly across languages. These results expose a failure mode of continual pre-training in which targeted misinformation replaces internal factual representations without triggering broad performance collapse, motivating representation-level monitoring of factual integrity during model updates.


💡 Research Summary

The paper investigates a previously under‑explored failure mode of large language models (LLMs): the gradual erosion of factual knowledge during continual pre‑training (CPT) when the model is repeatedly exposed to plausible misinformation. Starting from pretrained checkpoints of the Qwen2.5 family (0.5 B to 7 B parameters), the authors construct a controlled dataset of paired fact–counter‑fact items spanning general knowledge, history, geography, mathematics, chemistry and translation. Each fact is expressed in multiple surface forms (social‑media style, encyclopedic, news, academic, etc.) and prompt formats (direct question, cloze, instruction, etc.), yielding a rich corpus of nearly 150 k instances.

During CPT, the models are trained on a corpus that contains a configurable “poison ratio” ρ, i.e., the fraction of counter‑factual instances (ρ ∈ {0.1, 0.5, 0.9, 1.0}). Training proceeds for one epoch (≈12 k steps) with checkpoints saved every 1 k steps to track belief dynamics over time. Belief is operationalized as the log‑likelihood difference ΔLL = log p(y⁺|x) − log p(y⁻|x) between the correct answer y⁺ and the plausible false answer y⁻ for a given prompt x. Positive ΔLL indicates a preference for the true fact; negative ΔLL signals a “poisoned” belief. The authors also classify generated outputs as Correct, Poisoned, or Ambiguous for an external, generation‑level view.

Key findings:

  1. Abrupt belief flips – With moderate poisoning (ρ ≥ 0.5) more than 55 % of the model’s responses switch from correct to counter‑factual, while the proportion of ambiguous answers remains essentially unchanged. This demonstrates a systematic replacement of factual representations rather than a diffusion of uncertainty.

  2. Layer‑wise concentration – Using linear CKA to measure representation drift and activation‑patching to test causal importance, the authors locate the bulk of the drift in the upper transformer layers (e.g., layers 29‑36 in the 3 B model, layers 36‑44 in the 7 B model). Patching the hidden state of a poisoned checkpoint with the clean state at these layers restores ΔLL to positive values, confirming that the corrupted belief is encoded primarily in late layers.

  3. Head‑level locality – Ablating individual attention heads reveals a small subset of late‑layer heads whose removal dramatically reduces the poisoned preference. This suggests that the misinformation is not uniformly distributed but is amplified by a few specialized heads.

  4. Generalization across prompts and tasks – The poisoned belief persists across all surface‑form variations of the same fact, indicating that the model’s internal preference has truly shifted. Downstream evaluations on HellaSwag (commonsense reasoning), TruthfulQA (truthfulness), HH‑RLHF (alignment) and BBEH Logic (formal reasoning) show selective degradation: commonsense and truthfulness scores drop modestly, while alignment metrics remain stable.

  5. Cross‑lingual transfer – Translating prompts into Spanish, Korean and Arabic yields partial transfer of the poisoned belief (30‑55 % of the effect carries over), highlighting that the phenomenon is not confined to English but is mediated by language‑specific tokenization and pre‑training data distributions.

  6. Partial recoverability – Activation patching experiments achieve up to 56.8 % recovery of correct beliefs in the 3 B model and about 48 % in the 7 B model, indicating that the original factual representations remain latent and can be re‑exposed if the corrupted activations are overwritten.

The authors contribute four main artifacts: (i) a graded CPT experimental framework for controlled misinformation exposure, (ii) a log‑likelihood‑based belief metric together with a suite of mechanistic probes (CKA, activation patching, head ablation), (iii) evidence that CPT poisoning yields discrete, layer‑localized belief replacement rather than diffuse uncertainty, and (iv) an extensive analysis of how these belief shifts propagate across prompts, downstream tasks, and languages.

In conclusion, the work demonstrates that continual updates—commonplace in production LLM pipelines—can silently rewrite factual knowledge when fed plausible but false statements, without triggering obvious performance degradation. This underscores the need for monitoring mechanisms that operate at the representation level (e.g., layer‑wise consistency checks) and for mitigation strategies such as targeted activation patching or head‑level regularization to preserve factual integrity over time.


Comments & Academic Discussion

Loading comments...

Leave a Comment