Unveiling the Latent Directions of Reflection in Large Language Models
Reflection, the ability of large language models (LLMs) to evaluate and revise their own reasoning, has been widely used to improve performance on complex reasoning tasks. Yet, most prior works emphasizes designing reflective prompting strategies or reinforcement learning objectives, leaving the inner mechanisms of reflection underexplored. In this paper, we investigate reflection through the lens of latent directions in model activations. We propose a methodology based on activation steering to characterize how instructions with different reflective intentions: no reflection, intrinsic reflection, and triggered reflection. By constructing steering vectors between these reflection levels, we demonstrate that (1) new reflection-inducing instructions can be systematically identified, (2) reflective behavior can be directly enhanced or suppressed through activation interventions, and (3) suppressing reflection is considerably easier than stimulating it. Experiments on GSM8k-adv and Cruxeval-o-adv with Qwen2.5-3B and Gemma3-4B-IT reveal clear stratification across reflection levels, and steering interventions confirm the controllability of reflection. Our findings highlight both opportunities (e.g., reflection-enhancing defenses) and risks (e.g., adversarial inhibition of reflection in jailbreak attacks). This work opens a path toward mechanistic understanding of reflective reasoning in LLMs.
💡 Research Summary
Paper Overview
The authors investigate the internal mechanisms of “reflection” – the ability of large language models (LLMs) to evaluate and revise their own reasoning – by treating it as a latent direction in the model’s hidden activation space. They introduce a systematic methodology based on activation steering to (1) discover new reflection‑inducing prompts, (2) enhance or suppress reflective behavior at inference time, and (3) reveal an asymmetry: suppressing reflection is considerably easier than stimulating it.
Three Levels of Reflection
- No Reflection – The model is forced to answer immediately (e.g., “Answer”, “Result”, “Output”), ignoring any flawed chain‑of‑thought.
- Intrinsic Reflection – Neutral tokens such as “
Comments & Academic Discussion
Loading comments...
Leave a Comment