Context Structure Reshapes the Representational Geometry of Language Models
Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emph{within} a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs’ representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent – it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.
💡 Research Summary
This paper investigates how large language models (LLMs) reshape their internal representations during in‑context learning (ICL) and whether the phenomenon of “representational straightening”—the progressive linearization of token trajectories in hidden layers—occurs uniformly across different task types. The authors focus on Gemma‑2‑27B, a 27‑billion‑parameter open‑weight model, and evaluate three families of tasks that span continuous prediction, latent‑structure inference, and structured few‑shot reasoning.
First, they replicate prior findings on natural‑language long‑range dependency tasks using the LAMBADA benchmark. By comparing natural text with a shuffled‑token control, they show that middle transformer layers (approximately layers 15–25) exhibit a marked increase in straightening, measured via average curvature of token transition vectors and Menger curvature. Straightening peaks in these middle layers and then declines in the final layers, which the authors attribute to the unembedding stage where high‑dimensional representations collapse back into vocabulary space. The control condition shows minimal straightening, confirming that the effect depends on semantic structure rather than token statistics.
Second, the authors construct synthetic grid‑world tasks that encode latent graph structures into token sequences. Two levels of abstraction are used: a single‑level direct mapping of a 6×6 lattice and a hierarchical mapping where each latent node has four semantically similar child tokens. They generate long random‑walk contexts (up to 2048 tokens) and evaluate three test conditions: short context (tokens inserted early), long context (tokens inserted near the end of the context window), and a zero‑shot hierarchical condition where specific child‑to‑child transitions are omitted from the context. Performance is quantified by the model’s logits for valid graph transitions versus invalid ones. Results reveal that longer contexts produce higher straightening scores and higher transition‑logit accuracy, establishing a quantitative correlation between straightening and successful latent‑structure inference.
Third, the paper examines structured few‑shot tasks and a riddles benchmark (BIG‑bench). These tasks are presented as alternating “Q:” and “A:” prompts with multiple‑choice answers, preserving natural‑language format but requiring discrete mapping from inputs to outputs. The authors find that straightening is not consistently present: it appears transiently during repetitive template sections but disappears during the actual question‑answer reasoning phase. Moreover, there is no clear relationship between straightening magnitude and task accuracy, suggesting that the model relies on mechanisms other than linear trajectory compression for these types of ICL.
To capture geometry, the authors compute four complementary metrics: (1) curvature‑based straightening (difference between first‑layer and later‑layer curvature), (2) Menger curvature straightening, (3) effective dimensionality via participation ratio from PCA, and (4) elongation (anisotropy) of the trajectory. These metrics collectively illustrate that middle layers compress the representation manifold (lower effective dimensionality, higher elongation) when the task benefits from linear extrapolation, but such compression is absent or reversed when the task demands discrete symbolic manipulation.
The central claim emerging from these experiments is that ICL is not a monolithic process. Instead, LLMs appear to possess a “Swiss‑army‑knife” repertoire of internal strategies, dynamically selecting the one best suited to the current context. In continuous prediction settings (natural language, grid‑world traversal), the selected strategy involves straightening the neural trajectory, facilitating next‑token prediction via linear extrapolation. In structured few‑shot or reasoning tasks, the model switches to alternative mechanisms—potentially memory retrieval, conditional branching, or symbolic reasoning—that do not manifest as straightening.
The paper’s contributions are threefold: (1) a systematic measurement framework for representational geometry across layers and contexts, (2) empirical evidence that straightening correlates with performance in tasks requiring sequential prediction of latent structure, and (3) the conceptual insight that LLMs dynamically toggle between distinct computational modes depending on task structure.
Implications for future work include developing prompt‑engineering techniques that can deliberately invoke a desired internal strategy, designing training objectives that make strategy selection more transparent, and extending the analysis to larger models and multimodal architectures to test the generality of the “tool‑kit” hypothesis. The findings also raise theoretical questions about how transformer architectures encode and switch between linear‑extrapolation‑friendly manifolds and more discrete, graph‑like representations within the same parameter space.
Comments & Academic Discussion
Loading comments...
Leave a Comment