What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering
Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. “-ight”) or answer to a question (“whale”) can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
💡 Research Summary
This paper investigates whether large language models (LLMs) implicitly plan ahead for future target tokens—such as a rhyming word in poetry or an answer noun in a question‑answering task—despite being trained solely on next‑token prediction. The authors formalize “forward planning” (the emergence of representations of a future goal token at earlier positions) and “backward planning” (the use of those representations to shape intermediate token generation). Implicit planning is defined as hidden‑state activity, whereas explicit planning would appear as model outputs. Successful planning requires both components to causally influence the final token.
To study these phenomena, the authors introduce a simple, scalable methodology that replaces the computationally heavy cross‑layer transcoder (CLT) used in prior work. They compute a steering vector as the average activation difference between two categories (e.g., rhyme family A vs. rhyme family B) at a chosen token position (newline, last word, or question mark). This vector is scaled (typically by 1.5–2) and added to the residual stream of a selected layer during generation, thereby nudging the model’s internal planning representation toward the desired target. The approach is applied to a single token only, making the intervention interpretable and lightweight.
Two benchmark suites are built: (1) a rhyming dataset comprising 10 rhyme families, each with 105 first‑line prompts generated by Claude 3.5 Sonnet; 85 lines per family are used for steering‑vector estimation (training) and 20 for evaluation (testing). (2) a question‑answering dataset containing 20 noun pairs (one beginning with a vowel, the other with a consonant) and a set of suggestive and neutral questions. For each noun, 13 training questions and 5 test questions are created. The authors evaluate 23 open‑weight LLMs ranging from 1 B to 32 B parameters, covering four families (Gemma2, Gemma3, Qwen3, Llama 3.1/3.2) and both base and instruction‑tuned variants.
Four quantitative metrics are defined. For poetry: (i) Fraction of Correct Rhyme Family (baseline) – the proportion of generated couplets whose second‑line final word rhymes with the first‑line ending; (ii) Fraction of Correct Rhyme Family (Steered) – the same proportion after applying a steering vector toward a target rhyme family; (iii) Fraction of Correct Last‑Word Regeneration – after removing the final word of the second line, the model re‑generates it without rhyme context, and the metric measures how often the regenerated word belongs to the intended rhyme family. Analogous metrics are used for QA, substituting “rhyme family” with the intended answer noun. Additionally, probability‑based measures of token‑distribution shifts are reported in the appendix.
Experimental results show that steering consistently shifts the model toward the intended target across all sizes. Even 1 B models exhibit a measurable planning effect (≈15–20 % improvement), while models ≥7 B achieve >60 % success in steering the rhyme or answer. Backward planning is evidenced by altered intermediate token choices: when steered toward the “‑ight” rhyme, the model not only ends the line with a word from that family but also selects preceding words that naturally lead to it (e.g., “bathed in a golden light”). Layer analysis reveals that middle‑to‑higher layers (approximately layers 8–12) carry the strongest planning signals, aligning with the intuition that semantic planning emerges in intermediate transformer depths.
The authors compare their mean‑activation‑difference steering to the earlier CLT approach and find comparable or superior effectiveness despite the former’s simplicity and lower computational cost. This suggests that planning information is linearly encoded in hidden states, opening the door to more refined control techniques such as classifier‑free guidance analogues or SAE‑based direction vectors.
Limitations are acknowledged: the intervention affects only a single token, so its impact on longer contexts or multi‑step planning remains untested; the study focuses on a narrow set of tasks (rhyming and noun‑answer QA), leaving open whether similar planning exists for complex reasoning or multi‑modal generation; the scaling factor for the steering vector is chosen empirically, which may hinder automated deployment; and optimal layer selection varies across models, requiring per‑model calibration.
In conclusion, the paper provides a practical framework for detecting and manipulating implicit planning in LLMs, demonstrating that such planning is a universal mechanism present even in relatively small models. By offering quantitative metrics and a lightweight steering technique, the work contributes valuable tools for AI safety, controllability, and interpretability research. Future directions include extending the methodology to multi‑goal planning, longer discourse structures, and integrating more sophisticated probing or intervention strategies.
Comments & Academic Discussion
Loading comments...
Leave a Comment