Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes


💡 Research Summary

The paper introduces Control Reinforcement Learning (CRL), a novel framework that leverages sparse autoencoders (SAEs) to obtain interpretable features from a large language model’s residual stream and then uses reinforcement learning to dynamically steer the model at the token level. Existing SAE‑based interpretability methods only identify which features become active, but they do not reveal which features actually change the model’s output when amplified. CRL formulates the problem as a Markov Decision Process: the current residual activation at a chosen layer serves as the state, the binary selection of a single SAE feature is the action, and task‑specific rewards (e.g., correctness on multiple‑choice, safety refusals) guide learning. A lightweight policy network (an MLP) maps the residual vector to logits over the SAE dictionary; the top‑k (k = 1) feature is selected and added back to the residual stream via the decoder matrix with a learned steering coefficient. Proximal Policy Optimization (PPO) jointly trains the policy and a value network, enabling both performance gains and a “critic trajectory analysis” that separates policy limitations from value‑estimation errors.

A key innovation is Adaptive Feature Masking (AFM). AFM maintains a per‑sample mask over the SAE dictionary, initially exposing only a small set of frequent features and progressively expanding it based on features naturally activated during generation. This prevents the policy from collapsing onto a narrow feature subset while preserving single‑feature interpretability, because each intervention log records exactly which feature was amplified at each token.

Experiments are conducted on the Gemma‑2 2B model equipped with Gemma Scope SAEs across five benchmarks: MMLU (knowledge), BBQ (bias), GSM8K (reasoning), HarmBench and XSTest (safety). CRL‑Token (single‑layer steering) yields consistent improvements of roughly 0.3–1.5 percentage points over strong baselines, and it outperforms constrained decoding and post‑hoc fine‑tuning when combined. Multi‑layer steering achieves higher absolute scores but sacrifices the clean per‑token attribution, so the authors keep the single‑layer variant as the primary interpretable method.

The paper also showcases four analysis tools enabled by CRL: (1) token‑level intervention logs that pinpoint which feature altered the output at each step; (2) branch‑point tracking that identifies critical decision tokens where alternative feature choices lead to divergent outcomes; (3) critic trajectory analysis that visualizes where the value function misestimates rewards; and (4) layer‑wise semantic probing, revealing that early layers tend to encode syntactic patterns (e.g., mathematical notation) while deeper layers capture abstract semantic structures (e.g., logical derivations). These diagnostics illuminate both the mechanistic behavior of the LLM and the efficacy of the learned steering policy.

In summary, CRL provides a lightweight, post‑hoc mechanism to both improve LLM performance on diverse tasks and generate fine‑grained, interpretable intervention logs. By marrying static SAE decomposition with dynamic RL‑based control, it opens a new avenue for mechanistic interpretability that goes beyond passive feature observation to active, token‑level manipulation of model behavior.


Comments & Academic Discussion

Loading comments...

Leave a Comment