RAPTOR: Ridge-Adaptive Logistic Probes

RAPTOR: Ridge-Adaptive Logistic Probes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Probing studies what information is encoded in a frozen LLM’s layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.


💡 Research Summary

The paper addresses the growing interest in probing large language models (LLMs) not only as a diagnostic tool but also as a practical component of “probe‑then‑steer” pipelines, where a concept vector extracted from a probe is added to a layer’s activation to influence model behavior at inference time. The authors argue that for such pipelines to be reliable, a probe must satisfy three criteria simultaneously: (i) high classification accuracy on the target concept, (ii) directional stability of the learned concept vector under small perturbations (e.g., resampling, slight distribution shifts), and (iii) low computational cost, because probing is typically performed across many layers, concepts, and models.

To meet these requirements, they propose RAPTOR (Ridge‑Adaptive Logistic Probe), a minimalist approach that fits an ℓ₂‑regularized logistic regression model to frozen layer representations and uses the normalized weight vector as the concept direction. The only hyper‑parameter is the ridge strength λ, which is selected by maximizing validation accuracy. After training on standardized features, the weight vector is rescaled back to the original embedding space and optionally normalized to unit length, yielding the concept vector vℓ used for additive activation steering: hℓ,T ← hℓ,T + α vℓ. The steering strength α is set automatically using a calibrated rule (GCA‑V) that computes the minimal α needed to achieve a target logit probability.

The paper makes several technical contributions. First, it highlights that in the high‑dimensional, few‑shot regime typical of probing, unregularized logistic regression can diverge because the likelihood is unbounded when the data are linearly separable. Adding an ℓ₂ penalty guarantees existence, uniqueness, and well‑conditioned optimization. Second, by tuning λ on a validation split, RAPTOR reduces the hyper‑parameter search to a single knob, which is crucial when probing many (model, layer, concept) triples. Third, the authors provide a rigorous high‑dimensional analysis using the Convex Gaussian Min‑max Theorem (CGMT). In the proportional limit (n, p → ∞ with n/p → δ), they derive deterministic formulas for out‑of‑sample error and for the cosine similarity between weight vectors obtained from different training splits. The theory predicts that larger λ improves directional stability (weights change less across splits) at the cost of reduced classification accuracy, thereby formalizing the accuracy‑stability trade‑off.

Empirically, RAPTOR is evaluated on a broad benchmark covering instruction‑tuned models from the Qwen, Llama, and Gemma families (sizes from 3 B to 70 B parameters) and six human‑annotated concept datasets (STSA, Cities, Common Counterfact, HateXplain, Sarcasm, etc.). Table 1 reports best‑layer accuracies for three methods: the proposed RAPTOR, Gradient‑based Concept Subspace (GCS), and a Random‑Feature‑Model estimator (xRFM). RAPTOR matches or slightly exceeds the best accuracies of the baselines across almost all model‑dataset pairs. More importantly, RAPTOR consistently achieves higher directional stability (measured by cosine similarity of concept vectors across different random seeds) and substantially lower training time—often a factor of two to three faster—thanks to its simple closed‑form optimization and early‑stopping strategy.

Qualitative steering experiments further demonstrate RAPTOR’s utility. When the extracted concept vector for “hate speech” is added to the final‑token representation of a target sentence, the model’s output probability for hateful content drops to the desired level with minimal side effects. Conversely, injecting a “sarcasm” vector amplifies the model’s detection of sarcasm without degrading performance on unrelated inputs. These examples illustrate that a stable, accurate concept direction translates directly into predictable downstream control.

The theoretical analysis is validated by synthetic teacher‑student experiments where data are generated from a Gaussian model. Varying λ reproduces the predicted curves for test error and weight‑vector cosine similarity, and the same qualitative trends appear in the real LLM embedding experiments, confirming that the CGMT‑based predictions capture essential aspects of the practical setting.

In summary, RAPTOR offers a principled, low‑cost probing method that simultaneously optimizes for accuracy, directional stability, and computational efficiency. Its single‑parameter design makes it scalable to large model families, and the accompanying high‑dimensional theory provides insight into how regularization governs the trade‑off between probe performance and the reliability of the extracted concept vectors. The work opens avenues for extending the approach to multi‑dimensional concept subspaces, hybrid linear‑nonlinear probes, and adaptive λ selection based on online feedback, thereby strengthening the bridge between interpretability research and controllable LLM deployment.


Comments & Academic Discussion

Loading comments...

Leave a Comment