Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering
Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14% accuracy improvements and 49% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
💡 Research Summary
The paper tackles the persistent mis‑calibration of large language models (LLMs) that remains even after instruction‑tuning and preference alignment. While modifying the training objective can improve calibration, retraining is costly. The authors therefore propose a lightweight inference‑time steering technique called CORAL (Correctness‑Optimized Residual Activation Lens) that directly optimizes for correctness rather than a proxy, and simultaneously improves calibration.
The core idea is to define “residual correctness” (r j) for each answer option as the gap between the model’s softmax probability p j and the ideal target distribution (1 for the correct answer, 0 for all others). Positive residuals indicate under‑confidence on the correct answer, negative residuals indicate over‑confidence on an incorrect answer. Minimizing the squared residuals is equivalent to minimizing the Brier score, a proper scoring rule that jointly rewards accuracy and calibration.
To predict these residuals, the authors train a small multilayer perceptron (MLP) probe on frozen internal activations. For each multiple‑choice question they run n forward passes (one per answer option), extract the residual‑stream hidden states at each transformer layer, mean‑pool over answer tokens, and z‑score normalize across the training set. The probe consists of four hidden layers (1024‑512‑256‑128) with ReLU activations, dropout (p = 0.2), and a tanh output bounded to
Comments & Academic Discussion
Loading comments...
Leave a Comment