Internal Flow Signatures for Self-Checking and Refinement in LLMs

Internal Flow Signatures for Self-Checking and Refinement in LLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce \emph{internal flow signatures} that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact \emph{moving} readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. \emph{Code is available at} \texttt{github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs}.


💡 Research Summary

The paper tackles the pervasive problem of large language models (LLMs) producing fluent yet factually incorrect outputs, a phenomenon commonly referred to as hallucination. Existing safeguards—retrieval augmentation, post‑hoc verification, or an auxiliary judging model—operate after generation, incurring latency, computational cost, and offering no insight into the internal decision‑making process that led to the error. The authors propose a fundamentally different approach: monitoring the model’s internal dynamics during inference and using these dynamics for self‑checking and targeted refinement.

The core contribution is the concept of internal flow signatures, a set of geometric descriptors extracted from the residual stream at a fixed inter‑block boundary. To obtain stable, depth‑comparable measurements, the authors first remove the learned bias of the layer‑normalization at each block (bias‑centered monitoring). This eliminates a token‑independent translation that would otherwise obscure true token‑level motion.

Next, the authors define sliding depth windows of length L (with stride s). Within each window they identify the top‑ranked token (according to the logits) and its K nearest competitors. The difference vectors between the top token’s readout and each competitor are collected, sampled, and stacked into a matrix D_j. A compact singular‑value decomposition of D_j yields the top‑k right singular vectors, which form a moving subspace U_j that is aligned with the current readout direction. Because subspaces are only defined up to an orthogonal rotation, adjacent windows are aligned via an orthogonal transport matrix R_{j→j+1} obtained from the SVD of U_{j+1}^T U_j. This alignment makes it possible to compare vectors across depth without being affected by arbitrary basis rotations.

With the moving coordinates p_{t,b}=U_{j(b)}^T \tilde h_{t,b} (where \tilde h is the bias‑centered state), the authors compute three families of signatures:

  1. Transported step length s_{t,b}=‖Δp_{t,b}‖₂, where Δp_{t,b}=p_{t,b+1}−R_b p_{t,b}.
  2. Turning angle θ_{t,b}=∠(u_{t,b+1}, R_b u_{t,b}), where u denotes the unit direction of p.
  3. Centered increment Δp^{c}{t,b}=Δp{t,b}−μ_b, where μ_b is a robust rotation‑equivariant center over all tokens at depth b.

In addition to these geometric quantities, the method decomposes the pre‑normalization contributions of the attention and MLP sub‑layers. By integrating the injection path x_{t,b}(α)=h^{raw}{t,b}+α·(o{t,b}+m_{t,b}) through the non‑linear normalization map, the authors obtain path‑integrated updates Δq_{attn} and Δq_{mlp} projected onto the target subspace. The residual η_{t,b}=Δq_{t,b}−Δq_{end,t,b} captures the non‑linear effect that cannot be explained by a linear approximation.

All these features—step lengths, turning angles, drift vectors, component magnitudes, and residual ratios—constitute a rich, low‑dimensional representation of how a token’s decision evolves as it traverses the transformer depth. A lightweight GRU‑based validator is trained on these signatures to discriminate between “reliable” and “unreliable” generation trajectories. Crucially, the validator not only flags a hallucination but also pinpoints the exact depth block where the trajectory deviates anomalously.

Armed with this localization, the authors introduce a targeted refinement procedure. When a culprit block b* is identified for token t*, generation is rolled back to position t* and the abnormal transported step at block b* is clamped (i.e., its component in the moving subspace is set to zero). The orthogonal residual—information orthogonal to the moving subspace—is preserved, ensuring that only the offending direction is suppressed while the rest of the representation remains intact. This intervention requires no weight updates, no additional fine‑tuning, and incurs negligible computational overhead.

Empirical evaluation spans multiple LLM sizes (e.g., 7B, 13B, 34B) and tasks such as abstractive summarization, open‑domain QA, and factuality benchmarks. Across the board, the GRU validator achieves high AUC in separating hallucinated from faithful generations, and the targeted refinement reduces hallucination rates by 30‑50 % while maintaining or slightly improving BLEU/ROUGE scores. Ablation studies confirm that bias‑centering, moving subspace construction, and orthogonal transport each contribute significantly to the stability of the signatures.

The paper’s contributions can be summarized as follows:

  • A principled method for extracting depth‑wise, bias‑centered flow signatures that are invariant to within‑window basis rotations.
  • An orthogonal transport mechanism that aligns moving subspaces across depth, enabling consistent geometric measurements.
  • Demonstration that a compact GRU validator can learn to self‑check LLM outputs using only internal dynamics, without external data or model modification.
  • A novel, minimally invasive refinement technique that locally corrects a generation by clamping an abnormal step while preserving orthogonal information.

Limitations include the need to tune hyper‑parameters such as window length L, subspace dimension k, and number of competitors K for different architectures, and the current focus on text‑only generation tasks. Future work could explore extending the framework to multimodal models, automating hyper‑parameter selection, and integrating the validator into a fully online decoding loop. Overall, the study opens a promising avenue for making LLMs more self‑aware and self‑correcting by listening to their own internal “flow”.


Comments & Academic Discussion

Loading comments...

Leave a Comment