Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models
Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model’s internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model’s activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.
💡 Research Summary
The paper tackles a well‑known failure mode of large language models (LLMs) in formal reasoning tasks: the “content effect,” where models conflate the semantic plausibility of a conclusion with its logical validity. This bias is especially evident in syllogistic reasoning, leading models to misclassify valid‑but‑implausible arguments as invalid and to accept invalid‑but‑plausible conclusions as valid. Existing remedies—supervised fine‑tuning, chain‑of‑thought prompting, or static inference‑time constraints—either require costly parameter updates or still inherit the same semantic shortcuts in intermediate steps.
The authors propose a novel, inference‑time intervention that separates logical structure from lexical semantics by defining an “abstract reasoning space.” For each natural‑language syllogism (content‑laden), they construct a paired abstract version that preserves the logical form but replaces content words with placeholders (e.g., “All X need Y”). When the model processes these abstract inputs, the resulting residual‑stream activations at selected layers constitute the target representations of pure logical reasoning, free from semantic noise.
To map content‑conditioned activations onto this abstract space, the paper introduces lightweight Multi‑Layer Perceptrons called Abstractors. For each chosen layer ℓ (typically in the middle of the network where higher‑level concepts emerge), an Abstractor fℓ receives the last‑token activation aℓ(x_con) of the content‑laden input and predicts a target vector ˆaℓ(x). The architecture splits prediction into a direction head (unit vector) and a magnitude head, enabling fine‑grained control. Training uses a contrastive triplet loss: a positive abstract example sharing the same validity label, a negative abstract example with the opposite label, and the original content example. The loss combines (i) attraction (align direction with the positive target), (ii) repulsion (push away from the negative target), and (iii) magnitude matching. This encourages the model to place valid and invalid syllogisms in distinct regions of the abstract manifold without requiring explicit class labels at inference.
Inference proceeds in two passes. First, a standard forward pass extracts aℓ,|x|(x_con). The trained Abstractor then computes ˆaℓ(x). In the second pass, the same input is re‑processed, and at each token position t ≥ t_start the activation is blended with the target: a_steerℓ,t = (1‑α_t)·aℓ,t + α_t·ˆaℓ(x), where α_t linearly ramps up from 0 to a maximum α (chosen per model). Multiple contiguous layers L* are steered simultaneously, with layer selection based on empirical maximal separation between positive and negative abstract targets (lowest cosine similarity).
The experimental suite evaluates three families of open‑weight LLMs (Qwen‑2.5, Gemma‑2, Mistral) at 7‑14 B parameters. The primary dataset comprises 2,780 English syllogisms covering 24 logical forms, each paired with an abstract version and annotated for validity and plausibility. For cross‑lingual testing, the authors translate the data into nine additional languages (French, Spanish, Italian, German, Russian, Chinese, Bengali, Swahili, Telugu) using GPT‑4o with back‑translation verification.
Beyond raw accuracy, the paper introduces several robustness metrics: Belief Bias (Δbelief) measures the performance gap between belief‑consistent (logic aligns with real‑world plausibility) and belief‑conflict cases; Bias‑Penalized Accuracy (BPA) scales overall accuracy by (1‑Δbelief) to penalize reliance on semantic heuristics; and Abstract Alignment (η) compares steered performance to the upper bound achieved on purely abstract inputs. Results show that abstraction‑aligned steering consistently improves BPA by 5–12 percentage points, dramatically reduces Δbelief, and yields η values between 0.92 and 0.98, indicating that steered activations closely match the abstract manifold. Notably, Abstractors trained only on English transfer zero‑shot to all other languages, preserving most of the BPA gains, which demonstrates language‑agnostic capture of logical structure.
Ablation studies vary the steering strength α (0.1–1.0) and the number of steered layers. Optimal α lies in the 0.6–0.8 range for each model, and multi‑layer steering outperforms single‑layer interventions, confirming the need for robust propagation of abstract representations. Comparisons against baselines—no steering, parameter‑efficient fine‑tuning via PiSSA adapters, and CoT prompting—reveal that the proposed method achieves comparable or superior performance while leaving model weights untouched and remaining switchable at inference time.
In summary, the paper demonstrates that activation‑level abstraction is an effective, scalable mechanism to decouple logical reasoning from semantic content in LLMs. By learning lightweight mappings from content‑conditioned activations to an abstract reasoning space and applying dynamic, multi‑layer interventions, the approach mitigates belief bias, improves formal validity detection, and generalizes across languages without additional training data. The work opens avenues for extending abstract steering to more complex logical forms, multi‑step reasoning, and real‑time human‑AI collaboration where selective activation control could enhance trustworthiness and interpretability.
Comments & Academic Discussion
Loading comments...
Leave a Comment