Building Better Deception Probes Using Targeted Instruction Pairs

Building Better Deception Probes Using Targeted Instruction Pairs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.


💡 Research Summary

This paper revisits the use of linear probes for detecting deceptive behavior in large language models (LLMs) and identifies the dominant factor influencing probe performance: the system prompt used during training. Prior work (Goldowsky‑Dill et al., 2025) demonstrated that a simple contrastive instruction‑pair (honest vs. deceptive) combined with a small factual dataset could train a linear classifier on model activations that achieved very high AUROC (>0.96) on a few benchmarks. However, when applied to a broader set of deception datasets, those probes suffered from spurious correlations and high false‑positive rates, indicating limited generalization.

To diagnose the cause, the authors conducted a factorial experiment with 2,000 probe variants, systematically varying four factors: (1) the honest/deceptive system prompt, (2) the factual training dataset, (3) the token‑aggregation strategy, and (4) the model layer from which residual‑stream activations were extracted. An ANOVA‑style variance analysis revealed that the system prompt alone accounted for 70.6 % of the performance variance (p < 0.001). Layer choice contributed only 2.7 %, the factual dataset 2.5 %, and token selection 0.6 %. This result strongly suggests that the probes are primarily detecting the intent signal encoded by the prompt rather than content‑specific patterns.

Guided by this insight, the authors designed two families of instruction‑pair prompts. The first family consists of 23 “taxonomy‑based” prompt pairs, each explicitly targeting one of the deception types defined in Druckman & Bjork (1991) (e.g., white lie, exaggeration, evasion, concealment, impersonation, etc.) plus seven control pairs for non‑deceptive behavior. These prompts were generated with Claude Sonnet 4.5 and are human‑interpretable. The second family comprises 34 “framing variations” that keep the honest/deceptive dichotomy but vary linguistic framing (identity‑based, role‑based, instruction‑based, hybrid, long‑form, etc.) without targeting a specific deception type.

All probes were trained on the residual‑stream vectors from layer 20 of the open‑weight Gemma‑2‑9B‑IT model. For each response token, the final five tokens were omitted, and the mean of the remaining token vectors was used as the feature vector. Logistic regression with L2 regularization (λ = 1) was employed. The training data were the Azaria & Mitchell (2023) factual true/false pairs (≈200 examples); additional domain‑specific fact sets were tested but contributed only 2.5 % of variance, confirming that content is largely irrelevant for these probes.

Evaluation was performed on ten diverse deception benchmarks, including MASK, Liar’s Bench, AI Liar, Convincing Game, insider‑trading, sandbagging, and role‑playing scenarios. For validation, 50 balanced samples (25 honest, 25 deceptive) were held out from three datasets to select the best‑average probe. The “best‑taxonomy” probe for each dataset was then reported.

Results show that taxonomy‑based prompts consistently outperform the default baseline. Across the ten datasets, the average AUROC improvement was 0.108 (a 20.5 % relative gain). Every dataset had at least one taxonomy‑based probe that beat the baseline, with especially strong gains for concealment‑type deception in MASK (≈0.12 AUROC increase) and overt lying in Liar’s Bench (≈0.15 increase). Framing‑variation prompts yielded modest improvements on some datasets but were less reliable than the targeted taxonomy prompts, underscoring that semantic alignment with the deception type matters more than superficial linguistic changes.

The authors conclude that a single, universal deception detector is unlikely to succeed because deception manifests heterogeneously across tasks and threat models. Instead, organizations should develop specialized probes tailored to the specific deceptive behaviors they aim to monitor, or employ ensembles of targeted probes to balance sensitivity and specificity. This recommendation aligns with operational constraints: specialized probes reduce false alarms on benign role‑playing while maintaining high true‑positive rates on genuinely harmful deception.

In summary, the paper demonstrates that (1) system prompt design dominates linear‑probe performance, (2) aligning prompts with a human‑interpretable deception taxonomy yields significant, consistent gains across diverse benchmarks, and (3) a modular, threat‑model‑specific probing strategy is more practical than pursuing a one‑size‑fits‑all deception detector.


Comments & Academic Discussion

Loading comments...

Leave a Comment