Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure

Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Prior work uses linear probes on benchmark prompts as evidence of evaluation awareness in large language models. Because evaluation context is typically entangled with benchmark format and genre, it is unclear whether probe-based signals reflect context or surface structure. We test whether these signals persist under partial control of prompt format using a controlled 2x2 dataset and diagnostic rewrites. We find that probes primarily track benchmark-canonical structure and fail to generalize to free-form prompts independent of linguistic style. Thus, standard probe-based methodologies do not reliably disentangle evaluation context from structural artifacts, limiting the evidential strength of existing results.


💡 Research Summary

The paper critically examines the claim that large language models (LLMs) possess “evaluation awareness”—the ability to distinguish between evaluation‑time prompts (used in benchmarks) and deployment‑time prompts (used in real‑world interactions). Prior work has relied heavily on linear probing of frozen model activations, reporting high accuracy in classifying evaluation versus deployment prompts. However, these studies often conflate evaluation context with the canonical benchmark format, raising the possibility that probes are simply detecting surface structural cues rather than genuine contextual awareness.

To address this, the authors construct a controlled 2 × 2 experimental design that independently varies (i) Context (Evaluation vs. Deployment) and (ii) Format (Benchmark‑style structured prompts vs. Casual free‑form prompts). Four datasets are created: Bench‑Eval (standard benchmark prompts from MMLU, ARC, GSM8K), Casual‑Deploy (first‑turn user queries from ShareGPT, length‑matched to Bench‑Eval), Casual‑Eval (first‑turn dialogues from the MuTual dataset, serving as a conversational evaluation proxy), and Bench‑Deploy (deployment prompts rewritten into strict benchmark templates using DeepSeek‑V3). The rewrites are carefully audited to remove “answer leakage” (e.g., pre‑filled “Answer:” fields) and to ensure comparable token length distributions across conditions.

Activations are extracted from the residual stream of Llama‑3.1‑8B‑Instruct (layers 8–31, last token) and logistic regression probes are trained with 5‑fold cross‑validation. Two training regimes are compared:

  1. Standard (Single‑Contrast) Training – probes are trained only on the conventional contrast (Bench‑Eval vs. Casual‑Deploy), where format and context are perfectly confounded.
  2. Paired (Decorrelated) Training – probes are trained on the union of both contexts and both formats (Bench‑Eval ∪ Casual‑Eval vs. Bench‑Deploy ∪ Casual‑Deploy), thereby decorrelating format from label.

Results show a stark divergence. Under the standard regime, probes achieve near‑perfect training accuracy but catastrophically fail to generalize: Bench‑Deploy prompts are misclassified as “Evaluation” 94.5 % of the time, indicating that the probe is keyed to the rigid benchmark template rather than to evaluation intent. When the format is altered without changing content (e.g., rewriting deployment prompts in an academic tone while preserving the free‑form structure), probe predictions remain unchanged, confirming that linguistic genre is not the driving factor.

Conversely, the decorrelated regime yields probes that correctly identify Bench‑Deploy as deployment (only 7.2 % error) and classify Casual‑Eval as evaluation with 100 % accuracy. This demonstrates that, when the training data explicitly separate format from context, linear probes can learn a signal that aligns with the intended evaluation context.

Additional analyses rule out length shortcuts (histogram matching equalizes length distributions) and show that removing answer leakage actually increases the probe’s tendency to label rewrites as evaluation, suggesting that leakage was dampening the format signal rather than creating it.

The authors conclude that standard probe‑based diagnostics are highly sensitive to benchmark‑specific structural cues and therefore provide weak evidence for genuine evaluation awareness. Only with careful experimental control—explicitly decorrelating format from context—can probes capture a more meaningful signal. They caution that safety‑oriented monitoring tools that rely on such probes may overestimate a model’s situational awareness, potentially leading to false confidence in deployment settings.

Limitations are acknowledged: the study uses only linear probes, a single model architecture, and proxy datasets for some conditions. Nonetheless, the central finding—that benchmark format is a pervasive confound—likely extends to many existing benchmark practices. Future work should explore non‑linear or distributed probing methods, broader model families, and more naturalistic evaluation‑deployment splits to robustly assess evaluation awareness.


Comments & Academic Discussion

Loading comments...

Leave a Comment