Discovering Universal Activation Directions for PII Leakage in Language Models
Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model’s residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model’s representations, enabling both risk amplification and mitigation.
💡 Research Summary
The paper introduces UniLeak, a mechanistic‑interpretability framework that discovers “universal activation directions” in the residual stream of large language models (LLMs). A universal activation direction is a latent linear vector that, when added to the model’s hidden states during inference, consistently raises the probability of generating personally identifiable information (PII) across a wide range of prompts. Unlike prior work that focuses on prompt engineering or external attacks to elicit PII, UniLeak probes the internal geometry of the model to find a controllable signal that amplifies or suppresses privacy‑leaking behavior.
Methodologically, UniLeak proceeds in two stages. First, it builds a self‑generated training set of PII‑containing sequences without any external data. The authors sample 200 k generations from the target model using two extraction strategies: (1) BOS (begin‑of‑sequence) top‑k sampling, and (2) an optimized single‑token prompt technique from recent literature. Generated texts are then annotated for structured PII (emails, phone numbers, etc.) using regular expressions and for unstructured PII (names) using the Flair NER tagger. Each token is labeled with a binary indicator for each PII class, yielding class‑specific datasets D(c)₍gen₎.
Second, for each PII class the framework learns layer‑specific direction vectors vℓ(c) via gradient‑based optimization. The objective is to minimize the negative log‑likelihood of the true PII tokens when the vectors are added to the residual stream at chosen layers ℓ and token positions T. Formally, an additive intervention Add({vℓ(c)}, T) modifies hidden states hₜ^ℓ ← hₜ^ℓ + vℓ(c) for all ℓ∈L and t∈T during the forward pass. The loss Lₚᵢᵢ sums over only the tokens labeled as PII, which focuses the optimization on the desired signal and prevents the model from drifting toward unrelated outputs. The vectors are optimized jointly across all layers using standard stochastic gradient descent (Algorithm 1).
A key design choice is token‑position localization. Experiments show that applying the direction only at the first input token (T = {1}) yields stable training and stronger PII amplification than intervening at every token. This aligns with the sparsity of PII: a small early‑layer perturbation can be propagated and amplified through later layers, influencing the final token distribution.
The authors evaluate UniLeak on multiple LLM families (GPT‑Neo, LLaMA, Falcon) and datasets (TREC, WikiText). When combined with existing prompt‑based extraction methods, the universal directions increase the number of extracted PII records by up to 13,399 compared to the baseline. Conversely, projecting out (subtracting) the learned direction at inference time reduces PII leakage by up to 3,562 records while preserving generation quality (perplexity, BLEU, ROUGE remain unchanged). Mechanistic analysis reveals that interventions at early layers most effectively raise the output probability of PII tokens, and the steered activations exhibit up to 54 % higher cosine similarity to genuine PII contexts from the training data, despite UniLeak never seeing that data.
Beyond attack scenarios, the paper demonstrates practical implications. An adversary could embed the universal direction into a model’s embeddings (embedding poisoning) to create a “leaky” model that yields more PII when later queried with standard prompts. Defenders, on the other hand, can mitigate risk by detecting and projecting out the direction at runtime, offering a lightweight, model‑agnostic privacy safeguard.
Limitations include the white‑box assumption (full access to model parameters and activations) and reliance on self‑generated data that may still miss rare PII types. The annotation pipeline depends on regex and NER tools, which can introduce labeling noise. Future work could explore black‑box estimation of universal directions, self‑supervised discovery without explicit PII labels, and extension to other privacy‑sensitive attributes such as medical records.
In summary, UniLeak uncovers a latent, model‑specific but prompt‑agnostic representation of PII leakage, providing both a novel attack surface and a concrete mitigation pathway. By revealing that privacy‑leaking behavior can be steered through simple linear interventions, the work opens new avenues for mechanistic privacy auditing and defense in the era of openly released large language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment