Better Language Model Inversion by Compactly Representing Next-Token Distributions
Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model’s system message. We propose a new method – prompt inversion from logprob sequences (PILS) – that recovers hidden prompts by gleaning clues from the model’s next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2–3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5–27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
💡 Research Summary
This paper introduces a significant advancement in language model inversion, the task of recovering a hidden prompt given only the outputs of a language model. The proposed method, named Prompt Inversion from Logprob Sequences (PILS), achieves state-of-the-art performance by efficiently leveraging information from multiple steps of the model’s generation process.
The core innovation of PILS is based on a key mathematical insight: the vector-valued next-token probability distributions (logprobs) produced by a transformer language model reside in a low-dimensional subspace with a dimensionality equal to the model’s hidden state size (D). The authors prove that by applying the additive log-ratio (alr) transform to the probability vector and then selecting a specific subset of D logprob values, one can obtain a losslessly compressed representation that is a linear transformation of the model’s final hidden state. This compression drastically reduces the amount of information needed from the target model’s API—from requiring the entire vocabulary-sized vector (V, often hundreds of thousands) to just D+1 values (typically a few thousand). This translates to major reductions in API cost and storage requirements.
PILS utilizes this compressed representation across multiple generation steps (T). Instead of using only the first step’s output like prior logprob-based methods (e.g., L2T), PILS feeds a sequence of T compressed hidden state vectors into an encoder-decoder inverter model (based on T5). The intuition is that clues about different parts of the hidden prompt may surface at different stages of the generation.
The experimental results demonstrate massive gains over previous state-of-the-art methods, including both logprob-based (L2T) and text-output-based (O2P) inverters. Evaluated on Llama 2 Chat, PILS achieved a 51% exact match recovery rate on in-distribution prompts, more than doubling the 23% rate of the best previous method. It also showed 2-3.5x improvements on out-of-distribution datasets. A remarkable finding is the method’s strong generalization: an inverter trained on sequences of 16 generation steps showed a further 5-27 percentage point increase in recovery rate when evaluated on 32-step sequences, indicating that more generated context provides cumulative information about the prompt.
The paper further validates PILS on the more challenging and practical task of recovering hidden system messages from API-protected models, showing strong performance. Additional analyses explore the role of verbatim repetition in prompts and propose a novel cross-family model transfer technique for logit-based inverters.
In summary, this work establishes that next-token probabilities are a far more vulnerable attack surface for privacy leakage than previously recognized. PILS sets a new benchmark for inversion attacks by combining a theoretically grounded, efficient compression scheme with a multi-step inference strategy, forcing a serious reconsideration of how much information language model APIs should expose.
Comments & Academic Discussion
Loading comments...
Leave a Comment