Extracting Training Dialogue Data from Large Language Model based Task Bots
Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.
💡 Research Summary
The paper investigates privacy leakage in task‑oriented dialogue systems (TODS) that are built on large language models (LLMs). While LLMs have dramatically improved the ability of task bots to understand user intent, retrieve relevant database entries, and generate appropriate responses, they also act as “soft knowledge bases” that compress training data into model parameters. Consequently, personal identifiers (e.g., phone numbers) and entire event‑level information (e.g., travel itineraries) can be memorized and potentially extracted.
Existing data‑extraction attacks—most notably the prefix‑suffix approach and membership inference used on open‑ended language models—are ineffective for TODS. In TODS the model is not trained to reproduce raw user utterances; instead it is optimized to predict structured dialogue states (belief states) composed of domain‑slot‑value triples. This structural difference means that standard attacks either generate incoherent state strings or fail to capture the conditional dependencies required for valid extraction.
To address these challenges the authors propose two novel techniques. First, Schema‑Guided Sampling leverages the predefined dialogue schema (the list of domains, slots, and permissible value types) to constrain the token space during generation. By automatically probing the schema with a separate LLM (ChatGPT) and simulating user‑bot interactions, the method builds a pruned vocabulary that ensures generated candidates are syntactically valid and semantically plausible. This dramatically reduces the proportion of nonsensical outputs that plague naïve suffix decoding.
Second, Debiased Conditional Perplexity refines the membership inference step. Traditional perplexity‑based scores are biased toward frequent, generic fragments (e.g., greetings or common slot values), causing the attack to over‑rank harmless candidates. The proposed metric computes the conditional perplexity of a candidate given a prefix and then subtracts a bias term derived from the frequency of the candidate’s schema pattern in the training distribution. This correction amplifies the signal for rare, potentially private values while suppressing the noise from common patterns.
The experimental evaluation covers two attack settings. In the targeted extraction scenario, the adversary supplies a partial dialogue state (e.g., “Restaurant(name=”) to steer the model toward a specific value. In the untargeted extraction scenario, the adversary queries the model with an empty prompt and attempts to harvest any memorized slot‑value pairs. Results show that targeted attacks achieve near‑perfect precision for individual slot values (up to 100%) and exceed 70% precision for full event‑level states. Untargeted attacks are less effective overall, yet still manage up to 67% precision for isolated values, while full state extraction remains around 26% precision.
Beyond attack performance, the study quantifies two key factors influencing memorization. Substring repetition—the presence of repeated phrases or slot values across training dialogues—strongly correlates with higher memorization rates. Conversely, the one‑to‑many nature of dialogue responses (multiple correct states for the same context) disperses memorization, reducing extraction success.
Based on these insights, the authors suggest practical mitigation strategies. Reducing duplicate substrings during data preprocessing can lower the model’s propensity to memorize exact phrases. Introducing a value‑copy control mechanism—for example, limiting the model’s ability to directly copy token sequences from the input—can further diminish the risk of verbatim leakage without substantially harming overall task performance.
In summary, the paper makes four major contributions: (1) it identifies and formalizes a new privacy threat specific to LLM‑based task bots, focusing on structured belief states rather than free‑form text; (2) it provides a systematic threat model and benchmarks existing extraction techniques in this new context; (3) it introduces schema‑guided sampling and debiased conditional perplexity, both of which substantially improve extraction efficacy; and (4) it offers empirical analysis of memorization drivers and concrete mitigation recommendations. The work highlights that as LLM‑powered task bots become ubiquitous, careful attention must be paid to the privacy implications of their internal state representations.
Comments & Academic Discussion
Loading comments...
Leave a Comment