Can We Infer Confidential Properties of Training Data from LLMs?
Large language models (LLMs) are increasingly fine-tuned on domain-specific datasets to support applications in fields such as healthcare, finance, and law. These fine-tuning datasets often have sensitive and confidential dataset-level properties – such as patient demographics or disease prevalence – that are not intended to be revealed. While prior work has studied property inference attacks on discriminative models (e.g., image classification models) and generative models (e.g., GANs for image data), it remains unclear if such attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark task for evaluating property inference in LLMs under two fine-tuning paradigms: question-answering and chat-completion. Built on the ChatDoctor dataset, our benchmark includes a range of property types and task configurations. We further propose two tailored attacks: a prompt-based generation attack and a shadow-model attack leveraging word frequency signals. Empirical evaluations across multiple pretrained LLMs show the success of our attacks, revealing a previously unrecognized vulnerability in LLMs.
💡 Research Summary
This paper presents a pioneering investigation into “property inference attacks” specifically targeting Large Language Models (LLMs). The core research question is whether confidential, dataset-level properties of the fine-tuning data—such as aggregate patient demographics or disease prevalence in a medical dataset—can be inferred from a deployed LLM. The authors demonstrate that such inference is not only possible but can be highly effective, revealing a previously unrecognized privacy vulnerability in LLMs.
The study makes three primary contributions. First, it introduces “PropInfer,” a comprehensive benchmark task for evaluating property inference in LLMs. Built upon the ChatDoctor dataset of patient-doctor dialogues, the benchmark is designed along two critical axes:
- Fine-Tuning Paradigms (Modes): It considers two common fine-tuning methods, leading to two model “modes.” The Q&A Mode uses Supervised Fine-Tuning (SFT), teaching the model to generate a doctor’s diagnosis conditioned on a patient’s symptom description. The Chat-Completion Mode uses Causal Language Modeling Fine-Tuning (CLM-FT), training the model to autoregressively predict tokens across the entire dialogue sequence (both patient and doctor turns). These modes lead to different memorization patterns of the input vs. output text.
- Target Property Types: To study how property location affects inference, two property categories are defined. Demographic properties (e.g., patient gender) are primarily revealed in the patient’s description (the input
x). Medical diagnosis properties (e.g., mental disorder, digestive disorder) are discussed across both the patient’s query and the doctor’s response (bothxandy).
Second, the authors propose two novel attack methods tailored for the LLM setting:
- Generation-Based Attack (Black-Box): This attack assumes only API access to the target model. The adversary crafts context-specific prompts (e.g., “Hi doctor, I have a medical question.”) and uses them to generate a large number of output samples from the target LLM. Each generated sample is then labeled for the presence or absence of the target property (e.g., using a classifier or keyword matching). The property ratio in the fine-tuning data is estimated by averaging these labels. This attack leverages the idea that the model’s conditional generation distribution reflects its training data distribution.
- Shadow-Model Attack with Word Frequency (Grey-Box): This attack assumes the adversary has an auxiliary dataset from a related distribution and knowledge of the fine-tuning procedure. The adversary creates multiple “shadow models” by fine-tuning on the auxiliary data with varying, known ratios of the target property. For each shadow model, a feature vector is extracted by computing the frequency of a predefined set of keywords (related to the property) in the model’s generated outputs. A meta-regressor (e.g., a linear model) is trained to map these word-frequency vectors to the known property ratios. To attack the target model, its keyword output frequencies are computed and fed into the trained meta-regressor to infer the property ratio.
Third, the paper provides an extensive empirical evaluation across multiple base LLMs (including Llama-2 and Mistral) and all benchmark configurations. The key findings are:
- The success of an attack strongly depends on the interaction between the fine-tuning mode and where the property is expressed.
- The word-frequency shadow-model attack is most effective against models fine-tuned in Q&A Mode, especially when the property is explicitly contained in the patient’s input (e.g., gender). This is because SFT focuses on the input-output mapping, making internal representations sensitive to input features.
- The generation-based attack excels against models fine-tuned in Chat-Completion Mode, and for properties that are distributed across both questions and answers (e.g., disease diagnoses). CLM-FT models learn the joint distribution of the entire dialogue, making their generated outputs a more direct reflection of the training data’s aggregate properties.
- Both proposed attacks significantly outperform baseline methods, confirming property inference as a real and potent threat to LLMs.
The paper concludes that property inference poses a tangible data confidentiality risk for real-world LLM deployments, distinct from traditional concerns about memorizing individual data points. It highlights the urgent need for new defense mechanisms that can protect against the leakage of such aggregate dataset properties. The release of the PropInfer benchmark and code provides a standardized framework to foster future research in both attacking and defending against this vulnerability.
Comments & Academic Discussion
Loading comments...
Leave a Comment