Privately Fine-Tuned LLMs Preserve Temporal Dynamics in Tabular Data
Research on differentially private synthetic tabular data has largely focused on independent and identically distributed rows where each record corresponds to a unique individual. This perspective neglects the temporal complexity in longitudinal datasets, such as electronic health records, where a user contributes an entire (sub) table of sequential events. While practitioners might attempt to model such data by flattening user histories into high-dimensional vectors for use with standard marginal-based mechanisms, we demonstrate that this strategy is insufficient. Flattening fails to preserve temporal coherence even when it maintains valid marginal distributions. We introduce PATH, a novel generative framework that treats the full table as the unit of synthesis and leverages the autoregressive capabilities of privately fine-tuned large language models. Extensive evaluations show that PATH effectively captures long-range dependencies that traditional methods miss. Empirically, our method reduces the distributional distance to real trajectories by over 60% and reduces state transition errors by nearly 50% compared to leading marginal mechanisms while achieving similar marginal fidelity.
💡 Research Summary
The paper addresses a critical gap in differentially private synthetic data generation: the inability of existing methods to handle longitudinal, table‑wise data where each user contributes an entire sequence of rows (e.g., electronic health records). Traditional DP‑based synthesizers treat rows as independent records and assume i.i.d. data; to apply them to temporal data, practitioners often “flatten” each user’s trajectory into a high‑dimensional vector. The authors demonstrate that flattening explodes the dimensionality, introduces massive sparsity, and only preserves low‑order local marginals, leading to synthetic tables that are locally plausible but globally inconsistent.
To solve this, the authors propose PATH (Private Autoregressive Trajectory Histories). The key ideas are: (1) redefining the privacy unit as the full user table, using the add/remove adjacency definition; (2) leveraging the autoregressive capabilities of large language models (specifically the Gemma‑3/4 family) and fine‑tuning them with DP‑SGD to learn the joint distribution of entire tables; (3) generating tables in a two‑stage process—first an autoregressive row‑by‑row generation conditioned on previously generated rows, then a private selection step to prune noisy or duplicated rows. This approach naturally preserves long‑range dependencies across rows.
The paper also introduces a suite of evaluation metrics tailored to longitudinal synthetic data: Table‑wise Distance to Closest Record (TDCR) based on Dynamic Time Warping, state‑transition matrix differences, HMM log‑likelihoods, MA‑UVE (using Gecko embeddings) for manifold overlap, and classifier indistinguishability tests. For domain‑specific data (NYC 311), additional temporal (hour‑of‑day) and geospatial (latitude/longitude) metrics are employed.
Experiments are conducted on three datasets: a synthetic HMM dataset for controlled testing, the MIMIC‑IV vital‑signs dataset (real patient trajectories), and NYC 311 service‑request logs. Compared against leading marginal‑based mechanisms (AIM, Direct) and a non‑private few‑shot LLM baseline (Gemini), PATH consistently outperforms across all metrics. On MIMIC‑IV, PATH reduces TDCR by more than 50 % and halves state‑transition errors, while achieving a MA‑UVE score of 0.92, indicating near‑perfect manifold overlap. Similar gains are observed on NYC 311, where temporal rhythms and spatial distributions are faithfully reproduced. Importantly, these improvements are achieved under a modest privacy budget (ε = 2.0, δ = 1e‑5), demonstrating that high‑utility longitudinal synthesis is possible without sacrificing privacy.
The authors discuss limitations: current implementation assumes a common schema across users, and DP‑SGD fine‑tuning of large LLMs remains computationally expensive. Future work could extend PATH to heterogeneous schemas, multimodal time‑series, and streaming settings, as well as explore more efficient privacy‑preserving training algorithms.
In summary, the paper makes three major contributions: (1) formalizing user‑level DP for table‑wise data, (2) introducing a novel LLM‑based autoregressive synthesis framework (PATH) that preserves temporal dynamics under differential privacy, and (3) proposing new metrics (TDCR, state‑transition, MA‑UVE) for rigorous evaluation of synthetic longitudinal datasets. This work represents a significant step forward in privacy‑preserving data sharing for domains where temporal coherence is essential.
Comments & Academic Discussion
Loading comments...
Leave a Comment