Practical privacy metrics for synthetic data

Practical privacy metrics for synthetic data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper explains how the synthpop package for R has been extended to include functions to calculate measures of identity and attribute disclosure risk for synthetic data that measure risks for the records used to create the synthetic data. The basic function, disclosure, calculates identity disclosure for a set of quasi-identifiers (keys) and attribute disclosure for one variable specified as a target from the same set of keys. The second function, disclosure.summary, is a wrapper for the first and presents summary results for a set of targets. This short paper explains the measures of disclosure risk and documents how they are calculated. We recommend two measures: $RepU$ (replicated uniques) for identity disclosure and $DiSCO$ (Disclosive in Synthetic Correct Original) for attribute disclosure. Both are expressed a % of the original records and each can be compared to similar measures calculated from the original data. Experience with using the functions on real data found that some apparent disclosures could be identified as coming from relationships in the data that would be expected to be known to anyone familiar with its features. We flag cases when this seems to have occurred and provide means of excluding them. This paper was originally written as a vignette for the R package synthpop, with substantial changes added in February 2026 for synthpop version 1.9-3.


💡 Research Summary

The paper introduces practical privacy‑risk metrics for fully synthetic data and implements them in the R package synthpop. Two new functions, disclosure and multi.disclosure, compute identity‑disclosure and attribute‑disclosure measures for a synthetic data set (SD) relative to its original ground‑truth data (GT). The authors recommend two primary metrics: RepU (replicated uniques) for identity disclosure and DiSCO (Disclosive in Synthetic Correct Original) for attribute disclosure. Both metrics are expressed as percentages of the original records, allowing a direct comparison between the synthetic and original data.

Identity disclosure is assessed by forming a composite quasi‑identifier q from a user‑specified set of keys (e.g., sex, age, region). The proportion of original records that are unique on these keys is denoted UiO. RepU is the proportion of those uniquely identified original records that remain unique in the synthetic data. Additional auxiliary measures (UiS, UiOiS) are also provided, but RepU is the recommended summary statistic because it captures the risk that a record uniquely identifiable in the original remains uniquely identifiable after synthesis.

Attribute disclosure is evaluated through a series of increasingly stringent steps. First, the proportion of original records whose key pattern q appears at least once in the synthetic data is iS (in Synthetic). Second, among those, the proportion for which all synthetic records sharing the same q also share the same target variable value t is DiS (Disclosive in Synthetic). Finally, DiSCO is the proportion of original records for which the synthetic data both (a) contains the same q, (b) yields a consistent target value across all matching synthetic records, and (c) that consistent value matches the true value in the original. DiSCO therefore measures the risk that an intruder, equipped only with the synthetic data and the key values, can correctly infer a previously unknown attribute of a real individual.

The functions accept multiple synthetic replicates (parameter m) and automatically aggregate results, providing means, standard deviations, and optional visualisations. Numeric variables are treated as categorical by default, but the user can control grouping via ngroups_keys and ngroups_targets. The package also integrates statistical disclosure control (SDC) tools—category merging, smoothing, and removal of replicated uniques—so that users can iteratively reduce the reported risks.

A notable feature is the automatic flagging of disclosures that arise from known 1‑way or 2‑way relationships in the original data (e.g., deterministic links between variables). These flagged cases can be excluded from the risk calculations, preventing over‑estimation of danger in contexts where such relationships would be publicly known.

The authors illustrate the workflow with the Polish “SD2011” survey. Using four keys (sex, age, region, place size) they obtain UiO = 48.38 % (nearly half of the original records are unique on these keys). RepU drops to 14.86 % in the synthetic data, indicating a substantial reduction in identity risk. For attribute disclosure, the original data shows Dorig ≈ 53 % (over half of the records would allow inference of a target variable), while DiSCO falls to roughly 9 %, demonstrating that the synthetic data greatly limits correct attribute inference. The paper also discusses how reducing the number of released synthetic records further lowers DiSCO and related measures.

Although differential privacy (DP) is mentioned, the focus remains on these concrete, interpretable metrics that can be computed directly from any synthetic data set without additional privacy budgets. The authors argue that RepU and DiSCO provide a transparent, comparable baseline for assessing whether a synthetic release is safer than the original, and they suggest future extensions toward a more general τ‑threshold framework that would subsume both metrics.

In summary, the paper delivers a clear, implementable methodology for quantifying privacy risk in fully synthetic data, embeds it in a widely used R package, and demonstrates its utility on real‑world data. The proposed metrics enable data custodians to evaluate and, if necessary, mitigate disclosure risk before releasing synthetic data to the public or research community.


Comments & Academic Discussion

Loading comments...

Leave a Comment