CVPL: A Geometric Framework for Post-Hoc Linkage Risk Assessment in Protected Tabular Data

CVPL: A Geometric Framework for Post-Hoc Linkage Risk Assessment in Protected Tabular Data
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Formal privacy metrics provide compliance-oriented guarantees but often fail to quantify actual linkability in released datasets. We introduce CVPL (Cluster-Vector-Projection Linkage), a geometric framework for post-hoc assessment of linkage risk between original and protected tabular data. CVPL represents linkage analysis as an operator pipeline comprising blocking, vectorization, latent projection, and similarity evaluation, yielding continuous, scenario-dependent risk estimates rather than binary compliance verdicts. We formally define CVPL under an explicit threat model and introduce threshold-aware risk surfaces, R(lambda, tau), that capture the joint effects of protection strength and attacker strictness. We establish a progressive blocking strategy with monotonicity guarantees, enabling anytime risk estimation with valid lower bounds. We demonstrate that the classical Fellegi-Sunter linkage emerges as a special case of CVPL under restrictive assumptions, and that violations of these assumptions can lead to systematic over-linking bias. Empirical validation on 10,000 records across 19 protection configurations demonstrates that formal k-anonymity compliance may coexist with substantial empirical linkability, with a significant portion arising from non-quasi-identifier behavioral patterns. CVPL provides interpretable diagnostics identifying which features drive linkage feasibility, supporting privacy impact assessment, protection mechanism comparison, and utility-risk trade-off analysis.


💡 Research Summary

The paper introduces CVPL (Cluster‑Vector‑Projection Linkage), a geometric framework designed to assess post‑hoc linkage risk between an original tabular dataset and its protected version. The authors begin by critiquing traditional formal privacy criteria such as k‑anonymity, l‑diversity, t‑closeness, and differential privacy, arguing that these binary compliance checks do not reveal how much residual linkability remains when useful statistical structure is preserved for utility. Model‑based attacks (e.g., membership inference) are also discussed, but they target trained models, are computationally heavy, and provide little insight for auditors.

To fill this gap, CVPL formalizes the linkage assessment as a composition of five operators:

  1. Blocking (B) – uses quasi‑identifiers (Q) to partition records into discrete blocks, possibly via exact matching, hierarchical generalization, or locality‑sensitive hashing.
  2. Vectorization (φ) – converts analytical attributes (A) into numeric vectors.
  3. Latent Projection (ψ) – maps vectors into a lower‑dimensional latent space (e.g., PCA, auto‑encoders) where similarity is more meaningful.
  4. Similarity (s) – computes a normalized similarity score between a source record x and a candidate y.
  5. Thresholding (τ) – decides whether the similarity exceeds a user‑specified strictness τ.

Thus CVPL = τ ∘ s ∘ ψ ∘ φ ∘ B. The threat model assumes an adversary who has full access to the released dataset D′ and an auxiliary dataset X that shares Q∪A attributes, but no knowledge of the anonymization parameters, noise distribution, or synthetic generator. For a conservative upper bound, the authors set X = D (the original dataset), guaranteeing that any realistic auxiliary data will yield a risk estimate no higher than CVPL’s output.

Risk is defined in terms of existential linkage: for a randomly drawn source record x, if there exists at least one candidate y in the same block with similarity ≥ τ, the record is considered linkable. The linkage rate CVPL‑LR(τ) = P_x


Comments & Academic Discussion

Loading comments...

Leave a Comment