From Closed-world Enforcement to Open-world Assessment of Privacy

From Closed-world Enforcement to Open-world Assessment of Privacy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper, we develop a user-centric privacy framework for quantitatively assessing the exposure of personal information in open settings. Our formalization addresses key-challenges posed by such open settings, such as the unstructured dissemination of heterogeneous information and the necessity of user- and context-dependent privacy requirements. We propose a new definition of information sensitivity derived from our formalization of privacy requirements, and, as a sanity check, show that hard non-disclosure guarantees are impossible to achieve in open settings. After that, we provide an instantiation of our framework to address the identity disclosure problem, leading to the novel notion of d-convergence. d-convergence is based on indistinguishability of entities and it bounds the likelihood with which an adversary successfully links two profiles of the same user across online communities. Finally, we provide a large-scale evaluation of our framework on a collection of 15 million comments collected from the Online Social Network Reddit. Our evaluation validates the notion of d-convergence for assessing the linkability of entities in our data set and provides deeper insights into the data set’s structure.


💡 Research Summary

The paper addresses a fundamental gap in privacy research: the lack of a quantitative framework for assessing personal information exposure in open, unstructured online environments. Traditional privacy models such as k‑anonymity, l‑diversity, t‑closeness, and differential privacy assume a closed, well‑structured database where a global sanitization process can be applied. In contrast, modern online platforms (e.g., Reddit, Twitter) feature dynamic, heterogeneous user‑generated content that spreads across multiple channels, often with auxiliary background knowledge that can be arbitrarily rich. The authors therefore propose a user‑centric, open‑world privacy framework that models information as a set of attributes without pre‑labeling them as sensitive or non‑sensitive. Sensitivity is derived from each user’s explicit privacy requirements, allowing context‑dependent specifications.

The framework includes a formal adversary model based on ε‑semantic privacy, which assumes an attacker with unlimited auxiliary information and computational power. Under this model, the authors prove that hard non‑disclosure guarantees (i.e., absolute privacy) are impossible in open settings, establishing a theoretical justification for moving from enforcement to risk assessment.

To demonstrate the framework, the authors instantiate it for the identity‑disclosure problem. They introduce d‑convergence, a metric that quantifies how indistinguishable a particular entity is from the rest of the population based on a chosen distance (e.g., total variation or KL divergence) between attribute distributions. Low d‑convergence indicates that the entity blends into the crowd, whereas high values signal a higher risk of being singled out. Building on this, they define (k, d)‑anonymity, a generalization of classic k‑anonymity: an entity satisfies (k, d)‑anonymity if there exist at least k other entities within a d‑convergent neighbourhood. This captures both the size of the anonymity set and the quality of similarity, addressing limitations of purely cardinality‑based notions.

For empirical validation, the authors collect 15 million Reddit comments, extract unigram frequency vectors as user attributes, and compute d‑convergence and (k, d)‑anonymity scores for each user. Processing is performed on two Dell PowerEdge R820 servers (64 virtual cores each) over six weeks. Results show a wide distribution of d‑convergence values; highly active users and those concentrated in niche sub‑reddits exhibit lower d (higher risk). Under a typical setting of k = 5 and d = 0.3, only about 68 % of users meet the anonymity criterion, and these users experience a linkability success rate below 15 % in simulated adversarial matching. Conversely, users failing the criterion have a linkability rate around 42 %. The analysis also reveals that topic concentration and activity level are strong predictors of privacy risk.

The paper’s contributions are threefold: (1) a rigorous open‑world privacy framework that accommodates user‑specified, context‑dependent sensitivity; (2) the novel d‑convergence metric and (k, d)‑anonymity definition for quantifying identity disclosure risk; (3) a large‑scale empirical study confirming the practical relevance of the proposed measures. The authors suggest future work on extending the model to multimodal data (images, video), real‑time updating of privacy requirements, and integrating the risk scores into user‑facing privacy‑awareness tools. This research paves the way for more nuanced, probabilistic privacy assessments that reflect the realities of modern, open‑web ecosystems.


Comments & Academic Discussion

Loading comments...

Leave a Comment