Secure and Utility-Aware Data Collection with Condensed Local Differential Privacy

Secure and Utility-Aware Data Collection with Condensed Local   Differential Privacy
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Local Differential Privacy (LDP) is popularly used in practice for privacy-preserving data collection. Although existing LDP protocols offer high utility for large user populations (100,000 or more users), they perform poorly in scenarios with small user populations (such as those in the cybersecurity domain) and lack perturbation mechanisms that are effective for both ordinal and non-ordinal item sequences while protecting sequence length and content simultaneously. In this paper, we address the small user population problem by introducing the concept of Condensed Local Differential Privacy (CLDP) as a specialization of LDP, and develop a suite of CLDP protocols that offer desirable statistical utility while preserving privacy. Our protocols support different types of client data, ranging from ordinal data types in finite metric spaces (numeric malware infection statistics), to non-ordinal items (OS versions, transaction categories), and to sequences of ordinal and non-ordinal items. Extensive experiments are conducted on multiple datasets, including datasets that are an order of magnitude smaller than those used in existing approaches, which show that proposed CLDP protocols yield high utility. Furthermore, case studies with Symantec datasets demonstrate that our protocols accurately support key cybersecurity-focused tasks of detecting ransomware outbreaks, identifying targeted and vulnerable OSs, and inspecting suspicious activities on infected machines.


💡 Research Summary

The paper addresses a critical gap in privacy‑preserving data collection for cybersecurity: existing Local Differential Privacy (LDP) mechanisms assume large user populations (hundreds of thousands to millions) and therefore suffer severe utility loss when the number of participants is only a few thousand or less. To overcome this, the authors introduce Condensed Local Differential Privacy (CLDP), a specialization of LDP that incorporates a “condensation” principle: during perturbation, outputs that are closer to the true value are assigned higher probability, while distant outputs receive lower probability. This is realized by applying a distance‑based utility function within the Exponential Mechanism, yielding a probability distribution that is more concentrated around the true value for a given privacy budget ε.

A Bayesian adversary model is employed to quantify privacy loss via Maximum Posterior Confidence (MPC). The authors prove that, with appropriate choices of ε, a distance metric d(·,·), and a condensation factor α, CLDP achieves the same (or lower) MPC as standard ε‑LDP, thereby offering equivalent privacy guarantees while improving utility.

Three concrete protocols are built on this foundation:

  1. Ordinal‑CLDP – Handles numeric or ordinal data (e.g., malware infection counts) by defining ℓ₁/ℓ₂ distances and applying the condensed exponential mechanism.
  2. Item‑CLDP – Handles categorical, non‑ordinal items (e.g., OS versions, transaction types) using a 0‑1 distance and either one‑hot or hash‑based encodings, preserving low communication overhead.
  3. Sequence‑CLDP – Extends the approach to ordered item sequences (e.g., system‑call logs, file‑download streams). Each position is perturbed independently with condensation, and an additional condensation step protects the overall sequence length.

Theoretical analysis demonstrates that CLDP satisfies ε‑LDP’s indistinguishability condition and provides explicit formulas for selecting α to balance privacy and utility.

Empirical evaluation is extensive. Synthetic experiments vary the number of users from 1,000 to 100,000, measuring L1 error in frequency estimation. While state‑of‑the‑art LDP methods (GRR, OLH, RAPPOR) exhibit errors exceeding 80 % at 2,500 users, CLDP reduces error to below 35 %, a 60‑70 % relative improvement. Real‑world case studies use Symantec telemetry to test three cybersecurity tasks: (i) ransomware outbreak detection, (ii) identification of vulnerable operating systems, and (iii) mining suspicious activity patterns on infected machines. In all cases CLDP delivers accurate frequency estimates, heavy‑hitter identification, and pattern mining results comparable to those obtained from raw data, whereas LDP either fails to apply (for sequences) or yields unacceptable accuracy loss. Communication cost analysis shows that CLDP’s bit‑length is comparable to or lower than that of OLH and RAPPOR.

The paper acknowledges limitations: the design of an appropriate distance metric is domain‑specific, and very long sequences may increase computational overhead. Future work is suggested on automated distance‑metric learning, adaptive condensation factors, and cross‑domain transfer of CLDP parameters.

Overall, the work makes a substantial contribution by providing a privacy‑preserving data collection framework that is both theoretically sound and practically effective for small‑scale cybersecurity environments, where traditional LDP mechanisms are inadequate.


Comments & Academic Discussion

Loading comments...

Leave a Comment