Differentially Private Release of Public Transport Data: The Opal Use Case
This document describes the application of a differentially private algorithm to release public transport usage data from Transport for New South Wales (TfNSW), Australia. The data consists of two separate weeks of “tap-on/tap-off” data of individuals who used any of the four different modes of public transport from TfNSW: buses, light rail, train and ferries. These taps are recorded through the smart ticketing system, known as Opal, available in the state of New South Wales, Australia.
💡 Research Summary
The paper presents a concrete application of differential privacy to the release of public‑transport usage data collected by the Opal smart‑ticketing system in New South Wales, Australia. The raw dataset consists of “tap‑on/tap‑off” records for buses, trains, light rail and ferries over two separate weeks (14 days). Each record contains a unique card identifier, date, transport mode, tap‑on time, tap‑off time, and precise tap‑on/off locations. Direct publication of this data would expose individuals’ movement patterns and enable re‑identification attacks, so the authors set out to produce a privacy‑preserving sample that can be openly distributed.
The authors adopt the formal definition of (ε, δ)‑differential privacy: for any two neighboring datasets that differ in a single trip, the probability that a randomized mechanism outputs any particular result differs by at most a factor e^ε plus an additive term δ. They interpret this as “trip privacy,” meaning the addition or removal of a single trip does not substantially affect the released data.
To achieve this, they employ the Stability‑Based Histogram (SBH) algorithm, originally described in prior work on differentially private synthetic data generation. The algorithm proceeds as follows:
- Define the domain X as the set of all possible trip records (combinations of date, time bucket, location, and mode).
- For each point x ∈ X, compute the counting query qₓ(D) = number of occurrences of x in the original dataset D.
- Add Laplace noise with scale 2/ε to each count, yielding aₓ = qₓ(D) + Lap(2/ε).
- Apply a threshold τ = 2·ln(2δ)/ε + 1; if aₓ < τ, set aₓ = 0.
- Round the remaining aₓ to the nearest integer and replicate point x that many times in the synthetic output D_out.
The Laplace mechanism guarantees ε‑differential privacy for each noisy count because the global sensitivity of a counting query is 1. The threshold step ensures that points whose true count is too low (i.e., sparse data) are unlikely to survive the noise, thereby preventing the algorithm from outputting a large number of spurious records. The authors prove (Theorem 5) that the whole procedure satisfies (ε, δ)‑differential privacy, and (Theorem 6) that the utility loss is bounded: with high probability the noisy counts deviate from the true counts by at most (1/ε)·ln(1/β) for any failure probability β.
Because the raw Opal data is extremely sparse—each card ID is unique and many trips occur only once—the authors first preprocess the data to increase density. They remove the raw card identifier, aggregate times into 5‑ or 10‑minute buckets, and group locations at the station/stop level. Moreover, they partition the dataset by (date, transport mode) pairs, creating disjoint subsets. This partitioning enables the use of the parallel composition theorem: each subset can be processed independently with the same ε, and the overall privacy loss remains ε (rather than ε multiplied by the number of partitions).
The paper discusses parameter selection. Typical values explored are ε = 0.1, 0.5, 1 with δ ≈ 10⁻⁵. Smaller ε yields stronger privacy but larger noise; ε = 0.5 is shown to give a good trade‑off, preserving key statistics (overall ridership per mode, peak‑hour distributions, station‑level boarding counts) within a few percent error. The authors evaluate utility by comparing histograms of the synthetic data against the original data for each partition, reporting mean absolute errors and relative errors across a suite of common transport‑analysis queries.
The final released “Opal dataset” is a synthetic, differentially private version of the original 14‑day sample. Each record contains an anonymised card token, date, mode, time bucket, and station identifiers, but no exact timestamps or unique identifiers that could be linked back to individuals. The dataset is made publicly available for download, accompanied by documentation of the privacy parameters used.
In the discussion, the authors acknowledge limitations: (1) extremely rare trips may be completely omitted, which could bias analyses that focus on low‑frequency routes; (2) the choice of ε and δ remains policy‑driven, and the paper does not provide a systematic method for selecting them beyond illustrative experiments; (3) the approach treats each trip independently, so longitudinal patterns (e.g., a commuter’s daily sequence of trips) are not protected as a whole. Nonetheless, the work demonstrates a practical pipeline—from raw smart‑card logs through preprocessing, differential‑privacy‑preserving synthesis, to public release—that can be adapted by other transit agencies.
Overall, the paper contributes a concrete, reproducible case study of applying differential privacy to real‑world mobility data, balancing legal and ethical privacy requirements with the need for open data to support research, planning, and innovation in public transportation.
Comments & Academic Discussion
Loading comments...
Leave a Comment