Benchmarking Dataset for Presence-Only Passive Reconnaissance in Wireless Smart-Grid Communications
Benchmarking presence-only passive reconnaissance in smart-grid communications is challenging because the adversary is receive-only, yet nearby observers can still alter propagation through additional shadowing and multipath that reshapes channel coherence. Public smart-grid cybersecurity datasets largely target active protocol- or measurement-layer attacks and rarely provide propagation-driven observables with tiered topology context, which limits reproducible evaluation under strictly passive threat models. This paper introduces an IEEE-inspired, literature-anchored benchmark dataset generator for passive reconnaissance over a tiered Home Area Network (HAN), Neighborhood Area Network (NAN), and Wide Area Network (WAN) communication graph with heterogeneous wireless and wireline links. Node-level time series are produced through a physically consistent channel-to-metrics mapping where channel state information (CSI) is represented via measurement-realistic amplitude and phase proxies that drive inferred signal-to-noise ratio (SNR), packet error behavior, and delay dynamics. Passive attacks are modeled only as windowed excess attenuation and coherence degradation with increased channel innovation, so reliability and latency deviations emerge through the same causal mapping without labels or feature shortcuts. The release provides split-independent realizations with burn-in removal, strictly causal temporal descriptors, adjacency-weighted neighbor aggregates and deviation features, and federated-ready per-node train, validation, and test partitions with train-only normalization metadata. Baseline federated experiments highlight technology-dependent detectability and enable standardized benchmarking of graph-temporal and federated detectors for passive reconnaissance.
💡 Research Summary
The paper addresses a notable gap in smart‑grid cybersecurity research: the lack of benchmark data that captures passive, presence‑only reconnaissance attacks, where an adversary merely observes the communication medium without transmitting any packets. While most publicly available datasets focus on active threats such as false‑data injection, replay, or denial‑of‑service, they rarely provide low‑level propagation observables (e.g., CSI amplitude/phase, shadowing, coherence) that can be subtly altered by a nearby observer. To fill this void, the authors propose a synthetic benchmark dataset generator that models a tiered smart‑grid communication architecture consisting of a Home Area Network (HAN), Neighborhood Area Network (NAN), and Wide Area Network (WAN).
Topology and Technology Mapping
The generator builds a 12‑node graph that respects IEEE 2030/2030.5 guidelines. Nodes are assigned both a tier and a specific communication technology: ZigBee and Wi‑Fi for HAN devices, LoRa, PLC, and LTE for NAN components, and fiber or LTE for WAN back‑haul. Direct HAN‑WAN links are deliberately omitted to preserve realistic aggregation pathways. The adjacency matrix is constructed with tier‑aware constraints, and a row‑stochastic neighbor‑averaging operator (α = 0.3) is derived to enable graph‑aware feature engineering.
Passive Threat Model
The adversary is strictly receive‑only and physically proximate to selected non‑fiber links. Its effect is modeled as (i) an additional shadow‑loss term and (ii) a reduction in channel coherence (i.e., faster temporal innovation). No packets are injected, replayed, jammed, or deliberately dropped. These two physical perturbations are injected into a technology‑conditioned channel model that combines large‑scale log‑normal shadowing (parameterized from 3GPP TR‑38.901) with a complex Gauss‑Markov fading process whose autocorrelation coefficient ρ depends on the underlying technology.
Physical‑to‑Metric Mapping
From the perturbed channel the dataset derives a causal chain of observables:
CSI amplitude → Signal‑to‑Noise Ratio (SNR) → Packet Error Rate (PER) → Latency.
Latency incorporates a baseline transmission time, an ARQ‑inspired retransmission expectation derived from PER, jitter, and burst‑error components that become more likely as PER rises. An EWMA‑smoothed latency is also exported to capture short‑term persistence. This chain guarantees that any change in the physical layer propagates consistently through higher‑layer performance metrics, eliminating the need for artificial labels or shortcuts.
Dataset Generation and Leak‑Safe Design
Three independent realizations are generated for training, validation, and testing, each with a burn‑in period removed. Node‑level transmission counts (tx_count) follow role‑consistent patterns (periodic metering, near‑continuous PMU telemetry, intermittent DER activity). Attack labels are activity‑gated: a label of 1 is assigned only when tx_count > 0 on an attack‑eligible link; fiber links are always labeled normal.
Feature engineering is strictly causal. For each node, rolling statistics (mean, variance, entropy, drift), change descriptors, and activity indicators are computed using only past windows. Neighbor‑aware features are added by multiplying the node vector with the stochastic matrix W, yielding a neighbor aggregate (\bar{x}(t)) and a deviation (|x_i(t)-\bar{x}_i(t)|). Per‑node standardization parameters are estimated solely on the training split and stored; the same parameters are applied to validation and test data, preventing any information leakage.
Federated‑Ready Packaging
The release includes per‑node train/validation/test partitions, node‑specific normalization metadata, and a JSON file describing the topology, node roles, and technology assignments. This structure enables three deployment scenarios: (1) centralized training on the concatenated dataset, (2) local training on individual node data, and (3) federated learning where each node trains a local model and aggregates updates (e.g., via FedAvg).
Baseline Experiments
The authors conduct baseline federated experiments using a Graph Convolutional Network combined with an LSTM (GCN‑LSTM) as the local model. Results show technology‑dependent detectability: wireless links (ZigBee, Wi‑Fi, LoRa) exhibit higher detection AUC because shadow‑loss and coherence degradation more directly affect SNR and PER, whereas fiber‑backbone links remain essentially invisible to the passive adversary. These findings validate that the dataset captures realistic, subtle anomalies without resorting to overt packet manipulation.
Contributions and Impact
- Tiered, IEEE‑aligned topology that mirrors real smart‑grid communication layers.
- Purely propagation‑based attack modeling, avoiding any active packet‑level manipulation.
- Leak‑safe generation with independent splits, burn‑in removal, causal features, and train‑only normalization.
- Graph‑temporal feature set including neighbor aggregates and deviation metrics, facilitating GNN‑based approaches.
- Federated‑ready packaging, encouraging research on privacy‑preserving detection in distributed smart‑grid environments.
In summary, the paper delivers a rigorously constructed synthetic benchmark that enables reproducible evaluation of graph‑temporal and federated detectors for presence‑only passive reconnaissance in smart‑grid communications. By providing a physically consistent channel‑to‑metric pipeline, tier‑aware topology, and privacy‑preserving data splits, it lays a solid foundation for future research on subtle, propagation‑driven cyber threats in critical infrastructure.
Comments & Academic Discussion
Loading comments...
Leave a Comment