AM-FM: A Foundation Model for Ambient Intelligence Through WiFi
Ambient intelligence, continuously understanding human presence, activity, and physiology in physical spaces, is fundamental to smart environments, health monitoring, and human-computer interaction. WiFi infrastructure provides a ubiquitous, always-on, privacy-preserving substrate for this capability across billions of IoT devices. Yet this potential remains largely untapped, as wireless sensing has typically relied on task-specific models that require substantial labeled data and limit practical deployment. We present AM-FM, the first foundation model for ambient intelligence and sensing through WiFi. AM-FM is pre-trained on 9.2 million unlabeled Channel State Information (CSI) samples collected over 439 days from 20 commercial device types deployed worldwide, learning general-purpose representations via contrastive learning, masked reconstruction, and physics-informed objectives tailored to wireless signals. Evaluated on public benchmarks spanning nine downstream tasks, AM-FM shows strong cross-task performance with improved data efficiency, demonstrating that foundation models can enable scalable ambient intelligence using existing wireless infrastructure.
💡 Research Summary
The paper introduces AM‑FM, the first foundation model designed for ambient intelligence using Wi‑Fi channel state information (CSI). Recognizing that Wi‑Fi signals are already pervasive, always‑on, and privacy‑preserving, the authors argue that the technology can serve as a universal sensing substrate for tasks ranging from activity recognition to physiological monitoring. However, prior work has been fragmented, relying on task‑specific models that require large labeled datasets and cannot share knowledge across tasks.
To address this, the authors assembled a massive, heterogeneous dataset: 9.2 million unlabeled CSI samples collected continuously over 439 days from 20 commercial IoT device types (33 physical units) spanning eight chipset families, multiple antenna configurations (1×1 to 2×2 MIMO), and both 2.4 GHz and 5 GHz bands. The recordings cover 11 real‑world indoor environments (studios, apartments, houses, townhouses) and involve 26 participants, thereby capturing a wide spectrum of hardware, layout, and interference conditions.
The core of AM‑FM is a self‑supervised learning framework that combines three objectives tailored to the physics of wireless propagation: (1) contrastive learning that treats temporally adjacent windows from the same link as positives and windows from different links as negatives, encouraging invariance to benign transformations; (2) masked reconstruction where random subcarriers are masked and the model must predict their amplitudes, forcing it to learn both local and global spectral patterns; and (3) a physics‑informed autocorrelation prediction task that asks the model to forecast the temporal autocorrelation function of the CSI, thereby embedding knowledge of multipath‑induced non‑local frequency dependencies.
Architecturally, the encoder first flattens the CSI tensor into a frequency‑spatial dimension F = N_tx × N_rx × N_sub. Because subcarrier quality varies widely across devices and environments, the model employs a cross‑attention mechanism with learnable query vectors to compress the raw F channels into a fixed set of 10 latent frequency tokens. This adaptive frequency aggregation learns to weight informative subcarriers more heavily while down‑weighting noisy ones. Temporal modeling uses a relative positional encoding that captures translation‑invariant periodicities (e.g., respiration at 0.2‑0.5 Hz, gait at 1‑2 Hz) without relying on absolute timestamps.
After pre‑training, the backbone is frozen and downstream tasks are tackled via lightweight temporal classifiers or bottleneck adapters, keeping the number of trainable parameters small. The authors evaluate AM‑FM on nine public benchmarks covering activity recognition, gesture detection, respiration monitoring, indoor localization, user identification, and imaging‑style tasks. Across all tasks, AM‑FM matches or exceeds state‑of‑the‑art task‑specific models while requiring 30 %–70 % fewer labeled examples, demonstrating strong data efficiency and cross‑task generalization.
The paper also discusses limitations: only amplitude information is used (phase is discarded due to hardware noise), real‑time inference cost on edge devices is not quantified, and the dataset, while diverse, is still dominated by environments in the United States and China. Future work is suggested in three directions: (i) incorporating robust phase‑denoising to enable finer‑grained physiological sensing such as heart‑rate detection; (ii) optimizing the model for on‑device inference through pruning, quantization, or specialized hardware; and (iii) extending the dataset to more geographic regions and to multi‑AP collaborative scenarios, as well as exploring privacy‑preserving training (e.g., federated learning).
In summary, AM‑FM demonstrates that a large‑scale, self‑supervised foundation model can effectively learn universal representations from raw Wi‑Fi CSI, unlocking scalable ambient intelligence without additional sensing infrastructure. The work bridges the gap between the success of foundation models in vision and language and the emerging field of wireless sensing, paving the way for ubiquitous, low‑cost, and privacy‑aware smart environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment