GRAM: Spatial general-purpose audio representation models for real-world applications
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary benchmark task suites: NatHEAR and RealSELD. Our results demonstrate that GRAM outperforms all state-of-the-art self-supervised audio foundation models on NatHEAR and the clean, single-channel version HEAR, while using only a fraction of the training data. GRAM also shows state-of-the-art localization performance in simulated environments and generalizes efficiently to real-world recordings in RealSELD. Taken together, GRAM presents a significant advance toward robust spatial audio foundation models for real-world environments.
💡 Research Summary
Audio foundation models have achieved impressive results on clean, single‑channel datasets, yet they struggle in realistic acoustic environments that feature reverberation, background noise, and spatial cues. This paper introduces GRAM (General‑purpose Real‑world Audio Model), a self‑supervised multi‑channel masked autoencoder designed to learn robust spatial audio representations. GRAM operates on both binaural (2‑channel) and first‑order ambisonic (4‑channel) inputs, reconstructing masked spectrogram patches while explicitly preserving interaural level differences (ILDs) for binaural audio and intensity vectors (IVs) for ambisonics. The encoder is a ViT‑Base (12‑layer) transformer; the decoder employs a local‑global attention scheme that first attends to nearby patches with varying window sizes and then uses global attention to integrate full‑scene context.
To provide realistic training data, the authors built a large‑scale simulation pipeline using SoundSpace 2.0 and Matterport3D scans of 85 houses. For each house they generated 1,000 acoustic scenes, yielding 85,000 binaural room impulse responses (BRIRs) and 85,000 ambisonic room impulse responses (ARIRs). Each scene includes a randomly placed listener, a target sound source, and a noise source (either localized or diffuse). During pre‑training, AudioSet clips (10 s) are convolved with the appropriate BRIR/ARIR, mixed with WHAMR! background noise at signal‑to‑noise ratios ranging from +5 dB to +40 dB, and transformed into 128‑band log‑mel spectrograms (padded to 1024 × 128). For ambisonics, intensity vectors are computed and concatenated, forming a 7‑channel input (W, X, Y, Z plus three IV components). An in‑batch sampling strategy creates 16 overlapping 2‑second segments per batch, effectively increasing the batch size to 1,536 while keeping GPU memory manageable.
Training follows a masked‑patch paradigm: 80 % of patches are replaced by a learnable mask token, and the decoder reconstructs the full multi‑channel spectrogram. Patch sizes are 2 × 8 × 16 for binaural and 7 × 8 × 16 for ambisonics. The decoder’s local‑global attention uses window sizes
Comments & Academic Discussion
Loading comments...
Leave a Comment