Physics-Guided Variational Model for Unsupervised Sound Source Tracking
Sound source tracking is commonly performed using classical array-processing algorithms, while machine-learning approaches typically rely on precise source position labels that are expensive or impractical to obtain. This paper introduces a physics-guided variational model capable of fully unsupervised single-source sound source tracking. The method combines a variational encoder with a physics-based decoder that injects geometric constraints into the latent space through analytically derived pairwise time-delay likelihoods. Without requiring ground-truth labels, the model learns to estimate source directions directly from microphone array signals. Experiments on real-world data demonstrate that the proposed approach outperforms traditional baselines and achieves accuracy and computational complexity comparable to state-of-the-art supervised models. We further show that the method generalizes well to mismatched array geometries and exhibits strong robustness to corrupted microphone position metadata. Finally, we outline a natural extension of the approach to multi-source tracking and present the theoretical modifications required to support it.
💡 Research Summary
The paper addresses the problem of estimating the direction of arrival (DOA) of a single acoustic source using a microphone array without any ground‑truth position labels. Classical array‑processing techniques such as MUSIC, ESPRIT, and SRP rely on precise array calibration, grid searches, or eigen‑structure analysis, and they often suffer from high computational cost or sensitivity to initialization. Recent supervised deep‑learning approaches (e.g., Cross3D, Neural‑SRP) achieve state‑of‑the‑art performance but require large amounts of labeled data, typically generated by simulation, which limits their applicability to on‑device or rapidly changing acoustic environments.
To overcome these limitations, the authors propose a physics‑guided variational model that combines a variational auto‑encoder (VAE) with a physics‑based decoder. The input to the system consists of generalized cross‑correlation with phase transform (GCC‑PHAT) features computed for every microphone pair. GCC‑PHAT captures the time‑delay information while being robust to noise, and it is naturally suited for time‑difference‑of‑arrival (TDOA) estimation.
The encoder maps the high‑dimensional GCC‑PHAT tensor to a low‑dimensional latent variable z that lives on the unit sphere. This is achieved by modeling the variational posterior with a von Mises‑Fisher (vMF) distribution, parameterized by a mean direction µ and a concentration κ. The vMF distribution is ideal for directional data because it enforces unit‑norm constraints and provides a smooth probability density on the sphere.
During training, the decoder is not a learnable neural network but a deterministic physics model. Given a sampled latent direction z, the decoder computes the theoretical pairwise time delays ˆτₖ for each microphone pair (i, j) using the geometric relation
ˆτₖ(z) = (v_i − v_j)ᵀ z / c · F_s
where v_i and v_j are the known 3‑D coordinates of the microphones, c is the speed of sound, and F_s is the sampling frequency. These delays are turned into a discrete Gaussian‑like likelihood over the TDOA bins:
ℓₖ(τₖ) = −½ ((τₖ − ˆτₖ)/σ)²
followed by a softmax to obtain a normalized probability p(τₖ | z). The standard deviation σ is a global hyper‑parameter that reflects overall uncertainty of the physical model. This likelihood replaces the usual neural decoder p(g | z) in the evidence lower bound (ELBO):
ELBO = 𝔼_q
Comments & Academic Discussion
Loading comments...
Leave a Comment