Spatially-Augmented Sequence-to-Sequence Neural Diarization for Meetings
This paper proposes a Spatially-Augmented Sequence-to-Sequence Neural Diarization (SA-S2SND) framework, which integrates direction-of-arrival (DOA) cues estimated by SRP-DNN into the S2SND backbone. A two-stage training strategy is adopted: the model is first trained with single-channel audio and DOA features, and then further optimized with multi-channel inputs under DOA guidance. In addition, a simulated DOA generation scheme is introduced to alleviate dependence on matched multi-channel corpora. On the AliMeeting dataset, SA-S2SND consistently outperform the S2SND baseline, achieving a 7.4% relative DER reduction in the offline mode and over 19% improvement when combined with channel attention. These results demonstrate that spatial cues are highly complementary to cross-channel modeling, yielding good performance in both online and offline settings.
💡 Research Summary
The paper introduces Spatially‑Augmented Sequence‑to‑Sequence Neural Diarization (SA‑S2SND), a framework that explicitly incorporates direction‑of‑arrival (DOA) information into a sequence‑to‑sequence diarization backbone (S2SND). The authors first adopt SRP‑DNN, a lightweight causal CRNN that learns direct‑path inter‑channel phase differences (DP‑IPD) and produces a steered‑response‑power‑style spatial spectrum. By applying an iterative detection‑and‑removal (IDL) strategy, SRP‑DNN can estimate the azimuth of up to two active speakers per frame, even under reverberation, noise, and overlapping speech. The estimated azimuths are encoded as a probability matrix, up‑sampled to match the frame‑level acoustic embeddings, linearly projected to the model dimension, and added to the encoder output via a residual connection. This acts as a positional‑like encoding that supplies the network with explicit spatial priors, enabling it to separate speakers that overlap temporally but originate from different directions.
Training proceeds in two stages. Stage A trains a single‑channel version of the model while feeding it DOA cues derived from real multi‑channel recordings (via SRP‑DNN) and simulated DOA generated online from VAD‑based random assignments. The extractor (a ResNet pretrained on speaker verification) is frozen at first, then unfrozen, and finally the whole network is fine‑tuned. Stage B upgrades the model with a cross‑channel attention module to process genuine multi‑channel audio, again using SRP‑DNN DOA as auxiliary input. The two‑stage schedule isolates the effect of spatial information from channel‑fusion, ensuring that the model first learns to use DOA in an acoustic‑only setting before learning to combine it with multi‑channel attention.
Inference follows the block‑wise sliding‑window scheme of the original S2SND. Each block consists of left, chunk, and right contexts; the chunk shift defines the online latency (0.8 s in the experiments). DOA features are interpolated and fused at the encoder stage. The detection decoder consumes a fixed speaker‑embedding buffer (including pseudo‑speaker, buffered speakers, and non‑speech tokens) and outputs activity probabilities; the representation decoder updates the speaker embeddings. After an online pass, the buffered embeddings enable a second, offline pass that rescues errors and improves DER.
Experiments are conducted on the AliMeeting corpus, which provides 8‑channel far‑field recordings and headset references, as well as on simulated mixtures derived from VoxCeleb2. Two model sizes are evaluated: S2SND‑Small (16.56 M parameters) and S2SND‑Medium (45.96 M). Results show consistent gains from adding DOA. For the Small model, total DER drops from 16.03 % to 15.35 % online (‑4.2 %) and from 13.59 % to 12.59 % offline (‑7.4 %). When multi‑channel attention is also employed, the relative improvement reaches over 19 % in offline DER (10.40 % vs. 11.33 % for the baseline Medium model). The authors also report that only about 6 % of frames in AliMeeting contain more than two simultaneous speakers, justifying the two‑speaker limit of the DOA estimator.
Ablation studies demonstrate that DOA‑augmented single‑channel models outperform pure multi‑channel attention models, confirming that explicit spatial cues are more informative than blind channel fusion. The simulated‑DOA generation scheme further reduces dependence on large matched multi‑channel corpora, allowing the framework to generalize across different microphone arrays and acoustic conditions.
In summary, SA‑S2SND successfully merges explicit spatial information with a powerful sequence‑to‑sequence diarization architecture. The approach yields state‑of‑the‑art performance on a challenging meeting diarization benchmark while maintaining low latency suitable for online deployment. The paper also opens avenues for future work, such as incorporating elevation angles, handling more than two concurrent speakers, and extending the method to other multi‑modal diarization scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment