MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization

MK-SGC-SC: Multiple Kernel Guided Sparse Graph Construction in Spectral Clustering for Unsupervised Speaker Diarization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state-of-the-art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at https://github.com/nikhilraghav29/MK-SGC-SC.


💡 Research Summary

The paper introduces MK‑SGC‑SC, a novel unsupervised speaker diarization method that leverages multiple kernel similarities and principled sparse graph construction within a spectral clustering framework. The authors first compute five similarity matrices from speaker embeddings: four polynomial kernels of degree two and three (with and without a constant term) and a degree‑one arccosine kernel. These kernels capture complementary aspects of the embedding space—polynomial kernels amplify inner‑product differences non‑linearly, while the arccosine kernel preserves angular relationships.

Each kernel matrix Kₗ is shifted and scaled to obtain a non‑negative adjacency matrix Aₗ = (Kₗ – min(Kₗ))/‖Kₗ‖_F, ensuring all matrices share a comparable magnitude. Self‑loops are removed (Aₗᵢᵢ = 0) and each row is sparsified by retaining only the c nearest neighbors (k‑nearest‑neighbor pruning). This step yields a set of sparse, locally‑focused graphs that emphasize strong intra‑speaker connections while discarding weak inter‑speaker links.

The sparse adjacency matrices are then fused by simple averaging: A* = (1/m) Σₗ Aₗ, followed by a final Frobenius‑norm normalization. The resulting fused graph integrates diverse similarity cues while remaining sparse enough for reliable spectral analysis.

Spectral clustering proceeds by constructing the unnormalized Laplacian L = D – A* (where D is the degree matrix), computing its eigenvalues λ₁ ≤ … ≤ λ_n, and estimating the number of speakers k* as the index of the maximal eigengap among the first M eigenvalues (M = min{n, k_max}). The eigenvectors corresponding to the k* smallest eigenvalues form a matrix H; k‑means clustering on the rows of H yields the final speaker labels. The overall computational complexity remains O(n³), identical to standard spectral clustering, while memory usage scales as O(m n²) due to the multiple kernel matrices.

Experiments are conducted on three challenging corpora: DIHARD‑III (11 diverse domains), AMI meeting recordings, and VoxConverse (YouTube videos). Speaker embeddings are extracted using a pre‑trained ECAPA‑TDNN model (192‑dimensional) with 3‑second overlapping windows. Diarization performance is measured by Diarization Error Rate (DER) with a 0.25 s collar, both excluding and including overlapped speech.

Results show that MK‑SGC‑SC consistently outperforms two strong unsupervised baselines: SC‑pNA (p‑neighborhood retained affinity) and ASC (auto‑tuning spectral clustering). In the more demanding setting that includes overlapped speech, MK‑SGC‑SC achieves the lowest DER on 22 out of 30 test splits, notably reducing DER on VoxConverse dev/eval from 9.38 %/10.83 % (ASC) and 7.25 %/9.41 % (SC‑pNA) to 2.83 %/5.12 %. When compared with a semi‑supervised spectral clustering method (SS‑SC) that is tuned on a development set, MK‑SGC‑SC attains comparable or better performance, even when the number of speakers is estimated automatically.

Ablation studies reveal that each kernel contributes positively; removing any kernel degrades performance, confirming the benefit of multi‑kernel fusion. Varying the neighbor parameter c shows a trade‑off: too small a c leads to overly sparse graphs and loss of useful connections, while too large a c re‑introduces noise.

The paper’s main contributions are: (1) designing a set of kernels tailored to speaker embedding similarity, (2) proposing a principled pipeline for scaling, sparsifying, and fusing multiple kernel adjacency matrices, and (3) demonstrating that this approach yields state‑of‑the‑art unsupervised diarization performance while maintaining the computational profile of classic spectral clustering. Future directions include automatic kernel selection, adaptive neighbor sizing, and real‑time approximations of eigen‑decomposition.


Comments & Academic Discussion

Loading comments...

Leave a Comment