LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection

LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients, and introduce the Vision Transformer (ViT) architecture for VAD. Our experiments show that ViT with MFCC features consistently outperforms established VAD models such as boosted deep neural network and convolutional long short-term memory deep neural network across seen, unseen, and out-of-distribution (OOD) conditions, including evaluation on the real-world VOiCES dataset. We further analyze the impact of dataset size and SSR on model generalization, experimentally showing that scaling up dataset size and balancing SSR noticeably and consistently enhance VAD performance under OOD conditions. All datasets, trained models, and code are publicly released to foster reproducibility and accelerate progress in VAD research.


💡 Research Summary

The paper addresses a fundamental bottleneck in voice activity detection (VAD) research: the lack of a large‑scale, publicly available, and systematically controlled dataset. To fill this gap, the authors introduce LibriVAD, a dataset built on the open‑source LibriSpeech corpus and enriched with a wide variety of real‑world and synthetic noises. LibriVAD offers three size tiers—small (≈15 GB, ~140 h), medium (≈150 GB, ~1 400 h), and large (≈1.5 TB, ~14 000 h)—and two structural variants: LibriVAD‑NonConcat, which mirrors typical VAD corpora with limited non‑speech content, and LibriVAD‑Concat, which inserts additional silence between concatenated utterances to raise the silence‑to‑speech ratio (SSR) from ~17 % to ~34 %. This design enables researchers to study the impact of SSR on model performance.

Noise augmentation is a core contribution. The dataset incorporates nine noise types: city noise from WHAM!, six environmental categories from DEMAND (domestic, office, public, transport, nature, street), plus two synthetic sources—Speech‑Shaped Noise (SSN) generated from LibriLight‑small using speaker‑specific LPC filters, and Babble noise created by mixing six speech streams from LibriLight‑medium. Each noise is mixed with speech at six signal‑to‑noise ratios (SNRs) of –5, 0, 5, 10, 15, 20 dB, measured only on speech‑active frames to avoid under‑estimation. The resulting mixtures are split into training, validation, and test sets, with a clear distinction between “seen” noises (used during training) and “unseen” noises (reserved for testing). An out‑of‑distribution (OOD) evaluation is performed on the VOiCES corpus, which contains substantial reverberation not present in the training data.

For feature extraction the authors evaluate three representations: raw waveform, Mel‑Frequency Cepstral Coefficients (MFCC, 39‑dim including Δ and ΔΔ), and Gammatone Filter‑bank Cepstral Coefficients (GFCC, also 39‑dim). They benchmark three model families: Boosted Deep Neural Network (bDNN), Convolutional Long Short‑Term DNN (CLDNN), and, for the first time in VAD, the Vision Transformer (ViT). The ViT architecture consists of 12 self‑attention layers, 8 heads, and a 384‑dim embedding; input audio is tokenized into 20 ms frames, allowing the model to capture long‑range temporal dependencies.

Experimental results are extensive. Across all conditions—seen, unseen, and OOD—the ViT‑MFCC combination consistently yields the highest Area Under the ROC Curve (AUC) and F1 scores, outperforming bDNN and CLDNN by 3–5 % absolute AUC on the VOiCES set. Scaling the dataset size from small to medium to large produces steady performance gains, with the medium set delivering roughly a 2 % AUC improvement over the small set. The Concatenated variant (higher SSR) also outperforms the NonConcat version, demonstrating that a balanced proportion of non‑speech frames improves generalization, especially under low‑SNR conditions. Detailed analysis shows that speech‑based noises (Babble, SSN) are the most challenging, yet ViT remains robust, likely due to its global attention mechanism. GFCC and raw waveform features lag behind MFCC, confirming the continued relevance of perceptually motivated cepstral features for VAD.

The authors also emphasize reproducibility: forced alignment from the Montreal Forced Aligner provides frame‑level labels that match expert annotations, and all data, code, and pretrained models are released on HuggingFace and GitHub. They argue that LibriVAD can serve not only VAD research but also downstream tasks such as speech enhancement, noise‑robust ASR, and speaker diarization, establishing a new standard benchmark for the speech processing community.


Comments & Academic Discussion

Loading comments...

Leave a Comment