Scaling Speech Tokenizers with Diffusion Autoencoders
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.
💡 Research Summary
The paper introduces SiTok, a Speech Diffusion Tokenizer that unifies extreme compression, high‑fidelity reconstruction, and semantic‑rich representation within a single end‑to‑end model. Existing speech tokenizers typically trade off between (1) compressing speech into a small discrete token stream, (2) preserving acoustic detail for waveform synthesis, and (3) retaining linguistic information for downstream tasks such as ASR. Prior work often resorts to residual vector quantization, high frame rates, or multi‑stage pipelines that separate quantization from waveform generation, leading to sub‑optimal performance when the token rate is low.
SiTok tackles these issues by building a diffusion auto‑encoder around a large‑scale Transformer. The pipeline starts with 50 Hz, 128‑bin mel‑spectrograms, which are down‑sampled to 12.5 Hz by stacking four consecutive frames. An encoder (16 causal Llama layers, hidden size 1536, 16 heads) maps the spectrogram to latent vectors z. These vectors are quantized using a 32‑dimensional codebook with 65 k entries (EMA‑updated). The quantized embeddings z_q are fed to a non‑causal diffusion decoder (also 16 Llama layers) that learns to reverse a stochastic corruption process via a flow‑matching loss. The decoder predicts a velocity field that transforms a noisy sample x_t back to the clean spectrogram, effectively learning the data distribution conditioned on the discrete tokens.
To ensure that the discrete tokens carry linguistic structure, the authors attach a lightweight CTC decoder on top of the quantized embeddings. This decoder is trained with a Connectionist Temporal Classification loss against the ground‑truth transcript, encouraging the token sequence to be directly predictive of the spoken text. The total training objective combines three terms: diffusion reconstruction loss, CTC loss (weighted by λ_ctc), and the standard VQ commitment loss. This explicit semantic supervision mitigates the “semantic collapse” that often occurs when only reconstruction loss is used.
Because diffusion models traditionally require hundreds of inference steps, SiTok incorporates two acceleration strategies. First, shortcut fine‑tuning freezes the encoder and VQ modules while training the decoder to perform large jumps in the denoising trajectory. The model learns a mapping conditioned on a step size d, and a self‑consistency loss forces a single 2d step to match two sequential d steps. Second, a lightweight diffusion head splits the decoder into a main body (run once) and a small head reused across steps, reducing per‑step computation. With these tricks, SiTok can reconstruct speech with as few as 2–4 diffusion steps while preserving quality comparable to a 100‑step baseline.
The system is scaled to 1.6 B parameters and trained on 2 million hours of in‑house multilingual speech (predominantly English) for a single epoch (~450 k steps). Training uses AdamW (β1 = 0.9, β2 = 0.999, lr = 8e‑5, 32 k warm‑up). The model’s codebook dimension, size, and overall architecture are explored in extensive ablations.
Evaluation covers three axes: compression efficiency, reconstruction fidelity, and downstream understanding. Compression metrics (tokens per second, frame rate, bitrate) show that SiTok operates at an extreme 12.5 Hz token rate and 0.2 kbps bitrate. Reconstruction quality is measured by word error rate (WER) using Whisper‑large‑v3, speaker similarity (cosine similarity of WavLM‑TDNN embeddings), and UTMOS speech quality scores. SiTok achieves a substantial WER reduction (≈30 % relative) compared to strong baselines, while improving speaker similarity and MOS. Downstream tasks—including emotion recognition, keyword spotting, speaker verification, and ASR—all see 10–20 % relative gains, demonstrating that the tokens retain rich semantic information.
The authors also test zero‑shot text‑to‑speech generation using the same token stream. By applying token‑conditional free guidance (randomly dropping tokens during training) and optional decoder fine‑tuning, the model can synthesize high‑quality speech from tokens alone, confirming that SiTok’s representations are suitable for generative use cases.
In summary, SiTok is the first large‑scale speech tokenizer that jointly optimizes a diffusion auto‑encoder and a CTC‑based semantic regularizer, achieving a rare combination of ultra‑low bitrate, high‑fidelity reconstruction, and linguistically meaningful discrete codes. The work demonstrates that scaling diffusion models to billions of parameters and massive data can overcome the traditional bottlenecks of speech tokenization, opening pathways for more efficient speech‑language models, low‑bandwidth communication, and real‑time speech interfaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment