CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate
Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.
💡 Research Summary
The paper “CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate” addresses a fundamental inefficiency in modern neural speech codecs: the use of a fixed‑frame‑rate (FFR) that allocates the same number of tokens to every equal‑duration slice of audio, regardless of the actual information density. Long vowels, sustained consonants, and silences consume as many tokens as rapidly changing speech, leading to wasted bitrate, especially at low‑bitrate regimes where every token matters.
To remedy this, the authors introduce CodecSlime, a plugin‑style framework that enables dynamic frame‑rate (DFR) processing on top of any existing FFR codec without redesigning the backbone architecture. CodecSlime consists of two complementary components: (1) ScheDFR (Schedulable Dynamic Frame Rate) for inference‑time adaptive down‑sampling, and (2) Melt‑and‑Cool, a two‑stage training recipe that prepares the model to handle the variable‑length frames produced by ScheDFR.
ScheDFR inserts a schedulable down‑sampling module between the encoder and the quantizer. Given encoder outputs (h \in \mathbb{R}^{T \times d_h}) and a target down‑sampling ratio (R_S), it searches for a segmentation (s = {s_1,\dots,s_{T’}}) (where (T’ = \lceil T/R_S\rceil) and each segment length (s_i) is bounded by a maximum (U)). Each segment’s frames are averaged to produce a single down‑sampled representation, and the segment length is stored using (\lceil \log_2 U\rceil) extra bits, thereby decoupling content from duration. The optimal segmentation maximizes a quality surrogate (J_h = -\sum_{t}|h_t - h’_t|^2). Because the exact reconstruction metric (e.g., WER) is non‑differentiable, this L2‑based surrogate is used. The authors solve the combinatorial optimization with a dynamic‑programming (DP) algorithm that runs in (O(TU)) time, guaranteeing a global optimum for the surrogate objective.
Melt‑and‑Cool addresses the fact that an FFR‑trained backbone is not accustomed to receiving merged frames. In the “Melt” stage, a pretrained FFR model is post‑trained on randomly down‑sampled encoder outputs. A “Melt manager” gradually increases the probability of longer segment lengths, exposing the quantizer and decoder to a wide variety of down‑sampling patterns and building robustness. In the subsequent “Cool” stage, the model is fine‑tuned with the optimal ScheDFR schedules for a specific target ratio (R_S) and maximum segment length (U). During Cool, the encoder is frozen; only the quantizer and decoder are updated, allowing the model to specialize for the exact schedules it will encounter at inference time while retaining the robustness learned during Melt.
The authors implement CodecSlime on top of a VQ‑GAN backbone similar to BigCodec. They evaluate both vector‑quantizer (VQ) and finite‑scalar‑quantizer (FSQ) backbones, showing that the method is architecture‑agnostic. Experiments use the full 960‑hour LibriSpeech training set and the UniCATs‑B test set (500 utterances from 37 unseen speakers). Baselines include various sizes of BigCodec (VQ‑8k, VQ‑18k, FSQ‑18k, FSQ‑84k), as well as public models such as EnCodec, LLM‑Codec, SNA‑C, TFC, and V‑ARST‑ok. Evaluation metrics cover intelligibility (WER using NeMo ASR), objective quality (STOI, PESQ, ViSQOL, SECS, UTMOS), and subjective MUSHRA listening tests.
Key results: compressing an 80 Hz backbone to 40 Hz DFR (≈600 bps) yields a WER of 4.25 % with VQ‑8k, a 32 % relative reduction compared to a 40 Hz FFR model of similar bitrate. With FSQ‑18k, the WER drops to 3.80 %, outperforming both the 40 Hz FFR baseline (5.59 %) and even the larger FSQ‑84k model that matches CodecSlime’s total bitrate. Across all metrics, CodecSlime matches or exceeds baselines, and the MUSHRA test shows a statistically significant preference for CodecSlime over competing systems. The method also generalizes: a single CodecSlime model trained for 40 Hz can be run at 50 Hz or 67 Hz at inference time, consistently beating dedicated FFR models trained at those rates.
The paper’s contributions are threefold: (1) a principled, DP‑based dynamic‑frame‑rate scheduler that directly compresses temporal redundancy while preserving content, (2) a simple yet effective two‑stage training recipe that adapts any FFR codec to DFR without architectural changes, and (3) extensive empirical validation demonstrating that DFR can substantially improve low‑bitrate speech coding.
Limitations and future directions are acknowledged. The surrogate L2 loss does not directly reflect perceptual distortion; integrating perceptual loss or differentiable approximations of WER could further improve quality. The maximum segment length (U) is fixed, which may limit compression of very long steady sounds. Real‑time streaming scenarios would require low‑latency scheduling and possibly hardware‑friendly implementations of the DP scheduler. Extending CodecSlime to multi‑codebook or hierarchical codecs, and exploring joint optimization of content and duration bits, are promising avenues.
In summary, CodecSlime provides a practical, backbone‑agnostic pathway to bring dynamic frame‑rate capabilities to neural speech codecs, achieving notable gains in intelligibility and perceptual quality at very low bitrates while maintaining flexibility for downstream applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment