Frequency domain variants of velvet noise and their application to speech processing and synthesis: with appendices
We propose a new excitation source signal for VOCODERs and an all-pass impulse response for post-processing of synthetic sounds and pre-processing of natural sounds for data-augmentation. The proposed signals are variants of velvet noise, which is a sparse discrete signal consisting of a few non-zero (1 or -1) elements and sounds smoother than Gaussian white noise. One of the proposed variants, FVN (Frequency domain Velvet Noise) applies the procedure to generate a velvet noise on the cyclic frequency domain of DFT (Discrete Fourier Transform). Then, by smoothing the generated signal to design the phase of an all-pass filter followed by inverse Fourier transform yields the proposed FVN. Temporally variable frequency weighted mixing of FVN generated by frozen and shuffled random number provides a unified excitation signal which can span from random noise to a repetitive pulse train. The other variant, which is an all-pass impulse response, significantly reduces “buzzy” impression of VOCODER output by filtering. Finally, we will discuss applications of the proposed signal for watermarking and psychoacoustic research.
💡 Research Summary
The paper introduces two novel variants of velvet noise—Frequency‑domain Velvet Noise (FVN) and Frequency‑dependent Velvet Noise (FFVN)—and demonstrates their utility as excitation sources for vocoders and as all‑pass filter impulse responses for post‑processing synthetic speech. Velvet noise, originally a sparse time‑domain sequence containing only a few ±1 impulses, is known for its smoother perceptual quality compared to Gaussian white noise. The authors extend this concept to the frequency domain by allocating sparse phase‑modulation units on the DFT’s circular frequency axis using the same random placement rule as the original velvet noise.
A key technical contribution is the design of the phase‑modulation function wₚ(k,B) using a six‑term cosine series. This series yields exceptionally low side‑lobe levels (‑114 dB) and a steep decay (‑54 dB/octave), providing excellent joint time‑frequency localization. Each phase‑modulation unit is centered at a random frequency k_c determined by k_fvn(m)=⌊mF_d+r₁(m)(F_d−1)⌉, where F_d is the average frequency segment length. The overall phase spectrum ϕ_fvn(k) is constructed by summing symmetric contributions from all units, scaled by a maximum phase deviation ϕ_max. An inverse DFT of exp(j ϕ_fvn(k)) yields the impulse response h_fvn(n), which is a short, localized, all‑pass filter kernel (typically a few milliseconds). Because the filter is all‑pass, its magnitude response is flat, so it does not alter the power spectrum of the processed signal, yet it decorrelates the waveform enough to suppress the characteristic “buzzy” artifacts of many vocoder outputs.
To accommodate applications that require different temporal extents across frequency bands (e.g., voiced fricatives), the authors propose FFVN. They define a target duration function B_wtgt(f), normalize it to a weighting g(f), and perform a non‑linear frequency warping ν(f)=α∫₀^{f}g(u)du. A constant‑duration FVN is generated on the warped axis ν, then remapped to the original frequency axis, producing a phase function ϕ_mod(f) that yields an impulse response with frequency‑dependent duration. Two practical warping models are offered: a sigmoid model (controlled by transition frequency f_c and width f_tr) and a band‑wise model (specified by boundary frequencies f_b(k) and per‑band durations B_k, smoothed by a raised‑cosine window).
The paper showcases three main applications. First, applying a single FVN or FFVN impulse as an all‑pass filter to vocoder output dramatically reduces the “buzzy” impression while preserving naturalness; filters shorter than ~1 ms are perceptually transparent, making this a lightweight post‑processing step. Second, the authors use FVN units as excitation signals. A frozen (repeated) FVN element yields a clear pitch perception, whereas a shuffled (random) sequence produces noise‑like excitation. By linearly interpolating the phase of two FVN elements, a continuous morphing from periodic to stochastic excitation is achieved, enabling flexible voice synthesis ranging from singing‑like tones to breathy whispers. Short‑duration FVN elements (≤5 ms) inserted each pitch period provide temporally varying random components, beneficial for low‑pitched male voices. Third, because the all‑pass impulse is a Time‑Stretched Pulse (TSP), convolving a signal with its time‑reversed version perfectly recovers the original, offering a simple yet robust watermarking or tampering‑detection mechanism.
Extensive simulations of 5,000 FVN units reveal a linear relationship between the smoothing bandwidth B and the resulting impulse duration σ_t (B≈0.522 σ_t), allowing precise control over temporal spread. The six‑term cosine series outperforms Hann, Blackman, and Nuttall windows in side‑lobe suppression, justifying its selection.
In discussion, the authors note that high‑kurtosis speech signals can be flattened by the all‑pass filter, which may aid statistical speech synthesis and complex‑cepstrum excitation models. They outline future work, including integrating FVN‑based data augmentation into neural TTS systems (e.g., WaveNet) to embed perceptual constraints, and exploring psychoacoustic experiments to quantify masking effects of the phase‑randomized excitation.
Overall, the study presents a unified framework that leverages the sparsity and smoothness of velvet noise, sophisticated phase shaping, and frequency‑domain manipulation to solve long‑standing vocoder quality issues while opening new avenues for excitation design, data augmentation, and secure audio processing.
Comments & Academic Discussion
Loading comments...
Leave a Comment