Multi-layered Cepstrum for Instantaneous Frequency Estimation
We propose the multi-layered cepstrum (MLC) method to estimate multiple fundamental frequencies (MF0) of a signal under challenging contamination such as high-pass filter noise. Taking the operation of cepstrum (i.e., Fourier transform, filtering, and nonlinear activation) recursively, MLC is shown as an efficient method to enhance MF0 saliency in a step-by-step manner. Evaluation on a real-world polyphonic music dataset under both normal and low-fidelity conditions demonstrates the potential of MLC.
💡 Research Summary
The paper introduces a novel multi‑layered cepstrum (MLC) framework for robust multiple fundamental frequency (MF0) estimation, especially under severe contamination such as high‑pass filter noise. Traditional MF0 methods—including autocorrelation, generalized cepstrum (GC), non‑negative matrix factorization (NMF), MUSIC, and the combined frequency‑periodicity (CFP) approach—struggle when the signal has been passed through a high‑pass channel (e.g., a smartphone speaker) that suppresses low‑frequency components and creates a “missing fundamental” effect. To address this, the authors propose to recursively apply the three canonical cepstrum operations—Fourier transform, high‑pass filtering, and a nonlinear power‑law activation—across multiple layers, thereby progressively enhancing components that vary rapidly (the true F0 and its harmonics) while attenuating slowly varying or aperiodic parts.
Mathematically, for an input signal (x\in\mathbb{R}^N) the layer‑wise outputs are defined as
(z^{(0)} = \sigma^{(0)}(|F x|)) and
(z^{(\ell)} = \sigma^{(\ell)}\bigl(W^{(\ell)} F z^{(\ell-1)}\bigr)) for (\ell\ge 1).
Here (F) is the (N)-point DFT matrix, (W^{(\ell)}) is a diagonal high‑pass filter that zeros out frequencies (or quefrencies) below a chosen cutoff, and (\sigma^{(\ell)}) is an element‑wise power function (x^{\gamma_\ell}) (with (\gamma_\ell>0)). The parameters (\gamma_\ell) are the only learnable hyper‑parameters, giving the architecture a clear physical interpretation while resembling a deep neural network: the linear part (W^{(\ell)}F) acts as a fully‑connected layer, and (\sigma^{(\ell)}) as the activation function.
When applied to short‑time Fourier transform (STFT) frames, even‑indexed layers produce frequency‑domain representations, odd‑indexed layers produce quefrency‑domain (time‑like) representations. The authors then combine a pair of consecutive layers—one even, one odd—using the CFP principle: the odd‑layer output is non‑linearly mapped back to the frequency axis and multiplied element‑wise with the even‑layer output. This cross‑multiplication suppresses harmonics in the even layer and sub‑harmonics in the odd layer, leaving only the true F0 peaks. The mapping is implemented via a filterbank that aligns quefrency indices with corresponding frequencies.
Two experimental sections validate the approach. First, synthetic signals consisting of a 2 Hz square wave and a frequency‑modulated sawtooth (2.5 Hz ± cosine) are corrupted with a 10th‑order Butterworth low‑pass filter on the square wave, a high‑pass filter on the sawtooth, additive pink noise (10 dB SNR), and an impulse. With only one or two MLC layers, the STFT fails to reveal the F0 trajectories, while the first‑layer GC begins to show periodicity, and the second‑layer generalized cepstrum of the spectrum (GCoS) clearly resolves the true F0. The final CFP representation (product of layers 2 and 1) cleanly displays both F0 tracks despite the severe contamination.
Second, the authors evaluate on the real‑world Bach10 dataset (four‑instrument quartets). They explore three parameter‑optimization strategies for the set ({\gamma_\ell}): exhaustive grid search (brute‑force), a greedy incremental search, and stochastic gradient descent (SGD) minimizing binary cross‑entropy between the CFP output (after a 88‑band log‑frequency filterbank and sigmoid) and ground‑truth piano rolls. Ten‑fold cross‑validation is used. Results show that increasing the number of layers consistently improves precision, recall, and F‑score for the brute‑force method; the greedy method yields sub‑optimal but still improving performance; SGD does not show a monotonic trend but reaches competitive scores at (L=3). Compared with state‑of‑the‑art MF0 methods (constrained NMF, PLCA, and the original CFP), the MLC‑CFP consistently outperforms them. Notably, under simulated high‑pass degradations (cutoff frequencies from 10 Hz up to 1 kHz), a six‑layer MLC still attains a 72.7 % F‑score at 1 kHz, a 25 % gain over a single‑layer system. In some degradation scenarios (e.g., 100 Hz cutoff) the performance even exceeds that under normal conditions, likely because the high‑pass filter suppresses low‑frequency noise.
The discussion acknowledges a potential drawback: the nonlinear power activation can generate cross‑terms at frequencies equal to the absolute difference of two true F0s (|a‑b|), which may appear as spurious peaks. Empirical analysis suggests that setting a small (\gamma_0) and larger (\gamma_\ell) for higher layers mitigates this effect, as the cross‑terms become weak unless all (\gamma_\ell) approach zero. The authors also note that a modest pre‑filter (high‑pass) can improve results, and that optimizing ({\gamma_\ell}) efficiently on large, complex datasets remains an open research direction.
In summary, the multi‑layered cepstrum offers a physically interpretable, deep‑structure analogue to modern neural networks, capable of iteratively refining the saliency of fundamental frequencies. Its recursive application of DFT‑based linear transforms and simple power‑law nonlinearities yields a system that is both computationally efficient and highly robust to convolutional (high‑pass) noise. The experimental evidence on synthetic and real polyphonic music data demonstrates that deeper MLC configurations outperform shallow DFT‑based approaches, establishing MLC‑CFP as a promising tool for accurate MF0 estimation in challenging acoustic environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment