Neural Network Alternatives to Convolutive Audio Models for Source Separation
Convolutive Non-Negative Matrix Factorization model factorizes a given audio spectrogram using frequency templates with a temporal dimension. In this paper, we present a convolutional auto-encoder model that acts as a neural network alternative to convolutive NMF. Using the modeling flexibility granted by neural networks, we also explore the idea of using a Recurrent Neural Network in the encoder. Experimental results on speech mixtures from TIMIT dataset indicate that the convolutive architecture provides a significant improvement in separation performance in terms of BSSeval metrics.
💡 Research Summary
The paper proposes a neural‑network alternative to convolutive non‑negative matrix factorization (con‑vNMF) for supervised audio source separation. Traditional NMF factorises a magnitude spectrogram X into a non‑negative basis matrix W and activation matrix H, but it cannot explicitly model temporal dependencies across frames. Con‑vNMF extends NMF by using spectro‑temporal bases, yet its learning procedure remains limited.
To overcome these issues, the authors design a convolutional auto‑encoder (CNN‑CNN Auto‑Encoder, CCAE) that mirrors the two‑layer structure of con‑vNMF. The first convolutional layer (encoder) acts as an inverse filter, producing a latent activation map H from the input spectrogram. The second convolutional layer (decoder) reconstructs the spectrogram by convolving H with learned decoder filters, which serve as spectro‑temporal bases. Non‑negativity is enforced by applying a soft‑plus activation (g(x)=log(1+e^x)) after each convolution, and the network is trained by minimising the Kullback‑Leibler (KL) divergence between the original and reconstructed spectrograms.
A key insight is that decoder filters are allowed to take negative values; the soft‑plus non‑linearity prevents destructive cancellation while granting the model greater expressive power than the strictly non‑negative bases of conventional NMF. Visual experiments on a synthetic 40×350 “spectrogram‑like” image show that decoder filters learn snippets of the input pattern, and the encoder learns matched inverse filters that fire at the correct temporal locations. When trained on real speech, decoder filters capture phonetic‑like time‑frequency structures, confirming the model’s ability to learn meaningful spectro‑temporal atoms.
To capture longer temporal context, the authors replace the convolutional encoder with a set of recurrent neural networks (RNNs), forming a Recurrent‑Convolutional Auto‑Encoder (RCAE). Each of the K RNNs processes the input along the time axis, and their hidden states are summed to form H. The recurrent encoder can, in theory, model arbitrarily long dependencies, addressing the finite‑length limitation of the convolutional encoder. In experiments the authors use LSTM cells for the RNNs.
For source separation, the trained auto‑encoders are used in a two‑step procedure. First, source‑specific models (i.e., the encoder‑decoder parameters) are learned from clean utterances. Second, given a mixture spectrogram X_m, the goal is to find two source spectrograms X_1 and X_2 such that X_m ≈ AE(X_1|θ_1) + AE(X_2|θ_2), where AE denotes the full auto‑encoder (encoder + decoder) and θ_i are the learned parameters for source i. This formulation mirrors NMF’s additive magnitude model but leverages the full non‑linear auto‑encoder to estimate the source spectrograms directly, without explicitly extracting latent activations. The loss again is KL divergence between X_m and the sum of the two reconstructions. After estimating X_1 and X_2, the authors recover time‑domain signals by applying the mixture’s phase and performing an inverse STFT.
Experimental evaluation uses the TIMIT corpus. For each trial, a male–female pair of speakers is selected; nine utterances per speaker are used for training the auto‑encoders, and the remaining utterance per speaker forms a 0 dB mixture for testing. Twenty such mixtures are generated. Spectrograms are computed with a 1024‑point STFT and 25 % hop, and only magnitude information is fed to the networks. CNN filters are 512 × 8 (time‑only convolution) and the number of filters K is varied from 10 to 100 in steps of 10. Training employs RMSProp (learning rate 0.001, momentum 0.7) and Xavier initialization.
Performance is measured with the BSS‑Eval metrics: signal‑to‑distortion ratio (SDR), signal‑to‑interference ratio (SIR), and signal‑to‑artifact ratio (SAR). Results (displayed as violin plots) show that CCAE consistently outperforms the feed‑forward auto‑encoder baseline from prior work in SDR and SIR, with higher medians and tighter inter‑quartile ranges. SAR slightly degrades for CCAE, but the reduction in interference (higher SIR) more than compensates. RCAE also beats the baseline, though its gains are less pronounced than CCAE’s, especially for speech mixtures. The best performance occurs around K = 80; however, the median SDR does not drop sharply for other K values, indicating that the convolutional models are less sensitive to the exact number of bases than feed‑forward models. Moreover, variance in SDR decreases for larger K (≥ 50), suggesting more stable separation with richer dictionaries.
In summary, the paper demonstrates that (1) a convolutional auto‑encoder can faithfully emulate convolutive NMF while relaxing the strict non‑negativity constraint, (2) incorporating a recurrent encoder can theoretically capture longer temporal dependencies, and (3) using the full auto‑encoder during inference yields superior source separation performance compared to traditional feed‑forward approaches. The work opens avenues for more flexible, generative source models that can be adapted to diverse audio domains and potentially extended to real‑time or multi‑source scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment