Mix2Morph: Learning Sound Morphing from Noisy Mixes

Mix2Morph: Learning Sound Morphing from Noisy Mixes
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high-quality sound infusions across diverse categories, representing a step toward more controllable and concept-driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .


💡 Research Summary

Mix2Morph tackles the longstanding challenge of sound morphing—creating a single audio that convincingly blends two source sounds—by leveraging a text‑to‑audio diffusion model without requiring a dedicated morphing dataset. The authors focus on a practically motivated subclass they call “sound infusion,” where one source (the primary) dictates the temporal and structural behavior while the secondary source contributes timbral and textural detail throughout the output. Traditional DSP‑based morphing works well for pitched, harmonic sounds but fails on unpitched textures such as environmental noises and sound effects, which are central to creative sound design. Recent deep‑learning approaches (e.g., MorphFader, SoundMorpher) extend the scope but often suffer from “midpoint collapse,” producing intermediate outputs that resemble simple additive mixes rather than true hybrids.

The key insight of Mix2Morph is to repurpose noisy additive mixes as surrogate training data. These mixes are first aligned in time (RMS anchoring) and frequency (spectral interpolation) to force both sources into a shared structural and spectral space. Four augmentation modes—RMS‑only, Spectral‑only, Both, and None—are applied randomly, each paired with a descriptive caption that informs the model of the intended relationship (e.g., “behavior of X with textures from X and Y”). Crucially, the surrogate mixes are only used as training targets at high diffusion timesteps (large noise levels). At high timesteps the model learns coarse, global characteristics, allowing it to absorb the high‑level concept of blending while ignoring low‑level artifacts inherent in the noisy mixes. At low timesteps the model relies on its pretrained ability to reconstruct fine details, thus preserving the primary source’s temporal fidelity.

Extensive ablations show that allocating surrogate mixes to the timestep interval


Comments & Academic Discussion

Loading comments...

Leave a Comment