BNMusic: Blending Environmental Noises into Personalized Music

BNMusic: Blending Environmental Noises into Personalized Music
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.


💡 Research Summary

The paper introduces BNMusic, a novel two‑stage framework that reduces the perceptual salience of environmental noise by blending it with personalized music generated from user‑provided text prompts. Unlike traditional active noise cancellation (ANC) or simple acoustic masking, which either require individual devices or excessive volume, BNMusic treats the noise as a source of rhythmic and spectral cues that can be incorporated into a musical composition.

In the first stage, the raw audio noise is converted into a mel‑spectrogram, and a binary mask isolates high‑energy regions that are most likely to be noticed. Using a latent diffusion model (LDM) derived from Riffusion, the system performs outpainting to extend musical content around the masked area and inpainting to fill the masked region itself. The text prompt conditions the style, genre, and mood of the generated music, while the diffusion process ensures that the resulting mel‑spectrogram inherits the temporal‑spectral structure of the original noise. This creates a music segment that is rhythmically aligned with the noise, effectively turning disruptive acoustic elements into musical material.

The second stage applies adaptive amplification only to the portions of the generated music that overlap with the high‑energy noise bands. Guided by psychoacoustic masking theory, this selective gain raises the masking threshold precisely where it is needed, allowing the music to mask the noise without a global increase in loudness. Consequently, the overall listening level remains comfortable, and the musical quality is preserved.

Experiments use EPIC‑SOUNDS and ESC‑50 as noise sources, covering a wide frequency range and diverse real‑world scenarios (subway, appliances, machinery). Objective evaluation on MusicBench shows that BNMusic achieves higher masking efficiency and better music quality than baseline text‑to‑music models. Subjective listening tests confirm a significant reduction in perceived noise annoyance and an increase in listener satisfaction. Notably, the framework leverages pre‑trained music generators without additional training, making it computationally efficient and suitable for near‑real‑time deployment.

The authors claim three main contributions: (1) defining the noise‑blending with music task as a new multimodal generation problem, (2) proposing a two‑step outpainting/inpainting pipeline that embeds noise characteristics into generated music, and (3) demonstrating that adaptive amplification based on auditory masking can minimize noise noticeability while maintaining musical balance. The work opens avenues for scalable acoustic enhancement in shared spaces such as public transport, offices, and homes, where equipping every listener with ANC headphones is impractical. Future directions include real‑time interactive control, personalization to individual hearing profiles, and integration with a broader range of generative audio models.


Comments & Academic Discussion

Loading comments...

Leave a Comment