Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho’s versatility for controllable acoustic simulation and generative audio tasks.
💡 Research Summary
Gencho introduces a novel diffusion‑transformer framework for blind room impulse response (RIR) estimation and text‑conditioned RIR generation. The authors first identify the shortcomings of existing approaches: parametric methods (e.g., octave‑band T60 estimation) are too coarse, while recent deep learning models such as FiNS or GAN‑based generators are constrained by fixed architectural priors and often collapse when faced with unseen acoustic environments. A key observation is that an RIR consists of two fundamentally different components—early reflections (sparse, short‑time structure) and late reverberation (diffuse, noise‑like tail). Treating the reverberant speech as a monolithic signal therefore mixes these characteristics and degrades perceptual quality.
Gencho tackles this by designing a structure‑aware audio encoder and a diffusion‑based generative decoder. The encoder receives a two‑channel input: (1) the full reverberant speech and (2) an early‑reflection‑only signal obtained via a speech‑enhancement front‑end that isolates the first ~50 ms of the room response. Both channels are processed through a series of 1‑D convolutional blocks (kernel = 15, stride = 2) with PReLU activations and layer normalization, followed by adaptive average pooling that yields a 128‑dimensional global embedding w_ref. This embedding captures the acoustic fingerprint of the environment while preserving the distinction between early and late components.
The decoder operates on complex spectrograms rather than raw waveforms. Target RIRs (1 s, 48 kHz) are µ‑law encoded (β = 2, α = 0.3), transformed with a 128‑point STFT (hop = 64) to obtain a complex matrix H ∈ ℂ^{65 × 751}. Real and imaginary parts are stacked, giving a 130 × 751 tensor that serves as the diffusion data space. A forward diffusion process adds Gaussian noise across T timesteps; training uses the v‑prediction re‑parameterization, which has been shown to improve stability in audio generation. The generative backbone is a Diffusion Transformer (DiT): each layer contains RMS normalization, self‑attention, cross‑attention to w_ref, another RMS block, and a feed‑forward linear projection. Cross‑attention injects the acoustic embedding at every diffusion step, guiding the model toward plausible room characteristics. Classifier‑free guidance is employed with a 10 % drop‑out of the conditioning signal, encouraging the network to learn both conditional and unconditional distributions and thereby increasing diversity.
Experiments are conducted on a diverse set of publicly available room recordings (concert halls, classrooms, offices, etc.). Objective metrics include T60 error, direct‑to‑reverberant ratio (DRR), PESQ, and STOI. Gencho consistently outperforms FiNS and recent GAN‑based baselines, reducing T60 error by roughly 30 % and achieving higher DRR fidelity. Subjective listening tests show a strong preference (≈85 % of participants) for Gencho‑generated reverberation, citing more natural decay and better spatial coloration.
A notable extension is text‑conditioned RIR synthesis. By feeding natural‑language prompts such as “small café with 0.8 s reverberation” into the model, Gencho can generate RIRs that match the described acoustic attributes. This enables “soft acoustic matching,” where the same speech content can be rendered in multiple virtual environments without re‑training or explicit parameter tuning. The authors also demonstrate RIR completion (filling missing late tail from an early‑reflection snippet) and hybrid prompting (combining audio and text cues) to achieve fine‑grained control.
Limitations are acknowledged: the complex‑spectrogram representation (130 × 751) demands significant GPU memory, and generating very long RIRs (>2 s) would require more diffusion steps and larger models. Future work is proposed on multi‑scale diffusion, latent compression, and multimodal conditioning (e.g., using images or video of a space) to broaden applicability to AR/VR and generative audio pipelines.
In summary, Gencho presents the first diffusion‑transformer system that jointly addresses blind RIR estimation and controllable, text‑driven RIR generation. By explicitly separating early and late reverberation in the encoder and leveraging a powerful stochastic generative decoder, it achieves superior objective and perceptual performance while offering unprecedented flexibility for downstream acoustic‑matching and immersive audio applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment