DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer
Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.
💡 Research Summary
DegDiT introduces a novel approach to controllable text‑to‑audio generation by explicitly modeling sound events as dynamic graphs and using these graphs to guide a diffusion transformer. Given a natural‑language prompt, the system extracts each mentioned audio event and represents it with a tuple (category, start time, end time, intensity). Five temporal relations—before, after, overlaps, contains, contained‑by—are encoded in an adjacency tensor, forming a graph where nodes carry semantic and temporal attributes and edges capture inter‑event timing.
Semantic information is obtained from a pretrained FLAN‑T5‑Large model, which encodes the event category into a dense vector (e_cat). Temporal attributes (start and end times) are processed by a sinusoidal‑activated MLP to produce a temporal embedding (t_i). The two are summed to form an initial node embedding g⁽⁰⁾_i. In parallel, a frame‑level activation matrix describing how strongly each event appears in each time frame is projected by a FrameEncoder and weighted by the event’s intensity α_i, yielding a fine‑grained temporal feature f_i.
All node embeddings, together with relation embeddings derived from the adjacency tensor via a RelationEncoder, are fed into a graph transformer. Through multi‑head self‑attention, each node aggregates contextual information from its neighbors, producing a contextualized event embedding g_i that reflects both intra‑event semantics and inter‑event temporal dependencies. The set of g_i embeddings is concatenated with the original text embedding and supplied as conditioning input to a multi‑modal diffusion transformer. Unlike prior text‑only conditioning, this graph‑guided conditioning directly injects precise timestamp information into the diffusion process, dramatically improving temporal localization.
To train the model, the authors construct a high‑quality, diverse dataset using a “quality‑balanced data selection” pipeline. They first apply HTS‑AT to generate pseudo‑labels for AudioSet clips, then evaluate each clip on four criteria: event count, type diversity, temporal alignment accuracy, and duration plausibility. Samples that score well across all criteria are retained, ensuring open‑vocabulary coverage while maintaining acoustic diversity.
After supervised training, DegDiT is further refined with Consensus Preference Optimization (CoPO). CoPO treats four reward signals—text‑audio alignment, event‑level alignment, timestamp accuracy, and overall audio quality—as separate components. A learnable weighting mechanism aggregates these signals into a single scalar reward, and a reinforcement‑learning loop updates the diffusion model to maximize this consensus reward. This multi‑dimensional preference modeling yields more nuanced improvements than binary preference learning.
Extensive experiments on three benchmarks—AudioCondition (evaluating text‑audio semantic match), DESED (multi‑event detection), and AudioTime (timestamp precision)—show that DegDiT outperforms state‑of‑the‑art systems such as AudioLDM, Tango, and TangoFlux across both objective metrics (e.g., FAD, event‑F1, timestamp MAE) and subjective listening tests (MOS). Notably, in scenarios with overlapping or sequential events, DegDiT reduces timestamp error by over 30 % and improves MOS by ~0.4 points. Model size and inference latency remain comparable to existing diffusion‑transformer baselines, preserving practical efficiency.
In summary, DegDiT demonstrates that representing audio events as dynamic graphs, processing them with a graph transformer, and using the resulting embeddings as diffusion guidance can simultaneously achieve fine‑grained temporal control, open‑vocabulary scalability, and computational practicality. The quality‑balanced data curation and consensus‑based reinforcement learning further enhance both diversity and fidelity, establishing a new benchmark for controllable audio generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment