TimberAgent: Gram-Guided Retrieval for Executable Music Effect Control
Digital audio workstations expose rich effect chains, yet a semantic gap remains between perceptual user intent and low-level signal-processing parameters. We study retrieval-grounded audio effect control, where the output is an editable plugin configuration rather than a finalized waveform. Our focus is Texture Resonance Retrieval (TRR), an audio representation built from Gram matrices of projected mid-level Wav2Vec2 activations. This design preserves texture-relevant co-activation structure. We evaluate TRR on a guitar-effects benchmark with 1,063 candidate presets and 204 queries. The evaluation follows Protocol-A, a cross-validation scheme that prevents train-test leakage. We compare TRR against CLAP and internal retrieval baselines (Wav2Vec-RAG, Text-RAG, FeatureNN-RAG), using min-max normalized metrics grounded in physical DSP parameter ranges. Ablation studies validate TRR’s core design choices: projection dimensionality, layer selection, and projection type. A near-duplicate sensitivity analysis confirms that results are robust to trivial knowledge-base matches. TRR achieves the lowest normalized parameter error among evaluated methods. A multiple-stimulus listening study with 26 participants provides complementary perceptual evidence. We interpret these results as benchmark evidence that texture-aware retrieval is useful for editable audio effect control, while broader personalization and real-audio robustness claims remain outside the verified evidence presented here.
💡 Research Summary
The paper addresses a practical gap in modern digital audio workstations (DAWs): translating a user’s perceptual intent—expressed via natural‑language description, an optional audio reference, or both—into a set of concrete, editable plugin parameters. Rather than directly regressing parameters with a neural network or generating a final waveform, the authors propose a retrieval‑based approach that selects an existing preset from a finite knowledge base. This ensures that the output is immediately executable, respects the physical bounds and inter‑parameter constraints of the DSP chain, and remains fully editable by the user.
The central technical contribution is Texture Resonance Retrieval (TRR), a novel audio representation that captures “texture” information through second‑order statistics. Specifically, the method extracts frame‑level activations from a mid‑layer of a pretrained Wav2Vec2 model, then computes Gram matrices (i.e., inner‑product matrices) across frames. These Gram matrices encode co‑activation patterns that are crucial for effects whose perceptual character depends on temporal modulation (e.g., tremolo, distortion, chorus). By contrast, conventional embeddings such as CLAP, mean‑pooled Wav2Vec2, or PaSST collapse the temporal dimension and lose this texture‑relevant structure.
The system architecture consists of two loosely coupled components. A real‑time DSP engine implements a six‑module effect chain (EQ, compressor, distortion, etc.) and receives a validated parameter vector θ_safe. An asynchronous retrieval module processes the user query q = (text, optional audio) by first generating separate text and audio embeddings, performing dual‑modal top‑K retrieval (text‑RAG, audio‑RAG), and finally re‑ranking the candidates using the TRR distance metric. The selected preset’s parameters are checked against per‑module bounds l ≤ θ ≤ u and a binary validity function C(θ) before being applied, guaranteeing that the audio path remains deterministic and low‑latency.
Evaluation is performed on a guitar‑effects benchmark containing 1,063 distinct presets and 204 queries. The authors adopt Protocol‑A, a cross‑validation scheme that eliminates train‑test leakage, and report a normalized parameter error (average absolute error per dimension after min‑max scaling to physical ranges). Compared against four baselines—CLAP‑based retrieval, Wav2Vec‑RAG, Text‑RAG, and FeatureNN‑RAG—TRR achieves the lowest error across all effect categories, with especially large gains for texture‑dominant effects. A near‑duplicate sensitivity analysis shows that performance remains stable when trivial duplicates are removed, indicating that TRR does not rely on memorization.
To complement the objective metrics, a multiple‑stimulus listening test with 26 participants was conducted. Participants heard the audio output of each system for the same query and selected the most satisfactory result. Statistical analysis (paired permutation testing) revealed that TRR’s outputs were preferred significantly more often than any baseline, confirming that reduced parameter error translates into perceptually better sound.
The paper acknowledges several limitations. The benchmark is confined to a single instrument (guitar) and a specific effect chain, so generalization to other instruments, genres, or real‑world recordings remains unproven. The current design assumes the availability of both text and audio modalities; performance when only one modality is present could degrade. Finally, computing Gram matrices for each query introduces non‑trivial computational overhead, which may be problematic for low‑power devices or ultra‑low‑latency scenarios.
In summary, TimberAgent introduces a texture‑aware retrieval prior that leverages Gram‑matrix statistics of Wav2Vec2 activations to bridge the semantic‑to‑parameter gap in audio effect control. By grounding the retrieval in a physically valid preset space, the system delivers editable, real‑time‑compatible parameter configurations that outperform existing embedding‑based baselines both quantitatively and perceptually. Future work is suggested to broaden the domain coverage, streamline the Gram‑matrix computation, and integrate user‑controlled fine‑tuning interfaces, thereby moving closer to a universally applicable, intelligent music production assistant.
Comments & Academic Discussion
Loading comments...
Leave a Comment