AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Text-to-audio (TTA) generation is advancing rapidly, but evaluation remains challenging because human listening studies are expensive and existing automatic metrics capture only limited aspects of perceptual quality. We introduce AudioEval, a large-scale TTA evaluation dataset with 4,200 generated audio samples (11.7 hours) from 24 systems and 126,000 ratings collected from both experts and non-experts across five dimensions: enjoyment, usefulness, complexity, quality, and text alignment. Using AudioEval, we benchmark diverse automatic evaluators to compare perspective- and dimension-level differences across model families. We also propose Qwen-DisQA as a strong reference baseline: it jointly processes prompts and generated audio to predict multi-dimensional ratings for both annotator groups, modeling rater disagreement via distributional prediction and achieving strong performance. We will release AudioEval to support future research in TTA evaluation.

💡 Research Summary

**
Text‑to‑audio (TTA) generation has progressed rapidly, yet reliable evaluation remains a bottleneck. Human listening studies provide the gold standard but are costly, time‑consuming, and often ignore the differing perspectives of experts (audio engineers, musicians) and lay listeners. Existing automatic metrics such as Frechet Audio Distance (FAD) or CLAP capture only limited aspects of perceptual quality and frequently misalign with human judgments, especially for subjective dimensions like enjoyment or usefulness.

To address these gaps, the authors introduce AudioEval, the first large‑scale, dual‑perspective, multi‑dimensional evaluation dataset for TTA. AudioEval comprises 4,200 audio clips (11.7 h total) generated by 24 representative TTA systems covering a wide range of modeling paradigms (autoregressive, diffusion, latent‑diffusion, consistency/LCM) and parameter scales (from <0.5 B to ≥2 B). The prompt set contains 451 natural‑language descriptions drawn from the AudioSet ontology, ensuring balanced coverage across 20+ sound categories (human, animal, environmental, musical, mechanical, etc.) and realistic sentence lengths (5–20 words).

The annotation protocol adopts a dual‑population design: three expert annotators (with academic training in audio engineering, speech, or music) and nine non‑expert annotators (general listeners) each rate every clip independently. Each clip receives three expert and three non‑expert ratings, yielding 126 000 dimension‑level scores. Five evaluation dimensions are defined, each with a 10‑point slider and detailed scoring guidelines:

Content Enjoyment (CE) – subjective pleasure, emotional impact, artistic expression.
Content Usefulness (CU) – potential utility for downstream applications or creative tasks.
Production Complexity (PC) – richness and diversity of acoustic structure.
Production Quality (PQ) – technical fidelity, clarity, dynamics, balance.
Textual Alignment (TA) – semantic and temporal correspondence with the prompt.

Statistical analysis shows that expert and non‑expert scores correlate modestly at the clip level (Pearson r ≈ 0.4–0.6) but strongly at the system level (r ≈ 0.8–0.93), confirming that the two groups prioritize different criteria. A Pearson correlation matrix among the five dimensions reveals moderate inter‑dimension relationships (e.g., CE ↔ CU = 0.62) and weaker links between TA and subjective dimensions, underscoring the need for separate modeling.

The paper benchmarks several automatic evaluators on AudioEval: traditional distributional similarity metrics (FAD), contrastive audio‑text alignment scores (CLAP), and recent audio‑masked auto‑encoder (AudioMAE) based predictors. While these methods achieve reasonable correlation with human scores on PQ (r ≈ 0.55–0.68), they perform poorly on CE, CU, and TA (r ≤ 0.45), highlighting their limited perceptual scope.

To provide a stronger baseline, the authors develop Qwen‑DisQA, a multi‑modal quality predictor built on Qwen2.5‑Omni. Qwen‑DisQA jointly encodes the textual prompt and the generated audio, then outputs, for each of the five dimensions and each annotator group, the parameters of a normal distribution (mean ± variance). This distributional output explicitly models rater disagreement, allowing the system to express uncertainty rather than a single point estimate. Training uses the full AudioEval rating set with a negative log‑likelihood loss on the predicted distributions.

Evaluation results demonstrate that Qwen‑DisQA substantially outperforms all baselines: average Pearson correlation across dimensions reaches 0.85 for experts and 0.88 for non‑experts, with peak values of 0.92 on TA for experts. The model’s ability to predict variance correlates with observed inter‑rater disagreement, confirming that it captures the inherent subjectivity of the task. Moreover, Qwen‑DisQA shows consistent performance across systems of varying size and release year, suggesting good generalization.

The authors also analyze how system characteristics relate to multi‑dimensional quality. Recent large‑scale models (2024–2025, ≥2 B parameters) tend to score higher on PQ and TA, reflecting improved fidelity and better prompt adherence. However, some mid‑size models achieve higher PC scores, indicating richer acoustic textures despite lower overall fidelity. This trade‑off underscores the value of multi‑dimensional evaluation for nuanced system diagnostics.

Limitations acknowledged include: (1) all annotators are English‑proficient, restricting applicability to multilingual TTA; (2) the 10‑point Likert scale and normal‑distribution assumption may not fully capture the true distribution of human judgments; (3) the average clip length (~10 s) limits assessment of long‑form soundscapes or dynamic temporal structures. Future work is proposed to expand the dataset with multilingual listeners, explore alternative distributional modeling (e.g., mixture models), and incorporate longer, more diverse audio samples.

In summary, AudioEval provides the first publicly available, large‑scale benchmark that simultaneously captures expert and non‑expert perspectives across five perceptually meaningful dimensions for text‑to‑audio generation. By releasing the dataset and the strong Qwen‑DisQA baseline, the authors lay a solid foundation for the development of more reliable, perceptually aligned automatic evaluation metrics and for systematic, fine‑grained comparison of TTA systems. This contribution is poised to accelerate progress in the rapidly evolving field of generative audio.

AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment