Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models’ ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems’ ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.


💡 Research Summary

The paper introduces a linguistically motivated probing framework that targets a fine‑grained phonetic phenomenon—consonant‑induced F0 perturbation (CF0)—to assess how well modern neural text‑to‑speech (TTS) systems capture segmental‑prosodic interactions. CF0 refers to the systematic shift in the fundamental frequency of a vowel caused by the immediately preceding consonant: voiceless obstruents typically raise F0, while voiced obstruents may lower it, depending on language‑specific factors. Because this effect is short‑range and not explicitly supervised during TTS training, its presence in synthetic speech would indicate that a model has internalized the coupling between segmental features and continuous acoustic parameters.

Experiment 1 (controlled single‑speaker study).
Both Tacotron 2 (autoregressive) and FastSpeech 2 (non‑autoregressive) are trained on the same single‑speaker LJ Speech corpus. From the COCA corpus, 4,210 sentences are randomly sampled and synthesized with each model. Forced alignment (Montreal Forced Aligner) and Praat are used to obtain phoneme‑level boundaries and F0 contours. Vowel F0 trajectories are time‑normalized to 21 equidistant points, allowing comparison across tokens of varying duration. The analysis focuses on three onset categories—voiceless obstruents, voiced obstruents, and sonorants—while controlling for vowel height (high, mid, low) using IPA classifications.

A balanced token set is constructed: for each speech source (natural LJ, Tacotron 2, FastSpeech 2) and lexical frequency band (high vs. low), 1,000 tokens per onset type are selected, yielding 6,000 tokens per source. Lexical frequency is derived from SUBTLEX‑US, enabling a direct test of memorization versus abstraction. Generalized additive mixed models (AR1 GAMMs) with thin‑plate splines model the time‑varying effect of onset type on F0, while word and vowel height are included as random/factor smooths to control for item‑specific variation.

Findings.
In high‑frequency words, both models reproduce the expected CF0 pattern: voiceless obstruents induce a modest F0 rise relative to the sonorant baseline, and voiced obstruents produce a slight dip. Tacotron 2 shows a marginally stronger effect, likely due to its sequential frame‑wise conditioning. However, for low‑frequency (unseen) words, the CF0 effect collapses: synthetic speech shows virtually no significant difference between onset categories, and the magnitude of any residual effect is far smaller than in natural speech. This suggests that the models rely heavily on lexical‑level memorization rather than learning an abstract segmental‑prosodic mapping.

Experiment 2 (large‑scale, multi‑speaker validation).
The authors extend the probe to the “In‑the‑Wild” dataset, which contains real and deep‑fake audio from 58 public figures. Using the same alignment and F0 extraction pipeline (speaker‑adapted Polyglot algorithm), they evaluate a range of state‑of‑the‑art TTS systems (including commercial and open‑source variants). The same pattern emerges: while real speech exhibits robust CF0 across speakers and lexical frequencies, synthetic speech—regardless of architecture—fails to generalize the effect to low‑frequency items and shows attenuated perturbations overall. This confirms that the limitation observed in the controlled single‑speaker setting is a general property of current neural TTS designs.

Implications.
The study demonstrates that contemporary neural TTS models, despite achieving high naturalness scores, do not reliably encode fine‑grained segmental‑prosodic cues that are grounded in articulatory physiology. The proposed probing framework offers a reproducible, linguistically informed diagnostic tool for future TTS evaluation, encouraging researchers to incorporate explicit segment‑prosody modeling (e.g., conditional pitch predictors, joint phoneme‑prosody encoders) or to augment training data with low‑frequency lexical items. Ultimately, addressing this gap is essential for applications requiring high phonetic fidelity, such as language teaching, speech therapy, and forensic voice analysis.


Comments & Academic Discussion

Loading comments...

Leave a Comment