Activation Steering for Accent-Neutralized Zero-Shot Text-To-Speech

Activation Steering for Accent-Neutralized Zero-Shot Text-To-Speech
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Zero-shot Text-to-Speech (TTS) models can generate speech that captures both the voice timbre and accent of a reference speaker. However, disentangling these attributes remains challenging, as the output often inherits both the accent and timbre from the reference. In this study, we introduce a novel, post-hoc, and training-free approach to neutralize accent while preserving the speaker’s original timbre, utilizing inference-time activation steering. We first extract layer-specific “steering vectors” offline, which are derived from the internal activation differences within the TTS model between accented and native speech. During inference, the steering vectors are applied to guide the model to produce accent-neutralized, timbre-preserving speech. Empirical results demonstrate that the proposed steering vectors effectively mitigate the output accent and exhibit strong generalizability to unseen accented speakers, offering a practical solution for accent-free voice cloning.


💡 Research Summary

The paper tackles a practical limitation of modern zero‑shot text‑to‑speech (TTS) systems: when a reference utterance carries a non‑native accent, the generated speech inherits both the speaker’s timbre and the accent, making it difficult to produce accent‑neutral speech while preserving the speaker’s identity. The authors propose a post‑hoc, training‑free method called “activation steering” that manipulates the internal activations of a pre‑trained TTS model during inference to suppress accent information.

Model and Data
The study uses Qwen3‑TTS, a state‑of‑the‑art large‑language‑model (LLM) based zero‑shot TTS architecture consisting of a 28‑layer Transformer backbone and a lightweight multi‑token prediction (MTP) module. Only the backbone’s activations are considered, as they encode the bulk of voice‑related representations. For extracting steering directions, the authors employ the ARCTIC corpus (native US English, treated as accent‑neutral) and L2‑ARCTIC (English spoken by Mandarin‑L1 speakers, representing a target accent). Each sentence is synthesized twice—once with a neutral reference and once with an accented reference—while keeping the target text identical.

Steering Vector Extraction
For each Transformer layer ℓ, the authors compute the mean activation over all generated tokens for the accented condition (a_acc^ℓ) and for the neutral condition (a_neu^ℓ). The steering vector v_ℓ is defined as the difference between these two means:

v_ℓ = (1/N_a) Σ_i a_i^{accent,ℓ} – (1/N_n) Σ_i a_i^{neutral,ℓ}.

To reduce the risk that v_ℓ captures speaker identity (since each speaker in the dataset consistently uses a single accent), they augment the reference waveforms on‑the‑fly with three random transformations: (1) scaling of all formant frequencies, (2) scaling of fundamental frequency (F0), and (3) applying a random equalizer. These perturbations alter timbre while leaving the linguistic content and accent largely intact, encouraging v_ℓ to encode primarily accent information.

Inference‑Time Steering
During synthesis, at each decoding step t and for each steered layer ℓ, the activation a_t^ℓ is modified as follows:

a_t^ℓ ← (a_t^ℓ – α·v_ℓ) · ‖a_t^ℓ‖₂ / ‖a_t^ℓ – α·v_ℓ‖₂.

Here α is a hyper‑parameter controlling steering strength. The normalization term preserves the original activation norm, which the authors found crucial for maintaining speaker timbre. The steering is applied only to the generated tokens; prompt tokens (reference speech and text) remain untouched. The experiments focus on single‑layer steering (e.g., layers 10 or 15), though the framework permits multi‑layer steering.

Experimental Setup
Two model sizes are evaluated: a 0.6 B‑parameter and a 1.7 B‑parameter Qwen3‑TTS. Steering vectors are extracted from 4,000 paired samples (2,000 neutral + 2,000 accented). The authors test two steering strengths (α = 1.0 and 2.0) and evaluate on both the in‑domain L2‑ARCTIC test set and an out‑of‑domain speechocean762 set (250 Mandarin‑L1 speakers, diverse proficiency).

Metrics include:

  • Inference Success Rate (ISR) – proportion of successful syntheses without decoding loops.
  • Accent Match Rate (AMR‑CN, AMR‑US) – percentage classified as Chinese‑accented or US‑accented by an external accent classifier.
  • Speaker Similarity (Spk Sim) – cosine similarity between speaker embeddings of generated and reference speech.
  • UTMOS – predicted mean opinion score for naturalness.
  • Word Error Rate (WER) – transcription error using Whisper‑turbo.

Results
Across both model sizes, steering dramatically reduces AMR‑CN (often to near 0 %) and raises AMR‑US (up to ~98 % for the 1.7 B model), confirming effective accent neutralization. Speaker similarity drops modestly (e.g., from 0.84 to 0.76 for the larger model), indicating a trade‑off between accent removal and timbre preservation. UTMOS remains stable or slightly improves, and WER shows substantial gains, especially on speechocean762 (56 % → 32 %).

Layer‑wise analysis reveals that middle layers (10–15) provide the best balance: they achieve the strongest AMR reduction while preserving ISR and Spk Sim. Steering strength α = 2.0 yields stronger accent suppression but incurs lower ISR and larger timbre degradation, suggesting α = 1.0 as a practical default.

Ablation studies on the number of samples used for vector extraction and on the presence of data augmentation show that augmentation slightly reduces accent‑neutralization efficacy but improves timbre retention, confirming its role in biasing v_ℓ toward accent information.

Discussion and Limitations
The method is training‑free, requiring only a forward pass on a modest set of paired utterances to compute steering vectors. It operates in a single decoding pass, making it computationally efficient for real‑time applications. However, the approach is sensitive to the choice of layer and steering magnitude, and single‑layer steering may not capture more complex interactions between accent and other prosodic attributes. Multi‑layer or adaptive steering, as well as integration with explicit accent classifiers, are promising avenues for future work.

Conclusion
The authors demonstrate that linear directions in the activation space of a large‑scale zero‑shot TTS model can be harnessed to selectively suppress accent while largely preserving speaker identity. This activation‑steering framework offers a practical, model‑agnostic solution for accent‑free voice cloning and opens the door to broader controllability of speech synthesis attributes without additional model training.


Comments & Academic Discussion

Loading comments...

Leave a Comment