Complete reconstruction of the tongue contour through acoustic to articulatory inversion using real-time MRI data
Acoustic articulatory inversion is a major processing challenge, with a wide range of applications from speech synthesis to feedback systems for language learning and rehabilitation. In recent years, deep learning methods have been applied to the inversion of less than a dozen geometrical positions corresponding to sensors glued to easily accessible articulators. It is therefore impossible to know the shape of the whole tongue from root to tip. In this work, we use high-quality real-time MRI data to track the contour of the tongue. The data used to drive the inversion are therefore the unstructured speech signal and the tongue contours. Several architectures relying on a Bi-MSTM including or not an autoencoder to reduce the dimensionality of the latent space, using or not the phonetic segmentation have been explored. The results show that the tongue contour can be recovered with a median accuracy of 2.21 mm (or 1.37 pixel) taking a context of 1 MFCC frame (static, delta and double-delta cepstral features).
💡 Research Summary
This paper tackles the long‑standing challenge of acoustic‑to‑articulatory inversion (A‑to‑A) by aiming to reconstruct the full midsagittal contour of the tongue from the acoustic speech signal alone. Traditional approaches that rely on electromagnetic articulography (EMA) or X‑ray microbeam data provide only a handful of sensor positions (typically lips, incisors and a front‑tongue point), leaving the majority of the tongue’s shape unobserved. To overcome this limitation, the authors exploit high‑quality real‑time magnetic resonance imaging (rt‑MRI) recordings, from which they automatically extract a dense tongue contour consisting of 50 (x, y) coordinate pairs per frame.
The dataset comprises recordings of a single French‑speaking female: 2100 sentences (≈3.5 h of speech) captured in 178 acquisition sessions, each lasting 80 s and containing 4000 MRI frames (136 × 136 pixels, 50 fps). Audio is sampled at 16 kHz and manually aligned with phonetic transcriptions, yielding 43 phoneme classes. MFCC features (13 coefficients) together with their first‑ and second‑order deltas are computed using a 25 ms window and a 10 ms hop. To provide temporal context, an 11‑frame window (5 past, current, 5 future) is concatenated, resulting in a 429‑dimensional input vector per time step (125 ms total).
For the tongue contour, a Mask‑R‑CNN based tracker automatically delineates the tongue in each MRI frame; the resulting coordinates are normalized (zero‑mean, unit‑variance) using statistics from neighboring recordings. Because MRI frames span 20 ms while MFCC frames are spaced 10 ms, linear interpolation is applied to align the two modalities.
The core model is a bidirectional LSTM (Bi‑LSTM) network. The architecture begins with a dense layer of 300 units, followed by two stacked Bi‑LSTM layers (each 300 units). After the recurrent layers, two additional dense layers (each 300 units) are applied. Four configurations are examined:
- Single‑Task (ST) – direct regression to 100 outputs (50 × 2 coordinates).
- ST‑AE – the network predicts a 16‑dimensional latent vector; a separate decoder reconstructs the 100‑point contour.
- Multi‑Task (MT) – in addition to the contour regression, a parallel dense layer outputs phoneme probabilities (43‑class softmax).
- MT‑AE – combines the multi‑task output with the auto‑encoder latent space.
When both regression and classification are present, the loss is a weighted sum of mean‑squared error (MSE) for the contour and categorical cross‑entropy for phoneme labels, with equal weighting (α = 1). Training uses Adam (lr = 0.001), batch size 10, up to 300 epochs, and early stopping with patience = 5 on validation loss.
Experiments vary the temporal context (1, 3, 5, 7 frames) for the single‑task model and evaluate all four architectural families. Performance is measured by root‑mean‑square error (RMSE) and median absolute error (both in millimetres) on the test set; multi‑task models also report phoneme accuracy.
Key results: the ST‑1 model (single‑task, 1‑frame context) achieves the lowest median error of 2.21 mm and an RMSE of 2.52 mm, outperforming all other configurations. The MT‑AE model follows closely with a median of 2.28 mm, RMSE 2.58 mm, and the highest phoneme accuracy (75.54 %). The pure multi‑task model (MT) reaches 64.45 % phoneme accuracy, while models with larger context windows (ST‑3, ST‑5, ST‑7) show modestly higher errors, indicating that a short temporal window better captures rapid tongue movements. Incorporating the auto‑encoder alone does not markedly change contour error, but when combined with phoneme prediction it yields a modest gain in both regression and classification metrics.
Error analysis reveals that the network struggles with rapid articulatory transitions (e.g., dental stops) and with long pauses within sentences, often associated with breathing. In such cases, RMSE can exceed 4 mm, reflecting the limited ability of static MFCC features to encode fine‑grained, sub‑frame articulatory dynamics. Additionally, the automatic contour tracker occasionally misplaces points near the tongue tip, imposing an upper bound on achievable performance.
Compared with prior work that reconstructed low‑resolution MRI images (68 × 68) or relied on a handful of EMA sensors, this study uniquely demonstrates the feasibility of reconstructing the entire tongue contour from acoustic data with a median error below 2.3 mm. The use of high‑resolution MRI and a robust Mask‑R‑CNN tracker provides a reliable ground‑truth reference, while the Bi‑LSTM architecture efficiently models temporal dependencies.
Limitations include the fact that recordings were made inside an MRI scanner, where subjects lie supine and are exposed to high acoustic noise, potentially inducing a Lombard effect and altering natural speech production. The dataset also contains only a single speaker, raising questions about speaker‑independence and cross‑language generalization. The authors acknowledge these constraints and propose future work on improving contour tracking, jointly learning from raw images and contours via a combined loss, and extending the approach to other vocal‑tract structures and to natural, out‑of‑scanner speech.
In summary, the paper presents a compelling proof‑of‑concept that deep learning can bridge the gap between acoustic signals and full‑tongue geometry, opening avenues for more realistic articulatory synthesis, speech therapy tools, and phonetic research that require detailed knowledge of tongue shape without invasive instrumentation.
Comments & Academic Discussion
Loading comments...
Leave a Comment