Enhancement of Throat Microphone Recordings Using Gaussian Mixture Model Probabilistic Estimator

Enhancement of Throat Microphone Recordings Using Gaussian Mixture Model   Probabilistic Estimator
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The throat microphone is a body-attached transducer that is worn against the neck. It captures the signals that are transmitted through the vocal folds, along with the buzz tone of the larynx. Due to its skin contact, it is more robust to the environmental noise compared to the acoustic microphone that picks up the vibrations through air pressure, and hence the all interventions. The throat speech is partly intelligible, but gives unnatural and croaky sound. This thesis tries to recover missing frequency bands of the throat speech and investigates envelope and excitation mapping problem with joint analysis of throat- and acoustic-microphone recordings. A new phone-dependent GMM-based spectral envelope mapping scheme, which performs the minimum mean square error (MMSE) estimation of the acoustic-microphone spectral envelope, has been proposed. In the source-filter decomposition framework, we observed that the spectral envelope difference of the excitation signals of throat- and acoustic-microphone recordings is an important source of the degradation in the throat-microphone voice quality. Thus, we also model spectral envelope difference of the excitation signals as a spectral tilt vector, and propose a new phone-dependent GMM-based spectral tilt mapping scheme to enhance throat excitation signal. Experimental evaluations are performed to compare the proposed mapping scheme using both objective and subjective evaluations. Objective evaluations are performed with the log-spectral distortion (LSD) and the wide-band perceptual evaluation of speech quality (PESQ) metrics. Subjective evaluations are performed with A/B pair comparison listening test. Both objective and subjective evaluations yield that the proposed phone-dependent mapping consistently improves performances over the state-of-the-art GMM estimators.


💡 Research Summary

This thesis, titled “Enhancement of Throat Microphone Recordings Using Gaussian Mixture Model Probabilistic Estimator,” presents a novel method to improve the perceived quality and intelligibility of speech captured by a throat microphone (TM). While TMs offer high robustness to environmental noise due to direct skin contact, they suffer from degraded speech quality, characterized by a muffled, unnatural, and croaky sound, primarily due to missing frequency bands and the lack of oral cavity radiation effects.

The core problem is addressed through a joint analysis of simultaneously recorded acoustic microphone (AM) and TM signals. The research is grounded in the source-filter model of speech production. The key innovation lies in proposing two separate phone-dependent Gaussian Mixture Model (GMM)-based mapping schemes: one for the spectral envelope and another for the excitation signal.

First, the system decomposes the AM and TM signals into their spectral envelope (filter) and excitation (source) components using Linear Predictive Coding (LPC). The author observes that the degradation in TM quality stems not only from a distorted spectral envelope but also from a significant difference in the spectral characteristics of the excitation signals, modeled as a “spectral tilt vector.”

To tackle this, the thesis introduces:

  1. A Phone-Dependent GMM-based Spectral Envelope Mapping: This scheme performs Minimum Mean Square Error (MMSE) estimation of the clean AM spectral envelope from the corresponding TM envelope. “Phone-dependent” refers to training separate GMM mapping functions for eight broad phonetic articulation classes (e.g., bilabial, alveolar), based on the observation that TM distortion patterns vary across phonemes.
  2. A Phone-Dependent GMM-based Spectral Tilt Mapping: This novel scheme specifically enhances the TM excitation signal by mapping its spectral tilt closer to that of the AM excitation, using a similarly structured, phone-dependent GMM estimator.

The proposed enhancement framework operates in two stages: initially enhancing the spectral envelope, then using the enhanced filter to derive a residual excitation signal, which is subsequently enhanced via the spectral tilt mapping.

Experimental evaluations are comprehensive, involving both objective and subjective measures. Objective tests use Log-Spectral Distortion (LSD) and wideband Perceptual Evaluation of Speech Quality (PESQ). Subjective assessment is conducted via an A/B paired comparison listening test. The proposed phone-dependent soft mapping (PD-SM) method is compared against baseline GMM mapping and other phone-dependent hard mapping variants.

Results consistently demonstrate that the proposed phone-dependent mapping strategies, particularly PD-SM, yield superior performance across all metrics. They achieve lower LSD, higher PESQ scores, and are significantly preferred in listening tests. This validates the effectiveness of incorporating phonetic context into the probabilistic mapping process for more accurate and perceptually satisfying enhancement of throat microphone speech. The work concludes by suggesting potential extensions towards speaker-independent linear filtering systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment