Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening

Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by environmental noise and the lack of physiological context. Respiratory effort is a key signal used in clinical scoring of OSA events, but current approaches require additional contact sensors that reduce scalability and patient comfort. This paper presents the first study to estimate respiratory effort directly from nocturnal audio, enabling physiological context to be recovered from sound alone. We propose a latent-space fusion framework that integrates the estimated effort embeddings with acoustic features for OSA detection. Using a dataset of 157 nights from 103 participants recorded in home environments, our respiratory effort estimator achieves a concordance correlation coefficient of 0.48, capturing meaningful respiratory dynamics. Fusing effort and audio improves sensitivity and AUC over audio-only baselines, especially at low apnoea-hypopnoea index thresholds. The proposed approach requires only smartphone audio at test time, which enables sensor-free, scalable, and longitudinal OSA monitoring.


💡 Research Summary

The paper addresses the pressing need for scalable, low‑cost screening of obstructive sleep apnea (OSA), a condition that affects roughly 16 % of adults and is linked to serious comorbidities. Conventional diagnosis relies on overnight polysomnography (PSG), which requires a large sensor suite, is expensive, and can disturb sleep. Acoustic‑based screening using smartphones has emerged as a promising alternative, but its performance degrades in real‑world home environments due to ambient noise and the lack of physiological context. In clinical PSG, respiratory effort (the movement of the thorax and abdomen) is a key signal for scoring apnoea and hypopnoea events, yet capturing it traditionally demands contact sensors that undermine the scalability of home monitoring.

The authors propose the first method to infer respiratory effort directly from nocturnal audio recordings, thereby restoring physiological context without any extra hardware. Their approach consists of two stages. In the first stage, a convolutional neural network (CNN) extracts high‑resolution temporal features from 30‑second log‑Mel spectrograms (64 bins, 1500 frames). A bidirectional LSTM processes these features, and a linear decoder projects the LSTM hidden states to a low‑resolution effort waveform (187 points). This waveform is linearly interpolated to match the 960‑point ground‑truth effort signal (32 Hz sampling). Training optimises 1 – Concordance Correlation Coefficient (CCC), a loss that penalises both correlation and bias, rather than plain mean‑squared error. Across ten‑fold subject‑wise cross‑validation, the estimator achieves an average CCC of 0.48 ± 0.13, RMSE 1.05 ± 0.12, and MAE 0.79 ± 0.09. Visual inspection shows that the model captures overall trends, pauses, and some fine‑grained dynamics, though occasional phase shifts and flat predictions occur when acoustic cues are weak.

In the second stage, the final hidden state of the frozen LSTM is averaged over time to form a fixed‑dimensional “respiratory embedding.” A separate CNN encoder processes the same audio segment to produce a conventional acoustic embedding. The two embeddings are concatenated and passed through a fusion layer followed by a multilayer perceptron (MLP) that outputs the probability of an apnoea/hypopnoea event for each 30‑second window (10‑second stride). A weighted binary cross‑entropy loss addresses the severe class imbalance between event and non‑event frames. Predicted events are merged into episodes, and a night‑level Apnoea‑Hypopnoea Index (AHI) is computed. The authors evaluate OSA severity classification at the clinically relevant AHI cut‑offs of 5, 15, and 30 events per hour, reporting sensitivity, specificity, and area under the ROC curve (AUC).

Baseline comparisons include an audio‑only model with the same backbone architecture and an “oracle” model that directly uses the measured respiratory effort as an additional input (the upper performance bound). The latent‑space fusion (LSF) model consistently outperforms the audio‑only baseline, especially at the lower AHI thresholds (5 and 15), where sensitivity and AUC improvements are most pronounced. This demonstrates that the inferred effort embeddings provide complementary, noise‑robust information that helps the classifier detect subtle breathing disruptions that pure acoustic features may miss.

The dataset comprises 157 nights from 103 participants recorded in real home settings using a SOMNOtouch RESP device (providing thoracic and abdominal effort signals) and a smartphone placed near the bed. Audio was sampled at 16 kHz, effort at 32 Hz, and synchronization was achieved via the device’s snore channel (500 Hz) using cross‑correlation. The participant pool is balanced across OSA severity levels (10 % healthy, 38 % mild, 29 % moderate, 23 % severe), making the results broadly applicable.

Key contributions of the work are: (1) Quantitative evidence that nocturnal breathing sounds contain enough information to estimate respiratory effort, achieving a moderate CCC despite home‑environment noise; (2) Introduction of a latent‑space fusion strategy that regularises acoustic representations with effort‑derived embeddings; (3) Demonstration that this fusion yields statistically significant gains in OSA screening performance while preserving the sensor‑free nature of the system.

Limitations include the modest CCC (well below the 0.6 reported in controlled speech datasets), occasional temporal misalignment between predicted and true effort, and loss of temporal resolution due to pooling in the CNN. Future directions suggested by the authors involve higher‑resolution microphone arrays, multimodal fusion with heart‑rate or motion sensors, and more sophisticated alignment techniques to improve effort estimation fidelity.

In summary, the study provides a compelling proof‑of‑concept that respiratory effort can be approximated from smartphone audio alone and that embedding this estimate into a deep learning pipeline enhances OSA detection. This opens the door to inexpensive, longitudinal, at‑home monitoring of sleep‑disordered breathing, potentially reducing the diagnostic gap for millions of undiagnosed patients.


Comments & Academic Discussion

Loading comments...

Leave a Comment