Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot “eating sound” detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.

💡 Research Summary

Misophonia is a condition in which specific everyday sounds trigger intense negative emotional reactions, severely affecting quality of life. Existing coping strategies are largely passive (e.g., earplugs, avoidance) and do not provide real‑time assistance. Detecting trigger sounds with precise onset and offset timestamps is a prerequisite for any active mitigation system, yet publicly available strongly labeled misophonia data are virtually nonexistent.

To address this gap, the authors synthesize a large‑scale dataset of 10‑second soundscapes using the Scaper library, following the protocol of DCASE Task 4. Seven trigger categories are selected based on clinical reports and data availability: eating, chewing, coughing, breathing, throat‑clearing, typing, and clock‑ticking. Source clips are drawn from public corpora (FOAMS, MATA, FSD50K, ESC‑50, VocalSound, etc.). An automated filter (YAM‑Net) pre‑selects candidate clips, after which human auditors verify the presence of the target sound. Pitch shifting (±3 semitones) and background noise augmentation increase variability. The final corpus contains 10 000 synthetic clips (6 000 train, 2 000 validation, 2 000 test) with exact event timestamps.

The proposed detection architecture is a hybrid of a frozen, frame‑wise MobileNetV3 backbone and a trainable temporal module. The CNN processes each 40 ms frame and outputs a 128‑dimensional embedding sequence (250 frames per 10‑second clip). Freezing the backbone eliminates the need for back‑propagation through the convolutional layers, drastically reducing computational load and memory footprint—critical for on‑device deployment.

Four temporal modules are evaluated: (i) a linear baseline (no memory), (ii) gated recurrent unit (GRU), (iii) long short‑term memory (LSTM), and (iv) echo state network (ESN). Each is instantiated in both unidirectional and bidirectional forms; the bidirectional version concatenates forward and backward hidden states, thereby exploiting future context. The GRU uses two stacked layers with 256 hidden units per direction; the LSTM mirrors this configuration. The ESN follows a leaky‑integrator reservoir design: recurrent weights are randomly initialized and kept fixed, only the readout matrix is trained. Hyper‑parameters of the ESN (leaking rate, reservoir size, spectral radius) are optimized with Optuna’s Tree‑structured Parzen Estimator. All temporal modules share a single linear + sigmoid readout that produces per‑frame multi‑label posterior probabilities for the seven classes.

Performance is measured with the Polyphonic Sound Detection Score 1 (PSDS1) and frame‑level F1. In the multi‑class detection task, bidirectional GRU (BiGRU) achieves the highest PSDS1 (~0.78) and F1 (~0.81), confirming the benefit of learned gating and bidirectional context. Bidirectional LSTM yields comparable scores but with roughly double the trainable parameters. Bidirectional ESN (BiESN) attains a PSDS1 of ~0.74 and F1 of ~0.77 while using only about 0.1 % of the total parameters (≈5 k trainable weights), demonstrating that a fixed random reservoir can capture sufficient temporal structure for this problem.

To explore personalization, a few‑shot adaptation experiment focuses on the “eating” class. The model receives at most five support clips from a new user and is fine‑tuned without meta‑learning. Under this severe data‑scarcity regime, BiGRU exhibits high variance and occasional over‑fitting, whereas BiESN remains stable, maintaining PSDS1 around 0.70 even with a single support example. This robustness stems from the ESN’s minimal trainable component, which reduces the risk of over‑fitting to tiny support sets.

The study’s contributions are threefold: (1) creation of a publicly usable, strongly labeled synthetic misophonia dataset; (2) demonstration that freezing a pre‑trained CNN backbone while training only a lightweight temporal module yields competitive detection performance; (3) evidence that reservoir‑computing‑based ESNs provide an excellent trade‑off between accuracy, parameter efficiency, and adaptability, making them strong candidates for on‑device, user‑personalized misophonia assistive technologies. Future work should validate the models on real‑world recordings, integrate real‑time sound attenuation or masking strategies, and explore continual learning mechanisms to update personalized models as user trigger profiles evolve.

Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

💡 Research Summary

Comments & Academic Discussion

Leave a Comment