Efficient Solutions for Mitigating Initialization Bias in Unsupervised Self-Adaptive Auditory Attention Decoding
Decoding the attended speaker in a multi-speaker environment from electroencephalography (EEG) has attracted growing interest in recent years, with neuro-steered hearing devices as a driver application. Current approaches typically rely on ground-truth labels of the attended speaker during training, necessitating calibration sessions for each user and each EEG set-up to achieve optimal performance. While unsupervised self-adaptive auditory attention decoding (AAD) for stimulus reconstruction has been developed to eliminate the need for labeled data, it suffers from an initialization bias that can compromise performance. Although an unbiased variant has been proposed to address this limitation, it introduces substantial computational complexity that scales with data size. This paper presents three computationally efficient alternatives that achieve comparable performance, but with a significantly lower and constant computational cost. The code for the proposed algorithms is available at https://github.com/YYao-42/Unsupervised_AAD.
💡 Research Summary
Auditory attention decoding (AAD) from electroencephalography (EEG) is a promising technology for neuro‑steered hearing aids, but current supervised approaches require subject‑specific calibration sessions that provide ground‑truth labels of the attended speaker. Unsupervised self‑adaptive AAD eliminates the need for labeled data by initializing the decoder with random attention assignments and then iteratively updating the model based on its own predictions. However, this “bootstrap” scheme suffers from an initialization bias: if the initial random labels are wrong, the model tends to reinforce those errors across iterations. Heintz et al. mitigated this bias by performing a leave‑one‑out cross‑validation inside each iteration, training K models (one per segment) and predicting the held‑out segment. While effective, the method’s computational cost grows linearly with the number of segments, making it impractical for real‑time or low‑power applications.
The present paper tackles the same problem within a canonical correlation analysis (CCA) framework, which jointly learns a linear decoder for EEG ( wₓ ) and an encoder for speech features ( wₐ ) by maximizing their cross‑modal correlation. CCA admits a closed‑form solution via a generalized eigenvalue decomposition (GEVD), enabling fast training. Building on this, the authors propose three computationally efficient alternatives that avoid the K‑fold inner loop yet retain the bias‑reduction benefits.
-
Two‑Encoder Version – In addition to the attended‑speech encoder wₐ, a second encoder wᵤ is learned for the unattended speaker, while sharing the same EEG decoder wₓ. The objective maximizes the sum of EEG‑attended and EEG‑unattended correlations. This structure makes the decoder less attracted to a single speaker’s response, thereby reducing the chance of getting stuck in a wrong labeling regime. The GEVD solution is extended to a larger block matrix that incorporates cross‑covariances between all four modalities (EEG, attended, unattended).
-
Soft‑Label Version – Instead of hard binary assignments of each segment to attended/unattended, the algorithm computes probabilistic weights p₁ₖ and p₂ₖ based on the current correlation scores ρ̃₁ₖ and ρ̃₂ₖ. These probabilities are derived by modeling the distribution of correlation scores for attended and unattended cases as two Gaussians (parameters μₐ,σ²ₐ and μᵤ,σ²ᵤ) and applying Bayes’ rule. The weighted averages of the speech features are then used to form the cross‑covariance matrices for CCA. This soft labeling makes the training robust to uncertain segments and mitigates the impact of early mis‑classifications.
-
Sum‑Initialized Single‑Encoder Version – The first iteration is initialized not with random hard labels but with a composite speech signal obtained by summing the two speakers’ envelopes. Effectively this sets p₁ₖ = p₂ₖ = 0.5 for all segments, allowing the decoder to learn a representation of neural activity that is common to both speakers. Subsequent iterations revert to the standard hard‑label update. This simple heuristic removes the initial bias source altogether.
The methods were evaluated on a publicly available dataset comprising 72 minutes of 64‑channel EEG from 16 normal‑hearing participants listening to two competing talkers at ±90°. Audio envelopes were extracted with a gammatone filterbank, power‑law compressed (exponent 0.6), filtered to 1–9 Hz, and downsampled to 20 Hz. EEG and audio were augmented with time‑lagged copies (0–150 ms for EEG, –250–0 ms for audio) and concatenated. Segments of 60 s were used, yielding K ≈ 45 segments for the full recording. The number of CCA components was fixed to Q = 2.
Performance was measured in two regimes: transductive (predictions on the same data used for training) and inductive (generalization to unseen data). Random 3‑fold cross‑validation was employed to vary the amount of training data from 5 to 45 minutes. In addition to decoding accuracy, normalized CPU time (relative to the baseline single‑encoder method) was reported.
Key findings:
- Bias mitigation: All three proposed variants substantially reduced the initialization bias observed in the baseline single‑encoder method. The effect was most pronounced in the transductive setting, where the baseline’s performance degraded sharply with limited data.
- Accuracy: With small training sets (5–15 min), the sum‑initialized single‑encoder consistently achieved the highest accuracy, outperforming both the two‑encoder and soft‑label approaches and approaching the performance of the computationally expensive cross‑validated method of Heintz et al. As the training set grew, the soft‑label method caught up, reaching parity with the cross‑validated benchmark at around 30–45 min.
- Computational cost: The cross‑validated method’s CPU time scaled linearly with training duration, reaching ~30 × the baseline for 45 min of data. In contrast, the two‑encoder and soft‑label methods maintained a constant normalized time of ~1.5 ×, while the sum‑initialized method matched the baseline’s 1 × cost, regardless of data size.
- Trade‑offs: The two‑encoder approach, while computationally cheap, showed slightly lower peak accuracy because the shared decoder must capture responses to both speakers, reducing discriminative power. The soft‑label method offers a principled middle ground, preserving the single‑encoder’s discriminative structure while incorporating uncertainty information.
In conclusion, the paper delivers three practical, low‑complexity solutions for eliminating initialization bias in unsupervised self‑adaptive AAD. For scenarios with limited calibration data, the sum‑initialized single‑encoder is the preferred choice, delivering top accuracy with negligible overhead. For larger datasets, the soft‑label variant provides bias‑free performance comparable to the gold‑standard cross‑validation while keeping computational demands modest. These advances pave the way for real‑time, label‑free auditory attention decoding in next‑generation neuro‑steered hearing devices and brain‑computer interfaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment