Are foundation models useful feature extractors for electroencephalography analysis?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The success of foundation models in natural language processing and computer vision has motivated similar approaches in time series analysis. While foundational time series models have proven beneficial on a variety of tasks, their effectiveness in medical applications with limited data remains underexplored. In this work, we investigate this question in the context of electroencephalography (EEG) by evaluating general-purpose time series models on age prediction, seizure detection, and classification of clinically relevant EEG events. We compare their diagnostic performance against specialised EEG models and assess the quality of the extracted features. The results show that general-purpose models are competitive and capture features useful to localising demographic and disease-related biomarkers. These findings indicate that foundational time series models can reduce the reliance on large task-specific datasets and models, making them valuable in clinical practice.

💡 Research Summary

This paper investigates whether foundation models—large, pre‑trained neural networks that have shown remarkable success in natural language processing and computer vision—can serve as effective feature extractors for electroencephalography (EEG) analysis, a domain traditionally constrained by limited labeled data. The authors focus on three clinically relevant tasks: (1) predicting a subject’s age from resting‑state EEG, (2) detecting epileptic seizures, and (3) classifying multiple EEG event types (spike‑and‑sharp‑wave, generalized periodic epileptiform discharges, periodic lateralised epileptiform discharges, eye‑movement artifacts, equipment noise, and background activity).

Three state‑of‑the‑art general‑purpose time‑series foundation models are evaluated: MOMENT (a 40 M‑parameter masked‑model family), UniTS (an 8 M‑parameter unified multi‑task architecture), and OTiS (a 7 M‑parameter transformer pre‑trained on 640 k heterogeneous time‑series samples spanning ECG, weather, audio, engineering, and a modest amount of EEG). To isolate the effect of heterogeneous pre‑training, the authors also train an OTiS variant exclusively on the 3 k EEG samples (OTiS‑EEG).

The experimental datasets are publicly available and cover a range of sizes and sampling rates: LEMON (378 subjects, 250 Hz, age 20‑35 vs 59‑77 years), Epilepsy (11 500 single‑channel recordings, 174 Hz, 80 % seizure, 20 % healthy), and TUEV (112 237 multi‑channel recordings, 200 Hz, six clinical event classes plus three noise classes). For each task the authors perform five‑seed cross‑validation, reporting coefficient of determination (R²) for regression and accuracy / balanced accuracy for classification.

Three domain‑adaptation strategies are examined: (i) Zero‑Shot (ZS), where the pre‑trained model is frozen and its token embeddings are averaged to obtain a global representation; class logits are derived via cosine similarity to class prototypes (only applicable to classification). (ii) Linear Probing (LP), where the frozen backbone is paired with a randomly initialized linear head that is trained on the target task. (iii) Fine‑Tuning (FT), where both backbone and head are jointly optimized. Hyper‑parameters (learning rate, batch size, drop path, layer decay, weight decay, label smoothing) are tuned via grid search, and early stopping based on validation performance is employed. All experiments run on a single NVIDIA RTX A6000 GPU.

Key Findings

Competitive Diagnostic Performance – Across all three tasks, the general‑purpose models achieve performance comparable to, and sometimes surpassing, a suite of 16 specialised EEG models (including two EEG‑specific foundation models, BIO‑T and LaBraM, and several handcrafted‑feature baselines). In age prediction, fine‑tuned OTiS and MOMENT reach R²≈0.45, matching deep ConvNet baselines. In seizure detection, even the zero‑shot OTiS attains ≈94 % accuracy, on par with models pre‑trained on large EEG corpora (e.g., SimCLR, TimesNet). For multi‑class event classification on TUEV, fine‑tuned OTiS exceeds most specialised baselines, with zero‑shot features already delivering balanced accuracy above 55 %.
Domain‑Adaptation Needs Vary by Task – Fine‑tuning is essential for the regression task (age prediction) where subtle spectral cues reside in higher frequencies; linear probing alone yields insufficient R². Conversely, for seizure detection—where the discriminative signal is relatively coarse—zero‑shot or linear probing suffices, and fine‑tuning offers marginal gains. For the more nuanced event‑type classification, fine‑tuning improves performance but zero‑shot representations already capture clinically relevant patterns (e.g., distinguishing spike‑and‑sharp‑wave events from eye‑movement artifacts).
Benefit of Heterogeneous Pre‑Training – Comparing OTiS (trained on the full heterogeneous corpus) with OTiS‑EEG (trained only on EEG) demonstrates that exposure to diverse modalities (ECG, weather, audio) yields richer, more transferable representations for EEG, despite EEG comprising only 0.5 % of the pre‑training data. This suggests that generic temporal dynamics learned from other domains can be repurposed for neurophysiological signals.
Frequency‑Band Localization of Biomarkers – Principal component analysis of extracted features reveals that age‑related information concentrates in the beta (13‑30 Hz) and gamma (30‑100 Hz) bands, while seizure‑related activity is dominant in delta (0.5‑4 Hz) and theta (4‑8 Hz) bands. Band‑specific performance experiments confirm these trends, aligning with established neurophysiological literature. Moreover, aggressive low‑pass filtering at 40 Hz dramatically degrades seizure detection on the Epilepsy dataset, underscoring the importance of preserving high‑frequency content for certain biomarkers.
Scalability and Practicality – Because the foundation models can be deployed zero‑shot, they reduce the engineering overhead associated with designing and training a new model for each EEG task. In settings where only a few hundred labeled recordings are available, linear probing provides a quick, low‑resource adaptation path. When larger labeled cohorts exist, fine‑tuning can be applied to squeeze out additional performance.

Limitations and Future Directions – The study is confined to three tasks; extending the evaluation to sleep‑stage scoring, motor‑imagery classification, or brain‑computer interface paradigms would further validate generalizability. The pre‑training corpus, while large, contains a relatively small proportion of EEG data; incorporating more clinical neurophysiology recordings could boost performance. Additionally, interpretability beyond frequency‑band analysis (e.g., attention maps, saliency) remains an open avenue for building clinician trust.

Conclusion – General‑purpose time‑series foundation models, pre‑trained on massive heterogeneous datasets, are viable and often competitive feature extractors for EEG analysis. They alleviate the data‑scarcity bottleneck in clinical neurophysiology, enable rapid deployment across diverse tasks, and retain the ability to localize biologically meaningful biomarkers. This work provides a compelling blueprint for integrating foundation models into routine EEG workflows and motivates broader exploration of cross‑domain pre‑training strategies in medical time‑series AI.

Are foundation models useful feature extractors for electroencephalography analysis?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment