MSA-CNN: A Lightweight Multi-Scale CNN with Attention for Sleep Stage Classification

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advancements in machine learning-based signal analysis, coupled with open data initiatives, have fuelled efforts in automatic sleep stage classification. Despite the proliferation of classification models, few have prioritised reducing model complexity, which is a crucial factor for practical applications. In this work, we introduce Multi-Scale and Attention Convolutional Neural Network (MSA-CNN), a lightweight architecture featuring as few as ~10,000 parameters. MSA-CNN leverages a novel multi-scale module employing complementary pooling to eliminate redundant filter parameters and dense convolutions. Model complexity is further reduced by separating temporal and spatial feature extraction and using cost-effective global spatial convolutions. This separation of tasks not only reduces model complexity but also mirrors the approach used by human experts in sleep stage scoring. We evaluated both small and large configurations of MSA-CNN against nine state-of-the-art baseline models across three public datasets, treating univariate and multivariate models separately. Our evaluation, based on repeated cross-validation and re-evaluation of all baseline models, demonstrated that the large MSA-CNN outperformed all baseline models on all three datasets in terms of accuracy and Cohen’s kappa, despite its significantly reduced parameter count. Lastly, we explored various model variants and conducted an in-depth analysis of the key modules and techniques, providing deeper insights into the underlying mechanisms. The code for our models, baselines, and evaluation procedures is available at https://github.com/sgoerttler/MSA-CNN.

💡 Research Summary

The paper introduces MSA‑CNN, a lightweight multi‑scale convolutional neural network equipped with an attention‑based temporal context module, specifically designed for automatic sleep‑stage classification. While recent advances in machine learning and the availability of large public polysomnographic datasets have spurred the development of highly accurate classifiers, most state‑of‑the‑art (SOTA) models contain hundreds of thousands to millions of parameters. This high complexity hampers deployment on resource‑constrained platforms (e.g., wearables) and raises over‑fitting concerns, especially for multivariate EEG‑based approaches that often rely on graph‑CNNs, transformers, or 3D‑CNNs.

MSA‑CNN tackles the complexity problem through three complementary strategies. First, the Multi‑Scale Module (MSM) employs a novel “complementary pooling” scheme. The input epoch is down‑sampled by several pooling factors (e.g., 1, 2, 4, 8), each branch then applies a small 1‑D convolution (kernel size 3–5) to extract low‑level spectro‑morphological features at a distinct temporal resolution. A second complementary pooling operation upsamples each branch’s feature map back to a common temporal length, allowing the four branches to be merged and processed by a second temporal convolution that integrates the multi‑scale information. By keeping the convolutional kernel small and re‑using the same filter across scales, MSM dramatically reduces redundant parameters compared with traditional multi‑scale designs that increase filter size or use atrous convolutions.

Second, the architecture separates temporal and spatial processing. After MSM extracts per‑channel temporal patterns, a global spatial convolution operates across all channels simultaneously. Because typical sleep recordings involve only a handful of channels (4–8), a single spatial filter that spans the full channel dimension is far more efficient than depthwise separable convolutions, graph convolutions, or spatial pooling layers. This design mirrors the workflow of human sleep experts, who first identify temporal waveforms on each electrode and then interpret their co‑activation across electrodes.

Third, the Temporal Context Module (TCM) adds a multi‑head self‑attention mechanism on the sequence of time‑resolved feature vectors (tokens). Tokens are first linearly embedded into a lower‑dimensional space, positional encodings are added, and then N repetitions of multi‑head attention followed by a feed‑forward network (with residual connections and layer‑norm) are applied. The attention mechanism enables each token to be re‑weighted based on its surrounding context, capturing long‑range dependencies such as stage transitions and the interaction of sleep spindles with K‑complexes. The authors also provide a visualization tool that extracts the attention weight matrix for a given epoch, computes mean incoming attention (how much each time point is attended to by the rest) and outgoing attention (how much each point attends to the most attended point), thereby enhancing interpretability.

Two model sizes were instantiated: a “small” version with roughly 10 k trainable parameters and a “large” version with about 40 k parameters. Both were evaluated on three publicly available datasets: ISRUC‑S3 (10 subjects, 6 EEG + 2 EOG + EMG, 8 589 epochs), Sleep‑EDF‑20 (20 subjects, 2 EEG + 1 EOG + EMG, 42 308 epochs) and Sleep‑EDF‑78 (78 subjects, same channel set, 195 479 epochs). The datasets were pre‑processed uniformly (100 Hz sampling, 40 Hz low‑pass Butterworth filter) and split into 30‑second epochs labeled according to AASM or R&K standards.

For benchmarking, nine recent SOTA models were re‑implemented and re‑evaluated under the same repeated 5‑fold cross‑validation protocol (10 repetitions). These baselines included graph‑CNNs, transformer‑based encoders, 3D‑CNNs, and conventional CNN architectures, each in both univariate (single‑channel) and multivariate (multi‑channel) configurations.

Results show that the large MSA‑CNN consistently outperforms all baselines on accuracy and Cohen’s κ across the three datasets. For example, on ISRUC‑S3 the large model achieved 84.3 % accuracy and κ = 0.78, surpassing the best baseline (81.1 % / κ = 0.73). The small model, despite having only ~10 k parameters, still matched or exceeded many baselines, demonstrating the efficiency of the design.

Ablation studies systematically removed each core component. Without MSM (single‑scale convolution) accuracy dropped by ~3 % points; replacing the global spatial convolution with depthwise separable convolutions reduced performance by ~2 % points; substituting TCM with simple temporal averaging caused a ~4 % point decline. Parameter sensitivity analysis revealed that increasing filter counts beyond 40 k yields diminishing returns, confirming that the architecture is already near optimal in the low‑parameter regime.

The attention visualizations provide concrete clinical insight. In an N2 epoch containing both a spindle and a K‑complex, the model’s incoming attention peaks at the spindle, while outgoing attention highlights the K‑complex, illustrating how the network dynamically balances competing waveforms—behavior akin to expert scoring.

In conclusion, MSA‑CNN demonstrates that a carefully engineered combination of complementary pooling‑based multi‑scale feature extraction, global spatial convolution, and self‑attention can deliver state‑of‑the‑art sleep‑stage classification with an order of magnitude fewer parameters than existing methods. This makes the model well‑suited for deployment on embedded hardware, real‑time monitoring, and potentially for integration into consumer‑grade sleep wearables. Future work may explore extreme channel reduction (single‑EEG), multimodal fusion with respiratory or cardiac signals, and hardware‑aware optimizations such as quantization and pruning to further shrink the model for on‑device inference.

MSA-CNN: A Lightweight Multi-Scale CNN with Attention for Sleep Stage Classification

💡 Research Summary

Comments & Academic Discussion

Leave a Comment