MicroBi-ConvLSTM: An Ultra-Lightweight Efficient Model for Human Activity Recognition on Resource Constrained Devices
Human Activity Recognition (HAR) on resource constrained wearables requires models that balance accuracy against strict memory and computational budgets. State of the art lightweight architectures such as TinierHAR (34K parameters) and TinyHAR (55K parameters) achieve strong accuracy, but exceed memory budgets of microcontrollers with limited SRAM once operating system overhead is considered. We present MicroBi-ConvLSTM, an ultra-lightweight convolutional-recurrent architecture achieving 11.4K parameters on average through two stage convolutional feature extraction with 4x temporal pooling and a single bidirectional LSTM layer. This represents 2.9x parameter reduction versus TinierHAR and 11.9x versus DeepConvLSTM while preserving linear O(N) complexity. Evaluation across eight diverse HAR benchmarks shows that MicroBi-ConvLSTM maintains competitive performance within the ultra-lightweight regime: 93.41% macro F1 on UCI-HAR, 94.46% on SKODA assembly gestures, and 88.98% on Daphnet gait freeze detection. Systematic ablation reveals task dependent component contributions where bidirectionality benefits episodic event detection, but provides marginal gains on periodic locomotion. INT8 post training quantization incurs only 0.21% average F1-score degradation, yielding a 23.0 KB average deployment footprint suitable for memory constrained edge devices.
💡 Research Summary
The paper introduces µBi‑ConvLSTM, an ultra‑lightweight convolution‑recurrent architecture designed for human activity recognition (HAR) on highly constrained wearable devices. Existing lightweight models such as TinyHAR (≈55 K parameters) and TinierHAR (≈34 K parameters) still exceed the usable SRAM of many microcontrollers once OS, sensor drivers, and intermediate activation buffers are accounted for. µBi‑ConvLSTM reduces the average parameter count to 11.4 K by employing a two‑stage standard convolutional front‑end with 4× temporal pooling followed by a single bidirectional LSTM (hidden size 24). This design preserves linear O(N) computational complexity, avoids the quadratic cost of attention mechanisms, and keeps the memory footprint around 23 KB after INT8 post‑training quantization.
Key design principles are: (1) aggressive temporal reduction – two convolutional blocks each followed by 2× max‑pooling compress the sequence length from T to T/4, dramatically lowering the recurrent memory requirement; (2) use of standard convolutions rather than depthwise separable ones because sensor data typically have only 3–9 channels, and cross‑channel interactions are essential; (3) allocation of most parameters to a single wide bidirectional LSTM, which provides richer temporal context than stacked narrow layers; (4) bidirectionality is beneficial for episodic activities (e.g., gait‑freeze) but yields marginal gains for periodic motions; (5) strict O(N) complexity to guarantee predictable latency on microcontrollers.
The architecture processes an input tensor X∈ℝ^{B×T×C} through Conv1 (16 filters, kernel 5) → BN → ReLU → MaxPool(2) → Conv2 (32 filters, kernel 5) → BN → ReLU → MaxPool(2), yielding a compressed sequence of length T/4 with 32 feature maps. A bidirectional LSTM (H=24) consumes this sequence; forward and backward hidden states are concatenated (48‑dimensional) and the final timestep representation is passed through dropout and a linear classifier with softmax output. The LSTM accounts for roughly 75 % of the total parameters.
Eight publicly available HAR datasets were used for evaluation: UCI‑HAR, MotionSense, WISDM, PAMAP2, Opportunity, UniMiB‑SHAR, SKODA, and Daphnet. Each dataset received dataset‑specific preprocessing (Butterworth low‑pass filtering where needed, z‑score normalization, 50 % overlapping sliding windows). Subject‑wise cross‑validation ensured no data leakage. Training employed AdamW with cosine annealing for up to 200 epochs, early stopping (patience = 10) based on validation macro‑F1, and class‑frequency weighting for imbalanced sets. Hyper‑parameter optimization (learning rate, weight decay, dropout) was performed with Optuna’s TPE (50 trials per dataset). All baseline models (DeepConvLSTM, TinyHAR, TinierHAR) were retrained under the same protocol for a fair comparison.
Performance results show that µBi‑ConvLSTM achieves macro‑F1 scores of 93.41 % on UCI‑HAR, 94.46 % on SKODA, and 88.98 % on Daphnet, closely matching or slightly trailing the best‑performing baselines while using 2.9× fewer parameters than TinierHAR and 11.9× fewer than DeepConvLSTM. Across all datasets the average macro‑F1 is 83.68 % compared to 85.93 % (DeepConvLSTM), 86.16 % (TinyHAR), and 87.39 % (TinierHAR). The model’s MAC count ranges from 0.245 M to 1.14 M, representing a 32–42× reduction versus DeepConvLSTM and a 5–42× reduction versus TinyHAR.
Ablation studies reveal that removing bidirectionality reduces F1 by up to 1.8 % on episodic tasks (Daphnet) but has negligible impact on periodic locomotion datasets. Reducing temporal pooling from 4× to 2× modestly improves accuracy (<0.3 % gain) at the cost of higher computation, confirming that aggressive temporal compression is the primary driver of memory efficiency. Replacing the two‑stage convolutional stem with a single block cuts parameters by ~30 % but degrades F1 by 2–4 %, indicating that the two‑stage design balances receptive field and feature richness.
Quantization experiments demonstrate that INT8 post‑training quantization incurs only 0.21 % average F1 loss, while shrinking the model size to an average of 23 KB—well within the SRAM limits of many low‑end microcontrollers (e.g., ARM Cortex‑M4 with 64 KB SRAM). The authors also discuss deployment considerations such as static memory allocation for activations, the impact of batch size = 1 inference, and the feasibility of on‑device continuous inference.
In summary, µBi‑ConvLSTM offers a compelling solution for HAR on ultra‑resource‑constrained edge devices. By strategically combining standard convolutions, aggressive temporal pooling, and a single bidirectional LSTM, it achieves a remarkable trade‑off: sub‑12 K parameters, linear inference time, and competitive accuracy across a diverse set of activity recognition tasks. The work highlights the importance of architecture‑level co‑design with hardware constraints and opens avenues for further research into dynamic pooling strategies, hybrid recurrent‑attention modules, and automated neural architecture search tailored for TinyML environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment