ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning
Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 2.7-4.4% in fine-tuning and up to 5.9% in transfer learning to noisy datasets and achieves competitive performance compared to fully supervised baselines. Our distilled model achieves 91.4% parameter reduction and 3x faster inference on edge devices while maintaining competitive accuracy, enabling practical deployment in resource-constrained scenarios.
💡 Research Summary
The paper introduces ASMa (Asymmetric Spatio‑temporal Masking), a self‑supervised learning framework designed to improve skeleton‑based action recognition. Existing SSL approaches for skeleton data typically mask high‑motion frames and high‑degree joints (those with many graph connections), which leads to biased representations that overlook complementary motion patterns. ASMa addresses this limitation by employing two complementary masking strategies that target opposite ends of the joint‑degree and motion spectrum.
The first strategy, High‑Degree Spatial Masking (HDSM) combined with Low‑Motion Temporal Masking (LMTM), masks joints that have high graph degree (e.g., spine joints) while preserving low‑motion frames. The second strategy, Low‑Degree Spatial Masking (LDSM) together with High‑Motion Temporal Masking (HMTM), masks peripheral joints (hands, feet) that have low graph degree but selects frames with strong motion. Joint‑degree probabilities are derived from the graph centrality of each joint, and motion scores are computed as the average displacement of all joints between consecutive frames. By applying these asymmetric masks, two distinct views of each skeleton sequence are generated: (xθj, xθm) and (xϕj, xϕm).
Two ST‑GCN encoders, fθ and fϕ, are trained simultaneously on these views. Each encoder processes three parallel streams: an unmasked anchor stream, a spatially masked stream, and a temporally masked stream. The representations from each stream are projected through a Barlow Twins head, and the Barlow Twins loss aligns the anchor embedding with each masked embedding while penalizing redundancy across feature dimensions. The total pre‑training loss is the sum of the anchor‑spatial and anchor‑temporal alignment losses for both encoders.
After pre‑training, a feature‑alignment module fuses the complementary embeddings from fθ and fϕ. This module uses bidirectional multi‑head attention: the embedding from one encoder attends to the other and vice‑versa, producing a unified representation that captures both static structural cues and dynamic motion cues. The fused representation is then fed to a classifier for downstream action recognition.
To enable deployment on resource‑constrained devices, the authors distill the knowledge of the dual‑encoder teacher into a lightweight student network via knowledge distillation. The student learns to mimic the softened logits of the teacher while having far fewer parameters. Experiments show that the student achieves a 91.4 % reduction in parameters and three‑fold faster inference with negligible accuracy loss.
Extensive evaluations on NTU‑RGB+D 60, NTU‑RGB+D 120, and PKU‑MMD demonstrate that ASMa consistently outperforms prior SSL methods. In linear probing, it gains 1–3 % accuracy; in fine‑tuning, 2.7–4.4 %; and in transfer learning to noisy datasets, up to 5.9 % improvement. The method also reaches performance comparable to fully supervised baselines. An intriguing finding is that a student distilled from a linearly‑probed teacher can surpass the teacher’s performance, suggesting that self‑supervised distillation can further refine the learned representations.
In summary, ASMa introduces a principled asymmetric masking scheme based on joint degree and motion statistics, leverages Barlow Twins for redundancy‑free alignment, integrates complementary views through cross‑attention, and provides an efficient distilled model for edge deployment. The work advances both the theoretical understanding of masking bias in skeleton SSL and offers a practical solution for real‑world applications such as healthcare monitoring, security surveillance, and human‑machine interaction on low‑power devices.
Comments & Academic Discussion
Loading comments...
Leave a Comment