Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks

Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.


💡 Research Summary

The paper tackles the pressing challenge of isolated sign language recognition (ISLR) for American Sign Language (ASL) under conditions of severe data scarcity and a long‑tail class distribution. Traditional deep‑learning classifiers, which rely on abundant, balanced training data, overfit to frequent signs and fail to generalize to the myriad rare signs that constitute most of a realistic vocabulary. To overcome this bottleneck, the authors propose a few‑shot learning framework based on Prototypical Networks, adapted specifically for skeleton‑based sign language data.

Core Contributions

  1. Skeleton‑Based Encoder – The system extracts 2‑D pose keypoints (body, hands, face) using the high‑performance RTMLib pipeline (RTMPose‑l trained on COCO‑WholeBody). A three‑stage hierarchical normalization (global scaling, local centering for hands and face, confidence gating) ensures invariance to camera distance, signer position, and scale. These normalized keypoints are fed into a Spatiotemporal Graph Convolutional Network (ST‑GCN) that respects the natural graph structure of the human skeleton.
  2. Multi‑Scale Temporal Aggregation (MSTA) – Recognizing that sign gestures vary widely in speed, the authors augment the ST‑GCN with parallel 1‑D convolutions of kernel sizes 3, 5, and 7. This captures short, sharp motions as well as longer, fluid movements. A learnable attention‑pooling layer then compresses the variable‑length frame sequence into a fixed‑size embedding vector z, weighting frames by semantic relevance.
  3. Prototypical Few‑Shot Training – Instead of a fixed softmax classifier, the model is trained episodically. Each episode samples N‑way classes and K‑shot support examples, computes a prototype for each class as the mean of its support embeddings, and classifies queries by Euclidean distance to these prototypes. The loss is standard cross‑entropy over the softmax of negative distances. This metric‑learning objective encourages a globally consistent embedding space where signs cluster by semantic similarity.
  4. Training Optimizations – To make episodic training feasible for large‑scale ISLR, the authors implement automatic mixed‑precision (AMP) training, a custom EpisodicBatchSampler that scales N‑way far beyond typical few‑shot benchmarks, and temporal speed augmentation (0.8×, 1.0×, 1.25×) to force the network to focus on motion patterns rather than absolute duration. They also introduce a “global prototype validation” step that simulates an open‑set scenario by building a dictionary of prototypes from the entire training set and testing the model’s ability to reject unseen signs.

Experimental Setup

  • Datasets – Primary training and evaluation are performed on WLASL, a large‑scale word‑level ASL dataset containing over 2,000 classes with a pronounced long‑tail (most classes have ≤13 examples). For zero‑shot transfer, the SignASL dataset is used in evaluation‑only mode.
  • Baselines – The same ST‑GCN + MSTA backbone is used for a conventional softmax classifier (baseline) to ensure a fair comparison of training paradigms.

Results

  • On WLASL, under a 5‑way 5‑shot episodic regime, the proposed Prototypical Network achieves Top‑1 accuracy 43.75 % and Top‑5 accuracy 77.10 %. The baseline softmax classifier reaches roughly 30 % Top‑1, indicating a >13 % absolute improvement due to the metric‑learning approach.
  • In a zero‑shot setting on SignASL (no fine‑tuning), the model attains ≈30 % accuracy, demonstrating that the learned embedding space generalizes to entirely unseen sign vocabularies.
  • Ablation studies (not fully detailed in the excerpt) likely show the contribution of MSTA, the hierarchical normalization, and the mixed‑precision/large‑N‑way training to overall performance.

Discussion & Limitations
The reliance on 2‑D pose estimation introduces potential errors: depth ambiguity, occlusions, and hand‑face overlap can degrade keypoint quality, especially for fine‑grained finger configurations crucial to ASL. While the hierarchical normalization mitigates some variability, the system’s performance is bounded by the upstream pose detector. Moreover, the current evaluation focuses on relatively low‑way episodes (5‑way); scaling to thousands of classes in a real‑world deployment would require efficient prototype storage and fast nearest‑neighbor search, topics left for future work.

Conclusion & Future Directions
The study demonstrates that a few‑shot prototypical learning paradigm, when combined with a skeleton‑based ST‑GCN and a multi‑scale temporal aggregator, can substantially improve data‑efficient ASL recognition. It outperforms conventional classifiers even when both share the same backbone, and it shows promising zero‑shot transfer to unseen sign sets. Future research may explore 3‑D pose inputs, multimodal fusion (e.g., RGB + skeleton), larger‑scale open‑set evaluation, and deployment‑oriented optimizations such as prototype compression or approximate nearest‑neighbor indexing.

Overall, the paper provides a compelling blueprint for building scalable, low‑resource sign language recognition systems that can adapt to expanding vocabularies without the prohibitive cost of exhaustive video annotation.


Comments & Academic Discussion

Loading comments...

Leave a Comment