Facial Expression Recognition Using Residual Masking Network

Facial Expression Recognition Using Residual Masking Network
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Automatic facial expression recognition (FER) has gained much attention due to its applications in human-computer interaction. Among the approaches to improve FER tasks, this paper focuses on deep architecture with the attention mechanism. We propose a novel Masking idea to boost the performance of CNN in facial expression task. It uses a segmentation network to refine feature maps, enabling the network to focus on relevant information to make correct decisions. In experiments, we combine the ubiquitous Deep Residual Network and Unet-like architecture to produce a Residual Masking Network. The proposed method holds state-of-the-art (SOTA) accuracy on the well-known FER2013 and private VEMO datasets. The source code is available at https://github.com/phamquiluan/ResidualMaskingNetwork.


💡 Research Summary

**
The paper introduces a novel deep architecture for facial expression recognition (FER) called the Residual Masking Network (RMN). The central idea, termed the “Masking Idea,” integrates an attention mechanism that leverages a segmentation‑style mask to emphasize facial regions that are most informative for emotion classification (e.g., eyes, mouth) while suppressing irrelevant areas such as hair or background.

Architecture Overview
RMN builds upon a ResNet‑34 backbone. After an initial 7×7 convolution with stride 2 and a 2×2 max‑pooling layer, the input image (scaled to 224 × 224 RGB) is reduced to a 56 × 56 feature map. Four Residual Masking Blocks (RMBs) are then applied sequentially, each operating at a different spatial resolution: 56 × 56, 28 × 28, 14 × 14, and 7 × 7.

Each RMB consists of two sub‑components:

  1. Residual Layer (RL) – a standard ResNet‑34 residual unit that processes the incoming feature map (F) and produces a coarse transformed map (F_R).

  2. Masking Block (MB) – a lightweight UNet‑style encoder‑decoder network. The MB receives (F_R) and outputs a mask (F_M) of the same spatial size, with values constrained to (


Comments & Academic Discussion

Loading comments...

Leave a Comment