Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning
Visual Reinforcement Learning (RL) agents must learn to act based on high-dimensional image data where only a small fraction of the pixels is task-relevant. This forces agents to waste exploration and computational resources on irrelevant features, leading to sample-inefficient and unstable learning. To address this, inspired by human visual foveation, we introduce Gaze on the Prize. This framework augments visual RL with a learnable foveal attention mechanism (Gaze), guided by a self-supervised signal derived from the agent’s experience pursuing higher returns (the Prize). Our key insight is that return differences reveal what matters most: If two similar representations produce different outcomes, their distinguishing features are likely task-relevant, and the gaze should focus on them accordingly. This is realized through return-guided contrastive learning that trains the attention to distinguish between the features relevant to success and failure. We group similar visual representations into positives and negatives based on their return differences and use the resulting labels to construct contrastive triplets. These triplets provide the training signal that teaches the attention mechanism to produce distinguishable representations for states associated with different outcomes. Our method achieves up to 2.52x improvement in sample efficiency and can solve challenging tasks from the ManiSkill3 benchmark that the baseline fails to learn, without modifying the underlying algorithm or hyperparameters.
💡 Research Summary
The paper “Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning” presents a novel framework designed to address the sample inefficiency problem in visual Reinforcement Learning (RL). In visual RL, agents must learn from high-dimensional pixel inputs, most of which are irrelevant to the task, leading to wasted exploration and computational resources.
Inspired by human visual foveation—the ability to focus high-resolution attention on a small, task-relevant region of the visual field—the authors introduce the “Gaze on the Prize” framework. Its core innovation is a learnable, parametric foveal attention mechanism, guided by a self-supervised signal derived from the agent’s own experience. The key insight is that differences in reward returns can reveal task relevance: if two visually similar states lead to different outcomes, the features that distinguish them are likely critical. This insight is operationalized through return-guided contrastive learning.
The method consists of four main components: 1) A Gaze Module that generates spatial attention weights in the form of a 2D anisotropic Gaussian, parameterized by just five values (center and covariance). This provides a human-like inductive bias and allows for explainable visualization of the agent’s focus. 2) A Contrastive Buffer that stores historical visual feature maps and their associated episode returns. 3) A Triplet Mining procedure that queries this buffer to find “anchor” states and their visually similar “positive” (higher return) and “negative” (lower return) neighbors. 4) A Contrastive Loss (triplet loss) that trains the attention module. The loss pulls the attended representation of the anchor closer to that of the positive example and pushes it away from the negative example. By optimizing this objective, the attention mechanism learns to focus on image regions that are most discriminative for predicting high vs. low returns.
A significant advantage of this framework is its design as a plug-in enhancement. It adds the gaze module and the auxiliary contrastive loss to the existing RL objective (e.g., of PPO or SAC) without modifying the base algorithm’s core structure or hyperparameters. The attention is applied to the feature maps of a standard CNN backbone, and gradients from the contrastive loss update only the gaze parameters, leaving the pre-trained or concurrently trained visual encoder intact.
Experiments on manipulation tasks from the ManiSkill3 benchmark demonstrate the framework’s effectiveness. It achieves up to a 2.52x improvement in sample efficiency over a standard CNN baseline and can solve challenging tasks where the baseline fails entirely. Ablation studies confirm the contributions of both the foveal attention structure and the return-guided contrastive learning. The method also shows favorable or comparable performance to other representation learning techniques like CURL, while offering the additional benefit of an interpretable attention map. The work successfully demonstrates how a simple biological principle (foveation) combined with a clever self-supervised signal (return difference) can significantly improve the efficiency and capability of visual RL agents.
Comments & Academic Discussion
Loading comments...
Leave a Comment