CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning

CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Exploration remains a fundamental challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short in practical effectiveness. In this paper, we propose CAE, i.e., the Critic as an Explorer, a lightweight approach that repurposes the value networks in standard deep RL algorithms to drive exploration, without introducing additional parameters. CAE leverages multi-armed bandit techniques combined with a tailored scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and strong empirical stability. Remarkably, it is simple to implement, requiring only about 10 lines of code. For complex tasks where learning reliable value networks is difficult, we introduce CAE+, an extension of CAE that incorporates an auxiliary network. CAE+ increases the parameter count by less than 1% while preserving implementation simplicity, adding roughly 10 additional lines of code. Extensive experiments on MuJoCo, MiniHack, and Habitat validate the effectiveness of CAE and CAE+, highlighting their ability to unify theoretical rigor with practical efficiency.


💡 Research Summary

Exploration remains a central bottleneck in reinforcement learning (RL), especially in sparse‑reward or delayed‑reward settings. Existing approaches fall into two broad categories: (1) simple heuristics such as ε‑greedy or action‑space noise, which are sample‑inefficient and often fail on hard exploration problems; (2) sophisticated methods that add auxiliary networks (e.g., RND, ICM, E3B) to generate intrinsic rewards, which improve performance but increase computational overhead and lack rigorous theoretical guarantees. On the other hand, provably efficient exploration algorithms from the bandit literature (e.g., LSVI‑UCB, Neural‑UCB) either assume linear function approximation or require O(n³) computation where n is the number of network parameters, making them impractical for modern deep RL.

The paper introduces CAE (Critic as an Explorer), a lightweight scheme that repurposes the value (critic) network already present in any deep RL algorithm to produce exploration bonuses, without adding any new parameters. The key insight is to decompose the critic’s Q‑function as
 Q(s,a) = θᵀ ϕ(s,a|W),
where ϕ(·) denotes the embedding produced by the network’s hidden layers and θ is a linear head. This decomposition preserves the expressive power of deep networks while exposing a low‑dimensional context ϕ(s,a) that can be fed directly into linear multi‑armed bandit (MAB) techniques.

Two classic MAB strategies are adapted:

  • Upper Confidence Bound (UCB): β(s,a) = √{ϕ(s,a)ᵀ A⁻¹ ϕ(s,a)} where A = λI is the Gram matrix of past embeddings, updated as A ← A + ϕ ϕᵀ after each transition.
  • Thompson Sampling (TS): Sample Δθ ∼ N(0, A⁻¹) and set β(s,a) = Δθᵀ ϕ(s,a).

The exploration‑augmented Q‑value becomes Q(s,a) + α β(s,a), with α a scaling coefficient chosen via a classic scaling strategy (e.g., Welford 1962) to keep the bonus comparable to the Bellman loss and ensure training stability.

Theoretical analysis shows that, under the standard assumption that the embedding dimension d is bounded, the regret of any RL algorithm equipped with CAE is sub‑linear: O(√{T d log T}) over T episodes. Thus, average regret vanishes as training proceeds, providing a rigorous guarantee that many heuristic methods lack.

For environments where learning a reliable critic is especially hard (e.g., extremely sparse rewards), the authors propose CAE⁺, which adds a tiny auxiliary network f = \bar{f} ∘ U. This network is deliberately lightweight (≈0.8 % of the base parameters) and requires only about ten extra lines of code. It supplies additional features to the embedding, improving uncertainty estimation without compromising the overall simplicity.

Empirical evaluation spans three benchmark families:

  • MuJoCo (dense‑reward continuous control): CAE integrated with PPO, SAC, TD3, and DSA‑C accelerates convergence and improves final scores by 5‑12 % on average.
  • MiniHack (sparse‑reward, procedurally generated games): CAE outperforms state‑of‑the‑art intrinsic‑reward methods such as RND and E3B, achieving higher success rates and faster exploration coverage.
  • Habitat (reward‑free navigation): CAE⁺ shows a 30 % boost in exploration efficiency over the best baselines, demonstrating its value in highly challenging, reward‑scarce settings.

From a computational perspective, CAE’s overhead is limited to maintaining the d × d Gram matrix (O(d²) memory and computation) rather than the full parameter count, avoiding the O(n³) bottleneck of Neural‑UCB/TS. Consequently, training time and GPU memory usage remain comparable to the underlying RL algorithm.

In summary, CAE offers a conceptually simple yet theoretically grounded solution: reuse the critic’s own representation as a contextual bandit to drive exploration. It delivers provable regret bounds, negligible parameter increase, easy integration (≈10 lines of code), and strong empirical gains across diverse tasks. The authors release their implementation publicly, facilitating reproducibility and future extensions.


Comments & Academic Discussion

Loading comments...

Leave a Comment