The performance of reinforcement learning (RL) agents depends critically on the quality of the underlying feature representations. Hyperbolic feature spaces are well-suited for this purpose, as they naturally capture hierarchical and relational structure often present in complex RL environments. However, leveraging these spaces commonly faces optimization challenges due to the nonstationarity of RL. In this work, we identify key factors that determine the success and failure of training hyperbolic deep RL agents. By analyzing the gradients of core operations in the Poincaré Ball and Hyperboloid models of hyperbolic geometry, we show that large-norm embeddings destabilize gradient-based training, leading to trust-region violations in proximal policy optimization (PPO). Based on these insights, we introduce Hyper++, a new hyperbolic PPO agent that consists of three components: (i) stable critic training through a categorical value loss instead of regression; (ii) feature regularization guaranteeing bounded norms while avoiding the curse of dimensionality from clipping; and (iii) using a more optimization-friendly formulation of hyperbolic network layers. In experiments on ProcGen, we show that Hyper++ guarantees stable learning, outperforms prior hyperbolic agents, and reduces wall-clock time by approximately 30%. On Atari-5 with Double DQN, Hyper++ strongly outperforms Euclidean and hyperbolic baselines. We release our code at https://github.com/Probabilistic-and-Interactive-ML/hyper-rl .
Euclidean Hyper++ (Ours) 20 0 20
Rel. Improvement (%)
-27.9% -5.0%
Normalized ProcGen test score Baseline: Hyper+S-RYM Consider a chess-playing agent facing a difficult moment in its game. As it maps out future scenarios, these unfold into a tree of possible future states. Each action commits to one branch and rules out others, and the number of reachable positions grows exponentially with depth. Playing chess can thus be viewed as traversing this expanding tree of possible states. The same structure appears in other sequential decision-making benchmarks such as ProcGen BIGFISH (Cobbe et al., 2020). Here, the agent grows by eating smaller fish, and growth cannot be undone, inducing a natural order. In both cases, the data are inherently hierarchical: each state depends on its predecessors, and future states branch from the present one. Tree-structured data from sequential decision problems like chess or BIGFISH cannot be embedded in Euclidean space without large distortion: Euclidean volume grows only polynomially in radius, whereas tree size grows exponentially (Sarkar, 2011;Gromov, 1987). This creates a mismatch between the hierarchical data produced by decision-making agents and the Euclidean representations used by modern deep networks (Cetin et al., 2023). We hypothesize that this mismatch contributes to the data-inefficiency and deployment challenges of deep RL despite impressive successes (Silver et al., 2016;Schrittwieser et al., 2020;Ouyang et al., 2022). But what if there were representations that better match the geometry of sequential decision making? Hyperbolic geometry (Bolyai, 1896;Lobachevskiȋ, 1891) offers an appealing solution to the limitations of Euclidean representations: Unlike Euclidean space, its exponential volume growth makes it a natural fit for embedding hierarchical data. Since its inception, hyperbolic deep learning has achieved strong results in classification (Ganea et al., 2018), graph learning (Chami et al., 2019), unsupervised representation learning (Mathieu et al., 2019), deep metric learning (Ermolov et al., 2022), and image-text alignment (Pal et al., 2025). Despite its conceptual appeal, the broader adoption has been hampered by significant optimization challenges (Guo et al., 2022;Mishne et al., 2023). Thus, while hyperbolic geometry is inherently well-suited for RL, its broader adoption in deep RL hinges on a thorough understanding of the associated optimization challenges and potential failure modes. To this end, we study the heuristic trust-region algorithm proximal policy optimization (PPO) with a hybrid Euclidean-hyperbolic encoder, which is a commonly used architecture for Deep RL (Cetin et al., 2023;Salemohamed et al., 2023). Despite the trust region, hyperbolic PPO agents face policy-learning issues from unstable encoder gradients, further amplified by nonstationary data and targets in deep RL (Cetin et al., 2023).
In this paper, we take a step towards a more principled understanding of the underlying optimization issues in hyperbolic deep RL. We start by analyzing key derivatives of mathematical operations of hyperbolic deep learning, which we link to trust-region issues of hyperbolic PPO. We show that neither the Poincaré Ball nor the Hyperboloid -common models for hyperbolic geometry-is immune to gradient instability. Grounded in this analysis, we propose a principled regularization approach to stabilize the training of hyperbolic agents. The resulting agent, HYPER++, ensures stable learning by pairing Euclidean feature regularization on the Hyperboloid with a categorical value loss to handle target nonstationarity. HYPER++ improves PPO’s performance and wall-clock time on ProcGen (Figure 1) by approximately 30% over existing hyperbolic agents. We further show that our regularization approach generalizes beyond on-policy methods: applying the same ideas to DDQN (van Hasselt et al., 2016) and the Atari-3 benchmark (Aitchison et al., 2023) also yields strong performance improvements. Our code will be made available.
Our Key Contributions 1. Characterization of training issues. For both the Poincaré Ball and Hyperboloid, we derive gradients of key operations and link them to training instability in deep RL.
Principled regularization. We study the weaknesses of current approaches and propose improvements rooted in our insights into hyperbolic deep RL training.
HYPER++, a strong hyperbolic agent with stable training. We integrate a categorical value loss, RMSNorm, and a novel scaling layer for the Hyperboloid model.
This section first reviews Markov decision processes (MDPs) and the PPO optimization procedure (Section 2.1), then presents the mathematical foundations of hyperbolic representation learning in Section 2.2. A more thorough overview of the Poincaré Ball and Hyperboloid models can be found in Ganea et al. (2018); Shimizu et al. (2021); Bdeir et al. (2024).
We formalize RL as a discrete MDP M = ⟨S, A, P, R, γ⟩ with state space S and acti
This content is AI-processed based on open access ArXiv data.