ARLBench: Flexible and Efficient Benchmarking for Hyperparameter Optimization in Reinforcement Learning
Hyperparameters are a critical factor in reliably training well-performing reinforcement learning (RL) agents. Unfortunately, developing and evaluating automated approaches for tuning such hyperparameters is both costly and time-consuming. As a result, such approaches are often only evaluated on a single domain or algorithm, making comparisons difficult and limiting insights into their generalizability. We propose ARLBench, a benchmark for hyperparameter optimization (HPO) in RL that allows comparisons of diverse HPO approaches while being highly efficient in evaluation. To enable research into HPO in RL, even in settings with low compute resources, we select a representative subset of HPO tasks spanning a variety of algorithm and environment combinations. This selection allows for generating a performance profile of an automated RL (AutoRL) method using only a fraction of the compute previously necessary, enabling a broader range of researchers to work on HPO in RL. With the extensive and large-scale dataset on hyperparameter landscapes that our selection is based on, ARLBench is an efficient, flexible, and future-oriented foundation for research on AutoRL. Both the benchmark and the dataset are available at https://github.com/automl/arlbench.
💡 Research Summary
The paper introduces ARLBench, a benchmark designed to evaluate hyperparameter optimization (HPO) methods for reinforcement learning (RL) in a computationally efficient manner. Recognizing that RL algorithms such as DQN, PPO, and SAC depend on a large set of hyperparameters (often 10‑13) and that existing HPO studies are typically limited to a few environments, the authors construct a large meta‑dataset comprising over 100 000 training runs across diverse domains: Arcade Learning Environment (ALE), Classic Control, Box2D, Brax robot walkers, and XLand‑MiniGrid exploration tasks. Using this dataset, they apply a subset‑selection technique (inspired by Aitchison et al., 2023) to identify a small collection of environments that best predict average performance across the full suite.
Implementation-wise, the three RL algorithms are re‑implemented in JAX, enabling dramatically faster training (seconds to minutes per run) and seamless integration with modern hardware accelerators. The benchmark provides a Gymnasium‑like “AutoRL Environment” interface: at each HPO iteration the optimizer supplies a hyperparameter configuration λₜ and a training budget bₜ; the environment then runs the RL algorithm, collects rewards, losses, gradient statistics, and returns both objective values and state features. Crucially, a built‑in checkpointing system allows dynamic HPO methods (e.g., Population‑Based Training, meta‑gradient adaptation) to pause, duplicate, or resume training mid‑run, something static HPO approaches cannot do.
Empirical results show that evaluating a full HPO budget of 32 configurations with 10 seeds each across all three algorithms would require 8 163 GPU‑hours using Stable‑Baselines3 (SB3). In contrast, ARLBench’s JAX implementation reduces this to 937 GPU‑hours—a speed‑up of 7‑12× depending on the algorithm. The selected subset further cuts runtime by roughly a factor of 2.5, yielding total speed‑ups of 9.6× (PPO), 7.14× (DQN), and 11.61× (SAC) compared to SB3 on the full environment set.
Compared with the existing tabular benchmark HPO‑RL‑Bench, ARLBench offers (i) a far larger hyperparameter space, (ii) support for dynamic hyperparameter schedules via checkpointing, and (iii) a richer set of performance metrics (including runtime and carbon emissions). The authors argue that ARLBench fills a gap between zero‑cost tabular benchmarks and full‑training evaluations, providing a practical middle ground for researchers with limited compute.
The paper concludes with three main contributions: (1) an efficient, flexible RL HPO benchmark supporting static and dynamic methods; (2) a principled environment subset that captures the diversity of the RL task space while reducing computational cost by an order of magnitude; and (3) a publicly released dataset of over 100 000 runs (≈ 32 588 GPU‑hours). Future work includes expanding the subset to cover more specialized domains, developing surrogate models for truly zero‑cost evaluation, and integrating additional RL algorithms and high‑dimensional observation spaces.
Comments & Academic Discussion
Loading comments...
Leave a Comment