CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos at https://sites.google.com/view/code-sharp/homepage.
💡 Research Summary
CODE‑SHARP (Continuous Open‑ended Discovery and Evolution of Skills as Hierarchical Reward Programs) tackles two intertwined challenges in modern reinforcement learning: the need for open‑ended skill discovery and the difficulty of hand‑crafting reward functions for an ever‑expanding skill set. The authors propose to represent each skill as an executable Python program—a “Skill as Hierarchical Reward Program” (SHARP)—that contains (i) a binary success condition ϕ defining when the skill is completed, (ii) a set of environment condition functions c_i that check whether prerequisite resources or states are present, and (iii) references to prerequisite SHARP skills u_i that should be invoked when a condition is not satisfied. This formulation mirrors the Options framework but lifts it to the code level, allowing arbitrary, composable logic to be expressed directly in the reward function.
The system revolves around two iterative, foundation‑model‑driven loops. The first loop discovers new SHARP skills. A “skill proposal generator” receives as context the current skill archive (a directed acyclic graph), previously failed proposals, the environment source code, and auxiliary tutorials. It outputs a set of pseudo‑code candidates, each with a high‑level description, a success condition, and a mapping from environment conditions to prerequisite skills. A “skill proposal implementor” translates these candidates into runnable Python code using a class template. Finally, a “skill proposal judge”—also a foundation model—evaluates each implementation on three criteria: (1) syntactic correctness (compiles and references existing nodes), (2) feasibility (the current goal‑conditioned agent can learn it), and (3) novelty (it occupies a distinct region of the skill space). The judge selects the two most promising candidates, which are then trained on by a copy of the agent. If learning progress is observed, the skill is added to the archive; otherwise it is recorded as a failed proposal.
The second loop refines existing SHARP skills. Skills with low empirical success rates ρ_k are sampled with probability proportional to (1‑ρ_k). A “mutation proposal generator” receives the code of the sampled skill, the full archive, environment code, and auxiliary information, and produces m mutated pseudo‑code variants focusing on altering environment condition checks and prerequisite assignments. A “mutation proposal implementor” turns these into executable code. Because the agent’s policy is conditioned only on the active SHARP skill at each step, each mutation can be evaluated directly in the environment without retraining the agent. If a mutation yields a higher success rate than the current elite version, it replaces the node in the archive. This continual code‑level optimization enables the system to improve skill performance autonomously.
Training the agent proceeds in a continual, open‑ended fashion exclusively on rewards supplied by the SHARP archive. At the start of each episode a target SHARP skill is sampled uniformly; the episode ends after the skill is completed or after 300 environment steps, after which a new target is drawn. At every timestep the system traverses the target’s dependency graph to locate the “active” SHARP skill, using a transition operator T that follows unmet environment conditions to prerequisite skills until a fixed point is reached. The active skill’s name is embedded via a text encoder and concatenated with the raw state, providing the policy π(s,σ) with explicit skill context. To bias learning toward hard‑to‑reach skills, the authors introduce a prerequisite‑aware importance‑sampling scheme: each skill’s base weight B_j is inversely proportional to the cumulative success rates of its prerequisite skills, further filtered by a Top‑K selector. Additionally, adaptive reward scaling r_i = min(1/ρ_i, 10) inversely weights rewards by the skill’s current success rate, encouraging the agent to focus on under‑performing skills without over‑sampling their prerequisites.
Experiments are conducted in Craftax, a procedurally generated sandbox that blends Minecraft‑style crafting with NetHack‑style exploration, offering a rich open‑ended task space. Over three independent runs, the system performs 100 skill‑proposal iterations and 85 refinement iterations, discovering on average 90 diverse SHARP skills that span the Craftax skill space (e.g., mining, crafting, combat, navigation). A single goal‑conditioned agent trained solely on the generated rewards learns to solve long‑horizon objectives that baseline agents and pre‑trained experts cannot achieve. When a high‑level FM planner composes sequences of SHARP skills into policies‑in‑code, the combined system outperforms both pretrained agents and task‑specific expert policies by an average of 134 % on complex, multi‑step quests.
Key contributions of CODE‑SHARP are: (1) Automatic generation and continual refinement of reward functions as executable code, eliminating manual reward engineering; (2) A hierarchical, graph‑based archive that explicitly encodes skill dependencies and enables dynamic traversal during execution; (3) Leveraging foundation models for both discovery and mutation, providing open‑ended expansion of the skill repertoire; (4) Integration of goal‑conditioned policies with adaptive reward scaling and importance sampling to efficiently learn long‑horizon, compositional tasks. The paper demonstrates that code‑level hierarchical rewards, when coupled with powerful language models, can drive truly open‑ended skill acquisition, opening avenues for applications in robotics, multi‑agent coordination, and complex game AI where manual reward design is infeasible. Future work may explore richer multimodal foundation models, human‑in‑the‑loop validation of generated skills, and scaling to higher‑dimensional environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment