Towards Adaptive Environment Generation for Training Embodied Agents
Embodied agents struggle to generalize to new environments, even when those environments share similar underlying structures to their training settings. Most current approaches to generating these training environments follow an open-loop paradigm, without considering the agent’s current performance. While procedural generation methods can produce diverse scenes, diversity without feedback from the agent is inefficient. The generated environments may be trivially easy, providing limited learning signal. To address this, we present a proof-of-concept for closed-loop environment generation that adapts difficulty to the agent’s current capabilities. Our system employs a controllable environment representation, extracts fine-grained performance feedback beyond binary success or failure, and implements a closed-loop adaptation mechanism that translates this feedback into environment modifications. This feedback-driven approach generates training environments that more challenging in the ways the agent needs to improve, enabling more efficient learning and better generalization to novel settings.
💡 Research Summary
The paper tackles the persistent problem that embodied agents—such as home‑assistant robots or industrial manipulators—fail to generalize to new indoor scenes even when those scenes share the same overall layout as the training environments. Existing training‑environment generation pipelines fall into two categories. Real‑world scan‑based simulators (e.g., HM3D, Replica) provide photorealism but are hard to modify programmatically, limiting curriculum design. Procedurally generated simulators (e.g., AI2‑THOR, ProcTHOR) allow systematic scene manipulation but typically operate in an open‑loop fashion: environments are sampled randomly without regard to the agent’s current abilities, leading to many trivially easy or irrelevant scenarios that waste training resources.
To overcome these limitations, the authors propose a closed‑loop environment generation framework that continuously adapts environment difficulty to the agent’s present performance. The system consists of three tightly coupled components:
-
Controllable Environment Representation – Using ProcTHOR’s scene graph, each environment e is encoded as a structured tuple (O, A, R) where O is the set of objects, A(o) stores per‑object attributes (position, rotation, scale, material), and R(o_i, o_j) captures spatial or functional relations (e.g., “on”, “next‑to”). This explicit representation enables programmatic edits (add, remove, perturb objects) and supports a validation module that checks for physical consistency (no collisions) and task solvability.
-
Fine‑Grained Trajectory Analysis (F) – After deploying the current policy π_t in environment e_t, the agent’s trajectory τ_e_t is rendered as a top‑down map (or image sequence). A large language model (LLM), instantiated as GPT‑5‑mini, is prompted to interpret this visual trace and output a structured feedback object a_t = {outcome, concerns, suggestions}. “Outcome” is a binary success flag; “concerns” enumerate intermediate behavioral issues (e.g., unsafe clearance near doorways, inefficient path loops); “suggestions” are high‑level directives for environment modification (e.g., “add obstacles near doors”). This step extracts richer learning signals than the usual scalar reward.
-
Closed‑Loop Adaptive Generator (G) – The generator receives the current scene graph e_t and the feedback a_t, and produces a new environment e_{t+1} = G(e_t, a_t). G is also LLM‑driven; it translates abstract suggestions into concrete editing actions such as “move sofa 0.8 m along the y‑axis” or “insert a chair at coordinates (x=2.3, y=1.5)”. The generated scene is then validated for physical feasibility and task solvability before being handed to the agent for further training, yielding an updated policy π_{t+1}. The loop repeats, forming an adaptive curriculum where each iteration presents a slightly harder, yet learnable, scenario.
Mathematically, the objective is to maximize
J(G) = 𝔼_t
Comments & Academic Discussion
Loading comments...
Leave a Comment