Demonstration-Free Robotic Control via LLM Agents
Robotic manipulation has increasingly adopted vision-language-action (VLA) models, which achieve strong performance but typically require task-specific demonstrations and fine-tuning, and often generalize poorly under domain shift. We investigate whether general-purpose large language model (LLM) agent frameworks, originally developed for software engineering, can serve as an alternative control paradigm for embodied manipulation. We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification. Using the same iterative reasoning that enables software agents to debug code, FAEA enables embodied agents to reason through manipulation strategies. We evaluate an unmodified frontier agent, Claude Agent SDK, across the LIBERO, ManiSkill3, and MetaWorld benchmarks. With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively. This level of task success approaches that of VLA models trained with less than 100 demonstrations per task, without requiring demonstrations or fine-tuning. With one round of human feedback as an optional optimization, performance increases to 88.2% on LIBERO. This demonstration-free capability has immediate practical value: FAEA can autonomously explore novel scenarios in simulation and generate successful trajectories for training data augmentation in embodied learning. Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning. This opens a path for robotics systems to leverage actively maintained agent infrastructure and benefit directly from ongoing advances in frontier models. Code is available at https://github.com/robiemusketeer/faea-sim
💡 Research Summary
The paper introduces FAEA (Frontier Agent as Embodied Agent), a demonstration‑free framework that directly applies a general‑purpose large language model (LLM) agent to robotic manipulation. Instead of training vision‑language‑action (VLA) models on thousands of teleoperated demonstrations, the authors leverage the Claude Agent SDK—a production‑grade LLM agent infrastructure originally built for software engineering. The core of FAEA is a ReAct loop: the agent receives a task description, reasons about a high‑level plan, writes a short Python script that calls a set of predefined robot‑control tools (e.g., reset(), step(action), get_obs(), check_success()), executes the script in a simulated environment, observes the outcome (success flag, error messages, state feedback), and iterates. No gradient updates are performed; the agent discovers successful policies through in‑context program synthesis and trial‑and‑error, much like a human learning by practice.
Experiments were conducted on three widely used simulation benchmarks: LIBERO (120 long‑horizon tasks with a Franka Panda arm), ManiSkill3 (14 tasks with domain randomization), and MetaWorld (50 tabletop tasks with a Sawyer arm). The agent was granted privileged access to ground‑truth state (object positions, gripper pose) rather than raw RGB images, isolating the reasoning capability of the LLM from perception challenges. Using Claude Opus 4.5, FAEA achieved success rates of 84.9 % on LIBERO, 85.7 % on ManiSkill3, and 96 % on MetaWorld. These numbers are comparable to, and in some cases exceed, state‑of‑the‑art VLA models that were trained on ≤100 demonstrations per task. Adding a single round of human coaching—high‑level heuristic tips embedded in the prompt—raised LIBERO performance to 88.2 %.
Key methodological contributions include: (1) a formal description of iterative program synthesis where each script σ_i is conditioned on the task, the tool set, and the accumulated context C_i from previous attempts; (2) a concise prompt template that defines the agent’s role, success criteria, and optional coaching tips; (3) an automated trace‑validation step using Claude Code to detect cheating (e.g., hard‑coded simulator coordinates) and ensure that success stems from legitimate reasoning. The average number of attempts per task ranged from 2 to 26, depending on difficulty, demonstrating that the agent can efficiently converge to a solution without external supervision.
The authors acknowledge several limitations. The reliance on privileged state means that real‑world deployment would require a robust perception pipeline to replace ground‑truth observations. The approach assumes the existence of a uniform tool API; extending to heterogeneous robot platforms would necessitate engineering such interfaces. The decision‑making cycle operates at the task‑level (seconds per iteration), which may be insufficient for fast, dynamic manipulation requiring millisecond‑scale feedback. Finally, Claude Opus 4.5 is currently an expensive model, and large‑scale real‑time deployment would need cost‑effective alternatives.
Despite these constraints, FAEA offers a compelling new direction for robotics. By generating successful trajectories autonomously, it can serve as a data‑augmentation engine for VLA training, reducing the need for costly teleoperation datasets. Moreover, because the framework relies on a continuously updated LLM agent infrastructure, robotics systems can inherit improvements in reasoning, tool use, and safety without retraining. Future work is suggested in three areas: (i) integrating visual state estimation to test the method on physical robots, (ii) combining multimodal LLMs that process images and text to close the perception gap, and (iii) hybridizing high‑level LLM planning with low‑level reactive controllers for real‑time control.
In summary, the paper demonstrates that a state‑of‑the‑art LLM agent, when equipped with appropriate tool interfaces and a simple ReAct loop, can solve a broad set of manipulation tasks without any demonstration data or fine‑tuning. This “demonstration‑free” capability challenges the prevailing paradigm of data‑heavy policy learning and opens a practical pathway for leveraging rapid advances in frontier language models within robotic systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment