ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control
Achieving robust, human-like whole-body control on humanoid robots for agile, contact-rich behaviors remains a central challenge, demanding heavy per-skill engineering and a brittle process of tuning controllers. We introduce ZEST (Zero-shot Embodied Skill Transfer), a streamlined motion-imitation framework that trains policies via reinforcement learning from diverse sources – high-fidelity motion capture, noisy monocular video, and non-physics-constrained animation – and deploys them to hardware zero-shot. ZEST generalizes across behaviors and platforms while avoiding contact labels, reference or observation windows, state estimators, and extensive reward shaping. Its training pipeline combines adaptive sampling, which focuses training on difficult motion segments, and an automatic curriculum using a model-based assistive wrench, together enabling dynamic, long-horizon maneuvers. We further provide a procedure for selecting joint-level gains from approximate analytical armature values for closed-chain actuators, along with a refined model of actuators. Trained entirely in simulation with moderate domain randomization, ZEST demonstrates remarkable generality. On Boston Dynamics’ Atlas humanoid, ZEST learns dynamic, multi-contact skills (e.g., army crawl, breakdancing) from motion capture. It transfers expressive dance and scene-interaction skills, such as box-climbing, directly from videos to Atlas and the Unitree G1. Furthermore, it extends across morphologies to the Spot quadruped, enabling acrobatics, such as a continuous backflip, through animation. Together, these results demonstrate robust zero-shot deployment across heterogeneous data sources and embodiments, establishing ZEST as a scalable interface between biological movements and their robotic counterparts.
💡 Research Summary
The paper introduces ZEST (Zero‑shot Embodied Skill Transfer), a unified motion‑imitation framework that learns whole‑body control policies for legged robots directly from heterogeneous human motion sources and deploys them to hardware without any fine‑tuning. ZEST accepts three types of data: high‑fidelity motion‑capture (MoCap) clips, noisy monocular video (V‑Cap) recordings, and clean but physically unconstrained key‑frame animations. By treating all of these as “reference motions” the system avoids the need for hand‑crafted contact labels, future reference windows, observation histories, or external state estimators.
The training pipeline consists of three core ideas. First, adaptive sampling splits each reference trajectory into fixed‑duration bins and estimates a difficulty score for each bin using an exponential moving average of failure rates. A categorical sampler then biases experience replay toward harder bins, ensuring that the policy spends more training time on challenging segments and preventing catastrophic forgetting over long‑horizon clips. Second, a model‑based assistive wrench is applied virtually to the robot’s base during early training; its magnitude is automatically scaled according to bin difficulty and decays to zero as tracking improves, providing a smooth curriculum without manual shaping. Third, the reward function is deliberately simple: a weighted L2 distance between the current joint state and the next‑step reference, plus regularization on joint velocities and torques. This single, consistent reward eliminates the extensive reward engineering that typically plagues tabula‑rasa RL.
ZEST policies are implemented as shallow feed‑forward networks that receive only proprioceptive observations (joint positions, velocities, body orientation) and the previous action, together with the next‑step reference joint targets. The network outputs residual joint commands that are added to the reference and sent to a joint‑level PD controller. PD gains are not hand‑tuned; instead they are derived from an analytical armature model that approximates the effective inertia of closed‑chain actuators (e.g., knees, ankles). For the Spot quadruped, a more detailed power‑system model is incorporated. This analytical gain selection enables the same hyper‑parameter set to be used across very different platforms: the 1.8 m, 100 kg Atlas humanoid, the smaller Unitree G1 humanoid, and the Spot quadruped.
Training is performed entirely in simulation with moderate domain randomization (variations in friction, mass, sensor noise, and impulsive pushes). Despite this modest randomization, policies trained for roughly ten hours (≈7 k iterations) on a single NVIDIA L4 GPU transferred to hardware with consistent success across multiple trials.
Empirical results demonstrate the breadth of ZEST. On Atlas, MoCap‑derived policies achieve dynamic multi‑contact skills such as army crawl, break‑dancing, forward rolls, and cartwheels. From V‑Cap videos, the system learns expressive dances, box‑climbing, ballet moves, and soccer kicks, and transfers them zero‑shot to both Atlas and the Unitree G1. From animation data, Spot learns acrobatic feats including continuous backflips and barrel rolls. Notably, the video‑derived demonstrations contain pose jitter and foot‑skidding; ZEST’s preprocessing and curriculum automatically compensate for these artifacts, showing robustness to noisy inputs.
The authors emphasize that ZEST avoids the multi‑stage pipelines common in prior work (e.g., separate motion‑generation, tracking, and fine‑tuning stages). By using a single-stage RL training with a unified reward and minimal auxiliary components, the framework achieves a high degree of scalability and reproducibility. The paper positions ZEST as a practical interface that bridges the rich repertoire of human motion—whether captured in a lab, recorded on a phone, or generated by artists—to real‑world robots, enabling zero‑shot deployment of agile, contact‑rich behaviors across diverse embodiments.
Comments & Academic Discussion
Loading comments...
Leave a Comment