Embedding Classical Balance Control Principles in Reinforcement Learning for Humanoid Recovery
Humanoid robots remain vulnerable to falls and unrecoverable failure states, limiting their practical utility in unstructured environments. While reinforcement learning has demonstrated stand-up behaviors, existing approaches treat recovery as a pure task-reward problem without an explicit representation of the balance state. We present a unified RL policy that addresses this limitation by embedding classical balance metrics: capture point, center-of-mass state, and centroidal momentum, as privileged critic inputs and shaping rewards directly around these quantities during training, while the actor relies solely on proprioception for zero-shot hardware transfer. Without reference trajectories or scripted contacts, a single policy spans the full recovery spectrum: ankle and hip strategies for small disturbances, corrective stepping under large pushes, and compliant falling with multi-contact stand-up using the hands, elbows, and knees. Trained on the Unitree H1-2 in Isaac Lab, the policy achieves a 93.4% recovery rate across randomized initial poses and unscripted fall configurations. An ablation study shows that removing the balance-informed structure causes stand-up learning to fail entirely, confirming that these metrics provide a meaningful learning signal rather than incidental structure. Sim-to-sim transfer to MuJoCo and preliminary hardware experiments further demonstrate cross-environment generalization. These results show that embedding interpretable balance structure into the learning framework substantially reduces time spent in failure states and broadens the envelope of autonomous recovery.
💡 Research Summary
The paper tackles the persistent problem of humanoid robots failing to recover from falls, which limits their deployment in unstructured settings. While recent reinforcement‑learning (RL) approaches have demonstrated stand‑up behaviors, they typically treat recovery as a pure task‑reward problem and lack an explicit representation of the robot’s balance state. The authors propose a unified RL framework that embeds classical balance metrics—capture point, center‑of‑mass (CoM) state, and centroidal momentum—into the learning process. These metrics are supplied as privileged inputs to an asymmetric critic during training, while the actor receives only proprioceptive observations (joint positions, velocities, base angular velocity, and gravity direction), enabling zero‑shot transfer to hardware without additional sensors.
The reward function is explicitly shaped around the balance quantities. Group I rewards drive vertical elevation toward a target CoM height using a Gaussian height‑tracking term, asymmetric momentum rewards for upward motion, and a stabilization bonus near the target. Group II rewards enforce static and dynamic stability: a Gaussian term rewards CoM projections inside the support polygon, while a capture‑point term penalizes configurations where the capture point lies outside the foot support region, directly encoding the need for a stepping action. Group III adds safety regularizers for torque limits, joint limits, and action smoothness.
Training is performed in Isaac Lab on a simulated Unitree H1‑2 using Proximal Policy Optimization (PPO) with a three‑layer MLP for both actor and critic. The critic’s privileged observation set includes full CoM position, velocity, acceleration, whole‑body linear and angular momentum, and capture‑point location—information unavailable on the real robot. A curriculum alternates between fall induction (random pushes and initial poses) and stand‑up phases, exposing the policy to a wide range of disturbances. Randomized command delays (10–40 ms) and observation noise are injected to improve robustness.
Experimental results show a 93.4 % recovery success rate across randomized initial poses and unscripted fall configurations in simulation. An ablation study reveals that removing privileged balance inputs and capture‑point rewards causes the policy to fail to leave the ground entirely (termination rate rises from 0.067 to 1.0). The same policy transfers unchanged to MuJoCo (sim‑to‑sim) and to real hardware: ten zero‑shot trials on the Unitree H1‑2 achieve nine successful recoveries, demonstrating ankle/hip stabilization, corrective stepping, and multi‑contact stand‑up using hands, elbows, and knees. The policy exhibits smooth joint trajectories, respects torque limits, and recovers within an average of 2.3 seconds.
The work’s significance lies in demonstrating that classical balance analysis can be leveraged as a structured learning signal, dramatically improving sample efficiency, generalization, and safety of RL‑based recovery without sacrificing the adaptability of learned controllers. Limitations include reliance on accurate capture‑point and CoM estimates during training; real‑time estimation on hardware remains challenging, suggesting future work on integrating online state estimators or progressively reducing privileged information to further close the sim‑to‑real gap.
Comments & Academic Discussion
Loading comments...
Leave a Comment