Learning Thermal-Aware Locomotion Policies for an Electrically-Actuated Quadruped Robot
Electrically-actuated quadrupedal robots possess high mobility on complex terrains, but their motors tend to accumulate heat under high-torque cyclic loads, potentially triggering overheat protection and limiting long-duration tasks. This work proposes a thermal-aware control method that incorporates motor temperatures into reinforcement learning locomotion policies and introduces thermal-constraint rewards to prevent temperature exceedance. Real-world experiments on the Unitree A1 demonstrate that, under a fixed 3 kg payload, the baseline policy triggers overheat protection and stops within approximately 7 minutes, whereas the proposed method can operate continuously for over 27 minutes without thermal interruptions while maintaining comparable command-tracking performance, thereby enhancing sustainable operational capability.
💡 Research Summary
This paper addresses the critical problem of motor overheating in electrically‑actuated quadrupedal robots, which limits their ability to perform long‑duration tasks under high‑torque cyclic loads. The authors propose a thermal‑aware locomotion control method that integrates motor temperature measurements directly into the state space of a reinforcement‑learning (RL) policy and introduces a temperature‑constraint reward to keep temperatures below a safety threshold.
A first‑order thermal model is used for each motor, capturing Joule heating (proportional to the square of the current, approximated by torque) and heat exchange with the environment. Because the motors are densely packed, the authors construct a whole‑body thermal model comprising 14 nodes (12 joint motors, the onboard computer, and the ambient environment). The model is expressed in discrete‑time state‑space form, with a system matrix that encodes thermal coupling between neighboring nodes. During simulation, the RMS torque over each thermal‑model update interval is fed as the heat input, allowing real‑time temperature updates.
The RL framework builds on Proximal Policy Optimization (PPO) and follows the asymmetric actor‑critic paradigm used in recent quadruped locomotion work. The actor receives proprioceptive observations (commanded velocity, angular velocity, gravity vector, joint positions, velocities, temperatures, and the previous action) together with an estimated base velocity and a latent feature vector produced by an encoder that processes the last six proprioceptive frames. The critic, available only in simulation, also receives privileged information such as the true linear velocity and external forces, which improves value estimation and accelerates learning.
To enforce thermal safety, the authors convert the hard temperature limit (T_{\max}) into a Control Barrier Function (CBF): (-\dot T + \gamma_T (T_{\max} - T) \ge 0). This inequality is incorporated into the reward as a penalty term that becomes active when a motor’s temperature approaches the limit. To avoid bias from the initial temperature, a clipped temperature (T_{\text{clip}}) is used, and the penalty is computed only on the CBF violation. The coefficient (\gamma_T) is analytically chosen based on the discretized thermal model to guarantee that the CBF can be satisfied even if the motor receives zero torque.
Domain randomization is extensively applied during training to bridge the sim‑to‑real gap. Randomized parameters include payload mass (0–4 kg), center‑of‑mass offset, external forces, ground friction, initial joint positions, system delay, motor strength, initial motor temperature, and ambient temperature. Initial temperatures are sampled from (
Comments & Academic Discussion
Loading comments...
Leave a Comment