Sim-to-reality adaptation for Deep Reinforcement Learning applied to an underwater docking application
Deep Reinforcement Learning (DRL) offers a robust alternative to traditional control methods for autonomous underwater docking, particularly in adapting to unpredictable environmental conditions. However, bridging the “sim-to-real” gap and managing high training latencies remain significant bottlenecks for practical deployment. This paper presents a systematic approach for autonomous docking using the Girona Autonomous Underwater Vehicle (AUV) by leveraging a high-fidelity digital twin environment. We adapted the Stonefish simulator into a multiprocessing RL framework to significantly accelerate the learning process while incorporating realistic AUV dynamics, collision models, and sensor noise. Using the Proximal Policy Optimization (PPO) algorithm, we developed a 6-DoF control policy trained in a headless environment with randomized starting positions to ensure generalized performance. Our reward structure accounts for distance, orientation, action smoothness, and adaptive collision penalties to facilitate soft docking. Experimental results demonstrate that the agent achieved a success rate of over 90% in simulation. Furthermore, successful validation in a physical test tank confirmed the efficacy of the sim-to-reality adaptation, with the DRL controller exhibiting emergent behaviors such as pitch-based braking and yaw oscillations to assist in mechanical alignment.
💡 Research Summary
This paper presents a comprehensive framework that bridges the sim‑to‑real gap for autonomous underwater docking using the Girona Autonomous Underwater Vehicle (AUV). The authors first adapt the high‑fidelity Stonefish simulator into a multiprocessing reinforcement‑learning (RL) environment, running 20 parallel training threads plus one evaluation thread. This setup accelerates training up to five times faster than real time while preserving realistic hydrodynamics, collision handling, and sensor models.
The docking task is formulated as a Markov Decision Process (MDP). The state vector includes a three‑dimensional translational error, yaw error, linear and angular velocities, and IMU‑derived accelerations. To mimic real‑world sensor imperfections, Gaussian base noise proportional to the distance to the docking station and an occlusion‑dependent noise component are injected, scaling uncertainty with visibility. The action space is a six‑degree‑of‑freedom wrench (forces and torques) expressed in the vehicle body frame; although the physical AUV cannot directly actuate roll, the full 6‑DoF vector is retained so the policy can learn the vehicle’s actuation limits.
A multi‑component reward function drives learning:
- Distance reward uses a Mahalanobis distance weighted per axis, prioritizing X and Y for horizontal alignment.
- Angle reward penalizes yaw error with an exponential term.
- Smoothness reward discourages large step‑to‑step action changes, encouraging gentle actuation.
- Collision reward applies a penalty when the change in IMU acceleration exceeds an adaptive threshold, with the threshold automatically reduced after a collision to avoid repeated penalization.
- Mission reward provides a large positive terminal bonus for successful docking and a large negative penalty for premature episode termination.
Proximal Policy Optimization (PPO) is selected as the learning algorithm after initial experiments with Soft Actor‑Critic (SAC). PPO’s clipped surrogate objective offers stable updates for continuous control, and it proved more reliable during physical tank trials. Training on an Intel i7 CPU with an RTX 4060 GPU took roughly three hours, during which the mean episode reward rose from –800 to between 300 and 400, and the docking success rate exceeded 90% across varied random spawn positions.
Simulation and real‑world deployment share the same ROS communication interfaces. In simulation, a downward‑facing camera detects a 3‑Dimensional Binary Marker (3DBM) on the docking station; the pose is transformed into the vehicle frame and combined with the noisy state vector. The same camera‑based visual servoing pipeline is used in the physical test tank, ensuring minimal software changes when transferring the policy.
Physical experiments confirm the policy’s effectiveness: the AUV successfully docks in the test tank, exhibiting emergent behaviors such as pitch‑based braking and intentional yaw oscillations that aid mechanical alignment—behaviors that are difficult to engineer with conventional PID or Model Predictive Control.
The authors acknowledge limitations: only ocean currents were modeled (no waves or wind), the docking station’s collision geometry was simplified to guiding funnels, and the multiprocessing approach, while faster than single‑threaded simulation, does not match the scalability of GPU‑based simulators that can run thousands of parallel instances. Future work will incorporate domain randomization, adversarial disturbances, and more complex environmental dynamics, as well as explore hybrid CPU‑GPU simulation architectures and multi‑agent cooperative docking scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment