Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

This paper addresses the challenge of multi target active debris removal (ADR) in Low Earth Orbit (LEO) by introducing a unified coelliptic maneuver framework that combines Hohmann transfers, safety ellipse proximity operations, and explicit refueling logic. We benchmark three distinct planning algorithms Greedy heuristic, Monte Carlo Tree Search (MCTS), and deep reinforcement learning (RL) using Masked Proximal Policy Optimization (PPO) within a realistic orbital simulation environment featuring randomized debris fields, keep out zones, and delta V constraints. Experimental results over 100 test scenarios demonstrate that Masked PPO achieves superior mission efficiency and computational performance, visiting up to twice as many debris as Greedy and significantly outperforming MCTS in runtime. These findings underscore the promise of modern RL methods for scalable, safe, and resource efficient space mission planning, paving the way for future advancements in ADR autonomy.

💡 Research Summary

The paper tackles the challenging problem of planning multi‑target active debris removal (ADR) missions in Low Earth Orbit (LEO). It introduces a unified maneuver framework that couples co‑elliptic Hohmann transfers with a safety‑ellipse approach and explicit refueling logic, thereby addressing both orbital dynamics efficiency and operational safety. The authors generate a realistic simulation environment: each episode starts with a service spacecraft docked at a 700 km circular refueling station, a random set of 50 debris objects with full six‑element Keplerian states, and strict constraints on total ΔV (≈3 km s⁻¹) and mission duration (7 days).

Co‑elliptic transfers differ from classic Hohmann burns by first moving the chaser onto an intermediate orbit that shares the target’s apogee or perigee, allowing natural phasing and reducing cumulative ΔV when debris are clustered in similar orbital bands. After phasing, a final burn circularizes the orbit, and a safety‑ellipse maneuver provides a low‑speed, controlled approach within a predefined miss distance, mitigating collision risk during the final rendezvous. Refueling is modeled as a full ΔV replenishment after a round‑trip to the station, but incurs a time penalty, encouraging the planner to limit unnecessary refuel cycles.

Three planning strategies are benchmarked under identical conditions:

Greedy heuristic – selects the next unvisited debris that minimizes a weighted sum of instantaneous ΔV and transfer time. It is computationally cheap (≈1 s per episode) but myopic, yielding 15‑18 visited debris on average.
Monte Carlo Tree Search (MCTS) – builds a search tree from the current state, expands child nodes for feasible debris or refuel actions, and uses the Upper Confidence Bound (UCB) formula to balance exploration and exploitation. Random rollouts estimate future reward. While MCTS improves visitation to 25‑29 debris, its runtime is prohibitive (1 000‑10 000 s per episode) for real‑time or onboard use.
Masked Proximal Policy Optimization (Masked PPO) – a deep reinforcement learning (RL) agent based on PPO with action masking. The observation vector includes a binary mask of visited debris, remaining ΔV, remaining time, and the full Keplerian state of the chaser and all debris. Action masks prevent selection of already‑visited debris or actions that would violate ΔV/time limits, stabilizing learning. The reward is +1 per successful rendezvous, 0 for refuel or idle moves, and –1 for constraint violations or early termination. Training on a diverse set of random debris fields yields a policy that generalizes well; inference time is 1‑2 s per episode.

Experimental results over 100 randomized scenarios show that Masked PPO consistently visits 29‑32 debris (average ≈30.5), outperforming Greedy by ~70 % and MCTS by a modest 5 % while matching Greedy’s speed. The study demonstrates that integrating physics‑aware co‑elliptic transfers with safety‑ellipse approaches dramatically reduces ΔV consumption, and that action‑masked RL can exploit this structure to produce high‑quality, real‑time mission plans.

The discussion highlights the trade‑offs: MCTS offers high‑quality plans but is unsuitable for on‑board deployment due to its computational burden; Greedy is fast but fails to capture long‑term benefits; Masked PPO achieves the best balance, learning to respect refueling constraints and safety zones while delivering near‑optimal sequences. The authors suggest future work on multi‑vehicle coordination, extensions to highly eccentric or inclined orbits, and hybrid schemes where a lightweight RL policy guides a limited‑depth tree search for critical decision points.

In conclusion, the paper provides a compelling case that a co‑elliptic maneuver framework combined with safety‑ellipse proximity operations and refueling logic, when paired with a masked deep RL planner, can substantially improve the efficiency and feasibility of multi‑target ADR missions. This approach paves the way for autonomous, scalable debris removal and broader autonomous space‑operations planning.

Optimal Multi-Debris Mission Planning in LEO: A Deep Reinforcement Learning Approach with Co-Elliptic Transfers and Refueling

💡 Research Summary

Comments & Academic Discussion

Leave a Comment