Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning, yet their capacity for autonomous multi-stage planning in high-dimensional, physically constrained environments remains an open research question. This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12), a complex astrodynamics challenge requiring the design of a large-scale asteroid mining campaign. We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions. To assess performance beyond binary validity, we employ an “LLM-as-a-Judge” methodology, utilizing a rubric developed by domain experts to evaluate strategic viability across five structural categories. A comparative analysis of models, ranging from GPT-4-Turbo to reasoning-enhanced architectures like Gemini 2.5 Pro, and o3, reveals a significant trend: the average strategic viability score has nearly doubled in the last two years (rising from 9.3 to 17.2 out of 26). However, we identify a critical capability gap between strategy and execution. While advanced models demonstrate sophisticated conceptual understanding, correctly framing objective functions and mission architectures, they consistently fail at implementation due to physical unit inconsistencies, boundary condition errors, and inefficient debugging loops. We conclude that, while current LLMs often demonstrate sufficient knowledge and intelligence to tackle space science tasks, they remain limited by an implementation barrier, functioning as powerful domain facilitators rather than fully autonomous engineers.

💡 Research Summary

This paper investigates whether today’s large language models (LLMs) can autonomously tackle a high‑dimensional, physically constrained engineering problem by testing them on the 12th Global Trajectory Optimization Competition (GTOC 12). GTOC 12 asks participants to design a fleet of “Mining Ships” that launch from Earth between 2035 and 2050, rendezvous with a subset of 60 000 candidate asteroids, extract resources, and return the material to Earth, all while respecting low‑thrust propulsion limits, mission‑time windows, solar‑proximity constraints, and a dynamic fleet‑size rule. Traditional LLM benchmarks focus on code generation or narrow reasoning tasks and do not capture the multi‑stage planning, numerical simulation, and strict physical validation required by GTOC 12.

To bridge this gap, the authors repurpose the MLE‑Bench framework—originally built for Kaggle‑style machine‑learning competitions—into a trajectory‑optimization sandbox. They replace data‑set inputs and CSV submissions with a structured text file that encodes launch dates, thrust profiles, asteroid encounter times, and resource‑return schedules. The official GTOC 12 validator is integrated as a subprocess, providing immediate feedback on physical feasibility (e.g., unit consistency, time‑window adherence, thrust and specific‑impulse limits, gravity‑assist geometry). The Docker environment is rebuilt with astrodynamics libraries (poliastro, Astropy) and numerical solvers, eliminating deep‑learning dependencies.

The agent architecture is based on AIDE (Agent for Iterative Data‑science Exploration), which treats solution development as a tree‑search problem. The authors extract all hard‑coded prompts from AIDE and replace them with a “space‑craft guidance, navigation, and control engineer” persona. Each node in the search tree generates a Python script, runs it, and parses the validator’s log to identify error categories such as unit mismatches, out‑of‑bounds timestamps, or thrust violations. The agent then branches by proposing targeted fixes, following a best‑first search strategy. Additional safeguards are added: strict debugging protocols that fix one error at a time, warnings about unit handling, and prompts to keep computational cost low.

Experiments involve five models—GPT‑4‑Turbo, Gemini 2.5 Pro, Google o1‑preview, Claude‑3‑Opus, and LLaMA‑2‑70B—each run ten times. Performance is measured in two ways. First, the “implementation success” metric records whether the generated solution passes the official validator and earns any competition points; all models score zero, indicating no fully valid trajectory was produced. Second, a rubric created by GTOC experts evaluates the strategic soundness of the proposed plans across five structural categories (target selection, trajectory design, propulsion system, mining schedule, fleet sizing) on a 0‑5 scale. The average strategic viability score rose from 9.3 (early 2024) to 17.2 (late 2025), nearly doubling. Improvements are most pronounced in high‑level target prioritization and feasible trajectory sketches, while persistent failures remain in low‑level implementation: unit conversion errors (km vs. m), violation of the 2035‑2050 mission window, exceeding thrust or specific‑impulse budgets, malformed output files that crash the validator, and inefficient debugging loops that repeatedly address the same mistake.

The authors conclude that current LLMs excel at conceptual reasoning and can generate sophisticated, domain‑aware strategies, but they are hampered by an “implementation barrier.” Physical‑unit consistency, precise numerical simulation, and systematic error correction remain beyond their autonomous capabilities. Consequently, LLMs function best as powerful facilitators—providing high‑level design insight—rather than as fully autonomous aerospace engineers. The paper recommends future work on tighter integration of physics‑aware verification tools, dedicated unit‑management frameworks, error‑prioritized search strategies, and possibly hybrid systems where a reasoning‑focused LLM is paired with a verification‑oriented model. Overcoming these gaps could eventually enable LLMs to collaborate with human engineers in end‑to‑end space‑mission design, moving from advisory roles toward true autonomous problem solving.

Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12

💡 Research Summary

Comments & Academic Discussion

Leave a Comment