Multi-Agent Collaboration for Automated Design Exploration on High Performance Computing Systems
Today’s scientific challenges, from climate modeling to Inertial Confinement Fusion design to novel material design, require exploring huge design spaces. In order to enable high-impact scientific discovery, we need to scale up our ability to test hypotheses, generate results, and learn from them rapidly. We present MADA (Multi-Agent Design Assistant), a Large Language Model (LLM) powered multi-agent framework that coordinates specialized agents for complex design workflows. A Job Management Agent (JMA) launches and manages ensemble simulations on HPC systems, a Geometry Agent (GA) generates meshes, and an Inverse Design Agent (IDA) proposes new designs informed by simulation outcomes. While general purpose, we focus development and validation on Richtmyer–Meshkov Instability (RMI) suppression, a critical challenge in Inertial Confinement Fusion. We evaluate on two complementary settings: running a hydrodynamics simulations on HPC systems, and using a pre-trained machine learning surrogate for rapid design exploration. Our results demonstrate that the MADA system successfully executes iterative design refinement, automatically improving designs toward optimal RMI suppression with minimal manual intervention. Our framework reduces cumbersome manual workflow setup, and enables automated design exploration at scale. More broadly, it demonstrates a reusable pattern for coupling reasoning, simulation, specialized tools, and coordinated workflows to accelerate scientific discovery.
💡 Research Summary
The paper introduces MADA (Multi‑Agent Design Assistant), a novel framework that leverages large language models (LLMs) to orchestrate a closed‑loop scientific design workflow on high‑performance computing (HPC) platforms. MADA decomposes the end‑to‑end loop—design proposal, mesh generation, simulation execution, result analysis, and design refinement—into three specialized agents: a Job Management Agent (JMA), a Geometry Agent (GA), and an Inverse Design Agent (IDA). Communication between agents and external tools (Flux scheduler, Cubit mesh generator, Laghos hydrodynamics code, and machine‑learning surrogate models) is standardized through the Model Context Protocol (MCP), a lightweight JSON‑RPC based client‑server interface that abstracts tool‑specific details while handling parameter validation, error reporting, and result formatting.
The JMA abstracts HPC job submission and monitoring. It wraps Flux (or any other scheduler such as Slurm) in an MCP server, enabling the agent to launch ensembles of simulations, track their status, collect output files, and automatically retry or re‑schedule failed jobs. The GA receives user‑defined design variables (e.g., interface geometry parameters for inertial confinement fusion capsules) and invokes Cubit or PMesh via MCP to produce high‑quality, simulation‑ready meshes. Mesh validation steps are built into the GA to ensure geometric consistency before simulation. The IDA performs design‑space exploration. It initially samples the space using random or Latin‑hypercube methods, then extracts quantities of interest (QoI) such as Richtmyer‑Meshkov Instability (RMI) growth rates from simulation outputs. Using these QoI, the IDA applies Bayesian optimization, evolutionary algorithms, or reinforcement‑learning policies to propose new candidate designs. Crucially, the IDA uses LLM‑generated natural‑language explanations to articulate why a design is promising, providing transparent feedback to human scientists.
MADA is evaluated on the RMI suppression problem, a critical challenge in inertial confinement fusion (ICF). Two experimental tracks are presented: (1) a full‑scale HPC run where Laghos solves the compressible Euler equations on meshes generated by GA, with JMA handling job orchestration on a real cluster; (2) a rapid‑exploration track where a pre‑trained neural‑network surrogate approximates Laghos results, allowing the IDA to evaluate thousands of designs in minutes. In the HPC track, five to six iterative cycles reduced the RMI growth factor by more than 30 % compared with baseline designs. In the surrogate track, the same level of performance was reached with an order‑of‑magnitude reduction in wall‑clock time. The experiments also demonstrate robust error handling: mesh generation failures, simulation crashes, or surrogate prediction errors trigger automatic fallback strategies without human intervention.
The paper highlights several key contributions. First, it shows how LLM reasoning can replace manually authored workflow graphs: users only need to specify high‑level objectives and constraints in natural language, and the planner component dynamically decomposes tasks, selects the appropriate agent, and adapts the execution plan based on intermediate results. Second, the modular multi‑agent architecture mitigates LLM context‑window limitations; each agent maintains its own state and memory, allowing the system to scale to long‑horizon optimization problems. Third, MCP provides a vendor‑agnostic integration layer, making it trivial to add new simulation codes, schedulers, or analysis tools by wrapping them in a compliant server. Finally, the system demonstrates resilience: when one agent fails, the planner re‑assigns responsibilities or modifies the workflow, preserving overall progress.
Compared with traditional workflow managers (e.g., Pegasus, Airflow), MADA offers a higher degree of automation, natural‑language interfacing, and adaptive planning, reducing the orchestration burden on domain scientists. The authors discuss future directions, including extending the framework to multi‑physics problems, incorporating real‑time experimental data for closed‑loop calibration, and scaling to exascale environments with thousands of concurrent simulations.
In summary, MADA represents a new paradigm for scientific discovery on HPC systems: by coupling LLM‑driven reasoning, a standardized tool‑integration protocol, and specialized collaborative agents, it automates complex design exploration, accelerates optimization cycles, and frees researchers to focus on scientific insight rather than workflow plumbing.
Comments & Academic Discussion
Loading comments...
Leave a Comment