MDAgent2: Large Language Model for Code Generation and Knowledge Q&A in Molecular Dynamics
Molecular dynamics (MD) simulations are essential for understanding atomic-scale behaviors in materials science, yet writing LAMMPS scripts remains highly specialized and time-consuming tasks. Although LLMs show promise in code generation and domain-specific question answering, their performance in MD scenarios is limited by scarce domain data, the high deployment cost of state-of-the-art LLMs, and low code executability. Building upon our prior MDAgent, we present MDAgent2, the first end-to-end framework capable of performing both knowledge Q&A and code generation within the MD domain. We construct a domain-specific data-construction pipeline that yields three high-quality datasets spanning MD knowledge, question answering, and code generation. Based on these datasets, we adopt a three stage post-training strategy–continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL)–to train two domain-adapted models, MD-Instruct and MD-Code. Furthermore, we introduce MD-GRPO, a closed-loop RL method that leverages simulation outcomes as reward signals and recycles low-reward trajectories for continual refinement. We further build MDAgent2-RUNTIME, a deployable multi-agent system that integrates code generation, execution, evaluation, and self-correction. Together with MD-EvalBench proposed in this work, the first benchmark for LAMMPS code generation and question answering, our models and system achieve performance surpassing several strong baselines.This work systematically demonstrates the adaptability and generalization capability of large language models in industrial simulation tasks, laying a methodological foundation for automatic code generation in AI for Science and industrial-scale simulations. URL: https://github.com/FredericVAN/PKU_MDAgent2
💡 Research Summary
MDAgent2 introduces a comprehensive, end‑to‑end framework that brings large language model (LLM) capabilities to the highly specialized domain of molecular dynamics (MD) simulations, with a particular focus on LAMMPS script generation and domain‑specific question answering. The authors first address the chronic data scarcity in MD by constructing a three‑tiered data pipeline. They collect thousands of high‑quality MD‑related texts (papers, textbooks, manuals) and apply rigorous cleaning, deduplication, and formatting steps to produce a large unlabeled corpus (MD‑Knowledge) for continued pre‑training (CPT). In parallel, they curate two supervised datasets: MD‑InstructQA, a set of expert‑authored question‑answer pairs covering theoretical concepts, syntax, and practical usage; and MD‑CodeGen, which pairs natural‑language simulation objectives with correct LAMMPS scripts.
Using the Qwen‑3 series (8‑billion‑parameter) as the base model, the authors adopt a three‑stage post‑training regimen. 1) CPT injects MD‑specific terminology and concepts into the model, improving its internal representation of the domain. 2) Supervised fine‑tuning (SFT) on MD‑InstructQA aligns the model with the style and factual correctness required for QA tasks, while also teaching it the precise syntax of LAMMPS commands. 3) Reinforcement learning (RL) is performed with a novel closed‑loop method called MD‑GRPO, derived from the GRPO algorithm. In this stage, generated scripts are automatically executed in a real LAMMPS environment; execution success, physical plausibility (e.g., energy conservation), and quantitative simulation outcomes are transformed into reward signals. A distinctive “low‑reward trajectory recycling” mechanism re‑feeds failed generations back into the policy update, encouraging the model to explore corrective strategies rather than discarding them.
Two specialized models emerge from this pipeline: MD‑Instruct (optimized for knowledge understanding and QA) and MD‑Code (optimized for code generation). To exploit these models in practice, the authors build MDAgent2‑RUNTIME, a deployable multi‑agent system that orchestrates four functional agents: (i) a prompt engine that translates user natural‑language requests into model inputs, (ii) a code generator that produces LAMMPS scripts, (iii) an execution engine that runs the scripts and captures logs, and (iv) an evaluator/self‑corrector that interprets execution feedback and triggers iterative regeneration when needed. The runtime integrates LAMMPS‑specific parsers and validation tools, ensuring that syntactic errors are caught early and that feedback is meaningful for the RL loop.
For evaluation, the authors introduce MD‑EvalBench, the first benchmark suite targeting both MD knowledge and LAMMPS code generation. MD‑EvalBench comprises three sub‑benchmarks: (a) MD‑KnowledgeEval (336 expert‑curated theoretical questions), (b) LAMMPS‑SyntaxEval (333 questions on command usage and syntax), and (c) LAMMPS‑CodeGenEval (a set of natural‑language simulation tasks requiring runnable scripts). Performance metrics include standard QA accuracy, Execution‑Success@k (the proportion of tasks where at least one of the top‑k generated scripts runs without error), and a human‑rated Code‑HumanScore (0‑10) assessing readability, robustness, and physical correctness.
Experimental results demonstrate that domain‑specific post‑training yields substantial gains over the base Qwen‑3‑8B model. MD‑Instruct‑8B achieves an average QA score of 74.67, surpassing Qwen‑Flash (73.47) and approaching the much larger Qwen‑3‑32B (77.34). In code generation, the MD‑Code‑8B model, when used within the RUNTIME loop, raises Execution‑Success@3 from 14.23 % (direct prompting) to 37.95 % and modestly improves the human code score from 9.29 to 9.32. These improvements are attributed to (i) the CPT and SFT stages that embed MD terminology and LAMMPS syntax, and (ii) the MD‑GRPO RL stage that directly optimizes against execution‑based rewards. Notably, the lightweight 8‑billion‑parameter models achieve performance competitive with closed‑source, much larger models (e.g., Qwen‑3‑Max), highlighting the feasibility of deploying efficient, domain‑adapted LLMs on local hardware.
In summary, the paper makes four major contributions: (1) a systematic pipeline for constructing high‑quality MD‑specific corpora and supervised datasets, (2) a three‑stage post‑training strategy (CPT → SFT → RL) that effectively adapts general LLMs to the MD domain, (3) the MD‑GRPO closed‑loop RL framework that leverages real simulation outcomes as reward signals and recycles low‑reward trajectories for continual improvement, and (4) the MDAgent2‑RUNTIME multi‑agent system that automates the full cycle from natural‑language request to executable, self‑corrected LAMMPS code. Together with the newly released MD‑EvalBench benchmark, these contributions provide a practical roadmap for integrating LLMs into scientific computing pipelines, and open avenues for extending the approach to other simulation domains such as computational fluid dynamics or electronic‑structure calculations.
Comments & Academic Discussion
Loading comments...
Leave a Comment