A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges
Large language models (LLMs) have rapidly become familiar tools to researchers and practitioners. Concepts such as prompting, temperature, or few-shot examples are now widely recognized, and LLMs are increasingly used in Modeling & Simulation (M&S) workflows. However, practices that appear straightforward may introduce subtle issues, unnecessary complexity, or may even lead to inferior results. Adding more data can backfire (e.g., deteriorating performance through model collapse or inadvertently wiping out existing guardrails), spending time on fine-tuning a model can be unnecessary without a prior assessment of what it already knows, setting the temperature to 0 is not sufficient to make LLMs deterministic, providing a large volume of M&S data as input can be excessive (LLMs cannot attend to everything) but naive simplifications can lose information. We aim to provide comprehensive and practical guidance on how to use LLMs, with an emphasis on M&S applications. We discuss common sources of confusion, including non-determinism, knowledge augmentation (including RAG and LoRA), decomposition of M&S data, and hyper-parameter settings. We emphasize principled design choices, diagnostic strategies, and empirical evaluation, with the goal of helping modelers make informed decisions about when, how, and whether to rely on LLMs.
💡 Research Summary
The paper presents a practical guide for integrating large language models (LLMs) into Modeling & Simulation (M&S) workflows, focusing on common misconceptions, pitfalls, and evidence‑based best practices. It begins by noting the rapid adoption of LLMs in M&S and the danger of treating them as a universal “magic wand.” The authors then dissect core components—prompt engineering, hyper‑parameters (especially temperature), and knowledge augmentation techniques such as Retrieval‑Augmented Generation (RAG) and Low‑Rank Adaptation (LoRA). They demonstrate that longer prompts or excessive few‑shot examples often degrade performance (the “over‑prompting” effect) and that tasks should be broken into smaller, explicitly defined subtasks, each followed by a validation prompt.
A dedicated section on non‑determinism explains that setting temperature to zero does not guarantee reproducibility because of token‑sampling randomness, system‑level nondeterminism, and model updates. Mitigation strategies include fixing random seeds, using multiple samples and averaging, adjusting top‑p, and employing deterministic inference engines when needed.
The paper warns against feeding massive M&S datasets to an LLM in a single request, citing token limits and loss of critical information. Instead, it advocates selective compression, representation choice (e.g., adjacency list vs. matrix for graphs), and empirical testing of which format yields higher accuracy. The authors also caution that external knowledge bases used in RAG can unintentionally bypass guardrails or cause model collapse, emphasizing careful assessment before augmentation.
Emerging trends such as multimodal inputs, “design by subtraction” (removing unnecessary components), and style transfer are explored through concrete experiments (e.g., generating empathetic narratives by transferring style from known figures). The paper stresses the importance of pipeline‑level governance: data version control, automated experiment tracking (MLflow), personal data detection (Presidio), fairness evaluation, and orchestration tools like LangChain.
Finally, the authors provide hands‑on exercises that guide readers through assessing an LLM’s existing knowledge, deciding when fine‑tuning or LoRA is justified, and systematically evaluating performance across different prompt designs and representations. Overall, the work offers a comprehensive, reproducible framework for when, how, and to what extent LLMs should be relied upon in Modeling & Simulation.
Comments & Academic Discussion
Loading comments...
Leave a Comment