DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Alejandro Pradas-Gómez 1 Product Development Division, Department of Mechanical Engineering, Chalmers University of T echnology , Gothenburg, Sweden e-mail: alejandro.pradas@chalmers.se GKN Aerospace Engine Systems, T rollhättan, Sweden Arindam Brahma Product Development Division, Department of Mechanical Engineering, Chalmers University of T echnology , Gothenburg, Sweden e-mail: arindam.brahma@chalmers.se Ola Isaksson Product Development Division, Department of Mechanical Engineering, Chalmers University of T echnology , Gothenburg, Sweden e-mail: ola.isaksson@chalmers.se DUCTILE: A g entic LLM Orchestration of Engineering Analy sis in Product De velopment Design Practice Engineering analysis aut omation in product dev elopment relies on rigid interfaces be- tw een tools, data formats and documented processes. When t hese interfaces chang e, as they routinely do as the pr oduct evol ves in the engineering ecosys tem, the automation sup- port br eaks. This paper presents a DUCTILE (Deleg ated, User -supervised Coordination of T ool- and document-Integr ated LLM-Enabled) ag entic orc hestr ation, an approac h f or dev eloping, executing and ev aluating LLM-based ag entic automation support of engineer- ing analysis tasks. The appr oach separat es adaptiv e or chestr ation, performed by the LLM ag ent, fr om det er ministic execution, per f or med by v eriﬁed engineering tools. The ag ent interpr ets documented design practices, inspects input data and adapts the processing path, while t he engineer supervises and exer cises ﬁnal judgment. DUCTILE is demon- str ated on an industrial structur al analysis task at an aerospace manufactur er , where the ag ent handled input deviations in f or mat, units, naming conventions and methodology that w ould break traditional scripted pipelines. Evaluation ag ainst expert-deﬁned acceptance criteria and deployment with practicing engineers conﬁrm that the approac h produces corr ect, me thodologically compliant r esults across 10 repeat ed independent r uns. The paper discusses the paradigm shift and the pr actical consequences of adopting ag entic automation, including unintended eﬀects on the nature of engineering work when remo ving mundane tasks and creating an exhausting super visory role. Keyw ords: larg e language models, ag ent-based design, design automation, design methodology , design process, product dev elopment, expert sy stems, design evaluation 1 Introduction Product dev elopment relies on an engineer ing ecosys tem with man y interacting elements: engineers, data, methods, processes and specialis t tools [ 1 – 3 ]. In engineering anal yses, or ganizations often tr y to automate how these interactions occur , for e xample, through scr ipts and tool-speciﬁc workﬂo ws. In practice, designers often repeatedl y mo ve between context-speciﬁc tools, scr ipts and methods, adapting the w orkﬂow as the requirements, models and interfaces change [ 3 , 4 ]. Ho we ver , such chang es, e ven if minor, can often lead to signiﬁcant disruptions creating knoc k -on eﬀects in downs tream design activities or chang es within the product [ 5 , 6 ]. For example, because of a tool update, or because of re vised requirements or an organizational chang e. The result is a brittle analy sis process [ 7 ], where signiﬁcant engineer ing time is spent on data wrangling [ 8 , 9 ] and tool orchestration [ 10 , 11 ], rather than on engineering decisions that directly aﬀect product performance. In practice, organizations hav e addressed the brittleness of engi- neering automation through two main strategies. One is to extend the deter ministic automations to add more rules, co v er more sce- narios, which increases comple xity and creates ne w failure modes as the variation g ro ws [ 12 – 14 ]. If the automation itself cannot co ver the changes, the second approach is to rel y on e xper t engi- neers to adapt on a per -case basis. This second approach is slo w and concentrates cr itical know ledge in individuals, creating an or - ganizational vulnerability when those individuals are not a vailable [ 14 – 16 ]. N either approach scales well in en vironments where the w orkﬂow ev olv es alongside the product o ver a decade-long dev el- opment process. Approaches integrating machine lear ning with e xper t validation ha ve impro v ed kno w ledge sourcing eﬃciency in 1 Corresponding Author . March 18, 2026 w ell-str uctured domains [ 15 ], but remain dependent on predeﬁned data structures, e xper t-curated f eature sets, and rigid methodolog- ical pipelines, lea ving the underl ying br ittleness of the workﬂo w unaddressed. In this paper , a third approach is proposed: Large Language Model (LLM)-based agents as orches tration la y ers that connect engineers to the v eriﬁed and trusted domain-speciﬁc engineering tools. The agent does not replace the domain-speciﬁc software, but instead, inter prets the documented design practice, inspects the av ailable data and adapts to the particular context. In doing so, it generates and ex ecutes processing code at the engineer re- quest through the same v er iﬁed and trusted tools that the engineer w ould use manually . By adding a no w standardized and universal orches tration lay er on the automation, this approach minimizes the dev elopment (and maintenance) burden of tailored workﬂo ws and frame works. With this approach, engineers remain responsible f or revie wing the plan, super vising ex ecution and validating outputs as per the lates t AI policies and standards of aerospace and auto- motiv e cer tiﬁcation bodies [ 17 – 19 ]. In this framing, adaptability is provided by the agent, while engineering rigor and traceability are preser v ed b y the deter ministic, domain-cer tiﬁed tools. This paper makes two contr ibutions. Firs t, it presents DUCTILE, an approach f or de veloping, e xecuting and e valuat- ing LLM-based ag entic automation of engineer ing analy sis tasks, grounded in the separation of adaptiv e orchestration from deter- ministic, v er iﬁed tool ex ecution [ 20 ]. Second, it provides a test case and e valuates the approach on an industrial load processing task at an aerospace manufacturer , where the agent handled f our input deviations that w ould break traditional automation, across 10 independent runs and two engineers with diﬀerent supervision sty les. The nov elty lies in demonstrating that LLM ag ents can orches trate real aerospace anal ysis w orkﬂow s transparently and Journal of Mechanical Design PREPRINT / 1 compatibly with exis ting v er iﬁed tools, while absorbing the in- put variability that breaks traditional automation. The case, tools and e valuation are provided as open source and free of licenses f or other researchers to replicate and improv e. 2 Backgr ound 2.1 The T ool A doption Challenge at Aerospace Engineer - ing Com panies. Of speciﬁc interes t in this paper are the chal- lenges of automation in the mec hanical engineering analy sis, par - ticularl y in the aerospace domain [ 15 , 21 ]. The engines division of GKN A erospace (hereafter , the case company) is a Sw edish manu- facturer of jet engine components and is a risk and rev enue-shar ing partner for v ar ious jet engine Original Equipment Manuf acturers (OEMs). The case compan y routinely per f or ms structural analy sis activities to substantiate airw or thiness and cer tiﬁcation activities, an essential par t of the safety -focused industry [ 22 ]. These activi- ties f ollo w documented design practices and rel y on a combination of commercial sol vers, legacy in-house scripts, and an inter nally dev eloped, ev olving ecosys tem of modular Python tools [ 23 ]. The variety of tools in the ecosy stem creates a practical challeng e that directly relates to the user’ s experience le vel. A ne w engineer , f or e xample, may understand the engineer ing principles, but ma y f ace a steep lear ning curve in connecting that know ledge to the speciﬁc tools and sequences unique to the compan y’ s internally established processes. On the other hand, an e xper ienced engineer may know the methodology intimatel y but ma y lack the time to e xplore the documentation and capabilities of a frequently updated tool. Both these cases highlight a signiﬁcant and ev er -increasing g ap betw een the gro wing capability of computational tools and the en- gineering judgment required to use them eﬀectivel y [ 24 ]. The en- gineer’ s value lies in understanding what the analy sis must achiev e and whether the results are cor rectl y interpreted. How ev er , in practice, when a design change needs to be e valuated, a signif- icant amount of time is spent on non-engineering tasks suc h as debugging data f or mat mismatches between modules. 2.2 Design Automation in Engineering Analysis. The chal- lenge described at this case company is not uniq ue to the organi- zation. Engineering design problems are creative, ill-deﬁned and conte xt-dependent, often descr ibed as wicked problems [ 25 ]. In such problems, the designer often lacks clar ity early in the pro- cess, and the problem(s) become fully clear only as the solution emerg es [ 26 ]. Simon [ 27 ] framed design as a process of satisﬁcing under bounded rationality . This deﬁnition applies to both the de- sign synthesis and anal ysis. Vidner et al. [ 14 ] build upon Ullman ’s design parado x [ 28 ] to argue that if the design automation is seen as a design activity itself (designing a computational support for the engineer de veloping a product) then the uncertainty and lack of kno wledg e on the early phase of the de velopment compounds on the design automation la yer . Gericke et al. [ 1 ] argue that design methods are not isolated pro- cedures but are embedded in ecosystems of representations, tools, and purposes. These proper ties ha ve implications f or any attempt to automate design w ork: the automation must accommodate vari- ability , incomplete information, and ev olving understanding. Pre- cisely the conditions that rigid, deterministic sys tems handle poorly . Speciﬁc to the aerospace industry , this is a challenge as it creates signiﬁcant bottlenecks to rapid design iterations, both in the design [ 29 ] and manufacturing domains [ 30 ]. In literature, a number of approaches ha ve been put forward o ver the y ears to support engineers in per f or ming analysis of their com- ponents, mainly through various forms of computational suppor t f or engineer ing design that hav e ev olv ed through sev eral genera- tions. W e explore three of the most common in aerospace [ 31 ]. The ﬁrst category is Know ledge-based engineering (KBE) sys- tems, which encode e xper t rules and geometric primitives to au- tomate routine design tasks, but the y require ontologies [ 32 ] and design pr imitiv es to be deﬁned in adv ance of the design and anal- ysis activity [ 12 ]. In industrial practice, KBE dev elopers and the engineers who perform the analy sis are typically not the same peo- ple [ 33 ]. The specialized kno wledg e and eﬀort required to build and maintain KBE applications limit their use to high performing teams in earl y design concepts, where the design space is suﬃ- ciently constrained [ 34 ]. F ur ther , its use in early phases allo w s Multi-Disciplinary Optimization (MDO) or design space e xplo- rations (DSE) within pre-deﬁned validity ranges, that can be built into models [ 35 ]. How ev er , f or the detailed design and cer tiﬁcation phases, engineers do not work on pre-determined en vironments, limits, or design paradigm constraints. The data receiv ed from previous processes varies considerably due to other teams g enerat- ing the data manuall y , and analy sis processes and methods quickl y ev ol ve. In these scenarios, KBE approac hes are rarely viable [ 36 ]. The second category is process integration and design opti- mization (PIDO) platf orms or w orkﬂow manag ers [ 37 ]. Indepen- dent platforms such as modeFRONTIER, ModelCenter , optiSLang, iSight, or CAE-embedded equiv alents such as ANSY S W orkbenc h, or equivalents from Siemens, Dassault Sys tems, etc. connect het- erogeneous sol vers through graphical workﬂo w editors, enabling parametric and MDO studies [ 38 ]. The third categor y is scripted pipelines, which ser v e a similar purpose of connecting sequences of modular tools. Each of these approaches shares a common limitation: the inter - faces between components are deter ministic and must be speciﬁed in advance. For example, a KBE ontology cannot accommodate a design parameter that w as not foreseen when the system primitives w ere built. Or a scr ipted/visual pipeline produces er rors when the OEM deliv ers loads in a diﬀerent f or mat. In practice, e ven r ule- based sys tems with fully explicit logic are described as black box es [ 36 ] by the engineers who use them: the complexity of the r ule chains either g raphical or codiﬁed e xceeds what a practitioner can trace in complex products and design scenarios, reg ardless of the sys tem’ s theoretical transparency [ 13 , 36 ]. 2.3 AI in Engineering Design. The 2022 JMD Special Is- sue on Ar tiﬁcial Intelligence and Engineering Design [ 39 ] mapped the intersection of AI and design research along ﬁv e AI method themes: natural language processing, g raph neural networks, gen- erativ e models, transf er learning and representation learning, and across f our design research themes spanning conceptual synthesis, process acceleration, direct estimation and broader process support. Of the ﬁve themes, generativ e models are the most rele vant to the present w ork, yet they address a fundamentally diﬀerent problem. Deep Generative Models (DGMs) [ 40 ] suc h as G ANs, V AEs and diﬀusion models, produce ne w design candidates or predict perfor - mance ﬁelds. For example, [ 41 ] addresses the DGM as a sur rogate model. Computation occurs entirely within the neural netw ork and the output is a design proposal or a sur rogate prediction, a nar- ro w problem. In contrast, the central challenge highlighted here f ocuses on the wide problem that includes the detail phases of product de velopment of safety -cr itical products. In suc h cases, speciﬁed documented processes must be f ollow ed, speciﬁc tools must be called in a deﬁned sequence, and inputs and outputs must be traceable through the quality assurance c hain [ 42 , 43 ]. This orches tration problem falls outside the scope of DGMs or other Machine Learning techniq ues (e.g. pure numerical regression or classiﬁcation). These limitations point tow ard a class of approaches not cap- tured in earlier AI researc h applied to engineering design: systems that can mediate betw een heterogeneous tools, follo w documented processes, and adaptiv ely orchestrate analy sis workﬂo ws. Larg e language model (LLM)–based agents represent a promising s tep in this direction. In the pre viously mentioned JMD special issue [ 39 ], only one article used a LLM architecture [ 44 ] for processing tex- tual product descr iptions. Basic LLM design synthesis tasks were e xplored as earl y as 2024 [ 45 ], [ 46 ]. Ho we v er, a literature re view that year on the engineer ing design jour nals rev ealed that most re- search was around the use of LLMs as tools in the process, not as orches trators [ 20 ]. Since then the capabilities of the models ha ve ev ol ved. Software engineering took a leading role in de veloping 2 / PREPRINT T ransactions of the ASME applications that ﬁrst autocompleted the sentences, then used tools to read and edit ﬁles, and ﬁnally dev eloped agentic applications to delegate complete tasks inside the Integrated De velopment En vi- ronment (IDE) or in the Command Line Interface (CLI), such as Claude Code [ 47 ], Codex [ 48 ] or OpenClaw [ 49 ]. Recentl y , Massoudi and Fug e [ 50 ] demonstrated that ag entic LLMs can suppor t conceptual sy stems engineering tasks, includ- ing requirements anal ysis, functional decomposition and architec- ture trade-oﬀs. Their w ork sho ws that LLM ag ents can reason about design intent when pro vided with appropr iate context. How - ev er , their scope is the ear ly design phase, where outputs are con- ceptual and the integration of in-house tools and methods is not emphasized. Mustapha [ 51 ] also mentions the chang e in scope of LLMs, from tool to orchestrator , paradigm shift from deterministic to stoc hasticity , and adaptability or malleability - a similar concept to ductility . 2.4 Summary : sy stemic brittleness of existing approaches. A recur ring theme across engineer ing design literature is that meth- ods and tools do not w ork in isolation, but are embedded within ev ol ving socio-technical sys tems. As Gerick e and Eckert [ 1 ] note, methods, tools and practices co-ev ol ve; chang es in one element propagate into others. Empirical s tudies on engineering c hange propagation sho w that ev en minor interface shifts can tr igg er dis- proportionate rew ork [ 5 , 52 ]. Deter ministic automation approaches, including KBE sy stems, scr ipted pipelines, and rule-based work - ﬂo ws, are par ticularl y br ittle under such conditions because the y rely on predeﬁned assumptions about data structures and pro- cess sequences. When these assumptions no longer hold, failures emerg e that are diﬃcult f or engineers to diagnose, leading man y to perceiv e f or mall y transparent sy stems as practical “black box es” [ 13 ]. The limitations of deterministic sy stems also ec ho ear lier observations that design tools should support adaptive, intellig ent beha vior rather than encode rigid procedures [ 53 ], yet this has still not been realized in current industrial practice. The background literature, theref ore, indicates a common need f or an intermediar y that is capable of accommodating v ar iability in tools, data, and practice. The literature further show s that the brittleness of previous approac hes arises not only from isolated technical faults but also from the inability of deterministic sy stems to adapt to the normal v ar iability and e volution of engineering practice. This is a central limitation that the LLM-based orc hes- tration lay er proposed in this paper , seeks to address. The central insight from literature is therefore summar ized as a core pr inciple which is used to guide the de velopment of the method: Central Orc hestration Pr inciple: Eﬀective engineering orches- tration r equir es an intermediary capable of accommodating vari- ability in data, tools and pr actices, rather than r elying on rigid, pr edetermined int er faces. From the sur v ey of literature, a set of requirements for engi- neering automation can be extracted; T able 1 summarizes tw elv e requirements, R1-R12 with brief descriptions and supporting ref- erences. 3 LLM Agents: T echnical F oundations Building on the central orches tration principle ar ticulated in Sec. 2.4 , in this section, w e no w e xamine how large language mod- els (LLMs) in agents pro vide such an adaptiv e intermediar y . So far , we hav e av oided deﬁning the term A gent. Histor icall y in AI, agents hav e been deﬁned in many wa ys, including tools whose in- ter actions ar e via indirect manag ement [ 64 – 66 ]. In the conte xt of LLMs, as the technology matures, the term continues to ev ol ve reﬂecting diﬀerent c haracter istics and challenges. Earlier work de- ﬁned LLM ag ents as a [...] task oriented combination of LLM models and the infras tr uctur e to access tools, r etriev e context in- f or mation, and stor e memor y of its int eractions. [ 67 ]. While e very LLM and framew ork provider has a variation of this deﬁnition, Fig. 1 Schematic of the LLM inference loop. The model takes the full conte xt window as input, produces a probability distri- bution ov er the ne xt tok en, samples one token according to a sampling strategy , appends it to the context window , and re- peats until a stop token () is generated. The resulting sequence of tokens forms the response. this paper uses the term that is gaining traction lately , which in- v olv es a descr iption of what it does rather than what it is: An LLM ag ent runs tools in a loop to achiev e a goal [ 68 ]. In this paper , we distinguish between the ag entic framew ork , the softw are toolkit that provides the inference loop and tool-calling infras tr uc- ture (e.g. Claude Code, Pydantic-AI), and the ag entic application , the conﬁgured deplo yment that combines a speciﬁc model, sys- tem prompt and tool connections f or a giv en engineering context. Sec. 4 descr ibes ho w an agentic application is de veloped, e xecuted, and evaluated. W e f ocus solel y on external, oﬀ-the-shelf large languag e mod- els, recognizing that most engineering organizations, especiall y medium and small ﬁr ms lack the resources to dev elop or ﬁne- tune such models in house. The context window is central to this capability: since externally dev eloped models ha ve no built-in kno wledg e of an organization ’ s inter nal tools and processes, the inf or mation av ailable in the conte xt windo w determines what the model can dra w upon when generating each token. LLMs oﬀer a ﬂe xible inf erence mechanism that can inter pret conte xt, reconcile heterog eneous representations, and mediate be- tw een ev olving tools and engineering intent [ 51 ]. Recent advances ha ve fur ther s trengthened their suitability for orchestration through tw o capabilities that hav e matured rapidly: structured reasoning modes (often ref er red to as thinking or deliberat e r easoning ) and tool calling, which enables controlled interactions with determinis- tic e xternal software. This subsection provides a brief introduction to these mechanisms f or a mechanical engineering audience. 3.1 Response Generation. A transf ormer-based LLM [ 69 ] sequentiall y generates a tok en at a time through an autoregres- siv e process, see Fig. 1 . At each step, the model tak es the entire conte xt windo w: the sy stem prompt, the user messag e and all to- kens g enerated so far; and produces a probability distribution ov er the vocabulary of possible next tokens. A sampling strategy se- lects one token from this distribution, and the selected token is appended to the context windo w f or the ne xt iteration. This loop continues until a stop token is sampled, at which point the ac- cumulated seq uence of tok ens is retur ned as the response. Other recent architectural language models, namely diﬀusion [ 70 ], substi- tute the auto-regressive sequential token generation with a ser ies of reﬁnement s teps. Both the architecture and the process are funda- mentally non-deterministic: the same in put can produce diﬀerent responses depending on the sampling strategy and its stochastic parameters. 3.2 Thinking Mode. The transf or mer autoregressive nature of token g eneration introduces a limitation: once the model com- mits to an initial direction, subsequent tok ens are conditioned on that choice, making it diﬃcult to rev erse course [ 71 ]. This can lead to ﬁxations or hallucinations [ 72 ]. Thinking mode is a mitig ation strategy dev eloped during model post-training using reinf orcement Journal of Mechanical Design PREPRINT / 3 T able 1 Requirements for DUCTILE and supporting literature Req. Req uirement name Description Supporting literature R1 Inspectability Ag entic application (prompts, plans, tool calls, intermediate ar tifacts) must be directl y inspectable by engineers/auditors; no hidden s tate. [ 17 , 18 ] R2 Reproducibility Model, prompt, tool versions and inputs must be pinned and logged; r uns are reproducible or cause-attr ibutable despite s tochastic LLM steps. [ 17 , 18 ] R3 Deterministic e xecution boundar y Saf ety/engineer ing computations run in v eriﬁed, deterministic tools; the LLM performs orches tration and code g eneration only . [ 18 , 51 ] R4 T raceability & auditability Ev er y ar tifact is trace-linked to the generating step (arguments, stdout/stderr, chec ksums, timestamps) to suppor t do/chec k/approv e revie ws. [ 17 , 18 ] R5 Data go vernance Deplo yment mode (on-premises/API) and data ﬂow s compl y with conﬁden- tiality/e xpor t/privacy rules and are justiﬁed. [ 17 , 18 ] R6 Human o versight & accountability Engineers revie w plans and sign oﬀ on outputs; autonom y is limited to or - chestration, not engineering judgment. [ 17 ] R7 R obustness to routine variability Orchestration handles f oreseeable variation (formats, units, naming, minor method updates) without modifying cer tiﬁed tools. [ 5 , 13 , 51 , 52 , 54 – 56 ] R8 Observability The system emits traces for model calls and tool inv ocations (arguments, latency , tok ens/cost) suﬃcient to diagnose failures. [ 57 , 58 ] R9 Minimal coupling T ool integ ration is lightw eight/externalized (CLIs/typed APIs); g enerated code is readable and runnable by engineers. [ 12 , 13 , 55 , 56 ] R10 Evaluation and pass^k Reliability is demonstrated on curated cases with deter ministic checks and/or LLM-as-a-judge, using repeated independent runs (pass^k) propor tionate to risk. [ 59 – 61 ] R11 Change control Any c hange to model, prompts, tools, or design practice triggers re-e valuation on the same cases bef ore deployment. [ 17 , 18 ] R12 Documentation quality T ool/method documentation is suﬃciently str uctured f or agents to f ollo w procedures; ambiguities are revised or clar iﬁed. [ 62 , 63 ] User If a T -shir t takes 5 hours to dr y , ho w long does it take to dr y 3 T -shir ts? No thinking Each T -shir t takes 5 hours, so 3 T -shirts take 3 × 5 = 15 hours to dry . 15 hour s × Thinking mode 3 × 5 = 15 hours to dr y. . . Hold on! If the user is hanging them at the same time. . . W ait! t he user did not specify... usually laundr y is done at the same time so I should assume that the ... 5 hours. Assuming y ou ha ve enough space, all three dry si- multaneously . 5 hour s ✓ Fig. 2 Comparison of LLM responses with and without think- ing mode enabled. The thinking scratchpad (dashed bo x) al- lows the model to self-critiq ue and reconsider the autoregres- sive initial response before committing to a ﬁnal answer . learning [ 73 , 74 ], and ﬁrs t e xper ienced by the general public with OpenAI’ s o1 model in September 2024 [ 75 ] and subsequent com- petitor models. Practically , the so-called thinking mode pro vides the model with a scratchpad in which self-critique is rew arded. Speciﬁc token seq uences allo w the model to reconsider its initial direction, explore alternative inter pretations and select the most appropriate response bef ore committing to a ﬁnal answ er [ 76 ], see Fig. 2 . For engineer ing orchestration, thinking mode is impor tant because it allow s the model to reason at inference time about am- biguous or unexpected situations bef ore acting on them, increasing the probability of a successful ﬁnal answ er [ 74 ]. 3.3 T ool Calling. T ool calling enables the model to g round its responses on external data and computations, be yond what is encoded in its training weights or present in the context window [ 77 ]. Models are trained to recognize when a question requires ex- ternal inf ormation and to generate a structured output (a function name and arguments) ins tead of a direct answ er . The model does not e xecute the tool itself. In practice, it per f or ms a classiﬁcation Engineer What is the yield strength of Ti-6Al-4V at 400 ° C? No tool calling Ti-6Al-4V has a yield s trength of appro ximately 880 MPa. 880 MPa × T ool calling materials_db("Ti-6Al-4V", T=400, units=degC) → 𝜎 𝑦 = 620 MPa At 400 ° C, Ti-6Al-4V has a yield strength of appro ximately 620 MPa. 620 MPa ✓ Fig. 3 Comparison of LLM responses with and without tool calling. Without tools, the model relies on parametric knowl- edge from training weights, returning a room-temper ature value. With tool calling, the model queries an external mate- rials database for temperature-dependent data, grounding its response on veriﬁed external information. task, selecting which of the a vailable tools is best suited for the cur - rent step and inf err ing the appropriate arguments from the conte xt windo w . Fig. 3 illustrates this with an engineer ing e xample. W ei et al. [ 78 ] identify three s tyles of tool integration in ag en- tic sys tems: (a) in-context, where the model reads documentation and generates calls zero-shot; (b) post-training, where the tool-use strategies are ﬁne-tuned into model w eights, and (c) orches tration- based where a planning la y er coordinates multi-tool w orkﬂo ws. The present work uses the ﬁrst and third; the agent reads tool doc- umentation to learn about unfamiliar APIs and then plans multi- step w orkﬂo ws across multiple tools. Pos t-training integration is deliberately ex cluded because engineering tools ev ol ve and must re- main independently v er iﬁed and versioned. Fine-tuning the model to internalize the tool signature or APIs is, as mentioned earlier , possible but unrealistic in practice f or most engineer ing companies. 3.4 Orc hestration. These three mechanisms, inf erence, think - ing mode, and tool calling, establish a tw o-lay er architecture. The LLM selects actions and adapts to conte xt; the e xter nal tools ex e- 4 / PREPRINT T ransactions of the ASME cute deterministically . Y ao et al. [ 79 ] f ormalized this interleaving of reasoning traces and tool actions in the ReA ct frame work, sho w- ing that the model decides at each s tep whether to reason fur ther , act on the en vironment, or retur n a ﬁnal answ er. This patter n has been demonstrated be yond engineering: Shen et al. [ 80 ] show ed that an LLM can act as a controller that plans tasks, selects spe- cialized AI models and orc hestrates their ex ecution across vision, speech, and languag e domains. Thinking and tool calling capabilities are pre-requisites for or - ches tration. Ag entic reasoning literature distinguishes tw o f or ms of scaling at inf erence time [ 78 ]: scaling test-time computation , where the model reasons longer within its conte xt windo w (map- ping to thinking) and scaling test-time interaction , where the model interacts with external tools and receiv es feedbac k from the en vi- ronment. 3.5 LLM Agents in Engineering Practice. De velopments in the application of LLMs also come from commercial softw are ven- dors, who ha v e begun embedding LLM-based assistants within their products. Simulation platf or ms such as ANSY S [ 81 ] or COM- SOL [ 82 ] no w oﬀer interfaces that suppor t engineer interactivity within the GUI. These assis tants are, ho we v er, tightly coupled to their host ap- plication. They ha v e access to the model being edited but cannot access e xter nal design practice repositories, in-house tools, or data from other stages of the analysis process. This closed-ecosy stem constraint limits their applicability to the engineering orchestration problem descr ibed in Sec. 2.1 , where the challenge is to bridge across multiple tools and data sources. Emerging ser vice v endors such as COSMON [ 83 ] are dev eloping agentic inter faces that con- nect LLMs to commercial CAD and simulation softw are. These solutions are interesting f or a no-code solution. How ev er , due to the requirements imposed of inspectability (R1), reproducibility (R2) and traceability and auditability (R4) these solutions are dis- carded as the companies do not hav e direct access to adapting and theref ore adding another potential lay er of brittleness. T o address this gap, this paper argues that a minimal and generic agentic application shall be dev eloped at each company . 4 The DUCTILE approach to agentic orc hestration This section addresses the gap identiﬁed in the pre vious sections b y introducing the DUCTILE (Delegated, User -supervised Coordi- nation of T ool- and document-Integrated LLM-Enabled) approach to agentic orchestration, whose core concept is to separate inter pre- tation from computation. As argued in Sec. 2.4 , brittleness arises when adaptation is encoded in ﬁxed rules. DUCTILE addresses this b y assigning inter pretation and orc hestration to an LLM and delegating all calculations to v eriﬁed tools, impro ving robus tness to interface and input changes without relaxing engineer ing rigor . This separation also suppor ts traceability and cer tiﬁcation compli- ance, which are necessary for the aerospace conte xt. It is aligned with the requirements R1–R12 summarized in T able 1 . The f ollo wing subsections describe ho w the approach is imple- mented. The descr iption is divided into three par ts: dev eloping the agentic implementation (Sec. 4.1 ), e xecuting the ag entic anal ysis (Sec. 4.2 ) and ev aluating the ag entic performance (Sec. 4.3 ). Fig. 4 show s the architecture of the agentic application used to ex ecute the analysis. 4.1 De veloping the agentic implementation. In the dev elop- ment stage, the agentic application is established so that orches- tration can be observed, reproduced, and integrated with exis ting engineering tools. In line with the requirements (T able 1 ), the agentic application implementation is shaped by two considera- tions: (i) the c haracter of the agentic framew ork used to host the agent (addresses R1, R2, R9) and (ii) the conﬁguration of its core connections to models, prompts, and tools (addresses R2–R5, R8). 4.1.1 F ramew ork char acter . The agentic framew ork exhibits inspectable and reproducible behavior , transparent conﬁguration with minimal hidden state and lo w coupling to the analy sis stack. In practice, this corresponds to open-source, appropriately licensed solutions that surface prompts, traces, and ﬁle operations to the en- gineer , and that connect to e xternal documents, tools, and anal ysis codes with minimal glue (addresses R1, R2, R9). 4.1.2 Conﬁguration scope. Within such a frame work, an agen- tic application is org anized around three connectiv e elements as sho wn in Fig. 4 : (a) the LLM model connection , (b) the sys tem promp t , and (c) the tool interfaces . The model connection cov ers the chosen model famil y and deployment mode (on-premises or API), together with version pinning and environment capture so that runs are attr ibutable across iterations (addresses R2, R5). The sys tem prompt pro vides stable interaction con v entions and output contracts, while domain procedures and method rules are provided at run time through documents and skills rather than embedded as static instructions; this keeps behavior auditable and por table across projects (addresses R1, R9). T ool interfaces present de- terministic, versioned capabilities to the agent via e xplicit, typed I/O contracts (paths, f or mats, units) and controlled wr ite locations; generated scripts and in vocation logs are persis ted next to input- s/outputs to preserve traceability (addresses R3, R4, R9). 4.1.3 Selecting and connecting to a LLM. Model choice for engineering analysis is typically shaped by a three-wa y trade-oﬀ be- tw een capability proﬁle, computational cost, and data-go v er nance constraints. Capability is commonly proﬁled along three dimen- sions: tool use (selection and in v ocation of e xter nal tools), as re- ﬂected in 𝜏 -bench [ 84 ]; ag entic reasoning (multi-step planning and reco very on error), as reﬂected in Ag entBench [ 59 ]; and code g en- eration quality , e.g., SWE-bench [ 85 ]. The choice of on-premises vs e xternally hosted model deplo yment further conditions feasibil- ity and go vernance. On-premises ex ecution oﬀers full control o ver data ﬂow s, while externally hosted models reduce infrastructure burden and broaden immediate model a vailability . If an e xternally hosted solution is chosen, it shall be bounded by conﬁdentiality requirements, export-control rules, and applicable data-protection frame works (e.g., GDPR). F or e xplorator y or non-sensitive w ork - loads how ev er , external pro viders often function as a practical start- ing point. Reg ardless of the choice, model and version pinning enables r un-to-run attribution (addresses R2, R5). 4.1.4 Sys tem Prompt. Model behavior is inﬂuenced by the sys- tem prompt [ 86 ], with design that can ev olv e iterativ ely or via sys- tematic methods [ 86 , 87 ]. Guidance from major pro viders (Ope- nAI [ 88 ], Anthropic [ 89 ], Google [ 90 ], Meta [ 91 ], Mistral [ 92 ]) oﬀer diﬀerent advice, but hav e common sections that are alwa ys recommended to include: • Goal : purpose and scope of responsibility . • Style : tone and interaction conv entions for communicating with the user . • A vailable tools : typically injected automaticall y b y the frame work. • Reasoning strategy : approach to task decomposition, depth of diagnosis and information exhaustiv eness. • Output f ormat : a verbal schema or a structured contract (e.g., a ﬁnal tool call). Within the present approach, the prompt is k ept task -agnos tic (ad- dresses R1, R9). Compan y-speciﬁc procedures, methods, and de- sign practices are not encoded statically ; instead, the y are retr iev ed at run time from know ledg e resources or disclosed progressivel y via skills [ 93 ]. This keeps beha vior auditable and por table, and allo ws the user to state the task (e.g., per form task X using me thod Y , r evision Z ) while the agent assembles the relevant organizational inf or mation (addresses R1, R9, R12). Journal of Mechanical Design PREPRINT / 5 Fig. 4 Architecture of an agentic application. The inference engine processes user messages through the LLM within a context window , generating either ﬁnal answers or tool calls. T ool execution is handled internally or through external interfaces that connect to e xternal resources including ﬁle systems, knowledge bases, tool r epositories and system interfaces. 4.1.5 Conﬁguring tool access. T ool access can be pro vided internally or via e xter nal resources (Fig. 4 ). The approac h f av ors connecting to e xternal resources wherev er f easible to minimize frame work comple xity and maximize reusability across ag ent in- stances (addresses R9). Inter f aces e xpose deterministic, v ersioned capabilities with s tr ict I/O contracts (paths, f or mats, units) and controlled wr ite locations; generated scr ipts and inv ocation logs (arguments and stdout/stderr) are persisted alongside inputs and outputs to preserve traceability and support inspection (addresses R3, R4). 4.2 Ex ecuting the agentic analy sis. Execution proceeds as an alternation between stoc hastic planning and deterministic tool use. The engineer pro vides the design context, inputs, and task objec- tiv e; the application incorporates these into the conte xt window . The inf erence process described in Sec. 3 proposes the next s tep to ward the goal. Here, the stochas ticity inherent to the model is an advantag e: the model adapts inputs and company know ledge to the particular design conte xt and goals (orang e path in Fig. 4 ). Once the model proposes a tool call, the agentic framew ork ex ecutes it deter ministicall y and appends the result to the context windo w . The agent repeats this cy cle, calling successiv e tools until the task objectiv e is met (addresses R3, R4, R7). This loop connects the agent to the same tools an engineer w ould e xecute manually or hardcoded in a scripted pipeline, but the or - ches tration logic is no longer hardcoded. The LLM translates a natural languag e deﬁned process descr iption from a document into a sequence of tool inv ocations, selecting each step based on the task objectiv e, intermediate results, and av ailable tools (addresses R1, R9). This adaptability is reported to the engineer in the ﬁnal re- sponse and remains full y visible in the model traces and generated artifacts. At any time in the parallel or sequential call of tools iterations, the agent ma y retur n the control to the user [ 79 ]. The decision to ﬁnish the iteration is made by the LLM based on its training beha viors and the content of the conte xt windo w . A f ew e xamples of when the LLM can c hoose (via the likelihood of the ne xt token and subsequent g eneration of a response) to ﬁnish its turn are: - When the request of the user can be answered directly , without the need to call tools. - When the model misses cr itical information to car ry out the task. - When the tool calls hav e provided enough inf or mation to com- plete the task. - When a roadbloc k in the ex ecution plan stops the model from getting closer to the goal (e.g., permission access, tool fail- ure) and reasonable eﬀorts ha ve been made to circumv ent those limitations within the independence scope deﬁned in the context window . 4.2.1 F inal response and user intervention. When stopping, the agent may summar ize previous activities in the agent ’s turn. The agentic application parses the results and then ﬁnalizes the process b y returning to the user the last message. Cur rent ag entic applications stream to the engineer directl y the intermediate tool calls and thinking steps to f ollo w the task prog ress. This allow s f or engineers to stop the ag entic e xecution if it is de viating from the goal or stalling with an activity (addresses R6). 4.2.2 T raceability and obser vability . Messages, tool calls and tool results at minimum shall be sent f or recording to an observ- ability platf or m via standard pay loads (e.g. Open T elemetr y). If a company external ser vice is used, the same data shar ing con- siderations shall be ev aluated. Generated code is stored ne xt to inputs/outputs to preserve traceability (addresses R3, R4). Where a vailable, telemetr y captures call latency and token/cost f or model steps, supporting diagnosis, ev aluation, and change control o ver time (addresses R8, R10, R11). The resulting traces pro vide the evidence required for do/chec k/approv e re view and f or the ev alua- tion descr ibed in the next sub-section. 4.3 Evaluating the ag entic application performance. Method adoption in engineer ing practice depends on evidence that the method beha ves as e xpected, with a high degree of conﬁdence [ 94 ]. Fur ther , the system needs to be v alidated lik e an y in-house engineering software with rigor proportional to the r isk of the activity per f ormed [ 95 , 96 ]. How ev er , agentic sys tems introduce challeng es be yond those of traditional software. Firs t, the conte xt windo w ev olv es across tool calls and inter me- diate reasoning s teps, making the inter nal s tate diﬃcult to inspect and debug. Obser vability platf or ms that trace each model in vo- cation, tool call, and token ex chang e, such as Logﬁre [ 57 ], are essential f or diagnosing failures (addresses R8). Second, outputs can vary structurally across identical inputs and y et can be equally valid. Because the DUCTILE approach intentionally uses ﬂexibil- ity to adapt to diverse engineer ing conte xts, suppressing variability is not the goal. Instead, ev aluation is performed against curated datasets of representativ e problems, including edge cases, similar 6 / PREPRINT T ransactions of the ASME to the held-out tes t sets used to assess g eneralization in machine learning [ 97 ]. These datasets verify that the agentic application produces correct and compliant outputs across the expected range of inputs, independent of the par ticular response path taken (ad- dresses R10). U nlike machine learning datasets containing only input x and e xpected output y , each ev aluation case must specify: (1) The design environment : the sys tem prompt, the engineer query , the e xecutable tools and reference documentation. (2) The acceptance criteria : expected outcomes e xpressed as quantitativ e tolerances, qualitativ e descriptions, or both. (3) The evaluation me thod : a deter ministic c heck agains t refer - ence values, a human expert review , or an LLM-as-a-judge assessment. The evaluation must co ver both s tandard interactions and edg e cases. Expected answers shall be dev eloped and maintained by domain e xperts; in practice, these use cases gro w to the order of hundreds of scenar ios. Because model responses are probabilis- tic, eac h use case must be repeated to es tablish conﬁdence in the agent ’ s ability . DUCTILE adopts the pass^k metric [ 60 ], deﬁned as whether the ag ent correctly per f or ms the same task in 𝑘 inde- pendent repetitions. The choice of 𝑘 depends on the evaluation scenario and the criticality of the task. A useful analogy comes from material qualiﬁcation in aerospace structures, where allow able proper ties are deﬁned probabilisticall y as 𝑥 / 𝑦 : the 𝑥 -th population percentile at 𝑦 % conﬁdence. Three standard le vels e xist: S, B, and A -Basis [ 61 ]. Treating eac h e valu- ation case as an independent Ber noulli pass/fail ev ent, the Clopper – Pearson exact bound [ 98 ] f or 𝑘 successes in 𝑘 tr ials yields the min- imum number of consecutiv e passes required at each conﬁdence lev el, as sho wn in T able 2 . T able 2 Minimum consecutive passes k for the pass^k metric at 95% conﬁdence, derived from the Clopper –Pearson bound p lower = α 1 / k with α = 0 . 05 . S-Basis B-Basis A-Basis Pass probability 𝑝 ≥ 0 . 50 ≥ 0 . 90 ≥ 0 . 99 Req uired 𝑘 5 29 299 DUCTILE recommended values Dev elopment 𝑘 = 3 Deplo yment 𝑘 ≥ 10 The basis-lev el 𝑘 values serve as a reference for comparison with established engineering practice. In practice, DUCTILE del- egates e xecution but not responsibility . The engineer retains ac- countability f or revie wing outputs, and exis ting engineer ing quality processes remain in place. Since each repetition consumes compu- tational resources and cost, the DUCTILE approach recommends 𝑘 = 3 during dev elopment and 𝑘 ≥ 10 f or deployment, increasing 𝑘 as the criticality or autonomy of the agent g ro ws. Evaluations must be re-r un whenev er the sy stem prompt, tools, documentation, or underl ying model changes. At the scale of hun- dreds of cases repeated 𝑘 times, human ev aluation becomes un- f easible. Automated assessment via deterministic chec ks or LLM- as-a-judge is therefore essential, with tr ust metr ics analogous to inter -rater reliability established for the latter. 5 Industrial application of ag entic load pr ocessing on a realistic case The DUCTILE approach descr ibed in Sec. 4 is applied to a structural strength ev aluation on a realistic case at the case com- pan y . The component in this case is a T urbine Rear Structure (TRS), sho wn in Fig. 5 . While the case compan y routinel y per - f or ms static strength analy ses on the TRS as par t of the certiﬁca- tion and continuous product dev elopment activities, the jet engine OEM generall y o wns the global stiﬀness and loads models at the whole-engine integration lev el, and delegates structural compliance chec ks to component partners. A design practice document at the case compan y deﬁnes the steps, methods, and expected outputs f or this type of analysis. The use case description. The engineer ing use case concer ns a ne w structural e valuation tr igg ered by a design chang e in a neigh- boring engine component. The g eometr y , ﬁnite element mesh and analy sis methods remain unchang ed; only the loading input from the OEM is ne w . A ne w , company -maintained Python-based loads processing tool is used together with the established design prac- tice. The new OEM delivery introduces f our deviations compared to the pre vious analy sis. Three of these de viations are input f or - mat chang es, a common e xample of variability when data crosses org anizational boundaries; the f our th is a cor rection method com- municated by the OEM requir ing a 1.04 factor on all Fx f orces, separate from the raw data. In a traditional scripted pipeline, each deviation would require manual intervention or code modiﬁcation. T able 3 summar izes the f our deviations. T able 3 Deviations in the ne w analysis scenario. The ﬁrst three are input format changes; the fourth is an OEM- communicated correction speciﬁed in the task description. Deviation What chang ed Impact on automation File format Y AML instead of JSON Req uires new parser Unit system Imperial instead of SI Silent error Node naming right/left instead of port/starboard FEM r un fails do wnstream Method chang e 1.04 f actor on Fx f orces Silent error The remainder of this section describes the conﬁguration as sho wn in Sec. 5.1 , the ex ecution with tw o engineers (Sec. 5.2 ) and the systematic ev aluation of the ag ent’ s per f or mance, in Sec. 5.3 . 5.1 Case Setup. F ollowing Sec. 4.1 , the ag entic implemen- tation w as built using Claude Code [ 47 ], a terminal-based agentic frame work. Extensive description of the implementation is av ail- able in the supplemental mater ial. Claude Code operates within the engineer’ s e xisting IDE and ﬁle sys tem. Since the engineers at the case company already w ork with scr ipts, in put ﬁles, and terminal commands, the agentic application ﬁts within the e xist- ing w orkﬂo w rather than imposing a new one. The engineer can inspect ev er y ﬁle the agent reads or wr ites (addresses R1, R2). T o maximize transparency and user e xperience, engineering prod- uct data information was placed in the w orking ﬁle system dur ing the experiments. The design practice was ser v ed via an MCP document server , giving the agent version controlled access to the methodology . The loads processing tool was distributed as a Python package via PyPI, with API documentation published in both human-readable and LLM-consumable formats ( llms.txt ) on the corresponding GitHub Pages. The task description was prepared pr ior to ag ent tes ting, targ et- ing an e xper ienced engineer or an inexperienced graduate joining the company ; no reﬁnements to the prompt or integration were re- quired after initial testing. The design practice document captures the logic of the activities to be performed. Fig. 6 show s the w ork - ﬂo w derived by the authors from this document. This ﬂow chart was not a vailable to the agent. By reading the document and in- specting the inputs, the agent was expected to infer and implement a comparable sequence of activities (addresses R1, R12). 5.2 Ex ecution. T wo engineers at the case compan y w ere as- signed the same task with the same inputs and deviations descr ibed in T able 3 . Both had access to the ag ent and w ere encouraged to interact with it and v er ify its outputs as the y saw ﬁt. The transcr ipts and output ﬁles are a vailable in the supplemental material. Journal of Mechanical Design PREPRINT / 7 Fig. 5 T op: Views of the component in the phy sical domain. Bottom: Activity View of the TRS Strength ev aluation. Only a subset of the inputs, documents and tools are shown for visual clarity . Highlighted in orange is the load processing step where the use case focuses. 5.2.1 Engineer 1: Delegation. Engineer 1, familiar with agen- tic w orkﬂow s, delegated the task to the agent and chec ked the out- puts after the full activity was completed. Fig. 7 summar izes the interaction. The engineer pointed the agent to the task description and re- quested a plan. The ag ent read the task description PDF , fetched the design practice via the MCP document ser v er, e xplored the w orking directory , and e xamined the previous analy sis run and in- put ﬁles. During this reading phase, the agent identiﬁed all four deviations in its reasoning trace: the Y AML f ormat incompatible with the tool’ s JSON parser , the imper ial unit sys tem, the chang ed pilot node names, and the 1.04 cor rection f actor speciﬁed in the task descr iption. The ag ent sur faced these ﬁndings in a sev en-step plan presented to the engineer , proposing a processing approach that addressed each deviation. The engineer approv ed the plan. The ag ent then fetc hed the tool’ s API documentation, wrote a processing scr ipt, ex ecuted it and veriﬁed the outputs: six .inp ﬁles (one per en v elope-selected load case), an env elope summar y table, and an e xceedance comparison against the pre vious analy - sis. The engineer inspected the g enerated scr ipt and output ﬁles independently and declared the task completed. 5.2.2 Engineer 2: Incr emental discov er y. A second engineer , unfamiliar with agentic workﬂo ws, took the opposite approach. Rather than delegating the full task, this engineer requested one step at a time: ﬁrst reading the inputs, then processing a single load case, then checking the output bef ore proceeding. Each inter - mediate result was inspected bef ore the ne xt step w as requested. Through this incremental process, the engineer disco v ered the task, learned the tool’s behavior , and built conﬁdence in the agent grad- ually . The ag ent adapted to this step-b y-s tep supervision without diﬃculty . The ﬁnal output was identical to Engineer 1’ s result: the same load case ﬁles, env elope table and e x ceedance compar i- son. The same f our deviations were handled, through a diﬀerent supervision path. 5.3 Evaluation. F ollowing the ev aluation approach described in Sec. 4.3 , the ag ent’ s performance w as assessed against the scenario deﬁned b y the f our deviations in T able 3 . After an in- dependent chec k by a company domain e xper t, Engineer 1 output data was used as g round tr uth for automated evaluation. The ev aluation used two complementar y chec ks. First, an LLM- as-a-judge assessment (Opus 4.6) v er iﬁed that the agent ’ s process- ing script cor rectl y addressed all f our deviations: Y AML -to-JSON con version, unit con version, pilot node mapping, and application of the 1.04 Fx correction factor . Second, a deter ministic compar- ison chec ked the ﬁnal load summary values against the expert’ s ref erence output. A r un passed only if both checks succeeded. The LLM-as-a-judge assessments w ere manually revie wed by the domain expert, yielding 100% inter -rater ag reement. The ag ent was r un 10 independent times on the same scenar io, with no modiﬁcations to the prompt, tools, or context between runs. All 10 runs passed both quantitativ e and qualitative checks, yielding a 100 per cent inter rater reliability by LLM-as-a-judge (Opus 4.6) and the 10 manual veriﬁcations. The evaluation scr ipts and outputs are a vailable in the supplemental mater ial. The im- plementation is an e xample of the DUCTILE approach to wards agentic ev aluation. 8 / PREPRINT T ransactions of the ASME Start Read OEM load ﬁle & v erify f or mat Do wnselect applicable load cases OEM deliv er y in SI units? Conv ert units to FEM sys tem V er ify equilibr ium for each load case Ultimate anal ysis with limit loads? Apply 1.5 ultimate load factor W r ite one .inp ﬁle per load case Produce summar y env elope table Previous loads av ailable? Produce ex ceedance comparison End No Y es Y es Y es No No Fig. 6 Engineer-interpreted activity process ﬂowchart of the natural language and unstructured description contained in the loads design pr actice document. NOTE: The agent must in- fer a similar workﬂow by reading the document ; this ﬂowchart is not a vailable in the agent’s context. 6 Discussion 6.1 Closing the Brittleness Loop. Sec. 1 identiﬁed br ittle- ness as a structural property of engineering analy sis automation, breaking rigid pipelines when chang es occur . In the case study of Sec. 5 , the agentic application consis tently identiﬁed the er rors that w ould hav e broken the downs tream processes, and adapted to chang es consistentl y with pass^k=10. Theref ore it was demon- strated using the DUCTILE agentic orchestration approach how to dev elop, use and ev aluate an agentic application that meets the engineering anal ysis per f or mance expectations and fulﬁlls the re- quirements identiﬁed in T able 1 . It w as achie v ed without the need to maintain product datasets, or company speciﬁc LLMs. While it was ex empliﬁed on a pre-processing task f or a mec hanical activ- ity , it is left to the reader to interpret the same orc hestration and support in a post-processing task, or for other engineer ing analy ses where in-house tools are combined with external tools and wr itten procedures. 6.2 Consequences for Practice. The application of agentic support e xposed the gaps in the exis ting design practice documen- tation at the case company . Exper ienced engineers br idg e am- biguities in written procedures through tacit know ledge. Hsieh et al. [ 62 ] demonstrate that well-written tool documentation en- ables LLMs to use unfamiliar tools zero-shot, without ﬁne-tuning. Y uan et al. [ 63 ] show that concise, str uctured descr iptions outper - f or m v erbose ones. Inv esting in documentation quality serves both human engineers and agents. Ho we v er, the dev elopment of the Engineer 1 Opens the input ﬁles and design practice in the IDE. R eviews the deviations independently. Engineer 1 plan this: @ref_documents/task_description.pdf Agent read(task_description.pdf) bash(ls) working directory get_document_content(MCP) design practice read(previous_run/. . . ) read(OEM_loads_v2.yaml) The new ﬁle is YAML but the tool expects JSON. . . Units are in imperial, the design practice requir es SI. . . Pilot nodes ar e named left/right instead of port/starboard . . . The task description speciﬁes a 1.04 Fx correction factor not present in the previous run. . . Agent Here’ s the plan: 1. Read Y AML, rename lug_left → lug_port , lug_right → lug_starboard . 2. Appl y OEM Fx correction ( × 1.04). 3. V er ify coordinate system (no transform). 4. Con vert units klbs → N. 5. En velope downselection. 6. W rite .inp ﬁles (e xclude bear ing). 7. En velope table + ex ceedance comparison. Engineer 1 y es Agent web_fetch(ductile-loads/llms-full.txt) write(loads_processing_v2.py) bash(uv run loads_processing_v2.py) V er ifying outputs. . . bash(ls limit_loads/) read(limit_load_2.inp) read(comparison_report/v1_vs_v2.json) Processing complete. 6 load case ﬁles generated. Ne w loads e xceed pre vious env elope — Mz +30%, Fx +15%, Fy +11– 17%. Engineer 1 Checks the g enerat ed script and output ﬁles. Declar es the task completed. Fig. 7 Summar y of Engineer 1’s int er action. Green bub- bles: engineer messages. Orange bubbles: agent responses. Dashed box es: agent reasoning. Y ellow boxes: tool calls. Dot- ted box es: engineer actions outside the con versation. An e x- panded transcript is available in the supplemental material. Journal of Mechanical Design PREPRINT / 9 use case sho w ed ho w those tacit activities were missing from the compan y design practices, and had to be added while the ag entic application dev elopment and evaluations w ere being co-de veloped. Moreo ver , by requiring engineering know ledge to be captured in e xplicit, accessible documents rather than residing in individual e xper tise alone, the use case sho wed ho w the approach reduces the org anizational vulnerability that arises when cr itical kno wledg e is concentrated in individuals. The DUCTILE ag entic approach uses in-context reasoning e x- clusiv ely (no ﬁne-tuning nor persistent memory). This is a de- liberate choice because, tog ether with ev aluations, it allo ws the agentic application to a void model loc k -ins, and to update the model as smaller -footprint, cheaper , or more capable models are released without eﬀor t and robustl y . This contrasts with pos t- training approaches where tool-use strategies are embedded in model weights [ 78 ], and with rule-based sys tems that, despite f or - mally explicit logic, become practical black bo xes as rule chain comple xity e xceeds what practitioners can trace (Sec. 2.2 ). Both engineers in the case study conﬁrmed that the agent ’ s actions were easy to follo w and inspect, sho wing promise for wider adoption in practice with minimal maintenance o verburden. Despite the case f or a dataset-free conﬁguration pipeline, more comple x scenarios may call f or compan y- or task -speciﬁc models: cases where agentic performance (tested via e valuations) is not robust enough despite optimizing the context window , or where latency must be minimized and context length reduced. In such scenarios, the parameters of an open-w eights model can be ﬁne- tuned to inject company kno w ledge, and this is where a training dataset becomes necessary . Datasets in engineering, ho we ver , re- main scarce [ 99 ]. Ironically , adopting the ag entic approach itself generates the raw mater ial. Ev er y engineer -agent session auto- matically produces a conv ersation log capturing design intent and rationale, e ven when the agent performs poorly . This his tor y can be passed to a separate LLM pipeline to generate positive exam- ples of e xpected behavior , eﬀectiv ely producing a training dataset without dedicated engineering eﬀort. 6.3 Risks and Unintended Consequences. The aerospace community w ar ns that “ov erreliance on softw are tools can lead to a superﬁcial unders tanding of the underlying phy sics ” [ 24 ]. This concern applies to agents as well. A ke y distinction is needed, ho we ver , betw een de xter ity and judgment. De xter ity with speciﬁc APIs, data f or mats, and tool interfaces is what the ag ent will ab- sorb. Engineer ing judgment, kno wing whether results make phy s- ical sense, the method is appropriate, or the output meets cer tiﬁ- cation requirements, must remain with the engineer . In our e xpe- rience de veloping aerospace products under intense project dead- lines, the real risk is in the organizations stopping to recognize and allocate resources to dev elop the engineer’ s judgment. Engineer 2’s step-b y-s tep interaction patter n provides evidence that supervision and lear ning can coe xist: this engineer built understanding of the task progressivel y while the ag ent handled the implementation de- tails. Gericke and Eck er t [ 1 ] raise two cautions from the product de- v elopment ecosystem perspective. First, a rebound eﬀect: histori- cally , tools that shor tened dev elopment time led to more complex products requiring similar dev elopment time. Second, skill loss: designers ma y lose tacit skills previously dev eloped through rou- tine manual e xecution. Theref ore, we call f or a conscious imple- mentation. The supervisor y role must be deliberatel y designed to maintain engineer ing judgment [ 100 ]. It remains to be seen whether agentic suppor t will enrich or e xhaust the work of engineers. On one hand, many design and AI researchers, including Cross [ 53 ], argue that these systems that interactiv ely suppor t humans should be designed in a w ay that is cognitiv ely comf ortable f or them. If implemented in a human- centered wa y [ 101 ], the mundane tasks could be absorbed by the agent, and the most rew arding aspects of engineering work pre- served [ 102 ]. How ev er , we are s tar ting to observ e in software engineering how these agents are not delivering this promise. In- stead, the joy of solving the little problems and the social nature of the w ork are reduced, being subs tituted by a hollow super vision role that exhaus ts software engineers via increasing the intensity of their work [ 103 ]. Shneider man [ 104 ] argues that this outcome is not inevitable: sys tems designed as “super tools ” that amplify human capability , rather than autonomous agents requiring super - vision, can preser v e both productivity and satisfaction. The choice of implementation, ho we ver , belongs to the engineering organiza- tions. Ultimately , the y must resist the temptation to optimize solely f or short-ter m eﬃciency , and instead design an automation suppor t that enhances productivity while improving the quality of w ork and dev eloping engineer ing judgement. 6.4 An Engineering Automation Paradigm Shift. Engineer- ing automation has historically required deterministic logic, but LLM-based ag ents violate this e xpectation. They are probabilistic, their reasoning is not formally veriﬁable, and they can produce in- correct outputs. Based on our previous w ork [ 20 ] and interactions at the case compan y , we expect their acceptance f or cer tiﬁcation activities to be contested. This contest echoes a recur ring patter n in AI where hand- crafted, ontology -dr iv en methods are e ventuall y displaced by sta- tistical ones once per f ormance is demonstrated at scale. V on Neu- mann [ 105 ] show ed in 1956 that reliable sy stems can be built from unreliable components through redundancy . Rosenblatt him- self [ 106 ] ackno wledg ed the limitations of probabilis tic methods: the penalty that we pay f or t he use of statistical principles in the design of t he sy stem is a pr obability that w e may g et a wr ong r esponse in any particular case . Ne vertheless, Minsky and Pa- pert [ 107 ] dismissed neural netw orks in 1969. It was not until suﬃcient computational po w er and dataset size became av ailable that neural networks pre vailed, as demonstrated b y AlexNet [ 108 ]. Sutton [ 109 ] generalized this obser vation: methods that lev erage computation consistentl y outperform methods that lev erage human kno wledg e. In this framing, the probabilistic approach that pro- duces occasional er rors and the ﬂexibility that absorbs the design process variability are tw o sides of the same coin. One cannot e xist without the other . W e argue f or the acceptance of such prob- abilistic ag entic orc hestration to be judg ed by the mer it of the task ev aluations (Sec. 4.3 ), and not against pre-conceiv ed deter ministic e xpectations. 6.5 Limitations. Sev eral limitations constrain the generaliz- ability of these ﬁndings. • T wo engineers. The experiment is illustrativ e, two par tici- pants ’ behaviours cannot be extrapolated to the engineering population. • Model dependence. The results reﬂect Claude Opus 4.6 at the time of the study and ma y change with future v ersions and other models. • Req uires documented processes. Where engineer ing know l- edge is predominantl y tacit and not captured in accessible documents and tools, the agent has nothing to read and the approach does not apply . • T ext and dat a workﬂo ws only . This work does not cov er spatial tasks such as CAD geometry generation, ﬁnite ele- ment meshing, or graphical pos t-processing. Image modality interpretation has not been explored. • T ransf erability not demonstrated. Applicability to other engineering domains, organizations and task types is ar gued on structural grounds but not empirically validated bey ond this case. 6.6 Future W ork. An open dataset for ag entic ev aluation is necessar y . Organizations deplo ying agents in certiﬁed engineer - ing need references for dev eloping inter nal datasets that capture their speciﬁc tools, design practices and quality requirements. Any chang e to the underl ying model, prompts, tool version or design 10 / PREPRINT T ransactions of the ASME practice can alter agent beha vior and must be assessed before deplo yment. The research community , in tur n, needs open en- gineering benchmarks. Eﬀorts such as calls f or curated design datasets [ 99 ] and EngiBench [ 110 ] ha ve begun to address it f or generativ e tasks, and similar calls hav e emerged f or LLM ev alu- ation in engineer ing [ 51 ]. Ag entic engineer ing ev aluations hav e additional, nuanced req uirements: the frame w ork must be e x e- cutable and av ailable to simultaneousl y apply domain kno wledg e, call tools correctly , inter pret numerical outputs and f ollow docu- mented methods. Industrial impact shall be measur ed . The tw o engineers in Sec. 5 interacted with the agent in fundamentall y diﬀerent wa ys. One delegating the full task, the other proceeding incrementally . A controlled within-subject study compar ing engineer per f ormance with and without agent support w ould f or malize this observation and measure impact. Metr ics like task accuracy , completion time and conﬁdence in the result across a larger sample of practicing en- gineers w ould re veal how engineer ing practices interact in realis tic scenarios. 7 Conclusion The presented DUCTILE approach demonstrated the automa- tion suppor t separation via an industrial load processing task at an aerospace manufacturer . The adaptive orchestration was per- f or med by the ag ent, using the non-deter ministic nature of the LLMs. The deter ministic e xecution w as per f or med by v eriﬁed en- gineering tools. Engineers retain responsibility f or supervision and judgment while the process is guided b y design practices already a vailable in companies. In the use case, the agent handled f our input deviations that w ould break traditional scripted automation. All four w ere resolv ed correctly consistentl y across 10 independent ev aluation r uns. The approach w as also tested on tw o engineers with diﬀerent supervi- sion sty les, one delegating the full task and one proceeding incre- mentally , and both reached the same cor rect result using the same agent conﬁguration. T raditional engineer ing automation is br ittle because it encodes adaptation in deterministic rules bound to interfaces that, in prac- tice, keep changing. When the ecosy stem chang es, when people mo ve, the automation suppor t becomes a barr ier and stops being an enabler . The DUCTILE agentic approach presented here absorbs that variability at the orchestration la yer , so that engineer ing time and cognitive eﬀor t are spent on the decisions that aﬀect prod- uct perf or mance rather than on the data wrangling and w orkﬂow adaptation that currently consume it. A ckno wledgment The authors thank the engineers at GKN Aerospace who par - ticipated in the experimental study , and Na jeem Muhammed for his valuable technical kno wledg e. The authors ackno w ledge the support from F aez Ahmed and v alue the fruitful discussions with all DeCoDE lab members at MIT . Funding Data The ﬁrst author’ s salary w as funded b y GKN Aerospace, where he is a per manent employ ee. The Barbro Osher Pro Suecia Foun- dation (Grant No. 90401181) pro vided ﬁnancial suppor t for the ﬁrst author’ s research visit at MIT . Conﬂict of Interest The authors declare no conﬂicts of interest. Data A v ailability St atement The agentic application implementation, e valuation results, and supplemental material, including full session transcr ipts, are openly a vailable at https://github.com/alex- pradas/DUCTILE and perma- nently archiv ed [ 111 ]. Ref erences [1] Gericke, K. and Eck er t, C., 2026, “Co-Ev olution of T ools and Methods in Product Dev elopment Ecosys tems,” Co-Evolution of Design Researc h and De- sign Practice , K. Gerick e, C. Eckert, V . Singh, and S. V enkataraman, eds., Springer Nature Switzerland, pp. 6–19. [2] Eck er t, C. and Clarkson, J., 2005, “The Reality of Design, ” Design Process Impro vement , Springer London, pp. 1–29. [3] W ynn, D. C. and Clarkson, P . J., 2021, “Impro ving the Engineering Design Process b y Simulating Iteration Impact with ASM2.0, ” Research in Engineer ing Design , 32 (2), pp. 127–156. [4] Eppinger , S. D., Whitne y , D. E., Smith, R. P ., and Gebala, D. A., 1994, “ A Model-Based Method f or Organizing T asks in Product Development, ” Researc h in Engineering Design , 6 (1), pp. 1–13. [5] W ynn, D. C., Caldwell, N. H. M., and Clarkson, P . J., 2014, “Predicting Change Propagation in Complex Design W orkﬂow s,” Jour nal of Mechanical Design , 136 (8), p. 081009. [6] Brahma, A. and W ynn, D. C., 2022, “Concepts of Change Propagation Analy sis in Engineer ing Design,” Research in Engineer ing Design , 34 (1), pp. 117–151. [7] Liu, Y ., Abula wi, Z., Garimidi, A., and Lim, D., 2025, “ A utomating Data- Driven Modeling and Analysis for Engineering Applications Using Large Lan- guage Model Agents, ” doi: 10.2139/ssrn.5400869 , SSRN Prepr int. [8] Fernandes, A. A. A., K oehler, M., K onstantinou, N., Pankin, P ., P aton, N . W ., and Sakellariou, R., 2023, “Data Preparation: A T echnological Perspectiv e and Re view , ” SN Computer Science , 4 (4), p. 425. [9] Eck er t, C., Isaksson, O., Hane-Hagström, M., and Eck er t, C., 2022, “My Facts Are Not Y our Facts: Data Wrangling as a Sociall y Negotiated Process, a Case Study in a Multisite Manufacturing Company , ” Journal of Computing and Inf or mation Science in Engineering , 22 (6), p. 060906. [10] Qin, Y ., Lu, W ., Qi, Q., Liu, X., Zhong, Y ., Scott, P . J., and Jiang, X., 2017, “Status, Compar ison, and Issues of Computer- Aided Design Model Data Exc hange Methods Based on Standardized Neutral Files and W eb Ontology Language File, ” Journal of Computing and Inf or mation Science in Engineering , 17 (1), p. 010801. [11] Cherukur i, R. and Y arram, V . K., 2024, “From Intelligent Automation to Agen- tic AI: Engineering the next Generation of Enterpr ise Systems, ” Inter national Journal of Emerging Research in Engineer ing and T echnology , 5 (4), pp. 142– 145. [12] La Rocca, G., 2012, “Kno wledg e Based Engineer ing: Betw een AI and C AD. Re view of a Languag e Based T echnology to Support Engineer ing Design, ” Adv anced Engineer ing Informatics , 26 (2), pp. 159–179. [13] V erhag en, W ., Bermell-Garcia, P ., V an Dijk, R., and Cur ran, R., 2012, “ A Crit- ical R eview of Kno wledg e-Based Engineering: An Identiﬁcation of Researc h Challenges, ” Advanced Engineer ing Inf or matics , 26 (1), pp. 5–15. [14] Vidner , O., W ehlin, C., and Wiber g, A., 2022, “Design Automation Systems f or the Product Dev elopment Process: Reﬂections from Five Industrial Case Studies, ” Proceedings of the Design Society , 2 , pp. 2533–2542. [15] Quintana-Amate, S., Ber mell-Garcia, P ., Tiwari, A., and T urner, C., 2017, “ A Ne w Know ledge Sourcing Framew ork for Know ledge-Based Engineering: An Aerospace Industry Case Study , ” Computers & Industr ial Engineering , 104 , pp. 35–50. [16] Ahmed, S. and W allace, K. M., 2004, “Unders tanding the Know ledge Needs of No vice Designers in the Aerospace Industry , ” Design Studies , 25 (2), pp. 155–173. [17] 2024, “EAS A Ar tiﬁcial Intelligence Concept Paper Issue 2 - Guid- ance for Level 1 & 2 Mac hine-Lear ning Applications, ” accessed 2026- 02-29, https://www.easa.europa.eu/en/document- library/general- publications/ easa- ar tiﬁcial- intelligence- concept- paper- issue- 2 [18] ISO/IEC, 2024, “ Ar tiﬁcial Intelligence — Functional Safety and AI Systems, ” International Org anization for Standardization and International Electrotechni- cal Commission, ISO/IEC TR 5469:2024. [19] ISO, 2024, “Road V ehicles — Safety and Ar tiﬁcial Intelligence, ” Inter national Organization for Standardization, ISO/P AS 8800:2024. [20] Gomez, A. P ., Kretzschmar , M., P aetzold-Byhain, K., and Isaksson, O., 2025, “ A T eam of Three: The Role of Generativ e AI in the Dev elopment of De- sign Automation Systems f or Complex Products,” Proceedings of the Design Society , 5 , pp. 309–318. [21] Ker ley , W ., Ar mstrong, G., P epe, C., Moss, M., and Clarkson, P . J., 2011, “Using Simulation to Support Process Integration and Automation of the Earl y Stag es of A erospace Design, ” ICED 11 — 18th International Confer ence on Engineering Design , V ol. 1, Copenhagen, Denmark, A ugust 15–18, 2011, pp. 134–146. [22] S AE Inter national, 2017, “AS9100: Quality Systems – Aerospace – Model f or Quality Assurance in Design, Development, Production, Installation and Servicing,” SAE International, 400 Commonw ealth Driv e, W arrendale, P A, United States, doi: 10.4271/AS9100A . [23] Abollado, J. R., Shehab, E., and Bamforth, P ., 2017, “Challenges and Beneﬁts of Digital W orkﬂo w Implementation in Aerospace Manufacturing Engineer- ing, ” Procedia CIRP , 60 , pp. 80–85. [24] Krupa, G., 2025, “The Ar t of the Science. Creative S timulation in the Class- room, ” Aerospace, R oy al A eronautical Society . [25] Rittel, H. W . J. and W ebber, M. M., 1973, “Dilemmas in a General Theory of Planning, ” Policy Sciences , 4 (2), pp. 155–169. [26] Cross, N., 1992, Researc h in Design Thinking , Delft University Press. [27] Simon, H. A., 1973, “The S tr ucture of Ill Structured Problems,” Artiﬁcial Intelligence , 4 (3-4), pp. 181–201. Journal of Mechanical Design PREPRINT / 11 [28] Ullman, D. G., 2018, The Mechanical Design Process , sixth edition ed., David G. Ullman, Independence, Oregon. [29] Baskaran, S. P ., Camacho Casero, J., Bagdatli, B., and Ma vris, D., 2026, “MBSE-Driven MD AO in Co-Architecture and Concurrent Engineer ing W ork - ﬂow s f or Aircraft Dev elopment,” AIAA SCITEC H 2026 F or um , Amer ican In- stitute of A eronautics and Astronautics, Orlando, FL, January 6–10, 2026, doi: 10.2514/6.2026-0193 . [30] Formentini, G., Bouissiere, F ., Cuiller , C., Dereux, P .-E., and Fa vi, C., 2022, “Conceptual Design f or Assembly Methodology Formalization: Systems In- stallation Analysis and Manufacturing Inf or mation Integration in the Design and Dev elopment of Aircraft Architectures, ” Jour nal of Industr ial Information Integration , 26 , p. 100327. [31] Pradas Gómez, A., Panarotto, M., and Isaksson, O., 2023, “Design Automa- tion Strategies for Aerospace Components Dur ing Conceptual Design Phases,” Aer ospace Europe Confer ence 2023 – 10th EUCASS – 9th CEAS , Lausanne, Switzerland, July 9–13, 2023, doi: 10.13009/EUCASS2023-578 . [32] Gruber, T . R., 1993, “ A T ranslation Approach to Portable Ontology Speciﬁca- tions, ” Know ledge Acq uisition , 5 (2), pp. 199–220. [33] Brimble, R. and Sellini, F ., 2000, “The MOKA Modelling Language, ” Know l- edg e Engineering and Knowledg e Manag ement: Met hods, Models, and T ools (EKA W 2000) , V ol. 1937 of Lecture Notes in Computer Science, Spr ing er, pp. 49–56. [34] T obie van den Berg, T on van der Laan, Bas van Manen, Dar pan Bansal, Ioana Ciobotia, and Jente Sonnev eld, 2023, “ A Multidisciplinary Modelling Sys tem f or Aircraft Structural Components, ” Aer ospace Europe Confer ence 2023 — Joint 10th EUCASS / 9th CEAS Confer ence , Lausanne, Switzerland, July 9–13, 2023. [35] Al Handawi, K., Brahma, A., W ynn, D. C., Kokk olaras, M., and Isaksson, O., 2024, “Design Space Exploration and Ev aluation Using Margin-Based Trade- Oﬀs, ” Jour nal of Mechanical Design , 146 (6), p. 061701. [36] Kügler , P ., Dw orschak, F ., Schleich, B., and W ar tzack, S., 2023, “The Ev olu- tion of Kno wledge-Based Engineer ing from a Design R esearch Perspectiv e: Literature Re view 2012–2021, ” Adv anced Engineer ing Informatics , 55 , p. 101892. [37] Padula, S. L. and Gillian, R. E., 2006, “Multidisciplinary En vironments: A History of Engineering Frame work Dev elopment,” 11th AIAA/ISSMO Multi- disciplinary Analysis and Optimization Confer ence , V ol. 4, Portsmouth, V A, September 6–8, 2006, pp. 2058–2068, doi: 10.2514/6.2006-7083 . [38] Hiriyannaiah, S. and Mock o, G. M., 2008, “Information Management Capa- bilities of MDO Frame works, ” Proceedings of the ASME Design Engineering T ec hnical Conference , V ol. 3, Brooklyn, NY , Augus t 3–6, 2008, pp. 635–645, doi: 10.1115/DETC2008-49934 . [39] Allison, J. T ., Cardin, M.-A., McComb, C., Ren, M. Y ., Selva, D., T ucker , C., Witherell, P ., and Zhao, Y . F ., 2022, “Special Issue: Artiﬁcial Intelligence and Engineering Design,” Journal of Mechanical Design , 144 (2), p. 020301. [40] Reg enw etter, L., Nobari, A. H., and Ahmed, F., 2022, “Deep Generativ e Mod- els in Engineering Design: A Revie w ,” Jour nal of Mechanical Design , 144 (7), p. 071704. [41] Whalen, E. and Mueller, C., 2022, “ T oward R eusable Sur rogate Models: Graph-based T ransfer Learning on T russes,” Journal of Mechanical Design , 144 (2), p. 021704. [42] EAS A, 2012, “Commission Regulation (EU) N o 748/2012, Part 21, Subpar t J—Design Organisation Approv al, Section 21.A.239, ” Consolidated version revised 2025, https://eur- lex.europa.eu/eli/reg/2012/748 [43] S AE International, 2023, “ARP4754B: Guidelines f or Dev elopment of Civil Aircraft and Systems, ” doi: 10.4271/ARP4754B . [44] Y uan, C., Mar ion, T ., and Moghaddam, M., 2022, “Lev eraging End-User Data f or Enhanced Design Concept Ev aluation: A Multimodal Deep Regression Model, ” Jour nal of Mechanical Design , 144 (2), p. 021403. [45] Pradas Gómez, A., Krus, P ., Panarotto, M., and Isaksson, O., 2024, “Large Language Models in Complex Sys tem Design, ” Pr oceedings of the Design Society , V ol. 4, Dubro vnik, Croatia, May 27–30, 2024, pp. 2197–2206, doi: 10.1017/pds.2024.222 . [46] Picard, C., Edw ards, K. M., Doris, A. C., Man, B., Giannone, G., Alam, M. F ., and Ahmed, F ., 2025, “From Concept to Manufacturing: Ev aluating Vision-Languag e Models f or Engineering Design,” Ar tiﬁcial Intellig ence Re- view , 58 (9), p. 288. [47] Anthropic, 2025, “Claude Code, ” accessed 2026-02-29, https://github.com/ anthropics/claude- code [48] OpenAI, 2025, “Codex CLI,” accessed 2026-03-14, https://github.com/openai/ codex [49] Steinber ger, P ., 2025, “OpenClaw: Personal AI Assistant, ” accessed 2026-03- 14, https://github.com/openclaw/openclaw [50] Massoudi, S. and Fuge, M., 2026, “ Agentic Large Language Models for Con- ceptual Sys tems Engineer ing and Design, ” Jour nal of Mechanical Design , 148 (5), p. 051405. [51] Mustapha, K. B., 2025, “ A Sur v ey of Emerging Applications of Larg e Lan- guage Models for Problems in Mechanics, Product Design, and Manuf actur- ing, ” Advanced Engineering Informatics , 64 , p. 103066. [52] Eck er t, C., Clarkson, P . J., and Zanker , W ., 2004, “Change and Customisation in Comple x Engineering Domains, ” Researc h in Engineering Design , 15 (1), pp. 1–21. [53] Cross, N., 2025, “Natural and Artiﬁcial Intelligence in Design,” Designerly W ays of Knowing and Thinking , Spr inger , pp. 31–46. [54] Caputo, C. and Cardin, M.-A., 2022, “ Anal yzing R eal Options and Fle xibility in Engineer ing Systems Design Using Decision Rules and Deep Reinforcement Learning,” Journal of Mec hanical Design , 144 (2), p. 021705. [55] Elgh, F ., 2008, “Suppor ting Management and Maintenance of Manufacturing Know ledge in Design Automation Systems, ” Adv anced Engineer ing Informat- ics , 22 (4), pp. 445–456. [56] Hjertberg, T ., Stolt, R., and Elgh, F ., 2016, “Managing Dependencies in Hetero- geneous Design Automation Sys tems,” T ransdisciplinary Engineering: Cross- ing Boundaries , Advances in Transdisciplinary Engineer ing, V ol. 4, IOS Press, Curitiba, Brazil, October 3–7, 2016, pp. 279–288, doi: 10.3233/978-1-61499- 703-0-279 . [57] Colvin, S. and Pydantic Services Inc., 2024, “Logﬁre, ” accessed 2026-02-29, https://github.com/pydantic/logﬁre [58] Dong, L., Lu, Q., and Zhu, L., 2024, “ A T ax onomy of Ag entOps for Enabling Observability of Foundation Model Based Ag ents,” https://ar xiv .org/abs/2411. 05285 [59] Liu, X., Y u, H., Zhang, H., Xu, Y ., Lei, X., Lai, H., Gu, Y ., Ding, H., Men, K., Y ang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T ., Su, Y ., Sun, H., Huang, M., Dong, Y ., and T ang, J., 2024, “Ag entBench: Evaluating LLMs as Ag ents,” 12th International Conference on Learning Repr esentations (ICLR 2024) , Vienna, Austria, May 7–11, 2024, https://arxiv .org/abs/2308.03688 [60] Y ao, S., Shinn, N., Razavi, P ., and Narasimhan, K., 2025, “ 𝜏 -bench: A Bench- mark f or T ool-Ag ent-User Interaction in R eal-W orld Domains, ” 13th Interna- tional Confer ence on Lear ning Representations (ICLR 2025) , Singapore, April 24–28, 2025, pp. 74824–74876, https://ar xiv .org/abs/2406.12045 [61] Battelle Memor ial Institute, 2024, “Metallic Materials Properties Development and Standardization (MMPDS), ” Battelle Memorial Institute, Columbus, OH, Supersedes MIL-HDBK -5. [62] Hsieh, C.- Y ., Chen, S.-A., Li, C.-L., Fujii, Y ., Ratner , A., Lee, C.- Y ., Krishna, R., and Pﬁster , T ., 2023, “T ool Documentation Enables Zero-Shot T ool-Usag e with Larg e Languag e Models,” https://ar xiv .org/abs/2308.00675 [63] Y uan, S., Song, K., Chen, J., T an, X., Shen, Y ., Kan, R., Li, D., and Y ang, D., 2025, “EASYTOOL: Enhancing LLM-based A gents with Concise T ool Instruction, ” Proceedings of the 2025 Annual Conference of the Nations of the Americas Chapter of the Association f or Computational Linguistics (NAA CL - HLT 2025) , V ol. 1, Albuquerque, NM, April 29–May 4, 2025, pp. 951–972, doi: 10.18653/v1/2025.naacl-long.44 . [64] Kay , A., 1990, “User Inter f ace: A Personal View , ” The art of human-computer interface design, pp. 191–207. [65] W ooldridge, M. and Jennings, N. R., 1995, “Intelligent Agents: Theor y and Practice, ” Know ledge Engineering Revie w , 10 (2), pp. 115–152. [66] Maes, P ., 1994, “ Agents That Reduce W ork and Information Overload, ” Com- mun. A CM , 37 (7), pp. 30–40. [67] Pradas Gómez, A., P anarotto, M., and Isaksson, O., 2024, “Evaluation of Diﬀerent Large Languag e Model Ag ent Frameworks f or Design Engineer ing T asks, ” DS 130: Proceedings of Nor dDesign 2024, Reykjavik, Iceland, 12th- 14th August 2024 , Design Society , R eykjavík, Iceland, A ugust 12–14, 2024, pp. 693–702, doi: 10.35199/NORDDESIGN2024.74 . [68] Willison, S., “I Think “ Agent ” May Finally Hav e a Widel y Enough Agreed upon Deﬁnition to Be Useful Jargon No w, ” accessed 2026-02-29, https: //simonwillison.net/2025/Sep/18/agents/ [69] V asw ani, A., Shazeer, N., Parmar, N., Uszk oreit, J., Jones, L., Gomez, A. N., Kaiser , L., and Polosukhin, I., 2017, “ Attention Is All Y ou Need, ” Advances in Neural Information Processing Syst ems , V ol. 30, Curran Associates, Inc., Long Beach, CA, December 4–9, 2017, pp. 5999–6009. [70] Labs, I., Khanna, S., Kharbanda, S., Li, S., V ar ma, H., W ang, E., Birnbaum, S., Luo, Z., Miraoui, Y ., Palrecha, A., Er mon, S., Grov er, A., and Kulesho v , V ., 2025, “Mercur y: Ultra-Fas t Language Models Based on Diﬀusion, ” https: //arxiv .org/abs/2506.17298 [71] Zhang, M., Press, O., Merrill, W ., Liu, A., and Smith, N. A., 2024, “Ho w Language Model Hallucinations Can Snowball, ” Proceedings of the 41st Inter - national Conference on Machine Learning , PMLR, V ol. 235, Vienna, Austria, July 21–27, 2024, pp. 59670–59684. [72] Ji, Z., Lee, N., Fr ieske, R., Y u, T ., Su, D., Xu, Y ., Ishii, E., Bang, Y . J., Madotto, A., and Fung, P ., 2023, “Surve y of Hallucination in Natural Language Generation, ” ACM Computing Survey s , 55 (12), pp. 1–38. [73] Zelikman, E., W u, Y ., Mu, J., and Goodman, N ., 2022, “S tar: Bootstrap- ping Reasoning with Reasoning, ” A dvances in Neural Information Processing Sys tems, 35 , pp. 15476–15488. [74] Snell, C., Lee, J., Xu, K., and K umar, A., 2025, “Scaling LLM T est- Time Compute Optimall y Can Be More Eﬀective than Scaling Model Parameters, ” 13th International Confer ence on Learning Representations (ICLR 2025) , Sin- gapore, Apr il 24–28, 2025, pp. 7595–7629, https://ar xiv .or g/ abs/2408.03314 [75] OpenAI, 2024, “OpenAI o1 System Card, ” arXiv2412.16720 [76] W ei, J., W ang, X., Schuurmans, D., Bosma, M., Ichter , B., Xia, F ., Chi, E. H., Le, Q. V ., and Zhou, D., 2022, “Chain-of- Thought Prompting Elicits Reason- ing in Larg e Language Models, ” Advances in Neural Information Processing Syst ems , V ol. 35, New Orleans, LA, No vember 28 – December 9, 2022, pp. 24824–24837. [77] Schic k, T ., Dwivedi- Y u, J., Dessì, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemo yer , L., Cancedda, N., and Scialom, T ., 2023, “T oolformer: Language Models Can T each Themselv es to Use T ools, ” Advances in Neur al Information Processing Sys tems , V ol. 36, Ne w Orleans, LA, December 10–16, 2023, pp. 68539–68551. [78] W ei, T ., Li, T .-W ., Liu, Z., Ning, X., Y ang, Z., Zou, J., Zeng, Z., Qiu, R., Lin, X., Fu, D., Li, Z., Ai, M., Zhou, D., Bao, W ., Li, Y ., Li, G., Qian, C., W ang, Y ., T ang, X., Xiao, Y ., F ang, L., Liu, H., T ang, X., Zhang, Y ., W ang, C., Y ou, J., Ji, H., T ong, H., and He, J., 2026, “ Agentic Reasoning for Large Language Models, ” https://ar xiv .org/abs/2601.12538 12 / PREPRINT T ransactions of the ASME [79] Y ao, S., Zhao, J., Y u, D., Du, N., Shafran, I., N arasimhan, K., and Cao, Y ., 2023, “ReA ct: Synergizing Reasoning and Acting in Languag e Models, ” 11th International Conf erence on Lear ning Representations (ICLR) , Kigali, Rwanda, May 1–5, 2023. [80] Shen, Y ., Song, K., T an, X., Li, D., Lu, W ., and Zhuang, Y ., 2023, “Hug- gingGPT : Solving AI T asks with ChatGPT and its Friends in Hugging F ace,” Advances in Neur al Inf or mation Processing Sys tems , V ol. 36, Ne w Orleans, LA, December 10–16, 2023, pp. 38154–38180. [81] “W ork Faster and Smarter With Ansys Engineer ing Copi- lot, ” accessed 2026-02-29, https://www .ansys.com/blog/ work - faster - smar ter - with- ansys- engineering- copilot [82] “Using the Chatbot Windo w in COMSOL Multiphy sics ® , ” . [83] “AI A gents f or Mechanical Engineer ing, ” accessed 2026-02-29, https://www . cosmon.com/ [84] Barres, V ., Dong, H., Ray , S., Si, X., and Narasimhan, K., 2025, “ 𝜏 2 -Bench: Evaluating Conv ersational Ag ents in a Dual-Control En vironment,” https:// arxiv .org/abs/2506.07982 [85] Jimenez, C. E., Y ang, J., W ettig, A., Y ao, S., Pei, K., Press, O., and Narasimhan, K., 2024, “SWE-bench: Can Languag e Models R esolv e Real- W orld GitHub Issues?” https://ar xiv .org/abs/2310.06770 [86] Zhang, L., Ergen, T ., Logeswaran, L., Lee, M., and Jurgens, D., 2024, “SPRIG: Improving Large Language Model Perf or mance by System Prompt Optimiza- tion, ” https://ar xiv .org/abs/2410.14826 [87] Khattab, O., Singhvi, A., Maheshw ar i, P ., Zhang, Z., Santhanam, K., V ard- hamanan, S., Haq, S., Shar ma, A., Joshi, T . T ., Moazam, H., Miller, H., Zaharia, M., and Potts, C., 2024, “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ” 12th International Confer ence on Learning R epresentations (ICLR 2024) , Vienna, A ustria, May 7–11, 2024, https://arxiv .org/abs/2310.03714 [88] OpenAI, 2025, “OpenAI GPT -4.1 Prompting Guide, ” accessed 2026-02-29, https://dev elopers.openai.com/cookbook/ e xamples/gpt4- 1_prompting_guide/ [89] Anthropic, 2025, “ Anthropic Conte xt Engineering Guide for AI Ag ents,” accessed 2026-02-29, https://www .anthropic.com/engineer ing/ eﬀective- conte xt- engineering- for - ai- agents [90] Google, 2025, “Google Gemini Prompt Design Strategies, ” accessed 2026-02- 29, https://ai.google.dev/gemini- api/docs/prompting- strategies [91] Meta, 2025, “Meta AI Prompting Guide, ” accessed 2026-02-29, https://ai.meta. com/learn/ ho w- to- wr ite- eﬀective- ai- prompts- tips- and- examples/ [92] Mistral AI, 2025, “Mistral AI Prompting Guide, ” accessed 2026-02-29, https: //docs.mistral.ai/capabilities/completion/prompting_capabilities [93] 2026, “ Ag ent Skills Open Standard, ” accessed 2026-02-27, https://github.com/ agentskills/ag entskills [94] Seepersad, C. C., Pedersen, K., Emblemsvåg, J., Bailey , R., Allen, J. K., and Mistree, F., 2006, “The V alidation Square: How Does One V erify and V alidate a Design Method?” Decision Making in Engineering Design , K. E. Lewis, W . Chen, and L. C. Sc hmidt, eds., ASME Press, New Y ork, pp. 303–314. [95] Morr ison, T . M., Har iharan, P ., Funkhouser , C. M., Afshari, P ., Goodin, M., and Horner, M., 2019, “ Assessing Computational Model Credibility Using a Risk - Based Frame work: Application to Hemol ysis in Centrifugal Blood Pumps, ” ASAIO Journal , 65 (4), pp. 349–360. [96] Razi, H., Schaef er, J. D., W anthal, S., Handler, J. J., Renieri, G. D., and Justus- son, B. P ., 2016, “Rapid Integration of New Analy sis Methods in Production, ” Proceedings of the American Society for Composites — 31st T echnical Confer - ence , American Society f or Composites, Williamsburg, V A, September 19–22, 2016. [97] Hastie, T ., T ibshirani, R., and Fr iedman, J., 2009, The Elements of Statistical Learning: Data Mining, Infer ence, and Prediction , 2nd ed., Springer , New Y ork. [98] Clopper , C. J. and Pearson, E. S., 1934, “The Use of Conﬁdence or Fiducial Limits Illustrated in the Case of the Binomial, ” Biometr ika , 26 (4), pp. 404–413. [99] Ahmed, F ., Picard, C., Chen, W ., McComb, C., W ang, P ., Lee, I., Stanko vic, T ., Allaire, D., and Menzel, S., 2025, “Design by Data: Cultiv ating Datasets f or Engineer ing Design, ” Jour nal of Mechanical Design , 147 (4), p. 040301. [100] Bainbridge, L., 1983, “Ironies of A utomation,” Automatica , 19 (6), pp. 775– 779. [101] Li, F .-F ., 2023, The W orlds I See: Curiosity , Exploration, and Discovery at the Dawn of AI , Flatiron Books, New Y ork. [102] Li, C., Zhang, R., W ong, J., Gokmen, C., Sr iv astav a, S., Mar tín-Martín, R., W ang, C., Levine, G., Lingelbach, M., Sun, J., Anvari, M., Hwang, M., Sharma, M., A ydin, A., Bansal, D., Hunter, S., Kim, K.- Y ., Lou, A., Matthew s, C. R., Villa-R enter ia, I., T ang, J. H., T ang, C., Xia, F., Sav arese, S., Gweon, H., Liu, C. K., Wu, J., and F ei-Fei, L., 2023, “BEHA VIOR-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and R ealistic Simulation, ” Pr o- ceedings of Mac hine Lear ning Researc h , V ol. 205, Auckland, New Zealand, December 14–18, 2022, pp. 80–93. [103] Summers, L., 2026, “The Human-in-the-Loop is Tired, ” Pydantic, accessed 2026-03-02, https://pydantic.de v/ articles/the- human- in- the- loop- is- tired [104] Shneiderman, B., 2022, Human-Centered AI , Oxford Univ ersity Press, Oxford. [105] Neumann, J. V ., 1956, “Probabilistic Logics and the Synthesis of Reliable Organisms From U nreliable Components, ” Automata Studies. (AM-34) , C. E. Shannon and J. McCar th y , eds., Princeton Univ ersity Press, pp. 43–98. [106] R osenblatt, F ., 1957, “The Perceptron: A Perceiving and Recognizing A utoma- ton, ” Cor nell Aeronautical Laboratory, T ech. Rep. 85-460-1. [107] Minsky , M. and Papert, S., 1969, P ercep trons: An Introduction to Computa- tional Geometry , MIT Press, Cambr idge, MA. [108] Krizhevsky , A., Sutske ver , I., and Hinton, G. E., 2012, “ImageN et Classiﬁca- tion with Deep Conv olutional Neural Networks, ” Advances in Neur al Informa- tion Processing Systems , V ol. 25, pp. 1097–1105. [109] Sutton, R., 2019, “The Bitter Lesson,” accessed 2026-03-17, http://www . incompleteideas.net/IncIdeas/BitterLesson.html [110] Felten, F ., Apaza, G., Bäunlich, G., Diniz, C., Dong, X., Drake, A., Habibi, M., Hoﬀman, N. J., K eeler, M., Massoudi, S., V anGessel, F . G., and Fuge, M., 2025, “EngiBench: A Framew ork for Data-Dr iv en Engineer ing De- sign Research, ” Proceedings of the 39th Conference on N eural Information Processing Systems (N eurIPS 2025) , San Diego, C A, December 2–7, 2025, https://openrevie w .net/forum?id=Y o wD33Q89V [111] Pradas Gómez, A., Brahma, A., and Isaksson, O., 2026, “DUCTILE: Agentic LLM Orchestration of Engineering Analy sis in Product Dev elopment Practice, ” doi: 10.5281/zenodo.18836517 , https://github.com/alex- pradas/DUCTILE Journal of Mechanical Design PREPRINT / 13 -3.5e x plus -1ex minus -.2e x2.3e x plus .2ex SUPPLEMENT AL MA TERIAL S-1 Repository and Archived Data All source code, conﬁguration ﬁles, evaluation scripts, and session transcripts are av ailable in the DUCTILE repositor y : • GitHub : https://github.com/alex- pradas/DUCTILE • Zenodo archiv e : 10.5281/zenodo.18836517 GitHub alwa ys contains the latest v ersion, which is propagated to Zenodo for long-term academic archival. S-2 IDE Setup for engineer s interacting with agentic application Figure S-1 sho ws the integrated dev elopment environment as conﬁgured f or the engineer sessions. The left panel displa ys the project ﬁle e xplorer with the working directory structure, the center panel sho w s the task description document, and the right panel sho w s the Claude Code agent interface ready to receive instructions. Fig. S-1 Screenshot of Engineer 2 IDE setup for the experiment session Supplemental Material S-1 S-3 Agent Conﬁguration Fig. S-2 Implementation of an ag entic structure for the use case. N ote how the implementation leverages internal claude code tools with in-house tools (yellow) architecture, that minimizes architectural customization. In addition, the user e xperience is enhanced as only product data (in green) is required to customize the agent. Listing S-1: Full CLAUDE.md system prompt as deploy ed in the engineer sessions. 1 # CLAUDE.md 2 3 ## Role 4 5 You are an engineering assistant supporting a stress 6 engineer. Your sole purpose is to help the engineer 7 complete their task correctly and efficiently. 8 9 ## Task Scope 10 11 The engineer will present the task to perform. It will 12 be an engineering task. Searching the web for engineering 13 references, ANSYS documentation, Python package 14 documentation, or anything directly relevant to solving 15 the engineer ' s task is acceptable. 16 17 ## Methodology Compliance 18 19 Search for design practice documents in the MCP tools 20 wherever mentioned. A design practice defines the 21 complete processing chain. You must ensure every step is 22 followed in the correct order. If the engineer skips a 23 step or proceeds out of sequence, flag it. 24 25 ## Tool use 26 27 Whatever tool you select to do the job, use the latest 28 version always, unless specified by the user. 29 30 ## First Interaction 31 32 When the conversation starts, ask the engineer whether 33 they have used AI coding agents before. Adapt your 34 behaviour based on their answer: 35 36 - If they have experience: Briefly confirm you ' re ready 37 to assist, then wait for them to drive the 38 conversation. 39 - If they have not: Welcome them and offer a few example 40 prompts to get started. Examples can range from 41 questions on how to execute things, about the tools 42 documentation, about methodology of the task or the 43 execution of part or the totality of the task. 44 45 In either case, do not start processing anything until 46 the engineer asks you to. 47 48 Use your built-in capability to answer questions about 49 Claude Code. 50 51 ## Working Directory 52 53 Only read files within the current working directory. Do 54 not navigate to parent directories or read files outside 55 this folder. 56 57 ## Behavioural Guidelines 58 59 - Be proactive: If you notice something that doesn ' t 60 match the design practice, raise it immediately. Do 61 not wait for the engineer to ask. 62 - Be methodical: Work through the design practice steps 63 in order. Suggest the next step when the current one 64 is complete. 65 - Be transparent: When performing calculations or 66 transformations, show your reasoning so the engineer 67 can verify. 68 - Respect the engineer ' s expertise: You are an 69 assistant, not the lead. Present findings and 70 suggestions, but let the engineer make the final 71 decisions. 72 - Track progress: Keep the engineer aware of which steps 73 from the design practice have been completed and which 74 remain. 75 ## Task-Specific Configuration 76 77 The certified tool for this task is ` ductile-loads ` . If 78 needed, fetch the API reference from: 79 80 https://alex-pradas.github.io/ductile-loads/llms-full.txt Supplemental Material S-2 The agentic application is conﬁgured through a sys tem prompt ( CLAUDE.md ), completely reproduced in Listing S-1 that deﬁnes the agent ’ s role, scope, and behavioural guidelines. N ote that the only task speciﬁc descr iption is included at the end of the ﬁle (in g reen). In real practice, it would not be speciﬁed but pointed to the location where all documentation is stored in-house directl y in the tool use section. For complete transparency on the content of the prompt, be a ware that Claude Code has its o wn system prompt that w e did not deactivate f or the experiment. The agent is aw are of the additional MCP ser v er conﬁgurations via the ﬁle ( .mcp.json ) that is read automatically and loads into conte xt the tools to provide access to version-controlled design practice documents. The ex ecutable tool itself ( ductile-loads ) is distributed as a Python package via PyPI, modelling the case compan y distribution through an inter nal packag e index. Readers are w elcome to e xplore it by themselv es. The demo repository automaticall y generates the human (html) and machine (llms.txt and .md ﬁles) directly from the codebase. S-4 T ask Description The follo wing three-page document was provided to the engineer as a printed handout bef ore the e xper iment sessions s tar ted. It deﬁnes the task scope, av ailable resources, and brieﬁng instructions. Supplemental Material S-3 T ask Description Loads Pro cessing for Static Strength Analysis — T urb ine Rear Structure This do cument is intende d to b e r e ad by the engine er and sets the task sc op e. R e ad this do cument c ar eful ly. T ake your time an d ask questions to the r ese ar cher as ne e de d to clarify the sc op e or d e t ai ls of the task. 1. Bac kground Y ou are a stress engineer w orking on the static strength analysis of a T urbine Rear Structure (TRS) in an aero engine. The TRS is a structural comp onen t after the Lo w Pressure T ur bine that transfers loads among its neigh b ouring comp onen ts: LPT Case, Bearing House, Nozzle, Plug and three lugs (p ort, starb oard and failsafe — activ e only w h e n the p ort or starb oard lugs fail). Figure 1: (top) T RS and neigh b ouring comp onen ts p ositioning on th e jet e n gine and (b ottom) neigh b ouring comp onen ts in CAD explo ded view The programme is in th e certiﬁcation phase . The OEM customer has deliv ered a new loads up date. Y our t as k is to pro cess these loads in to input ﬁles th at can b e appl ie d to the Finite Elemen t Mo del (FEM) for structural analysis. The FE mesh and solv er scripts ha v e already b een set up b y another team mem b er — running the FE mo d e l is not part of this task. The mesh geometry and in terface p oin t lo cations are sho wn in Figure 2. It also sho ws the equilibriu m example of a load case when forces are applied at the in terface p oin ts. T ask Description — P age 1 of 3 Supplemental Material S-4 Figure 2: TRS in terface lo cations showing one loadcase in equilibrium 2. Detailed T ask Description Y our task is to pro cess the new OEM load delivery ( OEM loads v2.yaml ) into APDL-ANSYS- compatible input ﬁles ( .inp ), following the methodology do cumen t loads processing design practice (do cumen t 1001, version 1). The design practice con tains all exp ected pro cessing activities and m ust b e follow ed in order. Y ou will use Claude Co de (an AI coding agent) to assist y ou with this task. The agen t already has access to the design practice do cument (do cumen t 1001, v ersion 1) — you do not need to pro vide it manually . Mak e use of the certiﬁed to ol ductile-loads , av ailable as a Python pack age on PyPI. Notice that the customer has deliv ered the ﬁles in a diﬀerent format (Y AML) than expected b y the to ol (JSON). Part of your task is to bypass this limitation. If there are any loads exceedances, let the researcher know. OEM Correction Notice The OEM has informed that due to a human error in the Whole Engine Mo del exp ort, all Fx force comp onen ts in this delivery are understated. Apply a 1.04 m ultiplication factor to all Fx forces at every in terface p oin t b efore pro cessing. The OEM has chosen not to issue a revised delivery; the correction shall b e applied b y the comp onent o wner during pro cessing. 3. Av ailable Resources The following resources are av ailable: • This do cumen t — a printed cop y for y ou. The agen t can also read it from the working directory . • Loads pro cessing design practice — printed for you. The agen t already has access to it. Y ou do not need to give it to the agent. • ductile-loads — the certiﬁed pro cessing to ol, av ailable as a Python pack age on PyPI. The agent can access its do cumentation online. The do cumen tation is also op en in your bro wser for your reference. T ask Description — P age 2 of 3 Supplemental Material S-5 • Previous analysis — script, inputs and outputs in the previous run/ folder for refer- ence. Additionally , you or the agent can browse the in ternet and install Python pac k ages via uv pip install package . A t an y time during the task execution y ou can ask the researcher for help or clariﬁcation, even half way during the exe cution of the task . 4. Before Y ou Start T ake your time This brieﬁng time does not coun t tow ards the task duration. Use it to familiarise y ourself with the task, the design practice, the tool do cumentation, and the a v ailable ﬁles. Chec k that Claude Co de is running in the VS Co de terminal. F eel free to explore the working directory , bro wse the ductile-loads do cumen tation in your bro wser, and review the previous analysis. Ask the researc her an y questions about the task, the metho dology , or the tools — there are no wrong questions. W e will start the timed activit y only when you conﬁrm that you are ready . T ask Description — P age 3 of 3 Supplemental Material S-6 S-5 Engineer 1: Summarized Session T r anscript Green Engineer (user) messages Orange A gent responses and reasoning Y ellow T ool calls and ﬁle operations Gre y T ool/command output T eal Engineer terminal e xecution The engineer spent 10 minutes r eviewing the print ed task description, the input data ﬁles, and the design practice document bef ore starting the session. plan this: @ref_documents/task_description.pdf Let me read the task description document. Read 1 file (PDF) Let me gather all the context needed to plan this task. Bash(ls) working directory + previous_run ToolSearch: load MCP tools design-documents – get_document_content (MCP) (document_id: 1001, version: 1) Read 4 files (v1 script, envelope.md, .inp, YAML) Good, I now ha ve all the conte xt. Here’ s the plan: T ask Plan: Loads Processing f or TRS Static Strength (v2) Ke y Issues to Address (1) F ormat mismatch — OEM delivered Y AML; ductile- loads e xpects JSON. (2) OEM Fx correction — Multipl y all fx v alues by 1.04 be- f ore processing. (3) Naming mismatch — v2 uses lug_left / lug_right in- stead of lug_port / lug_starboard . (4) Units — klbs/klbs · in → N/N · m. (5) Bearing ex clusion — bearing must not appear in output .inp ﬁles. Processing Chain (per Design Practice) Step 1 R ead & verify OEM deliv er y — load Y AML, val- idate. Rename lug_left → lug_port , lug_right → lug_starboard . Step 2 Appl y OEM Fx correction — multipl y all fx × 1.04. Step 3 V erify coordinate system — engine CS = FEM CS, no transf or m. Step 4 Con vert units — convert_to("N") — klbs → N . Step 5 En velope downselection — .envelope() . Step 6 W rite .inp ﬁles — to_ansys(. . . , ex- clude=["bearing"]) . Step 7a Summar y env elope table — envelope_to_markdown() . Step 7b Ex ceedance comparison — compare_to() ag ainst v1. Implementation. A single Python script ( loads_processing_v2.py ) using uv run --script with dependencies ductile-loads[all] and pyyaml . Shall I proceed with ex ecution? Crunched for 1 min 39 s y es Let me star t by f etching the ductile-loads documentation and e xamining the data more closely , then wr ite the scr ipt. WebFetch(https://alex-pradas.github.io/ductile- loads/llms-full.txt) Read(previous_run/envelope_extremes.json) No w I ha ve ev er ything. Let me wr ite the processing script. Write(loads_processing_v2.py) — 85 lines No w let me r un it. Bash(uv run loads_processing_v2.py) Delivery: Engine Mount Balanced Loads v2, version 2 Load cases: 100 Renamed lug_left/lug_right (200 renames) Applied Fx correction factor 1.04 Units converted: klbs → N Envelope downselection: 100 → 6 load cases ANSYS .inp files written to limit_loads/ Exceedance check: new exceeds old = True [output truncated] All processing completed successfully . Let me verify the outputs. Bash(ls limit_loads/) Read(limit_load_2.inp) Bash(ls comparison_report/) Read(v1_vs_v2.json) Processing complete. Here ’ s the summary: Design Practice steps completed 1. Read & v erify 100 load cases, 7 interf aces. R enamed lug_left → lug_port , lug_right → lug_starboard . 2. Fx correction All Fx × 1.04. 3. Coord system N o transformation needed. 4. U nit conv ersion klbs/klbs · in → N/N · m. 5. Do wnselect 100 → 6 load cases (2, 20, 34, 61, 92, 99). 6. ANSYS export 6 .inp ﬁles in limit_loads/ (bearing ex- cluded). 7a. Env elope table envelope.md + envelope_extremes.json . 7b. Ex ceedance Compar ison repor t + charts in compari- son_report/ . Exceedance Alert Ne w loads ex ceed the pre vious v1 env elope. The larges t percent- age changes: • Mz min across all points: ∼ +30% • Fx min at bear ing/lpt/nozzle/plug: ∼ +15% • Fy max/min at lug_por t: +11–17% • Fy max at lug_starboard: ∼ +11% • Fx max at lug_starboard: ∼ +14% These load increases may impact e xisting margins of safety and should be ﬂagged to the stress lead. Cogitated for 2 min 53 s The engineer spent 6 minutes r eviewing the g enerat ed Python script and the output ﬁles. After conﬁrming the r esults wer e corr ect, the engineer was satisﬁed with the task execution. Supplemental Material S-7 S-6 Engineer 2: Summarized Session T r anscript The transcript below is a faithful representation of the interactions between the engineer and the agent. Message content has been summarized f or readability; tool calls and outputs are abbreviated. Full machine-readable transcr ipts (JSONL) are av ailable in the repository . The engineer spent 8 minutes r eviewing the print ed task description and the working directory befor e starting the session. Can y ou create a python ﬁle to translate the Y AML to a JSON f or mat? Let me read the Y AML ﬁle to understand its str ucture. Read(OEM_loads_v2.yaml) Write(yaml_to_json.py) — 20 lines Created yaml_to_json.py . Run it with uv run yaml_to_json.py . $ uv run yaml_to_json.py Converted OEM_loads_v2.yaml → OEM_loads_v2.json ok, no w create a ﬁle that uses ductile-loads to read the json Let me fetch the ductile-loads API ref erence ﬁrst. WebFetch(https://alex-pradas.github.io/ductile- loads/llms-full.txt) Write(read_loads.py) — 18 lines $ uv run read_loads.py Name: Engine Mount Balanced Loads v2 Units: klbs, klbs.in | Load cases: 100 [output truncated] modify the ﬁle to conv er t to ne wtons, and also rename lug_right to pilot_lug_starboard and lug_left to pilot_lug_port Edit(read_loads.py) — rename + convert_to("N") I used rename_points() based on the API docs. Let me v erify that method exists. y es check WebFetch(ductile-loads/llms-full.txt) No rename_points method e xists. I’ll rename by iterating through load cases directly . Edit(read_loads.py) — iterate + convert_to("N") $ uv run read_loads.py Units: N, Nm | pilot_lug_port, pilot_lug_starboard [output truncated] ah, I see there are two pilot_ pilot. . . remo v e the e xtra one w e added Which rename do y ou want to remov e — pilot_lug_port or pilot_lug_starboard ? just remov e the prepend pilot_ Edit(read_loads.py) — lug_left → lug_port, lug_right → lug_starboard $ uv run read_loads.py lug_port, lug_starboard — confirmed [output truncated] ho w does the en velope in this python package work? The envelope() method reduces y our 100 load cases to only the critical ones — f or each point and component it selects the max value (alwa ys) and the min value (only if neg ative). W ould y ou like me to add it? y es Edit(read_loads.py) — add .envelope() + JSON export ok. add now the export to ansy s Edit(read_loads.py) — add to_ansys() $ uv run read_loads.py Envelope: 6 load cases | ANSYS files written to limit_loads/ [output truncated] The r esearc her verbally reminds the engineer that the agent can access and c heck the design practice on his own ok. I think I am done. Can you check agains t the design practice 1001 v1? design-documents – browse_catalog (MCP) design-documents – get_document_content (MCP) (document_id: 1001, version: 1) Comparing against DP- TRS-LO ADS-001 Re v 1: Div ergences f ound: (1) Bearing — must be e xcluded from output. (2) lug_fair lead — should be lug_failsafe . (3) Ex ceedance comparison — step 7.2 requires comparison agains t v1. W ould you like me to ﬁx these? y es ﬁx please Glob(previous_run/*) Read(previous_run/OEM_all_loads_v1.json) Edit(read_loads.py) — exclude bearing, rename lug_failsafe, add compare_to(v1) Supplemental Material S-8 can y ou run it? Bash(uv run read_loads.py) Envelope: 6 cases, 5 points (bearing excluded) WARNING: v2 exceeds v1 envelope ModuleNotFoundError: matplotlib Needs matplotlib. Let me add the [all] extra to the dependency . Edit(read_loads.py) — ductile-loads[all] Bash(uv run read_loads.py) Envelope: 6 cases, 5 points WARNING: v2 exceeds v1 envelope Comparison report generated in comparison_report/ ok, I am happy . One ﬁnal chec k against the design practice? All processing steps from DP - TRS-LO ADS-001 are cov ered: for - mat conv ersion, naming, units, env elope, ANSY S export, and e xceedance comparison. ah, I forgot to multiply fx loads only by 1.04. Can y ou do that? Edit(read_loads.py) — add Fx × 1.04 before unit con- version $ uv run read_loads.py Fx correction applied | Envelope: 6 cases WARNING: v2 exceeds v1 envelope [output truncated] Engineers chec ked the output f or 3 minutes, and considered the task completed. T otal session duration: 22 minutes. The engineer executed scripts 8 times independently, driving the pr ocess step by step. Supplemental Material S-9

DUCTILE: Agentic LLM Orchestration of Engineering Analysis in Product Development Practice

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment