Embodied Science: Closing the Discovery Loop with Agentic Embodied AI

Embodied Science: Closing the Disco very Loop with Agentic Embodied AI Xiang Zhuang 1,2 , Chenyi Zhou 1 , Kehua Feng 1 , Zhihui Zhu 1 , Y unfan Gao 3 , Yijie Zhong 3 , Yichi Zhang 1 , Junjie Huang 1 , Ke y an Ding 1 B , Lei Bai 2 B , Haofen W ang 3 B , Qiang Zhang 1 B , and Huajun Chen 1 B 1 Zhejiang University , Hangzhou, China 2 Shanghai Ar tiﬁcial Intelligence Laboratory , Shanghai, China 3 T ongji University , Shanghai, China B baisanshi@gmail.com, haof en.wang@tongji.edu.cn, { dingk ey an,qiang.zhang.cs,huajunsir } @zju.edu.cn ABSTRA CT Ar tiﬁcial intelligence has demonstrated remarkable capability in predicting scientiﬁc proper ties , yet scientiﬁc discov er y remains an inherently physical, long-hor iz on pursuit gov er ned by experimental cycles. Most current computational approaches are misaligned with this reality , framing discov er y as isolated, task-speciﬁc predictions rather than continuous interaction with the ph ysical world. Here, we argue for embodied science, a paradigm that reframes scientiﬁc disco very as a closed loop tightly coupling agentic reasoning with physical e x ecution. We propose a uniﬁed P erception– L anguage– A ction– D iscov er y ( PLAD ) frame work, wherein embodied agents perceiv e e xperimental en vironments, reason over scientiﬁc knowledge , ex ecute physical inter ventions , and inter nalize outcomes to drive subsequent e xploration. By grounding computational reasoning in rob ust ph ysical f eedback, this approach bridges the gap betw een digital prediction and empir ical validation, off ering a roadmap for autonomous discov er y systems in the lif e and chemical sciences. 1 Introduction Artiﬁcial intelligence is reshaping how scientiﬁc knowledge is produced and acted upon 1 . Over the past decade, data-dri ven models hav e deliv ered striking advances in problems once considered intractable, from accurate protein structure prediction 2 , 3 to learned models for molecular property prediction 4 – 6 , generati ve design 7 , 8 , and synthesis planning 9 . These successes, together with the rise of foundation models that unify representations across modalities 10 – 12 , hav e begun to shift AI for Science (AI4S) from a collection of specialized predictors to ward more general-purpose scientiﬁc engines 13 . Y et a central tension remains: scientiﬁc discovery is not a single-shot infer ence pr oblem . Breakthroughs typically emerge from long-horizon, iterati ve interaction with the ph ysical world, through a sustained loop of hypothesis formation, experimental design, ex ecution under real constraints, analysis, and model re vision 14 , 15 . Even when a predicti ve model is e xcellent, the discovery process can stall if the system cannot decide what to do ne xt, cannot percei ve the signals from instruments, or cannot reliably translate decisions into ex ecutable laboratory operations. Recent ef forts make this mismatch visible in the form of a bifurcated landscape. On one side, lar ge language model (LLM)-based agents 16 – 20 expand cogniti ve scope through language-mediated reasoning, planning, and tool use, often translating high-le vel scientiﬁc intent into experimental plans, code, and workﬂo ws. On the other side, automated and robotic laboratories 21 , 22 demonstrate reliable embodied execution, enabling sustained experimentation and closed-loop optimization within well-deﬁned experimental spaces. This split reﬂects broader empirical evidence that AI’ s currently attributed scientiﬁc impact remains concentrated in cogniti ve augmentation, especially in data processing and pattern recognition, whereas the next critical step is to e xpand sensory and e xperimental capacity to enable the search for and acquisition of ne w forms of evidence be yond “standing” datasets 23 . Crucially , this split is not an incidental inte gration gap that will v anish by “connecting” modules. Each line optimizes a dif ferent Human-orchestr ated, AI- assisted scientific discovery Long - horizon autonomous scientific discovery Data Collection Model T raining Prediction Experiments Predictio n - cent ric AI for science Discovery Perception Action Embodied science Language Disembodied AI assistant Agentic embodi ed AI Scienti fic data from instruments Scienti fic cognit ion Embodie d executio n I nternaliz e outcomes as insight Reasoning Planning Reflecting ... AI mode l Input Output Design Tr ai ni n g AI Mode l Instruments Output Analysis Dataset Source … Figure 1. Fr om prediction-centric AI f or science to embodied science. Left: the dominant AI4S workﬂo w is human-orchestrated. Experts curate data sources into datasets, train task-speciﬁc models, and use predictions to inform the next e xperimental step; ex ecution remains human-managed, yielding a loosely coupled loop across stages. Right: embodied science reframes discov ery as a closed-loop process interaction with the physical world. Agentic embodied AI provides the enabling system layer that operationalizes this paradigm by integrating Perception – Language – Action – Discov ery (PLAD): it grounds cognition in instrument-deriv ed e vidence (Perception), conducts reasoning and planning (Language), carries out embodied experimentation (Action), and internalizes outcomes into scientiﬁc insight (Discov ery). The system can sustain long-horizon autonomous scientiﬁc discovery , ex ecuting iterativ e cycles be yond task-bounded, single-shot assistance. projection of discov ery under the prev ailing framing: cognition is powerful b ut weakly grounded in instrument-lev el e vidence and physical constraints, whereas e xecution is rob ust but often optimized around predeﬁned objectiv es and procedural boundaries. W ithout reframing discov ery as an end-to-end closed loop, incremental scaling (such as stronger planners, better robots, or larger models) tends to reproduce fragmentation rather than resolv e it. Thus, autonomy in scientiﬁc discovery is a property of the coupled system, and realizing the capacity expansion required for continued exploration 23 collapses when any of the three structural requirements is e xternalized: perception , cognition , and action . At the level of per ception , scientiﬁc evidence is generated by instruments, not datasets: raw spectra, chromatograms, microscop y streams, sensor logs, calibration traces, and meta-data that capture drift, f ailure, and context. Models trained on curated tables often lack the ability to parse these multimodal, imperfect signals and to perform goal-directed sensing (e.g., adapti vely zoom, r e-measure, recalibrate, or switch modalities when anomalies occur). At the level of cognition , most AI4S systems optimize well-deﬁned tasks (predict a structure, rank candidates, re gress a property), b ut long-horizon discovery requires persistent goal management, experimental reasoning under uncertainty , and planning over contingencies and costs 24 . It demands not only selecting the next e xperiment, but also deciding what to measur e , when to intervene , and how to r evise hypotheses as e vidence accumulates. At the level of action , discovery hinges on interventions in the world 25 . Many AI4S pipelines terminate at candidate suggestions, while real laboratories require precise, v eriﬁable actions: manipulating reagents, conﬁguring instruments, e xecuting protocols, ensuring safety constraints, and reco vering from errors. W ithout rob ust action grounding, “recommendations” cannot mature into discov eries. 2/ 17 Here, we argue for embodied science as a foundational paradigm for long-horizon autonomous disco very . In embodied science, AI does not merely analyze data or recommend actions; it expands sensory and experimental capacity by participating directly in experimental w orkﬂows as part of a closed loop that couples perception of instrument signals, knowledge-grounded reasoning, and physical intervention. Scientiﬁc progress, under this paradigm, emerges from continuous eng agement with real experimental en vironments rather than from one-off computation ov er static datasets. This perspectiv e reframes AI-driv en discov ery as an exploration-dri ven process, in which hypotheses, experimental strategies, and operational beha viors co-e volv e through repeated interaction with the physical world. 2 Scoping Embodied Science and Agentic Embodied AI AI-dri ven discov ery is transitioning from tool-based augmentation to system-le vel reconﬁguration of the scientiﬁc method. T o av oid terminology drift, we scope two concepts that or ganize this Perspectiv e, Embodied Science and Agentic Embodied AI , and we deﬁne long-horizon autonomous scientiﬁc discov ery as the operational outcome they aim to enable. 2.1 Embodied Science Deﬁnition. W e use Embodied Science to denote a paradigm in which discovery is treated as an embodied, long- horizon, closed-loop process. AI is inte grated into real experimental w orkﬂows and operates across the complete cycle of disco very by percei ving instrument-generated signals, reasoning with scientiﬁc knowledge, and e xecuting laboratory interv entions. Scientiﬁc progress therefore arises from sustained interaction with the ph ysical world, rather than from isolated computation ov er static datasets. Why embodiment matters in scientiﬁc exploration. Embodiment, in this view , is not laboratory automation in disguise. It is what makes AI-driv en disco very actionable: plans are realized as physical interventions in the laboratory , and their consequences are returned as instrument-grounded feedback. Rather than mechanically ex ecuting ﬁxed protocols, embodied capabilities enable systems to carry intent into the real world, observe what actually happens, and adapt subsequent decisions accordingly . 2.2 Agentic Embodied AI Embodied Science deﬁnes what kind of discovery process is needed; Agentic Embodied AI speciﬁes what kind of AI system can realize it. Deﬁnition. Agentic Embodied AI is a persistent cyber–physical scientiﬁc agent that couples (i) scientiﬁc cognition, (ii) experimental perception, and (iii) laboratory action within a single closed-loop controller , operating under explicit feasibility and safety constraints. Three properties are essential: (1) Agentic autonomy: the ability to manage goals over e xtended horizons, plan under uncertainty and cost, and re vise strategies in response to outcomes; (2) Embodiment in the experimental loop: interfaces to raw instrument streams, operational state, and actuation primiti ves to ground reasoning in laboratory reality; (3) Long-horizon persistence: memory , prov enance, monitoring, and reco very beha viors that maintain continuity across cycles rather than treating each run as a standalone episode. 2.3 Long-Horizon A utonomous Scientiﬁc Discovery Because “autonomy” is often used loosely , we adopt an operational criterion: Operational criterion. A system demonstrates long-horizon autonomous disco very if it can sustain multiple end-to-end discov ery cycles—h ypothesis → experiment design → physical e xecution → interpretation → re vision— ov er extended time spans with minimal human interv ention, while maintaining reproducibility , pro venance, and safety . This criterion sets a higher bar than most current demonstrations: it requires the loop to remain closed be yond a single episode, sustaining autonomous operation across instrument drift, stochastic outcomes, and accumulating uncertainty , while preserving reproducibility , provenance, and safety . 3/ 17 3 Analyzing the Current Landscape: Reasoning-Centric and Ex ecution-Centric P aradigms Moti vated by the need to bridge computational reasoning and physical experimentation, scientiﬁc agents and embodied AI hav e recently emerged as promising routes to ward more autonomous scientiﬁc discov ery . In practice, ho wev er , existing ef forts ha ve lar gely bifurcated into two partial realizations: reasoning-centric systems that advance hypothesis generation and in silico exploration, and ex ecution-centric platforms that enable autonomous, high- throughput experimentation within well-deﬁned procedural boundaries. Each excels at a single component of the discov ery process while remaining fundamentally decoupled from the others. Belo w , we analyze the current landscape and argue that these limitations are structural rather than incremental. 3.1 Disembodied Scientiﬁc Cognition: Reasoning without Ph ysical Action Artiﬁcial intelligence has long supported scientiﬁc research through data-dri ven modeling and prediction, a trajectory often associated with the Fourth Paradigm of data-intensi v e science 26 . In recent years, this role has expanded tow ard cognition-centric scientiﬁc agents that aim to ele vate AI from a passi ve analytical tool to an acti v e reasoning entity . As summarized in T able 1 , these approaches are uniﬁed by a common paradigm: they operate primarily on curated datasets, emphasize language-based scientiﬁc reasoning, and remain physically disembodied, with experimental ex ecution and validation performed by humans. W ithin this paradigm, a ﬁrst subclass focuses on task decomposition and planning. Systems such as ChemCro w 27 , Biomni 28 , SciT oolAgent 29 , and T oolUni verse 30 le verage large language models to orchestrate domain-speciﬁc tools, dynamically assembling workﬂo ws for synthesis planning, reaction optimization, or biomedical analysis. Their strength lies in structuring complex scientiﬁc tasks and coordinating heterogeneous computational resources. Ho wev er , their operational boundary is fundamentally cogniti v e: experimental actions are abstracted as tool calls, and physical ex ecution remains external. A second subclass emphasizes hypothesis generation and problem-space exploration. Platforms such as V irtual Lab 31 , Robin 32 , and AI Co-scientist 33 aim to emulate aspects of collaborative scientiﬁc reasoning by coordinating e vidence gathering, hypothesis reﬁnement, and iterativ e discussion across speciﬁed research problems. These systems move beyond task execution toward exploratory inquiry , yet their exploration remains conﬁned to literature, databases, and simulations. Hypotheses are e valuated through manual e valuation by human experts rather than autonomous experimental falsiﬁcation, leaving the core scientiﬁc loop incomplete. A third line of w ork frames discov ery as iterati ve search and optimization in silico. Systems such as AlphaEvolv e 34 , DeepScientist 35 and InternAgent 36 , 37 formalize scientiﬁc progress as a feedback-dri ven process in which hypotheses or programs are iterati vely reﬁned based on computational e v aluation. While this paradigm introduces an explicit notion of iteration and feedback, the feedback signal is constrained to the execution of computational algorithms. Lacking physical embodiment and the ability to interact with the material world, these systems struggle to generalize across broader scientiﬁc domains, often producing candidates that fail to translate into physical experimental succe ss. Finally , some efforts to ward end-to-end research in silico, ex empliﬁed by systems such as AI Scientist 38 and K osmos 39 , extend cognition-centric agents to include automated manuscript writing and reporting. Although these systems appear to close the research loop across problem formulation, methodological design, experimental ex ecution, and writing, their scope remains conﬁned to computational domains. W ithout physical embodiment, they remain incapable of engaging with tangible e xperimental en vironments, leaving the loop fundamentally decoupled from the complexities of the material w orld. Despite their diversity , reasoning-centric scientiﬁc agents share a fundamental structural limitation. W ithout embodied execution in real experimental environments, hypotheses cannot be autonomously tested, falsiﬁed, or re vised through sustained interaction with the physical world. Feedback is indirect, delayed, or entirely com- putational, leading to a form of cogniti ve closure in which reasoning saturates without experimental grounding. Consequently , these approaches struggle to support long-horizon autonomous scientiﬁc disco very , which requires repeated, embodied cycles of hypothesis testing and re vision. 3.2 Execution-Centric Embodiment: Action without Scientiﬁc Understanding In parallel with advances in cognition-centric scientiﬁc agents, embodied automation has made substantial progress to ward the physical e xecution of e xperiments 47 . As characterized in T able 1 , ex ecution-centric embodied systems 4/ 17 T able 1. Landscape of current AI4S approaches tow ard autonomous scientiﬁc discov ery . Existing efforts cluster into two partial realizations: reasoning-centric but physically disembodied systems, and e xecution-centric b ut cogniti vely shallo w platforms. Notably , neither paradigm achie ves a sustained coupling between language-le vel scientiﬁc reasoning and embodied experimental action. Paradigm Characteristics Subtypes Example approaches Language-heavy , action-light (physically disembodied) 16 , 17 • Strong language-based reasoning, but no coupling to embodied action; • Operate primarily on curated datasets rather than raw signals from instruments; • Remain physically disembodied, with experimental e xecution carried out by human scientists. T ask decomposition & planning e.g., ChemCro w 27 ; Biomni 28 ; SciT oolA- gent 29 ; T oolUni verse 30 Hypothesis & problem exploration e.g., V irtual Lab 31 ; Robin 32 ; AI Co-scientist 33 Iterati ve search & opti- mization in silico e.g., AlphaEvolv e 34 ; DeepScientist 35 ; InternA- gent 36 , 37 End-to-end research in silico e.g., AI Scientist 38 ; K os- mos 39 Action-heavy , language-light (cogniti vely shallo w) 21 , 22 • Reliable embodied execution, b ut limited language-le vel reasoning and hypothesis formation; • Operate directly on raw instrument signals using heuristic decision-making (e.g., Bayesian optimization); • Enable autonomous execution within narro w task scopes and ﬁxed procedural boundaries. Single-step, instrument-bound e.g., automated liquid han- dlers; robotic pipetting and weighing systems Multi-step, protocol- dri ven automation e.g., Chemputer 40 ; FLUID 41 Closed-loop ex ecution and optimization e.g., A-Lab 42 RoboChem 43 ; CRESt 44 Language-instructed experimental ex ecution e.g., Coscientist 45 ; ChemAgents 46 operate on raw instrument signals, rely primarily on heuristic or statistical decision-making, and enable direct interaction with the physical w orld. Ho wev er , these systems are typically cogniti vely shallo w: while they e xcel at ex ecuting experiments, the y lack the capacity for mechanism-aware reasoning and hypothesis-dri v en inquiry . At one end of the e xecution-centric spectrum are single-step, instrument-bound systems. Automated liquid handlers, robotic pipetting platforms, and weighing systems e xemplify this class. They automate isolated e xperi- mental operations with high precision and repeatability , yet each action is ex ecuted independently of the broader scientiﬁc conte xt. Consequently , these systems constitute the basic infrastructure of automated experimentation: they robustly perform prescribed single-step actions, yet lack the capacity to analyze outcomes, reason about experimental context, or adapt beha vior beyond explicit instructions. A more advanced, yet still ex ecution-bound subclass extends automation to multi-step, protocol-driven e xecution. Platforms such as the Chemputer 40 and FLUID 40 encode experimental procedures as machine-e xecutable w orkﬂows, enabling the automated realization of comple x, multi-stage protocols. Compared to single-step systems, this approach extends automation from isolated actions to coordinated, multi-stage e xperimental e xecution. Although e xecution no w spans multiple stages, the system does not analyze intermediate outcomes, reason about e xperimental context, or de viate from the prescribed workﬂo w when unexpected results occur . At the most integrated end of execution-centric systems are closed-loop, feedback-dri v en platforms, including A-Lab 42 , RoboChem 43 , and CRESt 44 . By inte grating robotic e xecution with online charac- terization and statistical optimization methods, including Bayesian optimization and ev olutionary search, these platforms demonstrate impressi ve autonomy over e xtended e xperimental campaigns, adapting parameters in response to observed outcomes and achieving local optimization in high-dimensional spaces. Howe ver , the closed loop operates primarily at the lev el of numerical feedback rather than scientiﬁc representation. When experiments fail or yield anomalous results, adaptation proceeds through parameter adjustment rather than counterfactual reasoning or 5/ 17 hypothesis re vision. W ith the introduction of lar ge language models, robotic systems ha ve be gun to blur the boundary between ex ecution-centric automation and cogniti ve planning, using LLMs to design experimental w orkﬂows and dri ve embodied laboratory systems for physical e xecution 45 , 46 . Despite this progress, LLM integration primarily improv es the ﬂexibility of workﬂo w design and control, without conferring sustained, long-horizon scientiﬁc agency . Experimental outcomes are used to reﬁne individual tasks, leading to episodic, task-bounded reasoning rather than cumulati ve, discov ery-dri ven iteration. Despite increasing integration, ex ecution-centric systems face tw o persistent limitations at the le vel of embod- iment. First, most platforms rely on ﬁxed or rail-mounted manipulators tightly coupled to predeﬁned laboratory layouts, which provide reliability b ut limit physical ﬂexibility across heterogeneous instruments and reconﬁgurable en vironments. Second, the development, calibration, and maintenance of such systems remain labor-intensi ve, requiring substantial human effort to debug workﬂo ws and adapt execution logic to new experimental settings. T o alleviate these constraints, ex ecution-centric autonomy has been e xtended through mobile embodiments and digital twin–based simulation. Mobile robotics 48 , 49 address the limitation of rigid embodiment by physically interconnecting spatially distrib uted instruments, thereby expanding the scope of ex ecutable workﬂo ws beyond ﬁxed workcells. In parallel, digital twin framew orks such as MA TTERIX 50 target the engineering bottleneck by enabling virtual v alidation and sim-to-real transfer 51 , reducing the cost and risk associated with deploying automated experimental w orkﬂows. Howe v er , both advances operate strictly at the le vel of ex ecution. They improv e ﬂexibility and reliability in ho w experiments are carried out, but do not enable systems to interpret e xperimental outcomes, re vise scientiﬁc hypotheses, or engage in mechanism-aware reasoning. In this sense, mobile robots and digital twins address ex ecution bottlenecks, not scientiﬁc cognition. Consequently , execution-centric embodied systems often function as po werful ex ecution engines rather than as scientists. Their decision-making is guided by predeﬁned objecti ve functions and ﬁx ed parameter spaces, limiting their ability to generalize be yond narrowly scoped tasks. Although they may achie ve ef ﬁcient optimization within constrained domains, they do not accumulate transferable scientiﬁc insight. Experimentation is treated as a process of ex ecuting and scoring trials, rather than as an acti ve inquiry in which hypotheses are formulated, challenged, and re vised. This gap between action and understanding gi v es rise to an illusion of scientiﬁc progress: local performance improv es, yet discov ery remains decoupled from scientiﬁc understanding. 3.3 Why Incremental Scaling Is Insufﬁcient The limitations above do not disappear by scaling one side. More po werful language reasoning does not automatically yield procedural correctness, instrument awareness, or safe execution. More capable robotics does not automatically yield hypothesis-dri ven inquiry , evidence inte gration, or scientiﬁc generalization. Long-horizon autonomy requires a uniﬁed system-lev el coupling between perception, language-level reasoning, embodied action, and cumulativ e discov ery 52 , 53 . This moti v ates a closed-loop framework that treats autonomy as an end-to-end property rather than a component upgrade. 4 PLAD: T owards Closed-Loop Agentic Embodied AI f or Scientiﬁc Discovery W e ar gue that agentic embodied AI represents a critical technological pathw ay to ward long-horizon autonomous scientiﬁc disco very . T o this end, we propose the Perception–Language–Action–Discovery (PLAD) closed-loop paradigm (Figure 2 ) as an ov erarching frame w ork for implementation. Unlike the V ision–Language–Action (VLA) 54 paradigm that underpins general-purpose embodied intelligence, where the primary objecti ve is to understand open en vironments, generate linguistic descriptions, and e xecute physical actions. PLAD is explicitly centered on scientiﬁc discov ery as its core goal. Accordingly , its design is aligned with the distinctive requirements of scientiﬁc research across cogniti ve structures, perceptual tar gets, and forms of action. W ithin PLAD, scientiﬁc discovery is modeled as a continuously operating closed-loop process. An agent percei ves the experimental en vironment (Perception), reasons and plans under the support of scientiﬁc language and kno wledge (Language), ex ecutes experiments through embodied actions in real laboratory settings (Action), and internalizes experimental outcomes as ne w scientiﬁc insights (Discov ery), which in turn drive subsequent rounds of exploration. Belo w , we introduce each component of the PLAD paradigm in detail. 6/ 17 Long - horizon Autonomous Scientific D iscovery Reasoning with Models , Knowledge, and T ools Scientific Brain General Larg e Language Model · Spe cialized Knowledge ↓ grounds & bounds invokes & orchestrates Paper Database Sci - KG · Spe cialized T ools ↓ Web search Database query Model execution … … Embodied Execution in Physical W orld · St ationary Manipulat or · Li near T rack Manipulator · Mobile Manipulator · Hu manoid Robots Spatially Constrai ned: Autono my Spatially Unconst rained: Flexibility Instruments as Extensions o f Scientific Senses Internalizing Outcomes as S cientific Insight · Instrument - defined Experim ental States ↓ · Instrument - mediated Physical Observables ↓ Microsco py image s Cryo - EM outputs … Progress states Equipment status … … Spectral measureme nts Electronic lab notebook … Exploration quest ion New insight · Cu mulative sci entific know ledge ↓ Enzyme function Reaction pathway … … Material property … Drug Activity Laboratory information systems … Figure 2. The PLAD loop f or long-horizon autonolous scientiﬁc discovery . Per ception transforms raw instrument signals into structured e vidence. Language integrates foundation models with specialized kno wledge and tools to support hypothesis formation, reasoning, and experimental planning. Action compiles plans into veriﬁable laboratory operations. Discovery internalizes outcomes into transferable scientiﬁc knowledge that shapes subsequent exploration across c ycles. 4.1 P erception: Instruments as Extensions of Scientiﬁc Senses Perception characterizes an agent’ s capacity to sense the scientiﬁc en vironment through instruments that extend the limits of human perception. In embodied science, instruments do not merely record data; they function as artiﬁcial scientiﬁc senses, deﬁning what aspects of the ph ysical world are observ able and ho w experimental information is structured. Scientiﬁc perception comprises two complementary forms of instrument sensing. First, instruments provide instrument-mediated physical observ ations, through which latent physical phenomena are rendered observable. These include high-dimensional signals such as microscopy images, cryo-electron microscopy reconstructions, spectral measurements, and other modality-speciﬁc outputs. Such observations e xpose structural, dynamical, or compositional properties of experimental systems that are inaccessible to unaided human senses. Second, instruments deﬁne instrument-deﬁned experimental states, which encode the operational and procedural context of e xperimentation. These include progress indicators, equipment status, and structured records maintained in electronic lab notebooks or laboratory information systems. Rather than capturing physical phenomena 7/ 17 directly , these signals formalize the evolving state of an experiment, enabling agents to track execution, detect de viations, and coordinate multi-step workﬂows. T ogether , these two forms allo w embodied agents to perceiv e not only what is occurring in an e xperiment, b ut also where it resides within the broader scientiﬁc process, forming a stable foundation for closed-loop reasoning and action. 4.2 Language: Reasoning with Models, Knowledge, and T ools Language constitutes the “scientiﬁc brain” of an agent, responsible for scientiﬁc reasoning, interpretation, and planning. Within PLAD, this component is centered on large language models (LLMs) 18 , 19 , understood as a general class of foundation models that encompass multimodal LLMs 11 , 12 , 55 capable of reasoning over heterogeneous scientiﬁc inputs. While LLMs supply general reasoning capability , reliable scientiﬁc intelligence cannot emer ge from models alone. Instead, it arises from a structured integration of models, specialized kno wledge, and task-speciﬁc tools, enabling a dynamic balance between generality and specialization. At the model le vel, LLMs used for scientiﬁc tasks must extend beyond natural language understanding to interpret multimodal scientiﬁc inputs produced by perception, including experimental data, instrument outputs, and structured records. In addition, the slo w-thinking 56 of LLMs enables sustained reasoning ov er complex instrument data as well as hypothesis inference and e xperimental planning. Such capacity allows models to integrate heterogeneous e vidence, examine intermediate conclusions, and reason counterfactually about alternativ e e xplanations, experimental conditions, or mechanistic hypotheses. These requirements motiv ate a set of science-oriented design considerations. These include modality-aware encoding for spectra, images, and time series; specialized tokenization schemes for chemical, biological, or materials representations; and architectural support for reasoning over long contexts. T raining pipelines further beneﬁt from integrating scientiﬁc literature with instrument-generated data, enabling LLMs to acquire long chain-of-thought reasoning patterns that align with empirical scientiﬁc practice. Kno wledge provides the specialized grounding and constraint that anchors general LLM reasoning in domain reality . Such knowledge spans unstructured scientiﬁc literature and structured resources, including databases and scientiﬁc knowledge graphs (Sci-KGs) 57 . Sci-KGs systematically encode scientiﬁc concepts and their relationships in the form of triples, integrating structured databases with unstructured textual knowledge to provide a more comprehensi ve and stable foundation. Moreover , knowledge graphs can incorporate multimodal information, such as omics data, imaging data, and dynamical trajectories from computational simulations, thereby offering rich contextual support for scientiﬁc reasoning. Importantly , grounding and constraint are not abstract properties but are operationalized through concrete mechanisms 58 . During training, structured kno wledge can be transformed into reasoning supervision, for example, via kno wledge-graph-to-corpus approaches 59 that con vert graph structure into long-chain scientiﬁc reasoning data, yielding high-reliability learning signals. During inference, literature, databases, and kno wledge graphs are integrated through retrie v al-augmented generation (RA G) 60 , 61 , supplying authoritati ve context that constrains model outputs and stabilizes reasoning under uncertainty . In this way , kno wledge injects precision, consistency , and domain depth that general-purpose LLMs alone cannot guarantee. T ools constitute a second axis of specialization by e xtending reasoning into executable operations. Through tool in v ocation, such as web search, database querying, and computational model in v ocation, LLMs can acti vely acquire external e vidence, perform specialized analyses, and v alidate intermediate hypotheses. Unlike the general reasoning capacity of LLMs, tools encode expert procedures and formal methods that deli v er accuracy and reliability on well-deﬁned scientiﬁc tasks, ef fectiv ely externalizing domain e xpertise into veriﬁable operations. T ogether , this model–kno wledge–tool inte gration enables a dynamic balance between generality and special- ization. General-purpose LLMs pro vide adaptability , contextual understanding, and analogical reasoning across domains; specialized knowledge and tools deliv er precision, depth, and task reliability . W ithin PLAD, agents dynamically adjust their reliance on these components according to task demands, allo wing them to rob ustly interpret scientiﬁc data, perform mechanism-aware reasoning, and plan experiments that are both ﬂexible across domains and dependable within speciﬁc scientiﬁc contexts. 8/ 17 4.3 Action: Embodied Execution in Ph ysical W orld Action corresponds to the scientiﬁc body of an embodied agent and denotes its capacity to intervene in the physical world through e xperimental ex ecution. In PLAD, action is deﬁned by the agent’ s ability to physically manipulate materials, instruments, and experimental processes, thereby grounding scientiﬁc reasoning in reality . Embodied ex ecution can be broadly organized along a ke y dimension: the degree of physical constraint under which an agent operates. Along this axis, embodied forms range from spatially constrained embodiments, which trade autonomy for reliability , to spatially unconstrained embodiments, which prioritize ﬂexibility at the cost of ex ecution certainty . Spatially constrained embodiments operate within predeﬁned mechanical boundaries and are tightly coupled to laboratory infrastructure, with tw o representativ e forms: stationary manipulators and linear track manipulators 62 . Stationary manipulators are ﬁxed robotic arms deplo yed at indi vidual experimental stations, automating well-deﬁned manual operations such as dispensing, sample loading or unloading, and instrument handling. Their limited work en velope enables high precision, stability , and repeatability , but also conﬁnes them to step-le vel e xecution within isolated experimental stages. Linear track manipulators extend this capability by mounting robotic arms on meter - scale rails that connect multiple functional stations, such as synthesis and characterization zones. This conﬁguration enables coordinated, multi-step experimental w orkﬂows and sustained high-throughput e xecution under predeﬁned pipelines, signiﬁcantly expanding e xperimental coverage while preserving e xecution reliability . Nev ertheless, both forms remain constrained by ﬁxed layouts and preconﬁgured trajectories, fav oring rob ustness and safety ov er autonomy and limiting their adaptability to unstructured or rapidly e volving laboratory settings. In contrast, spatially unconstrained embodiments operate without predeﬁned trajectories or ﬁxed work en velopes. This category includes mobile manipulators, which integrate robotic arms with wheels, as well as humanoid robots 48 , 49 . Such embodiments can navigate comple x laboratory spaces, transport samples and consumables, and interact with diverse instruments across distributed en vironments. Their physical freedom enables higher-le vel autonomy and more closely mimics the beha viors of human researchers. At the same time, this ﬂexibility introduces challenges in motion planning and execution reliability , particularly in safety-critical laboratory settings characterized by dense equipment and intricate procedures. These two classes of embodiment deﬁne complementary regimes of embodied action 63 , 64 . Spatially constrained systems provide a stable and reliable foundation for routine experimentation, while spatially unconstrained systems enable ﬂexibility and integration across heterogeneous laboratory contexts. Among the latter , humanoid robots of fer a distinctiv e long-term advantage: by directly operating within human-oriented laboratory layouts, tools, and protocols, they minimize the need for en vironment reconﬁguration and thus represent a promising pathway to ward general-purpose laboratory autonomy . Long-horizon autonomous scientiﬁc disco very arises from integrating constrained reliability with unconstrained adaptability , with embodied action in PLAD bridging reasoning and sustained physical experimentation. 4.4 Discovery: Internalizing Execution Outcomes as Scientiﬁc Insight Discov ery denotes the process by which an agent internalizes the outcomes of embodied action into scientiﬁc insight. Rather than treating experimental results as isolated observ ations, discovery abstracts execution feedback into transferable scientiﬁc understanding. W ithin PLAD, this process con verts e xploratory questions into ne w insights, such as reﬁned interpretations of enzyme function, inferred reaction pathways, or updated structure–property relationships. By feeding these insights back into subsequent reasoning and action, discovery enables the continual reﬁnement of research objecti ves and supports long-horizon autonomous scientiﬁc discov ery . 4.5 PLAD Examples Using enzyme design and chemical reaction optimization as representati v e examples, this section illustrates how the PLAD frame work can be instantiated across di verse e xperimental settings (Figure 3 ). While the underlying scientiﬁc questions and experimental modalities dif fer , PLAD provides a common or ganizational structure for inte grating perception, reasoning, and action into closed-loop discov ery processes. 9/ 17 Language Action Perception Discovery Structural a nd functional signals · Structu ral data Structure – function hy pothesis · MD simulation · Structu re predict ion · Database sear ch Active - site muta tion Biofoundry - driven mutant generation and screening · Mutant library · Screeni ng · Purifi cation · Express ion Internalized design rules for activity – stabilit y Functional mutation Silent mutation L A D P Enzyme Design Language Action Perception Di s c o v e r y Spectrosco pic data and reaction trajec tories Mechanism - driven chemical reasoning Automated reaction execu tion In te r n a l i z e d i n s i g h t i n to r e a c ti o n p a t hway s · Spectro scopic characterization data · Time - resolved informatio n · Mechanis tic reasoning · Competing pathway infe rence · Intervention strategy · Mobile robotic chemist · Autom ated workstation L A D P Chemical Reaction Optimiza tion · Competing pat hway · Updated rule · Functional data · Stabil ity metrics a b Figure 3. Instantiating PLAD in scientiﬁc discov ery . (a) Enzyme design: Perception integrates structural and functional measurements; Language forms structure–function hypotheses under constraints; Action executes library construction and screening; Discov ery internalizes design rules and counterexamples for future c ycles. (b) Chemical reaction optimization: Perception tracks spectroscopic and kinetic trajectories; Language performs mechanism-aw are reasoning; Action executes tar geted campaigns; Discov ery internalizes mechanistic explanations and pathway-le vel constraints that generalize be yond parameter tuning. 4.5.1 Enzyme Design In enzyme design, Perception focuses on processing instrument-derived protein structural outputs (e.g., from cryo-electron microscopy or X-ray crystallography), functional and kinetic measurements (such as K m , k cat , and time-resolved acti vity proﬁles), and experimental indicators of stability and e xpression (including thermal stability , expression le vels, and puriﬁcation yield), among other inputs. Language is centered on constructing hypotheses about enzyme structure–function relationships. It aims to infer which structural changes are lik ely to dri ve functional improv ements. For example, it must determine whether enhanced acti vity arises from optimized substrate-binding conformations or from increased global stability , and distinguish whether activity loss results from direct perturbation of the acti ve site or from conformational instability caused by distal mutations. Based on these inferences, it proposes design strategies, such as prioritizing mutations at acti ve-site residues or shifting focus to modifying secondary-structure interfaces to enhance stability . This reasoning process is supported by computational and analytical tools, including protein structure prediction and molecular dynamics simulations to assess ho w mutations af fect conformational stability and dynamics, as well as database queries to examine e v olutionary conserv ation or 10/ 17 homologous v ariant distributions. Action subsequently translates these reasoning outcomes into concrete, embodied experimental e xecution. Action is implemented through control of physical experiments, such as generating mutant libraries on a biofoundry and orchestrating high-throughput e xpression, puriﬁcation, and acti vity screening. Through iterati ve cycles, structure–function h ypotheses are tested and reﬁned over e xtended experimental horizons. 4.5.2 Chemical Reaction Optimization In chemical reaction optimization, Perception emphasizes dynamic and process-lev el signals, including spectroscopic characterization data (such as NMR and IR) as well as time-resolved information, including reaction progress curves and by-product formation trajectories. T ogether , these signals reﬂect the temporal e volution of reaction pathways, intermediate states, and operational stability . Language is centered on mechanism-driven chemical reasoning. It focuses on ho w solvent, additi ves, and lig and structure modulate key elementary steps, such as oxidati ve addition, migratory insertion, and reductiv e elimination. When undesired outcomes arise, such as diminished enantioselecti vity or increased by-product formation, the focus shifts from global parameter adjustment to targeted hypothesis reﬁnement. The formation of speciﬁc side products may indicate a competing mechanistic pathway , prompting structural modiﬁcations or the introduction of additi ves to suppress or intercept that pathway . This reasoning process is also supported by computational tools, including quantum chemical modeling and kinetic analysis, which are used to assess relative energetics, barrier heights, and selecti vity trends under different conditions. Action then translates these into embodied e xperimental e xecution. Actions are realized through automated workstations or mobile robotic chemists that execute tar geted reaction campaigns, experimentally v alidating mechanistic hypotheses and suppressing undesired pathways. Through such embodied interv ention, mechanism-aw are exploration can be sustained across extended e xperimental cycles. 5 Challenges and Design Considerations In this section, we identify challenges that constrain the stability , reliability , and safety of long-horizon PLAD loops, and outline corresponding design responses that are necessary to k eep such systems operational, cumulati ve, and trustworthy (Figure 4 ). 5.1 Reasoning over Scientiﬁc Data Reasoning ov er scientiﬁc data is fundamental to bridging the gap between Perception and Language. In experiments, perceptual inputs typically manifest as complex, noisy signals, such as liquid chromatography–mass spectrometry (LC–MS) spectra, nuclear magnetic resonance (NMR) and infrared (IR) spectra, time-resolved reaction kinetic proﬁles, microscopy images, and instrument status logs. Interpretation of these data is deeply contingent on domain-speciﬁc knowledge and conte xtual experimental details. Consider LC–MS data as an illustrati ve example: a spectrum is not a simple “product ﬁngerprint” b ut rather the superposition of multiple ph ysicochemical processes, including v ariations in ionization efﬁcienc y , matrix effects, di verse fragmentation pathw ays, co-elution of isomers, and dynamic changes in signal-to-noise ratio ov er time. Consequently , the appearance, disappearance, or intensity shift of a peak does not necessarily correlate with reaction progress or the formation of a target compound; accurate interpretation requires integrating retention tim e, isotopic distribution patterns, fragment ion signatures, and experimental conditions into a coherent assessment. Similarly , scientiﬁc instruments characterize reaction through process signals and operational states. Parameters such as temperature, pressure, and ﬂo w rate, along with sensor readings or visual cues (e.g., phase transitions or color ev olution captured in images), are routinely used to assess whether a reaction has stalled or de viated from its intended trajectory . Such judgments, ho we ver , also rely on a deep understanding of the underlying reaction mechanisms and experimental protocols. Addressing this challenge requires coordinated adv ances in model design and tool-assisted reasoning. At the architectural level, large language models must be adapted to scientiﬁc en vironments so as to accommodate the structural or temporal properties of experimental data. This enables reasoning to operate directly o ver instrument- generated representations, rather than relying solely on linguistic descriptions. Such science-adapted language models are often described as scientiﬁc large language models (Sci-LLMs) 11 . Even with these adaptations, it is dif ﬁcult for LLMs to nati vely co ver all dimensions of scientiﬁc data analysis. Reliable interpretation also depends on 11/ 17 a. Reasoning over S cientific Data b. Embodied Execution Reliabilit y Challenge Resolution Heterogeneous, noisy instrument signals Interpretation requires a mechanistic understanding Modality - aware encoding of raw scientific signals To o l - based, precise, and verifi able analysis Sci - LLMs with tool - grounded interpretation Sci - LLM To o l s Signals Structural/T emporal Encoding Challenge Resolution Heterogeneous physical embodiments High - precision, safety - critical execution c. Long -Horizon Knowledg e Accumulation Challenge Resolution Digital Tw i n simulation & trajectory learning Sim - to - real transfer Experimental memory is fra gmented across cycles Knowledge struggles to accumul ate beyond local updates Knowledge graph for outcome - to - knowledge internalization Knowledge graph Experiment 1 Experiment 2 Experiment 3 update update update Experimental record Scientific hypotheses & conclusion Newly identified regularities, causal relations Experiment 1 Experiment 2 Experiment 3 d. Protocolization of Scientific In frastructure Challenge Resolution Fragmented, non - interoperable infrastructure Agent - interpretable action - observation protocols Challenge Resolution Irreversible embodiment amplifies cascading risk in long - horizon autonomy Governance embedded throughou t the PLAD loop e. Safety and Risk G overnance Experiment 1 Experiment 2 Experiment 3 Success Failure anomaly Perception from instrument Scientific brain Embodied action Pre - execution safety gating Model - based risk assessment Knowledge - based safety boundaries PLAD loop Observation Action Reasoning & planning Standardized actions Structured observations Explicit failure states Chemical hazards Biological risks … Figure 4. Challenges and design r esolution for sustaining long-horizon PLAD loops . (a) Reasoning over scientiﬁc data : Heterogeneous, noisy instrument signals require mechanism-aware interpretation; science-adapted LLMs, modality-aw are encoders, and tool-grounded analysis support reliable reasoning. (b) Embodied execution reliability : Robust e xecution across di verse robotic embodiments requires lar ge-scale simulation, trajectory learning, and sim-to-real transfer . (c) Long-horizon knowledge accumulation : Knowledge graphs con vert fragmented experimental records into cumulati ve scientiﬁc insight across c ycles. (d) Protocolization of scientiﬁc infrastructure : Agent-interpretable action–observ ation protocols standardize actions, observations, and f ailure states, enabling composable closed-loop workﬂo ws. (e) Safety and risk governance : T rustworthy long-horizon autonomy requires integrated safety gating, model-based risk assessment, and kno wledge-based safety boundaries. tool in v ocation, in which models work in concert with specialized computational tools to execute precise analytical procedures while maintaining control ov er higher-le vel scientiﬁc inference. Both model-based and tool-assisted reasoning rely critically on appropriate training paradigms. Agentic reinforcement learning has been sho wn to be ef fectiv e in training slo w-thinking and contextual tool use, enabling agents to in voke, interpret, and integrate tool outputs within sustained reasoning and planning processes. T ogether , these design and training strategies support the analysis of raw e xperimental signals within long-horizon autonomous discov ery loops. 12/ 17 5.2 Embodied Execution Reliability The core challenge of embodied ex ecution in scientiﬁc scenarios lies in e xecution reliability: whether the e xperi- mental actions can be e xecuted correctly , consistently , and repeatedly in comple x laboratory en vironments under strict safety constraints. Scientiﬁc experimentation may not be carried out by a single, uniform robotic form. Instead, embodied ex ecution spans a di verse set of physical carriers, including stationary or mobile manipulators and humanoid robots. Ensuring reliable execution across such heterogeneous embodiments ampliﬁes the difﬁculty of autonomy in scientiﬁc en vironments. Beyond the di versity of embodiments, scientiﬁc experiments themselves demand a wide range of precise and highly specialized skills, such as liquid dispensing, po wder weighing, apparatus grasping, sample transfer , and instrument interfacing. More importantly , real experiments often in v olve hazardous chemicals, high temperatures, high pressures, or biosafety risks, making it difﬁcult to scale up learning approaches that rely on physical trial and error . T o address these challenges, sim-to-real approaches based on digital twin environments play a central role 51 . Digital twins enable systematic modeling not only of laboratory layouts and instrument geometries, b ut also of the interaction dynamics speciﬁc to dif ferent embodiments and e xperimental mechanisms. In scientiﬁc scenarios, the ke y to digital twins is not limited to simulation accuracy in terms of geometry or motion, but also includes the accurate depiction of experimental mechanisms. For example, by modeling heat transfer processes, temperature changes in the virtual en vironment can accurately reﬂect thermal behaviors in physical experiments, ensuring that ex ecution strategies de v eloped in the simulation en vironment remain v alid in real experiments 50 . Lev eraging simulation en vironments for large-scale, lo w-cost training and trial and error , combined with high-quality trajectory data generated from human demonstrations and robot executions, can gradually narrow the gap between virtual and real worlds, enabling smooth migration from simulation training to real-world deplo yment. 5.3 Long-Horizon Knowledge Accum ulation The essence of long-horizon autonomy lies in the sustained bidirectional closure between cognition and e xecution across multiple experimental c ycles. Scientiﬁc discovery is not a collection of isolated experiments, b ut a process of cumulativ e knowledge construction, re vision, and e xtension. First, long-horizon autonomy imposes stringent requirements on memory management 65 . Agents must continuously record, organize, and re visit historical hypothe- ses, experimental designs, and observ ations across e xtended timescales, such that prior experience can be reliably carried forward into future decision-making. Such continuity cannot be reliably supported by language-model context windo ws or unstructured experimental logs alone. Second, long-horizon discov ery requires that experimental outcomes be systematically analyzed and internalized as e v olving scientiﬁc kno wledge. Newly generated results should not remain as transient observations or isolated performance metrics; rather , they must be integrated into the agent’ s internal state. This integration ensures that past successes, failures, and anomalies ex ert lasting inﬂuence, thereby enabling cumulati ve rather than repetiti ve disco very . A central strategy for addressing these challenges is the introduction of kno wledge graphs as structured repre- sentational skeletons 66 . At the memory lev el, the y transform fragmented experimental records, hypotheses, and conclusions into structured representations that support retriev al, comparison, and reasoning across temporal scales. At the disco very le vel, ne wly identiﬁed scientiﬁc re gularities, causal relationships, or anomalous behaviors can be incorporated as ne w nodes or relationships, allowing scientiﬁc understanding to be continuously expanded and reﬁned. Overall, long-horizon knowledge accumulation is a systemic requirement for the stability and effecti v eness of the entire closed loop. Only when scientiﬁc knowledge can be persistently accumulated, structured, and re vised across experimental cycles can embodied agents move beyond short-term optimization and genuinely assume responsibility for long-horizon autonomous scientiﬁc discov ery . 5.4 Protocolization of Scientiﬁc Infrastructure In most laboratories, experimental devices, sensing instruments, and ex ecution modules typically operate as isolated systems. Scientiﬁc instruments extend the perceptual boundary of discovery by rendering processes observ able; their states and outputs need to be acquired continuously to support reasoning. In practice, howe ver , measurements, de vice states, and operational logs are recorded locally or exposed only as low-le v el signals. As a result, perceptual 13/ 17 information cannot ﬂo w continuously into the scientiﬁc brain, and reasoning outcomes cannot be reliably grounded in the ev olving experimental state. This challenge is further ampliﬁed in complex experimental settings that in v olve multi-step procedures and coordinated embodied action. Executing such workﬂo ws requires agents to orchestrate heterogeneous robots and manipulators. W ithout a uniﬁed representation of actions and system states, high-le vel experimental plans cannot be systematically decomposed into executable embodied beha viors. T ogether , these factors create a structural disconnection between perception, reasoning, and action, constituting a fundamental bottleneck for closed-loop autonomy . Overcoming this bottleneck requires a principled reorg anization of ho w experimental components are repre- sented and interfaced. Protocolized infrastructure addresses this challenge by deﬁning a uniﬁed, agent-interpretable abstraction ov er distributed e xperimental components 67 , 68 . While networked connecti vity enables heterogeneous instruments, sensors, and ex ecution systems to expose their states and outputs, protocolization lifts raw connec- ti vity into agent-operable capability . For example, the Science Context Protocol (SCP) 67 of fers a standardized way to expose and orchestrate scientiﬁc resources, including tools, models, datasets, and physical instruments, thereby transforming heterogeneous laboratory connecti vity into agent-operable capability . By standardizing how experimental actions, perceptual observations, and failure or exception states are represented, protocols enable agents to interpret, compare, and compose interactions across instruments and e xperimental contexts. This allows actions and observations to function as reusable, veriﬁable primiti v es within the PLAD loop, supporting traceability , composability , and cumulativ e discov ery across long-horizon cycles. 5.5 Safety and Risk Go vernance Safety constitutes a fundamental challenge for long-horizon autonomous scientiﬁc e xploration. A deﬁning feature of PLAD is that e xperimental actions are embodied and irr e versible . Physical interventions permanently alter materials, instruments, and en vironmental states, and erroneous actions cannot be rolled back as f ailed computations. Risk is further ampliﬁed by the iteration of PLAD. It also creates the potential for runaway autonomy , progressiv ely relax implicit safety mar gins. These factors substantially amplify safety hazards during e xperimentation, including the ex ecution of unsafe procedures or the generation of hazardous outcomes. For e xample, operating under extreme temperatures or pressures, employing toxic reagents outside validated re gimes, or producing products that pose chemical, biological, or en vironmental risks beyond the e xperimental boundary . Safety gov ernance in PLAD relies on a deliberate complementarity between kno wledge-dri ven constraints and model-based risk assessment. Explicit safety kno wledge can be deri ved from laboratory re gulations, instrument speciﬁcations, hazard databases, and e xperimental manuals. It deﬁnes operational boundaries that delimit where autonomous exploration must not go. These constraints encode known hazards and non-ne gotiable limits, ensuring that hypothesis generation and e xperimental planning remain within v alidated and auditable regimes. Howe ver , long-horizon autonomous discov ery routinely encounters risk that cannot be exhausti v ely speciﬁed in advance. As PLAD iterates, hazards may emer ge from contextual int eractions, cumulativ e de viations, or gradual extrapolation to ward extreme conditions. Model-based safety supervision is therefore required to assess such conte xt-dependent and emergent risks. Safety-aware guard models e valuate candidate plans within their historical and e xperimental context, estimating whether sequences of otherwise permissible actions collecti vely approach unsafe re gimes. 6 Conclusion Embodied Science reframes disco very as a long-horizon closed-loop process in which reasoning is continuously grounded, corrected, and enriched through physical interaction with the world. This perspectiv e highlights a structural limitation in today’ s AI4S landscape: scaling language reasoning improv es cogniti ve breadth, and adv ancing laboratory automation impro ves throughput, b ut neither alone satisﬁes the core requirement of autonomous discov ery , namely the ability to iterati vely generate, test, falsify , and revise hypotheses o ver e xtended horizons. W ithout embodiment, reasoning risks becoming self-referential; without scientiﬁc cognition, ex ecution risks degenerating into blind optimization. PLAD pro vides an org anizing principle for o vercoming this di vide. By integrating Perception, Language, Action, and Discov ery into a coupled system, PLAD shifts autonomy from short-term performance to cumulative scientiﬁc 14/ 17 understanding, learning not only from success but also from failure, anomaly , and uncertainty . Realizing this vision will require coordinated advances across foundation models, instrument-aware perception, protocol compilation and control, scientiﬁc infrastructure, ev aluation standards, and safety gov ernance. The central question is no longer whether AI can assist science, but whether scientiﬁc disco very can be engineered as an embodied, long-horizon process that remains trustworthy as autonomy scales. References 1. W ang, H. et al. Scientiﬁc discov ery in the age of artiﬁcial intelligence. Nature 620 , 47–60 (2023). 2. Jumper , J. et al. Highly accurate protein structure prediction with alphafold. Nature 596 , 583–589 (2021). 3. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373 , 871–876 (2021). 4. W u, Z. et al. Moleculenet: A benchmark for molecular machine learning. Chem. science 9 , 513–530 (2018). 5. Y u, T . et al. Enzyme function prediction using contrastiv e learning. Science 379 , 1358–1363 (2023). 6. Fang, Y . et al. Knowledge graph-enhanced molecular contrasti ve learning with functional prompt. Nat. Mach. Intell. 5 , 542–553 (2023). 7. Krishna, R. et al. Generalized biomolecular modeling and design with rosettafold all-atom. Science 384 , eadl2528 (2024). 8. Zeni, C. et al. A generativ e model for inorganic materials design. Nature 639 , 624–632 (2025). 9. Han, Y . et al. Retrosynthesis prediction with an iterati ve string editing model. Nat. Commun. 15 , 6404 (2024). 10. W ang, X. et al. Multimodal learning with next-token prediction for large multimodal models. Nature 1–7 (2026). 11. Bai, L. et al. Intern-s1: A scientiﬁc multimodal foundation model (2025). 12. Zhuang, X. et al. Adv ancing biomolecular understanding and design following human instructions. Nat. Mach. Intell. 7 , 1154–1167 (2025). 13. Xu, W . et al. Probing scientiﬁc general intelligence of llms with scientist-aligned w orkﬂows. arXiv pr eprint arXiv:2512.16969 (2025). 14. Kuhn, T . S. & Hacking, I. The structure of scientiﬁc r evolutions , vol. 2 (Uni v ersity of Chicago press Chicago, 1970). 15. Popper , K. The logic of scientiﬁc discovery (Routledge, 2005). 16. Gao, S. et al. Empowering biomedical disco very with ai agents. Cell 187 , 6125–6151 (2024). 17. W ei, J. et al. From AI for science to agentic science: A surv ey on autonomous scientiﬁc discov ery . CoRR abs/2508.14111 (2025). 18. Nav eed, H. et al. A comprehensive overvie w of large language models. A CM T ransactions on Intell. Syst. T echnol. 16 , 1–72 (2025). 19. Achiam, J. et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). 20. Li, B., Saini, A. K., Hernandez, J. G. & Moore, J. H. Agentic ai and the rise of in silico team science in biomedical research. Nat. Biotechnol. 1–15 (2026). 21. T om, G. et al. Self-driving laboratories for chemistry and materials science. Chem. Rev. 124 , 9633–9732 (2024). 22. Canty , R. B. et al. Science acceleration and accessibility with self-driving labs. Nat. Commun. 16 , 3856 (2025). 23. Hao, Q., Xu, F ., Li, Y . & Evans, J. Artiﬁcial intelligence tools expand scientists’ impact but contract science’ s focus. Natur e 1–7 (2026). 15/ 17 24. Y ao, S. et al. React: Synergizing reasoning and acting in language models. In ICLR (OpenRe view .net, 2023). 25. Fung, P . et al. Embodied AI agents: Modeling the world. CoRR abs/2506.22355 (2025). 26. Hey , T ., T ansley , S., T olle, K. M. et al. The fourth paradigm: Data-intensive scientiﬁc discovery , v ol. 1 (Microsoft research Redmond, W A, 2009). 27. M. Bran, A. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6 , 525–535 (2024). 28. Huang, K. et al. Biomni: A general-purpose biomedical ai agent. bioRxiv 2025–05 (2025). 29. Ding, K. et al. Scitoolagent: A knowledge-graph-dri ven scientiﬁc agent for multitool integration. Nat. Comput. Sci. 5 , 962–972 (2025). 30. Gao, S. et al. Democratizing AI scientists using tooluniv erse. CoRR abs/2509.23426 (2025). 31. Swanson, K., W u, W ., Bulaong, N. L., Pak, J. E. & Zou, J. The virtual lab of ai agents designs new sars-cov-2 nanobodies. Natur e 646 , 716–723 (2025). 32. Ghareeb, A. E. et al. Robin: A multi-agent system for automating scientiﬁc discovery . CoRR abs/2505.13400 (2025). 33. Penadés, J. R. et al. Ai mirrors e xperimental science to uncov er a mechanism of gene transfer crucial to bacterial e volution. Cell 188 , 6654–6665 (2025). 34. Novik ov , A. et al. Alphae volv e: A coding agent for scientiﬁc and algorithmic disco very . CoRR abs/2506.13131 (2025). 35. W eng, Y . et al. Deepscientist: Adv ancing frontier-pushing scientiﬁc ﬁndings progressi v ely . CoRR abs/2509.26603 (2025). 36. T eam, I. et al. Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to veriﬁcation. arXiv e-prints arXi v–2505 (2025). 37. Feng, S. et al. Internagent-1.5: A uniﬁed agentic framew ork for long-horizon autonomous scientiﬁc discovery . arXiv pr eprint arXiv:2602.08990 (2026). 38. Lu, C. et al. The AI scientist: T o wards fully automated open-ended scientiﬁc discovery . CoRR abs/2408.06292 (2024). 39. Mitchener , L. et al. K osmos: An AI scientist for autonomous discovery . CoRR abs/2511.02824 (2025). 40. Steiner , S. et al. Organic synthesis in a modular robotic system dri v en by a chemical programming language. Science 363 , eaav2211 (2019). 41. Kuw ahara, M. et al. Dev elopment of an open-source 3d-printed material synthesis robot ﬂuid: Hardware and software blueprints for accessible automation in materials science. A CS Appl. Eng. Mater. 3 , 978–987 (2025). 42. Szymanski, N. J. et al. An autonomous laboratory for the accelerated synthesis of inorganic materials. Natur e 624 , 86–91 (2023). 43. Slattery , A. et al. Automated self-optimization, intensiﬁcation, and scale-up of photocatalysis in ﬂow . Science 383 , eadj1817 (2024). 44. Zhang, Z. et al. A multimodal robotic platform for multi-element electrocatalyst disco very . Natur e 647 , 390–396 (2025). 45. Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Natur e 624 , 570–578 (2023). 46. Song, T . et al. A multiagent-driven robotic ai chemist enabling autonomous chemical research on demand. J. Am. Chem. Soc. 147 , 12534–12545 (2025). 47. King, R. D. et al. The automation of science. Science 324 , 85–89 (2009). 16/ 17 48. Burger , B. et al. A mobile robotic chemist. Natur e 583 , 237–241 (2020). 49. Dai, T . et al. Autonomous mobile robots for exploratory synthetic chemistry . Natur e 635 , 890–897 (2024). 50. Darvish, K. et al. Matterix: T o ward a digital twin for robotics-assisted chemistry laboratory automation. Nat. computational science 1–16 (2025). 51. Zhao, W ., Queralta, J. P . & W esterlund, T . Sim-to-real transfer in deep reinforcement learning for robotics: A surve y . In 2020 IEEE symposium series on computational intelligence (SSCI) , 737–744 (IEEE, 2020). 52. Xu, Y . et al. Lumi-lab: A foundation model-driven autonomous platform enabling disco very of ionizable lipid designs for mrna deli very . Cell (2026). 53. Shi, T . et al. Knowledge-dri v en autonomous materials research via collaborati v e multi-agent and robotic system. Matter (2026). 54. Zitko vich, B. et al. R T-2: V ision-language-action models transfer web knowledge to robotic control. In 7th Annual Confer ence on Robot Learning (2023). 55. W ang, P . et al. Qwen2-vl: Enhancing vision-language model’ s perception of the world at an y resolution. arXiv pr eprint arXiv:2409.12191 (2024). 56. Guo, D. et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Natur e 645 , 633–638 (2025). 57. Ding, K. et al. Bridging data and disco very: A survey on kno wledge graphs in ai for science. Natl. Sci. Rev. nwag140 (2026). 58. Meng, Z., Chen, J., Zhuang, X. & Zhang, Q. Inte grating kno wledge graphs and lar ge language models for adv ancing scientiﬁc research. In 2025 IEEE 45th International Confer ence on Distrib uted Computing Systems W orkshops (ICDCSW) , 393–396 (IEEE, 2025). 59. Agarwal, O., Ge, H., Shakeri, S. & Al-Rfou, R. Knowledge graph based synthetic corpus generation for kno wledge-enhanced language model pre-training. In NAA CL-HL T , 3554–3565 (Association for Computational Linguistics, 2021). 60. Gao, Y . et al. Retrie val-augmented generation for large language models: A survey . CoRR abs/2312.10997 (2023). 61. Liang, L. et al. Kag: Boosting llms in professional domains via kno wledge augmented generation. In Companion Pr oceedings of the A CM on W eb Confer ence 2025 , WWW ’25, 334–343 (Association for Computing Machinery , Ne w Y ork, NY , USA, 2025). 62. Lu, J.-M., Pan, J.-Z., Mo, Y .-M. & Fang, Q. Automated intelligent platforms for high-throughput chemical synthesis. Artif. Intell. Chem. 2 , 100057 (2024). 63. Orouji, N. et al. Autonomous catalysis research with human–ai–robot collaboration. Nat. Catal. 1–11 (2025). 64. Zhao, H. Biofoundry: Nsf ibiofoundry for basic and applied biology . NSF A war d. Number 2400058. Dir. for Biol. Sci. 24 , 58 (2024). 65. Hu, Y . et al. Memory in the age of ai agents. arXiv pr eprint arXiv:2512.13564 (2025). 66. Chhikara, P ., Khant, D., Aryan, S., Singh, T . & Y ada v , D. Mem0: Building production-ready ai agents with scalable long-term memory . arXiv preprint arXiv:2504.19413 (2025). 67. Jiang, Y . et al. Scp: Accelerating discovery with a global web of autonomous scientiﬁc agents. arXiv pr eprint arXiv:2512.24189 (2025). 68. Sim, M. et al. Chemos 2.0: An orchestration architecture for chemical self-dri ving laboratories. Matter 7 , 2959–2977 (2024). 17/ 17

Embodied Science: Closing the Discovery Loop with Agentic Embodied AI

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment