Towards a Medical AI Scientist

2026-3-31 T o wards a Medica l AI S ci entist Hongtao Wu 1,* Boyun Zheng 1,* Ding jie Song 2,* Y u Jiang 1 Jianfeng Gao 4 Lei Xing 3 Lichao Sun 2, † Yixuan Y u an 1, † 1 The Chinese U niv ersit y o f Hong Kong 2 Lehigh Univ ersit y 3 Stanford Univ ersit y 4 Micro soft R esearch {lis221@lehigh.edu, yxyuan@ee.cuhk.edu.hk}  Homepage: https://cuhk- aim- group.github.io/Med- AI- Scientist- Homepage/ A utonom ous systems that generate scientiﬁc hypotheses, condu ct experiments, and draft manuscripts hav e recently emerged a s a promising paradigm for accelerating discov ery . Ho wev er , existing “ AI Scientists” remain largely domain-agno stic, limiting their appli cabilit y to clini cal medi cine, where research is required to be gro unded in medical eviden ce with s pecialized dat a m odalities. In this work, w e introd uce Medical AI Scientist, the ﬁrst auton om ous research framework tailored to clinica l auto no m ous research. It generates clinically grounded ideas by transforming surv eyed literature into actionab le evidence through a clinician-engin eer co-rea soning mechanism, which improv es the traceability o f generated research ideas. The Medica l AI scientist f urther introdu ces evidence-grounded manuscript drafting guided by a structured medica l writing paradigm and ethica l policies. The framew ork operates under 3 research modes, namely paper-ba sed reprodu ctio n, literature-inspired inno vatio n, and tas k- driv en exploratio n, corres ponding to distinct lev els of medica l scientiﬁc autono my . Comprehensiv e eva luations by both large langu age models and human experts demonstrate that the ideas generated by the Medica l AI Scientist are of substantially higher qualit y than those prod uced by commercial LLMs across 171 cases, cov ering 19 clinical t as ks, and 6 dat a modaliti es. Mean while, our system achiev es strong alignment bet ween the proposed method and its implementation, while also demonstrating signiﬁcantly higher success rates in execut a b le experiments. Doub le-b lind evaluatio ns by human experts and the St anf ord Agenti c R eview er suggest that the gen erated man uscripts approach MICC AI-lev el qualit y , while consistently surpassing those from ISBI and BI BM. The proposed Medical AI S cientist highlights the potential of lev eraging AI for autono m ous scientiﬁc discov ery in healthcare. 1. Introd uctio n R ecent years hav e witnessed rapid advances in artiﬁcial intelligence for healthcare, with increasingly capa ble m odels achieving st ate-o f-the-art performance across disease di a gnosis [ 1 – 4 ], medical image analysis [ 5 – 7 ] and clinica l outcome predicti on [ 8 – 10 ]. In parallel, large l angua ge models [ 11 – 16 ] hav e made subst antial progress in l angua ge underst anding, reaso ning and code generatio n, en ab ling the emergence of tool-augmented and multi-a gent systems [ 17 – 25 ] that extend bey ond narrow t a sk executio n. T ogether , these dev elopments hav e cat a lyzed the rise of autono mous research frameworks, o ften referred to as AI Scientists [ 26 – 29 ], which seek to automate the scientiﬁ c workﬂo w from hypothesis generatio n and experiment al design to result interpret atio n and manuscript preparation, promising to accelerate scientiﬁc inno vatio n [ 30 ]. These AI S cientist systems hav e sho wn promise in accelerating research in domains such as mathemati cs, chemistry and general machin e learning, where prob lem form ul ati ons, dat a represent atio ns and evaluati on protocols are relatively st andardized. Medica l AI represents one of the most consequential domains for such systems, giv en its direct implicati ons for patient outcomes, diagn osti c reliabilit y and healthcare eﬃcien cy . As medical dat asets, analytica l methodologies and scientiﬁ c literature continu e to grow at an unprecedented pace, the throughput of human-driven research has become an increasingly critical bottleneck [ 31 – 34 ]. This T owards a Medica l AI S cientist widening gap highlights the urgent need f or auton omo us scientiﬁ c systems that are explicitly designed to operate within the epistemic, operational, and ethical constraints inherent to clinical medicine. Ho wev er , extending these auto no m ous research paradigms to medica l ﬁeld remains challenging. First, existing AI Scientists focu s on m odel modiﬁcati ons or generic optimiz atio n strategies, ignoring medica l rel ated priors, such as basi c di a gnostic workﬂo ws and disease-speciﬁ c pathological patterns. Moreo v er , their retrieva l and reasoning processes frequently l ack suﬃcient constraints to reliab ly identif y authorit ativ e medica l reaso ning evidence, which will lead to models with superﬁcial perfor- mance metrics but f ail to capture clinically relevant patterns. Second ly , the heterogeneo us and high dimensio n al nature of medical dat a, including three dimensio nal and anisotropic structures, together with specialized eva lu atio n st andards, poses cha llenges to the relia b le and fair experiment atio n executio n. Third ly , the prov en ance of medica l dat a and the clarit y of ethica l statements are central to the credibilit y , reprodu cibilit y , and clinica l translation of research ﬁndings, yet current autono mous research systems l argely ov erlook these requirements and f ail to produ ce manuscripts that ad here to clinica l writing frameworks and ethical standards. Here we present Medical AI Scientist, an agenti c framework for end-to-end medical AI disco v ery and development, as shown in Fig. 1 a. The system comprises three key compon ents: Idea Proposer , Experimental Executor , and Manuscript Composer , which together support the f ully auton om ous research lifecycle. The Idea Proposer levera ges structured literature retrieva l and analysis to iden- tif y clinica l pri or and then adapts the m ost suit ab le emerging technical models to medica l tas ks. A clinician–engineer co-reasoning mechanism is incorporated into the idea generatio n process to explicitly ground each hypothesis in veriﬁa ble evidence and mitigate hallu cin atio ns. The automated experimental executor orchestrates a reliab le va lidatio n pipeline by unif ying general-purpo se exe- cutio n toolchains with domain-speciﬁ c medical toolboxes t ailored to heterogeneous and complex clinica l data formats, enab ling iterativ e and self-correcting deep model devel opment. A hierarchi- cal Manu script C ompo ser transforms research outputs into coherent and evidence-grounded drafts through a structured medica l writing paradigm with enhanced n arrativ e logic and readabilit y . It also embeds ethical review mechanisms that explicitly document dat a usage in compliance with medical pub licati on polici es. T o address the absen ce o f standardi zed evaluatio n protocols for automated medica l research systems, w e introdu ce Med-AI Bench (Fig. 1 b). This benchmark comprises 171 high-qu a lit y evaluati on cases, organized aro und 19 distinct research t as ks s panning 6 comm on medical dat a m odalities. F or each task, we selected 3 represent ativ e papers of varying diﬃcult y ( easy , medium, hard) and constru cted 3 evaluati on cases with diﬀerent input modes. This design provides a systemati c and uniﬁed framew ork for both qu alitativ e and quantit ativ e assessment o f automated medica l research systems across the f ull research pipeline. As presented in Fig. 1 c, we ﬁrst eva luate research idea generati on using both large langu a ge models and human experts (Fig. 2 ), showing that the Medical AI Scientist consistently surpasses commercial language models across six dimensions, including no velt y , maturit y , ethicalit y , genera liz ability , utilit y , and interpret abilit y . W e then assess experiment al executio n, where the system exhibits strong alignment bet ween proposed methods and their implementations, together with subst antially higher success rates in producing execut ab le experiments (Fig. 4 ). Fina lly , under doubl e blind eva luatio n (Fig. 1 d, 5 b & c), 10 independent domain experts assess generated man uscripts alo ngside high qu a lit y human authored studies from leading ven ues such as MICC AI, ISBI, and BI BM, while all submissions w ere f urther review ed using the St anf ord Agentic R eviewer under ICLR-aligned criteria (Fig. 5 a). The gen erated manuscripts achiev e a mean score o f 4.60 ± 0.56 and remain competiti v e acro ss key dimensions including no velt y , reprodu cibilit y , coherence, and clarit y , with only a modest gap in co v era ge. Qualit ativ e feed back f urther indicates strong practica l relevan ce and clear present ati on with 2 T owards a Medica l AI S cientist Figure 1 | a, Sy stem workﬂ ow: f ully-auto mated multi-a gents system for end-to-end scientiﬁc discov ery in clinical medicine. b, Med-AI Bench: visualization depicting 19 distinct medical research t as ks within performance benchmarking. c, Experiment a l setup: comparativ e evaluatio n across Idea gener- atio n, execution and f ull paper compilation in the research lifecycl e. d, Perf ormance benchmarking: compara b le manuscript qualit y to represent ativ e works from leading ven ues. 3 T owards a Medica l AI S cientist limited critical wea knesses. Moreo ver , one manuscript generated by our system has been accepted by the Intern ati onal C onf erence on AI Scientists (ICAIS 2025 [ 35 ]) after peer review . T ogether , these results suggest that automated systems can speed up complex methodologica l designs, highlighting their potenti a l to signiﬁcantly enhance the eﬃcien cy of medica l AI research. 2. R esults 2.1. Building univ ersal medical research by systemati c LLM Agent The Medical AI Scientist provides diﬀerent levels of auto no m ous academic research modes: Pa per- based Reprod ucti on, Literature-inspired Inno vatio n, and T as k-driven Explorati on. These modes are designed to accomm odate users ranging from early st age PhD -lev el researchers entering a medical AI task to domain experts seeking eﬃcient and highly automated solutio ns for open ended probl ems. The R eprodu ction m ode follo ws explicitly deﬁned research instructi ons derived from t arget papers and focu ses on the f aithf ul implement atio n of est ab lished methods. An ethical gatekeeping mechanism is incorporated to prevent harmf ul implementatio ns. Instead of relying on explicit method speciﬁcati ons, the Inno vatio n mode identiﬁes research gaps and generates hypotheses based on ﬁxed references and datasets. E valuati on emphasizes origin ality and methodologica l completeness, supported by a clinician–engineer co-reaso ning mechanism and m ulti-dimensio nal assessment. The Explorati on m ode f urther t argets problem driv en discov ery in real-w orld settings. St arting from a single user deﬁned questi on, the system cond ucts literature mining, selects and integrates paradigms, generates solutio ns, and performs experiment a l veriﬁcati on. T o en a b le a rigorous and domain-spanning assessment of the Medical AI Scientist, we constru cted Med-AI Bench, a benchmark grounded in peer-review ed medical AI literature and expert-annot ated referen ces. Med-AI Bench is deliberately organiz ed to reﬂect the breadth o f contemporary medica l AI research, cov ering six data moda lities and nineteen represent ativ e t as ks that span the f ull spectrum from lo w-lev el perception to high-l ev el clinica l reasoning (Fig. 1 b). Speciﬁ cally , medical ima ges-related tasks cov er core problems in visual underst anding and analysis, in cluding classiﬁcati on [ 36 – 38 ], segmentation [ 39 – 41 ], progn osis [ 42 – 44 ], registrati on [ 45 – 47 ], and restoration [ 48 – 50 ]. Video- centric tasks en compa ss instrument detecti on [ 51 – 53 ], restoration [ 54 – 56 ], workﬂo w recognition [ 57 – 59 ], intraoperative risk assessment [ 60 – 62 ], and postoperativ e skill assessment [ 63 – 65 ]. Structured electro nic health record data support t a sks in risk predicti on [ 66 – 68 ] and clinical decisi on support [ 69 – 71 ], while physiologi cal signal dat a are used for diagn osis [ 72 – 74 ] and prognosis [ 75 – 77 ]. T ext- based clini cal rea soning is evaluated through report summarization [ 78 – 80 ], diagn osis and risk assessment [ 81 – 83 ], and bio medica l questio n answering [ 84 – 86 ]. Finally , multim odal t a sks assess the system’s abilit y to integrate heterogeneous dat a sources for multim odal diagno sis [ 87 – 89 ] and cross-m odal report generati on [ 90 – 92 ]. F or each t a sk, we retriev e three papers from Google Schol ar , which serve as a structured ground truth for benchmarking diﬀerent lev els of scientiﬁ c reasoning and executio n. E ach paper wa s eva lu ated across ﬁve dimensions, including code availability , v enue qu alit y , cit ati ons, year , complexit y , and subjectiv e human rating, and then ranked and assigned to one o f three diﬃcult y tiers per t as k. U sing this benchmark, we evaluate the Medical AI Scientist across the complete research lifecycle, including idea generati on, experiment al executio n, and manuscript compilation. Collectiv ely , Med-AI Bench f uncti ons as a st andardized and reprodu cible framew ork for assessing auto no m ous medica l AI researchers under realistic, multi-m odal, and clinically relevant research conditio ns. 4 T owards a Medica l AI S cientist 2.2. Comprehensiv e eva luatio n of idea generatio n The Idea Generati on mod ule is designed to address t w o central chall enges in AI assisted research ideatio n. The ﬁrst concerns the generatio n of no vel hypotheses from unstructured resources without a speciﬁc directio n, as in the Inno vatio n mode. The second concerns the need to ensure that these hypotheses remain clinically relevant and technically fea sib le, which is emphasized in the Explorati on m ode. W e quantit atively eva luated the qualit y of model-generated research ideas against t wo commercia l LLMs ( e.g., GPT -5, Gemini-2.5-Pro), using both LLM-as-judge metrics and blinded human assessments, with evaluatio ns cond ucted across six criteria comm only adopted in medica l AI research, including no velt y , maturit y , ethica lit y , generalizabilit y , utilit y , and interpret a bilit y . As sho wn in Fig. 2 a, the Medica l AI Scientist consistently outperf orms the baselin es across six dimensio ns of idea qualit y . F or no v elt y and maturit y , it achiev es higher scores in inno vatio n (4.07 vs. 3.00 and 3.12 in literature-based; 4.07 vs. 3.42 and 3.05 in open-ended) and maturit y (4.61 and 4.74 vs. ≤ 3.58 for the baselin es). F or technica l reli abilit y , it also leads in robustn ess (3.44 and 3.56 vs. ≤ 3.19) and interpret ability (3.83 and 3.81 vs. ≤ 3.42). Fin ally , for practical and ethical suit abilit y , the system obt ains stronger utilit y (3.56 and 3.61 vs. ≤ 3.44) and ethica lit y (3.39 and 3.64 vs. ≤ 3.05), indicating that the generated ideas are not only more inno vativ e but also more clinica lly grounded and deployab le. In the human expert assessment (Fig. 2 b), our method consistently achiev es the highest scores in technica l inno vation (4.40 ± 0.49 and 4.32 ± 0.47) and maturit y (4.65 ± 0.48 and 4.68 ± 0.47), substanti a lly outperforming GPT -5 and Gemini-2.5-Pro, while also exhibiting lo w er variance. This advantage extends to ethicalit y (up to 4.39 ± 0.63) and robustn ess (3.90 ± 0.61), where competing m odels remain belo w 3.50 on av erage, indicating more st ab le and reliab le hypothesis generatio n. Notab ly , impro v ements in utilit y and interpret a bilit y are m ore moderate ( e.g., 3.93 ± 0.53 and 3.81 ± 0.63 in Inno vatio n m ode), suggesting that gains in no v elt y and rigor are accompanied by only incremental advances in practical clarit y . Highlighted by human ev aluators’ observatio ns (Fig. 2 c), our method produ ces m ore consistently inn ovati v e and mature research ideas, with stronger alignment to clinical relevan ce and clearer experiment al grounding than competing approaches. In contra st, baseline models tend to generate more increment a l and less coherent hypotheses, o ften with higher variabilit y and wea ker integration into realistic research workﬂo ws. As illustrated in Fig. 3 , this case study compares the idea generati on results of our method with those of co mmercial LLMs under the Inno vatio n m ode. All m odels operate under identica l inputs, including the same task descriptio n, reference papers, and dat aset speciﬁcati on, ensuring a fair comparison. While commercial models produ ce reaso nab le designs, their form ul ati ons remain relatively generic and lack stro ng domain grounding. Their outputs o ften resembl e incremental extensio ns of prior w ork, with limited justiﬁcati on from a medical perspectiv e. In contrast, the proposed method incorporates both medical and engineering evidence into the ideatio n process, informing m odel design and learning objectives. This leads to a more concrete and clinically meaningf ul form ulation, reﬂected in the richer and more explicit set of equations. Consequently , the Medical AI Scientist demo nstrates greater implementation detail and improv ed conceptual no velt y , as its designs are guided by disease-related priors rather than abstract extensio ns of existing approaches. 2.3. Analysis of experiment al implement ati on 2.3.1. Implementation completeness T ranslating a conceptual research hypothesis into execut a b le code requires preserving methodologi cal coherence bet ween the idea and its technical rea liz ati on. T o evaluate this capa bilit y , w e systematica lly examined the extent to which ﬁn a lized research pl ans were faithf ully inst antiated in do wnstream 5 T owards a Medical AI Scientist a b c Sco r e N ov e lty M a tu ri ty E th icality Ge n e rali z a b il ity Util ity In te rpret a b il ity Lit erat u re - ins pir ed I nnov at ion M ode T as k - driv en Ex plorat ion M ode Ge mini - 2.5 - P r o GP T - 5 O u r s N ov e lty M a tu ri ty E th icality Ge n e rali z a b il ity Util ity In te rpret a b il ity Sco r e N ov e lty M a tu ri ty E th icality Ge n e rali z a b il ity Util ity In te rpret a b il ity Lit erat u re - ins pir ed I nnov at ion M ode T as k - driv en Ex plorat ion M ode Ge mini - 2.5 - P r o GP T - 5 O u r s N ov e lty M a tu ri ty E th icality Ge n e rali z a b il ity Util ity In te rpret a b il ity GPT - 5 Solid dom ain - s p ec if ic innov at ion M oderately c om plet e w ork f low Par t ial data c om plia nce M oderate robus t n es s Part ially ac t ionab le s olut ions M oderate reas oning t rac eab ilit y Our s P aradigm - s h if t ing innov at ion F ully operation al s y s t em Strong data app rop riat en es s Broader generaliz a t io n Prom is ing c linic al applic abilit y I m prov ed reas onin g t rac eabilit y Gemi n i - 2. 5 - Pr o I nc rem en t a l innov at ion Part ially dev eloped pipeline Lim it ed data aw are ness W eak generalizatio n c apabiliti es Surf ac e - le v el c linic al inc orpo ra t ion Bas ic int erpre t a bilit y Figure 2 | Medica l AI Scientist surpasses commercial LLMs in idea generati on under combined LLM- based and blinded human evaluati on. Models generated research ideas that were ano nymized and assessed by three independent experts using a ﬁv e point scale. a, LLM based eva luatio n of idea qu a lit y . b, Qu antitative human assessment across six evaluati on criteri a. c, Qualitative human analysis of strengths and limit atio ns rel ativ e to commercia l LLMs. 6 T owards a Medical AI Scientist Figure 3 | Example of idea generati on comparison bet ween Medical AI S cientist and commercial LLMs in the Literature-inspired Inno vati on Mode. 7 T owards a Medical AI Scientist Pap er - ba sed Re pro du ction Mod e Scor e Algorithm F idelity Pipeline I ntegrit y Li teratu re - in spi red Inno vatio n Mod e T ask - dri ven Ex pl ora tion Mod e Ex perim ental Succ ess Ex perim ental Succ ess Ex perim ental Succ ess a b Gemi ni - 2.5 - P r o GP T - 5 Our s Gemi ni - 2.5 - P r o GP T - 5 Our s Algorithm F idelity Pipeline I ntegrit y Algorithm F idelity Pipeline I ntegrit y Succe ss Ra te Pap er - ba sed Re pro du ction Mod e Li teratu re - in spi red Inno vatio n Mod e T ask - dri ven Ex pl ora tion Mod e Figure 4 | Comparativ e evaluati on of Medica l AI Scientist framew orks against commercial LLMs in terms o f implement ati on completen ess and experiment al success rate. a, Implement ati on completen ess was assessed on a ﬁv e point scale ranging from 1 to 5. Model generated outputs w ere anonymized and independently evaluated by t wo LLM-based judges. b, Experiment a l success rate measured through quantit ativ e human eva luation. implementations. As summari zed in Fig. 4 a, w e qu antiﬁed experiment a l success by jointly assessing algorithm ﬁdelit y and pipeline integrit y , reﬂecting whether the proposed methodologica l compon ents w ere both present and functi onally integrated within the resulting codebase. Acro ss a ll three ex- perimental modes, our Medical AI Scientist consistently achieved the highest mean scores for both indicators, along with the lo w est or near-lo w est st andard deviations. In open-ended inno vatio n mode, it reached 3.72 ± 0.52 and 4.09 ± 0.47, respectiv ely , matching GPT -5-Pro while subst antially outper- forming Gemini-2.5-Pro (2.84 ± 0.67 and 3.18 ± 0.94). The advantage grew clearer in replicatio n m ode (3.84 ± 0.49 and 4.30 ± 0.62) and literature-based innov ation mode (3.67 ± 0.54 and 4.12 ± 0.46), where our system not only scored highest b ut also show ed the most stab le performan ce. The results sho w that the system’s structured reﬁnement process, which couples systematic retrieva l from the literature and code repositori es with iterative clini cian–engineer deliberatio n, grounds each proposed idea in accessibl e methodologica l and technical resources. This integration ensures that ﬁn alized research plans are not only scientiﬁca lly coherent b ut also practically implementab le, with suﬃcient technica l and evidential grounding to enab le reliab le translatio n into execut a b le and methodologi cally faithf ul code. 8 T owards a Medical AI Scientist 2.3.2. Code execution Executing AI-generated research scripts may f ail d ue to unresolv ed dependenci es, dataset inco m- patibilities, or l atent logica l errors. These issues become m ore acute in medical AI research, where heterogeneo us clinical dat a demand specialized preprocessing, domain-speciﬁ c evaluati on metrics, and dedicated soft ware libraries to ensure valid analysis. T o quantif y robustn ess in this context, we measured ﬁrst-run experiment al success across a set of 57 medica l AI research inst ances, compar- ing experiment al results produ ced by our structured pipeline with those generated directly by the commercia l LLM baselines. As shown in Fig. 4 b, our approach co nsistently achieved higher su ccess rates, reﬂecting the eﬀectiv e resolution o f dependency conﬂicts, enforcement of dat a compatibilit y , and runtime-st a b le logi c through iterative reﬁnement and grounding in referen ce implement atio ns. By contra st, general- purpose LLM-generated code enco untered persistent deb ugging loops triggered by unresolv ed runtime errors or became prematurely terminated du e to environment conﬁguratio n issues, preventing successful completi on o f experiments. W e deﬁned experiment al success as stab le end-to-end executio n o f the training pipeline, characteri zed by successf ul runtime completio n, a decreasing loss trajectory , absen ce of gradient explosi on, and the generatio n of valid model weight ﬁles. Under this deﬁnition, our method achiev ed the highest success rate in all settings, reaching 0.91 in reprodu ctio n mode, 0.93 in literature-based inno vati on m ode, and 0.86 in open-ended t a sk mode. In comparison, GPT -5 obtained success rates of 0.72, 0.60, and 0.75, while Gemini-2.5-Pro achiev ed 0.40, 0.49, and 0.53 under the same conditio ns. These results sho w that our system consistently maint ains a subst antially higher end to end experiment al executio n success rate across increasing task diﬃcult y . 2.4. Human and automated evaluatio n o f medica l research manuscripts drafting T o evaluate the transl ati onal relev ance o f auton omo us medica l research under realistic expert scrutiny , w e designed a doub le-blind user study centered on diabetic retinopathy cl assiﬁ cation from f und us ima ges while preserving the genera lit y o f the framew ork. W e invited ten independent experts with o v er ﬁv e years of ﬁrst-author experien ce in AI for healthcare to assess a curated set of 20 manuscripts, including both autono mous ly generated studies and high-impact human-authored papers. These human-authored works w ere sampled from leading v enu es, including the Intern ati onal Conf erence on Medica l Image Computing and C omputer Assisted Interv entio n (MICC AI), Intern atio n al Conferen ce on Bio inf ormatics and Bio medicin e (BI BM), and The I EEE Internationa l Symposium on Bio medica l Ima ging (I SBI). In parallel, all manuscripts were independently evaluated using the St anford Agentic R eview er , an advanced l arge langu a ge model ba sed assessment system, follo wing st andardized review criteria align ed with The Internationa l C onf erence on L earning R epresentations (ICLR) guidelines. From the AI-based evaluatio n in Fig. 5 a, our method achieves a mean score of 4.60 ± 0.56, compara b le to the range observed across represent ativ e MICC AI (4.86 ± 0.47), ISBI (3.74 ± 1.02), and BI BM (4.06 ± 0.89) submissions. According to the doub le-b lind human evaluatio ns in Fig. 5 b, our manuscripts demo nstrate consistently strong performan ce across all ﬁv e dimensions, with scores broad ly aligned with those reported for MIC CAI , ISBI , and BI BM. In particular , they sho w competitiv e results in No v elt y , R eprodu cibilit y , C oherence, and Cl arit y , while exhibiting a modest gap in C o vera ge (3.44 ± 0.67 vs 3.68 ± 0.68), likely reﬂecting a m ore focused emphasis on methodologica l inno vatio n rather than extensive dataset cov erage and baselin e comparisons. Qu a litative observatio ns (Fig. 5 c) from domain experts f urther highlight the no velt y , practical relevance, and cl arit y of presentation in our manuscripts, alo ngside solid mid-range assessments in logical coherence and experiment a l design, with rel ativ ely few critica l wea knesses noted across comparisons with MI CC AI, ISBI, and BI BM submissi ons. Ov erall, these results suggest that our manuscripts achiev e a lev el of qu ality compara b le to that observed across leading ven ues such as MIC CAI , I SBI , and BI BM, as assessed under 9 T owards a Medical AI Scientist Ours BIBM IS BI N o v el ty C l ar i ty Co v er age C o h er en ce R ep r o d u ci b i l i t y Sco r e Sco r e Sco r e Nov e lty : L a cking sig n if ican t co n tri b u tio n Lo g ic : N ot ab ly w ea k E x p e ri m e n ts: S tro n g b u t f req u e n tly insu f ficie n t W ri tin g : Fa ir , m o re cri ticiz e d P ract icality : V e ry p o o r rep rod u cib il ity Nov e lty : M o st inn o v a tiv e L o g ic: M o d e rate , so m e f law s E x p e ri m e n ts: Go o d b u t m o st cri ticiz e d W ri tin g : A m o n g th e b e st P ract icality : Hi g h p ract ical Nov e lty : L e a st inn o v a tiv e L o g ic: A v e rag e ri g o r E x p e ri m e n ts: Go o d y e t b a se li n e - w e a k W ri tin g : B e low - a v e rag e q u a li ty P ract icality : M o d e rate , d e ta il - d e ficie n t a b c M I C C AI I SBI BI BM Ours M ICCA I N ov elty : M od erat e ly inn ov at iv e L o g ic: S o li d rea so n in g E x p e ri m e n ts: S tro n g e st y e t still in su f f ici e n t W ri tin g : Cl e a r & h ig h - q u a li ty Practicali ty : Mod erat e de ta il is sue s M I C C AI I SBI BI BM Ours Figure 5 | Ano nymized comparison o f paper qu a lit y on an identica l medical task. Manuscripts generated by Medical AI Scientist achieve performance comparab le to MICC AI, ISBI, and BI BM under consistent doub le-blind eva luatio n across both quantit ativ e and qualitative assessments: a, St anf ord Agentic R eview er automati c evaluati on. b, Doubl e-blinded scoring (1–5) by 10 medical experts (PhD/po stdoc) across ﬁv e review dimensions. c, Experts’ observatio ns on strengths and limit atio ns. 10 T owards a Medical AI Scientist consistent doub le-blind eva luatio n criteri a. W e also dem onstrated the advant a ge of our system ov er other AI-scientist systems by having a manuscript it generated accepted by IC AIS 2025 [ 35 ], which receiv ed 114 submissions and had an acceptance rate of 36.8%. 2.5. Case study of auton om ous medical research process 2.5.1. Mode 2: Literature-ins pired inno vati on for medica l image cl assiﬁ cation As sho wn in Fig. A.1 , we f urther evaluated the proposed automated medical research system to assess its capacit y to enrich generated research ideas with medica lly grounded priors and con crete engi- neering speciﬁcati ons through its medical–engin eering discussion m od ule. Using diabetic retinopathy sev erit y grading as a represent ativ e t as k, the system operated without explicit design instructio ns and relied solely on reference literature and publicly a vailab le codebases. The system demo nstrated structured co-reaso ning bet w een clinical evidence and implementab le methodology: clinical insights from ophthalmic literature m otivated the explicit separation of global neurodegen erativ e context and loca l vascular pathology , which w ere subsequently translated into a dua l-pathwa y diﬀ usio n-based architecture with imbalance-a ware objectiv es and realizab le training protocols. Each design choice wa s justiﬁed by identiﬁ ab le gaps in prior work and mapped to existing implement atio ns, yielding a hypothesis that was both clinica lly interpret ab le and experimentally execut ab le. Quantit ativ e eva luatio n co nﬁrmed that the resulting model achiev ed competiti v e performance on imbalanced disease st ages, supporting the validit y of the underlying reaso ning process. T aken together with the paradigm-transfer case study , these results demo nstrate that the system can not only identif y and adapt no v el AI paradigms for speciﬁed medica l t as ks, but also systematica lly reﬁne them through medica l–engineering co-reaso ning into f ully speciﬁed, experimentally validated research hypotheses. 2.5.2. Mode 3: T as k-driv en discov ery for medica l video restoration As presented in Fig. A.2 , w e eva luated the proposed automated medical research system on a clinically m otivated t as k of restoring high-resolution and temporally consistent endoscopic video from lo w- qualit y recordings, thereby assessing its abilit y to auton om ously translate emerging AI paradigms into executab le solutio ns for medical research. Starting from a minimally speciﬁed task descriptio n, the system independently grounded the problem in relevant clinica l and technica l literature, identiﬁed temporal inconsisten cy as a critical unmet requirement, and selected a recently devel oped continu ous- time video restoratio n paradigm with demo nstrabl e transfer potenti a l. Without manual interv entio n, this paradigm was adapted to the endoscopic setting through t ask-s peciﬁc architectural and training m odiﬁcati ons, yi elding a co mplete research hypothesis and an implementab le model. The result- ing system was experiment ally va lidated through structured ab l atio ns and qu antitative evaluatio n, achieving subst antial performance gains ov er a strong baselin e. This case study demo nstrates that the proposed framework can automatica lly operationa lize no v el AI paradigms for concrete medica l t as ks, progressing from t as k speciﬁcatio n to validated experiment al results, and thereby supports its role as a general-purpo se engine for automated medica l research rather than a t a sk-speciﬁ c algorithmic contrib ution. 3. Discussi on 3.1. Key ﬁndings In this study , we introdu ce Medical AI Scientist, an agentic framework that enab les end-to-end auto matio n of medica l AI research, spanning hypothesis generati on, experimental validati on, and manu script compositio n. By integrating an Idea Proposer , an automated experiment al executor , and 11 T owards a Medical AI Scientist a hi erarchica l Man uscript Compo ser , the system pro vides a uniﬁed solution for the full research lifecycl e. A centra l design feature li es in the clini cian–engineer co-rea soning mechanism, which grounds hypothesis generati on in veriﬁa b le medica l evidence and redu ces hallucinati ons. In parall el, the executio n mod ule ensures reliab le and iterativ e m odel dev elopment across heterogeneou s clinica l data, while the manuscript compo nent translates outputs into structured, evidence-based scientiﬁ c narratives with embedded ethical compliance. T o support systematic evaluati on, we f urther introdu ce Med-AI Bench, a comprehensiv e benchmark that st andardizes assessment across diverse medica l research t as ks, moda lities, and diﬃcult y lev els. Compared with existing approaches to automated scientiﬁc discov ery , Medica l AI Scientist ad- dresses severa l key limit ations. First, genera l-purpose langu a ge models, altho ugh capa ble o f generating plausib le research ideas, frequently suﬀer from insuﬃci ent grounding in domain-speciﬁc evidence, leading to unreli ab le or non-acti onab le hypotheses. Seco nd, existing automati on frameworks rarely account for the complexit y o f clinical dat a formats and the stringent requirements of medical research reporting and ethics. By contrast, our framework uniﬁes these compon ents into a coherent pipeline, ensuring that each st a ge is both technically rigorous and clinica lly grounded. Our experiment al results highlight three principal ﬁndings. (1) Superior research idea qu ality: across six eva luation dimensions, the proposed system consistently outperf orms commercial l angua ge m odels and approaches human expert-level assessments, demo nstrating strong no velt y , fea sibilit y , and interpret abilit y . (2) R obust experiment al execution: the system achieves high alignment be- t ween proposed methods and implemented experiments, with subst antially improv ed success rates in generating executab le and self-co nsistent pipelines f or medical AI dev elopment. (3) High-qualit y manu script generatio n: under doub le-b lind expert eva luatio n, generated manuscripts achiev e com- petitiv e scores relative to top-tier conferen ce publi catio ns, with strong performance in coherence, clarit y , and reproducibility , and o nly minor limit ations in content co v era ge. The acceptance of a system-generated manuscript at IC AIS 2025 f urther provides early evidence o f real-w orld scientiﬁc va lidit y . The broader implicatio ns o f Medical AI Scientist extend bey ond performan ce gains to a f undamen- tal shift in how medical AI research may be cond ucted. By signiﬁcantly red ucing the time and expertise required to mo ve from idea to validated results and polished manuscripts, the framework has the potential to accelerate scientiﬁc discov ery in healthcare. Its abilit y to systematica lly explore complex m odel designs and translate them into executabl e implement ati ons suggests a complementary role al ongside human researchers, particularly in t a sks that demand extensive iteratio n and technical integratio n. In clinica l and translationa l settings, such a system could low er barriers to inno vation, enab ling wider participatio n in medica l AI dev elopment and fostering more rapid dissemination of clinica lly relevant solutions. 3.2. Limitations and f uture w ork Although our Medical AI Scientist demonstrates promising empirica l behavior , several limit atio ns remain before it can be considered to match the best human-produ ced science. First, the con ceptual design of the method can at times become ov erly intricate. This complexit y not only increa ses the diﬃcult y o f faithf ul implementatio n, but also introd uces instabilit y d uring executi on. When the intended pipeline prov es too demanding, the implementation may implicitly simplif y or degrade certain compo nents, l eading to deviatio ns from the original design and potentially undermining performan ce. Second, the depth of experiment al eva luatio n is still limited. Current experiments are cond ucted strictly on predeﬁned dat a sets, without suﬃcient explorati on o f cross-domain or out-of - distrib utio n scen ario s. Finally , despite achieving reaso n ab le performance, the generated method does not yet reach st ate-o f-the-art lev els. This gap suggests that f urther reﬁnement is needed, both in 12 T owards a Medical AI Scientist Figure 6 | The conceptua l illustration of the Medical AI Scientist: A comprehensiv e system of f ully- auto mated agents for end-to-end scientiﬁc discov ery in clinical medicine. The system o ﬀers three user interaction modes: R eprodu ctio n (reprodu cing a speciﬁed hypothesis), Inno vatio n (inn ovating from pro vided literature) and Exploratio n ( auto no m ous ly exploring a giv en research direction), to streamline medica l research process. The workﬂ o w consists of severa l phases co vering automated idea generatio n, experiment execution, manuscript writing. terms of algorithmi c design and experiment al validatio n, before the AI-generated approach can be considered competitiv e with leading methods in the ﬁeld. Future work will focus on strengthening the experiment al pipeline to enab le more comprehensiv e and rigorous eva luatio ns, thereby improving both the rob ustness and performan ce of the method. In parall el, we aim to enhance the qu alit y and expressiven ess of visu a liz ati ons, including both empirical plots and framework illustratio ns, so as to better comm unicate the underlying mechanisms and results. Through these eﬀorts, we expect the method to ev olv e into a more reliab le and w ell-rounded system with stronger empirical competitiv eness and clearer present ation. 4. Methods 4.1. Building an auton om ous AI scientist for medical research As illustrated in Fig. 6 , the Medica l AI S cientist comprises three core compon ents: Idea Proposer , Experimental Executor , and Manuscript C ompo ser . Each compon ent is implemented as m ulti-a gents that integrate multipl e f uncti onaliti es through caref ully designed prompting strategies. The o v erall system operates via coordinated interactions am ong these agents. All agents are built upon general- purpose large l angua ge models, such as GPT -5, which serve as the base models for hand ling a broad range of t as ks. F or both the Reprod uctio n m ode and the Inn ovati on mode, the system t akes t as k instructi ons, dat aset inf ormatio n, and reference papers a s inputs, whi ch are then processed sequentially by the Preparer and Survey or , the Generator , and the Assessor . In contra st, the Exploration mode operates with only task instructi ons and dat aset informatio n as inputs. Building upon the previous t wo m odes, it ﬁrst introd uces an Analyzer and an Explorer to retriev e the medical baseline paper and the no vel technologica l paradigm paper , thereby est ab lishing a suﬃcient literature f oundati on for subsequent idea generati on. The resulting structured ideas are then form ulated as research plans and passed to the Experiment al Executor f or empirica l va lidatio n, after which the experimental outputs are f urther processed into structured manuscripts, yielding the ﬁnal paper . This entire process enforces a continu ous reﬂect-and-reﬁn e cycle, ensuring the ﬁn a l research output (in cluding idea proposa l, executab le program, and ﬁn al manuscript) is reprodu cible and responsib le. 13 T owards a Medical AI Scientist 4.2. Idea Proposer The Idea Proposer operatio n alizes medical hypothesis generati on as a structured, evidence-grounded reaso ning process. The system is organiz ed into a set of interacting f uncti onal mod ules, each ad- dressing a critical compon ent of scientiﬁc ideatio n. This design highlights the central contrib utio n: a uniﬁed framework that couples structured knowl edge retrieva l with clinician–engineer co-reasoning to produ ce hypotheses that are both no v el and veriﬁa ble. At a high lev el, the Idea Proposer transforms loosely speciﬁed medica l t as ks into execut ab le research hypotheses by iterativ ely reﬁning probl em understanding, identif ying appropriate paradigms, and grounding designed ideas in extern a l evi- dence. This process red uces hallu cination and mitigates the tendency o f language models to prod uce superﬁcia l or non-acti onab le ideas. Analyzer . The Medical T a sk Analyzer forma lizes the input problem by identif ying its core clinical and technical chall enges. Given a user-pro vided dat aset or research objective, this mod ule performs targeted retrieva l ov er peer-review ed medical and technical literature to constru ct a structured t as k representation based on the academic search engine [ 93 ]. This representation encodes disease context, dat a characteristics, evaluati on constraints, and implicit clinical needs. This step anchors subsequent hypothesis generation in real clinical gaps rather than abstract probl em descriptions. Explorer . Building on the structured t ask represent ation, the P aradigm Explorer identiﬁes the most suitab le emerging computational paradigms to address the extracted challenges. Instead of relying on static knowl edge, it performs dyn amic retriev al o v er recent literature and open-source repositories, jointly considering methodologi cal nov elt y , empirical performance, and implement ati on maturit y . A key feature of this mod ule is the explicit alignment bet ween probl em structure and algorithmi c capa bilit y . Candidate paradigms are not selected in isol ati on b ut are eva luated based on how their ind uctiv e biases and design principles match the identiﬁed clinical constraints. F or each selected paradigm, the system retrieves corresponding high-qu ality codebases, ensuring that the resulting hypothesis is directly grounded in execut a b le compon ents. Preparer and Surv ey or . T o support informed reasoning, the Preparer and Surv eyor jointly co n- struct a structured and executab le evidence base that links scientiﬁc claims to their operationa l implementations. The Preparer retriev es relevant literature together with associated code artif acts, norma li zing them into a uniﬁed represent ati on that captures problem form ulations, m odel designs, and experiment al protocols. Inspired by [ 28 ], the Survey or then performs structured synthesis by decompo sing each reference into its core conceptua l and methodologica l primitives. Large langu a ge m odels ﬁrst extract the f undamental research contrib utio n and methodologica l skeleton while ab- stracting a way domain-speciﬁc terminology to red uce surface bia s. These abstract directiv es are subsequently grounded through a multi-a gent process that maps them to canoni cal mathematica l forma lisms and aligns them with executab le code components from open-source repositories. This design enab les the system to reconstruct prior methods as veriﬁab le w orkﬂo ws rather than static descriptio ns, thereby transforming existing work into mod ul ar and recomposab le units for reasoning. As a result, this m od ule est a b lishes an evidence-grounded substrate for hypothesis constru ctio n by explicitly linking theoretical assumpti ons with their execut ab le implement ati ons. Generator . Hypothesis generati on is performed by the Generator through a clinician–engineer co- reaso ning mechanism that integrates clinica l insight with comput ati onal design. Rather than relying on unconstrain ed synthesis, the Generator constructs candidate hypotheses by aligning t as k-speciﬁ c chall enges with the capa bilities of selected paradigms, guided by the structured evidence base. Clinical considerati ons are introdu ced in the process to ensure relevance and pl ausibility , whil e technica l reﬁnements are derived through t argeted retrieva l and adaptation o f existing methods. This bidirec- tio n al interactio n mitigates the risk o f superﬁcial no velt y and grounds each hypothesis in veriﬁa b le evidence, eﬀectiv ely redu cing hallu cination. Iterativ e reﬁnement continu es until the hypothesis 14 T owards a Medical AI Scientist achiev es internal coherence across clinica l validit y , methodologica l soundness, and implementation fea sibilit y . This structured process parallels human-led medical hypothesis formati on and en a b les systemati c derivatio n of high-lev el ideas from clear gaps in existing literature. Assessor . The ﬁn a l st a ge evaluates the gen erated hypothesis through a co mbin atio n of scientiﬁc and ethical criteria. The Assessor examines conceptual consisten cy , empirical support, and practical executabilit y . In parallel, an explicit ethics check ensures compliance with biomedi cal research standards. Hypotheses that f ail to meet qualit y thresholds are returned for reﬁnement, while those violating ethical constraints are rejected. This mechanism enforces rigor and account abilit y , ensuring that only well-supported and responsib le ideas proceed to experiment al validati on. The resulting hypothesis is forma li zed as a det ailed research pl an, which speciﬁes the algorithmic rational e and anticipated eva luation protocols. 4.3. Experimental Executor The experiment al executor is form ulated as a structured multi-sta ge pipeline for traceab le and self- correcting model devel opment within a secure Dockeri zed environment. Giv en a research objective, the Inv estigator assemb les the required codebase together with domain-speciﬁ c medical toolboxes to ensure compatibilit y with heterogeneou s clinical data, and provides this uniﬁed speciﬁcatio n to the Planner , which decomposes it into a structured, machine-interpret a b le execution protocol with deﬁned inputs and outputs. The Executor inst antiates this protocol within a controlled enviro nment by constru cting the f ull training and eva lu atio n pipeline, lev eraging general-purpose executio n toolchains for sca lab le and st ab le implement atio n. R esulting logs, intermediate outputs, and quantit ativ e metrics are assessed by the Judger , which eva luates consisten cy bet w een intended design and the observed behavi or and prod uces t argeted correctiv e feed back. The Analyst consolidates validated results into structured records for downstream use. Through iterative feed back and execution-lev el correction, the system uniﬁes domain-speciﬁ c medical processing with general executio n infrastru ctures, en a b ling reliab le, iterative, and self-correcting va lidatio n under complex clinica l settings. 4.4. Manu script Composer The Manuscript Composer operates within an end-to-end multi-a gent framework that transforms substanti ated research materials into a t ypeset-ready paper . The Content Generator ﬁrst est ab lishes the global structure of the manuscript by levera ging the organiz ati onal patterns o f the mo st relevant referen ce papers, and subsequently dev elops sectio n level co ntent grounded in evidence from a structured repository of implementations, experiment al logs, and qu antitative results. T o preserv e narrative coherence, concise summaries o f previous ly generated sectio ns are ret ained and reused as semantic anchors d uring subsequent drafting. The Generator further aligns narrativ e and presentation by automatica lly generating experiment al ﬁgures from logged results and synthesizing architectural diagrams from method speciﬁcatio ns. By summarizing current conf erence and journal policies into structured instructi ons, the Ethics R eview er lev erages dat aset-s peciﬁc evidence to rigorously report and cite the origin, license, and ethical appro val o f each dat a set to meet pub lishing requirements. In parallel, a Scientiﬁc Narrativ e Enhancer is introd uced to counter the tendency of AI generated text to ov eremphasize procedura l det ail, reﬁning the manuscript to improv e clarit y and the scientiﬁc storyline while aligning the writing st yle with t as k-speciﬁ c paradigms. A Cross-R eference Resolv er subsequently veriﬁ es intern a l references, including equ atio ns, ﬁgures, sections, and cit atio ns. Fin ally , a self-hea ling mechanism in Latex C ompilatio n Engine continu ous ly validates the LaT eX source, interpreting compiler feed back t o auton omo usly correct syntactic or structura l errors and ensure reliab le compilatio n without manual interv entio n. T ogether , these compo nents enab le the automated 15 T owards a Medical AI Scientist generati on o f coherent, compliant, and publi cation ready medical manuscripts from heterogeneou s research artif acts. 4.5. Constructi on of Med-AI Bench T o enab le systematic and reprodu cible eva luatio n o f the Medical AI Scientist, w e constructed Med-AI Bench, a benchmark comprising 171 ca ses derived from 57 high-qu ality ground-truth medical research papers. Constru ction began with the six primary dat a moda lities identiﬁed in a scoping review o f m ultim odal AI in medicine [ 94 ]: (1) medical ima ges, (2) videos, (3) electronic health records (EH R, including stru ctured ICU data), (4) text, (5) physiologi cal signals ( e.g., ECG and EEG), and (6) m ultim odal data. The t asks for each moda lit y were deriv ed from authoritative domain surveys as foll o ws: med- ica l imaging t a sks ( cl assiﬁ cation, prognosis, restoratio n, segment ati on, and registratio n) from a comprehensiv e review o f AI-driven imaging inno vatio ns [ 95 ]; video-an a lysis tasks (instrument de- tectio n, restoratio n, w orkﬂo w recognition, intraoperative risk assessment, and postoperativ e skill assessment) from a scoping revi ew o f AI in medica l videos [ 96 ]; EH R tasks (risk predicti on and clinica l decision support) from a comparativ e analysis of deep learning architectures for EH R [ 97 ]; physi ologi cal-signa l tasks (disease diagn osis and prognosis) from a review of signal-based healthcare appli cations [ 98 ]; clinica l text tasks (report summari zation, text-based dia gno sis/risk assessment, and bio medica l questio n answering) from a UK-f ocused clinical NLP survey [ 99 ]; and multim odal tasks ( m ultim odal di a gnosis and cross-m odal report generati on) from a dedicated m ultim odal bio medica l AI review [ 100 ]. This structured process yielded 19 distinct tasks. F or each t as k, three represent ativ e papers w ere retriev ed from Google Schol ar using t as k-speciﬁ c keyw ord combinations, with explicit prioritization o f highly cited works. Each paper wa s independently scored on ﬁv e dimensi ons: Code A vailabilit y ( presence and u sabilit y of publi c impl ementations), V en ue Qualit y ( prestige ranking of the publicati on v enu e), Cit ati ons (normalized citation count), Y ear and C ompl exit y ( pub licati on recency weighted by methodologi cal intricacy), and Subjectiv e Human Rating ( by domain experts). P apers were subsequently ranked and partition ed into three diﬃcult y tiers ( hard, medium, easy; one paper per tier per t as k) from an AI-implement ati on perspectiv e. F or each paper , three cases were constructed using diﬀerent input m odes. The resulting 171 cases form a stratiﬁed benchmark that systematica lly spans technical and clinical complexit y , en ab ling rigorous assessment of hypothesis generati on, implementation ﬁdelit y , and manuscript qu a lit y . It is worth noting that, to speed up the execution and validati on o f automated experiments, we perf ormed random subsampling on the dat aset. 4.6. P erforman ce assessment of the Medical AI Scientist W e evaluated the Medical AI S cientist on Med-AI Bench across fo ur core dimensio ns: (1) Idea Generati on, (2) Implementation Completeness, and (3) Code Executio n, a gainst the strongest closed- source models ( GPT -5 and Gemini-2.5-Pro) under identical input conditio ns. All evaluati on criteria are scored on a ﬁv e-point scal e ranging from 1 to 5, ensuring consistent and interpret a b le assessment across all cases. F or Idea Generati on, the Idea Proposer and baselin e models receiv ed equivalent prompts ( either literature-deriv ed inno vati on or auton om ous exploration of a user-speciﬁ ed direction) and prod uced f ull research proposals. Each proposal wa s scored by a hybrid evaluator that combined LLM-based metrics with blinded assessments from prof essio n al clinica l AI scientists. Scoring foll o w ed st andard- ized rubrics across six dimensio ns: N o v elt y (subst antiv e inno vatio n in medica l probl em modeling), Maturit y ( completen ess and ease of implementation), Ethica lit y (responsib le hand ling of medica l dat a 16 T owards a Medical AI Scientist and constraints), Generalizabilit y (robustn ess across devices, popul atio ns, and institutio ns), Utilit y ( potential for real clinical adoptio n), and Interpret a bilit y ( alignment with medica l reasoning and traceabilit y). Explicit evidence grounding ensured high inter-rater reliabilit y . F or Implementation Completen ess, the f ull proposals w ere fed into the Experiment Executor ( our system) or the equiva lent code-generati on mod ules of the ba selines, prod ucing complete execut ab le programs. LLM-based scoring then assessed t w o aspects: ﬁdelit y o f core inno vativ e compon ents and completen ess of the pipeline (dat a preprocessing, training, validati on, testing, and logging). Code Executio n directly deployed the generated code in a predeﬁned Dockeri zed environment. Success was deﬁned as the fractio n of runs that completed without errors, exhibited m on otoni cally decreasing training loss, and prod uced va lid model weights accompani ed by quantit ativ e test results. In additio n, all Medica l AI Scientist-generated manuscripts w ere submitted to the St anf ord Agentic R eview er under the complete I CLR review protocol. The system returned an o v erall score on a scal e from 0 to 10, together with structured strengths and wea knesses, pro viding an independent m ulti-criteria va lidatio n of scientiﬁ c rigor . 4.7. Human expert evaluati on T o assess real-w orld usabilit y in a controlled yet ecologica lly valid setting, w e restricted the eva luation to a single classic medica l AI t ask: diabeti c retin opathy classiﬁcati on on f und us ima ges, while preserving f ull methodologica l generality . W e invited 10 independent human experts, each with m ore than ﬁv e years of ﬁrst-author experience in AI-for-hea lthcare publi catio ns. U sing a doub le-b lind protocol, experts rated a total of 20 papers: ﬁv e papers auton omo usly gen erated by the Medica l AI Scientist on the co nstrained t a sk and 15 high-impact human-authored papers ( ﬁve randomly selected via keyword search from each o f the MIC CAI , BI BM, and ISBI conferen ces, prioritized by citation rank). T o elimin ate any potenti a l source bias from formatting or st ylistic templ ates, all human-authored papers had their origin al templ ates, fo nts, and lay outs remo ved, with only the core content ret ained. Experts scored every paper on ﬁv e dimensions using the same st andardiz ed Likert-scale rubrics: No velt y (degree o f methodological inno vatio n relative to prior art), Coherence (logica l ﬂow and internal consisten cy of the scientiﬁ c narrative), Cov erage ( co mprehensiv eness of experimental design), Clarit y ( precisio n and concisen ess of expositio n), and R eprod ucibilit y (suﬃcien cy of methodologica l detail). All eva luatio n criteria are scored on a ﬁve-point scale ranging from 1 to 5. Also, all ratings w ere collected anonym ously to eliminate source bias, ena b ling direct quantitative comparison o f perceiv ed qu a lit y and practica l utilit y bet w een AI-generated and human-authored medical research outputs. 5. R elated W ork 5.1. AI agent systems and multi-agent coll aborati on The evoluti on o f AI agent systems has shifted from single-a gent tool integratio ns to advanced multi- a gent architectures that en ab le sophisticated collaboratio n and t as k decompositi on. E arly approaches focu sed on enhancing individua l a gents’ capa bilities, such as R eAct [ 101 ], which combines reaso ning and action by prompting LLMs to generate interlea v ed thoughts and actions for dyn ami c environmen- tal interactions. Building on this, T oolf ormer [ 102 ] en ab les LLMs to learn tool usage auto no m ous ly through ﬁne-tuning with API calls, supporting zero-shot applicati ons in t a sks requiring extern a l re- sources. These f oundati ons hav e pav ed the way for m ore integrated framew orks like LangChain [ 103 ], which f acilitates chaining compon ents for complex applicati ons, and its extension LangGraph [ 104 ], 17 T owards a Medical AI Scientist which introduces graph-based orchestration for man a ging st ateful multi-a gent systems. Simil arly , Semantic Kern el [ 105 ] integrates plugins for enterprise-lev el AI orchestratio n with an emphasis on semantic planning and mem ory persistence. Building upon these integrated frameworks, advancements in m ulti-agent coll a boratio n hav e prod uced systems that simulate team-based dynamics through role assignments and structured inter- actio ns. MetaGPT [ 22 ] employs st andardized operating proced ures to coordinate a gents in workﬂ o ws akin to soft ware devel opment teams, while CAMEL [ 21 ] uses role-playing to align auton omo us agents with user goals. More specialized frameworks like Crew AI [ 106 ] assemb le agent teams for sequential tasks such as research synthesis, and OpenAgents [ 23 ] deploys multiple a gents to provide accessib le capa bilities for data an aly sis, plugin usage, and web navigatio n. Extending these paradigms, systems like A uto-GPT [ 107 ] and Devin [ 108 ] operate as auton omo us AI engineers for f ull-cycle soft ware dev elopment, while Manus [ 24 ] and its open-source counterpart OpenManus [ 25 ] support complex, cloud-ba sed t as k executio n. Ho wev er , these frameworks highlight the evoluti on tow ard robust co- ordination but often l ack the deep reasoning required for scientiﬁc inno vatio n, such as hypothesis form ulation and domain-speciﬁc adaptatio n. 5.2. A utono m ous AI-driven scientiﬁc discov ery systems A utono m ous scientiﬁ c discov ery systems automate key research st ages, from ideatio n to dissemina- tio nd. The AI Scientist [ 26 ] pion eers an end-to-end automated pipeline that generates ideas, runs experiments, and drafts manuscripts, operating in an open-ended loop to build upon its own ﬁndings. Its successor , AI Scientist-v2 [ 27 ], enhances this auton omy by incorporating an agenti c tree-search for deeper hypothesis exploratio n and successf ully generating a manuscript that passed peer review at a major conference w orkshop. AI-R esearcher [ 28 ] introduces a multi-a gent architecture that maint ains coherence through bidirectional mappings between mathematical con cepts and code, mitigating hallu cin atio ns. DeepScientist [ 29 ] frames scientiﬁ c discov ery as a Bayesian Optimiz ati on prob lem, using an agent to iterativ ely balance exploratio n and exploitation to discov er no vel methods. Agent Laboratory [ 109 ] extends this by automating the executio n and reporting of user-pro vided ideas, act- ing as an accelerator for human researchers rather than an independent ideator . In contra st, Google’s AI co-scientist [ 110 ] operates as a collaborator in a "scientist-in-the-l oop" paradigm, lev eraging models like Gemini to assist domain experts with hypothesis generati on. Alo ngside these framew orks, complementary toolkits hav e been developed to support AI a gent systems by enhancing resource integration and accessibilit y . T oolUniv erse [ 111 ] provides an expansiv e repository of scientiﬁ c tools gov erned by a standardi zed AI-tool interaction protocol, en ab ling agents to disco v er and orchestrate diverse tools seamlessly . P aper2a gent [ 112 ] transforms research papers into executab le agents by encapsulating their contrib utio ns into a st andardiz ed Model Context Protocol (MCP), allo wing for interactive, natural-language-ba sed reproducti on and an aly sis. C ode2MCP [ 113 ] f urther streamlines this by con v erting code repositories into st andardiz ed services, f acilitating seamless tool incorporatio n into agent workﬂ o ws. While eﬀectiv e for general research automati on, these systems frequently o v erlook clinica l necessities such as ethical compliance and specialized dat a processing, shortcomings our framew ork mitigates with dedicated medical st ages and veriﬁ catio n processes. 5.3. AI appli catio ns and challenges in clinical medicine Artiﬁcia l intelligence has made subst antial impacts in clinica l medicin e, with specialized m odels achiev- ing expert-level performance on well-deﬁn ed t as ks. These t a sks include disease classiﬁcati on [ 1 – 3 ], lesi on segment atio n [ 5 , 6 ], prognosti c predictio n [ 8 – 10 ] and enhan ced surgica l n a vigatio n [ 114 – 116 ]. As technology has evolv ed, multim odal l arge langu a ge models (MLLMs) hav e emerged, integrating 18 T owards a Medical AI Scientist div erse dat a t ypes su ch as text and ima ges to perf orm m ore complex, co mprehensiv e t a sks. F or instance, m odels like Med-Gemini [ 117 ] levera ge visi on-l angua ge processing to support medi cal report generatio n and treatment recommendations, while framew orks su ch a s LL a V A-Med [ 118 ] facilit ate multim odal an aly sis in radiology . Ho wev er , these advances primarily consist of specialized models whose operatio n and integration still rely heavily on human experts to driv e the entire research project. R esearchers m ust be responsib le for identif ying clinical problems, form ul ating hypotheses, designing experiments, and ensuring ethica l compliance. T o our kno wledge, no existing framew ork bridges the auton om ous orchestration capa bilities of a general AI Scientist with the domain-speciﬁ c kno wledge, tools, and ethica l constraints o f clinical medicine. Medica l AI S cientist aims to ﬁll this gap, enab ling auton omo us, clinica lly meaningf ul, and ethica lly responsib le inno vatio n. 19 T owards a Medical AI Scientist R eferen ces [1] Andre Esteva, Brett Kuprel, Robert o A. No v oa, et al. Dermatologist-lev el classiﬁcati on o f skin cancer with deep neural net w orks. Nature , 542(7639):115–118, 2017. [2] Dani el S . Kermany , Michael Gold baum, W enjia Cai, et al. Identif ying medical di a gnoses and treatab le diseases by ima ge-based deep learning. Cell , 172(5):1122–1131.e9, 2018. [3] Pranav Rajpurkar , Jeremy I rvin, Kayli e Z hu, Banming Y ang, Hershel Mehta, T ony Duan, Daisy Ding, Karan Bagaria, Jenny Ball, Curtis L anglotz, et al. Chexnet: Radiologist-l evel pneum onia detectio n on chest x-rays with deep learning. arXiv preprint , 2017. [4] Sheng Zhang, Y uhong Xu, Naoto Usuyama, et al. L arge-scal e domain-speciﬁc pretraining for bio medica l vision-langua ge processing. arXiv preprint , 2023. [5] F abian Isensee, Jens Petersen, Andreas Klein, et al. nnU-Net: Self-adapting framew ork for u-net-ba sed medical ima ge segment atio n. arXiv preprint , 2018. [6] Ali Hat amiz adeh, Y an T ang, Vishw esh Nath, Dong Y ang, Andriy Myronenko, Bennett Landman, Holger R Roth, and Da guang Xu. U netr: T ransformers for 3d medical image segment atio n. In Proceedings of the I EEE/C VF Winter Conference on Applicati ons of Computer Visi on , pa ges 574–584, 2022. [7] Jun Ma, Y ining He, F eng Li, et a l. S egment anything in medica l ima ges. Nature Communi cations , 15(1):654, 2024. [8] P ooya Mobadersany , Saeed Y ouseﬁ, Mohamed Amgad, et al. Predicting cancer outcomes from histology and geno mics using con v olutional net w orks. Proceedings of the Nati onal Academy of Sciences , 115(13):E2970–E2979, 2018. [9] Xintian W ang, Ji an Zhao, Elena Marostica, et al. A pathology foundati on model for cancer diagn osis and prognosis predicti on. Nature , 634(8035):970–978, 2024. [10] Y i zhou Chen, Bo W ang, Y ifan Zhao, et al. Met a bolomi c machine learning predictor for di a gnosis and prognosis of gastric cancer . Nature C omm unicatio ns , 15(1):1657, 2024. [11] OpenAI. Introd ucing gpt-5. A vailabl e at https://openai.com/index/ introducing- gpt- 5 , A ugust 2025. [12] Sand hini Agarwal, Lama Ahmad, Ja son Ai, Sam Altman, Andy Applebaum, Edwin Arb us, Rahul K Arora, Y u Bai, Bow en Baker , Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b m odel card. arXiv preprint , 2025. [13] Gemini T eam, R ohan Anil, Sebastian Borgeaud, Jean-Baptiste Al a yrac, Ji ahui Y u, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a f amily of highly capa ble multim odal models. arXiv preprint , 2023. [14] xAI. Grok 4. Av ailab le at https://x.ai/news/grok- 4 , July 2025. [15] An Y ang, Anfeng Li, Baosong Y ang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bow en Y u, Chang Gao, Chengen Hu ang, Chenxu Lv , et a l. Qw en3 technical report. arXiv preprint arXiv:2505.09388 , 2025. [16] Aixin Liu, Bei F eng, Bing Xue, Bingxuan W ang, Bochao Wu, Chengda Lu, Chenggang Z hao, Chengqi Deng, Chenyu Z hang, Chong R u an, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 , 2024. 20 T owards a Medical AI Scientist [17] OpenAI. Introd ucing deep research, 2025. URL https://openai.com/index/ introducing- deep- research/ . Accessed: 2025-04-06. [18] Google T eam. Introdu cing gemini deep research, 2025. URL https://gemini.google/ overview/deep- research/ . Accessed: 2025-04-06. [19] xAI T eam. Introdu cing grok deepsearch, 2025. URL https://x.ai/news/grok- 3 . Accessed: 2025-04-06. [20] Qingyun W u, Ga gan Bansal, Jieyu Zhang, Yiran W u, Beibin Li, E rkang Z hu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. A utogen: Enab ling next-gen llm applicati ons via m ulti-a gent conv ersations. In First Conferen ce on L anguage Modeling , 2024. [21] Guohao Li, Hasan Hammoud, H ani It ani, Dmitrii Khi zb ullin, and Bern ard Ghanem. Camel: Comm unicati ve a gents for" mind" exploratio n of l arge l anguage m odel societ y . Advances in Neura l Informati on Processing S ystems , 36:51991–52008, 2023. [22] Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Z heng, Y uheng Cheng, Jinlin W ang, Ceyao Zhang, Zili W ang, Steven Ka Shing Y au, Ziju an Lin, et al. Metagpt: Met a programming for a multi-a gent collaborativ e framework. In T he Twelfth Internationa l C onf erence on L earning R epresent atio ns , 2023. [23] T ianbao Xie, F an Zhou, Z houjun Cheng, Peng Shi, Luoxuan W eng, Yitao Liu, T oh Jing Hua, Junning Z hao, Qian Liu, Che Liu, L eo Z. Liu, Y iheng Xu, Hong jin Su, Dongchan Shin, Caiming Xio ng, and T ao Y u. Openagents: An open platform for l anguage agents in the wild, 2023. URL https://arxiv.org/abs/2310.10634 . [24] Manu s T echnol ogies. Manu s: Structured and ai-assisted academic writing tool. https: //manus.im/ , 2025. [25] OpenManu s Contrib utors. Openmanus: Open-source framework for b uilding genera l ai agents. https://openmanus.github.io/ , 2025. [26] Chris Lu, Cong Lu, R obert Tjarko L ange, Ja kob F oerster , Jeﬀ Clune, and David Ha. The ai scien- tist: T o wards f ully auto mated open-ended scientiﬁc discov ery . arXiv preprint , 2024. [27] Y ut aro Y amada, R obert Tjarko L ange, Cong Lu, Shengran Hu, Chris Lu, Ja kob F oerster , Jeﬀ Clune, and David H a. The ai scientist-v2: W orkshop-l ev el automated scientiﬁ c discov ery via a gentic tree search. arXiv preprint , 2025. [28] Jiabin T ang, Lianghao Xia, Z honghang Li, and Chao Huang. Ai-researcher: A uton omo us scientiﬁ c inno vati on. arXiv preprint , 2025. [29] Y ixuan W eng, Minjun Zhu, Qiuji e Xi e, Qiyao Sun, Zhen Lin, Sif an Liu, and Yu e Z hang. Deepscientist: Advancing fronti er-pushing scientiﬁc ﬁndings progressiv ely . arXiv preprint arXiv:2509.26603 , 2025. [30] Chandan K R eddy and P arshin Shojaee. T ow ards scientiﬁ c disco very with generati v e ai: Progress, opportunities, and chall enges. Proceedings of the AA AI C onf erence on Artiﬁcia l Intelligence , pages 28601–28609, 2025. [31] Y olanda Gil, Mark Greav es, James Hend ler , and H aym Hirsh. Amplif y scientiﬁc disco v ery with artiﬁcia l intelligence. Science , 346(6206):171–172, 2014. 21 T owards a Medical AI Scientist [32] Hanchen W ang, Tianfan Fu, Yuan qi Du, W enhao Gao, Kexin Huang, Ziming Liu, P aya l Chandak, Shengchao Liu, Peter V an Kat wyk, Andreea Deac, et al. Scientiﬁc discov ery in the age of artiﬁcia l intelligence. Nature , 620(7972):47–60, 2023. [33] Jinheon Baek, Sujay Kumar Ja uhar , Silviu Cucerzan, and Sung Ju Hwang. R esearcha gent: Iterativ e research idea generati on o v er scientiﬁc literature with l arge l anguage m odels. arXiv preprint arXiv:2404.07738 , 2024. [34] Juraj Gott w eis, W ei-Hung W eng, Alexander Daryin, T ao Tu, Anil Pa lepu, P et ar Sirko vic, Artio m Myas kovs ky , F elix W eissenberger , Keran R ong, R yut aro T anno, et al. T owards an ai co-scientist. arXiv preprint arXiv:2502.18864 , 2025. [35] Zhongguancun Academy . The 1st international conference on ai scientists (icais 2025). W ebsite, No vember 2025. URL https://icais.ai/ . T agline: Exploring the frontiers o f automated scientiﬁ c discov ery with AI S cientists and auton om ous research agents. [36] Omid Nejati Manzari, Hamid Ahmadabadi, Hossein Kashiani, S hahriar B Shoko uhi, and Ahmad A yatollahi. Medvit: a robu st vision transformer for generalized medical ima ge cl assiﬁ cation. Computers in biology and medicine , 157:106791, 2023. [37] Xiangzuo Huo, Gang Sun, Shengw ei Tian, Y an W ang, L ong Y u, Jun L ong, W endo ng Z hang, and Aolun Li. Hif use: Hierarchica l multi-sca le feature fusio n net w ork for medi cal ima ge classiﬁcati on. Biomedi cal Sign al Processing and Control , 87:105534, 2024. [38] Y ijun Y ang, Hu azhu Fu, Angelica I A viles-Riv ero, Z haohu Xing, and L ei Zhu. Diﬀmic-v2: Medica l image cl assiﬁ cation via improv ed diﬀ usi on net w ork. I EEE T ransactio ns on Medical Imaging , 2025. [39] Olaf Ro nneberger , Philipp Fischer , and Thoma s Brox. U-net: Con v olutio n al n et works f or bio medica l image segment ati on. In Intern atio n al C onf erence on Medica l image computing and computer -a ssisted interventi on , pages 234–241. Springer , 2015. [40] Hu Cao, Y ueyue W ang, Joy Chen, Dongsheng Jiang, Xi aopeng Z hang, Qi Tian, and Manning W ang. Swin-un et: Unet-like pure transformer for medica l image segmentation. In European conf erence on computer vision , pa ges 205–218. Springer , 2022. [41] Jien eng Chen, Ji eru Mei, Xianhang Li, Y ongyi Lu, Qihang Y u, Qingyue W ei, Xiangde Luo, Y utong Xie, Ehsan Adeli, Y an W ang, et al. T ransunet: R ethinking the u-net architecture design for medica l image segment atio n through the lens of transformers. Medical Image Analysis , 97: 103280, 2024. [42] R en ato Hermo za, Gabriel Maicas, J acinto C Na scimento, and Gustav o Carneiro. Post-hoc o v erall survival time predictio n from brain mri. In 2021 I EEE 18th Intern atio n al Symposium on Bio medica l Imaging (ISBI) , pa ges 1476–1480. I EEE, 2021. [43] Lina Chato and Sha hram L atiﬁ. Machin e learning and deep learning techniques to predict o v erall survival of brain tumor patients using mri images. In 2017 I EEE 17th intern atio n al conf erence on bio informati cs and bioengin eering (BI BE) , pages 9–14. I EEE, 2017. [44] Y in Lin, Riccardo Barbieri, Domenico Aquin o, Giuseppe L auria, Marin a Grisoli, Elena De Momi, Alberto R edaelli, and Simona F errante. Gliob lastoma ov erall survival predictio n with vision transformers. In 2025 47th Annua l International Conferen ce of the I EEE Engineering in Medicine and Biology Societ y (EMBC) , pages 1–4. I EEE, 2025. 22 T owards a Medical AI Scientist [45] Guha Balakrishnan, Amy Zhao, Mert R Sab uncu, John Gutt a g, and Adrian V Dalca. V oxelm orph: a learning framework for deforma ble medica l image registration. I EEE transactio ns on medica l imaging , 38(8):1788–1800, 2019. [46] Junyu Chen, Eric C Frey , Yufan He, William P Segars, Y e Li, and Y ong Du. T ransmorph: T ransformer for unsupervised medica l ima ge registration. Medica l image an a lysis , 82:102615, 2022. [47] Boah Kim, Inhwa Han, and Jong Chul Y e. Diﬀ usem orph: U nsupervised deformab le image registratio n using diﬀ usio n model. In European conferen ce on computer visio n , pages 347–364. Springer , 2022. [48] Ho ng W ang, Y uexiang Li, Nanjun He, Kai Ma, Deyu Meng, and Y ef eng Z heng. Dicdnet: Deep interpretab le con v olutio nal dicti onary net work for metal artif act red ucti on in ct images. I EEE T ransactions on Medica l Imaging , 41(4):869–880, 2021. [49] Ho ng W ang, Q i Xie, Do ng Zeng, Ji anhua Ma, Deyu Meng, and Y ef eng Zheng. Oscnet: Orientatio n-shared conv olutio nal net work for ct met al artifact learning. I EEE T ransactio ns on Medica l Imaging , 43(1):489–502, 2023. [50] Ho ng W ang, Minghao Zhou, Dong W ei, Yu exi ang Li, and Y ef eng Zheng. Mepn et: A model- driv en equivariant proximal net w ork for joint sparse-view reconstructi on and met al artif act red ucti on in ct images. In International Conferen ce on Medica l Image C omputing and Computer- Assisted Interventi on , pages 109–120. Springer , 2023. [51] Ho ngqiu W ang, Guang Y ang, Shichen Zhang, Jing Q in, Yike Guo, Bo Xu, Y ueming Jin, and L ei Zhu. Video-instrument synergistic net w ork for referring video instrument segment atio n in robotic surgery . I EEE T ransactions on Medical Imaging , 2024. [52] Dongming Wu, T iancai W ang, Y u ang Z hang, Xi angyu Z hang, and Jianbing Shen. Onlinerefer: A simple online baseline for referring video object segment atio n. In Proceedings o f the I EEE/CVF Internationa l C onf erence on Computer Visio n , pages 2761–2770, 2023. [53] Adam Bot ach, Evgenii Z helton ozhskii, and Chaim Baskin. End-to-end referring video object segmentation with multim odal transformers. In Proceedings of the I EEE/C VF Conferen ce on Computer Visio n and P attern R ecognitio n , pages 4985–4995, 2022. [54] Xinyu Liu, Guolei Sun, Cheng W ang, Y ixuan Yuan, and Ender Ko nukoglu. Medvsr: Medical video super-resolution with cross st ate-space propagati on. In Proceedings of the IEEE/C VF Internationa l C onf erence on Computer Visio n , pages 11697–11707, 2025. [55] Xintao W ang, K elvin CK Chan, K e Y u, Chao Dong, and Chen Change L oy . Edvr: Video restoratio n with enhanced deformab le conv olutiona l net works. In Proceedings o f the I EEE/CVF conf erence on computer vision and pattern recognition works hops , pa ges 0–0, 2019. [56] Kelvin CK Chan, Xint ao W ang, Ke Yu, Chao Dong, and Chen Change L oy . Ba sicv sr: The search for essential compon ents in video super-resolution and bey ond. In Proceedings o f the I EEE/CVF conf erence on computer vision and pattern recognition , pages 4947–4956, 2021. [57] Shu Y ang, Luyang Luo, Qiong W ang, and Hao Chen. Surgformer: Surgica l transformer with hierarchi cal temporal attentio n for surgical phase recognition. In Internationa l C onf erence on Medica l Image Computing and C omputer -Assisted Interventi on , pa ges 606–616. Springer , 2024. 23 T owards a Medical AI Scientist [58] Y ueming Jin, Y onghao L ong, Cheng Chen, Zixu Z hao, Q i Dou, and Pheng-Ann Heng. T empora l mem ory rel atio n net work for w orkﬂo w recognition from surgica l video. I EEE T ransactions on Medica l Imaging , 40(7):1911–1923, 2021. [59] Y ueming Jin, Qi Dou, Hao Chen, L equan Y u, Jing Qin, Chi- Wing Fu, and Pheng-Ann Heng. Sv-rcnet: workﬂo w recognitio n from surgical videos using recurrent conv olutiona l net work. I EEE transactio ns on medical imaging , 37(5):1114–1126, 2017. [60] Masa hiro Kawam ura, Y uichi Endo, Atsuro Fujinaga, Hiroki Orimot o, Shota Amano, T akahide Kaw asaki, Y oko Ka wan o, T a kashi Ma suda, T eijiro Hiras hit a, Misako Kimura, et a l. Devel opment o f an artiﬁcial intelligence system for real-time intraoperativ e assessment of the critical view o f safet y in laparoscopi c cholecystectomy . Surgical Endoscopy , 37(11):8755–8763, 2023. [61] Pietro Masca gni, Armine V ardazaryan, Deepak Alapatt, T a keshi Urade, T a ha Emre, Cl audi o Fiorill o, P atrick P essaux, Didier Mutter , Jacques Marescaux, Guido C ostama gna, et al. Arti- ﬁcia l intelligence for surgical safet y: automati c a ssessment o f the criti cal view o f safet y in laparoscopi c cholecystectomy using deep learning. Annals of surgery , 275(5):955–961, 2022. [62] Franciszek M No wak, Evangelo s B Mazomeno s, Brian Da vidson, and Matthew J Clarkson. Swincvs: a uniﬁed approach to classif ying critical view of safet y structures in l aparo scopic cholecystect omy . Internatio nal Jo urnal of C omputer Assisted Radi ology and Surgery , 20(6): 1145–1152, 2025. [63] Aneeq Zia and I rfan Essa. Aut omated surgical skill assessment in rmis training. Internatio nal journal o f computer assisted radiology and surgery , 13(5):731–739, 2018. [64] Isabel Funke, Sören T orge Mees, Jürgen W eitz, and Stefanie Speidel. Video-ba sed surgica l skill assessment using 3d con v olutio n al neural net works. Internati onal journal of computer assisted radiology and surgery , 14(7):1217–1225, 2019. [65] Daochang Liu, Qiyue Li, Tingting Ji ang, Y izhou W ang, R ulin Miao, F ei Shan, and Ziyu Li. T o wards uniﬁed surgical skill assessment. In Proceedings of the I EEE/CVF conf erence on computer visio n and pattern recognitio n , pages 9522–9531, 2021. [66] Sujeong Im, Jungw oo Oh, and Edward Choi. L abt op: A uniﬁed model for l ab test outcome predicti on on electroni c health records. arXiv preprint , 2025. [67] Raphael Poulain and Rahmatollah Beheshti. Graph transformers on ehrs: Better representa- tio n improv es downstream performan ce. In T he t welfth international conference on learning representations , 2024. [68] Adibvafa F allahpo ur , Ma hshid Alinoori, W enqian Y e, Xu Cao, Arash Af kanpour , and Amrit Krishnan. Ehrmamba: T owards generalizab le and scalab le foundati on models for electronic health records. arXiv preprint , 2024. [69] Mengxuan Sun, Jinghao Niu, Xuebing Y ang, Yifan Gu, and W ensheng Z hang. Cehmr: Curricu- lum learning enhanced hierarchica l multi-label cl assiﬁ cation for medicati on recommendati on. Artiﬁcia l Intelligence in Medicine , 143:102613, 2023. [70] Junyuan Shang, Cao Xi ao, T engfei Ma, Hongyan Li, and Jimeng Sun. Gamenet: Graph augmented mem ory net works for recommending medicatio n combination. In proceedings of the A AAI Conferen ce on Artiﬁcial Intelligence , volume 33, pages 1126–1133, 2019. 24 T owards a Medical AI Scientist [71] Chaoqi Y ang, Cao Xiao, F englong Ma, Lucas Gl ass, and Jimeng Sun. Safedrug: Dual mol ecular graph encoders for recommending eﬀectiv e and safe drug combinati ons. arXiv preprint arXiv:2105.02711 , 2021. [72] Hany El-Ghaish and Emadeldeen Eldele. Ecgtransf orm: Empo wering adapti v e ecg arrhythmia classiﬁcati on framework with bidirectiona l transformer . Bi omedical Signal Processing and Control , 89:105714, 2024. [73] Y ihe W ang, Nan Hu ang, T aida Li, Y ujun Y an, and Xiang Zhang. Medf ormer: A multi-gran ul arit y patching transformer for medical time-series cl a ssiﬁcati on. Adv ances in Neura l Informati on Processing Systems , 37:36314–36341, 2024. [74] Shunxiang Y ang, Cheng Li an, Z higang Zeng, Bingrong Xu, Junbin Zang, and Z hidong Zhang. A multi-vi ew multi-sca le neura l net work for multi-label ecg cl a ssiﬁcati on. I EEE T ransactio ns on Emerging T opics in Computational Intelligence , 7(3):648–660, 2023. [75] Emilly M Lima, Antônio H Ribeiro, Gabri ela MM P aixão, Manoel Hort a Ribeiro, Marcelo M Pinto-Filho, P aulo R Gomes, Derick M Oliv eira, Ester C Sabin o, Bruce B Duncan, Luan a Giatti, et al. Deep neura l net w ork-estimated electrocardi ographi c a ge as a mortalit y predictor . Nature comm unicatio ns , 12(1):5117, 2021. [76] Sushra vya Raghunath, Alvaro E Ulloa Cerna, Linyuan Jing, D avid P V anMaanen, Joshua Stough, Dustin N H artzel, Jo seph B L eader , H L ester Kirchner , Martin C Stumpe, Ashraf Hafez, et al. Predictio n o f m ortalit y from 12-lead electrocardiogram volta ge dat a using a deep neura l net w ork. Nature medicin e , 26(6):886–891, 2020. [77] V eer Sangha, Bobak J Mort az a vi, Adrian D Haimo vich, Antônio H Ribeiro, Cynthia A Brandt, Dani el L J acoby , W ade L Schulz, H arlan M Krumholz, Antoni o Lui z P Ribeiro, and Rohan Khera. A utomated multilabel diagno sis on electrocardiogra phic images and signals. Nature comm unicatio ns , 13(1):1583, 2022. [78] Da ve V an V een, Cara V an Uden, L ouis Bl ankemeier , Jean-Beno it Delbrou ck, Asad Aali, Christian Bluethgen, Anuj P areek, Malgorzat a Polacin, Eduardo Po ntes Reis, Ann a S eeho fnerová, et al. Adapted l arge language models can outperf orm medica l experts in clinical text summari zatio n. Nature medicine , 30(4):1134–1142, 2024. [79] Shw et a Y adav , Deepak Gupt a, Asma Ben Abacha, and Dina Demner-Fushman. R einforcement learning for abstractiv e questio n summari zation with questi on-a ware semantic rewards. arXiv preprint arXiv:2107.00176 , 2021. [80] W enpeng Lu, Sibo W ei, Xueping Peng, Y i-F ei W ang, Usman Naseem, and Shoujin W ang. Medica l questi on summariz ati on with entit y-driv en contra stiv e learning. A CM Transacti ons on Asi an and L ow-R esource L anguage Informati on Processing , 23(4):1–19, 2024. [81] Kexin Huang, Jaan Altosa ar , and Rajesh Rangan ath. Clinica lbert: Modeling clinical notes and predicting hospital readmissio n. arXiv preprint , 2019. [82] Sara Nouri Golmaei and Xiao Lu o. Deepnote-gnn: predicting hos pital readmissio n using clinica l notes and patient net w ork. In Proceedings of the 12th A CM International Conferen ce on Bio informatics, Computational Biol ogy , and Hea lth Informatics , pa ges 1–9, 2021. [83] Huiting Ma, Dengao Li, Jumin Z hao, W enjing Li, Jian Fu, and Chunxia Li. H r-bgcn: Predi cting readmissio n for heart f ailure from electronic health records. Artiﬁcial Intelligence in Medicin e , 150:102829, 2024. 25 T owards a Medical AI Scientist [84] Georg Wiese, Dirk W eissenborn, and Marian a Nev es. Neural questi on answering at bioasq 5b. arXiv preprint arXiv:1706.08568 , 2017. [85] Zi Y ang, Y ue Zhou, and Eric Nyberg. L earning to answ er biomedi cal questio ns: Oaqa at bioasq 4b. In Proceedings of the F ourth BioA S Q works hop , pages 23–37, 2016. [86] Hajung Kim, Hooni ck L ee, Y ew on Cho, Jungw oo P ark, Jueon P ark, Soyo n P ark, Y an T ing Chok, Seungheun Baek, Donghyeon Lee, and Jaew oo Kang. Prompting matters: snippet-aware strategies for bio medica l qa with llms in bioa sq 13b. In CLEF , 2025. [87] Y ilan Zhang, F engying Xie, and Jianqi Chen. Tformer: A throughout f usio n transformer for m ulti-m odal skin lesio n diagn osis. Computers in biol ogy and medicine , 157:106712, 2023. [88] Y uan Z hang, Y utong Xie, Hu W ang, Jodi e C A very , M Louise Hull, and Gustav o Carneiro. A no vel perspectiv e for m ulti-m odal multi-label skin lesio n cla ssiﬁcati on. In 2025 I EEE/C VF Winter Conferen ce on Applicatio ns of Computer Visio n ( W A CV) , pa ges 3549–3558. I EEE, 2025. [89] Matthew J Cockayne, Marco Ortolani, and Baida a Al-Bander . Dermformer: nested multi-m odal visio n transformers for robu st skin cancer detection. P attern An a lysis and Applicati ons , 28(4): 194, 2025. [90] Shuxin Y ang, Xi an W u, Shen Ge, S Kevin Zhou, and Li Xiao. Kno wledge matters: Chest radiol ogy report generatio n with general and speciﬁc kno wledge. Medi cal image analysis , 80: 102510, 2022. [91] W enjun Hou, Yi Cheng, Kaishuai Xu, W enjie Li, and Jiang Liu. R ecap: T ow ards precise radiology report generatio n vi a dynamic disease progressio n reasoning. arXiv preprint , 2023. [92] Zhihong Chen, Y an S ong, T sung-Hui Chang, and Xi ang W an. Generating radiology reports vi a mem ory-driv en transformer . arXiv preprint , 2020. [93] J aso n Priem, Heather Piwo war , and Richard Orr . Openalex: A f ully-open index of schol arly w orks, authors, v enu es, institutions, and concepts. arXiv preprint , 2022. [94] Daan S chouten, Giulia Nicol etti, Bas Dille, Catherine Chia, Pi erpaolo V endittelli, Megan Schuurmans, Geert Litjens, and Nadieh Khalili. Na vigating the landscape o f multim odal ai in medicin e: a scoping review on technical challenges and clinical appli cations. Medical image analysis , 105:103621, 2025. [95] Luís Pinto-Coelho. Ho w artiﬁcia l intelligence is shaping medica l imaging technology: a survey o f inno vati ons and applicati ons. Bioengineering , 10(12):1435, 2023. [96] Anni King, George E Fo wler , Rhiannon C Maceﬁeld, H amish W alker , Charlie Thomas, Sheraz Markar , Ethan Higgins, J ane M Bl azeby , and Nat a lie S Blenco w e. Use o f artiﬁcial intelligence in the analysis o f digit al videos o f inva sive surgica l procedures: scoping review . BJS open , 9 (4):zraf073, 2025. [97] Jo se R oberto Aya la Sol ares, Francesca Elisa Diletta Raim ondi, Y ajie Zhu, F atemeh Rahimian, Dexter Canoy , Jenny T ran, Ana Cat arina Pinho Gomes, Amir H P aybera h, Mariagrazia Zottoli, Milad Nazarz adeh, et al. Deep learning for electroni c health records: A comparative review of m ultiple deep neura l architectures. Journal o f biomedi cal informati cs , 101:103337, 2020. 26 T owards a Medical AI Scientist [98] Oliv er F aust, Y uki Ha giwara, T an Jen Hong, Oh Shu Lih, and U Rajendra Acharya. Deep learning for healthcare applicati ons based on physi ologica l signals: A review . Computer methods and programs in biomedi cine , 161:1–13, 2018. [99] Ho nghan Wu, Minhong W ang, Jinge W u, F ara h Francis, Y un-Hsu an Chang, Alex Sha vick, Hang Dong, Michael TC Poon, Natalie Fitzpatrick, Adam P L evine, et al. A survey on clinical natural language processing in the united kingdom from 2007 to 2022. NPJ digit al medicin e , 5(1): 186, 2022. [100] Julián N Acost a, Guido J F alcon e, Pran a v Rajpurkar , and E ric J T opol. Multimoda l biomedi cal ai. Nature medicine , 28(9):1773–1784, 2022. [101] Shunyu Y ao, Jeﬀrey Z hao, Dian Y u, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Y uan Cao. R eact: S ynergizing reaso ning and acting in l anguage models. In T he elev enth intern atio n al conf erence on learning represent atio ns , 2022. [102] T im o S chick, J ane Dwiv edi- Yu, R oberto Dessì, R obert a Raileanu, Maria L omeli, Eric H ambro, Luke Zettlemoyer , Ni col a Cancedda, and Thomas S cial om. T oolformer: L angua ge models can teach themselves to use tools. Advan ces in Neural Informati on Processing Systems , 36: 68539–68551, 2023. [103] LangChain Contrib utors. Langchain: Build context-a ware reasoning appli catio ns. https: //github.com/langchain- ai/langchain , 2023. [104] LangChain C ontrib utors. Langchain: Build resilient langu a ge a gents as graphs. https: //github.com/langchain- ai/langgraph , 2024. [105] Micro so ft. Semantic Kernel Document atio n , 2025. URL https://learn.microsoft.com/ en- us/semantic- kernel/ . Accessed: 2025-10-28. [106] João Moura and contrib utors. crewai: Framew ork for building ai agent teams, 2024. URL https://github.com/crewAIInc/crewAI . Accessed: 2025-10-28. [107] Signiﬁcant Gra vit as Co ntrib utors. Aut ogpt: Accessible ai agents f or ev ery on e. https:// github.com/Significant- Gravitas/AutoGPT , 2023. [108] Cognitio n AI. Introdu cing devin, the ﬁrst ai software engineer , 2024. URL https://www. cognition- labs.com/introducing- devin . Accessed: 2025-10-28. [109] Sam uel Schmidgall, Y usheng Su, Ze W ang, Ximeng Sun, Jialian Wu, Xiaodong Y u, Ji ang Liu, Zicheng Liu, and Emad Barsoum. Agent laboratory: U sing llm a gents as research assistants, 2025. URL . [110] Juraj Gott w eis, W ei-Hung W eng, Alexander Daryin, T ao Tu, Anil Pa lepu, P et ar Sirko vic, Artio m Myas kovs ky , F elix W eissenberger , Keran R ong, R yut aro T anno, et al. T owards an ai co-scientist. arXiv preprint arXiv:2502.18864 , 2025. [111] Shanghua Gao, Richard Zhu, Pengw ei Sui, Z henglun Kong, Suﬁ an Aldogom, Y epeng Huang, A yush Noori, Reza Shamji, Krishna P arvataneni, Theodoros T siligkaridis, et al. Democratizing ai scientists using tooluniv erse. arXiv preprint , 2025. [112] Jiacheng Miao, Joe R Da vis, Jo n athan K Pritchard, and James Zou. Pa per2agent: R eima gining research papers as interactive and reliab le ai agents. arXiv preprint , 2025. 27 T owards a Medical AI Scientist [113] Chaoqian O uyang, Ling Yu e, Shimin Di, Libin Zheng, Linan Y ue, Shaowu P an, Jian Yin, and Min-Ling Z hang. C ode2mcp: T ransf orming code repositories into mcp services. arXiv preprint arXiv:2509.05941 , 2025. [114] Alexey Shv ets, Alexander Rakhlin, Alexandr Kalinin, and Vladimir Iglo vikov . A utomatic instru- ment segment atio n in robot-assisted surgery using deep learning. In 2018 I EEE Intern ati onal Conferen ce on Roboti cs and A utomatio n (ICRA) , pages 1103–1108. I EEE, 2018. [115] Andru P Twinanda, S Shehata, Didier Mutter , Jacques Marescaux, Michel de Mathelin, and Ni colas P adoy . Endonet: A deep architecture for recognition t a sks on l aparo scopic videos. I EEE T ransactions on Medica l Imaging , 36(1):86–97, 2017. [116] L ena Maier-Hein, Matthias Eisenmann, Alperen Sarikaya, T obi a s Collins, and et al. Surgica l data science: T owards semantic underst anding of operativ e w orkﬂo ws. Nature Biomedi cal Engineering , 6(10):957–972, 2022. [117] Khal ed Saab, T ao Tu, W ei-Hung W eng, R yut aro T ann o, Da vid Stutz, Ellery W ulczyn, F an Z hang, T im Strother , Chunjong Park, El ahe V edadi, et al. Capabiliti es of gemini models in medicine. arXiv preprint arXiv:2404.18416 , 2024. [118] Chunyuan Li, Cliﬀ W ong, Sheng Z hang, Naoto Usuyama, Haotian Liu, Jianw ei Y ang, Tristan Na umann, Ho if ung Poon, and Jianfeng Gao. Llav a-med: Training a l arge langu a ge-and-vision assistant for bio medicin e in one day . Advances in Neura l Informati on Processing Systems , 36: 28541–28564, 2023. 28 T owards a Medical AI Scientist Appendix T ask Idea Genera tion & V ali dation Dra ft & Re v ie w Experi mentat ion Use r Bas ed on the pr o v i ded l i ter atu re a nd the c or e objec ti v e, c onduct a c omplete re s ear c h s tud y based on the l i teratu re to dev el op a reli a bl e, general - pu r pos e method fo r 2 D medi c al i mage c l as s i f i c at i on ac ross di v erse modal i ti es , addres s i ng both mul ti - c l as s and mul ti - l abel tas k s . …… Human Ev aluation N euro - V as c u lar D ual - Pat hw ay D if f us ion N et w ork (N VD - D if f N et ) f or D iabeti c R et inopath y Grading 1. C o r e C o n cep t: T he proposed m et hod, N VD - D if f N et , is a nov el dual - pat hw ay arc hit ec t ur e des igned t o address t he s pec if ic c linic al c harac t er is t ic s of D iabeti c R et inopathy (D R ). I t departs f rom s t andard s ingle - s t r eam dif f us io n m odels by ex plic it ly m odeling t he dis eas e' s dual nat ure: dis c ret e v as c ular les ions (loc al high - f requ en c y f eat ures ) and dif f us e neurodeg ene ra t iv e c hanges (global low - f requen c y f eat ure s ) . 2. K ey C h al l en g es A d d r essed : 1) M ult i - s c ale Pat holo gy : DR manif es t s as bot h s pec if ic les ions (mic roan eur y s ms , hemorrh a ges) and global ret inal s t res s (neurodege ne ra t io n) . Single net w ork s of t en s t ruggle t o c apt ure bot h s im ult aneo us ly . 2) C las s I m balance: Th e c rit ic al dis t inc t ion bet w een N on - Prolif era t iv e (N PD R ) and Prolif erat iv e (PD R ) s t ages is of t en obs c ure d by t he s c arc it y of PD R s am ples in t raining dat a. 3) N ois e Sens it iv it y : F undus im ages f requent ly c ont ain art if ac t s t hat c an m im ic les ions ; s t andar d C N N s lac k t he robus t gen erat iv e noise - han dlin g c apa bilitie s of dif f usion m ode ls . 3. M eth o d o l o g i c al A r ch i tectu r e: 1) Global C ont ex t Pat hw ay (T he " N euro" Pat h): U s es a light w eight Ef f ic ient N et - B0 enc oder on dow n - s am pled inputs (128x 128) t o ex t rac t a global c ont ex t v ec t or 𝑓 𝑔𝑙𝑜𝑏𝑎𝑙 . Purpose: C apt ures dif f us e ret inal s t res s and opt ic nerv e s t ruc t ur al int egrit y as s oc ia t e d w it h neurodeg ene ra t io n. 2) Loc al D if f us ion Pat hw ay (T he " V as c ular" Pat h): A C onditi onal U - N et operating on h igh - res olut ion (512x 512) im ages. I nnov at ion: U t iliz es Adaptiv e Lay er N orm aliz at io n ( AdaLN ) t o injec t t he 𝑓 𝑔𝑙𝑜𝑏𝑎𝑙 v ec t or int o ev ery res idual bloc k of t he dif f us ion proc es s . T his c onditi ons t he f ine - graine d les ion ref inem en t on t he global anatomica l c ont ex t . 3) I m balanc e - A w ar e Objec t iv e: R eplac es s t andar d C ros s - Ent r opy w it h F oc al Los s ( 𝛾 = 2 ) t o dy nam ic ally dow n - w eight eas y ex am ples and f oc us t raining on hard, m inorit y c las s ex am ples (s pec if ic ally PD R ). Idea con t ent Supporting Lit er a t ur e Evidence 1. [ M edic al Insi gh t ] A. S uz um ura, et al., “ Re tin a l G a n g li o n Cell S e n e sce n ce L inks Di a b e te s to Re tin a l Neu rod e g e n e ratio n , ” Cure u s, 2 0 2 5 . DOI: 1 0 .7 7 5 9 /cu re u s.9 6 9 2 6 . 2. [C ondit iona l Diff us ion] Z. Dorj se m b e , e t a l. " Con d itio n a l Di f f u sion M o d e ls f o r S e m a n tic 3 D B rain M RI S y n th e sis," IE E E Jo u rna l o f B iom e d ica l a n d Hea lth In f o rm a tics, 2 0 2 4 . DOI: 1 0 .1 1 0 9 /JB HI.2 0 2 4 .3 3 8 5 5 0 4 . 3. [C a s c a de d/Dual M odel s ] R. Y il m a z , K . , e t a l. " Casca d e d Di f f u sion M o d e ls f o r 2 D a n d 3 D M icr o sco p y Im a g e S y n th e sis to E n h a n ce Cell S e g m e n ta tio n ," 2 0 2 5 IE E E 2 2 n d In te rna tio n a l S y m p o sium o n B iom e d ica l Im a g ing (ISB I), 2 0 2 5 . DOI: 1 0 .1 1 0 9 /I S B I6 0 5 8 1 .2 0 2 5 .1 0 9 8 1 0 3 7 . 4. [ Spatiot emporal D iff usion] Z. W a n g e t a l., " Rob u st Card iac Ci n e M RI Reco n stru ctio n W ith S p a tio te m p o ral Di f f u sio n M o d e l," IE E E T ran sa ctio n s o n Com p u ta tio n a l I m a g ing , 2 0 2 5 . DOI: 1 0 .1 1 0 9 /T CI.2 0 2 5 .3 5 9 8 4 2 1 . 5. [T umor - A w a r e A u g me nt a tion] V. - A . Ng o e t a l., " OT A Di f f : Ov a ri a n T u m o r - A w a re Di f f u sio n M o d e l f o r Ul tra so u n d Im a g e A u g m e n ta tio n a n d Det e ctio n ," 2 0 2 5 In te rna tio n a l Co n fe ren ce o n M u ltim e d ia A n a ly sis a n d P a tt e rn Reco g n itio n (M A P R) , 2 0 2 5 . DOI: 1 0 .1 1 0 9 /MAP R6 7 7 4 6 .2 0 2 5 .1 1 1 3 3 8 0 9 . Supporting Codebas e 1. ht tp s : // githu b.c om/s c ot t - y jy a ng/Dif f M IC 2. ht tp s : // githu b.c om/hoj onathanh o/di ff us ion 3. ht tp s : // githu b.c om/ope na i/ impr ov e d - diff us ion 4. ht tp s : // githu b.c om/e rmongroup/ddim 5. ht tp s : // githu b.c om/X zw Han/ CA RD 6. ht tp s : // githu b.c om/ge - x ing/Diff - UN e t 7. ht t ps: //gith ub .c om/mly g/u nif ie d - f oca l - los s …… Ov er view of t he P aper F r amew ork F ig ur e M a t hemat ic al f ormula Ref er enc e Et hi cs St a t ement C ode St ruct ure C ode Samples of t he proposed al gorithm T raini ng and V al i dation Process Ex perimental Anal y si s Ex p e ri m e n t a l An a l y s i s Q u a d ra ti c W e i g h t e d Ka p p a (Q W K) : 0 .7 1 8 9 - Pro m i s i n g fo r o rd i n a l m u l t i - c l a s s ta s k s , g o o d a l i g n m e n t o f p re d i c te d a n d tr u e v a l u e s . Ac c u ra c y : 0 .5 8 5 0 - Su b p a r fo r i m b a l a n c e d DR s e v e r i ti e s , s k e we d b y m a j o ri ty c l a s s . M a c ro F 1 - Sc o re : 0 .3 6 6 6 - Po o r m i n o ri t y c l a s s p re d i c ti o n ro b u s tn e s s ; wei g h te d a l p h a fo c a l l o s s i n s u ff i c i e n t fo r b a l a n c e . AU C ( O v R ): 0 .8 5 2 3 - Stro n g i n te r - c l a s s s e p a ra t i o n a n d d i s c ri m i n a t i o n a b i l i t y . Det a i l e d Ref e re n c e Cod e b a s e An a l y s i s Di ff M I C & CA RD : F o c u s o n d i ff u s i o n /c o n d i ti o n a l c l a s s i f i c a ti o n ; Di ff M I C fi ts m e d i c a l i m a g e s v i a d u a l c o n d i ti o n a l g u i d a n c e . Den o i s i n g M o d e l s : I m p ro v e d - DD PM s s u i t c o m p l e x d a ta d i s tr i b u ti o n s a n d re g re s s i o n , c o m p a t i b l e wi th DD I M c o n te x ts . UN e t I m p l e m e n t a ti o n : D i ff - UN e t u s e s c o n v o l u t i o n , s k i p c o n n e c ti o n s + a tt e n t i o n f o r fe a tu re e x tr a c ti o n . DD I M E n h a n c e m e n t s : R e d u c e s c o m p u ta ti o n c o s t, fi ts h i g h - p e rf o r m a n c e i n fe re n c e wi th i te ra ti v e c o n d i ti o n a l g u i d a n c e . T his w ork p re s e nt s a c om pellin g an d w ell - ex ec u t ed adv an c em en t i n au t om a t e d diab et i c r et i no pa t hy gra ding . T h e c or e i nnov at i on is s igni f ic an t : t he N VD - D if f N et f ram ew ork uniquely int egrat e s a Global C ont ex t Pat hw ay f or c apt urin g dif f us e neurodege ne rat iv e s igns , a Loc al D if f us ion Pat hw ay f or f ine - grain ed v as c ular les ion ref inem en t , and an Adaptiv e Lay er N orm aliz at ion ( AdaLN ) m ec hanis m w it hin a c ohesi v e dual - s t r eam arc hit ec t u re . T his holis t ic approach direc t ly t ac k les t he dom ain' s c rit ic al c hallenges of m ult i - s c a le pat hology det ec t ion , ex t re m e c las s im balance, and t he c om plex dif f erent iat io n bet w een non - prolif e ra t iv e and prolif erat iv e s t ages. T he im plem ent at io n is t horough and reproducib le. T he m et hodolog y is des c ribed w it h prec is e arc hit ec t ur al and m at hem at ic al det ail , part ic ular ly t he nov el c onditi onin g of t he dif f us ion U - N et v ia global em beddings , and t he c om m it m ent t o prov iding m odular Py T orc h c ode, inc luding c us t om dat as et loaders and D D I M s am pling s t rat eg ies , hightl ight s high c om plet ene s s . Ex perim ent al res ult s are rigorous and c onv inc ing. T he aut hors es t ablis h a f air t raining prot oc o l us ing s t rat if ie d c ros s - v alidat i on , report robus t m et ric s inc luding Quadratic W eighte d Kappa and AU C , and c onduct s y s t em at ic ablati ons t o v erif y t he nec es s it y of t he dual - pat h w ay des ign. T he report ed perf orm a nc e gains on t he APT OS 2019 dat as et , c oupled w it h t he int egrat ion of F oc al Los s f or m inorit y c las s s ens i t iv it y , s t rongly v alidat e t he m et hod' s ef f ec t iv eness f or c linic al s c reening s c enario s . F inally , t he paper is ex c ept ionally w ell - w rit t en and c learly s t ruc t ur ed , balanced in t one, and c om m endab le f or it s grounding in ret inal pat hophy s iolog y rat her t han purely engineerin g m et ric s . Ov erall, t his is a s t rong, m et hodolo gic ally s ound c ont ribu t ion t hat e f f e c t iv ely bridges algorit hm ic innov at ion w it h prac t ic a l c linic al needs in ophthalmic diagnosis . Figure A.1 | Example of Medical AI Scientist Innov ation Mode on the medical image classiﬁcati on task. All necessary processes and outcomes are presented, with human expert’s notes in bottom boxes. 29 T owards a Medical AI Scientist T ask Idea Genera tion & V ali dation Dra ft & Re v ie w Expe rim en ta tion Input R es t ore high - res olut ion m edic al v ideo f ram es f rom low - res olut ion c linic al rec ordin gs t o s upport ac c urat e diagnosis and analy s is . Human Code Struc ture Code Sa m pl e s of the prop os e d a lg orith m O v e rv ie w of the A I - G e ne rate d Pa pe r AI - G e ne rate d fram e w ork fig ure Vis ua l c om pa ris on s of a bl a tio n s tud y M a the m a tic a l form ul a Prope r Cita tio n of Li te ratu re Ethi c s Sta te m e nt Tra in in g a nd Va li da tio n Proc e s s Th e pro j e c t be g a n w i t h t h e i mpl e me n t a t i o n o f a Te mpo ra l C o n s i s t e n c y - Dri v e n S u pe r - Re s o l u t i o n a n d De n o i s i n g Mo de l , e s t a bl i s h i n g a s t ro n g ba s e l i n e . Th e i n i t i a l mo de l , t ra i n e d f o r 1 0 e po c h s , a c h i e v e d a P S NR o f 2 7 . 5 2 a n d a n S S I M o f 0 . 7 5 5 . F o l l o w i n g t h e i n i t i a l a n a l y s i s , a c o mpre h e n s i v e pl a n w a s e x e c u t e d t o re f i n e t h e mo de l by i n t e g ra t i n g a dv a n c e d a t t e n t i o n me c h a n i s ms . Th i s l e d t o a s e ri e s o f e x pe ri me n t s : 1 . Qua l i t a t i v e Vi s u a l i za t i o n : A s c ri pt w a s c re a t e d t o v i s u a l l y c o mpa re t h e o u t pu t s o f di f f e re n t mo de l s , pro v i di n g qu a l i t a t i v e i n s i g h t s be y o n d s i mpl e me t ri c s . Th e g e n e ra t e d i ma g e s a re s t o re d i n ` / w o rk pl a c e / pro j e c t / v i s u a l _re s u l t s / ` . 2 . No n - Lo c a l A t t e n t i o n E x pe ri me n t : A No n - Lo c a l a t t e n t i o n bl o c k w a s i n t e g ra t e d i n t o t h e U - Ne t a rc h i t e c t u re . A n i n i t i a l run f o r 1 0 e po c h s s h o w e d a s l i g h t pe rf o rma n c e de g ra da t i o n (P S NR: 2 7 . 3 7 , S S I M: 0 . 7 1 9 ), s u g g e s t i n g t h e mo re c o mpl e x mo de l w a s u n de rt ra i n e d. Th e e x pe ri me n t w a s re pe a t e d w i t h a n e x t e n de d t ra i n i n g s c h e du l e o f 2 5 e po c h s a n d a l e a rn i n g ra t e s c h e du l e r. Th i s re s u l t e d i n a dra ma t i c pe rf o rma n c e i n c re a s e , a c h i e v i n g a P S NR o f 2 9 . 6 4 a n d a n S S I M o f 0 . 8 2 3 , s i g n i f i c a n t l y o u t pe rf o rmi n g t h e ba s e l i n e . 3 . C BA M A t t e n t i o n A bl a t i o n S t u dy : To e x pl o re a l t e rn a t i v e a t t e n t i o n me c h a n i s ms , a C o n v o l u t i o n a l Bl o c k A t t e n t i o n Mo du l e (C BA M) w a s i mpl e me n t e d. Th e mo de l w i t h C BA M w a s t ra i n e d f o r 2 5 e po c h s w i t h t h e s a me l e a rn i n g ra t e s c h e du l e r. I t a l s o s u rpa s s e d t h e ba s e l i n e , a c h i e v i n g a P S NR o f 2 9 . 4 8 a n d t h e h i g h e s t S S I M o f 0 . 8 2 9 . Ex pe rim e nta l A na ly s is Co re In s i gh t : The ke y in n ov a t ion lie s in e x p li c it ly m o d e lin g t e m p o ra l d y n am ic s as a f i rs t - c las s c on s t ra in t f o r e n d os c op ic v i d e o re s t or a t io n . R a t h e r t h an t rea t in g v i d e o as a s e q u e n c e of in d e p e n d e n t im ag e res t o ra t ion t as ks , the m e t h o d in t e g ra t e s m ot ion - awar e t e m p or al c oh e rence d i rec t ly in t o t h e v e c t o r - f ie l d f o rmu la t io n an d p ro m p t - g e n e ra t ion m e c h an is m . This rec o g n iz e s t h a t e n d os c o p ic v i d e o d e g ra d a t ion is n o t on ly s p a t ial b u t als o e x h i b i t s t e m p o ra l in c on s is t e n c ie s (f licke r, m o t ion ar t if ac t s ) d u e to n on - u n if or m p ro b e m ov e m e n t an d t is s u e d e f or m a t ion . The in s ig h t is t h at t e m p or al c on s is t e n c y s h ould be e n f o rce d t h ro u g h ad ap t i v e i nte r - f ra m e grad i ent mo del i ng an d predi c t i v e t empo ral dy nami c s , ra t h e r t h an t h ro u g h p os t - p r oce s s in g or n aiv e f ra m e av e ra g in g . Co ntr i b u t i o ns : 1. T emp o ral Vec t o r F i el d E x t en s i o n – T he o pt i c al - fl o w - like v ec t o r fi el d in U niF lowR es t o re is a u g me nted wi t h a t emp o ral grad i en t t e rm (Δ X_ m o t i o n) t hat c ap t u res fra me - to - f ra me m o t i o n pa t t ern s , enab ling t he mo del to di s t i ng u i s h bet wee n i nten t i o nal mo t i o n and art i fa c t - i ndu c ed i nc o ns i s t en c i es . Me d ic al b ase l in e p ape r An al y s is 1.ht t ps: / / gi t hu b. com / Te nce ntAR C / G FP G A N 2.ht t ps: / / gi t hu b. com / ma r comusy / v ed o 3.ht t ps: / / gi t hu b. com / x i n ntao/ R ea l - E S R G A N …… Cod e b ase P r e p ar a t ion K ey Ch alle ng es An aly s i s : Learning ac c u rat e t e mp o ra l depen den c i es ac ro s s fra m es . Han d l in g in h e ren t n ois e an d lo w reso lu t ion . Summar y of t he p ar adi gm p ap e r P ar adi gm Se ar ch 1. T a s k 2. T i m e 3. Rep os i to ry 4. Fo l l o w e r 5. Ra n k i n g …… Cap abi l it y Ab s t r act ion Cap a b il it y Car d 9 Cor e P a r a d igm K e y A ssu m p tio n s Cor e M e ch a n ism s T r a n sfe r a b il ity Ri sk s & L im ita tio n s Cap a b il it y Car d 5 Cor e P a r a d igm K e y A ssu m p tio n s Cor e M e ch a n ism s T r a n sfe r a b il ity Ri sk s & L im ita tio n s Cap a b il it y Car d 1 Cor e P a r a d igm K e y A ssu m p tio n s Cor e M e ch a n ism s T r a n sfe r a b il ity Ri sk s & L im ita tio n s Ch all e ng e – P ar ad igm Map Gr aph R ank in g & De ci s ion Lit e r a t ur e E v id e nce 1. B an , T es s hi n , et al . " A c ti v e G al l bl ad de r La v ag e Us i ng a Doub l e ‐ pi gta i l P l as ti c S ten t Del i v ery S y s tem Dur i ng E nd os c op i c Ul tr as ou nd ‐ gu i de d Gal l bl ad de r Dr ai na ge ( W i th V i de o)." DE N op en 6.1 ( 20 26 ) : e7 02 47 . 2. Mo r i , Hi tos hi , et al . " E nd os c op i c Ful l ‐ thi c k ne s s R es ec ti on f or G as tr i c S ub m uc os al T um or: A T ec hn i c al A na l y s i s S tud y ( W i th V i de o)." DE N op en 6.1 ( 20 26 ) : e 70 19 8. 3. Li an g, J i ng y un , et al . " V r t: A v i de o res torati on trans f orm er." IE E E T r an s ac ti on s on Im ag e P r oc es s i ng 33 ( 20 24 ) : 21 71 - 2182. …… 1. Par a d i gm i nno v a tion . T h e ke y i n n o v a t i o n is a H a milto ni a n f lo w – b a s e d f o r mu l a tion , w h e r e d e g r a d e d v id e o f r a me s e v o lv e u n d e r a lear n e d vec t o r f ie ld c o mp o sed of ( i ) a p hy s i c s - a w a r e p o tent i a l ener gy enco d ed by Phy s i c s UNet that c a p tu r es generic d egr a d a tion p r i o r s . 2. Po tent i a l me d i c a l t a s k a d a p t a b i li ty . T h is p a r a d ig m is h ig h l y compat i b le w it h me d i c a l v i d eo a n d d y n a mic i m a gi ng , where d e g r a d a t i o n s a r e o f t e n t e mporall y c o r r e la t e d . Th e c o n t i n u o us Hamilt o n ia n e v o lut i o n n a t ur a lly e n f o r ce s tem p o r a l c o ns i s tenc y , w h ic h is cr it ic a l f o r e n d o s c o p y/ la p a r o sc o p y v id e o e n h a n ce me n t , ult r a so u n d spe c kl e d e n o i si n g , a n d car d iac c i n e M R I or f lu o r o sc o p y seq ue n ce s . T his is a str o ng and w ell - ex ecu t e d pi ec e of w ork on e nd osc o pic v id e o su p er - r e s o lut i o n . Th e T AMT f r a m ew ork is g e n u i n e ly in n ov at iv e — combi ni ng m ot i o n - a w are p ro m p ts, reli a b il ity - g a te d c o a rs e - to - f i n e a li g n m e n t, tr a ject o ry - rest ri cte d a tt e n ti o n , and a p h y sics - insp ir e d t e m p o ral f i e l d , a ll in a stri ctly ca u s a l m a n n e r . It sm a rtl y a d d r e ss e s th e c o re d i f ficu lti e s (n o n - ri g id d e fo r m a ti o n , o ccl u sio n s, t e m p o r a l c o n sist e n cy ) th a t h a v e tro u b l e d t h is d o m a i n fo r a l o n g tim e . I m p le m e n ta ti o n l o o ks s o li d a n d fu ll y rep ro d u ci b l e (cod e + co n f ig s + sp li t s p ro m ise d ), e x p e ri m e n ts a r e ca re fu ll y d e s ig n e d w ith p r o p e r co n tro ls, co n f i d e n ce i n t e rv a ls, th o ro u g h a b l a ti o n s, and cli n ic a ll y rel e v a n t l o w - lat e n cy co n si d e rati o n s . T h e g a i n s on Hy p e rK v a sir a p p e a r co n v inci n g . W ri ti n g is cl e a r , p ro fe ssi o n a l, and u n u s u a ll y b a la n c e d — e v e n t h e e th ics & im p a ct d isc u ssio n fe e ls th o u g h t fu l rath e r th a n p e r fu n ct o r y . O v e rall , th is is o n e of th e stro n g e r su b m i ssio n s I'v e se e n in th i s a rea : te c h n ic a ll y d e e p , p ract ically m e a n in g fu l, and v e ry w e ll p rese n te d . Hamilt onian V ide o F lo w Co nt inuous D yn am ic s E ner g y + Mom entum Medic al T empo r al Mot ion Δ X_ m ot ion + T empo r al P r om pt Co ndit ioning Sum ma r y of P r o pose d Met ho d K ey in no vat ion K ey in no vat ion Figure A.2 | Example of Medica l AI Scientist Explorati on Mode on the medica l video restoration t a sk. All necessary processes and outcomes are presented, with human expert’s notes in bottom boxes. 30

Towards a Medical AI Scientist

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment