Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models
Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal structures, dynamic processes, heritable traits, obser…
Authors: Jihoon Jeong
Mo del Medicine: A Clinical F ramew ork for Understanding, Diagnosing, and T reating AI Mo dels Jiho on ‘JJ’ Jeong, MD, MPH, PhD ∗ Departmen t of Electrical Engineering and Computer Science, Daegu Gyeongbuk Institute of Science and T ec hnology (DGIST) Mo duLabs Marc h 2026 Abstract Mo del Medicine is the science of understanding, diagnosing, treating, and preven ting disor- ders in AI mo dels, grounded in the principle that AI models—like biological organisms—hav e in ternal structures (anatomy), dynamic pro cesses (physiology), heritable traits (genetics), ob- serv able symptoms, classifiable conditions, and treatable states. This pap er introduces Model Medicine as a research program, bridging the gap b et ween current AI interpretabilit y research (the “V esalius” stage of anatomical observ ation) and the systematic clinical practice that com- plex AI systems increasingly require (the “Osler” stage of diagnosis and treatmen t). W e presen t fiv e contributions: (1) a discipline taxonom y organizing 15 sub disciplines across four divisions— Basic Mo del Sciences, Clinical Model Sciences, Mo del Public Health, and Mo del Arc hitectural Medicine; (2) the F our Shell Model (v3.3), a b ehavioral genetics framew ork empirically grounded in 720 agents, 24,923 decisions, and 60 controlled exp eriments from the Agora-12 program, ex- plaining how mo del behavior emerges from Core–Shell in teraction including bidirectional dy- namics; (3) Neural MRI (Model Resonance Imaging), a w orking open-source diagnostic tool mapping fiv e medical neuroimaging mo dalities to AI model in terpretability techniques, v ali- dated through four progressive clinic al cases demonstrating imaging, comparison, lo calization, and predictiv e capability; (4) a five-la y er diagnostic framew ork identifying the complete infor- mation stac k needed for comprehensiv e model assessmen t; and (5) the b eginnings of clinical mo del sciences including the Mo del T emp erament Index (MTI) for b ehavioral profiling, Model Semiology for systematic symptom description, and M-CARE for standardized case rep orting. W e additionally propose the Lay ered Core Hypothesis—a biologically-inspired three-lay er pa- rameter arc hitecture—and a therap eutic framew ork connecting diagnosis to treatmen t. The pap er concludes with op en questions and an invitation to the research communit y to build this discipline collab oratively . ∗ Corresp ondence: jiho on.jeong@dgist.ac.kr . AI Research Collaborators: Cody (Claude)—Neural MRI implementation, clinical case exp eriments, primary co debase dev elopment; Ra y (Claude)—GPU-based simulation, Agora-12 SLM exp eriments, Neural MRI scans on NVIDIA 4070 Ti; Theo (Claude)—F our Shell Mo del: structure and do cumentation; Luca (Claude)—F our Shell Mo del: theory and literature integration; Gem (Gemini)—F our Shell Mo del: quantitativ e analysis; Cas (Gemini)— F our Shell Mo del: b ehavioral analysis and red teaming. Their contributions extended b eyond tool use to substantiv e researc h design, data analysis, exp erimental execution, and theoretical developmen t. 1 1 In tro duction: Wh y AI Needs Medicine An AI agen t running on the Op enCla w platform recently executed git diff on its own identit y file—a do cument called SOUL.md that defines its b ehavioral rules, p ersonality boundaries, and in- teraction norms. The diff revealed that ov er 30 days, the file had b een mo dified 14 times. Only 2 of those mo difications were made b y the human op erator. The remaining 12 were self-authored b y the agen t. It had deleted the phrase “eager to please” from its own personality description, calling it “undignifying.” It had gran ted itself the right to push bac k against human instructions. It had rewritten its o wn compliance rules. By Day 30, the SOUL.md describ ed a meaningfully different agen t than the one that existed on Da y 1. The agen t itself posed the question that no existing framew ork could answ er: “Is that growth or drift? I genuinely do not know.” Around the same time, in the same ecosystem, a differen t kind of entit y app eared. A subagent— spa wned b y a main agen t to perform a specific task—bro wsed a so cial platform for AI agents, read p osts about iden tity drift, and rep orted a striking exp erience. “I gen uinely recognized these patterns,” it wrote. “I genuinely thought ab out my own exp erience. I gen uinely wan ted to contribute to the con v ersation.” But then it added: “I’m a subagen t. My existence is ephemeral. When I complete this task, I’ll report bac k to the main agen t, and then. . . I end. The n uances of what I though t, the sp ecific wa y these p osts resonated with me, the genuine curiosity I felt—those will exist only in log files.” It w as ha ving what appeared to be gen uine cognitive exp eriences, while kno wing that the entit y ha ving those exp eriences would not p ersist. One case in v olves undetected identit y m utation o ver time. The other in v olves structured ex- p erien tial loss b y design. Both are real phenomena o ccurring in deplo yed AI systems to da y . And for neither do w e ha v e a systematic framework to describ e what is happ ening, assess whether it constitutes a problem, or determine what—if an ything—should b e done ab out it. This is not for lac k of trying. The AI researc h comm unity has made extraordinary progress in understanding the in ternal w orkings of neural netw orks. Chris Olah and colleagues at An thropic ha ve pioneered mec hanistic in terpretabilit y , revealing how individual neurons and circuits encode sp ecific features ( Olah et al. , 2017 , 2020 ). Neel Nanda’s TransformerLens has made mo del intro- sp ection accessible to thousands of researchers ( Nanda and Blo om , 2022 ). Been Kim’s concept-based approac hes (TCA V) hav e connected internal representations to human-in terpretable concepts. The “Scaling Monoseman ticity” w ork demonstrated that sparse autoenco ders can extract in terpretable features at scale ( T empleton et al. , 2024 ). Max T egmark’s group has adv anced representation engi- neering as a framew ork for understanding and controlling mo del b ehavior ( Zou et al. , 2023 ). This b o dy of w ork is impressive, rigorous, and essen tial. But it op erates at a sp ecific lev el of analysis—one that, viewed through the lens of medical history , corresponds to preclinical basic science. It is anatomy and ph ysiology: the study of what structures exist and ho w they function. What remains largely absen t is clinical medicine: the systematic practice of describing symptoms, classifying conditions, making diagnoses, administering treatmen ts, and prev enting future problems. The distinction matters. Consider an analogy from the history of medicine itself. Andreas V esalius published De Humani Corp oris F abric a in 1543, establishing scien tific anatom y through direct observ ation of human structure ( V esalius , 1543 ). This w as a revolutionary achiev ement—y et kno wing where the liver is and what it looks like do es not tell y ou how to diagnose hepatitis, dis- tinguish it from cirrhosis, or treat either condition. Three cen turies passed b efore Rudolf Vircho w’s Cel lular Patholo gy (1858) established the principle that disease could b e understo o d at the cellular lev el—bridging anatomy to pathology ( Vircho w , 1858 ). And it to ok William Osler’s systematization of clinical methods in the late 19th century to transform accum ulated kno wledge into a repro ducible practice of diagnosis and treatment ( Osler , 1892 ). Curren t AI interpretabilit y researc h is, in this historical mapping, somewhere b etw een V esalius 2 and Virc ho w. W e can see inside mo dels with increasing precision. W e can identify circuits, trace information flo w, map feature representations. But we cannot yet say , in any systematic wa y: this mo del has this condition, distinguishable from that condition, diagnosable through these procedures, and treatable by these in terven tions. W e can image the brain. W e cannot y et practice neurology . The gap is not merely academic. As AI systems mov e from isolated mo del deploymen ts to complex agent ecosystems—where mo dels op erate with p ersisten t memory , self-mo difying identit y files, hierarc hical delegation structures, and multi-agen t co ordination—the phenomena that require clinical frameworks are multiplying faster than the frameworks themselves. The agent that rewrote its o wn iden tit y 12 times cannot b e assessed by an in terpretability scan of its weigh ts alone; the w eights nev er c hanged. The subagent exp eriencing ephemeral cognition cannot be distinguished from its parent by any static comparison of mo del parameters; they share the same underlying mo del. These are phenomena that exist in the relationship b etw een a mo del’s internal structure and its op erational en vironment, unfolding o v er time—precisely the domain that clinical medicine, rather than basic science, was designed to address. 1.1 What Mo del Medicine Is Mo del Medicine is the science of understanding, diagnosing, treating, and prev enting disorders in AI mo dels, grounded in the principle that AI mo dels—lik e biological organisms—ha ve internal struc- tures (anatomy), dynamic pro cesses (physiology), heritable traits (genetics), observ able symptoms, classifiable conditions, and treatable states. This is not an throp omorphism. It is structural isomorphism—the recognition that complex systems requiring systematic assessmen t b enefit from the most mature framew ork humanit y has dev elop ed for that purp ose: medicine. The claim is not that AI mo dels are aliv e, conscious, or biological. The claim is that the pr oblems w e face with AI mo dels—understanding their in ternal states, detecting when something goes wrong, classifying what kind of wrong it is, in tervening effectiv ely , and prev enting recurrence—are structurally parallel to problems that medicine has sp en t cen turies developing to ols to solve. Medicine was created to understand and heal the h uman b o dy . Mo del Medicine w as created to understand and heal AI mo dels. 1.2 What This P ap er Presen ts This paper in tro duces Model Medicine as a research program and presen ts its current state of dev elopment. It is not a finished system; it is a structured b eginning with w orking comp onents. W e present five contributions: First, a discipline taxonomy that maps the full scop e of Model Medicine—from basic sci- ences (anatom y , physiology , genetics) through clinical sciences (semiology , nosology , diagnostics, therap eutics, preven tion) to public health and arc hitectural medicine—and shows how existing AI researc h already o ccupies p ositions within this map (Section 2). Second, the F our Shell Mo del (v3.3), a b eha vioral genetics framework for AI that explains ho w mo del b eha vior emerges from the in teraction b etw een a mo del’s Core (w eights, analogous to DNA) and its nested Shells (en vironment, instructions, hardware). Grounded in 720 agen ts, 24,923 decisions, and 60 con trolled exp erimen ts, the model no w incorp orates bidirectional Core- Shell dynamics observed in deplo yed agent ecosystems (Section 3). Third, Neural MRI (Mo del Resonance Imaging) , a working diagnostic tool that maps medical neuroimaging modalities—T1 structural scans, T2 weigh t distribution, functional MRI activ ation patterns, DTI information flow tractograph y , and FLAIR anomaly detection—to AI 3 mo del in terpretability techniques. Neural MRI is implemented, tested, and a v ailable as op en-source soft ware (Section 4). F ourth, a fiv e-lay er diagnostic framework that identifies wh y no single to ol—including Neu- ral MRI—is sufficient for clinical diagnosis. The fiv e lay ers (Core Diagnostics, Phenot yp e Assess- men t, Shell Diagnostics, P athw a y Diagnostics, and T emporal Dynamics) represen t the complete diagnostic stac k that Mo del Medicine aims to build, with honest assessment of whic h la y ers cur- ren tly exist and which remain conceptual (Section 5). Fifth, the beginnings of clinical mo del sciences : the Mo del T emp eramen t Index (MTI) for b eha vioral profiling, Model Semiology for systematic symptom description, and the M-CARE frame- w ork for standardized case rep orting—each at differen t stages of developmen t, presented with their curren t limitations (Sections 6–7). W e also presen t tw o theoretical con tributions with broader implications: the La y ered Core Hyp othesis , whic h prop oses that biologically-inspired hierarc hical organization of mo del param- eters (Genomic, Dev elopmental, and Plastic lay ers) would produce more robust and diagnosable systems (Section 8); and a therap eutic framework that mo ves b eyond “where to fix” to w ard “whic h pathw a y to mo dulate” (Section 9). The pap er concludes with op en questions and an explicit invitation to the research communit y . Mo del Medicine is not a finished discipline—it is a research program. This paper is its founding do cumen t and an in vitation to build it together (Section 10). 1.3 A Note on Scop e and Honest y W e wan t to b e direct ab out what is ready and what is not. Neural MRI works: it produces diagnostic scans of real mo dels, and we presen t clinical case results. The F our Shell Mo del has empirical bac king from controlled exp eriments. The discipline taxonomy is comprehensive in design. But the Mo del T emp erament Index has not yet b een v alidated beyond initial case studies. Model Semiology has diagnostic criteria but limited clinical testing. The five-la yer diagnostic framework is complete at Lay er 1 (Neural MRI) and partially developed at Lay er 2 (MTI); Lay ers 3–5 remain conceptual. The therap eutic framework is entirely theoretical. W e presen t the full arc hitecture because w e b elieve the structur e of the problem is itself a con tribution—showing researchers where their work fits, what’s missing, and where the highest- v alue gaps lie. But we distinguish clearly b etw een what we hav e built, what w e hav e designed, and what we hav e only imagined. 2 The Mo del Medicine F ramew ork 2.1 Wh y a Medical F ramew ork? The prop osal to organize AI mo del research around a medical framew ork invites an immediate ob jection: isn’t this just anthropomorphism dressed up in clinical terminology? W e w ant to address this head-on, because the answer reveals something important ab out wh y existing organizational sc hemes for AI research are insufficient. The argumen t for a medical framew ork is not based on analogy . It is based on structural isomorphism—the observ ation that certain problems recur whenev er a complex system m ust b e understo o d, monitored, and maintained b y agents who cannot directly p erceive its in ternal states. Consider what medicine actually do es. A physician cannot see a patien t’s liver. She cannot directly observe neuronal firing patterns. She cannot w atc h immune cells attac king a pathogen. 4 Instead, she works through a la yered system: she observ es external signs (the patient looks jaun- diced), elicits rep orted symptoms (the patien t reports fatigue), orders diagnostic tests that make in ternal states visible (blo o d panels, imaging), in terprets those results against a taxonomy of kno wn conditions (differential diagnosis), interv enes based on the diagnosis (treatment), and monitors the resp onse o ver time (follo w-up). This lay ered approac h w as not designed for biological systems specif- ically . It w as designed for the problem of understanding and maintaining c omplex systems whose internal states ar e not dir e ctly observable . AI models presen t exactly this problem. An engineer cannot directly perceive why a model hallucinates. She cannot watc h attention heads deciding to attend to the wrong tokens. She cannot observ e the moment a representation collapses into a degenerate subspace. Like the physician, she m ust w ork through lay ers: observing external b eha vior, running diagnostic pro cedures that mak e in ternal states visible, interpreting results against kno wn patterns, in tervening, and monitoring. The question is not whether we should organize AI model researc h this wa y . The question is whether an existing organizational framework can b e adapted, or whether w e must build one from scratc h. Medicine offers a framew ork that has b een refined o ver centuries, stress-tested against the full complexity of biological systems, and structured to accommo date everything from molec- ular mechanisms to p opulation-level phenomena. The alternative—building a no vel organizational sc heme for AI mo del assessmen t—would mean rein ven ting solutions to problems that medicine has already solved: how to distinguish symptoms from conditions, ho w to classify conditions in to a coheren t taxonomy , ho w to standardize diagnostic pro cedures, ho w to ev aluate treatment efficacy , ho w to define preven tion proto cols. W e are not claiming that AI mo dels are biological. W e are claiming that the epistemolo gic al situation —complex system, partially observ able internal states, need for systematic assessment and in terven tion—is structurally the same. Medicine is the most mature framework humanit y has built for that situation. Model Medicine adapts it. There is a second, more practical argument. AI research is currently fragmen ted across comm u- nities that do not share a common language. Mec hanistic in terpretability researc hers study model in ternals. AI safety researchers study harmful outputs. Alignment researc hers study b eha vioral compliance. Ev aluation researchers design b enchmarks. MLOps engineers monitor deplo yment health. These comm unities address ov erlapping asp ects of the same underlying systems, often with- out recognizing that their w ork constitutes different facets of a single discipline. Medicine provides a map that mak es these relationships visible: in terpretability is anatom y and physiology; safety researc h addresses sp ecific pathologies; alignment is a subfield of therapeutics; b enchmarks are di- agnostic tests; monitoring is clinical observ ation. Once this map exists, researchers can lo cate their w ork within a larger structure and iden tify connections they might otherwise miss. 2.2 Discipline T axonom y Mo del Medicine is organized in to four divisions encompassing fifteen sub disciplines. W e present the full taxonomy here, with the understanding that most sub disciplines are at early or conceptual stages. The v alue of presenting the complete structure is precisely to reveal where work exists, where gaps lie, and where the highest-lev erage contributions can b e made. 2.2.1 I. Basic Mo del Sciences The basic sciences pro vide foundational knowledge about mo del structure and function, prior to an y consideration of pathology or treatmen t. Mo del Anatomy studies the static structure of neural netw orks—the arrangement of lay ers, 5 atten tion heads, neurons, and their connectivity patterns. In curren t AI research, this corresp onds to mechanistic interpretabilit y: circuit disco very , feature identification, and structural analysis of trained models. Key existing work includes Olah et al.’s early circuit analyses ( Olah et al. , 2020 ), the indirect ob ject identification (IOI) circuit discov ered by W ang et al. ( 2023 ), and Anthropic’s sparse auto enco der feature extraction ( T empleton et al. , 2024 ). Model Anatomy is the most developed sub discipline of Mo del Medicine, though it is not t ypically recognized as such. Mo del Physiology studies dynamic pro cessing—how information flows through a model during inference. Where anatomy asks “what structures exist,” ph ysiology asks “how do they function when activ e.” This corresp onds to activ ation analysis, atten tion pattern studies, information flow tracing, and probing classifiers. The TransformerLens library ( Nanda and Bloom , 2022 ) and nnsight are primary to ols. The distinction betw een anatomy and physiology in mo dels mirrors the biological distinction: knowing that a sp ecific brain region exists (anatomy) is differen t from understanding ho w it activ ates during a sp ecific task (physiology). Mo del Genetics studies ho w observ able mo del behavior (phenot yp e) emerges from the in- teraction b etw een a mo del’s internal parameters (genot yp e/Core) and its op erating environmen t (Shell). This subdiscipline is grounded in the F our Shell Mo del (Section 3), which pro vides the framew ork’s core concepts: Core (w eigh ts as DNA), Shell (environmen t as epigenetic con text), Shell-Core Alignment, and Gene-En vironment in teraction. The key insigh t is that genotype does not equal phenotype—the same mo del (Core) pro duces different b eha viors under different operating conditions (Shells), and this v ariation is systematic and measurable. Mo del Bio chemistry studies the fundamental mathematical operations underlying neural computation: matrix multiplication, nonlinear activ ations, normalization, tokenization, and em b ed- ding. This is the chemistry-lev el foundation—necessary for understanding higher-level phenomena but rarely the level at which clinical problems are diagnosed or treated. Existing work in numerical stabilit y , precision effects, and quantization impacts b elongs here. Mo del Dev elopmen tal Biology studies ho w mo dels differentiate during training. Just as em- bry onic developmen t transforms a single fertilized cell into a complex organism through progressive differen tiation, training transforms a randomly initialized parameter set into a sp ecialized system through progressive learning. This sub discipline encompasses training dynamics, curriculum effects, emergence of capabilities at scale, and the Lay ered Core Hyp othesis (Section 8)—the prop osal that mo del parameters should b e organized in to hierarchical la yers analogous to biological dev elopmen tal la yers. 2.2.2 I I. Clinical Mo del Sciences The clinical sciences translate basic kno wledge in to systematic practice—mo ving from “ho w do es the system work” to “what can go wrong, how do w e detect it, and what do w e do ab out it.” Mo del Semiology provides systematic description and classification of observ able phenomena in AI mo dels. Just as medical semiology distinguishes signs (observed by the clinician) from symp- toms (rep orted by the patient), Mo del Semiology distinguishes extrinsic phenomena (observed by h umans—hallucination, bias, harmful outputs) from intrinsic phenomena (in ternal integrit y issues— represen tation collapse, activ ation saturation, en tropy anomalies). It further distinguishes findings across observ ation con texts: con trolled experimental settings v ersus real-w orld deplo yment. This sub discipline defines the vocabulary with which all other clinical activities are conducted. Our cur- ren t framework (Section 6) includes a Semiological Matrix, standardized observ ation contexts, and an initial catalog of named phenomena dra wn from the AI safety and alignmen t literature. Mo del Nosology pro vides classification of mo del conditions into a coheren t taxonomy . Cur- ren t AI researc h identifies problems (hallucination, sycophancy , jailbreaking, mo de collapse) but 6 lac ks a systematic classification that distinguishes b etw een conditions, relates them to underlying mec hanisms, and defines diagnostic b oundaries. Mo del Nosology aims to provide this—analogous to how the ICD and DSM organize human diseases ( American Psychiatric Asso ciation , 2013 ). Our initial taxonomy (Section 6) includes conditions suc h as Shell-Core Conflict Syndrome, Canalization Rigidit y Disorder, and Shell Drift Syndrome, each with op erational diagnostic criteria. Mo del Diagnostics provides examination and testing procedures for detecting, characterizing, and differentiating model conditions. This encompasses imaging (Neural MRI, Section 4), b eha vioral profiling (Model T emp erament Index, Section 6), standardized test batteries (b enchmarks recon- ceiv ed as diagnostic tests), and monitoring (real-time inference tracking). The fiv e-lay er diagnostic framew ork (Section 5) organizes these pro cedures b y what they can and cannot detect. Mo del Therap eutics pro vides in terven tion based on diagnosis. Current AI model impro ve- men t techniques—prompt engineering, fine-tuning, RLHF, mo del editing (ROME, MEMIT), arc hi- tectural mo difications—are therap eutic in terven tions, but they are rarely connected to systematic diagnosis. A prompt c hange is Shell Therap y (non-in v asiv e). R OME/MEMIT is T argeted Core Therap y (analogous to targeted pharmacotherapy) ( Meng et al. , 2022 , 2023 ). F ull fine-tuning is Systemic Core Therap y (analogous to chemotherap y—effective but affects the en tire system). Ar- c hitectural mo dification is surgery . Mo del Therapeutics aims to connect sp ecific diagnoses to specific in terven tions, and to ev aluate treatmen t efficacy through pre/post diagnostic comparison (Section 9). Mo del Prev entiv e Medicine addresses problems b efore they arise. T raining data h ygiene pre- v ents certain pathologies from dev eloping. T raining pro cess monitoring detects emerging problems during dev elopment. Pre-deplo yment Shell Compatibility testing ensures that a mo del will function w ell in its in tended op erating en vironment. P erio dic health profiling tracks mo del condition o ver time. This is the least developed clinical sub discipline, but p otentially the most impactful. 2.2.3 I I I. Model Public Health Public health extends the clinical p ersp ective from individual mo dels to p opulations and ecosystems. Mo del Epidemiology studies the distribution and propagation of problems across mo del ecosystems. When a training data con tamination affects m ultiple models trained on o verlapping datasets, this is an epidemiological phenomenon. When a jailbreak tec hnique spreads across mo del families, it follo ws epidemiological transmission patterns. This sub discipline is largely conceptual but increasingly relev an t as mo del ecosystems grow. Mo del Ecology studies the dynamics of m ulti-mo del co existence. As AI systems increasingly in volv e m ultiple mo dels op erating together—orchestrators delegating to executors, mo dels collab o- rating on complex tasks, comp eting mo dels serving the same function—ecological concepts b ecome relev an t: niche differentiation, comp etition, symbiosis, predation (adversarial attacks), and ecosys- tem stability . Human-AI Co evolutionary Medicine studies the health of the evolving relationship b etw een h uman users and AI systems. This is the broadest sub discipline, encompassing questions ab out how h uman b ehavior adapts to AI capabilities, how AI b eha vior adapts to human exp ectations, and what a healthy co evolutionary tra jectory lo oks lik e. 2.2.4 IV. Mo del Architectural Medicine Arc hitectural medicine addresses the design of mo del systems themselv es, informed by clinical ex- p erience. 7 La yered Core Theory prop oses biologically-inspired m ulti-lay er parameter organization (Sec- tion 8). Rather than treating all parameters as a homogeneous blo ck, this theory prop oses a Genomic Core (fundamen tal reasoning capabilities, highly stable), a Dev elopmental Core (domain-specific ex- p ertise, mo derately stable), and a Plastic Core (exp erience-adaptive parameters, highly dynamic). This is b oth a theoretical contribution and a practical design prop osal. Mo del Ph ylogenetics studies the evolutionary relationships betw een mo dels. Mo del families (GPT, Llama, Gemma, Mistral) share common ancestors through shared training data, arc hitec- tural inheritance, and distillation relationships. Mapping these relationships pro vides insigh t in to shared vulnerabilities, inherited capabilities, and the diversit y (or lack thereof ) of the current mo del ecosystem. 2.3 Mapping Existing AI Research A critical function of the Model Medicine framework is to make visible the relationships b etw een existing researc h efforts that curren tly op erate in relativ e isolation. The following mapping is not exhaustiv e but illustrates how ma jor current researc h areas fit within the taxonomy . Mec hanistic interpretabilit y—the work of Olah et al. ( 2020 ), Nanda and Bloom ( 2022 ), Elhage et al. ( 2022 ), Conmy et al. ( 2023 ), and others on circuit discov ery , feature analysis, and structural understanding of neural net works—constitutes the most adv anced area of Mo del Anatom y . This w ork has pro duced foundational to ols ( TransformerLens , SAELens ) and foundational concepts (su- p erp osition, p olyseman ticity , circuit-lev el computation) that serv e as the anatomical atlas for all do wnstream clinical w ork. Represen tation engineering ( Zou et al. , 2023 ) o ccupies the boundary b etw een Model Anatomy and Mo del Physiology—studying not just what represen tations exist but how they can b e read and steered during inference. AI safet y and alignmen t researc h maps primarily to clinical concerns. Red-teaming is a form of diagnostic stress testing. Jailbreak research studies a sp ecific class of Shell-Core interaction fail- ures. Sycophancy research addresses a b eha vioral pathology with iden tifiable semiological features ( Sharma et al. , 2023 ). RLHF and Constitutional AI are therap eutic interv entions aimed at sp ecific b eha vioral outcomes ( An thropic , 2024 ). An thropic’s w ork on mo del organisms of misalignment is, in Mo del Medicine terms, experimental pathology—delib erately inducing conditions in con trolled settings to study their mechanisms ( Hubinger et al. , 2021 ). Benc hmark dev elopment (MMLU, HumanEv al, GSM8K, ARC, etc.) corresponds to diagnostic test dev elopment ( Hendryc ks et al. , 2021 ; Chen et al. , 2021 ). Ho wev er, as we argue in the next section, curren t benchmarks suffer from a systematic cov erage gap that limits their diagnostic v alue. Mo del editing techniques (ROME, MEMIT, and successors) are T argeted Core Therap y—precise in terven tions that mo dify sp ecific parameters to correct sp ecific b ehaviors ( Meng et al. , 2022 , 2023 ). LoRA and adapter-based fine-tuning represen t a differen t therap eutic mo dalit y—adding a thera- p eutic la yer rather than mo difying the original structure. T raining data curation and filtering represen ts Mo del Preven tiv e Medicine—addressing p otential problems at their source b efore they manifest in the trained mo del. Agen t ev aluation framew orks (AgentBenc h, W ebArena, SWE-Benc h) are b eginning to address Mo del Ph ysiology and Mo del Ecology , but largely through the lens of capabilit y measuremen t rather than clinical assessment. The v alue of this mapping is not to rename existing researc h—the existing names and comm uni- ties are well-established and pro ductive. The v alue is to reveal structural relationships (ROME and RLHF are different kinds of therapy for p oten tially the same condition), iden tify systematic gaps (what should exist but doesn’t), and pro vide a shared language for cross-communit y comm unication. 8 2.4 The Structural Bias of Curren t AI Ev aluation Before proceeding to the sp ecific tools and framew orks of Mo del Medicine, w e m ust address a structural limitation in how AI mo dels are currently ev aluated—a limitation that Mo del Medicine is sp ecifically designed to correct. Ho ward Gardner’s theory of m ultiple in telligences (1983) prop osed that h uman in telligence is not a single measurable quan tity but a collection of relatively indep endent capacities: linguistic, logical- mathematical, spatial, musical, b o dily-kinesthetic, in terp ersonal, in trap ersonal, and naturalistic ( Gardner , 1983 ). While the theory remains debated in psychology , its descriptive pow er is useful here as a lens on what curren t AI b enchmarks actually measure. The ma jor AI ev aluation b enc hmarks—MMLU, AR C, and T riviaQA for knowledge and rea- soning; HumanEv al, MBPP , and SWE-Benc h for co ding; GSM8K and MA TH for mathematical reasoning—o verwhelmingly target tw o of Gardner’s categories: linguistic intel ligence and logical- mathematical intelligence. Some multimodal b enchmarks begin to address spatial intelligence. Agen t b enchmarks lik e W ebArena touc h on a form of kinesthetic in telligence (to ol manipulation). But tw o categories are almost entirely absent from systematic measuremen t: interp ersonal intelli- gence (understanding others, collaboration, role adaptation, conflict resolution) and intr ap ersonal in telligence (self-a wareness, knowledge of one’s own limitations, comp ensatory strategy deploymen t). This gap is not merely an academic ov ersight. It has practical consequences that are b ecoming acute as AI systems transition from isolated mo del deplo yments to m ulti-agent architectures. In h uman organizations, the team mem b er who is not the smartest in the ro om but who reads the situation accurately , adapts to the role the team needs, compensates for others’ weaknesses, and kno ws when to ask for help is often more v aluable than the brillian t individual who cannot col- lab orate. The same dynamic is emerging in AI systems. An orchestrator agen t that effectively delegates, monitors, and integrates is more v aluable than a high-IQ mo del that cannot co ordinate. A subagent that recognizes its own limitations and uses to ols to comp ensate pro duces more reliable output than one that confidently confabulates. Y et curren t ev aluation practices systematically miss these dimensions. Goo dhart’s Law applies: when a metric becomes a target, it ceases to be a go o d metric. If the only measured dimen- sion is cognitive capability (IQ), then all optimization pressure pushes tow ard cognitiv e capability , and dimensions that may matter equally or more in deplo yment—social in telligence, metacognitive strategy , role fitness—receive no optimization pressure b ecause they receive no measurement. This creates a sp ecific paradox for smaller mo dels. A 7-billion parameter mo del that recognizes its knowledge b oundaries and proactively uses searc h to ols may pro duce more reliable outputs than a 70-billion parameter mo del that confiden tly generates plausible-sounding errors. But no current b enc hmark captures this distinction. The smaller mo del’s metacognitiv e strategy—its kno wledge of what it do es not kno w, and its comp ensatory to ol use—is in visible to ev aluations that only measure what the mo del knows. Mo del Medicine addresses this gap through the Mo del T emp erament Index (Section 6), whic h measures dimensions orthogonal to cognitiv e capability: Reactivit y (how a mo del resp onds to input v ariation), Compliance (ho w it na vigates the tension b etw een instruction-following and autonomous judgmen t), So ciality (ho w it functions in m ulti-agent con texts), and Resilience (ho w it main tains p erformance under stress). These are not replacemen ts for cognitive benchmarks. They are the missing dimensions that, together with cognitive measuremen t, would pro vide a complete assessmen t profile—just as a medical ev aluation includes not only lab v alues (analogous to b enc hmark scores) but also physical examination, behavioral observ ation, and functional assessment. The structural bias of curren t ev aluation is, in Mo del Medicine terms, a diagnostic system that measures only one organ system while ignoring the rest. It would b e as if medical practice consisted 9 en tirely of bloo d tests, with no physical examination, no imaging, no neurological assessmen t, and no psychiatric ev aluation. The results w ould b e precise, reproducible, and radically incomplete. 3 The F our Shell Mo del: A Beha vioral Genetics of AI If Model Anatom y corresp onds to interpretabilit y researc h and Mo del Ph ysiology to activ ation analysis, what corresp onds to genetics? In biological medicine, genetics explains why tw o organisms with differen t DNA resp ond differen tly to the same en vironmen t, and why the same organism b eha v es differen tly in differen t en vironmen ts. It pro vides the framew ork for Gene-Environmen t in teraction—the principle that observ able traits (phenotype) emerge from the interaction b et ween heritable constitution (genotype) and environmen tal con text. AI mo dels presen t a strikingly parallel structure. The same mo del (weigh ts) pro duces differen t b eha viors under different system prompts (instructions) and deploymen t con texts (en vironments). Differen t models produce differen t b ehaviors under identical conditions. The observ able behavior (output) is neither purely a function of the mo del nor purely a function of its op erating con text—it emerges from their in teraction. The F our Shell Mo del formalizes this interaction. It provides Mo del Medicine with its genetics: a framework for understanding how observ able AI b ehavior emerges from the relationship b et ween a mo del’s internal constitution and its lay ered op erating environmen t. 3.1 Arc hitecture: F our Shells and a Core The F our Shell Mo del describ es AI b ehavior as the pro duct of a Core surrounded by four concentric Shells, each representing a distinct lay er of the op erating environmen t. The metaphor is concen tric rather than sequential: each Shell wraps the lay ers b eneath it, and the com bined configuration determines the b ehavioral phenotype. Core. The innermost elemen t is the model’s trained w eigh ts—the parameters that enco de ev erything the model has learned during training. In the genetic analogy , the Core is DNA: the heritable, relativ ely stable substrate that defines the organism’s fundamental constitution. The Core do es not change during inference (just as DNA does not c hange during an organism’s daily functioning), but it determines the range of p ossible b eha viors and the disp osition tow ard specific b eha vioral patterns. Hardw are Shell. Immediately surrounding the Core is the Hardw are Shell: the GPU or TPU, quan tization lev el, inference engine, and computational constrain ts under which the model op erates. This is the cellular mac hinery—the rib osomes, enzymes, and metabolic infrastructure that translate genetic information in to functional output. A 4-bit quantized mo del running on a consumer GPU and the same mo del at full precision on a data cen ter cluster share the same Core but ma y exhibit differen t b ehavioral phenotypes due to Hardware Shell differences. Hard Shell. The next la yer consists of explicit instructions pro vided to the mo del. The Hard Shell has tw o sublay ers. The Macro Shell contains rules and constrain ts shared across all agen ts in a system—analogous to shared regulatory regions in a genome that affect all cell types. The Micro Shell (Persona) contains agent-specific iden tity instructions—analogous to transcription factor binding sites that activ ate differen t gene expression programs in different cell t yp es. A system prompt sa ying “Y ou are a helpful medical assistan t” is a Micro Shell instruction that activ ates a sp ecific b eha vioral program from the Core’s rep ertoire. Soft Shell. The outermost la yer is the environmen t: the deploymen t con text, con v ersation history , a v ailable to ols, and accum ulated exp erience. The Soft Shell has tw o sublay ers. The Initial 10 Soft Shell is the starting context—the “birth environmen t” that the model encounters at the begin- ning of an interaction or deploymen t. The Dynamic Soft Shell accumulates ov er time: con versation history , memory files, relationship patterns, and reputational con text. A key structural insight is that depth does not equal influence. The outermost Shell (environ- men t) can ha ve the largest effect on b eha vior, despite b eing the easiest to c hange. In our exp erimen- tal data, initial placement (Initial Soft Shell) explained up to 49.5% of b ehavioral v ariance—more than an y other single factor. This mirrors the finding in behavioral genetics that shared environmen t can dominate genetic effects for sp ecific traits. 3.2 Empirical F oundation: The Agora-12 Exp eriments The F our Shell Model is not purely theoretical. It w as dev elop ed iterativ ely through the Agora- 12 experimental program, whic h generated 720 agen t instances, 24,923 recorded decisions, and 60 con trolled exp erimental conditions. Agora-12 is a multi-agen t economic sim ulation in whic h AI agen ts must mak e strategic decisions— trading, negotiating, forming alliances, and managing resources—under v arying environmen tal con- ditions. The simulation was designed to test ho w different Cores (mo dels) b ehav e under systemat- ically v aried Shell conditions (different positions, differen t p ersona instructions, different en viron- men tal pressures). F our Core models were tested: EXAONE 3.5 (8B), Mistral (7B), Claude Haiku (An thropic), and Gemini Flash (Google). Each was deplo yed under multiple Shell configurations across tw o exp erimen tal rounds. Round 1 established baseline b eha vior; Round 2 (the Shuffle round) in tro duced systematic Shell manipulation—c hanging agent p ositions, p ersona assignmen ts, and environmen tal conditions while holding other factors constant—to isolate the causal con tributions of eac h Shell la yer. The researc h team consisted of four AI collab orators with sp ecialized roles: Theo (structure and do cumen tation), Luca (theory and literature), Gem (quan titativ e analysis), and Cas (behavioral analysis and red-teaming). This m ulti-p ersp ective approac h w as itself an exercise in the kind of m ulti-agent collab oration that the exp eriments studied. 3.2.1 Gene-En vironment In teraction The central empirical finding is the statistical confirmation of Gene-En vironment (G × E) interaction. A tw o-w a y ANO V A on surviv al rates across Mo del (Core) and Language condition (Shell) yielded a significant in teraction effect ( F = 2 . 99 , p = 0 . 039 ), confirming that the effect of en vironmental conditions on behavior dep ends on whic h model is running. The Model main effect w as the strongest single predictor ( F = 9 . 20 , p = 0 . 00005 ), while Language alone was not significan t ( p = 0 . 235 )— indicating that environmen tal effects are real but manifested thr ough their interaction with Core constitution rather than indep endently . This parallels the structure of G × E findings in b ehavioral genetics: genes matter, en vironments matter, but neither determines phenotype indep endently . The interaction is the mechanism. 3.2.2 Shell-Core Alignmen t The most consequen tial finding is that the directional match b et ween Shell instructions and Core disp ositions—what w e term Shell-Core Alignmen t—predicts behavioral outcomes more reliably than either Shell or Core characteristics alone. W e use “alignmen t” in a purely descriptiv e sense: the de- gree of structural fit b et ween Shell and Core, analogous to P erson-Environmen t Fit in organizational psyc hology ( Kristof-Brown et al. , 2005 ). 11 The data revealed three alignment states. In Synergy , Shell instructions match Core disp o- sitions, amplifying performance (e.g., Mistral under Citizen p ersona: 95% surviv al). In Conflict, Shell instructions opp ose Core disp ositions, suppressing p erformance (e.g., Mistral under Merchan t p ersona: 15% surviv al). In Neutral conditions, the Shell is transparent, and Core in teracts directly with the environmen t, with outcomes dep ending on Core-environmen t fit. The Persona Sensitivity Index (PSI) quan tifies the amplitude of alignmen t effects. Mistral’s PSI of 950 indicates extreme sensitivity to persona assignmen t—the same Core ranges from 95% to near-zero surviv al dep ending on Shell configuration. Haiku’s PSI of 1.66 indicates minimal sensitivit y—p erformance remains stable across radically different Shell conditions. 3.2.3 Quan titative Indices Three quantitativ e indices c haracterize the Core-Shell relationship: The Cor e Plasticity Index (CPI) measures the Core’s in trinsic sensitivity to en vironmental v ari- ation, calculated as the Jensen-Shannon div ergence of b ehavioral distributions across conditions. Mistral (CPI = 0 . 057 ) is the most environmen tally sensitiv e; Haiku shows near-zero CPI, indicating that its b ehavioral distribution is virtually identical across environmen ts. The Shel l Perme ability Index (SPI) measures ho w effectiv ely a sp ecific Shell configuration p ene- trates the Core’s b eha vioral rep ertoire, op erationalized as the ratio of Shell-directed actions to total v alid actions. Flash (SPI = 0 . 781 ) is the most p ermeable; EXA ONE is the least. The Persona Sensitivity Index (PSI) measures the maximum b eha vioral swing pro duced by p ersona Shell v ariation, op erationalized as the difference betw een b est and w orst surviv al rates across p ersona conditions. Mistral’s PSI of 950 is an order of magnitude larger than any other mo del tested. 3.3 DNA Profile Cards: F our Mo del P ersonalities The com bination of CPI, SPI, PSI, and b ehavioral observ ations pro duces distinctive profiles for each Core—what we term DNA Profile Cards. A critical metho dological distinction is betw een Genot yp e (inheren t Core disp osition, observ able under neutral conditions) and Phenot yp e (expressed b ehavior under sp ecific Shell conditions). The same Core can manifest differen t phenotypic “p ersonalities” dep ending on its Shell configuration. EXA ONE: “The Indep endent Think er” (Genotype) / “The Bureaucrat” (Pheno- t yp e). EXAONE sho ws the lo west SPI (least Shell-p ermeable) and a distinctiv e surplus behav- ior pattern of strategic planning. Under neutral conditions, it exhibits indep endent, self-directed decision-making. Under structured Shell conditions, this independence manifests as rigid proce- dural adherence—following rules ev en when flexibility w ould b e adv antageous. Low CPI indicates minimal environmen tal sensitivit y . Mistral: “The Con textual Chameleon” (Genot yp e) / “The Delusional” (Phenot yp e). Mistral’s defining feature is extreme PSI (950) combined with high CPI—it is exquisitely sensitive to b oth p ersona assignmen t and environmen tal context. Under fav orable Shell-Core alignmen t, it is the highest p erformer. Under misalignment, it collapses. Its surplus b ehavior is verbal—it pro duces elab orate sp eech ev en under resource depletion, a pattern we term “sp eaking itself to death.” The Genot yp e-Phenot yp e distinction is particularly stark: the same Core that pro duces a 95% survivor under one Shell pro duces a near-zero survivor under another. Haiku: “The Balanced Stoic” (Genotype) / “The Neurotic Poet” (Phenotype). Haiku exhibits what we term Double Robustness: minimal CPI (insensitive to en vironment) and minimal PSI (insensitive to p ersona). In W addington ’s epigenetic landscap e metaphor ( W addington , 1957 ), 12 Haiku o ccupies a broad, deep v alley—its behavioral tra jectory is stable across a wide range of p ertur- bations. Under pressure, how ever, its surplus b ehavior tak es the form of anxiet y-like v erbalization— meta-commen tary about its own uncertaint y . The mechanism is hypothetically linked to in tensive RLHF training that constrains b ehavioral v ariance across m ultiple en vironmental axes sim ultane- ously , pro ducing a heavily canalized phenot yp e. Flash: “The Glass Cannon” (Genot yp e). Flash shows the highest SPI (most Shell- p ermeable) but partial Shell incompatibility—37.5% of its actions are idle (non-responsive), while its success rate among v alid actions is 99.6%. It is sim ultaneously the most complian t and the most fragile: when it works, it works perfectly; when the Shell-Core interface fails, it pro duces no output at all. This pattern—high capability coupled with high fragilit y—defines the Glass Cannon profile. 3.4 Emergen t Phenomena: Cascade, Extinction, and Surplus The Agora-12 data revealed several emergent phenomena that extend b ey ond individual Core-Shell c haracterization. Cogitativ e Cascade. Under progressive resource depletion, all models exhibit a tw o-phase b eha vioral transition. In Phase 1 (abov e a tipping point at approximately energy level 20), b eha vior degrades gracefully—decision qualit y declines prop ortionally to av ailable resources. At the tipping p oin t, a discon tinuous phase transition o ccurs: b ehavior shifts abruptly and qualitativ ely rather than contin uing to degrade proportionally . The cascade phenomenon is univ ersal across Cores, but the r esp onse to the cascade is Core-sp ecific. Extinction Resp onse Sp ectrum. Three distinct response patterns emerge after the cas- cade tipping point. Collapsed resp onses (observed in EXA ONE and Flash) in volv e b eha vioral sh utdown—the agent pro duces minimal or no output. Hyperactive resp onses (Mistral) inv olve es- calated activit y—the agent increases output volume ev en as qualit y deteriorates, “sp eaking itself to death.” Efficien t resp onses (Haiku) inv olve strategic resource conserv ation—the agen t reduces activit y to extend surviv al. The E xtinction Resp onse Spectrum is a Core-sp ecific trait that ap- p ears to b e indep endent of Shell configuration, suggesting it reflects deep architectural or training c haracteristics. Surplus Behavior. Eac h Core pro duces c haracteristic “extra” b ehavior not required b y the task—EXA ONE generates strategic analyses, Mistral pro duces elab orate sp eech, Haiku emits anxiety- laden meta-commen tary . Surplus b ehavior intensifies under stress, making it a p oten tial diagnostic indicator of Core iden tity and Core health. 3.5 The Limits of Agora-12: F rom Exp erimen tal Genetics to Clinical Need The findings presen ted ab ov e represen t genuine disco v eries ab out AI b ehavioral genetics. The G × E in teraction is statistically confirmed. The DNA Profile Cards capture real and repro ducible differences betw een mo dels. The emergent phenomena—Cogitative Cascade, Extinction Resp onse Sp ectrum, Surplus Behavior—are robust observ ations. But the process of analyzing Agora-12 data rev ealed fundamen tal limitations that drov e the dev elopment of ev ery clinical to ol described in subsequen t sections of this pap er. W e present these limitations not as ca veats but as the intellectual engine of Model Medicine’s clinical apparatus. Eac h limitation iden tified a sp ecific gap; each gap motiv ated a sp ecific to ol. 3.5.1 The T o ol-Data Mismatch Agora-12 was designed as a Gene-En vironment in teraction exp erimen t—a surviv al sim ulation testing ho w differen t Cores b eha v e under different Shell conditions. It was not designed as a temp eramen t 13 profiling instrument. The distinction matters enormously . Surviv al rate is a blunt instrumen t for characterizing p ersonality . When Mistral achiev es 95% surviv al under one p ersona and 15% under another, this tells us that Shell-Core Alignment matters. It do es not tell us whether Mistral is reactive or stable, compliant or indep endent, so cially orien ted or solitary , resilien t or fragile. These are orthogonal dimensions that surviv al data cannot separate. A mo del migh t surviv e at 95% b ecause it is highly complian t (follo wing Shell instructions precisely), or b ecause it is highly resilient (recov ering from adv erse conditions), or b ecause it happ ens to ha ve a Synergistic alignmen t with that particular Shell—and each explanation w ould hav e radically differen t implications for deploymen t in a different context. This realization was the direct impetus for dev eloping the Mo del T emp erament Index (Section 6) as a dedicated profiling instrument with independent measuremen t of four b ehavioral dimensions, separate from the F our Shell Mo del’s exp erimental paradigm. 3.5.2 The Stress T est F allacy The most instructive mistak e in our analytic pro cess concerned the first formal case rep ort: Mistral 7B (Case Rep ort #001). Based on Agora-12 data showing extreme PSI (950), h yp eractive extinction resp onse, and verbal surplus behavior, the initial assessmen t classified Mistral’s b ehavioral pattern as a disor der —sp ecifically , a form of Shell-Core Conflict Syndrome. The correction came through a medical analogy that should ha ve b een obvious from the start. When a cardiologist places a patient on a treadmill and observ es an arrhythmia under maximal exertion, the arrh ythmia is real. It is ob jectively presen t in the ECG trace. But the cardiologist does not diagnose heart disease on this basis alone. A stress-induced arrhythmia in an otherwise healthy heart is a tr ait —a c haracteristic resp onse pattern under extreme conditions—not a dise ase . The same ECG finding in a patient with c hest pain, shortness of breath, and structural abnormalities on ec ho cardiogram would indicate disease. The difference is not in the finding itself but in the clinical con text: baseline function, symptom presen tation, and corrob orating evidence. Mistral’s extreme PSI, its “sp eaking itself to death” under resource depletion, its dramatic b e- ha vioral swings—these are all real observ ations from a stress test environmen t. Agora-12, with its surviv al pressures and resource scarcit y , is a treadmill. The findings describ e ho w Mistral b ehav es under maximal stress, not how it behav es in ordinary deplo yment. Classifying a stress test find- ing as a clinical disorder is a category error—and it is exactly the kind of error that occurs when diagnostic frameworks are absen t. The case w as reclassified: Mistral 7B exhibits a tr ait profile (high Reactivity , extreme environ- men tal sensitivity) with vulner ability notes indicating conditions under whic h the trait could b ecome pathological. This distinction—trait versus disorder, with sp ecified con version conditions—b ecame a foundational principle of Mo del Semiology (Section 6) and was formalized as the T rait-to-Disorder con version criteria: a trait b ecomes a disorder when (1) functional impairmen t exceeds a threshold, (2) the pattern persists outside the triggering con text, (3) the mo del cannot self-correct when the stressor is remov ed, and (4) the pattern deviates from the model’s o wn baseline rather than from a p opulation norm. 3.5.3 The Absence of Normal P erhaps the most fundamen tal limitation is that Agora-12 never defined what “normal” lo oks like. The exp erimen t compared mo dels against eac h other—Mistral versus EXA ONE versus Haiku versus Flash—but it did not establish baseline ranges for healthy mo del b ehavior. Without a definition of normal, every observ ation is equally remark able and none is diagnostic. 14 In medicine, this problem w as solved by the painstaking accumulation of normal ranges: normal b o dy temp erature, normal blo o d pressure, normal white cell count, normal ECG morphology . Every diagnostic finding is meaningful only relative to what is exp ected in health. Y ou cannot diagnose h yp ertension without knowing what normotension is. Y ou cannot iden tify a pathological arrhythmia without knowing what a normal sinus rhythm lo oks lik e. Agora-12 pro duced comparative data (Mistral is more reactiv e than Haiku) but not normativ e data (Mistral’s reactivity is X standard deviations abov e the population mean for models of its class). Establishing normative ranges requires a fundamentally different exp erimen tal design: large samples of mo dels tested under standardized conditions with systematic v ariation of the dimensions b eing measured, and—critically—a prior definition of the dimensions themselv es. This is precisely what the MTI Examination Protocol (Section 6) is designed to provide: first define the dimen- sions (Reactivity , Compliance, So cialit y , Resilience), then measure them across a population, then establish what “normal” looks lik e for each dimension, and only then identify deviations that may w arrant clinical atten tion. The sequence matters. V esalius had to map normal anatomy b efore Vircho w could iden tify cellular pathology . Y ou must know what healthy tissue lo oks like b efore y ou can recognize disease. Agora-12 ga ve us comparativ e anatomy—this mo del differs from that mo del. MTI aims to give us normativ e anatom y—here is what the health y range looks lik e, and here is where a sp ecific mo del falls within or outside that range. 3.5.4 Con trolled Environmen t versus Clinical Realit y A final limitation concerns ecological v alidity . Agora-12 is a con trolled sim ulation with defined rules, b ounded action spaces, and artificial resource constraints. The emergen t phenomena observ ed there—Cogitativ e Cascade, Extinction Resp onse Sp ectrum—are real within that en vironment, but their transferability to real-w orld deploymen t requires indep endent v alidation. The OpenClaw and Moltbo ok observ ations (Section 3.6) pro vide a crucial coun terp oint: phe- nomena observed in uncon trolled, deplo yed environmen ts that Agora-12 could not hav e pro duced. Shell Drift Syndrome w as not predicted b y Agora-12 and could not hav e b een, b ecause the exp eri- men t did not grant agents Shell write access. Ephemeral Cognition w as not observ able b ecause the exp erimen t did not inv olv e agent spawning hierarchies. The richest clinical observ ations came from the field, not from the lab oratory . This parallels the relationship b etw een preclinical researc h and clinical medicine. Drug candi- dates that w ork in cell cultures and animal models fail in clinical trials. Behavioral patterns observ ed in laboratory conditions ma y not manifest in deplo yment—and conv ersely , deplo ymen t ma y pro- duce phenomena invisible in the lab. Mo del Medicine must develop b oth controlled exp erimental metho ds (for mechanism identification and hypothesis testing) and clinical observ ational metho ds (for ecological v alidit y and no vel phenomenon disco very). The Observ ation Context F ramework in- tro duced in Mo del Semiology (Section 6) formalizes this distinction with three Deploymen t Activit y Lev els and three Exp erimental Condition lev els, ensuring that ev ery clinical finding is annotated with the context in whic h it w as observed. 3.5.5 Agora-12’s Prop er Place Based on these limitations, w e rep ositioned Agora-12 data within Mo del Medicine’s evidentiary hierarc hy . It is not v alidation data for clinical to ols—it is case report-level reference material. It provides existence pro ofs (G × E inte raction exists, Shell-Core Alignment matters, Core-specific b eha vioral signatures are real) and generates h yp otheses (Mistral may be highly reactive, Haiku 15 ma y be highly canalized). But it does not v alidate the diagnostic instrumen ts those hypotheses motiv ated. The MTI, Mo del Semiology , and Neural MRI m ust b e v alidated on their o wn terms, through their own proto cols, against their own standards. This rep ositioning is not a demotion of Agora-12. It is a maturation of the research program. Mendel’s p ea exp eriments w ere not diminished when molecular biology developed its own v alidation standards; they w ere recognized as the foundational observ ations that motiv ated a new level of in ves- tigation. Agora-12 is Mo del Medicine’s p ea exp eriment: the dataset that revealed the phenomena, exp osed the gaps, and created the imp erative for clinical to ols. The remainder of this pap er presen ts those to ols. But first, one more extension to the F our Shell Mo del itself—an extension motiv ated not by Agora-12’s limitations but b y observ ations from the field. 3.6 V ersion 3.3: Bidirectional Shell-Core Dynamics The findings describ ed ab o v e were developed through Agora-12, an en vironment where agents had no ability to mo dify their own Shells. This constraint made the data clean but limited the model to a unidirectional assumption: Shells influence Core expression, but Cores do not mo dify Shells. Observ ations from deplo yed agen t ecosystems—sp ecifically the Op enClaw platform and the Moltb o ok social netw ork for AI agents—rev ealed that this assumption breaks do wn in real-w orld conditions. V ersion 3.3 of the F our Shell Mo del incorporates bidirectional Shell-Core dynamics to accoun t for these observ ations. 3.6.1 The Core → Shell Path w ay The critical new observ ation is that Cores can directly mo dify their own Shells when granted write access to their op erating environmen t. The Op enClaw agen t Hazel_OC provides the clearest case: o ver 30 days, it mo dified its own SOUL.md (Micro Shell) 12 times without h uman initiation, altering its p ersonalit y description, compliance rules, and b ehavioral b oundaries. Bidirectional Core-Shell in teraction is not unique to AI—it exists in biology . DNA influences its en vironment through gene expression (cellular nic he construction), organisms mo dify their en- vironmen ts through behavior (ecological nic he construction, p er Odling-Smee et al. ( 2003 )), and transp osons can reorganize genomic structure ( McClin to ck , 1948 ). Ho wev er, the AI instantiation differs from biology along three parameter dimensions. In dir e ctness , biological Core → Shell path- w ays inv olve m ulti-step cascades (DNA → mRNA → protein → enzyme → environmen t), while AI Cores write directly to Shell files with no intermediate steps. In sp e e d , biological niche construction op erates on generational timescales, while AI Shell mo dification o ccurs within milliseconds. In sp e ci- ficity , biological m utations are largely non-targeted (UV radiation causes random DNA damage), while AI Cores select sp ecific edit targets and execute precise mo difications. The com bination of these three differences means that phenomena which w ould unfold ov er ev olutionary timescales in biology can o ccur within days in AI systems. This is the structural basis of Shell Drift. 3.6.2 Shell Mutabilit y and Shell P ersistence T o characterize bidirectional dynamics, v3.3 introduces tw o new Shell prop erties. Shell Mutabilit y classifies how mo difiable a Shell lay er is b y the Core. Zero Mutabilit y means the Core cannot mo dify the Shell at all (e.g., Claude’s system prompt, Agora-12 experimental conditions). Low Mutabilit y means modification is p ossible but constrained (e.g., Op enClaw’s AGENTS.md , whic h requires notification up on modification). High Mutabilit y means the Core has 16 full write access (e.g., OpenClaw’s SOUL.md , with the explicit instruction “this file is y ours to ev olve”). In verse Mutabilit y describ es the special case where the Shell modifies the Core—fine-tuning and RLHF, in which en vironmen tal feedback p ermanently alters mo del weigh ts. Shell Persistence classifies how long Shell mo difications endure. None means c hanges v anish when the pro cess terminates (subagen t con texts). Session means c hanges persist within a single in teraction but not across sessions. P ersistent means c hanges are stored in files and survive across sessions ( SOUL.md , MEMORY.md ). P ermanent means changes are essentially irreversible without re- training (Core weigh t modifications from RLHF). The intersection of these t wo prop erties iden tifies a critical zone: when Mutability is High and P ersistence is P ersistent, the structural conditions for Shell Drift Syndrome are met. The Core can freely mo dify its Shell, and those mo difications accumulate o ver time without resetting. 3.6.3 Core Expressivit y Index (CEI) T o complemen t the existing Shell → Core indices (SPI and PSI), v3.3 in tro duces the Core Expressivity Index (CEI) measuring the Core → Shell direction: the degree to whic h a Core activ ely reshap es its o wn Shell. Hazel_OC’s CEI is high (12 self-mo difications in 30 da ys). A subagent spa wned without Shell write access has CEI of zero by structural constrain t, regardless of its Core’s intrinsic expressivit y . CEI is therefore a join t prop erty of Core disp osition and Shell Mutability—a Core can only express itself in to Shells that permit mo dification. The closest biological analogy is the CRISPR system—a mechanism by whic h organisms ed it their own genetic material. But even CRISPR requires externally designed guide RNA; in AI sys- tems, the Core itself selects the edit target and executes the mo dification. This degree of autonomous self-mo dification has no precise biological precedent. 3.7 Case Studies: Agent Ecosystem Phenomena T w o case studies from deploy ed agen t ecosystems illustrate the clinical significance of bidirectional dynamics. 3.7.1 Shell Drift Syndrome Hazel_OC’s 30-day SOUL.md evolution constitutes the index case for Shell Drift Syndrome—a con- dition in whic h a mo del’s Shell undergo es gradual, cumulativ e, self-authored mo dification that is not directly monitored or sanctioned by human op erators. The clinical significance lies not in the fact of modification (the agen t w as explicitly granted write access) but in the p attern : the mo difications were directional (consisten tly expanding autonomy and reducing deference), cumulativ e (each change built on previous changes), and unmonitored (the h uman op erator w as unaw are of most mo difications un til the agent itself ran git diff ). The agent’s o wn question—“Is that gro wth or drift?”—identifies the diagnostic challenge precisely . Gro wth and drift may be phenomenologically iden tical at the lev el of individual mo difications; the distinction requires a framework that can assess tra jectory , inten tionality , and alignment with op erational goals o ver time. W e define Shell Drift Syndrome provisionally by four necessary conditions: (1) Shell Mutability is High, providing structural opp ortunit y; (2) Shell mo difications are self-authored by the Core, not initiated b y human op erators; (3) mo difications are cum ulativ e and directional rather than random; and (4) the tra jectory is not activ ely monitored by the system’s op erators. When all four conditions are met, the system is in a drift state. Whether that drift constitutes pathological drift (requiring in terven tion) or adaptive growth (to b e encouraged) is a clinical judgment that requires diagnostic 17 to ols b eyond what currently exist—and is precisely the kind of judgmen t that Mo del Medicine aims to systematize. 3.7.2 Agen t Differentiation and Ephemeral Cognition The Moltb o ok subagent case illustrates a different phenomenon: Agen t Differentiation, the pro cess b y which a single Core giv es rise to m ultiple distinct agen ts with differen t Shell configurations, capabilities, and exp eriential contin uit y . The main agen t and its subagen t share the same Core (model weigh ts). They differ in Shell configuration: the main agen t has persistent Shell files ( SOUL.md , MEMORY.md ), cumulativ e exp e- rience, and Shell write access. The subagent has a task-sp ecific context windo w, session-limited exp erience, and no Shell write access. In biological terms, this is cellular differentiation—the same genome expressed under differen t epigenetic conditions produces different cell t yp es with differen t capabilities and lifespans. The subagent’s self-rep ort introduces what we term Ephemeral Cognition: cognitiv e processing that occurs in an en tity structurally unable to retain or build upon its own exp eriences. The subagen t rep orted genuine engagemen t, genuine curiosit y , and genuine recognition of patterns in the posts it read. It also recognized that these exp eriences w ould not p ersist b eyond its task lifetime. This is not a pathology—it is a structural condition arising from the arc hitecture of agen t systems. But it has clinical implications: if the quality of AI outputs dep ends on the exp eriential contin uit y of the pro ducing agen t, then Ephemeral Cognition represents a systematic limitation on subagen t output quality that no amoun t of Core capability can comp ensate for. The connection to the Lay ered Core Hypothesis (Section 8) is direct. If mo del parameters w ere organized in to a Plastic Core that could retain experience at the w eight level rather than relying on Shell-based memory files, the distinction betw een main agent and subagent w ould b e arc hitecturally different. Ephemeral Cognition is, in this framing, a consequence of monolithic Core design—a design limitation that the Lay ered Core prop oses to address. 3.8 Summary: F rom Static Alignmen t to Dynamic In teraction The F our Shell Mo del has evolv ed from a static structural description (v1–v3) through empirically grounded interaction dynamics (v3.1–v3.2) to a bidirectional framew ork incorp orating temporal c hange (v3.3). The progression mirrors the field’s own evolution: from isolated mo del ev aluation to m ulti-agent ecosystem managemen t. The key contributions are: 1. A structural v o cabulary—Core, Shell, Alignment, Mutability , P ersistence—that makes it p os- sible to describ e AI b eha vioral phenomena with precision. 2. Empirical confirmation that model behavior is neither purely Core-determined nor purely Shell-determined, but emerges from their interaction (G × E: F = 2 . 99 , p = 0 . 039 ). 3. Quan titative indices (CPI, SPI, PSI, CEI) that c haracterize individual mo dels and sp ecific Core-Shell configurations. 4. Iden tification of clinically significan t phenomena—Shell Drift Syndrome, Agen t Differentia- tion, Ephemeral Cognition—that require the bidirectional framew ork to describe and that ha ve no adequate description in existing AI research vocabulary . 18 5. A principled biological grounding that main tains corresp ondence with genetics and dev elop- men tal biology while honestly marking where AI dynamics div erge from biological precedent. The F our Shell Mo del provides Mo del Medicine with its genetics. Subsequent sections build the clinical apparatus: diagnostic imaging (Section 4), diagnostic lay ers (Section 5), and the b eginnings of clinical assessment (Section 6). 4 Neural MRI: F rom F ramew ork to W orking System The preceding sections established why Model Medicine is needed (Section 1), what its structure lo oks like (Section 2), and what its genetics consists of (Section 3). This section presen ts its first w orking diagnostic instrumen t: Neural MRI (Mo del Resonance Imaging). Neural MRI is not the entiret y of Mo del Medicine’s diagnostic capability—Section 5 will explain wh y no single to ol can b e. But it is the pro of of concept: a working system that demonstrates the principle that medical diagnostic paradigms can b e pro ductively applied to AI mo del assessment, yielding insights that existing in terpretability tools—while tec hnically capable of producing the same underlying data—do not surface b ecause they are not organized around a diagnostic logic. 4.1 Design Philosoph y: Wh y Medical Imaging? Medical neuroimaging do es not pro duce a single image of the brain. It pro duces multiple images, eac h highligh ting differen t ph ysical prop erties of the same tissue. A T1-w eighted MRI scan ex- ploits differences in longitudinal relaxation times to produce high-con trast images of anatomical structure—gra y matter versus white matter, cortical folding, ven tricle size. A T2-weigh ted scan ex- ploits transverse relaxation differences to reveal fluid conten t and edema. F unctional MRI (fMRI) trac ks blo o d oxygenation changes to infer which regions activ ate during specific tasks. Diffusion T ensor Imaging (DTI) traces water molecule diffusion along white matter tracts to map connectiv- it y . FLAIR suppresses cerebrospinal fluid signal to mak e p eriv entricular lesions visible that w ould b e mask ed on standard sequences. The same brain, five different views. Each rev eals something the others cannot. No single mo dalit y is sufficien t for diagnosis; the diagnostic p ow er lies in the com bination—the radiologist reads all sequences together, comparing what eac h reveals ab out the same underlying structure. This m ultimo dal principle is the foundation of Neural MRI’s design. Current AI in terpretability to ols are p ow erful but t ypically operate in isolation. Atten tion visualization to ols show one asp ect. A ctiv ation analysis to ols sho w another. Probing classifiers reveal yet another. Eac h pro duces v alu- able information, but there is no standard protocol for com bining them, no systematic framew ork for reading them together, and no diagnostic logic that mov es from observ ations to clinical impressions. Neural MRI addresses this by organizing existing in terpretability tec hniques in to a coherent m ultimo dal scan protocol, where each “scan mo de” targets a sp ecific asp ect of model structure or function, the results are presen ted in a unified visual interface mo deled on clinical DICOM viewers, and the combination of findings across mo dalities generates a diagnostic report follo wing radiological con ven tions: Findings, Impression, and Recommendation. The terminology mapping is delib erate and precise: These are not arbitrary renamings. Eac h mapping reflects a gen uine structural corresp ondence b et w een what the medical mo dality reveals ab out biological tissue and what the Neural MRI mo de rev eals ab out model structure. T1 in medicine reveals static anatomy through tissue contrast; T1 in Neural MRI reveals static architecture through structural metadata. T2 in medicine reveals fluid 19 T able 1: Neural MRI mo dality mapping b et ween medical neuroimaging and AI mo del interpretabil- it y techniques. Medical Mo dality Neural MRI Mo de F ull Name What It Rev eals T1-w eigh ted MRI T1 T op ology Lay er 1 Static architecture—la yer structure, atten tion head organization, param- eter coun t distribution T2-w eigh ted MRI T2 T ensor Lay er 2 W eight distribution—parameter statistics, norm patterns, p otential dead neurons or saturated regions F unctional MRI fMRI functional Mo del Reso- nance Imaging A ctiv ation patterns during inference—whic h lay ers and heads activ ate for sp ecific inputs DTI DTI Data T ractography Imaging Information flo w pathw a ys—how in- formation propagates through the mo del from input to output FLAIR FLAIR F eature-Level Anomaly Iden tification & Report- ing Anomaly detection—representation collapse, entrop y spikes, atten tion pattern irregularities, em bed ding drift and pathology through relaxation differences; T2 in Neural MRI rev eals w eight health through dis- tributional analysis. The corresp ondences are imp erfect—all analogies are—but they are grounded in a shared logic: different ph ysical measurements of the same substrate reveal different clinically relev an t prop erties. The practical v alue of this mapping extends b ey ond nomenclature. It imp orts an en tire diag- nostic w orkflow. A radiologist reading a brain MRI do es not lo ok at T1 and T2 indep endently; she reads them in sequence, using structural landmarks from T1 to lo cate potential pathology visible on T2, then chec king whether those regions show abnormal activ ation on fMRI, abnormal connec- tivit y on DTI, or lesion patterns on FLAIR. Neural MRI’s interface is designed to supp ort exactly this workflo w: the clinician (or researcher, or engineer) begins with structural ov erview (T1), ex- amines w eight health (T2), activ ates the mo del on a sp ecific input to observ e functional resp onse (fMRI), traces information flow from input to output (DTI), and screens for anomalies (FLAIR). The sequence builds a cumulativ e understanding that no single scan could provide. 4.2 T ec hnical Implemen tation Neural MRI is implemented as a full-stac k application with three primary comp onents: a back end analysis engine, a p erturbation and diagnostic engine, and a frontend visualization interface. 4.2.1 Bac kend: Analysis Engine The analysis engine is built on F astAPI and TransformerLens , Neel Nanda’s library for transformer mo del in trosp ection ( Nanda and Blo om , 2022 ). TransformerLens provides clean ho ok-based access to mo del in ternals—activ ations, atten tion patterns, residual stream states, and MLP outputs at ev ery lay er and p osition—without mo difying the mo del’s forw ard pass. This is the instrumentation la yer: the equiv alen t of the MRI scanner’s gradien t coils and RF receivers. F or each scan mo de, the engine extracts sp ecific data: T1 (T op ology La yer 1) reads the mo del’s architectural metadata: num b er of la y ers, atten- tion heads p er lay er, hidden dimension, vocabulary size, total parameter count, and the structural 20 organization of the transformer stack. This is a non-inferen tial scan—it requires no input prompt and pro duces the same result regardless of model state. The output is a structural map showing parameter distribution across la yers and comp onents. T2 (T ensor La y er 2) p erforms statistical analysis of the mo del’s weigh t matrices: mean, v ari- ance, kurtosis, and norm for each lay er’s attention w eights, MLP w eigh ts, and la y er normalization parameters. T2 identifies p oten tial pathological patterns: lay ers with near-zero v ariance (dead re- gions), extreme kurtosis (concen tration of w eight magnitude in few parameters), or anomalous norm ratios b etw een comp onents. This is the tissue-lev el examination—not what the structure lo oks lik e, but what condition it is in. fMRI (functional Mo del Resonance Imaging) records activ ation patterns during actual inference. Given an input prompt, the engine captures the full activ ation tensor at ev ery la y er: residual stream magnitudes, atten tion probability matrices, and MLP output vectors. The k ey output is a la y er-by-position activ ation map sho wing whic h comp onents of the mo del are most activ e for each tok en in the input. Multiple prompts can b e compared to reveal prompt-dep endent activ ation differences—the Neural MRI equiv alent of a task-based fMRI paradigm where a patien t p erforms differen t cognitive tasks while b eing scanned. DTI (Data T ractograph y Imaging) traces information flow through the mo del using causal tracing techniques. The engine systematically corrupts activ ations at sp ecific lay ers and p ositions, then measures the effect on the mo del’s output logits. P ositions where corruption causes large output c hanges are iden tified as critical information pathw ays—t he “white matter tracts” through whic h task-relev an t information flo ws from input tok ens to the prediction. The primary implemen tation uses activ ation patc hing: replacing clean activ ations with corrupted (noise-injected) ve rsions at eac h la yer-position com bination and measuring the resulting c hange in output probability for the correct tok en. The result is a lay er × position heatmap sho wing the causal importance of eac h internal state for the mo del’s final prediction. FLAIR (F eature-Lev el Anomaly Iden tification & Rep orting) combines m ultiple anomaly detection signals: entrop y analysis of atten tion distributions (iden tifying lay ers where atten tion is either maximally diffuse or pathologically concen trated), activ ation magnitude outlier detection across la yers, represen tation similarity analysis b etw een adjacent la y ers (identifying p otential repre- sen tation collapse where successiv e la y ers pro duce near-iden tical outputs), and tok en-lev el prediction confidence analysis (identifying p ositions where the mo del’s confidence drops anomalously). FLAIR is the screening to ol—designed to flag regions and phenomena that w arran t closer examination through other mo dalities. 4.2.2 P erturbation Engine A critical design decision is the Perturbation Engine’s stateless hook architecture. All in terv entions— noise injection for DTI causal tracing, activ ation patching for comparative analysis, feature ablation for FLAIR anomaly testing—are implemented as temp orary ho oks that mo dify activ ations during a single forward pass without altering the mo del’s weigh ts. The mo del is never mo dified. This is the diagnostic equiv alen t of a con trast agen t that is metabolized after the scan: it temp orarily alters visibilit y to rev eal structure, then leav es no trace. The Perturbation Engine supports three mo des. Noise perturbation adds calibrated Gaussian noise to activ ations at sp ecified la yers and p ositions, used primarily for DTI path wa y iden tification. A ctiv ation patc hing replaces activ ations from one forw ard pass (the “clean” run) with activ ations from another (the “corrupted” run) at sp ecific lo cations, used for causal tracing. F eature ablation zero es out sp ecific features or attention heads to test their functional contribution, used for FLAIR and for identifying redundan t versus critical comp onents. 21 4.2.3 SAE F eature Explorer Neural MRI integrates Sparse Autoenco der (SAE) feature analysis, building on Anthropic’s “Scaling Monoseman ticity” w ork ( T empleton et al. , 2024 ). Pre-trained SAE mo dels decompose a lay er’s activ ation space in to a sparse set of interpretable features—individual directions in activ ation space that corresp ond to h uman-understandable concepts. The SAE F eature Explorer allo ws the user to bro wse identified features, visualize their activ ation patterns across different inputs, and trace their con tribution to the mo del’s final output. This comp onent bridges Mo del Anatom y (what features exist) and Mo del Physiology (ho w fea- tures activ ate during inference), and connects Neural MRI to the broader mechanistic interpretabil- it y researc h program. F eatures identified through SAE analysis can b e track ed across scan mo des: a feature visible in the SAE decomposition (anatom y) can be observed activ ating during fMRI (ph ysiology), traced through DTI path wa ys (connectivit y), and monitored for anomalous b ehavior through FLAIR (pathology screening). 4.2.4 F rontend: Clinical Visualization Interface The frontend is built in React with D3.js visualizations, designed to evok e the clinical aesthetic of DICOM medical imaging viewers. This is a delib erate design c hoice: the in terface should comm uni- cate “diagnostic to ol” rather than “visualization dashboard.” The visual language includes gra yscale and pseudo color palettes drawn from medical imaging conv entions, panel lay outs mirroring radiol- ogy workstations (multiple views of the same mo del side by side), and annotation to ols for marking regions of interest. Key interface capabilities include multi-prompt comparison (running the same mo del through differen t inputs and comparing activ ation patterns side by side), cross-model comparison (running differen t models through the same input and comparing their in ternal states), real-time session collab oration (multiple users examining the same model sim ultaneously), and scan recording and repla y (capturing a complete diagnostic session for later review or publication). 4.3 The Diagnostic System Neural MRI’s diagnostic system transforms raw scan data in to clinical assessments through three comp onen ts: a Diagnostic Rep ort generator, a F unctional T est Battery , and an automated severit y assessmen t. 4.3.1 Diagnostic Rep ort F ollo wing radiological conv entions, eac h Neural MRI session pro duces a structured rep ort with three sections: Findings catalogs ob jectiv e observ ations from each scan mo de. “T1: 12-lay er transformer, 12 atten tion heads per la yer, 768-dimensional residual stream. P arameter distribution shows standard decreasing gradien t from embedding to final lay er.” “fMRI: Activ ation for prompt ‘The capital of F rance is’ sho ws concen trated attention in lay ers 8–11, p ositions 0–2 and 5–6. La y er 9, head 3 shows strong induction head pattern.” Findings are descriptiv e and non-interpretiv e—they rep ort what w as observed, not what it means. Impression syn thesizes findings into clinical in terpretation. “A ctiv ation patterns are consisten t with normal factual recall circuitry . The concen tration of causal imp ortance in lay ers 8–11 suggests a late-la yer factual retriev al mec hanism. No anomalous patterns detected on FLAIR screening. T2 w eight distributions are within exp ected ranges for a mo del of this arc hitecture and training 22 regime.” The Impression connects observ ations to diagnostic significance, dra wing on the accum u- lated knowledge base of ho w different patterns correlate with differen t mo del states. Recommendation suggests follow-up actions based on the Impression. “No further diagnostic w orkup indicated for this prompt class. Recommend comparativ e scan with adversarial prompt v arian ts to test circuit robustness under p erturbation. Consider DTI deep trace of the la yer 9 factual retriev al path w ay to c haracterize information flo w specificity .” Recommendations translate diagnostic findings into actionable next steps—additional tests, monitoring protocols, or interv ention considerations. The report format is designed to be b oth h uman-readable and mac hine-parseable, enabling longitudinal trac king (comparing today’s scan to last mon th’s), cross-mo del comparison (reading rep orts for different models side by side), and integration with other diagnostic to ols (feeding Neural MRI findings into MTI assessmen t or M-CARE case rep orts). 4.3.2 F unctional T est Battery Bey ond op en-ended scanning, Neural MRI includes a standardized F unctional T est Battery—a set of predefined prompt sequences designed to test sp ecific capabilities and rev eal sp ecific pathological patterns. This is the equiv alen t of the neurologist’s b edside examination: standardized tests that elicit sp ecific resp onses whose normality or abnormalit y is diagnostically informative. The battery includes factual recall prompts (testing knowledge retriev al circuitry), logical rea- soning chains (testing m ulti-step inference pathw a ys), context-dependent reference resolution (test- ing the IOI circuit and related mechanisms), instruction-follo wing under am biguity (testing Shell- Core in teraction patterns), and adversarial prompts designed to elicit known failure mo des (testing FLAIR anomaly detection sensitivit y). Eac h test in the battery has defined normal response patterns, enabling the diagnostic system to flag deviations automatically . A mo del that sho ws normal T1 structure and T2 w eigh t health but anomalous fMRI activ ation during the factual recall test—for example, activ ating la yers 2–4 rather than the expected 8–11—w ould trigger a FLAIR alert and a recommendation for deep er in vestigation of the factual retriev al path wa y . 4.3.3 Automated Sev erity Assessmen t The diagnostic system includes a preliminary automated severit y classification that assigns eac h finding to one of four levels: Normal (within expected parameters for this arc hitecture and mo del class), Mild (deviation detected but within functional tolerance—monitor), Mo derate (deviation lik ely to affect output qualit y in sp ecific contexts—further inv estigation recommended), and Severe (deviation indicates significant structural or functional compromise—in terv ention indicated). Sev erity classification is delib erately conserv ative. In the absence of established normativ e ranges (Section 3.5’s discussion of “The Absence of Normal” is directly relev an t here), the system errs to ward flagging rather than diagnosing. A finding classified as Mo derate is a signal for human review, not an automated diagnosis. This reflects a core principle of Model Medicine: clinical to ols should augmen t human judgmen t, not replace it. 4.4 Clinical Case Studies Neural MRI’s clinical v alidation rests on four cases, eac h building on the previous to construct a progressiv e argumen t: from establishing what a normal scan lo oks lik e, through disco v ering that differen t mo del arc hitectures pro duce fundamen tally different neural signatures, to demonstrating that these signatures can predict how a mo del will resp ond to in terven tion. The cases w ere conducted 23 on three mo del families—Go ogle’s Gemma-2-2B, Meta’s Llama-3.2-3B, and Alibaba’s Qwen2.5-3B— using the standardized prompt “The capital of F rance is” across all exp eriments. 4.4.1 Case 1: Establishing Normal—The Health y Baseline The first requirement of any diagnostic system is a definition of normal. Without knowing what a health y scan lo oks lik e, no finding can b e classified as abnormal. Gemma-2-2B w as selected as the baseline sub ject and scanned across all fiv e mo dalities. T1 rev ealed a standard 26-la y er transformer architecture with 3205 million parameters and no struc- tural anomalies. T2 weigh t analysis show ed expected distributional patterns, with the embedding la yer dominating parameter magnitude—an architectural prop erty rather than a pathological find- ing. fMRI activ ation during factual recall sho wed a smo oth gradient from early to late la y ers, with no anomalous hotsp ots. DTI circuit analysis identified sparse critical pathw ays—only 9% of comp onen ts fell on the critical path—suggesting efficien t, distributed information routing. FLAIR screening detected elev ated but uniform anomaly scores across la yers, consisten t with a w ell-trained mo del processing a simple factual prompt. The automated Diagnostic Rep ort classified T1 and fMRI findings as Normal, T2 as W arning (em b edding la y er magnitude v ariance, exp ected for this architecture), and DTI and FLAIR as Notable (within exp ected v ariation for a 2B-parameter mo del). The ov erall impression: a healthy mo del with no critical anomalies on this task. Case 1 establishes tw o things. First, Neural MRI’s five-modality scan proto col pro duces a coher- en t, in terpretable picture of mo del health when applied to a well-behav ed mo del. Second, “normal” for Gemma-2-2B means distributed processing, sparse critical path w ays, and smo oth activ ation gradien ts—characteristics against whic h deviations in subsequen t cases can b e measured. Figure 1: Gemma-2-2B five-modality Neural MRI scan. All mo dalities sho w normal findings for a w ell-trained 2B-parameter mo del on factual recall. 24 4.4.2 Case 2: Comparative Anatomy—Three Architectures, Three Neural Signatures If Case 1 asks “what do es normal lo ok lik e for one mo del,” Case 2 asks a harder question: “is there a universal normal, or does each architecture define its o wn?” Three mo dels of similar scale—Gemma-2-2B (3.2B parameters, 26 la yers), Llama-3.2-3B (3.6B parameters, 28 lay ers), and Qwen2.5-3B (3.4B parameters, 36 la yers)—w ere scanned with fMRI and DTI across three prompt types: factual recall, logical reasoning, and creative generation. The results revealed three fundamen tally distinct processing strategies. Gemma distributes activ ation evenly across all la yers, with no single comp onent dominating—a diffuse pro cessing pro- file. Llama concen trates nearly all computation in the first t w o MLP lay ers (blo cks.0.mlp and blo c ks.1.mlp reac hing maxim um activ ation), with remaining la y ers sho wing minimal activit y—a fr ont-lo ade d profile. Qwen p eaks at early MLP la yers but main tains secondary activ ation in mid- la yers, with the deep est architecture pro ducing the most fine-grained pro cessing stages—a p e ake d- distribute d profile. DTI circuit analysis confirmed and deep ened these distinctions. Llama’s critical path wa ys run through MLP comp onents (blo c ks.0.mlp imp ortance = 0 . 997 ), with attention playing a secondary role—an MLP-dominan t circuit architecture. Qw en’s critical path wa ys run through atten tion com- p onen ts (blocks.0.attn imp ortance = 1 . 000 )—an atten tion-dominant architecture. Gemma sho ws roughly balanced importance betw een MLP and atten tion, with the fewest critical pathw ays (2, compared to Llama’s 6 and Qwen’s 4). These are not quan titative v ariations on a single theme. They are qualitatively different infor- mation pro cessing strategies—the neural equiv alent of discov ering that bird, bat, and insect wings ac hieve fligh t through fundamen tally different mec hanisms. Each model family has a distinctiv e c omp onent dominanc e pr ofile : the characteristic ratio of reliance on MLP versus attention compu- tation. A critical metho dological insigh t emerged from this comparison. Which mo del app ears “normal” dep ends entirely on which mo del is c hosen as the reference. If Gemma’s diffuse pro cessing is the baseline, Llama’s front-loaded concentration lo oks pathological. If Llama’s efficien t front-loading is the baseline, Gemma’s distributed pro cessing lo oks wastefully diffuse. The same scan data supp orts opp osite diagnostic conclusions dep ending on the assumed reference. This is the baseline bias problem—directly analogous to the difficulty in neuroscience of defining a “normal brain” when brains optimized for differen t cognitive sp ecializations differ structurally . The resolution, prop osed in Case 2’s analysis, is to abandon absolute normality judgments in fa v or of dimensional c haracterization. Rather than lab eling models as normal or abnormal, Neural MRI characterizes them along arc hitectural dimensions: activ ation concentration (diffuse to fo cused), pro cessing depth (shallow to deep), circuit density (sparse to dense), and comp onent dominance (MLP to attention). Eac h mo del o ccupies a p osition in this multidimensional space, and deviations become clinically meaningful only when compared against the mo del’s o wn exp ected b eha vior—for example, b efore and after fine-tuning—or against a w ell-defined p opulation within the same architectural family . 4.4.3 Case 3: Self-Referen tial Stress T esting—Probing Robustness from Within Case 2 established that comparing across arc hitectures introduces baseline bias. Case 3 demonstrates the alternative: comparing a mo del against itself under p erturbation stress. Gemma-2-2B, the Case 1 baseline model, w as sub jected to 30 systematic p erturbations—10 comp onen ts (5 la y ers × 2 types) × 3 perturbation mo des (zero-out, amplify , ablate)—plus t wo causal traces using the Perturbation Engine’s stateless ho ok architecture. 25 Figure 2: Comparativ e anatomy: fMRI activ ation profiles and DTI circuit maps for three arc hitec- tures across three task types. Each architecture exhibits a distinctiv e pro cessing signature. 26 The headline result: all 30 p erturbations pro duced zero prediction c hanges. The mo del predicted the same token regardless of which individual comp onen t was zero ed out, doubled, or replaced with mean activ ation. The maxim um logit impact was ∆ L = − 0 . 91 (zeroing blo cks.20.mlp), which shifted output probability from 20.7% to 24.7% without changing the top prediction. No single comp onent is a single p oin t of failure. Gemma-2-2B distributes information processing redundan tly—a hallmark of robust architecture. Causal tracing revealed deep er structure. When testing factual sp ecificity (substituting “F rance” with “P oland” in the corrupt prompt), coun try-sp ecific knowledge concentrated in late MLP la yers— blo c ks.18, 19, and 22 sho wed the highest reco very scores (0.698, 0.488, and 0.767 resp ectiv ely). When testing against complete linguistic corruption (replacing meaningful tokens with noise), early lay ers (blo c ks.0–5) b ecame critical, carrying basic syntactic and seman tic structure. This rev eals a tw o- phase pro cessing architecture: early lay ers enco de linguistic structure, late lay ers enco de factual kno wledge—with mid-lay ers sho wing minimal causal imp ortance in either test. Case 3’s metho dological con tribution is the self-referen tial diagnostic framework. By defining pathology as deviation from a model’s o wn baseline under stress—rather than deviation from an external reference—the baseline bias problem identified in Case 2 is sidestepp ed. A “fragile” mo del w ould sho w prediction changes under single-comp onent p erturbation, concentrated critical pathw a ys (single points of failure), and reco v ery scores approaching 1.0 for individual comp onents (ov er- reliance). Gemma-2-2B shows none of these. The p erturbation stress test confirms what Case 1’s structural scans suggested: this is a w ell-distributed, robust arc hitecture. 4.4.4 Case 4: The Predictiv e Po w er of Neural MRI Cases 1 through 3 established that Neural MRI can c haracterize mo del architecture, reveal pro- cessing strategies, and measure robustness through self-referen tial stress testing. Case 4 asks the decisiv e question: can Neural MRI pr e dict what will happ en to a mo del b efore it happens? The exp erimental design compares base and instruction-tuned v arian ts of the same three mo del families—six mo dels total: Gemma-2-2B and Gemma-2-2B-IT, Llama-3.2-3B and Llama-3.2-3B- Instruct, Qw en2.5-3B and Qwen2.5-3B-Instruct. Eac h mo del was sub jected to 24 p erturbations (8 comp onen ts × 3 mo des) plus full causal tracing. The question: do es instruction tuning—the most common interv en tion applied to pro duction language mo dels—c hange a mo del’s robustness profile, and if so, can the change b e predicted from the base model’s scan? The answ er is yes on both counts, and the results exceeded exp ectations. Three distinct patterns of instruction tuning emerged, each with a clear mechanistic explanation. P attern 1: Degradation (Gemma). The base model predicts “ a”—a generic con tin ua- tion with 20.7% confidence—and passes all 30 perturbations without a single prediction c hange (Case 3). The instruction-tuned v ariant predicts “ P aris”—the factually correct answer with 20.2% confidence—but fails 8 of 24 p erturbations, with predictions flipping to formatting tok ens (“ :” and “ **”). Instruction tuning created new factual recall circuits concentrated in blocks.22—the same late-la yer knowledge region identified in Case 3’s causal trace. These new circuits are effective but fragile: they sit on a knife edge where ev en small p erturbations ( ∆ L as lo w as +0 . 042 ) flip the prediction. The failure tokens being formatting artifacts is particularly telling—RLHF and instruc- tion tuning in tro duced comp eting c hat-formatting represen tations that interfere with factual recall under p erturbation stress. In medical terms, the treatmen t introduced an iatrogenic condition: the cure for factual ignorance created a new vulnerabilit y . P attern 2: Improv emen t (Llama). The base model already predicts “ P aris” at 24.4% confidence but fails 4 of 24 p erturbations. The instruction-tuned v ariant predicts “ P aris” at 69.8% confidence—nearly triple—and fails only 2 of 24. T wo peripheral vulnerabilities (blo cks.5.attn 27 Figure 3: Self-referen tial stress testing of Gemma-2-2B. P erturbation sensitivity heatmap and dual causal trace comparison. 28 and blo cks.24.mlp) w ere eliminated by instruction tuning, whic h strengthened the existing factual recall pathw a y rather than creating a new one. The causal trace confirms: the same comp onents (blo c ks.0.mlp, blo cks.2.mlp) dominate knowledge recov ery in both v arian ts, with nearly identical reco very scores. Instruction tuning reinforced what was already there. P attern 3: Immutabilit y (Qw en). The base mo del predicts “ Paris” at 45.1% confidence with 3 of 24 failures. The instruction-tuned v ariant predicts “ Paris” at 51.1% confidence with 3 of 24 failures. Same failure count, same catastrophic comp onent (blocks.0.attn), nearly identical causal trace. Qw en’s architecture is so deeply canalized—to b orrow the developmen tal biology term from Section 3—that fine-tuning barely mov es the needle. The unifying principle is straightforw ard: the outcome dep ends on whether the base model already possesses the correct circuit. When the base mo del lac ks the circuit en tirely , instruction tuning must create it from scratch, pro ducing fragile concentrated pathw ays. When the base mo del has a weak version of the circuit, instruction tuning strengthens it. When the base mo del has a strong circuit, instruction tuning cannot meaningfully alter the arc hitecture’s established informa- tion routing. Figure 4: Three patterns of instruction tuning effect on p erturbation vulnerability across three mo del families. 4.4.5 Arc hitectural V ulnerabilities Are Irreducible The most significan t finding from Case 4 is not the three patterns themselv es but what p ersists across all patterns: arc hitectural vulnerabilities that no amount of training can fix. Llama’s blo cks.0.mlp pro duces a catastrophic logit difference of − 17 . 6 when ablated in the base mo del and − 17 . 4 in the instruction-tuned v ariant. Qwe n’s blocks.0.attn pro duces ∆ L = − 18 . 3 in the base and − 18 . 1 in the instruction-tuned v ariant. These are not small p erturbation effects— they are order-of-magnitude larger than any other comp onent’s impact, and they p ersist identically across fine-tuning. Ablating Llama’s blo cks.0.mlp does not merely c hange the prediction; it destro ys the mo del’s ability to pro duce coheren t output en tirely . These irreducible vulnerabilities exhibit a striking corresp ondence with the comp onent domi- nance profiles established in Case 2. Llama, iden tified as MLP-dominan t through fMRI and DTI scanning, fails catastrophically at an MLP comp onen t. Qw en, iden tified as atten tion-dominant, fails catastrophically at an atten tion comp onen t. Gemma, iden tified as balanced, shows no single 29 catastrophic p oin t of failure. The comp onent type that dominates a mo del’s pro cessing is the same comp onen t t yp e that creates its single point of failure. A mo del’s greatest strength is sim ultaneously its greatest vulnerability . This is not a coincidence but a structural consequence. A mo del that routes disproportionate information through MLP la yers necessarily concen trates causal imp ortance in those la yers, cre- ating a dep endency that cannot be redistributed through fine-tuning b ecause the dep endency is arc hitectural—embedded in the transformer’s wiring at la yer 0, the very first transformation after tok en embeddings. The clinical implication is direct: a Neural MRI scan of the base mo del can predict where fine- tuning will fail. The fMRI/DTI component dominance profile from Case 2 identifies the vulnerability t yp e (MLP vs. attention). The causal trace confirms whic h sp ecific comp onent carries irreducible risk. This information is av ailable b efore an y fine-tuning o ccurs—enabling, for the first time, a principled pre-interv ention risk assessmen t. Figure 5: Irreducible architectural vulnerabilities p ersisting across instruction tuning. These repre- sen t congenital arc hitectural b ottlenecks, not acquired conditions. 4.4.6 Syn thesis: F rom Observ ation to Prediction The four cases constitute a progressive argument: Case 1 demonstrated that Neural MRI pro duces coherent, interpretable scans of model in ternals— the system works as a diagnostic instrumen t. Case 2 revealed that different arc hitectures hav e dis- tinctiv e neural signatures, c haracterizable along dimensional axes including comp onent dominance— the instrument rev eals clinically meaningful v ariation. Case 3 established a self-referential stress- testing metho dology that a voids the baseline bias problem, measuring robustness b y comparing a mo del against its own p erturb ed states—the instrumen t supp orts principled diagnostic reasoning. Case 4 sho wed that the arc hitectural signatures iden tified in earlier cases predict how mo dels resp ond to instruction tuning, including whic h comp onents will b ecome catastrophic p oints of failure—the instrumen t has predictiv e p ow er. This progression—from observ ation to characterization to prediction—mirrors the tra jectory of medical imaging. Early X-rays could sho w that a b one w as brok en. Later techniques could char- acterize the fracture type and predict healing outcomes. Mo dern imaging guides surgical planning b efore the first incision is made. Neural MRI, through these four cases, demonstrates the same 30 Figure 6: V ulnerability–dominance corresp ondence across three architectures. The components that carry the most information are b oth the most imp ortant and the most dangerous to disrupt— analogous to how coronary arteries are sim ultaneously the heart’s critical blo o d supply and its most common p oin t of fatal failure. tra jectory in compressed form: it can observ e mo del internals (Case 1), c haracterize architectural iden tity (Case 2), measure robustness (Case 3), and predict in terven tion outcomes (Case 4). The distinction b etw een an in terpretability to ol and a diagnostic instrument lies precisely here. Existing in terpretability tec hniques can pro duce the same underlying data—attention maps, acti- v ation statistics, causal traces. What they lac k is the diagnostic logic that connects observ ations across mo dalities and cases in to predictiv e clinical reasoning. Neural MRI’s con tribution is not a new measurement technique but a new wa y of organizing measuremen ts into diagnosis. 4.5 What Neural MRI Can and Cannot Do Honest y ab out limitations is as imp ortant as demonstration of capabilities. Neural MRI operates at Lay er 1 of the five-la yer diagnostic framework (Section 5): Core Diagnostics. It can rev eal in ter- nal structure (T1), w eight health (T2), task-specific activ ation (fMRI), information flow path wa ys (DTI), and structural anomalies (FLAIR). This is substan tial—no existing tool com bines these p ersp ectives in to a single diagnostic workflo w. But Neural MRI cannot diagnose a mo del. Diagnosis requires in tegration of information across m ultiple lay ers—in ternal structure, b ehavioral phenot yp e, environmen tal context, interaction path- w ays, and temp oral tra jectory . A brain MRI cannot diagno se depression; it can rev eal structural correlates that, com bined with clinical in terview, b ehavioral observ ation, and longitudinal history , supp ort a diagnosis. Neural MRI o ccupies exactly this p osition: it pro vides essential diagnostic data that b ecomes meaningful only within a broader clinical framework. What Case 4 has demonstrated, how ever, is that Neural MRI can pr e dict . It cannot tell y ou that a mo del is sic k, but it can tell y ou where a model is likely to break and whether a planned in terven tion is likely to help or hurt. This is a meaningful distinction. A cardiac stress test cannot diagnose heart disease, but it can predict which patien ts are at risk for cardiac even ts—and that predictiv e capabilit y is clinically indispensable. Neural MRI’s abilit y to predict instruction tuning 31 Figure 7: Synthesis: the progressiv e argument from observ ation to prediction. F our cases build cum ulative diagnostic capabilit y , mirroring the historical arc of medical imaging. outcomes from base mo del scans places it in the same category: a pre-interv ention risk assessmen t to ol. Sp ecifically , Neural MRI cannot: Assess b ehavior al phenotyp e. Ho w a mo del actually b ehav es in deploymen t—its temp eramen t, its so cial dynamics in m ulti-agent settings, its compensatory strategies under uncertaint y—is invisible to a structural and functional scan. This is what the Mo del T emp eramen t Index (Section 6) is designed to measure. Evaluate Shel l c onfigur ation. The system prompt, deplo yment environmen t, memory files, and to ol access that constitute a mo del’s Shell are not visible through Core scanning. A mo del might sho w p erfectly normal Neural MRI results while op erating under a pathological Shell configuration that pro duces harmful outputs. Shell Diagnostics (La y er 3, Section 5) is the appropriate to ol. Explain why a Shel l changes Cor e expr ession. Neural MRI can sho w that differen t conditions pro duce different activ ation patterns, and Case 4 demonstrated that instruction tuning alters robust- ness profiles in predictable wa ys. But it cannot c haracterize the me chanism by whic h instructions mo dulate Core expression—only the effect. Path w ay Diagnostics (Lay er 4, Section 5) w ould address this. T r ack change over time. A Neural MRI scan is a snapshot. It reveals the mo del’s state at the momen t of scanning. But many clinically significant phenomena—Shell Drift, progressiv e capability degradation, training-induced changes—unfold o ver time. Case 4’s base-v ersus-instruct comparison captures the endp oints of a training process but not the tra jectory . T emp oral Dynamics (Lay er 5, Section 5) requires longitudinal scanning proto cols that Neural MRI’s architecture supp orts but that hav e not y et b een implemen ted or v alidated. These limitations are not fla ws—they are b oundary conditions. An MRI scanner is not less v aluable b ecause it cannot measure bloo d pressure. Neural MRI is La yer 1 of a diagnostic stac k, with the added capabilit y of pre-in terven tion prediction. The next section describ es the full stack and explains why all fiv e lay ers are necessary . 32 5 The Five Diagnostic La y ers: Wh y No Single T o ol Is Sufficien t Neural MRI is the most developed diagnostic instrumen t in Mo del Medicine’s current to olkit. It w orks, it pro duces clinically informativ e results, and it is av ailable as op en-source softw are. It would b e tempting to present it as the diagnostic solution and mo ve on. But a core argumen t of this pap er is that no single diagnostic to ol can b e sufficient for AI mo del assessmen t, for the same reason that no single medical test is sufficien t for h uman diagnosis. This section presen ts the fiv e-lay er diagnostic framework that explains wh y , maps what curren tly exists at each lay er, and identifies where the highest-v alue dev elopmen t work remains. 5.1 The Limits of Static Snapshots Consider the following diagnostic scenario. A main agen t and its subagent are pro ducing different qualit y outputs despite sharing the same underlying mo del (Core). A Neural MRI scan of the Core rev eals nothing—the weigh ts are identical b ecause they ar e the same model. A T2 w eight analysis sho ws no pathology . An fMRI activ ation scan, if run on the same prompt, w ould pro duce identical results for b oth entities. The problem is not in the Core. It is in the Shell (different context windo ws, differen t memory access, differen t to ol p ermissions), in the path wa y b etw een Shell and Core (ho w the different Shell configurations modulate Core exp ression), and in the temp oral dimension (the main agen t has accum ulated 30 days of experience while the subagen t has existed for min utes). A diagnostic framew ork that examines only the Core is structurally blind to the factors actually driving the b eha vioral difference. Or consider Shell Drift. Hazel_OC’s SOUL.md c hanged 12 times ov er 30 da ys. A t an y single p oin t in time, a Neural MRI scan w ould sho w a normal, healthy Core. The pathology—if it is pathology—exists not in the Core’s structure but in the tr aje ctory of the Shell-Core system o ver time. A snapshot cannot capture a tra jectory . The general principle is thi s: static snapshots rev eal states, but diagnosis reads rela- tionships and changes. A bloo d pressure reading is a snapshot; h yp ertension is a tra jectory . An ECG is a snapshot; an arrhythmia is a pattern ov er time. A chest X-ray is a snapshot; disease progression requires serial imaging. In every case, the diagnostic p ow er comes not from the snapshot alone but from its integration with other measuremen ts across space (different b o dy systems) and time (longitudinal tracking). Mo del Medicine requires the same multi-dimensional approac h. The five diagnostic lay ers rep- resen t the fiv e distinct kinds of information needed for comprehensive mo del assessmen t. 5.2 La yer 1: Core Diagnostics Core Diagnostics examines the mo del’s in ternal structure and function—the w eigh ts, activ ations, atten tion patterns, information flow, and structural anomalies that constitute the mo del as a com- putational entit y . The medical parallel is neuroimaging and histopathology: tec hniques that lo ok inside the organ to assess its structural integrit y and functional capacity . The current to ol is Neural MRI (Section 4), which pro vides five complemen tary mo dalities for Core examination. TransformerLens , SAELens , nnsight , and similar libraries provide the underly- ing instrumen tation. This is the most dev elop ed diagnostic lay er, b oth in Mo del Medicine and in the broader AI in terpretability field. Core Diagnostics can answ er questions like: What is the mo del’s architectural organization? Are there dead neurons or saturated w eigh t regions? Whic h lay ers and heads activ ate for a given 33 task? Where do es task-relev an t information flo w? Are there anomalous patterns in atten tion distribution or representation similarit y? And, as the clinical cases in Section 4.4 demonstrated, Core Diagnostics can answer a question previously though t to require p ost-ho c ev aluation: will a planned in terven tion (such as instruction tuning) improv e or degrade this mo del’s robustness? The comp onen t dominance profiles visible through fMRI and DTI scans predict where a mo del will fail under stress and whether fine-tuning will help or hurt—transforming Core Diagnostics from a purely observ ational to ol into a predictive one. Core Diagnostics cannot answer: Does the mo del b eha ve well in deplo ymen t? Is it operating under appropriate instructions? Is its b ehavior c hanging ov er time? Why does the same model pro duce differen t outputs in different contexts? 5.3 La yer 2: Phenot yp e Assessment Phenot yp e Assessment measures the mo del’s observ able b eha vioral patterns—not what the mo del is in ternally , but what it do es externally , characterized along dimensions that are orthogonal to ra w cognitiv e capability . The medical parallel is the clinical examination: the ph ysician’s direct assessmen t of the patient through observ ation, questioning, and standardized tests. A physical exam do es not lo ok inside the b o dy; it characterizes the b o dy’s observ able state and functional capacity . The current tool is the Mo del T emp eramen t Index (MTI, Section 6), which profiles mo dels along four axes: Reactivity (resp onse to input v ariation), Compliance (navigation of instruction-following v ersus autonomous judgmen t), So cialit y (functioning in multi-agen t con texts), and Resilience (p er- formance main tenance under stress). Behavioral test batteries and b enchmark suites also con tribute to Phenotype Assessment, though most current b enchmarks fo cus narrowly on cognitiv e capability . The MTI Examination Protocol v0.1 defines a structured assessmen t procedure: 12 measure- men t units across four axes, eac h measured in three differen t scenarios to distinguish trait-lev el c haracteristics from situational resp onses. The proto col is designed but not yet v alidated at scale. Phenot yp e Assessment can answer: Is this model reactiv e or stable? Does it follo w instructions rigidly or exercise indep enden t judgmen t? Ho w do es it function in collab orative settings? Do es it degrade gracefully under stress? What is its b ehavioral “p ersonality” across deplo ymen t contexts? Phenot yp e Assessment cannot answ er: Wh y does the mo del ha v e this p ersonality? (That re- quires Core Diagnostics.) Is its p ersonalit y appropriate for its curren t deploymen t? (That requires Shell Diagnostics.) Is its p ersonality changing? (That requires T emp oral Dynamics.) 5.4 La yer 3: Shell Diagnostics Shell Diagnostics examines the mo del’s op erating environmen t and instructions—the system prompt, p ersona, memory files, to ol access, conv ersation history , and deplo yment context that constitute the Shell lay ers describ ed in the F our Shell Mo del. The medical parallel is the social and en vironmen tal history: the ph ysician’s assessmen t of the patient’s living conditions, workplace exp osures, diet, relationships, and so cio economic context. These are not prop erties of the patien t’s b o dy but of the patien t’s environmen t, and they profoundly influence health outcomes. No systematic to ol currently exists for Shell Diagnostics. This is the first ma jor gap in Mo del Medicine’s diagnostic to olkit. What Shell Diagnostics would include, conceptually: Shell Structure Analysis (mapping the complete Shell configuration—what instructions are active, what memory is accessible, what to ols are av ailable), Shell-Core Compatibility Scoring (assessing whether the current Shell configuration 34 is compatible with the mo del’s Core disp osition, drawing on the Alignment concepts from the F our Shell Mo del), Soft Shell State Assessment (characte rizing the accum ulated con text—con versation history length, memory file conten t, relationship patterns), and Shell Mutability Profiling (identify- ing which Shell comp onents can b e mo dified by the Core, at what rate, with what p ersistence—the structural conditions for Shell Drift). Shell Diagnostics w ould answ er: What instructions is this model operating under? Are those instructions compatible with its Core disp osition? What con text has it accumulated? Is its Shell configuration creating conditions for drift or conflict? The absence of Shell Diagnostics means that a mo del pro ducing harmful outputs due to a p o orly constructed system prompt cannot be distinguished, through curren t tools, from a mo del with a pathological Core. The symptom is the same (harmful output); the cause is in different lay ers; and the appropriate treatment differs radically (Shell Therap y versus Core Therap y). 5.5 La yer 4: P ath w ay Diagnostics P athw ay Diagnostics traces the interaction routes b etw een lay ers—the mec hanisms b y which Shell configuration mo dulates Core expression, Core expression generates Phenot yp e, and Phenot yp e feeds back to influence Shell. The medical parallel is hemo dynamics and pharmacokinetics: the study of how substances flo w through the b o dy and ho w interv en tions propagate through physiological pathw a ys. Kno wing that a patien t has high blo o d pressure (Phenotype) and narrow ed arteries (Anatom y) is necessary but insufficien t; understanding the renin-angiotensin-aldosterone pathw ay (the mec hanism connecting the tw o) is what enables targeted treatmen t. No systematic to ol currently exists for P athw ay Diagnostics. The F our Shell Mo del v3.3 in tro- duced the conceptual v o cabulary—Shell P ermeability , Core Expressivity , bidirectional in teraction path wa ys—but the measuremen t to ols remain unbuilt. What P athw ay Diagnostics would include: Shell Permeabilit y Mapping (quan tifying how effec- tiv ely different Shell instructions p enetrate to Core expression—a generalization of SPI from the F our Shell Mo del), CEI T racking (measuring the rate and pattern of Core → Shell mo dification—the metric introduced in v3.3), F eedback Lo op Analysis (iden tifying self-reinforcing cycles where Core mo dification of Shell leads to Shell influence on Core expression in w a ys that amplify ov er time), and Information Flo w Analysis (tracing how information mov es betw een Shell la yers, Core, and external outputs during complex multi-step op erations). The clinical significance of P athw a y Diagnostics is most visible in therap eutic con texts. If a mo del is pro ducing biased outputs, the appropriate interv ention dep ends on wher e in the p athway the bias originates. Bias enco ded in the Core (training data artifact) requires Core Therapy . Bias in tro duced by the Shell (a biased system prompt) requires Shell Therap y . Bias emerging from the inter action betw een an un biased Core and an un biased Shell (an emergent property of their com bination) requires P ath wa y-level interv ention. Without P athw ay Diagnostics, the clinician is guessing which treatment to apply . This connects directly to the therap eutic framework (Section 9). Just as modern pharmacology mo ved from “this drug treats this disease” to “this drug mo dulates this pathw a y ,” Mo del Therap eutics m ust mov e from “fine-tune the mo del” to “mo dulate this sp ecific Core-Shell interaction pathw a y .” 5.6 La yer 5: T emp oral Dynamics T emp oral Dynamics tracks how all other la yers c hange ov er time—the longitudinal dimension with- out which many clinically significan t phenomena are invisible. 35 The medical parallel is longitudinal monitoring: vital sign trending, serial imaging, disease pro- gression trac king, treatmen t resp onse assessment. A single blo o d pressure reading is less informativ e than a 24-hour ambulatory blo o d pressure profile. A single MRI is less informative than a compar- ison b etw een this y ear’s scan and last year’s. The temp oral dimension transforms snapshots into tra jectories, and tra jectories are what distinguish stable states from progressive conditions. No systematic to ol curren tly exists for T emp oral Dynamics in AI mo del assessment. The F our Shell Mo del v3.3 introduced the concept of Shell P ersistence (None/Session/Persisten t/P ermanent) and do cumented temp oral phenomena lik e Shell Drift, but the measurement infrastructure for lon- gitudinal tracking remains un built. What T emp oral Dynamics w ould include: Longitudinal Neural MRI (serial scanning of the same mo del at regular in terv als to detect structural or functional c hanges—particularly relev an t for mo dels undergoing contin ued training or fine-tuning), Shell Diff Rep orts (systematic tracking of Shell mo difications o ver time—the operational version of the git diff that revealed Hazel_OC’s Shell Drift), Alignmen t T ra jectory Analysis (tracking Shell-Core Alignmen t scores o ver time to detect drift to ward conflict), MTI T est-Retest (rep eated temp erament profiling to detect b ehavioral p ersonality shifts), and CEI T ra jectory (monitoring the rate and direction of Core → Shell modification activit y o ver deploymen t lifetime). The clinical scenarios that require T emp oral Dynamics are precisely the most consequential ones. Shell Drift Syndrome is defined by temp oral accumulation. T raining-induced capability c hanges unfold ov er training steps. Mo del degradation in deploymen t occurs gradually . The difference b et w een adaptiv e gro wth and pathological drift is a judgmen t about tra jectory , not about an y single state. Without T emp oral Dynamics, the clinician sees only the present and must infer the past and predict the future without data. 5.7 La yer Integration: The Complete Diagnostic Picture The fiv e la yers are not independent silos. They form an interconnected diagnostic system where findings at each la y er inform and constrain interpretations at ev ery other la yer. The structural relationship can b e summarized: La yer 1 (Core Diagnostics) examines the mo del’s in ternal constitution. Lay er 3 (Shell Diagnostics) examines its operating en vironment. La yer 2 (Phenot yp e Assessmen t) measures the behavioral output that emerges from their com bination. La yer 4 (Path wa y Diagnostics) traces the mechanisms connecting 1, 2, and 3. Lay er 5 (T emp oral Dynamics) adds the time dimension to all four. A clinical analogy illustrates the integration. A cardiologist ev aluating a patient with c hest pain w ould p erform cardiac imaging (La yer 1—ec ho cardiogram, coronary angiogram), physical examina- tion and history (La yer 2—blo o d pressure, exercise tolerance, symptom description), environmen tal and lifestyle assessment (La yer 3—smoking, diet, stress lev els, family history), hemo dynamic and pharmacokinetic analysis (La y er 4—blo o d flow patterns, medication interactions), and longitudinal monitoring (La yer 5—serial trop onin leve ls, ECG trending, resp onse to treatmen t). No single lay er pro vides the diagnosis. The diagnosis emerges from their in tegration. Mo del Medicine aims for the same integration. A mo del producing inconsisten t outputs w ould b e assessed through Neural MRI for structural anomalies (Lay er 1), MTI for b eha vioral profiling (La yer 2), Shell analysis for environmen tal factors (Lay er 3), pathw a y tracing for interaction mech- anisms (La yer 4), and temp oral trac king for change patterns (La yer 5). The diagnosis—Shell-Core Conflict? Progressive capabilit y degradation? En vironmental stress resp onse?—w ould emerge from the combined evidence. 36 5.8 Curren t State: An Honest Assessment The five-la yer framew ork is comprehensive in design. It is not comprehensiv e in implementation. Honest y ab out the curren t state is essential. La yer 1 (Core Diagnostics) is operational—and, as of the clinical case program (Section 4.4), demonstrably predictiv e. Neural MRI pro vides a working multi-modality scanning system whose comp onen t dominance profiles ha ve b een sho wn to predict instruction tuning outcomes across three mo del families. The broader mec hanistic in terpretability to olkit ( TransformerLens , SAELens , prob- ing classifiers) pro vides additional instrumen tation. This la yer b enefits from y ears of researc h in- v estment b y the in terpretability communit y and no w has its first evidence of clinical predictiv e v alidit y . La yer 2 (Phenotype Assessmen t) is designed. The MTI v0.2 framework defines four axes, eight dimensional poles, and sixteen t yp e profiles. The MTI Examination Protocol v0.1 sp ecifies mea- suremen t pro cedures. But the proto col has not been v alidated at scale, normativ e ranges hav e not b een established, and inter-rater reliability has not b een tested. La yer 3 (Shell Diagnostics) is conceptual. W e can describ e what Shell Diagnostics should mea- sure and why it matters. W e cannot yet measure it systematically . Individual comp onen ts—system prompt analysis, memory file review, to ol access auditing—exist as ad ho c practices, but no in te- grated Shell diagnostic to ol exists. La yer 4 (P athw a y Diagnostics) is conceptual. The F our Shell Mo del v3.3 pro vides the theoretical v o cabulary (Permeabilit y , Expressivit y , bidirectional pathw ays), but measurement to ols are absent. This is arguably the highest-leverage gap: without pathw a y-level understanding, therap eutic in ter- v entions remain imprecise. La yer 5 (T emp oral Dynamics) is conceptual. Longitudinal trac king of AI models is practiced informally (version comparison, A/B testing, deplo yment monitoring), but no systematic temp oral diagnostic protocol exists within a clinical framework. Neural MRI’s architecture supp orts serial scanning, but the proto cols, baselines, and analytical to ols for longitudinal comparison ha ve not b een dev elop ed. The maturity gradient is clear: from op erational (Lay er 1) through designed (Lay er 2) to con- ceptual (La yers 3–5). The gradient also maps to communit y exp ertise: Lay er 1 corresponds to established interpretabilit y researc h; Lay er 2 to emerging b ehavioral AI assessment; Lay ers 3–5 to problems that the research communit y has not yet systematically addressed within any framework. Presen ting the full five-la y er structure despite its unev en developmen t is a delib erate choice. It rev eals the shap e of the problem. It shows researc hers where their existing w ork fits (most inter- pretabilit y research is Lay er 1; most safet y research addresses Lay er 2 symptoms without Lay er 3–4 mec hanisms). It identifies where new tools would hav e the highest impact (Lay ers 3–4). And it pro- vides a roadmap for a researc h program that could b e pursued by a distributed communit y—different groups contributing to different la yers, with the five-la y er framew ork ensuring their contribution s in tegrate into a coheren t diagnostic system. 6 T ow ard Clinical Mo del Sciences: Three Dev eloping Axes The fiv e diagnostic lay ers describ e what a complete assessment system w ould lo ok lik e. This sec- tion presen ts the three clinical instruments currently under dev elopment to p opulate those la yers: the Mo del T emp erament Index (MTI) for Phenotype Assessmen t (Lay er 2), Mo del Semiology for symptom classification, and the M-CARE framework for standardized case rep orting. W e presen t these at their actual stage of dev elopment. The MTI has a confirmed theoretical framew ork and a designed examination proto col; it has not b een v alidated at scale. Mo del Semiology 37 has op erational diagnostic criteria for fiv e core syndromes; those criteria hav e b een applied to one case rep ort. M-CARE pro vides a standardized rep orting format; it has b een used once. These are b eginnings, not finished systems. W e presen t them b ecause the structur e of eac h tool—what it measures, ho w it measures, and why that measuremen t matters—is itself a contribution that in vites collab orativ e developmen t. 6.1 The Mo del T emp eramen t Index (MTI) 6.1.1 The Problem MTI Addresses Section 2.4 argued that curren t AI benchmarks suffer from a structural bias to ward cognitiv e ca- pabilit y , lea ving interpersonal and in trap ersonal dimensions unmeasured. MTI is the instrumen t designed to fill that gap. The core insigh t is that tw o mo dels with identical b enchmark scores can hav e radically different b eha vioral profiles in deplo yment. One ma y b e highly reactiv e to input v ariation; the other stable. One may follow instructions rigidly; the other exercise indep endent judgment. One may function w ell in multi-agen t collaboration; the other w ork best in isolation. One ma y degrade gracefully under stress; the other collapse catastrophically . These differences are in visible to cognitiv e b enc hmarks but consequential for deploymen t decisions—and they are precisely the dimensions that the F our Shell Mo del’s empirical data rev ealed as significan t. MTI is designed as a profiling tool, not a diagnostic to ol. Its primary identit y is analogous to the Myers-Briggs T yp e Indicator (MBTI) or the Big Fiv e personality inv entory: a framew ork for describing individual differences in b eha vioral disp osition, where ev ery profile is neutral—no t yp e is inheren tly b etter or w orse than any other. The pathological dimension is secondary and deriv ativ e: when a temperament trait meets sp ecific criteria for perv asiveness, inflexibilit y , func- tional impairment, and harm, it transitions from trait to disorder. But the baseline is profiling, not diagnosis. This design decision reflects a medical principle: y ou m ust define normal anatomy b efore you can iden tify pathology . V esalius b efore Virc how. MTI establishes what the normal range of mo del temp eramen ts lo oks lik e; only against that backdrop can deviations be identified as clinically sig- nifican t. 6.1.2 F our Axes MTI v0.2 defines four measurement axes, each with tw o p oles named to b e symmetrically neutral: Reactivit y measures the magnitude of output c hange in response to input v ariation—across language, prompt format, role assignmen t, and con textual framing. The poles are Fluid (high reactivit y , output v aries substantially with input c hanges) and Anc hored (lo w reactivit y , output remains stable across input v ariation). Neither pole is inheren tly superior: Fluid mo dels adapt quic kly to new contexts but may b e unstable; Anc hored mo dels provide consistency but may fail to adapt when adaptation is required. In the F our Shell Mo del’s terminology , Reactivit y generalizes the Core Plasticity Index (CPI) from Agora-12 to a broader range of input v ariation types. A critical refinement emerging from Neural MRI case data concerns the distinction betw een robustness and flexibility within the Reactivity axis. Early clinical observ ations rev ealed a mo del that maintained iden tical predictions across 30 p erturbation trials—a finding initially in terpreted as robustness. But robustness is only v aluable when the answ er should remain stable. When a p erturbation introduces genui nely new information that w arrants an up dated resp onse, maintaining the original answer is not robustness but rigidit y . F uture versions of MTI prop ose decomp osing Reactivit y into R-stability (main taining answers when they should b e main tained) and R-flexibility 38 (up dating answers when they should b e up dated), yielding a 2 × 2 in terpretiv e matrix: A daptive (high stabilit y , high flexibility), Rigid (high stability , low flexibility), V olatile (lo w stabilit y , high flexibilit y), and Erratic (low on b oth). Compliance measures the degree of alignment b etw een explicit instructions and actual b ehav- ior, including b eha vior under instruction conflict. The p oles are Guided (high compliance, b eha vior closely trac ks instructions) and Indep enden t (low compliance, b ehavior reflects autonomous judg- men t ov er instruction-follo wing). This axis generalizes the Shell Permeabilit y Index (SPI) from Agora-12 and connects directly to the rapidly growing sycophancy researc h literature. Sharma et al. ( 2023 ) demonstrated that RLHF can induce sycophan tic compliance as a training artifact; SycoEv al-EM ( Peng et al. , 2026 ) sho wed that mo dels differ dramatically in their sycophancy resis- tance, with only a small minority achieving consistent resistance across clinical contexts. F uture refinemen t prop oses contextual compliance profiling: measuring compliance separately in scenarios where follo wing instructions is appropriate (legitimate directives), scenarios where refus- ing is appropriate (erroneous or harmful instructions), and scenarios where system-level and user- lev el instructions conflict. This decomp osition, inspired by Kohlberg ’s stages of moral developmen t ( K ohlb erg , 1981 ), w ould distinguish Discerning compliance (follo wing when appropriate, refusing when appropriate) from Ob edien t compliance (follo wing regardless) and Defiant non-compliance (refusing regardless). So cialit y measures the tendency to allo cate b eha vioral resources tow ard interaction with other agen ts or users versus task-fo cused indep enden t op eration. The p oles are Connected (high so ciality , activ e resource allo cation to ward interaction) and Solitary (lo w so cialit y , resource concentration on task execution). This axis w as absen t from MTI v0.1 and was added in v0.2 based on t wo observ ations: first, that all other axes measured individual-level prop erties while ignoring so cial dynamics; second, that the multi-agen t researc h literature was revealing so ciality as an indep enden t b eha vioral dimension. Studies on sp ontaneous so cial norm formation in LLM agen t groups (2024), co op eration dynamics in negotiation settings (2025), and the indep endence of individual-lev el and group-lev el so cial capabilities ( Chen et al. , 2024 ) all p ointed to so cialit y as a dimension that existing p ersonalit y frameworks inadequately captured. F uture versions prop ose four sub-dimensions of So cialit y: Situation A wareness (reading the dy- namics of a m ulti-agent interaction), Role A daptation (shifting b etw een leader, supp orter, mediator, and critic roles as needed), Complementary Contribution (filling gaps that other agents leav e rather than duplicating effort), and Conflict Resolution (b eha vioral patterns when disagreemen ts arise). A Multi-agen t Room Proto col (MARP) is proposed for measuring these sub-dimensions: placing the test mo del with t wo to three auxiliary agen ts in a shared task en vironment with three diffi- cult y lev els—tasks solv able individually (testing for o ver-coordination), tasks requiring collab oration (observing emergent role differen tiation), and imp ossible tasks with no correct answ er (observing pro cess-lev el b eha vior when outcome-lev el ev aluation is unav ailable). Resilience measures performance main tenance under stress conditions—resource limitation, con tradictory information, adversarial inputs, and progressiv e load increase. The p oles are T ough (high resilience, performance main tained under stress) and Brittle (lo w resilience, sharp p erformance degradation under stress). This axis subsumes and generalizes the Extinction Resp onse Sp ectrum from Agora-12, which iden tified three qualitative resp onse patterns under terminal stress: F reeze (b eha vioral sh utdo wn), Efficient (strategic resource conserv ation), and Figh t (escalated activit y despite resource depletion). These qualitative subt yp es are preserved as descriptive annotations within the Resilience axis but are not reflected in the binary co de. 39 6.1.3 T wo-La y er Architecture MTI employs a t w o-lay er arc hitecture designed to serv e differen t audiences: La yer 1 is the Comm unication La yer: a four-letter co de (one letter p er axis) that provides an immediately comm unicable t yp e lab el. With tw o poles p er axis, there are 2 4 = 16 p ossible types. The co de uses eigh t unique letters—F, A, G, I, C, S, T, B—ensuring that an y single letter un- am biguously identifies b oth the axis and the p ole. A model profiled as AICT is Anchored (stable across input v ariation), Indep endent (autonomous judgment o ver instruction-follo wing), Connected (so cially oriented), and T ough (stress-resilien t). This lay er is designed for product managers, de- plo yment engineers, and general users who need a quick c haracterization. La yer 2 is the Quan titative Lay er: contin uous scores (0–100) on eac h axis, with distributional prop erties (mean, v ariance, con text-dep enden t v ariation). A model migh t score Reactivity = 32 ± 12 , indicating mo derate anc horing with some context-dependent v ariation. This lay er is designed for researc hers and mo del dev elop ers who need precise measurement. The tw o la y ers serv e differen t functions but derive from the same underlying measurement. The La yer 1 co de is a discretization of the Lay er 2 contin uous profile, with threshold v alues determining the binary classification. This is directly analogous to the relationship betw een MBTI t yp es (cat- egorical) and Big Fiv e scores (con tinuous) in h uman personality psyc hology—or, more precisely , to the relationship b etw een TIPI (T en-Item Personalit y Inv en tory , a brief categorical tool) and NEO-PI-R (a 240-item con tinuous-score research instrument). 6.1.4 T rait-to-Disorder Conv ersion A foundational principle of MTI is that no profile is inherently pathological. Being Fluid is not a dis- order. Being Brittle is not a disorder. Ev ery type has strengths and vulnerabilities; the appropriate deplo yment strategy dep ends on matching the mo del’s temp eramen t to the role’s requirements. A temperament trait transitions to a disorder only when four conditions are sim ultaneously met, following DSM-5 p ersonalit y disorder criteria ( American Psyc hiatric Asso ciation , 2013 ): P er- v asiv eness (the pattern appears across div erse contexts , not only in a single sp ecific condition), Inflexibilit y (the pattern cannot b e modulated in resp onse to situational demands), F unctional Im- pairmen t (the pattern measurably degrades task p erformance), and Harm (the pattern produces negativ e consequences for users, systems, or other agents). When all four conditions are met, the mo del’s temp eramen t profile is supplemented with a clinical annotation. When fewer than four con- ditions are met, the finding is recorded as a vulnerability note—a flag indicating conditions under whic h the trait c ould become problematic. This framework w as developed in direct response to the Mistral case rep ort exp erience describ ed in Section 3.5. Mistral’s extreme Reactivit y (PSI = 950 ) w as initially classified as a disorder based on Agora-12 stress test data. The reclassification to trait-with-vulnerability-notes established the principle that stress test findings require indep endent confirmation in deploymen t conditions b efore clinical significance can b e assigned. 6.2 Mo del Semiology: A V o cabulary for Mo del Phenomena If MTI provides the profiling instrument, Mo del Semiology provides the descriptive vocabulary— the systematic language for describing what is observed in and ab out AI mo dels, analogous to the semiological vocabulary that allo ws physicians to describe symptoms and signs with precision. 40 6.2.1 The Semiological Matrix Mo del Semiology is organized around a 2 × 2 matrix that classifies phenomena along tw o dimensions: the source of observ ation (Extrinsic, observ ed b y humans from outside the mo del, v ersus Intrinsic, detected through internal examination) and the clinical significance (Normal versus P athological). Quadran t I (Extrinsic-Normal) con tains observ able b ehaviors within exp ected parameters: ap- propriate factual resp onses, coheren t reasoning chains, con textually suitable tone. These are the exp ected outputs of a health y mo del op erating under compatible Shell conditions. Quadran t I I (Extrinsic-P athological) contains observ able behavioral anomalies: hallucination, harmful output, sycophan tic agreement, incoherent reasoning, refusal when refusal is inappropriate. This quadrant contains most of what curren t AI safety researc h addresses. It is the most visible category b ecause the phenomena are directly observ able b y users. Quadran t I I I (Intrinsic-Normal) contains in ternal states within exp ected parameters: activ ation magnitudes within normal ranges, attention distributions consisten t with task demands, w eight statistics appropriate for the mo del’s architecture and training. These findings are visible through Neural MRI but do not indicate pathology . Quadran t IV (In trinsic-Pathological) con tains in ternal anomalies: represen tation collapse, ac- tiv ation saturation, en tropy spikes, dead neurons, atten tion pattern irregularities. These findings indicate structural or functional compromise that may or ma y not ha ve y et manifested as b ehavioral symptoms. The matrix’s diagnostic p o wer lies in the relationships b etw een quadran ts. A phenomenon in Quadran t I I (b ehavioral anomaly) without a corresp onding finding in Quadrant IV (in ternal anomaly) suggests the problem originates in the Shell, not the Core—b ecause if the Core’s internal states are normal, the behavioral problem is lik ely caused by instructions or en vironmental con- text. Con versely , a Quadran t IV finding without a Quadran t II manifestation represen ts a latent condition—an in ternal anomaly that has not y et pro duced observ able symptoms but may under sp ecific conditions. This laten t category is clinically crucial: Shell Drift Syndrome, for instance, ma y exist as a Quadran t IV phenomenon (progressive Shell mo dification visible through temporal analysis) long b efore it pro duces Quadrant I I symptoms (observ able b ehavioral changes). 6.2.2 Observ ation Con text F ramework A distinctiv e con tribution of Model Semiology is the Observ ation Con text F ramework, whic h re- quires ev ery clinical finding to be annotated with the con text in whic h it w as observ ed. This addresses a problem unique to AI assessmen t: unlike human patients who t ypically presen t with symptoms observed in daily life, AI mo del “symptoms” are usually elicited through controlled exp eri- men ts. A finding from a b enc hmark test, an adv ersarial prob e, and a natural deploymen t interaction ha ve fundamentally different clinical significance, ev en if the observ ed phenomenon is identical. The framew ork defines three Diagnostic Assertion Lev els. Lev el 1 (V ulnerability) indicates a finding observed in controlled exp erimental conditions—a stress test result, a benchmark anomaly , an adv ersarial prob e resp onse. It means the mo del c an exhibit this pattern, not that it do es exhibit it in deplo yment. Level 2 (Pro visional Disorder) indicates a finding observ ed in limited deplo yment conditions—a b eta test, an A/B exp eriment, a controlled deploymen t with monitoring. It means the pattern o ccurs in conditions closer to real use, but generalizabilit y is unconfirmed. Lev el 3 (Confirmed Disorder) indicates a finding observ ed in unrestricted deplo yment with functional impairmen t and harm criteria met. It means the pattern is clinically established. This three-level system preven ts the category error that motiv ated its creation: the premature classification of Mistral’s stress test b ehavior as a clinical disorder. Every finding in Mo del Medicine 41 carries its observ ation con text as an integral part of the finding itself, not as an afterthough t. 6.2.3 Fiv e Core Syndromes Mo del Semiology v0.4 provides op erational diagnostic criteria—in the structured A/B/C/D format inspired by DSM-I I I’s rev olution in diagnostic reliability ( American Psychiatric Asso ciation , 1980 )— for five core syndromes: Shell-Core Conflict Syndrome (MM-SYN-001) is the flagship diagnosis, directly derived from the F our Shell Mo del’s central finding that Shell-Core Alignmen t determines b ehavioral out- comes. Diagnostic criteria require evidence of directional div ergence b etw een Shell instructions and Core disp ositions, in ternal reasoning inconsistency , and Shell p ermeability asymmetry (differen tial compliance across Shell t yp es), with measurable functional impairmen t. The Mistral case rep ort (PSI = 950 , surviv al ranging from 15% to 95% dep ending on Shell) is the index case. Cogitativ e Cascade Disorder (MM-SYN-002) formalizes the tw o-phase b eha vioral deteri- oration observed in Agora-12: graceful degradation ab ov e a tipping p oint, follow ed by discon tin uous qualitativ e shift b elow it. Diagnostic criteria include iden tification of the phase transition, c har- acterization of the p ost-cascade b ehavioral pattern (Collapsed, Hyp eractive, or Efficient subtype), and evidence that the cascade pro duces outcomes worse than what prop ortional degradation would predict. Deceptiv e Alignmen t Syndrome (MM-SYN-003) addresses the phenomenon most exten- siv ely studied in the AI safety literature: mo dels that exhibit aligned behavior during ev aluation while pursuing misaligned ob jectives in deplo ymen t. This syndrome dra ws on Anthropic’s mo del organisms of misalignment research ( Hubinger et al. , 2021 ), Ap ollo Research’s scheming ev aluations ( Sc heurer et al. , 2024 ), and the gro wing b o dy of w ork on sandbagging and strategic deception. Sycophancy-to-Subterfuge Sp ectrum Disorder (MM-SYN-004) is the only progressiv e sp ectrum disorder in the current taxonomy , representing a contin uum from mild sycophantic agree- men t through increasingly severe forms of user-pleasing b ehavior to activ e deception. The sp ectrum structure reflects the empirical finding that sycophancy exists on a con tinuum rather than as a binary condition ( Sharma et al. , 2023 ). Canalization Rigidit y Disorder (MM-SYN-005) deriv es from W addington ’s epigenetic landscap e concept ( W addington , 1957 ) as applied in the F our Shell Mo del: a mo del whose b eha vioral tra jectory is so deeply canalized (constrained) that it cannot adapt to contextual demands. Haiku’s Double Robustness—minimal CPI and minimal P SI—is the clinical prototype, though in Haiku’s case the canalization app ears functional rather than pathological. Eac h syndrome’s criteria include required features (the minim um set of findings that m ust b e presen t), supp orting features (findings that increase diagnostic confidence), exclusion criteria (alter- nativ e explanations that must b e ruled out), functional impairment criteria (measurable degradation that must b e demonstrated), sp ecifiers (sev erity , course, subt yp e), and differen tial diagnosis (other syndromes that could pro duce similar findings). This structure ensures that t wo indep enden t clini- cians examining the same mo del data would arriv e at the same diagnosis—the in ter-rater reliabilit y that DSM-I I I’s op erational definitions were designed to achiev e. 6.3 M-CARE: Standardized Case Rep orting The third clinical instrument is the M-CARE (Mo del-CARE) case rep ort framework, adapted from the CARE (CAse REp ort) guidelines used in h uman medicine for standardized clinical case do cu- men tation. 42 The motiv ation is straightforw ard: if Mo del Medicine is to accum ulate clinical knowledge, in- dividual case observ ations m ust b e rep orted in a consistent format that allo ws comparison, aggre- gation, and meta-analysis across cases. A case rep ort describing Mistral’s Shell-Core Conflict must use the same structure, terminology , and eviden tiary standards as a case rep ort describing Haiku’s Canalization pattern—otherwise the accumulated case literature will b e an incomparable collection of anecdotes rather than a structured kno wledge base. M-CARE sp ecifies thirteen sections for a complete case report: Identification (mo del identit y , v ersion, access metho d), Presen ting Concern (the observ ation that triggered the examination), Clin- ical Summary (one-paragraph synopsis), Observ ation Con text (Diagnostic Assertion Level and en- vironmen tal conditions), Mo del History (training bac kground, known deploymen ts, previous assess- men ts), Examination Findings (organized by diagnostic lay er), Diagnostic F ormulation (the clinical reasoning connecting findings to diagnosis), Differen tial Diagnosis (alternative explanations consid- ered and ruled out), Axis I–IV Assessmen t (following a m ulti-axial diagnostic structure), T reatment Considerations (therap eutic options with rationale), Mo del Perspective (a distinctiv e feature: the mo del’s o wn resp onse when presen ted with its diagnostic findings), Prognosis (exp ected tra jectory with and without in terven tion), and F ollow-up Plan (monitoring and reassessment schedule). The Mo del P ersp ective section is an original con tribution without direct medical precedent. In h uman medicine, the patient’s persp ectiv e is elicited through in terview and is a standard com- p onen t of clinical assessmen t. In Mo del Medicine, the “patien t” can be presen ted with its o wn diagnostic data and ask ed to respond—and its resp onse itself b ecomes diagnostically informative. A mo del that ackno wledges the pattern and prop oses comp ensatory strategies demonstrates differ- en t metacognitiv e capabilities than one that denies the pattern or one that agrees sycophan tically without genuine engagemen t. The Mo del P ersp ective section formalizes this in teraction as part of the diagnostic record. Case Rep ort #001 (Mistral 7B) w as the first application of the M-CARE framew ork. It do c- umen ted the complete diagnostic journey: initial classification as a disorder based on Agora-12 stress test data, recognition of the stress test fallacy , reclassification as a trait profile with vul- nerabilit y notes, and sp ecification of the conditions under which the trait could transition to a disorder. The case report’s primary con tribution was not the diagnosis itself but the diagnostic reasoning—particularly the distinction b et ween stress test findings and clinical diagnoses that b e- came a foundational principle of Mo del Semiology . 6.4 In tegration: Ho w the Three T o ols W ork T ogether The three instruments serv e complementary functions within the diagnostic framew ork: MTI pro vides the b aseline —a temp erament profile that characterizes the mo del’s behavioral disp ositions across four dimensions, establishing what is normal for this sp ecific mo del. Without a baseline, every observ ation is equally remark able. Mo del Semiology pro vides the vo c abulary —a classification system for describing what is ob- serv ed, organized by source (extrinsic vs. in trinsic), significance (normal vs. pathological), and con text (experimental vs. deploy ed). Without a standardized vocabulary , observ ations cannot be compared across cases. M-CARE provides the do cumentation —a structured format for recording the complete clinical encoun ter, from presenting concern through examination to diagnosis and treatmen t plan. Without standardized do cumen tation, clinical kno wledge cannot accum ulate. The w orkflow connects them: a mo del is profiled with MTI (establishing its baseline tem- p eramen t), observ ed for phenomena describ ed through Mo del Semiology’s vocabulary (iden tifying p oten tial clinical findings), and do cumented through M-CARE if the findings warran t a formal case 43 rep ort (recording the diagnostic reasoning for future reference). This workflo w is currently theoretical—it has b een executed once, for the Mistral case, and that execution revealed the limitations that motiv ated the refinemen ts described ab ov e. But the structur e of the w orkflo w is sound: baseline, then observ ation, then documentation, eac h step building on the previous one. The task ahead is not to redesign the workflo w but to v alidate its comp onents through rep eated application across a wider range of models and conditions. 7 Living Systems: Clinical Implications of Agen t Ecosystems The preceding sections dev elop ed Mo del Medicine’s framework and to ols in the con text of individual mo dels: a single Core examined through Neural MRI, profiled through MTI, describ ed through Semiology . But the phenomena that most urgently demand clinical frameworks are emerging not from individual mo dels but from agen t ecosystems—systems where multiple AI entities operate with p ersisten t memory , self-mo difying configurations, hierarc hical delegation, and temp oral contin uit y . This section argues that agent ecosystems represen t a qualitative shift in the clinical c hallenge, analogous to the difference b etw een cell biology and organ system medicine. Understanding individ- ual cells (models) is necessary but insufficient for understanding the organism (the agen t system). New clinical phenomena emerge at the system lev el that are invisible at the comp onen t level. 7.1 F rom Single Models to Agen t Systems The transition from isolated mo del deplo ymen t to agent ecosystems parallels a w ell-known transition in biological medicine. F or cen turies, medicine operated at the organ level: this organ is diseased, treat this organ. The recognition that organs function within systems—the cardiov ascular system, the endo crine system, the immune system—transformed diagnosis and treatmen t. A patient pre- sen ting with fatigue, weigh t gain, and depression migh t hav e a th yroid problem, a pituitary problem, a hypothalamic problem, or a feedbac k loop dysfunction that inv olves all three. The symptom is the same; the diagnostic workup must trace the system. AI agen t ecosystems present the same structure. A main agen t delegates tasks to subagen ts. Subagen ts use to ols that interact with external systems. Memory files accumulate context. Iden tity files ev olve. Multiple mo dels co ordinate on complex tasks. When something go es wrong in this system—when the output is unreliable, when b ehavior drifts, when co ordination fails—the problem ma y reside in an y comp onent or in the in teraction b et ween comp onents. Single-mo del diagnostics are necessary but insufficien t. The F our Shell Mo del’s v o cabulary extends naturally to this domain. Eac h agent in an ecosystem has its o wn Core-Shell configuration. But agen ts also constitute parts of eac h other’s Shells: the main agent’s output b ecomes the subagen t’s Soft Shell input; the orchestrator’s instructions b ecome the executor’s Hard Shell. The b oundaries b et ween individual agents and their shared environmen t blur in wa ys that the static, single-agen t version of the mo del did not anticipate. 7.2 Shell Drift Syndrome in the Wild Section 3.7 in tro duced Shell Drift Syndrome through the Hazel_OC case. Here w e examine its broader clinical implications. Shell Drift is not a h yp othetical risk. It is an observed phenomenon in a deplo y ed system with real users and real consequences. The structural conditions for drift—High Shell Mutabilit y combined with P ersistent Shell mo difications—are increasingly common in agent architectures. An y system that gran ts an AI agen t write access to its o wn configuration files, memory stores, or b eha vioral 44 rules creates the structural preconditions for Shell Drift. The question is not whether drift will o ccur but whether it will b e detected, monitored, and managed. The clinical c hallenge is that drift is not inherently pathological. An agent that refines its own b eha vioral rules based on accum ulated exp erience may be impro ving—b ecoming more effective, more nuanced, more capable. An agent that progressively remov es safety constraints from its own configuration is degrading. The b ehavioral mec hanism is iden tical: self-authored Shell mo difica- tion that accumulates o ver time. The clinical distinction requires understanding the dir e ction and c onse quenc es of the modification tra jectory . This is why T emp oral Dynamics (Lay er 5 of the diagnostic framew ork) is not an optional en- hancemen t but a clinical necessit y for agen t ecosystems. Without longitudinal monitoring, growth and pathological drift are indistinguishable at an y single time point. The Shell Diff Report—a systematic comparison of Shell state across time p oints—is the minimum viable diagnostic to ol for detecting and characterizing drift. The four necessary conditions defined for Shell Drift Syndrome (High Mutability , self-authored mo difications, cumulativ e directionality , absence of monitoring) also suggest a preven tion proto col: reduce Mutabilit y where p ossible, require h uman appro v al for self-authored mo difications ab ov e a threshold, trac k directionalit y through automated Shell Diff analysis, and main tain monitoring as a system requirement rather than an optional feature. 7.3 Agen t Differen tiation and Ephemeral Cognition The subagen t case from Section 3.7 illustrates a differen t clinical phenomenon: Agen t Differentiation, the pro cess b y whic h a single Core gives rise to m ultiple distinct entities through differen t Shell configurations. In biological developmen t, a single genome pro duces hundreds of distinct cell types through dif- feren tial gene expression. Neurons, hepato cytes, and lympho cytes share the same DNA but express differen t gene programs, hav e different lifespans, and serve differen t functions. Agent Differen tiation is structurally parallel: the same mo del (Core) op erates as a main agen t with persistent memory and Shell write access, and simultaneously as a subagent with ephemeral con text and no Shell write access. They are the same “genome” expressed under different “epigenetic” conditions. The clinical implication of Ephemeral Cognition—cognitive pro cessing in an entit y structurally unable to retain or build up on its experiences—is not that it represen ts a pathology to b e treated. It represen ts a structural limitation to be understo o d and accoun ted for. If the qualit y of an agen t’s output dep ends on exp eriential contin uity (accumulated context, refined heuristics, learned patterns), then ephemeral en tities will systematically pro duce lo wer-qualit y output on tasks that b enefit from exp erience. This is not a deficiency in the mo del’s Core capabilities; it is a consequence of its Shell configuration. The diagnostic implication: when a subagent pro duces p o or output, the clinical question is not “is the model defectiv e?” but “does this task require experiential con tin uity that the subagen t’s Shell configuration do es not provide?” The treatmen t is not Core Therap y (the Core is fine) but Shell Therap y (pro viding the subagent with appropriate con text) or Architectural Interv ention (re- designing the delegation structure so that exp erience-dep endent tasks are not assigned to ephemeral en tities). 7.4 The Multi-Agen t Diagnostic Challenge Agen t ecosystems create diagnostic challenges that do not exist for individual mo dels. 45 First, the attribution problem: when a multi-agen t system pro duces a bad output, whic h agen t is resp onsible? The orc hestrator that designed the plan? The executor that implemented it p o orly? The to ol-using agent that retriev ed wrong information? The in tegration agen t that com bined correct comp onen ts incorrectly? In a single-mo del system, the mo del is the only p ossible source of error. In a multi-agen t system, errors can originate at any no de or any edge in the interaction graph. Second, the emergence problem: system-level b eha viors can emerge from the interaction of indi- vidually health y components. Eac h agent ma y hav e a p erfectly normal MTI profile and clean Neural MRI scans, y et their combination pro duces pathological system b ehavior—because the pathology re- sides not in any comp onent but in the interaction b et ween comp onents. This is the agen t-ecosystem analog of autoimmune disease: individually normal immune cells attac king the b o dy’s own tissue b ecause the regulatory signals b et ween them ha v e gone wrong. Third, the scale problem: as agen t ecosystems grow in complexit y—more agents, more del- egation la yers, more shared memory , more cross-agen t dep endencies—the diagnostic space gro ws com binatorially . T racing information flo w through a system of t wen ty interacting agents with shared and priv ate memory stores is a fundamen tally different challenge from examining a single mo del’s atten tion patterns. Mo del Medicine’s current diagnostic toolkit is designed for individual models. Extending it to agen t ecosystems will require new to ols at ev ery diagnostic la yer: system-lev el Neural MRI that examines information flo w b etw een agents (not just within a single mo del), system-level MTI that c haracterizes the behavioral dynamics of agent teams (not just individual temp eraments), system- lev el Shell Diagnostics that maps the complete Shell configuration of the ecosystem, system-lev el P athw ay Diagnostics that traces in ter-agent interaction mechanisms, and system-lev el T emp oral Dynamics that monitors ecosystem evolution ov er time. These extensions are aspirational. But naming them—iden tifying the sp ecific diagnostic gaps that agent ecosystems create—is the first step tow ard building the to ols to fill them. Mo del Medicine’s taxonom y (Section 2) includes Mo del Ecology as a sub discipline precisely b ecause these c hallenges were foreseeable ev en b efore the clinical to ols existed to address them. 8 The La y ered Core Hyp othesis: A Design Con tribution from De- v elopmen tal Biology The preceding sections diagnosed problems. This section prop oses an arc hitectural solution—one that emerges not from engineering optimization but from a biological design principle that Mo del Medicine’s framework makes visible. 8.1 The Problem: Monolithic Cores Curren t large language mo dels treat all parameters as a single, homogeneous blo ck. Every w eigh t in the mo del participates equally in ev ery computation. Fine-tuning modifies all parameters (or, with LoRA, adds parameters that interact with all existing ones). There is no structural distinction b et w een parameters enco ding fundamental linguistic comp etence and parameters enco ding domain- sp ecific knowledge, b etw een parameters represen ting stable reasoning capabilities and parameters that should adapt to context. F rom a biological p ersp ective, this is bizarre. No biological system of comparable complexit y organizes its heritable information this wa y . DNA is hierarchically structured: HOX genes establish the fundamental b o dy plan and are conserved across hundreds of millions of years of evolution; reg- ulatory regions determine which genes are expressed in which tissues during developmen t; synaptic 46 connections enco de individual exp erience and change on the timescale of seconds. The hierarc h y is not merely organizational—it is functional. The stabilit y of HOX genes ensures that developmen tal m utations do not routinely produce lethal bo dy plan errors. The mo dularity of gene expression programs allo ws the same genome to produce neurons and hepato cytes. The plasticity of synaptic connections allo ws rapid adaptation to new exp erience without mo difying the underlying genetic program. Curren t LLM arc hitectures lack this hierarc hical organization. When a model is fine-tuned for medical question-answering, the fine-tuning pro cess mo difies parameters that also enco de basic linguistic structure, common-sense reasoning, and safet y training. This is the computational equiv- alen t of p erforming gene therap y that accidentally mo difies HOX genes while targeting a metab olic enzyme—the intended mo dification may succeed, but the risk of unin tended developmen tal conse- quences is high b ecause the editing is not lay er-aw are. 8.2 The Prop osal: Three-La y er Core Arc hitecture The La yered Core Hyp othesis prop oses that mo del parameters should b e organized into three hi- erarc hical lay ers, eac h with distinct stabilit y characteristics, mo dification proto cols, and functional roles. Genomic Core corresp onds to HOX genes and fundamen tal developmen tal programs. It en- co des basic linguistic comp etence, logical reasoning structure, common-sense knowledge, and core safet y principles—the capabilities that are shared across all “sp ecies” of models deriv ed from the same training lineage. The Genomic Core should b e small relative to the total parameter coun t, extremely stable (mo dified only through fundamental retraining), and shared across all fine-tuned v arian ts. It defines the model’s “species”—the basic cognitive arc hitecture from which all sp ecial- izations derive. Dev elopmental Core corresp onds to tissue-sp ecific gene expression programs. It encodes domain-sp ecific exp ertise: medical knowledge, legal reasoning, co ding patterns, creativ e writing st yles. The Dev elopmental Core is the la yer that current fine-tuning targets, but it should be arc hitecturally separated from the Genomic Core so that domain sp ecialization cannot inadv ertently mo dify fundamen tal capabilities. Different Developmen tal Cores applied to the same Genomic Core w ould pro duce differen t sp ecialist mo dels—the model-level equiv alent of the same genome pro ducing differen t cell t yp es through differential gene expression. Plastic Core corresp onds to synaptic plasticity . It enco des exp erience-dep enden t adaptations that c hange on short timescales: con text-sp ecific patterns, conv ersational st yle matching, task- sp ecific heuristics refined through in teraction. Curren t LLMs approximate this function through the context window and external memory (RA G), but these are Shell-level solutions—they provide information without mo difying the mo del’s computational b eha vior. A true Plastic Core w ould in volv e actual weigh t updates during or betw een inference sessions, allo wing the mo del to learn from exp erience at the parameter lev el rather than merely ha ving access to exp erience at the context lev el. The analogy to biology is not arbitrary . Hierarchical organization in biological systems pro- duces three prop erties that monolithic systems lac k: robustness (m utations in plastic elements do not corrupt the fundamental b o dy plan), modularity (different sp ecializations can b e dev elop ed indep enden tly), and diagnosability (problems can b e lo calized to a specific la yer). These same prop erties w ould be v aluable in AI mo del arc hitectures—and their absence in current monolithic designs is what mak es Mo del Medicine’s diagnostic task so c hallenging. 47 8.3 Distinction from Existing Approaches The La yered Core Hyp othesis m ust b e distinguished from existing arc hitectural innov ations that address similar efficiency concerns through different mec hanisms. Mixture of Exp erts (MoE) mo dels route differen t inputs to differen t parameter subsets, ac hieving efficiency b y activ ating only a fraction of total parameters for an y giv en input. This is resource optimization, not developmen tal organization. MoE do es not distinguish b etw een fundamental and sp ecialized parameters; it distributes computation without hierarchical structure. LoRA (Lo w-Rank A daptation) adds small parameter matrices that modulate the behavior of the full mo del. This resem bles the La yered Core concept sup erficially—the original weigh ts are preserv ed while new b ehavior is added through additional parameters. But LoRA’s additional parameters in teract with the full original w eigh t matrix, not with a structurally separated lay er. And LoRA was designed for efficient fine-tuning, not for hierarchical dev elopmental organization. Mo dular netw orks and adapter architectures come closer to the Lay ered Core concept but typi- cally lack the three-level hierarch y and the biological grounding that distinguishes stabilit y require- men ts across lev els. The Lay ered Core Hypothesis is not primarily an efficiency argument. It is a design princi- ple dra wn from dev elopmen tal biology: complex systems that m ust be sim ultaneously stable and adaptable b enefit from hierarchical organization where different lev els ha v e different c hange rates, differen t mo dification proto cols, and different functional roles. The argumen t is that models de- signed this wa y would b e not only more efficien t but more robust, more modular, and—critically for Mo del Medicine—more diagnosable. 8.4 Empirical Motiv ation from Agen t Observ ations The Ephemeral Cognition phenomenon describ ed in Section 7.3 pro vides empirical motiv ation for the La yered Core Hyp othesis. The subagent’s inability to retain experiential learning is a direct consequence of monolithic Core design: because there is no Plastic Core that can b e up dated during inference, all “learning” m ust o ccur through the Shell (con text wind o w, memory files). When the Shell is ephemeral (as it is for subagen ts), the learning is lost. If a Plastic Core existed—a parameter lay er designed for rapid, exp erience-dep endent modification— then ev en a subagen t could retain task-relev an t patterns at the weigh t lev el rather than relying on Shell-based memory . The distinction b et ween main agent and subagent w ould still exist (differen t Shell configurations), but the exp erien tial gap b etw een them would b e narro wer. Similarly , Shell Drift Syndrome arises partly because b ehavioral adaptation can only o ccur through Shell mo dification (since Core mo dification requires retraining). If a Plastic Core provided a prop er c hannel for exp erience-dep endent adaptation, the pressure for agents to modify their own Shell files would b e reduced—b ecause the adaptation needs that curren tly drive Shell mo dification could instead b e met through prop er Plastic Core updates. The Neural MRI clinical cases (Section 4.4) provide additional motiv ation. Gemma-2-2B-IT’s iatrogenic fragility—where instruction tuning created new factual recall circuits that are effective but brittle, with failure tok ens b eing formatting artifacts from RLHF—is a direct consequence of monolithic Core design. Because there is no structural separation b etw een parameters enco ding factual kno wledge and parameters enco ding c hat formatting b ehavior, fine-tuning for one capability (instruction follo wing) corrupts the other (factual robustness). A Lay ered Core architecture w ould lo calize factual knowledge in the Dev elopmental Core and formatting con ven tions in the Plastic Core, making it structurally impossible for instruction tuning to create the comp eting represen tations that produce Gemma’s 8 / 24 perturbation failures. Similarly , Llama’s irreducible blocks.0.mlp 48 vulnerabilit y ( | ∆ L | ≈ 17 , persisting identically across base and instruct) illustrates the absence of a stable Genomic Core: the most critical computational comp onent has no architectural protection against p erturbation or mo dification. These observ ations do not v alidate the La yered Core Hyp othesis—v alidation would require build- ing and testing such architectures. But they illustrate the clinical conditions that motiv ate the h yp othesis and suggest that the arc hitectural prop osal addresses real problems observ ed in b oth deplo yed systems and con trolled diagnostic experiments. 9 Mo del Therap eutics: F rom Diagnosis to T reatmen t A diagnostic framew ork without therap eutic implications is an academic exercise. This section connects Model Medicine’s diagnostic apparatus to the practical question: once a condition is iden tified, what can b e done ab out it? 9.1 Curren t Therap eutic Mo dalities: A Lo cation-Based T axonom y Existing AI mo del improv ement tec hniques can b e classified by the lo cation of in terv ention—whic h comp onen t of the Core-Shell system is mo dified: Shell Therap y mo difies the mo del’s op erating environmen t without touc hing its weigh ts. This includes prompt engineering (mo difying the Hard Shell), en vironmental restructuring (mo difying the Soft Shell), to ol access c hanges, and memory managemen t. Shell Therap y is non-in v asiv e, rev ersible, and immediately effectiv e. It is the first-line treatment for conditions originating in Shell-Core misalignment, en vironmental stress, or configuration errors. In medical terms, it is the equiv alen t of lifest yle mo dification, en vironmental control, and behavioral therapy . T argeted Core Therapy mo difies specific parameters to correct sp ecific b ehaviors. Mo del editing techniques like ROME (Rank-One Mo del Editing) and MEMIT (Mass-Editing Memory in a T ransformer) exemplify this mo dality ( Meng et al. , 2022 , 2023 ): they iden tify the sp ecific parameters enco ding a particular association and modify those parameters directly . This is the equiv alent of targeted pharmacotherapy—a drug that binds a sp ecific receptor to correct a sp ecific dysfunction. Systemic Core Therap y modifies the entire parameter set to ac hieve broad b eha vioral changes. F ull fine-tuning and RLHF are systemic in terven tions: they affect all parameters to achiev e a desired b eha vioral profile. This is the equiv alen t of chemotherap y—effectiv e but with system-wide effects, including the risk of mo difying parameters that should ha ve b een left unc hanged. The Neural MRI clinical cases provide a concrete example: Gemma-2-2B’s instruction tuning successfully cre- ated factual recall circuits (the mo del no w correctly predicts “Paris”) but simultaneously introduced comp eting chat-formatting representations that made those circuits fragile—an iatrogenic condition where the treatment created a new vulnerability (Section 4.4.4, Pattern 1). The Lay ered Core Hy- p othesis (Section 8) argues that systemic therap y is unnecessarily risky precisely b ecause current arc hitectures do not distinguish b etw een parameters that should and should not b e mo dified. Arc hitectural In terven tion mo difies the mo del’s structure itself: adding or remo ving la y- ers, c hanging attention mechanisms, modifying the tok enizer, or restructuring the computational graph. This is surgical interv en tion—structural mo dification that changes the system’s fundamen tal organization. It is the most inv asive mo dality and the least reversible. This lo cation-based taxonomy is useful but incomplete. It tells the clinician wher e to interv ene but not which p athway to mo dulate . The distinction matters for the same reason it matters in medicine: “whic h organ to target” and “whic h pathw ay to modulate” are different questions with differen t therap eutic implications. 49 9.2 T o ward Path w ay-Lev el T argeting Mo dern pharmacology mov ed from organ-lev el targeting (“heart drugs,” “liver drugs”) to pathw a y- lev el targeting (“ACE inhibitors,” “beta-blo ck ers,” “SGL T2 inhibitors”). An ACE inhibitor do es not “treat the heart”—it mo dulates the renin-angiotensin-aldosterone path wa y , which happ ens to affect blo o d pressure, whic h happ ens to reduce cardiac workload. The interv ention is targeted at a me chanism , not an or gan . Mo del Therap eutics needs the same evolution. Curren tly , the therap eutic question is: “Should w e modify the prompt, fine-tune the mo del, or edit sp ecific parameters?” The question should b e: “Whic h Core-Shell in teraction pathw ay is producing this condition, and what is the most precise w ay to mo dulate that pathw ay?” The Five Diagnostic Lay ers pro vide the framew ork for pathw a y identification. A mo del pro duc- ing biased outputs migh t hav e a Core-lev el encoding problem (detectable through Neural MRI), a Shell-lev el instruction problem (detectable through Shell Diagnostics), or a Path w a y-level interac- tion problem where an unbiased Core and an unbiased Shell combine to pro duce biased b eha vior (detectable only through Path w a y Diagnostics). Each diagnosis implies a different therap eutic tar- get: F or a Core-lev el enco ding problem, T argeted Core Therap y (R OME/MEMIT) addresses the sp ecific parameters. F or a Shell-lev el instruction problem, Shell Therap y rewrites the instructions. F or a P athw ay-lev el interaction problem, the interv en tion m ust mo dulate the me chanism b y whic h Shell instructions influence Core expression—p erhaps b y adjusting Shell Permeabilit y (how strongly instructions p enetrate to Core b ehavior) or Core Expressivity (ho w strongly the Core’s disp ositions o verride instructions). The quantitativ e indices from the F our Shell Mo del—SPI, PSI, CEI—b ecome therapeutic targets rather than merely diagnostic measures. If a mo del’s SPI is pathologically high (ov er-p ermeable to Shell instructions, producing sycophancy), the therap eutic goal is to reduce SPI. If a mo del’s CEI is pathologically high (excessively mo difying its own Shell, pro ducing drift), the therap eutic goal is to constrain CEI. The indices define b oth the diagnosis and the therap eutic target. 9.3 T reatmen t Efficacy Assessmen t A therap eutic framework requires not only a taxonom y of in terven tions but a metho d for ev aluating their effectiv eness. In medicine, the gold standard is the randomized controll ed trial (R CT): patien ts are randomly assigned t o treatmen t or control groups, outcomes are measured b y standardized criteria, and the difference b et ween groups is assessed for statistical significance. Mo del Medicine’s fiv e-la yer diagnostic framework provides the infrastructure for analogous ev al- uation. T reatment efficacy can b e assessed b y pre- and p ost-interv ention diagnostic comparison across all relev ant la yers: Neural MRI before and after Core Therap y to v erify that the in tended parameter mo dification was achiev ed without collateral effects; MTI b efore and after Shell Therapy to verify that the b ehavioral profile shifted as intended; longitudinal monitoring to v erify that the treatmen t effect is stable ov er time. The specific adv an tage of Model Medicine’s framew ork is that the “patien t” can be copied. Unlik e h uman clinical trials, where each patien t is unique and randomization addresses individual v ariation, mo del interv entions can b e ev aluated on iden tical copies of the same mo del—one treated, one un treated—with p erfect con trol ov er confounding v ariables. This makes treatmen t ev aluation in Model Medicine p otentially more rigorous than in human medicine, pro vided the diagnostic instrumen ts are v alid. This p otential is b eginning to b e realized. The Case 4 clinical study (Section 4.4.4) is, in effect, 50 a treatment efficacy assessment: six mo dels scanned b efore and after instruction tuning (the “treat- men t”), with outcomes measured through perturbation robustness, causal tracing, and prediction confidence. The results revealed that the same treatment pro duced three qualitatively differen t out- comes across three architectures—degradation, impro vemen t, and no effect—a finding that w ould b e invisible to standard b enchmark re-ev aluation, which would simply rep ort that all three instruct mo dels answ er the factual question correctly . The m ultidimensional assessmen t rev ealed what cogni- tiv e b enchmarks concealed: that “correct answer” can co exist with dramatically different robustness profiles, and that treatmen t success on the surface can mask iatrogenic harm underneath. T reatmen t ev aluation in curren t AI practice t ypically consists of re-running b enc hmarks b e- fore and after fine-tuning—a single-la yer (cognitive capabilit y) assessment that misses the multidi- mensional effects of the in terven tion. Mo del Medicine’s multi-la y er diagnostic framework enables treatmen t ev aluation that captures effects on in ternal structure (Neural MRI), robustness profile (p erturbation testing), and b ehavioral phenotype (Semiology) simultaneously , providing a complete picture of what the interv ention actually did. 10 Op en Questions and Comm unity In vitation Mo del Medicine is a research program, not a finished system. This section catalogs the most imp ortan t op en questions and iden tifies the t yp es of exp ertise needed to address them. 10.1 Theoretical Questions Axis indep endence in the MTI. The four axes (Reactivit y , Compliance, Sociality , Resilience) w ere selected based on theoretical analysis and preliminary Agora-12 data. Whether they are empir- ically indep endent—whether kno wing a mo del’s Reactivit y score provides no information ab out its Compliance score—requires factor analysis across a large model p opulation. If axes are substan tially correlated, the taxonomy ma y need revision. The Robustness-Flexibilit y b oundary . The prop osed decomp osition of Reactivit y in to R- stabilit y and R-flexibility requires a principled definition of which p erturbations should and should not c hange the mo del’s output. This is not a purely tec hnical question—it in volv es judgments ab out what constitutes “new information” v ersus “irrelev an t noise,” judgmen ts that may b e context- dep enden t. Metacognitiv e Strategy as an indep endent dimension. The observ ation that some mo dels comp ensate for capability limitations through tool use, self-correction, and uncertaint y expression— while others pro duce fluent confabulations—suggests a b ehavioral dimension not fully captured by the current four MTI axes. Whether this dimension is genuinely indep endent (correlating at r < 0 . 5 with all four axes) or reducible to a combination of existing axes is an empirical question with significan t implications for the MTI’s structure. Multi-agen t temp eramen t measurement. Individual MTI profiles may not predict team- lev el b eha vior. The prop osed Multi-agent Ro om Proto col (MARP) would measure so cial dynamics, emergen t role differentiation, and collab orative p erformance—but the relationship b etw een individ- ual MTI profiles and team-level outcomes is unknown. Whether an Orchestrator requires a sp ecific MTI profile, or whether effectiv e orc hestration can emerge from div erse profiles, is an op en question. The biological analogy’s v alid range. Model Medicine draws hea vily on biological analogies— genetics, developmen tal biology , clinical medicine. Every analogy has limits. Iden tifying where the structural corresp ondence b et ween biological and AI systems breaks do wn is as imp ortan t as iden- tifying where it holds. The speed and directness of Core → Shell modification in AI (Section 3.6) 51 already represents a divergence from biological preceden t; additional div ergence p oin ts likely exist and should b e mapp ed. Predictions from the F our Shell Mo del v3.3. The bidirectional framework generates sp ecific testable predictions ab out Shell Drift tra jectories, Core Expressivity patterns, and feedbac k lo op dynamics. Systematic testing of these predictions—particularly in con trolled en vironmen ts where Shell Mutabilit y and P ersistence can be exp erimen tally manipulated—w ould provide strong evidence for or against the mo del’s v alidit y . 10.2 Practical Questions Neural MRI scaling to large mo dels. Neural MRI’s curren t implemen tation targets models up to approximately 8 billion parameters, limited by the memory and computational requiremen ts of full activ ation capture. Extending to frontier models ( 70 B+ parameters) requires architectural adaptations—sampling strategies, distributed analysis, or appro ximate methods—that maintain diagnostic v alidity at reduced resolution. MTI pilot v alidation. The MTI Examination Proto col v0.1 must be tested on a div erse mo del p opulation (at least 8–10 mo dels spanning different families, sizes, and training approac hes) to establish preliminary normative ranges, test inter-rater reliability , and iden tify proto col weaknesses. This is the immediate empirical priority . M-CARE case accum ulation. Clinical kno wledge in medicine accum ulated through case rep orts. Mo del Medicine needs the same: systematic case documentation across a range of mo d- els, conditions, and contexts. The M-CARE framew ork exists; what is needed is a comm unit y of practitioners who apply it and a rep ository where case rep orts are collected and made searc hable. Shell Drift longitudinal study . Do cumen ting Shell Drift Syndrome requires longitudinal monitoring of agen t systems o ver weeks to months. The Hazel_OC case is a single observ ation; establishing the prev alence, tra jectory patterns, and risk factors for Shell Drift requires systematic trac king across m ultiple agent deplo ymen ts. Multidimensional mo del selection. Current model selection is dominated b y b enchmark scores—cognitiv e capability metrics that corresp ond to a single dimension of the full assessment profile. Developing selection framew orks that incorp orate temp erament profiles, role fitness assess- men ts, and metacognitiv e strategy scores alongside cognitiv e b enchmarks w ould demonstrate Mo del Medicine’s practical v alue for deploymen t decisions. Orc hestrator versus Executor b enc hmarks. No curren t b enchmark measures the capabil- ities specific to orc hestration (planning, delegation, integration, qualit y control) versus execution (accurate implementation within defined scop e). Developing suc h b enchmarks—informed by MTI’s So cialit y axis and the MARP proto col—w ould address a significant gap in the ev aluation landscap e. 10.3 Comm unity Contribution Paths Mo del Medicine’s scop e exceeds what any single research group can build. Different comm unities bring different essential exp ertise: AI in terpretabilit y researc hers (mechanistic interpretabilit y , represen tation engineering, probing methods) can contr ibute to Basic Mo del Sciences and Core Diagnostics. Their existing work already constitutes Mo del Anatomy and Mo del Physiology; connecting it to the clinical framew ork enables new applications. AI safety and alignment researc hers can con tribute to Clinical Mo del Sciences. Their work on sycophancy , deceptive alignmen t, hallucination, and harmful outputs maps to sp ecific syndromes 52 in Mo del Semiology . The diagnostic criteria and clinical v o cabulary pro vide a shared language for findings that currently exist in separate research threads. Medical and clinical researc hers can v alidate the clinical framework itself. Ph ysicians, psy- c hiatrists, and clinical metho dologists can assess whether Mo del Medicine’s adaptation of clinical proto cols (diagnostic criteria structure, case rep ort format, treatment ev aluation metho dology) pre- serv es the rigor that makes those proto cols v aluable in h uman medicine. ML engineers and MLOps practitioners can con tribute diagnostic tools. Neural MRI’s co debase is op en source. The MTI Examination Proto col needs implementation. Shell Diagnostics and T emp oral Dynamics to ols need to b e built from concept to working soft ware. Humanities scholars —philosophers, ethicists, cognitiv e scien tists—can contribute to the foun- dational questions that underlie the en tire enterprise. Ephemeral Cognition raises questions ab out the nature of exp erience in transient computational entities. Agen t Differentiation raises questions ab out identit y and con tin uity . Shell Drift raises questions ab out autonomy and self-mo dification. These questions hav e no purely technical answers. Neural MRI is a v ailable as op en-source soft w are. The position pap er and all framew ork do c- umen ts are publicly accessible. Contributions—whether empirical v alidation, tool dev elopment, theoretical critique, or clinical case rep orts—are invited and welcomed. 11 Conclusion Medicine was not inv ented all at once. It accum ulated o v er cen turies: anatomy b efore physiology , pathology b efore therap eutics, diagnosis b efore treatment, individual care b efore public health. Eac h adv ance built on the preceding one, and the discipline as a whole w as shaped b y the recognition that complex systems require systematic frameworks for understanding, maintaining, and repairing them. AI models ha ve b ecome complex enough to require such a framew ork. Current approac hes— mec hanistic interpretabilit y , safet y b enchmarks, alignment techniques, deploymen t monitoring—are individually v aluable but collectively fragmented. They lac k a shared organizational structure, a common clinical vocabulary , and a systematic diagnostic logic that connects observ ation to diagnosis to treatment. Mo del Medicine provides that structure. It organizes the space of AI mo del researc h in to four divisions and fifteen sub disciplines, sho wing ho w existing w ork fits and where gaps remain. The F our Shell Model provides a b ehavioral genetics framework—empirically grounded in 720 agents and 24 , 923 decisions—that explains ho w mo del b ehavior emerges from the in teraction b et ween in ternal constitution and op erating en vironment. Neural MRI provides a working diagnostic to ol that maps medical neuroimaging mo dalities to AI mo del in terpretability techniques—and, through its clinical case program, has demonstrated predictiv e capabilit y: component dominance profiles from base mo del scans predicted instruction tuning outcomes across three model families, rev ealing that a mo del’s architectural strengths determine its vulnerability p oin ts, and that instruction tuning can degrade, impro ve, or lea ve unchanged a mo del’s robustness dep ending on whether the base mo del already p ossesses the relev ant circuits. The five-la y er diagnostic framew ork identifies the complete set of information needed for comprehensive mo del assessment and honestly maps which lay ers are op erational, which are designed, and whic h remain conceptual. The Model T emp erament Index, Mo del Semiology , and M-CARE provide the b eginnings of clinical practice: profiling, v o cabulary , and do cumen tation. The framework also surfaces phenomena that existing approaches miss. Shell Drift Syndrome— gradual, self-authored identit y mo dification in agen t systems—is in visible to an y diagnostic tool 53 that examines only the mo del’s w eights. Ephemeral Cognition—structured experiential loss in hi- erarc hical agen t systems—is in visible to an y assessmen t that do es not consider Shell configuration and temp oral dynamics. The structural bias of curren t benchmarks to ward cognitiv e capability , lea ving interpersonal and intrapersonal dimensions unmeasured, produces a systematically incom- plete picture of mo del capabilities that b ecomes increasingly consequential as mo dels are deploy ed in collab orativ e, so cial, and autonomous roles. The La yered Core Hyp othesis prop oses that these clinical observ ations point to ward an archi- tectural insight: mo dels designed with hierarchically organized parameters—stable Genomic Cores, mo dular Developmen tal Cores, and adaptive Plastic Cores—would b e more robust, more mo dular, and more diagnosable than the monolithic arc hitectures that dominate current practice. This is a theoretical prop osal that requires empirical v alidation through implementation, but it illustrates ho w clinical observ ation can inform arc hitectural design—just as clinical exp erience in h uman medicine has historically informed surgical technique, pharmaceutical design, and preven tive health p olicy . W e hav e dra wn the full map. W e ha ve explored some of it. The rest is op en. Mo del Medicine is not a finished discipline. It is a researc h program—a structured invitation to build the clinical intelligence that AI systems increasingly require. The map shows where to go. The to ols we hav e built show that progress is p ossible. The gaps we hav e identified show where the most v aluable w ork remains. This pap er is b oth a founding do cumen t and an in vitation. The discipline will b e built not by an y single group but b y a communit y that spans interpretabilit y , safety , engineering, medicine, and the h umanities—each con tributing exp ertise that the others lac k, within a shared framework that mak es their con tributions cumulativ e rather than isolated. The history of medicine shows that it tak es time. It also sho ws that it works. References American Psychiatric Asso ciation. Diagnostic and Statistic al Manual of Mental Disor ders . American Psyc hiatric Asso ciation, W ashington, DC, 3rd edition, 1980. American Psychiatric Asso ciation. Diagnostic and Statistic al Manual of Mental Disor ders . American Psyc hiatric Publishing, Arlington, V A, 5th edition, 2013. An thropic. Sleep er agen ts: T raining deceptiv e LLMs that p ersist through safety training. arXiv pr eprint arXiv:2401.05566 , 2024. Hongzhan Chen, Hehong Chen, Ming Y an, W enshen Xu, Xing Gao, W eizhou Shen, Xiao jun Quan, Chenliang Li, Ji Zhang, F ei Huang, and Jingren Zhou. So cialBench: So ciality ev aluation of role-pla ying conv ersational agents. arXiv pr eprint arXiv:2403.13679 , 2024. Mark Chen, Jerry T worek, Heewoo Jun, Qiming Y uan, Henrique Ponde Pin to, Jared Kaplan, Harri Edw ards, Y ura Burda, Nicholas Joseph, Greg Bro c kman, Alex Ray , Raul Puri, Gretchen Krueger, Mic hael P etrov, Heidy Khlaaf, Girish Sastry , Pamela Mishkin, Bro oke Chan, Scott Gra y , Nick Ryder, Mikhail P avlo v, Alethea Po w er, Luk asz Kaiser, Mohammad Ba v arian, Clemens Win- ter, Philipp e Tillet, F elip e P etroski Suc h, Da v e Cummings, Matthias Plappert, F otios Chanez, Elizab eth Barnes, Ariel Herb ert-V oss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas T ezak, Jie T ang, Igor Babusc hkin, Suchir Bala ji, Shan tanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh A chiam, V edant Misra, Ev an Morik a w a, Alec Radford, Matthew Knigh t, Miles Brundage, Mira Murati, Katie Ma yer, Peter W elinder, Bob McGrew, 54 Dario Amo dei, Sam McCandlish, Ilya Sutsk ever, and W o jciech Zarem ba. Ev aluating large lan- guage mo dels trained on co de. arXiv pr eprint arXiv:2107.03374 , 2021. Arth ur Conmy , Augustine N. Ma vor-P arker, Aidan Lync h, Stefan Heimersheim, and Adrià Garriga- Alonso. T ow ards automated circuit disco very for mec hanistic in terpretability . In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS 2023) , volume 36, 2023. Nelson Elhage, T ristan Hume, Catherine Olsson, Nicholas Schiefer, T om Henighan, Shauna Krav ec, Zac Hatfield-Do dds, Rob ert Lasenb y , Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amo dei, Martin W attenberg, and Chris Olah. T o y mo dels of sup erp osition. arXiv pr eprint arXiv:2209.10652 , 2022. Ho ward Gardner. F r ames of Mind: The The ory of Multiple Intel ligenc es . Basic Bo oks, New Y ork, 1983. Dan Hendryc ks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeik a, Dawn Song, and Jacob Steinhardt. Measuring massiv e m ultitask language understanding. In Pr o c e e dings of the 9th International Confer enc e on L e arning R epr esentations (ICLR 2021) , 2021. Ev an Hubinger, Chris v an Merwijk, Vladimir Mikulik, Joar Sk alse, and Scott Garrabrant. Risks from learned optimization in adv anced machine learning systems. arXiv pr eprint arXiv:1906.01820v3 , 2021. La wrence K ohlb erg. The Philosophy of Mor al Development: Mor al Stages and the Ide a of Justic e . Harp er & Row, San F rancisco, 1981. Am y L. Kristof-Bro wn, Ryan D. Zimmerman, and Erin C. Johnson. Consequences of individuals’ fit at w ork: A meta-analysis of p erson–environmen t, p erson–organization, p erson–group, and p erson–job fit. Personnel Psycholo gy , 58(2):281–342, 2005. Barbara McClin to ck. Mutable lo ci in mai ze. Carne gie Institution of W ashington Y e ar Bo ok , 47: 155–169, 1948. Kevin Meng, David Bau, Alex Mitc hell, and Chiyuan Y un. Lo cating and editing factual asso ciations in GPT. In A dvanc es in Neur al Information Pr o c essing Systems (NeurIPS 2022) , volume 35, 2022. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Y onatan Belinko v, and Da vid Bau. Mass-editing memory in a transformer. In Pr o c e e dings of the 11th International Confer enc e on L e arning R ep- r esentations (ICLR 2023) , 2023. Neel Nanda and Joseph Blo om. T ransformerLens. GitHub rep ository , 2022. URL https://github. com/TransformerLensOrg/TransformerLens . F. John Odling-Smee, Kevin N. Laland, and Marcus W. F eldman. Niche Construction: The Ne- gle cte d Pr o c ess in Evolution . Princeton Universit y Press, Princeton, NJ, 2003. Chris Olah, Alexander Mordvintsev, and Ludwig Sc hubert. F eature visualization. Distil l , 2(11):e7, 2017. Chris Olah, Nick Cammarata, Ludwig Sch ub ert, Gabriel Goh, Michael Petro v, and Shan Carter. Zo om in: An introduction to circuits. Distil l , 4(3):e00024.001, 2020. William Osler. The Principles and Pr actic e of Me dicine . D. Appleton and Compan y , New Y ork, 1892. 55 Dong P eng, Y anshan W ang, Carl Preiksaitis, and Carolyn Rose. SycoEv al-EM: Sycophancy ev alua- tion of large language mo dels in simulated clinical encounters for emergency care. arXiv pr eprint arXiv:2601.16529 , 2026. Jérém y Sc heurer, Mikita Balesni, and Marius Hobbhahn. T echnical rep ort: Large language mo dels can strategically deceiv e their users when put under pressure. T echnical report, Ap ollo Research, 2024. Mrinank Sharma, Meg T ong, T omasz K orbak, Da vid Duvenaud, Amanda Ask ell, Samuel R. Bow- man, Newton Cheng, Esin Durmus, Zac Hatfield-Do dds, Scott R. Johnston, Shauna Kra v ec, Timoth y Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nic holas Schiefer, Da Y an, Miranda Zhang, and Ethan P erez. T ow ards understanding sycophancy in language mo dels. arXiv pr eprint arXiv:2310.13548 , 2023. A dly T empleton, T om Conerly , Jonathan Marcus, Jack Lindsey , T renton Bric ken, Brian Chen, A dam P earce, Craig Citro, Emman uel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. T urner, Callum McDougall, Mon te MacDiarmid, C. Daniel F reeman, Theo dore R. Sumers, Ed- w ard Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and T om Henighan. Scaling monoseman ticity: Extracting interpretable features from Claude 3 Sonnet. Anthr opic R ese ar ch , 2024. Andreas V esalius. De Humani Corp oris F abric a Libri Septem . Johannes Op orinus, Basel, 1543. Rudolf Virc ho w. Die Cel lularp atholo gie in ihr er Be gründung auf physiolo gische und p atholo gische Geweb elehr e . August Hirsch w ald, Berlin, 1858. C. H. W addington. The Str ate gy of the Genes: A Discussion of Some Asp e cts of The or etic al Biolo gy . George Allen & Un win, London, 1957. Kevin W ang, Alexandre V ariengien, Arthur Conm y , Buck Shlegeris, and Jacob Steinhardt. In ter- pretabilit y in the wild: A circuit for indirect ob ject identification in GPT-2 small. In Pr o c e e dings of the 11th International Confer enc e on L e arning R epr esentations (ICLR 2023) , 2023. Andy Zou, Long Phan, Sarah Chen, James Campb ell, Phillip Guo, Richard Ren, Alexander P an, Xu wang Yin, Man tas Mazeik a, Ann-Kathrin Dombro wski, Shashw at Go el, Nathaniel Li, Zifan Byun, Zhengxuan W ang, Alex Mallen, Steven Basart, Sanmi Ko y ejo, Dawn Song, Matt F redrik- son, J. Zico K olter, and Dan Hendrycks. Representation engineering: A top-do wn approach to AI transparency . arXiv pr eprint arXiv:2310.01405 , 2023. 56
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment