Open Biomedical Knowledge Graphs at Scale: Construction, Federation, and AI Agent Access with Samyama Graph Database

Biomedical knowledge is fragmented across siloed databases -- Reactome for pathways, STRING for protein interactions, ClinicalTrials.gov for study registries, DrugBank for drug vocabularies, DGIdb for drug-gene interactions, SIDER for side effects. W…

Authors: Madhulatha M, arapu, S

Op en Biomedical Kno wledge Graphs at Scale: Construction, F ederation, and AI Agen t A ccess with Sam y ama Graph Database Madh ulatha Mandarapu ∗ Sandeep Kunkun uru † V aidh yaMegha Priv ate Limited, India https://samyama.ai/ Marc h 2026 Abstract Biomedical kno wledge is fragmented across siloed databases—Reactome for pathw a ys, STRING for protein in teractions, Gene Ontology for functional annotations, ClinicalT rials.go v for study registries, DrugBank for drug v o cabularies, DGIdb for drug–gene in teractions, SIDER for side effects, and dozens more. Researc hers routinely download flat files from each source and write b espoke scripts to cross-reference them, a pro cess that is slow, error-prone, and not repro ducible. W e present three op en-source biomedical knowledge graphs— Path w ays KG (118,686 no des, 834,785 edges from 5 sources), Clinical T rials KG (7,774,446 no des, 26,973,997 edges from 5 sources), and Drug Interactions K G (32,726 no des, 191,970 edges from 3 sources)—built on Sam yama, a high-p erformance graph database written in Rust. Our con tributions are threefold. First, w e describe a reproducible ETL pattern for constructing large-scale KGs from heterogeneous public data sources, with cross-source dedupli- cation, batch loading (both Python Cypher and Rust native loaders), and p ortable snapshot exp ort. Second, we demonstrate cross-K G federation : loading all three snapshots in to a single graph tenant enables prop ert y-based joins across datasets, answ ering questions like “F or drugs indicated for diab etes, what are their gene targets and which biological pathw ays do those targets participate in?”—a query that no single KG can answer alone. Third, w e in tro duce sc hema-driven MCP serv er generation : eac h K G automatically exposes t yp ed to ols for LLM agents via the Mo del Con text Proto col, enabling natural-language access to graph queries. W e ev aluate domain-sp ecific MCP to ols against text-to-Cypher and standalone GPT-4o on a new BiomedQA b enc hmark (40 pharmacology questions), achieving 98% accuracy vs. 85% for sc hema-aw are text-to-Cypher and 75% for standalone GPT-4o, with zero sc hema errors. All data sources are op en-license (CC BY 4.0, CC0, OBO, public domain). Snapshots, ETL co de, and MCP configurations are publicly av ailable. The combined federated graph (7.9M no des, 28M edges) loads in approximately 3 min utes from p ortable snapshots on commo dit y cloud hardware (62 GB RAM), with single-KG queries completing in 80–100 ms and cross-KG federation joins in 1–4 s. Keyw ords: Kno wledge Graphs, Biomedical Data In tegration, Graph Databases, Cross-K G F ederation, Mo del Context Proto col, Clinical T rials, Biological Path wa ys, Drug Interactions, Pharmacogenomics, Op enCypher. ∗ madh ulatha@samy ama.ai, ORCID: https://orcid.org/0009- 0005- 2837- 6725 † sandeep@sam yama.ai, OR CID: https://orcid.org/0000- 0002- 8886- 1846 1 1 In tro duction Biological and clinical kno wledge is distributed across dozens of public databases, each with its own sc hema, identifiers, and access patterns. A researcher in vestigating how a cancer drug affects cellular signaling m ust consult ClinicalT rials.gov for trial metadata, DrugBank for drug–target in teractions, Reactome for path w ay membership, STRING for protein–protein in teractions, and Gene On tology for functional annotations. These databases are maintained by indep enden t comm unities, up dated on differen t sc hedules, and stored in incompatible formats (TSV, XML, JSON, OBO, GAF). Kno wledge graphs (KGs) offer a natural integration p oint [ Hogan et al. , 2021 ]: en tities from differen t sources b ecome no des, and relationships b etw een them b ecome edges. A single Cypher query can tra verse from a clinical trial to the biological pathw ays disrupted by its drug candidate—a query that would require joining across five separate databases in the flat-file paradigm. Ho wev er, constructing biomedical K Gs at scale remains challenging. Prior efforts such as Bio2RDF [ Belleau et al. , 2008 ], Hetionet [ Himmelstein et al. , 2017 ], and the Clinical Kno wledge Graph [ San tos et al. , 2022 ] hav e made significan t progress, but eac h faces limitations: Bio2RDF requires a dedicated SP ARQL endp oint infrastructure, Hetionet is a static dataset without an up date pip eline, and the Clinical Knowledge Graph targets a single data source. W e address these limitations with three contributions: 1. Repro ducible K G construction. W e present ETL pipelines for three biomedical KGs— P athw a ys K G (5 sources, 119K no des), Clinical T rials K G (5 sources, 7.7M no des), and Drug In teractions KG (3 sources, 33K no des)—built on Samy ama Graph Database using a common pattern: download, parse, deduplicate, load (Python Cypher or R ust nativ e), and exp ort as p ortable .sgsnap snapshots. The Rust native loader creates the Drug In teractions K G (32K no des, 192K edges) in under 1 second. 2. Cross-K G federation. W e sho w that loading m ultiple snapshots into a single graph tenan t enables property-based federation—joining on shared identifiers (UniProt accessions, DrugBank IDs, gene names) without ETL-time merging. W e demonstrate cross-KG query patterns spanning molecular biology , translational medicine, and pharmacogenomics. 3. Sc hema-driv en AI agent access. Each KG ships with a Mo del Context Proto col (MCP) [ An- thropic , 2024 ] server configuration with domain-sp ecific to ols. W e ev aluate MCP to ols against text-to-Cypher (GPT-4o) and standalone GPT-4o on a new BiomedQA b enc hmark: MCP to ols ac hieve 98% accuracy (39/40) vs. 85% for sc hema-aw are text-to-Cypher and 75% for standalone GPT-4o, with zero schema errors. All code, snapshots, and MCP configurations are op en-source 1 . 2 Bac kground and Related W ork 2.1 Biomedical Kno wledge Graphs Hetionet [ Himmelstein et al. , 2017 ] in tegrates 29 public resources in to a single heterogeneous net work (47K no des, 2.25M edges, 24 edge t yp es) for drug repurposing. While influen tial, it is a static dataset from 2017 with no up date pip eline. 1 P athw ays KG: https://github.com/samyama- ai/pathways- kg ; Clinical T rials K G: https://github.com/ samyama- ai/clinicaltrials- kg ; Drug Interactions KG: https://github.com/samyama- ai/druginteractions- kg ; Snapshots: https://github.com/samyama- ai/samyama- graph/releases 2 Bio2RDF [ Belleau et al. , 2008 ] conv erts life science databases to RDF and serves them via SP ARQL endp oin ts. The RDF approach offers semantic interoperability but introduces complexity (O WL reasoning, blank no des) and p erformance ov erhead for tra versal-hea vy queries. Clinical Knowledge Graph (CKG) [ Santos et al. , 2022 ] builds a comprehensive biomedical graph from 25+ databases using Neo4j. CKG is the closest prior w ork to ours; w e differ in three w ays: (1) we use a Rust-nativ e engine rather than Neo4j, ac hieving higher ingestion throughput; (2) w e in tro duce portable snapshots for instant deploymen t; and (3) w e provide sc hema-driven MCP serv ers for LLM agen t in tegration. PrimeK G [ Chandak et al. , 2023 ] constructs a precision medicine K G (129K nodes, 8M edges) from 20 sources. PrimeK G targets drug–disease prediction via graph neural net works; our fo cus is on in teractive querying and AI agen t access. 2.2 Graph Database Systems Neo4j [ Neo4j, Inc. , 2024 ] is the dominan t prop ert y graph database. Amazon Neptune, Tiger- Graph, and MemGraph offer alternatives with v arying trade-offs. Sam yama [ Mandarapu and Kunkun uru , 2026 ] is a R ust-native graph database com bining prop ert y graph storage, v ector search, 22 metaheuristic solvers, and Op enCypher supp ort in a single binary . 2.3 Mo del Con text Proto col (MCP) MCP [ An thropic , 2024 ] is an op en standard for connecting LLM agents to external data sources via t yp ed tools. An MCP server exp oses a set of to ols (functions with typed parameters and return v alues) that an LLM can in vok e during a conv ersation. Sc hema-driven MCP generation—automatically creating tools from a K G schema—eliminates the manual to ol authoring b ottlenec k. 3 Data Sources All three K Gs draw exclusively from op en-license public databases. T able 1 summarizes the sources. T able 1: Data sources for the three biomedical K Gs. All are op en-license and h uman-sp ecific (organism 9606 where applicable). K G Source Con tent License Size P athw ays Reactome P athw ays, reactions, complexes CC BY 4.0 172 MB STRING v12.0 Protein–protein in teractions CC BY 4.0 900 MB Gene On tology GO terms & annotations OBO 265 MB WikiP athw ays Comm unity-curated pathw ays CC0 336 KB UniProt Protein metadata, gene/disease/drug links CC BY 4.0 20 MB Clinical ClinicalT rials.gov T rial registry (575K studies) Public domain API MeSH (NLM) Disease hierarch y F ree API RxNorm (NLM) Drug normalization & A TC co des F ree API Op enFD A A dverse even t rep orts (F AERS) Public domain API PubMed (NLM) T rial-linked publications F ree API Drug In t. DrugBank CC0 Drug v o cabulary (19K drugs) CC0 3 MB DGIdb Drug–gene in teractions (38K edges) Op en 24 MB SIDER Side effects & indications CC-BY-SA 22 MB 3 4 Kno wledge Graph Construction 4.1 ETL Pattern All three KGs follow a common five-phase pattern: 1. Do wnload. A data downloader with resume supp ort fetches all source files. Compressed files are decompressed automatically . 2. P arse & Filter. Source-sp ecific parsers extract en tities and relationships. Human-only filters are applied where applicable (organism ID 9606). 3. Deduplicate. A shared R e gistry trac ks seen entities across phases (e.g., a protein loaded from Reactome is not re-created when encountered in STRING). 4. Batc h Load. No des and edges are loaded via batched Cypher CREATE statemen ts (50–100 en tities p er batch) through Samy ama’s HTTP API. 5. Snapshot Exp ort. The loaded graph is exp orted as a portable .sgsnap file (gzip JSON-lines), enabling instan t restoration on an y Sam yama instance. 4.2 P ath wa ys K G The P athw ays K G in tegrates molecular biology data in to a graph with 5 no de lab els and 9 edge t yp es (T able 2 ). T able 2: P athw a ys KG schema. No de Lab el Key Prop ert y Coun t GOT erm go_id 51,897 Protein uniprot_id 37,990 Complex reactome_id 15,963 Reaction reactome_id 9,988 P athw ay reactome_id 2,848 T otal no des 118,686 Edge Type Coun t ANNOT A TED_WITH 265,492 INTERA CTS_WITH 227,818 P AR TICIP A TES_IN 140,153 CA T AL YZES 121,365 IS_A 58,799 COMPONENT_OF 8,186 P AR T_OF 7,122 REGULA TES 2,986 CHILD_OF 2,864 T otal edges 834,785 The fiv e ETL phases execute in order: (1) Reactome Core —pathw ays, reactions, complexes, and protein participants from Reactome’s flat files; (2) STRING In teractions —high-confidence ( ≥ 700/999) protein–protein in teractions with ENSP → UniProt ID mapping; (3) Gene On tology — 47K GO terms with IS_A/P AR T_OF/REGULA TES hierarc hy and 265K protein annotations; (4) WikiP ath wa ys —communit y-curated path w ays deduplicated against Reactome; (5) UniProt Enric hmen t —gene mappings, disease asso ciations, and drug targets. 4.3 Clinical T rials KG The Clinical T rials K G mo dels the translational medicine domain with 15 no de lab els and 25 edge t yp es (T able 3 ). 4 T able 3: Clinical T rials K G sc hema (abbreviated; 15 lab els, 25 edge t yp es). No de Lab el Key Prop ert y Est. Count ClinicalT rial nct_id 575,000+ Condition name, mesh_id v aries In terven tion name, t yp e v aries Drug rxnorm_cui, drugbank_id v aries Protein uniprot_id v aries Gene gene_id, symbol v aries MeSHDescriptor descriptor_id v aries Sp onsor name, class v aries Site facilit y , country v aries Publication pmid, doi v aries A dverseEv ent term, is_serious v aries ArmGroup lab el, type v aries Outcome measure, type v aries DrugClass atc_co de, level v aries LabT est loinc_co de v aries T otal no des 7,774,446 T otal edges 26,973,997 The ETL pip eline queries the ClinicalT rials.go v API v2 for study metadata, enriches conditions with MeSH hierarch y , normalizes drug names via RxNorm, links adverse even ts from Op enFD A’s F AERS database, and asso ciates publications from PubMed E-utilities. 4.4 Drug Interactions K G The Drug In teractions K G mo dels drug–gene in teractions, side effects, and indications with 4 no de lab els and 3 edge t yp es (T able 4 ). T able 4: Drug In teractions K G schema. No de Lab el Key Prop ert y Coun t Drug drugbank_id 19,842 Gene gene_name 4,182 SideEffect meddra_id 5,858 Indication meddra_id 2,844 T otal no des 32,726 Edge Type Coun t INTERA CTS_WITH_GENE 38,033 HAS_SIDE_EFFECT 139,193 HAS_INDICA TION 14,744 T otal edges 191,970 The ETL consists of t w o phases. Phase 1 (DrugBank + DGIdb) : 19,842 Drug no des are created from DrugBank CC0 vocabulary CSV, with all synonyms indexed for cross-source name matc hing (52,154 synon yms). DGIdb interactions yield 4,182 Gene no des and 38,033 INTER- A CTS_WITH_GENE edges with interaction types (inhibitor, activ ator, etc.). Phase 2 (SIDER) : STITCH comp ound IDs are mapp ed to DrugBank names via synonym lo okup, yielding 5,858 SideEffect no des with 139,193 HAS_SIDE_EFFECT edges and 2,844 Indication no des with 14,744 HAS_INDICA TION edges. A R ust nativ e loader ( druginteractions_loader.rs ) performs the full ETL in 928 ms using direct GraphStore API calls (no Cypher parsing), compared to ∼ 50 min utes via the Python HTTP loader. 5 5 Cross-K G F ederation 5.1 Motiv ation The Path wa ys KG kno ws molecular biology—whic h proteins in teract, what pathw a ys they participate in. The Clinical T rials K G kno ws translational medicine—whic h drugs are in trials, what conditions they treat. The Drug Interactions KG kno ws pharmacology—whic h genes a drug targets, what side effects it causes. No single KG can answer: “F or drugs indic ate d for diab etes, what ar e their gene tar gets and which biolo gic al p athways do those tar gets p articip ate in?” This query requires trav ersing three K Gs: Drug HAS_INDICA TION − − − − − − − − − − − − → Indication (Drug In terac- tions KG), Drug INTERACTS_WITH_GENE − − − − − − − − − − − − − − − − − − → Gene bridge: gene_name = name − − − − − − − − − − − − − − − − − → Protein P AR TICIP A TES_IN − − − − − − − − − − − − − → Pathway (P athw a ys KG). The WHERE p.name = g.gene_name clause bridges Drug Interactions to P athw a ys. 5.2 Join Poin ts Shared en tity types with matc hing iden tifiers enable cross-KG joins: T able 5: Cross-K G join p oin ts across the three biomedical KGs. En tity F rom K G T o KG Join Prop ert y Identifier Gene/Protein Drug In t. P athw ays gene_name = name Gene symbol Drug Drug In t. Clin. T rials name = Intervention.name Drug name Drug Drug In t. Clin. T rials drugbank_id DrugBank ID Protein Clin. T rials Path wa ys uniprot_id UniProt accession 5.3 F ederation Mechanism Sam yama supp orts loading m ultiple snapshots in to a single tenant. Each snapshot imp ort appends no des and edges to the existing graph. Since imp orts create new nod e IDs, en tities from differen t snapshots with the same identifier (e.g., UniProt P04637 for TP53) exist as separate graph no des. Cross-K G queries use prop ert y-based joins : - - D r u g t a r g e t s - > b i o l o g i c a l p a t h w a y s ( D r u g I n t e r a c t i o n s - > P a t h w a y s ) M A T C H ( d : D r u g { n a m e : ’ M e t f o r m i n ’ } ) - [ : I N T E R A C T S _ W I T H _ G E N E ] - > ( g : G e n e ) M A T C H ( p : P r o t e i n ) - [ : P A R T I C I P A T E S _ I N ] - > ( p w : P a t h w a y ) W H E R E p . n a m e = g . g e n e _ n a m e R E T U R N g . g e n e _ n a m e , p w . n a m e L I M I T 1 0 - - D r u g - > c l i n i c a l t r i a l s t e s t i n g i t ( D r u g I n t e r a c t i o n s - > C l i n i c a l T r i a l s ) M A T C H ( d : D r u g { n a m e : ’ W a r f a r i n ’ } ) M A T C H ( i : I n t e r v e n t i o n ) < - [ : T E S T S ] - ( c t : C l i n i c a l T r i a l ) W H E R E i . n a m e = d . n a m e R E T U R N c t . n c t _ i d , c t . p h a s e L I M I T 1 0 - - B r e a s t c a n c e r t r i a l l a n d s c a p e ( C l i n i c a l T r i a l s o n l y ) M A T C H ( c t : C l i n i c a l T r i a l ) - [ : S T U D I E S ] - > ( c : C o n d i t i o n ) 6 W H E R E c . n a m e C O N T A I N S ’ B r e a s t ’ R E T U R N c . n a m e , c o u n t ( c t ) A S t r i a l s O R D E R B Y t r i a l s D E S C L I M I T 5 The WHERE p.name = g.gene_name clause is the Drug Interactions → P athw ays bridge—it joins a Gene no de from the Drug In teractions K G with a Protein node from the Path wa ys KG using gene sym b ol as the shared identifier. The WHERE i.name = d.name clause bridges Drug Interactions → Clinical T rials via drug name. 5.4 F ederation Query P atterns W e iden tify five cross-KG query patterns of increasing complexity: T able 6: Cross-K G federation query patterns. P attern F rom KG Bridge T o KG Drug → P athw ay Drug Int. (Gene) gene_name P athw ays (Pro- tein) Drug → T rial Drug In t. (Drug) name Clin. T rials (In- terv ention) Drug → GO terms Drug In t. (Gene) gene_name P athw ays (GOT erm) Drug SE → P athw ay Drug In t. (SE+Gene) gene_name P athw ays (Pro- tein) T rial → Side effects Clin. T rials (In terven- tion) name Drug In t. (Drug) 6 Sc hema-Driv en MCP Serv er Generation Eac h K G ships with a Y AML configuration that defines domain-sp ecific MCP to ols. At startup, the MCP serv er: 1. Disco v ers sc hema from the running Samy ama instance ( GET /api/schema ). 2. A uto-generates tools for eac h no de lab el (searc h, get, count) and eac h edge type (find connections). 3. Registers domain to ols from the Y AML configuration—eac h to ol is a parameterized Cypher template with typed inputs (e.g., protein_name: string , confidence_threshold: int ). Eac h KG’s MCP server exp oses domain-sp ecific to ols. The Path wa ys KG has 12 to ols (T a- ble 7 ), the Drug Interactions K G has 12 tools (e.g., drug_interactions , interaction_checker , polypharmacy_risk , drug_side_effects ), and the Clinical T rials KG has 15 to ols. 7 T able 7: P athw a ys KG MCP to ols (subset). T o ol Description pathway_members List proteins in a pathw ay (search by name) interaction_partners PPI neigh b ors ab o v e confidence threshold shared_pathways P athw a ys shared b et ween tw o proteins upstream_regulators Multi-hop PPI trav ersal up to N steps drug_pathway_impact P athw ays affected b y a drug via protein targets disease_pathways P athw ays asso ciated with a disease through gene links go_enrichment GO terms enric hed in pathw ay proteins protein_function_summary P athw ays, GO pro cesses, disease asso ciations for a protein An LLM agen t connected to this MCP server can answer “What pathw ays do es TP53 participate in?” without writing Cypher. The agen t calls pathway_members(protein_name="TP53") and receiv es structured results. 7 Ev aluation W e ev aluate construction time, query p erformance, and AI agen t access on an A WS g4dn.4xlarge instance (16 vCPU AMD EPYC, 62 GB RAM, NVIDIA A10G GPU). 7.1 Construction Performance T able 8: K G construction p erformance (snapshot imp ort on A WS g4dn.4xlarge). K G No des Edges Snapshot Imp ort R ust ETL P athw ays KG 118,686 834,785 9 MB 3.4 s — Drug In teractions KG 32,726 191,970 1.9 MB 0.7 s 0.9 s Clinical T rials KG 7,774,446 26,973,997 711 MB 177 s — Com bined 7,925,858 28,000,752 722 MB 181 s — Snapshot imp ort uses gzip-compressed JSON-lines format. The Path wa ys K G loads in 3.4 seconds; the Drug In teractions KG in 0.7 seconds; and the Clinical T rials KG, with its 7.8M no des, loads in 177 seconds. The Drug In teractions KG also has a Rust native loader that constructs the full graph from source files (DrugBank CSV, DGIdb TSV, SIDER TSV) in 928 ms—orders of magnitude faster than the Python HTTP-based ETL. The com bined federated graph (7.9M no des, 28M edges) uses 33 GB RAM (53% of 62 GB a v ailable on the A WS instance). 7.2 Query Performance W e b enc hmark represen tative queries from eac h K G and the federated graph: 8 T able 9: Query latency on A WS g4dn.4xlarge (7.9M no des loaded). Query K G Results Latency Drug gene targets (Metformin) Drug Int. 5 ro ws 83 ms Shared gene targets (2 drugs) Drug In t. 1 ro w 84 ms T op pathw ays by protein count P athw ays 10 ro ws 97 ms PPI partners (TP53: Q9UQ61) P athw ays 10 rows 96 ms Side effects of W arfarin Drug Int. 10 ro ws 82 ms Diab etes drugs → pathw a ys Drug Int.+P ath 5 rows 3.9 s W arfarin → trial conditions Drug In t.+CT 5 rows 5.8 s PTGS1 drugs → side effects Drug Int. 10 ro ws 3.7 s Simple single-KG queries (drug lo okups, gene targets, side effects) complete in 80–100 ms. Multi- hop cross-KG joins (e.g., “drugs targeting PTGS1 and their most common side effects”) complete in 3–4 s on the full 7.9M-no de graph. 7.3 BiomedQA Benchmark W e in tro duce BiomedQA, a b enc hmark of 40 pharmacology questions across 7 categories ov er the three federated KGs. W e compare three approaches: domain-sp ecific MCP tools (parameterized Cypher templates), text-to-Cypher via the sc hema-a ware NLQ endp oin t (GPT-4o with full schema system prompt and few-shot examples), and standalone GPT-4o (no database access). The BiomedQA benchmark is op en-source. 2 T able 10: BiomedQA results (40 questions, 7.9M no des, 3 federated KGs). Approac h A ccuracy A vg Latency A vg T ok ens GPT-4o standalone 30/40 (75%) 2,805 ms 195 T ext-to-Cypher (NLQ) 34/40 (85%) 1,846 ms 0 † MCP to ols 39/40 (98%) 920 ms 0 T ext-to-Cypher ac hieves 85% with a schema-a ware NLQ pip eline (full schema in system prompt, few-shot examples). Its 6 failures are: 3 schema hallucinations (non-existent edge trav ersals), 1 exact-vs-CONT AINS mismatc h, 1 inline prop ert y v ariable, and 1 correct empty result. MCP to ols eliminate sc hema errors entirely . ( † T oken coun t is 0 b ecause the NLQ endp oin t handles the LLM call serv er-side.) 7.4 F ederation Correctness W e v alidated cross-K G federation on the A WS g4dn.4xlarge with all three KGs loaded (7.9M no des): 1. Drug In teractions → P athw ays bridge : The query “Metformin gene targets → biological path wa ys” uses WHERE p.name = g.gene_name to bridge Gene no des from DGIdb to Protein no des from Reactome/STRING. Returns pathw a y memberships (e.g., HNF1B → Dev elopmental Biology) in 3.0 s. 2 https://github.com/samyama- ai/biomedqa 9 2. Drug In teractions → Clinical T rials bridge : The query “clinical trials testing W arfarin” uses WHERE i.name = d.name to bridge Drug no des to In terven tion nodes. Returns trial NCT IDs and phases (e.g., NCT00835861, PHASE2) in 0.6 s. 3. Three-K G chain : The query “drugs indicated for diab etes → gene targets → path wa ys” tra verses Drug → Indication (Drug In t. KG) → Gene (Drug In t. KG) → Protein → P athw a y (P athw a ys KG), returning path wa ys suc h as Circadian clock and Dissolution of Fibrin Clot via SERPINE1, in 3.9 s. 8 In teractiv e Visualization Sam yama Insight, a React-based fron tend, provides schema-driv en visualization of eac h KG: • Dash b oard : A uto-generated panels sho wing lab el distribution, edge type counts, and prop ert y statistics per tenant. • Query Console : Op enCypher editor with result tables, JSON views, and EXPLAIN/PROFILE plan visualization. • Graph Sim ulation : F orce-directed can v as with p er-label colors/shap es, liv e activit y particles, en tity filtering, and a legend ov erla y . The sim ulation engine is fully schema-driv en—it auto- configures from the tenant’s schema at runtime. Demo recordings for b oth K Gs are a v ailable as MP4 videos 3 . 9 Discussion 9.1 Prop ert y Joins vs. En tity Merging Our federation approac h uses prop ert y-based joins rather than merging entities at load time. This has trade-offs: • A dv antages : Snapshots remain indep enden t and comp osable; no load-order dependency; straight- forw ard to add or remo v e a KG from the federation. • Disadv an tages : Duplicate nodes inflate storage; prop erty joins are slow er than tra versals on merged nodes; no referential integrit y b et ween the tw o copies of a Protein. F or pro duction workloads, a p ost-load MERGE pass or ETL-time deduplication w ould eliminate duplicates. F or exploration and prototyping—the primary use case for these KGs—property joins are pragmatic. 9.2 Limitations 1. Snapshot currency : Snapshots are p oin t-in-time exp orts. Source databases (esp ecially Clini- calT rials.gov) up date contin uously . Periodic re-exp ort is required. 3 Cric ket KG demo: https://github.com/samyama- ai/samyama- graph/releases/tag/kg- snapshots- v2 ; P ath- w ays K G demo: https://github.com/samyama- ai/samyama- graph/releases/tag/kg- snapshots- v3 10 2. Iden tifier cov erage : Not all proteins in the Clinical T rials K G ha ve UniProt IDs (some ha ve only gene symbols). The join ov erlap dep ends on iden tifier normalization qualit y . 3. Memory requirements : The com bined 7.9M-no de graph requires appro ximately 33 GB RAM. Mac hines with less memory can load the Path wa ys + Drug Interactions K Gs alone (151K no des, < 1 GB). 4. No cross-tenant queries : Curren tly , federation requires loading all snapshots into a single tenan t. Native cross-tenant query supp ort is planned for a future Sam yama release. 9.3 Generalizabilit y The pattern is not limited to biomedicine. Any domain with shared iden tifiers across data sources can use the same approach: build indep enden t KGs with common entit y prop erties, exp ort as snapshots, load in to a single tenant, and query across them. W e hav e applied this pattern to sp orts (Cric ket KG, 36K nodes) and industrial operations (AssetOps K G, 781 no des) in addition to the biomedical K Gs describ ed here. 10 Conclusion W e presen ted three op en-source biomedical knowledge graphs—Path w ays KG (119K no des from 5 sources), Clinical T rials KG (7.7M no des from 5 sources), and Drug In teractions K G (33K no des from 3 sources)—built on Samy ama Graph Database. T ogether they contain 7.9 million no des and 28 million edges from 13 public data sources. Loading all three snapshots into a single graph tenan t enables cross-KG federated queries that bridge molecular biology , translational medicine, and pharmacogenomics. Single-KG queries complete in 80–100 ms; cross-KG federation joins in 1–4 s on commo dit y cloud hardware (62 GB RAM). A Rust native loader constructs the Drug In teractions K G in under 1 second. W e introduced domain-sp ecific MCP to ols for LLM agen t access and ev aluated them on a new BiomedQA b enc hmark (40 pharmacology questions): MCP to ols achiev e 98% accuracy vs. 85% for schema-a w are text-to-Cypher and 75% for standalone GPT-4o—with zero sc hema errors, demonstrating that for domain-sp ecific data access, the LLM’s role should b e to ol selection and argumen t extraction, not query generation. All co de, snapshots, ETL pip elines, b enc hmarks, and MCP configurations are op en-source. Researc hers can repro duce the full 3-K G federation—from snapshot imp ort to cross-KG queries—in under 3 minutes. Data and Co de A v ailabilit y • P ath wa ys K G: https://github.com/samyama- ai/pathways- kg • Clinical T rials K G: https://github.com/samyama- ai/clinicaltrials- kg • Drug Interactions KG: https://github.com/samyama- ai/druginteractions- kg • BiomedQA Benchmark: https://github.com/samyama- ai/biomedqa • Sam y ama Graph Database: https://github.com/samyama- ai/samyama- graph • Snapshots: https://github.com/samyama- ai/samyama- graph/releases 11 References An thropic. Mo del con text proto col. https://modelcontextprotocol.io/ , 2024. F rançois Belleau, Marc-Alexandre Nolin, Nicole T ourigny , Philipp e Rigault, and Jean Morissette. Bio2RDF: to wards a mash up to build bioinformatics knowledge systems. Journal of Biome dic al Informatics , 41(5):706–716, 2008. P ay al Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine. Scientific Data , 10(1):67, 2023. Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley , Ari Green, Pouy a Khankhanian, and Sergio E Baranzini. Systematic integration of biomedical knowledge prioritizes drugs for repurp osing. eLife , 6:e26726, 2017. Aidan Hogan, Ev a Blomqvist, Mic hael Co c hez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gay o, Rob erto Na vigli, Sebastian Neumaier, et al. Knowledge graphs. A CM Computing Surveys , 54(4):1–37, 2021. Madh ulatha Mandarapu and Sandeep Kunkunuru. Sam yama: A unified graph-v ector database with in-database optimization, agen tic enric hment, and hardware acceleration. arXiv pr eprint arXiv:2603.08036 , 2026. Neo4j, Inc. Neo4j graph database. https://neo4j.com/ , 2024. Alb erto Santos, Ana R Colaco, Annelaura B Nielsen, Lili Niu, Maximilian Strauss, Philipp E Gey er, F abian Coscia, Nicolai J W ew er Albrech tsen, Filip Mundt, Lars Juhl Jensen, and Matthias Mann. A kno wledge graph to interpret clinical proteomics data. Natur e Biote chnolo gy , 40(5):692–702, 2022. 12

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment