BioModelsRAG: A Biological Modeling Assistant Using RAG (Retrieval Augmented Generation)

BioModelsRAG: A Biological Modeling Assistant Using RAG (Retrieval Augmented Generation)
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The BioModels database is one of the premier databases for computational models in systems biology. The database contains over 1000 curated models and an even larger number of non-curated models. All the models are stored in the machine-readable format, SBML. Although SBML can be translated into the human readable Antimony format, analyzing the models can still be time consuming. In order to bridge this gap, a LLM (large language model) assistant was created to analyze the BioModels and allow interaction between the user and the model using natural language. By doing so, a user can easily and rapidly extract the salient points in a given model. Our analysis workflow involved ‘chunking’ BioModels and converting them to plain text using llama3, and then embedding them in a ChromaDB database. The user-provided query was also embedded, and a similarity search was performed between the query and the BioModels in ChromaDB to extract the most relevant BioModels. The BioModels were then used as context to create the most accurate output in the chat between the user and the LLM. This approach greatly minimized the chance of hallucination and kept the LLM focused on the problem at hand.


💡 Research Summary

The paper presents BioModelsRAG, an end‑to‑end Retrieval‑Augmented Generation (RAG) system that enables natural‑language interaction with the extensive collection of computational models stored in the BioModels repository. BioModels hosts over a thousand curated SBML (Systems Biology Markup Language) models and many more non‑curated ones. While SBML is machine‑readable, it is not readily interpretable by biologists without a background in systems biology. The authors therefore built a pipeline that (1) downloads every model via the BioModels API, (2) converts the SBML files to Antimony using the Tellurium and Antimony libraries, (3) splits each Antimony script into logical sections (compartments, species, reactions, initializations, etc.), (4) asks a large language model (LLM), specifically Llama 3, to produce concise, information‑preserving summaries of each section, (5) embeds those summaries with the sentence‑transformers all‑MiniLM‑L6‑v2 model, and (6) stores the resulting vectors in a ChromaDB vector database.

When a user submits a natural‑language query through a Streamlit web interface, the query is embedded with the same transformer model and a cosine‑similarity search retrieves the five most relevant summary chunks from ChromaDB. These chunks are inserted into a fixed prompt that tells the LLM it is a conversational agent and must answer using only the provided context, explicitly stating when the information is insufficient. The final answer is generated by Llama 3, now grounded in the retrieved, model‑specific text, which dramatically reduces hallucination.

The authors evaluate the system against a baseline where Llama 3 answers without any retrieved context. Two metrics are used: fuzzy string matching (token‑set‑ratio) to measure lexical overlap between the LLM output and the retrieved context (a proxy for hallucination), and cosine similarity between the output embedding and the context embedding (a proxy for topical relevance). With RAG, the token‑set‑ratio averages around 87 % versus 58 % for the baseline, and cosine similarity rises from 0.78 to 0.92. Parameter sweeps show that a temperature of 0, top‑k 20, and top‑p 1.0 yield the best trade‑off between deterministic output and coverage of relevant information.

The system is fully open‑source: the code resides on GitHub (https://github.com/TheBobBob/BioModelsRAG) and a live demo is hosted at https://biomodelsrag.streamlit.app/. The architecture consists of a FastAPI backend handling model download, conversion, summarization, and vector storage, while the Streamlit frontend provides a simple text box for queries.

Key contributions include: (1) an automated pipeline that transforms large, complex SBML models into human‑readable, summarized Antimony text; (2) a RAG framework that grounds LLM responses in retrieved, model‑specific context, thereby suppressing hallucinations; (3) a user‑friendly web interface that makes sophisticated systems‑biology models accessible to non‑experts. The authors acknowledge limitations such as potential loss of intricate mathematical expressions during summarization and the fixed number of context chunks, and propose future work on multi‑turn dialogues, integration of simulation results, parameter sensitivity analysis, and richer visualizations.

Overall, BioModelsRAG demonstrates that coupling domain‑specific knowledge bases with retrieval‑augmented LLMs can turn a vast, technically dense repository into an interactive, intelligible resource, lowering the barrier for biologists to explore, understand, and reuse computational models.


Comments & Academic Discussion

Loading comments...

Leave a Comment