Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (https://citegeist.org), as well as an implementation harness that works with several different LLM implementations.
💡 Research Summary
Citegeist introduces a practical Retrieval‑Augmented Generation (RAG) pipeline that leverages the entire arXiv corpus (≈2.6 M papers) to automatically produce a “Related Work” section for a given scientific manuscript. The system bridges two major obstacles of large language models (LLMs) in scholarly writing: hallucinated citations and lack of direct access to up‑to‑date scientific knowledge bases.
Data preparation: All arXiv abstracts are embedded with the all‑mpnet‑base‑v2 Sentence‑Transformer, producing 768‑dimensional vectors suitable for 384‑token paragraphs. These embeddings, together with a SHA‑256 hash of the serialized metadata and a BERTopic‑derived topic label, are stored in a Milvus vector database. The hash enables efficient incremental updates: unchanged records are skipped, modified records are re‑embedded, and new records are added, all in batch mode with optional GPU acceleration.
Candidate retrieval: Users submit either an abstract or a full PDF. The input is embedded with the same model and a cosine‑similarity search returns an initial “longlist”. Three user‑controllable hyper‑parameters shape the selection:
- Breadth – size of the initial candidate pool.
- Depth – number of pages per candidate that will be examined.
- Diversity – a weight w ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment