Unmediated AI-Assisted Scholarly Citations
Traditional bibliography databases require users to navigate search forms and manually copy citation data. Language models offer an alternative: a natural-language interface where researchers write text with informal citation fragments, which are automatically resolved to proper references. However, language models are not reliable for scholarly work as they generate fabricated (hallucinated) citations at substantial rates. We present an architectural approach that combines the natural-language interface of LLM chatbots with the accuracy of direct database access, implemented through the Model Context Protocol. Our system enables language models to search bibliographic databases, perform fuzzy matching, and export verified entries, all through conversational interaction. A key architectural principle bypasses the language model during final data export: entries are fetched directly from authoritative sources, with timeout protection, to guarantee accuracy. We demonstrate this approach with MCP-DBLP, a server providing access to the DBLP computer science bibliography. The system transforms form-based bibliographic services into conversational assistants that maintain scholarly integrity. This architecture is adaptable to other bibliographic databases and academic data sources.
💡 Research Summary
This paper addresses a critical flaw in using large language models (LLMs) for scholarly work: their propensity to generate fabricated or “hallucinated” citations. While LLMs offer a compelling natural-language interface that could streamline the tedious process of bibliographic referencing, their unreliability poses a significant threat to academic integrity. The authors propose a novel architectural solution that decouples the strengths of LLMs from the risks they introduce.
The core problem is that simply connecting an LLM to a database API does not prevent hallucination; the model may still corrupt or invent bibliographic data while processing search results. The proposed solution, implemented via the Model Context Protocol (MCP), establishes a clear separation of concerns. The LLM is responsible for what it does best: understanding informal user queries (e.g., “Devlin paper from 2018”), disambiguating references through conversational dialogue, and managing the search workflow. The authoritative database (DBLP, in this case) is solely responsible for providing accurate metadata.
The system, named MCP-DBLP, exposes eight tools through the MCP interface, including boolean search, fuzzy title/author matching, and statistical analysis. The key innovation is the “unmediated BibTeX export” mechanism. Instead of the LLM receiving citation data and potentially reformatting it, the system employs a “shopping cart” pattern. When a user (via the LLM) identifies a desired paper, the add_bibtex_entry tool is called with the paper’s DBLP key. The server immediately fetches the complete BibTeX entry directly from the canonical DBLP URL with timeout protection and adds it to a session-specific collection. The LLM’s involvement is limited to providing a preferred citation key for replacement via a deterministic pattern match. Finally, the export_bibtex tool writes the entire collection directly to a file on disk. Crucially, the bibliographic data—author names, titles, venues, DOIs—never passes through the LLM’s context window, eliminating any chance of corruption during the export process.
The evaluation rigorously tests this architecture against a baseline of web-only search. Using 104 obfuscated citations across three independent experiments, the study compares three agent configurations: Web (no MCP access), MCP-M (MCP search with manual BibTeX construction by the LLM), and MCP-U (MCP search with unmediated export). The results are striking. The “Perfect Match” rate jumps from 28.2% (Web) to 82.7% (MCP-U), a nearly threefold improvement. Most significantly, “Corrupted Metadata”—where the LLM invents plausible but false details—plagues the Web baseline at 6.7% but is completely eliminated (0%) in both MCP methods. Furthermore, while the MCP-M method suffered from a 36.5% “Incomplete Metadata” rate (e.g., missing DOIs or page numbers due to the LLM’s manual construction), the MCP-U method achieved 0%, as every field is populated directly from DBLP.
The paper concludes that this architectural pattern successfully combines conversational ease with scholarly rigor. By leveraging the MCP standard for tool calling and stateful interaction, and by architecturally bypassing the LLM for final data retrieval, the system provides a reliable, accurate, and user-friendly interface for bibliographic work. This approach is not limited to DBLP but is a generalizable framework applicable to other bibliographic databases like PubMed, arXiv, or Semantic Scholar, offering a path toward trustworthy AI-assisted scholarly communication.
Comments & Academic Discussion
Loading comments...
Leave a Comment