A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.

💡 Research Summary

The paper addresses the well‑known gap between large language models (LLMs), which excel at natural‑language understanding, and knowledge graphs (KGs), which provide precise, explainable, and multi‑hop reasoning capabilities. Traditional retrieval‑augmented generation (RAG) pipelines rely on text‑based retrieval and therefore struggle with multi‑hop queries and factual grounding. Conversely, KG‑based approaches require users to master query languages such as Cypher or SPARQL, creating a steep barrier for non‑technical domain experts.

To bridge this divide, the authors propose a human‑in‑the‑loop, LLM‑centered architecture that lets users interact with a KG using only natural language. The system consists of four modular components, all powered by the same LLM through a LangChain interface:

Query Generator – Translates a user’s natural‑language question into a syntactically correct Cypher query. A schema‑aware prompt restricts the LLM to valid node and relationship types, dramatically reducing syntax errors.
Executor – Sends the generated query to a Neo4j instance, retrieves the results, and returns them as structured Python objects.
Explainer – Uses the LLM with a dedicated prompt to produce a step‑by‑step natural‑language description of the query, flagging implausible patterns, inverted relationships, or nonexistent labels. This “plausibility check” gives users confidence without reading Cypher code.
Amender – Accepts natural‑language feedback (e.g., “use the opposite direction for the HAS_AUTHOR edge”) and asks the LLM to edit the existing query rather than regenerate it from scratch. This targeted editing preserves useful parts of the original query and avoids over‑correction.

The authors evaluate the architecture on three knowledge graphs:

Synthetic movie KG – a controlled benchmark with 90 questions designed to test explanation quality and fault detection. Multiple LLMs (GPT‑4, Claude, DeepSeek) are compared.
Hyena KG – a real‑world ecological graph containing demographic, behavioral, and genetic data about a long‑studied hyena population.
MaRDI KG – a mathematical research graph with over 700 million triples, exposing theorems, software implementations, and publications via SPARQL/REST endpoints.

Key findings include:

GPT‑4 achieved the highest explanation‑match score (~92 %) and fault‑detection rate (~88 %) on the synthetic benchmark, while other models produced more ambiguous or incorrect explanations.
In the Hyena and MaRDI case studies, initial query generation sometimes missed multi‑hop relationships, but users could correct the query in an average of 1.3–2.0 amendment rounds, demonstrating the efficiency of the natural‑language feedback loop.
The plausibility‑check and error‑flagging mechanisms significantly increased user trust, as participants reported fewer “surprise” results compared with a baseline text‑RAG system.
The system’s modularity allows swapping LLMs or graph back‑ends with minimal code changes, though the current implementation is tied to Neo4j.

Limitations noted by the authors are: (1) very complex or non‑standard schemas can still confuse the LLM, leading to mis‑mapped entities; (2) latency grows with each amendment round, which may be problematic for time‑critical applications; (3) the approach has only been tested on Neo4j, so portability to other graph databases remains open.

Future work will explore automatic error‑correction models, a database‑agnostic abstraction layer, and pre‑trained “query‑repair” prompts to reduce the number of human‑in‑the‑loop iterations.

In summary, the paper presents a novel, end‑to‑end framework that combines the expressive power of Cypher with the accessibility of LLMs, delivering transparent, interactive, and accurate KG‑based question answering. By providing natural‑language explanations and an iterative amendment mechanism, it makes sophisticated graph querying feasible for domain experts without requiring them to learn query languages, thereby advancing the practical integration of LLMs and knowledge graphs.

A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

💡 Research Summary

Comments & Academic Discussion

Leave a Comment