HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents

HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.


💡 Research Summary

HybridRAG introduces a pragmatic Retrieval‑Augmented Generation (RAG) architecture designed for enterprise‑scale chatbot deployments that must handle large collections of raw, unstructured PDF documents. The core idea is to shift most of the heavy lifting—document parsing, knowledge extraction, and question‑answer (QA) pair creation—to an offline preprocessing stage, thereby reducing the computational burden at query time.

Offline pipeline

  1. Layout analysis & OCR – Each PDF page is processed with MinerU to detect layout elements (text blocks, tables, figures). Text is extracted via PaddleOCR, while tables and figures are transformed into natural‑language descriptions using GPT‑4o prompted as an “image description expert.”
  2. Hierarchical chunking – Inspired by RAPTOR, the extracted content is organized into a tree‑structured hierarchy: the root node holds a document‑level summary, intermediate nodes contain sections, and leaf nodes correspond to paragraphs or fine‑grained blocks. This hierarchy enables flexible retrieval ranging from coarse to detailed contexts.
  3. Keyword extraction – For every node, GPT‑4o‑mini extracts a set of core keywords; higher‑level nodes receive more keywords reflecting their broader information scope.
  4. QA generation – Using a chain‑of‑thought prompt, GPT‑4o‑mini generates a diverse set of QA pairs per node, constrained to be answerable solely from the node’s text. The number of QA pairs per node matches the number of extracted keywords, ensuring coverage of salient facts while avoiding redundancy.
  5. Embedding & indexing – Questions are embedded with the dense retriever BGE‑M3 and stored in a vector index (e.g., FAISS).

Online inference
When a user submits a query, the same BGE‑M3 encoder produces a query embedding. The system retrieves the top‑3 most similar stored questions and computes inner‑product similarity scores. If the highest score exceeds a predefined threshold (e.g., 0.9), the associated answer is returned instantly, bypassing any LLM inference. If the similarity falls below the threshold, the corresponding chunks are aggregated and fed, together with the user query, to a generative LLM (Llama‑3.2‑3B‑Instruct or Qwen2.5‑3B‑Instruct) to produce a fresh answer. This dual‑mode approach guarantees low latency for frequent or predictable queries while preserving the flexibility to handle novel, complex questions.

Experimental evaluation
The authors evaluate HybridRAG on OHRBench, a benchmark comprising 1,261 real‑world PDFs (8,561 pages) across seven domains (law, finance, textbooks, etc.) and 8,498 ground‑truth QA pairs. Metrics include F1, BERTScore, ROUGE‑L for answer quality, and average response latency measured on an NVIDIA RTX 3090 server. Three systems are compared: (1) Standard RAG (OCR‑only, no pre‑generated QA), (2) Simplified HybridRAG (pre‑generated QA but only text, no multimodal handling), and (3) Full HybridRAG (the proposed pipeline).

Results show that Full HybridRAG consistently reduces latency (e.g., from 1.685 s to 0.931 s with Llama‑3.2) while achieving comparable or slightly higher quality scores (average F1 24.36 vs. 23.25). The simplified version already outperforms Standard RAG, confirming that a pre‑generated QA bank alone yields measurable benefits. With Qwen2.5, latency improvements are modest but quality gains are more pronounced, indicating that the framework is robust across different generative models.

Strengths and limitations
HybridRAG’s main strength lies in its practical trade‑off: heavy LLM inference is amortized offline, enabling fast online responses without sacrificing answer correctness. The hierarchical chunking and keyword‑driven QA generation promote coverage of both high‑level concepts and fine details. However, the offline stage is computationally intensive (large‑scale OCR, multimodal description, and massive QA generation), and the quality of generated QA pairs depends on the reliability of the underlying LLMs. Errors in OCR or image description can propagate into the knowledge base. Moreover, the system currently requires full re‑generation when documents are updated, lacking incremental update mechanisms.

Future directions
The authors suggest integrating vision‑language models to directly process tables and figures, reducing reliance on separate LLM prompts. Incremental indexing and dynamic QA refreshing could further lower maintenance costs. Extending the approach to multilingual corpora and exploring adaptive similarity thresholds are also promising avenues.

In summary, HybridRAG offers a compelling, resource‑efficient solution for deploying LLM‑powered chatbots over large, unstructured document collections, demonstrating that pre‑generated QA banks can substantially improve both latency and answer quality in real‑world settings.


Comments & Academic Discussion

Loading comments...

Leave a Comment