Halo: Domain-Aware Query Optimization for Long-Context Question Answering
Long-context question answering (QA) over lengthy documents is critical for applications such as financial analysis, legal review, and scientific research. Current approaches, such as processing entire documents via a single LLM call or retrieving relevant chunks via RAG have two drawbacks: First, as context size increases, response quality can degrade, impacting accuracy. Second, iteratively processing hundreds of input documents can incur prohibitively high costs in API calls. To improve response quality and reduce the number of iterations needed to get the desired response, users tend to add domain knowledge to their prompts. However, existing systems fail to systematically capture and use this knowledge to guide query processing. Domain knowledge is treated as prompt tokens alongside the document: the LLM may or may not follow it, there is no reduction in computational cost, and when outputs are incorrect, users must manually iterate. We present Halo, a long-context QA framework that automatically extracts domain knowledge from user prompts and applies it as executable operators across a multi-stage query execution pipeline. Halo identifies three common forms of domain knowledge - where in the document to look, what content to ignore, and how to verify the answer - and applies each at the pipeline stage where it is most effective: pruning the document before chunk selection, filtering irrelevant chunks before inference, and ranking candidate responses after generation. To handle imprecise or invalid domain knowledge, Halo includes a fallback mechanism that detects low-quality operators at runtime and selectively disables them. Our evaluation across finance, literature, and scientific datasets shows that Halo achieves up to 13% higher accuracy and 4.8x lower cost compared to baselines, and enables a lightweight open-source model to approach frontier LLM accuracy at 78x lower cost.
💡 Research Summary
Halo addresses the long‑context question answering (QA) problem, where users must extract precise answers from documents that can span hundreds of thousands of tokens (e.g., SEC 10‑K filings, legal contracts, scientific papers). Existing solutions fall into two categories. The “Vanilla LLM” approach feeds the entire document to a large language model (LLM) in a single request. While this leverages the model’s extended context windows, it is extremely costly and suffers from “lost in the middle” hallucinations as the model struggles to locate relevant evidence in massive inputs. The Retrieval‑Augmented Generation (RAG) approach reduces cost by retrieving a top‑K set of relevant chunks before inference, but retrieval errors can omit critical evidence, leading to inaccurate answers.
Practitioners often try to compensate by embedding domain knowledge directly into prompts (e.g., “focus only on tables”, “ignore legal disclosures”, “use diluted EPS, not basic”). However, this knowledge remains unstructured text that the LLM may ignore, especially in long contexts, and it provides no token‑level cost savings. When the output is wrong, users must iteratively refine prompts, incurring additional API calls and manual effort.
Halo proposes a fundamentally different architecture: it extracts domain knowledge from the free‑form user prompt, converts it into structured directives, and applies these directives via dedicated operators at the most effective stages of a multi‑stage query pipeline. The system consists of four main components:
-
KnowledgeParser – a lightweight NLP module that parses the user prompt and identifies three canonical types of domain knowledge:
- Structural – “where to look” (e.g., tables, MD&A sections).
- Filter – “what to ignore” (e.g., legal disclosures, boilerplate text).
- Validate – “how to verify” (e.g., “NOT basic computations”, “must contain diluted EPS”).
-
ContextSelector (Structural Operator) – uses the structural directives to prune the raw document before chunking. By extracting only the specified structural elements (e.g., all tables), the token count fed downstream can drop from 200 K to 30‑50 K, dramatically reducing cost and focusing the retrieval process on the most promising regions.
-
InferenceEngine (Filter Operator) – employs a cascade of models. A small, inexpensive language model (SLM) quickly evaluates each chunk against the filter directives and discards irrelevant pieces. Only the remaining chunks are passed to the expensive, high‑capacity LLM for answer generation. This model‑cascade approach ensures that the costly LLM sees a minimal, high‑quality context.
-
Verifier (Validate Operator) – after candidate answers are generated, the verifier scores them against the validation directives. Positive constraints (e.g., presence of “diluted”) boost a candidate, while negative constraints (e.g., presence of “basic”) demote it. This ranking step detects hallucinations without requiring extra LLM calls.
A critical innovation is Halo’s Fallback Manager. Domain directives extracted from prompts can be ambiguous, contradictory, or simply wrong. The fallback mechanism monitors runtime signals such as the confidence of the SLM filter, the match quality of validation scores, and overall answer accuracy (estimated via self‑consistency or external heuristics). If an operator degrades performance, the manager disables it for that query, reverting to a safer baseline pipeline. This dynamic adaptation prevents the system from suffering when user‑provided knowledge is low‑quality.
Evaluation: The authors benchmark Halo on three heterogeneous domains—finance (SEC 10‑K filings), literature (novels), and science (research papers). They compare against three baselines: (a) Vanilla LLM processing the full document, (b) standard RAG with top‑K retrieval, and (c) RAG augmented with a single structural cue (e.g., “only tables”). Metrics include Exact Match/F1 accuracy, total tokens processed, and estimated API cost. Results show:
- Accuracy gains of up to 13 percentage points over the strongest baseline, especially on queries that require fine‑grained distinctions (e.g., diluted vs. basic EPS).
- Cost reductions of 4.8× on average, driven by aggressive pruning and filtering before the expensive LLM call.
- When paired with a lightweight open‑source model (e.g., Llama‑2‑7B), Halo achieves near‑state‑of‑the‑art performance of a commercial frontier model (e.g., Sonnet 4.5) while incurring 78× lower cost.
- The overhead of the three operators together accounts for less than 2 % of total query cost, confirming that the benefits far outweigh the extra computation.
Contributions:
- Demonstrating that domain knowledge expressed in natural language can be automatically extracted and operationalized.
- Designing three specialized operators (Structural, Filter, Validate) that integrate these directives into a cost‑effective, accuracy‑boosting pipeline.
- Introducing a runtime fallback mechanism that safeguards against low‑quality directives.
- Providing extensive cross‑domain evaluation that validates both accuracy improvements and dramatic cost savings.
Limitations & Future Work: Halo currently supports only the three canonical directive types. More complex logical constraints (e.g., “use tables but also consider narrative for years 2020‑2022”) are not yet handled. Future research may extend the KnowledgeParser to capture richer logical forms, enable dynamic operator composition, and incorporate user feedback loops for continual parser refinement. Additionally, exploring adaptive chunk sizing and tighter integration with retrieval indexes could further reduce latency.
Conclusion: Halo transforms how domain expertise is leveraged in long‑context QA. By converting informal prompt instructions into explicit, executable operators placed at optimal stages of the processing pipeline, it simultaneously raises answer quality and slashes computational expense. The framework’s modularity, fallback safety net, and demonstrated effectiveness across diverse domains suggest a promising path toward scalable, cost‑efficient LLM‑powered knowledge extraction in real‑world enterprise settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment