Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.

💡 Research Summary

This paper investigates the reliability of large language models (LLMs) in closed‑book multi‑hop question answering (QA) by comparing three prompting regimes that are semantically equivalent: Direct (single‑step answer), Assistive (single‑call with a full gold decomposition), and Incremental (step‑wise execution of the same decomposition). The authors construct a verified gold‑standard decomposition for each question using a domain‑specific language (DSL) and ensure that the three regimes differ only in execution style, not in content.

Experiments span six multi‑hop QA benchmarks (Bamboogle, FRAMES, MuSiQue, CRA G, HotpotQA, Mintaka) and nine instruction‑tuned LLMs ranging from 8 B to frontier models such as GPT‑5.1, Gemini‑2.5‑Pro, and Gemini‑2.5‑Flash. Accuracy is measured by semantic agreement with gold answers using an LLM‑as‑judge (Gemini‑2.5‑Flash), while consistency is defined as semantic agreement between Direct and each decomposed regime, independent of correctness.

Key findings:

Scale Improves Both Accuracy and Consistency – As model size grows, both metrics increase, but even the largest models exhibit substantial cross‑prompt disagreement (≈60 % consistency on MuSiQue).
Decomposition Gains Diminish at Frontier Scale – For models ≤70 B parameters, Assistive and Incremental prompting yield sizable accuracy boosts (often +15–30 percentage points). In contrast, frontier models show little to no gain, sometimes even a slight drop, suggesting they have already internalized the necessary reasoning chains.
Cross‑Prompt Agreement Is a Strong Error Signal – The authors introduce the Reliability Multiplier (RM), the ratio of “correct ∧ consistent” to “correct ∧ inconsistent” cases. RM grows dramatically with scale, reaching >50× for Gemini‑Pro, indicating that when Direct and decomposed answers agree, the answer is far more likely to be correct. Pearson correlation between accuracy and consistency reaches r≈0.98.
Disagreement‑Based Abstention (DBA) – Leveraging the above insight, DBA simply refuses to answer (“I don’t know”) whenever Direct and a decomposed answer differ. This method requires no additional training, retrieval, or confidence calibration. Empirically, DBA outperforms standard uncertainty baselines (e.g., confidence scores, temperature scaling) on both F1 and AUROC across all models, especially improving error detection for high‑capacity models.
Practical Implications – The study reframes decomposition from a performance‑enhancing technique to a diagnostic probe. In safety‑critical settings where over‑confidence is dangerous (medical, legal, finance), DBA offers a lightweight, model‑agnostic way to curb hallucinations without sacrificing the benefits of closed‑book inference.

Limitations include reliance on manually verified gold DSLs; automatic DSL generation may introduce noise that weakens the consistency signal. The work focuses on multi‑hop QA, leaving open whether the same approach generalizes to single‑hop or generative tasks. Future research directions suggested are (a) automated, high‑quality decomposition synthesis, (b) analysis of whether disagreement stems from knowledge gaps versus execution errors, and (c) extension of disagreement‑based gating to broader NLP tasks.

In summary, the paper demonstrates that while decomposed prompting no longer boosts raw accuracy for frontier LLMs, it provides a powerful, training‑free mechanism to detect when a model’s answer is unreliable. By simply checking for answer consistency across prompting styles, practitioners can implement an effective “I don’t know” fallback, enhancing the trustworthiness of closed‑book LLM deployments.

Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

💡 Research Summary

Comments & Academic Discussion

Leave a Comment