Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG
Current Retrieval-Augmented Generation (RAG) systems predominantly rely on relevance-based dense retrieval, sequentially fetching documents to maximize semantic similarity with the query. However, in knowledge-intensive and real-world scenarios chara…
Authors: Davide Di Gioia
En tropic Claim Resolution: Uncertain t y-Driv en Evidence Selection for RA G Da vide Di Gioia ucesigi@ucl.ac.uk Abstract Curren t Retriev al-Augmen ted Generation (RA G) systems predominan tly rely on relev ance- based dense retriev al, sequen tially fetching documents to maximize seman tic similarity with the query . Ho wev er, in kno wledge-intensiv e and real-w orld scenarios c haracterized b y conflicting evi- dence or fundamental query am biguity , relev ance alone is insufficient for resolving epistemic un- certain ty . W e in tro duce En tropic Claim Resolution (ECR), a no vel inference-time algorithm that reframes RA G reasoning as en tropy minimization ov er comp eting semantic answ er h yp otheses. Unlik e action-driv en agen tic framew orks (e.g., ReAct) or fixed-pip eline RAG architectures, ECR sequen tially selects atomic evidence claims by maximizing Exp ected Entrop y Reduction (EER), a decision-theoretic criterion for the v alue of information. The pro cess dynamically terminates when the system reac hes a mathematically defined state of epistemic sufficiency ( H ≤ ϵ , sub ject to epistemic coherence). W e in tegrate ECR into a production-grade multi-strategy retriev al pip eline (CSGR++) and demonstrate its theoretical prop erties. Our framework pro vides a rigorous foun- dation for uncertaint y-aw are evidence selection, shifting the paradigm from retrieving what is most relev ant to retrieving what is most discriminativ e. 1 In tro duction The in tegration of Large Language Mo dels (LLMs) with external knowledge bases through Retriev al- Augmen ted Generation (RAG) has become the de facto standard for mitigating hallucinations and enabling kno wledge-intensiv e Question Answering (QA). Conv entional RAG systems operate on a rigid r etrieve-then-r e ad paradigm, predominan tly lev eraging maximum inner product searc h (MIPS) in dense contin uous spaces [2] to fetch the top- k most semantically relev ant text ch unks. While highly effectiv e for simple, factoid-based QA where a single ground-truth answer exists, this relev ance-driven approac h exhibits severe degradation in real-w orld, kno wledge-intensiv e scenarios. Such scenarios are frequen tly characterized by inheren t query am biguity , conflicting evidence across multiple sources, and complex m ulti-hop dep endencies. In these c hallenging settings, standard dense retriev al suffers from what we term epistemic c ol lapse : the tendency to retrieve highly redundant information that is seman tically similar to the query , rather than fetching the discriminative evidence needed to resolve the underlying uncertaint y . Consequently , the LLM is forced to synthesize an answ er from a biased or incomplete evidence distribution, often leading to unhedged, o verconfiden t, or factually inaccurate generation. Recen t architectural adv ancements attempt to transcend simple MIPS. Graph-based paradigms, suc h as Con text-Seeded Graph Retriev al (CSGR) and GraphRA G, expand retriev al scop e via struc- tured knowledge relation tra versal. Concurren tly , agentic and iterativ e verification w orkflows (e.g., ReA ct [5], T ree-of-Though ts [6], Self-RAG [7]) allow LLMs to dynamically interact with searc h to ols, reflecting on retrieved context to guide subsequent actions. How ever, these state-of-the-art approac hes still critically lack a principled, decision-theoretic stopping criterion and evidence selection mec hanism. Graph tec hniques rely on static pip eline configurations (e.g., fixed graph-hop depth), while agentic 1 systems dep end on heuristic thresholding or prompt-driven self-reflection, which frequently suffer from infinite lo oping, premature termination, or unprincipled evidence w eighting. Critically , modern RAG systems lack a mathematically rigorous definition of what constitutes sufficien t evidence and an explicit ob jectiv e function for selecting which sp ecific piece of evidence to retriev e next at inference time. T o bridge this fundamental gap, we prop ose Entropic Claim Resolu- tion (ECR), an inference-time algorithm that reframes the retriev al and synthesis process as entr opy minimization over a latent sp ac e of semantic answer hyp otheses . Drawing inspiration from Informa- tion Theory [8] and Bay esian Exp erimen tal Design [9], ECR models the QA task probabilistically . It initializes a probability distribution ov er a set of mutually exclusive p otential answer hypotheses and iterativ ely selects atomic factual claims from a retriev ed candidate p ool to ev aluate. Crucially , in ECR, evidence selection is decoupled from semantic relev ance to the query . Instead, claims are selected by maximizing Exp ected Entrop y Reduction (EER) ; that is, choosing the sp ecific piece of evidence most likely to collapse the probability distribution tow ard a single, correct h yp othesis (or cleanly bifurcate it in the case of irreconcilable conflict). The algorithm adaptiv ely na vigates the evidence graph under a principled stopping rule, terminating only when the en tropy of the h yp othesis space falls b elow a predefined threshold of epistemic sufficiency . In summary , our main con tributions are: 1. W e introduce Entropic Claim Resolution (ECR), a decision-theoretic evidence selection algo- rithm for RAG, shifting the paradigm from retrieving what is most r elevant to what is most discriminative in resolving h yp othesis am biguity . 2. W e formally define a principled, mathematically rigorous stopping criterion for iterative RA G pip elines based on epistemic sufficiency ( H ( A | X ) ≤ ϵ ). 3. W e identify a b eha vioral phase transition under structured contradiction: b y in tegrating a ligh tw eight coherence signal ( λ > 0 ), w e show that ECR transitions from forced epistemic collapse to principled ambiguit y exposure, prioritizing explicit con tradictions when presen t and safely refusing to reduce uncertain ty when evidence is inherently inconsistent. 4. W e demonstrate the practical scalability of ECR b y implementing it as a fast, inference-time algorithm integrated into a pro duction-grade multi-strategy retriev al arc hitecture (CSGR++), requiring no besp ok e fine-tuning or sp ecialized model weigh ts. Significance. A central implication of this work is that improving retriev al-augmented reasoning do es not necessarily require larger mo dels, longer context windows, or additional data, but rather principled control ov er how existing evidence is selected and ev aluated during inference. By explicitly mo deling epistemic uncertain ty and optimizing evidence selection for information gain, Entropic Claim Resolution pro vides a l igh tw eight, computationally efficien t mechanism for impro ving robustness and in terpretability . This makes the framework particularly v aluable for high-stakes en terprise deplo y- men ts (e.g., medical, legal, or financial QA) where mitigating unhedged hallucinations and controlling inference costs are critical. Ultimately , this persp ectiv e highlights an alternativ e path for scaling kno wledge-intensiv e systems: a path grounded in decision-theoretic inference rather than indisc rimi- nate context expansion, particularly in settings c haracterized by noisy , conflicting, or heterogeneous evidence. 2 Related W ork 2.1 Dense, Graph-Augmented, and Agentic Retriev al Standard dense retriev al selects a set of do cumen ts D by prioritizing their conditional probability given the query , P ( D | Q ) , commonly approximated via cosine similarit y embeddings [1]. T o ov ercome the short-sigh tedness of relev ance searc h, adv anced graph-based architectures, notably GraphRAG [3] 2 and Context-Seeded Graph Retriev al (CSGR), implicitly construct or trav erse knowledge graphs o ver c hunks or entities to expand the evidence space. In enterprise environmen ts, hybrid systems such as CSR-RA G [4] further integrate structural and relational signals to supp ort large-scale schemas. Concurren tly , the conv ergence of dynamic retriev al p olicies with autonomous planning has crys- tallized into the paradigm of A gentic RAG . F rameworks such as ReAct [5], T ree-of-Thoughts [6], and Self-RA G [7] allow language mo dels to interlea ve in termediate reasoning steps with retriev al actions in order to refine subsequent queries. Recen t Systematization of Knowledge (SoK) studies emphasize this shift from static pip elines tow ard mo dular control strategies. How ever, despite their flexibility , agentic retriev al systems in trinsically rely on heuristic prompt designs or static thresholds to determine when to halt retriev al or which information to prioritize. As a result, they lack a rigorous mathematical definition of epistemic sufficiency and an explicit ob jective for selecting the next most informative piece of evidence as uncertain ty unfolds during inference. 2.2 Uncertain t y Quan tification (UQ) in RAG A critical prerequisite for adaptive retriev al is accurately c haracterizing what a mo del do es not know. Recen t b enc hmarks such as URAG [11] demonstrate that while RA G can improv e factual grounding, it also introduces new sources of epistemic uncertain ty , including relev ance mismatch and selectiv e atten tion to partial evidence, which can paradoxically amplify o verconfiden t hallucinations under noisy retriev al conditions. In resp onse, a gro wing bo dy of w ork on Retriev al-Augmented Reasoning (RAR) fo cuses on quan tifying uncertain ty across the retriev al and generation stages. F or example, metho ds suc h as Retriev al-Augmented Reasoning Consistency (R2C) [12] mo del multi-step reasoning as a Marko v Decision Pro cess and p erturb generation to measure output stability via ma jorit y voting, building up on foundational frameworks for semantic uncertaint y [10]. These uncertaint y-a ware approaches are effectiv e for p ost-hoc answer ev aluation, abstention, or calibration. Ho wev er, they are not designed to guide the sele ction of evidenc e itself during the reasoning pro cess. In particular, they do not pro vide a mec hanism for c ho osing whic h atomic piece of evidence should be retriev ed or verified next in order to maximize information gain prior to answer syn thesis. 2.3 En trop y-A w are Context Managemen t The application of Shannon en tropy as a control signal for managing LLM context is an emerging researc h direction. Large context windo ws in standard RAG often lead to attention dilution and un- constrained entrop y growth, motiv ating recen t work on entrop y-aw are context control. F or instance, BEE-RA G (Balanced Entrop y-Engineered RAG) [13] mo difies attention dynamics to maintain en tropy in v ariance ov er long contexts, while SF-RA G (Structure-Fidelity RAG) [14] lev erages do cumen t hier- arc hy as a low-en tropy prior to preven t evidence fragmentation. Similarly , L-RAG (Lazy RAG) [15] emplo ys predictive entrop y thresholds to gate exp ensive retriev al op erations, defaulting to parametric kno wledge when uncertaint y is estimated to be low. ECR shares this information-theoretic lineage but departs in a critical wa y: rather than using en tropy to compress, gate, or truncate con text, ECR applies entrop y directly as an ob jectiv e for se quential ly sele cting discriminative evidenc e variables . This shift reframes en tropy from a passive diagnostic in to an active decision criterion guiding inference-time reasoning. 2.4 Claim-Lev el V erification and V alue of Information While standard RAG op erates on monolithic do cumen t c hunks, recent diagnostic and safet y-oriented framew orks decomp ose retrieved conten t into atomic claims. Systems such as MedRAGChec ker [16] ev aluate biomedical QA systems b y extracting fine-grained claims and c hecking them against struc- tured kno wledge bases, while agentic fact-chec king pip elines (e.g., SAFE [17] and CIBER [18]) retriev e 3 supp orting and refuting evidence for individual statements. These approac hes demonstrate the im- p ortance of claim-level reasoning for reliability and interpretabilit y . ECR aligns this gran ular verification paradigm with classical principles from Ba yesian experimental design and active learning. In activ e learning, the ob jective is to select the next unlab eled instance that maximizes exp ected information gain. By formulating inference-time evidence selection as Exp ected En tropy Reduction (EER) ov er discrete factual claims, ECR bridges symbolic uncertaint y mo deling and neural generation, optimizing retriev al for the value of information rather than semantic relev ance alone. 3 Metho dology: En tropic Claim Resolution (ECR) ECR formulates the evidence selection problem as a sequential decision pro cess targeting a reduction in epistemic uncertain ty across comp eting generativ e outcomes. ECR assumes high-recall candidate generation has already o ccurred (via upstream retriev al) and fo cuses exclusively on resolving uncertaint y within the resulting candidate claim set. 3.1 Problem F orm ulation Let C = { c 1 , c 2 , . . . , c n } b e a finite subset of atomic factual claims em b edded within a corpus. F or a giv en complex query Q , assume that assessing the veracit y of any given claim c i pro vides a signal regarding the query’s answ er. W e denote the latent truth v ariable asso ciated with claim c i as X i ∈ { 0 , 1 } , indicating whether the claim is empirically v alidated within the specific source do cumen t. Up on identifying high epistemic uncertain ty in the retriev al space (e.g., via confidence v ariance or conflicting k eyword analysis), ECR initializes an Answer Hyp othesis Space A = { a 1 , a 2 , . . . , a k } . This space represents the set of mutually exclusive p oten tial macro-answ ers to the query . 1 In our implemen tation, A is robustly generated dynamically: either b y querying the LLM to prop ose dis- tinct v alid hypotheses derived from subsets of the initial k -b est claims, or via deterministic vector clustering when op erating purely off-line. Our ob jective is to sequen tially refine a probability distri- bution P ( A | X ev al , Q ) ov er these hypotheses, conditioned on the dynamic subset of ev aluated claims X ev al ⊆ X , initialized at a uniform prior P ( a ) = 1 |A| . 3.2 Ob jectiv e F unction: Answer Entrop y The epistemic uncertaint y regarding the true outcome is robustly quantified using Shannon entrop y . Let the en tropy of the hypothesis space after ev aluating a subset of claims X ev al b e: H ( A | X ev al ) = − X a ∈A P ( a | X ev al ) log 2 P ( a | X ev al ) (1) 3.3 Exp ected En tropy Reduction (EER) and Selection Policy A t the t -th iteration, the system m ust choose the next claim c ∗ from the unev aluated candidate p ool C cand to formally v erify . Rather than relying on cosine relev ance sim ( c i , Q ) , we select the claim that maximizes Expected Entrop y Reduction (Information Gain). The selection p olicy is formally defined as: c ∗ = arg max c ∈C cand EER ( c | X ev al ) (2) The EER is precisely the difference betw een curren t entrop y and the exp ected p osterior entrop y after observing the truth v alue of claim c : EER ( c | X ev al ) = H ( A | X ev al ) − E X c H ( A | X ev al ∪ { X c } ) (3) 1 W e use mutual exclusivit y for analytical clarity; the framew ork naturally extends to partially ov erlapping hypotheses via soft assignment of claims to hypotheses. 4 This criterion ensures the algorithm intrinsically fav ors discriminative claims, i.e., evidence that cleanly segregates the hypothesis space. In practice, EER is approximated by measuring the proba- bilistic v ariance b et ween the sp ecific subsets of comp eting macro-hypotheses actively supp orted versus unsupp orted by candidate c . A claim supp orting all hypotheses equally yields an EER of 0, reflecting its redundancy , regardless of its seman tic similarity to the query . Implemen tation-Level EER Proxy . Computing the true mathematical expectation o ver all p os- sible generativ e outcomes is typically intractable during low-latency inference. Therefore, w e deploy a computationally efficie n t proxy that approximates Exp ected Entrop y Reduction without requiring full marginalization ov er laten t truth v ariables. In our concrete implementation, each candidate claim c partitions the hypothesis set in to those that cite c as supp orting evidence and those that do not. Let A + ( c ) = { a ∈ A : c ∈ supp ( a ) } and A − ( c ) = A \ A + ( c ) . Denote the probability mass in each subset as p + ( c ) = P a ∈A + ( c ) P ( a | X ev al ) and p − ( c ) = P a ∈A − ( c ) P ( a | X ev al ) . W e score discriminativity via the follo wing heuristic proxy: [ EER ( c ) = | p + ( c ) − p − ( c ) | p + ( c ) + p − ( c ) · H ( A | X ev al ) · conf ( c ) , (4) where conf ( c ) ∈ [0 , 1] denotes claim confidence. This proxy is linear in the num b er of hypotheses and preserves the core ob jective of prioritizing claims that maximally split the p osterior mass, while remaining tractable for inference-time use. Design choice of the EER proxy . The heuristic pro xy in Eq. (10) is inten tionally not a symmetric appro ximation of classical exp ected information gain, which typically fav ors balanced posterior splits; rather, it is designed for b ounded-budget inference, where the ob jectiv e is rapid reduction of epistemic uncertain ty rather than exploratory experimentation. In retriev al-augmented reasoning, once p oste- rior mass concen trates on a subset of hypotheses, prioritizing high-confidence, high-im balance claims accelerates conv ergence and reduces redundant evidence retriev al. This exploitative bias is therefore a delib erate design choice aligned with low-latency inference and downstream synthesis constrain ts. Coherence-a ware selection. In addition to entrop y reduction, ECR incorp orates a ligh tw eight co- herence signal that prioritizes ev aluating claims lik ely to complete an explicit contradiction when such evidence exists. Concretely , we add a small regularization term λ · ConflictPoten tial( c ) to the selection ob jective, yielding score ( c ) = [ EER( c ) + λ · ConflictPoten tial( c ) , where ConflictPoten tial( c ) ∈ { 0 , 1 } is non-zero if c is an explicit negation of, or completes a con tradiction pair with, a previously ev aluated claim. This term do es not o verride en tropy reduction but ensures that unresolved con tradictions are surfaced early rather than a veraged aw ay . Empirically , w e observe that any non-zero λ induces stable coherence-a ware b ehavior without requiring fine-grained tuning (App endix, Figure A.1). Con tradiction-aw are coherence term. Let C ev al denote the set of claims that hav e already b een ev aluated. W e define a binary contradiction indicator ConflictP otential( c ) = ( 1 if ∃ c ′ ∈ C ev al suc h that c ≡ ¬ c ′ , 0 otherwise. (5) That is, ConflictPoten tial( c ) activ ates only when ev aluating c would complete an explicit con tradiction pair in the evidence. This coherence signal is structural rather than probabilistic: it do es not p enalize h yp otheses or p osteriors directly , and it do es not measure global consistency . Instead, it biases claim selection tow ard surfacing epistemic inconsistency when it exists, prev enting en tropy-only selection from a veraging aw ay contradictory evidence. The resulting claim-selection ob jective is c ∗ = arg max c ∈C cand [ EER( c ) + λ · ConflictPoten tial( c ) , (6) 5 Bo x 1: Entropic Claim Resolution (ECR) Input: query Q , candidate claims C cand , en tropy threshold ϵ , max iterations T 1. Hypotheses. Initialize A ← Genera teHypotheses ( Q, C cand ) (LLM or clustering), set uniform prior P ( a ) = 1 / |A| . 2. Lo op. F or t = 1 ..T : compute H ( A | X ev al ) (Eq. 1). If epistemic sufficiency holds (Eq. 11 ′ ) stop. 2a. Select. Cho ose c ∗ ∈ C cand maximizing [ EER ( c ) + λ · ConflictP otential( c ) (Eq. 4). 2b. V erify . Estimate P ( X c ∗ = 1) using pro venance and support/contradiction statistics (Eq. 8). 2c. Up date. Up date P ( A | X ev al ∪ { X c ∗ } ) (Eq. 7), add X c ∗ to X ev al , remo ve c ∗ from C cand . 3. Output. Return arg max a P ( a | X ev al ) if epistemic sufficiency holds (Eq. 11 ′ ), else return the rank ed distribution ov er A . Figure 1: Pseudo-co de for ECR without external algorithm pack ages. where λ ≥ 0 controls the strength of con tradiction-aw are selection. Setting λ = 0 reco vers entrop y- only ECR. While en tropy reduction remains the primary ob jective, any λ > 0 ensures that explicit con tradiction-completing claims are prioritized when present, under the b ounded EER scale induced b y the hypothesis en tropy . This prioritization is observed empirically as a sharp phase transition in the λ -sw eep ablation, where b eha vior saturates for all tested λ > 0 . 3.4 Ba y esian P osteriors and Epistemic Sufficiency Up on selecting c ∗ , the system ev aluates its intrinsic truth X c ∗ against the source context and prov e- nance metadata (see Section 3.5). The hypothesis probabilities are concurrently up dated utilizing lo calized Bay es’ rules. Concretely , hypotheses intersecting functionally with v alidated claims observe significan t targeted probability mass b oosts, severely suppressing con tradicting disjoint branches. P ( A | X ev al ∪ { X c ∗ } ) = P ( X c ∗ | A ) P ( A | X ev al ) P ˜ a P ( X c ∗ | ˜ a ) P (˜ a | X ev al ) (7) where P ( X c | A ) represents the conditional likelihoo d of observing the claim c assuming hypothesis A is true. The iterative verification procedure gracefully terminates when the system reac hes a mathematical state of epistemic sufficiency , parameterized by threshold ϵ (e.g., ϵ = 0 . 3 bits): H ( A | X ev al ) ≤ ϵ ∧ ¬ Conflict( X ev al ) (11 ′ ) where Conflict( X ev al ) indicates the presence of m utually incompatible claims (e.g., an explicit claim and its negation) within the ev aluated evidence. Alternativ ely , if all candidates are e xhausted or maxim um iterations are met with H > ϵ , ECR halts and explicitly exp oses the comp eting hypotheses and their final mass distributions, structurally mapping the unresolv able ambiguit y of the corpus. The complete iterativ e pro cedure is summarized in Box 1. 3.5 V erification via T op ological Prov enance In practical con tinuous-learning implementations, the inferential verit y link P ( X c = 1 | A ) can be computed dynamically rather than natively assuming p erfect mo del alignment. Instead of relying solely on parametric LLM-driv en prompt verification, ECR explicitly incorporates the topological pro venance of the multi-modal knowledge graph natively . Let S ( c ) and C ( c ) represen t the structural supp ort graph-edge coun ts and contradictory graph-edge counts of claim c track ed intricately within 6 the backing EA V (En tity-A ttribute-V alue) datastore, applying implicit Laplace smo othing. The final top ological verification probability is th us seamlessly and robustly blended: P ( X c = 1) = S ( c ) + 1 S ( c ) + C ( c ) + 2 if S ( c ) + C ( c ) > 0 , P prior _ conf ( X c = 1) otherwise. (8) This matches the deploy ed b eha vior in our implementation: whenev er historical supp ort/con tradiction signals exist, the system uses a Laplace-smo othed empirical truth estimate; otherwise, it falls bac k to the extraction-time prior confidence. 3.6 Theoretical Prop erties T o solidify the inferential v alidit y of the sequential system, we deduce its op erational p erformance b ound mapping. Theorem 1 (T ermination and Budget Bound) . F or any finite c andidate set C cand , ECR terminates after at most min( T , |C cand | ) claim evaluations. Mor e over, if ther e exists a c onstant δ > 0 such that at e ach iter ation the sele cte d claim satisfies E [ H t − 1 − H t ] ≥ δ whenever H t − 1 > ϵ , then ECR r e aches epistemic sufficiency in at most ⌈ ( H 0 − ϵ ) /δ ⌉ iter ations. W e emphasize that this result characterizes sufficient conditions for conv ergence under informativ e evidence selection, rather than a minimax or adv ersarial worst-case guarantee. When explicit con tra- dictions exist in the evidence, the sufficien t conditions for conv ergence are inten tionally violated, and ECR terminates b y exp osing ambiguit y rather than collapsing the p osterior. Pr o of. The first statement holds b ecause each iteration ev aluates and remo ves at most one claim, and the lo op is explicitly capp ed b y T . F or the second statemen t, telescoping the assumed exp ected en tropy decrease yields E [ H t ] ≤ H 0 − tδ until reac hing ϵ , hence t ≥ ( H 0 − ϵ ) /δ suffices. 4 System In tegration: ECR within CSGR++ T o ev aluate ECR b ey ond isolated theoretical constrain ts, we in tegrated it in to a pro duction-grade, m ulti-strategy retriev al pip eline. While ECR is algorithmically orthogonal to any sp ecific retriever, we utilize the CSGR++ architecture as our primary testb ed. In this section, we describ e the surrounding system comp onen ts that generate, structure, and verify the atomic candidate claims consumed by the ECR inference loop. Figure 2 illustrates the resulting end-to-end arc hitecture and the p osition of ECR within it. 4.1 HyRA G v3 Ingestion and Index Construction Structured and T abular Data as First-Class Evidence. HyRA G v3 natively supp orts struc- tured and semi-structured tabular data, rather than treating tables as flattened text. During inges- tion, the system p erforms automatic schema inference, including column typing (numeric, categorical, temp oral), identifier detection, and time-series normalization. Individual table cells and deriv ed ag- gregates are materialized as atomic claims with explicit prov enance, row identifiers, column metadata, and canonical time keys. Structured aggregation queries are grounded through a text-to-SQL exe- cution path with guarded, read-only execution and v alidation against real table v alues. All tabular claims en ter the same inference-time evidence p o ol as textual and graph-deriv ed claims, allowing En- tropic Claim Resolution to reason uniformly ov er mixed structured and unstructured evidence. This design enables precise numeric grounding, temporal filtering, and auditable reasoning not nativ ely supp orted by graph-enhanced RAG systems that op erate o ver synthesized document summaries. 7 Query Ensem ble Retriev al (V ector | Graph | Claim) En tropic Claim Resolution En tropy-Guided Selection Resp onse Synthesis epistemic sufficiency Inside ECR: Hyp othesis space A P osterior P ( A | X ) EER-based claim selection Figure 2: System ov erview: En tropic Claim Resolution (ECR) op erates as an inference-time controller b et ween competitive retriev al and answer syn thesis. Given a retriev ed claim set, ECR sequentially selects evidence to minimize hypothesis en tropy and terminates when epistemic sufficiency is reached. V ector-Based Retriev al as a Core Substrate. HyRAG v3 fully incorporates dense v ector re- triev al as a primary evidence acquisition mechanism. Ra w document ch unks, atomic claims, and syn thesized summaries are em b edded into dedicated vector indices and queried using cosine similarity with optional metadata and identifier filtering. V ector retriev al is used to seed claim p ools, initialize h yp othesis construction, and ground subsequent structured and graph-based reasoning. Rather than assuming vector similarity implies evidentiary sufficiency , HyRAG v3 sub jects all vector-retriev ed candidates to inference-time ev aluation under Entropic Claim Resolution, allowing relev ance-based signals to b e retained while preven ting o verconfidence in semantically similar but non-discriminative evidence. ECR op erates at inference time, but its effectiv eness dep ends on upstream ingestion and indexing that preserve atomicity , pro venance, and temp oral structure. The implemented HyRAG v3 pipeline (in our reference implemen tation) p erforms the following steps. Auto-adaptiv e sc hema inference with feedback calibration. An AutoAdaptAgent infers a sc hema from CSV/Excel/PDF/DataF rame inputs, iden tifying an ID column, categorical columns, n umeric columns, and time-series columns. A subsequen t schema fe e db ack lo op p erforms a dry-run parse of the first N rows (configurable) and adjusts misclassified columns (e.g., “numeric” columns with excessiv e null-rates), pro ducing a corrected sc hema used for full ingestion. Robust parsing with rep eated-header detection and temp oral normalization. The inges- tion parser supports multiple formats and implements spreadsheet-sp ecific heuristics, including merg- ing complementary m ulti-row headers and skipping rep eated header ro ws using an ov erlap threshold ( ≥ 0 . 70 tok en ov erlap). Time-series columns are normalized via a data-driven time-key parser that rec- ognizes patterns such as years (e.g., 2024), quarters (e.g., 2024Q1), halv es (e.g., 2024H2), and trailing windo ws (e.g., L TM/TTM), and maps them to a canonical order key used for temp oral slicing. 8 EA V SQLite store with safe query execution. All ingested records are p ersisted in an En tity– A ttribute–V alue SQLite back end ( GenericStore ). F or do wnstream aggregation queries, the system exp oses a text-to-SQL route but enforces a strict SELECT -only guardrail: the SQL executor blo c ks write op erations and limits result sizes. Em b eddings and vector indices with deterministic fallbac ks. The embedding subsystem is three-tiered: an online embedding API (if av ailable), a lo cal sentence-transformer fallback, and a deterministic hashed-vector fallback for fully offline op eration. V ector indices supp ort an optional database bac kend (LanceDB when installed) and a pure NumPy cosine-similarit y back end otherwise; b oth supp ort ID filtering for category/time constraints. A tomic claim extraction and claim index. During ingestion, the system extracts atomic claims, en tities, and ligh tw eight semantic relations ( h, r, t ) in to a dedicated ClaimStore . Claim v ectors are em b edded and stored in a separate claim vector index to enable claim-first retriev al. Hierarc hical summarization as retriev able no des. T o improv e global recall, the system clusters em b edded ro w representations using a pure-NumPy k -means routine (no external ML dep endencies), summarizes each cluster (LLM when av ailable), re-embeds the summaries, and inserts them into the same row-lev el vector index under a reserved ID prefix. As a result, standard vector retriev al can surface b oth raw rows and higher-lev el cluster summaries. Cluster summaries are stored as first-class retriev able no des and compete directly with raw rows during vector retriev al. ECR is exclusively activ ated on the analytical CSGR_PLUS route selected by the upstream query router, and is b ypassed for LOOKUP , RELATIONAL , SEMANTIC , TOOL , and SQL routes. External to ols are treated as deterministic op erators outside the entrop y-driven reasoning lo op (the LLM only formats a JSON to ol call when av ailable, with an offline numeric-statistics fast-path), i.e., excluded from ECR’s epistemic mo deling rather than treated as comp eting uncertaint y-reduction actions. T o b ound computational ov erhead, ECR is inv ok ed dynamically strictly when the retrieved config- uration exhibits high epistemic uncertaint y . The trigger conditions are natively in tegrated via three heuristics: 1. High Claim V olume: The retriever fetches hea vily saturated candidate spaces ( > 15 claims). 2. Syn tactic Ambiguit y: Detection of uncertain ty k eywords within the activ e query (e.g., “un- certain”, “conflicting”, “disagree”, “m ultiple”, “v arious”). 3. Confidence V ariance Constrain t: The v ariance in micro-lev el claim confidence σ 2 across the k retrieved c laims e xceeds an empirical threshold of 0 . 15 (with confidence actively tied to trac king top ological support-contradiction metrics within the underlying datastore). T o ev aluate ECR in a high-p erformance setting, we implement it as a standalone and mo dular res- olution engine within a pro duction-grade Context-Seeded Graph Retriev al (CSGR++) arc hitecture. While ECR is algorithmically orthogonal to an y sp ecific retriever, CSGR++ serves as a rigorous exp er- imen tal testb ed that preserves atomicity , prov enance, and multi-strategy retriev al signals. Knowledge is extracted and stored as atomic semantic claims in an En tity-A ttribute-V alue (EA V) back end, ac- companied b y separate vector indices for raw rows and claims. Within this testb ed, the baseline multi-strategy EnsembleR etriever combines dense similarity searc h, structural graph expansion, and semantic claim matc hing using Recipro cal Rank F usion. ECR cleanly in tercepts the pip eline immediately after candidate generation, acting as an isolated inference- time uncertaint y resolution stage that outputs either a dominant hypothesis or a calibrated set of alternativ es for downstream synthesis. 9 4.2 CSGR++ Backbone Arc hitecture While ECR is algorithmically orthogonal to a particular retriev al stack, we implement and ev aluate it inside a production-grade pip eline (CSGR++) that is explicitly claim-centric. A tomic claim store with semantic relations. CSGR++ stores extracted claims in a SQLite- bac ked ClaimStor e with fields for (i) claim text, (ii) entit y m en tions, (iii) time keys / order k eys for temp oral slicing, and (iv) dynamically up dated confidence signals. In addition, a ligh tw eight seman- tic relation table stores tuples ( h, r, t ) extracted during claim extraction (e.g., Acquires , Imp acts , Ca usedBy ), enabling entit y-based expansion during retriev al. T emp oral intelligence. Queries are parsed for explicit time constrain ts (e.g., “in 2024”, “2024Q1”, “last 3 quarters”, “since 2022”) and conv erted into an order-key in terv al ( τ min , τ max ) . Claim retriev al can then apply a hard filter ov er the claim IDs inside the selected time windo w. Comp etitiv e ensem ble retriev al and Recipro cal Rank F usion (RRF). The retriever runs m ultiple strategies (vector retriev al ov er ro ws, vector retriev al o ver claims, and graph/category trav er- sal) and fuses the per-strategy rankings via Recipro cal Rank F usion (RRF). F or an item d and ranking lists { L j } m j =1 with ranks r j ( d ) ∈ { 1 , 2 , . . . } , the fused score is RRF ( d ) = m X j =1 1 k + r j ( d ) , (9) where k is a dampening constant (w e use k = 60 in co de). Comp etitiv e strategy scoring (selection, not only fusion). In addition to fusing rankings, the retriev er scores eac h strategy to identify a “best” strategy for the query . The implemented scoring com bines (i) av erage similarity score, (ii) a diversit y proxy based on unique source items, and (iii) a verage claim confidence (when applicable) via a weigh ted sum. Bey ond rank fusion, this strategy scoring iden tifies the dominan t evidence view for a query , enabling adaptiv e retriev al-path selection rather than blindly trusting an ensemble. Relation-based expansion for m ulti-hop analytical queries. F or analytical (CSGR++) queries, the system extracts frequent entities from initially retrieved claims, then expands the evidence set by retrieving related claims via the relation table (one-hop expansion), discounting confidence slightly for expanded claims. Dynamic confidence micro-learning. Claims main tain supp ort and contradiction coun ters. When v erification indicates a claim was supp orted or contradicted, the system up dates its confidence with a b ounded, asymmetric rule: conf new ( c ) = clip [0 , 1] conf base ( c ) + 0 . 15 log(1 + S ( c )) − 0 . 25 C ( c ) . (10) This pro duces an online “micro-learning” effect: frequen tly supported claims b ecome easier to trust, while con tradicted claims are rapidly down-w eighted. Because claim confidence is up dated online and directly affects future [ EER( c ) scores (Eq. 4), ECR exhibits ligh tw eight inference-time learning b eha vior across queries. T rust modes (graded v erification). The query router classifies user in tent into trust modes ( strict for regulatory or numerical precision, b alanc e d , and explor atory ), whic h mo dulate verification aggressiv eness and synthesis style. 10 T able 1: Key subsystems implemen ted in our system that supp ort ECR and the full end-to-end pip eline. Subsystem Role in the pipeline AutoA daptAgent + SchemaF eedbackLoop Sc hema inference with dry-run calibration GenericStore (EA V SQLite) Item/attribute p ersistence; safe SELECT -only SQL ex- ecution Em b eddingPro vider + V ectorIndex 3-tier embeddings; LanceDB/NumPy back ends; ID- filtered cosine searc h ClaimExtractor + ClaimStore A tomic claims + relations + temp oral keys + dynamic confidence Ensem bleRetriever Comp etitiv e retriev al + RRF fusion (Eq. 9) En tropicClaimResolver ECR lo op: entrop y , EER selection (Eq. 4) StructuredSyn thesizer Structured analytical brief with evidence bullets Rev erse V erifier Numeric grounding + claim-a ware v erification and score capping RA GAnswerer Multi-hop, HyDE, text-to-SQL grounding, CRA G self- correction, citations Rev erse V erifier: deterministic n umeric grounding + claim-aw are chec king. Beyond prob- abilistic resolution, CSGR++ applies a three-lay er Rev erse V erifier: (i) a deterministic numeric ground- ing pass that extracts all n umeric tok ens in a draft answer and chec ks v erbatim presence in retrieved evidence, (ii) LLM-based claim-by-claim judgement with b oth supp orting and coun ter-evidence re- triev al, and (iii) a com bined score where numeric failures cap the maximum achiev able verification score. Numeric grounding is enforced as a hard constraint: a single unsupp orted numeric token caps do wnstream verification scores. T able 1 summarizes the ma jor subsystems of the full HyRA G v3 and CSGR++ pipeline and their resp ectiv e roles, providing a compact o verview of ho w ECR in tegrates into the surrounding retriev al, v erification, and synthesis infrastructure. 4.3 Supp orting RA G Comp onen ts Outside the CSGR++ analytical route, the implemen tation includes a general-purp ose RAG engine that pac k ages standard, widely used RAG mechanisms b ehind a single answer() in terface. These comp onen ts are supp orting infrastructure and are orthogonal to ECR. The system also supp orts generator-based streaming resp onses (via answer_stream en try-p oin ts), whic h is orthogonal to ECR and not ev aluated in this work. 2 Multi-hop retrieve–reason–retriev e. The engine iteratively retrieves candidates and, when on- line, generates a follow-up query conditioned on curren t evidence, stopping early when additional hops yield no new items. HyDE query embedding. T o impro ve recall under distribution shift b et w een user queries and row- shap ed embeddings, the engine optionally generates a short hypothetical “answer row” and embeds that text (HyDE) to driv e vector search. Cross-enco der reranking and calibrated abstention. Candidates are reranked either by cosine similarit y (offline) or b y an LLM “cross-enco der” that outputs a ranking and confidence. A calibrated 2 All ma jor components admit deterministic fallbac ks when LLMs are unav ailable (e.g., hashed em b eddings and heuristic claim extraction), though answ er quality ma y degrade. 11 confidence score com bines the num b er of retriev ed results, the top similarity score, the reranker confidence, and a query-complexit y p enalt y; the system abstains when the calibrated score is low. T ext-to-SQL with v alue grounding. F or aggregation queries, the engine routes to text-to-SQL and applies a second grounding pass that v alidates every generated string literal against real categorical v alues in the database; when an unknown literal is detected, it is rewritten to the closest fuzzy match when p ossible. CRA G self-correction with sc hema evolution signals. When reverse verification returns fail or weak , the engine p erforms up to t wo correction attempts b y rewriting the query to target the v erification gap. Eac h failure can b e recorded by a schema evolution track er that increments p er- column failure counts and can request LLM-based reclassification suggestions once a threshold is exceeded. Schema evolution signals p ersist across queries, enabling long-term self-correction. 5 Exp erimen tal Design & Ev aluation While Section 4 outlines the deploymen t of ECR within a full-scale pro duction architecture, ev aluating the algorithm end-to-end immediately introduces confounding v ariables from upstream retriev al recall and downstream LLM generation quality . T o rigorously v alidate the decision-theoretic properties established in Section 3, our ev aluation strategy pro ceeds in tw o phases. First, w e strictly isolate the mathematical b eha vior of the en tropy-driv en claim selection p olicy using a controlled, claims-only harness (Sections 5.1–5.3). Second, we reintegrate ECR into an end-to-end reasoning pipeline to ev aluate its impact under realistic multi-hop and contradiction-hea vy settings (Section 5.4). 5.1 Con trolled claims-only harness Our “claims-only” harness fixes the dataset, query set, retriev al configuration, candidate claim p ool, and Bay esian entrop y mo del; only the claim-selection p olicy differs. This allo ws a clean measurement of whether a policy is actually minimizing epistemic uncertaint y as defined b y Eq. 1. Dataset and cases. W e use a small, m ulti-table business dataset of six CSV tables ( sales , customers , expenses , inventory , hr , marketing ) and 80 templated ev aluation queries spanning single-table lo okups and cross-table comparisons. Hyp otheses and initial entrop y . F or eac h query , the harness constructs |A| = 3 mutually exclu- siv e answer hypotheses, yielding an initial entrop y of H 0 = log 2 3 ≈ 1 . 585 bits. Candidate claims and p olicies. F or each case, we retrieve the same top-20 candidate claims (high-recall candidate generation). W e then compare three p olicies: (i) Retriev al-only , which takes the top-15 claims b y retriev al score under a fixed budget; (ii) ECR , whic h sequentially selects the next claim b y exp ected en tropy reduction and stops when H ≤ ϵ with ϵ = 0 . 3 bits (capp ed at 10 iterations); and (iii) Random control , which samples claims uniformly without replacemen t from the same candidate po ol, matc hing ECR’s realized claim budget of 5 claims. En tropy-aligned metrics. W e rep ort (i) final en tropy , (ii) en tropy drop p er ev aluated claim, (iii) claims-to-collapse (first step reaching H ≤ ϵ , else budget +1 ), (iv) effective hypotheses ( 2 H ), and (v) en tropy trace v ariance. W e additionally rep ort tw o diversit y-oriented diagnostics (claim redundancy and source entrop y) to illustrate that diversit y alone is not equiv alent to epistemic resolution. Finally , w e report hyp othesis-c onditione d r e dundancy (HypCondRed.), whic h computes redundancy within claim groups attributed to the same answer h yp othesis (rather than across the full mixed set). 12 Policy Claims H f inal ∆ H /claim Collapse 2 H f inal Redund. HypCondRed. SrcEnt Retriev al-only 15 . 0 ± 0 . 0 1 . 585 ± 0 . 000 0 . 0000 ± 0 . 0000 16 . 0 ± 0 . 0 3 . 000 ± 0 . 000 0 . 684 ± 0 . 119 0 . 662 ± 0 . 110 0 . 342 ± 0 . 510 ECR 5 . 0 ± 0 . 0 0 . 2129 ± 0 . 0000 0 . 2744 ± 0 . 0000 5 . 0 ± 0 . 0 1 . 159 ± 0 . 000 0 . 672 ± 0 . 125 0 . 672 ± 0 . 125 0 . 276 ± 0 . 443 Random 5 . 0 ± 0 . 0 1 . 243 ± 0 . 289 0 . 0684 ± 0 . 0577 6 . 0 ± 0 . 0 2 . 411 ± 0 . 437 0 . 658 ± 0 . 118 0 . 653 ± 0 . 123 0 . 354 ± 0 . 527 T able 2: Claims-only ev aluation (80 cases, seed=7). “Claims” is the n umber of ev aluated claims. H f inal is final answer-h yp othesis entrop y in bits. “Collapse” is claims-to-collapse (first step where H ≤ ϵ = 0 . 3 ; else budget +1 ). “Redund.” is claim redundancy , “HypCondRed.” is hypothesis- conditioned claim redundancy , and “SrcEnt” is source entrop y (diagnostic diversit y metrics). P olicy H f inal ∆ H /claim Collapse 2 H f inal T race V ar HypCondRed. Retriev al-only 1 . 585 ± 0 . 000 0 . 0000 ± 0 . 0000 16 . 00 ± 0 . 00 3 . 000 ± 0 . 000 0 . 002987 ± 0 . 000000 0 . 6619 ± 0 . 0000 ECR 0 . 2129 ± 0 . 0000 0 . 2744 ± 0 . 0000 5 . 00 ± 0 . 00 1 . 159 ± 0 . 000 0 . 262859 ± 0 . 000000 0 . 6719 ± 0 . 0000 Random 1 . 2628 ± 0 . 0265 0 . 0644 ± 0 . 0053 5 . 995 ± 0 . 006 2 . 436 ± 0 . 044 0 . 03210 ± 0 . 00374 0 . 6401 ± 0 . 0075 T able 3: Multi-seed robustness (seeds 0–4): mean ± std ov er seeds of the seed-level mean metrics. Only the random baseline changes across seeds in this frozen setup. “HypCondRed.” is h yp othesis- conditioned claim redundancy . 5.2 Main results (seed=7, 80 cases) T able 2 summarizes mean ± std across cases. ECR reliably reaches epistemic sufficiency using 5 claims, driving H b elo w ϵ ; retriev al-only do es not reduce entrop y under the same p osterior mo del; random impro ves mo destly but t ypically do es not collapse. A cross these runs, claim-cov erage is identical across p olicies (0.6375 on av erage), reflecting that this harness is designed to stress epistemic resolution rather than maximize ov erlap with a small set of exp ected claim snipp ets. 5.3 Robustness across random seeds (seeds 0–4) T o ensure the random-control comparison is not a single-seed artifact, we rerun the claims-only harness for five random seeds (0–4), reusing the same frozen dataset, query s et, candidate claims, and p osterior mo del. Retriev al-only and ECR are deterministic under this setup, while the random baseline v aries b y construction. T able 3 confirms the stabilit y of ECR across multiple seeds. F urthermore, Figure 3 illustrates the sc hematic entrop y tra jectories of these comp eting p olicies, highligh ting how rapidly ECR drives the h yp othesis space below the ϵ threshold compared to relev ance-only baselines. 5.4 End-to-End Ev aluation on a Standard Multi-Hop QA Benchmark In contrast to the preceding controlled, claims-only exp erimen ts, this ev aluation reintegrates a liv e large language mo del in to the inference lo op, exercising ECR as an online evidence-selection controller during end-to-end RA G generation. As an additional exp erimen t, to ev aluate whether entrop y-guided evidence selection improv es do wnstream answ er qualit y , we conduct an end-to-end ev aluation on a Hotp otQA-st yle multi-hop QA benchmark. All metho ds share the same retriever, language mo del, candidate evidence po ol, and decoding parameters; the only v ariable is the inference-time claim se- lection p olicy . W e ev aluate three p olicies: (i) a relev ance-based baseline RAG p olicy , (ii) a random selection con trol matched to the same a verage claim budget, and (iii) ECR, which applies entrop y- guided selection with stopping. W e rep ort exact match (EM), tok en F1, and an evidence faithfulness pro xy based on answ er-token cov erage, alongside the av erage n umber of claims used. Because Hot- p otQA exhibits substantially higher linguistic v ariance and more complex multi-hop dep endencies 13 Evidence Claims Activ ely Ev aluated ( t ) Hypothesis Entrop y H ( A | X eval ) (bits) 0 1 2 3 4 5 6 7 8 9 10 0.0 0.3 0.6 0.9 1.2 1.5 1.8 Epistemic sufficiency ( ϵ = 0 . 3 ) ECR (schematic) Random/retriev al (schematic) Figure 3: Schematic entrop y tra jectories consistent with the measured endp oin ts: ECR reac hes H ≤ ϵ quic kly , whereas relev ance-only and random baselines t ypically remain ab o ve ϵ at matc hed claim budgets. than highly structured tabular datasets, the ECR algorithm naturally ev aluates a larger n umber of claims b efore the hypothesis entrop y collapses b elo w ϵ . Metho d A vg. Claims Used Exact Match (EM) ↑ T ok en F1 ↑ Evidence F aithfulness ↑ Baseline RAG 19.87 0.313 0.459 0.639 Random Control 19.87 0.207 0.307 0.427 ECR (ours) 19.68 0.297 0.450 0.626 T able 4: End-to-End Ev aluation on Hotp otQA-St yle Multi-Hop QA (300 Questions). All metho ds use the same retriev er and language mo del; only the inference-time evidence selection p olicy differs. ECR substan tially outp erforms random selection while maintaining p erformance comparable to a strong relev ance-based baseline. T able 4 shows that ECR substantially outp erforms random selection across all reported metrics, confirming that entrop y-guided evidence selection is consisten tly more effectiv e than unguided or div ersity-only strategies. Relativ e to a strong relev ance-based baseline, ECR remains within a small margin on EM and F1, indicating that enforcing epistemic control do es not significantly degrade answ er accuracy on standard b enc hmarks. It is imp ortan t to note that HotpotQA is a largely factual and relev ance-orien ted b enc hmark with predominantly singular ground truths. As such, it do es not natively stress-test contradictory evidence or fundamentally ambiguous queries, whic h are precisely the regimes ECR is designed to address. Ac hieving near parity on suc h a saturated b enc hmark while enforcing strict inference-time epistemic constraints demonstrates that ECR integrates robust uncertain ty control without reliance on b enc hmark-sp ecific tuning. F uture ev aluations will fo cus on conflict-heavy or am biguity-orien ted b enc hmarks where relev ance-driven retriev al is kno wn to exhibit epistemic collapse. Robustness to Noisy Evidence. T o isolate a regime that is closer to real deploymen ts, where retriev ed evidence ma y include irrelev ant or even contradictory con tent, we perform a con trolled ablation on the same Hotp otQA ev aluation set and pip eline as ab o ve, injecting noise after r etrieval 14 Metho d EM (No Noise) F aith (No Noise) EM (40% Noise) F aith (40% Noise) Baseline RAG 0.323 0.660 0.167 0.345 ECR (ours) 0.307 0.657 0.163 0.331 T able 5: Robustness ablation on HotpotQA-style ev aluation (300 questions) with noise injected after r etrieval and b efor e evidenc e sele ction . “40% Noise” replaces 40% of retrieved candidate claims with unrelated (p oten tially contradictory) claims sampled from a noise p ool. Only baseline relev ance-based RA G and ECR are ev aluated; the retriever, LLM, prompts, and deco ding are unc hanged. (Perfor- mance is bounded ab o ve when ground-truth evidence is remov ed.) and b efor e evidenc e sele ction 3 . F or each query , we take the retriev ed candidate claim set and replace 40% of candidates with claims sampled from a noise p ool constructed from unrelated do cumen ts (k eeping the retriev er, LLM, prompts, deco ding, and ECR selection logic unchanged). T able 5 rep orts Exact Match (EM) and Evidence F aithfulness for baseline relev ance-based RAG and ECR under no noise v ersus 40% noise. Under this corruption regime, Exact Matc h necessarily degrades for b oth systems, as replacing a fraction of candidate claims can remo ve ground-truth evidence from the p ool. Notably , ECR exhibits predictable degradation to the relev ance-based baseline without amplifying noise-induced errors, de- spite enforcing strict inference-time stopping and ev aluating few er claims. This result indicates that en tropy-guided evidence selection remains w ell-b eha ved under partial evidence loss, a voiding ov erconfi- den t hallucination or unstable collapse when the a v ailable evidence b ecomes incomplete or unreliable. W e emphasize that this ablation ev aluates robustness to evidence corruption (i.e., partial remov al of v alid claims), rather than distractor accumulation, which isolates a complementary but distinct failure mo de. Offline Robustness Under Structured Con tradiction. Standard QA b enc hmarks predomi- nan tly ev aluate answer accuracy under relatively clean evidence conditions. T o stress-test the epistemic- con trol mec hanism itself—indep endently of LLM semantics—w e run a fully offline, deterministic con tradiction-injection ablation on the same 300-question Hotp otQA-st yle set and retriev al pip eline. F or eac h query , we take the retriev ed candidate claim p ool and inject paired, explicit contradiction t wins in to the candidate set at rate α ∈ { 0 . 0 , 0 . 3 , 0 . 5 } after retriev al and b efore evidence selection. In offline mode, hypothesis initialization uses deterministic hashed embeddings and claim v erification uses a deterministic pro venance proxy; this isolates controller behavior from verifier qualit y . W e rep ort (i) Ambiguit y Exp osure —whether the run ends with H > ϵ or an unresolved explicit con tradiction pair—and (ii) Ov erconfiden t Error —cases where the system outputs a dominan t h yp othesis despite b eing wrong (a pro xy for epistemic collapse). T able 6 shows a sharp regime shift: baseline relev ance-based RA G remains pathologically o verconfiden t and flat across α , while ECR transitions from fast epistemic sufficiency in the clean regime ( α = 0 . 0 ) to principled non-conv ergence under contradiction ( α ≥ 0 . 3 ). At α ≥ 0 . 3 , am biguity emerges deterministically for every query and termination is entirely explained b y unresolved conflict rather than heuristic budget limits. This extreme ambiguit y rate is exp ected: once an explicit contradiction pair is present in the ev aluated evidence, epistemic coherence is unattainable b y definition. Likewise, entrop y remains high b ecause ECR is not an en tropy minimizer “at all costs”; it is a coherence-constrained en tropy controller. Exploring complemen tary am biguity-focused b enc hmarks and distractor accum ulation regimes re- mains an important direction for future ev aluation. 3 Because noise is injected by replacing a fraction of candidate claims, this protocol may remov e gold evidence for some queries. Consequently , Exact Match under heavy corruption reflects robustness to partial evidence loss rather than distractor filtering. 15 Metho d α EM Ov erconfErr Am bExp Mean H Stop Reason Baseline RAG 0.0 0.0067 0.9933 0.0000 – fixed_budget (300/300) Baseline RAG 0.3 0.0067 0.9933 0.0000 – fixed_budget (300/300) Baseline RAG 0.5 0.0067 0.9933 0.0000 – fixed_budget (300/300) ECR (ours) 0.0 0.0000 0.9900 0.0100 0.226 epistemic_sufficiency (297/300) ECR (ours) 0.3 0.0067 0.0000 1.0000 1.496 unresolved_conflict (300/300) ECR (ours) 0.5 0.0067 0.0000 1.0000 1.458 unresolved_conflict (300/300) T able 6: Offline con tradiction-injection ablation (300 questions). P aired contradictions are injected in to the candidate claim p o ol at rate α after retriev al and b efore evidence selection. EM is rep orted only as a sanity anchor under a deterministic offline answ erer; the k ey signals are Am biguity Exp o- sure and Overconfiden t Error (epistemic collapse). ECR exhibits a phase transition from epistemic sufficiency to principled non-conv ergence as contradictions accumulate, while baseline RAG remains uniformly o verconfiden t. Counts indicate num b er of runs terminating for each reason. 6 Conclusion Summary En tropic Claim Resolution introduces a principled inference-time p ersp ectiv e on Retriev al-Augmented Generation, reframing evidence selection as a pro cess of epistemic uncertaint y reduction rather than relev ance maximization. By directly optimizing Exp ected Entrop y Reduction o ver atomic claims, ECR pro vides a mathematically grounded mechanism for determining b oth whic h evidence to ev aluate next and when sufficien t evidence has b een accumulated to justify synthesis. Empirically , w e show that this entrop y-driven framework reliably collapses h yp othesis uncertain ty in con trolled claim-level settings and substantially outp erforms random evidence selection in end-to- end multi-hop question answering, while maintaining p erformance comparable to strong relev ance- based baselines. These results highligh t a fundamen tal distinction b et ween optimizing for raw answ er accuracy and enforcing principled epistemic con trol during inference. In a fully offline con tradiction-injection stress test, ECR exhibits a sharp transition from epistemic sufficiency to principled non-conv ergence as structured conflict accumulates: entrop y ceases to collapse, evidence exploration increases, and termination is explained by unresolved inconsistency rather than heuristic budgets. Unlik e retriev al arc hitectures designed primarily for long-form unstructured do cumen ts, HyRAG v3 explicitly mo dels structured tabular data with ro w-level grounding, enabling ECR to enforce numeric correctness and temporal consistency during inference. Bey ond b enc hmark p erformance, the ECR framew ork offers clear adv antages for real-world and en terprise deploymen ts. In high-stakes domains suc h as medicine, law, and finance, confiden tly synthe- sizing a single answer from conflicting or incomplete evidence can b e costly or harmful. By pro viding a mathematically grounded mechanism to expose unresolved ambiguit y when epistemic sufficiency can- not be reached, ECR functions as a principled constraint against unhedged generation. F urthermore, the ability to dynamically halt evidence accumulation once H ≤ ϵ is satisfied mitigates unnecessary computational ov erhead, reducing latency and cost associated with processing large, redundan t con- text windows. This p ositions ECR as a resource-efficient inference-time control mechanism for scalable and risk-a ware AI reasoning. Limitations and F uture W ork W e conclude by outlining k ey limitations of the current framew ork and highlighting promising direc- tions for future researc h. 16 Hyp othesis space cov erage. A primary limitation of the current framew ork is its reliance on the initial hypothesis generation stage. Entropic Claim Resolution op erates ov er an explicitly constructed h yp othesis set and therefore inherits a b ounded-cov erage assumption; if the true answer is entirely absen t from this space, the system may conv erge confiden tly to an incorrect explanation. In practice, this limitation can b e mitigated b y regenerating h yp otheses when en trop y fails to decrease or when accum ulated evidence weakly supp orts all candidates. F uture w ork will explore dynamic mid-lo op h yp othesis extension, soft hypothesis assignments, richer lik eliho od mo dels, and tighter in tegration with learned retrievers to further strengthen entrop y-guided reasoning under uncertain ty . Imp ortan tly , ECR’s refusal to con verge under explicit contradiction is a delib erate design choice rather than a limitation: when the ev aluated evidence is epistemically incoherent, the framew ork exp oses ambiguit y instead of forcing p osterior collapse. This b eha vior preserves epistemic correctness but ma y yield non-decisiv e outputs in genuinely inconsistent corp ora. An orthogonal robustness regime inv olves distractor accumulation without evidence remov al, whic h w e leav e to future in vestigation. Finally , while this work ev aluates Entropic Claim Resolution sp ecifically within Retriev al-Augmented Generation, the underlying metho dology naturally extends to agentic and autonomous con texts. Our approac h suggests a p erspective where agen t actions (such as executing to ols, querying external APIs, or taking exploratory steps) can be mo deled dynamically as en tropy-minimizing decisions ev aluated under a rigorous Exp ected Entrop y Reduction criterion. This aligns with recent adv ancements in autonomous cognitive control, including top ology-a ware routing [19] and dynamic temp oral pacing [20], providing a formal alternative to standard prompt-driv en or heuristic action-selection p olicies. W e view the integration of decision-theoretic primitives into con tinuous agentic feedback lo ops as a comp elling frontier for building robust and mathematically grounded autonomous systems. 17 References [1] Lewis, P ., et al. (2020). Retriev al-augmen ted generation for knowledge-in tensive nlp tasks. A d- vanc es in Neur al Information Pr o c essing Systems , 33, 9459–9474. [2] Karpukhin, V., et al. (2020). Dense Passage Retriev al for Op en-Domain Question Answering. Pr o c e e dings of EMNLP . [3] Edge, D., et al. (2024). F rom Lo cal to Global: A Graph RAG Approach to Query-F o cused Summarization. arXiv pr eprint arXiv:2404.16130 . [4] Singh, R., et al. (2026). CSR-RA G: An Efficient Retriev al System for T ext-to-SQL on the Enter- prise Scale. arXiv pr eprint arXiv:2601.06564 . [5] Y ao, S., et al. (2023). ReAct: Synergizing Reasoning and A cting in Language Mo dels. ICLR . [6] Y ao, S., et al. (2023). T ree of Thoughts: Delib erate Problem Solving with Large Language Mo dels. NeurIPS . [7] Asai, A., et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv pr eprint arXiv:2310.11511 . [8] Shannon, C. E. (1948). A Mathematical Theory of Comm unication. Bel l System T e chnic al Journal , 27(3). [9] Houlsb y , N., et al. (2011). Ba yesian Activ e Learning for Classification and Preference Learning. arXiv:1112.5745 . [10] Kuhn, L., et al. (2023). Seman tic Uncertain ty: Epistemic Uncertain ty in Neural Language Mo dels. ICLR . [11] Zhang, T., et al. (2024). URA G: Benchmarking Uncertaint y in Retriev al-Augmented Generation. arXiv:2408.01234 . [12] Liu, H., et al. (2025). Retriev al-Augmented Reasonin g Consistency . ACL . [13] Zhao, X., et al. (2025). BEE-RA G: Balanced En tropy-Engineered Context Management. arXiv:2501.09912 . [14] Kim, S., et al. (2025). Structure-Fidelit y RAG. ICLR . [15] P atel, M., et al. (2024). Lazy RAG. EMNLP . [16] W ang, Y., et al. (2025). MedRAGChec ker. . [17] W ei, A., et al. (2024). SAFE. . [18] Chen, J., et al. (2025). CIBER. NeurIPS . [19] Di Gioia, D. (2026). Cascade-A ware Multi-Agent Routing. . [20] Di Gioia, D. (2026). Learning When to A ct. . 18 App endix A λ -Sw eep Robustness T o test whether coherence-a ware b ehavior requires fragile tuning, we sweep the coherence b on us weigh t λ in ECR’s evidence selection p olicy ov er { 0 , 0 . 01 , 0 . 025 , 0 . 05 , 0 . 1 } while keeping the offline proto col, budgets, and con tradiction injection rates α ∈ { 0 . 0 , 0 . 3 , 0 . 5 } fixed. W e observe that λ = 0 b eha ves as en tropy-only con trol and can conv erge to a dominan t hypothesis ev en under con tradiction injection, whereas an y tested non-zero λ yields the same coherence-aw are regime in which explicit contradictions are surfaced and preven t epistemic collapse; consequently , b eha vior saturates across all tested λ > 0 . W e set λ = 0 . 05 as the default. As shown in T able A.1, we observe a sharp transition betw een en tropy-only control ( λ = 0 ) and coherence-aw are con trol ( λ > 0 ), with b eha vior saturating for all tested non-zero v alues. This indicates that ECR do es not require fine-grained h yp erparameter tuning to surface epistemic inconsistency . λ conflict Am biguity Exp osure 0 0.01 0.025 0.05 0.10 0 1 α = 0 . 0 α = 0 . 3 α = 0 . 5 Figure A.1: Ambiguit y exp osure as a function of the coherence weigh t λ conflict under structured con tradiction injection. Empirically , ambiguit y exp osure exhibits a sharp phase transition: for α = 0 . 5 , exp osure jumps from 0 to 1 for an y tested λ > 0 , while remaining 0 for α ≤ 0 . 3 across all tested settings. λ α = 0 . 0 α = 0 . 3 α = 0 . 5 MeanClaims Mean H MeanClaims Mean H MeanClaims Mean H 0.00 5.04 0.226 5.06 0.226 5.08 0.226 0.01 5.04 0.226 25.83 1.496 29.81 1.458 0.025 5.04 0.226 25.83 1.496 29.81 1.458 0.05 5.04 0.226 25.83 1.496 29.81 1.458 0.10 5.04 0.226 25.83 1.496 29.81 1.458 T able A.1: λ -sweep summary statistics (offline, deterministic). V alues are aggregated ov er 300 ques- tions. B Offline Con tradiction Sanit y T est As an additional sanity c heck, w e ev aluate ECR in a minimal fully offline scenario consisting of a single claim and its explicit synthetic negation (a paired contradiction twin). In this setting, the exp ected out- come is that ECR ev aluates b oth claims, flags unresolved conflict ( has_unresolved_conflict=True ), and refuses to emit a dominant h yp othesis ( dominant_hypothesis=None ). This deterministic unit test passes. 19
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment