Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG

En tropic Claim Resolution: Uncertain t y-Driv en Evidence Selection for RA G Da vide Di Gioia ucesigi@ucl.ac.uk Abstract Curren t Retriev al-Augmen ted Generation (RA G) systems predominan tly rely on relev ance- based dense retriev al, sequen tially fetching documents to maximize seman tic similarity with the query . Ho wev er, in kno wledge-intensiv e and real-w orld scenarios c haracterized b y conﬂicting evi- dence or fundamental query am biguity , relev ance alone is insuﬃcient for resolving epistemic un- certain ty . W e in tro duce En tropic Claim Resolution (ECR), a no vel inference-time algorithm that reframes RA G reasoning as en tropy minimization ov er comp eting semantic answ er h yp otheses. Unlik e action-driv en agen tic framew orks (e.g., ReAct) or ﬁxed-pip eline RAG architectures, ECR sequen tially selects atomic evidence claims by maximizing Exp ected Entrop y Reduction (EER), a decision-theoretic criterion for the v alue of information. The pro cess dynamically terminates when the system reac hes a mathematically deﬁned state of epistemic suﬃciency ( H ≤ ϵ , sub ject to epistemic coherence). W e in tegrate ECR into a production-grade multi-strategy retriev al pip eline (CSGR++) and demonstrate its theoretical prop erties. Our framework pro vides a rigorous foun- dation for uncertaint y-aw are evidence selection, shifting the paradigm from retrieving what is most relev ant to retrieving what is most discriminativ e. 1 In tro duction The in tegration of Large Language Mo dels (LLMs) with external knowledge bases through Retriev al- Augmen ted Generation (RAG) has become the de facto standard for mitigating hallucinations and enabling kno wledge-intensiv e Question Answering (QA). Conv entional RAG systems operate on a rigid r etrieve-then-r e ad paradigm, predominan tly lev eraging maximum inner product searc h (MIPS) in dense contin uous spaces [2] to fetch the top- k most semantically relev ant text ch unks. While highly eﬀectiv e for simple, factoid-based QA where a single ground-truth answer exists, this relev ance-driven approac h exhibits severe degradation in real-w orld, kno wledge-intensiv e scenarios. Such scenarios are frequen tly characterized by inheren t query am biguity , conﬂicting evidence across multiple sources, and complex m ulti-hop dep endencies. In these c hallenging settings, standard dense retriev al suﬀers from what we term epistemic c ol lapse : the tendency to retrieve highly redundant information that is seman tically similar to the query , rather than fetching the discriminative evidence needed to resolve the underlying uncertaint y . Consequently , the LLM is forced to synthesize an answ er from a biased or incomplete evidence distribution, often leading to unhedged, o verconﬁden t, or factually inaccurate generation. Recen t architectural adv ancements attempt to transcend simple MIPS. Graph-based paradigms, suc h as Con text-Seeded Graph Retriev al (CSGR) and GraphRA G, expand retriev al scop e via struc- tured knowledge relation tra versal. Concurren tly , agentic and iterativ e veriﬁcation w orkﬂows (e.g., ReA ct [5], T ree-of-Though ts [6], Self-RAG [7]) allow LLMs to dynamically interact with searc h to ols, reﬂecting on retrieved context to guide subsequent actions. How ever, these state-of-the-art approac hes still critically lack a principled, decision-theoretic stopping criterion and evidence selection mec hanism. Graph tec hniques rely on static pip eline conﬁgurations (e.g., ﬁxed graph-hop depth), while agentic 1 systems dep end on heuristic thresholding or prompt-driven self-reﬂection, which frequently suﬀer from inﬁnite lo oping, premature termination, or unprincipled evidence w eighting. Critically , modern RAG systems lack a mathematically rigorous deﬁnition of what constitutes suﬃcien t evidence and an explicit ob jectiv e function for selecting which sp eciﬁc piece of evidence to retriev e next at inference time. T o bridge this fundamental gap, we prop ose Entropic Claim Resolu- tion (ECR), an inference-time algorithm that reframes the retriev al and synthesis process as entr opy minimization over a latent sp ac e of semantic answer hyp otheses . Drawing inspiration from Informa- tion Theory [8] and Bay esian Exp erimen tal Design [9], ECR models the QA task probabilistically . It initializes a probability distribution ov er a set of mutually exclusive p otential answer hypotheses and iterativ ely selects atomic factual claims from a retriev ed candidate p ool to ev aluate. Crucially , in ECR, evidence selection is decoupled from semantic relev ance to the query . Instead, claims are selected by maximizing Exp ected Entrop y Reduction (EER) ; that is, choosing the sp eciﬁc piece of evidence most likely to collapse the probability distribution tow ard a single, correct h yp othesis (or cleanly bifurcate it in the case of irreconcilable conﬂict). The algorithm adaptiv ely na vigates the evidence graph under a principled stopping rule, terminating only when the en tropy of the h yp othesis space falls b elow a predeﬁned threshold of epistemic suﬃciency . In summary , our main con tributions are: 1. W e introduce Entropic Claim Resolution (ECR), a decision-theoretic evidence selection algo- rithm for RAG, shifting the paradigm from retrieving what is most r elevant to what is most discriminative in resolving h yp othesis am biguity . 2. W e formally deﬁne a principled, mathematically rigorous stopping criterion for iterative RA G pip elines based on epistemic suﬃciency ( H ( A | X ) ≤ ϵ ). 3. W e identify a b eha vioral phase transition under structured contradiction: b y in tegrating a ligh tw eight coherence signal ( λ > 0 ), w e show that ECR transitions from forced epistemic collapse to principled ambiguit y exposure, prioritizing explicit con tradictions when presen t and safely refusing to reduce uncertain ty when evidence is inherently inconsistent. 4. W e demonstrate the practical scalability of ECR b y implementing it as a fast, inference-time algorithm integrated into a pro duction-grade multi-strategy retriev al arc hitecture (CSGR++), requiring no besp ok e ﬁne-tuning or sp ecialized model weigh ts. Signiﬁcance. A central implication of this work is that improving retriev al-augmented reasoning do es not necessarily require larger mo dels, longer context windows, or additional data, but rather principled control ov er how existing evidence is selected and ev aluated during inference. By explicitly mo deling epistemic uncertain ty and optimizing evidence selection for information gain, Entropic Claim Resolution pro vides a l igh tw eight, computationally eﬃcien t mechanism for impro ving robustness and in terpretability . This makes the framework particularly v aluable for high-stakes en terprise deplo y- men ts (e.g., medical, legal, or ﬁnancial QA) where mitigating unhedged hallucinations and controlling inference costs are critical. Ultimately , this persp ectiv e highlights an alternativ e path for scaling kno wledge-intensiv e systems: a path grounded in decision-theoretic inference rather than indisc rimi- nate context expansion, particularly in settings c haracterized by noisy , conﬂicting, or heterogeneous evidence. 2 Related W ork 2.1 Dense, Graph-Augmented, and Agentic Retriev al Standard dense retriev al selects a set of do cumen ts D by prioritizing their conditional probability given the query , P ( D | Q ) , commonly approximated via cosine similarit y embeddings [1]. T o ov ercome the short-sigh tedness of relev ance searc h, adv anced graph-based architectures, notably GraphRAG [3] 2 and Context-Seeded Graph Retriev al (CSGR), implicitly construct or trav erse knowledge graphs o ver c hunks or entities to expand the evidence space. In enterprise environmen ts, hybrid systems such as CSR-RA G [4] further integrate structural and relational signals to supp ort large-scale schemas. Concurren tly , the conv ergence of dynamic retriev al p olicies with autonomous planning has crys- tallized into the paradigm of A gentic RAG . F rameworks such as ReAct [5], T ree-of-Thoughts [6], and Self-RA G [7] allow language mo dels to interlea ve in termediate reasoning steps with retriev al actions in order to reﬁne subsequent queries. Recen t Systematization of Knowledge (SoK) studies emphasize this shift from static pip elines tow ard mo dular control strategies. How ever, despite their ﬂexibility , agentic retriev al systems in trinsically rely on heuristic prompt designs or static thresholds to determine when to halt retriev al or which information to prioritize. As a result, they lack a rigorous mathematical deﬁnition of epistemic suﬃciency and an explicit ob jective for selecting the next most informative piece of evidence as uncertain ty unfolds during inference. 2.2 Uncertain t y Quan tiﬁcation (UQ) in RAG A critical prerequisite for adaptive retriev al is accurately c haracterizing what a mo del do es not know. Recen t b enc hmarks such as URAG [11] demonstrate that while RA G can improv e factual grounding, it also introduces new sources of epistemic uncertain ty , including relev ance mismatch and selectiv e atten tion to partial evidence, which can paradoxically amplify o verconﬁden t hallucinations under noisy retriev al conditions. In resp onse, a gro wing bo dy of w ork on Retriev al-Augmented Reasoning (RAR) fo cuses on quan tifying uncertain ty across the retriev al and generation stages. F or example, metho ds suc h as Retriev al-Augmented Reasoning Consistency (R2C) [12] mo del multi-step reasoning as a Marko v Decision Pro cess and p erturb generation to measure output stability via ma jorit y voting, building up on foundational frameworks for semantic uncertaint y [10]. These uncertaint y-a ware approaches are eﬀectiv e for p ost-hoc answer ev aluation, abstention, or calibration. Ho wev er, they are not designed to guide the sele ction of evidenc e itself during the reasoning pro cess. In particular, they do not pro vide a mec hanism for c ho osing whic h atomic piece of evidence should be retriev ed or veriﬁed next in order to maximize information gain prior to answer syn thesis. 2.3 En trop y-A w are Context Managemen t The application of Shannon en tropy as a control signal for managing LLM context is an emerging researc h direction. Large context windo ws in standard RAG often lead to attention dilution and un- constrained entrop y growth, motiv ating recen t work on entrop y-aw are context control. F or instance, BEE-RA G (Balanced Entrop y-Engineered RAG) [13] mo diﬁes attention dynamics to maintain en tropy in v ariance ov er long contexts, while SF-RA G (Structure-Fidelity RAG) [14] lev erages do cumen t hier- arc hy as a low-en tropy prior to preven t evidence fragmentation. Similarly , L-RAG (Lazy RAG) [15] emplo ys predictive entrop y thresholds to gate exp ensive retriev al op erations, defaulting to parametric kno wledge when uncertaint y is estimated to be low. ECR shares this information-theoretic lineage but departs in a critical wa y: rather than using en tropy to compress, gate, or truncate con text, ECR applies entrop y directly as an ob jectiv e for se quential ly sele cting discriminative evidenc e variables . This shift reframes en tropy from a passive diagnostic in to an active decision criterion guiding inference-time reasoning. 2.4 Claim-Lev el V eriﬁcation and V alue of Information While standard RAG op erates on monolithic do cumen t c hunks, recent diagnostic and safet y-oriented framew orks decomp ose retrieved conten t into atomic claims. Systems such as MedRAGChec ker [16] ev aluate biomedical QA systems b y extracting ﬁne-grained claims and c hecking them against struc- tured kno wledge bases, while agentic fact-chec king pip elines (e.g., SAFE [17] and CIBER [18]) retriev e 3 supp orting and refuting evidence for individual statements. These approac hes demonstrate the im- p ortance of claim-level reasoning for reliability and interpretabilit y . ECR aligns this gran ular veriﬁcation paradigm with classical principles from Ba yesian experimental design and active learning. In activ e learning, the ob jective is to select the next unlab eled instance that maximizes exp ected information gain. By formulating inference-time evidence selection as Exp ected En tropy Reduction (EER) ov er discrete factual claims, ECR bridges symbolic uncertaint y mo deling and neural generation, optimizing retriev al for the value of information rather than semantic relev ance alone. 3 Metho dology: En tropic Claim Resolution (ECR) ECR formulates the evidence selection problem as a sequential decision pro cess targeting a reduction in epistemic uncertain ty across comp eting generativ e outcomes. ECR assumes high-recall candidate generation has already o ccurred (via upstream retriev al) and fo cuses exclusively on resolving uncertaint y within the resulting candidate claim set. 3.1 Problem F orm ulation Let C = { c 1 , c 2 , . . . , c n } b e a ﬁnite subset of atomic factual claims em b edded within a corpus. F or a giv en complex query Q , assume that assessing the veracit y of any given claim c i pro vides a signal regarding the query’s answ er. W e denote the latent truth v ariable asso ciated with claim c i as X i ∈ { 0 , 1 } , indicating whether the claim is empirically v alidated within the speciﬁc source do cumen t. Up on identifying high epistemic uncertain ty in the retriev al space (e.g., via conﬁdence v ariance or conﬂicting k eyword analysis), ECR initializes an Answer Hyp othesis Space A = { a 1 , a 2 , . . . , a k } . This space represents the set of mutually exclusive p oten tial macro-answ ers to the query . 1 In our implemen tation, A is robustly generated dynamically: either b y querying the LLM to prop ose dis- tinct v alid hypotheses derived from subsets of the initial k -b est claims, or via deterministic vector clustering when op erating purely oﬀ-line. Our ob jective is to sequen tially reﬁne a probability distri- bution P ( A | X ev al , Q ) ov er these hypotheses, conditioned on the dynamic subset of ev aluated claims X ev al ⊆ X , initialized at a uniform prior P ( a ) = 1 |A| . 3.2 Ob jectiv e F unction: Answer Entrop y The epistemic uncertaint y regarding the true outcome is robustly quantiﬁed using Shannon entrop y . Let the en tropy of the hypothesis space after ev aluating a subset of claims X ev al b e: H ( A | X ev al ) = − X a ∈A P ( a | X ev al ) log 2 P ( a | X ev al ) (1) 3.3 Exp ected En tropy Reduction (EER) and Selection Policy A t the t -th iteration, the system m ust choose the next claim c ∗ from the unev aluated candidate p ool C cand to formally v erify . Rather than relying on cosine relev ance sim ( c i , Q ) , we select the claim that maximizes Expected Entrop y Reduction (Information Gain). The selection p olicy is formally deﬁned as: c ∗ = arg max c ∈C cand EER ( c | X ev al ) (2) The EER is precisely the diﬀerence betw een curren t entrop y and the exp ected p osterior entrop y after observing the truth v alue of claim c : EER ( c | X ev al ) = H ( A | X ev al ) − E X c  H ( A | X ev al ∪ { X c } )  (3) 1 W e use mutual exclusivit y for analytical clarity; the framew ork naturally extends to partially ov erlapping hypotheses via soft assignment of claims to hypotheses. 4 This criterion ensures the algorithm intrinsically fav ors discriminative claims, i.e., evidence that cleanly segregates the hypothesis space. In practice, EER is approximated by measuring the proba- bilistic v ariance b et ween the sp eciﬁc subsets of comp eting macro-hypotheses actively supp orted versus unsupp orted by candidate c . A claim supp orting all hypotheses equally yields an EER of 0, reﬂecting its redundancy , regardless of its seman tic similarity to the query . Implemen tation-Level EER Proxy . Computing the true mathematical expectation o ver all p os- sible generativ e outcomes is typically intractable during low-latency inference. Therefore, w e deploy a computationally eﬃcie n t proxy that approximates Exp ected Entrop y Reduction without requiring full marginalization ov er laten t truth v ariables. In our concrete implementation, each candidate claim c partitions the hypothesis set in to those that cite c as supp orting evidence and those that do not. Let A + ( c ) = { a ∈ A : c ∈ supp ( a ) } and A − ( c ) = A \ A + ( c ) . Denote the probability mass in each subset as p + ( c ) = P a ∈A + ( c ) P ( a | X ev al ) and p − ( c ) = P a ∈A − ( c ) P ( a | X ev al ) . W e score discriminativity via the follo wing heuristic proxy: [ EER ( c ) = | p + ( c ) − p − ( c ) | p + ( c ) + p − ( c ) · H ( A | X ev al ) · conf ( c ) , (4) where conf ( c ) ∈ [0 , 1] denotes claim conﬁdence. This proxy is linear in the num b er of hypotheses and preserves the core ob jective of prioritizing claims that maximally split the p osterior mass, while remaining tractable for inference-time use. Design choice of the EER proxy . The heuristic pro xy in Eq. (10) is inten tionally not a symmetric appro ximation of classical exp ected information gain, which typically fav ors balanced posterior splits; rather, it is designed for b ounded-budget inference, where the ob jectiv e is rapid reduction of epistemic uncertain ty rather than exploratory experimentation. In retriev al-augmented reasoning, once p oste- rior mass concen trates on a subset of hypotheses, prioritizing high-conﬁdence, high-im balance claims accelerates conv ergence and reduces redundant evidence retriev al. This exploitative bias is therefore a delib erate design choice aligned with low-latency inference and downstream synthesis constrain ts. Coherence-a ware selection. In addition to entrop y reduction, ECR incorp orates a ligh tw eight co- herence signal that prioritizes ev aluating claims lik ely to complete an explicit contradiction when such evidence exists. Concretely , we add a small regularization term λ · ConﬂictPoten tial( c ) to the selection ob jective, yielding score ( c ) = [ EER( c ) + λ · ConﬂictPoten tial( c ) , where ConﬂictPoten tial( c ) ∈ { 0 , 1 } is non-zero if c is an explicit negation of, or completes a con tradiction pair with, a previously ev aluated claim. This term do es not o verride en tropy reduction but ensures that unresolved con tradictions are surfaced early rather than a veraged aw ay . Empirically , w e observe that any non-zero λ induces stable coherence-a ware b ehavior without requiring ﬁne-grained tuning (App endix, Figure A.1). Con tradiction-aw are coherence term. Let C ev al denote the set of claims that hav e already b een ev aluated. W e deﬁne a binary contradiction indicator ConﬂictP otential( c ) = ( 1 if ∃ c ′ ∈ C ev al suc h that c ≡ ¬ c ′ , 0 otherwise. (5) That is, ConﬂictPoten tial( c ) activ ates only when ev aluating c would complete an explicit con tradiction pair in the evidence. This coherence signal is structural rather than probabilistic: it do es not p enalize h yp otheses or p osteriors directly , and it do es not measure global consistency . Instead, it biases claim selection tow ard surfacing epistemic inconsistency when it exists, prev enting en tropy-only selection from a veraging aw ay contradictory evidence. The resulting claim-selection ob jective is c ∗ = arg max c ∈C cand  [ EER( c ) + λ · ConﬂictPoten tial( c )  , (6) 5 Bo x 1: Entropic Claim Resolution (ECR) Input: query Q , candidate claims C cand , en tropy threshold ϵ , max iterations T 1. Hypotheses. Initialize A ← Genera teHypotheses ( Q, C cand ) (LLM or clustering), set uniform prior P ( a ) = 1 / |A| . 2. Lo op. F or t = 1 ..T : compute H ( A | X ev al ) (Eq. 1). If epistemic suﬃciency holds (Eq. 11 ′ ) stop. 2a. Select. Cho ose c ∗ ∈ C cand maximizing [ EER ( c ) + λ · ConﬂictP otential( c ) (Eq. 4). 2b. V erify . Estimate P ( X c ∗ = 1) using pro venance and support/contradiction statistics (Eq. 8). 2c. Up date. Up date P ( A | X ev al ∪ { X c ∗ } ) (Eq. 7), add X c ∗ to X ev al , remo ve c ∗ from C cand . 3. Output. Return arg max a P ( a | X ev al ) if epistemic suﬃciency holds (Eq. 11 ′ ), else return the rank ed distribution ov er A . Figure 1: Pseudo-co de for ECR without external algorithm pack ages. where λ ≥ 0 controls the strength of con tradiction-aw are selection. Setting λ = 0 reco vers entrop y- only ECR. While en tropy reduction remains the primary ob jective, any λ > 0 ensures that explicit con tradiction-completing claims are prioritized when present, under the b ounded EER scale induced b y the hypothesis en tropy . This prioritization is observed empirically as a sharp phase transition in the λ -sw eep ablation, where b eha vior saturates for all tested λ > 0 . 3.4 Ba y esian P osteriors and Epistemic Suﬃciency Up on selecting c ∗ , the system ev aluates its intrinsic truth X c ∗ against the source context and prov e- nance metadata (see Section 3.5). The hypothesis probabilities are concurrently up dated utilizing lo calized Bay es’ rules. Concretely , hypotheses intersecting functionally with v alidated claims observe signiﬁcan t targeted probability mass b oosts, severely suppressing con tradicting disjoint branches. P ( A | X ev al ∪ { X c ∗ } ) = P ( X c ∗ | A ) P ( A | X ev al ) P ˜ a P ( X c ∗ | ˜ a ) P (˜ a | X ev al ) (7) where P ( X c | A ) represents the conditional likelihoo d of observing the claim c assuming hypothesis A is true. The iterative veriﬁcation procedure gracefully terminates when the system reac hes a mathematical state of epistemic suﬃciency , parameterized by threshold ϵ (e.g., ϵ = 0 . 3 bits):  H ( A | X ev al ) ≤ ϵ  ∧ ¬ Conﬂict( X ev al ) (11 ′ ) where Conﬂict( X ev al ) indicates the presence of m utually incompatible claims (e.g., an explicit claim and its negation) within the ev aluated evidence. Alternativ ely , if all candidates are e xhausted or maxim um iterations are met with H > ϵ , ECR halts and explicitly exp oses the comp eting hypotheses and their ﬁnal mass distributions, structurally mapping the unresolv able ambiguit y of the corpus. The complete iterativ e pro cedure is summarized in Box 1. 3.5 V eriﬁcation via T op ological Prov enance In practical con tinuous-learning implementations, the inferential verit y link P ( X c = 1 | A ) can be computed dynamically rather than natively assuming p erfect mo del alignment. Instead of relying solely on parametric LLM-driv en prompt veriﬁcation, ECR explicitly incorporates the topological pro venance of the multi-modal knowledge graph natively . Let S ( c ) and C ( c ) represen t the structural supp ort graph-edge coun ts and contradictory graph-edge counts of claim c track ed intricately within 6 the backing EA V (En tity-A ttribute-V alue) datastore, applying implicit Laplace smo othing. The ﬁnal top ological veriﬁcation probability is th us seamlessly and robustly blended: P ( X c = 1) =    S ( c ) + 1 S ( c ) + C ( c ) + 2 if S ( c ) + C ( c ) > 0 , P prior _ conf ( X c = 1) otherwise. (8) This matches the deploy ed b eha vior in our implementation: whenev er historical supp ort/con tradiction signals exist, the system uses a Laplace-smo othed empirical truth estimate; otherwise, it falls bac k to the extraction-time prior conﬁdence. 3.6 Theoretical Prop erties T o solidify the inferential v alidit y of the sequential system, we deduce its op erational p erformance b ound mapping. Theorem 1 (T ermination and Budget Bound) . F or any ﬁnite c andidate set C cand , ECR terminates after at most min( T , |C cand | ) claim evaluations. Mor e over, if ther e exists a c onstant δ > 0 such that at e ach iter ation the sele cte d claim satisﬁes E [ H t − 1 − H t ] ≥ δ whenever H t − 1 > ϵ , then ECR r e aches epistemic suﬃciency in at most ⌈ ( H 0 − ϵ ) /δ ⌉ iter ations. W e emphasize that this result characterizes suﬃcient conditions for conv ergence under informativ e evidence selection, rather than a minimax or adv ersarial worst-case guarantee. When explicit con tra- dictions exist in the evidence, the suﬃcien t conditions for conv ergence are inten tionally violated, and ECR terminates b y exp osing ambiguit y rather than collapsing the p osterior. Pr o of. The ﬁrst statement holds b ecause each iteration ev aluates and remo ves at most one claim, and the lo op is explicitly capp ed b y T . F or the second statemen t, telescoping the assumed exp ected en tropy decrease yields E [ H t ] ≤ H 0 − tδ until reac hing ϵ , hence t ≥ ( H 0 − ϵ ) /δ suﬃces. 4 System In tegration: ECR within CSGR++ T o ev aluate ECR b ey ond isolated theoretical constrain ts, we in tegrated it in to a pro duction-grade, m ulti-strategy retriev al pip eline. While ECR is algorithmically orthogonal to any sp eciﬁc retriever, we utilize the CSGR++ architecture as our primary testb ed. In this section, we describ e the surrounding system comp onen ts that generate, structure, and verify the atomic candidate claims consumed by the ECR inference loop. Figure 2 illustrates the resulting end-to-end arc hitecture and the p osition of ECR within it. 4.1 HyRA G v3 Ingestion and Index Construction Structured and T abular Data as First-Class Evidence. HyRA G v3 natively supp orts struc- tured and semi-structured tabular data, rather than treating tables as ﬂattened text. During inges- tion, the system p erforms automatic schema inference, including column typing (numeric, categorical, temp oral), identiﬁer detection, and time-series normalization. Individual table cells and deriv ed ag- gregates are materialized as atomic claims with explicit prov enance, row identiﬁers, column metadata, and canonical time keys. Structured aggregation queries are grounded through a text-to-SQL exe- cution path with guarded, read-only execution and v alidation against real table v alues. All tabular claims en ter the same inference-time evidence p o ol as textual and graph-deriv ed claims, allowing En- tropic Claim Resolution to reason uniformly ov er mixed structured and unstructured evidence. This design enables precise numeric grounding, temporal ﬁltering, and auditable reasoning not nativ ely supp orted by graph-enhanced RAG systems that op erate o ver synthesized document summaries. 7 Query Ensem ble Retriev al (V ector | Graph | Claim) En tropic Claim Resolution En tropy-Guided Selection Resp onse Synthesis epistemic suﬃciency Inside ECR: Hyp othesis space A P osterior P ( A | X ) EER-based claim selection Figure 2: System ov erview: En tropic Claim Resolution (ECR) op erates as an inference-time controller b et ween competitive retriev al and answer syn thesis. Given a retriev ed claim set, ECR sequentially selects evidence to minimize hypothesis en tropy and terminates when epistemic suﬃciency is reached. V ector-Based Retriev al as a Core Substrate. HyRAG v3 fully incorporates dense v ector re- triev al as a primary evidence acquisition mechanism. Ra w document ch unks, atomic claims, and syn thesized summaries are em b edded into dedicated vector indices and queried using cosine similarity with optional metadata and identiﬁer ﬁltering. V ector retriev al is used to seed claim p ools, initialize h yp othesis construction, and ground subsequent structured and graph-based reasoning. Rather than assuming vector similarity implies evidentiary suﬃciency , HyRAG v3 sub jects all vector-retriev ed candidates to inference-time ev aluation under Entropic Claim Resolution, allowing relev ance-based signals to b e retained while preven ting o verconﬁdence in semantically similar but non-discriminative evidence. ECR op erates at inference time, but its eﬀectiv eness dep ends on upstream ingestion and indexing that preserve atomicity , pro venance, and temp oral structure. The implemented HyRAG v3 pipeline (in our reference implemen tation) p erforms the following steps. Auto-adaptiv e sc hema inference with feedback calibration. An AutoAdaptAgent infers a sc hema from CSV/Excel/PDF/DataF rame inputs, iden tifying an ID column, categorical columns, n umeric columns, and time-series columns. A subsequen t schema fe e db ack lo op p erforms a dry-run parse of the ﬁrst N rows (conﬁgurable) and adjusts misclassiﬁed columns (e.g., “numeric” columns with excessiv e null-rates), pro ducing a corrected sc hema used for full ingestion. Robust parsing with rep eated-header detection and temp oral normalization. The inges- tion parser supports multiple formats and implements spreadsheet-sp eciﬁc heuristics, including merg- ing complementary m ulti-row headers and skipping rep eated header ro ws using an ov erlap threshold ( ≥ 0 . 70 tok en ov erlap). Time-series columns are normalized via a data-driven time-key parser that rec- ognizes patterns such as years (e.g., 2024), quarters (e.g., 2024Q1), halv es (e.g., 2024H2), and trailing windo ws (e.g., L TM/TTM), and maps them to a canonical order key used for temp oral slicing. 8 EA V SQLite store with safe query execution. All ingested records are p ersisted in an En tity– A ttribute–V alue SQLite back end ( GenericStore ). F or do wnstream aggregation queries, the system exp oses a text-to-SQL route but enforces a strict SELECT -only guardrail: the SQL executor blo c ks write op erations and limits result sizes. Em b eddings and vector indices with deterministic fallbac ks. The embedding subsystem is three-tiered: an online embedding API (if av ailable), a lo cal sentence-transformer fallback, and a deterministic hashed-vector fallback for fully oﬄine op eration. V ector indices supp ort an optional database bac kend (LanceDB when installed) and a pure NumPy cosine-similarit y back end otherwise; b oth supp ort ID ﬁltering for category/time constraints. A tomic claim extraction and claim index. During ingestion, the system extracts atomic claims, en tities, and ligh tw eight semantic relations ( h, r, t ) in to a dedicated ClaimStore . Claim v ectors are em b edded and stored in a separate claim vector index to enable claim-ﬁrst retriev al. Hierarc hical summarization as retriev able no des. T o improv e global recall, the system clusters em b edded ro w representations using a pure-NumPy k -means routine (no external ML dep endencies), summarizes each cluster (LLM when av ailable), re-embeds the summaries, and inserts them into the same row-lev el vector index under a reserved ID preﬁx. As a result, standard vector retriev al can surface b oth raw rows and higher-lev el cluster summaries. Cluster summaries are stored as ﬁrst-class retriev able no des and compete directly with raw rows during vector retriev al. ECR is exclusively activ ated on the analytical CSGR_PLUS route selected by the upstream query router, and is b ypassed for LOOKUP , RELATIONAL , SEMANTIC , TOOL , and SQL routes. External to ols are treated as deterministic op erators outside the entrop y-driven reasoning lo op (the LLM only formats a JSON to ol call when av ailable, with an oﬄine numeric-statistics fast-path), i.e., excluded from ECR’s epistemic mo deling rather than treated as comp eting uncertaint y-reduction actions. T o b ound computational ov erhead, ECR is inv ok ed dynamically strictly when the retrieved conﬁg- uration exhibits high epistemic uncertaint y . The trigger conditions are natively in tegrated via three heuristics: 1. High Claim V olume: The retriever fetches hea vily saturated candidate spaces ( > 15 claims). 2. Syn tactic Ambiguit y: Detection of uncertain ty k eywords within the activ e query (e.g., “un- certain”, “conﬂicting”, “disagree”, “m ultiple”, “v arious”). 3. Conﬁdence V ariance Constrain t: The v ariance in micro-lev el claim conﬁdence σ 2 across the k retrieved c laims e xceeds an empirical threshold of 0 . 15 (with conﬁdence actively tied to trac king top ological support-contradiction metrics within the underlying datastore). T o ev aluate ECR in a high-p erformance setting, we implement it as a standalone and mo dular res- olution engine within a pro duction-grade Context-Seeded Graph Retriev al (CSGR++) arc hitecture. While ECR is algorithmically orthogonal to an y sp eciﬁc retriever, CSGR++ serves as a rigorous exp er- imen tal testb ed that preserves atomicity , prov enance, and multi-strategy retriev al signals. Knowledge is extracted and stored as atomic semantic claims in an En tity-A ttribute-V alue (EA V) back end, ac- companied b y separate vector indices for raw rows and claims. Within this testb ed, the baseline multi-strategy EnsembleR etriever combines dense similarity searc h, structural graph expansion, and semantic claim matc hing using Recipro cal Rank F usion. ECR cleanly in tercepts the pip eline immediately after candidate generation, acting as an isolated inference- time uncertaint y resolution stage that outputs either a dominant hypothesis or a calibrated set of alternativ es for downstream synthesis. 9 4.2 CSGR++ Backbone Arc hitecture While ECR is algorithmically orthogonal to a particular retriev al stack, we implement and ev aluate it inside a production-grade pip eline (CSGR++) that is explicitly claim-centric. A tomic claim store with semantic relations. CSGR++ stores extracted claims in a SQLite- bac ked ClaimStor e with ﬁelds for (i) claim text, (ii) entit y m en tions, (iii) time keys / order k eys for temp oral slicing, and (iv) dynamically up dated conﬁdence signals. In addition, a ligh tw eight seman- tic relation table stores tuples ( h, r, t ) extracted during claim extraction (e.g., Acquires , Imp acts , Ca usedBy ), enabling entit y-based expansion during retriev al. T emp oral intelligence. Queries are parsed for explicit time constrain ts (e.g., “in 2024”, “2024Q1”, “last 3 quarters”, “since 2022”) and conv erted into an order-key in terv al ( τ min , τ max ) . Claim retriev al can then apply a hard ﬁlter ov er the claim IDs inside the selected time windo w. Comp etitiv e ensem ble retriev al and Recipro cal Rank F usion (RRF). The retriever runs m ultiple strategies (vector retriev al ov er ro ws, vector retriev al o ver claims, and graph/category trav er- sal) and fuses the per-strategy rankings via Recipro cal Rank F usion (RRF). F or an item d and ranking lists { L j } m j =1 with ranks r j ( d ) ∈ { 1 , 2 , . . . } , the fused score is RRF ( d ) = m X j =1 1 k + r j ( d ) , (9) where k is a dampening constant (w e use k = 60 in co de). Comp etitiv e strategy scoring (selection, not only fusion). In addition to fusing rankings, the retriev er scores eac h strategy to identify a “best” strategy for the query . The implemented scoring com bines (i) av erage similarity score, (ii) a diversit y proxy based on unique source items, and (iii) a verage claim conﬁdence (when applicable) via a weigh ted sum. Bey ond rank fusion, this strategy scoring iden tiﬁes the dominan t evidence view for a query , enabling adaptiv e retriev al-path selection rather than blindly trusting an ensemble. Relation-based expansion for m ulti-hop analytical queries. F or analytical (CSGR++) queries, the system extracts frequent entities from initially retrieved claims, then expands the evidence set by retrieving related claims via the relation table (one-hop expansion), discounting conﬁdence slightly for expanded claims. Dynamic conﬁdence micro-learning. Claims main tain supp ort and contradiction coun ters. When v eriﬁcation indicates a claim was supp orted or contradicted, the system up dates its conﬁdence with a b ounded, asymmetric rule: conf new ( c ) = clip [0 , 1]  conf base ( c ) + 0 . 15 log(1 + S ( c )) − 0 . 25 C ( c )  . (10) This pro duces an online “micro-learning” eﬀect: frequen tly supported claims b ecome easier to trust, while con tradicted claims are rapidly down-w eighted. Because claim conﬁdence is up dated online and directly aﬀects future [ EER( c ) scores (Eq. 4), ECR exhibits ligh tw eight inference-time learning b eha vior across queries. T rust modes (graded v eriﬁcation). The query router classiﬁes user in tent into trust modes ( strict for regulatory or numerical precision, b alanc e d , and explor atory ), whic h mo dulate veriﬁcation aggressiv eness and synthesis style. 10 T able 1: Key subsystems implemen ted in our system that supp ort ECR and the full end-to-end pip eline. Subsystem Role in the pipeline AutoA daptAgent + SchemaF eedbackLoop Sc hema inference with dry-run calibration GenericStore (EA V SQLite) Item/attribute p ersistence; safe SELECT -only SQL ex- ecution Em b eddingPro vider + V ectorIndex 3-tier embeddings; LanceDB/NumPy back ends; ID- ﬁltered cosine searc h ClaimExtractor + ClaimStore A tomic claims + relations + temp oral keys + dynamic conﬁdence Ensem bleRetriever Comp etitiv e retriev al + RRF fusion (Eq. 9) En tropicClaimResolver ECR lo op: entrop y , EER selection (Eq. 4) StructuredSyn thesizer Structured analytical brief with evidence bullets Rev erse V eriﬁer Numeric grounding + claim-a ware v eriﬁcation and score capping RA GAnswerer Multi-hop, HyDE, text-to-SQL grounding, CRA G self- correction, citations Rev erse V eriﬁer: deterministic n umeric grounding + claim-aw are chec king. Beyond prob- abilistic resolution, CSGR++ applies a three-lay er Rev erse V eriﬁer: (i) a deterministic numeric ground- ing pass that extracts all n umeric tok ens in a draft answer and chec ks v erbatim presence in retrieved evidence, (ii) LLM-based claim-by-claim judgement with b oth supp orting and coun ter-evidence re- triev al, and (iii) a com bined score where numeric failures cap the maximum achiev able veriﬁcation score. Numeric grounding is enforced as a hard constraint: a single unsupp orted numeric token caps do wnstream veriﬁcation scores. T able 1 summarizes the ma jor subsystems of the full HyRA G v3 and CSGR++ pipeline and their resp ectiv e roles, providing a compact o verview of ho w ECR in tegrates into the surrounding retriev al, v eriﬁcation, and synthesis infrastructure. 4.3 Supp orting RA G Comp onen ts Outside the CSGR++ analytical route, the implemen tation includes a general-purp ose RAG engine that pac k ages standard, widely used RAG mechanisms b ehind a single answer() in terface. These comp onen ts are supp orting infrastructure and are orthogonal to ECR. The system also supp orts generator-based streaming resp onses (via answer_stream en try-p oin ts), whic h is orthogonal to ECR and not ev aluated in this work. 2 Multi-hop retrieve–reason–retriev e. The engine iteratively retrieves candidates and, when on- line, generates a follow-up query conditioned on curren t evidence, stopping early when additional hops yield no new items. HyDE query embedding. T o impro ve recall under distribution shift b et w een user queries and row- shap ed embeddings, the engine optionally generates a short hypothetical “answer row” and embeds that text (HyDE) to driv e vector search. Cross-enco der reranking and calibrated abstention. Candidates are reranked either by cosine similarit y (oﬄine) or b y an LLM “cross-enco der” that outputs a ranking and conﬁdence. A calibrated 2 All ma jor components admit deterministic fallbac ks when LLMs are unav ailable (e.g., hashed em b eddings and heuristic claim extraction), though answ er quality ma y degrade. 11 conﬁdence score com bines the num b er of retriev ed results, the top similarity score, the reranker conﬁdence, and a query-complexit y p enalt y; the system abstains when the calibrated score is low. T ext-to-SQL with v alue grounding. F or aggregation queries, the engine routes to text-to-SQL and applies a second grounding pass that v alidates every generated string literal against real categorical v alues in the database; when an unknown literal is detected, it is rewritten to the closest fuzzy match when p ossible. CRA G self-correction with sc hema evolution signals. When reverse veriﬁcation returns fail or weak , the engine p erforms up to t wo correction attempts b y rewriting the query to target the v eriﬁcation gap. Eac h failure can b e recorded by a schema evolution track er that increments p er- column failure counts and can request LLM-based reclassiﬁcation suggestions once a threshold is exceeded. Schema evolution signals p ersist across queries, enabling long-term self-correction. 5 Exp erimen tal Design & Ev aluation While Section 4 outlines the deploymen t of ECR within a full-scale pro duction architecture, ev aluating the algorithm end-to-end immediately introduces confounding v ariables from upstream retriev al recall and downstream LLM generation quality . T o rigorously v alidate the decision-theoretic properties established in Section 3, our ev aluation strategy pro ceeds in tw o phases. First, w e strictly isolate the mathematical b eha vior of the en tropy-driv en claim selection p olicy using a controlled, claims-only harness (Sections 5.1–5.3). Second, we reintegrate ECR into an end-to-end reasoning pipeline to ev aluate its impact under realistic multi-hop and contradiction-hea vy settings (Section 5.4). 5.1 Con trolled claims-only harness Our “claims-only” harness ﬁxes the dataset, query set, retriev al conﬁguration, candidate claim p ool, and Bay esian entrop y mo del; only the claim-selection p olicy diﬀers. This allo ws a clean measurement of whether a policy is actually minimizing epistemic uncertaint y as deﬁned b y Eq. 1. Dataset and cases. W e use a small, m ulti-table business dataset of six CSV tables ( sales , customers , expenses , inventory , hr , marketing ) and 80 templated ev aluation queries spanning single-table lo okups and cross-table comparisons. Hyp otheses and initial entrop y . F or eac h query , the harness constructs |A| = 3 mutually exclu- siv e answer hypotheses, yielding an initial entrop y of H 0 = log 2 3 ≈ 1 . 585 bits. Candidate claims and p olicies. F or each case, we retrieve the same top-20 candidate claims (high-recall candidate generation). W e then compare three p olicies: (i) Retriev al-only , which takes the top-15 claims b y retriev al score under a ﬁxed budget; (ii) ECR , whic h sequentially selects the next claim b y exp ected en tropy reduction and stops when H ≤ ϵ with ϵ = 0 . 3 bits (capp ed at 10 iterations); and (iii) Random control , which samples claims uniformly without replacemen t from the same candidate po ol, matc hing ECR’s realized claim budget of 5 claims. En tropy-aligned metrics. W e rep ort (i) ﬁnal en tropy , (ii) en tropy drop p er ev aluated claim, (iii) claims-to-collapse (ﬁrst step reaching H ≤ ϵ , else budget +1 ), (iv) eﬀective hypotheses ( 2 H ), and (v) en tropy trace v ariance. W e additionally rep ort tw o diversit y-oriented diagnostics (claim redundancy and source entrop y) to illustrate that diversit y alone is not equiv alent to epistemic resolution. Finally , w e report hyp othesis-c onditione d r e dundancy (HypCondRed.), whic h computes redundancy within claim groups attributed to the same answer h yp othesis (rather than across the full mixed set). 12 Policy Claims H f inal ∆ H /claim Collapse 2 H f inal Redund. HypCondRed. SrcEnt Retriev al-only 15 . 0 ± 0 . 0 1 . 585 ± 0 . 000 0 . 0000 ± 0 . 0000 16 . 0 ± 0 . 0 3 . 000 ± 0 . 000 0 . 684 ± 0 . 119 0 . 662 ± 0 . 110 0 . 342 ± 0 . 510 ECR 5 . 0 ± 0 . 0 0 . 2129 ± 0 . 0000 0 . 2744 ± 0 . 0000 5 . 0 ± 0 . 0 1 . 159 ± 0 . 000 0 . 672 ± 0 . 125 0 . 672 ± 0 . 125 0 . 276 ± 0 . 443 Random 5 . 0 ± 0 . 0 1 . 243 ± 0 . 289 0 . 0684 ± 0 . 0577 6 . 0 ± 0 . 0 2 . 411 ± 0 . 437 0 . 658 ± 0 . 118 0 . 653 ± 0 . 123 0 . 354 ± 0 . 527 T able 2: Claims-only ev aluation (80 cases, seed=7). “Claims” is the n umber of ev aluated claims. H f inal is ﬁnal answer-h yp othesis entrop y in bits. “Collapse” is claims-to-collapse (ﬁrst step where H ≤ ϵ = 0 . 3 ; else budget +1 ). “Redund.” is claim redundancy , “HypCondRed.” is hypothesis- conditioned claim redundancy , and “SrcEnt” is source entrop y (diagnostic diversit y metrics). P olicy H f inal ∆ H /claim Collapse 2 H f inal T race V ar HypCondRed. Retriev al-only 1 . 585 ± 0 . 000 0 . 0000 ± 0 . 0000 16 . 00 ± 0 . 00 3 . 000 ± 0 . 000 0 . 002987 ± 0 . 000000 0 . 6619 ± 0 . 0000 ECR 0 . 2129 ± 0 . 0000 0 . 2744 ± 0 . 0000 5 . 00 ± 0 . 00 1 . 159 ± 0 . 000 0 . 262859 ± 0 . 000000 0 . 6719 ± 0 . 0000 Random 1 . 2628 ± 0 . 0265 0 . 0644 ± 0 . 0053 5 . 995 ± 0 . 006 2 . 436 ± 0 . 044 0 . 03210 ± 0 . 00374 0 . 6401 ± 0 . 0075 T able 3: Multi-seed robustness (seeds 0–4): mean ± std ov er seeds of the seed-level mean metrics. Only the random baseline changes across seeds in this frozen setup. “HypCondRed.” is h yp othesis- conditioned claim redundancy . 5.2 Main results (seed=7, 80 cases) T able 2 summarizes mean ± std across cases. ECR reliably reaches epistemic suﬃciency using 5 claims, driving H b elo w ϵ ; retriev al-only do es not reduce entrop y under the same p osterior mo del; random impro ves mo destly but t ypically do es not collapse. A cross these runs, claim-cov erage is identical across p olicies (0.6375 on av erage), reﬂecting that this harness is designed to stress epistemic resolution rather than maximize ov erlap with a small set of exp ected claim snipp ets. 5.3 Robustness across random seeds (seeds 0–4) T o ensure the random-control comparison is not a single-seed artifact, we rerun the claims-only harness for ﬁve random seeds (0–4), reusing the same frozen dataset, query s et, candidate claims, and p osterior mo del. Retriev al-only and ECR are deterministic under this setup, while the random baseline v aries b y construction. T able 3 conﬁrms the stabilit y of ECR across multiple seeds. F urthermore, Figure 3 illustrates the sc hematic entrop y tra jectories of these comp eting p olicies, highligh ting how rapidly ECR drives the h yp othesis space below the ϵ threshold compared to relev ance-only baselines. 5.4 End-to-End Ev aluation on a Standard Multi-Hop QA Benchmark In contrast to the preceding controlled, claims-only exp erimen ts, this ev aluation reintegrates a liv e large language mo del in to the inference lo op, exercising ECR as an online evidence-selection controller during end-to-end RA G generation. As an additional exp erimen t, to ev aluate whether entrop y-guided evidence selection improv es do wnstream answ er qualit y , we conduct an end-to-end ev aluation on a Hotp otQA-st yle multi-hop QA benchmark. All metho ds share the same retriever, language mo del, candidate evidence po ol, and decoding parameters; the only v ariable is the inference-time claim se- lection p olicy . W e ev aluate three p olicies: (i) a relev ance-based baseline RAG p olicy , (ii) a random selection con trol matched to the same a verage claim budget, and (iii) ECR, which applies entrop y- guided selection with stopping. W e rep ort exact match (EM), tok en F1, and an evidence faithfulness pro xy based on answ er-token cov erage, alongside the av erage n umber of claims used. Because Hot- p otQA exhibits substantially higher linguistic v ariance and more complex multi-hop dep endencies 13 Evidence Claims Activ ely Ev aluated ( t ) Hypothesis Entrop y H ( A | X eval ) (bits) 0 1 2 3 4 5 6 7 8 9 10 0.0 0.3 0.6 0.9 1.2 1.5 1.8 Epistemic suﬃciency ( ϵ = 0 . 3 ) ECR (schematic) Random/retriev al (schematic) Figure 3: Schematic entrop y tra jectories consistent with the measured endp oin ts: ECR reac hes H ≤ ϵ quic kly , whereas relev ance-only and random baselines t ypically remain ab o ve ϵ at matc hed claim budgets. than highly structured tabular datasets, the ECR algorithm naturally ev aluates a larger n umber of claims b efore the hypothesis entrop y collapses b elo w ϵ . Metho d A vg. Claims Used Exact Match (EM) ↑ T ok en F1 ↑ Evidence F aithfulness ↑ Baseline RAG 19.87 0.313 0.459 0.639 Random Control 19.87 0.207 0.307 0.427 ECR (ours) 19.68 0.297 0.450 0.626 T able 4: End-to-End Ev aluation on Hotp otQA-St yle Multi-Hop QA (300 Questions). All metho ds use the same retriev er and language mo del; only the inference-time evidence selection p olicy diﬀers. ECR substan tially outp erforms random selection while maintaining p erformance comparable to a strong relev ance-based baseline. T able 4 shows that ECR substantially outp erforms random selection across all reported metrics, conﬁrming that entrop y-guided evidence selection is consisten tly more eﬀectiv e than unguided or div ersity-only strategies. Relativ e to a strong relev ance-based baseline, ECR remains within a small margin on EM and F1, indicating that enforcing epistemic control do es not signiﬁcantly degrade answ er accuracy on standard b enc hmarks. It is imp ortan t to note that HotpotQA is a largely factual and relev ance-orien ted b enc hmark with predominantly singular ground truths. As such, it do es not natively stress-test contradictory evidence or fundamentally ambiguous queries, whic h are precisely the regimes ECR is designed to address. Ac hieving near parity on suc h a saturated b enc hmark while enforcing strict inference-time epistemic constraints demonstrates that ECR integrates robust uncertain ty control without reliance on b enc hmark-sp eciﬁc tuning. F uture ev aluations will fo cus on conﬂict-heavy or am biguity-orien ted b enc hmarks where relev ance-driven retriev al is kno wn to exhibit epistemic collapse. Robustness to Noisy Evidence. T o isolate a regime that is closer to real deploymen ts, where retriev ed evidence ma y include irrelev ant or even contradictory con tent, we perform a con trolled ablation on the same Hotp otQA ev aluation set and pip eline as ab o ve, injecting noise after r etrieval 14 Metho d EM (No Noise) F aith (No Noise) EM (40% Noise) F aith (40% Noise) Baseline RAG 0.323 0.660 0.167 0.345 ECR (ours) 0.307 0.657 0.163 0.331 T able 5: Robustness ablation on HotpotQA-style ev aluation (300 questions) with noise injected after r etrieval and b efor e evidenc e sele ction . “40% Noise” replaces 40% of retrieved candidate claims with unrelated (p oten tially contradictory) claims sampled from a noise p ool. Only baseline relev ance-based RA G and ECR are ev aluated; the retriever, LLM, prompts, and deco ding are unc hanged. (Perfor- mance is bounded ab o ve when ground-truth evidence is remov ed.) and b efor e evidenc e sele ction 3 . F or each query , we take the retriev ed candidate claim set and replace 40% of candidates with claims sampled from a noise p ool constructed from unrelated do cumen ts (k eeping the retriev er, LLM, prompts, deco ding, and ECR selection logic unchanged). T able 5 rep orts Exact Match (EM) and Evidence F aithfulness for baseline relev ance-based RAG and ECR under no noise v ersus 40% noise. Under this corruption regime, Exact Matc h necessarily degrades for b oth systems, as replacing a fraction of candidate claims can remo ve ground-truth evidence from the p ool. Notably , ECR exhibits predictable degradation to the relev ance-based baseline without amplifying noise-induced errors, de- spite enforcing strict inference-time stopping and ev aluating few er claims. This result indicates that en tropy-guided evidence selection remains w ell-b eha ved under partial evidence loss, a voiding ov erconﬁ- den t hallucination or unstable collapse when the a v ailable evidence b ecomes incomplete or unreliable. W e emphasize that this ablation ev aluates robustness to evidence corruption (i.e., partial remov al of v alid claims), rather than distractor accumulation, which isolates a complementary but distinct failure mo de. Oﬄine Robustness Under Structured Con tradiction. Standard QA b enc hmarks predomi- nan tly ev aluate answer accuracy under relatively clean evidence conditions. T o stress-test the epistemic- con trol mec hanism itself—indep endently of LLM semantics—w e run a fully oﬄine, deterministic con tradiction-injection ablation on the same 300-question Hotp otQA-st yle set and retriev al pip eline. F or eac h query , we take the retriev ed candidate claim p ool and inject paired, explicit contradiction t wins in to the candidate set at rate α ∈ { 0 . 0 , 0 . 3 , 0 . 5 } after retriev al and b efore evidence selection. In oﬄine mode, hypothesis initialization uses deterministic hashed embeddings and claim v eriﬁcation uses a deterministic pro venance proxy; this isolates controller behavior from veriﬁer qualit y . W e rep ort (i) Ambiguit y Exp osure —whether the run ends with H > ϵ or an unresolved explicit con tradiction pair—and (ii) Ov erconﬁden t Error —cases where the system outputs a dominan t h yp othesis despite b eing wrong (a pro xy for epistemic collapse). T able 6 shows a sharp regime shift: baseline relev ance-based RA G remains pathologically o verconﬁden t and ﬂat across α , while ECR transitions from fast epistemic suﬃciency in the clean regime ( α = 0 . 0 ) to principled non-conv ergence under contradiction ( α ≥ 0 . 3 ). At α ≥ 0 . 3 , am biguity emerges deterministically for every query and termination is entirely explained b y unresolved conﬂict rather than heuristic budget limits. This extreme ambiguit y rate is exp ected: once an explicit contradiction pair is present in the ev aluated evidence, epistemic coherence is unattainable b y deﬁnition. Likewise, entrop y remains high b ecause ECR is not an en tropy minimizer “at all costs”; it is a coherence-constrained en tropy controller. Exploring complemen tary am biguity-focused b enc hmarks and distractor accum ulation regimes re- mains an important direction for future ev aluation. 3 Because noise is injected by replacing a fraction of candidate claims, this protocol may remov e gold evidence for some queries. Consequently , Exact Match under heavy corruption reﬂects robustness to partial evidence loss rather than distractor ﬁltering. 15 Metho d α EM Ov erconfErr Am bExp Mean H Stop Reason Baseline RAG 0.0 0.0067 0.9933 0.0000 – ﬁxed_budget (300/300) Baseline RAG 0.3 0.0067 0.9933 0.0000 – ﬁxed_budget (300/300) Baseline RAG 0.5 0.0067 0.9933 0.0000 – ﬁxed_budget (300/300) ECR (ours) 0.0 0.0000 0.9900 0.0100 0.226 epistemic_suﬃciency (297/300) ECR (ours) 0.3 0.0067 0.0000 1.0000 1.496 unresolved_conﬂict (300/300) ECR (ours) 0.5 0.0067 0.0000 1.0000 1.458 unresolved_conﬂict (300/300) T able 6: Oﬄine con tradiction-injection ablation (300 questions). P aired contradictions are injected in to the candidate claim p o ol at rate α after retriev al and b efore evidence selection. EM is rep orted only as a sanity anchor under a deterministic oﬄine answ erer; the k ey signals are Am biguity Exp o- sure and Overconﬁden t Error (epistemic collapse). ECR exhibits a phase transition from epistemic suﬃciency to principled non-conv ergence as contradictions accumulate, while baseline RAG remains uniformly o verconﬁden t. Counts indicate num b er of runs terminating for each reason. 6 Conclusion Summary En tropic Claim Resolution introduces a principled inference-time p ersp ectiv e on Retriev al-Augmented Generation, reframing evidence selection as a pro cess of epistemic uncertaint y reduction rather than relev ance maximization. By directly optimizing Exp ected Entrop y Reduction o ver atomic claims, ECR pro vides a mathematically grounded mechanism for determining b oth whic h evidence to ev aluate next and when suﬃcien t evidence has b een accumulated to justify synthesis. Empirically , w e show that this entrop y-driven framework reliably collapses h yp othesis uncertain ty in con trolled claim-level settings and substantially outp erforms random evidence selection in end-to- end multi-hop question answering, while maintaining p erformance comparable to strong relev ance- based baselines. These results highligh t a fundamen tal distinction b et ween optimizing for raw answ er accuracy and enforcing principled epistemic con trol during inference. In a fully oﬄine con tradiction-injection stress test, ECR exhibits a sharp transition from epistemic suﬃciency to principled non-conv ergence as structured conﬂict accumulates: entrop y ceases to collapse, evidence exploration increases, and termination is explained by unresolved inconsistency rather than heuristic budgets. Unlik e retriev al arc hitectures designed primarily for long-form unstructured do cumen ts, HyRAG v3 explicitly mo dels structured tabular data with ro w-level grounding, enabling ECR to enforce numeric correctness and temporal consistency during inference. Bey ond b enc hmark p erformance, the ECR framew ork oﬀers clear adv antages for real-world and en terprise deploymen ts. In high-stakes domains suc h as medicine, law, and ﬁnance, conﬁden tly synthe- sizing a single answer from conﬂicting or incomplete evidence can b e costly or harmful. By pro viding a mathematically grounded mechanism to expose unresolved ambiguit y when epistemic suﬃciency can- not be reached, ECR functions as a principled constraint against unhedged generation. F urthermore, the ability to dynamically halt evidence accumulation once H ≤ ϵ is satisﬁed mitigates unnecessary computational ov erhead, reducing latency and cost associated with processing large, redundan t con- text windows. This p ositions ECR as a resource-eﬃcient inference-time control mechanism for scalable and risk-a ware AI reasoning. Limitations and F uture W ork W e conclude by outlining k ey limitations of the current framew ork and highlighting promising direc- tions for future researc h. 16 Hyp othesis space cov erage. A primary limitation of the current framew ork is its reliance on the initial hypothesis generation stage. Entropic Claim Resolution op erates ov er an explicitly constructed h yp othesis set and therefore inherits a b ounded-cov erage assumption; if the true answer is entirely absen t from this space, the system may conv erge conﬁden tly to an incorrect explanation. In practice, this limitation can b e mitigated b y regenerating h yp otheses when en trop y fails to decrease or when accum ulated evidence weakly supp orts all candidates. F uture w ork will explore dynamic mid-lo op h yp othesis extension, soft hypothesis assignments, richer lik eliho od mo dels, and tighter in tegration with learned retrievers to further strengthen entrop y-guided reasoning under uncertain ty . Imp ortan tly , ECR’s refusal to con verge under explicit contradiction is a delib erate design choice rather than a limitation: when the ev aluated evidence is epistemically incoherent, the framew ork exp oses ambiguit y instead of forcing p osterior collapse. This b eha vior preserves epistemic correctness but ma y yield non-decisiv e outputs in genuinely inconsistent corp ora. An orthogonal robustness regime inv olves distractor accumulation without evidence remov al, whic h w e leav e to future in vestigation. Finally , while this work ev aluates Entropic Claim Resolution sp eciﬁcally within Retriev al-Augmented Generation, the underlying metho dology naturally extends to agentic and autonomous con texts. Our approac h suggests a p erspective where agen t actions (such as executing to ols, querying external APIs, or taking exploratory steps) can be mo deled dynamically as en tropy-minimizing decisions ev aluated under a rigorous Exp ected Entrop y Reduction criterion. This aligns with recent adv ancements in autonomous cognitive control, including top ology-a ware routing [19] and dynamic temp oral pacing [20], providing a formal alternative to standard prompt-driv en or heuristic action-selection p olicies. W e view the integration of decision-theoretic primitives into con tinuous agentic feedback lo ops as a comp elling frontier for building robust and mathematically grounded autonomous systems. 17 References [1] Lewis, P ., et al. (2020). Retriev al-augmen ted generation for knowledge-in tensive nlp tasks. A d- vanc es in Neur al Information Pr o c essing Systems , 33, 9459–9474. [2] Karpukhin, V., et al. (2020). Dense Passage Retriev al for Op en-Domain Question Answering. Pr o c e e dings of EMNLP . [3] Edge, D., et al. (2024). F rom Lo cal to Global: A Graph RAG Approach to Query-F o cused Summarization. arXiv pr eprint arXiv:2404.16130 . [4] Singh, R., et al. (2026). CSR-RA G: An Eﬃcient Retriev al System for T ext-to-SQL on the Enter- prise Scale. arXiv pr eprint arXiv:2601.06564 . [5] Y ao, S., et al. (2023). ReAct: Synergizing Reasoning and A cting in Language Mo dels. ICLR . [6] Y ao, S., et al. (2023). T ree of Thoughts: Delib erate Problem Solving with Large Language Mo dels. NeurIPS . [7] Asai, A., et al. (2023). Self-RAG: Learning to Retrieve, Generate, and Critique. arXiv pr eprint arXiv:2310.11511 . [8] Shannon, C. E. (1948). A Mathematical Theory of Comm unication. Bel l System T e chnic al Journal , 27(3). [9] Houlsb y , N., et al. (2011). Ba yesian Activ e Learning for Classiﬁcation and Preference Learning. arXiv:1112.5745 . [10] Kuhn, L., et al. (2023). Seman tic Uncertain ty: Epistemic Uncertain ty in Neural Language Mo dels. ICLR . [11] Zhang, T., et al. (2024). URA G: Benchmarking Uncertaint y in Retriev al-Augmented Generation. arXiv:2408.01234 . [12] Liu, H., et al. (2025). Retriev al-Augmented Reasonin g Consistency . ACL . [13] Zhao, X., et al. (2025). BEE-RA G: Balanced En tropy-Engineered Context Management. arXiv:2501.09912 . [14] Kim, S., et al. (2025). Structure-Fidelit y RAG. ICLR . [15] P atel, M., et al. (2024). Lazy RAG. EMNLP . [16] W ang, Y., et al. (2025). MedRAGChec ker. . [17] W ei, A., et al. (2024). SAFE. . [18] Chen, J., et al. (2025). CIBER. NeurIPS . [19] Di Gioia, D. (2026). Cascade-A ware Multi-Agent Routing. . [20] Di Gioia, D. (2026). Learning When to A ct. . 18 App endix A λ -Sw eep Robustness T o test whether coherence-a ware b ehavior requires fragile tuning, we sweep the coherence b on us weigh t λ in ECR’s evidence selection p olicy ov er { 0 , 0 . 01 , 0 . 025 , 0 . 05 , 0 . 1 } while keeping the oﬄine proto col, budgets, and con tradiction injection rates α ∈ { 0 . 0 , 0 . 3 , 0 . 5 } ﬁxed. W e observe that λ = 0 b eha ves as en tropy-only con trol and can conv erge to a dominan t hypothesis ev en under con tradiction injection, whereas an y tested non-zero λ yields the same coherence-aw are regime in which explicit contradictions are surfaced and preven t epistemic collapse; consequently , b eha vior saturates across all tested λ > 0 . W e set λ = 0 . 05 as the default. As shown in T able A.1, we observe a sharp transition betw een en tropy-only control ( λ = 0 ) and coherence-aw are con trol ( λ > 0 ), with b eha vior saturating for all tested non-zero v alues. This indicates that ECR do es not require ﬁne-grained h yp erparameter tuning to surface epistemic inconsistency . λ conﬂict Am biguity Exp osure 0 0.01 0.025 0.05 0.10 0 1 α = 0 . 0 α = 0 . 3 α = 0 . 5 Figure A.1: Ambiguit y exp osure as a function of the coherence weigh t λ conﬂict under structured con tradiction injection. Empirically , ambiguit y exp osure exhibits a sharp phase transition: for α = 0 . 5 , exp osure jumps from 0 to 1 for an y tested λ > 0 , while remaining 0 for α ≤ 0 . 3 across all tested settings. λ α = 0 . 0 α = 0 . 3 α = 0 . 5 MeanClaims Mean H MeanClaims Mean H MeanClaims Mean H 0.00 5.04 0.226 5.06 0.226 5.08 0.226 0.01 5.04 0.226 25.83 1.496 29.81 1.458 0.025 5.04 0.226 25.83 1.496 29.81 1.458 0.05 5.04 0.226 25.83 1.496 29.81 1.458 0.10 5.04 0.226 25.83 1.496 29.81 1.458 T able A.1: λ -sweep summary statistics (oﬄine, deterministic). V alues are aggregated ov er 300 ques- tions. B Oﬄine Con tradiction Sanit y T est As an additional sanity c heck, w e ev aluate ECR in a minimal fully oﬄine scenario consisting of a single claim and its explicit synthetic negation (a paired contradiction twin). In this setting, the exp ected out- come is that ECR ev aluates b oth claims, ﬂags unresolved conﬂict ( has_unresolved_conflict=True ), and refuses to emit a dominant h yp othesis ( dominant_hypothesis=None ). This deterministic unit test passes. 19

Entropic Claim Resolution: Uncertainty-Driven Evidence Selection for RAG

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment