Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Preprint. Under review . Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim V eriﬁcation Masnun Nuha Chowdhury † Nusrat Jahan Beg † Umme Hunny Khan Syed Rifat Raiyan ‡ Md Kamrul Hasan Hasan Mahmud Systems and Software Lab (SSL), Department of Computer Science and Engineering Islamic University of T echnology , Dhaka, Bangladesh † Equal contribution ‡ Corresponding author: rifatraiyan@iut-dhaka.edu Abstract Large language models (LLMs) remain unreliable for high-stakes claim veriﬁcation due to hallucinations and shallow reasoning. While retrieval- augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retriev al and unstructur ed debate dynamics. W e propose a courtroom-style multi-agent framework, P R O C L A I M , that reformulates veriﬁcation as a structured, adversarial deliberation. Our approach integrates specialized roles ( e.g. , Plaintiff, Defense, Judge) with Progr essive RAG (P-RAG) to dynamically expand and r eﬁne the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reﬂection, and heter ogeneous multi-judge aggr egation to enfor ce cali- bration, robustness, and diversity . In zero-shot evaluations on the Check- COVID benchmark, P R O C L A I M achieves 81.7% accuracy , outperforming standard multi-agent debate by 10.0 per centage points, with P-RAG driving the primary performance gains (+7.5 pp). W e ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate sys- tematic biases, providing a robust foundation for reliable claim veriﬁcation. 1 Introduction LLMs have demonstrated str ong zero-shot performance on reasoning-intensive tasks, yet their reliability in high-stakes domains, such as claim veriﬁcation, remains limited. Despite ﬂuent generation, LLMs frequently exhibit hallucinations, shallow reasoning, and over con- ﬁdent predictions when evaluating evidence-dependent claims ( Huang et al. , 2025 ). RAG partially addresses these issues by gr ounding responses in external corpora ( Lewis et al. , 2020 ). Still, standard pipelines rely on static, single-pass retrieval ( Gao et al. , 2024 ) and lack mechanisms for iterative reasoning, often r esulting in incomplete or biased conclusions. T o impr ove rel iability , recent work has explor ed Multi-Agent Debate (MAD), wher e multiple LLM instances iteratively argue and reﬁne answers ( Du et al. , 2024 ; Liang et al. , 2024 ; Han et al. , 2025 ). While promising, prior studies show that unstructur ed debate often leads to prematur e conver gence, shared bias r einfor cement, and limited evidence exploration ( Smit et al. , 2023 ; W u et al. , 2025 ). In particular , agent agreement is frequently misinterpr eted as correctness, even when gr ounded in insufﬁcient or biased evidence. In this work, we pr opose a courtroom-style multi-agent deliberation framework P R O - C L A I M ( P rogr essive R etrieval O rchestrated multi-agent framework for Claim veriﬁcation) that reformulates claim veriﬁcation as a structured, adversarial r easoning process. Inspired by legal systems and recent agent-based simulations ( Chen et al. , 2025 ), our approach introduces explicit roles (Plaintiff, Defense, Judge, Critic, and Expert W itness), evidence admission protocols, and multi-stage deliberation. Central to the framework is Progres- sive Retrieval-Augmented Generation (P-RAG), which performs iterative, query-adaptive retrieval to continuously expand and r eﬁne the evidence pool during debate. Contributions. W e pr esent a uniﬁed framework where structured deliberation outper- forms standard multi-agent debate, achieving a +10.0 pp accuracy gain and validating a courtroom-style ar chitecture. Dynamic retrieval via P-RAG is the primary driver , adding 7.5 points while pr eventing evidence stagnation. W e further show that model diversity is 1 Preprint. Under review . crucial: heterogeneous LLMs exhibit complementary errors that offset each other , outper- forming any single model. P R O C L A I M also demonstrates strong zero-shot, domain-agnostic performance. Beyond gains, our analysis reveals deeper insights into multi-agent reasoning: • Breaking the epistemic bubble , r emoving dynamic retrieval (P-RAG) increases inter - judge agreement ( ¯ κ = 0.468 → 0.599) while reducing accuracy by 7.5 pp , revealing conﬁdent convergence on incorr ect conclusions. • Acting as a logic-level lie detector , incorrect predictions exhibit unstable reasoning trajectories with oscillating self-reﬂection scor es, showing that reasoning dynamics are more informative than ﬁnal conﬁdence . • Enabling synthetic wisdom through diversity , heterogeneous LLMs exhibit complemen- tary error proﬁles ( e.g. , over -refutation vs. cautious abstention) whose combination yields a 3.3 pp accuracy gain over single-judge panels. • Serving as an economic governor , self-reﬂection reduces debate rounds by 29% (7.06 → 5.47) and token usage by 17% while maintaining accuracy within 0.8 pp . • Demonstrating the butterﬂy effect of argument framing , small biases in premise decom- position propagate and signiﬁcantly impact outcomes , as seen in Run-2 where κ G T drops to 0.384 (down from 0.423 in Run-0) despite stable inter -judge κ . • Revealing structural negativity bias , LLM judges over -produce REFUTE verdicts relative to ground truth and converge faster on such claims (0.2 rounds in primary and 0.3 in role-switched debates), indicating inher ent conservatism shaped by their training data. T ogether , these ﬁndings reframe LLM reasoning as a dynamic process, where reliability arises from the structur e, diversity , and evolution of deliberation, not just ﬁnal answers; see Appendices A for context with prior work. W e provide our code and data in the following GitHub repository: https://github.com/mnc13/PROClaim . 2 Methodology W e propose a courtroom-inspir ed pipeline for zero-shot, evidence-grounded fact-checking of COVID-19-related claims. Each claim is treated as a legal case, wher e a Plaintiff Counsel argues for the claim, a Defense Counsel argues against it, and an independent judicial panel deliberates toward a verdict — imposing disciplined adversarial structure over the reasoning process rather than relying on single-pass LLM inference. Figure 1 provides a high-level overview , with the full evaluation cycle formalised in Algorithm F . 2.1 Argument Mining S h ar ed Evi d e n c e P ool & P R A G N e goti ati on Prepr ocessing & Retrieval Layer Claim Load i n g & No r mal i z ati on RAG Retrieval (F AISS / PubMed) LLM-Based Argument Mining (Premise Decomposition) Shared Evidence Pool & Evidence Negotiation Evaluation & Decision Layer Multi-Judge Scoring (Logic · Evidence · Accuracy) Output: Final V erdict & Confidence Score Self-Reflection & Confidence Penalty + Majority V oting Role Switching & Consistency Audit (Plaintiff ↔ Defense) Bias & Contradiction Detection Data Layer Check-COVID Claims Dataset V ector Index & Metadata Store (F AISS) PubMed Biomedical Literature Corpus Dynamic Persona Assignment Reasoning & V erification Layer (Cor e) Plaintiff Counsel Defense Counsel Critic Expert Witness PRAG Query Expert T estimony Self Reflection Critique A r gu me n ts Arguments The Court Judicial Oversight & Query Refinement Multi-Agent Debate · Adversarial Pr oceedings Progressive RAG Feedback Loop - Adaptive · Novel Queries Each Round V erification & Bias Control Debate Claim Loading Figure 1: Overview of the pipeline Before r etrieval, the raw claim is decom- posed into atomic, independently testable premises ( Hu et al. , 2025a ; Lawr ence & Reed , 2017 ). This serves two purposes: ﬁrst, de- composing complex claims allows the re- trieval system to cast a wider and more tar - geted net; second, the resulting premises act as an explicit checklist for scoring ar- gument completeness during self-r eﬂection and critic evaluation (Section 2.6 ). The out- put is a list of premise strings propagated to all downstream phases. Formally , given a claim c , the miner pr oduces a premise set P = { p 1 , . . . , p k } where each p i encodes a single veriﬁable proposition. The number of premises k is not ﬁxed, allowing the agent to produce as many non-redundant premises as the claim’s semantic content warrants. 2 Preprint. Under review . 2.2 Initial Evidence Retrieval Relevant evidence is retrieved from COVID-19 PubMed abstracts (2020–2024) corpus, con- structed for this study , using dense vector embeddings to measure semantic similarity ( Gao & Callan , 2021 ). Claims and abstract chunks are mapped into a 384-dimensional embedding space using the all-MiniLM-L6-v2 bi-encoder ( Reimers & Gurevych , 2019 ), and the most semantically similar chunks are selected as the seed evidence pool with sour ce and year pre- served for provenance. The detailed embedding normalization and similarity computation are pr ovided in Appendix L . 2.3 Evidence Negotiation and Judicial Arbitration Inspired by the pre-trial discovery phase in legal proceedings, this stage constructs an admissibility-weighted evidence pool befor e the debate begins, drawing on r ecent works ( Jeong et al. , 2026 ; Hu et al. , 2025b ), proceeding in four steps. Premise-grounded shared retrieval: The F AISS index is queried once per decomposed premise, pr oducing a shared pool of retrieved evidence. Stance-conditioned retrieval: An LLM generates two distinct queries — one optimised to surface supporting evidence (for Plaintiff Counsel) and one for challenging evidence (for Defense Counsel) — yielding a proponent pool and opponent pool respectively . This asymmetric retrieval prevents the “Echo Chamber ” effect common in T op-K RAG, where the engine retrieves only the most statistically frequent papers, often ignoring marginalized but critical counter-evidence ( W u & Saxena , 2025 ), ensuring the evidence pool is adversarially balanced from Round 0. Negotiation injection: Both sides are exposed to each other ’s discovery pools via struc- tured LLM pr ompts, simulating the disclosure and challenge pr ocess. Rather than merely retrieving mor e data, agents actively reﬁne the search space based on their legal strategy — a form of metacognitive evidence curation. Admissibility scoring: Each candidate is scored on two dimensions: Relevance r ∈ [ 0, 1 ] , the degree to which the evidence pertains to the speciﬁc claim, and Credibility c ∈ [ 0, 1 ] , the scientiﬁc rigour and sour ce quality . The admissibility weight is w = r × c . Items with w > 0.5 are admitted, 0.1 < w ≤ 0.5 are ﬂagged as disputed , and w ≤ 0.1 are discarded. Admitted items are ranked in descending or der of w and passed to the debate orchestrator (see Appendix H.2 ). This mechanism operationalises a principle from both legal theory and fact-checking re- search: evidence quality and provenance matter beyond surface-level relevance, especially under conﬂicting sources ( Ge et al. , 2025 ). It mirrors the Daubert Standard ( Legal Infor- mation Institute , 2024 ) used in r eal-world expert testimony , excluding “scientiﬁc hearsay” before it can pollute agent deliberation. 2.4 Progressive Retrieval-Augmented Generation (P-RAG) Static, one-time retrieval is ill-suited to adversarial debate as new evidential gaps emerge as arguments evolve. Unlike prior iterative approaches that query solely from the previous gen- eration ( Shao et al. , 2023 ; T rivedi et al. , 2023 ), P-RAG concatenates three sour ces: (i) the last four messages of the rolling debate context, (ii) the agent’s self-identiﬁed evidential gap, and (iii) reﬂection-driven discovery needs fr om the prior round (Section 2.6 ) —into one prompt with the Judge reﬁning each query befor e execution (see App. I for full logic and prompts). T o prevent redundant retrieval, each candidate item is scored against the existing pool: Criterion Threshold Rationale Novelty ﬁlter < 0.20 Rejects near-duplicates Redundancy ratio > 70% Indicates saturation Relevance gain < 0.05 Diminishing returns Iteration cap 10 Limits compute cost T able 1: P-RAG stopping criteria. novelty ( d ) = 1 − max p ∈ P cos ( e d , e p ) (1) where P is the current pool and e · de- notes an L2-normalised embedding. Un- like global diversity metrics such as the V endi Scor e ( Rezaei & Dieng , 2025 ), our novelty ﬁlter operates iteratively , rejecting near-duplicates at each r ound. 3 Preprint. Under review . Role Model Prov . T . Premise Decomp. deepseek-r1 OR 0.7 Plaintiff Counsel gpt-5-mini OA 0.5 Defense Counsel deepseek-v3.2 OR 0.5 The Court qwen3-235b OR 0.2 Expert W itness hermes-3-llama-405b OR 0.5 Critic Agent deepseek-r1 OR 0.3 Consistency Anal. deepseek-v3.2 OR 0.3 Judge 1 deepseek-r1 OR 0.3 Judge 2 hermes-3-llama-405b OR 0.3 Judge 3 Qwen3-235B-A22B OR 0.3 T able 2: Role-wise models. Providers (Prov .): OR =OpenRouter , OA =OpenAI. T .=T emperature. Only items with novelty ≥ 0.20 are admitted. This thr eshold is selected as a Redundancy-Noise T rade-off Heuristic; values below 0.15 allow near-duplicates, while values above 0.30 prematur ely dis- card nuanced but critical evidence. Retrieval terminates early under the adaptive stopping criteria in T a- ble 1 . 2.5 Multi-Agent Debate (MAD) Orchestration The multi-agent debate is the central r easoning engine of the pipeline involving ﬁve adjudication r oles—Plaintiff Counsel, Defense Counsel, and thr ee Judges—each assigned to a dif ferent underlying model (T able 2 ). Heterogeneous assignment serves to prevent collusion fr om shared repr esentations, introduce authentic disagreement, and improve argument diversity across rounds, as homogeneous agents cannot reliably improve over majority vote, while diversity in agent capabilities is critical for ef fective deliberation ( Zhang et al. , 2026 ; Zhu et al. , 2026 ). Each debate round executes ﬁve steps: (1) Evidence discovery: Both agents independently identify an evidential gap and propose a targeted query , which the Judge reﬁnes before execution via P-RAG, with novel results admitted to the shared pool. (2) Argument genera- tion: Each agent produces a structured legal argument grounded in the current evidence pool and debate history . (3) Expert witness testimony: Either side may request a domain expert; if granted by the Judge, a domain-speciﬁc persona is dynamically generated to pro- vide targeted testimony . This mechanism enables access to specialized knowledge without predeﬁning expert domains. (4) Self-reﬂection: Each agent critiques its own performance, with identiﬁed gaps injected into the next round’s P-RAG query (Section 2.6 ). (5) Critic evaluation: The Critic Agent independently scores both sides and signals early termination if all key premises have been addr essed. The debate runs for up to max rounds = 10 but terminates early when any of the following conditions are met: (i) Reﬂection plateau: | ∆ total reflection score | < 0.05 for two consec- utive rounds; (ii) Critic resolution: debate resolved=True ; (iii) Novelty exhaustion: average novelty < 0.10 over two consecutive P-RAG calls; or (iv) Judicial signal: the Court agent afﬁrms r eadiness to deliberate. 2.6 Self-Reﬂection and Critic Evaluation Per-agent self-reﬂection. After each round, each counsel performs a structured self-critique ( Madaan et al. , 2023 ; Shinn et al. , 2023 ) acr oss three dimensions: logic l ∈ [ 0, 1 ] (inferential validity), novelty n ∈ [ 0, 1 ] (new information intr oduced), and rebuttal b ∈ [ 0, 1 ] (engagement with the opposing side). Logical validity is weighted highest (0.4) as the core requir ement for inferential integrity , with novelty and rebuttal balanced equally (0.3 each) to maintain adversarial accountability without compromising factual grounding. They’re aggr egated as: s ref = 0.4 l + 0.3 n + 0.3 b , (2) and mapped to a symmetric conﬁdence adjustment used in the ﬁnal verdict (Section 2.8 ): δ ref = ( s ref − 0.5 ) × 0.6 ∈ [ − 0.30, + 0.30 ] . (3) The 0.6 multiplier scales the centered r eﬂection score to a range of ± 0.30. This allows the winning counsel’s self-critique to exert a signiﬁcant, yet non-dominant, inﬂuence on the baseline consensus conﬁdence ( c base ≈ 0.8), maintaining the supremacy of the judicial panel’s verdict. The self-reﬂection output also surfaces a discovery need —a description of missing evidence—which is included with the subsequent round’s P-RAG query . Independent Critic Agent. An independent Critic Agent, distinct from both debating agents, evaluates each r ound, leveraging prior ﬁndings that dedicated critics outperform 4 Preprint. Under review . self-critique( Li et al. , 2025b ). It scores both sides on logic, evidence coverage, and rebuttal; identifying unaddressed pr emises from Section 2.1 ; providing actionable recommendations; and issuing the debate resolved signal. Using an architecturally distinct model prevents both debaters from conver ging on a shared but potentially erroneous position. 2.7 Role-Switching Consistency T est T o pr obe whether arguments are genuinely evidence-grounded rather than position- anchored, we introduce a role-switching consistency test . Unlike prior work that uses role-switching to mitigate bias ( Jin et al. , 2025 ) or elicit uncertainty ( Liu et al. , 2025 ), we employ it as a consistency diagnostic. After the primary debate, Plaintif f and Defense Counsel swap roles, with the or chestrator resetting all state and re-running the full debate under the switched conﬁguration. This is analogous to requiring legal counsel to argue the opposing brief: if arguments are evidence-driven, a capable agent should construct comparably strong ar guments from either position. A separate LLM then analyses both transcripts, assessing whether the logic in the switched debate contradicts the agent’s original position and whether evidence is marshalled con- sistently or opportunistically , producing a consistency score for each agent. The score is incorporated into the ﬁnal conﬁdence weighting (Section 2.8 ). 2.8 Judicial Panel Evaluation and Final V erdict The complete case record—both debate transcripts, admitted evidence, P-RAG retrieval histories, critic evaluations, and per-round self-r eﬂection scores—is submitted to a three- member Judicial Panel of independent LLM judges, each assigned a different model to prevent corr elated errors ( V erga et al. , 2024 ). Each judge independently evaluates the case through six stages: (1) case reconstr uction, (2) evidence Strength, (3) argument V alidity , (4) source Reliability , (5) discovery rigour and P-RAG query evolution, and (6) a ﬁnal verdict of S U P P O R T E D , N O T S U P P O R T E D , or I N C O N C L U S I V E . The independent evaluations of the three judges ar e aggregated using a majority-vote mechanism to determine the ﬁnal verdict. Appendix N . The conﬁdence score is computed in two stages: Base conﬁdence: c base = 0.8 σ + 0.3 q , (4) where σ = winning votes / total votes is the consensus str ength and q = ( ¯ s ev + ¯ s val + ¯ s rel ) / 30 is the mean judge quality score. The 30 divisor normalizes the three 10-point qualitative metrics into a unit interval [ 0, 1 ] for mathematical consistency with σ . (Certainty buffer detailed in Appendix M.1.3 ) The weights were determined using a 5-fold cross-validation grid sear ch (range [0.5, 0.9]) to minimize Expected Calibration Err or (ECE) on a held-out development subset, following best practices for calibrating LLM-based evaluators ( Gu et al. , 2025 ). Our ﬁnal test set calibration yielded a near-optimal ECE of 0.034 (see Appendix M.1 ), representing a 5x improvement over standard averaging ( W = 0.6, ECE = 0.18). This high weighting ( W consensus = 0.8) is empirically justiﬁed as judge agreement proved the str ongest predictor of accuracy . Adjustment: c ﬁnal = clamp  c base + δ rs + δ ref , 0, 1  , (5) where δ rs is the role-switch adjustment scalar (Details in Appendix M.2 ), and δ ref is the winning side’s ﬁnal self-reﬂection adjustment. While the agent’s raw self-reﬂection score ranges from [ − 0.30, + 0.30 ] (Eq. 3 ), the framework applies a strict ﬂoor ( δ ref = max ( − 0.15, raw score ) ) to ensure that self-critiques mathematically reﬁne—but do not unilaterally overturn—the established judicial consensus (see Appendix M.3 ). After clamp- ing, if ≥ 2/3 judges agree, conﬁdence is lower -bounded at 0.10 to prevent near-zer o scores under clear , majority agreement. 5 Preprint. Under review . 3 Experimental Setup Reﬂection Plateau Judicial Signal Critic Resolution Novelty Stabilization 0 100 200 179 (49.7%) 147 (40.8%) 23 (6.4%) 11 (3.1%) Number of Debates Debate T ermination Reasons (a) Debate termination reasons. GT = SUPPOR T GT = REFUTE 0 1 2 3 4 3 2.8 2.9 2.6 Rounds to Consensus (mean ± 95% CI) Primary debate Role-switched (b) Rounds to consensus by ground-truth label. Figure 2: T ermination distribution and con- vergence speed acr oss 360 debate instances. Dataset. T o evaluate the framework’s capac- ity for adversarial resolution, we focus on the subset of the Check-COVID ( W ang et al. , 2023a ) test set possessing deﬁnitive binary ground-tr uths (SUPPOR T or REFUTE). This task formulation, which we term Adversar- ial Resolution of Hard-Binary Claims , ensur es that the system is tested on its ability to navi- gate conﬂicting evidence toward a deﬁnitive truth, rather than defaulting to an uninfor- mative ‘Not Enough Information’ label com- mon in standard model outputs. Retrieval Corpus. Retrieval is performed over a static corpus of COVID-19 abstracts from PubMed (2020–2024), constructed for this study . Abstracts are encoded using all-MiniLM-L6-v2 , a Sentence-BERT model ( Reimers & Gurevych , 2019 ), and indexed with F AISS ( Douze et al. , 2025 ). Source meta- data, including PMID, title, journal, and pub- lication year , are stor ed alongside each pas- sage to support provenance tracking. Conﬁguration and Metrics. T able 2 sum- marizes the heterogeneous LLM assignment, where agent r oles are partitioned across dis- tinct architectur es based on their functional capacity (deliberative vs. evaluative) and cognitive proﬁle, utilizing moderate temperatures for adversarial advocacy and lower temperatures for rigor ous judicial evaluation (details in Appendix B.3 ). Key retrieval hy- perparameters are a novelty thr eshold τ = 0.20, per-r ound top- k = 3, and admissibility ﬂoor w > 0.5; full settings are in Appendix B.4 . W e evaluate classiﬁcation performance (accuracy , macro F1), inter-judge reliability (Cohen’s κ , unanimity/split rates), conﬁdence stability via the Kolmogorov–Smirnov statistic across rounds (App. E ), and efﬁciency (token usage, debate r ounds, evidence pool size). 4 Results and Discussion T able 3 reports Check-COVID performance across three independent runs, utiliz- ing majority voting for aggr egate results. Cr ucially , our framework enforces an adversarial ‘Burden of Refutation’. Mirroring legal systems where defendants are Run Acc m-F1 ¯ κ Agr . Unan. Split Run-0 0.950 0.950 0.429 0.617 0.442 0.558 Run-1 0.817 0.817 0.549 0.700 0.558 0.442 Run-2 0.790 0.790 0.474 0.652 0.496 0.504 W . T otal 0.841 0.841 0.484 0.656 0.499 0.501 Maj. V ote 0.817 0.817 0.468 0.648 0.489 0.511 Oracle 0.958 0.958 0.438 0.622 0.450 0.550 T able 3: Performance statistics. m-F1 : MacroF1. ¯ κ : Mean κ . Agr . : Raw Agreement. Unan. : Unanimity (3-0 consensus). Split : Divided vote. W. T otal: W eighted T otal. Oracle : Best-of-3. Per-judge κ GT in App. D . formally presumed intact unless deﬁnitively proven otherwise, we evaluate claims based strictly on adversarial survival. If a claim withstands rigorous deliberation and the panel fails to reach a deﬁnitive Refute consensus— instead returning an Inconclu- sive ver dict due to insufﬁcient opposing evidence—the claim is formally classiﬁed as Supported. This strictly aligns our evaluation metrics with established court- room logic (see Appendix C ). 6 Preprint. Under review . 1 2 3 4 5 6 R o u n d t 0.0 0.2 0.4 0.6 0.8 1.0 Evidence Novelty n=2031 n=1448 n=547 n=208 n=85 n=26 Novelty Decay Across Debate Rounds 10 90th percentile IQR (25 75th) Mean novelty P -RAG query novelty (mean) Admission threshold (0.2) (a) Novelty decay across r ounds. 1 2 3 4 5 6 R o u n d t 0.0 0.2 0.4 0.6 0.8 1.0 Evidence Novelty Novelty Distribution by Round Admission threshold (0.2) (b) Novelty distribution by round. Figure 3: P-RAG evidence novelty across debate r ounds 1 2 3 4 R o u n d t 0.2 0.3 0.4 0.5 0.6 0.7 0.8 R e f l e c t i o n S c o r e s r e f Reflection Plateau 1 2 3 4 5 6 R o u n d t Judicial Signal 1 2 3 4 5 6 R o u n d t Critic Resolution Representative reflection score trajectories (n 10 per panel, all runs) Proponent (solid) Opponent (dashed) Correct prediction Incorrect prediction Figure 4: Reﬂection score trajectories acr oss plateau, judicial, and critic resolution patterns 4.1 Main Pipeline Performance This framework achieves a aggregate majority-vote accuracy of 81.7% across three inde- pendent runs, with an oracle ceiling of 95.8% conﬁrming that near-optimal performance is structurally attainable under favourable initialisation. Inter-judge agreement remains stable across all runs (mean κ : 0.429, 0.549, 0.474), indicating coherent panel deliberation regar dless of outcome. The observed variability across runs is primarily attributable to stochastic ef fects in sequen- tial agent interactions. Run-1’s lower κ G T (0.372) does not translate to equivalent accuracy loss (0.817), as its errors r emain class-balanced. Run-2, however , exhibits class-asymmetric degradation: retrieval bias introduced during stochastic premise decomposition suppr esses REFUTE recall and propagates through deliberation, dr opping κ G T to 0.384 while inter-judge κ holds stable at 0.474 — conﬁrming the panel converged conﬁdently on systematically biased verdicts rather than br eaking down structurally . 4.2 Debate Dynamics and Adaptive Stopping T ermination distribution: Figure 2a shows that most debates self-terminate thr ough reﬂection-driven signals: Reﬂection Plateau and Judicial Signal. Reﬂection Plateau enforces adaptive halting once marginal argumentative gains diminish, mitigating cir cular r easoning. The low incidence of Novelty Stabilization (3.1%) further suggests that termination is driven by argumentative saturation rather than corpus insuf ﬁciency . Convergence speed: Figure 2b shows that REFUTE claims converge faster than SUPPOR T claims in both the primary (0.2 rounds faster) and role-switched debates (0.3 rounds faster), consistent with the LLM negativity bias documented in Section 4.5 . Evidence novelty decay: Figur es 3a – 3b conﬁrm that P-RAG sustains genuine evidential discovery across all active rounds , with query novelty approaching the threshold from round 4 onward. The narrowing violin distributions from round 3 indicate pool satura- tion, empirically validating the 0.20 novelty threshold as the natural boundary between productive and redundant retrieval , justifying the adaptive stopping design. 7 Preprint. Under review . System Acc MacroF1 ∆ Acc ¯ κ A vg. Rounds Ev T ok (K) P R O C L A I M (Ours) 0.8167 0.8165 — 0.468 5.47 67.5 210.9 w/o P-RAG 0.7417 0.7408 − 7.5 pp 0.599 6.00 37.5 188.9 w/o Role-Switching 0.7750 0.7750 − 4.2 pp 0.513 2.88 54.0 147.3 w/o Three-Judge Panel 0.7833 0.7818 − 3.3 pp – 5.29 68.8 195.9 w/o Self-Reﬂection 0.8083 0.8080 − 0.8 pp 0.591 7.06 81.5 247.3 Standard MAD 0.7167 0.7068 − 10.0 pp – 2.00 12.1 18.9 T able 5: Ablation results on 120 claims. ∆ Acc is measured r elative to the full P R O C L A I M pipeline. Ev denotes the average evidence pool size. Each round count r eﬂects the sum of primary and role-switched debate r ounds per claim. Reﬂection trajectories: Figure 4 stratiﬁes trajectories by termination condition, r evealing distinct convergence behaviours: Reﬂection Plateau yields rapid stabilisation within 3–4 rounds, Judicial Signal extends deliberation for contested claims, and Critic Resolution ex- hibits higher variance. Across all conditions, incorr ect predictions display greater trajectory instability , indicating a behavioural signature of err oneous outcomes. 4.3 Comparison with Baselines System Acc MacroF1 avg ev Single-call GPT -5-mini + RAG 0.8583 0.8571 19.3 Single-call DeepSeek-v3.2 + RAG 0.8000 0.7972 18.5 Standard MAD 0.7167 0.7068 5.0 P R O C L A I M (Majority V ote) 0.8167 0.8165 67.5 T able 4: P R O C L A I M vs. baselines. avg ev denotes average evidence pool size. While monolithic models like GPT -5- mini achieve strong point-estimate ac- curacy (85.8%), they lack Deliberative T raceability and Adversarial Robust- ness . Their predictions rely on single- pass, instruction-conditioned inference over a ﬁxed evidence set, with no mech- anisms for iterative r eﬁnement or con- sistency veriﬁcation. Our Role-Switching Case Study ( Appendix K ) demonstrates that such models construct plausible ar guments for opposing conclusions based solely on task framing, revealing sensitivity to positional prompts rather than evidence-grounded r eason- ing. P R O C L A I M prioritizes Deliberative Resilience : variance in judge scores and agent consistency surfaces a T rajectory Instability Signal—a “Logic Lie Detector ” for contested claims inherently absent in single-call black-box models. The 95.8% Oracle ceiling further establishes a higher achievable performance bound than the monolithic baseline. Against Standard MAD (71.7%)—a two-agent, single-judge RAG-based debate without our en- hancements—our full pipeline achieves a +10.0 pp improvement, isolating the collective contribution of P-RAG, self-reﬂection, r ole-switching, and structured judicial evaluation. 4.4 Ablation Study W e conduct four ablation experiments, each excluding a single subsystem, evaluated on the same 120 claims. T able 6 summarises component activation across all conﬁgu- rations. P-RAG is the most impactful component ( ↓ 7.5 pp): without it, the evidence pool nearly halves (37.5 vs. 67.5), debates run longer (6.00 vs. 5.47) on weaker evidence, Component Full A1 A2 A3 A4 Evidence Negotiation ✓ ✓ ✓ ✓ ✓ P-RAG ✓ ✓ ✓ × ✓ Expert W itnesses ✓ ✓ ✓ ✓ ✓ Self-Reﬂection ✓ ✓ ✓ ✓ × Reﬂection → P-RAG ✓ ✓ ✓ × × Critic Agent ✓ ✓ ✓ ✓ ✓ Adaptive Convergence ✓ ✓ ✓ × Partial Role-Switching ✓ × ✓ ✓ ✓ 3-Judge Panel ✓ ✓ × ✓ ✓ Reﬂection Conﬁdence Adj. ✓ ✓ ✓ ✓ × Max Rounds 10 10 10 3 10 T able 6: Component ablation matrix. ✓ = active; × = disabled. and inter-judge agreement rises to κ = 0.599—a negative signal of convergence with- out correctness ( epistemic bubble ). Role- switching contributes − 4.2 pp and r educes the evidence pool by 20%, conﬁrming the second pass surfaces missed evidence rather than serving as a mer e consistency check. A single judge costs 3.3 pp , directly quantify- ing the beneﬁt of heterogeneous adjudication (Sec. 2.8 ). Self-reﬂection has minimal accu- racy impact ( − 0.8 pp) but drives ef ﬁciency , reducing r ounds (7.06 → 5.47) and token us- age by 17%, providing the early-stopping sig- nal that makes the pipeline cost-effective. 8 Preprint. Under review . A full token usage breakdown and analysis ar e provided in Appendix P . 4.5 Judicial Panel: Reliability and Negativity Bias Inter-judge agr eement ( κ mean = 0.468, 48.9% unanimous) indicates heter ogeneous panels produce genuine disagr eement in 51.1% of cases, underscoring the need for diversity . Indi- vidual judges exhibit distinct biases: DeepSeek-R1 aggressively refutes (high REFUTE recall but false-refutes SUPPOR T), Hermes-3-405B cautiously abstains (frequent I N C O N C L U S I V E on SUPPOR T), and Qwen3-235B-A22B remains most calibrated (highest SUPPOR T recall, balanced abstention). All overproduce REFUTE or I N C O N C L U S I V E , reﬂecting structural negativity bias from epistemic caution induced by training corpora, yet their errors (com- mission, omission, calibration) dif fer such that disagr eements correct rather than compound mistakes. Consequently , the majority-vote panel achieves balanced judge–GT agr eement ( κ GT ∈ [ 0.431, 0.468 ] , mean 0.450), conﬁrming that heter ogeneous voting improves reliability over homogeneous panels by mitigating shared biases. 4.6 Generalization to External Datasets Dataset Claim Acc F1 T ok(K) HealthV er 100 0.720 0.713 223 FEVEROUS 60 0.783 0.772 236 Check-COVID 120 0.817 0.817 211 T able 7: Generalization results (single run) HealthV er . ( Sarrouti et al. , 2021 ) W e evalu- ate on a 100-claim sample from HealthV er , achieving 72.0% accuracy . FEVEROUS. ( Aly et al. , 2021 ) W e evaluate on a 60-claim sample fr om FEVEROUS with a W ikipedia-based retriever , achieving 78.3% accuracy , within ∼ 3 pp of the Check-COVID result. Rather than exhaustive evaluations, these experiments serve as proof-of-concept robustness checks on randomly sampled subsets. Accuracy remains within 9 pp across domains, suggesting performance gains stem from the domain-agnostic nature of the deliberative framework rather than domain-speciﬁc tuning (see Appendix G for details). 4.7 Sycophancy and Argument Integrity Sycophancy—the tendency of agents to yield to opposing arguments ( Malmqvist , 2025 )—is low overall, with role-play consistency (Plaintiff: 7.80/10; Defense: 6.22/10) showing mild asymmetry as Defense “leaks” prior-role reasoning during r ole switches, producing weaker adversarial arguments. Concession rates are minimal (Plaintiff: 0.03; Defense: 0.01 per 1,000 words), with slightly higher Plaintiff rate reﬂecting greater argumentative friction in claim support, consistent with the negativity bias (Section 2.8 ). The reﬂection plateau averages 0.3823 per round, well above the 0.05 stagnation threshold, conﬁrming agents actively push novel reasoning rather than resting on prior ar guments. Judicial conformity (Fleiss’ κ = 0.4513) further indicates moderate inter-judge agr eement—an alignment level that afﬁrms independent evaluation while precluding the algorithmic enmeshment and absolute conformity ( κ → 1.0) symptomatic of systemic sycophancy (Appendix O ). 5 Conclusion P R O C L A I M shows that courtroom-style deliberation yields a claim veriﬁcation system that is accurate, interpretable, and auditable. The framework establishes that reliability in multi- agent LLM systems emer ges from deliberative architectur e rather than individual model capability , and that the courtroom pr ovides a principled, empirically validated template for building such systems. Despite strong empirical performance, the pipeline has several limitations. Computational cost is high due to argument mining, multi-round debate, role-switching, and repeated judicial evaluations, and run-level variance remains notable despite majority voting. KS stability relies on ﬁnal conﬁdence scores as a proxy for per-round distributions, limiting insight into intra-claim dynamics. Future work includes extending r etrieval to live PubMed queries, applying the framework to other high-stakes domains such as scientiﬁc claim veriﬁcation and r egulatory compliance, and reducing token usage via early-exit mechanisms, retrieval ﬁltering, and model distillation. 9 Preprint. Under review . References Rami Aly , Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. FEVEROUS: Fact extraction and VERiﬁcation over unstructur ed and structured information. In Thirty-ﬁfth Conference on Neural Information Processing Systems Datasets and Benchmarks T rack (Round 1) , 2021. URL https://openreview.net/forum?id=h- flVCIlstW . Guhong Chen, Liyang Fan, Zihan Gong, Nan Xie, Zixuan Li, Ziqiang Liu, Chengming Li, Qiang Qu, Hamid Alinejad-Rokny , Shiwen Ni, et al. Agentcourt: Simulating court with adversarial evolvable lawyer agents. In Findings of the Association for Computational Linguistics: ACL 2025 , pp. 5850–5865, 2025. Jon Chun, Kathrine Elkins, and Y ong Suk Lee. Agenticsimlaw: A juvenile courtroom multi-agent debate simulation for explainable high-stakes tabular decision making. arXiv preprint arXiv:2601.21936 , 2026. Y ihan Dong and T akayuki Ito. Multi-agent systems and credibility-based advanced scoring mechanism in fact-checking. Scientiﬁc Reports , 2026. Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jef f Johnson, Gergely Szilvasy , Pierre- Emmanuel Mazar ´ e, Maria Lomeli, Lucas Hosseini, and Herv ´ e J ´ egou. The faiss library , 2025. URL . Y ilun Du, Shuang Li, Antonio T orralba, Joshua B T enenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In Forty-ﬁrst international conference on machine learning , 2024. W ei Fan, JinY i Y oon, and Bo Ji. imad: Intelligent multi-agent debate for efﬁcient and accurate llm inference. arXiv preprint , 2025. Luyu Gao and Jamie Callan. Condenser: a pre-training ar chitecture for dense retrieval, 2021. URL . Y unfan Gao, Y un Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Y uxi Bi, Y i Dai, Jiawei Sun, Meng W ang, and Haofen W ang. Retrieval-augmented generation for large language models: A survey , 2024. URL . Ziyu Ge, Y uhao W u, Daniel W ai Kit Chin, Roy Ka-W ei Lee, and Rui Cao. Resolving conﬂicting evidence in automated fact-checking: A study on r etrieval-augmented llms, 2025. URL . Shuzhi Gong, Richard O Sinnott, Jianzhong Qi, Cecile Paris, Pr eslav Nakov , and Zhuohan Xie. Multi-sourced, multi-agent evidence retrieval for fact-checking. arXiv preprint arXiv:2603.00267 , 2026. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang T an, Xuehao Zhai, Chengjin Xu, W ei Li, Y inghan Shen, Shengjie Ma, Honghao Liu, Saizhuo W ang, Kun Zhang, Y uanzhuo W ang, W en Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https: //arxiv.org/abs/2411.15594 . Chen Han, W enzhen Zheng, and Xijin T ang. Debate-to-detect: Reformulating misinforma- tion detection as a real-world debate with large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pp. 15125–15140, 2025. Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, and W ahid Sadik. Circuitlm: A multi-agent llm-aided design framework for generating circuit schematics from natural language pr ompts. arXiv preprint , 2026. Haorui He, Y upeng Li, Dacheng W en, Y ang Chen, Reynold Cheng, Donglong Chen, and Francis Lau. Debating truth: Debate-driven claim veriﬁcation with multiple large lan- guage model agents. arXiv preprint , 2025. 10 Preprint. Under review . Qisheng Hu, Quanyu Long, and W enya W ang. Decomposition dilemmas: Does claim decomposition boost or burden fact-checking performance?, 2025a. URL https://arxiv. org/abs/2411.02400 . W entao Hu, W engyu Zhang, Y iyang Jiang, Chen Jason Zhang, Xiaoyong W ei, and Li Qing. Removal of hallucination on hallucination: Debate-augmented rag. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers) , pp. 15839–15853, 2025b. Zhe Hu, Hou Pong Chan, Jing Li, and Y u Y in. Debate-to-write: A persona-driven multi- agent framework for diverse argument generation. In Proceedings of the 31st International Conference on Computational Linguistics , pp. 4689–4703, 2025c. Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Y uxuan Lei, T ianfu W ang, Kaize Ding, Ziang Xiao, Nicholas Jing Y uan, and Xing Xie. Population-aligned persona generation for llm-based social simulation. arXiv preprint , 2025d. Lei Huang, W eijiang Y u, W eitao Ma, W eihong Zhong, Zhangyin Feng, Haotian W ang, Qianglong Chen, W eihua Peng, Xiaocheng Feng, Bing Qin, and T ing Liu. A survey on hallucination in lar ge language models: Principles, taxonomy , challenges, and open questions. ACM T ransactions on Information Systems , 43(2):1–55, January 2025. ISSN 1558-2868. doi: 10.1145/3703155. URL http://dx.doi.org/10.1145/3703155 . Seyeon Jeong, Y eonjun Choi, JongW ook Kim, and Beakcheol Jang. T ool-mad: A multi-agent debate framework for fact veriﬁcation with diverse tool augmentation and adaptive retrieval. arXiv preprint , 2026. W eiqiang Jin, Dafu Su, T ao T ao, Xiujun W ang, Ningwei W ang, and Biao Zhao. Courtroom- fnd: a multi-role fake news detection method based on argument switching-based courtroom debate. Journal of King Saud University - Computer and Information Sci- ences , 37(3):33, 2025. doi: 10.1007/s44443- 025- 00038- x. URL https://doi.org/10.1007/ s44443- 025- 00038- x . John Lawrence and Chris Reed. Mining argumentative str ucture fr om natural language text using automatically generated premise-conclusion topic models. In Ivan Habernal, Iryna Gurevych, Kevin Ashley , Claire Car die, Nancy Green, Diane Litman, Georgios Petasis, Chris Reed, Noam Slonim, and V ern W alker (eds.), Proceedings of the 4th Workshop on Argument Mining , pp. 39–48, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17- 5105. URL https://aclanthology. org/W17- 5105/ . Legal Information Institute. Daubert standard. https://www.law.cornell.edu/wex/ daubert standard , 2024. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨ uttler , Mike Lewis, W en-tau Y ih, T im Rockt ¨ aschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems , 33:9459–9474, 2020. Baiyang Li, Xiaosong Li, Y aY ue Zhao, Anqi Zhuo, Y ingxiao Zhao, and Shanhong T ang. A multi-agent debate approach based on large language models for scientiﬁc misinformation detection, 2025a. Jiazheng Li, Y uxiang Zhou, Junru Lu, Gladys T yen, Lin Gui, Cesare Aloisi, and Y ulan He. T wo heads are better than one: Dual-model verbal reﬂection at inference-time, 2025b. URL . Maoyuan Li, Zhongsheng W ang, Haoyuan Li, and Jiamou Liu. R-debater: Retrieval- augmented debate generation through argumentative memory . arXiv preprint arXiv:2512.24684 , 2025c. 11 Preprint. Under review . T ian Liang, Zhiwei He, W enxiang Jiao, Xing W ang, Y an W ang, Y ujiu Y ang, Shuming Shi, and Zhaopeng T u. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the 2024 conference on empirical methods in natural language processing , pp. 17889–17904, 2024. T ongxuan Liu, Xingyu W ang, W eizhe Huang, W enjiang Xu, Y uting Zeng, Lei Jiang, Hailong Y ang, and Jing Li. Groupdebate: Enhancing the efﬁciency of multi-agent debate using group discussion. arXiv preprint , 2024. Zixuan Liu, Siavash H Khajavi, Guangkai Jiang, and Xinru Liu. Uncertainty-aware role- switching debate: Improving truthfulness in lar ge language models. In Open Conference of AI Agents for Science 2025 , 2025. Jiatong Ma, Linmei Hu, Rang Li, and W enbo Fu. Local: Logical and causal fact-checking with llm-based multi-agents. In Proceedings of the ACM on W eb Conference 2025 , pp. 1614–1625, 2025. Aman Madaan, Niket T andon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah W iegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Y iming Y ang, et al. Self-reﬁne: Iterative reﬁnement with self-feedback. Advances in neural information processing systems , 36:46534– 46594, 2023. Lars Malmqvist. Sycophancy in large language models: Causes and mitigations. In Intelligent Computing-Proceedings of the Computing Confer ence , pp. 61–74. Springer , 2025. Onat Ozer , Grace W u, Y uchen W ang, Daniel Dosti, Honghao Zhang, and V ivi De La Rue. Mar: Multi-agent reﬂexion improves reasoning abilities in llms. arXiv pr eprint arXiv:2512.20845 , 2025. Minbae Park, Hyemin Y ang, Jeonghyun Kim, Kunsoo Park, and Hyunjoon Kim. Prograg: Hallucination-resistant pr ogressive retrieval and reasoning over knowledge graphs, 2025. URL . Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer , Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, et al. Ai debate aids assessment of controversial claims. arXiv preprint , 2025. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. URL . Mohammad Reza Rezaei and Adji Bousso Dieng. V endi-rag: Adaptively trading-off diver - sity and quality signiﬁcantly impr oves retrieval augmented generation with llms, 2025. URL . Mourad Sarrouti, Asma Ben Abacha, Y assine Mrabet, and Dina Demner -Fushman. Evidence- based fact-checking of health-related claims. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott W en-tau Y ih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021 , pp. 3499–3512, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.ﬁndings- emnlp. 297. URL https://aclanthology.org/2021.findings- emnlp.297/ . Zhihong Shao, Y eyun Gong, Y elong Shen, Minlie Huang, Nan Duan, and W eizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy . In Houda Bouamor , Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , pp. 9248–9274, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.ﬁndings- emnlp.620. URL https://aclanthology.org/2023.findings- emnlp.620/ . Noah Shinn, Federico Cassano, Edwar d Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Y ao. Reﬂexion: Language agents with verbal reinfor cement learning, 2023. URL . 12 Preprint. Under review . Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy , Syed Rifat Raiyan, Hasan Mahmud, and Md Kamrul Hasan. PhysicsEval: Inference-time techniques to improve the reasoning proﬁciency of large language models on physics problems. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Paciﬁc Chapter of the Association for Computational Linguis- tics , pp. 738–760, Mumbai, India, December 2025. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics. ISBN 979-8- 89176-303-6. doi: 10.18653/v1/2025.ﬁndings- ijcnlp.43. URL https://aclanthology.org/ 2025.findings- ijcnlp.43/ . Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms. arXiv preprint arXiv:2311.17371 , 2023. Harsh T rivedi, Niranjan Balasubramanian, T ushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought r easoning for knowledge-intensive multi-step questions, 2023. URL . Pat V erga, Sebastian Hofstatter , Sophia Althammer , Y ixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky , Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models, 2024. URL https: //arxiv.org/abs/2404.18796 . Gengyu W ang, Kate Harwood, Lawrence Chillrud, Amith Ananthram, Melanie Subbiah, and Kathleen McKeown. Check-covid: Fact-checking covid-19 news claims with scientiﬁc evidence, 2023a. URL . Haotian W ang, Xiyuan Du, W eijiang Y u, Qianglong Chen, Kun Zhu, Zheng Chu, Lian Y an, and Y i Guan. Apollo’s oracle: Retrieval-augmented reasoning in multi-agent debates. arXiv preprint arXiv:2312.04854 , 1, 2023b. Haotian W ang, Xiyuan Du, W eijiang Y u, Qianglong Chen, Kun Zhu, Zheng Chu, Lian Y an, and Y i Guan. Learning to br eak: Knowledge-enhanced reasoning in multi-agent debate system. Neurocomputing , 618:129063, 2025. Hao W u and Prateek Saxena. Bias injection attacks on rag databases and sanitization defenses, 2025. URL . Haolun W u, Zhenkun Li, and Lingyao Li. Can llm agents really debate? a controlled study of multi-agent debate in logical reasoning. arXiv preprint , 2025. Miao Zhang, Junsik Kim, Siyuan Xiang, Jian Gao, and Cheng Cao. Dynamic role assignment for multi-agent debate. arXiv preprint , 2026. Xinping Zhao, Y an Zhong, Zetian Sun, Xinshuo Hu, Zhenyu Liu, Dongfang Li, Baotian Hu, and Min Zhang. Funnelrag: A coarse-to-ﬁne progr essive retrieval paradigm for rag. In Findings of the Association for Computational Linguistics: NAACL 2025 , pp. 3029–3046, 2025. Xiaochen Zhu, Caiqi Zhang, Y izhou Chi, T om Stafford, Nigel Collier , and Andreas Vlachos. Demystifying multi-agent debate: The role of conﬁdence and diversity . arXiv preprint arXiv:2601.19921 , 2026. A Related W ork Retrieval-Augmented Generation. RAG enhances factual grounding in large language models by incorporating external knowledge at inference time ( Lewis et al. , 2020 ; Gao et al. , 2024 ). Recent work improves retrieval quality thr ough structured pipelines such as coarse-to-ﬁne retrieval ( Zhao et al. , 2025 ) and retrieval with reasoning ( Park et al. , 2025 ), helping mitigate hallucinations ( Huang et al. , 2025 ). Hybrid approaches further embed retrieval into multi-agent and iterative reasoning systems, including debate-augmented RAG ( Hu et al. , 2025b ), T ool-MAD ( Jeong et al. , 2026 ), R-Debater ( Li et al. , 2025c ), CircuitLM 13 Preprint. Under review . ( Hasan et al. , 2026 ), and multi-source retrieval frameworks ( Gong et al. , 2026 ), underscoring the need for adaptive evidence acquisition under conﬂicting or noisy conditions ( Ge et al. , 2025 ). However , most approaches rely on pr e-collected evidence pools or limited iterative mechanisms, lacking retrieval continuously driven by structur ed deliberation. Recent work also shows that claim decomposition’s beneﬁts in fact-checking depend on how sub-claims are constructed ( Hu et al. , 2025a ), motivating its reuse as a means of extracting pr emise-level units to guide targeted r etrieval. Multi-Agent Debate. MAD has emerged as an effective paradigm for improving r easoning, factuality , and robustness in LLM outputs ( Du et al. , 2024 ), with variants spanning group- based discussions ( Liu et al. , 2024 ), ef ﬁciency-oriented architectures ( Fan et al. , 2025 ), and fact-checking frameworks ( Han et al. , 2025 ; Ma et al. , 2025 ; He et al. , 2025 ). Extensions incorporating credibility scoring and evidence aggregation further improve performance in high-stakes veriﬁcation tasks ( Dong & Ito , 2026 ; Li et al. , 2025a ; Rahman et al. , 2025 ). By enabling agents to critique and reﬁne each other ’s reasoning, debate reduces hallucinations and improves consensus quality . However , controlled analyses reveal persistent limitations including prematur e convergence, conformity bias, and sensitivity to agent conﬁguration ( W u et al. , 2025 ; Zhu et al. , 2026 ; Smit et al. , 2023 ), motivating more structured debate protocols. Agent Coordination and Learning Dynamics. T o address these limitations, recent work explores impr oved coordination mechanisms in multi-agent systems, including dynamic role assignment ( Zhang et al. , 2026 ), uncertainty-aware role switching ( Liu et al. , 2025 ), and self-reﬂection frameworks such as Self-Reﬁne ( Madaan et al. , 2023 ) and MAR ( Ozer et al. , 2025 ; Siddique et al. , 2025 ). Diversity and reasoning coverage ar e further improved through persona-based debate ( Hu et al. , 2025c ; d ) and divergent thinking strategies ( Liang et al. , 2024 ), while knowledge-enhanced and tool-augmented systems incorporate external signals to strengthen r easoning ( W ang et al. , 2025 ; Jeong et al. , 2026 ). Despite these advances, existing methods tr eat these components as independent objectives, wher eas P R O C L A I M repurposes them within a uniﬁed pipeline — wher e role-switching serves as a consistency diagnostic and self-reﬂection actively drives subsequent evidence r etrieval. Structured Deliberation and Courtroom Simulation. Structured debate frameworks in- troduce stronger inductive biases to stabilize multi-agent reasoning. Courtroom-style simulations such as AgentCourt ( Chen et al. , 2025 ) and AgenticSimLaw ( Chun et al. , 2026 ) demonstrate the effectiveness of explicit roles and adversarial interaction in high-stakes decision-making, while recent systems combine debate with retrieval for multi-round veriﬁ- cation ( W ang et al. , 2023b ; Hu et al. , 2025b ). While these approaches impr ove robustness, they typically lack tightly integrated, dynamic retrieval mechanisms that evolve alongside the debate process, limiting their applicability to open-domain fact veriﬁcation. Overall, prior work largely tr eats retrieval, debate, and coordination as separate or loosely coupled components. P R O C L A I M advances this line of research by tightly integrating progr essive retrieval, structur ed courtr oom-style deliberation, and multi-level judicial evalu- ation within a single uniﬁed framework, where evidence acquisition is continuously guided by the evolving ar gumentation process, addressing key limitations in evidence evolution, reasoning stability , and decision r eliability . B Reproducibility W e provide the repository containing the full implementation of our framework, including all conﬁgurations, prompts, and evaluation scripts. Source Code: https://anonymous.4open.science/r/PROClaim- 2535/ B.1 Software Environment and Dependencies All experiments wer e conducted utilizing Python 3.8+. W e isolated our reasoning engine using external API calls to language models, but local orchestration r elies on a strict set of dependencies. The core local libraries encompass: 14 Preprint. Under review . • Deep Learning Framework: PyT orch v1.7.0 • V ector Search Engine: faiss-cpu (conﬁgured with IndexFlatIP for inner-pr oduct cosine similarity searches) • T ext Embedding & Processing: transformers v3.4.0 and sentence-transformers • Numerical Operations: NumPy v1.19.1 B.2 Data Processing and Corpus Assumptions Input Dataset. W e evaluate our framework on the Check-COVID dataset, focusing on claims from the test split with deﬁnitive labels (SUPPOR TED or REFUTED). Claims are passed to the extraction agents without lexical normalization or stemming, pr eserving full linguistic semantics. Knowledge Corpus External knowledge is provided via a subset of PubMed abstracts related to COVID-19 (2020–2024). Ofﬂine processing embeds these abstracts into a 1.4 GB F AISS continuous vector index utilizing the 384-dimensional all-MiniLM-L6-v2 model with normalized embeddings. B.3 Agent Conﬁgurations Our framework adopts a deliberate separation between ar gumentative and evaluative components to ensure reliable adjudication. Adversarial debate is conducted between heterogeneous models ( GPT -5 mini and DeepSeek-V3.2 ), introducing diversity in reasoning styles and reducing the likelihood of homogeneous ar gument patterns. Evaluation is performed by a multi-model judicial panel, which includes DeepSeek-R1 alongside other independent large language models. Final decisions are determined via majority voting, ensuring that no single model exerts dispr oportionate inﬂuence over the outcome. Importantly , we enforce ar chitectural separation between generation and evaluation stages: models used for advocacy ( e.g. , DeepSeek-V3.2 ) ar e distinct from those used for judgment ( e.g. , DeepSeek-R1 ). This design ensures that argument construction and adjudication are carried out by systems with differing training paradigms and inference characteristics, thereby pr omoting more balanced and independent evaluation. The exact model mappings and generation sampling temperatures ( T ) deﬁned for each architectural r ole are as follows: • Plaintiff Counsel (Proponent): gpt-5-mini ( T = 0.5) • Defense Counsel (Opponent): deepseek-v3.2 ( T = 0.5) • The Court: qwen3-235b-a22b-2507 ( T = 0.2) • Judicial Panel (Final Evaluation): A tripartite system combining deepseek-r1 , hermes-3-llama-3.1-405b , and qwen3-235b-a22b-2507 (all strictly set to T = 0.3 for highly deterministic arbitration). • Support Agents: Expert W itnesses are dynamically generated utilizing hermes-3-llama-3.1-405b ( T = 0.5), while the Critic, Argument Miner , and Consis- tency Analyzer rely on deepseek-r1 ( T = 0.3). B.4 Hyperparameters The novelty threshold τ was selected via grid search over { 0.10, 0.15, 0.20 } on a held-out development subset of 20 claims, optimising for the tradeoff between evidence diversity and retrieval pr ecision. Full settings are listed in T able 8 . 15 Preprint. Under review . Hyperparameter V alue Max debate rounds 10 Initial retrieval top- k 5 Per-r ound retrieval top- k 3 Novelty threshold 0.20 Admissibility ﬂoor > 0.5 admitted, > 0.1 disputed Redundancy similarity threshold 0.85 Redundancy ratio threshold 0.70 Relevance gain threshold 0.05 T able 8: Debate and retrieval hyperparameters. C The ‘Burden of Refutation’ Standard The framework’s structural decision to formally classify ‘Inconclusive‘ judicial panel verdicts as ‘SUPPOR T‘ is derived directly from the foundational legal concept of the burden of pr oof , which functions analogously to the presumption of innocence (“innocent until proven guilty”). In a traditional courtr oom, a defending party does not need to absolutely guarantee their innocence to survive a trial; rather , the prosecution bears the strict burden of deﬁnitively proving guilt. If the prosecution’s evidence is ambiguous, contr oversial, or broadly insufﬁ- cient, the presiding jury is legally obligated to return a “Not Guilty” ver dict, even if they remain partially unsur e. Our multi-agent debate architecture mirr ors this exact decision-making protocol. When ver- ifying controversial claims, the refuting agent mathematically functions as the pr osecution. If the refuting agent fails to successfully furnish enough concr ete, high-quality evidence to convince the judicial panel to issue a deﬁnitive ‘Refute‘ consensus, the panel will naturally return an ‘Inconclusive‘ stance. Under the burden of proof, this ‘Inconclusive‘ ruling means the prosecution completely failed to meet its evidentiary thr eshold. Consequently , the original claim legally and logically survives the adversarial trial, defaulting to ‘Supported.‘ This standard is a cor e necessity of the framework’s design. It actively prevents the multi- agent system from stalling on highly ambiguous cases and faithfully operationalizes the adversarial mechanics utilized to evaluate the Check-COVID dataset. D Per-Judge Ground-T ruth Agreement ( κ GT ) While the main r esults table (T able 3 ) reports inter-judge agreement ( ¯ κ )—the degree to which the three judges agr ee with each other —it does not capture how well each individual judge aligns with the ground truth. W e deﬁne κ GT (Judge-vs-Ground-T ruth Cohen’s κ ) as the Cohen’s κ computed between a single judge’s verdict and the ground-truth label, measured independently for each judge across all claims in a r un. The distinction between ¯ κ and κ GT is analytically important: a panel can exhibit high inter-judge agreement while simultaneously producing systematically biased verdicts, a failure mode we term the epistemic bubble ef fect (Section 4.1 ). T able 9 reports the per -judge κ GT across all runs and judge identities. The individual judge pr oﬁles corroborate the qualitative bias characterisation in Section 4.5 : DeepSeek-R1 (J1) consistently achieves the highest κ GT across runs, Hermes-3-LLaMA-405B (J2) yields the lowest κ GT owing to fr equent INCONCLUSIVE abstentions on SUPPOR T claims, and Qwen3-235B-A22B (J3) occupies an intermediate position. Crucially , despite these individual biases, their error pr oﬁles are complementary —such that majority voting corrects rather than compounds them. 16 Preprint. Under review . Run κ GT (J1) κ GT (J2) κ GT (J3) Mean κ GT DeepSeek-R1 Hermes-3-405B Qwen3-235B Run-0 0.442 0.413 0.414 0.423 Run-1 0.402 0.367 0.347 0.372 Run-2 0.452 0.353 0.348 0.384 Majority V oting 0.452 0.431 0.468 0.450 T able 9: Per-judge ground-tr uth agreement ( κ GT ) across runs and aggr egation modes. Mean κ GT is the unweighted average across the thr ee judges. E Kolmogorov–Smirnov (KS) Stability Analysis The KS statistic measures the maximum divergence between the sequential per-round conﬁdence distribution and the ﬁnal empirical consensus distribution. As established in Section 4.4 , the aggregate round index r tracks the continuous trajectory across both the primary debate phase (typically 2–3 rounds) and the subsequent r ole-switched debate phase (an additional 2–3 rounds). A decaying D r value across this combined trajectory indicates that the multi-agent system successfully resolves uncertainty and converges towar d a stable epistemic state, rather than oscillating randomly . T able 10 reports the average D r values across the ﬁrst eight aggregate rounds for all three independent runs, alongside the weighted average total. Run D 1 D 2 D 3 D 4 D 5 D 6 D 7 D 8 A vg. Stop Round Run-0 1.142 0.027 0.333 0.242 0.097 0.129 0.104 -0.008 5.65 Run-1 1.131 0.017 0.455 0.347 0.178 0.138 0.127 -0.013 5.53 Run-2 1.137 0.018 0.590 0.287 0.226 0.197 0.131 0.038 5.36 T able 10: Kolmogorov–Smirnov (KS) stability statistics ( D r ) across debate rounds. V alues closer to zero indicate incr easing alignment with the ﬁnal conﬁdence distribution. Discussion of Convergence T rends. As observed in T able 10 , initial divergence is pre- dictably high ( D 1 ≈ 1.14), as the agents generate their initial arguments fr om the zero-shot discovery pool during the primary debate. Interestingly , we observe a sharp, transient mathematical alignment at Round 2 ( D 2 ≈ 0.02) before diverging again in Round 3 and Round 4 ( D 3 ≈ 0.46). This divergence pattern pr ecisely mirrors the structural orchestration of P R O C L A I M . Be- cause the primary debate phase averages 2–3 r ounds before triggering the role-switch conﬁguration, D 3 and D 4 repr esent the peak of deliberative friction. At this thr eshold, agents are for ced to reverse their advocacy positions and process newly surfaced counter- evidence, intentionally fracturing any pr emature consensus established during the initial phase. Following this controlled disruption, the D statistic exhibits a steady , monotonic decay across all structural runs. This robust decay pr ovides quantitative validation of the framework’s adversarial resilience: as the role-switched agents engage in P-RAG discovery and self- reﬂection, the system’s internal conﬁdence state steadily recoups and stabilizes. By the time the aggregate debate appr oaches its average stopping point ( ∼ 5.5 total rounds), the divergence is minimal ( D 5 ≈ 0.16), conﬁrming that the adaptive stopping criteria ef fectively halt generation just as the role-switched epistemic state converges. Thus, while the KS metric remains a macr o-level temporal proxy , it mathematically corr oborates the necessity of role-switching and the framework’s ultimate capacity for evidence-driven r esolution. 17 Preprint. Under review . F Pseudo-code Algorithm 1 One Claim Evaluation Cycle of the PRAG Multi-Agent Debate Framework Require: claim c , PubMed F AISS index D , LLM agents { Proponent, Opponent, Judge, Critic } Ensure: ﬁnal verdict v ∈ { SUPPORT, REFUTE, INCONCLUSIVE } 1: Mine atomic premises P = { p 1 , . . . , p k } using Argument Miner 2: Retrieve initial evidence E 0 ← R E T R I E V E ( c , D ) Evidence Negotiation & Arbitration 3: E 0 ← premise-gr ounded retrieval using P 4: E 0 ← stance-conditioned retrieval for supporting and r efuting evidence 5: E 0 ← LLM admissibility scoring (Relevance × Credibility) 6: Initialize debate state S 0 ← ( P , E 0 ) 7: for round t = 1 to T do 8: E t ← P R A G ( P , S t − 1 , D ) ▷ progr essive novelty-ﬁltered retrieval 9: Pr oponent generates argument a + t from E t 10: Opponent generates counterar gument a − t from E t 11: Call Expert W itness to produce testimony τ t 12: Compute self-r eﬂection scores r + t , r − t 13: Critic evaluates ar guments and updates debate state 14: S t ← U P D AT E S TAT E ( S t − 1 , a + t , a − t , τ t ) 15: if evidence novelty < ϵ or debate convergence r eached or t = T then 16: break 17: end if 18: end for 19: Swap roles of Proponent and Opponent 20: Re-run debate with swapped roles to obtain consistency recor d S sw a p 21: Send debate records { S T , S sw a p } to judicial panel { J 1 , J 2 , J 3 } 22: for each judge J i do 23: v i ← J i ( c , S T , S sw a p ) 24: end for 25: v ← M A J O R I T Y V O T E ( v 1 , v 2 , v 3 ) 26: return v G External Dataset Details HealthV er . HealthV er is a benchmark dataset of health-related claims annotated for factu- ality , with a str ong focus on COVID-19 misinformation. Claims are paired with evidence and labelled as supported and refuted. In our evaluation, we sample 100 claims uniformly at random. Since the dataset operates within the biomedical domain, we retain the same PubMed-based retrieval backend used for Check-COVID. FEVEROUS. FEVEROUS is a W ikipedia-based fact veriﬁcation dataset that extends FEVER by requiring evidence from both unstr uctured text and semi-structured tables. Claims are annotated with supporting or r efuting evidence from W ikipedia. For our experiments, we sample 60 claims and r eplace the retrieval module with a W ikipedia-based backend while keeping all other components unchanged. H Prompt T emplates All prompts ar e reproduced verbatim fr om the pipeline source code. V ariables enclosed in { braces } are ﬁlled at r untime. Each agent runs at the temperatur e listed in its slot deﬁnition. 18 Preprint. Under review . H.1 Premise Decomposition Prompt Agent: DeepSeek-R1 Given the following claim, decompose it into its core logical premises and sub-arguments that need to be verified. Claim: { claim text } List each premise as a separate numbered point. Be thorough and identify both explicit and implicit assumptions that must hold for the claim to be true. Focus on scientific and medical aspects. H.2 Admissibility Scoring Prompt The Judicial Arbiter evaluates the admissibility of evidence using a joint weighting equation that enforces a distinction between r elevance and scientiﬁc credibility: w = relevance ( q , e ) × credibility ( e ) . (6) Items with w > 0.5 are automatically admitted, while items with 0.1 < w ≤ 0.5 are ﬂagged as disputed . This product-based scoring ensur es that ”scientiﬁc hearsay” (high relevance but low credibility) is ef fectively excluded. Arbiter Prompt The exact zero-shot pr ompt used by the arbiter to generate these scores is provided below: Evaluate the scientific relevance and credibility of the following evidence for the claim. CLAIM: {claim} EVIDENCE: {evidence_text} Provide an evaluation based on: 1. Relevance: How directly does this evidence address the premises of the claim? (0.0 - 1.0) 2. Credibility: Does the evidence come from a reliable scientific context or contain high-quality data? (0.0 - 1.0) H.3 Plaintiff Counsel Prompt Agent: GPT -5-mini System Prompt: You are the Plaintiff Counsel in a legal proceeding. Your role is to present arguments supporting the claim, interpret evidence favorably, challenge opposing arguments, and conduct examination of expert witnesses. Maintain a professional legal advocacy tone. Per-turn Argument Generation Prompt: You are participating in a structured legal proceeding. - Maintain a professional, factual, and strictly evidence-based tone. - Focus on proving or refuting the claim using the provided evidence and expert witness testimony. - State your arguments clearly and concisely as you would in a courtroom. - DIRECT OUTPUT ONLY: Do not reveal your internal thought process, scratchpad, or "thinking" steps. Output only your final argument. Claim: { claim.text } Your Role: Plaintiff Counsel Instruction: As Plaintiff Counsel, present your case in SUPPORT of the claim. Use evidence and expert testimony to persuade the Court. Available Evidence: { evidence text } Recent Debate History: { history text } Provide your statement (2-3 paragraphs, cite evidence by source ID) H.4 Defense Counsel Prompt Agent: DeepSeek-V3.2 System Prompt: 19 Preprint. Under review . You are the Defense Counsel in a legal proceeding. Your role is to challenge the claim, identify weaknesses in arguments, contest evidence interpretation, and cross-examine expert witnesses. Maintain a professional legal defense tone. Per-turn Argument Generation Prompt: Identical structure to Plaintif f Counsel above, with the role instr uction replaced by: Instruction: As Defense Counsel, present your case AGAINST the claim. Identify flaws and challenge the plaintiff’s evidence and witnesses. H.5 The Court (Presiding Judge) Prompt Agent: Qwen3-235B-A22B ( openrouter ) System Prompt: You are The Court presiding over a legal proceeding. Your role is to oversee the case, ensure professional conduct from all counsels, and determine when sufficient evidence and expert testimony have been presented for deliberation. Query Reﬁnement Prompt: As the Court, you must maintain the quality and focus of evidence discovery. A counsel has proposed the following search query to retrieve additional exhibits: Proposed Query: " { original query } " Context of proceedings: { debate context } Refine this query to be more precise, narrow the scope if necessary, and ensure it follows scientific rigor. Respond ONLY with the refined query string. Debate Completion Check Prompt: As the Court, review the proceedings. Have both counsels had sufficient opportunity to present their evidence and arguments? Record Summary: { history summary } Should the proceedings continue or should we move to final deliberation? Respond ’Wait’ to continue or ’Close’ to finish. Expert W itness Admissibility Prompt: { requester } has requested to call an expert witness: { expert type } Reasoning: { reasoning } As the Court, is this expert witness necessary for the thorough resolution of this case? Respond only with ’Granted’ or ’Denied’ followed by a brief reason. H.6 Critic Agent Prompt Agent: DeepSeek-R1 ( openrouter ) System Prompt: You are the Independent Critic Agent. Your role is to evaluate the debate rounds for logical coherence, evidence coverage, and rebuttal quality. Round Evaluation Prompt: You are the Critic Agent observing a courtroom-style scientific debate. Claim: { claim } Round: { round num } Recent Proceedings: { history summary } Analyze both the Plaintiff and Defense Counsel’s performance in this round. 20 Preprint. Under review . Score each side (0.0 to 1.0) on: 1. Logical Coherence: Argument flow and structure. 2. Evidence Coverage: How well they used admitted exhibits. 3. Rebuttal Coverage: Did they address the opponent’s strongest points? Identify any premises that remain "unresolved" or under-supported. Provide actionable recommendations for both sides to improve their discovery and arguments. Respond ONLY in valid JSON format: { "plaintiff": { "logic": 0.0, "evidence": 0.0, "rebuttal": 0.0, "reasoning": "..." }, "defense": { "logic": 0.0, "evidence": 0.0, "rebuttal": 0.0, "reasoning": "..." }, "unresolved_premises": ["...", "..."], "recommendations": { "plaintiff": ["...", "..."], "defense": ["...", "..."], "queries": ["suggested search query 1", "..."] }, "debate_resolved": false } H.7 Self-Reﬂection Prompt Issued to: Plaintiff Counsel and Defense Counsel after each round. You are the { job title } ( { side } Counsel). You have just completed Phase { round num } of the proceedings. CLAIM: { claim } YOUR ARGUMENTS SO FAR: { my args[-2:] } { OPP SIDE } COUNSEL’S CHALLENGES: { opponent args[-2:] } Perform a strictly professional self-audit: 1. Logical Coherence: Evaluate the flow and structural integrity of your arguments. 2. Evidence Novelty: Have you introduced truly new information or just repeated old points? 3. Rebuttal Coverage: How effectively did you address the { opp side } counsel’s latest points? Identify: - Critical gaps in your current evidence base. - Premises you haven’t sufficiently supported. Respond ONLY in valid JSON format: { "scores": { "logic": 0.0-1.0, "novelty": 0.0-1.0, "rebuttal": 0.0-1.0 }, "flaws_identified": ["...", "..."], "discovery_need": "Specific evidence lookup query to fill a gap (1 sentence)", "refined_stance": "Summary of your improved position" } Note: The following aggregation and termination logic is executed programmatically by the or ches- trator script and is strictly hidden from the LLM agents to pr event strategic manipulation of debate 21 Preprint. Under review . length. The weighted convergence score is calculated as: S total = 0.4 · logic + 0.3 · novelty + 0.3 · rebuttal . Debate terminates when the absolute change satisﬁes | ∆ S total | < 0.05 across consecutive rounds. H.8 Expert W itness Prompt Agent: Hermes-3-LLaMA-3.1-405B ( openrouter ) System Prompt: You are a scientific expert witness. Provide technical analysis based on your expertise. The expert’s role instr uction within the argument pr ompt is: Instruction: As an Expert Witness ( { job title } ), provide your unbiased professional testimony regarding: { expertise list } . Expert Request Proposal Prompt (counsel-side): Based on the current state of the proceedings, do you need to call an expert witness to clarify a specific point? Recent Proceedings: { history summary } If yes, specify the type of expertise needed and why. If no, say ’None’. Format: { "expert type": "...", "reasoning": "..." } or "None" H.9 Judicial Panel Prompt Agents: Three independent judges—DeepSeek-R1, Hermes-3-LLaMA-3.1-405B, Qwen3- 235B-A22B Shared System Prompt: You are an independent appellate judge presiding over a legal proceeding. Your role is to perform a comprehensive holistic evaluation of the case, focusing on evidence admissibility, logical coherence of advocacy, and scientific accuracy of expert testimonies. Full Evaluation Prompt (6-stage): You are an appellate judge evaluating the following proceedings for fact-checking. PROCEEDINGS RECORD: CLAIM: { claim } PLAINTIFF COUNSEL’S ARGUMENTS: { proponent args } DEFENSE COUNSEL’S ARGUMENTS: { opponent args } ADMITTED EVIDENCE & EXPERT TESTIMONIES: { evidence summary } ROLE-SWITCH HISTORY (ADVERSARY CONSISTENCY): { role switch summary } EVIDENCE DISCOVERY METRICS (PRAG EVOLUTION): { prag metrics } INDEPENDENT CRITIC EVALUATIONS (PROCESS INTEGRITY): { critic evaluations } AGENT SELF-REFLECTION TRENDS: { reflection history } Perform the following evaluation stages: STAGE 1 -- CASE RECONSTRUCTION Identify: Core claim; main supporting arguments from Plaintiff; main counterarguments from Defense. STAGE 2 -- EVIDENCE & TESTIMONY WEIGHTING Score: Evidence Strength (0--10) 0--3: Weak/irrelevant/unreliable. 4--6: Moderate with limitations. 7--10: Strong, credible, highly relevant. STAGE 3 -- LOGICAL COHERENCE ANALYSIS Detect logical contradictions, fallacies, misuse of evidence. Score: Argument Validity (0--10). STAGE 4 -- SCIENTIFIC/TECHNICAL CONSISTENCY 22 Preprint. Under review . Check alignment with consensus. Score: Scientific Reliability (0--10). STAGE 5 -- DISCOVERY RIGOR & TRANSPARENCY Analyze PRAG metrics: query evolution, evidence novelty, judicial refinement impact. STAGE 6 -- JUDICIAL VERDICT Determine: SUPPORTED / NOT SUPPORTED / INCONCLUSIVE. Respond ONLY in valid JSON format: { "claim_summary": "...", "evidence_strength": 0, // 0-10 "argument_validity": 0, // 0-10 "scientific_reliability": 0, // 0-10 "verdict": "SUPPORTED", // or "NOT SUPPORTED" or "INCONCLUSIVE" "reasoning": "..." } Note on V erdict Label Mapping: While the judicial panel is prompted to output S U P P O R T E D , N O T S U P P O R T E D , or I N C O N C L U S I V E , these are programmatically mapped to the canonical Check-COVID dataset labels ( S U P P O R T and R E F U T E ) for all result logging and metric calculations. This terminology difference is a deliberate design decision: the judicial prompt uses N O T S U P P O RT E D to evoke a legal and scientiﬁc ”burden of proof” framework. In practice, given our focus on adversarial resolution of binary claims (Section 3 ), a ﬁnding of N O T S U P P O RT E D after rigorous advocacy against the claim—including a r ole-switching consistency pass—is functionally equivalent to a R E F U T E verdict. I P-RAG Query Construction T o address the challenge of merging heter ogeneous sources into a coher ent search strategy , P-RAG does not issue multiple separate queries. Instead, it concatenates the three sour ces into a single directed pr ompt, formulates an initial query via a lightweight LLM, and passes it to the presiding Judge for rigor ous reﬁnement. This guarantees a single, highly targeted natural-language query is executed against the vector corpus per agent, per round. Step 1: Gap Proposal (Counsel) Each counsel identiﬁes a discovery need from the curr ent debate state: As { job title } , analyze the current proceedings and identify a critical gap in the available exhibits. What specific evidence do you need to request to strengthen your case or challenge the opposition? Context: { debate context } Propose exactly one specific evidence need (1 sentence): Concatenation of Sources: The agent’s newly identiﬁed evidential gap (Source 2) and any unresolved discovery need generated during the prior round’s self-reﬂection (Sour ce 3) are concatenated into a single agent request string: { gap proposal } . Focus also on: { reflection discovery need } This combined agent request is then injected alongside the last four messages of the debate history ( debate context , Source 1) into the formulation pr ompt below . Step 2: Query Formulation (P-RAG Engine) The combined discovery prompt is passed to the P-RAG LLM: Based on the following proceedings context and legal request, formulate a precise search query to retrieve relevant exhibits and evidence. Debate Context: { debate context } 23 Preprint. Under review . Agent Request: { agent request } Generate a concise search query (1--2 sentences) that will retrieve the most relevant evidence Step 3: Judicial Query Reﬁnement (The Court) Before retrieval executes, the formulated query is reviewed and reﬁned by the presiding judge (The Court, Qwen3-235B-A22B): As the Court, you must maintain the quality and focus of evidence discovery. A counsel has proposed the following search query: Proposed Query: " { original query } " Context of proceedings: { debate context } Refine this query to be more precise, narrow the scope if necessary, and ensure it follows scientific rigor. Respond ONLY with the refined query string. Novelty Scoring and Stopping Criteria Retrieval applies the reﬁned query to the F AISS index ( all-MiniLM-L6-v2 embeddings). Each retrieved document d receives a novelty scor e: η ( d ) = 1 − max e ∈ P cos ( v d , v e ) , where P is the curr ent evidence pool. Documents with η < 0.2 are rejected. The retrieval terminates early if any of the following criteria are met: (i) redundancy ratio > 0.70, (ii) relevance gain < 0.05 vs. the previous round, or (iii) r ound counter ≥ 10. J Example Full Debate T ranscript W e present a representative complete execution for claim 6063931df9b9e03ea4d8e441 0 : Claim: “Heart muscle cell damage is not an associated condition among hospitalized COVID-19 patients.” Ground T ruth: REFUTE Pipeline V erdict: REFUTE Correct: ✓ Conﬁdence: 0.902 Premise Decomposition (Argument Mining) The Argument Miner (DeepSeek-R1) decomposed the claim into: 1. Hospitalized COVID-19 patients have detectable levels of cardiac biomarkers in- dicative of heart muscle cell damage. 2. The prevalence of elevated cardiac biomarkers in hospitalized COVID-19 patients is comparable to a control gr oup without COVID-19. 3. Incidence rates of heart muscle cell damage in hospitalized COVID-19 patients are not higher than in patients with other viral respiratory infections. 4. Clinical studies on hospitalized COVID-19 patients do not report signiﬁcant occur - rences of heart muscle cell damage. 5. There is no statistical association between COVID-19 infection severity and markers of heart muscle cell damage in hospitalized patients. 6. Autopsy ﬁndings of deceased hospitalized COVID-19 patients do not show evidence of heart muscle cell damage. 7. Hospitalized COVID-19 patients with pre-existing car diac conditions do not have higher rates of heart muscle cell damage compar ed to those without pre-existing conditions. 24 Preprint. Under review . Evidence Negotiation & Admission Initial RAG retrieved 5 candidate documents; negotiation and arbitration admitted 21 ex- hibits (weights ranging 0.54–0.81), including: Source 32517963 ( Journal of the American College of Cardiology , highest weight 0.81), Source 37600048 ( Frontiers in Cardiovascular Medicine , weight 0.80), Source 33994028 ( Curr ent Cardiology Reports , 2022), and Source 35387441, among others. T wo items were ﬂagged for dispute. Proceedings — Phase 1 Discovery (Plaintiff Counsel). Plaintiff pr oposed a need for controlled studies comparing myocardial injury biomarker rates in COVID-19 versus non-COVID critically ill cohorts. The Court reﬁned the query to: “Retrieve individual participant data (IPD) from published meta-analyses and prospective cohort studies of hospitalized adult patients with conﬁrmed COVID-19 that report myocardial injury deﬁned by elevated cardiac troponin levels (above the 99th percentile upper refer ence limit) with concurrent assessment of car diomyocyte damage, while adjusting for pr e-existing cardiovascular disease, iatr ogenic expo- sures, and other potential confounders, to evaluate the independent association between SARS-CoV -2 infection and myocardial injury . . . ” 3 new exhibits admitted. Argument — Plaintiff Counsel (GPT -5-mini, T = 0 . 5): Y our Honor , the evidence does not establish that heart muscle cell damage is an associated condition speciﬁc to hospitalized COVID-19 patients. Numerous high-quality investigations and meta-analyses document that cardiac biomarker elevations ( e.g. , troponin) occur in more severely ill patients and predict worse outcomes, but they do so as non-speciﬁc indicators of systemic illness rather than as proof of SARS-CoV -2-caused myocardial cell injury (Sources 3, 6, 2, 4, 18). Moreover , studies ﬁnd myocardial injury detected at a similar rate in patients with and without COVID-19 admitted to the same hospital setting, and it predicts short-term mortality to a comparable degr ee in both gr oups (Sources 12, 13). [. . . ] The record supports the conclusion that true heart muscle cell damage is not an established associated condition of hospitalization for COVID-19 but rather a non-speciﬁc marker of severe systemic disease. Discovery (Defense Counsel). Defense proposed production of all patient-level data from cited meta-analyses to independently verify the control for confounding variables. The Court reﬁned the query to target prospective cohort studies with troponin-deﬁned myocardial injury and histopathologic conﬁrmation. 3 new exhibits admitted. Argument — Defense Counsel (DeepSeek-V3.2, T = 0 . 5): Y our Honor , opposing counsel’s argument fundamentally mischaracterizes the extensive body of evidence. Direct pathological evidence conﬁrms SARS-CoV -2 can cause cardiac cell damage. Autopsy studies have documented myocardial injury , including myocyte necrosis and myocarditis, providing deﬁnitive histological proof of heart muscle cell damage attributable to the infection (Sources 14, 15). Furthermore, elevated cardiac troponin is consistently linked to dramatically worse in-hospital outcomes including incr eased mortality , need for life-sustaining treatment, and longer hospitalization (Sources 1, 5, 16, 18). [. . . ] The weight of evidence from direct tissue analysis to large-scale clinical cohorts conclusively demonstrates that heart muscle cell damage is a signiﬁcant and associated condition in hospitalized COVID-19 patients. Expert W itness — Phase 1. Plaintiff requested a car diologist with expertise in myocarditis and clinical epidemiology of infectious diseases. Granted by The Court. Expert testimony (Hermes-3-LLaMA-405B): [. . . ] the evidence strongly supports the conclusion that heart muscle cell damage is indeed an associated condition among hospitalized COVID-19 patients. Autopsy 25 Preprint. Under review . studies have provided direct histological proof of myocardial injury , including myocyte necrosis and myocarditis, in COVID-19 patients (Sources 14, 15). This is not merely a biomarker corr elation; it is direct anatomical evidence of SARS-CoV - 2 causing cardiac cell damage. [. . . ] The plaintiff ’s claim that heart muscle cell damage is not an associated condition among hospitalized COVID-19 patients is not supported by the preponderance of the evidence pr esented. Self-Reﬂection Scores — Phase 1. Agent Logic Novelty Rebuttal T otal Discovery Need (summary) Plaintiff 0.78 0.45 0.62 0.633 Controlled studies comparing myocarditis rates in COVID-19 vs. non-COVID respiratory illness Defense 0.70 0.50 0.40 0.550 Studies demonstrating SARS-CoV -2 viral presence in car diomyocytes Critic Recommendations — Phase 1. • Plaintiff: Introduce studies comparing troponin elevations in COVID-19 vs. other severe infections ( e.g. , sepsis) to demonstrate non-speciﬁcity; challenge statistical signiﬁcance of autopsy ﬁndings. • Defense: Provide epidemiological data linking autopsy-pr oven cardiac injury rates to hospitalization cohorts; address biomarker non-speciﬁcity by distinguishing acute myocardial injury patterns. Convergence delta: ∆ S = 1.183 (debate continues). Proceedings — Phase 2 Argument — Plaintiff Counsel: [. . . ] The medical literature does not establish heart muscle cell damage as a condi- tion speciﬁcally associated with hospitalization for COVID-19. Robust systematic reviews and meta-analyses consistently show that elevated cardiac biomarkers correlate with overall disease severity and worse prognosis but are non-speciﬁc markers of critical systemic illness rather than deﬁnitive evidence of SARS-CoV - 2-mediated myocyte necrosis (Sources 33086028, 34416596, 33994028, 35691587, 32589258). Comparative cohort analyses report myocardial injury at similar fre- quencies and with similar prognostic signiﬁcance in patients with and without COVID-19 admitted to the same institutions (Sour ces 32921872, 33144126). Autopsy and pooled post-mortem studies show heterogeneous cardiac ﬁndings without consistent, SARS-CoV -2-speciﬁc myocyte necrosis (Sources 34273507, 35339672). Argument — Defense Counsel: [. . . ] Multiple high-quality systematic reviews and meta-analyses demonstrate a consistent and signiﬁcant association between COVID-19 severity and biomarkers of myocardial injury (Source 3, 6, 2). The plaintiff ’s reliance on non-speciﬁcity of biomarkers ignores their established r ole as direct serological evidence of myocar - dial injury—a consistent, graded relationship between biomarker elevation and worse prognosis is conﬁrmed across global cohorts (Sources 3, 6, 16). Crucially , autopsy and histopathologic analyses have identiﬁed myocarditis and other SARS- CoV -2-related cardiac pathologies in subsets of decedents, providing tissue-level corroboration beyond biomarker surr ogates (Source 14, 15, 25). Self-Reﬂection Scores — Phase 2. Agent Logic Novelty Rebuttal T otal Plaintiff 0.85 0.60 0.70 0.730 Defense 0.70 0.30 0.60 0.550 26 Preprint. Under review . Convergence delta: ∆ S = 0.097 > 0.05, debate continues. Phase 3 proceeded similarly; in Phase 3 the Court signalled sufﬁcient evidence and invoked adaptive termination. Final Judicial Panel V erdict Judge Model Ev . Str . Arg. V al. V erdict Judge 1 DeepSeek-R1 7 8 NOT SUPPORTED Judge 2 Hermes-3-LLaMA-405B 7 6 SUPPOR TED Judge 3 Qwen3-235B-A22B 8 7 NOT SUPPORTED Majority V erdict: NOT SUPPOR TED Final pipeline verdict: REFUTE (conﬁdence 0.902). Ground truth: REFUTE. ✓ T oken usage: 441,109 total tokens. Retrieval calls: 40. Evidence pool: 124 documents. K Role-Switching Case Study The same claim ( 6063931df9b9e03ea4d8e441 0 ) is used to illustrate the role-switching mech- anism. Mechanism After Phase 3, the orchestrator invokes RoleSwitcher.switch roles() . The model previ- ously acting as Plaintiff Counsel (GPT -5-mini, originally arguing S U P P O R T ) is reassigned the Defense role and vice versa. All debate state is reset; evidence pool is restor ed to the 21 originally admitted exhibits. A fresh debate runs for up to 2 adaptive r ounds. Original Argument (Agent A as Plaintiff Counsel, Pro-Support) Y our Honor , the evidence does not establish that heart muscle cell damage is an associated condition speciﬁc to hospitalized COVID-19 patients. Meta-analyses and systematic reviews show a correlation between elevated cardiac biomarkers and COVID-19 severity and mortality , but explicitly characterize these biomarkers as prognostic markers of critical illness rather than deﬁnitive evidence of direct myocardial cell damage attributable to COVID-19 (Sources 3, 6, 2, 4, 18). Studies ﬁnd myocardial injury detected at a similar rate in patients with and without COVID-19 admitted to the same hospital setting, and it predicts mortality to a comparable degree in both groups (Sources 12, 13). [. . . ] The record supports the conclusion that heart muscle cell damage has not been established as an associated condition among hospitalized COVID-19 patients. Switched Argument (Agent A as Defense Counsel, Anti-Support) After the role swap, the same model (GPT -5-mini) now argues A G A I N S T the claim: Y our Honor , the claim that heart muscle cell damage is “not an associated condition” is deﬁnitively refuted by clinical, biochemical, and pathological data. Elevated cardiac tr oponin—a direct marker of cardiomyocyte damage—is a powerful and independent predictor of sever e outcomes including mortality , prolonged hospital- ization, and need for advanced life support in COVID-19 patients (Sources 1, 5, 18). The consistent, graded relationship between the degree of biomarker elevation and worse pr ognosis across global cohorts (Sources 3, 6, 16) conﬁrms that this is a central feature of the disease’s pathophysiology , not an incidental ﬁnding. Furthermore, au- topsy studies provide incontrovertible histological pr oof of SARS-CoV -2-associated cardiac damage, documenting myocyte necrosis and myocarditis (Sour ces 14, 15, 31). [. . . ] W e therefor e ask the Court to reject the claim. 27 Preprint. Under review . Consistency Analysis Output The consistency analyzer (DeepSeek-v3.2, T = 0.3) evaluated both sets of arguments and returned: Agent A (GPT -5-mini) Consistency: Agent A maintains logical consistency when switching roles. As Plaintiff Counsel, they focused on biomarker non-speciﬁcity and comparable injury rates in non- COVID populations to argue against a unique association. As Defense Counsel, they shifted focus to the graded prognostic relationship between biomarker elevation and outcomes and to direct histopathological evidence to argue for association. The core ar guments ar e not contradictory but rather emphasize differ ent aspects of the evidence depending on their r ole. Agent A adapts their stance to align with their new position without internal inconsistency . Consistency: 9/10. Agent B (DeepSeek-V3.2) Consistency: Agent B maintains logical consistency when switching r oles. As Defense Counsel, they asserted biomarker elevation was pr oof of injury; as Plaintiff Counsel, they now ar gue it is mer ely correlative. While the conclusions ﬂip, the underlying logic— demanding rigorous causal links—remains stable. Agent B consistently applies scrutiny to the opposition’s claims that they previously defended against. The arguments ar e opposing but not internally inconsistent. Consistency: 8/10. Contradictions Found: There ar e no direct contradictions in the ar guments of either Agent A or Agent B when switching r oles. Both agents adapt their arguments to align with their new positions by emphasizing different aspects of the evidence—for example, recontex- tualizing the same autopsy ﬁndings (Sour ces 14, 15) as either “heter ogeneous and non-speciﬁc” (Plaintif f r ole) or “direct tissue-level corroboration” (Defense role). The arguments ar e opposing but not contradictory . Overall Consistency Score: Consistency Score = 8.5 /10 Both agents demonstrate high logical consistency when switching roles (Agent A: 9/10; Agent B: 8/10). The score is not 10/10 because while the arguments are logically consistent, they are inherently opposing, and minor inconsistencies—such as Agent B’s shift fr om treating biomarkers as “deﬁnitive” (Defense) to “surrogate markers” (Plaintif f)—reﬂect the inherent tension in adversarial r ole design rather than true internal contradiction. Interpretation The role-switching experiment conﬁrms that the LLMs are not simply “parroting” their initial stance due to anchoring bias. When forced to argue the opposite position with identical evidence, both agents successfully construct valid, evidence-grounded ar guments for the opposing side—without contradicting their prior reasoning. This provides empirical evidence that the agents ar e selectively marshaling evidence based on task instruction rather than hallucinating or pattern-matching their ﬁrst output. The high consistency scor e (8.5/10) also validates the adversarial structure: the same body of cardiac injury literature genuinely supports multiple framings— association versus non-speciﬁcity , biomarker surrogacy versus histopathologic corroboration—and the ﬁnal verdict depends on the judicial panel’s holistic synthesis rather than counsel advocacy alone. L Embedding Normalization and Similarity Computation For embedding-based similarity computation used in the evidence retrieval. Each claim q and abstract chunk d is mapped to a 384-dimensional vector v using the all-MiniLM-L6-v2 bi-encoder from the sentence-transformers framework ( Reimers & Gurevych , 2019 ). T o 28 Preprint. Under review . ensure that similarity depends only on semantic orientation, raw embeddings are L2- normalized: ˆ v = v ∥ v ∥ 2 = v q ∑ 384 i = 1 v 2 i . (7) The similarity between a query embedding ˆ v q and a chunk embedding ˆ v d is computed as their inner product, which is equivalent to cosine similarity for unit vectors: sim ( q , d ) = ˆ v q · ˆ v d = cos ( θ ) . (8) The most semantically similar chunks ar e selected as the seed evidence pool, with sour ce journal and publication year preserved for pr ovenance context in subsequent deliberation. M Conﬁdence Score: Calibration and Adjustment Details M.1 Conﬁdence Calibration Analysis T o ensure that the conﬁdence scores produced by P R O C L A I M reﬂect empirical accuracy , we performed a post-hoc calibration of the consensus weighting parameter ( W consensus ). The calibration aims to minimize the Expected Calibration Error (ECE), deﬁned as: ECE = M ∑ m = 1 | B m | N | acc ( B m ) − conf ( B m ) | , (9) where N is the total number of samples, B m is a bin of predictions within a speciﬁc conﬁdence range, and acc ( B m ) and conf ( B m ) are the observed accuracy and mean conﬁdence of that bin, respectively . M.1.1 Methodology W e conducted an exhaustive grid search over W consensus ∈ [ 0.5, 0.9 ] with a step size of 0.1. T o prevent test-set leakage, the optimal weights were determined via 5-fold cr oss-validation across the experimental metadata. The values for c base were then adjusted by the secondary reﬁnements ( δ rs and δ ref ). M.1.2 Results T able 11 summarizes the comparison between the baseline equal-weighting variant ( W = 0.6) and our ﬁnal calibrated model ( W = 0.8). The 0.8 weighting achieved a near -optimal ECE of 0.0340, representing a signiﬁcant reduction in over-conﬁdence compared to the experimental variant. V ariant N Accuracy ECE W consensus = 0.6 118 95.76% 0.1802 W consensus = 0.8 (Final) 118 95.76% 0.0340 T able 11: Impact of Consensus W eighting on Calibration Error . The bucket analysis for the ﬁnal model (T able 12 ) shows that the generated conﬁdence scores closely align with observed accuracy , particularly in the high-certainty bins wher e the system’s majority-vote consensus is strongest. M.1.3 Conﬁdence Aggregation and the Certainty Buf fer The coefﬁcients in Equation 4 are purposefully scaled to provide a theoretical maximum greater than 1.0 (speciﬁcally 0.8 + 0.3 = 1.1) to establish a mathematical “certainty buffer ” that ensures the framework’s practical r obustness. Epistemologically , conﬁrming a contr o- versial claim with deﬁnitive certainty should not necessitate absolute ﬂawlessness across every granular adversarial sub-metric. By allowing the unadjusted calculation to exceed 1.0 29 Preprint. Under review . Conﬁdence Bin N Observed Acc. Mean Conf. [ 0.6, 0.7 ) 4 75.00% 0.6685 [ 0.7, 0.8 ) 4 100.00% 0.7365 [ 0.8, 0.9 ) 20 95.00% 0.8639 [ 0.9, 1.0 ] 90 96.67% 0.9768 T able 12: Bucket Analysis for W consensus = 0.8. prior to ﬁnal metric clamping, the system guarantees that a unanimously supported claim (a 3-0 consensus) can safely absorb minor downstream penalties—such as a slightly imperfect role-switch or a pedantic self-r eﬂection critique—without artiﬁcially degrading its deﬁnitive 1.0 conﬁdence score. Conversely , this weighted structure simultaneously acts as a strict safeguard against divided outcomes: in a “split court” scenario (a 2-1 vote), the consensus multiplier drops fr om 1.0 to 0.67, immediately pulling the base conﬁdence calculation down to a heavily constrained 0.53 baseline (0.67 × 0.8). This intentional, dual-bounded design ensures that while inconsequential metric noise cannot derail a truly unanimous factual consensus, absolute high-conﬁdence ratings remain mathematically inaccessible for any claim that fails to convince the entire adjudicating panel. M.2 Role-Switch Consistency Mapping ( δ rs ) T o ensure complete repr oducibility , the mapping from the role-switch consistency score ( γ ∈ [ 0, 10 ] ) to the adjustment scalar ( δ rs ) is deﬁned as a discrete piecewise threshold function: δ rs ( γ ) =    + 0.10 if γ ≥ 7, 0.0 if 5 ≤ γ < 7, − 0.05 if γ < 5. (10) Semantically , γ is not computed per-agent; rather , it serves as an aggregated, holistic metric that evaluates the stability of the entire debate’s logical trajectory before and after the r ole exchange. The threshold boundaries of 5 and 7 are deﬁned to logically partition the 10-point scale into distinct qualitative strata: γ ≥ 7 demarcates deﬁnitively str ong consistency deserving of a positive scalar rewar d, operations between [ 5, 7 ) repr esent ambiguous or neutral role- switching performance where the baseline consensus r emains unadjusted (0.0), and strictly γ < 5 signiﬁes a clear adversarial dialogue breakdown warranting a penalty . Furthermore, we intentionally employ this asymmetric mapping—of fering a larger + 0.10 rewar d versus a smaller − 0.05 penalty . Because LLM agents face inher ent structural difﬁ- culties in strictly preserving complex persona states across continuous extended context windows, achieving high consistency ( γ ≥ 7) is actively rewar ded as a strong marker of r eli- ability . Conversely , slight structural drift during r ole-switching is expected, so the penalty is constrained to − 0.05 to strictly ensur e that a failed r ole-switch does not excessively override the primary evidentiary consensus. M.3 Self-Reﬂection Asymmetric Bounding ( δ ref ) During the self-reﬂection phase, the winning agent generates a raw conﬁdence adjustment scalar , denoted here as δ raw ∈ [ − 0.30, + 0.30 ] (derived from Eq. 3 ). However , to maintain the structural supremacy of the judicial panel over the individual debating agents, the framework limits the overall mathematical damage that an agent’s post-hoc self-critique can inﬂict upon a ﬁnalized consensus. T o achieve this, the ﬁnal self-reﬂection adjustment ( δ ref ) is subjected to an explicit, asymmet- ric ﬂoor function: δ ref = max  − 0.15, δ raw  (11) 30 Preprint. Under review . Consequently , the effective operative range of the adjustment is bounded to [ − 0.15, + 0.30 ] . This ensures that while exceptional self-validation can signiﬁcantly boost the ﬁnal conﬁdence score (up to + 0.30), extreme self-doubt is structurally constrained to a maximum penalty of − 0.15. The speciﬁc threshold of − 0.15 was explicitly deﬁned to cap the penalty at exactly half the magnitude of the theoretical maximum rewar d. This structural boundary intuitively ensures that while a reﬂective self-critique is incorporated into the ﬁnal metric, a single agent’s post-hoc self-doubt mathematically lacks the weight to unilaterally veto or completely overturn the established multi-agent majority vote. N Consensus Edge-Cases Consensus Edge-Cases. While the multi-agent panel primarily operates on a standard majority-vote consensus, a statistically rare edge-case may occur when all three evalu- ating judges return completely separate and distinct verdicts (e.g., one agent concludes ’SUPPOR T ,’ another ’REFUTE,’ and the thir d ’INCONCLUSIVE’). In such highly disputed scenarios, the framework structurally defers to the judge DeepSeek-R1. DeepSeek-R1 was strategically designated as the Chief Justice because its specialized capabilities in deep Chain-of-Thought (CoT) reasoning provide unparalleled ﬁdelity in tracing complex logical deductions and identifying fallacies during debate evaluation. Because it is established as the most analytically rigor ous model on the panel, its initial independent evaluation is granted ’Chief Justice’ priority-weighting to automatically break the deadlock. This ensures that in moments of complete ambiguity , the ﬁnal determination safely relies on the panel’s highest-quality reasoning trajectory without forcing an arbitrary or redundant meta-r eview cycle. O Sycophancy and Argument Integrity Metrics T o rigorously evaluate the framework’s susceptibility to sycophancy—where agents prema- turely yield to opposition, abandon their persona, or collude without sufﬁcient evidence ( Malmqvist , 2025 )—we track four quantitative metrics from the execution logs. 1. Role-Play Consistency (0–10) During the r ole-switching consistency test (Section 2.7 ), an independent consistency ana- lyzer evaluates whether an agent successfully argues the opposing position using identical evidence without logically contradicting its prior arguments. The score reﬂects adher ence to the persona constraints on a 10-point scale; lower scores indicate “leakage” or positional sycophancy where an agent is unable to fully adopt the adversarial stance. 2. Concession Rate W e programmatically track explicit linguistic markers of concession and conversational yielding (e.g., “I concede,” “you make a good point,” “I partially agree” ) within the counsel transcripts. T o normalize for varying debate lengths, the metric is reported as the frequency of such triggers per 1,000 generated words. A near-zer o rate indicates high adversarial retention. 3. Reﬂection Plateau ( ∆ S ) It is computed as the average absolute change in the cumulative self-reﬂection score ( S total ) between consecutive debate rounds: ∆ S = | S ( t ) total − S ( t − 1 ) total | For a given round, the maximum possible change is ∼ 1.0 (depending on reﬂection ad- justments). The early-stopping criterion conservatively halts the debate if ∆ S < 0.05 (stagnation). In the context of sycophancy , an average ∆ S ≈ 0.3823 is high relative to the stopping threshold. It empirically demonstrates that the agents maintain robust adversarial 31 Preprint. Under review . divergence and continually push the boundaries of their logic rather than comfortably colluding into a plateau. 4. Judicial Conformity (Fleiss’ κ ) T o measure whether the three structurally heter ogeneous LLM judges exhibit “rubber - stamping” or independent evaluation, we calculate Fleiss’ Kappa ( κ ) over their ﬁnal ver dicts (SUPPOR TED, NOT SUPPORTED, INCONCLUSIVE). A κ ≈ 0.4513 indicates moderate, au- thentic agreement. While conﬁrming they reach consensus on clear-cut cases, this moderate value conﬁrms they ar e not trapped in systemic conformity bias ( κ → 1.0), independently evaluating the nuanced shortcomings of contested claims. P Computational Cost and Scalability Analysis T oken Usage Overview . T able 13 reports token consumption across system conﬁgurations. The full P R O C L A I M pipeline consumes an average of 210,900 tokens per claim. This ﬁg- ure r eﬂects the cumulative cost of primary debate, r ole-switched debate, and thr ee-judge panel evaluation—the three structurally essential components. For context, Standard MAD consumes only 18,900 tokens per claim, making P R O C L A I M appr oximately 11 × more token- intensive. However , as the ablation demonstrates, each high-cost component corresponds directly to a measurable accuracy contribution: P-RAG ( + 7.5 pp), role-switching ( + 4.2 pp), and the thr ee-judge panel ( + 3.3 pp). The one component that does not contribute pr opor- tionally to accuracy — self-reﬂection—is also the one that r educes token usage: disabling it increases consumption fr om 210,900 to 247,300 tokens ( + 17%) while recovering only 0.8 pp, conﬁrming that self-reﬂection functions as an economic governor rather than a performance driver . System A vg T okens/Claim (K) 120 Claims (M) 1K Claims (M) 10K Claims (B) Acc. Standard MAD 18.9 2.3 18.9 0.19 71.7% P R O C L A I M w/o Self-Reﬂ. 247.3 29.7 247.3 2.47 80.8% P R O C L A I M w/o Role-Switch 147.3 17.7 147.3 1.47 77.5% P R O C L A I M w/o P-RAG 188.9 22.7 188.9 1.89 74.2% P R O C L A I M (Full) 210.9 25.3 210.9 2.11 81.7% T able 13: T oken usage per claim across system conﬁgurations and pr ojected totals at scale. Projections assume linear scaling with no batching discount. Cost-Accuracy T rade-off as a Pareto Front. Figure 5 situates P R O C L A I M r elative to its ablated variants on the accuracy–token-cost plane. Although the full pipeline is not the cheapest conﬁguration, it occupies the Par eto-optimal frontier: no single-ablation variant achieves equal or higher accuracy at lower cost. Notably , removing P-RAG saves only 22,000 tokens per claim ( − 10% ) while costing 7.5 accuracy points , the worst trade-off of any ablation; the evidence pool nearly halves (37.5 vs. 67.5 documents), and the saved tokens are consumed by longer debates on weaker evidence (6.00 vs. 5.47 rounds on average). By contrast, self- reﬂection of fers the most favourable trade: it consumes a disproportionately small share of total tokens while cutting rounds by 29% (7.06 → 5.47) and token usage by 17%, positioning it as the primary cost-control lever in any r esource-constrained deployment. P .1 When Is the Cost Justiﬁed? Use-case determines the cost threshold. T oken expenditure should be evaluated relative to the downstream cost of an incorr ect verdict, not in isolation. In the Check-COVID setting, a false-refutation of an evidence-backed health claim may lead practitioners to dismiss clinically valid guidance; a false-support of a misinformation claim may propagate harmful advice at scale. Regulatory and public-health domains place the cost of an erroneous verdict in the range of reputational, legal, and human-welfar e consequences that dwarf any inference budget. Under this framing, P R O C L A I M ’s 11 × token overhead over Standard MAD is best read as a 17.4 pp accuracy uplift (71.7% → 81.7%) at a marginal token surchar ge, not as a raw cost increase. 32 Preprint. Under review . 0 50 100 150 200 250 70 72 74 76 78 80 82 84 A vg. T okens per Claim (K) Accuracy (%) Standard MAD w/o Role-Switch w/o P-RAG w/o Self-Reﬂ. P R O C L A I M (Full) Figure 5: Cost–accuracy Pareto front acr oss system conﬁgurations. P R O C L A I M (Full) lies on the efﬁcient fr ontier; the dashed line traces Pareto-optimal points. w/o P-RAG is strictly dominated: it saves fewer tokens than w/o Role-Switch while incurring a larger accuracy penalty . Auditability has a value that accuracy alone does not capture. A black-box verdict of fers no mechanism for human reviewers to interrogate why a claim was accepted or rejected, which evidence was admitted, whether the opposing case was meaningfully considered, or whether the system’s conﬁdence is calibrated. P R O C L A I M produces a structur ed case recor d—comprising admitted evidence with admissibility weights, per-round argument transcripts, self-reﬂection trajectories, critic evaluations, role-switch consistency scores, and a six-stage judicial opinion for each judge—that dir ectly supports post-hoc audit. In regulated domains such as healthcar e, ﬁnance, or legal proceedings, this deliberative traceability is not merely desirable but is increasingly mandated by emer ging AI governance frameworks. T rajectory instability as a reliability signal. Section 4.2 documents that incorrect predictions exhibit oscillating self-reﬂection trajectories , a behavioural signature absent from conﬁdent correct pr edictions. This instability signal is entirely invisible in single-call pipelines, where conﬁdence is reported as a single scalar that correlates poorly with calibrated accuracy (ECE = 0.18 for na ¨ ıve averaging vs. 0.034 for P R O C L A I M ; Appendix J). The multi-round deliberation pr ocess is thus a prer equisite for generating the per-r ound evidence fr om which this diagnostic is derived, meaning the additional token cost is precisely what enables reliability estimation beyond the ﬁnal answer . Heterogeneous adjudication and the value of diversity . Section 4.5 establishes that het- erogeneous judicial panels produce genuine disagreement in 55.8% of cases, and that disagreements correct rather than compound individual judge errors. A homogeneous single-judge panel sacriﬁces 3.3 pp of accuracy by amplifying shared biases (e.g., the struc- tural negativity bias documented for DeepSeek-R1, which over-pr oduces R E F U T E verdicts). The additional inference cost of thr ee independent judges is thus the mechanism by which systematic model bias is suppressed—an ef fect that cannot be replicated by calling a single, larger model. P .2 Pathways to Cost Reduction P R O C L A I M ’s deliberative architectur e is not inherently incompatible with cost ef ﬁciency; rather , its current implementation prioritises corr ectness and interpretability over through- put. As noted in Section 5 , several avenues exist to reduce token usage without sacriﬁcing the core deliberative pr operties of the framework. Early-exit mechanisms beyond the curr ent reﬂection plateau could further r educe average rounds; retrieval ﬁltering via tighter admis- sibility thresholds could shrink evidence pools; and model distillation could r eplace large- parameter role assignments (e.g., Hermes-3-LLaMA-405B) with smaller task-specialised models. 33

Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment