Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness

T o w ard Reliable Ev aluation of LLM-Based Financial Multi-Agen t Systems: T axonom y , Co ordination Primacy , and Cost A w areness Phat Nguy en 1 [0009 − 0004 − 0671 − 2487] and Thang Pham 2 [0000 − 0003 − 3984 − 641 X ] 1 Georgia Institute of T echnology , A tlanta GA 30332, USA cherry.07.skr@gmail.com 2 A dob e Inc., San Jose CA 95110, USA thanpham@adobe.com Abstract. Multi-agen t systems based on large language mo dels (LLMs) for ﬁnancial trading hav e grown rapidly since 2023, yet the ﬁeld lacks a shared framework for understanding what drives p erformance or for ev al- uating claims credibly . This survey mak es three con tributions. First, we in tro duce a four-dimensional taxonomy , co vering arc hitecture pattern, co ordination mechanism, memory architecture, and tool integration; ap- plied to 12 multi-agen t systems and tw o single-agent baselines. Second, w e form ulate the Co or dination Primacy Hyp othesis (CPH): inter-agen t co ordination proto col design is a primary driv er of trading decision qual- it y , often exerting greater inﬂuence than mo del scaling. CPH is presented as a falsiﬁable research h ypothesis supp orted b y tiered structural evi- dence rather than as an empirically v alidated conclusion; its deﬁnitive v alidation requires ev aluation infrastructure that does not yet exist in the ﬁeld. Third, we do cumen t ﬁve perv asive ev aluation failures (lo ok-ahead bias, survivorship bias, bac ktesting ov erﬁtting, transaction cost neglect, and regime-shift blindness) and show that these can reverse the sign of rep orted returns. Building on the CPH and the ev aluation critique, we in tro duce the Co or dination Br e akeven Spr e ad (CBS), a metric for de- termining whether multi-agen t co ordination adds gen uine v alue net of transaction costs, and prop ose minimum ev aluation standards as prereq- uisites for v alidating the CPH. Keyw ords: Multi-agen t systems · LLM agents · ﬁnancial decision-making · co ordination mec hanisms · p ortfolio optimization · trading ev aluation 1 In tro duction Financial markets rew ard decision quality under uncertaint y . When LLM-based agen ts en tered this domain, the initial approac h treated the en tire in vestmen t w orkﬂow as a single prompt-to-trade pipeline. These monolithic systems face a fundamental constraint: a single con text window cannot simultaneously p er- form macro economic regime identiﬁcation, earnings quality assessmen t, tec hnical pattern recognition, risk budgeting, and execution optimization. The dominant 2 P . Nguyen and T. Pham paradigm now decomp oses these tasks across co operating agen ts, and systems suc h as FinCon [25], T radingAgents [22], and HedgeAgents [15] rep ort substan- tial gains, with claimed cum ulative returns ranging from 23% to o ver 400%. Ho wev er, the ﬁeld has not established whether these gains reﬂect genuine design adv ances or artifacts of inconsistent ev aluation metho dology . This survey addresses that gap. Rather than cataloguing systems or ranking them by rep orted performance, w e ask a more tractable prior question: Given that cr oss-system c omp arisons ar e curr ently unr eliable, what design choic es ar e most worth investigating rigor ously onc e evaluation standar ds impr ove? Our ap- proac h pro ceeds in four steps: a taxonomy decomp osing the design space (Sec- tion 3), documentation of ﬁv e systematic ev aluation failures (Section 4), the Co or dination Primacy Hyp othesis (CPH) derived from structural patterns that surviv e those failures (Section 5), and the Co or dination Br e akeven Spr e ad (CBS) metric that operationalizes the hypothesis in deploymen t (Section 6). Our con- tributions are analytical; w e do not rep ort new empirical results. 2 Related W ork Ding et al. [6] survey LLM agen ts for ﬁnancial trading but treat multi-agen t co ordination as one topic among many , without systematic comparison of co or- dination trade-oﬀs. General m ulti-agent surveys [11,20] analyze comm unication proto cols abstractly , without confronting domain-sp eciﬁc constraints such as co- ordination latency measurable in basis p oin ts of adverse price mov emen t. Sun et al. [19] surv ey LLM-based m ulti-agent reinforcemen t learning; no published ﬁ- nancial system has successfully integrated LLM agents with formal optimization guaran tees. FinCon’s verbal reward function is a step to ward structured decision optimization but lac ks formal conv ergence guarantees. Our work diﬀers in t w o resp ects. W e treat ﬁnancial multi-agen t systems (MAS) as decision arc hitectures rather than soft w are arc hitectures, so every design description is accompanied by its decision-quality implication. W e also foreground ev aluation metho dology as a ﬁrst-class concern: the gap b et ween rep orted and robust p erformance is large enough to aﬀect whether published claims should b e acted up on. 3 A T axonom y of Design Patterns 3.1 System Selection Candidate systems were drawn from the LLM-based ﬁnancial trading literature published b etw een 2023 and 2026. Inclusion required that a system employ dis- tinct LLM roles for active trading or p ortfolio allo cation and provide suﬃcien t ar- c hitectural detail for four-dimensional classiﬁcation. Purely single-agent systems, rule-based pip elines, and systems lacking architectural do cumen tation w ere ex- cluded. These criteria yield 12 systems, selected to maximize cov erage across the four taxonomy dimensions: FinCon [25], T radingAgen ts [22], HedgeAgents [15], T ow ard Reliable Ev aluation of LLM-Based Financial Multi-Agent Systems 3 D4: T ool Integration Read- Only Access Interactive Compute Veriﬁer- Gated Execution Tr adingAgents HedgeAgents Tr adingGPT FinVision REITs System FinMem* FinRobot (Data- CoT) QuantAgents AlphaAgents ContestTr ade ElliottAgents Agentic RAG FinCon (Risk Manager gate) D3: Memory Architecture Layered T emporal RAG- Based Retrieval Episodic Verbal Shared Blackboard FinMem Tr adingGPT ElliottAgents Agentic RAG Tr adingAgents QuantAgents AlphaAgents FinCon (verbal RL) HedgeAgents QuantAgents FinVision Multi- Agent Financial System D1: Architecture Pattern D2: Coordination Mechanism Hierarchical Role- Based Debate- Based Pipeline Structure Debate Hierrarchical Report Conference- Based Competitive Eval FinCon FinRobot Tr adingAgents HedgeAgents AlphaAgents QuantAgents Tr adingAgents Tr adingGPT AlphaAgents FinCon FinRobot Tr adingGPT ContestTr ade REITs System ElliottAgents Agentic RAG FinVision HedgeAgents QuantAgents ContestTr ade Fig. 1: T axonomy of LLM-based multi-agen t ﬁnancial systems across four design dimensions. Fin Vision, ElliottAgents, REIT s System, and Agentic RAG are not listed in Co ordination Mechanism b ecause they simply utilize a sequential pass- through (minimal co ordination). Con testT rade [28], Fin Vision [10], T radingGPT [13], QuantAgen ts [16], AlphaA- gen ts [2], ElliottAgents [21], FinRob ot [24], Agentic RAG [5], and Chinese Public REIT s system [12]. W e include FinMem [26] (memory-fo cused) and FinAgent [27] (to ol-augmen ted) as single-agent baselines for comparison in T able 1. 3.2 F our Dimensions W e classify systems along four dimensions (Fig. 1), in tended to decomp ose h y- brid designs so that future controlled exp erimen ts can v ary one dimension while holding others constan t. D1: Arc hitecture Pattern. Hier ar chic al architectures use a manager agent to arbitrate specialist inputs; the manager’s ability to w eight conﬂicting inputs is the binding constraint. R ole-b ase d designs map agents to professional depart- men ts, supp orting auditability . Deb ate-b ase d architectures impro ve calibration under signal am biguity but incur ov erhead costly when execution speed mat- ters. Pip eline arc hitectures oﬀer low latency but no error correction. D2: Co ordination Mechanism. Structur e d deb ate improv es accuracy ov er t wo to four rounds [7,4] but risks Degeneration-of-Though t [17], where agen ts con- v erge to a shared wrong answer through social pressure. Hier ar chic al r ep orting uses selectiv e knowledge propagation to reduce noise, ensuring only decision- relev an t feedbac k reac hes specialists. Confer enc e-b ase d co ordination activ ates group discussions adaptiv ely but requires precise triggers to av oid activ ating complex proto cols during routine trading. Comp etitive evaluation rew ards con- trarian accuracy rather than consensus, a voiding consensus bias en tirely . The 4 P . Nguyen and T. Pham absence of comp etitiv e risk-adjusted returns among systems using a sequential pass-through provides indirect motiv ating evidence for the CPH (Section 5.2), though this observ ation is sub ject to the same ev aluation confounds do cumen ted in Section 4 and should b e treated as motiv ating rather than evidential. D3: Memory Architecture. L ayer e d temp or al memory risks assuming ﬁxed relev ance decay during structural breaks. RA G-b ase d r etrieval allows for high- gran ularity data access, but introduces exp erience-follo wing b eha viour [23], am- plifying anc horing bias. Episo dic verb al memory supp orts compliance and au- ditabilit y but risks up date lag. Shar e d blackb o ar d state enables real-time sharing but propagates errors system-wide. D4: T o ol Integration. R e ad-only ac c ess dep ends on LLM numerical reasoning, a documented w eakness. Inter active c omputation addresses this but introduces co de correctness as a failure mode. V eriﬁer-gate d exe cution v alidates outputs b efore action and is preferred for institutional deploymen t. 3.3 Cross-System Observ ations T able 1 rep orts metrics for the eight systems that publish them; the remaining six (T radingGPT, AlphaAgen ts, ElliottAgen ts, FinRobot, REIT s System, Agentic RA G) are framew ork pap ers without standardized trading metrics. The ev alu- ation quality assessment scores eac h system against the ﬁv e criteria introduced in Section 4: no system satisﬁes more than tw o criteria. These eigh t systems diﬀer along at least ﬁv e methodological dimensions simultaneously: ev aluation p eriod, asset universe, market regime, cost mo del, and baseline choice. Normal- izing across these dimensions w ould produce num b ers that app ear comparable but conceal the diﬀerences that matter most for practitioners. Instead, we as- sess how trustw orth y eac h system’s rep orted num bers are (Ev aluation Qualit y column) and section 4 addresses the ev aluation inconsistencies in depth. Although the results in T able 1 are not directly comparable, a qualitative structural pattern is w orth noting. Systems with explicit co ordination mec ha- nisms more often report extended ev aluation horizons and live deploymen t, for example HedgeAgen ts posts a 405% three-year return and QuantAgen ts sus- tains 1.76–2.02 live Sharp e (b oth sub ject to the ev aluation quality limitations in Section 4), whereas pipeline designs rarely disclose comparable long-horizon, risk-adjusted results. High Sharpe ratios observ ed ov er short, bullish windows (e.g., debate-based T radingAgents ev aluations) highligh t the imp ortance of tem- p oral robustness rather than establishing sup eriorit y . While not causal evidence, this pattern suggests co ordination complexity may correlate with demonstrated robustness, motiv ating the CPH (Section 5). Note: Metrics across ro ws are not directly comparable owing to diﬀerences in ev aluation p eriod, asset un iv erse, market regime, and cost assumptions. T ow ard Reliable Ev aluation of LLM-Based Financial Multi-Agent Systems 5 T able 1: Rep orted p erformance metrics. Ev aluation quality scores eac h sys- tem against ﬁve criteria: (1) contamination con trol, (2) p oin t-in-time uni- v erse, (3) rolling-window rep orting, (4) net-of-cost returns, (5) regime cov erage. ✓ = satisﬁed, × = not satisﬁed. † FinMem’s rep orted 23% return on MSFT rev ersed to − 22% under FINSABER controlled conditions. System (Agents) Sharpe Ratio Cumulative Return Annual Return Max Drawdown Ev al Period Ev aluation Details Ev aluation Quality Contam. Point-in-Time Rolling Win. Net Costs Regime Cov. FinAgent [27] (1+tools) 1.43–2.01 — 31.9–92.3% 5.57–13.2% ∼ 6 mo Six datasets; stocks and crypto. Best ARR 92.27% on TSLA. ✓ × × × ✓ ■ 2/5 FinCon [25] (7+1) 3.26 114% — 16.2% ∼ 18 mo Jan 2022–Jun 2023; 6 US stocks ✓ × × × ✓ ■ 2/5 HedgeAgents [15] (4+1) 2.41 405% 71.60% ∼ 14% 3 yr 2021–2023; BTC, DJ30, F orex; multi-asset ✓ × × × ✓ ■ 2/5 QuantAgen ts [16] (Multi) 3.11 ∼ 300% 58.7% 16.86% 3 yr+live Jan 2021–Dec 2023; NASDAQ-100. ✓ × × × ✓ 1.76-2.02 98-112% — — Q3 2024 - Q1 2025 Live trading A-stock and HK-stock markets ■ 2/5 ContestT rade [28] (Multi) 3.12 52.8% — 12.41% — NASDAQ-100 ✓ × × × × ■ 1/5 Fin Vision [10] (4) 1.20–1.72 — 14.8–42.1% 12.09–14.38% ∼ 7 mo AAPL, MSFT, AMZN; predominantly bullish window; pip eline archi- tecture. ✓ × × × × ■ 1/5 T radingAgents [22] (7) 5.60–8.21 23–27% 24.9–30.5% 0.91–2.1% 3 mo Jan–Mar 2024; AAPL, GOOGL, AMZN. Sharpe ratio inﬂated b y short bullish window. ✓ × × × × ■ 1/5 FinMem [26] † (1) 0.23–2.67 23-61.7% — 10.8–22.9% ∼ 1 yr Oct 2022-Apr 2023 × × × × × 1.4 → -1.24 23 → − 22% 14.9 → -29 6 mo → 8 mo MSFT only , Under FINSABER controlled re-evaluation. ■ 0/5 ■ 4–5/5 Relatively credible ■ 2–3/5 Partial credibility ■ 0–1/5 Low credibility 4 Ev aluation F ailures in the Published Literature The design diversit y in Section 3 means that an y observed performance diﬀer- ence b et ween tw o systems could reﬂect a genuine design adv antage, a diﬀerence in ev aluation conditions, or b oth. Fiv e systematic failures preven t these expla- nations from b eing distinguished, making cross-system comparison unreliable as a basis for design conclusions. 4.1 Lo ok-Ahead Bias LLMs trained through 2024 may hav e encoun tered ﬁnancial outcomes for p eriods used in backtesting, eﬀectively retrieving rather than predicting. Sto c kBench [3] addresses this with DJIA data from March to July 2025; most LLM-based agents fail to outp erform buy-and-hold under these conditions. A second manifestation is feature leak age through retriev al: imprecisely timestamp ed RAG databases can inject future information into historical queries. FinAgent’s multi-step retriev al and Agentic RAG’s cross-enco der re-ranking are b oth vulnerable in the absence of do cumen ted timestamp controls. 6 P . Nguyen and T. Pham 4.2 Surviv orship Bias Most systems ev aluate on sto c k universes selected at ev aluation time, excluding delisted companies that are disprop ortionately p oor performers. Elton et al. [8] estimated 0.9% annual survivorship bias in mutual fund returns; for individual sto c k selection the eﬀect is larger. FINSABER [14] addresses this with historical index constituen t lists. 4.3 Bac ktesting Overﬁtting LLM-based multi-agen t systems hav e extensive h yp erparameters (agent coun t, debate rounds, memory depth, temp erature, prompt templates, ev aluation win- do ws), creating combinatorial space prone to o verﬁtting. FinMem’s rep orted 23.26% cumulativ e return on MSFT b ecame − 22.04% under a slightly diﬀer- en t but equally defensible windo w with transaction costs included [14]. A sign rev ersal of this magnitude is consistent with an ov erﬁtted system. 4.4 T ransaction Cost Neglect Round-trip costs of 10 to 20 basis points can comp ound to 25 to 50 p ercen tage p oin ts of annual drag for daily-trading systems. Of systems survey ed, only FINS- ABER and Sto c kBench explicitly mo del transaction costs. HedgeAgents’ 405%, FinCon’s 114%, and Con testT rade’s 52.8% are all gross of costs. This failure is particularly consequential for MAS: co ordination-driv en signal improv emen ts ma y increase trading frequency without prop ortionally impro ving p er-trade al- pha, con verting a nominal p erformance adv antage into net underp erformance. 4.5 Regime-Shift Blindness Most ev aluations cov er six to tw elv e months within a single market regime, pro- viding no cross-regime evidence. Only HedgeAgen ts explicitly addresses regime adaptation; its three-year ev aluation spanning 2021–2023 is the strongest av ail- able cross-regime evidence among survey ed systems. T radingAgents rep orts an extraordinary Sharp e of 5.60 to 8.21 based solely on a three-month bullish win- do w (Jan uary–March 2024) during rallies in AAPL, GOOGL, and AMZN. A Sharp e ratio at this lev el, ann ualized from a single fav ourable regime, is statis- tically consisten t with trend following in a fav orable regime rather than genuine risk-adjusted alpha, and is consistent with regime shift blindness pro ducing un- reliable metrics. 4.6 Consolidated Minimum Standards 1. Con tamination con trol. Ev aluation p erio d should p ost-date mo del train- ing, or a p ost-training ablation should b e provided. 2. P oint-in-time univ erse. Asset universe should reﬂect historical index com- p osition at each ev aluation date. T ow ard Reliable Ev aluation of LLM-Based Financial Multi-Agent Systems 7 3. Rolling-windo w reporting. Performance across m ultiple non-ov erlapping windo ws with v ariance estimates. 4. Net-of-cost returns. Explicit transaction cost mo del co vering commis- sions, half-spread, and mark et impact. 5. Regime cov erage. Ev aluation spanning multiple regimes or explicit adver- sarial stress testing. No system in our surv ey satisﬁes all ﬁv e (see ev aluation quality scores in T able 1). FinCon, HedgeAgents, FinAgent, and Quan tAgents each satisfy only 2/5, while T radingAgents, ContestT rade, and Fin Vision reac h just 1/5. Notably , FinMem scores 0/5, with its reported 23% return on MSFT rev ersing to -22% under con trolled re-ev aluation. Building a b enchmark satisfying all ﬁve simultaneously is among the most pressing infrastructure needs in this ﬁeld. 5 The Co ordination Primacy Hyp othesis 5.1 Motiv ation The ev aluation failures ab ov e make precise quan titative comparison unreliable, but structural observ ations remain informative even when sp eciﬁc return ﬁgures do not: which systems survive the transition to live trading, which co ordination patterns app ear consistently across indep endent research groups, and which de- signs collapse under controlled re-ev aluation. The CPH is deriv ed from these patterns rather than from cross-system p erformance rankings. 5.2 Hyp othesis Statemen t Co or dination Primacy Hyp othesis holds that the inter-agent c o or dination pr oto c ol is the most c onse quential structur al factor in tr ading de cision quality among the four taxonomy dimensions, exerting gr e ater inﬂuenc e than mo del sele ction. This is a falsiﬁable claim: upgrading the LLM backbone within a ﬁxed co ordi- nation protocol should yield smaller performance improv emen ts than replacing the coordination protocol, holding all other design c hoices constant. The h y- p othesis do es not assert that co ordination is suﬃcien t; only that it is the most consequen tial dimension to optimize. 5.3 Supp orting Evidence (Tiered) Tier 1 – Live-Mark et Benchmarking (Strongest). A v ailable evidence sug- gests that framew ork arc hitecture is a more consequen tial predictor of prof- itabilit y than mo del selection: weak er mo dels within sophisticated co ordina- tion structures tend to outp erform frontier mo dels in linear pip elines across the b enc hmarks examined. This is the most credible av ailable evidence for the CPH, though regime div ersity co vered by AMA [18] remains limited and the ﬁnding should b e treated as strongly suggestive rather than deﬁnitive. 8 P . Nguyen and T. Pham Tier 2 – Ablation Studies (Mo derate). In FinCon and T radingAgents, remo ving the co ordination reduced the Sharp e ratio by 15–30%; while substitut- ing a smaller mo del pro duced only 5–8% v ariance. These ablations are author- rep orted and should b e treated as suggestive rather than conﬁrmatory . Tier 3 – Theoretical Scaling Argumen ts (T entativ e). F ormal results [9] suggest that increasing agent coun t without an optimized coordination top ol- ogy yields diminishing returns and increased inter-agen t interference. This is consisten t with the CPH but do es not directly test it in ﬁnancial settings. 5.4 Wh y the CPH Cannot Y et Be V alidated Deﬁnitiv e v alidation requires a controlled exp erimen t v arying D2 while holding D1, D3, D4, and LLM backbone constant, ev aluated on contamination-free data with rolling windo ws and net-of-cost returns. This experiment has not been conducted b ecause the ﬁve ev aluation failures make its prerequisites unav ailable. The failures are not merely a general critique; they are the sp eciﬁc obstacle blo c king the ﬁeld’s most important untested h yp othesis. Addressing them is a prerequisite for testing the CPH, not a parallel concern. T o transition to w ard empirical v alidation, w e prop ose a Cross-Architecture F actorial Design. By isolating Co or dination L o gic as the primary independent v ariable and LLM Par ameter Sc ale as a control v ariable, researchers can quan tify the Marginal Alpha contributed by the proto col. W e suggest a b enc hmark of 500 sim ulated trading da ys across three distinct market regimes (Bull, Bear, and Sidew ays) to ensure the co ordination adv antage is robust against regime-sp eciﬁc mo del biases. 6 Co ordination T rade-oﬀs and the CBS Metric If co ordination protocol is the most consequential design dimension, understand- ing its costs and risks is essential for any practitioner acting on that hypothesis. This section synthesises four trade-oﬀ axes and introduces the CBS as the metric that op erationalizes the CPH in deploymen t. 6.1 Key T rade-oﬀ Axes Cost and p erformance. Inference costs scale linearly with agent count, while co ordination costs scale quadratically in fully connected top ologies. A t curren t API pricing, a seven-agen t system incurs roughly $0.50–$2.00 p er daily decision, negligible for medium-frequency strategies but material at higher frequencies. Practical budgets of three to seven agents with t wo to three in teraction rounds are consistent across the literature. Hybrid designs escalating selectively to fron- tier mo dels for complex reasoning steps oﬀer order-of-magnitude cost reductions. Debate and latency . Each debate round introduces one to three seconds of latency; a t wo-round debate can incur ﬁve to tw ent y basis p oints of adverse T ow ard Reliable Ev aluation of LLM-Based Financial Multi-Agent Systems 9 price mov ement, p oten tially exceeding the signal improv emen t it provides. This latency cost is frequency-dep endent, with co ordination b eneﬁts most pronounced at medium-frequency horizons where holding p erio ds are long enough to absorb the dela y . Debate depth should be calibrated to both mark et conditions and asset t yp e: direct execution during high-con viction signals minimises latency cost, while liquid equities can tolerate deep er co ordination more readily than illiquid or v olatile assets. Memory depth and regime drift. Historical precedents from a prior regime in tro duce anchoring bias when structural conditions change. Existing designs such as FinMem’s decay-based memory and FinCon’s episo dic up dating partially address this but rely on ﬁxed temp oral structures rather than even t- driv en adaptation. Explicit regime-c hange detection is needed to trigger belief revision; absent such mechanisms, memory-equipp ed agents should incorp orate circuit break ers susp ending retriev al under elev ated dra wdown or volatilit y . Planner-executor depth. Ablation results from FinCon and T radingA- gen ts indicate that removing indep enden t risk asse ssmen t degrades risk-adjusted p erformance. A minimal three-stage pip eline (signal generation, risk gate, exe- cution) appears suﬃcient for most production settings, with additional stages yielding diminishing returns unless they con tribute distinct information. 6.2 The Co ordination Breakev en Spread The trade-oﬀ axes ab ov e share a common structure: co ordination improv es sig- nal quality but incurs a cost, and no published metric determines whether the impro vemen t justiﬁes that cost for a giv en instrument. W e formalize this via the Co ordination Break ev en Sp read (CBS) . Let ∆p ( d ) denote the exp ected impro vemen t in entry/exit price from co ordination depth d , and let s denote the bid–ask spread (round-trip cost = 2 s ). Deﬁning CBS ( d ) = ∆p ( d ) 2 co ordination is optimal only if s < CBS ( d ) ; when the instrumen t’s spread exceeds the CBS, co ordination ov erhead is not recov ered and the system should rev ert to single-agent op eration. Under reasonable assumptions, the implied CBS lies in the lo w single-digit basis-p oin t range for typical Sharpe levels, indicating that co ordination is eco- nomically viable primarily in highly liquid instruments with narrow spreads. In practice, ∆p ( d ) can be estimated as the diﬀerence in volume-w eighted a verage execution price b et ween a coordinated and single-agent baseline ov er matched trading windows, net of latency-induced slippage; where una v ailable, it can b e b ounded using coordination-driven Sharp e improv emen ts con verted to basis p oin ts via a verage holding p eriod. Precise calibration b eyond these appro xima- tions remains an op en empirical task. Relationship to the Ev aluation F ailures and the CPH. CBS directly ad- dresses transaction cost neglect (Section 4.4) by conv erting co ordination gains 10 P . Nguyen and T. Pham 0 20 40 60 80 0 20 40 60 80 Co ordination adds gen uine v alue Co ordination destroys v alue (c ost > signal gain) CBS threshold Large-cap equities F orex ma jors Small-cap equities Crypto / EM instr. Instrumen t Bid-Ask Spread (bps) Co ordination Signal Gain (bps) Fig. 2: Conceptual illustration of the CBS threshold. Instruments in the upper- left region (lo w spread, high co ordination gain) are candidates for multi-agen t co ordination; those in the low er-right are not. Asset-class regions are directional and illustrative only; empirical CBS v alues cannot be computed from currently published results o wing to transaction cost neglect (Section 4.4). in to a spread threshold. It is regime-dep enden t, as spreads widen during volatil- it y spikes. In deplo yment, practitioners can test the CPH by running co ordinated and single-agent systems in parallel and observing whether co ordination consis- ten tly clears the CBS threshold. Asset-Class and Regime Dep endence. The CBS threshold v aries substan- tially (Fig. 2). F or large-capitalisation US equities (1–2 bps spreads), co ordination- driv en signal impro v ements ma y plausibly exceed costs at daily frequency . F or small-capitalisation equities (10–50 bps), mid-capitalisation crypto currency (20– 100 bps), or emerging market instruments, the threshold is cons iderably higher. A system applying co ordination uniformly will ov er-coordinate during crisis p e- rio ds when spreads are wide and under-coordinate during calm perio ds when co ordination ov erhead is relativ ely more costly . W e propose CBS as a standard rep orting requirement alongside Sharp e ratio and maximum drawdo wn. 7 Conclusion and F uture Directions This survey has argued that ﬁve systematic ev aluation failures make cross-system comparisons of LLM-based m ulti-agent ﬁnancial systems unreliable, and that addressing them is a prerequisite for any credible claim ab out what driv es p er- formance. F rom the structural patterns that remain observ able despite these failures, w e formulated the CPH and introduced the CBS as its deploymen t metric. The logical c hain is: the taxonomy pro vides vocabulary for con trolled comparison; the ev aluation critique explains wh y c omparison is curren tly un- reliable; the CPH identiﬁes what is most w orth testing; and the CBS deﬁnes T ow ard Reliable Ev aluation of LLM-Based Financial Multi-Agent Systems 11 what testing it requires in practice. W e identify three high-priority directions for future w ork. Con trolled v alidation of the CPH. The deﬁnitive exp erimen t holds arc hi- tecture top ology , memory design, to ol integration, and LLM backbone constant while v arying only the co ordination mechanism across iden tical contamination- free data with rolling-window, net-of-cost ev aluation. A communit y b enchmark pro viding this infrastructure would either conﬁrm the CPH and redirect research eﬀort, or reveal that co ordination and mo del quality require joint optimization. Small language mo del sp ecialist architectures. No published system implemen ts a production-ready h ybrid in whic h small mo dels handle routine subtasks (sentimen t classiﬁcation, compliance chec king) and escalate to frontier mo dels only for complex reasoning. Related w ork [1] suggests that ﬁne-tuned small mo dels can match fron tier mo dels on sp ecialized tasks, with inference cost reductions of one to tw o orders of magnitude, though this ﬁnding deriv es from general agentic settings and its applicabilit y to ﬁnancial m ulti-agent systems remains an op en question. Systemic risk from correlated AI trading. If multiple institutions deplo y similar LLM-based multi-agen t architectures, correlated signals could amplify rather than dampen mark et volatilit y . Existing regulatory framew orks (IMF, CFTC, EU AI Act), calibrated for single-agen t AI systems, do not account for emergen t co ordination eﬀects across indep endent deploymen ts. The most consequen tial con tributions will come not from adding agen ts or scaling mo dels, but from designing co ordination mechanisms that demonstra- bly improv e risk-adjusted, net-of-cost, regime-robust decision quality , and from building the ev aluation infrastructure needed to verify such claims. Disclosure of In terests. The authors declare no comp eting in terests. No external funding w as received in supp ort of this work. References 1. Belcak, P ., Molchano v, P ., Dong, H., Muralidharan, S., et al.: Small language mo d- els are the future of agentic AI. arXiv preprint arXiv:2506.02153 (2025), n VIDIA 2. Blac kRo c k: AlphaAgen ts: Multi-agent LLM for equit y p ortfolios. arXiv preprint arXiv:2508.11152 (2025) 3. Chen, Y., et al.: Sto c kBench: Can LLM agen ts trade sto c ks proﬁtably in real-world mark ets? arXiv preprint arXiv:2510.02209 (2025) 4. Choi, J., Zhu, S., Li, T.: Debate or vote: Which yields better decisions in multi- agen t large language mo dels? In: Adv ances in Neural Information Pro cessing Sys- tems (NeurIPS) (2025), sp otligh t 5. Co ok, J., et al.: Agentic RAG for ﬁntec h (2025), preprint 6. Ding, Y., Li, J., W ang, X., Chen, Y.: Large language mo del agen t in ﬁnancial trading: A survey . arXiv preprint arXiv:2408.06361 (2024) 7. Du, Y., et al.: Impro ving factuality and reasoning in language mo dels with mul- tiagen t debate. In: Proceedings of the 41st In ternational Conference on Machine Learning (ICML) (2024) 12 P . Nguyen and T. Pham 8. Elton, E.J., Grub er, M.J., Blake, C.R.: Survivorship bias and mutual fund p erfor- mance. Review of Financial Studies 9 (4), 1097–1120 (1996) 9. Estornell, A., Liu, Y.: Multi-LLM debate: F ramework, principals, and interv en- tions. In: Adv ances in Neural Information Pro cessing Systems (NeurIPS) (2024) 10. F atemi, S., Hu, Y.: Fin Vision: A multi-agen t framew ork for stock market prediction. In: Proceedings of the 5th A CM International Conference on AI in Finance (ICAIF) (2024), 11. Guo, T., Chen, X., W ang, Y., Chang, R., P ei, S., Chawla, N.V., Wiest, O., Zhang, X.: Large language mo del based multi-agen ts: A survey of progress and challenges. In: Pro ceedings of the 33rd International Joint Conference on Artiﬁcial Intelligence (IJCAI) (2024), 12. Li, X.: Design and empirical study of a large language mo del-based multi-agen t in- v estment system for c hinese public REIT s. arXiv preprin t arXiv:2602.00082 (2026) 13. Li, Y., et al.: T radingGPT: Multi-agent system with lay ered memory and distinct c haracters for enhanced ﬁnancial trading p erformance (2023), preprint 14. Li, Y., Kim, B., Cucuringu, M., Ma, Y.: Can LLM-based ﬁnancial inv esting strate- gies outp erform the market in long run? arXiv preprint arXiv:2505.07078 (2025), kDD 2026 Datasets & Benchmarks T rack 15. Li, Y., et al.: HedgeAgen ts: A balanced-aw are multi-agen t ﬁnancial trading system. In: Companion Pro ceedings of the ACM W eb Conference (WWW Companion) (2025) 16. Li, Y., et al.: Quan tAgents: T ow ards multi-agen t ﬁnancial system via sim ulated trading. arXiv preprint arXiv:2510.04643 (2025) 17. Liang, Y., et al.: Degeneration-of-thought: Multi-agent debate can harm reasoning in large language models. In: Pro ceedings of the 2024 Conference on Empirical Metho ds in Natural Language Pro cessing (EMNLP) (2024) 18. Qian, Y., et al.: When agents trade: Live multi-mark et trading benchmark for LLM agen ts. arXiv preprint arXiv:2510.11695 (2025) 19. Sun, Y., Huang, H., Pompili, D.: LLM-based m ulti-agent reinforcement learning: Curren t and future directions. arXiv preprint arXiv:2405.11106 (2024) 20. T alebirad, Y., Nadiri, A.: Multi-agen t collaboration: Harnessing the p o wer of in- telligen t LLM agents. arXiv preprint arXiv:2306.03314 (2023) 21. W aw er, M., Chudziak, B.: In tegrating traditional technical analysis with AI: A m ulti-agent LLM-based approach to stock market forecasting. In: Pro ceedings of the 17th International Conference on Agents and Artiﬁcial In telligence (ICAAR T). pp. 100–111 (2025) 22. Xiao, Y., et al.: T radingAgents: Multi-agents LLM ﬁnancial trading framework. In: Pro ceedings of the 39th AAAI Conference on Artiﬁcial Intelligence (AAAI) (2025) 23. Xiong, Z., et al.: How memory managemen t impacts LLM agen ts: An empirical study of exp erience-follo wing b eha vior. arXiv preprint arXiv:2505.16067 (2025) 24. Y ang, H., et al.: FinRob ot: An op en-source AI agent platform for ﬁnancial appli- cations using large language mo dels. arXiv preprint arXiv:2405.14767 (2024) 25. Y u, W., et al.: FinCon: A synthesized LLM multi-agen t system with conceptual v erbal reinforcement for enhanced ﬁnancial decision making. In: Adv ances in Neu- ral Information Pro cessing Systems (NeurIPS) (2024) 26. Y u, Y., et al.: FinMem: A p erformance-enhanced LLM trading agent with lay ered memory and c haracter design. In: AAAI Spring Symp osium (AAAI-SS) (2024) 27. Zhang, Y., et al.: A m ultimo dal foundation agent for ﬁnancial trading: T o ol- augmen ted, diversiﬁed, and generalist. In: Pro ceedings of the 30th ACM SIGKDD Conference on Kno wledge Discov ery and Data Mining (KDD) (2024) 28. Zhao, Y., et al.: ContestT rade: Comp etitiv e multi-agen t trading (2025), preprint

Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment