Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
The adoption of large language models (LLMs) for structured information extraction from financial documents has accelerated rapidly, yet production deployments face fundamental architectural decisions with limited empirical guidance. We present a systematic benchmark comparing four multi-agent orchestration architectures: sequential pipeline, parallel fan-out with merge, hierarchical supervisor-worker and reflexive self-correcting loop. These are evaluated across five frontier and open-weight LLMs on a corpus of 10,000 SEC filings (10-K, 10-Q and 8-K forms). Our evaluation spans 25 extraction field types covering governance structures, executive compensation and financial metrics, measured along five axes: field-level F1, document-level accuracy, end-to-end latency, cost per document and token efficiency. We find that reflexive architectures achieve the highest field-level F1 (0.943) but at 2.3x the cost of sequential baselines, while hierarchical architectures occupy the most favorable position on the cost-accuracy Pareto frontier (F1 0.921 at 1.4x cost). We further present ablation studies on semantic caching, model routing and adaptive retry strategies, demonstrating that hybrid configurations can recover 89% of the reflexive architecture’s accuracy gains at only 1.15x baseline cost. Our scaling analysis from 1K to 100K documents per day reveals non-obvious throughput-accuracy degradation curves that inform capacity planning. These findings provide actionable guidance for practitioners deploying multi-agent LLM systems in regulated financial environments.
💡 Research Summary
The paper presents a comprehensive benchmark of multi‑agent large language model (LLM) architectures for extracting structured data from U.S. Securities and Exchange Commission (SEC) filings. The authors evaluate four orchestration patterns—Sequential Pipeline, Parallel Fan‑Out with Merge, Hierarchical Supervisor‑Worker, and Reflexive Self‑Correcting Loop—across five LLMs (OpenAI GPT‑4o, Anthropic Claude 3.5 Sonnet, Google Gemini 1.5 Pro, Meta Llama 3 70B, and Mistral Mixtral 8×22B). Using a curated dataset of 10,000 filings (4,000 10‑K, 4,000 10‑Q, 2,000 8‑K) covering 25 fields (financial metrics, governance, executive compensation), the study measures field‑level micro‑averaged F1, strict document‑level accuracy, end‑to‑end latency (p50/p95), cost per document, and token‑efficiency.
Key findings:
- Accuracy hierarchy – The Reflexive architecture achieves the highest field‑level F1 (0.943) and strict document accuracy (0.758) but incurs 2.3 × the cost of the baseline Sequential pipeline. The Hierarchical architecture follows closely (F1 0.921, accuracy 0.704) while offering a far more favorable cost‑accuracy trade‑off (1.4 × baseline cost).
- Cost and token usage – Cost is driven by model pricing and the number of API calls. The cheapest model (Mixtral 8×22B) costs $0.031 per document in the Sequential setup, rising to $0.072 in the Reflexive setup. Parallel execution yields the highest token‑efficiency because each agent receives only the relevant document slice, avoiding cumulative context growth.
- Latency patterns – Sequential processing shows linear latency growth (≈34 s median), Parallel is the fastest (≈12 s p95), Hierarchical adds supervisory overhead (≈41 s), and Reflexive is the slowest (≈74 s) due to up to three verification‑correction cycles.
- Scaling behavior – When scaling daily throughput from 1 K to 100 K documents, Parallel and Hierarchical designs scale near‑linearly in cost and latency. Reflexive exhibits a “knee point” around 30 K documents per day where accuracy drops sharply and cost escalates non‑linearly, reflecting the increasing frequency of multi‑round verification for complex filings.
- Hybrid optimizations – The authors explore three orthogonal techniques: semantic caching (re‑using extracted sections), model routing (assigning high‑complexity fields to expensive models, simple fields to cheap ones), and adaptive retry (re‑extracting only low‑confidence fields). Combining these yields a hybrid configuration that recovers 89 % of the Reflexive accuracy gain (F1 ≈ 0.936) while only increasing baseline cost by 1.15 × and keeping latency around 18 s.
The paper also contributes a failure taxonomy (12 failure modes specific to multi‑agent financial extraction) and mitigation strategies, as well as a rigorous evaluation pipeline built on LangGraph, vLLM, and a 21‑day experimental run on an 8‑node A100 cluster.
Practical implications for financial institutions:
- Cost‑constrained deployments should favor the Hierarchical architecture with model routing, achieving near‑optimal accuracy at roughly 60 % of the Reflexive cost.
- High‑precision regulatory reporting (e.g., audit‑ready data) may justify the Reflexive approach if budget permits, but only for workloads below the identified knee point.
- Large‑scale production (tens of thousands of filings per day) benefits from Parallel or Hierarchical designs with semantic caching to keep latency and cost manageable.
- Dynamic resource allocation can be realized by monitoring verification failure rates; when they exceed a threshold, the system can switch to a more thorough Reflexive mode for the affected subset.
In summary, the study demonstrates that the choice of orchestration pattern is as critical as the underlying LLM for financial document automation. Hierarchical supervision, especially when augmented with intelligent routing and caching, offers the most balanced solution for real‑world, regulated environments, while Reflexive loops provide a ceiling on accuracy at a substantially higher operational cost. Future work could integrate domain‑specific knowledge graphs or memory‑augmented verification to reduce the cost of self‑correction without sacrificing precision.
Comments & Academic Discussion
Loading comments...
Leave a Comment