RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The advent of complex, interconnected long-horizon LLM systems has made it incredibly tricky to identify where and when these systems break down. Evaluation capabilities that currently exist today are limited in that they often focus on simple metrics, end-to-end outcomes, and are dependent on the perspectives of humans. In order to match the increasing complexity of these many component systems, evaluation frameworks must also be able to reason, probe, iterate, and understand the nuanced logic passing through these systems. In this paper, we present RAFFLES, an offline evaluation architecture that incorporates iterative reasoning. Specifically, RAFFLES operates as an iterative, multi-component pipeline, using a central Judge to systematically identify faults and a set of specialized Evaluators to assess the quality of the candidate faults as well as rationales of the Judge. We evaluated RAFFLES with several benchmarks - the Who&When dataset to identify step-level faults in multi-agent systems and the ReasonEval datasets to diagnose step-level mathematical reasoning errors. RAFFLES outperforms strong baselines, achieving an accuracy of over 20% and 50% on the Who&When Hand-Crafted and Algorithmically-Generated datasets, and over 80% on the ReasonEval datasets. These results demonstrate a key step towards introducing automated fault detection for autonomous systems over labor-intensive manual review.

💡 Research Summary

The paper introduces RAFFLES (Reasoning‑based Attribution of Faults for LLM Systems), an offline evaluation architecture designed to automatically locate and diagnose decisive faults in long‑horizon, multi‑component language‑model‑driven agentic systems. Existing evaluation methods largely focus on end‑to‑end metrics or human judgments, which are insufficient for pinpointing the exact step where a failure originates in complex pipelines. To address this, the authors formalize a hierarchy of error concepts—step‑level fault, trivial fault, critical fault, and finally decisive fault (the earliest critical fault whose correction would turn a failed trajectory into a success).

RAFFLES consists of two cooperating LLM‑based modules: a Judge and a set of Evaluators. The Judge receives the full execution log τ and a memory buffer H, then proposes a candidate step t as the decisive fault together with three separate rationales (R₁, R₂, R₃) corresponding to the three criteria: (1) Fault Condition (the step is indeed erroneous), (2) Primacy (it is the earliest critical error), and (3) Decisiveness (the error is critical, i.e., fixing it would salvage the outcome). Each Evaluator Eₚ (p = 1,2,3) independently checks one of these rationales, returning a confidence score cₚₑ (0‑100) and a supplemental rationale rₚₑ. A fourth rule‑based Evaluator validates that the proposed step aligns with the log structure.

The confidence scores are summed (C = ∑ₚ cₚₑ) and stored together with the rationales in H. The Judge then incorporates H in the next iteration, refining its candidate selection. The loop terminates when either (a) C exceeds a fixed threshold (350 in the experiments) or (b) a maximum number of iterations K is reached, at which point the step with the highest accumulated confidence is output as the decisive fault t*. If the Judge declares that no decisive fault exists and the confidence threshold is met, t* = None.

The authors evaluate RAFFLES on two benchmark suites. Who&When provides multi‑agent logs with annotated step‑level faults. On the Hand‑Crafted subset, RAFFLES raises accuracy from 18.20 % (previous best) to 27.59 %; on the Algorithmically‑Generated subset, accuracy improves from 38.10 % to 51.59 %. ReasonEval contains step‑annotated mathematical reasoning chains. Using Claude Sonnet 4, RAFFLES lifts step‑level fault detection from 73.58 % to 84.91 % on MR‑Math‑Invalid and from 75.46 % to 83.78 % on MR‑GSM8K‑Original. These gains demonstrate that iterative, structured reasoning between a central judge and specialized evaluators can substantially outperform single‑pass LLM‑as‑judge baselines.

Technical strengths include: (1) a rigorous, mathematically‑grounded definition of decisive faults that aligns with human annotation guidelines; (2) modular design allowing task‑agnostic prompts—only minor prompt adjustments are needed to transfer RAFFLES to new domains such as code execution or tool‑calling pipelines; (3) an explicit feedback loop that lets the system self‑correct when initial candidate selections are weak.

Limitations are also acknowledged. Confidence scores are model‑dependent; the absolute threshold of 350 may need retuning for different LLMs. The iterative process can become costly for very long trajectories, as each iteration invokes multiple LLM calls. Moreover, the current aggregation simply sums independent confidences, without a principled method to resolve contradictory evaluator feedback.

Future work suggested includes normalizing confidence scores across models (e.g., Bayesian calibration), learning dynamic stopping criteria to reduce unnecessary iterations, and extending the architecture to handle multimodal logs and direct tool‑call results. Integrating a more sophisticated consensus mechanism among evaluators could further improve robustness.

In sum, RAFFLES offers the first practical, automated framework for pinpointing decisive step‑level errors in complex LLM‑driven systems, moving evaluation beyond coarse outcome metrics toward fine‑grained, actionable diagnostics. Its demonstrated performance gains and modular nature make it a promising candidate for becoming a new standard in the evaluation of autonomous language‑model pipelines.

RAFFLES: Reasoning-based Attribution of Faults for LLM Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment