JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees

JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the maintenance of complex systems, fault trees are used to locate problems and provide targeted solutions. To enable fault trees stored as images to be directly processed by large language models, which can assist in tracking and analyzing malfunctions, we propose a novel textual representation of fault trees. Building on it, we construct a benchmark for multi-turn dialogue systems that emphasizes robust interaction in complex environments, evaluating a model’s ability to assist in malfunction localization, which contains $3130$ entries and $40.75$ turns per entry on average. We train an end-to-end model to generate vague information to reflect user behavior and introduce long-range rollback and recovery procedures to simulate user error scenarios, enabling assessment of a model’s integrated capabilities in task tracking and error recovery, and Gemini 2.5 pro archives the best performance.


💡 Research Summary

The paper addresses a practical bottleneck in applying large language models (LLMs) to fault‑tree‑based diagnosis: most fault trees are published as static images, while LLMs are trained primarily on textual data. To bridge this modality gap, the authors introduce JFT A (JSON‑based Fault Tree Analysis), a structured, extensible textual representation that captures the full logical structure of fault trees, including Boolean gates (AND, OR, XOR), hierarchical nesting, and cross‑branch references that turn a strict tree into a directed acyclic graph (DAG). Node identifiers, types, child lists, and optional links are encoded in JSON, allowing both natural‑language descriptions (for leaf‑node solutions) and strict syntactic constraints that can be programmatically validated.

Using this representation, the authors construct JFT A‑Bench, a benchmark designed to evaluate LLMs in multi‑turn, error‑prone dialogue settings typical of real‑world maintenance. They first collect 126 fault trees across 24 domains (power, aerospace, etc.), each averaging 140 nodes. Human experts manually annotate a few seed trees in JFT A; then GPT‑4o and Claude Sonnet 4.5 are prompted in a one‑shot fashion to generate the remaining trees, which are subsequently vetted by experts. From these trees, 3,130 distinct fault‑path groups are sampled. Each path starts from 1–6 basic failures and is expanded upward to the top‑level event, yielding three difficulty levels based on the number of underlying root causes.

A key novelty of the benchmark is the explicit simulation of long‑range state rollback and recovery. For each test case, two paths of identical difficulty share a common prefix of length L. The dialogue begins following the first path; at a random step after the shared prefix, the simulated user declares that a previous observation was incorrect and switches to the second path. The LLM assistant must therefore backtrack in its belief state, update the diagnostic reasoning, and continue to correctly localize all root causes and propose solutions within a limited number of turns.

The interaction framework adopts the ReAct paradigm: the assistant can ask verification questions about specific fault nodes or suggest remediation actions, while the user (simulated) replies with vague, context‑dependent, and sometimes biased statements. To generate realistic user behavior, the authors train a user simulator based on Qwen‑3‑8B using supervised fine‑tuning (behavior cloning) followed by reinforcement learning for response optimization. The trained user model achieves a 99.98 % correctness rate on held‑out fault trees, ensuring that the dialogue remains challenging yet coherent.

Evaluation metrics focus on success rate: the proportion of cases where the assistant identifies all root causes and provides correct solutions within the prescribed turn budget. Results show that Gemini 2.5 pro attains the highest overall performance, successfully solving 53.76 % of the test cases. Among open‑source models, DeepSeek‑V3.2 is the strongest with a 41.40 % success rate. Failure analysis reveals that most errors stem from deficiencies in planning and state‑tracking—models often fail to ask the most informative questions in the optimal order or to maintain a consistent internal representation after a rollback event.

The paper’s contributions are threefold: (1) a novel, machine‑friendly textual format (JFT A) that faithfully encodes complex fault‑tree structures, including DAG‑style cross‑branch links; (2) the JFT A‑Bench benchmark, which evaluates LLMs on fault localization, solution recommendation, and robust error‑recovery in long‑horizon dialogues; and (3) a high‑fidelity user simulator that injects ambiguity and bias, enabling realistic multi‑turn interaction testing. By providing both a standardized data representation and a rigorous evaluation suite, the work paves the way for deploying LLM‑driven agents in safety‑critical maintenance domains, while also highlighting current limitations in planning and long‑term consistency that future research must address.


Comments & Academic Discussion

Loading comments...

Leave a Comment