Enforcing Monotonic Progress in Legal Cross-Examination: Preventing Long-Horizon Stagnation in LLM-Based Inquiry

Enforcing Monotonic Progress in Legal Cross-Examination: Preventing Long-Horizon Stagnation in LLM-Based Inquiry
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) exhibit impressive linguistic fluency but struggle to reliably complete long-horizon tasks under explicit procedural constraints. In legal cross-examination, purely proba-bilistic generation often maintains behavioral coherence while failing to ensure procedural advancement. We characterize this failure as procedural stagnation and propose Soft-FSM, a neuro-symbolic architecture that enforces monotonic progress over accumulated Key Information Units (KIUs) via an external deterministic state controller. Experiments on three real-world Taiwanese criminal homicide cases show that baseline methods collapse below 40% completeness, while Soft-FSM consistently achieves over 97% with near-zero redundancy. These results suggest that, in such domains, reliable task completion cannot be guaranteed by emergent LLM behavior alone, and can be reliably enforced through explicit and verifiable external state control.


💡 Research Summary

The paper tackles a fundamental limitation of large language models (LLMs) when they are used for long‑horizon, procedurally constrained tasks such as legal cross‑examination. While LLMs excel at producing fluent, locally coherent utterances, they lack any built‑in guarantee that each turn advances the overall procedural goal. The authors name this failure mode “procedural stagnation”: the dialogue remains linguistically consistent, but the set of verified Key Information Units (KIUs) does not grow, causing the inquiry to stall in a cyclic region of the state space. They formalize the problem as a directed‑acyclic‑graph (DAG) traversal over an information state Kₜ, where a valid transition must satisfy a monotonicity constraint |Kₜ₊₁| > |Kₜ|. The probability of eventual failure grows as 1‑(1‑ε)ⁿ, illustrating how even a small per‑turn error rate ε compounds over many turns.

To address this, the authors propose Soft‑FSM, a neuro‑symbolic architecture that externalizes procedural control into a deterministic finite‑state machine (FSM). The FSM encodes the required KIUs for each stage of the cross‑examination and only permits a state transition when the LLM‑generated question yields a new, verified KIU. The language model is thus responsible solely for natural‑language question formulation, conditioned on the current state and the list of unmet KIUs. An “oracle witness”—a deterministic component that returns factually correct answers extracted from official court judgments—provides the ground‑truth responses, ensuring that any observed failure is due to procedural control rather than noisy evidence.

The experimental suite consists of three real Taiwanese homicide cases of increasing complexity (simple confession, disputed intent, and multi‑accomplice denial). For each case the authors manually construct a schema of over 40 KIUs; the task is considered complete only when all KIUs have been elicited. Four systems are compared using the same base model (Gemma‑3‑27B‑it): (1) Pure LLM with a high‑level goal, (2) Stage‑prompted LLM that receives a full SOP but no external state tracking, (3) Soft‑FSM (the proposed method), and (4) Equilibria‑prompted LLM that performs self‑checks. Metrics include completeness (percentage of KIUs retrieved), redundancy (questions targeting already‑filled KIUs), unknown rate (questions not mapping to any KIU), and stability (standard deviation across runs).

Results show a stark contrast. Soft‑FSM achieves 97‑99 % completeness across all cases, with zero redundancy and negligible variance, demonstrating that monotonic progress can be enforced reliably even in the most complex scenario. In contrast, the Pure LLM’s completeness drops from ~67 % on the simple case to ~36 % on the most complex, illustrating the “Complexity Cliff”. The Stage‑prompted system suffers from high redundancy (up to 66 %) and frequent paraphrasing loops, while the Equilibria‑prompted approach performs similarly to the Pure LLM, confirming that internal self‑consistency mechanisms alone cannot prevent stagnation.

The analysis underscores a key insight: behavioral coherence (low perplexity, fluent language) does not imply procedural advancement. Without an explicit, verifiable external controller, LLMs will inevitably drift into regions where the dialogue continues but the underlying task remains unfinished. Soft‑FSM’s deterministic FSM guarantees that every state transition corresponds to genuine information gain, effectively pruning non‑progressive cycles and ensuring that the inquiry follows a DAG‑consistent path toward the terminal state.

The authors conclude that reliable completion of long‑horizon, asymmetric tasks such as legal cross‑examination cannot be left to emergent LLM behavior. Explicit external state control, as embodied by Soft‑FSM, provides the necessary monotonicity guarantee. Future work is suggested in automating KIU extraction, extending the framework to multi‑agent settings, and integrating with unstructured legal documents to broaden applicability beyond the controlled experimental setup.


Comments & Academic Discussion

Loading comments...

Leave a Comment