AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce AgenticSimLaw, a role-structured, multi-agent debate framework that provides transparent and controllable test-time reasoning for high-stakes tabular decision-making tasks. Unlike black-box approaches, our courtroom-style orchestration explicitly defines agent roles (prosecutor, defense, judge), interaction protocols (7-turn structured debate), and private reasoning strategies, creating a fully auditable decision-making process. We benchmark this framework on young adult recidivism prediction using the NLSY97 dataset, comparing it against traditional chain-of-thought (CoT) prompting across almost 90 unique combinations of models and strategies. Our results demonstrate that structured multi-agent debate provides more stable and generalizable performance compared to single-agent reasoning, with stronger correlation between accuracy and F1-score metrics. Beyond performance improvements, AgenticSimLaw offers fine-grained control over reasoning steps, generates complete interaction transcripts for explainability, and enables systematic profiling of agent behaviors. While we instantiate this framework in the criminal justice domain to stress-test reasoning under ethical complexity, the approach generalizes to any deliberative, high-stakes decision task requiring transparency and human oversight. This work addresses key LLM-based multi-agent system challenges: organization through structured roles, observability through logged interactions, and responsibility through explicit non-deployment constraints for sensitive domains. Data, results, and code will be available on github.com under the MIT license.

💡 Research Summary

AgenticSimLaw introduces a courtroom‑style multi‑agent debate (MAD) framework designed to make high‑stakes tabular decision‑making transparent, controllable, and auditable. The system models three distinct roles—prosecutor, defense, and judge—and enforces a fixed seven‑turn interaction protocol. Each agent first formulates a private strategy before delivering a public utterance, while the judge maintains an internal belief state (prediction, confidence, and reasoning) that is updated after each turn. This structure simultaneously addresses three core challenges of LLM‑based multi‑agent systems: organization (through explicit roles), observability (by logging every utterance, private plan, and belief update), and responsibility (by keeping the system strictly for research and audit, not for deployment).

The authors evaluate the framework on the National Longitudinal Survey of Youth 1997 (NLSY97) recidivism prediction task, a binary classification problem (rearrest within three years) built from 1,412 cases and 27 demographic, socioeconomic, and criminal‑history features. Features are rendered as natural‑language statements (“sex is male”, etc.) and injected into prompts. The dataset is split 60/20/20 for training/validation/test, with the test set reserved for evaluation only.

Experiments run on a local workstation equipped with two RTX 3090 GPUs and 128 GB RAM, using the Ollama model serving stack. Sixteen 7–14 billion‑parameter models form the primary ensemble; larger ensembles (37 and 81 models) span 0.5–72 billion parameters. All models are 4‑bit quantized, and temperatures are set to 0.0 for deterministic single‑turn prompting and 0.7 for the stochastic multi‑turn debate.

Two reasoning paradigms are compared: (1) standard LLM prompting (zero‑shot, chain‑of‑thought (CoT), and 30‑shot CoT) and (2) the proposed AgenticSimLaw MAD. Across almost 90 model‑prompt combinations, the MAD approach yields higher stability: the correlation between accuracy and F1‑score is markedly stronger, indicating more consistent performance across class imbalances. The judge’s intermediate belief updates expose when the model’s confidence shifts, allowing analysts to trace which features drive decisions and where uncertainty spikes.

The MAD protocol consumes roughly 9,100 additional tokens per run compared with single‑shot CoT, a deliberate trade‑off that provides a complete audit trail. Robust parsing combines strict JSON validation with permissive regex fallback; failures are logged and either skipped (for intermediate turns) or cause the simulation to be marked incomplete (for the final verdict). Every public utterance, private plan, belief state, and API metadata (token count, latency, temperature) is timestamped, enabling post‑hoc bias analysis, failure‑mode detection, and reproducibility.

Ethically, the paper stresses that AgenticSimLaw is a research benchmark, not a production sentencing tool, and imposes a “non‑deployment” constraint for sensitive domains. This aligns with broader calls for responsible AI in criminal‑justice contexts, where transparency and human oversight are paramount.

In sum, AgenticSimLaw demonstrates that a role‑structured, fully logged multi‑agent debate can turn opaque LLM reasoning on tabular data into a process that humans can inspect, question, and audit. It outperforms traditional single‑agent CoT in stability and offers richer explanatory artifacts, suggesting a viable path toward trustworthy AI assistance in high‑stakes domains such as law, medicine, finance, and policy analysis. Future work may extend the protocol to more complex legal or medical cases and integrate automated bias‑mitigation modules to further strengthen responsible AI guarantees.

AgenticSimLaw: A Juvenile Courtroom Multi-Agent Debate Simulation for Explainable High-Stakes Tabular Decision Making

💡 Research Summary

Comments & Academic Discussion

Leave a Comment