Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Mafia is a social deduction game where informed mafia compete against uninformed townsfolk. Its asymmetry of information and reliance on theory-of-mind reasoning mirror real-world multi-agent scenarios, making it a useful testbed for evaluating the social intelligence of large language models (LLMs). To support a systematic study, we introduce Mini-Mafia: a simplified four-player variant with one mafioso, one detective and two villagers. We set the mafioso to kill a villager and the detective to investigate the mafioso during the night, reducing the game to a single day phase of discussion and voting. Remarkably, we find that the mafia win-rate $p$ in this three-agent system can be described by a simple theoretical model: $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ are intrinsic model parameters representing the mafioso’s deception, the villager’s detection, and the detective’s disclosure capabilities, respectively. This compact analytic description of an interacting triad shows that multi-agent dialogue can be captured by a few latent parameters while still matching empirical outcomes, opening a path to a principled theoretical description of multi-agent LLM systems. Estimating these parameters from LLM gameplay data using Bayesian inference yields the Mini-Mafia Benchmark. Our experiments reveal counterintuitive results, including cases where smaller models significantly outperform larger ones. We also establish human baselines, revealing that LLMs excel at persuasion but lag in simple strategic reasoning for agentic interaction. Beyond benchmarking, Mini-Mafia enables quantitative study of emergent multi-agent dynamics such as name bias and last-speaker advantage, and contributes to AI safety by generating training data for deception detectors.
💡 Research Summary
The paper introduces “Mini‑Mafia,” a deliberately simplified four‑player variant of the classic social deduction game Mafia, to evaluate the social intelligence of large language models (LLMs) in multi‑agent settings. In Mini‑Mafia, one player is the Mafia, one is a Detective, and two are Villagers. The night phase is fixed: the Mafia kills a random Villager, while the Detective always investigates the Mafia and learns their identity. Consequently, the game reduces to a single day phase consisting of two rounds of public discussion followed by a blind vote. This design creates a clear information asymmetry—partial knowledge for the Mafia, complete knowledge for the Detective, and no knowledge for the Villagers—allowing the authors to isolate three core interactive capabilities: Deception (Mafia), Detection (Villager), and Disclosure (Detective).
To benchmark LLMs, the authors construct a systematic tournament where ten different models each play every role. They generate 140 distinct (Mafia, Detective, Villager) configurations, each repeated 100 times, yielding 14,000 games. Human baselines are collected from 80 games played at a data‑science school. Each agent receives a prompt containing the game rules, its secret role information, and the full dialogue history; it then produces a public utterance for each discussion round and finally a vote.
The central theoretical contribution is a compact logistic model of the Mafia’s win probability: logit(p₍ᵢⱼₖ₎) = vₖ · (mᵢ − dⱼ), where p₍ᵢⱼₖ₎ is the probability that model i (as Mafia) wins against model j (as Detective) and model k (as Villager). The three latent parameters have intuitive meanings: mᵢ captures the Mafia’s deception strength, dⱼ the Detective’s disclosure effectiveness, and vₖ the Villager’s sensitivity to the deception‑disclosure gap. The functional form mirrors the Fermi‑Dirac distribution, with (m − d) playing the role of energy and v acting as inverse temperature.
Using Bayesian inference with weakly informative N(0, 2) priors, the authors estimate the 30 latent parameters (three per model) from the binomial win counts. They employ PyMC’s No‑U‑Turn Sampler (NUTS) with two chains and 2,000 samples each, achieving good convergence (R̂≈1.01) and ample effective sample sizes. Post‑hoc rescaling enforces E
Comments & Academic Discussion
Loading comments...
Leave a Comment