Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textit{worst-case} performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.

💡 Research Summary

The paper tackles the problem of information‑seeking by large language models (LLMs) when they lack sufficient knowledge to complete a task, a situation common in high‑stakes domains such as medical diagnosis or planning. Existing methods (Self‑Consistency, Tree of Thought, Uncertainty of Thought) assume that the hidden “item” to be identified is drawn from a known, often uniform distribution, and they optimize expected information gain. This assumption is unrealistic when an adversary could select the hardest item, and it can lead to poor worst‑case performance.

To address this, the authors formalize the Strategic Language Search (SLS) problem, an adversarial version of the classic 20‑Questions game. An Item Chooser secretly selects an item s* from a finite set S. The Questioner then asks a sequence of binary (yes/no) natural‑language questions q∈Q, receiving deterministic answers f(q, s*). The game ends when the Questioner can identify s* with certainty; the cost incurred is the number of questions asked. The Questioner’s objective is to minimize the maximum possible cost, i.e., to find a strategy that performs best against the worst‑case item choice. This is precisely a two‑player zero‑sum extensive‑form game (EFG) with imperfect information, and the optimal strategy corresponds to a Nash equilibrium (NE), which must be randomized in general.

The paper establishes several theoretical results: (1) deciding whether a deterministic sequence of k questions can guarantee identification of a known s* is NP‑complete, showing that optimal deterministic policies are computationally intractable for realistic Q. (2) When the question set Q is unrestricted (denoted Q∞), an “even‑split” strategy that halves the remaining candidate set at each step is optimal and achieves a worst‑case cost of ⌈log₂|S|⌉. (3) The authors introduce restricted variants—SLS‑Restricted (SLSR), where at each step the Questioner may only ask questions drawn from a function g(S(H)) that depends on the current candidate set, and Weighted SLS (WSLS/WSLSR), where each item carries a weight w(s) that scales the cost, reflecting higher stakes for certain items.

Because manually constructing Q and f for large domains is infeasible, the authors propose to instantiate them with LLMs. The set Q consists of questions that an LLM can generate given a prompt, while f is realized by feeding the same LLM (or a human oracle) the question together with the hidden item description. The function g is similarly realized by prompting the LLM to propose up to m candidate questions based on the current candidate set. An important assumption (3.10) is that the LLM behaves as a perfect oracle for f, i.e., it never makes factual errors when answering about the hidden item.

The core algorithmic contribution is Game of Thought (GoT), a framework that approximates the NE for SLSR using Counterfactual Regret Minimization (CFR) or similar regret‑minimization techniques. GoT iteratively simulates plays of the game, updates regret tables for each information set (i.e., each possible history of questions and answers), and converges to a mixed strategy that minimizes worst‑case cost. By leveraging the LLM to generate the feasible question set g(S(H)) at each iteration, GoT can scale to moderately large S (up to a few dozen items in the experiments) while respecting natural‑language constraints on question form.

Empirical evaluation is conducted on several synthetic 20‑Questions‑style benchmarks with varying numbers of items (8, 16, 32, 64), different limits on the number of questions per turn (m = 1–3), and weighted scenarios where certain items have higher penalties. Baselines include (a) direct prompting (the Questioner asks a single LLM‑generated question without look‑ahead), (b) Self‑Consistency and Tree of Thought (which perform shallow look‑ahead via multiple sampled reasoning paths), and (c) Uncertainty of Thought (UoT), which performs a depth‑limited tree search assuming a uniform prior over items. Results show that GoT consistently reduces both the average and worst‑case number of questions. In the hardest adversarial settings, GoT achieves 15‑30 % fewer questions than the best baseline, and in weighted experiments it lowers the weighted cost by over 20 % for high‑penalty items. Moreover, the learned mixed strategies approximate the theoretical NE: the empirical regret converges to near‑zero, confirming that randomization is essential for optimal worst‑case performance.

The authors discuss limitations: the reliance on an error‑free LLM oracle is strong; in practice LLMs may misinterpret or misanswer questions, which would degrade the guarantees. CFR can be computationally expensive when the branching factor (size of g(S(H))) is large, leading to slower convergence. The current implementation restricts the number of questions per turn (parameter m), so the method does not yet handle the fully unrestricted SLS where the Questioner could ask any conceivable question. Future work is suggested on (i) incorporating uncertainty in the oracle (e.g., Bayesian modeling of LLM error), (ii) hybrid approaches that combine regret minimization with heuristic pruning to accelerate convergence, (iii) real‑world deployments in domains like medical triage where weighted costs are critical, and (iv) extending the framework to multi‑turn dialogue with human users rather than a purely simulated Item Chooser.

In summary, the paper reframes LLM‑driven information seeking as an adversarial extensive‑form game, provides rigorous theoretical foundations, and introduces the Game of Thought algorithm that approximates a Nash equilibrium to achieve robust worst‑case performance. This contribution advances the reliability of LLMs in high‑stakes applications where guaranteeing correct identification under adversarial conditions is essential.

Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory

💡 Research Summary

Comments & Academic Discussion

Leave a Comment