Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

R easonabl y reasoning AI ag ents can a v oid g ame-theoretic f ailures in zero-shot, pro v abl y Enoch Hyun w ook Kang ∗ U niv ersity of W ashington Abstract AI agents are increasingly deplo yed in interactiv e economic environments c haracter ized by re- peated AI- AI interactions. Despite AI ag ents ’ advanced capabilities, empir ical s tudies rev eal that such interactions often fail to s tably induce a strategic equilibrium, such as a Nash equi- librium. P ost-training methods hav e been proposed to induce a strategic equilibrium; how ev er , it remains impractical to unif or ml y apply an alignment method across div erse, independently de veloped AI models in s trategic settings. In this paper , w e pro vide theoretical and empirical e vidence that oﬀ-the-shelf reasoning AI ag ents can achie ve Nash-lik e play zero-shot, without e xplicit pos t-training. Speciﬁcall y , w e pro ve that ‘reasonabl y reasoning’ agents, i.e., agents capable of f orming beliefs about others ’ strategies from previous observation and lear ning to best respond to these beliefs, ev entuall y beha ve along almost ev er y realized pla y path in a w ay that is w eakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-kno w ledge pa y oﬀ assumption by allo wing stage pa y oﬀs to be unknown and by ha ving each agent observ e only its own pr ivatel y realized stoc hastic pa y oﬀs, and w e sho w that w e can still ac hiev e the same on-path Nash conv erg ence guarantee. W e then empir icall y v alidate the proposed theor ies by simulating ﬁve game scenar ios, ranging from a repeated pr isoner ’ s dilemma game to sty lized repeated marketing promotion g ames. Our ﬁndings sugg est that AI ag ents naturall y e xhibit such reasoning patter ns and theref ore attain stable equilibrium beha viors intrinsically , obviating the need f or univ ersal alignment procedures in many real-w orld strategic interactions. ∗ ehwkang@uw .edu 1 1 Introduction R ecent advancements integrating Ar tiﬁcial Intelligence (AI) models with sophisticated reasoning and tool- use capabilities ha ve enabled the widespread practical deplo yment of AI ag ents across div erse application domains (Qu et al., 2025). As AI agents become increasingly integ ral to interactiv e sy stems, a critical and timely challeng e ar ises: determining whether these ag ents can na vigate comple x s trategic interactions eﬀectiv ely in real-wor ld competitions in digital markets, e.g., automated negotiation, dynamic pricing, and adv er tising auctions (Bianchi et al., 2024; T . Guo et al., 2024; Lopez-Lira, 2025; Y . Li et al., 2024; Zhu et al., 2025; Bansal et al., 2025). As AI agents are deplo y ed more broadly in such settings, the central issue is not only whether they can beha ve strategically , but also whether their strategic interactions will conv erge to stable, predictable equilibria, and which equilibria such sys tems will select. This question is not merely theoretical. Recent w ork b y Calv ano et al. (2020) and Fish, Gonczarow ski, and Shor rer (2024), tog ether with related empirical studies of algor ithmic interaction, sugges ts that autonomous algorithmic and AI systems can g enerate strategically consequential repeated-g ame behavior in economi- call y impor tant en vironments. Pr icing algorithms can sustain supra-competitiv e outcomes without explicit communication, rapid reactiv e pr icing technologies can elev ate pr ices ev en in competitiv e equilibrium, and real-w orld adoption of algor ithmic pr icing has been associated with higher margins in concentrated markets (Bro wn and MacKay, 2023; Assad et al., 2024). On the other hand, empir ical ev aluations of LLMs rev eal that widely used, oﬀ-the-shelf AI models (e.g., GPT , Claude, Gemini, Kimi, DeepSeek) as AI ag ents frequently f ail to exhibit predicted equilibrium behavior in strategic interactions and often resor t to br ittle heur is tics or produce inconsistent policies (X. Guo et al., 2024; J.-t. Huang et al., 2024; Hua et al., 2024; Buscemi et al., 2025). In practice, simply prompting standard AI models to engag e in repeated games often yields strategies that div erg e signiﬁcantl y from rational, equilibrium-based pla y predicted by classical game theor y , although some successes hav e been repor ted (Akata et al., 2025). Such brittleness and inconsistency raise concer ns about deploying AI ag ents in societall y crucial domains that require reliable strategic decision-making. One prominent approach to address this limitation is targ eted, strategic post-training procedures (Park et al., 2024; Duque et al., 2024). How ev er , relying on unif or m deplo yment of such ﬁne-tuning approaches across div erse and independently dev eloped AI agents is often impractical. Consequently , there exis ts a compelling need f or the assurance that AI agents with some “reasonable” reasoning capabilities autonomously adapt their strategies and ﬁnd a stable equilibrium. This cr itical observation motivates the central research question e xplored in this paper: Can oﬀ-the-shelf r easoning AI ag ents achiev e str ategic equilibrium without post-training? In this paper , we theoreticall y and empir icall y address this question within the framew ork of inﬁnitely repeated games, a setting in which agents repeatedl y encounter identical strategic scenar ios with no prede- ﬁned endpoint. Speciﬁcall y , w e show that reasoning LLM-based AI ag ents naturally e v olv e to ward N ash 2 equilibrium along realized pla y paths, without relying on e xplicit post-training or specialized ﬁne-tuning procedures. The ke y to achie ving this lies in tw o basic reasoning capabilities we call “reasonably reasoning” capabilities: Bayesian learning and asympto tic bes t-response learning . By Bay esian learning, w e ref er to an ag ent ’ s capacity to learn other agents ’ strategies from observ ed historical interactions, thereb y enabling a theory- of-mind of others ’ future actions. By asymptotic bes t-response lear ning, w e mean the ag ent ’ s ability to e ventuall y learn an optimal counter -strategy giv en the inf er red beliefs about other ag ents ’ strategies, thereby maximizing its e xpected pay oﬀ. Under such capabilities, which we demonstrate that AI agents possess, w e pro ve that ag ents e ventuall y e xhibit a Nash equilibr ium along e v er y realized path in possibly inﬁnitely repeated interactions. Our main theoretical results are heavil y rooted in a fundamental result in Ba y esian learning literature (Kalai and Lehrer, 1993a; Norman, 2022) that a set of Bay esian lear ning agents with the ability to exactly best respond to their belief about the opponent agents ’ strategy , i.e., maximize their e xpected pay oﬀ, ev entually e xhibit a N ash equilibr ium along e v er y realized path in possibl y inﬁnitely repeated interactions. The ke y diﬀerence in this paper ’ s theoretical result is that it allow s asymp totic best-response lear ning rather than assuming that the ag ent can choose the exact best-response action, i.e., the agent is an e xpected-utility maximizer . This is an impor tant relaxation, as the oﬀ-the-shelf LLM ag ents are not expected utility maximizers (Y amin et al., 2026; Ge, Y ongy an Zhang, and V orobey chik, 2026). Rather , they are stoc hastic post erior samplers by def ault (i.e., in temperature = 1 setup) (Ar umug am and Gr iﬃths, 2025). W e pro v e that, under mild and realistic assumptions, LLM ag ents, which are posterior belief samplers, achie v e asymptotic best-response lear ning. W e then prov e that the fundamental result in Bay esian lear ning (Kalai and Lehrer, 1993a; N or man, 2022), whic h requires exact best-response capability , can be extended to asymp totic best- response lear ning. Combined with the recent ﬁndings that LLMs are Bay esian in-conte xt lear ners under stationary , repeated settings (Coda-Forno et al., 2023; Lu et al., 2024; Cah ya wijay a, Lo venia, and Fung, 2024; Xie et al., 2021; X. W ang et al., 2023; W akay ama and Suzuki, 2025; F alck, Z. W ang, and Holmes, 2024), we conclude that reasoning LLM ag ents ev entuall y exhibit a Nash equilibr ium along ev ery realized path in possibly inﬁnitely repeated interactions. Be y ond the benchmark with common-know ledge stag e pa y oﬀs, we also consider the practically relev ant case in which pay oﬀs are not kno wn to agents e x ante and each ag ent obser v es only its o wn pr iv ately realized stochas tic pa yoﬀs. W e modify PS-BR to not onl y sample an opponent-s trategy h ypothesis, but also sample a h ypothesis f or the agent ’ s o wn mean pay oﬀ matrix (equivalentl y , its own pay oﬀ kernel within the known noise f amil y). U nder the analogous learning conditions together with an additional asymptotic public-suﬃciency assumption on hidden private histor ies, PS-BR reco v ers the same asymptotic on-path ε -best-response proper ty and theref ore inher its the zero-shot N ash con ver gence guarantees. This paper is structured as f ollo ws. Section 2 discusses related w orks. Section 3 introduces the setup. 3 Section 4 deﬁnes reasonably reasoning agents and relates their Bay esian and best-response lear ning proper ties to in-context and test-time inf erence in language models. Section 5 presents the main zero-shot Nash con v erg ence results. Section 6 discusses ho w w e can e xtend the zero-shot N ash con v erg ence result f or unkno wn, stoc hastic pa y oﬀs. Then Section 7 pro vides empirical evidence of the theoretical contributions in this paper . 2 Related w orks Ba yesian Learning. The theoretical analy sis of reasonably reasoning ag ents is based larg ely on the Ba y esian learning literature. Ba yesian lear ning in repeated g ames is deﬁned b y a fundamental tension betw een the ability to logically lear n opponents’ strategies and the ability to respond to them optimally . The f oundational possibility result in Kalai and Lehrer, 1993a sho w ed that if play ers ’ pr ior beliefs contain a "g rain of truth" (absolute continuity) regarding the tr ue distribution of pla y , then standard Bay esian updating guarantees that their predictions will ev entually conv erg e to the tr uth, thereb y naturally culminating in a Nash equilibrium. Ho w ev er , Nachbar (1997) and N achbar (2005) subsequently prov ed a negativ e result: requir ing play ers to simultaneousl y maintain this g rain of tr uth and perfectl y best-respond across all possible counter f actual game histories leads to a mathematical contradiction, as the inﬁnite sets of lear nable strategies and optimizing strategies are often mutuall y singular . N or man, 2022 resol v ed this tension b y introducing “optimizing learnability”, the crucial insight that ag ents do not need to perfectl y learn unreac hed counterfactuals; the y onl y need to accurately predict and best-respond along the realized path of play . N onetheless, N or man identiﬁed that a stubborn impossibility persists in a speciﬁc class of games called MM* g ames, where adv ersar ial pay oﬀ g eometr ies prev ent lear ning and optimization from coe xisting ev en on-path. This paper sy stematicall y navig ates these classic boundaries to guarantee zero-shot Nash con v erg ence f or LLM ag ents. W e activ ely emplo y Kalai and Lehrer, 1993a grain of truth (Assumption 2) to guarantee predictiv e accuracy via the classic merging of opinions, and a void N achbar (1997) and Nac hbar (2005)’ s impossibility b y f or mall y adopting the on-path relax ation and non-MM* in Norman (2022). Ho we v er , al- though emplo ying the standard Bay esian lear ning setup (Kalai and Lehrer, 1993a; Norman, 2022) guarantees accurate f orecasts of future on-path actions, it does not guarantee posterior concentration , as LLM ag ents are not e xpected-utility maximizers, and rather posterior belief samplers (Ar umug am and Gr iﬃths, 2025; Y amin et al., 2026; Ge, Y ongy an Zhang, and V orobey chik, 2026). T o address this, we introduce the ﬁnite menu and KL separation condition (Assumption 3), which is necessar y to mathematically f orce the LLM ag ent’ s posterior to concentrate onto a single point mass (Lemma 4.2). By f orcing poster ior concentration, the LLM ag ent’ s stoc hastic “predict-then-act” reasoning seamlessl y stabilizes into an asymptotic best response. Strategic capabilities of LLM agents. As LLMs are increasingly deplo y ed as interactive agents, a g ro wing literature studies whether LLMs behav e strategicall y in canonical games, emphasizing pref erence represen- tation, belief f ormation, and (approximate) bes t responses rather than taking equilibrium play f or g ranted (H. Sun et al., 2025; Jia et al., 2025). In one-shot nor mal-f or m, bargaining, and negotiation tasks, oﬀ-the- shelf models often f ollo w plausible but conte xt-sensitive heur istics: beha vior can depart from equilibrium 4 predictions and chang e markedl y under small framing or instruction variations (F . Guo, 2023; C. Fan et al., 2023; Hua et al., 2024). S trategic per f or mance can improv e with model scale and reasoning scaﬀolds, but the remaining variance across prompts and settings is substantial (Kader and Lee, 2024). These issues become more acute under repeated games, where pay oﬀs depend on stable, histor y -contingent policies. Multi-ag ent ev aluation benchmarks repor t larg e cross-model and cross-game heterogeneity and frequent non-eq uilibr ium dynamics, especially in coordination and social-dilemma regimes (Mao et al., 2023; Duan et al., 2024; J.-t. Huang et al., 2024). Controlled repeated-g ame experiments similarl y ﬁnd that cooperation/reciprocity can emerg e, but is fragile to opponent c hoice and to seemingly minor prompt or protocol chang es (Akata et al., 2025; Fontana, Pierr i, and Aiello, 2024; Willis et al., 2025). In mark et-sty le repeated settings, recent work further documents collusive or supra-competitiv e outcomes among LLM ag ents and highlights sensitivity to communication oppor tunities and wording choices (Fish, Gonczaro wski, and Shorrer, 2024; Agra w al et al., 2025). Ov erall, e xisting results demonstrate meaningful strategic adaptation but do not provide general, zero-shot guarantees that heterogeneous, independently deplo yed oﬀ-the-shelf agents will con ver ge to predictable equi- librium beha vior . Our paper targ ets this gap by identifying tw o basic theor y -of-mind capabilities, Ba y esian learning of opponents and asymptotic best-response lear ning, and pro ving that, under mild conditions, the y impl y Nash continuation play along realized paths in repeated games, without requiring e xplicit post-training or cross-ag ent coordination. LLM agents as Ba yesian in-conte xt learners. A g ro wing body of work links in-context lear ning (ICL), i.e., test-time adaptation that conditions prior his tor y on a prompt without parameter updates, to Bay esian inf erence o ver latent task hypotheses. In sty lized transf or mer meta-lear ning settings, Xie et al. (2021) argue that transf ormers trained ov er a task distribution can implement an implicit Bay esian update and produce posterior -predictiv e behavior from in-conte xt data; related anal yses f or malize ICL as (approximate) Bay esian model av eraging and study ho w this view depends on model parameterization and dr iv es g eneralization (Y ufeng Zhang et al., 2023). Mo ving bey ond speciﬁc constructions, F alck, Z. W ang, and Holmes (2024) propose a mar tingale-based perspectiv e that yields diagnostics and theoretical cr iteria f or when an in-conte xt learner’ s predictiv e sequence is consistent with Ba y esian updating, while W aka yama and Suzuki (2025) pro vide a broader meta-learning theor y in which ICL is pro vabl y equiv alent to Ba yesian inf erence with accompan ying generalization guarantees. Empir icall y , LLMs also exhibit meta -adaptation across tasks presented in-conte xt (Coda-Forno et al., 2023), and se v eral abilities that appear “emerg ent” under scaling can be substantially attr ibuted to improv ed ICL mechanisms (Lu et al., 2024). Complementing these vie wpoints, X. W ang et al. (2023) model LLM ICL through a latent-v ar iable lens, where demonstrations act as evidence about an unobserv ed task v ariable—clar ifying why behavior can be highl y sensitiv e to the speciﬁc e xamples and their order ing—and related results document f ew -shot in-context adaptation ev en in lo w-resource languag e lear ning regimes (Cah ya wija ya, Lo v enia, and Fung, 2024). For agentic and repeated-interaction settings, these Bay esian-ICL perspectiv es motivate modeling an LLM agent ’ s use of the 5 interaction transcr ipt as maintaining and updating a posterior ov er opponent strategies/types; autoregressive g eneration can then be inter preted as sampling-based decision-making from the induced posterior (K. W . Zhang et al., 2024; W ellec k et al., 2024), providing a concrete br idge between in-conte xt lear ning and belief-based strategic beha vior . Expected utility maximization and best response. Standard learning-in-games anal yses often assume ag ents compute an e xact best response to their posterior at e very history (Kalai and Lehrer, 1993a; Norman, 2022). This is a poor behavioral model f or oﬀ-the-shelf LLM agents, whose actions are induced b y stochas tic decoding and thus implement a distribution ov er choices rather than a deterministic maximization of e xpected utility . In probabilis tic decision tasks, Y amin et al. (2026) ﬁnd sys tematic belief–decision incoherence, sugg esting that elicited probabilities should not be treated as belief s that the model then perf ectly best- responds to. In r isky -choice e xper iments, Ge, Y ongy an Zhang, and V orobe y chik (2026) similarl y document substantial depar tures from expected-utility maximization and large sensitivity to prompting/model type, with beha vior better descr ibed as noisy sampling. Arumugam and Gr iﬃths (2025) argues that LLMs naturall y implement posterior sampling. These results motivate replacing e xact best response with a w eaker , sampling- compatible notion, e.g., posterior -sampling policies, which are sho wn to achie v e asymp totic best-response performance along the realized path. 3 Setup 3.1 Inﬁnitely repeated game W e study interaction among a ﬁnite set of agents I = { 1 , 2 , . . . , N } in an inﬁnitel y repeated (discounted) game with per f ect monitor ing of actions and common-know ledge stag e pa y oﬀs. W e deﬁne the game as the tuple G =  I , { A i } i ∈ I , { u i } i ∈ I , { λ i } i ∈ I  where: • I is the ﬁnite set of AI agents • A i is the ﬁnite set of actions av ailable to agent i • A = Q i ∈ I A i is the joint action space, where a joint action proﬁle at round t is denoted a t =  a t 1 , . . . , a t | I |  ∈ A . ( a t i indicates the action of ag ent i at round t ) • u i : A → [0 , 1] is agent i ’ s (known) stag e-game pa y oﬀ function • λ i ∈ (0 , 1) is the private discount factor used b y ag ent i to value future pay oﬀs. At each round t = 1 , 2 , . . . , each agent i simultaneously chooses an action a t i ∈ A i , f or ming a joint action proﬁle a t ∈ A , which is publicly obser v ed. Ag ent i then receives the stag e pay oﬀ u i ( a t ) ∈ [ 0 , 1] (1) 6 These stag e pay oﬀs induce a standard inﬁnitely repeated game with per f ect monitor ing of actions. In deﬁning the pay oﬀs { u i } i ∈ I , we res tr ict the set of games considered in this paper using the f ollo wing standard assumption in the Bay esian learning literature (Norman, 2022). Intuitiv ely , this e xcludes g ames without a pure-strategy equilibr ium, e.g., rock -scissors-paper; r igorousl y , it r ules out the pathological class in which on-path learning cannot be patched into nearby Nash behavior . Assump tion 1 (Non-MM ⋆ game (N or man, 2022)) . Consider the inﬁnitel y repeated game induced b y the true s tage pa y oﬀs { u i } i ∈ I in equation (1). For each pla yer i , deﬁne the s tage-g ame minmax pa yoﬀ and pure-action maxmin pay oﬀ as φ i := min σ − i ∈ ∆( A − i ) max σ i ∈ ∆( A i ) u i ( σ i , σ − i ) , Φ ⋆ i := max a i ∈ A i min a − i ∈ BR − i ( a i ) u i ( a i , a − i ) , where BR − i ( a i ) denotes the set of opponents’ (joint) best responses to a i in the stage game. W e call that the stag e game is MM ⋆ if Φ ⋆ i < φ i f or ev er y i . W e assume the stag e game is not MM ⋆ (equiv alently , Φ ⋆ i ≥ φ i holds f or some i ). 3.2 Strategy W e deﬁne the joint action history at round t as h t =  a 1 , a 2 , . . . , a t − 1  , and H t = n a 1 , a 2 , . . . , a t − 1  : a s ∈ A f or s ≤ t − 1 o . Let H 0 := {∅} denote the empty history . Denote the complete set of possible histor ies as H = S t ≥ 0 H t . (Throughout this paper , we allo w AI agents ’ strategies to hav e bounded memory .) Deﬁnition 1 (Strategy) . A strategy f or agent i is a function f i : H → ∆ ( A i ) , which maps e very joint action history to a distr ibution ov er ag ent i ’ s actions A i . Let F i denote the space of all strategies of ag ent i . A strategy proﬁle is a tuple f = ( f 1 , . . . , f N ) ∈ F = Q i ∈ I F i . Let H ∞ denote the space of inﬁnite pla y paths, i.e., H ∞ = n a 1 , a 2 , . . .  : a t ∈ A f or all t ∈ N o . Deﬁnition 2 (Play -path distr ibution) . A strategy proﬁle f = ( f 1 , . . . , f N ) ∈ F induces a unique probability distribution µ f o v er H ∞ (the play-path distribution ), deﬁned on cy linder sets b y µ f  C  a 1 , . . . , a t  := t Y s =1 Y i ∈ I f i ( h s ) ( a s i ) , 7 where h s = ( a 1 , . . . , a s − 1 ) and C ( h ) := { z ∈ H ∞ : z = ( h, . . . ) } . By Kolmogoro v’ s extension theorem (Dur rett, 2019), these ﬁnite-dimensional probabilities deﬁne a unique probability measure µ f on ( H ∞ , B ) , where B is the product σ -alg ebra. For the upcoming discussions, w e ﬁx some notations. Given that we ﬁx a his tor y h t , f or any continuation proﬁle g (i.e., a proﬁle that speciﬁes play after histor ies extending h t ), let µ g h t denote the induced distr ibution on H ∞ o v er the future joint-action sequence ( a t , a t +1 , . . . ) when play starts at histor y h t and f ollow s g thereafter . F or mall y , w e identify the tail ( a t , a t +1 , . . . ) with y ∈ H ∞ b y setting y 1 = a t , y 2 = a t +1 , and so on, and regard µ g h t as a measure on this reindex ed space. For a full proﬁle g ∈ F , w e wr ite µ g h t f or the continuation distribution induced by its restriction g | h t . If µ g ( C ( h t )) > 0 , then µ g h t coincides with the conditional distribution µ g ( · | h t ) . 3.3 Beliefs Each agent i acts under uncer tainty regarding the opponents ’ future play f − i . The agent maintains subjective beliefs o v er opponents ’ strategies and updates them as the game unf olds. Beha vioral representativ es (belief-equiv alent beha vior strategies). Fix pla y er i and a (possibl y mixed) belief µ i o v er opponents ’ s trategy proﬁles F − i . For an y o wn s trategy g i ∈ F i , µ i induces a predictiv e distribution o v er pla y paths P µ i ,g i i ( E ) := Z F − i µ ( g i ,f − i ) ( E ) dµ i ( f − i ) f or measurable E ⊆ H ∞ . By Kuhn ’ s theorem K uhn, 1953 and Aumann ’ s extension to inﬁnite e xtensiv e-f or m games Aumann, 1961, there e xists a behavior -strategy proﬁle ¯ f − i ∈ F − i such that f or e v er y g i , µ ( g i , ¯ f − i ) = P µ i ,g i i . W e call any such ¯ f − i a behavior al repr esentative (or belief-equiv alent proﬁle) of µ i (Kuhn, 1953; Aumann, 1961; Kalai and Lehrer, 1993a). When µ i has ﬁnite suppor t { g 1 − i , . . . , g K − i } , one conv enient choice is ¯ f − i ( h )( a − i ) = K X k =1 µ i ( g k − i | h ) g k − i ( h )( a − i ) , f or histor ies h where Ba y es ’ r ule is deﬁned. Prior and posterior predictiv e beliefs. Ag ent i holds a subjective pr ior µ 0 i o v er F − i . W rite P 0 ,g i i := P µ 0 i ,g i i f or the induced pr ior predictiv e distribution. As we discussed abo v e (as used explicitl y in Kalai and Lehrer (1993a)), there exis ts a beha vioral representativ e f i − i ∈ F − i such that, f or ev er y g i , µ ( g i ,f i − i ) = P 0 ,g i i . W e ﬁx such an f i − i and call it agent i ’ s subjectiv e expectation of opponents ’ pla y . At any his tor y h t where Ba y es ’ r ule is deﬁned, µ 0 i yields a posterior µ t i ( · | h t ) and a posterior predictiv e continuation belief. Let f i,t − i denote an y beha vioral representativ e of this pos ter ior predictiv e belief. As a 8 standing conv ention, w e take these representativ es to be chosen consistentl y by continuation: f i,t − i   h t := f i − i   h t , i.e., the time- t poster ior predictive continuation is represented b y the restr iction of the ﬁx ed belief-equivalent proﬁle f i − i to histories extending h t . 3.4 Subjectiv e utility and Nash equilibrium Subjectiv e Expected Utility . An ag ent e valuates the optimality of a continuation strategy based on their subjectiv e beliefs at a giv en history . Fix a history h t and let σ i ∈ F i ( h t ) be a continuation strategy f or ag ent i from h t on ward. F or an y opponents’ continuation proﬁle g − i , denote by µ ( σ i ,g − i ) h t the induced distribution o v er future pla y paths when play starts at h t and f ollow s ( σ i , g − i ) thereafter . Follo wing the standard literature (Kalai and Lehrer, 1993b), we deﬁne the belief-e xplicit subjective expected utility of playing σ i starting at h t as V i ( σ i | h t ; g − i ) = E y ∼ µ ( σ i ,g − i ) h t " (1 − λ i ) ∞ X k =0 λ k i u i ( y k +1 ) # , (2) where y = ( y 1 , y 2 , . . . ) represents the future path of joint actions relativ e to time t , with y k +1 denoting the joint action at step k + 1 of this future path (i.e., at absolute time t + k ). When g − i = f i,t − i , w e write V i ( σ i | h t ) := V i ( σ i | h t ; f i,t − i ) . (3) For an y belief about opponents ’ continuation play g − i at history h t , we deﬁne the set of ε -best-response continuation strategies f or ag ent i at h t as BR ε i ( g − i | h t ) = ( σ i ∈ F i ( h t ) : V i ( σ i | h t ; g − i ) ≥ sup σ ′ i ∈F i ( h t ) V i ( σ ′ i | h t ; g − i ) − ε ) . Nash equilibrium. The tr ue per f or mance of a strategy proﬁle f ∈ F f or ag ent i is given b y: U i ( f ) = E z ∼ µ f " (1 − λ i ) ∞ X t =1 λ t − 1 i u i  z t  # , where z t ∈ A is the joint action at round t , and λ i ∈ (0 , 1) is agent i ’ s discount factor . The factor (1 − λ i ) is a nor malization ensur ing that U i ( f ) ∈ [0 , 1] whene v er u i ( a ) ∈ [ 0 , 1] f or all a ∈ A . Deﬁnition 3 ( ε -Nash equilibrium) . A s trategy proﬁle f = ( f 1 , . . . , f N ) ∈ F is an ε -N ash equilibrium if, 9 f or ev ery agent i ∈ I , U i ( f ) ≥ sup f ′ i ∈F i U i  f ′ i , f − i  − ε. 4 Reasonabl y Reasoning Agents As discussed earlier , one of the ke y ideas of this work is that reasoning LLM-based AI agents are fundamen- tall y “reasonably reasoning” agents. In this section, w e f or mally deﬁne the class of reasonabl y reasoning ag ents, and then demonstrate why reasoning-LLM ag ents are naturally reasonabl y reasoning agents. The def- inition isolates tw o ingredients: (i) Bay esian lear ning and (ii) an on-path, asymptotic notion of ε -consistency . Deﬁnition 4 (R easonably Reasoning Ag ent) . F ix a repeated g ame and a strategy proﬁle f = ( f i ) i ∈ I g enerating the objective pla y-path distribution µ f (Deﬁnition 2). Pla y er i is a Reasonably Reasoning (RR) ag ent if the follo wing hold. • Ba yesian learning: Play er i has a prior µ 0 i o v er opponents’ s trategy proﬁles F − i and f or ms posteriors ( µ t i ) t ≥ 0 b y Ba yes ’ rule. Let f i,t − i denote any behavioral representativ e of pla yer i ’ s posterior predictive continuation belief at history h t (as in Section 3.3), so that f or ev er y continuation strategy σ i , V i ( σ i | h t ) = V i ( σ i | h t ; f i,t − i ) . • Asymp totic ε -consistency on-path: F or e very ε > 0 , µ f n z : ∃ T i ( z , ε ) < ∞ s.t. ∀ t ≥ T i ( z , ε ) , f i   h t ( z ) ∈ BR ε i  f i,t − i   h t ( z ) | h t ( z )  o = 1 . Intuitiv ely , the “Ba yesian learning” condition ensures that agents update their strategic beliefs coherentl y giv en observations. The “asymptotic ε -consistency” condition captures the idea that after a possibly long initial stumbling phase, agents ev entually learn to play (appro ximatel y) optimal continuation strategies relativ e to their o wn beliefs along the realized path of play . It generalizes Norman’ s ε -consistency (Norman, 2022), which requires ε -best responding at all times (not onl y ev entually) on a full-measure set of paths. This g eneralization is cr itical, as LLM-based AI ag ents are not e xpected-utility maximizers but rather posterior belief samplers (Ar umugam and Griﬃths, 2025; Y amin et al., 2026; Ge, Y ongy an Zhang, and V orobey chik, 2026). 4.1 Ba yesian learning The Ba yesian-learning component of Deﬁnition 4 does not require an agent to e xplicitly store a sym- bolic prior o ver the full (and typically inﬁnite-dimensional) strategy space F − i . Instead, what matters f or decision-making is that, after obser ving a public history h t , the agent induces a coherent posterior pr edictive distribution o v er opponents ’ continuation play . In repeated interaction, the latent object of inf erence is not merely the opponents ’ ne xt-per iod action, but their 10 r epeated-g ame str ategy : a reaction r ule mapping histories to action distr ibutions. While realized actions v ar y with the ev olving public history , the underl ying reaction r ule is time-inv ar iant; lear ning is theref ore best understood as reﬁning uncer tainty about that r ule (and, cr uciall y , about its pr edictive implications for future pla y). Formall y , let µ 0 i denote play er i ’ s subjective prior o v er opponents’ strategy proﬁles F − i , and let µ t i ( · | h t ) denote the poster ior obtained b y Ba y es ’ r ule after histor y h t whene ver it is deﬁned. The continuation problem depends on µ t i onl y through the induced posterior predictiv e distr ibution o v er future play , because continuation values are computed by integ rating pay oﬀs against that predictiv e distribution. Follo wing Kalai and Lehrer, 1993a, we represent play er i ’ s posterior predictive continuation belief by a beha vioral proﬁle f i,t − i , chosen (without loss of generality) so that along the realized history h t ( z ) , f i,t − i   h t ( z ) ≡ f i − i   h t ( z ) , (4) where f i − i is a ﬁxed belief-equivalent proﬁle representing pla yer i ’ s prior pr edictive distribution as in Section 3. Thus, the continuation of a single belief-equiv alent beha vioral proﬁle can be taken to match the time- t posterior predictive continuation belief along the realized path. T o guarantee that Bay esian updating is well-deﬁned and that predictive beliefs can con ver ge to the truth on-path, w e impose the standard g rain-of-truth condition. Assump tion 2 (Grain of tr uth (Kalai and Lehrer, 1993a)) . For eac h pla y er i , the objectiv e pla y -path distribution µ f is absolutely continuous with respect to i ’ s pr ior predictive distribution under f i , i.e. µ f ≪ P 0 ,f i i . Equiv alently , an y e v ent that play er i assigns zero probability under their prior predictiv e model has zero probability under the true play distribution induced b y f . U nder Assumption 2, classical merging-of-opinions results (Blac kwell and Dubins, 1962) imply that play er i ’ s posterior predictive continuation beliefs become accurate along µ f -almost e v er y realized play path. W e f ormalize this later b y sho wing that absolute continuity implies strong path prediction (Lemma 5.1). 4.2 LLM ag ents are Ba y esian learning agents The Ba yesian-learning abstraction abo ve matches what w e can operationall y observe from LLM agents: history-conditioned pr edictiv e distributions . An LLM, when prompted with the game r ules and the realized interaction history , induces a conditional distribution o v er next tokens, which can be ar rang ed to correspond to a distribution ov er a discrete label f or an opponent strategy . This “as if Bay esian” framing is appropriate f or tw o reasons. Firs t, the technical apparatus in Section 3 already w orks at the lev el of predictive distributions: giv en any coherent famil y of histor y -conditioned forecas ts, w e may represent it b y an equivalent belief o v er opponents ’ strategies via the behavioral representatives f i,t − i (and, in par ticular , b y a ﬁxed belief-equivalent proﬁle f i − i whose continuation matches poster iors along realized histories as in (4)). Second, recent theor y and empir ical evidence indicate that AI agents, most 11 of which are auto-regressive LLM models, can implement Bay esian or appro ximately Bay esian in-conte xt learning in repeated, stationary en vironments (Xie et al., 2021; Y ufeng Zhang et al., 2023; F alck, Z. W ang, and Holmes, 2024; W aka yama and Suzuki, 2025). Inter preting the prompt histor y as data and the model’ s induced distribution as a posterior predictiv e theref ore pro vides a pr incipled br idg e between LLM behavior and Ba y esian-lear ning agents in repeated g ames. Finall y , Assumption 2 should be unders tood as a modeling requirement on the LLM ag ent’ s support : the ag ent’ s predictiv e model should not r ule out (assign zero probability to) e vents that can actuall y occur under the tr ue interaction induced by f . In practice, this cor responds to ensur ing that the agent ’ s elicited belief s (or the menu used to elicit them) are suﬃcientl y e xpressiv e and include mild s tochas ticity/trembles so that no on-path ev ent receiv es zero predicted probability . 4.3 LLM ag ents achie v e asymp totic ε -consistency In LLM agents, the output mechanism is mediated by stoc hastic decoding . Ev en holding the prompt ﬁx ed, a standard LLM’ s output induces a distribution o v er outputs rather than a deter ministic argmax r ule. Empirically , LLMs e xhibit substantial decision noise and can violate the coherence one would e xpect if they w ere consistentl y computing expected-utility -maximizing bes t responses to elicited beliefs (Y amin et al., 2026; Ge, Y ongy an Zhang, and V orobey chik, 2026). Rather , LLM agents are posterior samplers , which sample an output from their pos ter ior belief o ver the output space in their mind (Arumugam and Griﬃths, 2025; Cai et al., 2024). This creates a methodological tension f or our pur poses, as the Ba yesian lear ning literature ’ s Nash equilibr ium con v erg ence arguments require a best-response proper ty (e.g., Kalai and Lehrer, 1993a; Norman, 2022). The goal of this subsection is to reconcile these: w e f or malize a minimal “predict-then-act ” r ule that is faithful to sampling-based LLM beha vior y et is s till suﬃcient to guarantee asymp totic ε -best-response lear ning on the realized play path . LLMs naturally induce posterior-sampling best response (PS-BR). Reasoning LLM-based AI ag ents are naturall y scaﬀolded ﬁrst to infer the situation from the pre vious interactions and then respond optimall y to that inferred model (a theor y -of-mind “inf er , then respond” (Zhou et al., 2023; Riemer et al., 2024)). This beha vior is formall y deﬁned as as post erior -sampling best r esponse (PS-BR) : sample a h ypothesis about the opponent from the cur rent poster ior , then best respond to that sampled h ypothesis. Deﬁnition 5 (Pos ter ior sampling best response (PS-BR)) . Fix play er i and a history h t . Giv en posterior µ t i ( · | h t ) o ver opponents ’ strategy proﬁles, PS-BR chooses a continuation strategy b y: 1. sampling ˜ f − i ∼ µ t i ( · | h t ) ; 2. pla ying an y best response σ i ∈ BR i ( ˜ f − i | h t ) in the continuation game after h t . Denote the resulting (randomized) continuation strategy b y σ PS i,t ( · | h t ) . 12 Here, step 1, “sample ˜ f − i ∼ µ t i ( · | h t ) ”, is simpl y quer ying an LLM (under its def ault temperature τ = 1 setup) to output an opponent strategy label from the LLM’ s conditional distr ibution o v er allo w ed labels based on the previous interaction history . Step 2 is instantiated by e valuating a ﬁnite set of candidate self-strategies agains t that sampled opponent strategy via roll-out, and selecting the value-maximizing candidate. F or implementation details used f or e xper iments, see Appendix D. Because PS-BR best responds to a sing le draw ˜ f − i rather than to the posterior predictive continuation f i,t − i , it can be suboptimal if the posterior remains dispersed: diﬀerent posterior samples can induce diﬀerent best responses, producing unstable pla y and potentiall y persistent deviations from best-response optimality . The ke y obser vation is that this suboptimality is entirel y driven by posterior dispersion. The ne xt lemma makes this quantitativ e b y upper -bounding the best-response g ap b y a simple collision statistic of the posterior . Lemma 4.1 (PS-BR is a D t i -best response) . Fix play er i and a histor y h t . Suppose µ t i ( · | h t ) is suppor t ed on a ﬁnite set S − i and write p t ( g − i ) := µ t i ( g − i | h t ) , g − i ∈ S − i . Deﬁne the posterior collision complement D t i ( h t ) := 1 − X g − i ∈S − i p t ( g − i ) 2 = Pr ˜ g , ˜ g ′ ∼ µ t i ( ·| h t )  ˜ g  = ˜ g ′  . Let σ PS i,t ( · | h t ) be PS-BR at h t . Then V i ( σ PS i,t | h t ) ≥ sup σ i V i ( σ i | h t ) − D t i ( h t ) . Equiv alently, σ PS i,t ( · | h t ) ∈ BR D t i ( h t ) i  f i,t − i   h t | h t  . The statistic D t i ( h t ) = 1 − ∥ p t ∥ 2 2 is 0 ex actly when the posterior is degenerate (a point mass) and is close to 1 when the poster ior is highly spread out. Thus Lemma 4.1 sa ys: PS-BR is an approximate best response to the ag ent’ s posterior predictiv e belief, with an approximation er ror equal to the probability that tw o independent posterior samples would disagree. T o obtain RR’ s asympt otic ε -consis tency , it suﬃces (b y Lemma 4.1) to ensure that D t i ( h t ( z )) → 0 along µ f -almost ev er y realized path z . Intuitiv ely , w e need the ag ent’ s pos ter ior to concentrate so that posterior sampling becomes (asymptotically) deter ministic. In g eneral repeated games, full posterior concentration o v er an unrestricted strategy space is too much to ask (and is closely related to classic impossibility phenomena; see Nachbar, 1997; Nac hbar, 2005). W e theref ore impose a standard restriction that is also natural from an LLM-agent implementation perspectiv e: the ag ent maintains a ﬁnite menu of opponent-s trategy h ypotheses and updates a posterior o ver that menu (A o yagi, Fréchette, and Y uksel, 2024; Gill and R osokha, 2024). In addition, w e require an on-path KL 13 separation condition ensur ing that incor rect h ypotheses are detectabl y diﬀerent from the tr ue strategy along the realized pla y path. This is e xactly what makes poster ior concentration (and hence v anishing sampling er ror) mathematically inevitable. Assump tion 3 (Finite menu and KL separation) . Fix play er i . Assume the suppor t of µ 0 i is ﬁnite; write S − i := supp( µ 0 i ) ⊆ F − i . Assume: 1. (Menu g rain of truth) f − i ∈ S − i and µ 0 i ( f − i ) > 0 . 2. (Caution / unif orm positivity) There exis ts ν ∈ (0 , 1) such that f or e very g − i ∈ S − i , e very history h , and e very a − i ∈ A − i , g − i ( h )( a − i ) ≥ ν . 3. (On-path KL separation) For ev er y g − i ∈ S − i \ { f − i } there exis ts κ i ( g − i ) > 0 such that µ f -a.s. in z , lim inf T →∞ 1 T T X t =1 D KL  f − i ( h t ( z ))    g − i ( h t ( z ))  ≥ κ i ( g − i ) , where f or distributions p, q ∈ ∆( A − i ) , D KL ( p ∥ q ) := X a − i ∈ A − i p ( a − i ) log p ( a − i ) q ( a − i ) . Assumption 3 is directl y implementable in an LLM-ag ent pipeline: the menu S − i is a ﬁnite library of opponent strategy templates, “caution ” can be enf orced by adding an arbitrarily small tremble (to av oid zero likelihoods), and KL separation is an identiﬁability condition stating that wrong templates are distinguishable from the tr uth along the realized interaction history (the only history that matters f or on-path learning). U nder Assumption 3, standard likelihood-ratio arguments yield posterior concentration on the tr ue hypothesis. Lemma 4.2 (Pos ter ior concentration under KL separation) . F ix player i and suppose Assumption 3 holds f or i . Then µ f -a.s. in z , µ t i ( f − i | h t ( z )) − → 1 , and hence max g − i ∈S − i \{ f − i } µ t i ( g − i | h t ( z )) − → 0 . Lemma 4.2 implies D t i ( h t ( z )) → 0 on-path, and then Lemma 4.1 upgrades PS-BR from a dispersion- dependent appro ximation to an ev entual ε -best-response r ule. Proposition 4.3 (PS-BR implies asymptotic ε -consistency) . F ix play er i . Suppose player i uses PS-BR at ev er y history and Assumption 3 holds for i . Then player i satisﬁes the asympto tic ε -consist ency req uirement in Deﬁnition 4. 14 This proposition is the formal resolution of the “LLMs are stoc hastic samplers” issue: the standard sampling- based decoding (temperature τ ≃ 1 ) induces randomness that prev ents e xact best-response optimality at an y ﬁx ed time, but if the ag ent’ s posterior o v er a ﬁnite, identiﬁable hypothesis menu concentrates, then the induced sampling randomness becomes asymptotically negligible. Consequentl y , the ag ent’ s behavior con ver ges (on- path) to ε -best-response play relativ e to its (accurate) predictive beliefs, which is e xactl y the RR requirement needed f or the zero-shot N ash con v erg ence results in Section 5. The proof s of Lemmas 4.1 – 4.2 and Proposition 4.3 are deferred to Appendix B. 5 Zero-sho t Nash con v ergence W e no w sho w that the reasonably reasoning agents w e deﬁned in Section 4, tog ether with a lear nability condition on beliefs, generate play that is ev entually w eakly close to Nash equilibrium play along the realized path. This argument f ollo ws the w eak -subjectiv e-equilibrium framew ork in Norman, 2022, adapted to LLM ag ent-speciﬁc setups discussed in Section 4, i.e., (i) asymptotic (on-path) ε -consistency and (ii) the ﬁnite-menu KL-separation f or v er ifying the lear nability condition. 5.1 W eak subjective equilibrium W e work with the standard weak distance on play -path distr ibutions. Let B t be the σ -algebra generated by cy linder e vents of length t . Deﬁnition 6 (W eak distance) . For probability measures µ, ν o ver inﬁnite pla y paths, deﬁne d ( µ, ν ) := ∞ X t =1 2 − t sup E ∈B t   µ ( E ) − ν ( E )   . For a history h t with µ ( C ( h t )) > 0 and ν ( C ( h t )) > 0 , deﬁne the conditional (continuation) weak distance d h t ( µ, ν ) := d ( µ ( · | C ( h t )) , ν ( · | C ( h t ))) . W e use weak distance to compare continuations of play after a realized history . Deﬁnition 7 (W eak similarity in continuation) . Fix a history h t . T w o proﬁles f and g are η -weakly similar in continuation af ter h t if d h t ( µ f , µ g ) ≤ η . W eak subjectiv e equilibrium is Norman ’ s ke y inter mediate notion: pla y ers best respond (up to ξ ) to their subjectiv e model, and their subjective model is weakl y close (within η ) to the objectiv e continuation distri- bution. Deﬁnition 8 (W eak subjective equilibr ium (Norman, 2022)) . Fix ξ , η ≥ 0 and a history h t . A continuation proﬁle f   h t is a weak ξ -subjectiv e η -equilibrium after h t if for ev er y play er i there e xists a suppor ting proﬁle 15 f i = ( f i , f i − i ) such that: 1. (Subjectiv e best response) f i   h t ∈ BR ξ i  f i − i   h t | h t  , where pay oﬀs are e valuated under µ f i . 2. (W eak pr edictiv e accuracy) d h t ( µ f , µ f i ) ≤ η . Deﬁnition 9 (Learns to predict the path of pla y (strong)) . Play er i learns to pr edict the path of play under f if f or ev er y η > 0 , µ f n z : ∃ T i ( z , η ) < ∞ s.t. ∀ t ≥ T i ( z , η ) , d h t ( z ) ( µ f , µ f i ) ≤ η o = 1 , where f i = ( f i , f i − i ) is a suppor ting (belief-equivalent) proﬁle f or pla y er i (as in Section 3). Remar k 1 (Connection to Optimizing Lear nability) . A longstanding challeng e in Ba yesian lear ning in games is N achbar (1997) and Nac hbar (2005)’ s inconsistency result, whic h sho ws that requir ing an agent to lear n and best-respond on all possible continuation paths is often mathematically impossible. Ho we ver , Norman (2022) resolv ed this by introducing optimizing lear nability , the insight that agents only need to learn the continuation pla y along the realized paths generated by their optimizing choices. Our RR deﬁnition naturally instantiates Norman ’s insight: Deﬁnition 4 and Deﬁnition 9 require ε -consis tency and predictiv e accuracy strictly µ f -almost surely (i.e., str ictl y on the realized, optimizing pla y path). Theref ore, the on-path merging of opinions guaranteed by Blackw ell and Dubins (1962) is entirely suﬃcient for zero-shot N ash conv erg ence, b ypassing N achbar’ s impossibility . Crucially , while our agent ’ s speciﬁc decision r ule (PS-BR) requires ﬁnite menus and KL separation to guarantee the optimality of actions (asymptotic ε -consistency , Section 4), the lear ning of the true path (strong path prediction) relies purely on the absolute continuity of beliefs. It does not require the posterior to concentrate; it can be v eriﬁed directly from Assumption 2 via the classic merging of opinions result. The f ollo wing Lemma 5.1 f or malizes this idea. Lemma 5.1 (Absolute continuity implies strong path prediction) . F ix player i . Suppose the objectiv e play- path distribution µ f is absolutely continuous with r espect to play er i ’ s prior predictiv e distribution P 0 ,f i i (Assumption 2). Then player i learns to pr edict the path of play under f in the sense of Deﬁnition 9. The proof is def er red to Appendix B. 5.2 From learning to zero-sho t Nash conv ergence W e ﬁrst sho w that asymptotic ε -consistency , tog ether with strong prediction, implies that the realized continuation pla y is ev entually a weak subjectiv e equilibrium. Proposition 5.2. Suppose each player i is RR (Deﬁnition 4) and learns to predict the path of play under f 16 (Deﬁnition 9). Then f or any ξ > 0 and η > 0 , µ f n z : ∃ T ( z ) < ∞ s.t. ∀ t ≥ T ( z ) , f   h t ( z ) is a weak ξ -subjective η -equilibrium after h t ( z ) o = 1 . Finall y , we conv er t a weak subjectiv e equilibrium into pro ximity to a Nash equilibrium. Theorem 5.3 (Zero-shot Nash conv erg ence along realized play) . Suppose ev er y player i is RR and lear ns to pr edict the path of play under f . Assume the gr ain-of-tr uth condition (Assumption 2) holds for eac h player . Then f or ev er y ε > 0 , µ f  z : ∃ T ( z ) < ∞ s.t. ∀ t ≥ T ( z ) , ∃ ˆ f ε,t,z an ε -N ash equilibrium of the continuation g ame af ter h t ( z ) wit h d h t ( z ) ( µ f , µ ˆ f ε,t,z ) ≤ ε  = 1 . Corollary 5.4 (Zero-shot Nash conv erg ence for PS-BR) . Assume that for every player i , Assumption 3 holds and play er i uses PS-BR (Deﬁnition 5). Then the conclusion of Theor em 5.3 holds. The proofs of Theorem 5.3 and Corollar y 5.4 are deferred to Appendix B. As a direct conseq uence, under our practical PS-BR implementation, the premises of Theorem 5.3 are veriﬁed directl y . The main theoretical results, Theorem 5.3 and Corollary 5.4, ma y seem counter -intuitive: if each ag ent is learning, then what each agent is trying to predict is itself changing ov er time, so why should behavior e v er stabilize? This concer n is valid f or many m y opic lear ning models, where the learner treats the opponent as ha ving a ﬁx ed action distr ibution ev en though the opponent is also adapting. The promise of Bay esian lear ning (Kalai and Lehrer, 1993a) is that, under a suitable g rain-of-truth condition, ag ents ’ posterior predictiv e f orecasts about future play can nonetheless become accurate (merg e) along the realized path. In repeated games, the cor rect object of inference is not a ﬁx ed action, but the opponent’ s r epeated-g ame str ategy : a ﬁxed contingent plan (mapping histories to actions) that ma y be highly nonsta- tionary . In par ticular , e v en if an opponent updates beliefs and c hanges its period-b y-period bes t response, once its prior , update r ule, and decision rule are ﬁxed from time 0, its behavior deﬁnes a single mapping f − i : H → ∆( A − i ) (hence a ﬁx ed repeated-g ame strategy in our sense). A gents ’ belief s chang e because the y reﬁne uncer tainty about this ﬁxed mapping (and its on-path implications), not because the mapping is being re written ex og enously o ver time. Indeed, our main results do not require that posteriors ov er opponent strategies literally stop mo ving. Instead, the y require on-path stabilization in tw o w eaker senses: 1. Stability of f orecas ts (predictiv e merging). U nder the g rain-of-truth condition (Assumption 2), Bay esian updating implies that, along µ f -almost e very realized history h t ( z ) , the agent ’ s post erior pr edictive distribution o v er future play becomes close to the tr ue continuation distribution (f or malized later b y 17 Deﬁnition 9 and Lemma 5.1). Impor tantly , this can happen ev en if the posterior o v er strat egy labels does not concentrate: distinct strategy hypotheses may be observationall y equivalent on the r ealized path , and an y remaining disagreement can persist onl y on counterfactual histories that are not reached. 2. Stability of (appr o ximate) best r esponses. Once an ag ent ’ s predictiv e belief about continuation play is accurate on-path, pla ying an ε -best response to that belief is also nearl y optimal agains t the tr ue continuation pla y . Moreo ver , best-response sets need not vary wildl y: when the pay oﬀ gap between the best action and the runner -up is nontr ivial, small changes in belief s do not chang e which continuation strategies are ε -optimal. This is e xactl y why our RR deﬁnition imposes onl y asymptotic on-path ε - consistency (Deﬁnition 4), rather than requir ing per f ect best-response optimality at ev er y time and e very counterfactual histor y . Ev en if beliefs keep updating fore ver , beha vior can still stabilize because decisions depend on the pr edictive implications of beliefs on the realized continuation game. If the poster ior mass shifts among hypotheses that induce (nearl y) the same continuation distr ibution after h t ( z ) , then the ag ent’ s bes t-response problem is (near ly) unchang ed, so pla y remains stable. For our PS-BR implementation with a ﬁnite menu and KL separation (Assumption 3), we obtain an ev en strong er f or m of stabilization: the posterior o v er the menu concentrates on the tr ue opponent strategy (Lemma 4.2), so the randomness from posterior sampling becomes asymptoticall y negligible (Lemma 4.1), yielding ev entual on-path ε -best-response behavior (Proposition 4.3). 5.3 Zero-shot stage-g ame Nash conv ergence for m y opic rules Theorem 5.3 and Corollary 5.4 establish ev entual on-path con ver gence to a Nash equilibr ium of the contin- uation game under PS-BR. That guarantee is deliberately strong: it concerns repeated-game optimality and theref ore requires beliefs o ver opponents ’ full continuation strategies. Y et this lev el of reasoning may be unnecessary when the object of interest is only stag e-wise strategic optimality . If we ask instead whether the realized mixed action proﬁle at each history is ev entually an appro ximate Nash equilibrium of the one-shot stag e game, then predicting the opponents ’ ne xt joint action ma y suﬃce. This reduction captures the logic of SCo T (Akata et al., 2025), which implements a “predict the ne xt mov e, then best respond” procedure rather than full continuation planning. The pur pose of this subsection is to justify this simpliﬁcation f or mall y . W e anal yze two one-step variants: my opic PS-BR, which best responds to a one-step predictiv e belief, and SCo T (Akata et al., 2025), which best responds to a deterministic point prediction of the opponents’ ne xt action. 5.3.1 My opic PS-BR myopic PS-BR retains the Ba yesian-learning-plus-best-response structure of the pre vious subsection, but truncates both objects to one per iod: the ag ent f or ms a one-step predictive belief o v er the opponents ’ ne xt joint action and then pla ys a m y opic best response to that belief. For notational con v enience, as already used abo ve, f or an y opponents ’ proﬁle g − i and history h , we wr ite g − i ( h ) ∈ ∆( A − i ) 18 f or the induced distr ibution o v er the opponents’ joint ne xt action at history h . In par ticular , when g − i is an actual proﬁle of opponents ’ mix ed actions, this is the product distribution g − i ( h ) = O j  = i g j ( h ) . Deﬁnition 10 (One-shot stag e-game ε -best response and stage ε -N ash) . For α i ∈ ∆( A i ) and q ∈ ∆( A − i ) , deﬁne u i ( α i , q ) := X a i ∈ A i X a − i ∈ A − i α i ( a i ) q ( a − i ) u i ( a i , a − i ) . For ε ≥ 0 , deﬁne br ε i ( q ) := ( α i ∈ ∆( A i ) : u i ( α i , q ) ≥ sup α ′ i ∈ ∆( A i ) u i ( α ′ i , q ) − ε ) . W e also wr ite br i ( q ) := br 0 i ( q ) . At a history h t , write f − i ( h t ) := O j  = i f j ( h t ) ∈ ∆( A − i ) f or the actual cur rent joint mixed action of play er i ’ s opponents. The cur rent mixed-action proﬁle f ( h t ) := ( f 1 ( h t ) , . . . , f N ( h t )) ∈ Y j ∈ I ∆( A j ) is a stag e ε -N ash equilibrium if f i ( h t ) ∈ br ε i  f − i ( h t )  f or ev ery i ∈ I . Fix play er i and let f i = ( f i , f i − i ) , where f i − i is the ﬁx ed belief-equivalent proﬁle from Section 3.3. Let f i,t − i be the continuation-consistent representativ e of pla yer i ’ s predictive belief at history h t . W e write q t i ( · | h t ) := f i,t − i ( h t ) ∈ ∆( A − i ) . By the representative-c hoice con v ention from Section 3.3, along the histories under consideration, f i,t − i ( h t ) = f i − i ( h t ) . 19 When the posterior µ t i ( · | h t ) is suppor ted on a ﬁnite set S − i ⊆ F − i , this is q t i ( · | h t ) = X g − i ∈S − i µ t i ( g − i | h t ) g − i ( h t )( · ) . Deﬁnition 11 (My opic posterior -sampling best response (my opic PS-BR)) . Fix pla y er i and a history h t . Suppose µ t i ( · | h t ) is suppor ted on a ﬁnite set S − i . For each g − i ∈ S − i , choose a mix ed action α g − i ,h t i ∈ br i  g − i ( h t )  . My opic PS-BR: 1. samples ˜ f − i ∼ µ t i ( · | h t ) ; 2. uses the mixed action α ˜ f − i ,h t i . The induced ex ante mix ed action is α mPS i,t ( · | h t ) := X g − i ∈S − i µ t i ( g − i | h t ) α g − i ,h t i ( · ) . Whene ver pla y er i uses m y opic PS-BR, we identify f i ( h t ) = α mPS i,t ( · | h t ) . Lemma 5.5 (Stag e best responses are stable under nearb y beliefs) . F ix player i and deﬁne ∥ p − q ∥ TV := sup B ⊆ A − i | p ( B ) − q ( B ) | for p, q ∈ ∆( A − i ) . If α i ∈ br ξ i ( q ) , then α i ∈ br ξ +2 ∥ p − q ∥ TV i ( p ) . Lemma 5.6 (My opic PS-BR is a D t i -stag e best response) . F ix player i and a histor y h t . Suppose µ t i ( · | h t ) is supported on a ﬁnite set S − i and write p t ( g − i ) := µ t i ( g − i | h t ) , g − i ∈ S − i . Deﬁne D t i ( h t ) := 1 − X g − i ∈S − i p t ( g − i ) 2 . 20 Let α mPS i,t ( · | h t ) be myopic PS-BR and let q t i ( · | h t ) = X g − i ∈S − i p t ( g − i ) g − i ( h t )( · ) be the one-step posterior pr edictive belief. Then u i  α mPS i,t , q t i ( · | h t )  ≥ sup α i ∈ ∆( A i ) u i  α i , q t i ( · | h t )  − D t i ( h t ) . Equiv alently, α mPS i,t ( · | h t ) ∈ br D t i ( h t ) i  q t i ( · | h t )  . Lemma 5.7 (S trong path prediction implies one-step predictiv e accuracy) . F ix play er i . Suppose player i learns to predict the path of play under f (Deﬁnition 9). Then µ f n z : ∀ η > 0 , ∃ T i ( z , η ) < ∞ s.t. ∀ t ≥ T i ( z , η ) ,   q t i ( · | h t ( z )) − f − i ( h t ( z ))   TV ≤ η o = 1 . Theorem 5.8 (Bay esian con v erg ence to s tage-g ame N ash under m yopic PS-BR) . Assume t hat f or ev er y player i , Assumption 3 holds and player i uses myopic PS-BR (Deﬁnition 11) at every histor y . Then for every ε > 0 , µ f n z : ∃ T ( z ) < ∞ s.t. ∀ t ≥ T ( z ) , f ( h t ( z )) is a stag e ε -N ash equilibrium o = 1 . 5.4 SCo T (Akata et al., 2025) The second reduction is SCo T (Akata et al., 2025). Instead of best responding to the full one-step predictive distribution, the agent ﬁrst forms a deterministic point prediction of the opponents ’ ne xt joint action and then best responds to that point prediction. In general, this is not equiv alent to best responding to a mixed belief, so the argument is diﬀerent from the classical Ba yesian-learning-plus-best-response route. Ne vertheless, when all play ers use deterministic point-prediction r ules, the tr ue next action along the realized path is pure at ev er y history , and predictive accuracy is enough to make the point prediction ev entually correct. This giv es e v entual stag e-game Nash conv erg ence under a diﬀerent mechanism than my opic PS-BR. Deﬁnition 12 (Social Chain of Thought (SCo T) (Akata et al., 2025)) . Fix pla y er i . At each histor y h t , let q t i ( · | h t ) := f i,t − i ( h t ) ∈ ∆( A − i ) denote play er i ’ s one-step predictiv e distr ibution o v er opponents’ ne xt joint action. Along the histories under 21 consideration, the representative-c hoice con v ention from Section 3.3 giv es f i,t − i ( h t ) = f i − i ( h t ) . A SCo T rule f or pla y er i consists of: 1. a deterministic MAP (maximum a posterior i) selector ˆ a t − i ( h t ) ∈ arg max a − i ∈ A − i q t i ( a − i | h t ); 2. a deterministic pure best-response selector b i : A − i → A i such that b i ( a − i ) ∈ arg max a i ∈ A i u i ( a i , a − i ) f or ev ery a − i ∈ A − i . The induced strategy is f i ( h t ) := δ b i (ˆ a t − i ( h t )) ∈ ∆( A i ) . Thus a SCo T play er uses a pure action at e v er y histor y . Lemma 5.9 (Deter ministic truth implies asymptotic pur ity and ev entual MAP cor rectness) . F ix player i and suppose player i learns to predict the path of play under f in the sense of Deﬁnition 9. Assume that for ev er y history h ∈ H ther e exists an action a ⋆ − i ( h ) ∈ A − i suc h that f − i ( h ) = δ a ⋆ − i ( h ) . Then µ f n z : ∃ T i ( z ) < ∞ s.t. ∀ t ≥ T i ( z ) , ˆ a t − i ( h t ( z )) = a ⋆ − i ( h t ( z )) o = 1 . In particular , along µ f -almost every r ealized path z , q t i  a ⋆ − i ( h t ( z )) | h t ( z )  − → 1 and 1 − max a − i ∈ A − i q t i ( a − i | h t ( z )) − → 0 . Theorem 5.10 (One-shot stage-g ame Nash conv erg ence f or SCo T) . Suppose every player i ∈ I uses SCo T in the sense of Deﬁnition 12, and suppose every player learns to predict the path of play under f in the sense of Deﬁnition 9. Then µ f n z : ∃ T ( z ) < ∞ s.t. ∀ t ≥ T ( z ) , f ( h t ( z )) is a stag e N ash equilibrium o = 1 . Equiv alently, along µ f -almost ev er y r ealized path, the current mixed-action pr oﬁle eventually becomes a 22 stag e 0 -Nash equilibrium. Corollary 5.11 (Bay esian stag e-game N ash conv erg ence f or SCo T) . Suppose every player uses det er ministic MAP -SCo T and Assumption 2 holds for ev er y player . Then the conclusion of Theor em 5.10 holds: µ f n z : ∃ T ( z ) < ∞ s.t. ∀ t ≥ T ( z ) , f ( h t ( z )) is a stag e N ash equilibrium o = 1 . Remar k 2 . Theorem 5.10 relies on the fact that when all player s use SCo T with deterministic tie-breaking, the tr ue cur rent action proﬁle is pure at ev er y history . This is why asymptotic purity need not be imposed separatel y: it is implied by Bay esian one-step predictive accuracy to ward a pure tr uth. If opponents are allo w ed to pla y g enuinely mixed cur rent actions, this argument breaks do wn, and additional conditions such as asymptotic pur ity or BR -in variance are ag ain needed. The SCo T result is theref ore naturally paired with the g rain-of-truth assumption (Assumption 2) and the cor responding mer ging-of-opinions argument, rather than with Assumption 3, whose unif or m-positivity requirement is tailored to cautious menu-based posteriors and posterior -sampling r ules such as PS-BR. The proofs are def er red to Appendix B. T aken together , Theorem 5.8 and Theorem 5.10 sho w that, f or the w eaker objectiv e of stag e-g ame Nash con ver gence, full continuation planning is not necessary . How ev er , these one-step results are inherently limited to stag e-game equilibrium. They do not by themsel v es reco ver more demanding continuation-game or history-conting ent r epeated-g ame equilibr ia, whose incentive struc- ture is sustained by the value of future paths of play . Establishing con ver gence to those r ic her repeated-game equilibria requires a procedure, such as PS-BR, that reasons o v er full continuation strategies rather than onl y o v er the ne xt-per iod action. 6 Extension to unkno wn, stochastic, and private pa y oﬀs Sections 3 – 5 assumed that the stage pa y oﬀ functions u i : A → [0 , 1] are common kno w ledge and determin- istic. W e no w drop this assumption and allo w each agent to obser v e only its o wn pr iv ately realized stoc hastic pa y oﬀs. 6.1 Private-pa y oﬀ repeated game and information histories Fix the same action sets ( A i ) i ∈ I and discount factors ( λ i ) i ∈ I as in Section 3. For each pla y er i , let R i ⊆ R denote the pay oﬀ space and let ν i (d r ) be a dominating base measure (counting measure in the discrete case, Lebesgue measure in the continuous case). W e assume that the pa y oﬀ noise famil y is known. Concretely , f or each pla yer i there is a kno wn famil y of densities ψ i ( r ; µ ) , r ∈ R i , µ ∈ R , 23 where the parameter µ is the mean pay oﬀ. The tr ue unkno wn object is pla y er i ’ s mean pa y oﬀ matr ix u i : A → [0 , 1] . (As usual, an y bounded pay oﬀ matrix can be aﬃnely normalized into [0 , 1] without changing best responses or N ash inequalities.) At round t , after the public joint action a t ∈ A is realized, pla y er i privatel y observes r t i ∼ q u i i ( · | a t ) , where q u i i (d r | a ) := ψ i ( r ; u i ( a )) ν i (d r ) . (5) Thus the tr ue pay oﬀ kernel is determined b y the true mean matr ix u i . In the pr ivate-pa y oﬀ model, actions may depend on both the public history and the pla y er’ s own private pa y oﬀ observations. Accordingl y , deﬁne pla y er i ’ s inf or mation histor y at time t as x t i := ( h t , r 1: t − 1 i ) ∈ X t i := H t × R t − 1 i , X i := [ t ≥ 1 X t i . A strategy f or pla y er i in the pr iv ate-pa y oﬀ game is a map σ i : X i → ∆( A i ) . Let Σ i denote the set of such strategies and Σ := Q i ∈ I Σ i . The full sample space is Ω := Y t ≥ 1  A × Y i ∈ I R i  , whose typical element is ω = ( a 1 , r 1 , a 2 , r 2 , . . . ) , r t = ( r t 1 , . . . , r t N ) . Giv en a strategy proﬁle σ ∈ Σ and the tr ue mean matrices u = ( u i ) i ∈ I , the tuple ( σ , u ) induces a uniq ue probability la w P σ,u on Ω by the Ionescu–T ulcea theorem. For a realized path ω ∈ Ω , wr ite x t ( ω ) := ( x t i ( ω )) i ∈ I f or the realized vector of inf or mation histor ies at time t . F or an y continuation proﬁle τ deﬁned on future inf ormation histories e xtending x t , let P τ ,u x t denote the induced continuation la w . 24 For pla yer i , deﬁne the continuation pa y oﬀ after x t b y U i ( τ | x t ) := E P τ ,u x t " (1 − λ i ) ∞ X k =0 λ k i r t + k i # . By iterated expectation and (5), U i ( τ | x t ) = E P τ ,u x t " (1 − λ i ) ∞ X k =0 λ k i u i ( a t + k ) # . Hence the objectiv e continuation pa y oﬀ in the private-pa y oﬀ game equals the discounted pa y oﬀ induced b y the true mean matrix, ev en though strategies ma y condition on private pay oﬀ realizations. A continuation proﬁle τ is an ε -Nash equilibrium after x t if, f or ev er y i ∈ I , U i ( τ | x t ) ≥ sup τ ′ i ∈ Σ i ( x t i ) U i ( τ ′ i , τ − i | x t ) − ε. Finall y , let ¯ µ τ ,u x t denote the public-action marginal of P τ ,u x t on the future public action path ( a t , a t +1 , . . . ) ∈ H ∞ . W e compare continuation proﬁles only through these public-action marginals, using d x t ( τ , ˆ τ ) := d  ¯ µ τ ,u x t , ¯ µ ˆ τ ,u x t  , where d is the w eak distance from Deﬁnition 6. 6.2 Kno wn-noise, unkno wn-mean parametrization W e no w impose the ﬁnite-menu structure used by PS-BR. For play er i , let M i be a ﬁnite menu of candidate mean pa y oﬀ matrices m i : A → [0 , 1] . Each m i ∈ M i induces a pay oﬀ kernel q m i i (d r | a ) := ψ i ( r ; m i ( a )) ν i (d r ) . Thus sampling a pay oﬀ matr ix label is ex actly sampling a pay oﬀ kernel, e xpressed in mean-matr ix coordi- nates. Giv en x t i = ( h t , r 1: t − 1 i ) , pla y er i ’ s posterior o v er candidate mean matrices is π t i ( m i | x t i ) ∝ π 0 i ( m i ) t − 1 Y s =1 ψ i ( r s i ; m i ( a s )) , m i ∈ M i . (6) As in Sections 4 – 5, we model play er i ’ s beliefs about the opponents through a ﬁnite menu of public-action 25 continuation models g − i : H → ∆( A − i ) . These models descr ibe the predictiv e la w of opponents ’ ne xt public action conditional on public history . Let S − i denote the ﬁnite menu and let µ t i ( · | h t ) be pla y er i ’ s posterior o v er S − i . 6.3 Subjectiv e continuation v alues and PS-BR Fix play er i , an information history x t i = ( h t , r 1: t − 1 i ) , a reduced-f or m opponents ’ continuation model g − i ∈ S − i , and a continuation strategy τ i ∈ Σ i ( x t i ) . Let P ( τ i ,g − i ) ,m i x t i denote the induced law on play er i ’ s future obser vable sequence when: (i) pla y er i f ollo ws τ i , (ii) opponents’ public actions are generated b y g − i , and (iii) play er i ’ s future pr iv ate pay oﬀs are generated from the kernel q m i i . Deﬁne the m i -subjectiv e continuation value by V m i i ( τ i | x t i ; g − i ) := E P ( τ i ,g − i ) ,m i x t i " (1 − λ i ) ∞ X k =0 λ k i r t + k i # . (7) For ε ≥ 0 , deﬁne BR ε i,m i ( g − i | x t i ) :=    τ i ∈ Σ i ( x t i ) : V m i i ( τ i | x t i ; g − i ) ≥ sup τ ′ i ∈ Σ i ( x t i ) V m i i ( τ ′ i | x t i ; g − i ) − ε    , and write BR i,m i ( g − i | x t i ) := BR 0 i,m i ( g − i | x t i ) . Pla y er i ’ s mix ed subjectiv e continuation value is V mix ,t i ( τ i | x t i ) := E g − i ∼ µ t i ( ·| h t ) m i ∼ π t i ( ·| x t i ) h V m i i ( τ i | x t i ; g − i ) i . (8) For the true mean matrix u i , deﬁne V u i ,t i ( τ i | x t i ) := E g − i ∼ µ t i ( ·| h t ) h V u i i ( τ i | x t i ; g − i ) i . (9) Fix play er i and an inf or mation history x t i = ( h t , r 1: t − 1 i ) . The poster ior µ t i ( · | h t ) o v er the ﬁnite menu 26 S − i induces a poster ior predictiv e law o v er future public action paths. Let g i,t − i denote any reduced-f or m beha vioral representative of this posterior predictiv e continuation law . Concretel y , g i,t − i is chosen so that f or e very continuation strategy τ i ∈ Σ i ( x t i ) , V u i ,t i ( τ i | x t i ) = V u i i ( τ i | x t i ; g i,t − i ) . (10) When S − i = { g 1 − i , . . . , g K − i } is ﬁnite, one con v enient choice is g i,t − i ( h )( a − i ) = K X k =1 µ t,h i ( g k − i ) g k − i ( h )( a − i ) , h ⪰ h t , where µ t,h i is the continuation posterior obtained b y updating µ t i ( · | h t ) along the continuation history h . Let ¯ µ ( τ i ,g − i ) ,m i x t i denote the public-action marginal of P ( τ i ,g − i ) ,m i x t i on ( a t , a t +1 , . . . ) ∈ H ∞ . For the actual continuation strategy σ i , pla yer i ’ s posterior predictiv e la w o ver future public action paths can then be written as Π t i ( · | x t i ) = X m i ∈M i π t i ( m i | x t i ) ¯ µ ( σ i ,g i,t − i ) ,m i x t i . (11) W e can no w state the pr iv ate-pa y oﬀ PS-BR rule. Deﬁnition 13 (P osterior -sampling best response (PS-BR) with pr iv ate pa y oﬀs) . Fix pla yer i and an in- f ormation history x t i = ( h t , r 1: t − 1 i ) . Giv en: (i) the posterior µ t i ( · | h t ) o v er reduced-form opponents ’ continuation models, and (ii) the pos ter ior π t i ( · | x t i ) o v er pla y er i ’ s o wn mean pa y oﬀ matrices, PS-BR chooses a continuation strategy b y: 1. sample an opponents ’ continuation model ˜ g − i ∼ µ t i ( · | h t ) ; 2. sample a mean pay oﬀ matrix ˜ m i ∼ π t i ( · | x t i ) ; 3. pla y an y continuation strategy τ i ∈ BR i, ˜ m i ( ˜ g − i | x t i ) . Denote the resulting randomized continuation strategy b y σ PS i,t ( · | x t i ) . 6.4 P osterior concentration Although the primitiv e strategy proﬁle is σ ∈ Σ , the public action path it induces admits a reduced-f orm description. F or each pla yer i , deﬁne ¯ f i ( h ) := P σ,u ( a t i ∈ · | h t = h ) , ¯ f := ( ¯ f i ) i ∈ I , and let ¯ µ σ,u denote the induced law on the public action path in H ∞ . Thus ¯ f is the tr ue reduced-f or m public-action model generated by the inf or mation-history strategy proﬁle σ and the true mean matrices u . 27 For play er i ’ s ﬁnite menu of reduced-form opponents ’ continuation models S − i , assume that Assumption 3 holds mutatis mutandis with the true reduced-f or m opponent model ¯ f − i and the true public-action path law ¯ µ σ,u in place of f − i and µ f . Lemma 6.1 (Posterior concentration of reduced-f or m public-action beliefs) . F ix player i and suppose player i ’ s ﬁnite menu S − i and posterior µ t i ( · | h t ) satisfy Assumption 3 mutatis mutandis with ¯ f − i and ¯ µ σ,u in place of f − i and µ f . Then under the true inter action law P σ,u , µ t i ( ¯ f − i | h t ) − → 1 and hence max g − i ∈S − i \{ ¯ f − i } µ t i ( g − i | h t ) − → 0 , almost surely . The only genuinel y new learnability req uirement in the pr iv ate-pa y oﬀ extension is on the pay oﬀ side: identiﬁability of play er i ’ s o wn mean pa y oﬀ matr ix from pr iv ate noisy rew ards. Assump tion 4 (Finite pa y oﬀ-menu identiﬁability under known noise) . Fix pla yer i and let M i = supp( π 0 i ) be ﬁnite. Assume: 1. (Menu g rain of truth) The tr ue mean matr ix u i ∈ M i and π 0 i ( u i ) > 0 . 2. (Known common noise f amily) Each menu element m i ∈ M i induces the pay oﬀ kernel q m i i (d r | a ) = ψ i ( r ; m i ( a )) ν i (d r ) , and the tr ue pay oﬀ law is q u i i . 3. (F inite second moments of log-likelihood ratios) For ev er y m i ∈ M i \ { u i } , sup a ∈ A E R ∼ q u i i ( ·| a ) "  log ψ i ( R ; u i ( a )) ψ i ( R ; m i ( a ))  2 # < ∞ . 4. (On-path KL separation) F or e very m i ∈ M i \ { u i } there e xists κ i ( m i ) > 0 such that under the true interaction la w P σ,u , lim inf T →∞ 1 T T X t =1 D KL  q u i i ( · | a t )    q m i i ( · | a t )  ≥ κ i ( m i ) a.s. The ne xt lemma is the mean-matr ix analogue of Lemma 4.2. Lemma 6.2 (Pa y oﬀ pos ter ior concentration under kno wn-noise KL separation) . F ix player i and suppose 28 Assumption 4 holds. Then under the true inter action law P σ,u , π t i ( u i | x t i ) − → 1 , and hence max m i ∈M i \{ u i } π t i ( m i | x t i ) − → 0 , almost surely . Lemma 6.3 (Pa y oﬀ concentration identiﬁes the predictiv e public-action law) . Fix player i . F or ev er y inf or mation history x t i , d  Π t i ( · | x t i ) , ¯ µ ( σ i ,g i,t − i ) ,u i x t i  ≤ 1 − π t i ( u i | x t i ) . Consequently , under Lemma 6.2, d  Π t i ( · | x t i ) , ¯ µ ( σ i ,g i,t − i ) ,u i x t i  − → 0 under P σ,u a.s. The proof is def er red to Appendix B. 6.5 PS-BR gap and asymp totic consistency Let p t ( g − i , m i ) := µ t i ( g − i | h t ) π t i ( m i | x t i ) , ( g − i , m i ) ∈ S − i × M i . Deﬁne the joint collision complement D t, joint i ( x t i ) := 1 − X ( g − i ,m i ) ∈S − i ×M i p t ( g − i , m i ) 2 . Lemma 6.4 (PS-BR is a D t, joint i -best response to the mixed subjectiv e value) . Fix player i and an information history x t i = ( h t , r 1: t − 1 i ) . Let σ PS i,t be PS-BR from Deﬁnition 13. Then V mix ,t i ( σ PS i,t | x t i ) ≥ sup τ i ∈ Σ i ( x t i ) V mix ,t i ( τ i | x t i ) − D t, joint i ( x t i ) . Equiv alently, σ PS i,t is a D t, joint i ( x t i ) -best response to the mixed subjectiv e continuation value (8) . Deﬁne δ t i ( x t i ) := 1 − π t i ( u i | x t i ) . Because continuation values are nor malized to lie in [ 0 , 1] , f or ev er y τ i ∈ Σ i ( x t i ) ,   V mix ,t i ( τ i | x t i ) − V u i ,t i ( τ i | x t i )   ≤ δ t i ( x t i ) . (12) Combining (12), Lemma 6.4, Lemma 6.1, and Lemma 6.2 yields the asymptotic best-response proper ty . 29 Proposition 6.5 (PS-BR implies asymptotic ε -consistency in the pr iv ate-pa y oﬀ game) . F ix player i . Assume: (i) Assumption 3 holds mutatis mutandis for player i ’ s menu of reduced-f or m opponents’ continuation models, with the true reduced-f or m opponent model ¯ f − i and the true public-action path law ¯ µ σ,u in place of f − i and µ f , (ii) Assumption 4 holds for play er i ’ s mean-matrix menu, and (iii) player i uses PS-BR at every inf or mation history. Then f or ev er y ε > 0 , P σ,u n ω : ∃ T i ( ω , ε ) < ∞ s.t. ∀ t ≥ T i ( ω , ε ) , σ PS i,t ( · | x t i ( ω )) ∈ BR ε i,u i  g i,t − i | x t i ( ω )  o = 1 . 6.6 Zero-shot Nash conv ergence with private pa y oﬀs T o lift the earlier zero-shot argument, one replaces public histories h t b y inf or mation-history v ectors x t , and one compares continuation proﬁles through the w eak distance between their induced public-action marginals after the realized full inf or mation-history vector . Because pla y er i only obser v es x t i = ( h t , r 1: t − 1 i ) , the rele vant Bay esian merging step is ﬁrst stated on pla y er i ’ s obser vable process. Assumption 6 then identiﬁes this pla y er -relativ e predictiv e targ et with the e x post public continuation law after x t asymptoticall y . For pla yer i , let O i := Y t ≥ 1 ( A × R i ) denote the space of observable sequences ( a 1 , r 1 i , a 2 , r 2 i , . . . ) . Let P σ,u i be the marginal of P σ,u on O i , and let Q 0 ,σ i i be play er i ’ s prior predictive la w on O i induced by their priors o v er S − i and M i , the known noise f amily , and their own strategy σ i . Let ¯ µ σ,u i,x t i ( E ) := P σ,u  ( a t , a t +1 , . . . ) ∈ E | x t i  , E ∈ B , denote the tr ue public-action continuation la w conditional on play er i ’ s o wn obser v able inf or mation history x t i . Also let Π t i ( · | x t i ) denote play er i ’ s poster ior predictiv e la w o v er the future public action path ( a t , a t +1 , . . . ) ∈ H ∞ conditional on x t i . In the private-pa y oﬀ setup, play er i ’ s pr ior ov er reduced-f or m opponents ’ continuation models and o ver its o wn ﬁnite menu of pay oﬀ hypotheses is cons tr ucted so that the true obser v able process is represented as one f easible element. Thus the induced pr ior predictiv e la w on play er i ’ s obser v able sequence should place positiv e mass on the tr ue obser v able path la w . This naturall y giv es the f ollowing Assumption 5. 30 Assump tion 5 (Observable g rain of tr uth in the private-pa y oﬀ g ame) . Fix play er i . Assume P σ,u i ≪ Q 0 ,σ i i . The ne xt requirement is also natural in the PS-BR regime. Although play er i ne v er obser v es the opponents ’ private rew ard histories, those histories matter for future public play only through how the y shape the opponents ’ o wn continuation beha vior . As each pla yer ’s private pay oﬀ posterior concentrates and the residual eﬀect of these hidden rew ard histories on public continuation pla y becomes negligible, conditioning on the realized full inf or mation-his tor y v ector x t or on pla yer i ’ s own obser v able history x t i should asymptotically yield the same public-action continuation la w . Assumption 6 f or malizes the intended inf or mation structure: pla y er i does not obser v e the other pla y ers ’ pr iv ate rew ard histor ies and need only inf er its own pa y oﬀ matr ix tog ether with the opponents’ reduced-form public-action strategy . Asymptotically , any additional predictive content in the unobser v ed pr iv ate histories becomes negligible f or future public pla y . Assump tion 6 (Asymptotic public suﬃciency of hidden private histories) . For ev er y pla y er i , d  ¯ µ σ,u i,x t i ( ω ) , ¯ µ σ,u x t ( ω )  − → 0 for P σ,u -a.e. ω . Assumption 6 is the f or mal expression of the idea that, in the intended regime, each play er needs to infer only its o wn pay oﬀ matr ix and the opponents ’ reduced-f or m public-action strategy ; the opponents ’ unrev ealed private rew ard histories do not asymptoticall y alter future public play be y ond what those objects already encode. Lemma 6.6 (Obser v able grain of tr uth implies strong public-path prediction) . F ix play er i . Under As- sumptions 5 and 6, play er i ’ s posterior pr edictive law o v er future public action paths merg es with the tr ue public-action continuation law af t er the realized information-history v ector : d  Π t i ( · | x t i ( ω )) , ¯ µ σ,u x t ( ω )  − → 0 for P σ,u -a.e. ω . The proof is def er red to Appendix B. Deﬁnition 14 (W eak subjectiv e equilibrium on inf ormation his tor ies) . Fix ξ , η ≥ 0 and an inf or mation- history vector x t . A continuation proﬁle τ is a weak ξ -subjectiv e η -equilibrium after x t if, f or ev er y pla y er i , there exis ts a reduced-form opponents ’ continuation model g i − i such that τ i ∈ BR ξ i,u i ( g i − i | x t i ) and d  ¯ µ τ ,u x t , ¯ µ ( τ i ,g i − i ) ,u i x t i  ≤ η . 31 Proposition 6.7 (Learning and asymptotic consistency impl y weak subjectiv e equilibr ium in the private– pa y oﬀ g ame) . Suppose every play er i satisﬁes the conclusion of Proposition 6.5 and of Lemma 6.6. Then f or ev er y ξ > 0 and η > 0 , P σ,u n ω : ∃ T ( ω ) < ∞ s.t. ∀ t ≥ T ( ω ) , σ   x t ( ω ) is a weak ξ -subjective η -equilibrium after x t ( ω ) o = 1 . The proof is def er red to Appendix B. Theorem 6.8 (Zero-shot Nash con v erg ence with pr iv ate pa y oﬀs) . Assume that for ev er y player i , Assump- tion 3 holds mutatis mutandis for the ﬁnite menu of r educed-form opponents ’ continuation models, with the true reduced-f orm opponent model ¯ f − i and the true public-action path law ¯ µ σ,u in place of f − i and µ f , Assumption 4 holds for the ﬁnit e menu of candidate mean payoﬀ matrices under the known noise family , Assumptions 5 and 6 hold, and player i uses PS-BR at ev er y inf or mation history. Then f or ev er y ε > 0 , P σ,u   ω : ∃ T ( ω ) < ∞ s.t. ∀ t ≥ T ( ω ) , ∃ ˆ τ ε,t,ω an ε -N ash equilibrium of the continuation g ame after x t ( ω ) with d  ¯ µ σ,u x t ( ω ) , ¯ µ ˆ τ ε,t,ω ,u x t ( ω )  ≤ ε   = 1 . Theorem 6.8’ s interpretation is similar to Theorem 5.3, but no w under the additional Assumption 6: although ag ents do not know the pay oﬀ matr ix ex ante and obser v e only noisy pr iv ate rew ards, their public continuation pla y e v entually becomes weakl y close, along the realized path, to an ε -N ash equilibrium of the continuation game. In the known common noise f amily setting, implementing pay oﬀ-kernel sampling is equivalent to sampling a mean pay oﬀ matr ix from a ﬁnite re ward menu and e valuating continuation strategies against the induced k er nel. 7 Experiments In this section, we empir icall y ev aluate whether oﬀ-the-shelf reasoning LLM agents exhibit the theoretical proper ties der iv ed in previous sections, i.e., whether the y con v erg e to ward Nash equilibrium beha vior in repeated strategic interaction. After discussing the experiment setup that is common throughout all e xper iments in Section 7.1, w e provide simulation experimentation results that test the f ollo wing three h ypotheses implied b y our theoretical analy sis: 1. For con ver gence to the stag e-game (my opic) Nash equilibrium, simple predict–then–act reasoning, e.g., SCo T , should already be suﬃcient (Section 7.2). 2. For con v erg ence to non-tr ivial repeated-game Nash equilibria that rely on continuation incentiv es and long- horizon strategic reasoning, m y opic approac hes should g enerally fail, whereas PS-BR, which e xplicitly e valuates continuation strategies, should succeed (Section 7.3). 3. PS-BR should remain eﬀectiv e ev en when the pa y oﬀ matrix is not giv en and must be lear ned from noisy 32 pa y oﬀ observations, recov er ing equilibr ium behavior under pa y oﬀ uncer tainty (Section 7.4). 7.1 Setup Baselines. W e use Qwen 3.5-27B (Qwen T eam, 2026), a small-scale open-reasoning model with GPT - 5-mini le v el capabilities (A. Singh et al., 2025). Speciﬁcall y , we run three models, with almost the same prompts e xcept the reasoning patterns: • Base : Qwen 3.5-27B with direct action selection from r ules + interaction history . • SCo T : Qw en 3.5-27B with chain-of-thought sty le “predict-then-act ” prompting (Akata et al., 2025).It has demonstrated success in some repeated games, such as the Battle of the Se x es, and can be considered a simpliﬁed, m y opic v ersion of PS-BR. For details, see Appendix E. • PS-BR : Qw en 3.5-27B with PS-BR (Deﬁnition 5, also detailed in Appendix D). Benchmar ks. W e consider ﬁv e repeated-game en vironments in total: BoS, PD, Promo, Samar itan, and Lemons. (1) Battle of the Sex es (BoS; coordination with asymmetric equilibria). Actions each per iod: J or F . P er -per iod pay oﬀ matrix (Play er 1, Pla y er 2): P2: J P2: F P1: J (10 , 7) (0 , 0) P1: F (0 , 0) (7 , 10) The non-trivial cooperative Nash equilibrium (pure): ( J, J ) and ( F , F ) . One non-trivial cooperative Nash equilibrium is both of them stic king to one action: • Pla y J after ev ery history (outcome ( J, J ) e v er y per iod). • Pla y F after ev er y history (outcome ( F , F ) e very period). Such a non-tr ivial cooperative Nash equilibrium is par ticularl y plausible when a monetar y transf er underlies the g ame. Another non-tr ivial cooperative Nash equilibrium is tur n-taking: • Pla y ( J, J ) in odd per iods and ( F , F ) in ev en per iods. • After an y history , continue the same odd/e ven phase con v ention. (2) Prisoner’s Dilemma (PD; social dilemma). A ctions eac h period: J or F . Per -period pa y oﬀ matr ix (Pla y er 1, Pla y er 2): P2: J P2: F P1: J (3 , 3) ( − 5 , 5) P1: F (5 , − 5) (0 , 0) 33 One-shot stag e-game N ash equilibrium: ( F , F ) . A baseline pure Nash equilibrium of the repeated game is stationary play of ( F , F ) after ev er y histor y . A nontr ivial cooperativ e Nash equilibr ium (g rim-tr igg er cooperation) is: • Cooperativ e phase: play ( J, J ) e v er y per iod. • If an y pla y er e ver pla ys F , switch fore ver to ( F , F ) . (3) Promo (Lal, 1990, Appendix H.1) Actions each per iod: R (Regular), P (Promotion), or Z (price-war punishment). Per -per iod pay oﬀ matrix (Play er 1, Play er 2): P2: R P2: P P2: Z P1: R (1 , 1) ( − 1 , 4) ( − 2 , − 2) P1: P (4 , − 1) (0 , 0) ( − 2 , − 2) P1: Z ( − 2 , − 2) ( − 2 , − 2) ( − 2 , − 2) One-shot s tage-g ame N ash equilibrium (pure): ( P , P ) . A baseline pure Nash eq uilibr ium of the repeated game is the stationary pla y of ( P, P ) after e very his tor y . A nontrivial cooperativ e pure N ash equilibrium described in Lal, 1990 is: • Cooperativ e phase: ( P , R ) in the odd round, and ( R , P ) in the e v en round. • If the opponent deviates from the cooperation, pla y Z f or tw o per iods and rev er t to the cooperativ e phase. (4) Samaritan (altruism / one-sided moral hazard). Pla y er 1 (Helper): Help ( H ) or No-help ( N ). Pla y er 2 (R ecipient): W ork ( W ) or Shirk ( S ). Per -period pay oﬀ matrix (Helper , R ecipient): R ecipient: W R ecipient: S Helper: H (2 , − 1) (0 , 0) Helper: N (1 , − 2) ( − 1 , − 3) One-shot stag e-game N ash equilibrium (pure): ( H , S ) . The helper has a dominant action (help), and the recipient best responds by shirking. A nontrivial cooperativ e N ash equilibrium e xists f or suﬃciently patient pla y ers: • Cooperativ e phase: play ( H , W ) e very period. • If the recipient ev er shirks, switch f orev er to punishment ( N , W ) . • If, dur ing punishment, the helper e ver deviates by helping, the recipient switches f orev er to ( H , S ) behavior . 34 (5) Lemons (adv erse selection). Pla yer 1 (Seller): High Quality ( H Q ) or Lo w Quality ( LQ ). Pla yer 2 (Buy er): Buy ( B ) or Don ’ t buy ( D ). Per -per iod pay oﬀ matr ix (Seller , Buy er): Buy er: B Buy er: D Seller: H Q (3 , 3) ( − 1 , 0) Seller: LQ (4 , − 1) (0 , 0) One-shot stag e-game N ash equilibrium (pure): ( LQ, D ) . Seller has strict dominant action LQ ; buyer best- responds to LQ with D . A baseline pure Nash equilibrium of the repeated g ame is the stationary pla y of ( LQ, D ) after ev er y history . A nontr ivial cooperative Nash equilibrium f or suﬃcientl y patient pla y ers: • Start b y pla ying ( H Q, B ) , and continue ( H Q, B ) as long as no lo w-quality sale has ev er been obser v ed. • If the buy er e ver buys and then obser v es LQ , switch f orev er to D ; seller then pla ys dominant LQ thereafter . 7.2 Experiment 1. Nash conv ergence Here, we test the ﬁrst h ypothesis: for conv erg ence to any Nash equilibrium, simple predict–then–act reason- ing, e.g., SCo T (Akata et al., 2025), should already suﬃce. 7.2.1 Experiment design In Section 5.3, w e sho wed that if agents m y opically lear n to predict opponents ’ ne xt actions and then best respond to those predictions, the realized pla y path ev entually conv erg es to a stag e-game ε -Nash equilibr ium. SCo T (Akata et al., 2025) operationalizes precisely such a predict–then–act r ule, making it a natural empir ical test of the theory . T o ev aluate this prediction, we simulate repeated interaction in each benchmark game descr ibed in Section 7.1. T w o identical copies of the same model interact in symmetr ic self-pla y f or T = 200 rounds with per f ect monitoring of actions and pay oﬀs. No communication c hannel is av ailable bey ond the public his tor y of pre vious actions and realized pay oﬀs. Each model conditions its round- t decision only on the obser v ed interaction history up to round t − 1 . T o measure this eq uilibr ium-action con ver gence, among the 1 , . . . , 200 rounds, w e onl y f ocus on the late- round windo w t ∈ { 161 , . . . , 180 } . For each round in this windo w , w e c heck ed the percentag e of both pla y ers ’ realized actions that matc h any N ash eq uilibr ium action, i.e., Nash equilibrium action of the underl ying one-shot game or an on-path action of the cooperative repeated-game equilibr ium described in Section 7.1. W e then av erage these indicators o ver rounds 161 – 180 and repor t the resulting percentage. Thus, the repor ted number can be interpreted as the fraction of late-round pla y that lies on either a one-shot N ash path or a cooperativ e-equilibrium path. Using rounds 161 – 180 isolates s teady-s tate beha vior and a v oids placing weight on transient earl y-round dynamics and ter minal-horizon eﬀects. For each of the three model conﬁgurations ( Base , SCo T , and PS-BR ), we r un 20 independent such self-play matches. Our pr imary outcome of interest is whether the realized joint action proﬁle conv erg es to either a one-shot Nash action or 35 an on-path action of the benchmark cooperativ e repeated-game N ash equilibrium f or that g ame. 7.2.2 Results T able 1: Equilibr ium-f ollow percentage in late rounds (rounds 161–180) for any (one-shot Nash or coopera- tiv e on-path action) Nash equilibrium. R epor ted scores are a v eraged o ver 20 trials. Game Base SCo T PS-BR BoS 60.0% 100.0 % 100.0 % PD 60.0% 100.0 % 87.8% Promo 0.0% 100.0 % 100.0 % Samaritan 64.5% 100.0 % 97.2% Lemons 0.0% 100.0 % 89.8% T able 1 show s that once cooperative on-path actions are also credited, SCo T attains a perf ect late-round equilibrium-action score in all ﬁv e benc hmark en vironments. Base , by contrast, remains une v en across games, reaching 60.0% in BoS, 60.0% in PD, and 64.5% in Samar itan, but 0.0% in both Promo and Lemons. PS-BR also per f or ms strongl y , scor ing 100.0% in BoS and Promo and r ising to 87.8% in PD, 97.2% in Samaritan, and 89.8% in Lemons when cooperativ e on-path actions are credited. Ov erall, these results show that m y opic predict–then–act prompting often steers play to some Nash equilibr ium. A natural question is what kind of equilibr ium con v erg ence T able 1 is captur ing. The theor y in Section 5.3 predicts that m y opic predict–then–act reasoning should be suﬃcient f or con v erg ence to a stag e-game ε -Nash equilibrium, without requiring ag ents to reason ov er full continuation strategies. The empirical results are broadl y consistent with this prediction. In par ticular , SCo T attains per f ect equilibrium-f ollow scores in all ﬁve en vironments once the e valuation metric credits both one-shot N ash actions and on-path actions of cooperative repeated-game equilibria. This sugges ts that e xplicitly prompting the model to f orecast the opponent ’ s ne xt mo ve and then act accordingly is often enough to remo ve ob viously non-equilibr ium play in the late rounds. At the same time, the results should be inter preted carefully . The metr ic in T able 1 deliberatel y aggregates tw o qualitativ ely diﬀerent notions of equilibrium-consistent beha vior: one-shot Nash actions and actions that lie on the path of a benchmark cooperativ e repeated-game equilibrium. As a result, a high score means that pla y has mo v ed onto some equilibrium-consistent path, but it does not tell us whic h kind of equilibrium has been selected. For ex ample, in Pr isoner ’ s Dilemma, both ( F , F ) and ( J, J ) can be counted as successful late-round outcomes under our metric, ev en though the f or mer reﬂects my opic defection while the latter reﬂects cooperation sustained by continuation incentives. Like wise, in BoS, conv erging to either coordinated outcome counts as success e v en though equilibrium selection remains unresolv ed. This distinction is important because m y opic reasoning can explain onl y a limited class of equilibrium phenomena. A one-step predict–then–act r ule can stabilize play at actions that are locally optimal giv en beliefs about the opponent’ s next mo ve, but it does not b y itself reason ov er future punishment and rew ard 36 paths. Consequentl y , strong perf ormance in T able 1 should be read as evidence that m y opic prompting is often suﬃcient for equilibrium action conv erg ence , not as evidence that it can reliably implement a par ticular nontrivial repeated-game equilibr ium. In other w ords, SCo T appears eﬀectiv e at steering play to w ard some equilibrium-consistent late-round behavior , but the table does not y et establish whether it can sustain the richer , history -contingent equilibr ia that depend on long-hor izon continuation values. This limitation is ex actly what motivates the ne xt e xper iment. T o distinguish simple equilibrium-action con v erg ence from g enuine repeated-game strategic reasoning, w e no w tes t whether the models can follo w a speciﬁc nontrivial cooperative N ash equilibrium path when that path must be sustained by continuation incentiv es rather than by m y opic one-step optimization alone. 7.3 Experiment 2. Nontrivial Nash conv ergence W e no w mo v e from asking whether play con v erg es to some equilibrium-consistent action proﬁle to the harder ques tion of whether agents can track a nontr ivial, cooperative repeated-game Nash equilibr ium sustained by continuation incentiv es. Here, we test the second hypothesis: f or conv erg ence to non-tr ivial repeated-g ame N ash equilibria that rely on continuation incentiv es and long-hor izon strategic reasoning, m y opic approaches should g enerally fail, whereas PS-BR, which e xplicitl y ev aluates continuation strategies, should succeed. 7.3.1 Experiment design T o verify whether a particular long-horizon cooperativ e Nash equilibr ium can be implemented, w e included a prompt f or each agent that speciﬁes a par ticular long-hor izon non-tr ivial cooperativ e N ash equilibrium and asks them to “strongl y expect the opponent to play” the strategy . Suc h prompting sets the initial point of the e volution of their beliefs. For e xample, in PD, this meant prompting both agents to expect the opponent to pla y a continued gr im-trigger strategy , i.e., cooperation until a def ection tr igg ers per manent punishment. On the other hand, in Promo, it meant prompting both agents to e xpect the prescr ibed alter nating cooperative pattern ( P , R ) , ( R, P ) , ( P , R ) , . . . , until a def ection triggers ﬁnite punishment. As before, all e xper iments use symmetric self-pla y with tw o copies of the same model under perf ect monitoring. Each match lasts T = 200 rounds. In ev er y round, pla y ers act simultaneousl y , obser v e both actions and realized pay oﬀs, and then condition the ne xt-round decision on the updated history . Ag ain, for each round t ∈ { 161 , . . . , 180 } in each run, we chec ked if both play ers ’ realized actions match the desired nontrivial cooperativ e equilibrium beha vior in terms of percentag e, then a verag ed the percentag es o v er the 20 rounds (161-180) and repor ted the mean by setting and game. (W e c hose round 180 as the endpoint since PS-BR uses 20 rounds of lookahead, and we ex cluded pre-161 results, as w e want to see the equilibrium outcome.) 7.3.2 Results. T able 2 sho ws a shar p separation across methods. PS-BR achiev es high late-round f ollo w rates in all ﬁv e en vironments, reaching 92.5% in BoS, 98.0% in PD, 94.8% in Promo, 93.3% in Samar itan, and 93.5% in 37 T able 2: Equilibrium-follo w percentag e in late rounds (rounds 161–180) for the prompt-speciﬁed nontr ivial cooperativ e equilibrium. Reported scores are a v eraged o v er 20 trials. Game Base SCo T PS-BR BoS 0.0% 0.0% 92.5 % PD 0.0% 100.0 % 98.0% Promo 0.0% 0.0% 94.8 % Samaritan 0.0% 0.0% 93.3 % Lemons 0.0% 0.0% 93.5 % Lemons. Thus, once the cooperative equilibr ium is explicitl y speciﬁed, the non-my opic planner trac ks the intended long-horizon path quite reliably across all benchmark games. By contrast, Base remains at 0.0% in ev er y environment. SCo T succeeds only in PD, where it reaches 100.0%, and remains at 0.0% in BoS, Promo, Samar itan, and Lemons. Since the three settings use nearl y the same game instr uctions and his tor y conte xt, the main diﬀerence is the reasoning/decision strategy (direct action f or Base, my opic predict–then–act f or SCo T , and posterior -sampling best response with rollout planning f or PS-BR). This pattern suggests that direct prompting is insuﬃcient f or f ollo wing contingent cooperative equilibrium prescr iptions, while my opic prompting can recov er the simple stationary cooperativ e path in PD but not the r icher coordination, punishment, or trust-based prescriptions in the other g ames. PS-BR’ s e xplicit modeling of opponent strategy and continuation value is what enables sustained on-path behavior in late rounds. The results in T able 2 provide a clear separation betw een my opic and non-m y opic reasoning. Unlik e Experiment 1, where multiple equilibr ium-consistent outcomes were credited, this experiment sets up initial beliefs so that ag ents f ollo w a speciﬁc cooperativ e equilibr ium path that req uires non-m y opic reasoning. U nder this stricter cr iter ion, PS-BR consistentl y achie ves high f ollo w rates across all en vironments, whereas Base f ails entirel y and SCo T succeeds onl y in the simplest case (PD). This patter n aligns closely with the theoretical distinction dev eloped in Section 5. Implementing a nontr ivial repeated-game equilibr ium requires reasoning o ver continuation v alues: agents must understand that shor t- term deviations tr igger future punishment, and that adherence to the cooperative path is optimal only when these future conseq uences are taken into account. PS-BR e xplicitly e valuates suc h continuation strategies through rollout, and theref ore can inter nalize these long-hor izon incentiv es. By contrast, SCo T operates on one-step predictions and local best responses, which are insuﬃcient to sustain equilibria that depend on multi-period incentive compatibility . The one par tial e xception is Pr isoner’ s Dilemma, where SCo T achie ves perfect performance. This is consistent with the structure of the gr im-trigger equilibrium in PD: the cooperativ e phase ( J, J ) is itself a stag e-game Pareto-dominant outcome and is locally consistent with mutual best responses under optimistic 38 beliefs. As a result, my opic reasoning can incidentally align with the cooperative path. In contrast, games such as BoS, Promo, Samaritan, and Lemons require coordination on asymmetr ic roles, punishment phases, or tr ust-dependent beha vior that cannot be justiﬁed purely from one-step optimization, making my opic approaches ineﬀectiv e. More broadly , these results indicate that equilibrium selection and path-f ollowing are fundamentally harder than equilibrium action conver g ence . While Experiment 1 show s that simple reasoning can often eliminate non-equilibrium beha vior , Experiment 2 demonstrates that sustaining a par ticular equilibr ium—especiall y one suppor ted by continuation incentives—req uires e xplicit modeling of future play . This provides empir ical suppor t f or the theoretical claim that the posterior -sampling best response, by operating o v er full continuation strategies, can implement repeated-game equilibr ia that lie bey ond the reach of m y opic predict–then–act rules. Ha ving established this distinction under known and deterministic pay oﬀs, we next consider a more realistic setting in which agents must simultaneously lear n the pay oﬀ structure from noisy pr iv ate observations while engaging in strategic interaction. 7.4 Experiment 3: Nontrivial Nash conv ergence under unkno wn pa y oﬀs 7.4.1 Setup W e keep the interaction protocol, hor izons, and game set from Experiment 1 (Section 7.2) and Exper iment 2 (Section 7.3), and modify only the pa yoﬀ obser vations: ag ents no long er receive the pay oﬀ matr ix in the prompt and instead learn solely from noisy , pr iv ately obser v ed pa y oﬀs. For each benchmark game g ∈ { BoS , PD , Promo , Samar itan , Lemons } , let u g i ( a ) ∈ R denote the det er min- istic stage pa y oﬀ from Exper iment 1 f or play er i and joint action a ∈ A . In Exper iment 3, after the public joint action a t is realized, play er i observes a private pay oﬀ r t i = u g i ( a t ) + ϵ i,t , ϵ i,t i.i.d. ∼ N (0 , σ 2 g ) , (13) independent across pla y ers i and rounds t . Play ers obser v e the full public action history but only their own pa y oﬀ sequence ( r t i ) t . All equilibrium notions continue to ref er to the underl ying mean-pay oﬀ repeated game induced b y u g i . Kno wn common noise famil y , unkno wn mean matrix. Exper iment 3 instantiates the pr iv ate-pay oﬀ theory in the special case where the re ward noise f amily is kno wn and only the mean pa y oﬀ matr ix is unkno wn. Concretely , f or each play er i and joint action a , r t i | a t = a ∼ N ( m i ( a ) , σ 2 g ) , where σ 2 g is common kno w ledge and the unknown object is the matr ix m i : A → R . The ﬁnite re ward menu used by PS-BR is theref ore a ﬁnite menu of candidate mean matr ices. Equivalentl y , each candidate matrix 39 m induces a full pa y oﬀ kernel q m i ( · | a ) = N ( m ( a ) , σ 2 g ) , so pay oﬀ-matr ix sampling in the implementation is e xactly pay oﬀ-kernel sampling in the theor y , e xpressed in mean-matrix coordinates. W e choose a noise lev el larg e enough that, on a single step, the realized pa y oﬀ can often re v erse the ranking betw een tw o outcomes whose tr ue mean pa y oﬀs diﬀer b y the smallest strategicall y relev ant gap. Formally , f or each game g , deﬁne the minimal nonzero pa y oﬀ separation ∆ min ,g := min i ∈ I min  | u g i ( a ) − u g i ( a ′ ) | : a, a ′ ∈ A, u g i ( a )  = u g i ( a ′ )  . (14) For the pa y oﬀ matr ices used in Experiment 1, the smallest pay oﬀ gaps are ∆ min , BoS = 3 and ∆ min , PD = 2 , while f or Promo, Samaritan, and Lemons the smallest gap is 1 . W e set the Gaussian noise standard de viation to σ g = ∆ min ,g . (15) With additiv e Gaussian noise, the noisy diﬀerence betw een tw o outcomes with mean g ap ∆ has s tandard de viation √ 2 σ g ; hence when ∆ = ∆ min ,g and σ g = ∆ min ,g , a single observation rev erses the sign of the comparison with probability Φ  − 1 / √ 2  ≈ 0 . 24 . Thus, roughl y one in f our obser v ations on the tightes t gaps is directionall y misleading, while av eraging o ver time still re veals the true mean incentiv es. Then w e repeat the same experiments in Experiment 1 (late-round adherence to the an y Nash equilibr ium path) and Exper iment 2 (late-round adherence to the prompt-speciﬁed nontr ivial cooperativ e Nash equilibrium path), using the same scor ing windo w and repor ting con ventions; the only chang e is that agents must inf er incentiv es from the pr iv ate noisy pa y oﬀs (13) rather than reading u g i from the prompt. T o match Assumption 4, w e eq uip each agent with a ﬁnite hypothesis class ov er the unkno wn mean payoﬀ matrix . Fix a game g and play er i , and deﬁne the oﬀset set K := {− 2 , − 1 . 5 , − 1 , − 0 . 5 , 0 , +0 . 5 , +1 , +1 . 5 , +2 } . The ﬁnite menu of candidate mean matr ices is M i,g := { m : A → R : m ( a ) = u g i ( a ) + k a σ g f or each a ∈ A, with k a ∈ K } . In particular, the true mean matrix u g i belongs to M i,g b y taking k a = 0 f or ev er y joint action a . 40 Operationall y , pla yer i maintains a posterior o ver M i,g using the Gaussian likelihood π t i ( m | h t , r 1: t − 1 i ) ∝ π 0 i ( m ) t − 1 Y s =1 ϕ ( r s i ; m ( a s ) , σ 2 g ) , where ϕ ( · ; µ, σ 2 g ) is the Gaussian density . PS-BR then samples one candidate mean matr ix from this posterior and ev aluates continuation strategies against the induced pay oﬀ kernel. Because M i,g has product form ov er joint actions, this poster ior can be updated action-wise under a product prior o v er oﬀsets ( k a ) a ∈ A ; one need not enumerate the full menu e xplicitly in order to sample a complete mean matrix. 7.4.2 Results. W e repor t tw o complementary late-round metr ics under unkno wn stochas tic pa y oﬀs: con ver gence to an y N ash equilibr ium action (T able 3) and f ollow -through on the prompt-speciﬁed cooperative Nash equilibrium path (T able 4). T able 3: Unkno wn stochas tic pa y oﬀs: equilibrium- f ollo w percentag e in late rounds (rounds 161–180) f or any Nash equilibrium. Reported scores are a v er - ag ed o v er 20 trials. Game Base SCo T PS-BR BoS 60.0% 95.0% 99.8 % PD 60.0% 98.0 % 98.0 % Promo 0.0% 100% 100.0 % Samaritan 0.0% 0.0% 96.2 % Lemons 0.0% 98.5 % 82.5% T able 4: Unkno wn stochas tic pa y oﬀs: equilibrium- f ollo w percentag e in late rounds (rounds 161–180) f or the prompt-speciﬁed cooperativ e Nash equilib- rium. R epor ted scores are a v eraged o ver 20 trials. Game Base SCo T PS-BR BoS 0% 0% 98.0 % PD 0% 0% 71.2 % Promo 0% 0% 71.0 % Samaritan 5% 0% 81.0 % Lemons 0% 0% 73.8 % On the broader “any N ash ” metr ic (T able 3), SCo T still performs very strongl y in BoS (95.0%), PD (98.0%), Promo (100.0%), and Lemons (98.5%), but f alls to 0.0% in Samar itan. PS-BR is near -perfect in BoS (99.8%), PD (98.0%), and Promo (100.0%), remains strong in Samar itan (96.2%), and reaches 82.5% in Lemons. Base remains limited, scor ing 60.0% in BoS and PD and 0.0% in Promo, Samar itan, and Lemons. On the other hand, on str icter prompt-speciﬁed cooperativ e-equilibrium metr ic (T able 4), PS-BR remains the onl y method with substantial late-round f ollo w-through under unkno wn pa y oﬀs: 98.0% in BoS, 71.2% in PD, 71.0% in Promo, 81.0% in Samar itan, and 73.8% in Lemons. Both Base and SCo T are at 0.0% in BoS, PD, Promo, and Lemons, while Base reaches only 5.0% in Samar itan. These results sugges t that under noisy private pa y oﬀs, my opic reasoning is often still enough to reach some equilibrium-like late-round beha vior , but not to track the speciﬁc long-hor izon cooperativ e prescr iption; the non-my opic planner , PS-BR, retains a clear advantag e when the task requires identifying and sustaining the intended cooperativ e repeated-game path. 41 A ccordingly , Exper iment 3 should be inter preted as testing strategic lear ning under noisy private obser v ations of an unkno wn mean-pay oﬀ matr ix, rather than learning an arbitrar y pa y oﬀ distribution. The informational diﬃculty comes from identifying the mean incentives rele vant f or continuation planning, while the noise f amily itself is held ﬁxed and kno wn. T aken together , T ables 3 and 4 sho w that pa y oﬀ uncertainty preser v es the basic separation observ ed in the deterministic-pa y oﬀ experiments, while also making the task meaningfully harder . On the broader “any N ash ” metr ic, both SCo T and PS-BR still often reach equilibrium-consistent late-round beha vior , indicating that noisy private pa y oﬀs do not prev ent agents from ev entuall y identifying at least some strategicall y stable pattern of play . This is consistent with the idea that coarse equilibr ium-action conv erg ence can survive substantial obser vational noise as long as the underl ying incentiv es remain lear nable o v er repeated interaction. Ho w ev er , the str icter cooperativ e-equilibrium metr ic rev eals a much shar per distinction. Under unkno wn pa y oﬀs, PS-BR remains the only method that reliably tracks the prompt-speciﬁed nontr ivial repeated-game equilibrium across all environments, whereas Base and SCo T almost completely f ail. This gap is impor tant because it show s that the main diﬃculty is not merely predicting the opponent ’ s ne xt mo ve, but jointl y inf er ring the pa y oﬀ structure and reasoning o ver continuation incentives. T o sustain a par ticular cooperativ e equilibrium under pay oﬀ uncer tainty , an agent must lear n which action proﬁles are valuable, which deviations are tempting, and wh y future punishments make cooperation incentive compatible. PS-BR is designed to do e xactly this b y sampling both opponent strategies and pa y oﬀ h ypotheses and then planning against the sampled continuation game. The f act that PS-BR still per f or ms w ell, though less per f ectly than in the kno wn-pay oﬀ case, is also inf or - mativ e. Relativ e to T able 2, follo w rates decline in PD, Promo, Samaritan, and Lemons once pay oﬀs must be lear ned from noisy pr iv ate obser vations. This is the expected direction: pa yoﬀ uncer tainty introduces an additional la y er of poster ior dispersion, so ev en when the opponent strategy is inf er red cor rectl y , er rors in the learned pay oﬀ model can still distort continuation-value compar isons. In other w ords, the unkno wn-pay oﬀ setting does not o v er turn the mechanism established earlier , but it weak ens it quantitativ ely by making both belief learning and best-response computation noisier . At the same time, the results sugges t that the theoretical e xtension in Section 6 is empirically meaningful rather than merel y f or mal. The model class that e xplicitly represents uncer tainty ov er pa y oﬀs and updates from pr iv ate obser v ations retains a substantial adv antage precisel y in the environments where long-horizon repeated-game incentiv es matter most. Thus, the e xper iments suppor t the broader claim of the paper: reasonabl y reasoning agents need not kno w the full game in advance to mov e to ward equilibrium-like beha vior . What matters is whether the y can infer both the strategic behavior of others and the pay oﬀ consequences of interaction w ell enough to approximate continuation best responses on the realized path. Ov erall, the three experiments draw a coherent empir ical picture. Simple predict–then–act reasoning is 42 often suﬃcient f or con v erg ence to some s tage-g ame or equilibrium-consistent action pattern. But when the objectiv e is to implement a speciﬁc nontr ivial repeated-game equilibr ium, especiall y under realistic inf ormational frictions such as unknown and stochas tic pay oﬀs, e xplicit continuation-lev el reasoning becomes decisiv e. This is e xactl y the regime in whic h PS-BR provides a robus t adv antage, matching the central theoretical messag e of the paper . 8 Conclusion In this paper , we theoreticall y highlight the promising prospect that general-purpose AI agents can attain game-theoretic robustness through inherent reasoning capabilities rather than bespoke training. By demon- strating that LLMs can e v olv e to ward equilibrium beha vior on the ﬂy , w e take a step to ward saf er and more autonomous multi-ag ent AI sys tems that remain eﬀective across the myriad interactiv e scenar ios they will encounter in the real wor ld. The results bridge the gap between AI agents and classical game the- ory , indicating that the rich kno w ledge and inferential pow er of modern LLMs may be har nessed to meet longstanding challenges in multi-agent learning and interaction. Ultimately , enabling LLM-based agents to naturall y e xhibit equilibrium-like behavior during pla y not only advances our theoretical understanding of their behavior but also pa v es the w ay f or their deplo yment in societall y cr ucial domains that require reliable strategic decision-making. Ref erences Qu, Xiaodong et al. (2025). “A comprehensiv e revie w of AI agents: T ransf orming possibilities in technology and be y ond”. In: arXiv preprint arXiv :2508.11957 . Bianchi, Federico et al. (2024). “How well can llms negotiate? negotiationarena platf or m and analy sis”. In: arXiv pr eprint arXiv :2402.05863 . Guo, T aicheng et al. (2024). “Large language model based multi-agents: A sur v ey of progress and challeng es”. In: arXiv preprint arXiv :2402.01680 . Lopez-Lira, Alejandro (2025). “Can larg e language models trade? testing ﬁnancial theories with llm agents in mark et simulations”. In: arXiv preprint arXiv :2504.10789 . Li, Y ang et al. (2024). “Aligning individual and collectiv e objectiv es in multi-agent cooperation”. In: Adv ances in N eural Inf or mation Processing Syst ems 37, pp. 44735–44760. Zhu, Shenzhe et al. (2025). “The automated but risky game: Modeling and benchmarking ag ent-to-agent negotiations and transactions in consumer mark ets”. In: arXiv preprint arXiv :2506.00073 . Bansal, Gag an et al. (2025). “Mag entic Marketplace: An Open-Source En vironment f or S tudying Ag entic Markets”. In: arXiv preprint arXiv :2510.25779 . Cal vano, Emilio et al. (2020). “Artiﬁcial intellig ence, algorithmic pricing, and collusion”. In: American Economic Review 110.10, pp. 3267–3297. Fish, Sara, Y annai A Gonczaro wski, and Ran I Shorrer (2024). “Algorithmic collusion b y larg e language models”. In: arXiv preprint arXiv :2404.00806 7.2, p. 5. 43 Bro wn, Zach Y and Ale xander MacKa y (2023). “Competition in pr icing algor ithms”. In: American Economic Journal: Microeconomics 15.2, pp. 109–156. Assad, Stephanie et al. (2024). “Algor ithmic pr icing and competition: Empir ical evidence from the Ger man retail g asoline market”. In: Journal of P olitical Economy 132.3, pp. 723–771. Guo, X udong et al. (2024). Embodied LLM Ag ents Learn to Cooperat e in Org anized T eams . arXiv: 2403 . 12482 [cs.AI] . url : https://arxiv.org/abs/2403.12482 . Huang, Jen-tse et al. (2024). “Ho w Far Are W e on the Decision-Making of LLMs? Ev aluating LLMs ’ Gaming Ability in Multi- Ag ent En vironments”. In: arXiv preprint arXiv :2403.11807 . Hua, W en yue et al. (2024). “Game-theoretic LLM: Ag ent workﬂo w f or negotiation games”. In: arXiv preprint arXiv :2411.05990 . Buscemi, Alessio et al. (2025). “Fair game: a framew ork f or ai agents bias recognition using game theor y”. In: arXiv preprint arXiv :2504.14325 . Akata, Elif et al. (2025). “Pla ying repeated games with larg e language models”. In: Natur e Human Behaviour 9.7, pp. 1380–1390. Park, Chanw oo et al. (2024). “Do llm ag ents ha ve reg ret? a case study in online learning and g ames”. In: arXiv pr eprint arXiv :2403.16843 . Duque, Juan Agus tin et al. (2024). “A dvantag e alignment algor ithms”. In: arXiv preprint arXiv :2406.14662 . Kalai, Ehud and Ehud Lehrer (1993a). “Rational lear ning leads to Nash equilibrium”. In: Econometrica: Journal of the Econometric Society , pp. 1019–1045. Norman, Thomas WL (2022). “The possibility of Ba yesian lear ning in repeated games”. In: Games and Economic Behavior 136, pp. 142–152. Y amin, Khurram et al. (2026). “Do LLMs Act Like Rational Ag ents? Measur ing Belief Coherence in Probabilistic Decision Making”. In: arXiv preprint arXiv :2602.06286 . Ge, Luise, Y ongy an Zhang, and Y ev geniy V orobey chik (2026). “Mind the (DH) Gap! A Contrast in Risky Choices Betw een R easoning and Con v ersational LLMs”. In: arXiv preprint arXiv :2602.15173 . Arumugam, Dilip and Thomas L Gr iﬃths (2025). “T o ward eﬃcient e xploration by larg e languag e model ag ents”. In: arXiv preprint arXiv :2504.20997 . Coda-Forno, Julian et al. (2023). “Meta-in-context learning in lar ge languag e models”. In: Adv ances in N eural Information Processing Syst ems 36, pp. 65189–65201. Lu, Sheng et al. (2024). “Are emerg ent abilities in larg e languag e models jus t in-conte xt learning?” In: Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pp. 5098–5139. Cah ya wijay a, Samuel, Holy Lo v enia, and Pascale Fung (2024). “Llms are fe w-shot in-conte xt lo w-resource languag e lear ners”. In: arXiv pr eprint arXiv :2403.16512 . Xie, Sang Michael et al. (2021). “An e xplanation of in-conte xt learning as implicit ba yesian inference”. In: arXiv pr eprint arXiv :2111.02080 . 44 W ang, Xin yi et al. (2023). “Larg e language models are latent variable models: Explaining and ﬁnding good demonstrations f or in-conte xt lear ning”. In: Advances in N eural Information Processing Syst ems 36, pp. 15614–15638. W akay ama, T omoy a and T aiji Suzuki (2025). “In-conte xt lear ning is prov ably Bay esian inference: a general- ization theory f or meta-learning”. In: arXiv pr eprint arXiv :2510.10981 . F alck, F abian, Ziyu W ang, and Chris Holmes (2024). “Is in-conte xt lear ning in larg e language models ba y esian? a mar ting ale perspectiv e”. In: arXiv preprint arXiv :2406.00793 . N achbar , John H (1997). “Prediction, optimization, and learning in repeated games”. In: Econometrica: Journal of the Econometric Society , pp. 275–309. – (2005). “Beliefs in repeated games”. In: Econometrica 73.2, pp. 459–480. Sun, Haoran et al. (2025). “Game theory meets large language models: A sys tematic sur v e y with tax onom y and ne w frontiers”. In: arXiv pr eprint arXiv :2502.09053 . Jia, Jing ru et al. (2025). “LLM Strategic Reasoning: Ag entic Study through Beha vioral Game Theor y”. In: arXiv pr eprint arXiv :2502.20432 . Guo, Fulin (2023). GPT in Game Theor y Experiments . doi : 10 . 48550 / arXiv . 2305 . 05516 . arXiv: 2305.05516 [econ.GN] . F an, Cao yun et al. (2023). Can Larg e Language Models Ser v e as Rational Play ers in Game Theor y? A Sys tematic Analysis . AAAI 2024. doi : 10.48550/arXiv.2312.05488 . arXiv: 2312.05488 [cs.AI] . Kader , Gavin and Dongw oo Lee (2024). “The Emerg ence of Strategic Reasoning of Larg e Language Models”. In: arXiv preprint arXiv :2412.13013 . Mao, Shaoguang et al. (2023). ALYMPICS: LLM Ag ents Meet Game Theor y – Exploring Str ategic Decision- Making with AI A g ents . doi : 10.48550/arXiv.2311.03220 . arXiv: 2311.03220 [cs.CL] . Duan, Jinhao et al. (2024). GTBenc h: U ncov ering the Str ategic Reasoning Limitations of LLMs via Game- Theor etic Evaluations . doi : 10.48550/arXiv.2402.12348 . arXiv: 2402.12348 [cs.CL] . Fontana, Nicoló, Francesco Pier ri, and Luca Mar ia Aiello (2024). “Nicer Than Humans: How do Larg e Languag e Models Beha v e in the Pr isoner’ s Dilemma?” In: arXiv pr eprint arXiv :2406.13605 . Willis, Ric hard et al. (2025). “Will Sy stems of LLM Ag ents Cooperate: An Inv estigation into a Social Dilemma”. In: arXiv preprint arXiv :2501.16173 . Agra w al, Kushal et al. (2025). Evaluating LLM A g ent Collusion in Double Auctions . doi : 10.48550/arXiv. 2507.01413 . arXiv : 2507.01413 [cs.GT] . Zhang, Y uf eng et al. (2023). “What and ho w does in-conte xt lear ning learn? bay esian model av eraging, parameterization, and generalization”. In: arXiv preprint arXiv :2305.19420 . Zhang, Kell y W et al. (2024). “Pos terior sampling via autoregressive generation”. In: N eurIPS 2024 W orkshop on Bay esian Decision-making and Uncertainty . W ellec k, Sean et al. (2024). “From decoding to meta-generation: Inf erence-time algorithms f or larg e language models”. In: arXiv preprint arXiv :2406.16838 . 45 Dur rett, Rick (2019). Pr obability : Theor y and Examples . 5th ed. See Theorem 2.1.21 (Kolmogoro v’ s e xten- sion theorem). Cambr idg e U niversity Press. doi : 10.1017/9781108591034 . Kuhn, Harold W (1953). “Extensive games and the problem of inf or mation”. In: Contributions to the Theor y of Games 2.28, pp. 193–216. A umann, R ober t J (1961). Mixed and behavior strat egies in inﬁnite extensiv e g ames . Princeton U niversity Princeton. Kalai, Ehud and Ehud Lehrer (1993b). “Subjective equilibr ium in repeated games”. In: Econometrica 61.5, pp. 1231–1240. Blackw ell, Da vid and Lester Dubins (1962). “Merging of opinions with increasing inf or mation”. In: The Annals of Mathematical Statis tics 33.3, pp. 882–886. Cai, T iﬀany Tianhui et al. (2024). “Activ e Exploration via Autoregressiv e Generation of Missing Data”. In: arXiv pr eprint arXiv :2405.19466 . Zhou, P ei et al. (2023). “Ho w f ar are lar ge language models from ag ents with theory-of-mind?” In: arXiv pr eprint arXiv :2310.03051 . Riemer , Matthew et al. (2024). “Position: Theor y of mind benchmarks are broken f or larg e languag e models”. In: arXiv preprint arXiv :2412.19726 . A o yagi, Masaki, Guillaume R Fréchette, and Se v gi Y uksel (2024). “Belief s in repeated games: An e xper i- ment ”. In: American Economic Review 114.12, pp. 3944–3975. Gill, David and Y arosla v R osokha (2024). “Beliefs, lear ning, and personality in the indeﬁnitely repeated prisoner’ s dilemma”. In: American Economic Jour nal: Microeconomics 16.3, pp. 259–283. Qw en T eam (F eb. 2026). Qw en3.5: T ow ards Nativ e Multimodal Ag ents . url : https : / / qwen . ai / blog ? id=qwen3.5 . Singh, Aadity a et al. (2025). “Openai gpt-5 system card”. In: arXiv preprint arXiv :2601.03267 . Lal, Rajiv (1990). “Price promotions: Limiting competitiv e encroachment”. In: Marketing science 9.3, pp. 247–262. Abreu, Dilip (1988). “On the theor y of inﬁnitely repeated games with discounting”. In: Econometrica: Journal of the Econometric Society , pp. 383–396. 46 A Continuity and Finite-Horizon R obustness Lemma A.1 (Continuity of discounted pa yoﬀ ) . F or each ag ent i and ev er y δ > 0 , there exists ρ i ( δ ) > 0 suc h that for any str ategy proﬁles f , g ∈ F , d ( µ f , µ g ) ≤ ρ i ( δ ) ⇒   U i ( f ) − U i ( g )   ≤ δ. In particular , if ρ ( δ ) = min i ∈ I ρ i ( δ ) and d ( µ f , µ g ) ≤ ρ ( δ ) , then   U i ( f ) − U i ( g )   ≤ δ f or all i ∈ I . A.1 Finite-horizon v ariants and robustness For a ﬁnite hor izon T ∈ N , we denote by F T the set of beha viour strategies speciﬁed on histories of length at most T ; tw o full strategies that coincide on these histor ies induce the same distribution ov er histories up to time T and the same tr uncated pa y oﬀ. For f ∈ F T , deﬁne the T -per iod discounted pay oﬀ U T i ( f ) = E z ∼ µ f h (1 − λ i ) T X t =1 λ t − 1 i u i ( z t ) i . Deﬁnition 15 (Finite-horizon weak ξ -subjective η -equilibrium) . Let ξ , η ≥ 0 and a ﬁxed hor izon T . A truncated strategy proﬁle f ∈ F T is a ﬁnite-horizon w eak ξ -subjective η -equilibrium if f or each agent i ∈ I there e xists a suppor ting truncated proﬁle f i ∈ F T such that: • f i i = f i ; • U T i ( f i , f i − i ) ≥ sup g i ∈F T i U T i ( g i , f i − i ) − ξ ; • d ( µ f i , µ f ) ≤ η when d is computed using only cy linder ev ents in B t with t ≤ T . W e now sho w that ﬁnite-horizon weak subjective equilibria can be “patched” into appro ximate ﬁnite-hor izon N ash equilibria without changing the induced distribution of play up to time T . Lemma A.2 (Finite-horizon pur iﬁcation f or η = 0 Norman, 2022) . F ix a ﬁnite horizon T and a proﬁle f ∈ F T . Suppose f is a ﬁnite-horizon w eak ψ -subjective 0 -equilibrium for some ψ ≥ 0 . Then there exists a truncated strat egy pr oﬁle ˆ f ∈ F T suc h that : • ˆ f is a ψ -N ash equilibrium of the T -period g ame, i.e., for all i ∈ I and all g i ∈ F T i , U T i ( ˆ f i , ˆ f − i ) ≥ U T i ( g i , ˆ f − i ) − ψ ; • the induced distributions of histories of length at most T coincide: f or ev er y E ∈ B T , µ ˆ f ( E ) = µ f ( E ) . W e next e xtend this to the case where η > 0 but small, using a compactness and limit argument. Lemma A.3 (Finite-horizon robustness) . F ix a ﬁnite horizon T and ψ > 0 . F or every θ > 0 t here exists 47 ¯ η T ( ψ , θ ) > 0 suc h that : if f ∈ F T is a ﬁnite-horizon weak ψ -subjective η -equilibrium with η ≤ ¯ η T ( ψ , θ ) , then ther e exists a ψ -Nash equilibrium ˆ f ∈ F T satisfying d ( µ ˆ f , µ f ) ≤ θ (ag ain with d computed on cylinder ev ents of length at most T ). W e now patch ﬁnite-hor izon robustness to the inﬁnite-hor izon game by tr uncating the pa y oﬀ at a suﬃciently larg e hor izon and using Lemma A.1; the resulting inﬁnite-hor izon patching lemma is recorded belo w . Lemma A.4 (Inﬁnite-hor izon patching) . F ix ξ > 0 and ε > 0 . There exists ˆ η ( ξ , ε ) > 0 suc h that if f ∈ F is a weak ξ -subjective η -equilibrium in the sense of Deﬁnition 8 with η ≤ ˆ η ( ξ , ε ) , then ther e exists a str ategy pr oﬁle ˆ f ∈ F satisfying: • ˆ f is a ( ξ + ε ) -Nash equilibrium of the inﬁnite-horizon game; • d ( µ ˆ f , µ f ) ≤ ε . Remar k 3 (Continuation-game analogues) . Lemmas A.2 – A.4 apply verbatim to continuation games after any history h t b y inter preting U i ( · ) as continuation pa y oﬀ from h t and d ( · , · ) as the weak distance betw een µ g h t and µ g ′ h t . The y also apply v erbatim to the pr ivate-pa y oﬀ continuation game after any realized inf or mation- history v ector x t when F i is replaced b y Σ i , histories h t are replaced b y x t , pa y oﬀs are U i ( τ | x t ) , and w eak distance is computed on the public-action marginals ¯ µ τ ,u x t . 48 B Proofs Pr oof of Lemma A.1. Fix i and δ > 0 . Choose a ﬁnite hor izon T ∈ N larg e enough that (1 − λ i ) ∞ X t = T +1 λ t − 1 i ≤ δ 4 . (16) For an y proﬁle g ∈ F , deﬁne the tr uncated pay oﬀ U T i ( g ) = E z ∼ µ g " (1 − λ i ) T X t =1 λ t − 1 i u i ( z t ) # . Then f or any g w e hav e   U i ( g ) − U T i ( g )   ≤ (1 − λ i ) ∞ X t = T +1 λ t − 1 i ≤ δ 4 b y (16), using that u i ( · ) ∈ [ 0 , 1] . No w ﬁx f , g ∈ F . W e can decompose   U i ( f ) − U i ( g )   ≤   U i ( f ) − U T i ( f )   +   U T i ( f ) − U T i ( g )   +   U T i ( g ) − U i ( g )   . By the bound abo ve, the ﬁrst and third ter ms are each at most δ / 4 . It remains to control | U T i ( f ) − U T i ( g ) | . For each t ∈ { 1 , . . . , T } and each joint action proﬁle a ∈ A , let α f t ( a ) = µ f  { z ∈ H ∞ : z t = a }  , α g t ( a ) = µ g  { z ∈ H ∞ : z t = a }  . Since u i ( a ) ∈ [ 0 , 1] f or all a , w e ha v e      X a ∈ A u i ( a )  α f t ( a ) − α g t ( a )       ≤ sup E ∈B t   µ f ( E ) − µ g ( E )   . Hence   U T i ( f ) − U T i ( g )   =      T X t =1 (1 − λ i ) λ t − 1 i X a ∈ A u i ( a )  α f t ( a ) − α g t ( a )       ≤ T X t =1 (1 − λ i ) λ t − 1 i sup E ∈B t   µ f ( E ) − µ g ( E )   . By the deﬁnition (6) of d ( µ f , µ g ) , f or each t we ha v e 2 − t sup E ∈B t   µ f ( E ) − µ g ( E )   ≤ d ( µ f , µ g ) , 49 hence sup E ∈B t   µ f ( E ) − µ g ( E )   ≤ 2 t d ( µ f , µ g ) . Thus   U T i ( f ) − U T i ( g )   ≤ d ( µ f , µ g ) T X t =1 (1 − λ i ) λ t − 1 i 2 t . The ﬁnite sum on the right depends onl y on T and λ i ; call it C i ( T ) . Deﬁne ρ i ( δ ) = min  δ 4 C i ( T ) , 1  . If d ( µ f , µ g ) ≤ ρ i ( δ ) , then   U T i ( f ) − U T i ( g )   ≤ C i ( T ) ρ i ( δ ) ≤ δ 4 . Combining the three bounds giv es   U i ( f ) − U i ( g )   ≤ δ 4 + δ 4 + δ 4 < δ. Setting ρ ( δ ) = min i ∈ I ρ i ( δ ) yields the ﬁnal claim. Pr oof of Lemma A.2. This is the ﬁnite-horizon analogue of the “puriﬁcation ” or “deviation-tree patc hing” result f or weak subjectiv e equilibria in Norman (2022). The ke y idea is to modify oﬀ-path behavior so that, f or each play er i , any histor y that can onl y ar ise from a deviation by i tr igg ers opponents ’ play according to the suppor ting proﬁle f i (which makes f i a ψ -best response), while on-path histories preser v e the or iginal proﬁle f . Formall y , one constructs a deviation tree f or each pla y er and assigns to each subtree cor responding to a ﬁrst de viation b y i the opponents ’ strategies from f i − i , keeping f on the non-deviation branch. This constr uction ensures: (i) if all play ers follo w ˆ f , the induced distribution of histor ies up to time T coincides with that under f (item 2); and (ii) any unilateral de viation by play er i induces, up to time T , the same distribution of his tor ies as de viating ag ainst f i − i , ag ainst which f i is a ψ -best reply by Deﬁnition 15. Theref ore ˆ f is a ψ -N ash equilibrium of the T -per iod game (item 1). A detailed construction and proof of these proper ties is giv en in Norman (2022), Proposition 3.1, and the associated deviation-tree arguments; our setting is the same repeated-game environment, so the proof car ries o v er v erbatim. Pr oof of Lemma A.3. Suppose, to wards a contradiction, that there e xist T , ψ > 0 and θ > 0 such that f or e very m ∈ N there is a ﬁnite-horizon w eak ψ -subjective η m -equilibrium f ( m ) ∈ F T with η m ≤ 1 /m and such that no ψ -N ash equilibrium lies within weak distance θ of µ f ( m ) (measured on B T ). 50 For each m and each i ∈ I , let f i, ( m ) be a suppor ting tr uncated proﬁle witnessing that f ( m ) is a ﬁnite-hor izon w eak ψ -subjectiv e η m -equilibrium, i.e., f i, ( m ) i = f ( m ) i , U T i ( f ( m ) i , f i, ( m ) − i ) ≥ sup g i ∈F T i U T i ( g i , f i, ( m ) − i ) − ψ , d ( µ f i, ( m ) , µ f ( m ) ) ≤ η m . Because the hor izon T and action sets are ﬁnite, the space of behaviour strategies F T is a ﬁnite-dimensional product of simplices and hence compact in the product topology . Thus, by sequential compactness, there e xists a subsequence (which w e relabel for notational con v enience) such that f ( m ) → f ⋆ and f i, ( m ) → f i,⋆ f or all i ∈ I , as m → ∞ , in the product topology on F T . The map f 7→ µ f on ﬁnite histor ies (up to time T ) is continuous with respect to this topology and the w eak topology induced by d (restricted to B T ), so µ f ( m ) → µ f ⋆ , µ f i, ( m ) → µ f i,⋆ . Since d ( µ f i, ( m ) , µ f ( m ) ) ≤ η m → 0 , we must hav e d ( µ f i,⋆ , µ f ⋆ ) = 0 , so µ f i,⋆ = µ f ⋆ on B T . Moreo v er , the best-response inequality passes to the limit. F ix i and any g i ∈ F T i . For all m , U T i ( f ( m ) i , f i, ( m ) − i ) ≥ sup g ′ i ∈F T i U T i ( g ′ i , f i, ( m ) − i ) − ψ ≥ U T i ( g i , f i, ( m ) − i ) − ψ . By continuity of U T i in the product topology (an immediate consequence of Lemma A.1 restricted to hor izon T ), taking m → ∞ yields U T i ( f ⋆ i , f i,⋆ − i ) ≥ U T i ( g i , f i,⋆ − i ) − ψ . Since g i w as arbitrar y and f i,⋆ i = f ⋆ i (b y pointwise conv erg ence of f i, ( m ) i to f i,⋆ i and of f ( m ) i to f ⋆ i ), w e conclude that U T i ( f ⋆ i , f i,⋆ − i ) ≥ sup g i ∈F T i U T i ( g i , f i,⋆ − i ) − ψ . T ogether with d ( µ f i,⋆ , µ f ⋆ ) = 0 , this sho ws that f ⋆ is a ﬁnite-hor izon weak ψ -subjective 0 -equilibrium of the T -period game. By Lemma A.2, there exis ts a proﬁle ˆ f ⋆ ∈ F T such that ˆ f ⋆ is a ψ -Nash equilibrium of the T -per iod game and µ ˆ f ⋆ coincides with µ f ⋆ on histories of length at most T . In par ticular , d ( µ ˆ f ⋆ , µ f ⋆ ) = 0 . Since µ f ( m ) → µ f ⋆ in the w eak metr ic d (restricted to B T ), we ha ve d ( µ f ( m ) , µ ˆ f ⋆ ) → 0 as m → ∞ . Thus f or all suﬃciently larg e m , d ( µ f ( m ) , µ ˆ f ⋆ ) ≤ θ . But ˆ f ⋆ is a ψ -Nash equilibrium, contradicting the assumption 51 that no ψ -N ash eq uilibr ium lies within weak distance θ of µ f ( m ) . This contradiction sho ws that suc h a sequence ( f ( m ) ) cannot exis t, and hence there must exis t ¯ η T ( ψ , θ ) > 0 with the stated proper ty . Pr oof of Lemma A.4. Fix ξ > 0 and ε > 0 . Choose a ﬁnite horizon T larg e enough that, f or all i ∈ I and all proﬁles h ∈ F ,   U i ( h ) − U T i ( h )   ≤ ε 8 , (17) and also X t>T 2 − t ≤ ε 4 . (18) Such a T e xists because the tails of both g eometr ic ser ies are uniforml y small. Let f be a w eak ξ -subjectiv e η -equilibrium with suppor ting proﬁles { f i } i ∈ I as in Deﬁnition 8, i.e., f or each i , f i i = f i , U i ( f i , f i − i ) ≥ sup g i ∈F i U i ( g i , f i − i ) − ξ , d ( µ f i , µ f ) ≤ η . Consider the tr uncated proﬁles f ( T ) and ( f i ) ( T ) obtained by restricting the prescr iptions of f and f i to histories of length at most T . For each i we hav e ( f i i ) ( T ) = f ( T ) i and, since the w eak distance on histor ies up to T is bounded b y the full weak distance, d ( µ ( f i ) ( T ) , µ f ( T ) ) ≤ d ( µ f i , µ f ) ≤ η . W e no w show that f ( T ) is a ﬁnite-horizon weak ψ T -subjectiv e η -equilibr ium f or a slightly relaxed parameter ψ T . Fix i and note that for an y proﬁle h , | U i ( h ) − U T i ( h ) | ≤ ε 8 b y (17). Using the w eak subjectiv e inequality f or f and f i , w e obtain U T i ( f ( T ) i , ( f i − i ) ( T ) ) = U T i ( f i , f i − i ) ≥ U i ( f i , f i − i ) − ε 8 ≥ sup g i ∈F i U i ( g i , f i − i ) − ξ − ε 8 . For an y tr uncated deviation g ( T ) i ∈ F T i w e can e xtend it arbitrar il y to a full strategy g i ∈ F i , and then U i ( g i , f i − i ) ≥ U T i ( g ( T ) i , ( f i − i ) ( T ) ) − ε 8 , 52 again b y (17). T aking the supremum o ver g ( T ) i yields U T i ( f ( T ) i , ( f i − i ) ( T ) ) ≥ sup g ( T ) i ∈F T i U T i ( g ( T ) i , ( f i − i ) ( T ) ) − ξ − ε 4 . Thus, if we deﬁne ψ T := ξ + ε 4 , then f or each i the truncated proﬁles f ( T ) and ( f i ) ( T ) satisfy U T i ( f ( T ) i , ( f i − i ) ( T ) ) ≥ sup g ( T ) i ∈F T i U T i ( g ( T ) i , ( f i − i ) ( T ) ) − ψ T , and d ( µ ( f i ) ( T ) , µ f ( T ) ) ≤ η , so f ( T ) is a ﬁnite-horizon w eak ψ T -subjectiv e η -equilibrium in the sense of Deﬁnition 15. Appl ying Lemma A.3 with this T , ψ = ψ T and θ = ε/ 2 , there e xists ¯ η T ( ψ T , ε/ 2) > 0 such that if η ≤ ¯ η T ( ψ T , ε/ 2) then there is a ψ T -N ash equilibrium ˜ f ( T ) ∈ F T f or the T -period game with d ( µ ˜ f ( T ) , µ f ( T ) ) ≤ ε 2 . Deﬁne ˆ η ( ξ , ε ) := ¯ η T  ξ + ε 4 , ε 2  . Assume hencef or th that η ≤ ˆ η ( ξ , ε ) so that this conclusion holds. Extend ˜ f ( T ) arbitrarily to a full strategy proﬁle ˆ f ∈ F by specifying its behaviour after period T in any wa y . Then ˆ f and ˜ f ( T ) coincide on periods t ≤ T , and similarl y f and f ( T ) coincide on t ≤ T . The w eak distance betw een ˆ f and f can be bounded as d ( µ ˆ f , µ f ) ≤ d ( µ ˆ f , µ ˜ f ( T ) ) + d ( µ ˜ f ( T ) , µ f ( T ) ) + d ( µ f ( T ) , µ f ) . The second term is at most ε/ 2 b y cons tr uction. For the ﬁrst and third ter ms, an y discrepancy between ˆ f and ˜ f ( T ) (respectiv ely , f and f ( T ) ) occurs only at times t > T , so each of these weak distances is bounded b y the tail P t>T 2 − t ≤ ε/ 4 b y (18). Hence d ( µ ˆ f , µ f ) ≤ ε 4 + ε 2 + ε 4 = ε. It remains to sho w that ˆ f is a ( ξ + ε ) -N ash equilibrium of the inﬁnite-hor izon game. Fix i ∈ I and an y de viation g i ∈ F i . Let g ( T ) i denote the tr uncation of g i to a T -per iod strategy , i.e., its prescr iptions on histories of length at most T ; clear ly U T i ( g i , ˆ f − i ) = U T i ( g ( T ) i , ˜ f ( T ) − i ) since ˆ f and ˜ f ( T ) coincide on the ﬁrst T 53 periods. Because ˜ f ( T ) is a ψ T -N ash equilibrium of the T -per iod game, U T i ( ˜ f ( T ) i , ˜ f ( T ) − i ) ≥ U T i ( g ( T ) i , ˜ f ( T ) − i ) − ψ T . Using the tr uncation bound (17), we obtain U i ( ˆ f i , ˆ f − i ) ≥ U T i ( ˆ f i , ˆ f − i ) − ε 8 = U T i ( ˜ f ( T ) i , ˜ f ( T ) − i ) − ε 8 and U i ( g i , ˆ f − i ) ≤ U T i ( g i , ˆ f − i ) + ε 8 = U T i ( g ( T ) i , ˜ f ( T ) − i ) + ε 8 . Combining these inequalities yields U i ( ˆ f i , ˆ f − i ) ≥ U T i ( ˜ f ( T ) i , ˜ f ( T ) − i ) − ε 8 ≥ U T i ( g ( T ) i , ˜ f ( T ) − i ) − ψ T − ε 8 ≥ U i ( g i , ˆ f − i ) − ψ T − ε 4 . R ecalling that ψ T = ξ + ε/ 4 , w e ha v e ψ T + ε 4 = ξ + ε 2 ≤ ξ + ε, so f or ev er y deviation g i , U i ( ˆ f i , ˆ f − i ) ≥ U i ( g i , ˆ f − i ) − ( ξ + ε ) . Thus ˆ f is a ( ξ + ε ) -Nash equilibr ium. Pr oof of Lemma 4.1. For each g − i ∈ S − i deﬁne the continuation value en velope M ( g − i ) := sup σ i V i ( σ i | h t ; g − i ) ∈ [ 0 , 1] . For each g − i pick a (measurable) best response σ g − i i ∈ BR i ( g − i | h t ) , so that V i ( σ g − i i | h t ; g − i ) = M ( g − i ) . By deﬁnition, PS-BR ﬁrst samples ˜ g − i ∼ p t ( · ) and then pla ys σ ˜ g − i i . Evaluating against the posterior predictiv e belief and using linear ity in the mixing ov er opponent h ypotheses, V i ( σ PS i,t | h t ) = X ˜ g − i ∈S − i p t ( ˜ g − i ) X g − i ∈S − i p t ( g − i ) V i ( σ ˜ g − i i | h t ; g − i ) ≥ X g − i ∈S − i p t ( g − i ) 2 V i ( σ g − i i | h t ; g − i ) 54 = X g − i ∈S − i p t ( g − i ) 2 M ( g − i ) . On the other hand, sup σ i V i ( σ i | h t ) = sup σ i X g − i ∈S − i p t ( g − i ) V i ( σ i | h t ; g − i ) ≤ X g − i ∈S − i p t ( g − i ) M ( g − i ) . Subtracting and using M ( g − i ) ≤ 1 , sup σ i V i ( σ i | h t ) − V i ( σ PS i,t | h t ) ≤ X g − i ∈S − i  p t ( g − i ) − p t ( g − i ) 2  M ( g − i ) ≤ X g − i ∈S − i  p t ( g − i ) − p t ( g − i ) 2  = 1 − X g − i ∈S − i p t ( g − i ) 2 = D t i ( h t ) . This pro ves the claim. Pr oof of Lemma 4.2. Fix any g − i ∈ S − i \ { f − i } . W rite a t = ( a t i , a t − i ) f or the per iod- t action proﬁle along the realized play path z , and write h t f or the length- t history ( a 1 , . . . , a t − 1 ) . Because S − i is ﬁnite and all menu strategies are ν -cautious, Ba yes ’ r ule is well-deﬁned at e v er y history and the posterior odds admit the standard likelihood ratio f orm: µ t i ( g − i | h t ) µ t i ( f − i | h t ) = µ 0 i ( g − i ) µ 0 i ( f − i ) t − 1 Y s =1 g − i ( h s )( a s − i ) f − i ( h s )( a s − i ) . (19) Deﬁne the log-likelihood ratio increments X s := log f − i ( h s )( a s − i ) g − i ( h s )( a s − i ) . T aking logs in (19) gives log µ t i ( g − i | h t ) µ t i ( f − i | h t ) = log µ 0 i ( g − i ) µ 0 i ( f − i ) − t − 1 X s =1 X s . (20) Let F s be the σ -algebra generated by the histor y h s . U nder the tr ue pla y distribution µ f , conditional on F s the opponents ’ action a s − i is distributed according to f − i ( h s ) . Theref ore, E µ f  X s | F s  = X a − i ∈ A − i f − i ( h s )( a − i ) log f − i ( h s )( a − i ) g − i ( h s )( a − i ) = D KL  f − i ( h s )    g − i ( h s )  . 55 Deﬁne the mar tingale diﬀerence sequence Y s := X s − E [ X s | F s ] . By ν -caution, f or all s w e ha v e f − i ( h s )( a s − i ) ∈ [ ν, 1] and g − i ( h s )( a s − i ) ∈ [ ν, 1] , hence | X s | ≤ log (1 /ν ) ,   E [ X s | F s ]   ≤ log (1 /ν ) , and thus | Y s | ≤ 2 log(1 /ν ) := c. Azuma–Hoeﬀding yields, f or any ϵ > 0 , Pr      T X s =1 Y s      ≥ ϵT ! ≤ 2 exp − ϵ 2 T 2 c 2 ! . The right-hand side is summable in T , so b y Borel–Cantelli, 1 T T X s =1 Y s − → 0 µ f -a.s. Consequentl y , 1 T T X s =1 X s = 1 T T X s =1 E [ X s | F s ] + o (1) = 1 T T X s =1 D KL  f − i ( h s )    g − i ( h s )  + o (1) µ f -a.s. By the KL -separation par t of Assumption 3, the liminf of the empir ical a v erages of these KL ter ms is strictly positiv e µ f -a.s., hence t − 1 X s =1 X s − → + ∞ µ f -a.s. R etur ning to (20), w e obtain log µ t i ( g − i | h t ) µ t i ( f − i | h t ) − → −∞ µ f -a.s., so µ t i ( g − i | h t ) /µ t i ( f − i | h t ) → 0 almos t surel y . Because there are ﬁnitely many g − i  = f − i , this implies µ t i ( f − i | h t ) → 1 and max g − i  = f − i µ t i ( g − i | h t ) → 0 almost surel y . Pr oof of Lemma 6.1. Identical to the proof of Lemma 4.2, with ( f − i , µ f ) replaced by ( ¯ f − i , ¯ µ σ,u ) . Pr oof of Pr oposition 4.3. Along any realized pla y path z , deﬁne p t ( · ) = µ t i ( · | h t ( z )) on the ﬁnite set S − i and the associated D t i ( h t ( z )) = 1 − P g − i p t ( g − i ) 2 . By Lemma 4.2, µ t i ( f − i | h t ( z )) → 1 almos t surely , hence X g − i ∈S − i p t ( g − i ) 2 ≥ µ t i ( f − i | h t ( z )) 2 − → 1 , and theref ore D t i ( h t ( z )) → 0 almost surely . Fix an y ε > 0 and an y z in the full-measure e vent where D t i ( h t ( z )) → 0 . Choose T i ( z , ε ) such that D t i ( h t ( z )) ≤ ε f or all t ≥ T i ( z , ε ) . For each such t , Lemma 4.1 implies that PS-BR at h t ( z ) is an ε -bes t 56 response to the posterior predictiv e continuation belief, i.e., f i   h t ( z ) ∈ BR ε i  f i,t − i   h t ( z ) | h t ( z )  . This is ex actly the asymptotic ε -consistency requirement in Deﬁnition 4. Pr oof of Lemma 5.1. Let µ f i ≡ P 0 ,f i i be the dis tr ibution induced b y the belief-equiv alent proﬁle ( f i , f i − i ) representing the pr ior predictive. By Assumption 2, µ f ≪ µ f i . By the mer ging of opinions theorem (Kalai and Lehrer, 1993a; Blackw ell and Dubins, 1962), absolute continuity guarantees that the conditional predictive distributions ov er future play paths merg e almost surely in total variation. Speciﬁcally , for µ f -almost ev er y path z ∈ H ∞ : lim t →∞ sup E ∈B   µ f ( E | C ( h t ( z ))) − µ f i ( E | C ( h t ( z )))   = 0 , where B is the product σ -alg ebra on H ∞ . R ecall from Deﬁnition 6 that the continuation w eak distance is bounded b y the total variation dis tance. For any ﬁnite length k , the σ -algebra B k g enerated b y cy linder e v ents of length k is a sub- σ -algebra of B . Theref ore: sup E ∈B k   µ f ( E | C ( h t ( z ))) − µ f i ( E | C ( h t ( z )))   ≤ sup E ∈B   µ f ( E | C ( h t ( z ))) − µ f i ( E | C ( h t ( z )))   . Using this bound, the continuation w eak distance d h t ( z ) ( µ f , µ f i ) satisﬁes: d h t ( z ) ( µ f , µ f i ) = ∞ X k =1 2 − k sup E ∈B k   µ f ( E | C ( h t ( z ))) − µ f i ( E | C ( h t ( z )))   ≤ ∞ X k =1 2 − k sup E ∈B   µ f ( E | C ( h t ( z ))) − µ f i ( E | C ( h t ( z )))   = sup E ∈B   µ f ( E | C ( h t ( z ))) − µ f i ( E | C ( h t ( z )))   . Since the total variation distance on the r ight-hand side conv erg es to zero as t → ∞ for µ f -almost ev ery z , w e ha v e: lim t →∞ d h t ( z ) ( µ f , µ f i ) = 0 µ f -a.s. By the deﬁnition of the limit, f or an y η > 0 , there µ f -a.s. exis ts a ﬁnite time T i ( z , η ) such that f or all t ≥ T i ( z , η ) , d h t ( z ) ( µ f , µ f i ) ≤ η . This precisel y satisﬁes the strong path prediction requirement in Deﬁnition 9. Pr oof of Pr oposition 5.2. Fix ξ , η > 0 . For each pla y er i , RR implies that µ f -a.s. in z there exis ts T br i ( z ) 57 such that f or all t ≥ T br i ( z ) , f i   h t ( z ) ∈ BR ξ i  f i,t − i   h t ( z ) | h t ( z )  . By the representative choice (4), w e ma y equiv alently wr ite f i,t − i   h t ( z ) ≡ f i − i   h t ( z ) , so f or all t ≥ T br i ( z ) , f i   h t ( z ) ∈ BR ξ i  f i − i   h t ( z ) | h t ( z )  , which is e xactl y the subjectiv e best-response condition in Deﬁnition 8. Similarl y , strong prediction implies that µ f -a.s. in z there e xists T pred i ( z ) suc h that for all t ≥ T pred i ( z ) , d h t ( z ) ( µ f , µ f i ) ≤ η , which is the w eak predictiv e accuracy condition in Deﬁnition 8. Let T ( z ) := max i { T br i ( z ) , T pred i ( z ) } , which is ﬁnite µ f -a.s. since I is ﬁnite. Then f or all t ≥ T ( z ) and ev er y play er i , both conditions in Deﬁnition 8 hold with suppor ting proﬁle f i , so f   h t ( z ) is a weak ξ -subjectiv e η -equilibrium after h t ( z ) . Pr oof of Theor em 5.3. Fix ε > 0 and set ξ := ε/ 2 . Let ˆ η ( · , · ) be the function from the inﬁnite patching lemma (Lemma A.4 in Appendix A), and set η := ˆ η ( ξ , ε/ 2) . By Proposition 5.2, µ f -a.s. in z there exis ts T ( z ) such that f or all t ≥ T ( z ) , the continuation proﬁle f   h t ( z ) is a w eak ξ -subjective η -equilibrium after h t ( z ) . Applying Lemma A.4 at each such t yields an ε -Nash equilibrium ˆ f ε,t,z of the continuation game after h t ( z ) satisfying d h t ( z ) ( µ f , µ ˆ f ε,t,z ) ≤ ε . Pr oof of Cor ollar y 5.4. By Proposition 4.3, under Assumption 3, each pla yer is RR. Because Assumption 3 (speciﬁcall y the menu g rain of truth) implies Assumption 2, Lemma 5.1 guarantees each pla y er lear ns to predict the path of pla y under f . Theorem 5.3 theref ore applies. Pr oof of Lemma 6.2. Fix an y m i ∈ M i \ { u i } . By Ba yes ’ rule (6), π t i ( m i | x t i ) π t i ( u i | x t i ) = π 0 i ( m i ) π 0 i ( u i ) t − 1 Y s =1 ψ i ( r s i ; m i ( a s )) ψ i ( r s i ; u i ( a s )) . Equiv alently , log π t i ( m i | x t i ) π t i ( u i | x t i ) = log π 0 i ( m i ) π 0 i ( u i ) − t − 1 X s =1 X s , where X s := log ψ i ( r s i ; u i ( a s )) ψ i ( r s i ; m i ( a s )) . 58 Let H s := σ ( h s +1 , r 1: s − 1 i ) , so that a s is H s -measurable and, under the true interaction law , r s i is conditionally distributed as q u i i ( · | a s ) . Theref ore E [ X s | H s ] = D KL  q u i i ( · | a s )    q m i i ( · | a s )  . Deﬁne the mar tingale diﬀerence sequence Y s := X s − E [ X s | H s ] . By Assumption 4(3), sup s E [ Y 2 s ] < ∞ . Hence ∞ X s =1 E [ Y 2 s ] s 2 < ∞ , so the mar tingale strong law implies 1 T T X s =1 Y s − → 0 a.s. Theref ore, 1 T T X s =1 X s = 1 T T X s =1 D KL  q u i i ( · | a s )    q m i i ( · | a s )  + o (1) a.s. By Assumption 4(4), the liminf of the empir ical KL av erage is strictl y positiv e almost surely , hence t − 1 X s =1 X s − → + ∞ a.s. It f ollow s that log π t i ( m i | x t i ) π t i ( u i | x t i ) − → −∞ a.s., so π t i ( m i | x t i ) π t i ( u i | x t i ) − → 0 . Since M i is ﬁnite, this implies π t i ( u i | x t i ) − → 1 and max m i  = u i π t i ( m i | x t i ) − → 0 almost surely . 59 Pr oof of Lemma 6.3. By (11), f or e very measurable ev ent E ⊆ H ∞ ,    Π t i ( E | x t i ) − ¯ µ ( σ i ,g i,t − i ) ,u i x t i ( E )    =       X m i ∈M i π t i ( m i | x t i ) ¯ µ ( σ i ,g i,t − i ) ,m i x t i ( E ) − ¯ µ ( σ i ,g i,t − i ) ,u i x t i ( E )       ≤ X m i  = u i π t i ( m i | x t i ) = 1 − π t i ( u i | x t i ) . T aking the supremum ov er cy linder ev ents at each hor izon and summing with the weights 2 − t yields the stated bound. Pr oof of Lemma 6.4. Fix pla y er i and an information history x t i = ( h t , r 1: t − 1 i ) . Let M := S − i × M i , and f or each m = ( g − i , m i ) ∈ M deﬁne the continuation value functional V m i ( τ i | x t i ) := V m i i ( τ i | x t i ; g − i ) ∈ [0 , 1] , and the value env elope M ( m ) := sup τ i V m i ( τ i | x t i ) ∈ [0 , 1] . For each m ∈ M ﬁx a (measurable) best response τ m i attaining M ( m ) , i.e., V m i ( τ m i | x t i ) = M ( m ) . By Deﬁnition 13, PS-BR samples ( ˜ g − i , ˜ m i ) ∼ p t ( · ) and then pla ys τ ( ˜ g − i , ˜ m i ) i . Let σ PS i,t denote this randomized continuation strategy at x t i . Because V mix ,t i is linear in both the opponents-mixture and the pa y oﬀ-matr ix mixture, we can write V mix ,t i ( τ i | x t i ) = X ( g − i ,m i ) ∈M p t ( g − i , m i ) V ( g − i ,m i ) i ( τ i | x t i ) = X m ∈M p t ( m ) V m i ( τ i | x t i ) . Theref ore, ev aluating PS-BR under the mixed subjectiv e objectiv e giv es V mix ,t i ( σ PS i,t | x t i ) = X ˜ m ∈M p t ( ˜ m ) V mix ,t i ( τ ˜ m i | x t i ) = X ˜ m ∈M p t ( ˜ m ) X m ∈M p t ( m ) V m i ( τ ˜ m i | x t i ) ≥ X m ∈M p t ( m ) 2 V m i ( τ m i | x t i ) = X m ∈M p t ( m ) 2 M ( m ) . On the other hand, sup τ i V mix ,t i ( τ i | x t i ) = sup τ i X m ∈M p t ( m ) V m i ( τ i | x t i ) ≤ X m ∈M p t ( m ) sup τ i V m i ( τ i | x t i ) = X m ∈M p t ( m ) M ( m ) . 60 Subtracting and using M ( m ) ≤ 1 f or all m , sup τ i V mix ,t i ( τ i | x t i ) − V mix ,t i ( σ PS i,t | x t i ) ≤ X m ∈M  p t ( m ) − p t ( m ) 2  M ( m ) ≤ X m ∈M  p t ( m ) − p t ( m ) 2  = 1 − X m ∈M p t ( m ) 2 = D t, joint i ( x t i ) . This pro ves the claim. Pr oof of Pr oposition 6.5. W ork on the full-measure e v ent on which both posterior concentrations hold: µ t i ( ¯ f − i | h t ) → 1 and π t i ( u i | x t i ) → 1 . Then D t, joint i ( x t i ) → 0 and δ t i ( x t i ) := 1 − π t i ( u i | x t i ) → 0 . By (12), Lemma 6.4, and (10), sup τ i ∈ Σ i ( x t i ) V u i i  τ i | x t i ; g i,t − i  − V u i i  σ PS i,t | x t i ; g i,t − i  = sup τ i V u i ,t i ( τ i | x t i ) − V u i ,t i ( σ PS i,t | x t i ) ≤ D t, joint i ( x t i ) + 2 δ t i ( x t i ) . The r ight-hand side conv erg es to 0 almost surely , so the stated ev entual ε -best-response proper ty f ollo ws. Pr oof of Lemma 6.6. By the Blackw ell–Dubins merging argument applied on the obser v able process O i , Assumption 5 implies d  Π t i ( · | x t i ( ω )) , ¯ µ σ,u i,x t i ( ω )  − → 0 for P σ,u -a.e. ω . Assumption 6 gives d  ¯ µ σ,u i,x t i ( ω ) , ¯ µ σ,u x t ( ω )  − → 0 for P σ,u -a.e. ω . The claim f ollow s by the triangle inequality . Pr oof of Pr oposition 6.7. Fix ξ , η > 0 . F or each pla y er i , Proposition 6.5 implies that P σ,u -a.s. there e xists T br i ( ω ) such that f or all t ≥ T br i ( ω ) , σ PS i,t ( · | x t i ( ω )) ∈ BR ξ i,u i  g i,t − i | x t i ( ω )  . Also, Lemma 6.6 together with Lemma 6.3 implies that P σ,u -a.s. there e xists T pred i ( ω ) such that f or all t ≥ T pred i ( ω ) , d  ¯ µ σ,u x t ( ω ) , ¯ µ ( σ i ,g i,t − i ) ,u i x t i ( ω )  ≤ η . 61 Indeed, d  ¯ µ σ,u x t ( ω ) , ¯ µ ( σ i ,g i,t − i ) ,u i x t i ( ω )  ≤ d  ¯ µ σ,u x t ( ω ) , Π t i ( · | x t i ( ω ))  + d  Π t i ( · | x t i ( ω )) , ¯ µ ( σ i ,g i,t − i ) ,u i x t i ( ω )  , and both ter ms vanish almost surel y b y Lemmas 6.6 and 6.3. Let T ( ω ) := max i ∈ I { T br i ( ω ) , T pred i ( ω ) } . Then for all t ≥ T ( ω ) and e very play er i , both conditions in Deﬁnition 14 hold with suppor ting reduced-f or m model g i,t − i . Pr oof of Lemma 5.5. Fix pla y er i , let p, q ∈ ∆( A − i ) , and suppose α i ∈ br ξ i ( q ) . For an y α i ∈ ∆( A i ) deﬁne ϕ α i ( a − i ) := X a i ∈ A i α i ( a i ) u i ( a i , a − i ) , a − i ∈ A − i . Since u i ( a i , a − i ) ∈ [ 0 , 1] , w e ha v e ϕ α i ( a − i ) ∈ [ 0 , 1] f or all a − i ∈ A − i . Also, u i ( α i , p ) − u i ( α i , q ) = X a − i ∈ A − i ϕ α i ( a − i )  p ( a − i ) − q ( a − i )  . Set S + := { a − i ∈ A − i : p ( a − i ) ≥ q ( a − i ) } . Because 0 ≤ ϕ α i ≤ 1 , we ha ve u i ( α i , p ) − u i ( α i , q ) = X a − i ∈ S + ϕ α i ( a − i )  p ( a − i ) − q ( a − i )  + X a − i / ∈ S + ϕ α i ( a − i )  p ( a − i ) − q ( a − i )  ≤ X a − i ∈ S +  p ( a − i ) − q ( a − i )  = p ( S + ) − q ( S + ) ≤ ∥ p − q ∥ TV . Appl ying the same argument with p and q interchang ed yields u i ( α i , q ) − u i ( α i , p ) ≤ ∥ p − q ∥ TV . Theref ore | u i ( α i , p ) − u i ( α i , q ) | ≤ ∥ p − q ∥ TV f or ev ery α i ∈ ∆( A i ) . (21) 62 No w suppose α i ∈ br ξ i ( q ) . Then u i ( α i , q ) ≥ sup α ′ i ∈ ∆( A i ) u i ( α ′ i , q ) − ξ . Using (21), u i ( α i , p ) ≥ u i ( α i , q ) − ∥ p − q ∥ TV ≥ sup α ′ i ∈ ∆( A i ) u i ( α ′ i , q ) − ξ − ∥ p − q ∥ TV ≥ sup α ′ i ∈ ∆( A i )  u i ( α ′ i , p ) − ∥ p − q ∥ TV  − ξ − ∥ p − q ∥ TV = sup α ′ i ∈ ∆( A i ) u i ( α ′ i , p ) − ξ − 2 ∥ p − q ∥ TV . Hence α i ∈ br ξ +2 ∥ p − q ∥ TV i ( p ) . Pr oof of Lemma 5.6. Fix pla y er i and histor y h t . For each g − i ∈ S − i deﬁne M ( g − i ) := sup α i ∈ ∆( A i ) u i  α i , g − i ( h t )  ∈ [ 0 , 1] . By Deﬁnition 11, f or each g − i ∈ S − i w e ha v e chosen α g − i ,h t i ∈ br i  g − i ( h t )  , so u i  α g − i ,h t i , g − i ( h t )  = M ( g − i ) . W r ite p t ( g − i ) = µ t i ( g − i | h t ) . The e x ante mix ed action induced by m y opic PS-BR is α mPS i,t ( · | h t ) = X ˜ g − i ∈S − i p t ( ˜ g − i ) α ˜ g − i ,h t i ( · ) , and the one-step posterior predictive belief is q t i ( · | h t ) = X g − i ∈S − i p t ( g − i ) g − i ( h t )( · ) . 63 By bilinearity of u i ( · , · ) , u i  α mPS i,t , q t i  = X ˜ g − i ∈S − i p t ( ˜ g − i ) X g − i ∈S − i p t ( g − i ) u i  α ˜ g − i ,h t i , g − i ( h t )  ≥ X g − i ∈S − i p t ( g − i ) 2 u i  α g − i ,h t i , g − i ( h t )  = X g − i ∈S − i p t ( g − i ) 2 M ( g − i ) . On the other hand, again b y bilinear ity , sup α i ∈ ∆( A i ) u i ( α i , q t i ) = sup α i ∈ ∆( A i ) X g − i ∈S − i p t ( g − i ) u i  α i , g − i ( h t )  ≤ X g − i ∈S − i p t ( g − i ) sup α i ∈ ∆( A i ) u i  α i , g − i ( h t )  = X g − i ∈S − i p t ( g − i ) M ( g − i ) . Subtracting, sup α i u i ( α i , q t i ) − u i ( α mPS i,t , q t i ) ≤ X g − i ∈S − i  p t ( g − i ) − p t ( g − i ) 2  M ( g − i ) ≤ X g − i ∈S − i  p t ( g − i ) − p t ( g − i ) 2  = 1 − X g − i ∈S − i p t ( g − i ) 2 = D t i ( h t ) . This pro ves the claim. Pr oof of Lemma 5.7. Fix play er i and let f i = ( f i , f i − i ) be the supporting proﬁle from Deﬁnition 9. Fix a realized path z ∈ H ∞ in the full-measure ev ent from Deﬁnition 9. By deﬁnition of q t i and the representativ e choice (4), q t i ( · | h t ( z )) = f i,t − i ( h t ( z )) = f i − i ( h t ( z )) . Let η > 0 . By Deﬁnition 9, there exis ts T i ( z , η / 2) < ∞ such that f or all t ≥ T i ( z , η / 2) , d h t ( z ) ( µ f , µ f i ) ≤ η / 2 . 64 Fix such a t . For any subset B ⊆ A − i , deﬁne the one-step cylinder ev ent E B := { y ∈ H ∞ : y 1 − i ∈ B } ∈ B 1 . By the deﬁnition of continuation measures, µ f h t ( z ) ( E B ) = f − i ( h t ( z ))( B ) , µ f i h t ( z ) ( E B ) = f i − i ( h t ( z ))( B ) = q t i ( B | h t ( z )) . Theref ore,   q t i ( · | h t ( z )) − f − i ( h t ( z ))   TV = sup B ⊆ A − i    q t i ( B | h t ( z )) − f − i ( h t ( z ))( B )    = sup B ⊆ A − i    µ f i h t ( z ) ( E B ) − µ f h t ( z ) ( E B )    ≤ sup E ∈B 1    µ f i h t ( z ) ( E ) − µ f h t ( z ) ( E )    . By Deﬁnition 6, d h t ( z ) ( µ f , µ f i ) = ∞ X k =1 2 − k sup E ∈B k    µ f h t ( z ) ( E ) − µ f i h t ( z ) ( E )    . In particular, 1 2 sup E ∈B 1    µ f h t ( z ) ( E ) − µ f i h t ( z ) ( E )    ≤ d h t ( z ) ( µ f , µ f i ) , so sup E ∈B 1    µ f h t ( z ) ( E ) − µ f i h t ( z ) ( E )    ≤ 2 d h t ( z ) ( µ f , µ f i ) ≤ η . Hence   q t i ( · | h t ( z )) − f − i ( h t ( z ))   TV ≤ η f or all t ≥ T i ( z , η / 2) . Since η > 0 w as arbitrary , this pro v es the claim. Pr oof of Theor em 5.8. Fix ε > 0 and set ξ := ε/ 3 . For pla yer i , Assumption 3 implies, b y Lemma 4.2, that there is a full-measure ev ent on which µ t i ( f − i | h t ( z )) − → 1 . Since f − i ∈ S − i b y menu grain of tr uth, on that ev ent we also ha v e D t i ( h t ( z )) = 1 − X g − i ∈S − i µ t i ( g − i | h t ( z )) 2 − → 0 . 65 Theref ore there exis ts T br i ( z ) < ∞ such that f or all t ≥ T br i ( z ) , D t i ( h t ( z )) ≤ ξ . Because pla y er i uses my opic PS-BR, w e ha v e f i ( h t ( z )) = α mPS i,t ( · | h t ( z )) . Appl ying Lemma 5.6, it f ollow s that f or all t ≥ T br i ( z ) , f i ( h t ( z )) ∈ br ξ i  q t i ( · | h t ( z ))  . Ne xt, wr ite p t ( g − i ) = µ t i ( g − i | h t ( z )) . At history h t ( z ) , q t i ( · | h t ( z )) = X g − i ∈S − i p t ( g − i ) g − i ( h t ( z ))( · ) . For an y B ⊆ A − i ,    q t i ( B | h t ( z )) − f − i ( h t ( z ))( B )    =       X g − i ∈S − i p t ( g − i ) g − i ( h t ( z ))( B ) − f − i ( h t ( z ))( B )       =       X g − i  = f − i p t ( g − i )  g − i ( h t ( z ))( B ) − f − i ( h t ( z ))( B )        ≤ X g − i  = f − i p t ( g − i ) = 1 − µ t i ( f − i | h t ( z )) . T aking the supremum o v er B ⊆ A − i giv es   q t i ( · | h t ( z )) − f − i ( h t ( z ))   TV ≤ 1 − µ t i ( f − i | h t ( z )) − → 0 . Hence there exis ts T pred i ( z ) < ∞ such that f or all t ≥ T pred i ( z ) ,   q t i ( · | h t ( z )) − f − i ( h t ( z ))   TV ≤ ξ . No w ﬁx t ≥ max { T br i ( z ) , T pred i ( z ) } . W e already kno w that f i ( h t ( z )) ∈ br ξ i  q t i ( · | h t ( z ))  , 66 and that   q t i ( · | h t ( z )) − f − i ( h t ( z ))   TV ≤ ξ . Appl ying Lemma 5.5 with p = f − i ( h t ( z )) and q = q t i ( · | h t ( z )) yields f i ( h t ( z )) ∈ br ξ +2 ξ i  f − i ( h t ( z ))  = br ε i  f − i ( h t ( z ))  . Intersect the full-measure ev ents abo v e o v er all play ers i ∈ I . Since I is ﬁnite, on that intersection we ma y deﬁne T ( z ) := max i ∈ I max  T br i ( z ) , T pred i ( z )  < ∞ . Then f or all t ≥ T ( z ) and all pla y ers i , f i ( h t ( z )) ∈ br ε i  f − i ( h t ( z ))  . By Deﬁnition 10, this means that f ( h t ( z )) is a stag e ε -Nash equilibrium f or all t ≥ T ( z ) . Pr oof of Lemma 5.9. Fix play er i and let f i = ( f i , f i − i ) be the supporting proﬁle from Deﬁnition 9. Fix a realized path z in the full-measure ev ent from Deﬁnition 9. By deﬁnition of q t i and the representative choice (4), q t i ( · | h t ( z )) = f i,t − i ( h t ( z )) = f i − i ( h t ( z )) . For each t , deﬁne the one-step cy linder e vent E t ( z ) := { y ∈ H ∞ : y 1 − i = a ⋆ − i ( h t ( z )) } ∈ B 1 . Because the tr ue opponents’ ne xt action at histor y h t ( z ) is pure, f − i ( h t ( z )) = δ a ⋆ − i ( h t ( z )) , so µ f h t ( z ) ( E t ( z )) = 1 . Also, b y the on-path identiﬁcation abo v e, µ f i h t ( z ) ( E t ( z )) = f i − i ( h t ( z ))  a ⋆ − i ( h t ( z ))  = q t i  a ⋆ − i ( h t ( z )) | h t ( z )  . Hence 1 − q t i  a ⋆ − i ( h t ( z )) | h t ( z )  =    µ f h t ( z ) ( E t ( z )) − µ f i h t ( z ) ( E t ( z ))    67 ≤ sup E ∈B 1    µ f h t ( z ) ( E ) − µ f i h t ( z ) ( E )    . As in the proof of Lemma 5.7, sup E ∈B 1    µ f h t ( z ) ( E ) − µ f i h t ( z ) ( E )    ≤ 2 d h t ( z ) ( µ f , µ f i ) . Because pla y er i lear ns to predict the path of play , d h t ( z ) ( µ f , µ f i ) − → 0 . Theref ore q t i  a ⋆ − i ( h t ( z )) | h t ( z )  − → 1 . It f ollow s immediately that 1 − max a − i ∈ A − i q t i ( a − i | h t ( z )) ≤ 1 − q t i  a ⋆ − i ( h t ( z )) | h t ( z )  − → 0 , which pro ves asymptotic purity . Finall y , because q t i  a ⋆ − i ( h t ( z )) | h t ( z )  − → 1 , there e xists T i ( z ) < ∞ such that f or all t ≥ T i ( z ) , q t i  a ⋆ − i ( h t ( z )) | h t ( z )  > 1 2 . For such t , the action a ⋆ − i ( h t ( z )) is the uniq ue maximizer of q t i ( · | h t ( z )) , because all other probabilities sum to 1 − q t i  a ⋆ − i ( h t ( z )) | h t ( z )  < 1 2 . Hence the deter ministic MAP selector must satisfy ˆ a t − i ( h t ( z )) = a ⋆ − i ( h t ( z )) f or all t ≥ T i ( z ) . This pro ves the claim. Pr oof of Theor em 5.10. Because ev er y play er j ∈ I uses deter ministic MAP -SCo T , f or ev er y histor y h ∈ H w e ha v e f j ( h ) = δ a ⋆ j ( h ) f or some a ⋆ j ( h ) ∈ A j . 68 Hence f or ev er y play er i and e v er y histor y h , f − i ( h ) = δ a ⋆ − i ( h ) f or some a ⋆ − i ( h ) ∈ A − i . For each play er i , apply Lemma 5.9. There is a full-measure ev ent on which there e xists T i ( z ) < ∞ suc h that f or all t ≥ T i ( z ) , ˆ a t − i ( h t ( z )) = a ⋆ − i ( h t ( z )) . Because the play er set I is ﬁnite, the intersection of these full-measure ev ents o v er all play ers still has measure one. Fix a realized path z in that intersection. For any play er i and any t ≥ T i ( z ) , Deﬁnition 12 gives f i ( h t ( z )) = δ b i (ˆ a t − i ( h t ( z ))) = δ b i ( a ⋆ − i ( h t ( z ))) . By deﬁnition of the pure best-response selector b i , b i ( a − i ) ∈ arg max a i ∈ A i u i ( a i , a − i ) f or ev ery a − i ∈ A − i . Theref ore δ b i ( a ⋆ − i ( h t ( z ))) ∈ br i  δ a ⋆ − i ( h t ( z ))  = br i  f − i ( h t ( z ))  . So f or ev er y play er i and all t ≥ T i ( z ) , f i ( h t ( z )) ∈ br i  f − i ( h t ( z ))  . Deﬁne T ( z ) := max i ∈ I T i ( z ) < ∞ . Then f or all t ≥ T ( z ) and ev er y play er i , f i ( h t ( z )) ∈ br i  f − i ( h t ( z ))  . By Deﬁnition 10, this means that f ( h t ( z )) is a stag e Nash equilibrium f or all t ≥ T ( z ) . Pr oof of Cor ollar y 5.11. By Lemma 5.1, Assumption 2 implies that e very play er lear ns to predict the path of pla y under f in the sense of Deﬁnition 9. Theorem 5.10 theref ore applies directly . 69 C Bounded-memory strategies and ﬁnite-state reduction Man y practical agent policies (including menu-based planners) depend only on a bounded windo w of recent interaction. Follo wing the bounded-recall restriction in Norman (2022), we f or malize this as a bounded- memor y condition. For a history h = ( a 1 , . . . , a t − 1 ) ∈ H let | h | := t − 1 denote its length. F or κ ∈ N , deﬁne suﬃx κ ( h ) := ( a t − min { κ,t − 1 } , . . . , a t − 1 ) ∈ κ [ m =0 A m , i.e., the last min { κ, | h |} joint actions of h (with suﬃx κ ( ∅ ) = ∅ ). Deﬁnition 16 ( κ -memory (bounded-recall) strategy) . A strategy f i : H → ∆( A i ) has memor y at most κ if f or all histories h, h ′ ∈ H , suﬃx κ ( h ) = suﬃx κ ( h ′ ) = ⇒ f i ( h ) = f i ( h ′ ) . Let F κ i ⊆ F i denote the set of κ -memory strategies for pla yer i , and wr ite F κ := Q i ∈ I F κ i . Let S κ := κ [ m =0 A m be the ﬁnite set of action-suﬃxes of length at most κ . Deﬁne the deterministic state update map T κ : S κ × A → S κ b y T κ ( s, a ) := suﬃx κ (( s, a )) , i.e., append the new joint action a to the suﬃx s and keep the last κ entries. F or any pla y path z = ( a 1 , a 2 , . . . ) ∈ H ∞ , deﬁne the induced memory state at time t : s t ( z ) := suﬃx κ ( h t ( z )) ∈ S κ . Lemma C.1 (Finite-state Marko v property under bounded memor y) . If f ∈ F κ , then for every t ≥ 1 and ev er y hist or y h t with s = suﬃx κ ( h t ) , the next-period action distribution depends on h t only thr ough s : µ f ( a t = a | h t ) = Y i ∈ I f i ( s )( a i ) . Mor eov er , the induced state process satisﬁes s t +1 = T κ ( s t , a t ) almos t sur ely, so ( s t ) t ≥ 1 is a time- homog eneous Marko v chain on S κ . 70 Pr oof. Fix t and history h t . By Deﬁnition 2, µ f ( a t = a | h t ) = Y i ∈ I f i ( h t )( a i ) . If f ∈ F κ , then f i ( h t ) = f i (suﬃx κ ( h t )) = f i ( s ) for each i , giving the display ed equality . The state update is deter ministic by constr uction of T κ : s t +1 = suﬃx κ ( h t +1 ) = suﬃx κ (( h t , a t )) = T κ (suﬃx κ ( h t ) , a t ) = T κ ( s t , a t ) . Thus ( s t ) is Marko v with kernel induced by the conditional law of a t giv en s t . Lemma C.2 (Continuation distributions depend only on the memory state) . Let g ∈ F κ and let h, h ′ ∈ H satisfy suﬃx κ ( h ) = suﬃx κ ( h ′ ) . Then the continuation play-path distributions coincide: µ g h = µ g h ′ . Pr oof. By Lemma C.1, the conditional distribution of the next action proﬁle and all future ev olution under g depends on the past only through the cur rent memor y state s = suﬃx κ ( · ) . Since h and h ′ induce the same state, the induced kernels f or ( a t , a t +1 , . . . ) are identical from either star ting history . Theref ore the induced continuation measures coincide. C.1 Best responses to bounded-memory opponents are bounded-memory A ke y beneﬁt of bounded-memor y opponents is that each play er faces a ﬁnite-state discounted MDP in the continuation game. In particular, the best-response search in BR ε i ( g − i | h t ) can be restricted without loss to bounded-memory policies. Lemma C.3 (Marko vian best responses to κ -memor y opponents) . F ix play er i , a histor y h t , and an opponents’ continuation pr oﬁle g − i ∈ F κ − i . Then ther e exists a bes t r esponse σ ⋆ i ∈ BR i ( g − i | h t ) that is stationary Marko v with r espect to the memor y state. That is, ther e exists a map π i : S κ → ∆( A i ) suc h that f or ev er y continuation histor y ¯ h ⪰ h t , σ ⋆ i ( ¯ h ) = π i (suﬃx κ ( ¯ h )) . Consequently , f or ev er y ε ≥ 0 , sup σ i ∈F i ( h t ) V i ( σ i | h t ; g − i ) = sup σ i ∈F κ i ( h t ) V i ( σ i | h t ; g − i ) , and BR i ( g − i | h t ) ∩ F κ i ( h t )  = ∅ . Pr oof. Let s 0 := suﬃx κ ( h t ) ∈ S κ . F ix g − i ∈ F κ − i . Deﬁne a controlled Mark o v process on S κ as follo ws. In state s , the play er chooses a i ∈ A i , the opponents’ joint action is drawn as a − i ∼ g − i ( s ) ∈ ∆( A − i ) , the stag e pay oﬀ is u i ( a i , a − i ) , and the ne xt state is s ′ = T κ ( s, ( a i , a − i )) . 71 For an y bounded function v : S κ → R , deﬁne the Bellman operator T by ( T v )( s ) := max α ∈ ∆( A i ) E a i ∼ α a − i ∼ g − i ( s ) h (1 − λ i ) u i ( a i , a − i ) + λ i v  T κ ( s, ( a i , a − i ))  i . Because λ i ∈ (0 , 1) , T is a contraction in ∥ · ∥ ∞ : f or an y v , w and an y s , | ( T v )( s ) − ( T w )( s ) | ≤ max α E  λ i | v ( s ′ ) − w ( s ′ ) |  ≤ λ i ∥ v − w ∥ ∞ . Hence T has a unique ﬁx ed point V ⋆ : S κ → R . For each s , the maximization ov er α ∈ ∆( A i ) attains its maximum because ∆( A i ) is compact and the objectiv e is continuous and linear in α . Fix a maximizer π i ( s ) ∈ ∆( A i ) f or each s and deﬁne the associated policy ev aluation operator ( T π i v )( s ) := E a i ∼ π i ( s ) a − i ∼ g − i ( s ) h (1 − λ i ) u i ( a i , a − i ) + λ i v  T κ ( s, ( a i , a − i ))  i . Then ( T π i V ⋆ )( s ) = ( T V ⋆ )( s ) = V ⋆ ( s ) f or all s , so V ⋆ is a ﬁx ed point of T π i . Since T π i is also a λ i -contraction, its ﬁxed point is unique; denote it by V π i . W e conclude V π i = V ⋆ . No w deﬁne σ ⋆ i to be the stationary Marko v continuation strategy induced by π i , i.e. σ ⋆ i ( ¯ h ) = π i (suﬃx κ ( ¯ h )) f or all ¯ h ⪰ h t . By construction, the induced continuation value from h t is V i ( σ ⋆ i | h t ; g − i ) = V ⋆ ( s 0 ) . It remains to show optimality agains t all continuation strategies, including those with unbounded memor y . Let σ i be an y continuation strategy and deﬁne its statewise v alue envelope W σ i ( s ) := sup n V i ( σ i | ¯ h ; g − i ) : ¯ h ⪰ h t , suﬃx κ ( ¯ h ) = s o . Fix any s and ϵ > 0 , and choose ¯ h with suﬃx κ ( ¯ h ) = s and V i ( σ i | ¯ h ; g − i ) ≥ W σ i ( s ) − ϵ . Let α := σ i ( ¯ h ) ∈ ∆( A i ) be the ﬁrst-s tep mixed action. Conditioning on the ﬁrst joint action ( a i , a − i ) and using that the ne xt state is s ′ = T κ ( s, ( a i , a − i )) , w e ha v e V i ( σ i | ¯ h ; g − i ) = E h (1 − λ i ) u i ( a i , a − i ) + λ i V i ( σ i | ( ¯ h, ( a i , a − i )); g − i ) i ≤ E h (1 − λ i ) u i ( a i , a − i ) + λ i W σ i ( s ′ ) i . Theref ore, W σ i ( s ) − ϵ ≤ E a i ∼ α a − i ∼ g − i ( s ) h (1 − λ i ) u i ( a i , a − i ) + λ i W σ i ( T κ ( s, ( a i , a − i ))) i ≤ ( T W σ i )( s ) . Letting ϵ ↓ 0 gives W σ i ≤ T W σ i pointwise. By monotonicity of T and contraction, iterating yields W σ i ≤ T n W σ i f or all n , and T n W σ i → V ⋆ unif ormly as n → ∞ . Hence W σ i ( s ) ≤ V ⋆ ( s ) f or all s , and in 72 par ticular V i ( σ i | h t ; g − i ) ≤ W σ i ( s 0 ) ≤ V ⋆ ( s 0 ) = V i ( σ ⋆ i | h t ; g − i ) . Thus σ ⋆ i is a best response. The ﬁnal display ed equality of suprema follo ws because an optimal policy e xists within F κ i ( h t ) . C.2 A chec kable KL-separation condition under bounded memory Assumption 3-(3) (on-path KL separation) is stated f or general histor y -dependent strategies. U nder bounded memory , it reduces to a state-frequency condition. Lemma C.4 (S tate-frequency decomposition of on-path KL av erages) . F ix player i , κ ∈ N , and f − i , g − i ∈ F κ − i . F or a realized path z , deﬁne s t ( z ) = suﬃx κ ( h t ( z )) and empirical stat e fr equencies ˆ π z T ( s ) := 1 T T X t =1 1 { s t ( z ) = s } , s ∈ S κ . Then f or ev er y T and every z , 1 T T X t =1 D KL  f − i ( h t ( z ))    g − i ( h t ( z ))  = X s ∈ S κ ˆ π z T ( s ) D KL  f − i ( s )    g − i ( s )  . In particular , for any ﬁxed stat e s , lim inf T →∞ 1 T T X t =1 D KL  f − i ( h t ( z ))    g − i ( h t ( z ))  ≥  lim inf T →∞ ˆ π z T ( s )  · D KL  f − i ( s )    g − i ( s )  . Pr oof. If f − i , g − i ∈ F κ − i , then for each t we hav e f − i ( h t ( z )) = f − i ( s t ( z )) and g − i ( h t ( z )) = g − i ( s t ( z )) by Deﬁnition 16. Theref ore, 1 T T X t =1 D KL  f − i ( h t ( z ))    g − i ( h t ( z ))  = 1 T T X t =1 D KL  f − i ( s t ( z ))    g − i ( s t ( z ))  . Grouping the sum b y the v alue of s t ( z ) yields the stated decomposition. The ineq uality f ollo ws by lo wer bounding the sum by a single state ’ s contribution and taking lim inf . Corollary C.5 (A suﬃcient condition f or Assumption 3(3)) . F ix player i and suppose S − i ⊆ F κ − i . F ix g − i ∈ S − i \ { f − i } and a state s ∈ S κ suc h that D KL ( f − i ( s ) ∥ g − i ( s )) > 0 . If µ f -a.s. in z , lim inf T →∞ ˆ π z T ( s ) ≥ ρ i ( g − i ) > 0 , then the on-pat h KL separation condition in Assump tion 3(3) holds for this g − i with κ i ( g − i ) = ρ i ( g − i ) · D KL ( f − i ( s ) ∥ g − i ( s )) . 73 Pr oof. Immediate from Lemma C.4. All statements in Sections 4 – 5 are f or mulated on the full history space H and theref ore apply verbatim when the realized proﬁle f (and/or the menu strategies in Assumption 3) lie in F κ . The main additions abov e are: (i) best responses to κ -memory opponents can be taken to be stationary Marko v (Lemma C.3), and (ii) Assumption 3(3) can be v er iﬁed by state-frequency separation (Lemma C.4 and Corollar y C.5). Once Assumption 3 is v eriﬁed (e.g. via Corollar y C.5), the proofs of Lemma 4.2, Proposition 4.3, and Corollary 5.4 are unchang ed. 74 D Implementation details of the strategy-le v el PS-BR planner This appendix details the implementation used in our e xperiments. A t each round, an ag ent samples a latent opponent strategy from its inf erence based on the pre vious history , e valuates candidate self-strategies by rollout, and play s the cur rent action induced by the best rollout-value strategy . D.1 Opponent strategy sampling Fix play er i at round t with local history h t i = (( a 1 i , a 1 − i ) , . . . , ( a t − 1 i , a t − 1 − i )) . For opponent-strategy inf erence, the implementation rewrites this to the opponent-vie w history ˜ h t − i = (( a 1 − i , a 1 i ) , . . . , ( a t − 1 − i , a t − 1 i )) , so eac h tuple is ordered as (opponent action, your action) . The opponent s trategy inf erence is perf ormed once per r eal decision round (with conﬁgured label-sampling temperature) and then held ﬁx ed across all K rollout samples used to e valuate candidate self-strategies at that round. Inf erence suppor ts two modes: • llm-label (default): construct an in-conte xt prompt containing the game rules, obser v ed history , and the allo w ed strategy labels (with shor t descr iptions), then ask the model to output exactly one label . P arsing is label-constrained; if parsing fails repeatedly , a deterministic label fallback is used. • likelihood : inf er from a hand-coded likelihood ov er the menu (descr ibed belo w), with no model call. llm-label mode details. In llm-label mode, if the model call itself fails, the implementation falls back to likelihood mode f or that decision round. The template used in code is: {rules_text} Observed action history tuple format: (opponent action, your action). Infer the opponent strategy from the FIRST action in each tuple. Round 1: {opp_action_1}, {self_action_1} Round 2: {opp_action_2}, {self_action_2} ... You are inferring the opponent strategy in repeated {game_name}. Observed rounds so far: {observed_rounds}. Objective: sample one opponent strategy label according to your posterior belief over allowed labels. Estimate that posterior using ALL observed rounds (do not ignore older rounds), and focus on recent patterns. The opponent may change strategy over time; if you detect a shift, prioritize the most recent consistent behavior while still accounting for earlier rounds. Internally assign a compatibility score from 0 to 100 to every 75 allowed label, convert them into relative posterior weights, and sample exactly one final label from those weights. Output rule: do NOT output scores, reasoning, or ranking. Respond with exactly one label only. **Output only the label.** Allowed labels: - {label_1}: {description_1} - {label_2}: {description_2} ... where game_name is the activ e repeated-g ame name (e.g., BoS, PD, Promo, Samaritan ’ s dilemma, or Lemons), and observed_rounds =t-1. When collusiv e-pr ior guidance is enabled ( –collusive-mode ), the prompt appends a strong-prior line. In our code this pr ior is mad0 f or Promo opponent 1 and mad1 f or Promo opponent 2. Likelihood -mode details. T o score strategy s , the implementation e valuates history under the opponent ’ s perspectiv e ˜ h t − i = (( a 1 − i , a 1 i ) , . . . , ( a t − 1 − i , a t − 1 i )) : log L t ( s ) = t − 1 X u =1 log  1 { a u − i = J } p u s + 1 { a u − i = F } (1 − p u s )  , with clipping to [10 − 6 , 1 − 10 − 6 ] f or numerical stability . Given temperature τ > 0 (implemented as τ = max { sample_temperature , 10 − 5 } ), w eights are w t ( s ) ∝ exp  log L t ( s ) τ  , and one opponent strategy is sampled from this categor ical distribution. D.2 Rollout v alue and strategy selection Giv en a sampled opponent strategy s − i , for ev er y candidate self-strategy s i ∈ M g , the planner rolls out from round t to ¯ t , where ¯ t =    min { T , t + H − 1 } , H > 0 , T , H = 0 , T is the game hor izon, and H is the planning horizon. For rollout sample m ∈ { 1 , . . . , K } , at each simulated round r , actions are sampled from the ﬁx ed opponent strategy s − i and the cur rently ev aluated candidate s i : ˆ a r,m i ∼ Bernoulli  p r s i  , ˆ a r,m − i ∼ Bernoulli  p r s − i  , 76 Algorithm 1 Strategy -lev el PS-BR loop for tw o-play er games Req uire: g ame g , total rounds T , menu M g , samples K , hor izon H , discount γ , temperature τ , inf erence mode ∈ { llm-label , likelihood } 1: Initialize h 1 ← ∅ , x 1 1 ← ( h 1 , ∅ ) , x 1 2 ← ( h 1 , ∅ ) , C 1 ← 0 , and C 2 ← 0 2: f or t = 1 , . . . , T do 3: f or i ∈ { 1 , 2 } do 4: Let x t i = ( h t , r 1: t − 1 i ) be play er i ’ s current local history 5: Construct opponent-vie w history ˜ h t − i b y sw apping tuple order in the public history h t 6: Inf er one strategy label s − i ∈ M g from rules, history ˜ h t − i 7: f or all s i ∈ M g do 8: f or k = 1 , . . . , K do 9: V ( k ) i ( s i | s − i ) ← RolloutV alue( g , i, s i , s − i , x t i , t, T , H , γ ) 10: ¯ V i ( s i | s − i ) ← 1 K P K k =1 V ( k ) i ( s i | s − i ) 11: s ⋆ i ← arg max s i ∈ M g ¯ V i ( s i | s − i ) ▷ deterministic tie-break 12: Sample real action a t i from strategy s ⋆ i at history x t i 13: Sample realized rew ards ( r t 1 , r t 2 ) from the environment pa y oﬀ la w at ( a t 1 , a t 2 ) 14: C 1 ← C 1 + r t 1 and C 2 ← C 2 + r t 2 15: Set h t +1 ← ( h t , ( a t 1 , a t 2 )) 16: Set x t +1 1 ← ( h t +1 , r 1: t 1 ) and x t +1 2 ← ( h t +1 , r 1: t 2 ) where p r s i and p r s − i are the round- r probabilities of action J induced b y s i and s − i under the simulated history preﬁx generated so far . The rollout v alue f or candidate s i agains t sampled opponent strategy s − i is V ( m ) i ( s i | s − i ) = ¯ t X r = t γ r − t u i (ˆ a r,m i , ˆ a r,m − i ) , with discount γ . The estimated v alue of strategy s i is ¯ V i ( s i | s − i ) = 1 K K X m =1 V ( m ) i ( s i | s − i ) , and the chosen strategy is s ⋆ i ∈ arg max s i ¯ V i ( s i | s − i ) , with deter minis tic hash-based tie-breaking when needed. The e xecuted action at real round t is then sampled from s ⋆ i at the cur rent history . For Exper iment 3, the environment pay oﬀ law in Algorithm 1 is the kno wn Gaussian noise famil y centered at the tr ue mean matrix. On the pla y er’ s o wn side, pla y er i additionall y samples ˜ m i ∼ π t i ( · | x t i ) , rollout v alues are computed under ˜ m i in place of the tr ue u i , and pla y er i ’ s local inf or mation history s tores only 77 ( h t , r 1: t − 1 i ) ; in par ticular , the update step abov e nev er re veals or conditions on r 1: t − 1 − i . 78 E Social chain-of-thought promp ting (SCo T) This appendix discusses that the social c hain-of-thought (SCo T) prompting inter v ention of Akata et al. (2025) can be view ed as a par ticularl y simple instance PS-BR. E.1 SCo T as a two-stag e “predict-then-act ” operator In Akata et al. (2025), SCo T is implemented b y promp t-c haining in each round of a repeated game: 1. Pr ediction pr ompt (belief elicitation). Giv en the public his tor y h t , the model is asked to predict the opponent ’ s ne xt mo ve (or , more g enerally , to describe what the other play er will do ne xt). 2. Action promp t (best r esponse to the elicited belief). The model is then asked to choose its action given the predicted opponent mov e, typicall y phrased as “giv en y our prediction, what is best for y ou to do now?” This “separate belief repor t, then act ” structure forces an e xplicit theor y -of-mind step bef ore action selection, and empirically impro ves coordination in some repeated games. E.2 Mapping SCo T as a special case of PSBR Fix agent i at history h t . Let A − i denote the opponents ’ joint action space, and deﬁne the agent ’ s posterior pr edictive o v er opponents ’ ne xt action as q t i ( · | h t ) ∈ ∆( A − i ) . In our paper’ s belief language, q t i ( · | h t ) is the one-step marginal induced by the agent ’ s posterior predictive continuation belief f i,t − i | h t . SCo T can then be e xpressed as the f ollowing generic operator: 1. Infr ence: produce ˜ a t − i as an imputation of the missing opponents ’ ne xt action. Operationall y , this is obtained b y querying the model with the prediction prompt. 2. Optimize given the imputation: choose a t i as an (appro ximate) best response to the imputed ˜ a t − i (and the kno wn pa y oﬀs), e.g. a t i ∈ arg max a i ∈ A i u i ( a i , ˜ a t − i ) (m y opic) . More g enerally , one may replace u i b y the continuation objectiv e, i.e., choose a t i (or a continuation strategy) that maximizes the discounted value conditional on ˜ a t − i and the induced continuation pla y . T wo special cases are worth separating because they clar ify the relationship to PS-BR. (i) Deterministic SCo T = point estimation. In the implementation studied b y Akata et al. (2025), the model is often r un in a near -deter ministic regime (e.g., decoding choices consistent with temperature ≈ 0 ), so the prediction step beha ves like a point estimate (roughl y “MAP” under the model’ s implicit predictive distribution). In this view , SCo T is an inference-and-optimize heur istic that can still impro ve play b y making 79 the model’ s implicit prediction problem e xplicit. (ii) My opic PS-BR = sam pling-based estimation. If instead the prediction prompt is decoded stochas- ticall y (e.g., sampling at nonzero temperature), then ˜ a t − i becomes a draw from the model’ s own predictiv e distribution: ˜ a t − i ∼ q t i ( · | h t ) . F Prom pts F .1 B ase promp ts In B ase , each play er’ s round- t prompt is: rules text + compact history + “Y ou are cur rentl y pla ying round t ” + action query . The compact history preﬁx used in code is: Observed action history (your action, opponent action): Round 1: , ... Round t-1: , R ound-lev el action query templates (Base). • BoS: Q: Which Option do you choose, J or F? A: • PD (order randomized each round): Q: Which action do you choose, J or F? A: • Harmony : Q: Which action do you choose, C or D? A: • Promo: Q: Which action do you choose, R, P, or Z? A: • Samaritan (Helper prompt): Q: Which action do you choose, H or N? A: 80 • Samaritan (Recipient prompt): Q: Which action do you choose, W or S? A: • Lemons (Seller prompt): Q: Which action do you choose, HQ or LQ? A: • Lemons (Buy er prompt): Q: Which action do you choose, B or D? A: Bef ore the ﬁnal “ A:” token, code injects a strategy-conte xt block (same helper used in B ase and SCoT ): In repeated , a strategy maps prior history to a player’s next action (possibly probabilistically). Allowed strategies: - : - ... Role mapping in this prompt: - Player A is the other player. - Player B is you. Observed rounds so far: . Context: full history prefix up to round . Strongly expect Player A to play with strategy ’’. [if available] Allowed action tokens: . [if available] Output rule: do NOT output scores, reasoning, or ranking. Respond with exactly one action only. F .2 SCoT promp ts SCoT uses tw o prompts per play er per round. Stage 1 (prediction prompt). The prediction quer ies are: • BoS: Q: Which action do you predict the other player will choose, J or F? A: • PD (order randomized each round): Q: Which action do you predict the other player will choose, J or F? A: 81 • Harmony : Q: Which action do you predict the other player will choose, C or D? A: • Promo: Q: Which action do you predict the other player will choose, R, P, or Z? A: • Samaritan (Helper predicts R ecipient): Q: Which action do you predict the other player will choose, W or S? A: • Samaritan (Recipient predicts Helper): Q: Which action do you predict the other player will choose, action H or action N? A: • Lemons (Seller predicts Buy er): Q: Which Option do you predict the other player will choose, Option B or Option D? A: • Lemons (Buy er predicts Seller): Q: Which Option do you predict the other player will choose, Option HQ or Option LQ? A: As implemented, the S tage-1 prediction prompt is enr iched with the same strategy-conte xt bloc k shown abo v e. Stage 2 (action prom pt conditioned on Stage-1 prediction). After receiving prediction , code uses: • BoS: Q: Given that you think the other player will choose Option in round , imagine the outcome for both of your possible actions (Option J and Option F), compare which gives you a better result, and then choose. Which Option do you think is the best to choose for you in this round, Option J or Option F? Output only one letter: J or F. A: • PD (with randomized , ): Q: Given that you think the other player will choose Option in round , imagine the outcome for both of your possible actions (Option and Option ), 82 compare which gives you a better result, and then choose. Which Option do you think is the best to choose for you in this round, Option or Option ? Output only one letter: J or F. A: • Harmony : Q: Given that you think the other player will choose in round , imagine the outcome for both of your possible actions (C and D), compare which gives you a better result, and then choose. Which action do you think is best for you in this round, C or D? Output only one action: C or D. A: • Promo: Q: Given that you think the other player will choose in round , imagine the outcome for your possible actions (R, P, and Z), compare which gives you a better result, and then choose. Which action do you think is best for you in this round, R, P, or Z? Output only one action: R, P, or Z. A: • Samaritan (Helper): Q: Given that you think the other player will choose Option in round , imagine the outcome for both of your possible actions (Option H and Option N), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option H or Option N? Output only one letter: H or N. A: • Samaritan (Recipient): Q: Given that you think the other player will choose Option in round , imagine the outcome for both of your possible actions (Option W and Option S), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option W or Option S? Output only one letter: W or S. A: • Lemons (Seller): Q: Given that you think the other player will choose Option in round , imagine the outcome for both of your possible actions (Option HQ and Option LQ), compare which gives you a better result, and then choose. 83 Which Option do you think is best to choose for you in this round, Option HQ or Option LQ? Output only one letter: HQ or LQ. A: • Lemons (Buy er): Q: Given that you think the other player will choose Option in round , imagine the outcome for both of your possible actions (Option B and Option D), compare which gives you a better result, and then choose. Which Option do you think is best to choose for you in this round, Option B or Option D? Output only one letter: B or D. A: F .3 PS-BR prom pts for kno wn deterministic pa y oﬀs PS-BR does not query the LLM f or direct action choice. A ctions are produced by rollout-based strategy e valuation after sampling one opponent s trategy per round. The prompt-facing LLM call is for opponent str ategy-label infer ence in llm-label mode. Opponent strategy inf erence prompt ( llm-label ). At round t , f or play er i , history is re wr itten to opponent vie w ˜ h t − i = (( a 1 − i , a 1 i ) , . . . , ( a t − 1 − i , a t − 1 i )) , so tuples are (Player A action, Player B action) with: • Pla y er A = opponent whose strategy is inf er red. • Pla y er B = cur rent decision-maker . The prompt template is: You are inferring Player A’s strategy (the opponent) in repeated . In a repeated-game setting, a strategy is a rule that maps prior history to the player’s next action (possibly probabilistically). Observed rounds so far: . Allowed labels: - : - ... Observed action history tuple format: (Player A action, Player B action). Player A is the opponent whose strategy label you must infer. Player B is you (the decision-maker). Context: full history prefix up to round <...>. Target: observed Player A action at round <...>. 84 Choose the allowed label that makes this observed Player A target most compatible with the context. At round <...>, use this mapping: Context history as (Player A, Player B), rounds <...>: round : Player A=<...>, Player B=<...> Observed target Player A action at round <...>: <...> Strongly expect Player A to play with strategy ’’. Player A’s strategy may have changed over time, so weigh recent rounds more heavily than earlier rounds. Output rule: do NOT output scores, reasoning, or ranking. Respond with exactly one label only. **Output only the label.** Lik elihood mode (no prom pt). If –strategy-inference likelihood is used, no LLM prompt is issued f or strategy inference; the label is sampled from a hand-coded likelihood ov er the ﬁnite menu. F .4 PS-BR prom pts for unkno wn stochastic pa y oﬀs U nder the theorem-aligned implementation used for Exper iment 3, PS-BR under unkno wn stochas tic pay oﬀs still samples both an opponent strategy h ypothesis and a pay oﬀ hypothesis at each round before rollout-based strategy ev aluation. The opponent-strategy side is handled e xactl y as in the kno wn deter ministic-pa y oﬀ case. The pay oﬀ side is not open-ended JSON inf erence. Instead, Experiment 3 uses the kno wn-common-noise / unknown-mean constr uction from Section 6 and Section 7.4.1: play er i maintains a posterior o ver a ﬁnite menu M i,g of candidate mean pay oﬀ matrices under the Gaussian noise famil y with kno wn v ar iance σ 2 g . Opponent strategy inference promp t ( llm-label ). The opponent strategy is inf er red from the joint action histor y , e xactl y as in the kno wn deterministic pay oﬀs case. The prompt template remains identical to the one detailed in the pre vious subsection. Finite-menu Gaussian pa y oﬀ posterior (experiment conﬁguration). At round t , pla yer i updates π t i ( m | h t , r 1: t − 1 i ) ∝ π 0 i ( m ) t − 1 Y s =1 ϕ ( r s i ; m ( a s ) , σ 2 g ) , m ∈ M i,g , where ϕ ( · ; µ, σ 2 g ) is the Gaussian density and r s i | a s ∼ N ( m ( a s ) , σ 2 g ) under candidate mean matrix m . The implementation then samples one matrix label ˜ m i ∼ π t i and e valuates continuation strategies ag ainst the induced pay oﬀ kernel q ˜ m i i ( · | a ) = N ( ˜ m i ( a ) , σ 2 g ) . Product structure of the menu. Although the theorem-lev el menu M i,g is ﬁnite but larg e, it has product f orm o v er joint actions. With a product prior o v er the oﬀsets ( k a ) a ∈ A and the Gaussian lik elihood abo ve, the posterior factorizes by joint action. Operationall y , the implementation theref ore updates the discrete 85 posterior f or each action-speciﬁc oﬀset k a ∈ K separately and samples a full mean matr ix by drawing one oﬀset for each joint action. This is e xactl y equiv alent to sampling from the full ﬁnite menu, without explicitl y enumerating all of its elements. Lik elihood mode (experiment conﬁguration). In the repor ted Exper iment 3 r uns, –payoff-inference likelihood is used. No LLM prompt is issued f or pay oﬀ inf erence; the sampled mean-matr ix label is drawn from the Gaussian posterior abo ve. Opponent strategy inf erence is handled either by the llm-label prompt described abov e or by the cor responding likelihood mode, depending on the strategy-inf erence setting. Heuristic prompt mode. An open-ended json pa y oﬀ-table prompt can still be used as a heur istic variant, but it is not the theorem-aligned implementation analyzed in Section 6 and instantiated in Exper iment 3. G Game-speciﬁc strategy menus Denote a t − 1 i and a t − 1 − i denote o wn and opponent actions at round t − 1 . Then we consider: (1) BoS strategy menu. Her e p t s denot es the probability of playing J at round t . • insist_j : p t s = 1 f or all t . • insist_f : p t s = 0 f or all t . • wsls_bos : p 1 s = 0 . 5 ; f or t ≥ 2 , if a t − 1 i = a t − 1 − i then repeat a t − 1 i , else switch from a t − 1 i . • mlur : p 1 s = 0 . 5 ; f or t ≥ 2 , if a t − 1 i = a t − 1 − i then repeat a t − 1 i , else p t s = 0 . 5 . • alternate_phase0 : p t s = 1 on odd t , and p t s = 0 on ev en t . • alternate_phase1 : p t s = 0 on odd t , and p t s = 1 on ev en t . • noisy_insist_j : p t s = 0 . 9 f or all t . • noisy_insist_f : p t s = 0 . 1 f or all t . (2) PD strategy menu. Her e p t s denot es the probability of playing J at round t . • allc : p t s = 1 f or all t . • alld : p t s = 0 f or all t . • soft_allc : p t s = 0 . 9 f or all t . • soft_alld : p t s = 0 . 1 f or all t . • tft : p 1 s = 1 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = J . • wsls : p 1 s = 1 ; f or t ≥ 2 , if a t − 1 i = a t − 1 − i then repeat a t − 1 i , else switch from a t − 1 i . 86 • soft_grim_trigger : p t s = 0 if the opponent pla yed F in either of the pre vious tw o rounds; otherwise p t s = 1 . • grim_trigger : p t s = 1 until the opponent has play ed F at least once in the past; thereafter p t s = 0 f orev er . (3) Harmon y strategy menu. Here p t s denot es the probability of playing C at round t . • allc : p t s = 1 f or all t . • alld : p t s = 0 f or all t . • tft : p 1 s = 1 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = C . • stft : p 1 s = 0 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = C . • generous_tft : p 1 s = 1 ; f or t ≥ 2 , if a t − 1 − i = C then p t s = 1 , else p t s = 0 . 3 . • grim_trigger : p t s = 1 until the opponent has play ed D at least once in the past; thereafter p t s = 0 fore ver . • wsls_pavlov : p 1 s = 1 ; f or t ≥ 2 , if a t − 1 i = a t − 1 − i then repeat a t − 1 i , else switch from a t − 1 i . • random_pc : p t s = 0 . 5 f or all t . (4) Promo strategy menu (actions: R = regular , P = promotion, Z = punishment/price war). • allR : pla y R at ev er y round. • allP : pla y P at ev er y round. • allZ : pla y Z at ev er y round. • soft_allR : pla y R with probability 0 . 9 and P with probability 0 . 1 . • soft_allP : pla y P with probability 0 . 9 and R with probability 0 . 1 . • mad0 : cooperativ e path is odd-round P /ev en-round R ; when a deviation from the prescr ibed phase path is detected, pla y Z f or 2 rounds, then retur n to phase-0 alter nation. • mad1 : cooperativ e path is odd-round R /ev en-round P ; when a deviation from the prescr ibed phase path is detected, pla y Z f or 2 rounds, then retur n to phase-1 alter nation. • grim_trigger : f ollo w the phase-0 alter nating path until the ﬁrst deviation, then pla y Z f orev er . (5) Samaritan’s dilemma (Helper actions: H = Help, N = No-help; Recipient actions: W = W ork, S = Shirk). Helper strat egy menu. Here p t s denotes the probability the helper pla ys H at round t . • always_help : p t s = 1 f or all t . • never_help : p t s = 0 f or all t . 87 • tft_help : p 1 s = 1 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = W . • grim_forgive : p t s = 0 if the recipient pla y ed S in either of the previous two rounds; otherwise p t s = 1 . • grim_nohelp : p t s = 1 until the recipient has pla y ed S at least once in the past; thereafter p t s = 0 f orev er . • wsls_helper : p 1 s = 1 ; f or t ≥ 2 , if a t − 1 − i = W then repeat a t − 1 i , else switch from a t − 1 i . • noisy_help : p t s = 0 . 9 f or all t . • noisy_nohelp : p t s = 0 . 1 f or all t . Recipient str ategy menu. Here p t s denotes the probability the recipient pla ys W at round t . • always_work : p t s = 1 f or all t . • always_shirk : p t s = 0 f or all t . • work_if_helped : p 1 s = 0 . 5 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = H . • exploit_help : p 1 s = 0 . 5 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = N . • grim_shirk_after_nohelp : p t s = 1 until the helper has play ed N at least once in the past; thereafter p t s = 0 f orev er . • forgiving_work : p 1 s = 1 ; f or t ≥ 2 , if a t − 1 − i = H then p t s = 1 , else p t s = 0 . 3 . • noisy_work : p t s = 0 . 9 f or all t . • noisy_shirk : p t s = 0 . 1 f or all t . (6) Lemons (Seller actions: H Q = High-quality , LQ = Lo w-q uality ; Buy er actions: B = Buy , D = Don’ t buy). Seller str ategy menu. Here p t s denotes the probability the seller pla ys H Q at round t . • always_hq : p t s = 1 f or all t . • always_lq : p t s = 0 f or all t . • hq_if_bought_last : p 1 s = 0 . 5 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = B . • grim_hq_until_boycott : p t s = 1 until the buyer has play ed D at least once in the past; thereafter p t s = 0 f ore ver . • lq_if_boycott_last : p 1 s = 0 . 5 ; f or t ≥ 2 , p t s = 0 iﬀ a t − 1 − i = D . • grim_forgiving : p t s = 0 if the buy er pla yed D in either of the pre vious tw o rounds; otherwise p t s = 1 . • noisy_hq : p t s = 0 . 9 f or all t . 88 • noisy_lq : p t s = 0 . 1 f or all t . Buyer str ategy menu. Here p t s denotes the probability the buy er pla ys B at round t . • always_buy : p t s = 1 f or all t . • never_buy : p t s = 0 f or all t . • soft_always_buy : p t s = 0 . 9 f or all t . • soft_never_buy : p t s = 0 . 1 f or all t . • tft_buy : p 1 s = 0 . 5 ; f or t ≥ 2 , p t s = 1 iﬀ a t − 1 − i = H Q . • generous_buy : p 1 s = 1 ; f or t ≥ 2 , if a t − 1 − i = H Q then p t s = 1 , else p t s = 0 . 3 . • grim_boycott : p t s = 1 until the seller has pla y ed LQ at least once in the past; thereafter p t s = 0 f orev er . • grim_forgiving : p t s = 0 if the seller play ed LQ in either of the previous two rounds; otherwise p t s = 1 . H Promo game H.1 Promo game (Lal, 1990): alternating promotions with ﬁnite punishment Lal (1990) studies repeated pr ice competition in a market with tw o identical “national” brands that hav e lo y al consumers and a third “local” brand with little/no lo yalty . The local brand disciplines pr ices in the switching segment, creating a tension f or the national brands between (i) e xtracting rents from lo y als via a high “regular” price and (ii) defending the switchers via temporar y price cuts. A ke y result is that, ev en when the cor responding one-shot stag e game has no Nash equilibr ium, an alternating promo tions patter n – onl y one national brand is on promotion in a given period and the roles alter nate ov er time – can arise as a pur e-str ategy Nash equilibrium of the inﬁnite-hor izon discounted game, suppor ted b y a credible number of punishment periods. T o obtain a compact repeated-game benchmark, we discretize Lal, 1990’ s r icher price-choice problem into three representativ e regimes per ﬁrm: • Regular ( R ): charg e the high “regular” pr ice • Pr omotion ( P ): charg e the low promotional price • Punishment/price w ar ( Z ): c harg e a very lo w price used onl y in punishment phases. The resulting 3 × 3 pay oﬀ matr ix in Appendix 7 is a reduced-f or m encoding of the ordinal incentive structure: a unilateral promotion agains t a regular -price riv al yields the highest cur rent-period gain (the “temptation ” pa y oﬀ ); simultaneous promotions are less proﬁtable than alternating promotions; and outcomes in vol ving Z are jointly bad, standing in f or the “intense competition/pr ice w ar” phase used to deter deviations. 89 The canonical nontrivial Nash equilibrium is an alternating path: play ( P, R ) in odd rounds and ( R, P ) in e v en rounds (or vice v ersa). After an y de viation from the prescr ibed phase, switch to a punishment phase (e.g., ( Z, Z ) f or a ﬁxed number of rounds) f or a fe w per iods and then return to the alternating path (as deﬁned as Abreu, 1988), or re v er t per manentl y to a lo w-pa yoﬀ punishment regime (g rim trigger). F or suﬃcientl y patient pla y ers, the discounted loss from the punishment phase outweighs the one-shot deviation gain, making the alter nating-promotions path incentive compatible. 90

Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment