The PokeAgent Challenge: Competitive and Long-Context Learning at Scale
We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon…
Authors: Seth Karten, Jake Grigsby, Tersoo Upaa
The P okeAgent Challenge: Competitiv e and Long-Context Lear ning at Scale Seth Karten 1 , ∗ ∗ Jake Grigsby 2 , ∗ T ersoo Upaa Jr 1 Junik Bae 6 Seonghun Hong 14 Hyunyoung J eong 14 Jaey oon Jung 14 Kun Kerdthaisong 15 Gyungbo Kim 9 Hyeokgi Kim 14 Y ujin Kim 9 Eunju Kwon 9 Dongyu Liu 7 Patrick Mariglia 8 Sangyeon Park 9 Benedikt Schink 12 Xianwei Shi 7 Anthony Sistilli 11 Joseph T win 13 Arian Urdu 12 Matin Urdu 12 Qiao W ang 10 Ling W u 10 W enli Zhang 7 Kunsheng Zhou 7 Stephanie Milani 3 , 4 Kiran V odrahalli 5 Amy Zhang 2 Fei F ang 3 Y uke Zhu 2 Chi Jin 1 1 Princeton 2 UT -Austin 3 CMU 4 NYU 5 Google DeepMind 6 T eam Heatz 7 T eam P A-Agent 8 T eam FoulPlay 9 T eam 4thLesson 10 T eam Q 11 T eam Anthonys 12 T eam Hambur g 13 T eam Porygon2AI 14 T eam Deepest 15 T eam August Abstract W e present the PokéAgent Challenge, a lar ge-scale benchmark for decision-making research built on Pokémon’ s multi-agent battle system and expansi ve role-playing game (RPG) en vironment. Partial observability , game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokéAgent targets these limitations at scale through two complementary tracks: our Battling T rack , which calls for strategic reasoning and generalization under partial observ ability in com- petiti ve Pokémon battles, and our Speedrunning T rack , which requires long-horizon planning and sequential decision-making in the Pokémon RPG. Our Battling T rack supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-le vel competiti ve play . Our Speedrunning T rack provides the first standardized e v aluation framew ork for RPG speedrunning, including an open-source multi-agent orchestration system that enables modular , reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition v alidates both the quality of our resources and the research commu- nity’ s interest in Pokémon, with more than 100 teams competing across both tracks and winning solutions detailed in our paper . Participant submissions and our o wn baselines rev eal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress e v aluation matrix shows that Pokémon battling is nearly orthogonal to standard LLM benchmarks, measur- ing capabilities not captured by existing ev aluation suites and positioning Pokémon as an unsolved benchmark that can drive RL and LLM research forward. W e transition from the NeurIPS 2025 competition to a li ving benchmark by releasing a li ve leaderboard for Battling and self-contained e valuation for Speedrunning at https://pokeagentchallenge.com . 1 Introduction Partial observability , game-theoretic reasoning, and long-horizon planning are core challenges in sequential decision-making, yet few e xisting benchmarks stress all three simultaneously under realistic ∗ Contribution statement in Appendix I . ∗ Equal contribution. Correspondence: sethkarten@princeton.edu, grigsby@cs.utexas.edu 39th Conference on Neural Information Processing Systems (NeurIPS 2025): Competition T rack. conditions. Standard testbeds tend to isolate one axis—imperfect-information games emphasize equilibrium computation in short episodes, while open-ended en vironments test exploration but lack adversarial opponents. Pokémon is an en vironment that combines all three: competitiv e battles require reasoning under hidden information against a strategic adversary , while single-player campaigns demand thousands of cumulativ e decisions spanning exploration, resource management, and combat ov er extended horizons. W ith approximately 10 564 possible battle states (see Appendix G ), team building across 1,000+ species with di verse mo vesets and abilities, and a competitiv e metagame that ev olves continuously , Pokémon is more complex and dynamic than most existing benchmarks. In 2025, Pokémon gained significant interest for e valuating frontier AI systems. Claude Plays Pokémon [ 1 ] demonstrated extended thinking capabilities o ver 35,000 actions to complete a small section of the game, Gemini 2.5 Pro completed the entire game of Pokémon Blue in 406 hours [ 2 , 3 ], and OpenAI’ s GPT -5 finished the game in 6,470 steps [ 4 ]. These demonstrations reinforced Pokémon’ s suitability as an AI testbed, but the ef forts were fragmented—different games (Red, Blue, Crystal, Emerald), different harnesses, and dif ferent ev aluation criteria made meaningful comparison impossible. W as Gemini 3 Pro’ s 173-hour completion better than Claude Opus 4.6 reaching V ictory Road? Did GPT -5’ s step count account for the same mechanics? By conflating harness with model capability , it became impossible to attrib ute success to the agent architecture, the underlying model, or hardcoded assumptions that simplified perception. The importance of standardized ev aluation in games AI is well established: the Arcade Learning En vironment [ 5 ] catalyzed a decade of RL progress [ 6 ], while MineRL [ 7 – 9 ] demonstrated shared benchmarks for open-ended en vironments. Pokémon demands—and now recei ves—similar standardization. Pokémon also of fers a distinctiv e form of out-of-distrib ution ev aluation. While extensi ve Pokémon knowledge exists in pretraining corpora—mov e types, damage formulas, competitive tier lists— translating that latent kno wledge into effecti ve multi-turn sequential decision-making under partial observability is fundamentally dif ferent from the recognition and retriev al tasks where data contami- nation typically inflates performance. Moreov er , the competitiv e metagame shifts continuously as the player community dev elops new strate gies and the game’ s governing body rebalances tiers—creating natural distribution shifts that test an agent’ s ability to adapt rather than memorize. W e present the PokéAgent Challenge, a standardized ev aluation framework for Pokémon-playing AI agents. The benchmark features two complementary tracks: Competitive Battling ev aluates strategic reasoning under partial observ ability in two-player competitiv e Pokémon, while the RPG Speedrunning tests long-horizon planning in completing Pokémon Emerald as quickly as possible. The NeurIPS 2025 PokéAgent Challenge confirmed the benchmark’ s difficulty and dre w strong community engagement: 100+ teams submitted solutions across both tracks and 650+ researchers joined the competition Discord for technical exchange. The competition produced no vel methods— including Scripted Policy Distillation for RPG play and iterativ e of fline RL with dynamic data weighting—and served as the first large-scale competitiv e testbed for approaches such as root- parallelized MCTS in imperfect-information battling [ 10 ], while rev ealing a capability hierarchy: specialist RL and search methods dominated LLM approaches in Competiti ve Battling, and no ra w frontier model achiev ed non-tri vial progress in Speedrunning without a sophisticated harness. These gaps remain far from closed. Our contributions include: (1) the PokéAgent Challenge, a two-track e v aluation framew ork pairing competitiv e battling (via Pokémon Showdo wn) with RPG speedrunning (via Pokémon Emerald), with standardized infrastructure for fair comparison across RL, LLM, and hybrid approaches; (2) the largest publicly a v ailable Pokémon battle dataset—comprising 4M human demonstrations and 18M synthetic battles, plus 200K+ curated competiti ve teams; (3) baselines spanning heuristic bots, RL agents, and harness LLM agents, alongside the first open-source multi-agent orchestration system for long-horizon RPG play; (4) empirical validation through the NeurIPS 2025 PokéAgent Challenge (100+ competitors, 100K+ battles, top methods in Appendix E ), with results rev ealing substantial gaps between generalist LLM, specialist RL, and elite human performance, and orthogonality analysis showing that Pokémon battling captures capabilities not predicted by the 49 benchmarks in the BenchPress e valuation matrix [ 11 ]; and (5) a li ving benchmark with a liv e Battling and Speedrunning leaderboard and self-contained T rack 2 e valuation at https://pokeagentchallenge.com . 2 2 Related W ork Game AI Benchmarks. T raditional benchmarks rapidly saturate, but adv ersarial games resist this by forcing continuous adaptation. Game AI has dri ven major adv ances: superhuman board games [ 12 ], imperfect-information poker [ 13 , 14 ], grandmaster-le vel StarCraft II [ 15 ], and human- lev el Diplomacy combining language models with strategic reasoning [ 16 ]. As Figure 1 sho ws, RL achiev es superhuman performance in fully observable settings, but this margin erodes in stochastic, partially observ able en vironments [ 17 ], and LLM agents consistently lag specialist RL and search systems [ 18 ]. Hanabi [ 19 ] and FightLadder [ 20 ] adv ance imperfect-information and competiti ve e val- uation; NetHack [ 21 ] and B ALR OG [ 22 ] benchmark long-horizon reasoning for RL and LLM agents. Howe ver , none combines adversarial reasoning with long-horizon planning at scale, and most rely on symbolic state representations rather than visual perception. Pokémon of fers a unique combination: an enormous partially observed state space, a visual RPG requiring pixel-lev el perception, and a massi ve acti ve player base generating continuous human data and an ev olving competitiv e metagame. The gaps between AI and humans, and between specialist and generalist AI, remain wide open. 1 0 1 0 1 0 1 0 0 1 0 1 0 0 0 State Space Complexity 0 20 40 60 80 100 120 Performance (% of Human Expert) Human Expert Backgammon (T esauro, 1995) Chess (Silver , 2018) Go (Silver , 2017) Poker (Brown, 2019) Pokémon Battling (Grigsby , 2025) Chess (Acher , 2023) Go (Ma, 2026) Poker (Zhuang, 2025) Pokémon Battling (Karten, 2025) Backgammon (T esauro, 1995) Chess (Stockfish) Go (Crazy Stone) Poker (Billings, 2002) Zer o-Sum Unbounded Dota 2 (OpenAI, 2019) StarCraft II (V inyals, 2019) Minecraft (Baker , 2022) StarCraft II (Ma, 2024) Minecraft (W ang, 2023) Pokémon RPG (Karten, 2025) Dota 2 (OpenAI, 2017) StarCraft II (Blizzard AI) Minecraft (Scripted Bot) Open- Ended Partially Observable Zer o-Sum RL Agent LLM Agent Heuristic Figure 1: Game Benchmarks. Pokémon cre- ates v ast partially observable state spaces (see Appendix G ). Data from [ 12 – 15 , 23 – 32 ]. NeurIPS Game AI Competitions. Standard- ized competitions hav e been important for ad- vancing game AI. Neural MMO [ 33 , 34 ] bench- marks multi-agent cooperation, Lux AI [ 35 ] tar- gets resource management with shifting dynam- ics, MineRL [ 7 – 9 ] addresses long-horizon plan- ning in open worlds. While each advances a specific research axis, none combines adversar- ial partial observability with long-horizon plan- ning at the scale of a li ving competiti ve ecosystem. PokéAgent bridges this gap: its dual-track design jointly e valuates RL and LLM approaches in high- stakes competiti ve play (battling) and extended se- quential decision-making ov er thousands of steps (speedrunning), providing complementary stress tests that no single existing benchmark co vers. Prior W ork on P okémon AI. Our prior w ork introduced PokéChamp [ 36 ], combining minimax search with LLMs, and Metamon [ 37 ], training RL agents on millions of human and self-play battles. On the RPG side, Puf fer [ 38 ] demonstrated RL- based completion of Pokémon Red, while demonstrations from Claude [ 1 , 39 ], Gemini [ 2 , 3 , 40 ], and GPT [ 4 ] showed both the strengths and limitations of frontier models. Ho wev er , each ef fort produced individual systems rather than reusable benchmark infrastructure: none established standardized ev aluation, public leaderboards, or multi-track design for fair cross-paradigm comparison. The PokéAgent Challenge extends these ef forts into a unified evaluation frame work. 3 Competitive Battling T rack 3.1 Battling En vironment Design Pokémon Showdown is an open-source simulator that transforms Pokémon’ s turn-based battle mechanics into a standalone competitiv e game with thousands of daily players. Formally , battles are two-player , zero-sum, stochastic games with imperfect information and simultaneous action selection. Their imperfect information primarily stems from team construction: each player drafts a team from a vast design space, and key aspects of the opponent’ s team remain hidden until revealed through play . On each turn, players select from ~9 actions (Figure 2 ), with battles lasting ~20–100 turns. Action outcomes are stochastic and can lead to a long tail of rare events that abruptly swing state values. The combination of randomness, hidden information, team di versity , long-horizon planning, and ev olving rules presents a significant challenge, and ev aluating progress is difficult: existing work typically relies on disjoint baselines or anonymous competition on the Pokémon Sho wdown rank ed 3 ladder , where performance metrics are noisy and non-stationary . W e address these challenges by releasing standardized baselines and datasets alongside a dedicated leaderboard for AI agents. 3.2 Battling Evaluation Criteria Battling T rack agents are ev aluated through direct competition against both community submissions and a diverse suite of state-of-the-art baselines maintained by our team. T o avoid interfering with human players, all matches are conducted on a separate, modified Showdo wn server operated by the PokéAgent Challenge and configured specifically for AI benchmarking. Skill Rating Metrics. Agents are ranked on a public leaderboard according to several metrics. W e report the standard Sho wdown implementations of Glicko-1 [ 41 ] (an Elo v ariant incorporating uncertainty) and GXE (expected win probability ag ainst a randomly sampled opponent). Ho wev er , these online metrics are designed for a large human player base with ev olving skills and sparse matchups. In contrast, our agent pool is comparati vely small, matchups are dense, and policies are fixed during ev aluation. Our primary metric is based on a Bradley–T erry model [ 42 ] with bootstrapped uncertainty , fit over the full history of an agent’ s battle results subject to a minimum sample size. W e refer to this metric as the Full-History Bradley–T erry ( FH-BT ) rating to distinguish it from Showdo wn’ s version of Elo, which is too noisy for our purposes. Appendix B provides a comparison of alternativ e skill metrics. Figure 2: P okémon Battling. Rulesets. Pokémon Showdo wn supports dozens of rulesets (“for- mats”), but results here will focus on two that stress different AI capabilities: Gen 1 OU and Gen 9 OU . Gen 1 OU features greater effecti ve hidden information and a more compact state space but yields smaller human demonstration datasets than Gen 9 OU. Our infrastructure currently supports three additional formats, with room to expand as performance saturates. Agents can play under two different time constraints: standard rules enforce faster-than-human play for efficient large-sample e valuation, while an “Extended Timer” variant provides nearly unlimited deliberation time for LLMs and test-time reasoning. 3.3 Battling Baselines The PokéAgent Challenge is co-organized by the teams behind PokéChamp [ 36 ] and Metamon [ 37 ]. While the Battling Track b uilds upon these leading LLM and RL approaches, the resources provided here hav e been heavily impro ved and standardized for this challenge. F or clarity , we introduce these features as a unified framew ork, with nov el improv ements detailed in Appendix D . Demonstrations. Sho wdown archiv es public battles spanning a decade of online play , and we organize an anon ymized dataset of these files to protect player priv acy . Howe ver , these “replays” are logged from a spectator’ s perspecti ve and do not reflect the priv ate information a v ailable to each player at decision time. W e release more than 4M RL trajectories generated by inferring priv ate information and reconstructing the battle from each player’ s perspectiv e. The resulting dataset allows for flexible e xperimentation with alternativ e observation spaces, action spaces, and re ward functions. While human demonstrations are inv aluable for bootstrapping policies, competitiv e performance often requires the scale of self-play . W e release all 18M trajectories used to train our strongest baselines and continue to expand this dataset with battles played on the PokéAgent Challenge server , including 100K community battles from our NeurIPS competition (Section 5 ). Sample T eams. The combinatorial space of legal, competiti vely viable teams creates a substantial generalization challenge, as agents must perform across a v ast range of initial conditions. Ef fectiv e self-play training and e valuation demand di verse, realistic teams that mirror human trends. W e release a dataset of 200K+ teams generated by inferring hidden information from human replays, alongside a curated collection of expert-v alidated teams sourced from community forums. LLM Baselines. Although Pokémon knowledge appears in pretraining corpora, competitiv e game- play is not an explicit optimization target of LLM training, making the application of that knowledge in competitiv e battles a genuine out-of-distribution test that extends recent LLM ev aluations in Chess and Poker [ 18 ] to an ev en more complex domain. W e extend PokéChamp [ 36 ] into a generalized 4 Gen 1 OU 20 30 40 50 60 70 80 90 GXE (%) IQR 90th Percentile 10th Percentile (of the top 500 players) Gen 9 OU IQR 90th Percentile 10th Percentile Gen 1 OU 20 30 40 50 60 70 80 90 GXE (%) Gen 9 OU 0 25 50 75 100 GXE (%) Gemini 3.1 Pro GPT-5.2 Gemini 3 Flash GLM-5 Gemini 3 Pro Claude Opus 4.6 Grok-3 Mini Gemini 2.5 Flash Claude Sonnet 4.6 Grok-3 MiniMax M2.5 PC-Gemma3-4B Hermes 4 405B DeepSeek V3 Qwen3-14B GPT-oss Kimi K2.5 PC-Gemma3-1B PC-Llama3.1-8B Qwen3.5 Plus Gemini 2.5 Flash Lite Qwen3-4B Gemma3-12B Llama3.1-8B Qwen3-8B Gemma3-4B Gemma3-1B 91% 90% 82% 81% 76% 70% 60% 56% 55% 54% 52% 47% 46% 43% 43% 42% 42% 42% 41% 40% 37% 37% 37% 33% 29% 25% 13% Agents vs. Humans GXE on Public Ranked Ladder (if Known) RL vs. RL GXE on Self-Play Ladder (Selected Methods) LLM vs. LLM GXE on Self-Play Ladder (Gen 9 OU) RL (PokéAgent) RL (Metamon) Heuristic LLM (PokéAgent scaffold) LLM (PokéChamp scaffold) Figure 3: Baseline Perf ormance. (Left) Agents vs. Humans : Official ratings on the Showdo wn ladder . Statistics from the T op 500leaderboard are provided as a frame of reference for e xperienced human players. (Center/Right) RL vs. RL and LLM vs. LLM : GXE is measured relativ e only to methods within each plot. W e differentiate between prior Metamon RL policies [ 37 ] and baselines newly de veloped for this work; PC-Llama3.1-8B represents the original PokéChamp agent [ 36 ]. harness frame work for reasoning models, supporting both frontier API models (GPT , Claude, Gemini) and open-source models (Llama, Gemma, Qwen). The frame work con verts game state to structured text and pro vides configurable harness including depth-limited minimax search with LLM-based position evaluation. All LLM baselines use a harness; ev en small open-source models achie ve meaningful performance with this support (Figure 3 ). Default turn timers (60–90s) proved insufficient for LLM inference; the Extended T imer setting provides nearly unlimited deliberation time for fair ev aluation of these methods. See Appendix D for architecture details. RL Baselines. In competiti ve domains, specialized systems often set the performance ceiling before general-purpose approaches reach parity . Pokémon provides a venue to study this gap, and we include strong RL baselines trained on large datasets of human demonstrations and self-play battles. W e e xtend Metamon [ 37 ] and release checkpoints from 30 agents that span the competiti ve skill ladder , ranging from compact RNNs to 200M-parameter T ransformers. Our RL baselines provide high-quality reference points across a range of human skill levels, allo wing researchers to benchmark their progress and explore compute-ef ficiency tradeoffs on accessible hardware. Figure 3 visualizes the relati ve strength of select RL and LLM agents alongside their performance against human players on Pokémon Sho wdown. Our baselines represent a substantial improvement ov er prior work [ 36 , 37 ] and span a broad performance range in both cate gories, pro viding researchers with di verse reference points to track progress as they iterate on ne w techniques. The strongest baselines are competiti ve with top human players, confirming the benchmark captures the strategic depth of high-lev el play , though our current upper bound remains less than superhuman. 4 Long-Context Speedrunning T rack Speedrunning provides a natural optimization objective for long-horizon planning: a clear metric (completion time), decomposable milestones for fine-grained credit assignment, and a task that demands the full stack of AI capabilities—visual perception, long-horizon planning, persistent memory , spatial navigation, and strategic combat—simultaneously . W e formalize RPG gameplay as an episodic MDP M = ( S , A , T , R, γ ) where actions are button inputs, transitions are largely deterministic for navigation b ut stochastic for battles, and reward is +1 per milestone with γ = 1 . 4.1 Speedrunning En vironment Design The ev aluation environment runs the g ame server at a fixed frame rate. Agents receive visual frames and limited state information—party composition (species, le vels), status conditions, and HP v alues— 5 Speedrunning Route (Early Game) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Human Record Pace 0:57 Littleroot 2:44 R oute 101 3:00 Starter 4:02 Oldale T own 4:28 Rival Battle 5:50 Bir ch Lab 6:48 R oute 102 7:58 Petalburg 8:25 Dad's Gym 10:34 R oute 104 S 10:53 Petalburg W oods 12:25 R oute 104 N 12:35 Rustboro 12:42 Rustboro Gym 17:27 Roxanne 2 5 7 10 12 1 4 8 11 13 W orld Map Figure 4: Speedrunning Route (Early Game). Milestones from Littleroot Town (1) to Defeating Roxanne (15), with game frames from each waypoint. The geographic overvie w (right) maps ke y locations. Although progression appears linear , the route requires substantial exploration and backtracking—agents must revisit earlier areas, navig ate branching paths, and manage nonlinear dependencies between objecti ves. W e provide splits from the human world record as an upper bound. but puzzle states, dynamic obstacles, items, and mov esets are not exposed, so that perception remains challenging (see Figure 23 in Appendix F ). 4.2 Speedrunning Evaluation Criteria Agents are e valuated on completion percentage (progress through standardized milestones, illustrated in Figure 4 ) and completion time for agents achie ving 100%, with ties broken by action count. An action is each discrete instance where the agent outputs button presses to the emulator . W e scoped the initial e valuation to defeating the first gym leader (Roxanne). Even this early segment requires thousands of agent steps and millions of reasoning tokens, with agents maintaining coherent plans across extended context windo ws that accumulate over hours of real-time play . The task demands the full stack of AI capabilities—perception, memory , planning, navigation, and battle strategy—and requires context compaction to manage the thousands of reasoning steps in volved. W e scope to the first gym to enable rapid iteration on approaches at reasonable cost; the milestone framework naturally extends to the full game as agent performance saturates. 4.3 Speedrunning Baselines Human Baselines. W e scope ev aluation to the first gym to enable participants to reasonably iterate on their approaches. Our top human speedrunner reached the first gym in 18 minutes, while av erage human players completed it in 1:22:05. Harness versus Model Capability . A ke y challenge in ev aluating LLM-based game agents is attribution: does performance stem from the underlying model or the surrounding harness (also called scaf fold)? As discussed in Section 1 , prior ef forts (Claude, Gemini, GPT playing Pokémon) conflated these factors. W e disentangle them through a harness × model e val framework that analyzes systems along fiv e dimensions—state representation ( S ), tools ( T ), memory ( M ), feedback ( F ), and fine-tuning ( Φ )—so that approaches can be compared on equal footing (Appendix F T able 2 and Figure 5 ). Figure 5 compares our harness against common CLI-agent harnesses (Claude Code, Codex CLI, Gemini CLI) re veal that, while coding-agent architectures are impressi ve out-of-the-box, they nonetheless struggle to maintain coherence o ver the thousands of sequential decisions required for , and the non-linear exploration characteristic of RPG play . PokéAgent Baseline. W e release the first open-source multi-agent orchestration system for long- horizon RPG play . The system coordinates MCP tools (A* pathfinding, button inputs, kno wledge retriev al) with specialized sub-agents for battle strategy , self-reflection, gym puzzles, and objectiv e verification. A central orchestrator maintains a high-le vel route plan while dynamically dispatching 6 0.0h 1.0h 2.0h 3.0h 4.0h 5.0h 6.0h Hours Littleroot T own Route 101 Starter Chosen Oldale T own Rival Battle Birch Lab Route 102 Petalbur g City Dad's Gym Route 104 S Petalbur g W oods Route 104 N Rustboro City Rustboro Gym Roxanne Defeated (a) W all-Clock T ime 0 500 1000 1500 2000 2500 Actions (b) Cumulative Actions 1.0M 10.0M 100.0M T okens Littleroot T own Route 101 Starter Chosen Oldale T own Rival Battle Birch Lab Route 102 Petalbur g City Dad's Gym Route 104 S Petalbur g W oods Route 104 N Rustboro City Rustboro Gym Roxanne Defeated (c) Cumulative T okens $0.00 $20 $40 $60 $80 Cost (USD) (d) Cumulative Cost Speedrunning: Harness vs. Model PokéAgent Claude Code Gemini CLI Codex CLI GPT 5.2 GPT 5.3 Codex Gemini 3 Flash Gemini 3 Pro Gemini 3.1 Pro Claude Sonnet 4.5 Claude Sonnet 4.6 Figure 5: Speedrunning T rack Baseline Results. Cumulativ e wall-clock time, actions, tokens, and cost at each milestone for fiv e frontier models (mean ± min/max range across runs). Gemini 3 Flash completes the route fastest ( ∼ 2:24 mean) but requires more actions than Gemini 3 Pro. Claude Sonnet 4.5 completes all milestones but with the highest v ariance and 3–4 × the cost of Gemini v ariants. GPT -5.2 falls between the two f amilies in both time and cost. sub-agents based on game conte xt, with automatic conte xt compaction to manage the thousands of reasoning steps required (see Appendix D.2 for architecture details). W e e v aluate fiv e frontier models using the same harness (Figure 5 ): Gemini 3 Flash achiev es the fastest mean completion ( ∼ 2:24), while Claude Sonnet 4.5 completes all milestones but with high variance (6:25–20:45 across runs). Even the best or ganizer baseline remains ∼ 1.8 × slower than the av erage human (1:22:05). 5 NeurIPS 2025 Competition Our NeurIPS competition provided an opportunity to grow the research community’ s interest in Pokémon and validate our ev aluation protocols. Participants competed for prizes across both our Battling and Speedrunning leaderboards. The competition, which ran from July to December 2025, grew an online community of 650+ members and generated 150+ submissions. Here, we summarize the competition’ s results and key tak eaways; additional rules and details can be found in Appendix A . 5.1 Battling T rack Results Participants in the Battling T rack submitted agents to our AI-focused Showdo wn leaderboard, where they competed against fellow entrants and organizer-hosted baselines. The PokéAgent Challenge offers substantial improv ements ov er prior agents (Figure 3 ); sev eral of the strongest new baselines were kept pri vate until the competition’ s conclusion to ensure that reaching the top of the leaderboard would require significant technical contributions. Figure 6 visualizes the final standings. The top 8 teams in both formats qualified for a head-to-head tournament bracket, where both #1 seeds ( PA-Agent in Gen 1 OU, FoulPlay in Gen 9 OU) were ev entually victorious. Members of the winning and runner-up teams provide details of their final solutions in Appendix E . Of the 16 qualifying spots, 13 were secured by teams extending our public RL baselines, while the remaining three were won by Porygon2AI (#8 in Gen 9 OU) and FoulPlay [ 10 ] (#1 in Gen 9 OU, #8 in Gen 1 OU)—independent RL and search approaches, respectiv ely . Appendix B provides further analysis of the results, including the final tournament, and a comparison of alternativ e rating schemes. 5.2 Speedrunning T rack Results Of the 22 teams with valid submissions that submitted to the Speedrunning Track, 6 achie ved 100% completion (all 15 milestones). Figure 7 visualizes the final standings; full methodology 7 Figure 6: NeurIPS Battling Leaderboard. Organizer and Participant agent ratings are directly comparable. The x-axis measures disagreement between our primary rating metric and the linear trend established by GXE (a metric widely used on Showdo wn). descriptions appear in Appendix E . The winner , Heatz, used Scripted Policy Distillation (SPD): an LLM decomposes the task into subgoals and generates scripted policies for each, which are then distilled into neural networks via imitation learning and refined with RL. The resulting agent completed the route in 40:13—nearly twice as fast as the second-place Hambur g PokéRunners, who used a pure RL approach (recurrent PPO with milestone-conditioned re wards). The remaining completions used LLM harness architectures with varying tool inte grations (see Appendix C ). 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Cumulative T ime (hours) Littleroot T own Route 101 Starter Chosen Oldale T own Rival Battle Birch Lab Route 102 Petalbur g City Dad's Gym Route 104 S Petalbur g W oods Route 104 N Rustboro City Rustboro Gym Roxanne Defeated Milestone Heatz (SPD) Hambur g (Rec. PPO) anthonys (LLM+A*) Evelord (LLM Scaf f.) Deepest (VLM+T ools) PokéAgent Simple Baseline Human (WR) 0 1000 2000 3000 4000 Cumulative Steps Figure 7: NeurIPS Speedrunning Leaderboard. Milestone progress vs. wall-clock time (left) and cumulati ve agent steps (right) for the six teams that completed all 15 milestones. Unlike most game benchmarks that pause between actions, our en vironment runs in real time—the game world continues while the agent reasons, making inference latenc y a first-class cost. W e therefore report both wall-clock time (end-to-end throughput) and step count (sample efficienc y). This distinction rev eals that Deepest † (Judge’ s Choice) completed in the fewest steps (649) despite ranking 5th by time, highlighting the tradeoff between inference speed and action ef ficiency . T ime-based rankings do not tell the full story . When measured by total actions rather than wall-clock time, Deepest—5th by time—completed with the fewest steps (649 vs. Heatz’ s 1,608), pointing to a tradeof f between inference speed and sample ef ficiency . Heatz’ s RL-distilled policy e xecutes f ar faster per step, compensating for the higher step count. W ithout a harness, raw frontier VLMs achiev e ef fectiv ely 0% task completion on this track. Pokémon Emerald gameplay is out-of-distrib ution for these models, and raw model calls produce agents that wander aimlessly , repeat failed actions, or become stuck in dialogue loops. A harness—perception, memory , planning, and action modules—is not a marginal optimization but a prerequisite for any progress. Furthermore, not all harnesses are equal: common CLI-agent architectures (e.g., Claude Code, Codex CLI, Gemini CLI) fail to maintain coherence over the thousands of sequential decisions required, despite using the same underlying frontier models that succeed with our domain-specific harness (Figure 5 ). Long-context autonomous embodied tasks appear to demand qualitati vely dif ferent 8 architectures than single-session coding workflo ws, particularly in the effecti ve use of robust long- term memory abstractions coupled with detailed and consistently iterated plans specified at sufficient lev els of granularity . In the absence of these abstractions, we find that CLI-agent architectures frequently make erroneous assumptions about their game progress and struggle to effecti vely localize their game progress within the broader continuity of milestone objectiv es in Pokémon Emerald. This gap likely e xtends to other OOD embodied domains requiring very long-conte xt coherence. 5.3 Cross-T rack Insights Specialist methods outperform generalist LLMs. Both tracks show a consistent pattern: RL and search methods outperform LLM approaches. In battling, the top participants all used RL or MCTS rather than LLM reasoning. In speedrunning, the top two finishers used RL-based methods. Heatz’ s 40:13 is more than 2 × faster than the best pure LLM harness approach (anthonys, 01:29:17). Put differently , Pokémon tasks require precise computation—probability estimation, opponent modeling, spatial planning—that current LLMs do not reliably perform from prompts alone. LLMs as priors, RL as refinement. Despite this gap, LLMs played a key role in the winning speedrunning approach. Heatz used an LLM to decompose the task and generate initial scripted policies, then distilled these into neural networks via RL. The LLM provided the prior (task structure and initial beha vior) while RL pro vided refinement (faster e xecution and strate gy discov ery). This pattern—using LLMs for high-level reasoning and RL for low-le vel optimization—seems applicable beyond games. Pokémon exposes r easoning failures that standard benchmarks miss. In battles, weaker models exhibit “panic beha vior” (also observed by [ 2 ]): after a small tactical error , they compound mistakes rather than recov ering, losing games they could still hav e won. Different model families f ail in distinct ways—memory corruption cascades, goal oscillation, e xcessiv e plan commitment, and computational paralysis (see Appendix H and Figure 26 ). These failure modes do not appear in coding or math benchmarks, where questions are independent. Pokémon’ s multi-turn, adversarial setting tests whether models recov er from mistakes under pressure. Pokémon battling is orthogonal to standard LLM benchmarks. BenchPress [ 11 ] shows that an 83-model × 49-benchmark ev aluation matrix is approximately rank-2: two latent dimensions explain > 90% of score variance, and 5 benchmarks suffice to predict the remaining 44 within ∼ 7 points. W e added our Battling Track GXE scores for the 16 models that overlap with BenchPress. Pokémon breaks this low-rank structure: no existing benchmark correlates strongly with GXE (max Spearman ρ = 0 . 77 ; mean | ρ | = 0 . 45 ), and the rank-2 SVD that explains 91% of standard benchmark variance captures only 27% of GXE v ariance. Sev eral models that score at frontier lev el on standard benchmarks collapse in battles, and vice versa. Competitive Pokémon appears to measure capabilities—strategic reasoning under partial observability and adversarial pressure—that are nearly orthogonal to what current ev aluation suites capture. 6 Conclusion: From Competition to Living Benchmark The PokéAgent Challenge transitions from the NeurIPS 2025 competition to a living benchmark av ailable at https://pokeagentchallenge.com . The Battling T rack maintains a live leaderboard on a dedicated Showdo wn server with all organizer baselines activ e, allo wing new agents to be ev aluated against the full history of submissions. The Speedrunning T rack provides self-contained e valuation: researchers run agents locally ag ainst the standardized emulator and milestone frame work, enabling reproducible comparison without server access. All datasets, baselines, and infrastructure are publicly released. The ev aluation server and leaderboards are maintained by the organizing team with funding support as listed in the Acknowledgements; all code and data are hosted on GitHub and HuggingFace to ensure long-term a vailability independent of server infrastructure. Both tracks provide clear room to e xpand in order to maintain benchmark difficulty as performance saturates. Large capability gaps remain. W e highlight four open challenges: (1) VLM-SLAM: Speedrunning agents struggle with basic localization, action-distance estimation, and objecti ve detection. Ground- ing VLM outputs to consistent spatial representations—analogous to classical SLAM b ut through language-vision interfaces—remains a bottleneck for RPG play . (2) Closing the LLM–RL gap in 9 battling: Specialist RL agents far outpace harness LLM agents in competitiv e battles. De veloping LLM agents that match RL performance, or hybrid approaches that combine RL ’ s sample ef ficiency with LLM world kno wledge, is an open problem. (3) Full-game completion with open-source mod- els: Proprietary frontier models hav e completed Pokémon RPGs with heavy harness support, b ut no open-source model has done so. Achieving this would make long-horizon RPG e valuation accessible to more research groups. (4) A pproaching human speedrun times: The best agent (Heatz, 40:13) is 2 . 2 × slower than human speedrunners. Closing this gap requires adv ances in navigation ef ficiency , obstacle av oidance, and objective sequencing—capabilities relev ant to time-critical planning more broadly . Acknowledgments W e gratefully ackno wledge funding from NeurIPS, IJCAI AI Journal, Google DeepMind, and compute support from Google Cloud Platform. This work is also supported by the NSF under Grant No. DGE-2444107. W e thank all participating teams and community members for making this competition a success. W e would also like to thank Aaron T raylor for his insights on competitiv e Pokémon and Caleb Frey for moderating our competition’ s online discussions. References [1] Anthropic. V isible extended thinking, 2025. URL https://ww w .anthropic.com/res e ar ch/visible- extended- thinking . [2] Gheorghe Comanici, Eric Bieber , Mike Schaekermann, Ice Pasupat, Noveen Sachde va, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality , long context, and next generation agentic capabilities, 2025. [3] Joel Zhang. The making of gemini plays pokémon. Blog post, 2025. URL https://blog.j cz.dev/the- making- of- gemini- plays- pokemon . [4] John-Anthony Disotto. Gpt-5 is the new pokémon master – openai’ s latest model completes red in half the time it took the last chatgpt model. T echRadar , August 2025. URL ht tp s: //w ww.te chrad a r.com /ai- p latfo rms- a ssist a nts/c hatgp t/gpt- 5- ju st- com plet ed- pok emon - red - in- a- n e w- wo rld- rec ord- ti me- cl aude- gem i ni- an d- ch atgpt - o 3- arent- even- close . [5] Marc G Bellemare, Y a var Naddaf, Joel V eness, and Michael Bo wling. The arcade learning en vironment: An ev aluation platform for general agents. Journal of artificial intelligence research, 47:253–279, 2013. [6] V olodymyr Mnih, K oray Kavukcuoglu, David Silver , Andrei A Rusu, Joel V eness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-lev el control through deep reinforcement learning. nature, 518(7540):529–533, 2015. [7] W illiam H Guss, Brandon Houghton, Nicholay T opin, Phillip W ang, Cayden Codel, Manuela V eloso, and Ruslan Salakhutdinov . Minerl: A lar ge-scale dataset of minecraft demonstrations. arXiv preprint arXi v:1907.13440, 2019. [8] Stephanie Milani, Anssi Kanervisto, Karolis Ramanauskas, Sander Schulhof f, Brandon Houghton, Sharada Mohanty , Byron Galbraith, Ke Chen, Y an Song, T ianze Zhou, et al. T ow ards solving fuzzy tasks with human feedback: A retrospecti ve of the minerl basalt 2022 competition. arXiv preprint arXi v:2303.13512, 2023. [9] Rohin Shah, Steven H W ang, Cody Wild, Stephanie Milani, Anssi Kanervisto, V inicius G Goecks, Nicholas W aytowich, David W atkins-V alls, Bharat Prakash, Edmund Mills, et al. Retrospectiv e on the 2021 minerl basalt competition on learning from human feedback. In NeurIPS 2021 Competitions and Demonstrations T rack, pages 259–272. PMLR, 2022. [10] Patrick Mariglia. Foul play: A competiti ve pokémon sho wdo wn battle bot. https://pmarig lia.github.io/posts/foul- play/ , 2025. 10 [11] Dimitris Papailiopoulos. Y ou don’t need to run every e val, 2026. URL https://github.com /anadim/llm- benchmark- matrix . [12] David Silv er , Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthe w Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play . Science , 362(6419):1140–1144, 2018. [13] Noam Brown and T uomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018. [14] Noam Bro wn and T uomas Sandholm. Superhuman ai for multiplayer pok er . Science , 365 (6456):885–890, 2019. [15] Oriol V inyals, Igor Babuschkin, W ojciech M Czarnecki, Michaël Mathieu, Andre w Dudzik, Jun- young Chung, David H Choi, Richard Po well, T imo Ewalds, Petko Georgie v , et al. Grandmaster lev el in starcraft ii using multi-agent reinforcement learning. nature , 575(7782):350–354, 2019. [16] Meta Fundamental AI Research Diplomacy T eam (F AIR)†, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty , Daniel Fried, Andrew Gof f, Jonathan Gray , Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. [17] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of research and dev elopment, 3(3):210–229, 1959. [18] Kaggle. Introducing game arena, August 2025. URL https://www.kaggle.com/blog/in troducing- game- arena . Kaggle blog post. [19] Nolan Bard, Jakob N Foerster , Sarath Chandar , Neil Burch, Marc Lanctot, H Francis Song, Emilio Parisotto, V incent Dumoulin, Subhodeep Moitra, Edward Hughes, et al. The hanabi challenge: A ne w frontier for ai research. Artificial Intelligence, 280:103216, 2020. [20] W enzhe Li, Zihan Ding, Seth Karten, and Chi Jin. Fightladder: A benchmark for competiti ve multi-agent reinforcement learning. arXiv preprint arXi v:2406.02081, 2024. [21] Heinrich Küttler, Nantas Nardelli, Alexander Miller, Roberta Raileanu, Marco Selv atici, Edward Grefenstette, and Tim Rocktäschel. The nethack learning en vironment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020. [22] Davide P aglieri, Bartłomiej Cupiał, Samuel Cow ard, Ulyana Piterbarg, Maciej W olczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuci ´ nski, Lerrel Pinto, Rob Fergus, et al. Balrog: Bench- marking agentic llm and vlm reasoning on games. arXi v preprint arXiv:2411.13543, 2024. [23] Gerald T esauro et al. T emporal difference learning and td-gammon. Communications of the A CM, 38(3):58–68, 1995. [24] David Silv er , Julian Schrittwieser , Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker , Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017. [25] Darse Billings, Aaron Da vidson, Jonathan Schaeffer , and Duane Szafron. The challenge of poker . Artificial Intelligence, 134(1-2):201–240, 2002. [26] Christopher Berner , Greg Brockman, Brooke Chan, V icki Cheung, Przemysław D ˛ ebiak, Christy Dennison, David Farhi, Quirin Fischer , Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXi v:1912.06680, 2019. [27] Bowen Baker , Ilge Akkaya, Peter Zhokov , Joost Huizinga, Jie T ang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. V ideo pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems , 35: 24639–24654, 2022. 11 [28] Guanzhi W ang, Y uqi Xie, Y unfan Jiang, Ajay Mandlekar , Chaowei Xiao, Y uke Zhu, Linxi F an, and Anima Anandkumar . V oyager: An open-ended embodied agent with large language models. arXiv preprint arXi v:2305.16291, 2023. [29] Mathieu Acher . Debunking the chessboard: Confronting gpts against chess engines to estimate elo ratings and assess legal mo ve abilities, 2023. [30] Y ichuan Ma, Linyang Li, Y ongkang Chen, Peiji Li, Jiasheng Y e, Qipeng Guo, Dahua Lin, and Kai Chen. Mixing expert knowledge: Bring human thoughts back to the game of go. arXiv preprint arXiv:2601.16447, 2026. [31] Richard Zhuang, Akshat Gupta, Richard Y ang, Aniket Rahane, Zhengyu Li, and Gopala Anumanchipalli. Pokerbench: Training lar ge language models to become professional pok er players. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 39, pages 26175–26182, 2025. [32] W eiyu Ma, Qirui Mi, Y ongcheng Zeng, Xue Y an, Runji Lin, Y uqiao W u, Jun W ang, and Haifeng Zhang. Large language models play starcraft ii: Benchmarks and a chain of summarization approach. Advances in Neural Information Processing Systems, 37(1):133386–133442, 2024. [33] Enhong Liu, Joseph Suarez, Chenhui Y ou, Bo W u, Bingcheng Chen, Jun Hu, Jiaxin Chen, Xiaolong Zhu, Clare Zhu, Julian T ogelius, et al. The neurips 2022 neural mmo challenge: A mas- siv ely multiagent competition with specialization and trade. arXiv preprint arXi v:2311.03707 , 2023. [34] Joseph Suarez, David Bloomin, K young Whan Choe, Hao Xiang Li, Ryan Sulliv an, Nishaanth Kanna, Daniel Scott, Rose Shuman, Herbie Bradley , Louis Castricato, et al. Neural mmo 2.0: A massiv ely multi-task addition to massively multi-agent learning. Advances in Neural Information Processing Systems, 36:50094–50104, 2023. [35] Stone T ao, Akarsh Kumar , Bovard Doerschuk-T iberi, Isabelle Pan, Addison Ho ward, and Hao Su. Lux ai season 3: Multi-agent meta learning at scale. In NeurIPS 2024 Competition T rack , 2024. [36] Seth Karten, Andy Luu Nguyen, and Chi Jin. Pokéchamp: an expert-le vel minimax language agent. In Forty-second International Conference on Machine Learning , 2025. URL ht tp s: //openreview.net/forum?id=SnZ7SKykHh . [37] Jake Grigsby , Y uqi Xie, Justin Sasek, Ste ven Zheng, and Y uke Zhu. Human-level competiti ve Pokémon via scalable offline reinforcement learning with transformers. Reinforcement Learning Journal, 6:2685–2719, 2025. [38] Marco Pleines, Daniel Addis, David Rubinstein, Frank Zimmer , Mike Preuss, and Peter Whid- den. Pokemon red via reinforcement learning. In 2025 IEEE Conference on Games (CoG) , pages 1–8. IEEE, 2025. [39] Kris Holt. Claude isn’ t a great pokémon player , and that’ s okay . Engadget , April 2025. URL http s://www .engadg et.com / ai/cla ude- isn t- a- grea t- poke m on- pla y er- and- tha ts- okay- 151522448.html . [40] Joel Zhang. The making of gemini plays pokémo crystal. Blog post, 2025. URL h tt ps : //blog.jcz.dev/gemini- 3- pro- vs- 25- pro- in- pokemon- crystal . [41] Mark E Glickman. Parameter estimation in large dynamic paired comparison e xperiments. Journal of the Royal Statistical Society Series C: Applied Statistics, 48(3):377–394, 1999. [42] Ralph Allan Bradley and Milton E T erry . Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. [43] Serge y Le vine, A viral Kumar , George T ucker , and Justin Fu. Offline reinforcement learning: T utorial, revie w , and perspectiv es on open problems. arXiv preprint arXi v:2005.01643, 2020. [44] Jake Grigsby , Linxi Fan, and Y uke Zhu. Amago: Scalable in-context reinforcement learning for adaptiv e agents. In The T welfth International Conference on Learning Representations. 12 [45] Dhruva T irumala, Thomas Lampe, Jose Enrique Chen, T uomas Haarnoja, Sandy Huang, Guy Le ver , Ben Moran, Tim Hertweck, Leonard Hasencle ver , Martin Riedmiller, et al. Replay across experiments: A natural extension of of f-policy rl. In The T welfth International Conference on Learning Representations. [46] Michael Ahn, Anthony Brohan, Noah Bro wn, Y e vgen Chebotar , Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic af fordances. arXiv preprint arXi v:2204.01691, 2022. [47] Xizhou Zhu, Y untao Chen, Hao Tian, Chenxin T ao, W eijie Su, Chenyu Y ang, Gao Huang, Bin Li, Le wei Lu, Xiaogang W ang, et al. Ghost in the minecraft: Generally capable agents for open-world en vironments via large language models with te xt-based kno wledge and memory . arXiv preprint arXi v:2305.17144, 2023. [48] Jacky Liang, W enlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter , Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International conference on robotics and automation (ICRA) , pages 9493–9500. IEEE, 2023. 13 A ppendix T able of Contents A Competition Retrospecti ve 15 A.1 Competition Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Organizational Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Battling T rack Competition Results 16 B.1 Practice Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Qualifying Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.3 T ournament Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.4 Rating System Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.5 Judge’ s Choice A wards: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C Speedrunning T rack: Full Competition Results 18 D Baseline Architectur e Details 20 D.1 Battling T rack Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 D.2 Speedrunning T rack Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 E Participant Methodologies 23 E.1 Battling Track: Competitive Battling . . . . . . . . . . . . . . . . . . . . . . . . . 23 E.2 Speedrunning Track: RPG Speedrunning . . . . . . . . . . . . . . . . . . . . . . . 26 F En vironment and System Details 30 F .1 RPG System Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 F .2 LLM Baseline T oken Usage and Cost . . . . . . . . . . . . . . . . . . . . . . . . 32 F .3 Speedrunning T rack T ask Div ersity . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G State Space Complexity Derivation 32 G.1 EV Spread Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G.2 Generation 9 T eam Configuration Space . . . . . . . . . . . . . . . . . . . . . . . 32 G.3 Generation 1 OU (RBY) T eam Configuration Space . . . . . . . . . . . . . . . . . 34 G.4 Battle State Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 G.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 G.6 T eam Space in Human Metagame . . . . . . . . . . . . . . . . . . . . . . . . . . 38 H Extended Discussion 38 H.1 Additional Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 H.2 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 I A uthor Roles and Contributions 40 14 A Competition Retrospecti ve A.1 Competition Setup A.1.1 Timeline The PokéAgent Challenge ran from July 2025 through November 2025, culminating in presentations at NeurIPS 2025 in December . The timeline proceeded as follows: July launch with open ladder access, September hackathon for community building and technical talks, October qualifying rounds requiring 250+ games for statistical validity , and November finals featuring best-of-99 head-to-head matches. This structure allowed iterati ve improv ement while maintaining e valuation inte grity through held-out final assessments. The Speedrunning T rack accepted submissions throughout with periodic leaderboard updates. A.1.2 Resources T o support participants, we deployed sev eral resources: • A dedicated custom Pokémon Showdown server hosting baseline agents and participant battles • A RA G-powered Discord chatbot (“@pokeagent”) with a Professor Pokémon persona, using an embedding model and retriev al pipeline ov er organizer -curated documentation to answer participant questions • Google Cloud Platform credits including Gemini API access • A September hackathon with research talks streamed on Y ouT ube A.2 Organizational Outcomes Participation Statistics The PokéAgent Challenge exceeded participation e xpectations: • 100+ active teams with re gistered submissions across both tracks • 650+ Discord community members engaging in technical discussions • 100K+ battles on the competition Showdo wn server • 22 valid Speedrunning T rack submissions , with 6 achieving 100% completion Dual Submission T racks The dual-track structure successfully attracted participants from both the RL and LLM research communities. The Battling Track appealed to game AI and multi-agent learning researchers, while the Speedrunning T rack attracted the long-context reasoning and language agent communities. Se veral teams competed in both tracks, developing unified architectures that could handle both strategic combat and e xploration. Cross-Platf orm Engagement The competition benefited significantly from the broader Pokémon AI zeitgeist of 2025. The Claude Plays Pokémon and Gemini Plays Pokémon communities provided a natural audience, and cross-promotion through Discord, Reddit (r/ClaudePlaysPokemon), and social media drov e significant participation. The hackathon research talks, streamed on Y ouT ube, attracted viewers from both academic and hobbyist communities. Community Infrastructure Our RA G-po wered “@pokeagent” Discord bot pro ved v aluable for participant support, answering common questions about environment setup, baseline usage, and submission procedures using organizer -curated documentation. This reduced organizer b urden while providing 24/7 assistance. Timing Recommendations While NeurIPS encourages competitions to launch early , we found that peak participation and sustained engagement coincided with a 2–3 month window aligned with the Fall uni versity semester (September–No vember). This timing particularly incentivized student in volv ement, as participants could integrate the competition into course projects and independent studies, creating a natural alignment between academic schedules and competition milestones. 15 Organizer Disclosure. The or ganizing team includes the developers of PokéChamp and Meta- mon, which serve as baselines. T o ensure fairness, organizer baselines were excluded from prize consideration, and sev eral strong baselines were withheld during the competition to ensure that qualifying required genuine technical contribution be yond fine-tuning provided checkpoints. Of the 16 qualifying tournament spots, 13 were secured by teams that extended our public RL baselines, while 3 used independent approaches—reflecting that the released baselines provided an ef fectiv e starting point rather than an unfair adv antage. All e v aluation was conducted on a shared server with identical conditions for all participants. B Battling T rack Competition Results The Battling Track competition was divided into a Practice Stage , Qualifying Stage , and T our - nament Stage . In each phase, participating teams could compete in one or both of the Gen 1 OU and Gen 9 OU battling rulesets. The Practice and Qualifying Stages were most similar to the e valu- ation setup of the permanent PokéAgent Challenge leaderboard: teams connect their agents to our AI-focused Sho wdown serv er and compete in ranked battles against both their fellow participants and a large set of “org anizer baselines” maintained by our team. Our baselines ensure accurate ratings by creating a shared set of opponents and maintaining battle acti vity when few participants are online. During the NeurIPS competition, or ganizer baselines also serv ed as a form of held-out ev aluation; some were released as part of the of ficial starter kit, while others were kept pri vate, so the only way to ev aluate against them was to compete on the leaderboard. B.1 Practice Stage For most of the competition (July–October), participants were free to compete on the leaderboard to iterate on their solutions. T eams could create unlimited usernames to reset their ratings and ev aluate new methods. A hackathon event in September reset the leaderboard and of fered compute credit prizes to the top teams 36 hours later , which were ED-Testing (Gen 9 OU) and srsk-1729 (Gen 1 OU). In total, participants played about 75 k battles during the practice stage. B.2 Qualifying Stage In late October, our leaderboards were reset, and participation was restricted to one username per registered team. T eams competed to qualify for a spot in the T ournament Stage, where all teams would be guaranteed a cash prize. Qualifications were awarded to the top tw o teams by Elo and the next six highest by FH-BT , which is considered the more accurate metric (Appendix B.4 ). This structure was a compromise that encouraged participation by allo wing a late comeback, as Elo can be improv ed regardless of earlier results, and FH-BT required a minimum sample of 250 battles, which could hav e been prohibiti vely expensi ve for some methods. The main Qualifying Stage results were presented in Figure 6 . Participants played a total of 35K battles ov er the two-week qualifying period, and we found strong alignment across all of our skill rating metrics; Figure 11 visualizes the leaderboard across several alternativ es. Figure 8 presents head-to-head win rates among the top non-baseline participants. Our set of or ganizer baselines helps gi ve the participant results conte xt by grounding them against methods whose training details and performance relativ e to human players are known. The distinction between public and priv ate organizer baselines created a clear separation point on both leaderboards: many participants clustered slightly abov e the best public baseline, whereas participants competitive with the priv ate baselines separated from the field and qualified for the tournament. In Gen 9 OU, FoulPlay outperformed pri vate baselines by all metrics, while Q , PA-Agent , and piploop formed a chase pack situated between the best public and pri vate baselines. In Gen 1 OU, 4thLesson separated from the public baselines while PA-Agent fell just short of the best pri vate baselines and placed first among participants by a wide margin. As discussed in Section 5 , all top-performing submissions emplo yed RL or search-based methods rather than pure LLM approaches. Our baselines re vealed systematic LLM reasoning failures— including “panic behavior” and other failure modes—discussed in Section 5 and Appendix H . Additionally , LLM agent performance on the competition ladder was deflated by time pressure: agents frequently switched to less sophisticated fallback methods when lo w on time, or lost on time 16 entirely . W e have since introduced a separate long-timer leaderboard to isolate reasoning ability from inference speed (Section 3.2 ). Figure 8: Head-to-Head Win Rates Among T op Participants. Heatmap matrices sho wing pairwise win rates for Gen 1 OU (left) and Gen 9 OU (right). Each cell sho ws the ro w player’ s win–loss record against the column player . Diagonal cells sho w each player’ s Elo rating. Players are ordered by Elo (highest at top). Empty cells indicate insuf ficient direct head-to-head matchups. B.3 T ournament Stage The competition concluded with a head-to-head single-elimination tournament among the qualifying teams, seeded by Qualifying Stage leaderboard position. T eams played their designated opponent in a best-of-99 battle match (first to 50 wins). Figure 9 shows the tournament brackets for both formats. PA-Agent dominated Gen 1, defeating 4thLesson 50–28 in the finals. In Gen 9, FoulPlay swept through the competition, defeating Q 50–14 in the finals after a close match with PA-Agent in the semifinals. Both tournament winners were ranked #1 on the Qualifying Stage leaderboard and were therefore the #1 seeds. Appendix E pro vides technical details of the first and second place methods as written by the teams themselves. Champion Quarterfinals Semifinals Final Gen1OU (Best-of -99) P A-Agent 50 F oulPlay 12 ED - T esting 50 MetaHorns 28 4thLesson 50 GCOGS 11 srsk -1729 50 Exp-05 29 P A-Agent 50 ED - T esting 13 4thLesson 50 srsk -1729 20 P A-Agent 50 4thLesson 28 Champion Quarterfinals Semifinals Final Gen9OU (Best-of -99) F oulPlay 50 P orygon2AI 12 P A-Agent 50 piploop 37 ED - T esting 50 srsk -1729 20 Q 50 MetaHorns 16 F oulPlay 50 P A-Agent 39 Q 50 ED - T esting 22 F oulPlay 50 Q 14 Figure 9: T ournament Brackets. Best-of-99 single-elimination brackets for Gen 1 OU (left) and Gen 9 OU (right). Numbers indicate games won by each team; green highlighting indicates winners. B.4 Rating System Analysis Most LLM e valuation arenas use Bradle y–T erry models (batch MLE) without reporting uncertainty— a potentially misleading practice. W e stress-tested four ranking systems (Bradley–T erry , Elo, Glicko-1, and GXE) across 1.6M+ agent matches (Figure 10 ). T op-3 agents con verged across all methods (with 17 250+ games each), but ranks 4+ showed systematic disagreement—Elo di ver ged from Bradley–T erry ev en when error bars did not overlap. Glick o-1 offered the best tradeof f: online updates for real-time leaderboards with con vergence guarantees matching batch MLE. W e prefer Glicko-1 o ver raw Elo for agent ev aluation, particularly when sample sizes vary across agents. Figure 11 compares the Battling T rack competition results according to alternati ve skill metrics (Sho wdown’ s Elo, GXE, W in Rate) and sample size (Battles Played). Figure 10: Rating System Comparison Across 1.6M+ Matches. Four ranking methods applied to Battling Track data. Left: Rank correlation between methods, sho wing Glicko-1 con ver ges to Bradley–T erry (batch MLE) while Elo div erges for mid-ranked agents. Right: Rating trajectories with uncertainty bands for selected agents, demonstrating Glicko-1’ s uncertainty quantification. T op-3 agents (shaded) show consistent rankings across methods; ranks 4+ exhibit systematic disagreement, particularly between Elo and Bradley–T erry . B.5 Judge’ s Choice A wards: The organizing committee a warded additional prizes based on participant technical writeups. These Judge’ s Choice awards were intended to re ward nov el directions and significant departures from the starter kit baselines regardless of final results. The winners were: • Porygon2AI : Recognized for innov ativ e league training methodology inspired by AlphaS- tar’ s [ 15 ] approach to div erse opponent modeling • August : Best pure LLM method without learned components, demonstrating effecti ve chain-of-thought reasoning for mov e selection C Speedrunning T rack: Full Competition Results The Speedrunning T rack challenged participants to complete Pokémon Emerald as quickly as possible (see Appendix E for detailed descriptions of participant methods). Of 22 teams that submitted, 6 achiev ed 100% completion. Figure 7 presents the final standings, sho wing both wall-clock time and step-count views. Heatz won with a 40:13 completion time using Scripted Policy Distillation (SPD); see Section 5 for a discussion of the winning approach and the time-vs-efficiency tradeoff across teams. As discussed in Section 5 , a harness is a prerequisite for any OOD performance—not a marginal optimization. The successful harness architectures decompose into four components: • Per ception : V ision-to-text translation for game state understanding • Memory : Maintaining coherent state across thousands of timesteps • Planning : High-le vel goal decomposition and route optimization • Action : Lo w-level control for na vigation and battle ex ecution 18 Figure 11: Battling T rack Qualifying Metrics. Qualifying stage results according to alternativ e skill metrics (Showdo wn’ s Elo, GXE, Win Rate) and sample size (Battles Played). 19 Benchmark results therefore reflect the joint capability of model + harness, and differences between models cannot be attributed to “ra w” model capability alone. Judge’ s Choice: Deepest was recognized for achieving completion with the fewest total actions, demonstrating efficient reasoning despite a slo wer wall-clock time. D Baseline Architectur e Details D.1 Battling T rack Baselines PokéChamp [ 36 ] and Metamon [ 37 ] form the foundation for our Battling Track baselines. Both projects were significantly upgraded to impro ve their competitiv e performance and serv e as extensible starter points for future work. All of these changes are open-source and detailed information is av ailable via the PokéAgent Challenge Resources Page . This section pro vides a high-level introduction and summarizes key impro vements. D.1.1 PokéChamp PokéChamp combines large language models with minimax search for competitiv e battling. The system con verts game state to text via a g ame engine interface, then uses an LLM to e valuate positions through an approximate state transition heuristic. This enables strategic lookahead without requiring explicit re ward signals or training data. PokeChamp Architecture BATTLE LOOP Usage Statistics Kingambit Abilities: Defiant 78.3% Spreads: Adamant 252Atk/252HP 45% Moves: Kowtow Cleave 94% Sucker Punch 89% Game State Kingambit Lv100 Heatran Lv100 What will Heatran do? Magma Storm Earth Power Stealth Rock Protect Scaffolding + LLM 1. Convert game state to text "Opp: Kingambit vs You: Heatran" "HP: 82% vs 62%, Turn 5" 2. Enrich with usage statistics "Kingambit: Defiant 78%, Sucker Punch 89%" "Likely: Kowtow Cleave or Sucker Punch" 3. LLM Decision Option A: Direct Chain-of-thought reasoning to select optimal action Option B: Minimax Lookahead with LLM rubric evaluation & value cutoff pruning Action Output Move: Magma Storm Traps Kingambit, deals chip damage, avoids Sucker Punch Opponent Simultaneous action selection Battle State Update Figure 12: P okeChamp Ar chitecture. Within the battle loop, two data sources feed into the LLM pipeline: (1) historical usage statistics from Smogon (left top), pro viding probability distributions ov er abilities, EV spreads, and mov es for each species; and (2) the current game state from Pokémon Showdo wn (left bottom). The T ext Conv ersion + LLM module (center) con verts game state to structured text, enriches it with usage statistics for opponent prediction, and performs LLM inference to select actions. The chosen mo ve ex ecutes and updates the game state for the next turn. D.1.2 Metamon Metamon con verts Showdo wn’ s replay archiv es into an opportunity to study offline RL [ 43 ] at scale in a complex domain. Its baselines embrace the complexity and partial observ ability of Pokémon by training Transformer policies with model-free off-polic y RL updates [ 44 ]. Further improvements beyond the human demonstration dataset are dri ven by a lar ge-scale “replay across experiments” [ 45 ] loop in which the project continuously expands a central repository of self-play trajectories and policies (Figure 13 ). As of the launch of the PokéAgent Challenge, the training dataset exceeds 20 M battles and has produced 30 meaningfully distinct policies at v arious skill le vels. Metamon training runs are expensi ve by academic standards and each iteration may ha ve se veral ax es of improvement (datasets, architectures, RL details). As a result, its agent releases are gi ven somewhat arbitrary names (LLM-style) that are e xcluded from results like Figures 3 and 6 b ut are a vailable in PokéAgent 20 Challenge resources. T able 1 defines a short list that are relev ant to the NeurIPS competition results (Appendix B ) or referenced by participant writeups in (Appendix E ). Figure 13: Metamon T raining Pipeline. Metamon trains RL agents on large datasets of self-play battles and human demonstrations “reconstructed” from replays gathered from Pokémon Showdo wn. Figure reproduced from Grigsby et al. [ 37 ]. Model Size Date Description Human Ladder Ratings (GXE) G1 G2 G3 G4 G9 SyntheticRLV2 200M Sep 2024 The best (public) Gen 1 OU policy during the NeurIPS competition, and therefore the basis of many of the qualifying submissions (Appendix E ). 77% 68% 64% 66% – Abra 57M Jul 2025 The best (public) Gen 9 OU policy during the NeurIPS competition, and therefore the basis of many of the qualifying submissions (Appendix E ). – – – – 50% Kadabra3 57M Sep 2025 The best policy to participate in the NeurIPS competition as an organizer baseline (rank #1 in the Gen 1 OU qualifier and #2 in Gen 9 OU). 80% – – – 64% Kakuna 142M Dec 2025 The best metamon model as of the launch of the PokéAgent Challenge (with results appearing in Fig. 3 ). 82% 70% 63% 64% 71% T able 1: Metamon baselines referenced in Appendix results, and their estimated performance against human players. “G1” → “Gen 1 OU”. The most notable improvement made to Metamon for the PokéAgent Challenge was its expansion to support the Gen 9 OU ruleset, which created the opportunity for RL and LLM baselines to compete directly . Gen 9 OU has a significantly larger state space (Appendix G ) than the rulesets from the original work, and it was an open question whether model-free RL would generalize to this more complex game. T oday , Metamon is at least as competitiv e against human players in Gen 9 OU as earlier generations (T able 1 ), though its performance in Gen 1 OU remains an outlier . D.2 Speedrunning T rack Baselines The PokéAgent starter kit implements a multi-agent or chestration system that successfully com- pleted Pokémon Emerald. The architecture distinguishes between two types of capabilities: MCP tools for lo w-lev el game interaction and sub-agents for high-le vel reasoning. A central orchestrator coordinates these components, dispatching to specialized sub-agents based on game conte xt while using MCP tools for direct game control. Orchestrator . The flagship harness implements a central orchestrator that coordinates all sub- agents and tool calls. Upon initialization, the orchestrator in vokes a planning sub-agent to generate a high-lev el route covering story progression, team building objecti ves (training, catching, item acquisition), and resource management (Pokémon Center visits). This loose plan maintains long-term coherence while allowing ad-hoc objecti ve creation during gameplay . The orchestrator dispatches to specialized sub-agents based on context: the battle sub-a gent handles combat decisions, the r eflection sub-agent analyzes stuck states, and the verification sub-agent checks objecti ve completion to pre vent premature advancement. The system exposes two cate gories of capabilities to the orchestrator: 21 Game Environment Screen ASCII Map @ @ = player P arty Treecko Lv12 85% Wingull Lv10 50% Ralts Lv8 100% P osition: R oute 106 (12, 8) Badges: 0 | Money : $1,200 Multimodal input Orchestrator Observe Game state VLM Reason Plan ne xt action Dispatch Agent OR T ool 100K token context with history compaction Sub-Agents Main General game contr ol Battle Move selection & switching Objective Plan game path/r oute Reflection Diagnose stuck states V erify Confir m goal complete Gym Puzzle Spatial r easoning MCP T ools Game Control A B ST AR T SELECT press_buttons() navigate_to() Knowledge Base Beat Trainer: Bug Catcher Found: Potion at Route 102 Wiped out on Route 104 Learned: Rock weak to Water Objectives (3 T ypes) Story Get starter Beat Gym 1 Battling T rain to Lv15 Catch R alts Dynamic Heal at P ok émon Center Figure 14: PokéAgent Multi-Agent Architecture. The orchestrator coordinates sub-agents and MCP tools to play the game. Data flows left-to-right from the GB A emulator (game frames, parsed state) through the Orchestrator (context management, sub-agent dispatch, stuck detection), which in vok es either MCP tools (pathfinding, b utton inputs, kno wledge retrie val) for direct game control or specialized sub-agents (battle agent, reflection agent, verification agent, gym puzzle agent) for complex reasoning tasks. A feedback loop returns sub-agent outputs and tool results to inform subsequent decisions. MCP T ools. Low-le vel game interaction via HTTP endpoints: • Game Control : get_game_state , press_buttons , and navigate_to (A* pathfinding with variance options for obstacle handling) • Knowledge Retrieval : Persistent memory system storing discoveries (locations, NPCs, items, strategies) with importance-weighted retrie val. • Objective Management : Three parallel objecti ve sequences— story (main narrati ve), bat- tling (team building), and dynamics (agent-created adaptiv e goals)—enabling balanced progress across game dimensions Sub-Agents. Specialized reasoning modules in voked by the orchestrator: • Battle Sub-Agent : Handles combat decisions including move selection, switching, and item usage based on type matchups and team state • Reflection Sub-Agent : Analyzes stuck states by comparing current situation against ground truth sources (porymap data, knowledge base) to diagnose na vigation failures • V erification Sub-Agent : Independently checks whether objectiv es are truly complete, prev enting the orchestrator from advancing when the main agent incorrectly belie ves a task is finished • Gym Puzzle Sub-Agent : Specialized reasoning for gym-specific puzzles requiring spatial planning (e.g., Mauville’ s electric barriers, Lav aridge’ s floor tiles) Interestingly , wiki access via the kno wledge retriev al tool sometimes degraded performance—the agent w ould retrie ve contradictory information from different sources or misapply advice for dif ferent game versions, highlighting the challenge of grounding e xternal knowledge. Thus, we did not grant the agent access to searching the web or Pokémon specific wikis for our baselines. Context Management. The agent maintains conv ersation history with automatic compaction when exceeding history time-step limit (20+ turns typically), preserving only LLM responses and action 22 summaries while discarding redundant game state. The knowledge base summary maintains persistent memory . This enables coherent beha vior across thousands of timesteps without context o verflo w . VLM Backend Support. Both the orchestrator and all sub-agents share a unified VLM inter- face supporting multiple backends (Gemini, OpenAI, OpenRouter), with action speed control (fast/normal/slo w) for different gameplay situations—rapid b utton presses for dialogue advancement versus deliberate inputs for critical menu na vigation. E Participant Methodologies This section presents methodology summaries contributed by top-performing teams. These descrip- tions are provided largely verbatim to preserv e the participants’ voices and technical details, but we note that their writing generally assumes familiarity with Pokémon terminology and PokéAgent Challenge resources (baseline names, dataset names, etc.). E.1 Battling T rack: Competitiv e Battling E.1.1 P A-Agent (Gen 1 OU Champion) T eam: Xianwei Shi, K unsheng Zhou, Dongyu Liu, W enli Zhang Affiliation: PokéAgent Challenge Gen 1 OU Champion P A-Agent addresses Pokémon battle challenges (massive team composition space and incomplete information) by building on the Metamon offline reinforcement learning (RL) framework [ 37 ], integrating Transformer -based decision-making, iterativ e training, and tournament-driv en team selection. The approach prioritizes efficiency and adaptability to stochastic, partial-observ ability scenarios. The agent adopts a modular architecture with two core components: a Battle Decision Module and a T eam Optimization Module. The Battle Decision Module uses a T ransformer backbone to process hybrid inputs (87 te xtual tokens and 48 numerical features) for end-to-end action prediction (mo ve selection/switching), enhancing attention to recent battle e vents to mitigate information ov erload. K ey algorithmic innov ations include iterativ e offline RL with dynamic data weighting—bootstrapping from human replays provided by Metamon, then refining via 6 rounds of inter-model battle data, gradually reducing human data proportion from 100% to 10% to a void lo w-quality decision interference. For team composition, a tournament selection mechanism narrows the huge state space by ev aluating candidate teams (from Smogon, Metamon, and community sources) against 50+ lineups, selecting those with > 60% win rates. In the PokéAgent Challenge, P A-Agent achie ved competiti ve performance against top submissions, demonstrating the effecti veness of iterativ e of fline RL refinement and tournament-based team selec- tion. The model’ s GXE in Gen 1 OU qualifying reached 80.35%, with particular strength in handling div erse opponent strategies and partial observability . E.1.2 FoulPlay (Gen 9 OU Champion) A uthor: Patrick Mariglia [ 10 ] Affiliation: PokéAgent Challenge Gen 9 OU Champion Foul Play is a competiti ve Pokémon battle bot that emplo ys root-parallelized Monte Carlo T ree Search (MCTS) with the Decoupled Upper Confidence Bound applied to T rees (DUCT) formula to handle simultaneous move selection. The bot uses a custom battle engine called poke-engine , written in Rust for performance, which addresses the computational complexity inherent in competitive Pokémon. Rather than e xhausti vely e xploring all possible game states (which is intractable due to the combinatorial explosion from damage rolls, critical hits, and secondary effects), poke-engine employs damage roll grouping. This technique clusters damage outcomes by their practical impact— primarily whether they result in a knockout—reducing branching while preserving strategically relev ant information. The engine generates state instructions rather than copying entire game states, enabling ef ficient tree trav ersal to depths of 10 or more turns for promising lines while pruning uninteresting branches early . 23 The transition from expectiminimax to MCTS was motiv ated by depth limitations: the earlier approach would exhausti vely search every branch of the game tree, limiting depth to approximately 5 turns. MCTS achiev es 10+ turn depth on promising lines while exploring unpromising branches to only 2–3 turns. Rather than relying on rollouts, search is guided by a custom e v aluation function. Set prediction is critical to Foul Play’ s performance, particularly in open team-building formats like OU where opponent Pokémon attrib utes are initially unknown. The bot maintains a comprehensiv e dataset of possible sets for each species, sourced from Smogon usage statistics, scraped team builds from forums, and public replay data. As battles progress, Foul Play intelligently infers hidden information through game mechanics: damage calculations rev eal stat distrib utions, mov e priority ordering constrains speed ranges, e xtended weather duration indicates specific items, and move usage patterns eliminate equipment possibilities (e.g., status mo ve usage eliminates Assault V est; hazard damage reactions exclude Hea vy-Duty Boots). By sampling from likelihood-weighted distributions of possible opponent configurations during search, the bot adapts its strategy based on the most probable scenarios. This combination of ef ficient search and accurate prediction enabled Foul Play to achiev e strong results: o ver 90% GXE in Generation 9 Random Battles (peak rating 2341), ov er 80% GXE in Generation 9 OU (peak rating 1879), and first place in the PokéAgent Challenge Gen 9 OU track with a dominant 50–14 finals victory . Figure 15: Foul Play’ s MCTS search architecture with damage roll grouping. The engine clusters damage outcomes by knockout potential rather than exploring all 32 possible values (16 rolls × critical hit), enabling deeper search while preserving strategic rele vance. Figure 16: F oul Play’ s set prediction pipeline. Hidden information is progressiv ely inferred through damage calculations, priority ordering, weather duration, and move usage patterns, narro wing the space of possible opponent configurations. 24 E.1.3 4thLesson (Gen 1 OU Finalist) T eam: Gyungbo Kim, Sangyeon P ark, Eunju Kwon, Y ujin Kim Affiliation: PokéAgent Challenge Gen 1 OU Finalist 4thLesson follows Metamon’ s SyntheticRL-V2 model and adopts a fine-tuning approach starting from its pretrained model, while preserving the original architecture for stable initialization. T wo key modifications distinguish their approach: First, they emplo y the Kron (Kronecker -factored Approximate Curv ature) optimizer instead of the AdamW optimizer originally used in the Amago framework. Although the Kron optimizer is a second-order optimizer and incurs higher computational cost than AdamW , recent work demonstrates that it stabilizes reinforcement learning by ensuring more consistent gradient flo w when scaling up model size, outperforming Adam-based optimizers in this regard. Second, they adopt AID (Acti v ation by Interval-wise Dropout) instead of the pre viously used Leaky ReLU. AID introduces additional linearity into the model, mitigating plasticity loss in continual learning and reinforcement learning settings where model plasticity tends to degrade. Since AID also acts as a form of dropout, they remo ved the dropout modules used in the original model. T o collect high-quality self-play data, they employ a multi-stage data generation strate gy based on a local ladder setup. They first generated self-play replays using the 19 baseline models provided by a local ladder setup, collecting approximately 30k replays for small-size models, 40k for medium-size models, and 50k–100k for large-size models. All replays were generated using the modern_replays (v1) teamset, resulting in roughly 800k replays in total. In addition, they performed self-play between intermediate checkpoints of their o wn models, collecting another 300k–400k replays and increasing the total dataset size to approximately 1.1–1.2M replays. After the preliminary round, they repeated a similar process, generating about 10k samples per model, using the modern replays (v2) teamset and adding approximately 130k more samples, increasing the total dataset size to 1.3–1.4M samples. Baseline Model modern_ r epl a ys(v1 (~800k modern_ r epl a ys(v2 (~130k Int ermediat e checkpoint (~ 300~400k Finetunin Our Mode Self -pla R epla y Buff e (1 . 3~1 .4M R epla ys Optimiz e (A dam W → Kr on Local Ladde Backpr opagatio & W eight Updat Pr etr ained Synt heticRL - V A ctiv ation F unctio (LR eLU → AID Figure 17: 4thLesson’ s learning architecture and training pipeline, showing the integration of the Kron optimizer and AID activ ation function. 2. 5 1 . 3 1 .4 1 .4 3 .2 2. 5 1 . 3 1 .4 3 .2 1 .4 1 1 1 .4 0 1 -0 .2 - 3 . 0 - 2. 0 - 2. 0 -4 . 0 -1 .4 -0 .2 - 3 . 0 - 2. 0 -4 . 0 -1 .4 - 2. 0 1 0 0 1 .4 0 1 1 0 - 2. 0 2. 5 1 . 3 0 3 .2 2. 5 1 . 3 0 3 .2 0 - 3 . 0 0 0 -1 .4 0 - 3 . 0 0 0 -1 .4 Figure 18: Structure of Activ ation by Interval-wise Dropout (AID) used by 4thLesson to mitigate plasticity loss. 25 E.1.4 T eam Q (Gen 9 OU Finalist) T eam: Qiao W ang, Ling W u Affiliation: PokéAgent Challenge Gen 9 OU Finalist T eam Q’ s approach centers on a T wo-Phase Curriculum Learning frame work designed to bootstrap basic competency before refining adv anced strategic reasoning. The agent architecture utilizes a 50M parameter Actor-Critic model. The core innov ation lies in splitting the training process into two distinct phases: a Mec hanics Phase and a Strate gy Phase . In the initial Mechanics Phase, the agent is fine-tuned against a suite of heuristic bots. This phase focuses on learning the fundamental rules of the game, such as type advantages, mo ve ef fecti veness, and v alid action masking, without the noise of complex adv ersarial beha vior . Once the agent achie ves a baseline win rate against these static policies, it transitions to the Strategy Phase. Here, they employ an iterativ e “coach cycle” where the agent trains against a fixed e xpert coach and previous v ersions of itself. This self-play and expert-play hybrid allo ws the agent to de velop deeper foresight and prediction capabilities, resulting in a steady increase in win rates from v ersion v0 to v6 as observed in their experiments. Their results demonstrate that the separation of mechanics learning from strategic fine-tuning sig- nificantly accelerates con vergence. The agent showed the most dramatic performance jumps when transitioning from the v0 team set (random/heuristic initialization) to the v1 team set (curriculum- based), validating the hypothesis that mastering low-le vel game dynamics is a prerequisite for high-lev el long-horizon planning in stochastic environments like Pokémon. Figure 19: T eam Q’ s win rate performance across training iterations (v1–v6). Performance is measured against the Competitive v1 T eam Set. The initial iterations (v1–v4) were trained on the v0 team set, while v6 incorporates fine-tuning on the v1 set, resulting in a performance jump to approximately 0.8. E.2 Speedrunning T rack: RPG Speedrunning E.2.1 Heatz (Speedrunning T rack Champion) A uthor: Junik Bae Affiliation: PokéAgent Challenge Speedrunning T rack Champion Reinforcement learning (RL) of fers a powerful framew ork for training autonomous agents, yet applying RL directly to long-horizon tasks with sparse re wards remains fundamentally challenging. T o address this challenge, we propose lev eraging LLM-generated scripted policies as priors for RL exploration (Figure 20 ). Although these scripted policies are not always optimal, they facilitate RL training by initializing exploration from a distribution that already reaches the goal, rather than from scratch. W e implement this long-horizon task learning framework as Scripted Policy Distillation (SPD) , which consists of three stages: (1) subgoal generation [ 46 , 47 ], (2) scripted policy generation [ 28 , 48 ], and (3) script-guided RL. 26 B Subgoal 1: Get out of house Stage 3: Script- Guided RL Stage 1: Generate Subgoals Subgoal 1: Get out of house ··· LLM Stage 2: Generate Scripted Policies def policy(state): # self-parsing dialog = extract _feature( state , “Is dialog open?” ) # self- logging print( “dialog:”, dialog ) # compute action action = “down” if not dialog else “a” return action LLM policy () actio n sta t e success_cond() Move on to next subgoal logs env LLM env interact policy () distill success_cond() RL algo rithm 0/1 reward update SPD policy policy () trained w/o distillation small VLM >> feature or Game docs Game docs Initial state analyze Result: SPD sp eedruns faster SPD policy revise success_cond(): ”house” not in state[“loc”] Subgoal 2: Head to Route 101 success_cond(): state[“loc”] == ”Route 101” Subgoal 3: Choose a starter success_cond(): len(state[“party”]) > 0 ··· subgoal Subgoal spec “complete the game” Ta s k s p e c if succeeded revise Figure 20: Over view of Scripted P olicy Distillation (SPD). The approach consists of three stages: (1) Subgoal Generation where an LLM decomposes the task into sequential subgoals with executable success conditions; (2) Scripted Policy Generation where the LLM generates policies that can inv oke VLM tools and use self-directed logging for deb ugging; (3) Script-Guided RL where scripted policies are distilled into neural netw orks via supervised learning follo wed by RL with e xpert action guidance. (1) Subgoal generation. Giv en a long-horizon task specification, an LLM decomposes the task into a sequence of subgoals, each paired with an ex ecutable success-condition function success_cond(state) that determines subgoal completion. (2) Scripted policy generation. For each subgoal, the LLM generates a scripted policy that maps states to actions. The scripted policy interacts with the en vironment until success_cond(state) returns True or a timeout occurs. Upon failure, the LLM analyzes execution traces and revises either the policy code (Stage 2) or the subgoal specification (Stage 1). This stage employs two ke y techniques to handle complex en vironments. First, to extract rich visual cues that are not present in the state inputs, policies can optionally query a vision-language model (Qwen3-VL-8B) and use the resulting information for decision making. Second, to efficiently update subgoals and scripted policies, we pro vide the LLM with concise summaries of agent interactions obtained via self-dir ected logging , rather than long, full e xecution trajectories. (3) Script-guided RL. Once all scripted policies reliably achiev e their corresponding subgoals, we distill them into neural network policies via imitation learning on expert trajectories, follo wed by RL with expert action guidance. The resulting neural policies e xecute faster than the scripted policies and can therefore discov er more efficient strate gies. For distillation, we train a DQN agent with two forms of expert guidance. First, we seed the replay buf fer with successful expert trajectories. Second, during rollouts, we execute e xpert actions with probability ϵ (annealed from 0 . 1 to 0 ), while otherwise following the learned polic y . T ogether , these mechanisms bootstrap learning from expert behavior while allowing improvements through RL optimization. Our method, SPD, achieved a 40:13 run on Pokémon Emerald up to the first gym. The resulting policy e xhibited se veral interesting emer gent behaviors across the pipeline. During scripted polic y generation, the agent occasionally synthesized explicit search routines, such as BFS over a local navigation graph, to reliably plan short paths for na vigation subgoals. Furthermore, after distillation and RL fine-tuning, the neural policy ex ecuted these behaviors more ef ficiently and further improved speed through strategies not e xplicitly programmed in the scripts, including shorter routes, skipping unnecessary trainer battles, and faster menu interactions. 27 E.2.2 Hamburg P okéRunners (Speedrunning T rack Second Place) T eam: Benedikt Schink, Arian Urdu, Matin Urdu Affiliation: PokéAgent Challenge Speedrunning T rack Second Place Hambur g PokéRunners achiev ed second place (first badge: 01:14:43) using a reinforcement learning approach based on recurrent proximal policy optimization (PPO). They used a standard recurrent PPO architecture with a long short-term memory (LSTM) in the recurrent state. The encoder consists of a con volutional neural network (CNN) and a multi-layer perceptron (MLP). The CNN recei ves downsampled game frames (grayscale, 128 × 128 pixels). A sinusoidal positional encoding is used to encode the x and y in-game coordinates. The location (e.g., the current city or route) is encoded using a one-hot encoding. T ogether , the location and the sinusoidal positional encoding result in unique positional coordinates. This global conditioning pre vents the spatial ambiguity inherent in local coordinate systems. The last input of the MLP is a binary milestone vector m ∈ { 0 , 1 } 38 , which encodes the sub- milestones of the competition, as well as the locations and Pokécenters reached along the way . The milestone v ector serves both as a memory component, keeping track of the milestones achieved thus far , and as goal-conditioning, keeping track of the current objective. This played a crucial role in making the model more stable to the inherent randomness of Pokémon Emerald and the non-deterministic input/output latenc y of the e valuation script. They observed simple error-correcting behavior , where the agent would overshoot or undershoot the target, resulting in backtracking to correct the trajectory . Their rew ard function includes only positi ve re wards that incenti vize desired behavior: competition sub-milestones, locations, coordinate exploration, badges recei ved, money earned, HP healed, Poké- mon le veled up, and damaging attacks used. All negati ve re wards the y tried, such as step penalties, resulted in model degradation as the model rev erted to only safe behavior and did not progress in the game. Due to limited computational resources, they trained multiple smaller models that were then stitched together during inference, instead of training one larger model. Most attempts at transfer learning yielded only limited success, mostly helping with very basic na vigation capabilities. While transfer learning yielded only small improvements restricted to basic na vigation, the inclusion of a milestone vector significantly mitigated catastrophic for getting. e n v i r o n m e n t d e e p n e u r a l n e t w o r k S t a t e t i m e t w e i g h t s E N C O D E R R E C U R R E N T L S T M S T A T E Figure 21: Hambur g PokéRunners’ recurrent PPO architecture, showing the CNN encoder for visual input, MLP for positional and milestone encoding, and LSTM for temporal reasoning. E.2.3 Anthonys (Speedrunning T rack Third Place) A uthor: Anthony Sistilli Affiliation: PokéAgent Challenge Speedrunning T rack Third Place Anthonys’ approach centered on decomposing the game progression into discrete phases with specialized navigation strate gies for each context. The core innov ation was combining deterministic 28 pathfinding algorithms with context-aw are prompt engineering to guide the language model through complex na vigation and battle scenarios. Phase-Based State Management: They structured the challenge into sev en distinct phases, each corresponding to major game milestones. Each phase maintained its own prompt template with conditional instructions that dynamically adapted based on completed objectiv es and current location. A* P athfinding with Directional Priorities: Rather than relying solely on the LLM for spatial reasoning, they implemented A* pathfinding on the game map grid to compute optimal mo vement sequences. The system extracted walkable tiles from game memory , constructed a navigable grid representation, and generated action sequences to reach target coordinates. Battle State Management: Battle sequences required special handling due to the game’ s menu system complexity . They implemented an input-clearing mechanism that sent predetermined button sequences to ensure the agent escaped bad states. Adaptive Reco very and Stuck Detection: The system tracked position history to detect when the agent remained stationary for multiple turns, indicating an obstacle or navigation f ailure. E.2.4 Deepest (Speedrunning T rack Judge’ s Choice: Most Efficient) T eam: Seonghun Hong, Hyun young Jeong, Hyeokgi Kim, Jaeyoon Jung Affiliation: Seoul National Uni versity , Soongsil Univ ersity , MA UM.AI Figure 22: Deepest agent architecture overvie w . The agent receives partial observations and milestone- specific guidance from the Guidebook. Gemini 2.5 Flash reasons and outputs actions, optionally in voking tools (Pathfinding, Memory , Thinking). Actions execute in the emulator , producing the next observation. History provides temporal context. W e propose a training-free, fully autonomous agent architecture for Pokémon Emerald speedrunning that operates under human-like perceptual constraints using the Gemini 2.5 Flash vision-language model. Our approach achieved 5th place with a completion time of 02:04:29 for the first gym milestone and receiv ed the Judge’ s A ward for most sample-efficient completion, demonstrating the effecti veness of principled agent design without data collection or human feedback. A core design principle of our system is the absence of privile ged game state. The agent recei ves no ground-truth map information; all na vigation decisions—including those made by our pathfinding module—are computed exclusiv ely from partially observed tiles explored during gameplay . This constraint closely mirrors human visual perception and ensures that planning and control emer ge from observation-dri ven reasoning rather than access to internal game mechanics. T o support high-le vel strategic decision-making, we introduce a Guidebook system inspired by how human speedrunners consult established guides. The guidebook provides milestone-specific kno wledge, such as traveling to Route 103 to meet the ri v al and selecting Mudkip for type adv antage against the Rock-type gym 29 leader . For goal specification, we employ image goals —reference images of destination locations such as Professor Birch’ s Lab and Dad’ s Gym, which are visually indistinct buildings that are difficult to identify without prior game knowledge. These image goals enable the agent to visually recognize objectiv es and ground its actions in perceptual similarity . The overall architecture is illustrated in Figure 22 . Rather than relying solely on lo w-level action tokens (e.g., “UP”, “LEFT”, “ A ”), we augment the agent with a set of auxiliary tools that allow it to autonomously reason, plan over long horizons, and adapt its computation, thereby better le veraging its decision-making capacity for ef ficient speedrunning. Specifically , we introduce three auxiliary tools: Pathfinding T ool: This tool provides two navigation interfaces: coordinate-based navigation for known destinations, and directional exploration when e xact coordinates are unknown. The pathfinder computes optimal trajectories on the observed map grid with the A* algorithm, incorporating game- specific constraints including one-way ledges and dynamic NPC collision a voidance. Memory T ool: This tool supports long-horizon planning by enabling the agent to explicitly persist critical state information. T o prev ent the accumulation of stale or task-irrelev ant information, we apply a sliding window polic y that retains memory only over the most recent 10 con versation turns. Thinking T ool: This tool implements adaptiv e computation by toggling the model’ s reasoning budget—allocating 2048 tokens for complex decisions while defaulting to 512 tokens for routine navigation. F En vironment and System Details F .1 RPG System Comparison W e analyze Pokémon RPG AI systems through a fi ve-dimensional categorical frame work S ( A ) = ( S, T , M , F, Φ) , characterizing state representation, tools, memory architecture, feedback mecha- nisms, and fine-tuning approach. T able 2 illustrates this decomposition for each system as it was initially popularized. 2 For current version capability for X Plays Pokémon, we encourage readers to directly view the stream of the individual creators. Man y of these X Plays Pokémon baselines hav e completed numerous distinct Pokémon RPG games, which is very impressiv e from the view of the ML community . W e recommend standardization among subsections of the game for easier comparison with state standardized by PokéAgent. For X Plays Pokémon, systems vary dramatically across each dimension: state representations range from visual frames to structured RAM extraction; tool av ailability ranges from basic button inputs to full planning suites with code execution; memory architectures span raw conte xt windows to engineered summarization pipelines. Critically , ev en the reported e v aluation metrics—variously “steps, ” “actions, ” and “turns”—lack shared definitions across projects. These design choices dramatically affect measured performance, making cross-system comparison scientifically problematic without standardization. T able 2: Analysis of Pokémon RPG AI systems using the ( S, T , M , F, Φ) framew ork, characterized at the version each system w as initially popularized. Heterogeneity across dimensions makes direct performance comparison methodologically unsound without standardization. System State ( S ) T ools ( T ) Memory ( M ) F eedback ( F ) Fine-tuning (Φ) Claude Plays Pokémon † Frame + location, walkable tiles, party stats Pathfinder, kno wledge base File-based summaries + knowledge base ReAct loop Zero-shot GPT Plays Pokémon † Frame + location, party stats, money , tile colors Button inputs, self-built minimap Goal tracker + notepad V aried Zero-shot Gemini Plays Pokémon † Frame + object positions, tile properties, navigability Self-generated tools, code exec, pathfinder agent Notepad + map markers + mental map Varied Zero-shot Nunu AI Frame + full parsed game state (map, party , items, NPCs) Na vigation, planning tools Persistent memory store ReAct loop + T witch chat Zero-shot CLI Agents ‡ Frame + map, party, bag Game MCP tools (get state, press buttons, pathfinding) Agent context window ReAct loop Zero-shot PokeAgent Frame + map, party, bag Full MCP tool suite (pathfinding, memory , progress summary , reflection, etc) knowledge base + Action history windo w Multi-agent self-reflection Zero-shot † Community-built harnesses, not official products of the respecti ve model providers. ‡ Claude Code, Codex CLI, Gemini CLI — evaluated on the PokéAgent Speedrunning T rack with standardized MCP tool access. The key architectural distinction is in feedback and reflection. The “X Plays Pokémon” systems operate as observe-act loops: the model receiv es an observ ation, selects an action, and observes the result, with no explicit mechanism for e valuating its own decisions or recovering from errors. In contrast, PokéAgent employs a multi-agent self-reflection system in which a dedicated critic agent 2 These systems are under activ e development and ha ve since adopted features—minimal additional informa- tion and self-reflection mechanisms—that were initially introduced by the PokéAgent project. W e characterize each system at the version contemporaneous with PokeAgent’ s development (mid 2025), as later iterations increasingly con verge on similar design choices (such as using sub-agents for feedback) without being completely open-source at the time of writing. 30 ev aluates action outcomes, detects suboptimal play , and triggers strategy re vision—enabling error recov ery rather than error compounding. This distinction is not merely taxonomic: pure observe-act agents are known to exhibit “panic behavior” [ 2 ], where a single tactical mistake cascades into increasingly poor decisions, a failure mode that self-reflection explicitly mitig ates. A further distinction is in execution model. The “X Plays Pokémon” systems all freeze the emulator between actions, giving the model unlimited deliberation time per step—wall-clock runtimes of hundreds of hours are common (e.g., Gemini’ s 406-hour playthrough of Pokémon Blue). None penalize slow inference in their reported metrics, fully decoupling computational cost from any notion of real-time play . The PokéAgent Speedrunning T rack, by contrast, measures wall-clock time as a primary metric: the emulator runs continuously and agents must issue actions in real time, penalizing slow inference and long reasoning chains. This mak es speedrunning performance a joint measure of decision quality and computational ef ficiency , more closely reflecting the practical constraints of deploying agents in interacti ve en vironments. Among contemporary RPG systems, progress has been rapid but fragmented. By early 2026, frontier models had completed full Pokémon playthroughs across multiple games—including Red, Blue, Crystal, and Emerald—using a v ariety of harness architectures. Notable milestones include Gemini 2.5 Pro completing Pokémon Blue in approximately 406 hours [ 2 , 3 ], GPT -5 finishing Pokémon Red in 6,470 steps [ 4 ], Gemini 3 Pro completing Pokémon Crystal [ 40 ], and later model iterations (GPT - 5.1, GPT -5.2, Gemini 3 Flash) completing Crystal and Emerald as well. Claude Plays Pokémon [ 1 ] used chain-of-thought reasoning with a harness to complete a small section of the game over 35,000 actions, where its predecessor without these capabilities failed to exit the starting location. Ho wev er , these achiev ements resist direct comparison: the underlying harnesses differ fundamentally in state representation, tool access, and action granularity , and ev en the reported metrics—variously “steps, ” “actions, ” and “turns”—lack shared definitions across projects. The PokéAgent Challenge Speedrunning Track addresses precisely this gap, providing a standardized state and ev aluation protocol that enables verified, apples-to-apples comparisons across models and harnesses. See Appendix D for detailed architecture descriptions of our baselines, including the PokéAgent multi- agent orchestration system. F .1.1 Speedrunning: Evaluation Interface Figure 23: Speedrunning T rack Evaluation Interface. The interface displays: (1) the Pokémon Emerald game screen showing a trainer battle with move selection (top left), (2) agent reasoning traces with timestamped decision logs detailing battle strategy and action rationale (right panel), (3) current party composition with sprite pre views (bottom left), and (4) map overvie w with action history (bottom right). This visualization enables real-time deb ugging of agent behavior across the thousands of timesteps required for speedrunning. 31 F .2 LLM Baseline T oken Usage and Cost T able 3 reports token consumption and API cost for each LLM baseline on the Gen 9 OU Extended T imer ladder . Costs v ary by o ver 70 × across models: GPT -5.2 is the most e xpensiv e at $1.247 per game, while DeepSeek V3 and Qwen3.5 Plus cost roughly $0.015. Output token counts per turn also vary dramatically—MiniMax produces ∼ 6.2K completion tokens per turn, whereas DeepSeek V3 and Qwen3.5 Plus produce fe wer than 10. These differences highlight that raw per-game cost is driv en primarily by output pricing and reasoning verbosity rather than game length. T able 3: LLM baseline token usage and API cost on the Gen 9 OU Extended T imer ladder . In- put/output tokens are per -turn av erages. Models sorted by GXE rank (descending). Model A vg Input T ok/T urn A vg Output T ok/T urn A vg T urns A vg Cost/Game A vg Cost/T urn T otal Cost Gemini 3.1 Pro 1,245 54 21.1 $0.066 $0.0031 $8.86 GPT -5.2 2,472 4,270 17.8 $1.247 $0.0702 $162.16 Gemini 3 Flash 1,265 76 23.7 $0.020 $0.0009 $0.90 GLM-5 1,658 960 18.3 $0.069 $0.0038 $7.98 Gemini 3 Pro 1,211 115 23.9 $0.091 $0.0038 $3.72 Claude Opus 4.6 2,110 137 24.4 $0.341 $0.0140 $73.59 Grok-3 Mini 1,902 1,262 23.7 $0.029 $0.0012 $4.27 Claude Sonnet 4.6 2,234 132 23.7 $0.206 $0.0087 $34.83 Grok-3 1,971 9 22.2 $0.134 $0.0061 $24.96 MiniMax M2.5 2,327 6,218 18.3 $0.120 $0.0065 $1.31 Hermes 4 405B 2,310 91 21.7 $0.056 $0.0026 $7.22 DeepSeek V3 2,238 9 23.5 $0.017 $0.0007 $3.42 Kimi K2.5 1,900 4,510 19.2 $0.207 $0.0108 $3.72 Qwen3.5 Plus 2,306 8 22.9 $0.014 $0.0006 $2.55 F .3 Speedrunning T rack T ask Diversity G State Space Complexity Derivation This appendix deri ves combinatorial upper bounds on Pokémon team configuration and battle state spaces for Gen 1 OU, Gen 9 OU, and Gen 9 VGC. W e clearly distinguish exact quantities (marked E ), exact upper bounds (marked E upper bound ), and approximations (marked A ), and justify e very choice. Mov es unav ailable in Generation 9 ( isNonstandard: "Past" in the Pokémon Showdo wn source) are excluded from the Gen 9 deri vations. All arithmetic has been v erified programmatically . G.1 EV Spread Counting Setup. Each Pokémon distributes Ef fort V alues (EVs) across 6 stats. EVs only affect the stat formula in multiples of 4, so we count in units of 4. Let x i ∈ Z ≥ 0 be the number of 4-EV units allocated to stat i . The constraints are: 6 X i =1 x i ≤ 127 , 0 ≤ x i ≤ 63 . (1) The upper bounds reflect the per-stat cap of 252 EV = 63 × 4 and the total budget of 508 EV = 127 × 4 . Exact count via inclusion-exclusion. Introduce slack variable x 7 ≥ 0 so that P 7 i =1 x i = 127 . Stars-and-bars without the per -stat cap giv es 133 6 = 6 , 856 , 577 , 728 . Applying inclusion-e xclusion on x i ≥ 64 (substitute y i = x i − 64 , remaining sum = 63 ) gi ves 69 6 = 119 , 877 , 472 per capped stat. T wo simultaneously capped stats would require total ≥ 128 > 127 , so higher-order terms vanish. |E | = 133 6 − 6 69 6 = 6 , 856 , 577 , 728 − 719 , 264 , 832 = 6 , 137 , 312 , 896 . (2) (E) All arithmetic exact; no approximation. G.2 Generation 9 T eam Configuration Space This bound applies to both Gen 9 OU and Gen 9 VGC, which dra w from the same Pokémon pool. 32 Figure 24: Speedrunning T rack T ask Diversity . Directed graph sho wing RPG subtask cate gories and their dependencies. Core tasks include ov erworld na vigation, battle encounters (wild and trainer), gym puzzles, menu management, and NPC dialogue. Edges indicate ho w completing one task type enables or requires others—for example, na vigation leads to encounters, battles yield experience for team building, and menu interactions manage resources needed for all other tasks. Species (E). pokedex.ts contains 1,329 standard entries for Generation 9: 1,025 base Pokémon (National Dex #1–1025) plus 304 alternate formes. Under Species Clause (no duplicates, unordered team of 6): 1329 6 = 7 , 566 , 741 , 017 , 135 , 560 ≈ 7 . 57 × 10 15 . (3) Movesets (E upper bound). Mew has the largest mo vepool in Generation 9: e xactly 375 moves . Using Mew’ s pool as the upper bound for all 6 team members: 375 4 6 = 810 , 855 , 375 6 ≈ 2 . 84 × 10 53 . (4) Abilities (E upper bound). Maximum 3 abilities per species (two standard plus one Hidden Ability): 3 6 = 729 . (5) Individual V alues (E). Each stat has an IV in { 0 , . . . , 31 } : 32 6 6 = 32 36 = 2 180 ≈ 1 . 53 × 10 54 . (6) Effort V alues (E). Using the exact count from Section G.1 : |E | 6 = 6 , 137 , 312 , 896 6 ≈ 5 . 34 × 10 58 . (7) 33 Natures (E). 25 natures exist, but 5 are neutral (all producing identical stat modifiers in battle), yielding 20 + 1 = 21 functionally distinct natures: 21 6 = 85 , 766 , 121 . (8) Held Items (E). Generation 9 contains 248 standard held items: 248 6 = 232 , 653 , 764 , 952 , 064 ≈ 2 . 33 × 10 14 . (9) T erastallization (E). Each Pokémon may T erastallize into any of 19 types (the 18 standard types plus Stellar): 19 6 = 47 , 045 , 881 . (10) Gen 9 team space. T Gen9 ≈ 7 . 57 × 10 15 | {z } species × 2 . 84 × 10 53 | {z } moves × 7 . 29 × 10 2 | {z } abilities × 1 . 53 × 10 54 | {z } IVs × 5 . 34 × 10 58 | {z } EVs × 8 . 58 × 10 7 | {z } natures × 2 . 33 × 10 14 | {z } items × 4 . 70 × 10 7 | {z } T era ≈ 10 215 . (11) The mov eset factor is an upper bound (using Mew’ s movepool for all species); all other factors are exact. The only approximation is the final log 10 rounding. G.3 Generation 1 OU (RBY) T eam Configuration Space Gen 1 lacks abilities, held items, natures, and T erastallization, but uses distinct mechanics for stats. Species (E). 151 6 = 14 , 888 , 600 , 755 ≈ 1 . 49 × 10 10 . (12) Movesets (E). Generation 1 contains 164 standard battle moves. Me w can learn all of them via TMs/HMs; using the full 164-mov e pool for all 6 Pokémon: 164 4 6 = 29 , 051 , 001 6 ≈ 6 . 01 × 10 44 . (13) Determinant V alues (D Vs) (E). Attack, Defense, Speed, and Special D Vs are each in { 0 , . . . , 15 } ; HP D V is determined by the others. In competitive Gen 1 play , Defense, Speed, and Special D Vs are always set to 15 (their maximum), as there is no strategic reason not to maximize them. The sole exception is Attack D V , which is sometimes set to 0 on special attackers to minimize self-damage from Confusion. Thus Attack D V ∈ { 0 , 15 } (2 values) and all other D Vs are fixed, gi ving 2 choices per Pokémon: 2 6 = 64 . (14) Stat Experience (E). Each stat has Stat Experience in { 0 , . . . , 65535 } (unsigned 16-bit integer), but in competitive Gen 1 play all Stat Experience v alues are always maxed at 65535 via full EV training—there is no strate gic reason to lea ve any stat below maximum. This f actor therefore collapses to 1 and does not contribute to the team space. Gen 1 team space. T Gen1 = 151 6 | {z } E × 164 4 6 | {z } E × 2 6 |{z} E ≈ 1 . 49 × 10 10 × 6 . 01 × 10 44 × 64 ≈ 10 57 . (15) 34 G.4 Battle State Spaces Common components. HP (A) . Current HP is an integer from 0 to maximum; we approximate with a representativ e max-HP = 300 , gi ving 301 states (0–300) per Pokémon: 301 12 ≈ 5 . 53 × 10 29 . ( A ) (16) This is the only approximation in the deri v ation: actual max HP is determined by species, IVs, and EVs (already counted in the team space) and ranges from 1 (Shedinja) to ∼ 714 (Blissey). The representativ e value of 300 falls within the typical competiti ve range and does not materially af fect the order-of-magnitude estimate. Gen 9 field conditions (E) . W eather: 1 + 4 × 8 + 3 = 36 states (None; Sun/Rain/Sand/Snow with 1–8 turns; Harsh Sun/Hea vy Rain/Strong Winds). T errain: 1 + 4 × 8 = 33 states (None; Electric/Grassy/Misty/Psychic with 1–8 turns). Side conditions (E upper bound) . Per -side conditions, each independent: Hazards: 2 × 4 × 3 × 2 = 48 states (Stealth Rock on/off, Spik es 0–3, T oxic Spikes 0–2, Sticky W eb on/of f). Screens: 9 3 = 729 states (Reflect/Light Screen/Aurora V eil each 0–8 turns). T ailwind: 5 states (of f or 1–4 turns). Safeguard: 6 states (off or 1–5 turns). Mist: 6 states (of f or 1–5 turns). Per-side total: 48 × 729 × 5 × 6 × 6 = 6 , 298 , 560 . Both sides: 6 , 298 , 560 2 ≈ 10 13 . 6 . T erastallization state (E) . Each player may T era at most once per battle; state = (whether T era used) × (which of 6 Pokémon T era’ d, if applicable): (1 + 6) 2 = 49 . ( E ) (17) Pseudo-weather (E) . T urn-counted field-level v olatile conditions: T rick Room: 6 states (of f; 1–5 turns remaining). Gravity: 6 states. Magic Room: 6 states. W onder Room: 6 states. Fairy Lock: 3 states (off; 1–2 turns remaining). P Gen9 = 6 4 × 3 = 3 , 888 . ( E ) (18) Slot conditions (E) . Conditions tied to a battlefield position (slot), not to the Pokémon occupying it. In singles, there are 2 slots (one per side): W ish: 2 (pending or not). Future Sight/Doom Desire: 3 (of f; 1 or 2 turns remaining). Healing W ish: 2 (pending or not). Per slot: 2 × 3 × 2 = 12 . T wo slots: 12 2 = 144 . Per-acti ve volatile statuses (E upper bound) . Each acti ve Pokémon may simultaneously carry mul- tiple volatile conditions. W e systematically enumerate all volatiles from the Pokémon Sho wdo wn source code ( conditions.ts , moves.ts , abilities.ts ), grouping mutually exclusi ve states. Independent binary/counter volatiles (can all coexist). W e restrict to mo ves a vailable in Genera- tion 9; mo ves mark ed isNonstandard: "Past" in the Sho wdown source are excluded. 3 Confusion 3 Excluded Past mo ves: Rage, Nightmare, Embargo, Heal Block, Octolock, T elekinesis, Foresight, Miracle Eye, Snatch, Grudge, Magic Coat, Laser Focus, Bide. These could technically be in voked via Metronome, but we adopt the same competitiv e-play framing used elsewhere in this deri vation. 35 (5: of f or 2–5 turns), Attract (2), Leech Seed (2), Substitute (2), T aunt (5: off or 1–4 turns), Encore (4: of f or 1–3 turns), Disable (5: of f or 1–4 turns), T orment (2), Perish Song (4: of f or 1–3 counter), Aqua Ring (2), Ingrain (2), Focus Ener gy (2), Y awn (3: of f or 1–2 turns), Curse (2), No Retreat (2), T ar Shot (2), Salt Cure (2), Syrup Bomb (4: off or 1–3 turns), Imprison (2), Char ge (2), Minimize (2), Defense Curl (2), Stockpile (4: 0–3 layers), Glaiv e Rush (2), Gastro Acid (2), Power T rick (2), T ransform (2), T rapped (2: Mean Look/Block/Shado w T ag), Throat Chop (3: of f or 1–2 turns), Lock-On/Mind Reader (2). V indep ≈ 6 . 04 × 10 11 . (product of all 30 factors abov e) (19) Mutually exclusive gr oups (at most one state per group): Mov e locks (a Pokémon can be locked into at most one multi-turn mov e): None (1), locked mov e/Outrage etc. (1–3 turns = 3), two-turn move/Fly/Dig (1), must recharge/Hyper Beam (1), Uproar (1–3 turns = 3), Rollout (1–5 hits = 5), Ice Ball (1–5 hits = 5) = 1 + 3 + 1 + 1 + 3 + 5 + 5 = 19 states. Protection mo ves (at most one per turn): None (1) + Protect, Baneful Bunker , Burning Bulwark, King’ s Shield, Obstruct, Silk Trap, Spik y Shield, Endure (8 variants) = 9 states. Grounding/levitation (mutually exclusi ve): None (1), Smack Do wn (1), Magnet Rise (1–5 turns = 5) = 7 states. Ability-specific v olatiles (a Pokémon has e xactly one ability , so at most one applies): None (1), Flash Fire activ e (1), Unburden active (1), Slo w Start (1–5 turns = 5), Protosynthesis acti ve (1), Quark Driv e active (1) = 10 states. Self-applied one-turn effects (only one mov e per turn): None (1), Roost (1), Destiny Bond (1) = 3 states. T ype changes : T ype override (Soak, etc.): None (1) + 18 types + typeless from Burn Up (1) = 20 states. T ype addition: None (1) + Grass from Forest’ s Curse (1) + Ghost from Trick-or -T reat (1) = 3 states. Choice lock (from Choice Band/Scarf/Specs): None (1) + lock ed into mov e 1/2/3/4 (4) = 5 states. Independent of multi-turn mov e locks abov e. Other per-activ e factors : Stall counter (2: consecutive protect tracking), partial trapping (8: off or 1–7 turns), flinch (2), Powder (2), Electrify (2). Combining all groups per activ e Pokémon: V Gen9 = V indep × 19 × 9 × 5 × 2 × 20 × 3 × 8 × 2 × 7 × 10 × 3 × 2 × 2 ≈ 8 . 33 × 10 20 ; V 2 Gen9 ≈ 10 41 . 8 . ( E upper bound ) (20) Item and ability state changes (E upper bound) . Items may be consumed, knocked of f, or swapped (T rick/Switcheroo): 3 states per Pokémon. Abilities may be suppressed or changed (Gastro Acid, Skill Swap, Mummy): 2 states per Pokémon. 3 12 × 2 12 ≈ 10 9 . 3 . (21) PP . W e omit move PP from the state space count. In competiti ve play , PP depletion is strate gically relev ant only in niche stalling scenarios; the v ast majority of games end well before an y mov e runs out of PP . Including binary PP tracking ( 2 48 ≈ 10 14 . 4 ) would increase all totals by ∼ 14 orders of magnitude but w ould not meaningfully reflect the strategic state of typical competiti ve battles. Non-volatile status conditions (E upper bound) . Each Pokémon carries at most one non-v olatile status, which persists through switches. In Gen 9 the possible states per Pokémon are: Healthy (1), Burn (1), Freeze (1), Paralysis (1), Poison (1), T oxic (1), Sleep with turn counter 1–3 (3): C Gen9 = 9 12 ≈ 2 . 82 × 10 11 . ( E upper bound ) (22) Gen 9 OU (singles). Activ e Pokémon (E) . One of 6 activ e per side: 6 × 6 = 36 arrangements. Stat stages (E) . 2 activ e Pokémon × 7 modifiable stats (Atk, Def, SpA, SpD, Spe, Acc, Eva), each ∈ [ − 6 , +6] (13 v alues): 13 14 ≈ 3 . 94 × 10 15 . (23) 36 Gen 9 OU battle state = T 2 Gen9 × 36 × 301 12 × 13 14 × C Gen9 × 36 × 33 × 6 , 298 , 560 2 × 49 × P Gen9 × 12 2 × V 2 Gen9 × 3 12 × 2 12 ≈ 10 430 × 10 1 . 6 × 10 29 . 7 × 10 15 . 6 × 10 11 . 5 × 10 1 . 6 × 10 1 . 5 × 10 13 . 6 × 10 1 . 7 × 10 3 . 6 × 10 2 . 2 × 10 41 . 8 × 10 5 . 7 × 10 3 . 6 ≈ 10 564 . (24) Gen 9 VGC (doubles). VGC is a doubles format: each player has two Pokémon activ e simulta- neously , with position (left/right) mattering for tar geting. Each player also chooses which 4 of 6 Pokémon to bring during team previe w (hidden from opponent). T eam previe w (E) . 6 4 2 = 15 2 = 225 joint selections. Activ e positions (E) . Ordered pairs of activ e Pokémon from the 4 brought per side (left/right positions distinct): P (4 , 2) 2 = (4 × 3) 2 = 144 . Stat stages (E) . 4 activ e Pokémon × 7 modifiable stats: 13 28 ≈ 1 . 55 × 10 31 . (25) Since only 4 Pokémon per side participate in each VGC battle, HP is counted ov er 8 (not 12), non-volatile status o ver 8, and T erastallization over the 4 brought per side: (1 + 4) 2 = 25 . VGC adds se veral doubles-specific side conditions not present in singles: Fire/W ater/Grass Pledge field effects (5 states each, turn-counted), Quick Guard (2), W ide Guard (2), Crafty Shield (2), Mat Block (2). The per-side total becomes 6 , 298 , 560 × 5 3 × 2 4 = 12 , 597 , 120 , 000 , so both sides: ∼ 10 20 . 2 . Per-acti ve volatiles gain doubles-specific additions: Helping Hand (2), redirection—Follo w Me/Rage Powder/Spotlight (mutually exclusi ve, 4 states), Dragon Cheer (2), Ally Switch (2). Slot conditions hav e 4 slots (2 per side) instead of 2: 12 4 = 20 , 736 . Gen 9 VGC battle state = T 2 Gen9 × 225 × 144 × 301 8 × 13 28 × 9 8 × 36 × 33 × SC 2 VGC × 25 × P Gen9 × 12 4 × V 4 VGC × 3 8 × 2 8 ≈ 10 430 × 10 2 . 4 × 10 2 . 2 × 10 19 . 8 × 10 31 . 2 × 10 7 . 6 × 10 1 . 6 × 10 1 . 5 × 10 20 . 2 × 10 1 . 4 × 10 3 . 6 × 10 4 . 3 × 10 89 . 7 × 10 3 . 8 × 10 2 . 4 ≈ 10 622 . (26) Gen 1 OU (singles). Gen 1 has no weather, terrain, entry hazards, or screens. Status conditions per Pokémon: Healthy + BRN + FRZ + P AR + PSN + bad PSN (turn counter 1–15) + SLP (turn counter 1–7) = 1 + 1 + 1 + 1 + 1 + 15 + 7 = 27 states 4 ; for 12 Pokémon: 27 12 ≈ 10 17 . 2 (E) . Stat stages: 2 activ e Pokémon × 6 stats (Attack, Defense, Special, Speed, Evasion, Accurac y), each ∈ [ − 6 , +6] : 13 12 ≈ 10 13 . 4 (E) . Per-acti ve volatile statuses (E upper bound) . V olatile conditions that apply only to activ e Pokémon (cleared on switch). In Gen 1, Reflect and Light Screen are per-Pokémon volatiles rather than field-wide screens. Per acti ve Pokémon: 4 T oxic poison in Gen 1 deals ⌊ N/ 16 ⌋ × max-HP damage on turn N , so the counter 1–15 produces 15 distinct in-battle trajectories. 37 Independent: Confusion (5: none or 1–4 turns), Leech Seed (2), Substitute (2), F ocus Energy (2), Disable (29: of f, or one of 4 moves × 1–7 turns), Reflect (2), Light Screen (2), Mist (2), Minimize (2), T ransform (2), Rage (2). Mutually exclusive mo ve locks: None (1), Thrash lock (1–3 turns = 3), two-turn mo ve/Fly/Dig (1), must recharge (1), Bide (1–2 turns = 2), partial trapping lock (1–4 turns = 4) = 12 states. Other per -active: Partial trapping (defender side, 5: of f or 1–4 turns), residual damage counter (16: 0–15, tracks toxic/trap damage), flinch (2). V Gen1 = (5 × 2 7 × 29 × 2 2 ) × 12 × 5 × 16 × 2 = 142 , 540 , 800 ≈ 10 8 . 2 . (27) V 2 Gen1 ≈ 10 16 . 3 . ( E upper bound ) (28) Gen 1 OU battle state = T 2 Gen1 × 36 × 301 12 × 13 12 × 27 12 × V 2 Gen1 ≈ 10 114 × 10 1 . 6 × 10 29 . 7 × 10 13 . 4 × 10 17 . 2 × 10 16 . 3 ≈ 10 192 . (29) G.5 Summary T able 4: State space complexity across Pokémon formats and classical games. All Pokémon compo- nents are exact (E) except HP values (marked A ); status conditions and volatile statuses are exact upper bounds. T eam space v alues are upper bounds. Battle state includes item consumption/knockout state and ability suppression/swap state; PP is omitted as it is rarely strategic in competitive play (see text). V olatile statuses are enumerated from the Pokémon Showdo wn source code with mutual exclusion groups (protect v ariants, move locks, grounding/le vitation, ability-specific volatiles, self- applied one-turn ef fects) properly handled. Gen 9 OU and VGC share the same team pool; VGC’ s larger battle state reflects 4 simultaneously acti ve Pokémon (vs. 2 in singles), hidden team pre view selection, and doubles-specific side conditions and volatiles. Format T eam Space (1 player) Battle State Space Notes Gen 1 OU (RBY) ∼ 10 57 ∼ 10 192 Singles; no items/abilities/natures; competitiv e DV constraints Gen 9 OU ∼ 10 215 ∼ 10 564 Singles; T erastallization Gen 9 VGC ∼ 10 215 ∼ 10 622 Doubles; bring 4 of 6; team previe w Chess — ∼ 10 47 — Go — ∼ 10 170 — G.6 T eam Space in Human Metagame The combinatorial space of Pokémon teams is vast when enumerating all le gal combinations of species, mov es, items, abilities, and stat spreads (T able 4 ). In practice, ho wev er , the ef fectiv e space considered by competiti ve players is considerably smaller . Many options are dominated by clearly superior alternati ves. For example, tw o mov es may share the same type and role, b ut if one is less powerful and reliable, there is no competiti ve reason to justify its selection. Similar dominance relationships apply to items, abilities, and stat spreads. T able 5 estimates this state-space reduction by restricting to options that Showdo wn players above a ~median skill rating select in at least 1% of battles. A second factor is metagame conv ergence. Even among viable options, competitiv e play concentrates around successful team archetypes as players adapt to one another . Figure 25 provides an example: a small fraction of a v ailable Pokémon account for the majority of human choices. Ne vertheless, the number of competiti vely viable combinations is exceptionally lar ge, and team design in competitiv e Pokémon remains a daunting optimization problem. H Extended Discussion H.1 Additional Findings V ision Remains a Fundamental Limitation. Community observ ations consistently identified vision as the primary f ailure mode: “Claude’ s vision is be yond fixing with a prompt—it barely kno ws 38 T able 5: Effective team space on Showdown Gen 1 OU and Gen 9 OU ladders. Deriv ed from public usage statistics (2014–2025, Glicko-1 ≥ 1500). Only options appearing on at least 1% of teams (for species) or 1% of that species’ sets (for moves, items, abilities, and EV/nature spreads) are counted. Gen 1 OU Gen 9 OU Species in pool 45 117 A vg. moves per species 13.0 14.0 A vg. items per species — 6.6 A vg. abilities per species — 1.7 A vg. EV/nature spreads per species — 7.8 T eam state space ( log 10 ) 23.5 37.8 0 50 100 150 200 250 300 350 400 450 500 Species R ank (by usage fr equency) 0 25 50 75 90 100 Cumulative Usage (%) Species Usage Distribution Gen 1 OU Gen 9 OU 0 5 10 15 20 25 30 Move R ank (by usage fr equency) 0 25 50 75 90 100 Cumulative Usage (%) (avg over top 50 species) Move Usage Distribution Gen 1 OU Gen 9 OU Figure 25: Empirical usage distributions in Gen 1 OU and Gen 9 OU . Derived from Smogon usage statistics (2014–2025, rank ≥ 1500). Left: cumulative species usage—the fraction of all team-slot appearances accounted for by the top- k most-used species. Right: cumulati ve mo ve usage per species, av eraged ov er the top-50 species in each format. what up and down are. ” This suggests that game-playing benchmarks will continue to challenge AI systems until vision-language integration improv es substantially . Concurrent independent w ork comparing Gemini 3 Pro and 2.5 Pro on Pokémon Crystal [ 40 ] corroborates this finding, report- ing that vision-related errors remained a dominant failure mode across model generations despite improv ements in reasoning capability . Frontier vs. Open-Source Perf ormance Gap. Our baseline ev aluation (Figure 3 ) re veals that frontier models significantly outperform open-source alternati ves in direct prompting, with Gemini 3 Flash achieving 71% GXE compared to 29% for Gemma3-1B without a harness. Howe ver , the PokeChamp harness substantially narro ws this gap: Gemma3-1B with a harness reaches 53% GXE, demonstrating that architectural support can partially compensate for ra w model capability . This finding has practical implications for benchmark accessibility—researchers without frontier model access can still achieve meaningful results through careful harness design. The comparison also validates Pokémon as an OOD e valuation: if models had seen extensiv e Pokémon training data, we would e xpect the performance gap between model scales to be smaller , as smaller models would ha ve relev ant cached knowledge. Instead, the gap reflects genuine reasoning capability differences that a harness can only partially bridge. Model Behavior Di versity . Different model families e xhibited distinct failure patterns: • Claude models sho wed “memory corruption cascades”—once false information entered context, the y followed incorrect paths for e xtended periods • Gemini models exhibited “roadblock” behavior —oscillating between contradictory goals when facing conflicting objecti ves • GPT models demonstrated excessi ve plan commitment—persisting with suboptimal strate- gies for hours/days • Qwen models exhibited “computational paralysis”—entering recursi ve damage calculation loops and getting stuck verifying basic type matchups (“water vs fire?”) while the battle 39 state e volv ed. This failure mode is striking: in high-stakes sequential decision-making, extended deliber ation can be catastr ophic Chain-of-Thought V isualization Enables F ailure Diagnosis. W e deployed a li ve ladder stream that visualizes LLM chain-of-thought synchronized with battle state. This rev ealed failure modes in visible to outcome-only metrics. Figure 26 sho ws Gemma-3-12B vs Qwen-3-14B on Gen 9 OU: while Gemma articulates strategy and executes decisi vely , Qwen’ s reasoning trace shows real-time “panic”—recursiv e verification loops that consume the decision windo w . W ithout CoT visualization, Qwen’ s poor performance would appear as generic “weak play” rather than a specific, diagnosable pathology . Figure 26: Chain-of-Thought Visualization Re veals Failur e Modes. Screenshots from our liv e ladder stream sho wing synchronized CoT reasoning during a Gen 9 OU battle between Gemma-3-12B (a) and Qwen-3-14B (b). In T urn 2, Gemma articulates a brief strategic assessment (“boosting Special Attack will be beneficial”) and executes Nasty Plot. By T urn 4, Qwen’ s reasoning trace fills the panel with recursi ve damage calculations—enumerating type matchups, computing e xact damage ranges, and deliberating between moves—e xhibiting the “computational paralysis” described in the text. This failure mode is in visible to win/loss statistics alone. H.2 Broader Impact H.2.1 Beyond P okémon: Fr om Game Agents to Coding Agents An unexpected outcome of the X Plays Pokémon has been the transfer of harness techniques to other domains. The modular context engineering harness that emerged as essential for g ame-playing agents has influenced the design of autonomous coding agents. Systems like Claude Code no w incorporate similar harness patterns: persistent memory across sessions, hierarchical planning for complex tasks, and structured perception of codebases. Recent de velops ha ve entirely ov erlapped with autonomous gaming agents as coding agents hav e been used in 24/7 loops with the “Ralph W iggum” extension and “OpenClaw”. This con vergence suggests that game-playing benchmarks serve as v erifiable benchmarks for dev eloping autonomous agent capabilities that transfer to practical applications. I A uthor Roles and Contributions Individual Contrib utors. Seth Karten 1 , Jake Grigsby 2 , T ersoo Upaa 1 40 Pok eAgent Winning T eams (alphabetical). Junik Bae 6 , Seonghun Hong 14 , Hyunyoung Jeong 14 , Jaeyoon Jung 14 , Kun K erdthaisong 15 , Gyungbo Kim 9 , Hyeokgi Kim 14 , Y ujin Kim 9 , Eunju Kwon 9 , Dongyu Liu 7 , P atrick Mariglia 8 , Sangyeon Park 9 , Benedikt Schink 12 , Xianwei Shi 7 , Anthon y Sistilli 11 , Joseph T win 13 , Arian Urdu 12 , Matin Urdu 12 , Qiao W ang 10 , Ling W u 10 , W enli Zhang 7 , Kunsheng Zhou 7 Advisory Board. Stephanie Milani 3 , 4 , Kiran V odrahalli 5 , Amy Zhang 2 , Fei Fang 3 , Y uke Zhu 2 , Chi Jin 1 Affiliations. 1 Princeton Uni versity 2 UT -Austin 3 Carnegie Mellon University 4 New Y ork Univ ersity 5 Google DeepMind 6 T eam Heatz 7 T eam P A-Agent 8 T eam FoulPlay 9 T eam 4thLesson 10 T eam Q 11 T eam Anthonys 12 T eam Hamburg 13 T eam Porygon2AI 14 T eam Deepest 15 T eam August Correspondence. Seth Karten and Jake Grigsby { sethkarten@princeton.edu, grigsby@cs.utexas.edu } 41
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment