AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a “human game” to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy – the “Multiverse of Human Games”. Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
💡 Research Summary
The paper tackles a fundamental problem in AI evaluation: existing benchmarks are narrow, static, and quickly saturated, failing to capture the breadth of human cognition. The authors propose using the “Multiverse of Human Games” – the set of all games that humans could design and enjoy – as a proxy for human‑level general intelligence. Games are argued to be distilled cultural artifacts that encapsulate a wide array of cognitive skills, from strategic planning and resource management to social interaction, pattern recognition, and embodied navigation. Excelling across this space would therefore require the same versatile, adaptable intelligence that characterizes an educated adult.
To make this vision operational, the authors introduce AI GameStore, a meta‑benchmark platform that automatically sources, adapts, and containerizes representative human games. The pipeline consists of three stages. First, large language models (LLMs) scrape metadata, core mechanics, objectives, and reward structures from the top charts of the Apple App Store and Steam. Second, the extracted specifications are fed into open‑source game engines (e.g., Unity, Godot) to generate standardized, sandboxed environments with a unified API for observations, actions, and rewards. Third, a human‑in‑the‑loop step lets domain experts verify the fidelity of the generated games, adjust difficulty, and ensure that both AI agents and human participants interact through the same interface. This automation enables continual expansion of the benchmark without the legal and technical hurdles of directly using proprietary commercial games.
As a proof‑of‑concept, the authors curated 100 games spanning puzzle, strategy, platformer, and casual genres, and evaluated seven state‑of‑the‑art vision‑language models (VLMs) – including GPT‑4V, LLaVA‑1.5, Flamingo, and others – against 106 human participants. Each participant and model played a two‑minute episode of each game. Performance was measured by average score, success rate, behavioral diversity, and trajectory similarity to human play. The results are stark: all VLMs achieved less than 30 % of the human average score, with the best models reaching under 10 % on the majority of games. The biggest deficits appeared in games that demand robust world‑model acquisition (e.g., exploration‑heavy puzzles), long‑term memory (e.g., resource‑management strategy games), and multi‑step planning (e.g., sequential quest lines). Moreover, the models required 15–20× more computation time than humans, and real‑time action games were essentially unplayable due to latency in API calls.
These findings highlight two key insights. First, the Multiverse of Human Games is indeed a demanding, multidimensional testbed that can expose gaps in current AI systems far beyond what traditional benchmarks reveal. Second, contemporary VLMs, while impressive at static visual‑language tasks, lack integrated cognitive architectures capable of continuous world modeling, persistent memory, and adaptive planning. The AI GameStore pipeline also demonstrates a feasible path to scaling such evaluation: it mitigates copyright concerns by using synthetic, open‑source re‑implementations, and it standardizes heterogeneous game interfaces. Nevertheless, challenges remain, including ensuring engine compatibility across thousands of titles, managing the cost of human verification, and guaranteeing that generated games cover the full diversity of human cognition without bias.
The authors outline several future directions: (1) expanding the benchmark to include multi‑agent and socially interactive games to probe theory‑of‑mind and cooperation; (2) integrating meta‑learning and continual‑learning frameworks so that agents can improve across games within a shared budget; (3) enhancing human‑AI co‑creation loops to automatically design novel, high‑quality games; and (4) developing methods to detect and prevent data contamination between training corpora and benchmark games.
In sum, AI GameStore offers a scalable, open‑ended platform for measuring and driving progress toward human‑like general intelligence. By grounding evaluation in the rich, culturally‑relevant space of human games, it provides a more holistic, dynamic, and future‑proof alternative to existing AI benchmarks.
Comments & Academic Discussion
Loading comments...
Leave a Comment