Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others’ statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK-Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability. The data and code are available at the project homepage: https://ck-arena.site.


💡 Research Summary

The paper introduces CK‑Arena, a novel benchmark designed to assess whether large language models (LLMs) truly grasp conceptual knowledge rather than merely memorising surface patterns. Existing benchmarks are largely static, fact‑oriented, and vulnerable to data leakage; they fail to probe fine‑grained semantic understanding. CK‑Arena addresses these shortcomings by embedding LLMs in a dynamic, multi‑agent social deduction game called “Undercover.” In each game, a group of agents (four “civilians” and two “undercover” players) receive concept words that are closely related but not identical. Civilians share a common target concept, while undercover agents receive a subtly different one. Over several rounds, agents generate short textual descriptions of their assigned concepts, infer the concepts of others from partial clues, and vote to eliminate suspected undercover players. Success is measured both by game‑level outcomes (win rate, survival rate) and by the semantic quality of the generated descriptions, evaluated with automated language judges and human calibration.

A key innovation is the automatic extraction of gameplay logs to construct a snapshot QA benchmark. From each round, the system derives cross‑concept inference, fine‑grained comparison, and outlier detection questions, enabling detailed diagnostic analysis. The authors report a strong correlation (Spearman ρ = 0.89) between QA performance and game win rates, confirming that the dynamic game outcomes reflect underlying conceptual knowledge.

The dataset comprises 529 English concept pairs (220 concrete nouns, 100 abstract nouns, 109 adverbs, 100 verbs) filtered for semantic proximity and describability. The main evaluation uses 464 game instances across twelve domains (food, landforms, animals, artifacts, tools, people/social, plants, sports, stationery, electronics, clothing, sundries). Models evaluated include GPT‑4, Claude‑2, LLaMA‑2‑70B, among others. Results reveal substantial variation: (1) performance is highly category‑dependent, with concrete noun domains yielding higher win rates than abstract or adverbial domains; (2) larger model size does not guarantee superior conceptual discrimination—some mid‑size models outperform larger ones on specific categories; (3) strategic language use differs markedly, with some models producing overly specific descriptions that expose their undercover role, while others are too vague, leading to higher survival but lower win rates. These patterns suggest that internal concept representations (token‑level co‑occurrence vs. deeper semantic structures) and prompt design critically influence outcomes.

CK‑Arena is designed for extensibility. Researchers can add domain‑specific concept sets (e.g., medical or legal terminology) or modify game rules to emphasize different strategic aspects, while the automatic QA generation pipeline ensures continuous, renewable evaluation data without extensive human annotation.

In summary, CK‑Arena provides a comprehensive, scalable framework that (1) leverages interactive multi‑agent gameplay to probe fine‑grained conceptual reasoning, (2) automatically generates diagnostic QA tasks from gameplay, and (3) delivers nuanced insights into model strengths and weaknesses beyond traditional static benchmarks. This work represents a significant step toward more realistic, dynamic evaluation of LLMs’ conceptual mastery.


Comments & Academic Discussion

Loading comments...

Leave a Comment