VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Large language models (LLMs) have shown significant promise in embodied decision-making tasks within virtual open-world environments. Nonetheless, their performance is hindered by the absence of domain-specific knowledge. Methods that finetune on large-scale domain-specific data entail prohibitive development costs. This paper introduces VistaWise, a cost-effective agent framework that integrates cross-modal domain knowledge and finetunes a dedicated object detection model for visual analysis. It reduces the requirement for domain-specific training data from millions of samples to a few hundred. VistaWise integrates visual information and textual dependencies into a cross-modal knowledge graph (KG), enabling a comprehensive and accurate understanding of multimodal environments. We also equip the agent with a retrieval-based pooling strategy to extract task-related information from the KG, and a desktop-level skill library to support direct operation of the Minecraft desktop client via mouse and keyboard inputs. Experimental results demonstrate that VistaWise achieves state-of-the-art performance across various open-world tasks, highlighting its effectiveness in reducing development costs while enhancing agent performance.
💡 Research Summary
VistaWise tackles two major bottlenecks in current Minecraft agents: the high cost of domain‑specific visual training and the reliance on external APIs for accurate environment perception. The authors fine‑tune a lightweight object detection model on fewer than 500 annotated frames extracted from gameplay videos, drastically reducing visual training data requirements compared with millions of frames used by prior works such as VPT or STEVE‑1. This detector isolates relevant environmental entities (e.g., trees, ores, inventory items) and provides precise spatial attributes (coordinates, bounding‑box size) in real time.
To mitigate hallucinations inherent to large language models (LLMs), VistaWise constructs a compact textual knowledge graph (KG) from publicly available Minecraft documentation. The KG contains only entity names and simple relational edges (e.g., “Iron Ingot → can be used to craft → Pickaxe”), deliberately omitting extraneous background text to keep prompts short and retrieval fast. The static KG encodes factual dependencies that LLMs often miss when reasoning about crafting recipes or tool usage.
The core of the system is a cross‑modal vision‑text KG. At each timestep the detector’s outputs are embedded as dynamic attributes on the corresponding KG nodes, turning the static graph into a live representation of the world state. When the agent receives a task description, a retrieval‑based pooling mechanism selects the most relevant subgraph based on both the textual prompt and the current visual attributes. This selective feeding of information reduces token overhead, speeds up inference, and supplies the LLM with reliable, context‑specific knowledge, thereby curbing hallucinations.
Action execution is performed entirely through a desktop‑level skill library built with PyAutoGUI. The library implements mouse‑click, drag, and keyboard commands that mimic human player inputs. The LLM, equipped with chain‑of‑thought reasoning and a memory stack that records past decisions, generates the required parameters (e.g., target coordinates, duration) for each skill call. Consequently, VistaWise operates directly on the Minecraft client without any Mineflayer‑style API, achieving a higher degree of autonomy and better generalization to environments lacking programmatic interfaces.
Experiments were conducted on five representative open‑world tasks (diamond mining, wood gathering, farming, building structures, and hostile‑entity avoidance) using the MineDojo benchmark. Compared with non‑API baselines such as VPT‑Base, STEVE‑1, and recent multimodal LLM agents, VistaWise attained an average success rate of 33 %, with a notable 33 % success in diamond acquisition versus the previous best of 25 %. Ablation studies demonstrated that (1) removing the KG caused a 20 %p drop due to increased hallucinations, (2) using an off‑the‑shelf object detector (no fine‑tuning) reduced performance by 9 %p, and (3) disabling retrieval‑based pooling doubled token usage and increased inference latency by 80 %.
The paper acknowledges limitations: the current KG captures only simple static relations and cannot express temporal or conditional quest logic, and the detector must be re‑trained when novel blocks or items appear, limiting true zero‑shot adaptability. Future work is proposed to automate KG expansion via web crawling and LLM‑driven refinement, incorporate continual‑learning object detectors, and develop query mechanisms for time‑dependent graph reasoning.
Overall, VistaWise presents a cost‑effective, API‑free framework that synergistically combines a minimally fine‑tuned visual model, a lightweight external knowledge graph, and LLM reasoning to achieve state‑of‑the‑art performance on complex Minecraft tasks while dramatically lowering development and computational expenses.
Comments & Academic Discussion
Loading comments...
Leave a Comment