CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence

Reading time: 5 minute
...

📝 Original Info

  • Title: CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence
  • ArXiv ID: 2512.12768
  • Date: 2025-12-14
  • Authors: Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou

📝 Abstract

Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.

💡 Deep Analysis

📄 Full Content

PERCEPTION &LANGUAGE plan-lab.github.io CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou {ty41, lourent2}@illinois.edu University of Illinois Urbana-Champaign Abstract. Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions. 1. Introduction Despite rapid progress in 3D generation, most ex- isting methods remain imitation-based, reproducing shapes rather than reasoning about objects [52, 105]. As a result, they struggle with prompts that implicitly describe relations, counts, geometry, or physical con- tacts, concepts that recent unified language–vision models have begun to handle effectively in 2D set- tings [69, 98]. This progress is largely attributed to the integration of Chain-of-Thought (CoT) rea- soning [81], which, when extended to multimodal LLMs [7, 50, 109, 112], improves interpretability and consistency across visual reasoning tasks [34, 48]. However, unified reasoning in the 3D domain remains under-explored; few models are capable of jointly interpreting and constructing 3D objects [80, 104]. To advance this frontier, we propose CoRe3D, a framework for collaborative reasoning that unifies semantic understanding and geometric generation within a single 3D-LLM. As illustrated in Fig. 1, CoRe3D integrates a unified 3D language model with an octant-based 3D VQ-VAE, enabling the model to reason in both language and 3D token space. At its core, our approach couples a Semantic CoT for high-level textual planning with a novel Geometric CoT for spatial synthesis. The geometric CoT operates autoregressively across octant blocks, addressing the limitations of existing “flat” voxel rep- resentations that waste computation on empty space and fail to capture structured spatial dependencies. Unlike part-level representations [9], which require fixed ontologies and suffer from poor generalization across categories, or voxel-level representations [53, 90], which remain unstructured and seman- tically agnostic, our octant-based representation remains ontology-free yet structure-aware. To jointly refine both reasoning streams, we fur- ther employ Group-Relative Policy Optimization (GRPO) [60], allowing CoRe3D to learn from multi-critic feedback that balances semantic in- tent, visual quality, and physical coherence. This reasoning-aware framework produces high-fidelity 3D construction with enhanced spatial understand- ing while maintaining strong general language abilities. This approach is essential for three reasons: (1) it elicits plans where no “gold" supervision exists; (2) it allows for granular process credit assignment using dense 3D-specific rewards; and (3) it prevents reward hacking and overfitting by leveraging an ensemble of different critics. By rewarding both linguistic reasoning and 3D synthesis, our approach lays the groundwork for general 3D intelligence, unifying understanding and generation. ∗Preprint. Work in progress. arXiv:2512.12768v1 [cs.CV] 14 Dec 2025 Collaborative Reasoning as a Foundation for 3D Intelligence Semantic-level CoT Geometric-level CoT First, recognize the main structural parts of the cottage, including the sloped roof, wooden walls, chimney, windows, and front door. Next, place these components in the correct spatial arrangement, with the roof on top, the chimney offset to one side  ...  and the door centered beneath the upper windows. Refine the scene by adding small decorative details like shingles, flowers  ...  and rounded edges to capture the cozy, handcrafted look. Local Details High-Level Guidance Then, assign appropriate materials and styles, such as warm wooden textures for the walls ... along with white shutters and green vines. 3D Prompt "A cozy wooden cottage with a red door and leafy vines" 3D Prompt "A cozy wooden cottage with a red door and leafy vines" Collaborative Reasoning Figure 1: We introduce CoRe3D, a framework that unifies Semantic CoT and octant-based Geometric CoT through collaborative reasoning. By coupling language-grounded reasoning with sh

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut