Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding

December 14, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding
ArXiv ID: 2512.12822
Date: 2025-12-14
Authors: Yongyuan Liang, Xiyao Wang, Yuanchen Ju, Jianwei Yang, Furong Huang

📝 Abstract

Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.

💡 Deep Analysis

📄 Full Content

Preprint. LEMON : A UNIFIED AND SCALABLE 3D MUL- TIMODAL MODEL FOR UNIVERSAL SPATIAL UNDER- STANDING Yongyuan Liang1, Xiyao Wang1, Yuanchen Ju2, Jianwei Yang, Furong Huang1 1University of Maryland, College Park 2University of California, Berkeley As a robot, how can you open this beverage can if it's placed upside down? Is the bed in this room tightly against the wardrobe behind it? No, the bed is not tightly against the wardrobe behind it. There is a gap between the back of the bed and the wardrobe. Can a robot vacuum pass behind the sofa? Yes, a robot vacuum can pass behind the orange sofa. There‘s enough space between the sofa and the white and wood cabinets. Step 1: Grip the can by its sides. Step 2: Rotate it 180 degrees to position it upright with the tab facing up. Step 3: Locate the pull tab on top and then lift the tab upward. Step 4: Pull the tab in an arc motion until the can opens. If this potted plant were to fall on the ground, what would happen? The pot could break. The plant could be damaged with leaves or stems breaking. The soil or potting mix inside the pot would spill out, making a mess. Universal 3D Understanding LEMON Figure 1: Universal 3D understanding with Lemon. Lemon demonstrates comprehensive 3D spatial reasoning capabilities across diverse tasks. ABSTRACT Scaling large multimodal models (LMMs) to 3D understanding poses unique chal- lenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehen- sive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications. 1 INTRODUCTION Understanding 3D environments is fundamental for embodied agents, enabling interaction, ma- nipulation, and navigation in the physical world. While large multimodal models (LMMs) have achieved impressive progress in 2D vision-language domains — demonstrated by models such as 1 arXiv:2512.12822v1 [cs.CV] 14 Dec 2025 Preprint. Flamingo (Alayrac et al., 2022), GPT-4V (OpenAI, 2023) and many open-sourced ones (Chen et al., 2023; Liu et al., 2024; Zhang et al., 2021; Bai et al., 2025; Peng et al., 2023; Xiong et al., 2024; Yang et al., 2025a; Wang et al., 2025) —scaling such capabilities to 3D data remains an open challenge. The irregular structure, sparsity, and high-dimensional nature of point clouds make 3D learning inherently difficult. Yet, robust 3D understanding is crucial for robotics (Fang et al., 2023; Zhu et al., 2024; Qi et al., 2025), AR/VR systems, and spatial AI (Chen et al., 2024a; Cheng et al., 2024; Zheng et al., 2024a; Yang et al., 2024b; Cao et al., 2024). Despite the emergence of 3D foundation models such as Point-BERT (Yu et al., 2022a) and ULIP (Xue et al., 2022), current efforts fall short of scaling to general-purpose 3D understanding and reasoning tasks in a manner analogous to 2D LMMs. Most existing 3D LMMs adopt modular designs that employ separate encoders for 3D geometry and language, typically using pretrained 3D encoders such as PointNet++ followed by cross-modal alignment mechanisms (Liu et al., 2023b; Zhou et al., 2023). However, this approach faces several fundamental challenges: (1) 3D encoders are typically pretrained on limited datasets with narrow training objectives, limiting their adaptability to diverse spatial reasoning tasks required by LLMs; (2) unlike the 2D domain where billions of images are available, 3D data remains significantly more constrained in scale, further limiting 3D representation quality; and (3) the architectural imbalance between smaller 3D encoders and large language models creates a representational bottleneck where spatial understanding becomes a performance limitation. Furthermore, reliance on frozen pretrained modality-specific encoders prevents end-to-end optimization and generalization to novel 3D structures, impeding progress toward scalable 3D multimodal learning. We propose Lemon,

📄 Read Full PDF on ArXiv