Scaling large multimodal models (LMMs) to 3D understanding poses unique challenges: point cloud data is sparse and irregular, existing models rely on fragmented architectures with modality-specific encoders, and training pipelines often suffer from instability and poor scalability. We introduce Lemon, a unified transformer architecture that addresses these challenges by jointly processing 3D point cloud patches and language tokens as a single sequence. Unlike prior work that relies on modality-specific encoders and cross-modal alignment modules, this design enables early spatial-linguistic fusion, eliminates redundant encoders, improves parameter efficiency, and supports more effective model scaling. To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context, and a three-stage training curriculum that progressively builds capabilities from object-level recognition to scene-level spatial reasoning. Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks, from object recognition and captioning to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as model size and training data increase. Our work provides a unified foundation for advancing 3D spatial intelligence in real-world applications.
💡 Deep Analysis
📄 Full Content
Preprint.
LEMON
: A UNIFIED AND SCALABLE 3D MUL-
TIMODAL MODEL FOR UNIVERSAL SPATIAL UNDER-
STANDING
Yongyuan Liang1, Xiyao Wang1, Yuanchen Ju2, Jianwei Yang, Furong Huang1
1University of Maryland, College Park
2University of California, Berkeley
As a robot, how can you
open this beverage can if it's
placed upside down?
Is the bed in this room
tightly against the wardrobe
behind it?
No, the bed is not
tightly against the wardrobe
behind it. There is a gap
between the back of the bed
and the wardrobe.
Can a robot vacuum
pass behind the sofa?
Yes, a robot
vacuum can pass behind
the orange sofa. There‘s
enough space between
the sofa and the white
and wood cabinets.
Step 1: Grip the can by
its sides. Step 2: Rotate it 180
degrees to position it upright
with the tab facing up. Step
3: Locate the pull tab on top
and then lift the tab upward.
Step 4: Pull the tab in an arc
motion until the can opens.
If this potted plant
were to fall on the ground,
what would happen?
The pot could break.
The plant could be damaged
with leaves or stems
breaking. The soil or potting
mix inside the pot would
spill out, making a mess.
Universal 3D Understanding
LEMON
Figure 1: Universal 3D understanding with Lemon. Lemon demonstrates comprehensive 3D spatial
reasoning capabilities across diverse tasks.
ABSTRACT
Scaling large multimodal models (LMMs) to 3D understanding poses unique chal-
lenges: point cloud data is sparse and irregular, existing models rely on fragmented
architectures with modality-specific encoders, and training pipelines often suffer
from instability and poor scalability. We introduce Lemon, a unified transformer
architecture that addresses these challenges by jointly processing 3D point cloud
patches and language tokens as a single sequence. Unlike prior work that relies
on modality-specific encoders and cross-modal alignment modules, this design
enables early spatial-linguistic fusion, eliminates redundant encoders, improves
parameter efficiency, and supports more effective model scaling. To handle the
complexity of 3D data, we develop a structured patchification and tokenization
scheme that preserves spatial context, and a three-stage training curriculum that
progressively builds capabilities from object-level recognition to scene-level spatial
reasoning. Lemon establishes new state-of-the-art performance across comprehen-
sive 3D understanding and reasoning tasks, from object recognition and captioning
to spatial reasoning in 3D scenes, while demonstrating robust scaling properties as
model size and training data increase. Our work provides a unified foundation for
advancing 3D spatial intelligence in real-world applications.
1
INTRODUCTION
Understanding 3D environments is fundamental for embodied agents, enabling interaction, ma-
nipulation, and navigation in the physical world. While large multimodal models (LMMs) have
achieved impressive progress in 2D vision-language domains — demonstrated by models such as
1
arXiv:2512.12822v1 [cs.CV] 14 Dec 2025
Preprint.
Flamingo (Alayrac et al., 2022), GPT-4V (OpenAI, 2023) and many open-sourced ones (Chen et al.,
2023; Liu et al., 2024; Zhang et al., 2021; Bai et al., 2025; Peng et al., 2023; Xiong et al., 2024; Yang
et al., 2025a; Wang et al., 2025) —scaling such capabilities to 3D data remains an open challenge.
The irregular structure, sparsity, and high-dimensional nature of point clouds make 3D learning
inherently difficult. Yet, robust 3D understanding is crucial for robotics (Fang et al., 2023; Zhu et al.,
2024; Qi et al., 2025), AR/VR systems, and spatial AI (Chen et al., 2024a; Cheng et al., 2024; Zheng
et al., 2024a; Yang et al., 2024b; Cao et al., 2024). Despite the emergence of 3D foundation models
such as Point-BERT (Yu et al., 2022a) and ULIP (Xue et al., 2022), current efforts fall short of scaling
to general-purpose 3D understanding and reasoning tasks in a manner analogous to 2D LMMs.
Most existing 3D LMMs adopt modular designs that employ separate encoders for 3D geometry
and language, typically using pretrained 3D encoders such as PointNet++ followed by cross-modal
alignment mechanisms (Liu et al., 2023b; Zhou et al., 2023). However, this approach faces several
fundamental challenges: (1) 3D encoders are typically pretrained on limited datasets with narrow
training objectives, limiting their adaptability to diverse spatial reasoning tasks required by LLMs;
(2) unlike the 2D domain where billions of images are available, 3D data remains significantly more
constrained in scale, further limiting 3D representation quality; and (3) the architectural imbalance
between smaller 3D encoders and large language models creates a representational bottleneck where
spatial understanding becomes a performance limitation. Furthermore, reliance on frozen pretrained
modality-specific encoders prevents end-to-end optimization and generalization to novel 3D structures,
impeding progress toward scalable 3D multimodal learning.
We propose Lemon,