📝 Original Info
- Title: SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding
- ArXiv ID: 2512.12842
- Date: 2025-12-14
- Authors: Kuan Fang, Yuxin Chen, Xinghao Zhu, Farzad Niroui, Lingfeng Sun, Jiuguang Wang
📝 Abstract
We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.
💡 Deep Analysis
📄 Full Content
SAGA: Open-World Mobile Manipulation via
Structured Affordance Grounding
Kuan Fang∗, Yuxin Chen∗, Xinghao Zhu∗, Farzad Niroui, Lingfeng Sun, Jiuguang Wang
https://robot-saga.github.io
Abstract—We present SAGA, a versatile and adaptive frame-
work for visuomotor control that can generalize across vari-
ous environments, task objectives, and user specifications. To
efficiently learn such capability, our key idea is to disentangle
high-level semantic intent from low-level visuomotor control by
explicitly grounding task objectives in the observed environment.
Using an affordance-based task representation, we express di-
verse and complex behaviors in a unified, structured form. By
leveraging multimodal foundation models, SAGA grounds the
proposed task representation to the robot’s visual observation as
3D affordance heatmaps, highlighting task-relevant entities while
abstracting away spurious appearance variations that would
hinder generalization. These grounded affordances enable us to
effectively train a conditional policy on multi-task demonstration
data for whole-body control. In a unified framework, SAGA can
solve tasks specified in different forms, including language in-
structions, selected points, and example demonstrations, enabling
both zero-shot execution and few-shot adaptation. We instantiate
SAGA on a quadrupedal manipulator and conduct extensive
experiments across eleven real-world tasks. SAGA consistently
outperforms end-to-end and modular baselines by substantial
margins. Together, these results demonstrate that structured
affordance grounding offers a scalable and effective pathway
toward generalist mobile manipulation.
I. INTRODUCTION
Generalist robots need to seamlessly integrate semantic and
geometric understanding to solve diverse and complex tasks
in unstructured environments. In mobile manipulation [1, 2]
in particular, performing a single task may require concurrent
or sequential interactions with multiple objects of different
affordances. An example is shown in Fig. 1, where a robot is
tasked with retrieving snack bags from a shelf using a duster
as a tool. During execution, the robot must select actions to
achieve the task objectives while accounting for the geometry
and configuration of surrounding objects. The difficulty is
further compounded by the wide range of ways in which
users specify task objectives, ranging from natural language
to example trajectories, and the variations in how these spec-
ifications are expressed. Achieving such broad generalization
across environments, objectives, and specifications remains a
central challenge for modern robotic systems.
Recent advances in multimodal foundation models [3, 4,
5] have created unprecedented opportunities for open-world
robotics. These models can perform strong visual recognition
and semantic reasoning over an open set of concepts, yet still
lack nuanced physical understanding required for control. To
close the perception-action loop, end-to-end robot foundation
∗Equal contribution. This work was conducted at the RAI Institute.
indirect
contact
grasp
place
function
direct
contact
avoid
walk
to
step
on
Fig. 1: SAGA expresses diverse, complex mobile-manipulation
behaviors using an affordance-based task representation. By
explicitly grounding task objectives as 3D heatmaps in the
observed environment, our approach disentangles semantic in-
tents from visuomotor control, enabling generalization across
environments, task objectives, and user specifications.
models have been trained to directly fuse visual observations
with high-level user specifications [6, 7]. However, such mod-
els must implicitly learn to parse abstract concepts (e.g., “fluffy
duster,” “maroon stair”), ground them to raw sensory input,
and generate control signals within a black-box model. As a
result, their generalization capabilities depend on prohibitively
large datasets that attempt to span the combinatorial diversity
of real-world scenarios, often leading to sharp performance
degradation when deployed outside their training distributions.
Alternatively, modular frameworks adopt a more structured
design, leveraging pre-trained multimodal foundation models
for high-level reasoning, while resorting to hand-engineered
modules for low-level execution [8, 9]. Although more data-
efficient, most of these frameworks are less robust in un-
structured environments and often constrained to narrowly
defined behaviors, such as grasping, limiting their application
to sophisticated domains like mobile manipulation. Together,
these limitations highlight the need for a new paradigm that
can retain the open-world reasoning capabilities of founda-
tion models while enabling robust, data-efficient visuomotor
control in complex mobile manipulation settings.
In this work, we present Structured Affordance Grounding
arXiv:2512.12842v1 [cs.RO] 14 Dec 2025
for Action (SAGA), a versatile and adaptable framework for
open-world mobile manipulation. To enable broad generaliza-
tion,
Reference
This content is AI-processed based on open access ArXiv data.