SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding

Reading time: 5 minute
...

📝 Original Info

  • Title: SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding
  • ArXiv ID: 2512.12842
  • Date: 2025-12-14
  • Authors: Kuan Fang, Yuxin Chen, Xinghao Zhu, Farzad Niroui, Lingfeng Sun, Jiuguang Wang

📝 Abstract

We present SAGA, a versatile and adaptive framework for visuomotor control that can generalize across various environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express diverse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot's visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language instructions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation.

💡 Deep Analysis

📄 Full Content

SAGA: Open-World Mobile Manipulation via Structured Affordance Grounding Kuan Fang∗, Yuxin Chen∗, Xinghao Zhu∗, Farzad Niroui, Lingfeng Sun, Jiuguang Wang https://robot-saga.github.io Abstract—We present SAGA, a versatile and adaptive frame- work for visuomotor control that can generalize across vari- ous environments, task objectives, and user specifications. To efficiently learn such capability, our key idea is to disentangle high-level semantic intent from low-level visuomotor control by explicitly grounding task objectives in the observed environment. Using an affordance-based task representation, we express di- verse and complex behaviors in a unified, structured form. By leveraging multimodal foundation models, SAGA grounds the proposed task representation to the robot’s visual observation as 3D affordance heatmaps, highlighting task-relevant entities while abstracting away spurious appearance variations that would hinder generalization. These grounded affordances enable us to effectively train a conditional policy on multi-task demonstration data for whole-body control. In a unified framework, SAGA can solve tasks specified in different forms, including language in- structions, selected points, and example demonstrations, enabling both zero-shot execution and few-shot adaptation. We instantiate SAGA on a quadrupedal manipulator and conduct extensive experiments across eleven real-world tasks. SAGA consistently outperforms end-to-end and modular baselines by substantial margins. Together, these results demonstrate that structured affordance grounding offers a scalable and effective pathway toward generalist mobile manipulation. I. INTRODUCTION Generalist robots need to seamlessly integrate semantic and geometric understanding to solve diverse and complex tasks in unstructured environments. In mobile manipulation [1, 2] in particular, performing a single task may require concurrent or sequential interactions with multiple objects of different affordances. An example is shown in Fig. 1, where a robot is tasked with retrieving snack bags from a shelf using a duster as a tool. During execution, the robot must select actions to achieve the task objectives while accounting for the geometry and configuration of surrounding objects. The difficulty is further compounded by the wide range of ways in which users specify task objectives, ranging from natural language to example trajectories, and the variations in how these spec- ifications are expressed. Achieving such broad generalization across environments, objectives, and specifications remains a central challenge for modern robotic systems. Recent advances in multimodal foundation models [3, 4, 5] have created unprecedented opportunities for open-world robotics. These models can perform strong visual recognition and semantic reasoning over an open set of concepts, yet still lack nuanced physical understanding required for control. To close the perception-action loop, end-to-end robot foundation ∗Equal contribution. This work was conducted at the RAI Institute. indirect contact grasp place function direct contact avoid walk to step on Fig. 1: SAGA expresses diverse, complex mobile-manipulation behaviors using an affordance-based task representation. By explicitly grounding task objectives as 3D heatmaps in the observed environment, our approach disentangles semantic in- tents from visuomotor control, enabling generalization across environments, task objectives, and user specifications. models have been trained to directly fuse visual observations with high-level user specifications [6, 7]. However, such mod- els must implicitly learn to parse abstract concepts (e.g., “fluffy duster,” “maroon stair”), ground them to raw sensory input, and generate control signals within a black-box model. As a result, their generalization capabilities depend on prohibitively large datasets that attempt to span the combinatorial diversity of real-world scenarios, often leading to sharp performance degradation when deployed outside their training distributions. Alternatively, modular frameworks adopt a more structured design, leveraging pre-trained multimodal foundation models for high-level reasoning, while resorting to hand-engineered modules for low-level execution [8, 9]. Although more data- efficient, most of these frameworks are less robust in un- structured environments and often constrained to narrowly defined behaviors, such as grasping, limiting their application to sophisticated domains like mobile manipulation. Together, these limitations highlight the need for a new paradigm that can retain the open-world reasoning capabilities of founda- tion models while enabling robust, data-efficient visuomotor control in complex mobile manipulation settings. In this work, we present Structured Affordance Grounding arXiv:2512.12842v1 [cs.RO] 14 Dec 2025 for Action (SAGA), a versatile and adaptable framework for open-world mobile manipulation. To enable broad generaliza- tion,

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut