AOMGen: Photoreal, Physics-Consistent Demonstration Generation for Articulated Object Manipulation
Recent advances in Vision-Language-Action (VLA) and world-model methods have improved generalization in tasks such as robotic manipulation and object interaction. However, Successful execution of such tasks depends on large, costly collections of real demonstrations, especially for fine-grained manipulation of articulated objects. To address this, we present AOMGen, a scalable data generation framework for articulated manipulation which is instantiated from a single real scan, demonstration and a library of readily available digital assets, yielding photoreal training data with verified physical states. The framework synthesizes synchronized multi-view RGB temporally aligned with action commands and state annotations for joints and contacts, and systematically varies camera viewpoints, object styles, and object poses to expand a single execution into a diverse corpus. Experimental results demonstrate that fine-tuning VLA policies on AOMGen data increases the success rate from 0% to 88.7%, and the policies are tested on unseen objects and layouts.
💡 Research Summary
The paper introduces AOMGen, a novel data‑generation framework that turns a single real‑world scan and demonstration of an articulated object into a large, photorealistic, and physics‑consistent dataset for training Vision‑Language‑Action (VLA) manipulation policies. The pipeline consists of three main stages. First, a multi‑view static scene is reconstructed using 3D Gaussian Splatting (3DGS). Gaussian points are segmented into object parts with SAM2 masks and SAGA features, then aligned to the real‑world coordinate frame via ICP using the robot’s URDF. Second, motion recovery leverages the recorded robot arm joint states to infer the articulated object’s motion. Keyframe extraction, contact‑point detection, and a custom edge‑pair scoring scheme identify joint axes and centers, while a supervised optimization aligns the movable part’s rotation or translation parameters with the robot‑arm contact trajectory. Third, the articulated object can be replaced by any other instance from the same category; the system transfers lighting and material properties from the original scene to the new model, and randomly varies camera viewpoints, object styles, and initial poses to generate diverse multi‑view RGB sequences together with synchronized action commands and joint/contact annotations. Experiments show that fine‑tuning a VLA policy on AOMGen‑generated data raises the success rate on articulated manipulation tasks from 0 % (no synthetic data) to 88.7 %, and the policy generalizes to unseen objects and layouts. Compared with prior simulation‑only or video‑world‑model approaches, AOMGen achieves comparable visual fidelity (high SSIM/LPIPS) while guaranteeing physical plausibility through explicit motion recovery. Limitations include support only for rotational and prismatic joints, lack of detailed friction or high‑speed dynamics modeling, and reliance on the quality of 3DGS editing for texture realism. The authors suggest future work on multi‑joint, non‑linear articulation, hybrid physics‑simulation integration, and real‑time data augmentation to further close the sim‑to‑real gap.
Comments & Academic Discussion
Loading comments...
Leave a Comment