Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling
In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions – which fit the sensor data – can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.
💡 Research Summary
In this work the authors address a fundamental gap in current 3‑D scene reconstruction pipelines: while modern pose‑and‑shape estimators can produce geometrically accurate models that fit RGB‑D observations, those models often violate basic physical laws such as non‑penetration, support, and stability. Small pose errors can lead to inter‑object interpenetration, floating objects, or unstable equilibria, which makes downstream simulation‑based planning and control unreliable. To close this gap, the paper proposes Picasso, a physics‑constrained reconstruction framework that reasons holistically over an entire scene rather than treating each object in isolation.
The problem is formalized as a maximum‑likelihood estimation (MLE) where the likelihood term captures the fit between observed point clouds (derived from RGB‑D data and object masks) and the predicted object shapes, while a feasible set F encodes physical plausibility. The likelihood is expressed using a Chamfer‑type distance between observed points and sampled surface points of the predicted shape. Physical plausibility is enforced through four hard constraints: (a) inter‑object non‑penetration, (b) object‑environment non‑penetration, (c) consistency with observed free space, and (d) a contact constraint that prevents floating objects by requiring each object to be within a small tolerance of at least one other object or the supporting surface. All constraints are written analytically using signed distance fields (SDFs) for each object, the environment, and free space.
Directly optimizing this constrained, high‑dimensional problem (seven degrees of freedom per object) with gradient‑based methods is prone to local minima and costly Jacobian computations. Instead, Picasso adopts a rejection‑sampling strategy. To make sampling tractable, the authors first infer an object contact graph from the initial pose estimates. This graph encodes which objects are likely to be in contact, dramatically reducing the dimensionality of each sampling sub‑problem: each object is only allowed to move in ways that preserve its predicted contacts. Samples are generated in parallel, evaluated against the four constraints, and accepted if all are satisfied. This approach provides global exploration of the pose space while guaranteeing that any accepted configuration is physically valid.
To evaluate the method, the authors introduce the Picasso dataset, comprising ten real‑world, contact‑rich tabletop scenes (e.g., piles of plates, Jenga blocks). For each scene they provide ground‑truth 6‑DOF poses, CAD models, environment SDFs, and free‑space SDFs. They also propose three quantitative metrics for physical plausibility: (1) total penetration volume, (2) stability margin (distance of each object’s center of mass to its support region), and (3) contact consistency (how many objects satisfy the contact constraint).
Experiments on both the new dataset and the widely used YCB‑V benchmark demonstrate that Picasso consistently outperforms state‑of‑the‑art baselines such as SAM3D, PhysPose, and PhyRecon. Compared to these baselines, Picasso reduces Chamfer error by roughly 15‑25 % and cuts penetration volume by 30‑40 %. Qualitative user studies confirm that reconstructions produced by Picasso are perceived as more “physically plausible” and align better with human intuition. The ablation study shows that the contact‑graph‑guided sampling is the key factor: without it, rejection sampling becomes prohibitively slow and often fails to find feasible solutions.
The paper’s contributions are threefold: (1) a principled formulation that treats physics as hard constraints rather than soft penalties, (2) an efficient rejection‑sampling algorithm that leverages an inferred contact graph to explore the high‑dimensional pose space, and (3) a publicly released dataset and evaluation suite for physical plausibility in multi‑object reconstruction. Limitations include reliance on an accurate initial contact graph (errors can degrade sampling efficiency) and the focus on static scenes; extending the framework to dynamic interactions would require integrating temporal physics simulation or Bayesian filtering.
Overall, Picasso demonstrates that incorporating explicit physics constraints via guided sampling can dramatically improve the realism and utility of reconstructed digital twins, paving the way for more reliable simulation‑based manipulation and planning in cluttered, contact‑rich environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment