Disentangling perception and reasoning for improving data efficiency in learning cloth manipulation without demonstrations

Disentangling perception and reasoning for improving data efficiency in learning cloth manipulation without demonstrations
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Cloth manipulation is a ubiquitous task in everyday life, but it remains an open challenge for robotics. The difficulties in developing cloth manipulation policies are attributed to the high-dimensional state space, complex dynamics, and high propensity to self-occlusion exhibited by fabrics. As analytical methods have not been able to provide robust and general manipulation policies, reinforcement learning (RL) is considered a promising approach to these problems. However, to address the large state space and complex dynamics, data-based methods usually rely on large models and long training times. The resulting computational cost significantly hampers the development and adoption of these methods. Additionally, due to the challenge of robust state estimation, garment manipulation policies often adopt an end-to-end learning approach with workspace images as input. While this approach enables a conceptually straightforward sim-to-real transfer via real-world fine-tuning, it also incurs a significant computational cost by training agents on a highly lossy representation of the environment state. This paper questions this common design choice by exploring an efficient and modular approach to RL for cloth manipulation. We show that, through careful design choices, model size and training time can be significantly reduced when learning in simulation. Furthermore, we demonstrate how the resulting simulation-trained model can be transferred to the real world. We evaluate our approach on the SoftGym benchmark and achieve significant performance improvements over available baselines on our task, while using a substantially smaller model.


💡 Research Summary

This paper tackles the long‑standing challenge of robotic cloth manipulation by rethinking the conventional end‑to‑end image‑based reinforcement learning (RL) pipeline. The authors argue that using raw workspace images as the sole input entangles perception, exploration, and reasoning, leading to inefficient learning and heavy computational demands. Instead, they propose a modular framework that separates perception from reasoning and exploits full‑state information available in simulation.

The method rests on four design principles: (1) Offline pre‑training – a large offline dataset (6.5 M transitions) generated from heuristic rollouts is used to bootstrap the agent; (2) Multi‑objective training – in addition to the primary coverage‑area reward, two auxiliary folding tasks (straight and diagonal) are introduced to enrich the learning signal and avoid latent space collapse; (3) Full‑state exploitation – the cloth’s node positions are encoded as a 3‑channel image (x, y, z per node), allowing a compact convolutional encoder to capture local geometric continuity; (4) Q‑level sim‑to‑real transfer – the simulation‑trained Q‑function serves as a teacher that labels real‑world RGB‑D observations, enabling supervised distillation into a vision‑based policy without re‑training the dynamics model.

The state representation consists of 40 × 30 × 3 = 4800 floating‑point values, reshaped into an image. Actions are split into a pick (node index) and a place (2‑D ground‑plane coordinate). This representation introduces an inductive bias that naturally highlights high‑leverage nodes such as corners and prevents invalid picks.

The neural architecture features a shared convolutional encoder (two Conv‑GELU‑LayerNorm blocks followed by a linear layer) and two decoder heads, one for the pick Q‑value and one for the place Q‑value. The pick head receives only the encoder output, while the place head concatenates the selected pick index, reflecting the conditional nature of the task. Training follows the Double DQN algorithm with Polyak averaging (τ = 5 × 10⁻⁴) and a discount factor γ = 0.9. During offline pre‑training, the loss combines pick and place L2 terms with a bounding loss that caps Q‑values at the theoretical maximum return (R_max/(1‑γ)), mitigating over‑estimation in out‑of‑distribution states. Online fine‑tuning uses an ε‑greedy exploration strategy, a replay buffer seeded with the offline data, and drops the bounding loss once the agent begins to explore high‑quality actions.

For sim‑to‑real transfer, the trained Q‑function is used as a labeler: real‑world RGB‑D images are fed to a separate vision encoder, and the corresponding Q‑values generated by the simulation teacher are used as supervised targets. This “Q‑level” distillation allows a single simulation policy to be re‑used across multiple real‑world setups without re‑training the dynamics model.

Experiments on the SoftGym cloth‑spreading benchmark demonstrate that the proposed approach outperforms existing baselines by more than 12 % in average coverage while using a model of only ~0.8 M parameters (about five times smaller). Training time is reduced to roughly 8 hours on a single GPU (4 h offline pre‑training, 4 h online fine‑tuning). Real‑world tests show that the distilled vision‑based policy reaches the 95 % coverage threshold with 30 % fewer interaction steps compared to prior sim‑to‑real methods.

The paper’s contributions are threefold: (i) a compact, state‑based RL agent that achieves superior data efficiency; (ii) an ablation study highlighting the impact of each design principle; and (iii) a cross‑modality distillation strategy that enables practical sim‑to‑real transfer. Limitations include the focus on single‑handed pick‑and‑place actions and the lack of explicit domain adaptation for material property mismatches. Future work will explore multi‑robot coordination, more complex folding sequences, and systematic handling of sim‑real dynamics gaps.


Comments & Academic Discussion

Loading comments...

Leave a Comment