ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation
Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub…
Authors: Hongyu Yan, Qiwei Li, Jiaolong Yang
Progress-Guided Dif fusion Polic y for V ision-Language Robotic Manipulation Hongyu Y an 1 , Qiwei Li 1 , Jiaolong Y ang 2 , and Y adong Mu 1 1 Peking Univ ersity 2 Microsoft Research Asia (MSRA) T a s k : p u l l t h e h a n d l e t o o p e n t h e d r a w e r R o b o t M a n i p u l a t io n D a t a s e t O X E I n s t r u c t i o n : g o t o w a r d s t h e d r a w e r a n d p l a c e t h e y e l l o w o b j e c t . P r o g r e s s : 0 0 . 4 9 I n s t r u c t i o n : o p e n t o p d r a w e r . L I B E R O I n s t r u c t i o n : p i c k u p t h e b l a c k b o w l o n t h e s t o v e a n d p l a c e i t o n t h e p l a t e . I n s t r u c t i o n : p u l l t h e h a n d l e t o o p e n t h e d r a w e r . C A L V I N V L E v a lua tor 𝒑 = 𝒇 ( 𝒕 𝒆 𝒙 𝒕 , 𝒐 𝟎 , 𝒐 𝒕 ) L a t e n t A c t i o n S p a c e 𝑶 𝒕 𝑶 𝒕 + 𝟏 𝐋 𝐚 𝐭 𝐞 𝐧 𝐭 𝐀 𝐂 𝐓 F i n e t u n e (a) P re trai ning Data fo r P rogre s s P re dic to r P r o g r e s s : 0 0 . 6 8 P r o g r e s s : 0 0 . 3 P r o g r e s s : 0 0 . 6 3 (b ) P rogre s s P re dic to r a s a V L E v a lua to r 𝐋 𝐚 𝐭 𝐞 𝐧 𝐭 𝐀 𝐂𝐓 (c) P rogre s s - guide d Diff us ion P olic y P r o g r e s s 0 T i m e S t e p s Diffusi on (w /o CG ) + Progre s s - gu id e d CG + Progre s s - gu id e d CG Diffu s io n (w /o CG ) Cla s s ifi e r G ui da nc e (C G ) Decod e to low - lev el ac t ion s A c tion E x pe rt 𝐋 𝐚 𝐭 𝐞 𝐧 𝐭 𝐀𝐂𝐓 𝐋 𝐚 𝐭 𝐞 𝐧 𝐭 𝐀𝐂𝐓 Z e r o - s h o t - > F in e t u n e d : 0 .1 5 - > 0 .0 7 Z e r o - s h o t - > F in e t u n e d : 0 .1 5 - > 0 .1 0 0. 1 0 .2 0 .3 0 .4 CA LVI N Rea l -W orld Ran do m Ze ro-Sh ot Fi nt un e d ① ② ② ① ② ① ① ② ① ① Fig. 1. Illustration of the key ideas in ProgressVLA. (a) W e pretrain a vision/language-conditioned progress estimator on Open X-Embodiment (O XE) [33] and finetune it on LIBER O [30] / CAL VIN [32]. (b) The estimator serves as a vision-language ev aluator and achiev es lo w residual after being finetuned (0.07 on CAL VIN and 0.1 on real-w orld scenarios with a progress scale of [0 , 1] ). (c) At inference, we use classifier (estimator) guidance in latent action space to steer diffusion to ward higher progress. The refined latents are then decoded into action chunks for execution, often producing faster progress. I . A B S T R A C T Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically re- lying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks in volving cascaded sub-goals. In this work, we in vestigate the estimation and integration of task progress, proposing a novel model named Progr essVLA . Our technical contributions are twofold: (1) r obust pr ogr ess estimation : W e pre-train a progress esti- mator on large-scale, unsupervised video-text robotic datasets. This estimator achie ves a low prediction residual (0.07 on a scale of [0 , 1] ) in simulation and demonstrates zero-shot gen- eralization to unseen real-world samples, and (2) differ entiable pr ogr ess guidance : W e introduce an in verse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a dif ferentiable pipeline that provides progress- piloted guidance to refine action tokens. Extensi ve experiments on the CAL VIN and LIBER O benchmarks, alongside real- world robot deployment, consistently demonstrate substantial improv ements in success rates and generalization ov er strong baselines. I I . I N T RO D U C T I O N Recent vision-language-action (VLA) models hav e ad- vanced policy learning by scaling to large robotics T ABLE I K E Y MO T I V ATI NG OB S E RV A T I O N . P RO G R E S S - G U I D E D S A M P L I N G I M P ROV E S P R O G RE S S A L I G N M E N T AN D R E DU C E S C O M P L E T I O N S T E P S O N C A LVI N . P E A R S O N r I S T H E C O R R E L A T IO N B E T W E E N P R E D I C TE D P RO G R E S S { ˆ p t } A N D A L I N E A R R A M P { t/T } OV E R A RO L L O U T . A V G . S T E P S D E N OT E S S T E P S - T O - C O M P L E T I O N , A N D S U C C E S S I S T H E TA S K S U C C E S S R AT E . T H E B A S E L I N E H E RE I S A S TA N D A R D D I FF U S I O N P OL I C Y F O R RO B OT IC M A N I P U L A T I O N . Progress Guidance Pearson r ↑ A vg. steps ↓ Success ↑ 0.722 90.4 92.7 ✓ 0.934 77.3 93.6 datasets [33, 42, 23, 5]. Y et many approaches still rely on dense action supervision [24], which limits their scalabil- ity , while others depend on implicit and often noisy goal- satisfaction cues. Generativ e planners including tokenized autoregressi ve models [35, 6] and dif fusion policies [3, 19], can produce plausible trajectories. Howe ver , their sampling is largely driv en by conditioning and typically lacks an explicit, dense notion of task progress. Consequently , long-horizon ex ecution often relies on brittle termination heuristics rather than goal-directed generation. As a motiv ating fact, we present some empirical validation by us in T able I. It shows that explicitly guiding sampling with a learned progress signal substantially improves progress alignment and reduces the required steps for completing a task on CAL VIN [32], with a consistent gain in success. T o address these limitations, we propose a progress estima- tion technique (see Fig. 1) and integrate it to guide a diffusion policy , a framew ork we call ProgressVLA. Central to this approach is a progress estimator that outputs a normalized completion score, conditioned on the language-specified task and current visual observations. Why is progress estimation fundamental to long-horizon robotic manipulation? In vision- language conditioned tasks, a policy must transcend the gen- eration of locally plausible motions; it must continuously ev aluate whether its actions effecti vely adv ance tow ard the specified goal [18]. W ithout a dense notion of progress, generativ e policies often squander computation on trajectories that appear visually reasonable but remain task-irrelev ant. Fur- thermore, the y lack a principled mechanism for termination, often defaulting to brittle, hand-crafted heuristics. Howe ver , learning progress directly from raw pixels is difficult [7]. Robotic videos exhibit significant nuisance variations, such as camera jitter, background shifts, and distractor objects, which naiv e learning objecti ves often entangle with task dynamics. This results in progress signals that are noisy and poorly aligned with actual goal completion. ProgressVLA mitigates this by grounding progress estimation within a pre-trained, object-centric visual feature space. By explicitly conditioning on language, the model ensures the learned signal prioritizes task-relev ant state changes over incidental visual dynamics. It is equally important to use progress during control. Rather than treating progress as a post-hoc ev aluator , such as for reranking sampled trajectories or as a sparse success classifier [48, 49, 12, 14], ProgressVLA embeds progress awareness directly into the action generation process. Specif- ically , for a given candidate action chunk, we develop an in verse dynamics-based world model to predict the resulting future visual features, while the progress predictor assigns a differentiable score to the predicted outcome. W e then back- propagate the progress gradients through the world model to provide classifier-style guidance during each diffusion denois- ing step. This steers the sampling process to ward action chunks predicted to maximize progress toward the goal. By coupling planning with ev aluation, progress shapes the generated tra- jectories and provides a simple threshold-based termination criterion, leading to more goal-directed and reliable long- horizon ex ecution. In summary , our contributions are threefold: (1) a progress estimator grounded in predicted future observ ations through an in verse dynamics world model, enabling foresight in task e v al- uation; (2) a progress-guided dif fusion sampler that le verages differentiable progress gradients to iterativ ely optimize action chunks during generation; and (3) extensi ve empirical valida- tion on the CAL VIN and LIBERO benchmarks, complemented by real-robot deployments, demonstrating significant gains in long-horizon success rates and more reliable task termination. I I I . R E L AT E D W O R K A. V ision-Language-Action Models Large-scale V ision-Language Models (VLMs) have estab- lished robust multimodal representations that transfer effec- tiv ely across diverse perception tasks, including visual ques- tion answering and image captioning. Extending these ca- pabilities to control settings has led to the emergence of V ision-Language-Action (VLA) models [31, 38], which map linguistic goals and visual observations to control policies. Current VLA research generally follows three paradigms: (i) autoregressi ve tokenization, which discretizes continuous control signals into action codebooks [24, 35, 6]; (ii) direct supervised regression to joint action spaces when dense la- bels are available [50, 25, 43]; and (iii) generative trajec- tory modeling, notably dif fusion policies that sample action sequences via iterativ e denoising [13, 22, 3, 19, 5]. While diffusion-based methods produce high-quality , div erse trajec- tories through stochastic sampling, they often lack an explicit task-lev el signal to steer generation toward goal completion. Parallel efforts have focused on learning latent, transferable action spaces from video data to facilitate cross-embodiment generalization [47, 4, 7, 1, 9, 10, 11, 29, 5]. These approaches typically utilize an Inv erse Dynamics Model (IDM) to infer latent actions from video frames and a Forward Dynamics Model (FDM) to reconstruct future states. For instance, recent works [47, 6, 10, 5] demonstrate that learning compact, latent action representations from large-scale human data can mitigate domain gaps and provide a powerful supervisory signal for next-token prediction, ultimately yielding higher- fidelity robotic trajectories. Despite recent advancements, existing methods primarily rely on passiv e vision-language conditioning and fail to in- corporate an explicit mechanism for monitoring task progress or completion. In contrast, we introduce a progress-critic framew ork that integrates a learned progress estimator directly into diffusion-based sampling within a latent action space. By fusing transferable latent representations with progress- guided generativ e planning, our approach facilitates more goal-directed and robust trajectories, significantly enhancing performance in challenging, long-horizon manipulation tasks. B. W orld Model W orld models establish compact internal representations of en vironmental dynamics, facilitating prediction, planning, and counterfactual reasoning without the need for costly physical rollouts [28]. In robotics, these models are increasingly utilized to learn latent dynamics for model-based control, synthesize future observ ations for imagination-based planning [40, 21, 17, 8], and estimate task-specific objectives such as success classifiers or re ward functions [45, 51]. While such signals enable reinforcement learning through simulated rollouts, the resulting supervision is often restricted to sparse, binary suc- cess indicators that provide a limited gradient for efficient optimization [14]. Recent adv ancements [39, 5] address this by demonstrating that the joint learning of latent dynamics along- side perception and action embeddings significantly enhances sample efficienc y and cross-embodiment generalization. In this paper , we introduce a progress-oriented model that explicitly predicts a scalar progress estimate alongside la- tent observation dynamics. This learned signal serves as a L a t e n t A c t i o n E x p e r t Im a g e E n c o d e r T e x t E n c o d e r V i s i o n - L a n g u a g e Mo d e l T a s k : O p e n t h e S l i d e r a n d p i c k u p t h e r e d b l o c k P r o g r e s s E s t im a t o r A c t i o n D e c o d e r i n i t i a l Im a g e C u r r e n t Im a g e B a c k w a r d P r o g r e s s - B a s e d C l a s s i f i e r G u i d a n c e F u t u r e I m a g e s ( l a t e n t s ) A c t io n Dy n a m ic s - Or ie n t e d W o r ld M o d e l r o b o t i c a c t io n c o m m a n d s F o r w a r d F l o w f o r P r e d i c t e d L a t e n t Ac t i o n s Fig. 2. Overvie w of ProgressVLA. Conditioned on a language instruction and current observation, the diffusion policy first generates a candidate chunk of latent actions. A action-oriented world model then rolls out these actions within a pre-trained visual feature space to project future states, while a progress estimator assigns a completion score to the predicted outcomes. Finally , progress gradients are backpropagated through the world model as classifier guidance, steering the diffusion process toward actions that maximize task advancement. dense, task-oriented guidance mechanism during diffusion- based sampling and can be thresholded to establish reliable, principled termination criteria. By jointly modeling latent actions, state dynamics, and task progress, our world model facilitates highly goal-directed trajectory generation while significantly reducing dependence on costly , sparse, or task- specific supervision. I V . T H E P R O P O S E D M E T H O D Giv en an image observation o and a language instruction l , our goal is to train a policy π to predict a coherent action chunk, denoted as π : ( o t , l ) → a t : t + N . Our proposed ProgressVLA framew ork achie ves this through three compo- nents: (1) a progress estimator that regresses a normalized task-completion score from the current visual and language instruction (Sec. IV -A); (2) an action-conditioned world model that facilitates bidirectional reasoning by either projecting future visual states from predicted latent actions (forward dynamics) or inferring the underlying latent actions from visual state transitions (in verse dynamics). (Sec. IV -B); and (3) a diffusion-based generativ e model that leverages the progress signal through dif ferentiable classifier guidance to steer action sampling toward goal-optimal trajectories (Sec. IV -D). See Fig. 2 for an ov ervie w . A. Pr ogr ess Estimator The progress estimator P operates as a vision-language ev aluator that assesses task adv ancement. It processes the language instruction l , the initial observation o 0 (to provide a global task anchor), and the current image o t to output a normalized scalar progress score: p = P ( l, o 0 , o t ) , p ∈ [0 , 1] . (1) W e train P as a regressor with an L 1 loss: L prog = | p − p ∗ | , (2) where p ∗ denotes the ground-truth progress label. W e utilize the normalized timestep as a proxy for progress; specifically , for a trajectory of total length T , the progress … Latent Action T oken Encoder Decoder Fig. 3. Architecture of action dynamics oriented world model. label at timestep t is defined as p ∗ = t/T . This formulation is grounded in the observation that our expert demonstrations are curated to advance steadily toward completion, ensuring that task progress remains approximately monotonic. Conse- quently , normalized time serves as a robust and effecti ve sur - rogate for progress, without requiring additional annotations. B. W orld Model Our method incorporates a compact latent world model designed to capture visual states and dynamics for short- horizon imagination. This architecture as in Fig. 3 consists of a vision encoder E and a decoder D . Specifically , the encoder serves as an in verse dynamics model, mapping the transition between two observations, o t and o t + N , into a compressed latent action space: a z = E ( o t , o t + N ) , (3) and the decoder (the forward dynamics model) predicts the future image giv en observation o t and latent action a z : o t + N = D ( o t , a z ) . (4) The training objectiv e for the proposed world model integrates a latent-dynamics reconstruction loss with a Kullback-Leibler (KL) di ver gence term to regularize the latent action distribu- tion ( N is the normal distribution): L world = X t ∥ o t + N − o ∗ t + N ∥ 2 + K L ( a z , N (0 , I )) . (5) Essentially , we train the world model to extract compact, transferable latent representations that decouple visual nui- sances from task-rele v ant features. These latents serv e as a unified state representation, shared by both the policy genera- tor for actions and the progress estimator for state ev aluation. C. Joint Finetuning of W orld Model and Pr ogr ess Estimator After the separate pre-training of the world model and the progress estimator , we perform joint fine-tuning to align latent dynamics with task-lev el progression. Specifically , giv en a current visual observation o t and a candidate latent-action chunk a z t : t + N , the world model first projects the resulting future latent state; the progress estimator then assesses this predicted outcome to compute a task-advancement score: p t + N = P ( l, o 0 , D ( o t , a z t : t + N )) . (6) W e define a loss on the predicted progress score, which jointly supervises the two modules: L joint = ∥ p t + N − p ∗ t + N ∥ . (7) The overall joint finetuning objectiv e is a weighted av erage of the world-model loss, progress loss and joint loss, namely L ft = L world + L prog + L joint , (8) which encourages the predicted latent dynamics to be infor- mativ e for downstream progress estimation and guidance. D. Pr ogr ess-Guided Diffusion P olicy Our policy employs a two-stage generation pipeline de- signed for cross-embodiment flexibility . First, a Latent Ac- tion Expert generates an action chunk a z t : t + N within an embodiment-agnostic latent space, focusing on high-lev el task strategy . In the second stage, an Action Decoder maps a z t : t + N into a low-le vel action sequence a t : t + N for robot ex ecution. Let x 0 denote the latent representation of a z as in Eq. 3. The backbone dif fusion model is trained by optimizing the denoising objectiv e, namely L diff = E x 0 ,ϵ,τ ϵ − ϵ θ ( x τ , τ , l , o t ) 2 , where x τ is the noisy latent-action sample at dif fusion step τ (distinct from the observ ation o t ), ϵ is a Gaussian noise, and ϵ θ predicts the noise. The dif fusion policy is guided with the progress estimator through the world model. Given the current visual observation o t and the current latent sample x τ , the world model predicts the resultant future image: ˆ o t + N = D ( o t , x τ ) , (9) which is then fed to the progress estimator to obtain the predicted progress via: ˆ p t + N = P ( l, z 0 , ˆ o t + N ) . (10) Since ˆ p t + N is differentiable with respect to x τ through the world model decoder D , we can backpropagate gradients to the latent action and use them as classifier guidance during E n v i r o n m e n t I n t e r a c t i o n O b s e r v a t i o n P o l i c y 𝝅 ( a | s ) P r o g r e s s E s t i m a t o r a s P o l i c y T r a n s i t i o n s P r o g r e s s O n l i n e B u f f e r Tr a jec t o r y L is t ( o b s, te x t, a c ti o n , p r o g r e ss , su cce ss ) N o n - M o n o t o n i c P r o g r e s s P a i r P r o g r e s s E s t i m a t o r F i n e t u n i n g P o l i c y I m p r o v e m e n t su cce ss= 1 su cce ss= 0 𝑝 𝑡 > 𝑝 𝑡’ = 𝑃 ( 𝑜 0 , 𝑜 𝑡’ , 𝑡 𝑒 𝑥 𝑡 ) 𝑡’ > 𝑡 𝐿 𝑚 𝑜 𝑛 𝑜 = 𝑚 𝑎 𝑥 ( 0 , 𝜖 − ( 𝑝 𝑡’ − 𝑝 𝑡 ) ) P r o b l e m a t i c A c t i o n C h u n k a 𝑄 ( 𝑠 , 𝑎 ) = 𝑃 ( 𝑊 ( 𝑜 𝑡 , 𝑎 ) ) < 𝑝 𝑡 M o n o t o n i c i t y L o s s KL - r e g u l a r iz e d i m p r o v e m e n t I n s t r u c t i o n 𝜋 ∗ ( 𝑎 | 𝑠 ) = 𝜋 ( 𝑎 | 𝑠 ) ∙ 𝑒 1 𝛼 𝑄 ( 𝑠 , 𝑎 ) → ǁ 𝜖 = 𝜖 − 𝜆 𝛻𝑄 Fig. 4. Reinforcement learning framew ork of ProgressVLA. sampling. At diffusion step τ , let the unguided reverse mean be µ θ ( x τ , τ , c ) . W e modify the update rule as x τ − 1 = µ θ ( x τ , τ , c ) + s ∇ x τ ˆ p t + N + σ τ ϵ, (11) where s controls the guidance strength and ∇ x τ ˆ p t + N effec- tiv ely optimizes a z tow ards increased progress. The sampled latent-action chunk x 0 is mapped by the action decoder into an ex ecutable sequence a t : t + N − 1 for deployment. Empirical results demonstrate that progress-guided sampling significantly shifts the distribution of generated actions toward those yielding higher predicted progress; this effecti vely re- duces the need for extensi ve re-sampling and enables a robust, threshold-based termination criterion at runtime. E. RL F inetuning with Pr ogr ess While progress-guided sampling can be applied at inference time, we further finetune both the progress estimator and the diffusion policy with online experience (see Fig. 4), so that (i) progress estimates remain well-aligned with task completion and (ii) the resulting policy is more robust to execution noise. a) Online trajectory collection: W e periodically roll out the current policy to collect trajectories B = { ( o 0 , o t , l , a z t : t + N , a t : t + N , ˆ p t , y ) } T − 1 t =0 , (12) where ˆ p t is the predicted progress, and y ∈ { 0 , 1 } indicates episode success. This online buf fer captures critical edge cases that are typically under-represented in static of fline datasets, such as recovery behaviors, near-failure states, and out-of- distribution visual perturbations. b) Pr ogr ess estimator finetuning: For successful episodes ( y =1 ), task progress should be (approximately) monotonic. W e therefore mine progress anomalies, namely instances where the predictor’ s output violates this expected monotonicity . Specifically , for each timestep t we define t ′ = arg min k>t ˆ p k , (13) namely the index after t with the smallest predicted progress, and mark t as an anomaly if the following holds: I anom = { t | ˆ p t > ˆ p t ′ + ϵ } . (14) and apply a margin-based monotonicity loss L mono = X t ∈I anom max 0 , ϵ − ( ˆ p t ′ − ˆ p t ) . (15) In implementation, we finetune P by minimizing L prog + L mono on the online buf fer . c) Diffusion policy finetuning: W e cast progress max- imization as a KL-regularized policy improv ement. Let the state be s = ( l, o 0 , o t ) and the action be the latent action a z . W e define a task-aware score using the learned ev aluator: Q ( s, a ) = P ( l, o 0 , D ( o t , a z )) , (16) i.e. , the progress predicted after applying a to the world model. W e then formulate the follo wing KL-constrained optimization problem: π ∗ ( a | s ) = argmax π ∗ θ ( a | s ) E a ∼ π ∗ θ ( ·| s ) Q ( s, a ) , s.t. KL( π ∗ ( ·| s ) ∥ π θ ( ·| s )) ≤ ε. (17) Solving this problem yields π ∗ ( a | s ) ∝ π θ ( a | s ) exp 1 α Q ( s, a ) , (18) which increases the likelihood of action chunks linked with higher progress while staying close to the current policy . For dif fusion policies, π θ is parameterized by the denoiser ϵ θ . From a guided denoising view , we incorporate the progress score by adjusting the noise target at each diffusion step: ˜ ϵ = ϵ − σ t α ∇ x τ Q ( s, x τ ) , (19) and train the policy with a standard denoising objective, L policy = E ∥ ˜ ϵ − ϵ θ ( x τ , τ , l , o t ) ∥ 2 . (20) This update encourages the denoiser to produce samples that mov e to ward higher progress. V . E X P E R I M E N T S A. Pr etraining Data All components are pre-trained on the Open X-Embodiment (O XE) datasets [33, 24], adhering to the dataset selection and mixture weighting protocols established in [24, 41]. Actions are normalized and filtered using the same procedure as [33]. Unless otherwise specified, we use the same image prepro- cessing across all modules. The modules are trained with a batch size of 2048 on 8 NVIDIA H20 GPUs (256 samples per GPU) and the base learning rate is set to 1 × 10 − 4 . B. Implementation Details 1) Pr ogr ess estimator pr etraining: In implementation, the progress estimator takes the patch features of the starting and current frames extracted by DINOv2 [34] as input. V isual and text tokens are first projected into a shared embedding space, where learnable role embeddings (start, current, and text) preserve functional distinctness. A lightweight cross- attention stack, with residual connections and LayerNorm, aligns the language instructions with the current observ ation while encoding start-to-current changes. Finally , the tokens are mean-pooled and fused via an MLP , with a sigmoid head outputting the scalar progress score p . 2) W orld model pretr aining: W e adopt the UniVLA [6] world model architecture to predict future visual features gi ven candidate latent actions. T o stabilize downstream latent-action prediction, we add a KL regularization term during training to normalize the latent action distribution (Eq. 5), which improv es compatibility with the Latent Action Expert. 3) Latent action expert pretr aining: Our Latent Action Expert follows the DiT A-style design [16] and uses a causal T ransformer to autoregressi vely predict latent actions from multimodal context. The Action Decoder shares the same architecture and training recipe. Additional hyperparameters are deferred to the supplemental material. C. CAL VIN CAL VIN [32] is a simulated benchmark for long-horizon, language-conditioned manipulation. It contains four distinct en vironments (A, B, C, and D). W e adopt the standard ABC → D ev aluation protocol, training on environments A, B, and C while testing on D. Each ev aluation trial inv olves a sequence of fiv e subtasks sampled from a div erse pool of language-specified goals, totaling up to 1,000 unique se- quences. Follo wing established metrics, we report the success sequence length (denoted as “task completed in a row”), namely consecuti ve subtasks completed in a ro w , from 1 to 5, alongside the average number of tasks completed per episode. 1) Baselines and our variants: The adopted competing methods for comparison can be found in T able II. In addition, we consider the following variants: • Pr ogressVLA (w/o CG). From-scratch trained diffusion policy without classifier (progress estimator) guidance. • Pr etrained ProgressVLA (w/o CG). Pretrained dif fusion policy on OXE with no guidance at inference. • Pr etrained Progr essVLA (w/ CG). Pretrained dif fusion policy with classifier guidance, where e v aluator trained from scratch on CAL VIN. • Pr etrained Progr essVLA (w/ Pretrained CG). Pre- trained diffusion policy with classifier guidance, using the pretrained world model and progress predictor as ev aluator . • Pr etrained Progr essVLA (Full). Pretrained dif fusion policy with classifier guidance using the pretrained ev al- uator + RL finetuning. 2) Pr etraining contrib utes to diffusion policy performance: Comparing Pretrained ProgressVLA (w/o CG) to Pro- gressVLA (w/o CG) in T able II sho ws that pretraining the diffusion policy yields a large and consistent improvement in ov erall task completion performance, especially on longer- horizon sequences. This indicates that DP pretraining provides a strong latent-action prior and reduces compounding errors ev en without guidance. 3) Classifier guidance relies on a r eliable evaluator: Adding guidance on top of a pretrained policy (Pretrained ProgressVLA (w/o CG) → Pretrained ProgressVLA (w/ CG)) yields a moderate improvement. Notably , the benefit of classi- fier guidance becomes significantly larger when the ev aluator (world model + progress predictor) is pretrained: Pretrained T ABLE II T H E C O M PAR I S O N S W I TH S TA T E - O F - T H E - A RT A P PR O AC H E S O N C A L V I N ( A B C → D ) W I T H T H E M ET R I C S O F S U C C E S S R AT E A N D A V E R AG E S U C C E S S L E N G T H . T H E A B B R E V I A T I O N S D E N OT E D I FF E R E N T I N P U T M O D A L IT I E S : S - R G B F O R S TA T I C R G B , G - R G B F O R G R I P P E R R G B, S - R G B D F O R S T A T I C R G B - D , G - R G B D F O R G R I P P E R R G B - D , P F O R P R O P R I O C E P T I VE A R M P O S I T I O N , A N D C A M F O R C A M E R A PA R A M E T E R S . Method Input T ask completed in a Row(%) ↑ A vg.Len. 1 2 3 4 5 RoboFlamingo [27] S-RGB,G-RGB 82.4 61.9 46.6 33.1 23.5 2.47 GR-1 [44] S-RGB,G-RGB,P 85.4 71.2 59.6 49.7 40.1 3.06 3D Diffuser [22] S-RGBD,G-RGBD,P ,Cam 92.2 78.7 63.9 51.2 41.2 3.27 GR-MG [26] S-RGBD,G-RGBD,P 96.8 89.3 81.5 72.7 64.4 4.04 SuSIE [2] S-RGB 87.0 69.0 49.0 38.0 26.0 2.69 GHIL-Glue [2, 15] S-RGB 95.2 88.5 73.2 62.5 49.8 3.69 Dita [16] S-RGB 94.5 82.5 72.8 61.3 50.0 3.61 Progr essVLA (w/o cg) S-RGB 89.4 76.8 63.0 52.2 43.1 3.24 Pretrained ProgressVLA (w/o cg) S-RGB 92.7 81.6 70.1 60.9 51.6 3.57 Pretrained ProgressVLA (w/ cg) S-RGB 93.6 82.4 71.2 60.8 52.8 3.61 Pretrained ProgressVLA (w/ cg(pretrained)) S-RGB 93.6 82.0 72.0 63.6 56.4 3.68 Pretrained ProgressVLA (Full) S-RGB 95.2 84.8 73.6 67.2 52.0 3.73 ProgressVLA (w/ CG) → Pretrained ProgressVLA (w/ CG (pretrained)) increases the 5-in-a-ro w rate from 52 . 8% to 56 . 4% (+3.6) and the 4-in-a-row rate from 60 . 8% to 63 . 6% (+2.8), while improving the a verage completed length to 3 . 68 . W e attrib ute this gap to the r obustness of the pretrained vision- language ev aluator: pretraining yields a more reliable progress signal under distribution shift and ex ecution noise and pro vides higher-quality guidance gradients during sampling. 4) RL finetuning further impr oves r obustness: In the RL finetuning stage, we roll out the policy in training en viron- ments A/B/C and collect a total of 1,000 trajectories for online updates. Pretrained ProgressVLA (Full) achieves the best ov erall average completed length and improves 1–4 subtask success. W e attribute these gains to the complementary effects of RL: online experience improves progress–completion align- ment in the ev aluator and yields a stronger guidance signal. And it simultaneously refines the diffusion policy to be more robust to execution noise. D. LIBER O LIBER O [30] is a comprehensive benchmark for knowl- edge transfer in multitask and lifelong robot learning. It contains four sub-datasets: LIBER O-SP A TIAL, LIBER O- OBJECT , LIBERO-GO AL, and LIBER O-100. LIBER O-100 is further split into LIBER O-90 and LIBER O-LONG, where LIBER O-LONG features long-horizon tasks that require di- verse object interactions and versatile motor skills. W e use the modified LIBER O setup released with OpenVLA [24] as the data source for finetuning and ev aluation. T able III reports success rates (%) on LIBER O-SP A TIAL, LIBER O-OBJECT , LIBERO-GO AL, and LIBER O-LONG, to- gether with the average across subsets. W e compare against representativ e multitask VLA baselines ( e.g. , Diffusion Pol- icy [13], OpenVLA [24], MDT [37], and Dita [16]). T o isolate the contribution of progress guidance, we report three variants: w/o cg performs unguided diffusion sampling (no classifier guidance); w/ cg enables progress-guided classifier guidance at T ABLE III T H E E X P E R I M E N TA L R E S U LTS O N T H E L I BE RO B E N C H M A R K . S E E T H E M A I N T E X T F OR M OR E E X P L A N ATI ON . Method Spatial Object Goal Long A verage LAP A [47] 73.8 74.6 58.8 55.4 65.7 Diffusion Policy [13] 78.3 92.5 68.3 50.5 72.4 Octo [41] 78.9 85.7 84.6 51.1 75.1 MDT [37] 78.5 87.5 73.5 64.8 76.1 OpenVLA [24] 84.7 88.4 79.2 53.7 76.5 MaIL [20] 74.3 90.1 81.8 78.6 83.5 Dita [16] 84.2 96.3 85.4 63.8 82.4 Progr essVLA w/o cg 83.2 95.0 84.6 63.2 81.5 Progr essVLA w/ cg 85.8 96.1 86.0 65.4 83.3 Progr essVLA Full 88.2 96.4 87.2 66.2 84.5 inference; and Full denotes our strongest configuration. Across all subsets, progress guidance yields consistent gains over the unguided counterpart ( e.g. , average 81 . 5 → 83 . 3 and LIBERO- LONG 63 . 2 → 65 . 4 ), while the full model further improves performance (average 84 . 5 ). Notably , our full method achie ves the best ov erall average and delivers strong improvements on the long-horizon LIBER O-LONG split compared to OpenVLA ( 66 . 2 vs. 53 . 7 ), supporting the ef fecti veness of progress-guided diffusion policy . E. Real-W orld Evaluation 1) Experiment setup.: Real-world experiments are con- ducted using an ARX A C-One robotic dual-arm outfitted with two X5 arms and ARX G2 parallel grippers. The sensory setup consists of two Intel RealSense D405 RGB-D cameras: one wrist-mounted and one positioned as a stationary third-person ”head” view . While the cameras support depth, we utilize RGB images as the primary policy inputs unless specified otherwise. All trials are performed in a tabletop manipulation en vironment characterized by fixed object initializations and consistent camera perspectiv es. Pick up the T oy Drawer Pick up the Peach Pick up the Orange Plate Open the Drawer Stack the Bowls Fig. 5. Illustration of the fiv e tasks in real-world model deployment on an ARX robotic dual-arm. T ABLE IV R E A L - R O B O T R E S U LTS O N FI V E TA S K S . W E R E PO RT A V E R A G E E N D - E FF EC TO R PA T H L E NG T H ( A VG . D I S TAN CE , M ) , S U C C E S S R AT E ( % ) , A N D A V E R AG E S T E P S ( A VG . S T E P S ) . Octo[41] ProgressVLA (w/o CG) ProgressVLA (w/ CG) T ask Success ↑ Dist ↓ Steps ↓ Success ↑ Dist ↓ Steps ↓ Success ↑ Dist ↓ Steps ↓ Pick up the peach 35 1.04 151.6 70 0.78 96.0 90 0.61 39.1 Open the drawer 20 1.45 196.8 80 0.92 90.5 90 0.78 42.0 Stack the bowls 30 0.91 145.8 65 0.71 94.1 70 0.59 37.8 Pick up the orange → plate 20 1.63 228.3 60 1.38 117.2 75 1.12 72.0 Pick up the toy → drawer 10 1.48 216.8 55 1.03 106.1 55 0.96 75.6 A vg. 23 1.30 187.9 66 0.96 100.8 76 0.81 53.3 2) T ask setup.: W e ev aluate the proposed model on fiv e real-robot manipulation tasks of varying difficulty , as shown in Fig. 5: T oy → Dra wer, Pick Peach, Open Dra wer, Orange → Plate, and Stack Bowls. These tasks span both single-step and long-horizon beha viors. T ar get objects and goal receptacles are highlighted with blue boxes in Fig. 5. 3) Data collection and finetuning.: For each task, we col- lect 50–100 human teleoperated trajectories, depending on task complexity , to finetune the models. Each trajectory contains multi-view RGB observations from the wrist and head cameras together with the ex ecuted action sequence. 4) Evaluation protocol and baselines.: For quantitativ e ev aluation, we run 20 trials per task. W e compare against two baselines: (i) Octo [41], a strong pretrained VLA policy , and (ii) ProgressVLA (w/o CG), which serves as an unguided generativ e baseline. Our full method applies progress-guided classifier guidance during sampling to improv e task comple- tion reliability . T able IV reports real-robot results using success rate (Succ, %), end-effector travel distance (Dist, m), and executed action chunks (Steps; one step corresponds to one predicted and ex- ecuted chunk). Lo wer Dist/Steps indicates more efficient exe- cution. Overall, ProgressVLA substantially outperforms Octo, and classifier guidance (CG) further improv es both reliability and efficiency . A veraged across tasks, Octo achieves 23% success with 1.30 m/187.9 steps, while ProgressVLA (w/o CG) increases success to 66% and reduces Dist/Steps to 0.96 m/100.8. With CG, ProgressVLA (w/ CG) further im- prov es to 76% success with 0.81 m/53.3 steps. The gains are especially clear on tasks that otherwise exhibit redundant motion, suggesting that progress-guided CG leads to more decisiv e and goal-directed real-robot execution. F . Pro gr ess Estimator Generalization on Real-Robot Expert T rajectories 1) Offline, policy-agnostic evaluation pr otocol.: W e ev al- uate the progress predictor independently of policy learning using a small set of real-robot expert demonstrations collected under three controlled scene settings, as sho wn in Fig. 6: (i) Original, (ii) Lighting shift (adding a desk lamp), and (iii) Nov el objects (swapping the target object with an unseen instance while keeping the layout and cameras fixed). Giv en the language instruction ℓ and observ ation o t (op- tionally with o 0 ), the progress estimator outputs a normalized progress ˆ p t ∈ [0 , 1] at each timestep. 2) Metrics: W e report three metrics, as seen in T able V: (i) Progress alignment (Pearson correlation between { ˆ p t } and a reference ramp p t = t/T ; higher is better), (ii) Stop reliability (fraction of trajectories with max t ∈{ T − 9 ,...,T } ˆ p t > 0 . 9 ; higher is better), and (iii) Progress error (MAE between ˆ p t and p t ; lower is better). 3) F r om-scratc h vs. pr etrained and finetuned pro gr ess pr e- dictor: W e compare two training regimes: F r om Scr atch , where the progress predictor is trained only on the target real- robot dataset without large-scale pretraining, and F inetuned , where we start from a progress predictor pretrained on large- Novel Object Light Shifting Origin Setting Fig. 6. Real-world scenarios used for in vestigating the generalization of progress estimator . Current Image Progress: 0.36 Start Image Progress: 0.0 Future Image Progress: 0.54 Start Image Progress: 0.0 Current Image Progress: 0.38 Future Image Progress: 0.66 CALVIN store t he grasped block in th e drawer LIBERO pick up the black bowl on the ramekin and place it on the plate Real-world scena rio pick up the peach Start Image Progress: 0.0 Current Image Progress: 0.82 Future Image Progress: 0.90 Future Image Progress: 0.85 W/O progress guidance W / progress guidanc e Future Image Progress: 0.40 W/O progress guidance W/O progress guidance W/ progress gui dance W/ progress gui dance Future Image Progress: 0.45 Fig. 7. V isualization of progress estimation and effect of guidance. The second column from the left illustrates the predicted trajectories, contrasting the baseline diffusion policy (in black) with our ProgressVLA approach (in red). Our method consistently generates more plausible, goal-directed paths. T ABLE V “ F R O M S C R A T C H ” T R A I N S T H E P R O G RE S S P R E D I C T O R F RO M R A N D O M I N I T I A L I Z ATI ON O N O U R R E A L - R O B OT D A TA , W H IL E “ F I N E T U N E D ” S TA RT S F R O M T H E P R ET R A I N E D C H E C K P O I N T A N D I S FI N ET U N E D O N T H E S A M E DAT A . Setting Method Pearson ↑ Stop ↑ MAE ↓ Origin Setting From Scratch 0.912 53.8 0.14 Finetuned 0.977 82.1 0.10 Lighting Shift From Scratch 0.809 3.6 0.24 Finetuned 0.953 80.8 0.12 Nov el Objects From Scratch 0.810 37.5 0.15 Finetuned 0.972 81.2 0.11 scale manipulation data and then finetune on the real-robot demonstrations. 4) Results and discussion: T able V shows that pretraining is critical for progress generalization, and the pretrained model remains robust after finetuning on real-robot data. In the original scene, the pretrained+finetuned predictor outperforms training from scratch across all metrics (Pearson 0 . 912 → 0 . 977 , Stop 53 . 8 → 82 . 1 , MAE 0 . 14 → 0 . 10 ). Under lighting shift, the from-scratch model degrades sharply (Pearson 0 . 809 , Stop 3 . 6 ), whereas the pretrained+finetuned model stays strong (Pearson 0 . 953 , Stop 80 . 8 , MAE 0 . 12 ). A similar trend holds for nov el objects, where pretraining yields large gains (Pearson 0 . 810 → 0 . 972 , Stop 37 . 5 → 81 . 2 , MAE 0 . 15 → 0 . 11 ). Overall, these results suggest that a pretrained progress predictor, when finetuned on a small amount of real data, transfers well to common scene shifts, supporting its use as a reliable signal for classifier guidance. G. V isualization Fig. 7 visualizes three representativ e examples drawn from both simulation and realistic scenarios. Starting from the same time point, we present the achieved robotic states (particularly the progress scores) after performing same number of action tokens, with or without the proposed classifier guidance. The curves of progress scores across the entire operations are also displayed in the rightmost column in Fig. 7, which further validate the effecti veness of ProgressVLA. V I . C O N C L U S I O N W e presented ProgressVLA, a progress-guided diffusion policy for robotic manipulation that adds an explicit progress signal to dif fusion-based action generation. Its progress esti- mator , built on pretrained visual features, remains robust under original , lighting-shift , and novel-object settings. Classifier guidance steers diffusion sampling with progress gradients, improving both success rates and ex ecution efficienc y , while RL finetuning further improv es the robustness of both the ev aluator and the policy . R E F E R E N C E S [1] Hongzhe Bi, Hengkai T an, Shenghao Xie, Zeyuan W ang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Y ao Feng, Chendong Xiang, Y inze Rong, et al. Motus: A unified latent action world model. arXiv pr eprint arXiv:2512.13030 , 2025. [2] Ke vin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer W alke, Chelsea Finn, A viral Kumar , and Serge y Levine. Zero-shot robotic manipulation with pre- trained image-editing diffusion models. arXiv preprint arXiv:2310.10639 , 2023. [3] Ke vin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi 0 : A vision-language-action flow model for general robot control. arXiv pr eprint arXiv:2410.24164 , 2024. [4] Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker -Holder, Y uge Shi, Edward Hughes, Matthew Lai, Aditi Mav alankar , Richie Steigerwald, Chris Apps, et al. Genie: Generativ e interactiv e en vironments. In F orty-first International Confer ence on Machine Learning , 2024. [5] Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Y an Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large- scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint , 2025. [6] Qingwen Bu, Y anting Y ang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Y ao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions. arXiv pr eprint arXiv:2505.06111 , 2025. [7] Xizhou Bu, Jie xi L yu, Fulei Sun, Ruichen Y ang, Zhiqiang Ma, and W ei Li. Laof: Robust latent action learn- ing with optical flo w constraints. arXiv pr eprint arXiv:2511.16407 , 2025. [8] Jun Cen, Siteng Huang, Y uqian Y uan, Kehan Li, Hangjie Y uan, Chaohui Y u, Y uming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision- language-action and w orld model. arXiv pr eprint arXiv:2511.17502 , 2025. [9] Xiaoyu Chen, Junliang Guo, T ianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Y ang, Li Zhao, and Jiang Bian. Igor: Image-goal representations are the atomic control units for foundation models in embodied ai. arXiv pr eprint arXiv:2411.00785 , 2024. [10] Xiaoyu Chen, Hangxing W ei, Pushi Zhang, Chuheng Zhang, Kaixin W ang, Y anjiang Guo, Rushuai Y ang, Y ucen W ang, Xinquan Xiao, Li Zhao, et al. V illa- x: enhancing latent action modeling in vision-language- action models. arXiv pr eprint arXiv:2507.23682 , 2025. [11] Y i Chen, Y uying Ge, Y izhuo Li, Y ixiao Ge, Mingyu Ding, Y ing Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. arXiv pr eprint arXiv:2412.04445 , 8, 2024. [12] Zengjue Chen, Runliang Niu, He Kong, Qi W ang, Qianli Xing, and Zipei Fan. Tgrpo: Fine-tuning vision- language-action model via trajectory-wise group relative policy optimization. arXiv pr eprint arXiv:2506.08440 , 2025. [13] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Y ilun Du, Benjamin Burchfiel, Russ T edrake, and Shuran Song. Diffusion policy: V isuomotor policy learning via action diffusion. The International Journal of Robotics Resear ch , 44(10-11):1684–1704, 2025. [14] Senyu Fei, Siyin W ang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision-language-action models. arXiv pr eprint arXiv:2511.15605 , 2025. [15] Kyle B Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair , Seohong Park, Blake W ulfe, Masha Itkina, Ben- jamin Eysenbach, Sergey Levine, Thomas K ollar , et al. Ghil-glue: Hierarchical control with filtered subgoal im- ages. In 2025 IEEE International Confer ence on Robotics and A utomation (ICRA) , pages 9516–9524. IEEE, 2025. [16] Zhi Hou, T ianyi Zhang, Y uwen Xiong, Haonan Duan, Hengjun Pu, Ronglei T ong, Chengyang Zhao, Xizhou Zhu, Y u Qiao, Jifeng Dai, et al. Dita: Scaling diffusion transformer for generalist vision-language-action policy . arXiv pr eprint arXiv:2503.19757 , 2025. [17] Chia-Y u Hung, Navonil Majumder , Haoyuan Deng, Liu Renhang, Y ankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei W ang, and Soujanya Poria. Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv pr eprint arXiv:2511.14659 , 2025. [18] Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, K e vin Black, K en Conley , Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al. π ∗ 0 . 6 : a vla that learns from e xperience. arXiv pr eprint arXiv:2511.14759 , 2025. [19] Physical Intelligence, Ke vin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π 0 . 5 : a vision-language-action model with open-world generalization, 2025. URL https://arxiv . or g/abs/2504.16054 , 1(2):3, 2025. [20] Xiaogang Jia, Qian W ang, Atalay Donat, Bowen Xing, Ge Li, Hongyi Zhou, Onur Celik, Denis Blessing, Rudolf Lioutikov , and Gerhard Neumann. Mail: Improving imitation learning with selectiv e state space models. In 8th Annual Confer ence on Robot Learning , 2024. [21] Y uxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Y ue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Y ao, et al. Enerverse-ac: En visioning embodied environments with action condi- tion. arXiv pr eprint arXiv:2505.09723 , 2025. [22] Tsung-W ei Ke, Nikolaos Gkanatsios, and Katerina Fragkiadaki. 3d dif fuser actor: Policy diffusion with 3d scene representations. arXiv pr eprint arXiv:2402.10885 , 2024. [23] Alexander Khazatsky , Karl Pertsch, Suraj Nair , Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany , Mohan Kumar Srirama, Lawrence Y unliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv pr eprint arXiv:2403.12945 , 2024. [24] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, T ed Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov , Ethan Foster , Grace Lam, Pannag Sanketi, et al. Open vla: An open-source vision-language-action model. arXiv pr eprint arXiv:2406.09246 , 2024. [25] Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success. arXiv pr eprint arXiv:2502.19645 , 2025. [26] Peiyan Li, Hongtao W u, Y an Huang, Chilam Cheang, Liang W ang, and T ao K ong. Gr-mg: Le veraging partially- annotated data via multi-modal goal-conditioned policy . IEEE Robotics and Automation Letters , 2025. [27] Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Y u, Jie Xu, Hongtao W u, Chilam Cheang, Y a Jing, W einan Zhang, Huaping Liu, et al. V ision-language foundation models as effecti ve robot imitators. arXiv preprint arXiv:2311.01378 , 2023. [28] Xinqing Li, Xin He, Le Zhang, Min W u, Xiaoli Li, and Y un Liu. A comprehensi ve surve y on world models for embodied ai. arXiv pr eprint arXiv:2510.16732 , 2025. [29] Zuolei Li, Xingyu Gao, Xiaofan W ang, and Jian- long Fu. Latbot: Distilling universal latent actions for vision-language-action models. arXiv preprint arXiv:2511.23034 , 2025. [30] Bo Liu, Y ifeng Zhu, Chongkai Gao, Y ihao Feng, Qiang Liu, Y uke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Pr ocessing Systems , 36:44776– 44791, 2023. [31] Y ueen Ma, Zixing Song, Y uzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language- action models for embodied ai. arXiv preprint arXiv:2405.14093 , 2024. [32] Oier Mees, Lukas Hermann, Erick Rosete-Beas, and W olfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks. IEEE Robotics and Automation Letters , 7(3):7327–7334, 2022. [33] Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar , Abraham Lee, Acorn Pooley , Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages 6892–6903. IEEE, 2024. [34] Maxime Oquab, T imoth ´ ee Darcet, Th ´ eo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov , Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby , et al. Dinov2: Learning robust visual features without supervision. arXiv pr eprint arXiv:2304.07193 , 2023. [35] Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair , Quan V uong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models. arXiv preprint arXiv:2501.09747 , 2025. [36] Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry , Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International confer ence on ma- chine learning , pages 8748–8763. PmLR, 2021. [37] Moritz Reuss, ¨ Omer Erdinc ¸ Y a ˘ gmurlu, Fabian W enzel, and Rudolf Lioutikov . Multimodal diffusion transformer: Learning v ersatile behavior from multimodal goals. arXiv pr eprint arXiv:2407.05996 , 2024. [38] Ranjan Sapkota, Y ang Cao, K onstantinos I Roumelio- tis, and Manoj Karkee. V ision-language-action models: Concepts, progress, applications and challenges. arXiv pr eprint arXiv:2505.04769 , 2025. [39] W enxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Y an, Y uxin Huang, Feilong T ang, Donglin W ang, and Haoang Li. Recon vla: Recon- structiv e vision-language-action model as ef fecti ve robot perceiv er . arXiv pr eprint arXiv:2508.10333 , 2025. [40] GigaW orld T eam, Angen Y e, Boyuan W ang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: W orld models as data engine to empower embodied ai. arXiv pr eprint arXiv:2511.19861 , 2025. [41] Octo Model T eam, Dibya Ghosh, Homer W alke, Karl Pertsch, Ke vin Black, Oier Mees, Sudeep Dasari, Joey Hejna, T obias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy . arXiv pr eprint arXiv:2405.12213 , 2024. [42] Homer Rich W alke, Ke vin Black, T ony Z Zhao, Quan V uong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre W ang He, V iv ek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning , pages 1723– 1736. PMLR, 2023. [43] Y ihao W ang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang T ong, W enxuan Song, Han Zhao, W ei Zhao, Pengxu Hou, et al. Vla-adapter: An effecti ve paradigm for tiny-scale vision-language-action model. arXiv pr eprint arXiv:2509.09372 , 2025. [44] Hongtao W u, Y a Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and T ao Kong. Unleashing large-scale video generativ e pre- training for visual robot manipulation. arXiv pr eprint arXiv:2312.13139 , 2023. [45] Junjin Xiao, Y andan Y ang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, W ei-Shi Zheng, and Qing Zhang. W orld-en v: Lev eraging world model as a vir- tual en vironment for vla post-training. arXiv preprint arXiv:2509.24948 , 2025. [46] Mingxing Xu, W enrui Dai, Chunmiao Liu, Xing Gao, W eiyao Lin, Guo-Jun Qi, and Hongkai Xiong. Spatial- temporal transformer networks for traf fic flow forecast- ing. arXiv pr eprint arXiv:2001.02908 , 2020. [47] Seonghyeon Y e, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Y ang, Baolin Peng, Ajay Mandlekar , Reuben T an, Y u-W ei Chao, Bill Y uchen Lin, et al. La- tent action pretraining from videos. arXiv preprint arXiv:2410.11758 , 2024. [48] Chao Y u, Y uanqing W ang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Y ongji W u, Chunyang Zhu, Junhao Hu, et al. Rlinf: Flexible and efficient large- scale reinforcement learning via macro-to-micro flo w transformation. arXiv pr eprint arXiv:2509.15965 , 2025. [49] Hongzhi Zang, Mingjie W ei, Si Xu, Y ongji W u, Zhen Guo, Y uanqing W ang, Hao Lin, Liangzhi Shi, Y uqing Xie, Zhexuan Xu, et al. Rlinf-vla: A unified and efficient frame work for vla+ rl training. arXiv preprint arXiv:2510.06710 , 2025. [50] T ony Z Zhao, V ikash K umar, Ser gey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint , 2023. [51] Fangqi Zhu, Zhengyang Y an, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: W orld model- based policy optimization for vision-language-action models. arXiv pr eprint arXiv:2511.09515 , 2025. A P P E N D I X A I M P L E M E N T A T I O N D E TA I L S A. Pr ogr ess Estimator The progress estimator is a compact cross-attention re- gressor that maps the instruction ℓ , the start observation o 0 , and the current observation o t to a scalar progress score ˆ p ∈ [0 , 1] . W e first extract frozen pretrained features: visual patch tokens from the pretrained DINOv2[34] and text tokens from the pretrained CLIP model (OpenAI CLIP V iT -L/14) [36]. All visual and text tokens are then projected into a shared embedding space. A lightweight cross-attention stack with residual connections and LayerNorm (i) aligns language with the current observation and (ii) encodes start-to-current changes. Finally , token features are mean-pooled, fused by a small MLP , and passed through a sigmoid head to predict ˆ p . Concretely , let S , C , and T denote the projected (and role-embedded) start-frame visual tokens, current-frame visual tokens, and instruction tokens, respectiv ely . Three residual cross-attention updates are applied: (1) attend from T to C to inject current visual context into the language stream; (2) attend from C to S to capture start-to-current changes; and (3) attend from the current-conditioned visual tokens to T to obtain a language-conditioned visual representation. The resulting streams are mean-pooled, concatenated, and passed through a fusion MLP and sigmoid head to obtain ˆ p . The estimator uses 768-dim DINO patch features and 768- dim CLIP text features, each projected to width 1024, with 8 attention heads, dropout 0.1, and a 6-block Spatio-T emporal T ransformer backbone [46]. The prediction head is an MLP with hidden size 512 and sigmoid output. The model is pre- trained with a total b udget of at most 400 H20-hours, achie ving competitiv e offline progress estimation and providing a reliable signal for downstream guidance. B. W orld Model The world model follows the UniVLA world-model archi- tecture [6]. Giv en observations o t and o t + N , the pretrained DINOv2[34] extract frozen features F t and F t + N , and model latent action a z in feature space (instead of pixel space) for robustness to appearance changes. The encoder (inv erse dynamics) maps ( F t , F t + N ) to a latent action a z , and the decoder (forward dynamics) predicts future features conditioned on F t and a z : a z = E ( F t , F t + N ) , ˆ F t + N = D ( F t , a z ) . (21) A VQ bottleneck is used for latent-action discretization before passing a z to downstream policy modules. In our experiments, both the encoder and decoder adopt a Spatio-T emporal T ransformer backbone [46], with hidden width 768 , 12 Transformer layers, and 12 attention heads. The latent action is parameterized with dimension 128 . Training minimizes an L 2 feature reconstruction loss with standard VQ commitment/codebook terms, plus the KL regularizer in Eq. (5) of the main paper to improve latent stability and reduce distribution drift. C. Noise-conditioned evaluator Follo wing the main-paper ev aluator definition, candidate latent-action chunks are scored by applying the progress estimator to world-model-imagined futures. For state s = ( ℓ, o 0 , o t ) and latent-action chunk a , the task-aw are score is Q ( s, a ) = P ℓ, o 0 , D ( o t , a ) . T o apply this e v aluator inside dif fusion sampling, we use a noise-conditioned form Q τ ( x τ , τ , s ) = P ℓ, o 0 , D ( o t , x τ , τ ) , τ ∈ { 0 , . . . , 1000 } , (22) which is differentiable w .r .t. x τ and provides classifier - guidance gradients. The base world model D ( o t , a ) is τ -agnostic. For ev aluator prediction during classifier guidance only , a lightweight τ - conditioning branch is introduced and a guidance-time variant D ( o t , x τ , τ ) is used. Specifically , τ is encoded by a sinusoidal embedding e τ = TimeEmb( τ ) and concatenated with the noisy latent action token x τ and current DINO features P t : X τ = [ a z τ ; e τ ; P t ] . (23) A T ransformer decoder predicts future DINO features from X τ , which are used in Eq. (22). This τ -conditioning stabilizes guidance gradients across noise lev els and does not change the base world-model formulation. T o improv e ev aluator consistency from high-noise to lo w- noise steps ( τ : 1000 → 0 ), the world model and progress estimator are jointly finetuned on expert demonstrations under the same noise-conditioning as in Eq. (22), then distilled into a noise-aw are e v aluator with a total compute budget of 160 H20-hours for classifier guidance. D. Latent Action Expert and Action Decoder The policy is implemented as a two-stage dif fusion sys- tem that factorizes planning into (i) denoising a latent-action chunk and (ii) decoding the latent chunk into an executable action chunk. Both stages adopt the LLaMA2-style dif fusion- transformer architecture of [16]: the observation is tokenized into a compact visual token sequence, concatenated with language conditioning, and processed by a lightweight T rans- former backbone. Each network is trained in the standard variance-preserving dif fusion setting to predict the noise resid- ual ( epsilon pr ediction ). 1) Latent Action Expert: The latent-action expert is a diffu- sion model operating in a compact latent-action space. It takes the current observation and instruction as conditioning and iterativ ely denoises a noisy latent variable to produce a latent- action chunk that captures task-relev ant high-level intent. A fine-grained diffusion schedule with 1000 training timesteps is used for latent denoising, which provides smoother inter- mediate noise lev els for guidance and refinement. 2) Action Decoder: The action decoder is another diffusion model that generates the executable action chunk. It is con- ditioned on the observ ation and instruction, and additionally takes the (noisy) latent-action variable as an explicit condition- ing signal throughout denoising. Intuiti vely , the latent-action chunk represents an action plan in the visual (image) space, and the decoder translates it into lo w-le vel actions consistent with the robot embodiment. A shorter dif fusion schedule with 100 training timesteps is used for action denoising, which reduces inference cost while retaining sufficient fidelity . E. T wo-stage infer ence with latent warm-start Inference uses two coupled dif fusion processes: a latent- action diffusion (fine schedule) and an action-chunk diffusion (coarse schedule). The latent-action diffusion uses T z = 1000 training timesteps, while the action-chunk diffusion uses T a = 100 . At test time, the noise scales are aligned via a two-stage procedure. This two-stage design is motiv ated by differing denoising difficulty: decoding a coherent latent plan benefits from a longer schedule, while the action chunk is simpler and can be generated faithfully with T a =100 steps. 1) Stage 1: latent warm-start to the T a =100 noise scale: The latent-action denoiser is first run alone under the T z =1000 schedule, b ut only for the tail segment that corresponds to timesteps τ ≥ T a . Starting from Gaussian noise x z τ =T z ∼ N (0 , I ) , the following iterative updates are applied: ϵ z τ = ϵ z θ ( x z τ , τ , s ) , ˜ ϵ z τ = ( ϵ z τ − σ z τ ∇ x z τ Q τ ( x z τ , τ , s ) , if CG on , ϵ z τ , otherwise , x z τ − 1 ← Step z ( x z τ , ˜ ϵ z τ , τ ) , ∀ t ∈ { T z , T z − 1 , . . . , T a } , (24) yielding a partially denoised latent x z T a at the noise scale compatible with the action diffusion. 2) Stage 2: joint denoising of latent and action chunks: A T a =100 -step denoising loop is then run to jointly update the latent v ariable and the action chunk. An action-chunk noise x a T a ∼ N (0 , I ) is initialized and the following coupled updates are applied for τ = T a , . . . , 1 : ϵ z τ = ϵ z θ ( x z τ , τ , s ) , ˜ ϵ z τ = ( ϵ z τ − σ z τ ∇ x z τ Q τ ( x z τ , τ , s ) , if CG on , ϵ z τ , otherwise , x z τ − 1 ← Step z ( x z τ , ˜ ϵ z τ , τ ) , ϵ a τ = ϵ a θ ( x a τ , x z τ , τ , s ) , x a τ − 1 ← Step a ( x a τ , ϵ a τ , τ ) . (25) Here, ϵ z θ is the latent-action denoiser and ϵ a θ is the action decoder conditioned on the updated latent x z τ − 1 ; Step z and Step a denote one DDIM scheduler step. CG is shorthand for classifier guidance. After Stage 2, x a 0 is taken as the predicted action chunk to execute. This two-stage procedure ensures that the latent plan is first brought to a compatible noise scale and then refined jointly with low-le vel action denoising, so that latent improv ements immediately influence action updates within the same diffusion trajectory . Head Camera Wrist Camera Head View Fig. 8. Real-world scenes. All real-robot trials are executed using only the right arm for manipulation, while the left arm remains stationary throughout the rollout. A P P E N D I X B R E A L - W O R L D E X P E R I M E N T S A. Robot platform and sensors Real-world experiments are conducted on an ARX A C- One dual-arm platform equipped with two X5 arms and ARX G2 parallel grippers. The sensory setup consists of two Intel RealSense D405 RGB-D cameras: one wrist-mounted on the end-effector and one fixed as a stationary third-person “head- view” camera (see Fig. 8). Although both cameras support depth measurements, we do not use depth; the policy takes RGB images as input. B. P olicy e xecution and evaluation Fig. 9 visualizes the progress estimator outputs during a real-robot rollout for the same instruction under two infer- ence settings: without progress guidance and with progress guidance. The predicted progress (in %) is plotted against the rollout timestep, where higher values indicate that the ev aluator believ es the policy is closer to completing the task. In Fig. 9, the top two panels show rollouts with progress guidance, while the bottom two panels sho w rollouts without progress guidance. W ith progress guidance, the gripper suc- cessfully closes on and lifts the orange, and the highlighted progress segment (blue box) increases monotonically , indi- cating consistent ev aluator -measured advancement. Without progress guidance, the gripper fails to properly grasp the orange, and the highlighted progress segment exhibits notice- able oscillations , suggesting unstable adv ancement signals and dithering during ex ecution. This is consistent with classifier guidance: during diffusion sampling, the ev aluator-score gradient biases updates toward latent-action chunks predicted to achiev e higher progress. Consequently , the guided policy selects task-adv ancing actions more reliably , reducing both steps and unnecessary motion. Fig. 9. Progress estimator traces with/without progr ess guidance. The rollout corresponds to the instruction “pick up the orange and put it on the plate”. T op: with progress guidance. Bottom: without progress guidance. The orange curve shows the predicted task progress ov er time. C. Pr ogr ess Estimator generalization A pr etrained+finetuned progress estimator is compared with a fr om-scratc h one under two appearance shifts: lighting change and novel objects . 1) Lighting shift: Fig. 10 compares progress traces under a lighting-shift setting for the same real-robot instruction. Fig. 10. Progress traces under lighting shift. The rollout is an expert- collected trajectory for ev aluating the progress estimator (instruction: “stack the bowls”). Left: Pretrained+Finetuned. Right: From Scratch. Under lighting shift, the pretrained+finetuned estimator (left) stays smooth and near-monotonic, reaching high progress in fewer steps. The from-scratch estimator (right) rises more slowly and plateaus in the later stage, making completion harder to identify . 2) Novel-object shift: A novel-object setting is further tested, where the tar get object dif fers from training. Fig. 11 shows progress traces for an unseen object manipulation instruction. W ith novel objects, the pretrained+finetuned estimator (left) remains smooth/near-monotonic and approaches high progress near the end. The from-scratch estimator (right) fluctuates more and giv es a less separable near-completion region, mak- ing it harder to judge whether the task is almost done. Pretraining + finetuning improv es robustness to appearance shifts, yielding more accurate progress prediction and a clearer terminal signal. Fig. 11. Progr ess traces under novel objects. The rollout is an expert- collected trajectory for ev aluating the progress estimator (instruction: “pick up the apple and put it on the plate”). Left: Pretrained+Finetuned. Right: From Scratch. D. Additional video r esults More qualitativ e visualizations are provided in the sup- plementary videos/ folder . Progress-guidance videos are in videos/Guidance/ and are named by the cor- responding instruction. Progress-estimator videos are in videos/Estimator/ and are named novel_object and light_shifting . A P P E N D I X C P R O O F O F P O L I C Y I M P R OV E M E N T A. T ask-awar e scor e from the evaluator Follo wing Eq. (16) in the main paper , let the state be s = ( ℓ, o 0 , o t ) and the (latent) action be a . Giv en a learned ev aluator (world model + progress estimator), the task-aw are score is defined as Q ( s, a ) = P ℓ, o 0 , D ( o t , a ) , (26) i.e., the predicted progress after applying a through the world model D . B. KL-constrained impr ovement As in Eq. (17) in the main paper , progress maximization is cast as a KL-regularized policy improv ement: π ⋆ ( ·| s ) = arg max π ( ·| s ) E a ∼ π ( ·| s ) Q ( s, a ) , s.t. KL π ( ·| s ) ∥ π 0 ( ·| s ) ≤ ε. (27) T o solve the KL-constrained problem (Boltzmann form), let us first introduce a Lagrange multiplier α > 0 and write the Lagrangian L ( π ; α ) = E a ∼ π ( ·| s ) Q ( s, a ) − α KL π ( ·| s ) ∥ π 0 ( ·| s ) − ε . (28) T aking the functional deri v ati ve w .r .t. π ( a | s ) and enforcing normalization yields the unique optimum π ⋆ ( a | s ) = 1 Z ( s ) π 0 ( a | s ) exp 1 α Q ( s, a ) , Z ( s ) = Z π 0 ( a | s ) exp 1 α Q ( s, a ) da, (29) which matches Eq. (18) in the main paper . It implies that the log-density differs by an additiv e energy term: log π ⋆ ( a | s ) = log π 0 ( a | s ) + 1 α Q ( s, a ) − log Z ( s ) , ∇ a log π ⋆ ( a | s ) = ∇ a log π 0 ( a | s ) + 1 α ∇ a Q ( s, a ) , (30) since Z ( s ) does not depend on a . C. Instantiating π 0 as a VP (V ariance-Pr eserving) diffusion policy over latent actions The action variable a is now set to the diffusion latent x 0 , and the denoising v ariable x τ at diffusion step τ . Under the VP forward process, x τ = √ ¯ α τ x 0 + σ τ ϵ, ϵ ∼ N (0 , I ) , σ 2 τ = 1 − ¯ α τ . (31) The diffusion policy is parameterized by an ϵ -predictor ϵ θ ( x τ , τ , s ) . For VP diffusion, the score of the base model satisfies s θ ( x τ , τ , s ) ≜ ∇ x τ log p θ ( x τ | s ) ≈ − 1 σ τ ϵ θ ( x τ , τ , s ) , (32) where the approximation becomes exact when ϵ θ matches the conditional mean E [ ϵ | x τ , τ , s ] . D. Classifier guidance on the denoising variable x τ T o apply the KL-deriv ed improvement during sampling, the ev aluator is used to define a noise-aware score Q τ ( s, x τ ) (the ev aluator takes ( x τ , τ , s ) as input). Applying Eq. (30) to the denoising variable giv es the guided score s ⋆ ( x τ , τ , s ) ≜ ∇ x τ log p ⋆ ( x τ | s ) = s θ ( x τ , τ , s ) + 1 α ∇ x τ Q τ ( s, x τ ) . (33) E. Con verting guided scor e to a guided ϵ targ et (Eq. (19)) Define a guided noise target ˜ ϵ by s ⋆ ( x τ , τ , s ) = − σ − 1 τ ˜ ϵ . Combining Eq. (32) and Eq. (33) yields − 1 σ τ ˜ ϵ = − 1 σ τ ϵ θ ( x τ , τ , s ) + 1 α ∇ x τ Q τ ( s, x τ ) , ˜ ϵ = ϵ θ ( x τ , τ , s ) − σ τ α ∇ x τ Q τ ( s, x τ ) , (34) which is exactly Eq. (19) in the main paper (up to notation). Finally , the guided direction is distilled into the denoiser by minimizing the standard denoising objecti ve with the guided target: L policy = E h ∥ ˜ ϵ − ϵ θ ( x τ , τ , s ) ∥ 2 2 i , (35) matching Eq. (20) in the main paper .
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment