Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

💡 Research Summary

Robometer tackles a fundamental bottleneck in robot learning: how to scale reward models beyond expert demonstrations to massive, noisy datasets that contain abundant failures and sub‑optimal attempts. Traditional general‑purpose reward models rely solely on dense, frame‑level progress labels derived from expert trajectories. While such absolute progress supervision is easy to obtain for successful demonstrations, it becomes ill‑defined and costly for failed rollouts, limiting both scalability and generalization.

The key insight of this work is to augment absolute progress supervision with a global, relative signal: trajectory‑level preference comparisons. By training a model to predict which of two videos better satisfies a given language instruction, the authors impose ordering constraints across diverse trajectories, tasks, robot embodiments, and viewpoints. This preference signal is cheap to obtain because it can be generated automatically through augmentations such as video rewinding, trimming, and cross‑task pairing, without any additional human annotation.

Robometer’s architecture builds on a large pre‑trained vision‑language model (VLM), specifically QWEN‑3‑VL‑4B‑INSTRUCT, and introduces two new token types: (1) progress tokens interleaved within a single video stream to produce dense per‑frame progress estimates, and (2) a preference token appended after a pair of videos to make a binary judgment about which trajectory is “better”. The causal mask ensures progress tokens only attend to past frames, preserving the temporal nature of progress prediction, while the preference token attends to both videos jointly.

Training uses a composite loss:

Progress loss (L_prog) – a categorical cross‑entropy over discretized progress bins (C51 style) for expert trajectories, anchoring the absolute scale of the reward.
Preference loss (L_pref) – a binary cross‑entropy on the preference token’s hidden state, encouraging the model to rank trajectories correctly.
Success loss (L_succ) – a balanced binary cross‑entropy predicting a per‑frame success flag, useful for downstream RL.

The authors curate RBM‑1M, a 1‑million‑trajectory dataset spanning 21 robot embodiments (bimanual arms, single‑arm manipulators, mobile platforms) and human videos. The collection deliberately balances successful expert demonstrations with a substantial proportion of failed or sub‑optimal rollouts obtained from automated policy executions, simulation, and failure‑detection datasets. For failed trajectories, no absolute progress label is provided (p = None); they are only used in preference pairs. All trajectories are temporally normalized to a fixed length T to prevent the model from using trajectory length as a proxy for quality.

Experiments evaluate both the reward model itself and its impact on downstream tasks. On six out‑of‑distribution (OOD) scenes gathered from three institutions, Robometer achieves a 14 % average improvement in reward rank‑correlation compared to state‑of‑the‑art baselines, and a 32 % boost in distinguishing successful from failed trajectories. Downstream, the model is applied to four distinct robot‑learning paradigms:

Online RL – Using Robometer’s dense reward, policies converge faster and attain 2.4–4.5× higher success rates than baselines.
Offline RL with noisy data – The model’s ability to rank sub‑optimal rollouts leads to higher sample efficiency and robust policy performance.
Imitation‑learning data filtering – Robometer automatically selects high‑quality demonstrations, improving the quality of the imitation dataset and resulting policies.
Zero‑shot failure detection – Across multiple robot platforms and institutions, the model accurately flags failures without any task‑specific fine‑tuning.

Ablation studies reveal a mutual reinforcement effect: even when trained only on expert demonstrations, adding the preference objective improves the model’s ability to separate sub‑optimal from successful trajectories, indicating that global ordering constraints shape a more structured reward representation. Conversely, adding more unlabeled failure data continues to improve performance, demonstrating the scalability of the approach.

Limitations include reliance on automatically generated preference pairs, which may be sensitive to the design of augmentation policies, and the computational overhead of using a large VLM for long video sequences. Future work suggested by the authors includes incorporating human‑in‑the‑loop preference feedback, extending the modality beyond vision (e.g., force/torque signals), and developing lighter‑weight transformer variants for real‑time deployment.

In summary, Robometer introduces a principled dual‑supervision framework that combines absolute progress anchoring with trajectory‑level preference ranking, enabling reward learning at the scale of a million heterogeneous robot trajectories. This results in more generalizable, well‑calibrated reward functions that substantially boost performance across a wide range of robot learning applications.

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

💡 Research Summary

Comments & Academic Discussion

Leave a Comment