RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design

RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design Tianxing Chen * 1 Y uran W ang * 2 3 Mingleyang Li * 2 Y an Qin * 4 Hao Shi 5 Zixuan Li 6 Y ifan Hu 2 Y ingsheng Zhang 5 Kaixuan W ang 1 Y ue Chen 2 Hongcheng W ang 2 Renjing Xu 4 Ruihai W u 2 Y ao Mu 7 Y aodong Y ang 2 3 Hao Dong † 2 Ping Luo † 1 ∗ Equal contribution, † Corresponding authors W ebsite: https://RMBench.github .io Code: https://github .com/robotwin-Platform/rmbench Abstract Robotic manipulation policies hav e made rapid progress in recent years, yet most existing ap- proaches give limited consideration to memory capabilities. Consequently , they struggle to solve tasks that require reasoning ov er historical obser- v ations and maintaining task-rele v ant information ov er time, which are common requirements in real-world manipulation scenarios. Although sev- eral memory-aw are policies hav e been proposed, systematic e valuation of memory-dependent ma- nipulation remains underexplored, and the rela- tionship between architectural design choices and memory performance is still not well understood. T o address this gap, we introduce RMBench , a simulation benchmark comprising 9 manipulation tasks that span multiple lev els of memory com- plexity , enabling systematic ev aluation of policy memory capabilities. W e further propose Mem-0 , a modular manipulation polic y with explicit mem- ory components designed to support controlled ablation studies. Through extensiv e simulation and real-world experiments, we identify memory- related limitations in e xisting policies and pro v ide empirical insights into how architectural design choices inﬂuence memory performance. 1. Introduction Recent progress in robotic manipulation has demonstrated strong capabilities across a wide range of tasks. Mod- ern policies such as Pi0.6 ( Intelligence et al. , 2025a ) and RDT2 ( T eam , 2025 ) achiev e impressive performance in 1 MMLab@HKU 2 PKU 3 PsiBot 4 HKUST (GZ) 5 THU 6 SZU 7 SJTU. Correspondence to: T ianxing Chen < chentianx- ing@connect.hku.hk > , Ping Luo < pluo@hku.hk > , Hao Dong < hao.dong@pku.edu.cn > . Pr eprint. March 17, 2026. ﬂexible object manipulation and ﬁne-grained skills, includ- ing complex activities like coffee making. Nev ertheless, most existing robotic policies are primarily designed for short-horizon tasks and ﬁne-grained manipulation. These approaches typically rely on a ﬁxed-length windo w of re- cent observ ations, implicitly assuming that the underlying decision process is approximately Markovian. In contrast, memory-dependent tasks are ubiquitous in real- world robotic applications. Such tasks are inherently non- Markovian, as past observ ations and actions may inﬂuence future decisions over extended temporal horizons. Exam- ples include remembering the location of previously placed objects or reasoning ov er multiple attempts after forgetting a password. These tasks are both challenging and practi- cally important, as they require robots to retain, retriev e, and utilize information beyond short-term sensory inputs. Motiv ated by this challenge, sev eral recent works ha ve be- gun to explore memory-aw are robotic policies. Approaches such as MemoryVLA ( Shi et al. , 2025 ), MemER ( Srid- har et al. , 2025 ), CronusVLA ( Li et al. , 2025 ), and SAM2Act ( Fang et al. , 2025 ) incorporate explicit mem- ory mechanisms to address long-horizon and memory- dependent decision making. Despite these efforts, the ﬁeld currently lacks a systematically designed experimental plat- form for ev aluating and analyzing robotic policies under long-term memory requirements. In particular, there is lim- ited understanding of the underlying mechanisms that make memory strategies ef fective for robotic manipulation. Existing benchmarks only partially bridge this gap. Memo- ryBench ( Fang et al. , 2025 ) comprises sev en single-arm manipulation tasks that in volv e memory , yet only three can be reliably reproduced in simulation, and the bench- mark provides limited guidance on principled task design. MIKASA ( Cherepanov et al. , 2025 ) introduces 32 memory- related manipulation tasks, but its formulation is largely tailored to reinforcement learning rather than general im- itation learning. LIBERO-Long ( Liu et al. , 2023 ) offers ten long-horizon tasks, though these tasks do not e xplicitly 1 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design demand memory since all task-rele v ant information remains observable throughout e xecution. T o address these limitations, we ﬁrst introduce T ask Memory Complexity , a principled metric for characterizing memory requirements in robotic manipulation tasks. This metric provides a systematic way to classify memory-dependent tasks and serves as a guideline for task design. Based on this formulation, we propose RMBench , a robotic manipulation benchmark built on RoboT win 2.0 platform. RMBench consists of 9 dual-arm manipulation tasks spanning dif ferent lev els of task memory comple xity , enabling large-scale and controlled studies of memory retention and utilization in robotic manipulation. Furthermore, we propose Mem-0 , a nov el memory-oriented robotic policy designed with modular components that can be easily replaced or reconﬁgured. Mem-0 adopts a dual- system architecture with a task-phase classiﬁer that e xplic- itly distinguishes dif ferent stages of a task, allo wing struc- tured memory usage across long horizons. Through system- atic ablation studies of Mem-0, we analyze which design components are critical for effecti ve memory in robotic manipulation and deriv e insights for future policy design. Our main contributions are summarized as follo ws: • W e introduce T ask Memory Complexity , a principled metric for categorizing memory-dependent robotic ma- nipulation tasks, and propose RMBench , a simulation benchmark comprising 9 memory-centric tasks based on this metric. • W e propose Mem-0 , a memory-oriented robotic ma- nipulation policy featuring a dual-system architecture with a task-phase classiﬁer for ﬂexible memory usage. • W e conduct comprehensiv e e v aluations of representa- tiv e state-of-the-art policies on RMBench and perform detailed ablation studies of Mem-0, rev ealing which policy design mechanisms are most beneﬁcial for mem- ory in robotic manipulation. 2. Related W ork 2.1. Robotic Manipulation Benchmarks Physics-based simulators underpin modern robotic manipu- lation research, and numerous simulation benchmarks have been proposed in recent years. RoboT win ( Mu et al. , 2025 ; Chen et al. , 2025a ), RoboCasa ( Nasiriany et al. , 2024 ), Man- iSkill3 ( T ao et al. , 2025 ), AutoBio ( Lan et al. , 2025 ), Uni- VT A C ( Chen et al. , 2026 ), DexGarmentLab ( W ang et al. , 2025 ), BEHA VIOR-1K ( Li et al. , 2024a ), and SIMPLER ( Li et al. , 2024c ) provide di verse manipulation tasks, yet most scenarios emphasize short-horizon interactions or can be solved without relying on historical observ ations. Several benchmarks hav e begun to consider memory-related manip- ulation tasks. MemoryBench ( Fang et al. , 2025 ) includes a limited number of memory-dependent tasks but suffers from poor reproducibility and lacks clear task design principles. MIKASA ( Cherepanov et al. , 2025 ) introduces a larger col- lection of memory-related tasks, though its design is primar - ily tailored to reinforcement learning. LIBER O-Long ( Liu et al. , 2023 ) and RoboCerebra ( Han et al. , 2025 ) features long-horizon tasks, but task-rele vant information remains observable throughout ex ecution and therefore does not ex- plicitly require memory . In contrast, RMBench e xplicitly stratiﬁes memory requirements in manipulation tasks, en- abling systematic analysis of memory-based policies across different le vels of task difﬁculty . 2.2. Robotic Manipulation Policies Recent advances in generative models and imitation learning hav e produced a wide range of robotic manipulation policies that achie ve strong performance on indi vidual tasks ( Zhao et al. , 2023 ; Chi et al. , 2025 ; Ze et al. , 2024 ; Chen et al. , 2025b ; Lu et al. , 2024 ; Su et al. , 2025 ). Inspired by large visual foundation models, many approaches adopt V ision- Language-Action (VLA) formulations ( W en et al. , 2025a ; Lin et al. , 2025 ; Liang et al. , 2025 ; Shen et al. , 2025 ; W en et al. ; 2025b ), where policies are pretrained on lar ge-scale robot datasets and exhibit improved generalization. Rep- resentativ e e xamples include Pi0.5 and Pi0.6 ( Intelligence et al. , 2025b ; a ), RDT2 ( T eam , 2025 ), and X-VLA ( Zheng et al. , 2025 ), as well as methods that incorporate future observation prediction such as Motus ( Bi et al. , 2025 ), Co- gA CT ( Li et al. , 2024b ), and CronusVLA ( Li et al. , 2025 ). Despite these advances, most existing policies rely on ﬁxed-length observation histories, which limits their abil- ity to selectiv ely retain task-rele v ant information ov er long time horizons. This limitation motiv ates recent memory- aware approaches, including MemoryVLA ( Shi et al. , 2025 ), MemER ( Sridhar et al. , 2025 ), and SAM2Act ( Fang et al. , 2025 ), which explicitly incorporate memory mechanisms for memory-dependent manipulation. Building on this line of work, we propose Mem-0, a modular memory-enabled policy designed to facilitate systematic ablation and analysis of memory components and architectural choices. 3. RMBench In this section, we introduce T ask Memory Complexity , a principled criterion for characterizing memory requirements in robotic manipulation tasks and guiding benchmark de- sign. Based on this, we propose RMBench , a simulation benchmark comprising nine manipulation tasks with v arying memory demands, designed to support controlled e v aluation across different le vels of task memory complexity . 2 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design Battery T ry Blocks Ranking T ry Cover B locks Press Butt on Swap T Swap Blocks Observe and Pic k Up Rearrange Block s Put Back Block F igur e 1. RMBench T asks. W e illustrate the nine memory-dependent tasks in RMBench along with their key ex ecution steps. T asks detailed description are shown in Appendix. A . 3.1. T ask Memory Complexity (TMC) Robotic manipulation often operates under partial observ- ability , where the current observ ation alone may be insuf ﬁ- cient to determine task progress or the correct next action without access to past information. Such partial observ- ability may arise from occlusions, delayed ef fects, or state aliasing. Importantly , the required past information does not necessarily correspond to a contiguous sequence of re- cent observ ations, but rather to a small set of task-rele vant observations occurring at arbitrary time steps. Existing benchmarks typically assess memory through speciﬁc policy architectures, but lack a task-centric criterion that character- izes ho w much task-rele v ant information must be retained ov er time. T o address this gap, we introduce T ask Memory Complexity (TMC) , which measures the minimal amount of past information required for optimal decision-making. Setup. W e model a manipulation task as a partially observ- able Mark ov decision process (POMDP) with latent state s t ∈ S , observation o t ∈ O , and action a t ∈ A . Let the full interaction history up to time t be h t = ( o 1: t , a 1: t − 1 ) . (1) Rather than assuming access to the full history , we consider a memory state that summarizes task-rele v ant information from past observations. Let M t denote a memory represen- tation constructed from h t , and let M ( k ) t denote a memory state that encodes information from at most k task-relev ant past observations. Deﬁnition. T ask Memory Complexity is deﬁned as the smallest inte ger m ≥ 0 such that there e xists an optimal pol- icy π ∗ whose action at time t depends only on the memory state M ( m ) t . Formally , ∃ π ∗ s.t. π ∗ ( a t | h t ) = π ∗ ( a t | M ( m ) t ) , ∀ t. (2) T ask Annotation. T asks are annotated according to their T ask Memory Complexity using the notation M (0) , M (1) , or more generally M ( n ) , where the inde x indicates the num- ber of task-rele vant past observ ations that must be retained to solve the task optimally . Interpr etation. M (0) denotes memory-free tasks, for which the current observation is suf ﬁcient to uniquely deter- mine task progress and the optimal action. M (1) denotes tasks that require retaining a single task-relev ant past ob- servation to disambiguate the current state. More generally , M ( n ) denotes tasks whose optimal decisions depend on retaining n task-relev ant past observations, capturing non- local and multi-step temporal dependencies. 3.2. RMBench System Design RMBench is dev eloped within the RoboT win 2.0 ( Chen et al. , 2025a ) system framew ork. It is built on top of the SAPIEN ( Xiang et al. , 2020 ) simulation engine and sup- ports both automated data synthesis and integrated policy ev aluation within a uniﬁed pipeline. This design enables scalable data generation as well as consistent and repro- ducible benchmarking of robotic manipulation policies. In addition, we pro vide ﬁne-grained language annotations that align with each action–observation pair . These anno- tations assign explicit linguistic descriptions to low-le vel interactions and state transitions, offering structured and 3 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design dense supervision signals for do wnstream training of high- lev el reasoning or memory modules. 3.3. RMBench Benchmark T asks Based on the proposed T ask Memory Comple xity (Sec. 3.1 ), we design a total of nine memory-dependent manipula- tion tasks. These tasks are grouped into two categories: ﬁv e M (1) tasks, and four M ( n ) tasks. Representativ e ke y frames for each task are illustrated in Fig. 1 , and detailed task speciﬁcations are provided in the Appendix A . The M (1) tasks consist of Observe and Pick Up , Rearr ange Blocks , Put Bac k Block , Swap Bloc ks and Swap T . These tasks require the polic y to retain a single past observation or a ﬁxed, limited number of historical frames. Successful ex ecution depends on dynamically attending to task-rele v ant information across different stages of the task. The M ( n ) tasks include Bloc ks Ranking T ry , Pr ess But- ton , Cover Blocks , and Battery T ry . These tasks require repeated acti ve e xploration, trial-and-error interactions, or repeated ex ecution for a task-speciﬁc number of attempts, often guided by external feedback. As a result, the y demand strong long-term memory retention and ef fective retrie v al mechanisms to accumulate and utilize historical information ov er extended horizons. 4. Mem-0 Policy In this section, we present Mem-0 , a modular memory- oriented robotic policy designed for systematic analysis of memory mechanisms. As shown in Fig. 2 , Mem-0 consists of a Planning Module that performs subtask-le vel reason- ing based on key memory and an Execution Module that ex ecutes subtasks using sliding-windo w and anchor mem- ories. The two modules are connected by a Subtask End Classiﬁer , which enables closed-loop planning and e xecu- tion. This modular design supports ﬁne-grained analysis of the roles of dif ferent memory components in memory- dependent robotic manipulation. 4.1. Planning Module In long-horizon manipulation tasks, end-to-end VLA mod- els are prone to error accumulation and trajectory drift dur- ing inference. Prior work ( Sridhar et al. , 2025 ; W en et al. , 2024 ) mitigates these issues through subtask decomposition. Howe ver , such approaches become insuf ﬁcient in memory- dependent settings, such as the M ( n ) -type tasks in RM- Bench, where accurate subtask inference requires reasoning ov er multiple pre viously completed subtasks, for e xample by tracking the number of b utton presses. T o address this limitation, we introduce a key memory window module within the Planning Module, which aggregates completed subtasks and enables memory-aware subtask reasoning. The Planning Module performs subtask-lev el reasoning us- ing a vision language model conditioned on visual observa- tions and structured memory . At planning step t , the model receiv es as input the initial observ ation o 0 , the task goal g , and the memory state M t − 1 , and predicts the next subtask as s t = V plan ( o 0 , g , M t − 1 ) . (3) Here, o 0 ∈ R H × W × 3 denotes the initial RGB observ ation at the beginning of an episode, and g ∈ T denotes the global task instruction. The ﬁnished-task memory M t − 1 , also referred to as the ke y memory windo w , aggregates all previously completed subtasks and is deﬁned as M t − 1 =  s i , o end i  t − 1 i =1 , (4) where s i ∈ T is the textual description of the i -th sub- task and o end i ∈ R H × W × 3 is the RGB observation at the termination of that subtask. Conditioning on M t − 1 enables the vision–language model to explicitly reason o ver previously ex ecuted subtasks and their corresponding visual outcomes, thereby supporting memory-dependent subtask inference beyond single-frame observation-based planning. In contrast to existing approaches that infer a ne w subtask at ev ery observ ation frame and thus require O ( T ) planning calls o ver a horizon of T timesteps, the Planning Module performs subtask reasoning only upon subtask termination, as identiﬁed by a Subtask End Classiﬁer (Section 4.3 ). For a long-horizon task consisting of N subtasks, where N ≪ T , this design reduces the number of planning in vocations to O ( N ) . As a result, the Execution Module can operate at a high control frequency within each subtask without be- ing constrained by planning latency , substantially reducing computational ov erhead and accelerating task ex ecution. 4.2. Execution Module Although subtask-lev el planning is effecti ve for many memory-dependent tasks, some tasks are not well suited for explicit subtask decomposition. For instance, in M (1) - type tasks such as Swap T , the target placement orientation of object T cannot be speciﬁed reliably through language. In addition, ov erly ﬁne-grained subtask decomposition in- creases annotation effort and planning latency , which de- grades ov erall execution efﬁciency . T o address these lim- itations, we incorporate an anchor memory module and a sliding memory window module within the Execution Module. This design allo ws the Ex ecution Module to handle M (1) -type tasks without additional subtask decomposition by maintaining a persistent anchor memory together with short-term transient memories. The Execution Module ex ecutes the current subtask using a diffusion-based policy conditioned on multimodal per- 4 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design In - Episode Mem ory (key) V ision Language Model (Qwen3- VL -2B- Instruct) Anchor M emory Fusion Sliding Memory Fusion SubT ask Open the left cover to uncover the blocks in the order of red, green, and blue. anchor fused token conditioning Ex ecution Mod ule Denoised Action Noised Action State Diffusion T ra nsformer ( DiT ) Current Observation Subtask End Classifier Vis io n L an g u a g e M od e l ( Q w e n 3 - VL - 8B - Instruct) if the subtask has been finished Ta s k I n s t r u c t i o n Cover the blocks from left to right using the lids, and then uncover them a gain in the sequence red, green, and blue . new key frame Initial Observation key frame 3 key frame 2 key frame 1 key frame 4 Planning Module Design Details about Anchor Memory Fusion Q K, V anchor window Cross Attention Residual Image token Image token text token sliding fused token … sliding window Design Details about Sliding Memory Fusion Q Timestep PE K, V Cross Attention Residual Image token F igur e 2. Mem-0 Pipeline . Mem-0 comprises a Planning Module and an Execution Module linked by a Subtask End Classiﬁer . The Planning Module generates high-lev el subtasks from task instructions, observations, and ke y-frame memory , while the Execution Module produces low-le vel actions using the current observation, the subtask, and fused anchor and sliding memories in a diffusion-based policy . Upon subtask completion, a key frame is stored to enable iterati ve planning and execution until task completion. ception and memory . A vision–language model (VLM), denoted by V exec ( · ) , encodes the current RGB observ ation o t ∈ R H × W × 3 and the subtask instruction s t ∈ T into image and text token embeddings:  Z img t , Z text t  = V exec ( o t , s t ) . (5) Mean pooling is then applied to obtain compact latent representations z img t = MeanP o ol( Z img t ) and z text t = MeanP o ol( Z text t ) . T o incorporate memory , the image latent attends to two memory buf fers: an anchor memory A and a sliding mem- ory window S t . Memory-conditioned representations are computed via cross-attention: ˜ z l t = CrossAttn  z img t , M m t  + z img t , (6) where l ∈ { anchor , slide } , M anchor t = A and M slide t = S t . The fused token are concatenated with the text tok en to form the conditioning vector c t = [ ˜ z anchor t ; ˜ z slide t ; z text t ] . After attention at timestep t , the image latent is appended to the sliding memory window: S t +1 = T runc K  S t ∪ { z img t }  (7) where T runc K ( · ) retains the most recent K elements. At the beginning of a subtask ( t = 0 ), the image latent z img 0 is stored as the anchor memory A = { z img 0 } , which remains ﬁxed throughout the subtask. Upon subtask termination, both memory buf fers are reset: A ← ∅ , S t ← ∅ . For action generation, we employ a diffusion transformer (DiT) with a ﬁxed action horizon H = 30 . Let a ϵ t : t + H − 1 ∈ R H × d a denote a noisy action sequence obtained by perturb- ing a ground-truth action sequence with Gaussian noise, and let ˆ a t : t + H − 1 ∈ R H × d a denote the corresponding denoised prediction produced by the model. At timestep t , the DiT predicts a denoised action sequence ˆ a t : t + H − 1 = DiT  a ϵ t : t + H − 1 , x t , c t  . (8) A preﬁx of the predicted sequence, denoted as ˆ a t : t +∆ − 1 with 1 ≤ ∆ ≤ H , is ex ecuted as control commands before the next replanning step. 4.3. Subtask End Classiﬁer T o enable closed-loop interaction between the Planning and Execution Modules, we introduce a Subtask End Classiﬁer to detect subtask completion. The classiﬁer is implemented as a lightweight multilayer perceptron (MLP) that operates on the conditioning vector c t and outputs a binary signal C end ( c t ) ∈ { 0 , 1 } , indicating whether the current subtask is ongoing or terminated at timestep t . T o improv e robustness and av oid premature termination 5 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design caused by transient noise, we enforce a temporal consis- tency criterion. A subtask is considered complete only if the classiﬁer predicts termination for L = 8 consecutiv e timesteps: t X i = t − L +1 C end ( c i ) = L. (9) Once this condition is satisﬁed, the subtask is terminated, and the ﬁnal observ ation o end t together with the correspond- ing subtask description s t is passed to the Planning Module to trigger the next round of subtask-lev el reasoning. This mechanism establishes a closed loop between high-level planning and low-le vel execution, enabling coordinated and iterativ e subtask inference and execution. 5. Experiment W e design a set of experiments to validate three key ob- jectiv es: (1) to e valuate the performance of existing ma- nipulation policies and Mem-0 on RMBench, thereby char- acterizing their ability to handle memory-dependent tasks across different le vels of difﬁculty; (2) to conduct system- atic ablation studies on the Mem-0 architecture in order to analyze how dif ferent module designs affect performance on memory-intensi ve manipulation tasks; and (3) to perform real-world robotic experiments to assess the ef fectiv eness and generalization of Mem-0 beyond simulation. In addition to the SAPIEN platform, RMBench also imple- mented on NVIDIA Isaac Lab - Arena 1 . 5.1. Evaluation of P olicies on RMBench we benchmark a diverse set of policies on RMBench, in- cluding non-pretrained methods, pretrained methods, and Mem-0, our memory-centric policy . Speciﬁcally , DP and A CT are used as non-pretrained baselines, while Pi0.5 and X-VLA represent pretrained approaches. For each task un- der both the M (1) and M ( n ) settings, all models are trained with 50 expert demonstrations and e valuated ov er 100 roll- out episodes. For M (1) -type tasks, Mem-0 operates without subtask de- composition, and the reported results primarily reﬂect the capability of the Execution Module. In contrast, for M ( n ) - type tasks, subtask decomposition is performed at key de- cision points, and the results capture the joint performance of the Planning and Execution Modules. All baseline meth- ods are trained without subtask decomposition. W e report success rates for all methods in T able 1 . Experimental results show that both non-pretrained and pretrained baselines consistently underperform on memory- dependent tasks. This behavior can be attrib uted to the fact that most existing models are designed under a Marko vian 1 https://github .com/isaac-sim/IsaacLab-Arena assumption, where the next action is determined solely by the current observation. When applied to the non-Markovian tasks in RMBench, these models fail to infer the correct action without access to task-rele vant past information, lead- ing to substantial performance degradation. Representativ e failure cases of baseline methods are illustrated in Fig. 3 . In contrast, Mem-0 demonstrates substantial performance gains across the majority of tasks, indicating the ef fecti ve- ness of explicitly incorporating memory mechanisms. On av erage, Mem-0 improv es success rates by 38.4% on M (1) tasks and 21.2% on M ( n ) tasks relative to the baselines, underscoring the critical role of memory modules in address- ing memory-dependent manipulation in RMBench. Despite these gains, Mem-0 exhibits limitations on tasks that require strong semantic understanding, such as Observe and Pick Up , where pretrained models retain an adv antage due to large-scale pretraining. In ﬁne-grained manipulation tasks such as Swap T , Mem-0 exhibits limited placement ac- curacy , resulting in only marginal performance gains. In the Pr ess Button task, the small magnitude of indi vidual press actions further complicates reliable termination detection: the Subtask End Classiﬁer may f ail to consistently recog- nize task completion, causing repeated presses or missed contacts and ultimately zero success. These results highlight several open challenges in the cur- rent Mem-0 design. More V isualization and Analysis can be found in the Appendix C and Supplementary Material. Nev ertheless, the o verall performance trends clearly demon- strate that explicit memory modeling yields signiﬁcant im- prov ements on the majority of memory-dependent tasks in RMBench. 5.2. Analysis on Memory-Related Module In this section, we analyze the contribution of individual components in Mem-0 to provide insights into effecti ve memory module design. T o this end, we conduct four abla- tion studies: (1) w/o Anchor : The anchor memory module in the Exe- cution Module is remo ved, so image tokens are no longer fused with anchor memory tokens. (2) w/o Sliding : The sliding memory module in the Exe- cution Module is remov ed, and image tokens are not fused with historical sliding memory tokens. (3) w/o K ey : The key memory window in the Planning Module is removed, and subtask inference relies solely on a single-frame observation. (4) GT Classiﬁer : The Subtask End Classiﬁer is remov ed, and subtask termination is determined using ground-truth signals provided by the simulator . 6 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design T able 1. RMBench benchmark results. RMBench includes nine manipulation tasks across the M (1) and M ( n ) lev els of T ask Memory Complexity . W e report success rates for ﬁve policies, each trained with 50 synthesized demonstrations and ev aluated ov er 100 rollouts. ( Bold : best; Underlined: second-best; Green : relativ e improv ement ov er the second-best). T asks TMC DP A CT Pi0.5 X-VLA Mem-0 (ours) Observe and Pick Up M (1) 1% 1% 9% 9% 4% Rearrange Blocks M (1) 0% 29% 13% 13% 89% Put Back Block M (1) 0% 0% 11% 18% 90% Swap Blocks M (1) 11% 2% 24% 16% 67% Swap T M (1) 20% 2% 15% 3% 14% A verage M (1) 6.4% 6.8% 14.4% 11.8% 52.8%( +38.4% ) Battery T ry M ( n ) 10% 19% 16% 26% 28% Blocks Ranking T ry M ( n ) 10% 0% 6% 1% 18% Cov er Blocks M ( n ) 0% 0% 0% 2% 68% Press Button M ( n ) 0% 0% 0% 0% 0% A verage M ( n ) 5% 4.8% 5.5% 7.3% 28.5%( +21.2% ) T otal A verage / 5.8% 5.9% 10.4% 9.8% 42.0%( +31.6% ) T able 2. Ablation Studies. ( Bold : the best results; Underlined: the second-best results). M (1) T asks Observe and Pick Up Rearrange Blocks Put Back Block Swap Blocks Swap T A verage V anilla (ours) 4% 89% 90% 67% 14% 52.8% w/o Anchor 4% 73% 35% 15% 7% 26.8% w/o Sliding 3% 62% 78% 39% 20% 40.4% M ( n ) T asks Battery T ry Blocks Ranking T ry Cov er Blocks Press Button A verage V anilla (ours) 28% 18% 68% 0% 28.5% w/o Ke y 13% 1% 5% 0% 4.8% w/o Anchor 14% 0% 92% 1% 26.8% w/o Sliding 17% 0% 84% 0% 25.3% GT Classiﬁer 30% 45% 92% 14% 45.3% Because Mem-0 does not perform subtask decomposition for M (1) -type tasks, the ablation study for M (1) includes only the w/o Anchor and w/o Sliding settings. All ablation results are reported in T able 2 . Analysis on Anchor Memory Perf ormance. Compared to the v anilla setting, removing the anchor memory ( w/o An- chor ) leads to a substantial reduction in success rates across most tasks. Qualitative inspection of e valuation videos in- dicates that, although the sliding memory window remains activ e, Mem-0 progressi vely loses access to task-critical in- formation as relev ant memories are evicted ov er time. As a result, the polic y f ails to attend to essential cues required for correct decision-making and e xhibits erroneous beha viors similar to those observed in Fig. 3 . These ﬁndings indicate that, for memory-dependent manipulation tasks, it is crucial for a policy to e xplicitly identify and retain task-critical in- formation throughout ex ecution in order to achiev e reliable task completion. Analysis on Sliding Memory P erformance. The Sliding Memory W indo w primarily captures short-term historical motion trends. Experiments sho w that, ev en with the sup- port of anchor information, removing this module still de- grades performance across most tasks, with success rates falling belo w those of the vanilla model. Qualitati ve results further indicate that, without sliding memory , Mem-0 ex- hibits unstable and oscillatory behaviors; for example, in the button-pressing task, the policy cannot infer whether the button has already been pressed from a single observ ation, leading to premature termination or redundant actions and ev entual failure. Interestingly , on the Swap T task, the w/o Sliding setting outperforms the vanilla model. This improvement likely stems from the ﬁxed motion patterns in the training data and the task’ s high sensitivity to the initial orientation of the T -shaped object. Removing sliding memory shifts the policy’ s reliance toward anchor information and reduces interference from transient motion cues, leading to better performance. This observ ation suggests that sliding memory can function as either a facilitator or a source of interference, underscoring the importance of coordinating sliding and anchor memories. Analysis on Key Memory P erf ormance. Under the w/o Key setting, where subtask inference is based solely on single-frame observ ations, success rates decrease substan- tially relati ve to the v anilla conﬁguration. Qualitativ e anal- ysis sho ws that, for M ( n ) -type tasks requiring long-term 7 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design 0 T 0 T Swap T Observe and Pick Up 0 T Swap Blocks 0 T Put Back Block 0 T Rearrange Blocks 0 T Blocks Ranking Try 0 T Cover Blocks 0 T Battery Try 0 T Press Button F igure 3. Visualization of Baseline T ypical Error . Because the baseline predicts the ne xt action solely from the current observation, it struggles to perform reliably on non-Markovian tasks that require persistent memory o ver time. information to inform subsequent motion decisions, the Planning Module is unable to reliably infer the correct next subtask when restricted to the current observation alone, resulting in task f ailure. These results indicate that retaining key memories is critical for accurate and reliable subtask inference in memory-dependent manipulation tasks. Analysis on the Classiﬁer between the Planning and Ex- ecution Modules. The Subtask End Classiﬁer in Mem-0 serves two primary functions. First, it connects the Planning and Execution Modules by triggering high-le vel reasoning only when necessary , thereby reducing inference cost and latency . Second, it enables task simpliﬁcation through sub- task decomposition. Compared with MemER ( Sridhar et al. , 2025 ), which performs high-le vel subtask reasoning at e v- ery timestep, Mem-0 operates at a planning frequency of approximately 5–10 Hz, whereas MemER runs at 1–2 Hz. In addition, the strong performance achie ved with the GT Classiﬁer highlights the effecti veness of subtask decompo- sition relativ e to fully end-to-end approaches. Neverthel ess, the current classiﬁer design in Mem-0 is relativ ely simple and lacks precision in detecting subtask transitions. In- accurate transition timing can negativ ely impact subtask inference in the Planning Module. These results indicate that more reﬁned classiﬁer designs are necessary to achie ve tighter coordination between the Planning and Execution Modules through more accurate subtask termination signals. 5.3. Real W orld Experiment T able 3. Real-world Experiment r esults. T asks A CT Pi0.5 Mem-0 (ours) Put Back Block 0.0% 10.0% 17.5% Rearrange Blocks 0.0% 7.5% 37.5% Cov er Blocks 0.0% 0.0% 12.5% A verage 0.00% 5.83% 22.50% Put Back B lock Rearrange Bl ocks Cover Blocks F igure 4. Real-world Experiment T asks. The real-world experi- mental setup is illustrated abov e. T o assess the real-world performance of Mem-0, we e v al- uate it on three physical manipulation tasks aligned with RMBench: Put Back Blocks, Rearrange Blocks and Cov er Blocks. W e compare Mem-0 against A CT and Pi0.5, with all data collection and e v aluation conducted on the X-One dual-arm robotic platform. For each task, we collect 100 real-world demonstrations and e v aluate the trained policies ov er 40 rollout trials, reporting success rate as the primary metric. The ev aluation results are summarized in T able 3 . 8 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design The results sho w that Mem-0 outperforms both baseline poli- cies in real-world experiments. Upon closer inspection, we observe that most failures of Mem-0 arise from imprecise block manipulation, rather than high-le vel task planning. This behavior is likely attributable to two factors. First, real-world data collection inv olves di verse human behav- iors, which introduces additional v ariability and increases the difﬁculty of learning consistent lo w-level manipulation skills. Second, Mem-0 is trained without dedicated pre- training on robotic manipulation, which may limit its ability to generalize ﬁne-grained motor behaviors. Addressing these limitations through improved low-le vel pretraining and more structured real-world data collection constitutes an important direction for future work. 6. Conclusion In this paper, we present RMBench (benchmark) and Mem-0 (policy) to systematically ev aluate memory in robotic manip- ulation, rev ealing the memory limitations of existing poli- cies and how dif ferent architectural choices (such as anchor memory , sliding memory , and key memory) af fect memory performance, thereby providing preliminary insights into the principled integration of memory mechanisms for ef fective memory-dependent robotic manipulation. As for future work, promising directions include impro ved memory representation and fusion, more robust subtask termination criteria, and the integration of pretraining to enhance semantic understanding and generalization. W e hope RMBench fosters principled progress toward scalable, memory-aware robotic manipulation. Acknowledgements W e would like to thank Xspark AI for supporting our real- world e xperiments, and D-Robotics for pro viding the com- puting resources. Also thank NVIDIA Isaac Lab - Arena T eam for technical support. Impact Statement This paper presents work whose goal is to advance the ﬁeld of Machine Learning. There are many potential societal consequences of our work, none which we feel must be speciﬁcally highlighted here. References Bi, H., T an, H., Xie, S., W ang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y ., Xiang, C., Rong, Y ., et al. Mo- tus: A uniﬁed latent action world model. arXiv pr eprint arXiv:2512.13030 , 2025. Chen, B., W an, W ., Chen, T ., Guo, X., Xu, C., Qi, Y ., Zhang, H., W u, L., Xu, T ., Li, Z., et al. Uni vtac: A uniﬁed simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking. arXiv pr eprint arXiv:2602.10093 , 2026. Chen, T ., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., et al. Robotwin 2.0: A scal- able data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv pr eprint arXiv:2506.18088 , 2025a. Chen, T ., Mu, Y ., Liang, Z., Chen, Z., Peng, S., Chen, Q., Xu, M., Hu, R., Zhang, H., Li, X., et al. G3ﬂo w: Genera- tiv e 3d semantic ﬂow for pose-aware and generalizable object manipulation. In Pr oceedings of the Computer V i- sion and P attern Reco gnition Confer ence , pp. 1735–1744, 2025b. Cherepanov , E., Kachaev , N., Ko v ale v , A. K., and Pano v , A. I. Memory , benchmark & robots: A benchmark for solving comple x tasks with reinforcement learning. arXiv pr eprint arXiv:2502.10550 , 2025. Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y ., Burch- ﬁel, B., T edrake, R., and Song, S. Diffusion policy: V isuomotor policy learning via action diffusion. The International Journal of Robotics Researc h , 44(10-11): 1684–1704, 2025. Fang, H., Grotz, M., Pumacay , W ., W ang, Y . R., Fox, D., Krishna, R., and Duan, J. Sam2act: Integrating visual foundation model with a memory architecture for robotic manipulation. arXiv pr eprint arXiv:2501.18564 , 2025. Han, S., Qiu, B., Liao, Y ., Huang, S., Gao, C., Y an, S., and Liu, S. Robocerebra: A large-scale benchmark for long- horizon robotic manipulation e valuation. arXiv preprint arXiv:2506.06677 , 2025. Intelligence, P ., Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conle y , K., Connors, G., Darpinian, J., Dha- balia, K., DiCarlo, J., Driess, D., Equi, M., Esmail, A., Fang, Y ., Finn, C., Glossop, C., Godden, T ., Goryachev , I., Groom, L., Hancock, H., Hausman, K., Hussein, G., Ichter , B., Jakubczak, S., Jen, R., Jones, T ., Katz, B., Ke, L., K uchi, C., Lamb, M., LeBlanc, D., Le vine, S., Li-Bell, A., Lu, Y ., Mano, V ., Mothukuri, M., Nair , S., Pertsch, K., Ren, A. Z., Sharma, C., Shi, L. X., Smith, L., Sprin- genberg, J. T ., Stachowicz, K., Stoeckle, W ., Swerdlo w , A., T anner , J., T orne, M., V uong, Q., W alling, A., W ang, H., W illiams, B., Y oo, S., Y u, L., Zhilinsk y , U., and Zhou, Z. π ∗ 0 . 6 : a vla that learns from e xperience, 2025a. URL https://arxiv.org/abs/2511.14759 . Intelligence, P ., Black, K., Brown, N., Darpinian, J., Dha- balia, K., Driess, D., Esmail, A., Equi, M., Finn, C., 9 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design Fusai, N., Galliker , M. Y ., Ghosh, D., Groom, L., Haus- man, K., Ichter , B., Jakubczak, S., Jones, T ., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., Mothukuri, M., Nair , S., Pertsch, K., Ren, A. Z., Shi, L. X., Smith, L., Springenberg, J. T ., Stachowicz, K., T anner , J., V uong, Q., W alke, H., W alling, A., W ang, H., Y u, L., and Zhilinsky , U. π 0 . 5 : a vision-language-action model with open-world generalization, 2025b. URL https: //arxiv.org/abs/2504.16054 . Kwon, W ., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Y u, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Ef ﬁcient memory management for lar ge language model serving with pagedattention. In Pr oceedings of the A CM SIGOPS 29th Symposium on Operating Systems Principles , 2023. Lan, Z., Jiang, Y ., W ang, R., Xie, X., Zhang, R., Zhu, Y ., Li, P ., Y ang, T ., Chen, T ., Gao, H., et al. Autobio: A simulation and benchmark for robotic automation in digi- tal biology laboratory . arXiv pr eprint arXiv:2505.14030 , 2025. Li, C., Zhang, R., W ong, J., Gokmen, C., Sriv astav a, S., Mart ´ ın-Mart ´ ın, R., W ang, C., Levine, G., Ai, W ., Mar- tinez, B., Y in, H., Lingelbach, M., Hwang, M., Hiranaka, A., Garlanka, S., A ydin, A., Lee, S., Sun, J., An vari, M., Sharma, M., Bansal, D., Hunter , S., Kim, K.-Y ., Lou, A., Matthews, C. R., V illa-Renteria, I., T ang, J. H., T ang, C., Xia, F ., Li, Y ., Sav arese, S., Gweon, H., Liu, C. K., W u, J., and Fei-Fei, L. Behavior -1k: A human-centered, em- bodied ai benchmark with 1,000 e veryday acti vities and realistic simulation. arXiv preprint , 2024a. Li, H., Y ang, S., Chen, Y ., Tian, Y ., Y ang, X., Chen, X., W ang, H., W ang, T ., Zhao, F ., Lin, D., et al. Cronusvla: Transferring latent motion across time for multi-frame prediction in manipulation. arXiv preprint arXiv:2506.19816 , 2025. Li, Q., Liang, Y ., W ang, Z., Luo, L., Chen, X., Liao, M., W ei, F ., Deng, Y ., Xu, S., Zhang, Y ., et al. Cogact: A foundational vision-language-action model for synergiz- ing cognition and action in robotic manipulation. arXiv pr eprint arXiv:2411.19650 , 2024b. Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., W alke, H. R., Fu, C., Lunaw at, I., Sieh, I., Kirmani, S., Levine, S., W u, J., Finn, C., Su, H., V uong, Q., and Xiao, T . Evaluating real-w orld robot manipulation policies in sim- ulation, 2024c. URL 2405.05941 . Liang, Z., Li, Y ., Y ang, T ., Wu, C., Mao, S., Nian, T ., Pei, L., Zhou, S., Y ang, X., Pang, J., et al. Discrete diffusion vla: Bringing discrete diffusion to action de- coding in vision-language-action policies. arXiv pr eprint arXiv:2508.20072 , 2025. Lin, M., Ding, P ., W ang, S., Zhuang, Z., Liu, Y ., T ong, X., Song, W ., L yu, S., Huang, S., and W ang, D. Hif-vla: Hindsight, insight and foresight through motion represen- tation for vision-language-action models. arXiv pr eprint arXiv:2512.09928 , 2025. Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y ., and Stone, P . Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv. org/abs/2306.03310 . Lu, G., Gao, Z., Chen, T ., Dai, W ., W ang, Z., Ding, W ., and T ang, Y . Manicm: Real-time 3d diffusion policy via consistency model for robotic manipulation. arXiv pr eprint arXiv:2406.01586 , 2024. Mu, Y ., Chen, T ., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Y u, Q., Zou, Y ., Xu, M., et al. Robotwin: Dual-arm robot benchmark with generativ e digital twins. In Pr o- ceedings of the Computer V ision and P attern Recognition Confer ence , pp. 27649–27660, 2025. Nasiriany , S., Maddukuri, A., Zhang, L., P arikh, A., Lo, A., Joshi, A., Mandlekar , A., and Zhu, Y . Robocasa: Large- scale simulation of e veryday tasks for generalist robots. arXiv pr eprint arXiv:2406.02523 , 2024. Shen, W ., Liu, Y ., W u, Y ., Liang, Z., Gu, S., W ang, D., Nian, T ., Xu, L., Qin, Y ., Pang, J., et al. Expertise need not monopolize: Action-specialized mixture of ex- perts for vision-language-action learning. arXiv pr eprint arXiv:2510.14300 , 2025. Shi, H., Xie, B., Liu, Y ., Sun, L., Liu, F ., W ang, T ., Zhou, E., Fan, H., Zhang, X., and Huang, G. Memo- ryvla: Perceptual-cognitiv e memory in vision-language- action models for robotic manipulation. arXiv preprint arXiv:2508.19236 , 2025. Sridhar , A., P an, J., Sharma, S., and Finn, C. Memer: Scal- ing up memory for robot control via experience retrie v al. arXiv pr eprint arXiv:2510.20328 , 2025. Su, Y ., Zhan, X., Fang, H., Xue, H., Fang, H.-S., Li, Y .-L., Lu, C., and Y ang, L. Dense policy: Bidirec- tional autoregressi ve learning of actions. arXiv pr eprint arXiv:2503.13217 , 2025. T ao, S., Xiang, F ., Shukla, A., Qin, Y ., Hinrichsen, X., Y uan, X., Bao, C., Lin, X., Liu, Y ., kai Chan, T ., Gao, Y ., Li, X., Mu, T ., Xiao, N., Gurha, A., Rajesh, V . N., Choi, Y . W ., Chen, Y .-R., Huang, Z., Calandra, R., Chen, R., Luo, S., and Su, H. Maniskill3: Gpu parallelized robotics simula- tion and rendering for generalizable embodied ai, 2025. URL . 10 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design T eam, R. Rdt2: Enabling zero-shot cross-embodiment gen- eralization by scaling up umi data, September 2025. URL https://github.com/thu- ml/RDT2 . W ang, Y ., W u, R., Chen, Y ., W ang, J., Liang, J., Zhu, Z., Geng, H., Malik, J., Abbeel, P ., and Dong, H. Dex- garmentlab: Dexterous garment manipulation en viron- ment with generalizable policy , 2025. URL https: //arxiv.org/abs/2505.11032 . W en, J., Zhu, Y ., Zhu, M., T ang, Z., Li, J., Zhou, Z., Liu, X., Shen, C., Peng, Y ., and Feng, F . Dif fusion vla: Scaling robot foundation models via uniﬁed diffusion and autore- gression. In F orty-second International Confer ence on Machine Learning . W en, J., Zhu, M., Zhu, Y ., T ang, Z., Li, J., Zhou, Z., Li, C., Liu, X., Peng, Y ., Shen, C., and Feng, F . Dif fusion vla: Scaling robot foundation models via uniﬁed dif fusion and autoregression. arXiv preprint arXiv:None , 2024. W en, J., Zhu, Y ., Li, J., T ang, Z., Shen, C., and Feng, F . Dexvla: V ision-language model with plug-in diffu- sion expert for general robot control. arXiv preprint arXiv:2502.05855 , 2025a. W en, J., Zhu, Y ., Li, J., Zhu, M., T ang, Z., W u, K., Xu, Z., Liu, N., Cheng, R., Shen, C., et al. T inyvla: T ow ards fast, data-efﬁcient vision-language-action models for robotic manipulation. IEEE Robotics and Automation Letters , 2025b. Xiang, F ., Qin, Y ., Mo, K., Xia, Y ., Zhu, H., Liu, F ., Liu, M., Jiang, H., Y uan, Y ., W ang, H., et al. Sapien: A simulated part-based interacti ve en vironment. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pp. 11097–11107, 2020. Ze, Y ., Zhang, G., Zhang, K., Hu, C., W ang, M., and Xu, H. 3d diffusion polic y: Generalizable visuomotor policy learning via simple 3d representations. arXiv pr eprint arXiv:2403.03954 , 2024. Zhao, T . Z., Kumar , V ., Levine, S., and Finn, C. Learn- ing ﬁne-grained bimanual manipulation with lo w-cost hardware, 2023. URL 2304.13705 . Zheng, J., Li, J., W ang, Z., Liu, D., Kang, X., Feng, Y ., Zheng, Y ., Zou, J., Chen, Y ., Zeng, J., Zhang, Y .- Q., Pang, J., Liu, J., W ang, T ., and Zhan, X. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model, 2025. URL https:// arxiv.org/abs/2510.10274 . Zheng, Y ., Zhang, R., Zhang, J., Y e, Y ., Luo, Z., Feng, Z., and Ma, Y . Llamafactory: Uniﬁed efﬁcient ﬁne- tuning of 100+ language models. In Pr oceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (V olume 3: System Demonstrations) , Bangkok, Thailand, 2024. Association for Computational Linguistics. URL 13372 . 11 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design A. RMBench T asks Description T able 4. T ask descriptions of RMBench benchmark. T ask Description Observe and Pick Up A reference object is placed on a shelf, and multiple objects are placed on the table. The robot ﬁrst observes the reference object while remaining stationary . After the reference object is hidden, the robot must pick up the matching object from the table. Rearrange Bloc ks T wo pads and a b utton are placed on the table. One block is positioned between the two pads, and another block is placed on one of the pads. The robot moves the middle block onto a pad, presses the button, and then mo ves the other block to the middle position. Put Back Bloc k Four pads are arranged around a central position, with one block placed on one of the pads. The robot moves the block to the center , presses the button, and then returns the block to its original pad. Swap Block Three pads and a button are placed on the table, with two blocks placed on dif ferent pads. The robot uses the empty pad to sw ap the positions of the tw o blocks and then presses the button. Swap T T wo T -shaped blocks with different colors are placed on the table. The robot picks up both blocks and swaps their positions and orientations. Battery T ry T wo batteries with random orientations and a dual-slot battery holder are placed on the table. The robot repeatedly attempts dif ferent insertion orders, placing both batteries into the holder with the correct orientations until the insertion succeeds. Blocks Ranking T ry Three blocks of dif ferent colors are randomly arranged on the table, along with a b utton. The robot repeatedly attempts dif ferent block arrangements and presses the b utton to conﬁrm until the correct ordering is achiev ed. Cover Bloc ks Three colored blocks (red, green, and blue) and three covers are placed on the table. The robot cov ers the blocks from left to right, then unco vers them in red–green–blue order and returns the cov ers to their original positions. Pr ess Button Three buttons (left, middle, and right) and two single-digit number tiles are placed on the table. The robot presses the left button the number of times indicated by the left digit, presses the middle b utton the number of times indicated by the right digit, and then presses the right button to conﬁrm. B. T raining Details B.1. Planning Module In the Planning Module, we ﬁne-tune the vision–language model (Qwen3-VL-8B-Instruct) using LoRA via LLaMA- Factory ( Zheng et al. , 2024 ) to enable reasoning over key memories. After ﬁne-tuning, we deploy the model with vLLM ( Kwon et al. , 2023 ) for ef ﬁcient loading and inference. The key hyperparameters used for VLM ﬁne-tuning are summarized in T able 5 . Training is conducted on 8 NVIDIA A800 GPUs and the duration of training for a single task is approximately half an hour . T able 5. Hyperparameters f or ﬁne-tuning the Mem-0 Planning Module. Conﬁguration Finetuning T ype LoRA Rank Batch Size Learning Rate Epochs LR Scheduler W armup Ratio Dtype V alue LoRA 8 16 1 . 0 × 10 − 4 25 Cosine 0.1 bf16 12 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design B.2. Execution Module In this section, we detail the training infrastructure, training organization strategy and hyperparameter conﬁgurations employed for Execution Module in Mem-0. B . 2 . 1 . T R A I N I N G I N F R A S T R U C T U R E A N D T I M E B U D G E T The Execution Module of Mem-0 utilizes a single-task training strate gy , where the model is trained from scratch for each speciﬁc task. Training is conducted on 8 NVIDIA A800 GPUs with a global batch size of 448 ov er 30K iterations. The duration of training for a single task is approximately 18 hours. B . 2 . 2 . T R A I N I N G O R G A N I Z AT I O N S T R AT E G Y Forward P ass Strategy . Giv en the memory-centric architecture of Mem-0, we employ a specialized training methodology . W ithin each batch, the processes of VLM token generation and DiT -based action chunk generation are e xecuted in parallel. Con versely , the fusion of Sliding Memory and Anchor Memory requires the temporal inte grity of the data; therefore, this stage is processed serially , ensuring that all frames within an episode remain sequentially aligned. This approach guaranties the ef fectiv e utilization of VLM tokens. Furthermore, we maintain a global data structure during training to store memory information for each episode, facilitating seamless cross-batch token utilization. Dataloader Implementation . Consequently , the dataloader was custom-designed to align with this architecture. Episodes are distributed as e venly as possible across all GPUs. Each GPU processes its assigned episodes using a speciﬁc number of workers and manages the dataloader reset independently . Due to the stochastic nature of the distribution, as training iterations progress, the frames within a global batch become temporally desynchronized. This allo ws the model to learn simultaneously from data that span various time steps. B . 2 . 3 . H Y P E R P A R A M E T E R C O N FI G U R A T I O N S T able 6 . summarizes the ke y training hyperparameters. T o balance the learning rate requirements of different modules, we implemented a grouped learning rate strategy alongside a cosine learning rate schedule with linear warm-up. Regarding numerical precision, the VLM and Memory Bank components operate in bfloat16 , while all other modules utilize float32 . Furthermore, regarding visual inputs, images are resized to 224 × 224 and subjected to mild data augmentation via frame-independent ColorJitter , aimed at enhancing the model’ s generalization capabilities. T able 6. Hyperparameters f or training the Mem-0 Execution Module. Conﬁguration V alue Batch Size 448 Iterations 30,000 Max Grad. Norm 2.5 LR Scheduler Cosine W armup Ratio 0.05 Optimizer AdamW Momentum β 1 , β 2 = 0 . 9 , 0 . 999 W eight Decay 0.005 Image Resize 224 × 224 Image Aug. ColorJitter † † Jitter: (0.1, 0.1, 0.1, 0) Conﬁguration V alue LR (Base) 1 . 0 × 10 − 5 LR (VLM) 1 . 0 × 10 − 5 LR (Action Head) 1 . 0 × 10 − 4 LR (Classiﬁer) 1 . 0 × 10 − 4 Min LR (Base) 1 . 0 × 10 − 6 Min LR (VLM) 1 . 0 × 10 − 6 Min LR (Action Head) 5 . 0 × 10 − 6 Min LR (Classiﬁer) 5 . 0 × 10 − 6 W orkers per GPU 2 C. Additional V isualizations and Analysis of Failur e Cases in Mem-0 While Mem-0 demonstrates substantial improv ements ov er the baselines, its architectural design still offers e xtensi ve room for further exploration. In this section, we present representative cases where Mem-0 exhibits suboptimal performance, aiming to provide v aluable insights for future research. 13 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design 0 T Obser ve and Pick Up Observe initial scene T arget object hidden Prepare to grasp Pick up target object F igure 5. Failur e examples of Observe and Pick Up . (T op) Confused by objects with similar colors and shapes. (Mid- dle) Confused by identical object morphologies. (Bottom) General failure to identify the target, resulting in the robot grasping a mean position or unintended position. 0 T Swap Blocks Initial scene Move block Press button Move block Move block F igure 6. F ailure examples of Swap Blocks . (T op) Premature termination after a single subtask. (Middle) Premature termination after two subtasks. (Bottom) Failure to terminate on time, resulting in the initiation of a redun- dant subtask. 0 T Rearr ange Blocks Initial scene Move blocks Press button Release Press button (redundant) Back to origin Move blocks F igure 7. Failure examples of Rearrange Blocks . Mem-0 redundantly presses the button, resulting in task failure. C.1. Failur es Analysis for M (1) T asks For M (1) tasks, in addition to the failures illustrated in Fig. 3 , we summarize the representativ e errors encountered by Mem-0 below . Observe and Pickup & Swap Blocks . Fig. 5 illustrates typical f ailure modes in the Observe and Pick Up task, where Mem-0 fails to accurately identify the target object. Similarly , Fig. 6 presents examples of misjudgments regarding the termination of the swapping sequence in the Swap bloc ks task, resulting in the conﬁrmation button being pressed at inappropriate timings. These instances re veal that the Anchor Memory , in fact, ex erts a continuous inﬂuence throughout the entire task horizon. Consequently , the model must maintain constant attention to the Anchor Memory and intelligently modulate the degree of its contribution to action prediction. Nev ertheless, our ablation studies ha ve already demonstrated the substantial performance gains brought by the Anchor Memory within the Mem-0 architecture, with its impact being particularly pronounced in tasks such as Rearrang e Blocks and Put Back Bloc k . Rearrange Blocks . On the other hand, Fig. 7 illustrates failure cases in the Rearrange Bloc ks task where Mem-0 performs excessi v e b utton presses, a beha vior we attribute to the limitations of the Sliding Memory module. Quantitative results from our ablation studies show a signiﬁcant performance de gradation in this task when the Sliding Memory is omitted. By analyzing the video playbacks, we found that the frequenc y of redundant button-pressing e vents increases markedly without Sliding Memory , identifying it as a primary failure mode. These ﬁndings demonstrate that while the current Sliding Memory provides substantial performance gains, there remains potential for further reﬁnement. Summarize . T o address these observations, we belie v e that one potential avenue for enhancement in volv es exploring more richer representation fusion mechanisms to improv e the utilization of both Anchor and Sliding Memory . Such advances would further impro ve the performance and stability of the model in more intricate and v ersatile scenarios. Additionally , increasing the visual processing capabilities of e xisting VLM architectures is expected to yield better results in tasks such as Observe and Pick Up . 14 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design C.2. Failur es Analysis for M ( n ) T asks For M ( n ) tasks, although Mem-0 demonstrates substantial impro vements o ver the baselines, there remains signiﬁcant room for further improv ement. W e believe that the primary challenge to be addressed lies in the performance and rob ustness of the Classiﬁer . 0 T Cover Bl ocks Initial scene Uncover right cover Initial scene Uncover left cover Uncover right cover Cover blocks Cover blocks F igure 8. F ailur e examples of Cover Blocks . The Classiﬁer fails to accurately detect the completion of the Uncover xxx subtask, thereby prev enting a subtask transition. As the instruction remains unchanged, the model is forced to operate under a wrong task context, leading to unintended and erratic behaviors. Cover Blocks . The current design of Classiﬁer occasionally f ails to accurately percei ve the ongoing task progress, leading to an inability to discriminate whether a speciﬁc state represents the initiation or the termination of a subtask. This phenomenon is ex empliﬁed in Fig. 8 during the ex ecution of the Cover Bloc ks task. Blocks Ranking T ry . Another primary challenge pertains to the execution of b utton-pressing operations. W e observe that the inclusion of b utton-pressing actions often introduces interference into hybrid tasks that are not exclusi vely focused on pressing. For instance, in Bloc ks Ranking T ry , the transition between button-pressing and block-sw apping is occasionally ﬂuidly disrupted, a phenomenon illustrated in Fig. 9 . In the Blocks Ranking T ry task, ev en a single e xecution error ine vitably leads to overall task failure. This sensiti vity is the main driv er of failure for this task, as clearly substantiated by the quantitative results of our ablation studies. 0 T Blocks R anking T ry Initial scene Press button Release Force button press Unexpected process Chaotic scene Try to swap blocks … … … … F igure 9. F ailure examples of Blocks Ranking T ry . Upon pressing the button, the system is expected to transition to the next subtask to ex ecute the swapping of designated blocks. Ho we ver , the Classiﬁer fails to trigger this transition promptly , causing the task to stall in the Press button state. This leads to a coordination conﬂict between the dual arms: the right hand attempts to initiate manipulation while the left hand remains tethered to the button-pressing instruction. 0 T Press Button Initial scene Press button Still press button and release Release Press button (incomplete) Release Move to middle button Move to middle button Initial scene Press button Release F igur e 10. F ailure examples of Press Button . (T op) Insuf ﬁcient presses: the Classiﬁer issues a false positi ve termination signal ev en when the button-press is unsuccessful. (Bottom) Excessive presses: the Classiﬁer fails to recognize a successful subtask completion, leading to redundant ex ecution of the same subtask. 15 RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into P olicy Design 0 T Battery T ry Initial scene Initial scene Switch left Grasp battery Battery inserted into slot Battery slips away Insert battery into slot Unexpected process Switch right Switch left (bad grasping) Bad placement Bad placement … … … … Chaotic scene F igure 11. Failure examples of Battery T ry . (T op) For the horizontally oriented cylindrical battery , a suboptimal grasp pose prevents a successful lift and causes signiﬁcant displacement, leading the model into unforeseen observ ational states. (Bottom) The model fails to commit to a speciﬁc manipulation strategy during battery adjustment, resulting in a mean action that leads to improper placement in the slot. Press Button . Speciﬁcally , the Pr ess Button task, as a dedicated button-pressing scenario, underscores the inherent dif ﬁculty of this operation. In the current design of Mem-0, the state of the b utton (pressed vs. unpressed) is reﬂected in the visual input only through extremely subtle cues. Consequently , the visual tokens generated by the VLM backbone lack the granularity to encapsulate such ﬁne-grained information, creating a fundamental bottleneck for the do wnstream Classiﬁer . Fig. 10 displays representativ e failure cases of the Classiﬁer in the Press Button task. Battery T ry . Furthermore, for the Battery T ry task, a representati ve failure mode is suboptimal manipulation precision. This encompasses both insertion errors when placing the battery into the slot and challenges in determining the appropriate grasp strategy due to the e xtremely subtle visual cues of the slot, as illustrated in Fig. 11 . Summarize . T o address these observations, we posit that incorporating proprioceptiv e or tactile feedback could pro vide the Classiﬁer with critical non-visual information. Moreov er , as the Classiﬁer is the most do wnstream module in Mem-0, optimizing upstream VLM token e xtraction and the fusion mechanisms of Anchor and Sliding Memory remains a promising av enue. Such reﬁnements would allo w the Classiﬁer to operate on input tokens with better -conditioned distrib utions and enhanced informational saliency . W e also anticipate that more interpretable tokens could impro ve the synergy between div erse subtasks, thereby further enhancing performance in complex scenarios like Blocks Ranking T ry . Nev ertheless, the ef ﬁcacy of the current design has been validated across v arious M ( n ) tasks, notably yielding substantial performance gains in Cover Bloc ks compared to the baseline. W e hope that the aforementioned analysis provides v aluable insights for future work to further improv e performance. 16

RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment