FlowCorrect: Efficient Interactive Correction of Generative Flow Policies for Robotic Manipulation
Generative manipulation policies can fail catastrophically under deployment-time distribution shift, yet many failures are near-misses: the robot reaches almost-correct poses and would succeed with a small corrective motion. We propose FlowCorrect, a…
Authors: Edgar Welte, Yitian Shi, Rosa Wolf
Flo wCorrect: Ef ficient Interacti ve Correction of Generati v e Flo w Policies for Robotic Manipulation Edgar W elte, Y itian Shi, Rosa W olf, Maximillian Gilles, Rania Rayyes Abstract — Generative manipulation policies can fail catas- trophically under deployment-time distribution shift, yet many failures are near -misses: the robot reaches almost-correct poses and would succeed with a small correcti ve motion. W e present FlowCorr ect, a deployment-time corr ection framework that con verts near -miss failur es into successes using sparse human nudges, without full policy retraining . During execution, a human provides brief correcti ve pose nudges via a lightweight VR interface. FlowCorrect uses these sparse corr ections to locally adapt the policy , improving actions without retraining the backbone while preser ving the model performance on pre viously learned scenarios. W e evaluate on a real-world robot acr oss three tabletop tasks: pick-and-place, pouring, and cup uprighting. With a low correction budget, FlowCorrect impro ves success on hard cases by 85% while preser ving perfor - mance on pre viously solved scenarios. The results demonstrate clearly that FlowCorrect learns only with v ery few demon- strations and enables fast and sample-efficient incremental, human-in-the-loop corrections of generative visuomotor policies at deployment time in real-w orld robotics. I . I N T R O D U C T I O N Recent years have witnessed major progress in large-scale imitation learning. These advances produced foundational behavior models, including V ision-Language-Action (VLA) models [1], [2], that map rich semantic embeddings to continuous motor commands. In parallel, generati ve pol- icy learning with diffusion and flow models [3]–[5] has emerged as a po werful paradigm for action generation. These policies can acquire broad, multimodal manipulation skills from di verse demonstrations. Ho wev er , real-world deploy- ment remains brittle: out-of-distribution (OOD) situations can cause execution failures, including irreversible mistakes or unsafe interactions because test-time states differ from those seen during training. Closing this gap requires not only stronger pre-training, but also mechanisms for continuous, incremental adaptation during deployment. A common remedy is parameter-ef ficient fine-tuning, which adapts pre-trained robot policies to new tasks and embodiments with modest compute [6], [7]. Ne vertheless, ev en lightweight fine-tuning typically assumes a relati vely stable target distribution and sufficiently representative cor- rection data. In practice, many failures occur in narrow OOD “pockets” of state space and are near-misses : the policy reaches an almost-correct state, and a small spatial or temporal adjustment would recover the task. Batch updating on a handful of corrected rollouts can be e xpensiv e and brittle, and may induce parameter interference that degrades previously competent behaviors [8]. W e therefore argue that continual policy learning should support ef ficient online correction: incremental adaptation to new situations while Fig. 1. Overvie w of our interactiv e FlowCorrect framework: A flow matching visuomotor policy is trained from offline demonstrations. During deployment, the frozen base policy runs the robot while a human provides occasional relative corrections. These sparse corrections are used to train our proposed lightweight FlowCorrect module that locally steers the pol- icy’ s flow field, yielding an adapted policy without retraining the original backbone. preserving the stability of the base policy . Interactiv e imitation learning (IIL) provides a natural pathway by enabling brief supervisory interventions during ex ecution [9]. This is especially well suited to near -miss failures, where a small end-effector perturbation or short recov ery motion can con vert failure into success without collecting ne w expert demonstrations. Prior interactive ap- proaches often learn from ev aluativ e feedback (e.g., suc- cess/failure or scalar preference [10], [11]), which of fers limited directional information for precise motion correction. Other approaches [12], [13] rely on absolute corrections that specify an exact target action, pose, or short trajectory seg- ment. While straightforward, they typically require precise inputs from the human supervisor and thus impose a higher cognitiv e load. In this work, we propose FlowCorr ect , a deployment- time correction framew ork that enables intuitive, incremental adaptation from sparse human interventions while preserving the broad capabilities of a pre-trained base policy . For a flo w- based policy , the key idea is to keep the backbone frozen and learn a lightweight correction module that applies only local adjustments. W e use relativ e corrections—incremental changes to the robot’ s current behavior—rather than requir- ing full demonstrations or exact target actions. Such correc- tions are often more natural for non-experts and align with interactiv e teaching paradigms such as COA CH [14]. W e treat these interventions as short-horizon, bounded edits that recov er near-miss failures without relearning the underlying skill. Fig. 1 illustrates our plug-and-play FlowCorr ect module for flow-based policies: a lightweight correction component that incorporates sparse online human feedback to improve behavior locally , together with a locality-preserving design that lea ves the original policy unchanged outside corrected regions. Flo w parameterizations are particularly amenable to such edits because they represent beha vior as a continuous action flow , enabling targeted local perturbations around en- countered states while preserving the global structure learned from of fline demonstrations. T o summarize, our contributions are as follo ws: • FlowCorrect deployment-time correction for genera- tive manipulation policies : W e introduce an interactiv e framew ork for adapting flo w-based manipulation poli- cies from sparse human interventions, targeting near- miss failures without full policy retraining. • Intuitive human feedback with localized adaptation : FlowCorr ect learns from brief relati ve corrections. The adaptation is localized to corrected situations, improv- ing recoverability while preserving the base policy in previously competent regions. • Real-robot validation under low correction budgets : Across three tabletop manipulation tasks, we show that a small number of corrected rollouts enables rapid deployment-time recovery of hard failures, substantially improving success while maintaining performance on previously solved scenarios, and outperforming full- policy retraining in efficienc y . I I . R E L A T E D W O R K A. IIL with human-in-the-loop corr ections A practical line of IIL improv es policies during deploy- ment by allowing a human to interv ene when failures are imminent. Pan et al. [15] introduce Decaying Relative Cor- rection (DRC), where a human supplies a transient relati ve offset that decays ov er time, reducing sustained teleoperation burden. The policy is then updated by retraining a diffusion model on corrected demonstrations. Kasaei et al. [16] simi- larly employ relativ e wrist-delta corrections, but apply them through of fline replay with the image-based teleoperation system. The y fine-tune the full policy on a mixture of original and corrected data. In contrast, several systems use absolute interventions where the human takes over control to directly provide actions or trajectory se gments, and the polic y is updated from the collected intervention data [17]–[20]. For instance, fol- lowing the recent trend in uncertainty estimation for robotics [21], [22], Diff-D Agger [18] utilizes generative dif fusion models approximate the expert’ s action distribution and quantify policy div ergence. They trigger interventions when the robot’ s predicted trajectory drifts from the learned man- ifold. Ho wev er , absolute corrections raise human workload and shift the interaction toward full teleoperation, reducing the distinction between sparse correcti ve signals and full demonstrations. B. Residual and modular adaptation fr om correction data T o mitigate catastrophic for getting and focus on learn- ing from failure patterns [23], a common direction learns residual or gated modules on top of an existing policy . T ransic [24] learns a residual policy from on-robot human interventions (e.g., using a SpaceMouse) and combines it with a strong base policy to address sim-to-real gaps without ov erhauling the original controller . Xu et al. [25] propose Compliant Residual D Agger (CR-D Agger), collecting delta motion corrections via a compliant end-effector interface and additionally leveraging force information to stabilize contact-rich adaptation. These residual formulations motiv ate our design choice to preserve the pre-trained policy and learn a lightweight correction mechanism that activ ates only when needed, reducing interference outside the narrow OOD regions responsible for failures. C. Generative visuomotor policies Recent visuomotor policies increasingly utilize generativ e models to capture multimodal predictions [3], [4], [26], en- abling the policy to represent multiple distinct, equally valid action sequences for the same task. Among them, diffusion- based policies [27] have demonstrated strong performance and flexibility , and are commonly adapted via fine-tuning on additional (corrected) rollouts [15], [18]. Beyond that, flow matching with consistency training [28] has emerged as a highly data-ef ficient alternativ e for high-dimensional control with fast inference. For instance, ManiFlow [5] combines flow matching and consistency training to produce dexterous actions in up to 10 steps and shows strong rob ustness and scaling beha vior across div erse manipulation settings. W e build on this generative-polic y trend by introducing a correction-oriented module that steers flow in the desired direction while keeping the base flow matching policy . I I I . P R O B L E M F O R M U L A T I O N A. Imitation Learning Setup A robot manipulator with a parallel-jaw gripper operates in discrete control time steps t ∈ T = { 1 , ..., T } . At each step, the robot observes o t = ( o pcd t , o prio t ) : a point cloud o pcd t ∈ R N × 3 of the workspace and its current proprioceptiv e state o prio t ∈ SE(3) × { 0 , 1 } ; and ex ecutes an end-effector pose command a t = ( T t , g t ) , where T t ∈ SE(3) denotes the end effector transformation that relative to the robot base and g t ∈ { 0 , 1 } represents the binary gripper state (open/close), tracked by a low-le vel Cartesian controller . A high-lev el policy π maps an observ ation history to an action sequence 1 : ˆ a t :( t + H − 1) = π ( o ( t − K +1): t ) , where H denotes the prediction horizon and K the length of the observation history . 1 W e denote ” ˆ a ” as all the pr edicted actions by policies in this work. W e assume access to an of fline dataset of human teleoper- ation demonstrations D demo = { τ m } m =1 ,...,M containing M episodes: τ m = ( o 1: T , a 1: T ) m , which we use to train a base policy π θ parametrized by θ . B. Preliminaries in Flow Matching P olicy In line with ManiFlo w [5], we use consistency flo w match- ing (CFM) [29] as the base policy π θ , which comprises a visual encoder z θ and a v ector field model f θ . Specifically , f θ parameterizes a latent ordinary dif ferential equation (ODE) in the normalized action space. Let x n = { ( T n h , s n h ) } t + H − 1 h = t denote the subtrajectory at ODE step n ∈ { 0 , ..., N − 1 } . During inference, we initialize with Gaussian noise x 0 and then iteratively sample: x n +1 = x n + ∆ k f θ ( x n , k n , c ) , where ∆ k = 1 / N is the step size and k n = ∆ k · n is the normalized time step. c = z θ ( o ( t − K +1): t ) denotes the latent conditioning from the observed point cloud, encoded by z θ . After N inference steps, we interpret: ˆ a t :( t + H − 1) = x N , as the obtained action chunks ˆ a t :( t + H − 1) . During training we sample a subtrajectory from a roll- out x gt = a t :( t + H − 1) ∈ τ m , noise x 0 , and continuous flow time step k ∼ U (0 , 1) . A sample x = (1 − k ) x 0 , + k x gt is defined as the linear interpolation between x 0 and x gt . The model f θ is trained to predict the velocity v = x gt − x 0 by minimizing the loss: L F ( θ ) = E ( x 0 , x gt ,k ) h ∥ f θ ( x , k , c ) − v ∥ 2 2 i . (1) In consistency flow matching, an additional objectiv e en- forces the velocity predictions at two random flow time steps from a discrete range to be consistent, facilitating smoother velocity predictions according to [29]. I V . M E T H O D O L O G Y A. System overview Our IIL approach aims to learn an augmented policy π θ +∆ θ that combines a frozen base polic y π θ with an additional learnable adapter parameterized by ∆ θ . During online ex ecution, a human teacher monitors the robot and can provide correcti ve feedback. Specifically , giv en a roll- out generated by the current policy π θ +∆ θ , the teacher may identify a subset of actions as suboptimal and supply corrected actions through our teleoperation-based correction interface. These corrections are then used to update the adapter parameters ∆ θ . B. Objectives for Interactive Corrections Let T corr ⊆ T be the set of corrected timesteps, i.e., the subset of the entire execution horizon T . For each t ∈ T corr , the teacher specifies a corrected pose a corr t ∈ SE(3) × { 0 , 1 } . W e define a binary mask to record the existence of this correction on timestep t : m t = 1 T corr ( t ) := ( 1 if t ∈ T corr , 0 otherwise . (2) W e denote the baseline (pre-adaptation) actions by ˆ a base t , produced by the frozen base policy π θ , and the adapted (post- adaptation) actions by ˆ a adapt t from π θ +∆ θ respectiv ely: ˆ a base t :( t + H − 1) = π θ ( o ( t − K +1): t ); ˆ a adapt t :( t + H − 1) = π θ +∆ θ ( o ( t − K +1): t ) . The goal of interactive adaptation is to update the aug- mented policy π θ +∆ θ within the correction time set T corr , such that it matches the teacher’ s corrected actions a corr t at timesteps t ∈ T corr , while remaining close to the frozen base policy π θ elsewhere. This is achiev ed by using only a small number of corrections and a parameter-ef ficient update. Formally , with base polic y parameters θ frozen, we seek adapter parameters ∆ θ such that: ˆ a adapt t ≈ ( a corr t if m t = 1 , ˆ a base t otherwise . C. Interactive Corr ection Strate gy a) Correction Data: T o train the adapter , we collect correction data in a dedicated format during base-policy rollouts. The i -th correction sample S i consists of: S i = { o ( t − K +1): t , ˆ a base t :( t + H − 1) , a corr t :( t + H − 1) , x 0 , m t :( t + H − 1) } ( i ) , each containing observ ation history o ( t − K +1): t , base pol- icy actions ˆ a base t :( t + H − 1) , corrected actions a corr t :( t + H − 1) , noise sample x 0 used to generate actions and correction masks m t :( t + H − 1) . W e additionally record a small set of successful episodes without corrections, which serv e as anchor data to discourage global drift of the adapted policy . Since the correction dataset is typically small (i.e., |T corr | ≪ |T | ), we require a highly parameter-ef ficient adaptation mechanism. W e employ relativ e corrections to enable humans to adjust the robot’ s motion during polic y ex ecution adapted from [15]. In contrast to absolute corrections, in which the human teacher takes full control and teleoperates the robot to specify the complete action, relati ve corrections consist of a correction offset b t applied on top of the policy’ s nominal output ˆ a base t : a corr t = ˆ a base t ⊕ b t Here, we use ⊕ and ⊖ to denote group composition and difference in the action space SE(3) × { 0 , 1 } respectively . b) Interactive Corr ection Interface: W e extract a user correction signal through the same VR teleoperation interface used to record demonstrations, as sho wn in Fig. 3. During policy execution, the user initiates a correction by holding a button on the VR controller . When it’ s activ ated, we cache the controller pose p ref ∈ SE(3) as a reference. While the button remains pressed, we compute the relativ e controller motion as the pose difference: ∆ p t = p t · p − 1 ref , where p t is the current controller pose. This raw correction ∆ p t is then scaled, low-pass filtered, and slew-rate limited before being applied as an additive Fig. 2. (a) Overvie w of FlowCorr ect module that is attached to the DiTX-Transformer from Maniflow [5]: we extend an existing flo w matching policy π θ based on DiTX-Transformer with our FlowCorrect module. Our lightweight FlowCorrect module consists of LoRA adapters (parametrized by ∆ θ ) injected into the transformer, and a gating module g ψ that outputs a signal to steer the vector flow field tow ards the corrected action. (b) Intuition: across N=4 integration steps, FlowCorr ect iteratively adjusts the predicted velocities from v n,t to v ∗ n,t , steering the rollout from a base action ˆ a base t tow ard a corrected action a corr t . offset to the policy action. Attached with the corrective open/closs signal g t , we formulate a smoothed target: ˜ b t = b t − 1 ⊕ α γ ∆ p t g t ⊖ b t − 1 , α = dt τ + dt , with scale factor γ , time constant τ , and control timestep dt . T o av oid abrupt changes, we additionally limit the maximum step size per timestep via b t = b t − 1 ⊕ clip ˜ b t ⊖ b t − 1 , r max dt , where the operator clip( · , ρ ) constrains the magnitude of the relativ e transformation in SE(3) . This smooth correctiv e process yields an intuiti ve “nudge” interface that preserves the ov erall structure of the policy’ s behavior while enabling targeted user adjustments. Impor- tantly , the correction loop operates at a higher control rate ( ∼ 15 Hz) than the policy updates ( ∼ 1 Hz), thereby improving responsiv eness and supporting more natural user interaction. Practically , we record the correction offset b t during data collection, corresponding to the action a corr t once it has been fully ex ecuted, i.e., when the robot reaches the commanded target pose. Finally , we apply a temporal Decaying Relativ e Correction (DRC) following [15] after the user releases the button, allowing the robot to smoothly transition back to the policy’ s uncorrected output. D. FlowCorrect Module T o integrate corrections into the augmented policy π θ +∆ θ , we present FlowCorr ect , a learnable adapter module based on LoRA [6]. This module is attached to the pretrained state-of-the-art ManiFlow [5] as the generative policy to enable ef ficient fine-tuning. Fig. 2 (a) illustrates the general architecture of FlowCorrect . Fig. 3. Pipeline of the interactiv e correction interface. Specifically , our core idea is to modify the flow v ector field by attaching the LoRA adapter to the MLP head of DiTX- T ransformer in Maniflow . F or a gi ven correction trajectory , we want the flow trajectory starting from the original noise x 0 and integrated with the edited field to end at the corrected actions a corr t :( t + H − 1) (see Fig. 2 (b)). Let x n denote the latent action at ODE step n under the edited v ector field model f θ +∆ θ , the FlowCorrect vector field model becomes: f θ +∆ θ ( x n , k n , c ) = f θ ( x n , k n , c ) + v ∆ θ ( x n , t n , c ) , with observational condition c = z θ ( o ( t − K +1): t ) . Restricting LoRA to the head of the DiTX-Transformer and to a low- rank update keeps the number of trainable parameters small ( ≈ 10k), and typically yields edits that are localized in hidden-state space. For each timestep t , we define a per-step target velocity tow ard the corrected action a corr t : v ∗ n,t = a corr t ⊖ x n,t ( N − n )∆ k , which can be viewed as the velocity that would exactly reach a corr t by the end of the integration if it stayed constant (see Fig. 2 (b)). ( n − n )∆ k represents the remaining flow time. W e also introduce a time-dependent weight w n that em- phasizes later steps, as these are more relev ant to reach the targeted action: w n = n + 1 N . The FlowCorr ect loss for a single flo w trajectory is: L FE (∆ θ ) = 1 N N − 1 X n =0 w n f θ +∆ θ ( x n , t n , c ) t − v ∗ n,t 2 2 . In practice, we re-use the logged noise x 0 and run the ODE forward with the current f θ +∆ θ . The ov erall objectiv e over all correction trajectories {S i } is: ∆ θ ∗ = argmin ∆ θ X i L ( i ) FE (∆ θ ) . (3) Although LoRA is parameter-ef ficient, its updates can hav e global effects on the policy . Consequently , improving behavior in one region of the workspace can unintentionally change actions in other re gions, potentially reducing per- formance. T o further enforce locality , we introduce a small gating network g ψ that decides where to apply the flow edit (see Fig. 2 (a)): α t = g ψ ( c t ) ∈ [0 , 1] , where c t is the observation condition at timestep t . The gating network is intentionally a small: it first projects c t into a low-dimensional space via linear projection, aggregates the projected features via mean pooling, and then uses a two- layer Multilayer Perceptron (MLP) to produce the scalar gate α t . The gated FlowCorrect vector field model becomes: f θ +∆ θ ( x n , t n , c ) = f θ ( x n , t n , c ) + α t v ∆ θ ( x n , t n , c ) . In general, we train the FlowCorr ect in two stages: 1) Training FlowCorrect module: optimize ∆ θ with Eq. 3, where we fix α t ≡ 1 2) Gate training: freeze θ + ∆ θ and optimize ψ with L G = BCE ( α, y ) + λ ent H( α ) . BCE ( α, y ) supervises the gate to predict whether an edit should be applied o ver the current horizon. The binary entropy H( α ) = − α (1 − α ) promotes decisiv e gating by pushing α tow ard 0 or 1 rather than ambiguous intermediate values. Moreov er , we define the ground-truth target y as: y = ( 1 if ∃ t ′ ∈ { t, . . . , t + H − 1 } , m t ′ = 1 0 otherwise which claims that the window is labeled positive if at least one indicator m t = 1 within the interval from t : t + H − 1 . This encourages the gate to open in those cases. Fig. 4. Hardware setup and representativ e real-world tasks used in the experiments. At inference, we apply a hard threshold ˆ α t = 1 [ α t > 0 . 5] to obtain a binary “use edit / do not use edit” decision per timestep, making the behavior interpretable as applying the learned flow correction only where the human indicated failures during interaction. V . E X P E R I M E N T S W e design real-robot experiments to answer the following questions: • Can our FlowCorrect fine-tuning, using only local hu- man corrections, reliably fix low-performing situations while preserving the base policy’ s performance on other states? • How does FlowCorrect compare to retraining the com- plete base policy in terms of performance and effi- ciency? • What is the impact of our gating mechanism and the usage of uncorrected rollout data? W e e valuate our FlowCorrect module on three tabletop manipulation tasks illustrated in Fig. 4: (i) Pick-and-Place, (ii) Pouring, and (iii) Cup Uprighting. For each task, we compare three policy types trained with the same backbone and observation/action spaces: (i) the base policy ( Base ) trained from demonstrations; (ii) our fine-tuned FlowCorrect policy ( FC ) updated from human corrections; (iii) a retrained policy ( R T ) updated from the same corrections. W e report success rates over a structured set of in- distribution (ID) and out-of-distribution (OOD) initial con- ditions, and perform ablations to isolate the contributions of gating and rollout data. A. Experiment setups a) Hardwar e and P olicy I/O: All experiments are con- ducted on a UR10 manipulator equipped with a Robotiq 2F- 85 parallel-jaw gripper . The policy observes o t , including (i) a 3D point cloud o pcd t of the workspace captured by an external time-of-flight (T oF) depth camera (Orbbec Femto Mega) and (ii) the robot proprioceptiv e state o prio t (end- effector pose and gripper state). The polic y outputs an Fig. 5. T op row: Selected ID-hard and OOD-hard initial conditions for the three tasks (left to right): Pouring, Cup Uprighting, and Pick-and-Place. The green regions indicate the workspace areas covered by the demonstrations. Middle row: Representativ e failure cases of the base policy under these conditions. Bottom row: Qualitative examples of successful executions after FlowCorr ect fine-tuning on conditions that previously failed. absolute 6D end-effector pose command together with a gripper command a t , which are expressed in a common world coordinate frame. Demonstrations and policy rollouts are recorded at 10 Hz. W e use an observ ation horizon of K = 2 timesteps and predict an action sequence of length H = 14 . At execution time, we execute the first H exec =10 actions before triggering another inference cycle. b) Data collection: For each task, we collect eight expert demonstrations to train the base policy . During de- ployment, when the base policy fails, a human provides relativ e corrections for the same failure situation to generate correction rollouts. W e also record trajectories from success- ful base-polic y ex ecutions. For each selected f ailure situation, we collect ten corrected rollouts, and we randomly sample fiv e uncorrected rollouts from successful executions. c) T asks: W e consider three tasks: Pick-and-Place: The robot must pick up a Rubik’ s Cube and place it on a fixed-positioned box. Pouring: The robot must grasp a cup and perform a pouring motion toward a fixed-positioned cup. Uprighting a Cup: The robot must flip an overturned cup upright and place it in a designated position on the table. The initial position of all manipulated objects is randomly selected in a square area of 15x20 cm. d) P olicies and T r aining V ariants: W e ev aluate three policy types per task: (i) Base polic y: trained on 8 demonstra- tions. (ii) FC policy (ours): fine-tuned from the base policy by FlowCorr ect using 10 corrections for each selected f ailure case plus 5 rollout trajectories. (iii) R T policy: retrained (updated) base policy using the same data as FC policy in terms of corrections and rollouts. The Base policy is trained for 3000 epochs, whereas the FC and RT policies are trained for 500 epochs on a single NVIDIA R TX 4090. W e use a batch size of 256 for base policy training and 64 for fine- tuning, av erage runtimes are listed in T able II. B. Experiment evaluation For each task, we define 30 in-distrib ution (ID) initial conditions (i.e., object positions) within the workspace re- gion used for demonstration recording. T o ensure a more standardized and comparable ev aluation, these 30 positions are arranged as a 6×5 grid over the defined workspace. In addition, we define three selected lo w-performing ID condi- tions ( ID-har d #1-#3 - identified by ev aluation of the base policy), and one selected lo w-performing OOD condition ( OOD-har d ) to be corrected by a human. The OOD-hard condition is selected differently per task. In Cup Uprighting and Pick-and-Place , we chose a random position 3 cm outside the workspace as the OOD condition, whereas for the P ouring task, we selected a smaller cup height as the OOD condition, since the positional OOD conditions were already robust in that task. Figure 5 shows the dif ferent ID and OOD conditions as well as some common failure cases of the base policy before adaptation with FlowCorr ect . T ypical failure cases include an unstable grasp, collision with the object, or misalignment with the object. Each ID and OOD condition is e valuated 10 times to define a success rate. C. Results a) Quantitative r esults: Fig. 6 summarizes ov erall per- formance across the 30 ID initial conditions. Across all three tasks, FlowCorrect ( FC ) improv es the base policy’ s ID T ABLE I S T RE S S - TE S T S U C CE S S O N S E L EC T E D H A RD P O S IT I O N S . “ I D- H A R D ” A R E T H E T H R E E L OW - P E RF O R M IN G I D C O ND I T I ON S ; “ O O D- H A R D ” I S T H E S E LE C T E D O O D C O N DI T I O NS . T ask Policy ID-hard #1 ID-hard #2 ID-hard #3 OOD-hard Base 0/10 0/10 0/10 0/10 Pick-and-Place FC (Ours) 3/10 10 /10 9/10 10 /10 R T 10 /10 10 /10 10 /10 10 /10 Base 4/10 0/10 0/10 0/10 Pouring FC (Ours) 10 /10 10 /10 10 /10 2/10 R T 10 /10 10 /10 10 /10 10 /10 Base 0/10 0/10 0/10 0/10 Cup Uprighting FC (Ours) 10 /10 9 /10 10 /10 9 /10 R T 10 /10 8/10 8/10 9 /10 Pick-and-Place Pouring Cup Uprighting 0 0 . 2 0 . 4 0 . 6 0 . 8 1 16/30 19/30 16/30 18/30 27/30 22/30 20/30 23/30 21/30 Success rate Base FC (Ours) R T Fig. 6. Overall success rate on 30 ID positions inside the workspace. T ABLE II R E SO U R C E U S AG E C O M P A R IS O N B E T WE E N O U R FlowCorrect T R A I NI N G ( F C ) A N D R E TR A I N IN G ( RT ) . Mode A vg. GPU Memory Usage (GB) A vg. Runtime (min.) Base 18.84 ± 0.24 80.86 ± 10.01 FC (Ours) 4.35 ± 0.15 30.24 ± 5.45 R T 19.23 ± 0.25 52.93 ± 10.96 success rate, indicating that sparse corrections can generalize beyond the corrected timesteps and stabilize e xecution in nearby states. In P ouring and Cup Uprighting , FC yields large gains. Pick-and-Place requires a more precise posi- tioning of the gripper to prev ent collisions, as the gripper is only 10mm wider than the cube. Therefore, the improv ement does not have such a generalizable effect here. T able I reports stress tests on the selected low-performing ID and OOD conditions. For Cup Uprighting , FC reliably re- solves both ID-hard and OOD-hard settings (9–10/10 success across all hard cases). For P ouring , FC fixes all three ID- hard positions (10/10 each), but improves the selected OOD condition only marginally (0/10 → 2/10). Notably , this OOD setting corresponds to a height change (smaller cup) rather than a positional shift. For Pick-and-Place , FC substantially improv es most hard cases (up to 10/10 on ID-hard #2 and 9/10 on ID-hard #3, and 10/10 on OOD-hard), but one selected ID-hard condition improv es only partially (3/10). This ID-hard condition lies spatially close to the chosen OOD condition, and the corresponding corrections are di- rectionally different in a narrow region of the workspace. In T ABLE III A B LAT IO N S T U DY O N FlowCorrect . R E P O RT S U C C ES S ( I N % ) O N ( I ) 3 0 I D P O SI T I O NS A N D ( I I ) H A RD C A S ES AC RO S S A L L TA SK S . V ariant ID-30 A vg ID-hard A vg OOD-hard A vg FC (full) 74.45 90.00 70.00 FC w/o gate 61.11 71.11 70.00 FC w/o rollouts 66.67 84.44 83.33 R T 71.11 95.56 96.67 R T w/o rollouts 45.56 92.22 96.67 such cases, a single locality gate and observation-agnostic LoRA update can lead to over-corr ection tow ard the OOD solution, ef fectiv ely overriding the more appropriate edit for the nearby ID case. This highlights an important limitation of local correction schemes when multiple, conflicting edits must coexist at fine spatial granularity . b) Qualitative results: The bottom row of Fig. 5 pro- vides representative successful ex ecutions after FlowCorr ect fine-tuning. The examples visually confirm the quantitati ve improv ements sho wn in Fig. 6 and T able I, demonstrating more stable alignment and execution under challenging ini- tial conditions. c) Comparison to retr aining.: Retraining ( R T ) achie ves consistently strong performance on the hard cases. Howe ver , FC is largely competitiv e with R T in o verall ID success (Fig. 6) while updating only a small LoRA module and a lightweight gate. In addition, RT incurs a substantially larger training-time footprint (GPU memory and runtime; T able II), whereas FC provides a more deployment-friendly adaptation mechanism. d) Ablation insights.: T able III shows that the gating mechanism is critical for preserving ID performance: remov- ing the gate drops ID-30 success from 74.45% to 61.11%, consistent with the gate prev enting unintended global drift. Using a small set of uncorrected rollouts also improves stability: training FC without rollouts reduces ID-30 perfor- mance (66.67%) and changes the trade-off between hard-case gains and generalization, indicating that anchor trajectories help maintain the base policy’ s behavior outside corrected regions. D. Discussion and Outlook. The two observed failure modes: (i) conflicting edits in a narrow spatial neighborhood, and (ii) OOD shifts driv en by object geometry rather than pose, suggesting that condition- ing the FlowCorrect itself on the observation could improve selectivity . A promising direction is to inject observation- conditioned modulation not only into the gate, but also into the LoRA/edit pathway , enabling the correction to depend explicitly on the current scene features and reducing inter- ference between nearby but distinct correction regimes. V I . C O N C L U S I O N S W e presented FlowCorr ect , an interactiv e adaptation method for flow matching manipulation policies that targets common deployment failures in narrow OOD pockets. Flow- Corr ect keeps the pretrained backbone frozen and learns a lightweight LoRA-based module that locally steers the action flow field from sparse relative human corrections. A gate and anchor rollouts preserve behavior outside corrected regions. On three real-robot tabletop tasks, FlowCorr ect improves success on hard ID/OOD conditions with a small correction budget while maintaining ov erall ID performance and requir- ing substantially less training overhead than full retraining. Future work will focus on stronger observation-conditioned edits to better handle nearby , conflicting corrections and shifts dri ven by object geometry . R E F E R E N C E S [1] J. Zhang et al. , “V ision-language models for vision tasks: A survey , ” TP AMI , 2024. [2] M. J. Kim et al. , “Open vla: An open-source vision-language-action model, ” arXiv pr eprint arXiv:2406.09246 , 2024. [3] R. W olf et al. , “Diffusion models for robotic manipulation: a survey , ” F r ontiers in Robotics and AI , 2025. [4] K. Black et al. , “ π 0 : A vision-language-action flow model for general robot control, ” RSS , 2025. [5] G. Y an et al. , “Maniflow: A general robot manipulation policy via consistency flow training, ” in CoRL , 2025. [6] E. J. Hu et al. , “Lora: Low-rank adaptation of large language models. ” ICLR , 2022. [7] M. J. Kim et al. , “Fine-tuning vision-language action models: Opti- mizing speed and success, ” RSS , 2025. [8] Z. Zheng et al. , “imanip: Skill-incremental learning for robotic ma- nipulation, ” in ICCV , 2025. [9] E. W elte and R. Rayyes, “Interactiv e imitation learning for dexterous robotic manipulation: challenges and perspectiv es—a survey , ” Fr on- tiers in Robotics and AI , 2025. [10] J. MacGlashan et al. , “Interactiv e learning from policy-dependent human feedback, ” in ICML , 2017. [11] P . F . Christiano et al. , “Deep reinforcement learning from human preferences, ” NeurIPS , 2017. [12] M. Kelly et al. , “Hg-dagger: Interactive imitation learning with human experts, ” in ICRA , 2019. [13] C. Celemin and J. Ruiz-del Solar, “ An interactive framework for learning continuous actions policies based on corrective feedback, ” Journal of Intelligent & Robotic Systems , 2019. [14] C. Celemin and J. Ruiz-del-Solar, “Coach: Learning continuous ac- tions from correctiv e advice communicated by humans. ” ICAR, 2015. [15] C. Pan et al. , “Online imitation learning for manipulation via decaying relativ e correction through teleoperation, ” in IROS , 2025. [16] H. Kasaei and M. Kasaei, “V ital: V isual teleoperation to enhance robot learning through human-in-the-loop corrections, ” in CoRL , 2023. [17] Z. Xu et al. , “Hacts: A human-as-copilot teleoperation system for robot learning, ” in IR OS , 2025. [18] S.-W . Lee et al. , “Diff-dagger: Uncertainty estimation with diffusion policy for robotic manipulation, ” in ICRA , 2025. [19] H. Liu et al. , “Robot learning on the job: Human-in-the-loop autonomy and learning during deployment, ” IJRR , 2024. [20] E. Chisari et al. , “Correct me if i am wrong: Interactiv e learning for robotic manipulation, ” RAL , 2022. [21] Y . Shi et al. , “V iso-grasp: vision-language informed spatial object- centric 6-dof acti ve view planning and grasping in clutter and in visi- bility , ” in IR OS , 2025. [22] Y . Shi et al. , “vmf-contact: Uncertainty-aware evidential learning for probabilistic contact-grasp in noisy clutter , ” in ICRA , 2025. [23] Y . Shi et al. , “Uncertainty-dri ven exploration strategies for online grasp learning, ” in ICRA , 2024. [24] Y . Jiang et al. , “Transic: Sim-to-real policy transfer by learning from online correction, ” in CoRL , 2024. [25] X. Xu et al. , “Compliant residual dagger: Improving real-world contact-rich manipulation with human corrections, ” in NeurIPS , 2025. [26] Y . Shi et al. , “Hograspflow: Exploring vision-based generative grasp synthesis with hand-object priors and taxonomy awareness, ” arXiv pr eprint arXiv:2509.16871 , 2025. [27] C. Chi et al. , “Diffusion policy: V isuomotor policy learning via action diffusion, ” IJRR , 2025. [28] A. Prasad et al. , “Consistency policy: Accelerated visuomotor policies via consistency distillation, ” RSS , 2024. [29] L. Y ang et al. , “Consistency flow matching: Defining straight flows with velocity consistency , ” arXiv pr eprint arXiv:2407.02398 , 2024.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment