Enhancing Linguistic Generalization of VLA: Fine-Tuning OpenVLA via Synthetic Instruction Augmentation

Generalization remains a core challenge in embodied AI, as robots must adapt to diverse environments. While OpenVLA represents the State-of-the-Art (SOTA) in Vision-Language-Action models by leveraging large-scale pre-training, its zero-shot performa…

Authors: Dongik Shin

Enhancing Linguistic Generalization of VLA: Fine-T uning OpenVLA via Synthetic Instruction A ugmentation Dongik Shin i.dongik@utexas.edu Abstract Generalization remains a core challenge in em- bodied AI, as robots must adapt to di verse en- vironments. While OpenVLA represents the State-of-the-Art (SO T A) in V ision-Language- Action models by lev eraging large-scale pre- training, its zero-shot performance can be lim- ited when encountering completely ne w envi- ronments. This paper proposes a parameter- efficient fine-tuning strate gy to enhance the lin- guistic generalization of OpenVLA by synthe- sizing a general instruction set for the Bridge Dataset V2. The paper lev erages a Lar ge Lan- guage Model (LLM) to generate a rich v ariety of semantically equi valent but structurally di- verse commands for existing trajectories. In this experiment, Lo w-Rank Adaptation (LoRA) is implemented to fine-tune OpenVLA on aug- mented pairs, allowing the model to bridge the gap between complex natural language intent and robotic actions. Results demonstrate that the LoRA-enhanced model’ s robustness, sug- gesting that enriching the linguistic space of specialized datasets is crucial for embodied agents. 1 Introduction The embodied AI has recently undergone a paradigm shift with the emergence of V ision- Language-Action (VLA) models. Among these, OpenVLA ( Kim et al. , 2025 ) stands as a state-of- the-art (SO T A) open-source model that bridges the gap between high-lev el linguistic reasoning and lo w-lev el robotic control. OpenVLA is built upon a vision encoder (SigLIP) ( Zhai et al. , 2023a ) and a large language model backbone (Llama) ( T ou- vron et al. , 2023 ), and the model is pre-trained on the massi ve Open X-Embodiment dataset ( Collab- oration et al. , 2023 ), which consists o ver 900,000 robot trajectories. This extensi ve training allo ws the model to exhibit remarkable zero-shot general- ization capabilities across v arious robotic platforms and tasks ( Kim et al. , 2025 ). While OpenVLA shows outstanding physical task performance, its linguistic generalization is often limited by the quality of language annota- tions in its training data. Large-scale datasets such as the Bridge Dataset V2 ( W alke et al. , 2023b ), which is a component of the Open X-Embodiment dataset ( Collaboration et al. , 2023 ), often lack nat- ural language instructions. This lack of linguistic di versity hinders the model’ s rob ustness to interpret v aried human’ s instructions. T o handle this limitation, this paper proposes a method to enhance the linguistic generalization of OpenVLA ( Kim et al. , 2025 ) through a synthesized "General Instruction Set". This study lev erages the reasoning capabilities of Lar ge Language Models (LLMs) to generate a rich spectrum of semantically equi valent b ut general instructions for the trajecto- ries in the Bridge Dataset V2 ( W alke et al. , 2023b ). T o adapt the massiv e OpenVLA model ef ficiently , Lo w-Rank Adaptation (LoRA) ( Hu et al. , 2021 ) is employed. This parameter-ef ficient fine-tuning strategy allows the model to align its general lin- guistic expressions without the prohibiti ve compu- tational cost of full-parameter updates. The primary contribution of this study is the demonstration that enriching the linguistic space of a specialized robotic dataset can significantly improv e the generalization of a VLA agent. By fine-tuning with a structured and di verse instruc- tion set, the proposed model achie ves higher scores. The remainder of this paper is or ganized as fol- lo ws: Section 2 revie ws related w ork in VLA and parameter-ef ficient tuning; Section 3 details the generation of the General Instruction Set and the LoRA ( Hu et al. , 2021 ) fine-tuning framework; Section 4 presents the e xperimental results on lin- guistic generalization; and Section 5 e valuates the results; and Section 6 concludes with future re- search directions. 2 Related W orks 2.1 V isually-Conditioned Language Models V isually-conditioned language models (VLMs) are trained on large-scale data to generate human lan- guage from images and language prompts. The model commonly ha ve been adopted for various ap- plications from visual question answering ( Goyal et al. , 2017 ; Hudson and Manning , 2019 ; Singh et al. , 2019 ). One of the key advances make re- cent VLMs feasible are model architectures that bridge representations from pretrained vision en- coders with pretrained language models ( Radford et al. , 2021 ; Zhai et al. , 2023b ; T eam et al. , 2024a ). Recent open-source VLMs hav e emcompassed on a simpler “patch as token” method, where patch representations from pretrained visual transform- ers are treated as tokens, then the patches are fed into the input space of a language model ( Chen et al. , 2023 ; Liu et al. , 2023 , 2024 ; Karamcheti et al. , 2024 ). This simple approach makes it easy to utilize existing methods for training language models at scale for VLM training. For instance, VLMs from Karamcheti et al. ( Karamcheti et al. , 2024 ) are trained from multi-resolution visual fea- tures, fusing low-le vel spatial information from DINOv2 ( Oquab et al. , 2023 ) with higher-le vel se- mantics from SigLIP ( Zhai et al. , 2023b ) to aid in visual generalization. 2.2 V ision-Language-Action Models Numerous works ha v e e xplored the usage of VLMs for robotics, for object detection ( Y itzhak Gadre et al. , 2022 ), providing a feedback signal ( Ma et al. , 2023 ; Sontakke et al. , 2023 ), and visual state repre- sentations ( Nair et al. , 2022 ). Likewise, a number of recent works hav e explored approaches which directly fine-tuned lar ge pretrained VLMs for pre- dicting robot actions ( Kim et al. , 2024 ; Collabora- tion et al. , 2023 ; Huang et al. , 2023 ). Such mod- els are referred to as vision-language-action mod- els (VLAs), since they fuse robot control actions directly into VLM backbones ( Kim et al. , 2024 ). Kim et al. state three key benefits of VLA: (1) it performs alignment of pretrained vision and lan- guage components on a lar ge, Internet-scale vision- language dataset, (2) the use of a generic archi- tecture, not custom-made for robot control, allows us to lev erage the scalable infrastructure underly- ing modern VLM training ( Dao , 2023 ; Zhao et al. , 2023 ) and scale to training billion-parameter poli- cies with minimal code modifications, and (3) it provides a direct pathway for robotics to benefit from the rapid impro vements in VLMs. Recent researches on VLAs focus on training and ev alu- ating in single robot or simulated setups ( Huang et al. , 2023 ; Zhen et al. , 2024 ; Dorka et al. , 2024 ) and thus lack generability . R T -2-X ( Collaboration et al. , 2023 ) trains a 55B-parameter VLA policy on the Open X-Embodiment dataset and demonstrates state-of-the-art generalist manipulation policy per - formance b ut it has high computation cost. Open- VLA ( Kim et al. , 2024 ) improves with a richer robot pretraining datasets, fit to ne w target setups, and shows ef fecti veness of parameter-ef ficient fine- tuning (PEFT) and quantization approaches for VLAs. In this study , OpenVLA is used as a base model for computational ef ficiency . 2.3 Generalization in VLAs A recent trend in robotics works to wards training multi-task generalist robot policies on lar ge di v erse robot datasets, spanning many dif ferent robot em- bodiments ( Kim et al. , 2024 ; Brohan et al. , 2022 ; W alke et al. , 2023a ; Kalashnik ov et al. , 2018 , 2021 ; Ebert et al. , 2021 ; Ehsani et al. , 2024 ; Bharadhwaj et al. , 2024 ; Pinto and Gupta , 2016 ; Mandlekar et al. , 2018 ; Gupta et al. , 2018 ; Dasari et al. , 2019 ; Cabi et al. , 2019 ; Jang et al. , 2022 ; Fang et al. , 2024 ; De vin et al. , 2017 ; Hu et al. , 2022 ; Y ang et al. , 2023 ; Reed et al. , 2022 ; Radosav ovic et al. , 2023 ; Shah et al. , 2022 , 2023 ). A ke y difference between these approaches and OpenVLA is the model architecture. Prior works like Octo ( Brohan et al. , 2023 ; T eam et al. , 2024b ; W alke et al. , 2023b ) typically consist of pretrained components such as language embeddings or visual encoders with addi- tional model components initialized from scratch. OpenVLA adopts a more end-to-end approach, di- rectly fine-tuning VLMs to generate robot actions by treating them as tokens in the language model vocab ulary ( Kim et al. , 2024 ). In this experiment, experimental e valuation sho ws that with general in- struction sets and generalization ability o ver prior generalist policies. 3 Proposed Method 3.1 Problem F ormulation In this study , the pre-trained OpenVLA ( Kim et al. , 2025 ) model is fine-tuned on a subset of Bridge Dataset V2 ( W alke et al. , 2023b ) D = { ( o i , l i , a i ) } N i =1 , where o i represents the visual ob- serv ation, l i denotes the generated instruction, and a i is the corresponding ground-truth action. T o achie ve ef ficient adaptation while fine-tuing, Low- Rank Adaptation (LoRA) ( Hu et al. , 2021 ) method is used. During fine-tuning, the original weights W 0 are kept frozen, and the low-rank decomposi- tion matrices are updated. The optimization objec- ti ve is defined as follo w: L LoRA = X ( o,l,a ) ∈D L vla ( VLA θ ( o, l ) , a ) (1) where L vla is a default objectiv e in the of ficial OpenVLA implementation for discrete action token prediction ( Kim et al. , 2025 ). The objecti ve tar get the query (Q), v alue (V), key (K), and output (O) projection matrices in the self-attention layers with a rank r = 32 and α = 64 to capture generated instructions set with motion patterns. T able 1: T op 5 representati ve instructions generated by the LLM for general task adaptation in Bridge Dataset V2 ( W alke et al. , 2023b ). No. Instruction 1 In order to pick up the object, the robot should 2 T o move the object to a ne w location, the robot must 3 In order to grasp and relocate the item, the robot should 4 T o manipulate the objects in front of it, the robot must 5 In order to complete the task of moving the utensils, the robot should 3.2 Generate Instruction Set T o enhance the model’ s adaptability and robustness across diverse en vironments, a method for gener- ating a general instrcution set using a Large Lan- guage Model (LLM) is introduced in this project. Unlike con ventional datasets that rely on fixed, human-annotated labels, this approach lev erages the semantic reasoning capabilities of LLMs to synthesize a di verse instruction with linguistic v ari- ations for each trajectory . For each trajectory in the subset of BridgeData V2, the task’ s metadata, in- cluding the objects in volv ed and the final goal state are fed into the LLM. The LLM is then prompted to generate a comprehensive set of instructions that describe the same robotic action through v arious syntactic structures such as "In order to pick up the object, the robot should move it to the target" or "T o relocate the item, the robot must execute a grasp and place action". By fine-tuning OpenVLA on this augmented in- struction set, more robust mapping between visual observ ations and linguistic commands is used. This process allo ws the model to adapt to a general in- struction space, preventing it from ov erfitting to specific, rigid phrasing. Consequently , the pro- posed method significantly improves the general- ization capabilities of the agent, enabling it to ex e- cute commands successfully ev en when faced with paraphrased instructions that were not present in the original training distribution. T able 1 illustrates the top 5 representativ e in- structions generated by the LLM for a single ma- nipulation task. These generated instructions in- troduce significant linguistic variety , ranging from high-le vel (abstract) goal descriptions (e.g., "ma- nipulate the objects") to specific item-based com- mands (e.g., "moving the utensils"). By providing a sequence of three k ey frames, initial, intermedi- ate, and final images, the LLM by using prompt template described in T able 3 was able to infer the robot’ s intent and generate semantically rich instructions. From the generated candidates, few instructions (5 instruction in this e xperiments) were manually curated the most contextually appropriate instruction sets to ensure high-quality supervision. Then, These selected instructions were then ran- domly paired with their corresponding trajectories during the training phase. This random mapping within the augmented set encourages the model to decouple specific linguistic patterns from rigid task labels, thereby fostering a more flexible and generalized policy . 4 Experiments 4.1 Datasets In this e xperiments, the scripted raw dataset from BridgeData V2 ( W alke et al. , 2023b ), a large-scale dataset specifically designed for robotic manipu- lation in diverse en vironments was used. Unlike purely teleoperated demonstrations, this subset con- sists of 9,731 trajectories collected via a random- ized scripted policy , and does not have human in- structions. Although this autonomous collection process frequently results in suboptimal ex ecutions, data is particularly v aluable for training rob ust be- haviors. Therefore, subset of the data that aligns with objecti ves is manually curated. Moreov er , due to the massiv e scale of the original scripted dataset in the BridgeData V2 (35GB), 100 trajectories were selected for simplicity . Each trajectories consists of an av erage of 25 sequential image-action pairs. 4.2 Implementation Details The model is trained based on the AdamW opti- mizer ( Loshchilo v and Hutter , 2019 ), and the learn- T able 2: Comparison of Action Prediction Accuracy between Zero-shot OpenVLA and Our Proposed Method (Fine-tuned with General Instruction Set). Methods T op-1 Acc (%) 5-Bin Acc (%) OpenVLA ( Kim et al. , 2025 ) 6.62 40.76 Proposed Method 5.09 42.47 ing rate is 5e-05. W e fine-tuned a model on dataset described abo ve. All e xperiments are performed on a single Nvidia A100 GPUs with 40GB of VRAM. 5 Evaluation 5.1 Policy Ev aluation OpenVLA replaces the 256 tokens least frequently used tokens in the Llama tokenizer’ s vocab ulary with action tokens ( Kim et al. , 2025 ). T o ev aluate the precision of the model’ s predicted actions, the predicted action tokens against the ground-truth v alues from the BridgeData V2 test set are used to compare. For the quantitativ e assessment, each continuous action dimension—comprising the end-effector’ s Cartesian displacement (∆ x, ∆ y , ∆ z ) , orientation (∆ r oll , ∆ pitch, ∆ y aw ) , and gripper state—is nor - malized to the range [ − 1 , 1] and assigned to one of the 256 discrete bins. T wo metrics for perfor- mance analysis is used: T op-1 accuracy and 5-Bin tolerance accurac y . The accurac y for a gi ven action dimension is formulated as follo ws: k-Bin Acc = 1 N N X i =1 1 ( | a i − ˆ a i | ≤ k ) (2) where a i is the ground-truth token, ˆ a i is the pre- dicted token, and k represents the tolerance thresh- old ( k = 0 for T op-1 accuracy and k = 5 for 5-bin tolerance accuracy).Experimental results in- dicate that while the T op-1 accuracy slightly de- creased compared to the original OpenVLA, the 5-bin tolerance accurac y showed a significant im- prov ement. This shift suggests that although the model’ s exact token matching is more distrib uted, the predictions remain consistently within a nar- ro w , physically plausible range of the target action. This result demonstrates that fine-tuning with a general instruction set enhances the rob ustness of the policy , allowing the model to prioritize func- tional success and motion consistency ov er rigid token-wise memorization. 6 Conclusion and Limitation This study demonstrates that fine-tuning the Open- VLA model with an LLM-generated general in- struction set significantly enhances the rob ustness of robotic manipulation policies. While our re- sults sho w a slight decrease in T op-1 accuracy , the substantial improvement in 5-bin tolerance accu- racy indicates that the model has dev eloped a more flexible and generalized understanding of the task, prioritizing functional success over rigid token- wise memorization. Ho we ver , sev eral limitations remain. First, the increase in linguistic v ariety ap- pears to introduce a slight trade-of f in absolute pre- cision. Second, the current ev aluation is conducted on a curated subset of 100 trajectories, leaving the scalability to more complex, multi-stage tasks as a subject for future in vestigation. Future work will focus on refining the instruction generation process to mitigate precision loss and v alidating the pro- posed method in real-world en vironments with a broader range of robotic platforms. T able 3: Prompt template for instruction generation. [System Message] Y ou are a linguistic expert specializing in robotic task annotation. Y our goal is to provide di verse, natural language instructions based on visual observ ations of robot manipulation. [User Message] {Image 1: First frame of the trajectory} {Image 2: Intermediate frame of the trajectory} {Image 3: Last frame of the trajectory} T ask: 1. Scene Analysis: Briefly identify the primary object and the robot’ s objecti ve from the provided image. 2. Instruction Generation: Synthesize exactly 5 distinct natural language instructions for the observed task. Requirements: - Ensure linguistic v ariety: Use dif ferent sentence structures (Imperati ve, Goal-oriented, and Condi- tional). - V ary the level of abstraction: Include instructions ranging from low-le vel motor descriptions to high-le vel intent. - V ocab ulary div ersity: Use synonyms for objects (e.g., "item," "tar get," "utensil") and actions (e.g., "grasp," "pick up," "relocate"). - Format: Return only the 5 instructions, each on a new line starting with "No. [Number]". [Output Example] No. 1 In order to pick up the object, the robot should... No. 2 T o mov e the item to a ne w location, the robot must... References Homanga Bharadhwaj, Jay V akil, Mohit Sharma, Ab- hinav Gupta, Shubham T ulsiani, and V ikash Kumar . 2024. Roboagent: Generalization and efficienc y in robot manipulation via semantic augmentations and action chunking. In 2024 IEEE International Con- fer ence on Robotics and A utomation (ICRA) , pages 4788–4795. IEEE. Anthony Brohan, Noah Brown, Justice Carbajal, Y ev- gen Chebotar , Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Ale x Herzog, Jas- mine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, T omas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashniko v , Y uheng Kuang, and 32 others. 2023. Rt-1: Robotics transformer for real- world control at scale . Preprint , arXi v:2212.06817. Anthony Brohan, Noah Brown, Justice Carbajal, Y ev- gen Chebotar , Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Ale x Herzog, Jas- mine Hsu, and 1 others. 2022. Rt-1: Robotics trans- former for real-world control at scale. arXiv preprint arXiv:2212.06817 . Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Noviko v , Ksenia Kon yushko va, Scott Reed, Rae Jeong, Konrad Zolna, Y usuf A ytar, David Budden, Mel V ecerik, and 1 others. 2019. Scaling data-driven robotics with re ward sketching and batch reinforce- ment learning. arXiv pr eprint arXiv:1909.12200 . Xi Chen, Xiao W ang, Lucas Beyer , Alexander K olesnikov , Jialin W u, Paul V oigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdul- mohsin, Piotr Padle wski, and 1 others. 2023. Pali-3 vision language models: Smaller , faster , stronger . arXiv pr eprint arXiv:2310.09199 . Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Mad- dukuri, Abhishek Gupta, Abhishek Padalkar , Abra- ham Lee, Acorn Pooley , Agrim Gupta, Ajay Man- dlekar , Ajinkya Jain, Albert T ung, Alex Bewley , Alex Herzog, Alex Irpan, Alexander Khazatsky , Anant Rai, Anchit Gupta, and 275 others. 2023. Open X- Embodiment: Robotic learning datasets and R T -X models. . T ri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv pr eprint arXiv:2307.08691 . Sudeep Dasari, Frederik Ebert, Stephen T ian, Suraj Nair , Bernadette Bucher , Karl Schmeckpeper , Sid- dharth Singh, Sergey Le vine, and Chelsea Finn. 2019. Robonet: Large-scale multi-robot learning. arXiv pr eprint arXiv:1910.11215 . Coline Devin, Abhishek Gupta, Tre vor Darrell, Pieter Abbeel, and Sergey Le vine. 2017. Learning modu- lar neural network policies for multi-task and multi- robot transfer . In 2017 IEEE international confer- ence on robotics and automation (ICRA) , pages 2169– 2176. IEEE. Nicolai Dorka, Chenguang Huang, Tim W elschehold, and W olfram Burgard. 2024. What matters in em- ploying vision language models for tokenizing ac- tions in robot control? In F irst W orkshop on V ision- Language Models for Navigation and Manipulation at ICRA 2024 . Frederik Ebert, Y anlai Y ang, Karl Schmeckpeper , Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. 2021. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396 . Kiana Ehsani, T anmay Gupta, Rose Hendrix, Jordi Sal- vador , Luca W eihs, Kuo-Hao Zeng, Kunal Pratap Singh, Y ejin Kim, Winson Han, Alv aro Herrasti, and 1 others. 2024. Spoc: Imitating shortest paths in sim- ulation enables effecti ve na vigation and manipulation in the real world. In Pr oceedings of the IEEE/CVF Confer ence on Computer V ision and P attern Recog- nition , pages 16238–16250. Hao-Shu Fang, Hongjie Fang, Zhenyu T ang, Jirong Liu, Chenxi W ang, Junbo W ang, Haoyi Zhu, and Cewu Lu. 2024. Rh20t: A comprehensiv e robotic dataset for learning div erse skills in one-shot. In 2024 IEEE International Conference on Robotics and Automa- tion (ICRA) , pages 653–660. IEEE. Y ash Goyal, T ejas Khot, Douglas Summers-Stay , Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elev ating the role of image understanding in visual question answering. In Pr oceedings of the IEEE confer ence on computer vision and pattern r ecognition , pages 6904–6913. Abhinav Gupta, Adithyav airavan Murali, Dhi- raj Prakashchand Gandhi, and Lerrel Pinto. 2018. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information pr ocessing systems , 31. Edward J. Hu, Y elong Shen, Phillip W allis, Zeyuan Allen-Zhu, Y uanzhi Li, Shean W ang, Lu W ang, and W eizhu Chen. 2021. Lora: Low-rank adaptation of large language models . Pr eprint , Edward S. Hu, Kun Huang, Oleh Rybkin, and Dinesh Jayaraman. 2022. Know th yself: Transferable visual control policies through robot-a wareness . Preprint , Jiangyong Huang, Silong Y ong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Y an W ang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. 2023. An em- bodied generalist agent in 3d world. arXiv pr eprint arXiv:2311.12871 . Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Pr oceed- ings of the IEEE/CVF confer ence on computer vision and pattern r ecognition , pages 6700–6709. Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler , Frederik Ebert, Corey L ynch, Serge y Levine, and Chelsea Finn. 2022. Bc-z: Zero-shot task generaliza- tion with robotic imitation learning. In Conference on Robot Learning , pages 991–1002. PMLR. Dmitry Kalashnikov , Alex Irpan, Peter Pastor , Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly , Mrinal Kalakrishnan, V incent V an- houcke, and 1 others. 2018. Scalable deep reinforce- ment learning for vision-based robotic manipulation. In Confer ence on r obot learning , pages 651–673. PMLR. Dmitry Kalashnikov , Jacob V arley , Y evgen Chebotar , Benjamin Swanson, Rico Jonschkowski, Chelsea Finn, Serge y Levine, and Karol Hausman. 2021. Mt- opt: Continuous multi-task robotic reinforcement learning at scale. arXiv pr eprint arXiv:2104.08212 . Siddharth Karamcheti, Suraj Nair , Ashwin Balakrishna, Percy Liang, Thomas Kollar , and Dorsa Sadigh. 2024. Prismatic vlms: In vestigating the design space of visually-conditioned language models. In F orty-first International Confer ence on Machine Learning . Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, T ed Xiao, Ashwin Balakrishna, Suraj Nair , Rafael Rafailov , Ethan Foster , Grace Lam, Pannag San- keti, and 1 others. 2024. Open vla: An open- source vision-language-action model. arXiv pr eprint arXiv:2406.09246 . Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, T ed Xiao, Ashwin Balakrishna, Suraj Nair , Rafael Rafailov , Ethan P Foster , Pannag R Sanketi, Quan V uong, Thomas Kollar , Benjamin Burchfiel, Russ T edrake, Dorsa Sadigh, Ser gey Le vine, Percy Liang, and Chelsea Finn. 2025. Open vla: An open-source vision-language-action model . In Pr oceedings of The 8th Conference on Robot Learning , volume 270 of Pr oceedings of Machine Learning Researc h , pages 2679–2713. PMLR. Haotian Liu, Chunyuan Li, Y uheng Li, and Y ong Jae Lee. 2024. Improv ed baselines with visual instruc- tion tuning. In Pr oceedings of the IEEE/CVF con- fer ence on computer vision and pattern r ecognition , pages 26296–26306. Haotian Liu, Chun yuan Li, Qingyang W u, and Y ong Jae Lee. 2023. V isual instruction tuning. Advances in neural information processing systems , 36:34892– 34916. Ilya Loshchilov and Frank Hutter . 2019. Decoupled weight decay regularization. In 7th International Confer ence on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenRe- view .net. Y echeng Jason Ma, V ikash Kumar , Amy Zhang, Os- bert Bastani, and Dinesh Jayaraman. 2023. Liv: Language-image representations and rewards for robotic control. In International Confer ence on Ma- chine Learning , pages 23301–23320. PMLR. Ajay Mandlekar , Y uke Zhu, Animesh Garg, Jonathan Booher , Max Spero, Albert T ung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay , and 1 oth- ers. 2018. Roboturk: A cro wdsourcing platform for robotic skill learning through imitation. In Confer- ence on Robot Learning , pages 879–893. PMLR. Suraj Nair , Aravind Rajeswaran, V ikash Kumar , Chelsea Finn, and Abhinav Gupta. 2022. R3m: A univ ersal visual representation for robot manipula- tion. arXiv pr eprint arXiv:2203.12601 . Maxime Oquab, T imothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, V asil Khalidov , Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby , and 1 others. 2023. Dinov2: Learning robust visual features without supervision. arXiv pr eprint arXiv:2304.07193 . Lerrel Pinto and Abhinav Gupta. 2016. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international confer ence on r obotics and automation (ICRA) , pages 3406–3413. IEEE. Alec Radford, Jong W ook Kim, Chris Hallacy , Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try , Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others. 2021. Learning transferable visual models from natural language supervision. In International confer ence on machine learning , pages 8748–8763. PmLR. Ilija Radosavo vic, Baifeng Shi, Letian Fu, Ken Gold- berg, T rev or Darrell, and Jitendra Malik. 2023. Robot learning with sensorimotor pre-training . In Pr oceedings of The 7th Confer ence on Robot Learn- ing , volume 229 of Pr oceedings of Mac hine Learning Resear ch , pages 683–693. PMLR. Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gomez Colmenarejo, Alexander Novikov , Gabriel Barth-Maron, Mai Gimenez, Y ury Sulsky , Jackie Kay , Jost T obias Springenberg, T om Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Y utian Chen, Raia Hadsell, Oriol V inyals, Mahyar Bordbar, and Nando de Freitas. 2022. A generalist agent . Pr eprint , arXi v:2205.06175. Dhruv Shah, Ajay Sridhar , Arjun Bhorkar , Noriaki Hi- rose, and Sergey Levine. 2022. GNM: A general navigation model to dri ve any robot . In Deep Rein- for cement Learning W orkshop NeurIPS 2022 . Dhruv Shah, Ajay Sridhar , Nitish Dashora, Kyle Sta- chowicz, Ke vin Black, Noriaki Hirose, and Sergey Levine. 2023. V int: A foundation model for visual navigation . Pr eprint , Amanpreet Singh, V iv ek Natarajan, Meet Shah, Y u Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. T ow ards vqa models that can read. In Proceedings of the IEEE/CVF con- fer ence on computer vision and pattern r ecognition , pages 8317–8326. Sumedh Sontakke, Jesse Zhang, Séb Arnold, Karl Pertsch, Erdem Bıyık, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. 2023. Roboclip: One demonstration is enough to learn robot policies. Advances in Neur al Information Pr ocessing Systems , 36:55681–55693. Gemma T eam, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Mor gane Ri vière, Mihir Sanjay Kale, Juliette Love, and 1 others. 2024a. Gemma: Open models based on gemini research and technology . arXiv pr eprint arXiv:2403.08295 . Octo Model T eam, Dibya Ghosh, Homer W alke, Karl Pertsch, Ke vin Black, Oier Mees, Sudeep Dasari, Joey Hejna, T obias Kreiman, Charles Xu, Jianlan Luo, Y ou Liang T an, Lawrence Y unliang Chen, Pan- nag Sanketi, Quan V uong, T ed Xiao, Dorsa Sadigh, Chelsea Finn, and Serge y Le vine. 2024b. Octo: An open-source generalist robot policy . Preprint , Hugo T ouvron, Louis Martin, Ke vin Stone, Peter Al- bert, Amjad Almahairi, Y asmine Babaei, Nikolay Bashlykov , Soumya Batra, Prajjwal Bharga va, Shruti Bhosale, Dan Bikel, Lukas Blecher , Cristian Canton Ferrer , Moya Chen, Guillem Cucurull, David Esiob u, Jude Fernandes, Jeremy Fu, W enyin Fu, and 49 oth- ers. 2023. Llama 2: Open foundation and fine-tuned chat models . Pr eprint , arXi v:2307.09288. Homer Rich W alke, Ke vin Black, T ony Z Zhao, Quan V uong, Chongyi Zheng, Philippe Hansen-Estruch, Andre W ang He, V i vek Myers, and Moo Jin Kim. 2023a. Max du, abraham lee, kuan fang, chelsea finn, and serge y levine. bridgedata v2: A dataset for robot learning at scale. In 7th Annual Confer ence on Robot Learning . Homer Rich W alke, K e vin Black, T ony Z. Zhao, Quan V uong, Chongyi Zheng, Philippe Hansen-Estruch, Andre W ang He, V i vek Myers, Moo Jin Kim, Max Du, Abraham Lee, Kuan Fang, Chelsea Finn, and Serge y Levine. 2023b. Bridgedata v2: A dataset for robot learning at scale . In Pr oceedings of The 7th Conference on Robot Learning , volume 229 of Pr oceedings of Machine Learning Researc h , pages 1723–1736. PMLR. Jonathan Heewon Y ang, Dorsa Sadigh, and Chelsea Finn. 2023. Polybot: Training one policy across robots while embracing variability . In 7th Annual Confer ence on Robot Learning . Samir Y itzhak Gadre, Mitchell W ortsman, Gabriel Il- harco, Ludwig Schmidt, and Shuran Song. 2022. Cows on pasture: Baselines and benchmarks for language-driv en zero-shot object na vigation. arXiv e-prints , pages arXiv–2203. Xiaohua Zhai, Basil Mustafa, Alexander Kolesnik ov , and Lucas Beyer . 2023a. Sigmoid loss for lan- guage image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer V ision (ICCV) , pages 11975–11986. Xiaohua Zhai, Basil Mustafa, Alexander Kolesnik ov , and Lucas Beyer . 2023b. Sigmoid loss for language image pre-training. In Pr oceedings of the IEEE/CVF international confer ence on computer vision , pages 11975–11986. Y anli Zhao, Andrew Gu, Rohan V arma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, and 1 oth- ers. 2023. Pytorch fsdp: experiences on scal- ing fully sharded data parallel. arXiv pr eprint arXiv:2304.11277 . Haoyu Zhen, Xiao wen Qiu, Peihao Chen, Jincheng Y ang, Xin Y an, Y ilun Du, Y ining Hong, and Chuang Gan. 2024. 3d-vla: A 3d vision-language- action generativ e world model. arXiv preprint arXiv:2403.09631 .

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment