A Caltech Library Service

VIMA: General Robot Manipulation with Multimodal Prompts

Jiang, Yunfan and Gupta, Agrim and Zhang, Zichen and Wang, Guanzhi and Dou, Yongqiang and Chen, Yanjun and Fei-Fei, Li and Anandkumar, Anima and Zhu, Yuke and Fan, Linxi (2022) VIMA: General Robot Manipulation with Multimodal Prompts. . (Unpublished)

[img] PDF - Submitted Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to 2.9× task success rate given the same training data. With 10× less training data, VIMA still performs 2.7× better than the top competing approach. We open-source all code, pretrained models, dataset, and simulation benchmark at

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper
Fei-Fei, Li0000-0002-7481-0810
Anandkumar, Anima0000-0002-6974-6797
Zhu, Yuke0000-0002-9198-2227
Fan, Linxi0000-0001-7393-3125
Additional Information:Attribution 4.0 International (CC BY 4.0). We are extremely grateful to Shyamal Buch, Jonathan Tremblay, Ajay Mandlekar, Chris Choy, De-An Huang, Silvio Savarese, Fei Xia, Josiah Wong, Abhishek Joshi, Soroush Nasiriany, and many other colleagues and friends for their helpful feedback and insightful discussions. NVIDIA provides the necessary computing resource and infrastructure for this project. This work is done during Yunfan Jiang and Guanzhi Wang’s internships at NVIDIA. Guanzhi Wang is supported by the Kortschak fellowship in Computing and Mathematical Sciences at Caltech.
Funding AgencyGrant Number
Kortschak Scholars ProgramUNSPECIFIED
Record Number:CaltechAUTHORS:20221221-004703977
Persistent URL:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:118550
Deposited By: George Porter
Deposited On:22 Dec 2022 18:42
Last Modified:16 Mar 2023 20:08

Repository Staff Only: item control page