--- language: - en library_name: transformers license: apache-2.0 pipeline_tag: robotics base_model: Dream-org/Dream-VL-7B tags: - robotics - vla - multimodal - pretraining --- # Dream-VLA 7B [Paper](https://huggingface.co/papers/2512.22615) | [Project Page](https://hkunlp.github.io/blog/2025/dream-vlx/) | [GitHub](https://github.com/DreamLM/Dream-VLX) Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM [Dream-VL](https://huggingface.co/Dream-org/Dream-VL-7B). The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning. All Dream-VLA checkpoints, as well as our [training codebase](https://github.com/DreamLM/Dream-VLX) are released under an Apache 2.0 License. ## Model Summary - **Model type:** Vision-language-action (language, image => robot actions) - **Language(s) (NLP):** en - **License:** apache-2.0 - **Finetuned from:** [`Dream-VL`](https://huggingface.co/Dream-org/Dream-VL-7B), a VLM trained from: + **Vision Backbone**: Qwen2ViT + **Language Model**: Dream-7B (Diffusion Language Model) - **Pretraining Dataset:** [Open X-Embodiment](https://robotics-transformer-x.github.io/), with specific dataset mixture following [OpenVLA](https://github.com/openvla/openvla). - **Repository:** [https://github.com/DreamLM/Dream-VLX](https://github.com/DreamLM/Dream-VLX) ## Uses Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be *un-normalized* subject to statistics computed on a per-robot, per-dataset basis. The available *un-normalized* keys are listed inside `config.json`. Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for [BridgeV2 environments with a Widow-X robot](https://rail-berkeley.github.io/bridgedata/)). They can also be efficiently *fine-tuned* for new tasks and robot setups given minimal demonstration data. ## Getting Started Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the BridgeV2 environments with a Widow-X robot: ```python # Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...) from transformers import AutoModel, AutoProcessor from PIL import Image import torch # Load Processor & VLA processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True) vla = AutoModel.from_pretrained( "Dream-org/Dream-VLA-7B", attn_implementation="flash_attention_2", # [Optional] Requires `flash_attn` torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True ).to("cuda:0") # Grab image input & format prompt image: Image.Image = get_from_camera(...) # Replace with actual camera loading task_description = "pick up the block" conversation = [ {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": f"What action should the robot take to {task_description}?"}]}, ] text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to("cuda:0", dtype=torch.bfloat16) # Predict Action (7-DoF; un-normalize for BridgeV2) action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False) # Execute... robot.act(action) ``` ## Citation ```bibtex @article{ye2025dreamvla, title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone}, author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng}, journal={arXiv preprint arXiv:2512.22615} year={2025} } ```