Dream-VLA 7B

Paper | Project Page | GitHub

Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM Dream-VL. The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning.

All Dream-VLA checkpoints, as well as our training codebase are released under an Apache 2.0 License.

Model Summary

  • Model type: Vision-language-action (language, image => robot actions)
  • Language(s) (NLP): en
  • License: apache-2.0
  • Finetuned from: Dream-VL, a VLM trained from:
    • Vision Backbone: Qwen2ViT
    • Language Model: Dream-7B (Diffusion Language Model)
  • Pretraining Dataset: Open X-Embodiment, with specific dataset mixture following OpenVLA.
  • Repository: https://github.com/DreamLM/Dream-VLX

Uses

Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be un-normalized subject to statistics computed on a per-robot, per-dataset basis. The available un-normalized keys are listed inside config.json.

Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for BridgeV2 environments with a Widow-X robot). They can also be efficiently fine-tuned for new tasks and robot setups given minimal demonstration data.

Getting Started

Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the BridgeV2 environments with a Widow-X robot:

# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...)
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True)
vla = AutoModel.from_pretrained(
    "Dream-org/Dream-VLA-7B",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...) # Replace with actual camera loading
task_description = "pick up the block"
conversation = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": f"What action should the robot take to {task_description}?"}]},
]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to("cuda:0", dtype=torch.bfloat16)

# Predict Action (7-DoF; un-normalize for BridgeV2)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action)

Citation

@article{ye2025dreamvla,
  title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
  author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
  journal={arXiv preprint},
  year={2025}
}
Downloads last month
505
Safetensors
Model size
8B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Dream-org/Dream-VLA-7B

Finetuned
(1)
this model