---
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: robotics
base_model: Dream-org/Dream-VL-7B
tags:
- robotics
- vla
- multimodal
- pretraining
---

# Dream-VLA 7B

[Paper](https://huggingface.co/papers/2512.22615) | [Project Page](https://hkunlp.github.io/blog/2025/dream-vlx/) | [GitHub](https://github.com/DreamLM/Dream-VLX)

Dream-VLA 7B is an open vision-language-action model built from a diffusion VLM [Dream-VL](https://huggingface.co/Dream-org/Dream-VL-7B). 
The model takes language instructions and camera images as input and generates robot actions. It supports controlling multiple robots out-of-the-box, and can be quickly adapted for new robot domains via (parameter-efficient) fine-tuning.

All Dream-VLA checkpoints, as well as our [training codebase](https://github.com/DreamLM/Dream-VLX) are released under an Apache 2.0 License.

## Model Summary

- **Model type:** Vision-language-action (language, image => robot actions)
- **Language(s) (NLP):** en
- **License:** apache-2.0
- **Finetuned from:** [`Dream-VL`](https://huggingface.co/Dream-org/Dream-VL-7B), a VLM trained from:
  + **Vision Backbone**: Qwen2ViT
  + **Language Model**: Dream-7B (Diffusion Language Model)
- **Pretraining Dataset:** [Open X-Embodiment](https://robotics-transformer-x.github.io/), with specific dataset mixture following [OpenVLA](https://github.com/openvla/openvla).
- **Repository:** [https://github.com/DreamLM/Dream-VLX](https://github.com/DreamLM/Dream-VLX)

## Uses

Dream-VLA models take a language instruction and a camera image of a robot workspace as input, and predict (normalized) robot actions consisting of 7-DoF end-effector deltas
of the form (x, y, z, roll, pitch, yaw, gripper). To execute on an actual robot platform, actions need to be *un-normalized* subject to statistics computed on a per-robot,
per-dataset basis. The available *un-normalized* keys are listed inside `config.json`.

Dream-VLA models can be used zero-shot to control robots for specific combinations of embodiments and domains seen in the Open X-Embodiment pretraining mixture (e.g., for 
[BridgeV2 environments with a Widow-X robot](https://rail-berkeley.github.io/bridgedata/)). They can also be efficiently *fine-tuned* for new tasks and robot setups
given minimal demonstration data.

## Getting Started

Dream-VLA 7B can be used to control multiple robots for domains represented in the pretraining mixture out-of-the-box. For example, here is an example for loading Dream-VLA for zero-shot instruction following in the BridgeV2 environments with a Widow-X robot:

```python
# Install minimal dependencies (`torch`, `transformers`, `timm`, `tokenizers`, `flash_attn`, ...)
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch

# Load Processor & VLA
processor = AutoProcessor.from_pretrained("Dream-org/Dream-VLA-7B", trust_remote_code=True)
vla = AutoModel.from_pretrained(
    "Dream-org/Dream-VLA-7B",
    attn_implementation="flash_attention_2",  # [Optional] Requires `flash_attn`
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True, 
    trust_remote_code=True
).to("cuda:0")

# Grab image input & format prompt
image: Image.Image = get_from_camera(...) # Replace with actual camera loading
task_description = "pick up the block"
conversation = [
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": f"What action should the robot take to {task_description}?"}]},
]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to("cuda:0", dtype=torch.bfloat16)

# Predict Action (7-DoF; un-normalize for BridgeV2)
action = vla.predict_action(**inputs, unnorm_key="bridge_orig", do_sample=False)

# Execute...
robot.act(action)
```

## Citation

```bibtex
@article{ye2025dreamvla,
  title={Dream-VL & Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone},
  author={Ye, Jiacheng and Gong, Shansan and Gao, Jiahui and Fan, Junming and Wu, Shuang and Bi, Wei and Bai, Haoli and Shang, Lifeng and Kong, Lingpeng},
  journal={arXiv preprint arXiv:2512.22615}
  year={2025}
}
```