thomagram commited on
Commit
3247f5f
Β·
verified Β·
1 Parent(s): 65bacb2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +274 -5
README.md CHANGED
@@ -1,5 +1,274 @@
1
- ---
2
- license: apple-amlr
3
- language:
4
- - en
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # STARFlow: Scalable Transformer Auto-Regressive Flow
2
+
3
+ <div align="center">
4
+ <img src="starflow_logo.png" alt="STARFlow Logo" width="300">
5
+ </div>
6
+
7
+ <div align="center">
8
+
9
+ [![arXiv](https://img.shields.io/badge/arXiv-2506.06276-b31b1b.svg)](https://arxiv.org/abs/2506.06276)
10
+ [![arXiv](https://img.shields.io/badge/arXiv-2511.20462-b31b1b.svg)](https://arxiv.org/abs/2511.20462)
11
+ [![NeurIPS](https://img.shields.io/badge/NeurIPS-2025%20Spotlight-blue.svg)](https://neurips.cc/Conferences/2025)
12
+
13
+ </div>
14
+
15
+ This is the official open source release of **STARFlow** and **STARFlow-V**, state-of-the-art transformer autoregressive flow models for high-quality image and video generation.
16
+
17
+ ## πŸ“– Overview
18
+
19
+ **STARFlow** introduces a novel transformer autoregressive flow architecture that combines the expressiveness of autoregressive models with the efficiency of normalizing flows. The model achieves state-of-the-art results in both text-to-image and text-to-video generation tasks.
20
+
21
+ - **[STARFlow](https://arxiv.org/abs/2506.06276)**: Scaling Latent Normalizing Flows for High-resolution Image Synthesis (NeurIPS 2025 Spotlight)
22
+ - **[STARFlow-V](https://arxiv.org/abs/2511.20462)**: End-to-End Video Generative Modeling with Normalizing Flows (Arxiv)
23
+
24
+ 🎬 **[View Video Results Gallery](https://starflow-v.github.io)** - See examples of generated videos and comparisons
25
+
26
+ ## πŸš€ Quick Start
27
+
28
+ ### Environment Setup
29
+
30
+ ```bash
31
+ # Clone the repository
32
+ git clone https://github.com/apple/ml-starflow
33
+ cd ml-starflow
34
+
35
+ # Set up conda environment (recommended)
36
+ bash scripts/setup_conda.sh
37
+
38
+ # Or install dependencies manually
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ### Model Checkpoints
43
+
44
+ **Important**: You'll need to download the pretrained model checkpoints and place them in the `ckpts/` directory. For example:
45
+
46
+ - `ckpts/starflow_3B_t2i_256x256.pth` - For text-to-image generation
47
+ - `ckpts/starflow-v_7B_t2v_caus_480p_v3.pth` - For text-to-video generation
48
+
49
+
50
+ ### Text-to-Image Generation
51
+
52
+ Generate high-quality images from text prompts:
53
+
54
+ ```bash
55
+ # Basic image generation (256x256)
56
+ bash scripts/test_sample_image.sh "a film still of a cat playing piano"
57
+
58
+ # Custom prompt and settings
59
+ torchrun --standalone --nproc_per_node 1 sample.py \
60
+ --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
61
+ --checkpoint_path "ckpts/starflow_3B_t2i_256x256.pth" \
62
+ --caption "your custom prompt here" \
63
+ --sample_batch_size 8 \
64
+ --cfg 3.6 \
65
+ --aspect_ratio "1:1" \
66
+ --seed 999
67
+ ```
68
+
69
+ ### Text-to-Video Generation
70
+
71
+ Generate videos from text descriptions:
72
+
73
+ ```bash
74
+ # Basic video generation (480p, ~5 seconds)
75
+ bash scripts/test_sample_video.sh "a corgi dog looks at the camera"
76
+
77
+ # With custom input image for TI2V video generation
78
+ bash scripts/test_sample_video.sh "a cat playing piano" "/path/to/input/image.jpg"
79
+
80
+ # Longer video generation (specify target length in frames)
81
+ bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 241 # ~15 seconds at 16fps
82
+ bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 481 # ~30 seconds at 16fps
83
+
84
+ # Advanced video generation
85
+ torchrun --standalone --nproc_per_node 8 sample.py \
86
+ --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \
87
+ --checkpoint_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \
88
+ --caption "your video prompt here" \
89
+ --sample_batch_size 1 \
90
+ --cfg 3.5 \
91
+ --aspect_ratio "16:9" \
92
+ --out_fps 16 \
93
+ --jacobi 1 --jacobi_th 0.001 \
94
+ --target_length 161 # Customize video length
95
+ ```
96
+
97
+ ## πŸ› οΈ Training
98
+
99
+ ### Image Training
100
+
101
+ Train your own STARFlow model for text-to-image generation:
102
+
103
+ ```bash
104
+ # Quick training test
105
+ bash scripts/test_train_image.sh 10 16
106
+
107
+ # Full training with custom parameters
108
+ torchrun --standalone --nproc_per_node 8 train.py \
109
+ --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
110
+ --epochs 100 \
111
+ --batch_size 1024 \
112
+ --wandb_name "my_starflow_training"
113
+ ```
114
+
115
+ ### Video Training
116
+
117
+ Train STARFlow-V for text-to-video generation:
118
+
119
+ ```bash
120
+ # Quick training test
121
+ bash scripts/test_train_video.sh 10 8
122
+
123
+ # Resume training from checkpoint
124
+ torchrun --standalone --nproc_per_node 8 train.py \
125
+ --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \
126
+ --resume_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \
127
+ --epochs 100 \
128
+ --batch_size 192
129
+ ```
130
+
131
+ ## πŸ”§ Utilities
132
+
133
+ ### Video Processing
134
+
135
+ Extract individual frames from multi-video grids:
136
+
137
+ ```bash
138
+ # Extract frames from a video containing multiple video grids
139
+ python scripts/extract_image_from_video.py --input_video path/to/video.mp4 --output_dir output/
140
+
141
+ # Extract images with custom settings
142
+ python scripts/extract_images.py input_file.mp4
143
+ ```
144
+
145
+ ## πŸ“ Model Architecture
146
+
147
+ ### STARFlow (3B Parameters - Text-to-Image)
148
+ - **Resolution**: 256Γ—256
149
+ - **Architecture**: 6-block deep-shallow architecture
150
+ - **Text Encoder**: T5-XL
151
+ - **VAE**: SD-VAE
152
+ - **Features**: RoPE positional encoding, mixed precision training
153
+
154
+ ### STARFlow-V (7B Parameters - Text-to-Video)
155
+ - **Resolution**: Up to 640Γ—480 (480p)
156
+ - **Temporal**: 81 frames (16 FPS = ~5 seconds)
157
+ - **Architecture**: 6-block deep-shallow architecture (full sequence)
158
+ - **Text Encoder**: T5-XL
159
+ - **VAE**: WAN2.2-VAE
160
+ - **Features**: Causal attention, autoregressive generation, variable length support
161
+
162
+ ## πŸ”§ Key Features
163
+
164
+ - **Autoregressive Flow Architecture**: Novel combination of autoregressive models and normalizing flows
165
+ - **High-Quality Generation**: Competetive FID scores and visual quality to State-of-the-art Diffusion Models
166
+ - **Flexible Resolution**: Support for various aspect ratios and resolutions
167
+ - **Efficient Training**: FSDP support for large-scale distributed training
168
+ - **Fast Sampling**: Block-wise Jacobi iteration for accelerated inference
169
+ - **Text Conditioning**: Advanced text-to-image/video capabilities
170
+ - **Video Generation**: Temporal consistency and smooth motion
171
+
172
+ ## πŸ“Š Configuration
173
+
174
+ ### Key Parameters
175
+
176
+ #### Image Generation (`starflow_3B_t2i_256x256.yaml`)
177
+ - `img_size: 256` - Output image resolution
178
+ - `txt_size: 128` - Text sequence length
179
+ - `channels: 3072` - Model hidden dimension
180
+ - `cfg: 3.6` - Classifier-free guidance scale
181
+ - `noise_std: 0.3` - Flow noise standard deviation
182
+
183
+ #### Video Generation (`starflow-v_7B_t2v_caus_480p.yaml`)
184
+ - `img_size: 640` - Video frame resolution
185
+ - `vid_size: '81:16'` - Temporal dimensions (frames:downsampling)
186
+ - `fps_cond: 1` - FPS conditioning enabled
187
+ - `temporal_causal: 1` - Causal temporal attention
188
+
189
+ ### Sampling Options
190
+ - `--cfg` - Classifier-free guidance scale (higher = more prompt adherence)
191
+ - `--jacobi` - Enable Jacobi iteration for faster sampling
192
+ - `--jacobi_th` - Jacobi convergence threshold
193
+ - `--jacobi_block_size` - Block size for Jacobi iteration
194
+ - `--aspect_ratio` - Output aspect ratio ("1:1", "16:9", "4:3", etc.)
195
+ - `--seed` - Random seed for reproducible generation
196
+
197
+ ## πŸ“š Project Structure
198
+
199
+ ```
200
+ β”œβ”€β”€ train.py # Main training script
201
+ β”œβ”€β”€ sample.py # Sampling and inference
202
+ β”œβ”€β”€ transformer_flow.py # Core model implementation
203
+ β”œβ”€β”€ dataset.py # Dataset loading and preprocessing
204
+ β”œβ”€β”€ finetune_decoder.py # Decoder fine-tuning script
205
+ β”œβ”€β”€ utils/ # Utility modules
206
+ β”‚ β”œβ”€β”€ common.py # Core utility functions
207
+ β”‚ β”œβ”€β”€ model_setup.py # Model configuration and setup
208
+ β”‚ β”œβ”€β”€ training.py # Training utilities and metrics
209
+ β”‚ └── inference.py # Evaluation and metrics
210
+ β”œβ”€β”€ configs/ # Model configuration files
211
+ β”‚ β”œβ”€β”€ starflow_3B_t2i_256x256.yaml
212
+ β”‚ └── starflow-v_7B_t2v_caus_480p.yaml
213
+ β”œβ”€β”€ scripts/ # Example training and sampling scripts
214
+ β”‚ β”œβ”€β”€ test_sample_image.sh
215
+ β”‚ β”œβ”€β”€ test_sample_video.sh
216
+ β”‚ β”œβ”€β”€ test_train_image.sh
217
+ β”‚ β”œβ”€β”€ test_train_video.sh
218
+ β”‚ β”œβ”€β”€ setup_conda.sh
219
+ β”‚ β”œβ”€β”€ extract_images.py
220
+ β”‚ └── extract_image_from_video.py
221
+ └── misc/ # Additional utilities
222
+ β”œβ”€β”€ pe.py # Positional encodings
223
+ β”œβ”€β”€ lpips.py # LPIPS loss
224
+ └── wan_vae2.py # Video VAE implementation
225
+ ```
226
+
227
+ ## πŸ’‘ Tips
228
+
229
+ ### Image Generation
230
+ 1. Use guidance scales between 2.0-5.0 for balanced quality and diversity
231
+ 2. Experiment with different aspect ratios for your use case
232
+ 3. Enable Jacobi iteration (`--jacobi 1`) for faster sampling
233
+ 4. Use higher resolution models for detailed outputs
234
+ 5. The default script uses optimized settings: `--jacobi_th 0.001` and `--jacobi_block_size 16`
235
+
236
+ ### Video Generation
237
+ 1. Start with shorter sequences (81 frames) and gradually increase length (161, 241, 481+ frames)
238
+ 2. Use input images (`--input_image`) for more controlled generation
239
+ 3. Adjust FPS settings based on content type (8-24 FPS)
240
+ 4. Consider temporal consistency when crafting prompts
241
+ 5. The default script uses `--jacobi_block_size 64`.
242
+ 6. **Longer videos**: Use `--target_length` to generate videos beyond the training length (requires `--jacobi 1`)
243
+ 7. **Frame reference**: 81 frames β‰ˆ 5s, 161 frames β‰ˆ 10s, 241 frames β‰ˆ 15s, 481 frames β‰ˆ 30s (at 16fps)
244
+
245
+ ### Training
246
+ 1. Use FSDP for efficient large model training
247
+ 2. Start with smaller batch sizes and scale up
248
+ 3. Monitor loss curves and adjust learning rates accordingly
249
+ 4. Use gradient checkpointing to reduce memory usage
250
+ 5. The test scripts include `--dry_run 1` for validation
251
+
252
+ ## πŸ”— Citation
253
+
254
+ If you use STARFlow in your research, please cite:
255
+
256
+ ```bibtex
257
+ @article{gu2025starflow,
258
+ title={STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis},
259
+ author={Gu, Jiatao and Chen, Tianrong and Berthelot, David and Zheng, Huangjie and Wang, Yuyang and Zhang, Ruixiang and Dinh, Laurent and Bautista, Miguel Angel and Susskind, Josh and Zhai, Shuangfei},
260
+ journal={NeurIPS},
261
+ year={2025}
262
+ }
263
+ ```
264
+
265
+ ## πŸ“„ License
266
+
267
+ LICENSE: Please check out the repository [LICENSE](LICENSE) before using the provided code and [LICENSE_MODEL](LICENSE_MODEL) for the released models.
268
+
269
+ ## 🀝 Contributing
270
+
271
+ We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
272
+
273
+
274
+