Spaces:
Runtime error
Runtime error
| title: GestureLSM Demo | |
| emoji: "\U0001F57A" | |
| colorFrom: purple | |
| colorTo: pink | |
| sdk: gradio | |
| sdk_version: "4.42.0" | |
| app_file: hf_space/app.py | |
| pinned: false | |
| [](https://paperswithcode.com/sota/gesture-generation-on-beat2?p=gesturelsm-latent-shortcut-based-co-speech) <a href="https://arxiv.org/abs/2501.18898"><img src="https://img.shields.io/badge/arxiv-gray?logo=arxiv&"></a> | |
| # GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025] | |
| # π Release Plans | |
| - [x] Inference Code | |
| - [x] Pretrained Models | |
| - [x] A web demo | |
| - [x] Training Code | |
| - [x] Clean Code to make it look nicer | |
| - [x] Support for [MeanFlow](https://arxiv.org/abs/2505.13447) | |
| - [x] Unified training and testing pipeline | |
| - [ ] MeanFlow Training Code (Coming Soon) | |
| - [ ] Merge with [Intentional-Gesture](https://github.com/andypinxinliu/Intentional-Gesture) | |
| ## π Code Updates | |
| **Latest Update**: The codebase has been cleaned and restructured. For legacy or historical information, please check out the `old` branch. | |
| **New Features**: | |
| - Added MeanFlow model support | |
| - Unified training and testing pipeline using `train.py` | |
| - New configuration files in `configs_new/` directory | |
| - Updated checkpoint files with improved performance | |
| # βοΈ Installation | |
| ## Build Environtment | |
| ``` | |
| conda create -n gesturelsm python=3.12 | |
| conda activate gesturelsm | |
| conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia | |
| pip install -r requirements.txt | |
| bash demo/install_mfa.sh | |
| ``` | |
| ## π Code Structure | |
| Understanding the codebase structure will help you navigate and customize the project effectively. | |
| ``` | |
| GestureLSM/ | |
| βββ π configs_new/ # New unified configuration files | |
| β βββ diffusion_rvqvae_128.yaml # Diffusion model config | |
| β βββ shortcut_rvqvae_128.yaml # Shortcut model config | |
| β βββ meanflow_rvqvae_128.yaml # MeanFlow model config | |
| βββ π configs/ # Legacy configuration files (deprecated) | |
| βββ π ckpt/ # Pretrained model checkpoints | |
| β βββ new_540_diffusion.bin # Diffusion model weights | |
| β βββ shortcut_reflow.bin # Shortcut model weights | |
| β βββ meanflow.pth # MeanFlow model weights | |
| β βββ net_300000_*.pth # RVQ-VAE model weights | |
| βββ π models/ # Model implementations | |
| β βββ Diffusion.py # Diffusion model | |
| β βββ LSM.py # Latent Shortcut Model | |
| β βββ MeanFlow.py # MeanFlow model | |
| β βββ π layers/ # Neural network layers | |
| β βββ π vq/ # Vector quantization modules | |
| β βββ π utils/ # Model utilities | |
| βββ π dataloaders/ # Data loading and preprocessing | |
| β βββ beat_sep_lower.py # Main dataset loader | |
| β βββ π pymo/ # Motion processing library | |
| β βββ π utils/ # Data utilities | |
| βββ π trainer/ # Training framework | |
| β βββ base_trainer.py # Base trainer class | |
| β βββ generative_trainer.py # Generative model trainer | |
| βββ π utils/ # General utilities | |
| β βββ config.py # Configuration management | |
| β βββ metric.py # Evaluation metrics | |
| β βββ rotation_conversions.py # Rotation utilities | |
| βββ π demo/ # Demo and visualization | |
| β βββ examples/ # Sample audio files | |
| β βββ install_mfa.sh # MFA installation script | |
| βββ π datasets/ # Dataset storage | |
| β βββ BEAT_SMPL/ # Original BEAT dataset | |
| β βββ beat_cache/ # Preprocessed cache | |
| β βββ hub/ # SMPL models and pretrained weights | |
| βββ π outputs/ # Training outputs and logs | |
| β βββ weights/ # Saved model weights | |
| βββ train.py # Unified training/testing script | |
| βββ demo.py # Web demo script | |
| βββ rvq_beatx_train.py # RVQ-VAE training script | |
| βββ requirements.txt # Python dependencies | |
| ``` | |
| ### π§ Key Components | |
| #### **Model Architecture** | |
| - **`models/Diffusion.py`**: Denoising diffusion model for high-quality generation | |
| - **`models/LSM.py`**: Latent Shortcut Model for fast inference | |
| - **`models/MeanFlow.py`**: Flow-based model for single-step generation | |
| - **`models/vq/`**: Vector quantization modules for latent space compression | |
| #### **Configuration System** | |
| - **`configs_new/`**: New unified configuration files for all models | |
| - **`configs/`**: Legacy configuration files (deprecated) | |
| - Each config file contains model parameters, training settings, and data paths | |
| #### **Data Pipeline** | |
| - **`dataloaders/beat_sep_lower.py`**: Main dataset loader for BEAT dataset | |
| - **`dataloaders/pymo/`**: Motion processing library for gesture data | |
| - **`datasets/beat_cache/`**: Preprocessed data cache for faster loading | |
| #### **Training Framework** | |
| - **`train.py`**: Unified script for training and testing all models | |
| - **`trainer/`**: Training framework with base and generative trainers | |
| - **`optimizers/`**: Optimizer and scheduler implementations | |
| #### **Utilities** | |
| - **`utils/config.py`**: Configuration management and validation | |
| - **`utils/metric.py`**: Evaluation metrics (FGD, etc.) | |
| - **`utils/rotation_conversions.py`**: 3D rotation utilities | |
| ### π Getting Started with the Code | |
| 1. **For Training**: Use `train.py` with configs from `configs_new/` | |
| 2. **For Inference**: Use `demo.py` for web interface or `train.py --mode test` | |
| 3. **For Customization**: Modify config files in `configs_new/` directory | |
| 4. **For New Models**: Add model implementation in `models/` directory | |
| ## Results | |
|  | |
| This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to [**Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis**](https://arxiv.org/abs/2412.06786), accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper. | |
| ## Important Notes | |
| ### Model Performance | |
| - The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. | |
| - The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow) | |
| - If you want to use all-speaker, please modify the config files to include all speaker ids. | |
| - April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut) | |
| - No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used. | |
| ### Model Design Choices | |
| - No speaker embedding is included to make the model capable of generating gestures for novel speakers. | |
| - No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications. | |
| - If you want to see better FGD scores, you can try adding gesture type information. | |
| ### Code Structure | |
| - **Current Version**: Clean, unified codebase with MeanFlow support | |
| - **Legacy Code**: Available in the `old` branch for historical reference | |
| - **Accepted to ICCV 2025** - Thanks to all co-authors! | |
| ## Download Models | |
| ### Pretrained Models (Updated) | |
| ``` | |
| # Option 1: From Google Drive | |
| # Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs) | |
| gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder | |
| # Option 2: From Huggingface Hub | |
| huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt | |
| # Download the SMPL model | |
| gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder | |
| ``` | |
| ### Available Checkpoints | |
| - **Diffusion Model**: `ckpt/new_540_diffusion.bin` | |
| - **Shortcut Model**: `ckpt/shortcut_reflow.bin` | |
| - **MeanFlow Model**: `ckpt/meanflow.pth` | |
| - **RVQ-VAE Models**: `ckpt/net_300000_upper.pth`, `ckpt/net_300000_hands.pth`, `ckpt/net_300000_lower.pth` | |
| ## Download Dataset | |
| > For evaluation and training, not necessary for running a web demo or inference. | |
| ### Download BEAT2 Dataset from Hugging Face | |
| The original dataset download method is no longer available. Please use the Hugging Face dataset: | |
| ```bash | |
| # Download BEAT2 dataset from Hugging Face | |
| huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2 | |
| ``` | |
| **Dataset Information**: | |
| - **Source**: [H-Liu1997/BEAT2 on Hugging Face](https://huggingface.co/datasets/H-Liu1997/BEAT2) | |
| - **Size**: ~4.1K samples | |
| - **Format**: CSV with train/test splits | |
| - **License**: Apache 2.0 | |
| ### Legacy Download (Deprecated) | |
| > The original download method is no longer working | |
| ```bash | |
| # This command is deprecated and no longer works | |
| # bash preprocess/bash_raw_cospeech_download.sh | |
| ``` | |
| ## Testing/Evaluation | |
| > **Note**: Requires dataset download for evaluation. For inference only, see the Demo section below. | |
| ### Unified Testing Pipeline | |
| The codebase now uses a unified `train.py` script for both training and testing. Use the `--mode test` flag for evaluation: | |
| ```bash | |
| # Test Diffusion Model (20 steps) | |
| python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test | |
| # Test Shortcut Model (2-step reflow) | |
| python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test | |
| # Test MeanFlow Model (1-step flow-based) | |
| python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test | |
| ``` | |
| ### Model Comparison | |
| | Model | Steps | Description | Key Features | Use Case | | |
| |-------|-------|-------------|--------------|----------| | |
| | **Diffusion** | 20 | Denoising diffusion model | High quality, slower inference | High-quality generation | | |
| | **Shortcut** | 2-4 | Latent shortcut with reflow | Fast inference, good quality | **Recommended for most users** | | |
| | **MeanFlow** | 1 | Flow-based generation | Fastest inference, single step | Real-time applications | | |
| ### Performance Comparison | |
| | Model | Steps | FGD Score β | Beat Constancy β | L1Div Score β | Inference Speed | | |
| |-------|-------|-------------|------------------|---------------|-----------------| | |
| | **MeanFlow** | 1 | **0.4031** | **0.7489** | 12.4631 | **Fastest** | | |
| | **Diffusion** | 20 | 0.4100 | 0.7384 | 12.5752 | Slowest | | |
| | **Shortcut** | 20 | 0.4040 | 0.7144 | 13.4874 | Fast | | |
| | **Shortcut-ReFlow** | 2 | 0.4104 | 0.7182 | **13.678** | Fast | | |
| **Legend**: | |
| - **FGD Score** (β): Lower is better - measures gesture quality | |
| - **Beat Constancy** (β): Higher is better - measures audio-gesture synchronization | |
| - **L1Div Score** (β): Higher is better - measures diversity of generated gestures | |
| **Recommendation**: **MeanFlow** is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed. | |
| ### Legacy Testing (Deprecated) | |
| > For reference only - use the unified pipeline above instead | |
| ```bash | |
| # Old testing commands (deprecated) | |
| python test.py -c configs/shortcut_rvqvae_128.yaml | |
| python test.py -c configs/shortcut_reflow_test.yaml | |
| python test.py -c configs/diffuser_rvqvae_128.yaml | |
| ``` | |
| ## Train RVQ-VAEs (1-speaker) | |
| > Require download dataset | |
| ``` | |
| bash train_rvq.sh | |
| ``` | |
| ## Training | |
| > **Note**: Requires dataset download for training. | |
| ### Unified Training Pipeline | |
| The codebase now uses a unified `train.py` script for training all models. Use the new configuration files in `configs_new/`: | |
| ```bash | |
| # Train Diffusion Model | |
| python train.py --config configs_new/diffusion_rvqvae_128.yaml | |
| # Train Shortcut Model | |
| python train.py --config configs_new/shortcut_rvqvae_128.yaml | |
| # Train MeanFlow Model | |
| python train.py --config configs_new/meanflow_rvqvae_128.yaml | |
| ``` | |
| ### Training Configuration | |
| - **Config Directory**: Use `configs_new/` for the latest configurations | |
| - **Output Directory**: Models are saved to `./outputs/weights/` | |
| - **Logging**: Supports Weights & Biases integration (configure in config files) | |
| - **GPU Support**: Configure GPU usage in the config files | |
| ### Legacy Training (Deprecated) | |
| > For reference only - use the unified pipeline above instead | |
| ```bash | |
| # Old training commands (deprecated) | |
| python train.py -c configs/shortcut_rvqvae_128.yaml | |
| python train.py -c configs/diffuser_rvqvae_128.yaml | |
| ``` | |
| ## Quick Start | |
| ### Demo/Inference (No Dataset Required) | |
| ```bash | |
| # Run the web demo with Shortcut model | |
| python demo.py -c configs/shortcut_rvqvae_128_hf.yaml | |
| ``` | |
| ### Testing with Your Own Data | |
| ```bash | |
| # Test with your own audio and text (requires pretrained models) | |
| python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test | |
| ``` | |
| ## Demo | |
| The demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference. | |
| ```bash | |
| python demo.py -c configs/shortcut_rvqvae_128_hf.yaml | |
| ``` | |
| **Features**: | |
| - Web-based interface for easy interaction | |
| - Real-time gesture generation | |
| - Support for custom audio and text input | |
| - Visualization of generated gestures | |
| # π Acknowledgments | |
| Thanks to [SynTalker](https://github.com/RobinWitch/SynTalker/tree/main), [EMAGE](https://github.com/PantoMatrix/PantoMatrix/tree/main/scripts/EMAGE_2024), [DiffuseStyleGesture](https://github.com/YoungSeng/DiffuseStyleGesture), our code is partially borrowing from them. Please check these useful repos. | |
| # π Citation | |
| If you find our code or paper helps, please consider citing: | |
| ```bibtex | |
| @inproceedings{liu2025gesturelsmlatentshortcutbased, | |
| title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}}, | |
| author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu}, | |
| booktitle={IEEE/CVF International Conference on Computer Vision}, | |
| year={2025}, | |
| } | |
| ``` | |