VMAE MR25 ImageNet256 400ep LPIPS Tuned

This is a VMAE (Vision Masked AutoEncoder) model trained for LDMAE (Latent Diffusion with Masked AutoEncoder).

Model Details

  • Architecture: VMAE (Vision Masked AutoEncoder) with f8d16 configuration
  • Training: 400 epochs on ImageNet 256x256
  • Optimization: LPIPS-tuned for better perceptual quality
  • Compression: 8x spatial compression, 16-dimensional latent space
  • Input Size: 256x256 RGB images
  • Output: 32x32x16 latent representation

Usage

import torch
from models_mae import mae_for_ldmae_f8d16_prev

# Load model
model = mae_for_ldmae_f8d16_prev(
    ldmae_mode=True,
    no_cls=True,
    kl_loss_weight=True,
    smooth_output=True,
    img_size=256
)

# Load checkpoint
checkpoint = torch.load('pytorch_model.bin', map_location='cpu')
model.load_state_dict(checkpoint['model'], strict=False)
model.eval()

# Encode images
with torch.no_grad():
    latents = model.encode(images).latent_dist.mode()

# Decode latents
with torch.no_grad():
    reconstructed = model.decode(latents).sample

Training Configuration

augmentation:
  color_jitter: 0.4
  random_crop: true
  random_flip: true
data:
  batch_size: 32
  data_path: /data/dataset/imagenet/1K_dataset
  image_size: 256
  num_workers: 8
evaluation:
  metrics:
  - rfid
  - psnr
  - lpips
  - ssim
loss:
  kl_weight: 1.0e-06
  lpips_weight: 0.1
  reconstruction_weight: 1.0
model_info:
  compression_ratio: 8
  description: VMAE with 8x spatial compression and 16-dimensional latent space
  latent_channels: 16
  optimization: LPIPS-tuned for perceptual quality
  training_dataset: ImageNet 256x256
training:
  epochs: 400
  learning_rate: 0.00015
  min_lr: 0.0
  warmup_epochs: 40
  weight_decay: 0.05
vae:
  architecture: mae_for_ldmae_f8d16_prev
  model_name: vmae_f8d16
  params:
    decoder_depth: 8
    decoder_embed_dim: 512
    decoder_num_heads: 16
    depth: 24
    embed_dim: 512
    img_size: 256
    in_channels: 3
    kl_loss_weight: true
    latent_dim: 16
    ldmae_mode: true
    mlp_ratio: 4.0
    no_cls: true
    norm_layer: LayerNorm
    num_heads: 16
    patch_size: 8
    smooth_output: true
  weight_path: pretrain_weight/vmaef8d16.pth

Citation

If you use this model, please cite:

@article{ldmae2025,
  title={LDMAE: Latent Diffusion with Masked AutoEncoder},
  author={Your Name},
  year={2025}
}

License

Apache-2.0

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support