SNCE: Geometry-Aware Supervision for Scalable Discrete Image Generation
Abstract
Recent advancements in discrete image generation showed that scaling the VQ codebook size significantly improves reconstruction fidelity. However, training generative models with a large VQ codebook remains challenging, typically requiring larger model size and a longer training schedule. In this work, we propose Stochastic Neighbor Cross Entropy Minimization (SNCE), a novel training objective designed to address the optimization challenges of large-codebook discrete image generators. Instead of supervising the model with a hard one-hot target, SNCE constructs a soft categorical distribution over a set of neighboring tokens. The probability assigned to each token is proportional to the proximity between its code embedding and the ground-truth image embedding, encouraging the model to capture semantically meaningful geometric structure in the quantized embedding space. We conduct extensive experiments across class-conditional ImageNet-256 generation, large-scale text-to-image synthesis, and image editing tasks. Results show that SNCE significantly improves convergence speed and overall generation quality compared to standard cross-entropy objectives.
Community
Effective Optimization Objective for VQ-based discrete image generators.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BitDance: Scaling Autoregressive Generative Models with Binary Tokens (2026)
- High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding (2026)
- Autoregressive Image Generation with Masked Bit Modeling (2026)
- Soft Tail-dropping for Adaptive Visual Tokenization (2026)
- EvoTok: A Unified Image Tokenizer via Residual Latent Evolution for Visual Understanding and Generation (2026)
- PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation (2026)
- NativeTok: Native Visual Tokenization for Improved Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper