Vision
updated
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper
• 2506.22434
• Published • 10
VisionThink: Smart and Efficient Vision Language Model via Reinforcement
Learning
Paper
• 2507.13348
• Published • 79
RewardDance: Reward Scaling in Visual Generation
Paper
• 2509.08826
• Published • 73
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for
Multimodal LLMs
Paper
• 2510.18876
• Published • 37
Back to Basics: Let Denoising Generative Models Denoise
Paper
• 2511.13720
• Published • 70
Diversity Has Always Been There in Your Visual Autoregressive Models
Paper
• 2511.17074
• Published • 8
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Paper
• 2511.20256
• Published • 28
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Paper
• 2511.22787
• Published • 10
InfiniDepth: Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields
Paper
• 2601.03252
• Published • 103
DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation
Paper
• 2601.22904
• Published • 15
Unified Personalized Reward Model for Vision Generation
Paper
• 2602.02380
• Published • 20
Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Paper
• 2602.11858
• Published • 59
DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
Paper
• 2602.10809
• Published • 57
PyVision-RL: Forging Open Agentic Vision Models via RL
Paper
• 2602.20739
• Published • 31
Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
Paper
• 2603.09095
• Published • 28
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Paper
• 2603.12267
• Published • 13
WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
Paper
• 2603.11593
• Published • 24