Multimodal
updated
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published • 172
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published • 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and
Reinforcement Learning
Paper
• 2505.23380
• Published • 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published • 13
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
• 2506.00123
• Published • 35
Aligning Latent Spaces with Flow Priors
Paper
• 2506.05240
• Published • 27
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper
• 2506.13759
• Published • 43
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
• 2506.14028
• Published • 93
Show-o2: Improved Native Unified Multimodal Models
Paper
• 2506.15564
• Published • 29
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published • 78
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual
Tokens
Paper
• 2506.17218
• Published • 29
UniFork: Exploring Modality Alignment for Unified Multimodal
Understanding and Generation
Paper
• 2506.17202
• Published • 10
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Paper
• 2506.21277
• Published • 14
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published • 252
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and
Visual Documents
Paper
• 2507.04590
• Published • 17
Robust Multimodal Large Language Models Against Modality Conflict
Paper
• 2507.07151
• Published • 6
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation
from Diffusion Models
Paper
• 2507.07104
• Published • 46
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
• 2507.10787
• Published • 13
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published • 47
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation
Paper
• 2508.03320
• Published • 63
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance
for Text-to-Image Generation
Paper
• 2508.18032
• Published • 41
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
Reasoning, and Efficiency
Paper
• 2508.18265
• Published • 216
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published • 90
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
• 2509.01644
• Published • 34
Visual Representation Alignment for Multimodal Large Language Models
Paper
• 2509.07979
• Published • 84
Reconstruction Alignment Improves Unified Multimodal Models
Paper
• 2509.07295
• Published • 40
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published • 58
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
• 2509.18824
• Published • 23
Learning to See Before Seeing: Demystifying LLM Visual Priors from
Language Pre-training
Paper
• 2509.26625
• Published • 43
More Thought, Less Accuracy? On the Dual Nature of Reasoning in
Vision-Language Models
Paper
• 2509.25848
• Published • 81
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Paper
• 2509.26231
• Published • 18
Self-Improvement in Multimodal Large Language Models: A Survey
Paper
• 2510.02665
• Published • 21
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
Paper
• 2511.06805
• Published • 13
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Paper
• 2511.12207
• Published • 10
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Paper
• 2511.17729
• Published • 17
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Paper
• 2511.20561
• Published • 33
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Paper
• 2511.22663
• Published • 29
MMGR: Multi-Modal Generative Reasoning
Paper
• 2512.14691
• Published • 121
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Paper
• 2512.14052
• Published • 42
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Paper
• 2601.02204
• Published • 63
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Paper
• 2601.10129
• Published • 12
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Paper
• 2601.16973
• Published • 40
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Paper
• 2601.21821
• Published • 61
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Paper
• 2601.22060
• Published • 155
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Paper
• 2602.04804
• Published • 48
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Paper
• 2602.07026
• Published • 140
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Paper
• 2602.20161
• Published • 23
Beyond Language Modeling: An Exploration of Multimodal Pretraining
Paper
• 2603.03276
• Published • 93
Mario: Multimodal Graph Reasoning with Large Language Models
Paper
• 2603.05181
• Published • 8
Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs
Paper
• 2603.09095
• Published • 28