Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Abstract
Multimodal Large Language Models (MLLMs) lack robustness to contradictory modalities, as demonstrated by MMA-Bench, and a modality alignment tuning strategy improves their multimodal reasoning.
Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.
Community
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration
As the famous George Orwell quote goes - "all animals are equal but some animals are more equal than others", we indeed find that though present-day MLLMs claim to process all modalities equally, some modalities (like text then visual) are more heavily utilized than others.
Consider a video of birds chirping. If blindfolded, could you describe what you hear? Similarly, if wearing noise-canceling headphones, could you describe what you see? . In all these situations, most humans can easily describe the events and rely on the available modality. But regarding present-day Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities?.
Most models are trained on datasets that overwhelmingly assume that all available modalities are aligned. To rigorously study this, we introduce MMA-Bench to systematically control the presence or alignment of one modality at a time. We use structured contradictions to reveal if MLLMs are truly multi-modal or take shortcuts during cross-modal reasoning tasks.
Key Findings: The Hierarchy of Modalities
Our analysis revealed a critical flaw: despite being trained as multimodal systems, all MLLMs collapse when modalities conflict.
Text Dominance: We found that small text distractions cripple models, ignoring clear audio-visual cues. White-box interpretability reveals that textual tokens consistently exhibit the highest attention magnitudes. This indicates that textual priors frequently override multimodal signals.
Seeing Without Listening: Models rely almost exclusively on vision, i.e., "seeing without listening," resulting in strong degradation of audio performance when modalities conflict.
Visual Fallback: When the requested modality is absent (e.g., asking about audio in a silent video), models do not reliably follow the desired modality instructions and instead fall back on whichever modality remains informative - typically the visual stream.
Brittle Modality Integration: MLLMs fail ungracefully if any one modality is perturbed. This brittleness makes these models susceptible to simple text-, vision-, and audio-based data poisoning or prompt attacks.
Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. This prompt-conditioned supervision yields adaptive attention redistribution toward the queried modality rather than static bias suppression. We show that targeted modality alignment is more effective than scaling alone.
Code and dataset will be available at cskyl.github.io/MMA-Bench/.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning (2025)
- Augmenting Intra-Modal Understanding in MLLMs for Robust Multimodal Keyphrase Generation (2025)
- OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs (2025)
- Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View (2025)
- XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models (2025)
- VisuoAlign: Safety Alignment of LVLMs with Multimodal Tree Search (2025)
- See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper