We introduce Awesome Multimodal Modeling, a curated repository tracing the architectural evolution of multimodal intelligenceโfrom foundational fusion to native omni-models.
๐น Taxonomy & Evolution:
Traditional Multimodal Learning โ Foundational work on representation, fusion, and alignment. Multimodal LLMs (MLLMs) โ Architectures connecting vision encoders to LLMs for understanding. Unified Multimodal Models (UMMs) โ Models unifying Understanding + Generation via Diffusion, Autoregressive, or Hybrid paradigms. Native Multimodal Models (NMMs) โ Models trained from scratch on all modalities; contrasts early vs. late fusion under scaling laws. ๐ก Key Distinction: UMMs unify tasks via generation heads; NMMs enforce interleaving through joint pre-training.