We present Matryoshka Multimodal Models (M3), which represents visual tokens in a nested manner following the coarse-to-fine order. Now users can explicitly control the visual granularity per test instance during inference! It will be great to host this model in huggingface! @akhaliq