Purpose: Models that understand text + image + audio together.
-
llava-hf/llava-1.5-7b-hf
Image-Text-to-Text • 7B • Updated • 2.84M • 354 -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 2.4M • 847 -
google/pix2struct-base
Image-to-Text • 0.3B • Updated • 2.54k • 79 -
microsoft/kosmos-2-patch14-224
Image-to-Text • Updated • 159k • 184