Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2407.12594

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Paper • 2408.07246 • Published Aug 14, 2024 • 22
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Paper • 2401.13313 • Published Jan 24, 2024 • 5
BAAI/Bunny-v1_0-4B

Text Generation • 4B • Updated Jun 24, 2024 • 63 • 10
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 103
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 37

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

rishiraj/CatPPT-base

Text Generation • 7B • Updated Jan 10, 2024 • 3.64k • 48
quantumaikr/quantum-dpo-v0.1

Text Generation • 7B • Updated Dec 17, 2023 • 975 • 2
TheBloke/zephyr-7B-beta-GGUF

7B • Updated Oct 27, 2023 • 3.31k • 230
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

Paper • 2403.00483 • Published Mar 1, 2024 • 15
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

Paper • 2403.01779 • Published Mar 4, 2024 • 30
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Paper • 2401.11605 • Published Jan 21, 2024 • 23
FiT: Flexible Vision Transformer for Diffusion Model

Paper • 2402.12376 • Published Feb 19, 2024 • 48

DocGraphLM: Documental Graph Language Model for Information Extraction

Paper • 2401.02823 • Published Jan 5, 2024 • 36
Understanding LLMs: A Comprehensive Overview from Training to Inference

Paper • 2401.02038 • Published Jan 4, 2024 • 65
DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Paper • 2309.01131 • Published Sep 3, 2023 • 1

Seeing and Understanding: Bridging Vision with Chemical Knowledge Via ChemVLM

Paper • 2408.07246 • Published Aug 14, 2024 • 22
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

Paper • 2401.13313 • Published Jan 24, 2024 • 5
BAAI/Bunny-v1_0-4B

Text Generation • 4B • Updated Jun 24, 2024 • 63 • 10
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 103
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 37

RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization

Paper • 2403.00483 • Published Mar 1, 2024 • 15
OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

Paper • 2403.01779 • Published Mar 4, 2024 • 30
Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Paper • 2401.11605 • Published Jan 21, 2024 • 23
FiT: Flexible Vision Transformer for Diffusion Model

Paper • 2402.12376 • Published Feb 19, 2024 • 48

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

DocGraphLM: Documental Graph Language Model for Information Extraction

Paper • 2401.02823 • Published Jan 5, 2024 • 36
Understanding LLMs: A Comprehensive Overview from Training to Inference

Paper • 2401.02038 • Published Jan 4, 2024 • 65
DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 189
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

Paper • 2309.01131 • Published Sep 3, 2023 • 1

rishiraj/CatPPT-base

Text Generation • 7B • Updated Jan 10, 2024 • 3.64k • 48
quantumaikr/quantum-dpo-v0.1

Text Generation • 7B • Updated Dec 17, 2023 • 975 • 2
TheBloke/zephyr-7B-beta-GGUF

7B • Updated Oct 27, 2023 • 3.31k • 230
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs