--- base_model: Qwen/Qwen2.5-VL-3B-Instruct library_name: peft pipeline_tag: image-text-to-text tags: - vision - vqa - qwen2.5-vl - lora - transformers license: apache-2.0 --- # VQA Base Model Fine-tuned VQA model using Qwen2.5-VL-3B-Instruct with LoRA. **Performance:** - **Validation Accuracy: 88.69%** (345/389) - **High-res (512px) Accuracy: 89.72%** (349/389) - Baseline model for the project **Part of 3-Model Ensemble:** - Combined with Improved Epoch 1 and Improved Epoch 2 - **Ensemble Validation: 90.75%** - **Ensemble Test (Kaggle): 91.82%** ## Model Details - **Base Model:** Qwen/Qwen2.5-VL-3B-Instruct - **Fine-tuning Method:** LoRA (Low-Rank Adaptation) - **Quantization:** 4-bit (NF4) - **Hardware:** NVIDIA A100 40GB - **Training:** Fine-tuned on VQA dataset (604 samples) ## LoRA Configuration ```python { "r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "target_modules": [ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ] } ``` ## Usage ```python from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig from peft import PeftModel import torch # Load model with 4-bit quantization bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) base_model = AutoModelForVision2Seq.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) model = PeftModel.from_pretrained(base_model, "ikellllllll/vqa-base-model") processor = AutoProcessor.from_pretrained( "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=512*512, max_pixels=512*512, trust_remote_code=True ) # IMPORTANT: Set left-padding for decoder-only models processor.tokenizer.padding_side = 'left' ``` ## Inference Settings - **Image Resolution:** 512×512px (higher resolution recommended) - **Batch Size:** 32 (for A100 40GB) - **Padding:** Left-padding (critical for decoder-only models!) ## Dataset - **Training:** 604 VQA samples - **Validation:** 389 VQA samples - **Test:** 3,887 VQA samples ## Performance Notes - 384px resolution: 88.69% validation accuracy - 512px resolution: 89.72% validation accuracy (+1.03%) - **Higher resolution significantly improves performance** ## Links - **GitHub Repository:** [SSAFY_AI_competition](https://github.com/ikellllllll/SSAFY_AI_competition) - **Related Models:** - [vqa-improved-epoch1](https://huggingface.co/ikellllllll/vqa-improved-epoch1) (90.49%) - [vqa-improved-epoch2](https://huggingface.co/ikellllllll/vqa-improved-epoch2) (90.23%) ## Citation ```bibtex @misc{vqa-base-model, author = {Team 203}, title = {VQA Base Model}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/ikellllllll/vqa-base-model}} } ``` ## License Apache 2.0