|
|
--- |
|
|
base_model: Qwen/Qwen2.5-VL-3B-Instruct |
|
|
library_name: peft |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision |
|
|
- vqa |
|
|
- qwen2.5-vl |
|
|
- lora |
|
|
- transformers |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# VQA Base Model |
|
|
|
|
|
Fine-tuned VQA model using Qwen2.5-VL-3B-Instruct with LoRA. |
|
|
|
|
|
**Performance:** |
|
|
- **Validation Accuracy: 88.69%** (345/389) |
|
|
- **High-res (512px) Accuracy: 89.72%** (349/389) |
|
|
- Baseline model for the project |
|
|
|
|
|
**Part of 3-Model Ensemble:** |
|
|
- Combined with Improved Epoch 1 and Improved Epoch 2 |
|
|
- **Ensemble Validation: 90.75%** |
|
|
- **Ensemble Test (Kaggle): 91.82%** |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** Qwen/Qwen2.5-VL-3B-Instruct |
|
|
- **Fine-tuning Method:** LoRA (Low-Rank Adaptation) |
|
|
- **Quantization:** 4-bit (NF4) |
|
|
- **Hardware:** NVIDIA A100 40GB |
|
|
- **Training:** Fine-tuned on VQA dataset (604 samples) |
|
|
|
|
|
## LoRA Configuration |
|
|
|
|
|
```python |
|
|
{ |
|
|
"r": 16, |
|
|
"lora_alpha": 32, |
|
|
"lora_dropout": 0.05, |
|
|
"target_modules": [ |
|
|
"q_proj", "k_proj", "v_proj", "o_proj", |
|
|
"gate_proj", "up_proj", "down_proj" |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig |
|
|
from peft import PeftModel |
|
|
import torch |
|
|
|
|
|
# Load model with 4-bit quantization |
|
|
bnb_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_use_double_quant=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
base_model = AutoModelForVision2Seq.from_pretrained( |
|
|
"Qwen/Qwen2.5-VL-3B-Instruct", |
|
|
quantization_config=bnb_config, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
model = PeftModel.from_pretrained(base_model, "ikellllllll/vqa-base-model") |
|
|
processor = AutoProcessor.from_pretrained( |
|
|
"Qwen/Qwen2.5-VL-3B-Instruct", |
|
|
min_pixels=512*512, |
|
|
max_pixels=512*512, |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# IMPORTANT: Set left-padding for decoder-only models |
|
|
processor.tokenizer.padding_side = 'left' |
|
|
``` |
|
|
|
|
|
## Inference Settings |
|
|
|
|
|
- **Image Resolution:** 512×512px (higher resolution recommended) |
|
|
- **Batch Size:** 32 (for A100 40GB) |
|
|
- **Padding:** Left-padding (critical for decoder-only models!) |
|
|
|
|
|
## Dataset |
|
|
|
|
|
- **Training:** 604 VQA samples |
|
|
- **Validation:** 389 VQA samples |
|
|
- **Test:** 3,887 VQA samples |
|
|
|
|
|
## Performance Notes |
|
|
|
|
|
- 384px resolution: 88.69% validation accuracy |
|
|
- 512px resolution: 89.72% validation accuracy (+1.03%) |
|
|
- **Higher resolution significantly improves performance** |
|
|
|
|
|
## Links |
|
|
|
|
|
- **GitHub Repository:** [SSAFY_AI_competition](https://github.com/ikellllllll/SSAFY_AI_competition) |
|
|
- **Related Models:** |
|
|
- [vqa-improved-epoch1](https://huggingface.co/ikellllllll/vqa-improved-epoch1) (90.49%) |
|
|
- [vqa-improved-epoch2](https://huggingface.co/ikellllllll/vqa-improved-epoch2) (90.23%) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{vqa-base-model, |
|
|
author = {Team 203}, |
|
|
title = {VQA Base Model}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/ikellllllll/vqa-base-model}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|