Qwen2-VL-7B Traffic Detection LoRA

Fine-tuned LoRA adapter for traffic and urban scene understanding using the Sculptor Method (inverse masking strategy).

About This Project

This comprehensive pipeline for video frame extraction, processing, and fine-tuning Vision Language Models (Qwen2-VL) was developed as part of the "Team Project" course at Technische Hochschule Ingolstadt under the supervision of Lecturer Marion Neumier. The project demonstrates the application of LoRA (Low-Rank Adaptation) techniques on the UrbanING-V2X dataset for cooperative perception tasks.

Model Details

Base Model: Qwen/Qwen2-VL-7B-Instruct
Method: LoRA (Low-Rank Adaptation)
LoRA Rank: 32
LoRA Alpha: 64
Quantization: 4-bit (NF4)
Training Strategy: Sculptor Method (Inverse Masking)

Training Hyperparameters

Learning Rate: 2e-4
Batch Size: 2 per device
Gradient Accumulation: 8 steps
Effective Batch Size: 16
Epochs: 20
Optimizer: PagedAdamW 8-bit
Scheduler: Cosine with warmup (3%)
LoRA Dropout: 0.05
Weight Decay: 0.01

Performance

Best model from training with random data split strategy, achieving high performance on traffic detection and urban scene understanding tasks.

Usage

IMPORTANT: This repository contains only the LoRA adapter weights. You must load it on top of the base Qwen2-VL-7B-Instruct model.

Step 1: Download the Adapter

You can download the adapter using one of these methods:

Method 1: Using Hugging Face CLI

huggingface-cli download muk0644/Urban-Traffic-Qwen2-VL3

Method 2: Using Git Clone

# HTTPS
git clone https://huggingface.co/muk0644/Urban-Traffic-Qwen2-VL3

# SSH
git clone [email protected]:muk0644/Urban-Traffic-Qwen2-VL3

Method 3: Using Python (Recommended)

from huggingface_hub import snapshot_download

repo_id = "muk0644/Urban-Traffic-Qwen2-VL3"
local_folder = "./downloaded_adapter"

snapshot_download(repo_id=repo_id, local_dir=local_folder, token=True)
print(f"✅ Adapter successfully downloaded to: {local_folder}")

Step 2: Installation

pip install transformers peft torch qwen-vl-utils

Step 3: Inference with Downloaded Adapter

import os
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
from qwen_vl_utils import process_vision_info

# Memory optimization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Path to your downloaded adapter
ADAPTER_PATH = "./downloaded_adapter"  # Change this to your local path

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

print("Loading Base Model (4-bit)...")
# Step 1: Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Step 2: Load adapter on top of base model
print(f"Loading Adapter from: {ADAPTER_PATH}")
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model.eval()

# Step 3: Load processor
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    trust_remote_code=True
)

print("✅ Model ready for inference!")

# Inference example
def analyze_image(image_path):
    question = (
        "Count the visible objects. Output the result strictly in this format: "
        "car=N, van=N, pedestrian=N, truck=N, trailer=N, bus=N, cyclist=N, other=N. "
        "Use 0 if an object is not present."
    )
    
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": question},
            ],
        }
    ]
    
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    ).to(model.device)
    
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=128)
    
    output_text = processor.batch_decode(
        generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    
    return output_text.split("assistant\n")[-1].strip()

# Run inference
image_path = "path/to/your/image.jpg"
result = analyze_image(image_path)
print(f"Result: {result}")

Dataset

Trained on the UrbanING-V2X dataset, a large-scale multi-vehicle, multi-infrastructure dataset for cooperative perception. The training data includes traffic and urban scene images for:

Traffic scene description
Vehicle detection and counting
Road condition analysis
Urban environment understanding
Safety assessment
Cooperative perception scenarios

Training Methodology

This model uses the Sculptor Method with inverse masking strategy:

Vision encoder frozen during training
Language model trained with targeted masking
Improved visual-language alignment
Enhanced performance on complex visual reasoning

Citation

If you use this model in your research, please cite:

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@inproceedings{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J and others},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

@misc{urbaningv2x2025,
  title={UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset for Cooperative Perception},
  author={Sekaran, Karthikeyan Chandra and others},
  year={2025},
  eprint={2510.23478},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Framework Versions

PEFT: 0.18.0+
Transformers: 4.45.0+
PyTorch: 2.0.0+
Python: 3.8+

License

This model adapter is released under the Apache 2.0 license. The base Qwen2-VL model has its own license terms.

Contributors

Muhammad Shariq Khan
Akshat Arage
Smit Bhenjaliya
Abdullah Naim
Pavankumar Ginkala
Sai Muddu

Acknowledgments

Based on Qwen2-VL by Alibaba Cloud.

Downloads last month: 45

Model tree for muk0644/Urban-Traffic-Qwen2-VL3

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Adapter

(189)

this model

muk0644
/

Urban-Traffic-Qwen2-VL3