Qwen2-VL-7B Traffic Detection LoRA

Fine-tuned LoRA adapter for traffic and urban scene understanding using the Sculptor Method (inverse masking strategy).

About This Project

This comprehensive pipeline for video frame extraction, processing, and fine-tuning Vision Language Models (Qwen2-VL) was developed as part of the "Team Project" course at Technische Hochschule Ingolstadt under the supervision of Lecturer Marion Neumier. The project demonstrates the application of LoRA (Low-Rank Adaptation) techniques on the UrbanING-V2X dataset for cooperative perception tasks.

Model Details

  • Base Model: Qwen/Qwen2-VL-7B-Instruct
  • Method: LoRA (Low-Rank Adaptation)
  • LoRA Rank: 32
  • LoRA Alpha: 64
  • Quantization: 4-bit (NF4)
  • Training Strategy: Sculptor Method (Inverse Masking)

Training Hyperparameters

  • Learning Rate: 2e-4
  • Batch Size: 2 per device
  • Gradient Accumulation: 8 steps
  • Effective Batch Size: 16
  • Epochs: 20
  • Optimizer: PagedAdamW 8-bit
  • Scheduler: Cosine with warmup (3%)
  • LoRA Dropout: 0.05
  • Weight Decay: 0.01

Performance

Best model from training with random data split strategy, achieving high performance on traffic detection and urban scene understanding tasks.

Usage

IMPORTANT: This repository contains only the LoRA adapter weights. You must load it on top of the base Qwen2-VL-7B-Instruct model.

Step 1: Download the Adapter

You can download the adapter using one of these methods:

Method 1: Using Hugging Face CLI

huggingface-cli download muk0644/Urban-Traffic-Qwen2-VL3

Method 2: Using Git Clone

# HTTPS
git clone https://huggingface.co/muk0644/Urban-Traffic-Qwen2-VL3

# SSH
git clone [email protected]:muk0644/Urban-Traffic-Qwen2-VL3

Method 3: Using Python (Recommended)

from huggingface_hub import snapshot_download

repo_id = "muk0644/Urban-Traffic-Qwen2-VL3"
local_folder = "./downloaded_adapter"

snapshot_download(repo_id=repo_id, local_dir=local_folder, token=True)
print(f"โœ… Adapter successfully downloaded to: {local_folder}")

Step 2: Installation

pip install transformers peft torch qwen-vl-utils

Step 3: Inference with Downloaded Adapter

import os
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
from qwen_vl_utils import process_vision_info

# Memory optimization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# Path to your downloaded adapter
ADAPTER_PATH = "./downloaded_adapter"  # Change this to your local path

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

print("Loading Base Model (4-bit)...")
# Step 1: Load base model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Step 2: Load adapter on top of base model
print(f"Loading Adapter from: {ADAPTER_PATH}")
model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)
model.eval()

# Step 3: Load processor
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    trust_remote_code=True
)

print("โœ… Model ready for inference!")

# Inference example
def analyze_image(image_path):
    question = (
        "Count the visible objects. Output the result strictly in this format: "
        "car=N, van=N, pedestrian=N, truck=N, trailer=N, bus=N, cyclist=N, other=N. "
        "Use 0 if an object is not present."
    )
    
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": question},
            ],
        }
    ]
    
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt"
    ).to(model.device)
    
    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=128)
    
    output_text = processor.batch_decode(
        generated_ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]
    
    return output_text.split("assistant\n")[-1].strip()

# Run inference
image_path = "path/to/your/image.jpg"
result = analyze_image(image_path)
print(f"Result: {result}")

Dataset

Trained on the UrbanING-V2X dataset, a large-scale multi-vehicle, multi-infrastructure dataset for cooperative perception. The training data includes traffic and urban scene images for:

  • Traffic scene description
  • Vehicle detection and counting
  • Road condition analysis
  • Urban environment understanding
  • Safety assessment
  • Cooperative perception scenarios

Training Methodology

This model uses the Sculptor Method with inverse masking strategy:

  • Vision encoder frozen during training
  • Language model trained with targeted masking
  • Improved visual-language alignment
  • Enhanced performance on complex visual reasoning

Citation

If you use this model in your research, please cite:

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@inproceedings{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J and others},
  booktitle={International Conference on Learning Representations},
  year={2022}
}

@misc{urbaningv2x2025,
  title={UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset for Cooperative Perception},
  author={Sekaran, Karthikeyan Chandra and others},
  year={2025},
  eprint={2510.23478},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

Framework Versions

  • PEFT: 0.18.0+
  • Transformers: 4.45.0+
  • PyTorch: 2.0.0+
  • Python: 3.8+

License

This model adapter is released under the Apache 2.0 license. The base Qwen2-VL model has its own license terms.

Links

Contributors

  1. Muhammad Shariq Khan
  2. Akshat Arage
  3. Smit Bhenjaliya
  4. Abdullah Naim
  5. Pavankumar Ginkala
  6. Sai Muddu

Acknowledgments

Based on Qwen2-VL by Alibaba Cloud.

Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for muk0644/Urban-Traffic-Qwen2-VL3

Base model

Qwen/Qwen2-VL-7B
Adapter
(189)
this model