SigLIP2-Giant Fine-tuned for Animal Identification

Fine-tuned SigLIP2-Giant model for individual animal identification, specializing in distinguishing between unique cats and dogs. This model produces robust image embeddings optimized for pet recognition, re-identification, and verification tasks.

Model Details

Base Model: google/siglip2-giant-opt-patch16-384
Input: Images (384x384)
Output: Image embeddings (1536-dimensional)
Task: Individual animal identification and verification

Training Data

The model was trained on a comprehensive dataset combining multiple sources:

PetFace Dataset: Large-scale animal face dataset with 257,484 unique individuals across 13 animal families
Dogs-World: Kaggle dataset for dog breed and individual identification
LCW (Labeled Cats in the Wild): Cat identification dataset
Web-scraped Data: Additional curated images from various sources

Total Dataset Statistics:

1,904,157 total photographs
695,091 unique individual animals (cats and dogs)

Training Details

Training Configuration:

Batch Size: 116 samples (58 unique identities × 2 photos each)
Optimizer: Adam with learning rate 1e-4
Training Duration: 10 epochs
Transfer Learning: Final 5 transformer blocks unfrozen, lower layers frozen to preserve pre-trained features

Loss Function: The model is trained using a combined loss function consisting of:

Triplet Loss (margin α=0.45): Encourages separation between different animal identities
Intra-Pair Variance Regularization (ε=0.01): Promotes consistency across multiple photos of the same animal

Combined as: L_total = 1.0 × L_triplet + 0.5 × L_var

This approach creates compact feature clusters for each individual animal while maintaining large separation between different identities.

Performance Metrics

The model has been benchmarked against various vision encoders on multiple pet recognition datasets:

Cat Individual Images Dataset

Model	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9821	0.0604	0.8359	0.9579	0.9711
DINOv2-Small	0.9904	0.0422	0.8547	0.9660	0.9764
SigLIP-Base	0.9899	0.0390	0.8649	0.9757	0.9842
SigLIP2-Base	0.9894	0.0388	0.8660	0.9772	0.9863
Zer0int CLIP-L	0.9881	0.0509	0.8768	0.9767	0.9845
SigLIP2-Giant	0.9940	0.0344	0.8899	0.9868	0.9921
SigLIP2-Giant + E5-Small-v2 + gating	0.9929	0.0344	0.8952	0.9872	0.9932

DogFaceNet Dataset

Model	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9739	0.0772	0.4350	0.6417	0.7204
DINOv2-Small	0.9829	0.0571	0.5581	0.7540	0.8139
SigLIP-Base	0.9792	0.0606	0.5848	0.7746	0.8319
SigLIP2-Base	0.9776	0.0672	0.5925	0.7856	0.8422
Zer0int CLIP-L	0.9814	0.0625	0.6289	0.8092	0.8597
SigLIP2-Giant	0.9926	0.0326	0.7475	0.9009	0.9316
SigLIP2-Giant + E5-Small-v2 + gating	0.9920	0.0314	0.7818	0.9233	0.9482

Combined Test Dataset (Overall Performance)

Model	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9752	0.0729	0.6511	0.8122	0.8555
DINOv2-Small	0.9848	0.0546	0.7180	0.8678	0.9009
SigLIP-Base	0.9811	0.0572	0.7359	0.8831	0.9140
SigLIP2-Base	0.9793	0.0631	0.7400	0.8889	0.9197
Zer0int CLIP-L	0.9842	0.0565	0.7626	0.8994	0.9267
SigLIP2-Giant	0.9912	0.0378	0.8243	0.9471	0.9641
SigLIP2-Giant + E5-Small-v2 + gating	0.9882	0.0422	0.8428	0.9576	0.9722

Metrics Explanation:

ROC AUC: Area Under the Receiver Operating Characteristic Curve - measures the model's ability to distinguish between different individuals
EER: Equal Error Rate - the error rate where false acceptance and false rejection rates are equal
Top-K: Accuracy of correct identification within the top K predictions

Basic Usage

Installation

pip install transformers torch pillow

Get Image Embedding

import torch
import torch.nn as nn
import torch.nn.functional as F
from PIL import Image
from transformers import SiglipModel, SiglipProcessor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        ckpt = "google/siglip2-giant-opt-patch16-384"
        self.clip = SiglipModel.from_pretrained(ckpt)
        self.processor = SiglipProcessor.from_pretrained(ckpt)

        
    def forward(self, images):
        clip_inputs = self.processor(images=images, return_tensors="pt").to(self.clip.device)
        return self.clip.get_image_features(**clip_inputs)

model = Model()

weights_path = hf_hub_download(repo_id="AvitoTech/SigLIP2-giant", filename="model.safetensors")
state_dict = load_file(weights_path)
model.load_state_dict(state_dict)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

image = Image.open("your_image.jpg").convert("RGB")

with torch.no_grad():
    embedding = model([image])
    embedding = F.normalize(embedding, dim=1)

print(f"Embedding shape: {embedding.shape}")  # torch.Size([1, 1536])

Citation

If you use this model in your research or applications, please cite our work:

@Article{jimaging12010030,
AUTHOR = {Kudryavtsev, Vasiliy and Borodin, Kirill and Berezin, German and Bubenchikov, Kirill and Mkrtchian, Grach and Ryzhkov, Alexander},
TITLE = {From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification},
JOURNAL = {Journal of Imaging},
VOLUME = {12},
YEAR = {2026},
NUMBER = {1},
ARTICLE-NUMBER = {30},
URL = {https://www.mdpi.com/2313-433X/12/1/30},
ISSN = {2313-433X},
ABSTRACT = {Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091 unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.},
DOI = {10.3390/jimaging12010030}
}

Use Cases

Individual pet identification and re-identification
Lost and found pet matching systems
Veterinary record management
Animal behavior monitoring
Wildlife conservation and tracking

Downloads last month: 24

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AvitoTech/SigLIP2-giant

Animal Identification

Collection

This collection brings together specialized models for animal recognition and identification. • 7 items • Updated Jan 12 • 2

Paper for AvitoTech/SigLIP2-giant

PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

Paper • 2407.13555 • Published Jul 18, 2024 • 1