File size: 9,225 Bytes

394febe

---

library_name: transformers
tags:
- siglip
- siglip2
- vision
- text
- clip
- multimodal
- image-text-embeddings
- pet-recognition
model_id: AvitoTech/SigLIP2-giant-e5small-v2-gating-for-animal-identification
pipeline_tag: feature-extraction
---


# SigLIP2-Giant + E5-Small-v2 + Gating Fine-tuned for Animal Identification

Fine-tuned multimodal model combining SigLIP2-Giant vision encoder with E5-Small-v2 text encoder for individual animal identification. This advanced architecture uses a learned gating mechanism to dynamically fuse image and text embeddings, specializing in distinguishing between unique cats and dogs. The model produces robust multimodal embeddings optimized for pet recognition, re-identification, and verification tasks.


## Model Details

- **Base Vision Model**: google/siglip2-giant-opt-patch16-384
- **Text Encoder**: intfloat/e5-small-v2
- **Image Input**: Images (384x384)
- **Text Input**: Variable length text descriptions
- **Final Output**: Fused embeddings (512-dimensional) via learned gating
- **Task**: Individual animal identification and verification with multimodal inputs

## Training Data

The model was trained on a comprehensive dataset combining multiple sources:

- **[PetFace Dataset](https://arxiv.org/abs/2407.13555)**: Large-scale animal face dataset with 257,484 unique individuals across 13 animal families
- **[Dogs-World](https://www.kaggle.com/datasets/lextoumbourou/dogs-world)**: Kaggle dataset for dog breed and individual identification
- **[LCW (Labeled Cats in the Wild)](https://www.kaggle.com/datasets/dseidli/lcwlabeled-cats-in-the-wild)**: Cat identification dataset
- **Web-scraped Data**: Additional curated images from various sources

**Total Dataset Statistics:**
- **1,904,157** total photographs
- **695,091** unique individual animals (cats and dogs)

## Training Details

**Training Configuration:**
- **Batch Size**: 116 samples (58 unique identities × 2 photos each)
- **Optimizer**: Adam with learning rate 1e-4
- **Training Duration**: 10 epochs
- **Transfer Learning**: Final 5 transformer blocks unfrozen, lower layers frozen to preserve pre-trained features

**Loss Function:**
The model is trained using a combined loss function consisting of:
1. **Triplet Loss** (margin α=0.45): Encourages separation between different animal identities
2. **Intra-Pair Variance Regularization** (ε=0.01): Promotes consistency across multiple photos of the same animal

Combined as: L_total = 1.0 × L_triplet + 0.5 × L_var



This approach creates compact feature clusters for each individual animal while maintaining large separation between different identities. The gating mechanism learns to dynamically balance image and text features for optimal performance.



## Performance Metrics



The model has been benchmarked against various vision encoders on multiple pet recognition datasets:



### [Cat Individual Images Dataset](https://www.kaggle.com/datasets/timost1234/cat-individuals)



| Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |

|-------|---------|-----|-------|-------|--------|

| CLIP-ViT-Base | 0.9821 | 0.0604 | 0.8359 | 0.9579 | 0.9711 |

| DINOv2-Small | 0.9904 | 0.0422 | 0.8547 | 0.9660 | 0.9764 |

| SigLIP-Base | 0.9899 | 0.0390 | 0.8649 | 0.9757 | 0.9842 |

| SigLIP2-Base | 0.9894 | 0.0388 | 0.8660 | 0.9772 | 0.9863 |

| Zer0int CLIP-L | 0.9881 | 0.0509 | 0.8768 | 0.9767 | 0.9845 |

| SigLIP2-Giant | 0.9940 | 0.0344 | 0.8899 | 0.9868 | 0.9921 |

| **SigLIP2-Giant + E5-Small-v2 + gating** | **0.9929** | **0.0344** | **0.8952** | **0.9872** | **0.9932** |



### [DogFaceNet Dataset](https://www.springerprofessional.de/en/a-deep-learning-approach-for-dog-face-verification-and-recogniti/17094782)



| Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |

|-------|---------|-----|-------|-------|--------|

| CLIP-ViT-Base | 0.9739 | 0.0772 | 0.4350 | 0.6417 | 0.7204 |

| DINOv2-Small | 0.9829 | 0.0571 | 0.5581 | 0.7540 | 0.8139 |

| SigLIP-Base | 0.9792 | 0.0606 | 0.5848 | 0.7746 | 0.8319 |

| SigLIP2-Base | 0.9776 | 0.0672 | 0.5925 | 0.7856 | 0.8422 |

| Zer0int CLIP-L | 0.9814 | 0.0625 | 0.6289 | 0.8092 | 0.8597 |

| SigLIP2-Giant | 0.9926 | 0.0326 | 0.7475 | 0.9009 | 0.9316 |

| **SigLIP2-Giant + E5-Small-v2 + gating** | **0.9920** | **0.0314** | **0.7818** | **0.9233** | **0.9482** |



### Combined Test Dataset (Overall Performance)



| Model | ROC AUC | EER | Top-1 | Top-5 | Top-10 |

|-------|---------|-----|-------|-------|--------|

| CLIP-ViT-Base | 0.9752 | 0.0729 | 0.6511 | 0.8122 | 0.8555 |

| DINOv2-Small | 0.9848 | 0.0546 | 0.7180 | 0.8678 | 0.9009 |

| SigLIP-Base | 0.9811 | 0.0572 | 0.7359 | 0.8831 | 0.9140 |

| SigLIP2-Base | 0.9793 | 0.0631 | 0.7400 | 0.8889 | 0.9197 |

| Zer0int CLIP-L | 0.9842 | 0.0565 | 0.7626 | 0.8994 | 0.9267 |

| SigLIP2-Giant | 0.9912 | 0.0378 | 0.8243 | 0.9471 | 0.9641 |

| **SigLIP2-Giant + E5-Small-v2 + gating** | **0.9882** | **0.0422** | **0.8428** | **0.9576** | **0.9722** |



**Metrics Explanation:**

- **ROC AUC**: Area Under the Receiver Operating Characteristic Curve - measures the model's ability to distinguish between different individuals

- **EER**: Equal Error Rate - the error rate where false acceptance and false rejection rates are equal

- **Top-K**: Accuracy of correct identification within the top K predictions



**Note:** This multimodal model achieves the best overall Top-K accuracy scores by leveraging both visual and textual information through a learned gating mechanism.



## Basic Usage



### Installation



```bash

pip install transformers torch pillow safetensors huggingface_hub
```



### Load Model and Get Embedding



```python

import torch

import torch.nn as nn

import torch.nn.functional as F

from PIL import Image

from transformers import SiglipModel, SiglipProcessor, AutoModel, AutoTokenizer

from safetensors.torch import load_file

from huggingface_hub import hf_hub_download



# Define the model architecture

class FaceRecognizer(nn.Module):

    def __init__(self, embedding_dim=512):

        super().__init__()

        ckpt = "google/siglip2-giant-opt-patch16-384"

        self.clip = SiglipModel.from_pretrained(ckpt)

        self.processor = SiglipProcessor.from_pretrained(ckpt)

        

        text_model_name = "intfloat/e5-small-v2"

        self.text_encoder = AutoModel.from_pretrained(text_model_name)

        self.tokenizer = AutoTokenizer.from_pretrained(text_model_name)

        

        img_dim = self.clip.config.vision_config.hidden_size

        text_dim = self.text_encoder.config.hidden_size

        

        self.proj_img = nn.Linear(img_dim, embedding_dim)

        self.proj_text = nn.Linear(text_dim, embedding_dim)

        

        self.gate = nn.Sequential(

            nn.Linear(embedding_dim * 2, 128),

            nn.ReLU(),

            nn.Linear(128, 2),

            nn.Softmax(dim=-1)

        )

    

    def average_pool(self, last_hidden_states, attention_mask):

        last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)

        return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

    

    def forward(self, images, texts):

        device = next(self.parameters()).device

        

        clip_inputs = self.processor(images=images, return_tensors="pt").to(device)

        img_emb = self.clip.get_image_features(**clip_inputs)

        

        text_inputs = self.tokenizer(

            texts, padding=True, truncation=True, max_length=512, return_tensors="pt"

        ).to(device)

        text_outputs = self.text_encoder(**text_inputs)

        text_emb = self.average_pool(text_outputs.last_hidden_state, text_inputs['attention_mask'])

        

        img_proj = self.proj_img(img_emb)

        text_proj = self.proj_text(text_emb)

        

        fused = torch.cat([text_proj, img_proj], dim=-1)

        w = self.gate(fused)

        fused_emb = w[:, 0:1] * text_proj + w[:, 1:2] * img_proj

        

        return F.normalize(fused_emb, dim=1)



# Load model

model = FaceRecognizer()



# Download and load weights from HuggingFace

weights_path = hf_hub_download(repo_id="AvitoTech/SigLIP2-giant-e5small-v2-gating-for-animal-identification", filename="model.safetensors")

state_dict = load_file(weights_path)

model.load_state_dict(state_dict)



device = "cuda" if torch.cuda.is_available() else "cpu"

model = model.to(device).eval()



# Get fused embedding

image = Image.open("your_image.jpg").convert("RGB")

text = "orange cat"



with torch.no_grad():

    embedding = model([image], [text])



print(f"Embedding shape: {embedding.shape}")  # torch.Size([1, 512])

```

## Citation

If you use this model in your research or applications, please cite our work:

```

BibTeX citation will be added upon paper publication.

```

## Use Cases

- Individual pet identification and re-identification with multimodal queries
- Lost and found pet matching systems with text descriptions
- Veterinary record management with combined image and text search
- Animal behavior monitoring with contextual information
- Wildlife conservation and tracking with metadata integration