DINOv3 ConvNeXt-Small for GeoGuessr Country Classification

A DINOv3 ConvNeXt-Small model fine-tuned for geographic image classification on Google Street View images. The model predicts the country of origin among 55 countries with 74.31% test accuracy.

Model Details

Property Value
Architecture ConvNeXt-Small (50M params)
Base Model facebook/dinov3-convnext-small-pretrain-lvd1689m
Input Resolution 384 × 384
Output Classes 55 countries
Training Data ~25,000 Street View images

Why DINOv3?

DINOv3 models are distilled from a 7B-parameter Vision Transformer trained with self-supervision on 1.7 billion diverse web images. This pretraining distribution—spanning global visual patterns like road markings, vegetation, architecture, and signage—creates representations inherently suited to geographic reasoning.

Our core finding: A smaller model with domain-aligned features outperforms larger models with weaker transfer. The 50M parameter DINOv3 ConvNeXt-Small beats the 87M ConvNeXtV2-Base by 13 percentage points.

Results

Test Set Performance

Metric Value
Accuracy 74.31%
Top-3 Accuracy 86.94%
F1-Macro 58.93%
F1-Weighted 72.76%
mAP 61.47%

Comparison with Previous Work

Model Parameters Accuracy F1-Macro
DINOv3 ConvNeXt-S (ours) 50M 74.31% 58.93%
SigLIP2 93M 64.85% 38.36%
ConvNeXtV2-Base 87M 61.03% 51.77%
ViT-Base-384 (baseline) 86M 38.81% 14.40%

Training Progression

Checkpoint Val Acc Test Acc Test F1 Test mAP
Epoch 10 55.38% 52.36% 42.91% 50.09%
Epoch 20 64.24% 62.61% 53.78% 57.89%
Epoch 30 70.27% 68.98% 55.92% 59.34%
Epoch 40 74.87% 73.50% 58.64% 61.19%
Epoch 48 (best) 75.28% 74.31% 58.93% 61.47%

Per-Country Performance

Top 5 Countries (by accuracy):

Country Accuracy F1 Support
🇯🇵 Japan 95.83% 0.923 576
🇸🇬 Singapore 91.59% 0.916 107
🇩🇪 Germany 90.57% 0.877 106
🇧🇷 Brazil 92.53% 0.835 348
🇦🇺 Australia 87.55% 0.851 257

Bottom 5 Countries (by accuracy):

Country Accuracy F1 Support
🇱🇻 Latvia 0.00% 0.000 19
🇸🇰 Slovakia 5.88% 0.111 17
🇨🇿 Czechia 10.00% 0.154 40
🇭🇷 Croatia 10.00% 0.182 20
🇧🇪 Belgium 11.76% 0.200 34

Training Recipe

Core Thesis

Self-Distillation based pretraining on diverse global data produces features that transfer better than generative self-supervised pretraining, and pretraining distribution matters more than model capacity for geographic tasks.

Key Design Decisions

1. BCE Loss with Multi-Label MixUp/CutMix

Standard MixUp interpolates labels (y = 0.7*A + 0.3*B), but when you look at a mixed image, both concepts are visible. Following Wightman et al. [2], we treat mixed samples as multi-label (both classes = 1) with BCE loss. This better matches visual reality and gave consistent improvements.

2. Short-Schedule Optimization

For 50-epoch fine-tuning, aggressive regularization hurts more than helps [2]:

  • MixUp α = 0.1 (not 0.2-0.8)
  • CutMix α = 1.0
  • RandAugment magnitude 6 (not 7-9)
  • No label smoothing
  • No stochastic depth
  • Test crop ratio 0.95 (not 0.875) [6]

3. Resolution Matters

GeoGuessr images are ~1.5k pixels wide. Critical details (text on signs, road markings) are lost at 224px. Training at 384px preserves these geographic cues [6].

4. Class Imbalance Handling

Japan/France comprise ~20% of training data. We used WeightedRandomSampler to ensure each class is seen equally often.

Training Configuration

Hardware: NVIDIA RTX 3080 Ti (12GB)
Batch size: 16 (hardware constrained)
Epochs: 50
Optimizer: AdamW (lr=5e-5, weight_decay=0.01)
Schedule: Cosine decay with 10% warmup
Image size: 384
Dropout: 0.1
Augmentation: RandAugment(n=2, m=6) + MixUp(0.1) + CutMix(1.0)

Usage

import torch
from PIL import Image

# Load model and processor (trust_remote_code required for custom model)
from transformers import AutoImageProcessor, AutoModelForImageClassification

processor = AutoImageProcessor.from_pretrained("Simon-Kotchou/convnext-dinov3-small-geoguessr-25k-384")
model = AutoModelForImageClassification.from_pretrained(
    "Simon-Kotchou/convnext-dinov3-small-geoguessr-25k-384",
    trust_remote_code=True,
)

# Load and process image
image = Image.open("street_view.jpg")
inputs = processor(images=image, return_tensors="pt")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    predicted_class = logits.argmax(-1).item()
    confidence = probs[0, predicted_class].item()
    
# Map to country name
country = model.config.id2label[str(predicted_class)]
print(f"Predicted: {country} ({confidence:.1%})")

# Get top-3 predictions
top3 = torch.topk(probs, 3, dim=-1)
for i, (prob, idx) in enumerate(zip(top3.values[0], top3.indices[0])):
    country = model.config.id2label[str(idx.item())]
    print(f"  {i+1}. {country}: {prob.item():.1%}")

Limitations and Future Work

Why Latvia Fails Completely

Latvia achieves 0% accuracy despite representing 0.35% of training data. The previous ConvNeXtV2 model also struggled (14.29%). This suggests:

  1. Visual similarity to neighbors: Latvia shares visual characteristics with Lithuania, Estonia, and other Baltic states
  2. Insufficient training signal: ~88 training samples may be below the threshold for learning discriminative features
  3. Lack of distinctive markers: Unlike Japan (unique script) or Australia (distinctive landscapes), Baltic countries lack globally unique visual signatures

Potential Improvements

Approach Rationale
Focal Loss Down-weight easy examples, focus learning on hard tail classes
EMA Stabilize training, often helps with class imbalance
Higher Resolution 518px or 768px to capture fine text details
MetaCLIP/SigLIP2 backbone Caption-pretrained models might capture semantic country associations
Hierarchical Classification Predict region → country to reduce confusion within regions
Longer Training 100-200 epochs with proper regularization schedule

Hardware Constraints

Training was conducted on a single RTX 3080 Ti with 12GB VRAM, limiting batch size to 16. Larger batches with gradient accumulation or multi-GPU training could improve optimization dynamics.

Key Takeaways

  1. Pretraining distribution > model capacity: A 50M model pretrained on 1.7B diverse images beats an 87M model pretrained on ImageNet-22k

  2. Modern ConvNets remain competitive: In the age of Vision Transformers, ConvNeXt architectures offer strong performance with better efficiency and spatial inductive bias

  3. DINOv3 transfers remarkably well: Self-supervised pretraining on diverse monocular images creates representations that generalize to geographic visual reasoning

  4. The long tail is hard: Even with weighted sampling, rare classes with subtle distinguishing features remain challenging

Citation

If you use this model, please cite:

@misc{dinov3-geoguessr-2025,
  title={DINOv3 ConvNeXt-Small Fine-tuned for GeoGuessr Country Classification},
  author={Simon Kotchou},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384}
}

References

[1] Siméoni et al. "DINOv3." arXiv:2508.10104, 2025.

[2] Wightman, Touvron, Jégou. "ResNet Strikes Back: An improved training procedure in timm." arXiv:2110.00476, 2021.

[3] Liu et al. "A ConvNet for the 2020s." CVPR 2022. arXiv:2201.03545.

[4] Zhang et al. "MixUp: Beyond Empirical Risk Minimization." ICLR 2018. arXiv:1710.09412.

[5] Yun et al. "CutMix: Regularization Strategy to Train Strong Classifiers." ICCV 2019. arXiv:1905.04899.

[6] Touvron et al. "Fixing the train-test resolution discrepancy." NeurIPS 2019. arXiv:1906.06423.

Model Card Authors

Simon Kotchou

Acknowledgments

  • Meta AI for the DINOv3 foundation models
  • The GeoGuessr dataset creators
  • The HuggingFace and PyTorch communities
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384

Dataset used to train Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384

Evaluation results