DINOv3 ConvNeXt-Small for GeoGuessr Country Classification
A DINOv3 ConvNeXt-Small model fine-tuned for geographic image classification on Google Street View images. The model predicts the country of origin among 55 countries with 74.31% test accuracy.
Model Details
| Property | Value |
|---|---|
| Architecture | ConvNeXt-Small (50M params) |
| Base Model | facebook/dinov3-convnext-small-pretrain-lvd1689m |
| Input Resolution | 384 × 384 |
| Output Classes | 55 countries |
| Training Data | ~25,000 Street View images |
Why DINOv3?
DINOv3 models are distilled from a 7B-parameter Vision Transformer trained with self-supervision on 1.7 billion diverse web images. This pretraining distribution—spanning global visual patterns like road markings, vegetation, architecture, and signage—creates representations inherently suited to geographic reasoning.
Our core finding: A smaller model with domain-aligned features outperforms larger models with weaker transfer. The 50M parameter DINOv3 ConvNeXt-Small beats the 87M ConvNeXtV2-Base by 13 percentage points.
Results
Test Set Performance
| Metric | Value |
|---|---|
| Accuracy | 74.31% |
| Top-3 Accuracy | 86.94% |
| F1-Macro | 58.93% |
| F1-Weighted | 72.76% |
| mAP | 61.47% |
Comparison with Previous Work
| Model | Parameters | Accuracy | F1-Macro |
|---|---|---|---|
| DINOv3 ConvNeXt-S (ours) | 50M | 74.31% | 58.93% |
| SigLIP2 | 93M | 64.85% | 38.36% |
| ConvNeXtV2-Base | 87M | 61.03% | 51.77% |
| ViT-Base-384 (baseline) | 86M | 38.81% | 14.40% |
Training Progression
| Checkpoint | Val Acc | Test Acc | Test F1 | Test mAP |
|---|---|---|---|---|
| Epoch 10 | 55.38% | 52.36% | 42.91% | 50.09% |
| Epoch 20 | 64.24% | 62.61% | 53.78% | 57.89% |
| Epoch 30 | 70.27% | 68.98% | 55.92% | 59.34% |
| Epoch 40 | 74.87% | 73.50% | 58.64% | 61.19% |
| Epoch 48 (best) | 75.28% | 74.31% | 58.93% | 61.47% |
Per-Country Performance
Top 5 Countries (by accuracy):
| Country | Accuracy | F1 | Support |
|---|---|---|---|
| 🇯🇵 Japan | 95.83% | 0.923 | 576 |
| 🇸🇬 Singapore | 91.59% | 0.916 | 107 |
| 🇩🇪 Germany | 90.57% | 0.877 | 106 |
| 🇧🇷 Brazil | 92.53% | 0.835 | 348 |
| 🇦🇺 Australia | 87.55% | 0.851 | 257 |
Bottom 5 Countries (by accuracy):
| Country | Accuracy | F1 | Support |
|---|---|---|---|
| 🇱🇻 Latvia | 0.00% | 0.000 | 19 |
| 🇸🇰 Slovakia | 5.88% | 0.111 | 17 |
| 🇨🇿 Czechia | 10.00% | 0.154 | 40 |
| 🇭🇷 Croatia | 10.00% | 0.182 | 20 |
| 🇧🇪 Belgium | 11.76% | 0.200 | 34 |
Training Recipe
Core Thesis
Self-Distillation based pretraining on diverse global data produces features that transfer better than generative self-supervised pretraining, and pretraining distribution matters more than model capacity for geographic tasks.
Key Design Decisions
1. BCE Loss with Multi-Label MixUp/CutMix
Standard MixUp interpolates labels (y = 0.7*A + 0.3*B), but when you look at a mixed image, both concepts are visible. Following Wightman et al. [2], we treat mixed samples as multi-label (both classes = 1) with BCE loss. This better matches visual reality and gave consistent improvements.
2. Short-Schedule Optimization
For 50-epoch fine-tuning, aggressive regularization hurts more than helps [2]:
- MixUp α = 0.1 (not 0.2-0.8)
- CutMix α = 1.0
- RandAugment magnitude 6 (not 7-9)
- No label smoothing
- No stochastic depth
- Test crop ratio 0.95 (not 0.875) [6]
3. Resolution Matters
GeoGuessr images are ~1.5k pixels wide. Critical details (text on signs, road markings) are lost at 224px. Training at 384px preserves these geographic cues [6].
4. Class Imbalance Handling
Japan/France comprise ~20% of training data. We used WeightedRandomSampler to ensure each class is seen equally often.
Training Configuration
Hardware: NVIDIA RTX 3080 Ti (12GB)
Batch size: 16 (hardware constrained)
Epochs: 50
Optimizer: AdamW (lr=5e-5, weight_decay=0.01)
Schedule: Cosine decay with 10% warmup
Image size: 384
Dropout: 0.1
Augmentation: RandAugment(n=2, m=6) + MixUp(0.1) + CutMix(1.0)
Usage
import torch
from PIL import Image
# Load model and processor (trust_remote_code required for custom model)
from transformers import AutoImageProcessor, AutoModelForImageClassification
processor = AutoImageProcessor.from_pretrained("Simon-Kotchou/convnext-dinov3-small-geoguessr-25k-384")
model = AutoModelForImageClassification.from_pretrained(
"Simon-Kotchou/convnext-dinov3-small-geoguessr-25k-384",
trust_remote_code=True,
)
# Load and process image
image = Image.open("street_view.jpg")
inputs = processor(images=image, return_tensors="pt")
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)
predicted_class = logits.argmax(-1).item()
confidence = probs[0, predicted_class].item()
# Map to country name
country = model.config.id2label[str(predicted_class)]
print(f"Predicted: {country} ({confidence:.1%})")
# Get top-3 predictions
top3 = torch.topk(probs, 3, dim=-1)
for i, (prob, idx) in enumerate(zip(top3.values[0], top3.indices[0])):
country = model.config.id2label[str(idx.item())]
print(f" {i+1}. {country}: {prob.item():.1%}")
Limitations and Future Work
Why Latvia Fails Completely
Latvia achieves 0% accuracy despite representing 0.35% of training data. The previous ConvNeXtV2 model also struggled (14.29%). This suggests:
- Visual similarity to neighbors: Latvia shares visual characteristics with Lithuania, Estonia, and other Baltic states
- Insufficient training signal: ~88 training samples may be below the threshold for learning discriminative features
- Lack of distinctive markers: Unlike Japan (unique script) or Australia (distinctive landscapes), Baltic countries lack globally unique visual signatures
Potential Improvements
| Approach | Rationale |
|---|---|
| Focal Loss | Down-weight easy examples, focus learning on hard tail classes |
| EMA | Stabilize training, often helps with class imbalance |
| Higher Resolution | 518px or 768px to capture fine text details |
| MetaCLIP/SigLIP2 backbone | Caption-pretrained models might capture semantic country associations |
| Hierarchical Classification | Predict region → country to reduce confusion within regions |
| Longer Training | 100-200 epochs with proper regularization schedule |
Hardware Constraints
Training was conducted on a single RTX 3080 Ti with 12GB VRAM, limiting batch size to 16. Larger batches with gradient accumulation or multi-GPU training could improve optimization dynamics.
Key Takeaways
Pretraining distribution > model capacity: A 50M model pretrained on 1.7B diverse images beats an 87M model pretrained on ImageNet-22k
Modern ConvNets remain competitive: In the age of Vision Transformers, ConvNeXt architectures offer strong performance with better efficiency and spatial inductive bias
DINOv3 transfers remarkably well: Self-supervised pretraining on diverse monocular images creates representations that generalize to geographic visual reasoning
The long tail is hard: Even with weighted sampling, rare classes with subtle distinguishing features remain challenging
Citation
If you use this model, please cite:
@misc{dinov3-geoguessr-2025,
title={DINOv3 ConvNeXt-Small Fine-tuned for GeoGuessr Country Classification},
author={Simon Kotchou},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384}
}
References
[1] Siméoni et al. "DINOv3." arXiv:2508.10104, 2025.
[2] Wightman, Touvron, Jégou. "ResNet Strikes Back: An improved training procedure in timm." arXiv:2110.00476, 2021.
[3] Liu et al. "A ConvNet for the 2020s." CVPR 2022. arXiv:2201.03545.
[4] Zhang et al. "MixUp: Beyond Empirical Risk Minimization." ICLR 2018. arXiv:1710.09412.
[5] Yun et al. "CutMix: Regularization Strategy to Train Strong Classifiers." ICCV 2019. arXiv:1905.04899.
[6] Touvron et al. "Fixing the train-test resolution discrepancy." NeurIPS 2019. arXiv:1906.06423.
Model Card Authors
Simon Kotchou
Acknowledgments
- Meta AI for the DINOv3 foundation models
- The GeoGuessr dataset creators
- The HuggingFace and PyTorch communities
- Downloads last month
- 14
Model tree for Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384
Base model
facebook/dinov3-vit7b16-pretrain-lvd1689mDataset used to train Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384
Evaluation results
- Test Accuracy on GeoGuessr (55 Countries)self-reported0.743
- Top-3 Accuracy on GeoGuessr (55 Countries)self-reported0.869
- F1-Macro on GeoGuessr (55 Countries)self-reported0.589
- mAP on GeoGuessr (55 Countries)self-reported0.615