DINOv3 ConvNeXt-Small for GeoGuessr Country Classification

A DINOv3 ConvNeXt-Small model fine-tuned for geographic image classification on Google Street View images. The model predicts the country of origin among 55 countries with 74.31% test accuracy.

Model Details

Property	Value
Architecture	ConvNeXt-Small (50M params)
Base Model	facebook/dinov3-convnext-small-pretrain-lvd1689m
Input Resolution	384 × 384
Output Classes	55 countries
Training Data	~25,000 Street View images

Why DINOv3?

DINOv3 models are distilled from a 7B-parameter Vision Transformer trained with self-supervision on 1.7 billion diverse web images. This pretraining distribution—spanning global visual patterns like road markings, vegetation, architecture, and signage—creates representations inherently suited to geographic reasoning.

Our core finding: A smaller model with domain-aligned features outperforms larger models with weaker transfer. The 50M parameter DINOv3 ConvNeXt-Small beats the 87M ConvNeXtV2-Base by 13 percentage points.

Results

Test Set Performance

Metric	Value
Accuracy	74.31%
Top-3 Accuracy	86.94%
F1-Macro	58.93%
F1-Weighted	72.76%
mAP	61.47%

Comparison with Previous Work

Model	Parameters	Accuracy	F1-Macro
DINOv3 ConvNeXt-S (ours)	50M	74.31%	58.93%
SigLIP2	93M	64.85%	38.36%
ConvNeXtV2-Base	87M	61.03%	51.77%
ViT-Base-384 (baseline)	86M	38.81%	14.40%

Training Progression

Checkpoint	Val Acc	Test Acc	Test F1	Test mAP
Epoch 10	55.38%	52.36%	42.91%	50.09%
Epoch 20	64.24%	62.61%	53.78%	57.89%
Epoch 30	70.27%	68.98%	55.92%	59.34%
Epoch 40	74.87%	73.50%	58.64%	61.19%
Epoch 48 (best)	75.28%	74.31%	58.93%	61.47%

Per-Country Performance

Top 5 Countries (by accuracy):

Country	Accuracy	F1	Support
🇯🇵 Japan	95.83%	0.923	576
🇸🇬 Singapore	91.59%	0.916	107
🇩🇪 Germany	90.57%	0.877	106
🇧🇷 Brazil	92.53%	0.835	348
🇦🇺 Australia	87.55%	0.851	257

Bottom 5 Countries (by accuracy):

Country	Accuracy	F1	Support
🇱🇻 Latvia	0.00%	0.000	19
🇸🇰 Slovakia	5.88%	0.111	17
🇨🇿 Czechia	10.00%	0.154	40
🇭🇷 Croatia	10.00%	0.182	20
🇧🇪 Belgium	11.76%	0.200	34

Training Recipe

Core Thesis

Self-Distillation based pretraining on diverse global data produces features that transfer better than generative self-supervised pretraining, and pretraining distribution matters more than model capacity for geographic tasks.

Key Design Decisions

1. BCE Loss with Multi-Label MixUp/CutMix

Standard MixUp interpolates labels (y = 0.7*A + 0.3*B), but when you look at a mixed image, both concepts are visible. Following Wightman et al. [2], we treat mixed samples as multi-label (both classes = 1) with BCE loss. This better matches visual reality and gave consistent improvements.

2. Short-Schedule Optimization

For 50-epoch fine-tuning, aggressive regularization hurts more than helps [2]:

MixUp α = 0.1 (not 0.2-0.8)
CutMix α = 1.0
RandAugment magnitude 6 (not 7-9)
No label smoothing
No stochastic depth
Test crop ratio 0.95 (not 0.875) [6]

3. Resolution Matters

GeoGuessr images are ~1.5k pixels wide. Critical details (text on signs, road markings) are lost at 224px. Training at 384px preserves these geographic cues [6].

4. Class Imbalance Handling

Japan/France comprise ~20% of training data. We used WeightedRandomSampler to ensure each class is seen equally often.

Training Configuration

Hardware: NVIDIA RTX 3080 Ti (12GB)
Batch size: 16 (hardware constrained)
Epochs: 50
Optimizer: AdamW (lr=5e-5, weight_decay=0.01)
Schedule: Cosine decay with 10% warmup
Image size: 384
Dropout: 0.1
Augmentation: RandAugment(n=2, m=6) + MixUp(0.1) + CutMix(1.0)

Usage

import torch
from PIL import Image

# Load model and processor (trust_remote_code required for custom model)
from transformers import AutoImageProcessor, AutoModelForImageClassification

processor = AutoImageProcessor.from_pretrained("Simon-Kotchou/convnext-dinov3-small-geoguessr-25k-384")
model = AutoModelForImageClassification.from_pretrained(
    "Simon-Kotchou/convnext-dinov3-small-geoguessr-25k-384",
    trust_remote_code=True,
)

# Load and process image
image = Image.open("street_view.jpg")
inputs = processor(images=image, return_tensors="pt")

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=-1)
    predicted_class = logits.argmax(-1).item()
    confidence = probs[0, predicted_class].item()
    
# Map to country name
country = model.config.id2label[str(predicted_class)]
print(f"Predicted: {country} ({confidence:.1%})")

# Get top-3 predictions
top3 = torch.topk(probs, 3, dim=-1)
for i, (prob, idx) in enumerate(zip(top3.values[0], top3.indices[0])):
    country = model.config.id2label[str(idx.item())]
    print(f"  {i+1}. {country}: {prob.item():.1%}")

Limitations and Future Work

Why Latvia Fails Completely

Latvia achieves 0% accuracy despite representing 0.35% of training data. The previous ConvNeXtV2 model also struggled (14.29%). This suggests:

Visual similarity to neighbors: Latvia shares visual characteristics with Lithuania, Estonia, and other Baltic states
Insufficient training signal: ~88 training samples may be below the threshold for learning discriminative features
Lack of distinctive markers: Unlike Japan (unique script) or Australia (distinctive landscapes), Baltic countries lack globally unique visual signatures

Potential Improvements

Approach	Rationale
Focal Loss	Down-weight easy examples, focus learning on hard tail classes
EMA	Stabilize training, often helps with class imbalance
Higher Resolution	518px or 768px to capture fine text details
MetaCLIP/SigLIP2 backbone	Caption-pretrained models might capture semantic country associations
Hierarchical Classification	Predict region → country to reduce confusion within regions
Longer Training	100-200 epochs with proper regularization schedule

Hardware Constraints

Training was conducted on a single RTX 3080 Ti with 12GB VRAM, limiting batch size to 16. Larger batches with gradient accumulation or multi-GPU training could improve optimization dynamics.

Key Takeaways

Pretraining distribution > model capacity: A 50M model pretrained on 1.7B diverse images beats an 87M model pretrained on ImageNet-22k
Modern ConvNets remain competitive: In the age of Vision Transformers, ConvNeXt architectures offer strong performance with better efficiency and spatial inductive bias
DINOv3 transfers remarkably well: Self-supervised pretraining on diverse monocular images creates representations that generalize to geographic visual reasoning
The long tail is hard: Even with weighted sampling, rare classes with subtle distinguishing features remain challenging

Citation

If you use this model, please cite:

@misc{dinov3-geoguessr-2025,
  title={DINOv3 ConvNeXt-Small Fine-tuned for GeoGuessr Country Classification},
  author={Simon Kotchou},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384}
}

References

[1] Siméoni et al. "DINOv3." arXiv:2508.10104, 2025.

[2] Wightman, Touvron, Jégou. "ResNet Strikes Back: An improved training procedure in timm." arXiv:2110.00476, 2021.

[3] Liu et al. "A ConvNet for the 2020s." CVPR 2022. arXiv:2201.03545.

[4] Zhang et al. "MixUp: Beyond Empirical Risk Minimization." ICLR 2018. arXiv:1710.09412.

[5] Yun et al. "CutMix: Regularization Strategy to Train Strong Classifiers." ICCV 2019. arXiv:1905.04899.

[6] Touvron et al. "Fixing the train-test resolution discrepancy." NeurIPS 2019. arXiv:1906.06423.

Model Card Authors

Simon Kotchou

Acknowledgments

Meta AI for the DINOv3 foundation models
The GeoGuessr dataset creators
The HuggingFace and PyTorch communities

Downloads last month: 14

Model tree for Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384

Base model

facebook/dinov3-vit7b16-pretrain-lvd1689m

Finetuned

facebook/dinov3-convnext-small-pretrain-lvd1689m

Finetuned

(1)

this model

Dataset used to train Simon-Kotchou/dinov3-convnext-small-geoguessr-25k-384

Evaluation results

Test Accuracy on GeoGuessr (55 Countries)
self-reported

0.743
Top-3 Accuracy on GeoGuessr (55 Countries)
self-reported

0.869
F1-Macro on GeoGuessr (55 Countries)
self-reported

0.589
mAP on GeoGuessr (55 Countries)
self-reported

0.615

View on Papers With Code