DART Student Backbones

Distilled lightweight backbones for DART (Detect Anything in Real Time), a training-free framework that converts SAM3 into a real-time open-vocabulary multi-class detector.

For more details, see the paper: Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection.

These student backbones replace SAM3's 439M-parameter ViT-H/14 backbone with lightweight alternatives via adapter distillation: a small FPN adapter (~5M trainable parameters) is trained to project student features into the ViT-H feature space while the SAM3 encoder-decoder remains frozen.

Models

Model Backbone Params COCO AP AP50 AP75 AP_S AP_L BB Latency Pipelined FPS
DART (ViT-H teacher) 439M 55.8 73.4 61.5 40.3 70.7 53 ms 15.8
DART-Pruned-16 220M 53.6 70.6 58.9 37.7 68.8 26.6 ms 37.6
DART-RepViT-M2.3 8.2M 38.7 53.1 42.3 22.6 49.9 13.9 ms 55.8
DART-TinyViT-21M 21M 30.1 42.4 32.6 17.4 37.8 12.2 ms 57.8
DART-EfficientViT-L2 9.2M 21.7 31.5 23.5 13.7 24.2 10.7 ms 62.5
DART-EfficientViT-L1 5.3M 16.3 24.2 17.4 10.6 17.3 10.4 ms 64.2

All results on COCO val2017 (5,000 images, 80 classes, 1008x1008 resolution) using TRT FP16 backbone + encoder-decoder on a single RTX 4080. Pipelined FPS measured at 4 classes. Teacher uses training-free multi-class detection (no detection training); students use adapter distillation with frozen encoder-decoder; Pruned-16 uses self-distillation with 16 of 32 ViT blocks removed.

Pruned Backbone

DART-Pruned-16 removes 16 of 32 ViT-H blocks and recovers quality via self-distillation. The full backbone serves as a frozen teacher while the pruned copy is trained with MSE loss on FPN features.

Training

# 8xH100 via SLURM
srun --ntasks=1 torchrun --nproc_per_node=8 scripts/distill.py \
    --data-dir /path/to/coco/train2017 \
    --checkpoint sam3.pt \
    --phase prune \
    --skip-blocks "5,10,12,14,17,18,19,20,21,22,24,25,26,27,28,30" \
    --epochs 100 --batch-size 32 --lr 1e-4 \
    --output-dir skipblocks_distill

Export and evaluate

# Export pruned backbone via HF path (fused attention kernels)
PYTHONIOENCODING=utf-8 python scripts/export_hf_backbone.py \
    --image x.jpg \
    --output-onnx onnx_hf_backbone_1008_pruned/hf_backbone.onnx \
    --output-engine hf_backbone_1008_pruned_fp16.engine \
    --skip-blocks "5,10,12,14,17,18,19,20,21,22,24,25,26,27,28,30"

# Evaluate on COCO val2017
PYTHONIOENCODING=utf-8 python scripts/eval_coco_official.py \
    --images-dir D:/val2017 \
    --ann-file D:/coco2017labels/coco/annotations/instances_val2017.json \
    --checkpoint sam3.pt \
    --pruned-checkpoint distilled/pruned_16blocks.pt \
    --configs "pruned16_1008=trt:hf_backbone_1008_pruned_fp16.engine;encdec:enc_dec_1008_c16_presence_fp16.engine;imgsz:1008"

Block selection

Blocks were selected by greedy importance analysis (scripts/analyze_block_importance.py). Blocks in the later layers (17-30) are least important, while early blocks (0-8) and global attention blocks (7, 15, 23, 31) are critical. The pruned checkpoint stores skip_blocks metadata so it auto-applies during loading.

Distilled Student Architecture

Each student model consists of:

  1. A frozen ImageNet-pretrained backbone from timm (features_only=True, 3 stages)
  2. A trained FPN adapter (3 levels of Conv1x1 + bilinear interpolation + Conv3x3) that maps backbone features to SAM3's expected FPN format: (B, 256, 288, 288), (B, 256, 144, 144), (B, 256, 72, 72)
  3. The original frozen SAM3 encoder-decoder (unchanged)

Usage

Loading a student model

import torch
from sam3.distillation.sam3_student import build_sam3_student_model

model = build_sam3_student_model(
    backbone_config="repvit_m2_3",       # or efficientvit_l1, efficientvit_l2, tiny_vit_21m
    teacher_checkpoint="sam3.pt",        # SAM3 weights (encoder-decoder)
    device="cuda",
)

ckpt = torch.load("distilled/repvit_m2_3_distilled.pt", map_location="cuda")
model.backbone.student_backbone.load_state_dict(ckpt["student_state_dict"])
model.eval()

Inference

from sam3.model.sam3_multiclass_fast import Sam3MultiClassPredictorFast

predictor = Sam3MultiClassPredictorFast(model, device="cuda")
predictor.set_classes(["person", "car", "dog"])
state = predictor.set_image(image)  # PIL Image
results = predictor.predict(state, confidence_threshold=0.3)
# results: dict with 'boxes', 'scores', 'class_ids', 'class_names'

COCO evaluation

PYTHONIOENCODING=utf-8 python scripts/eval_all_students.py

This runs scripts/eval_coco.py for all four student models and produces coco_eval_all_students.json.

Training

All adapters were trained on COCO train2017 (118K unlabeled images, no annotations used) for 5 epochs with AdamW (lr=1e-3, weight decay=0.01, cosine schedule) using multi-scale MSE loss between student and teacher FPN features (level weights: 0.15, 0.20, 0.65). Training takes approximately 2 GPU-hours on a single RTX 4080.

python scripts/distill.py \
    --data-dir /path/to/coco/train2017 \
    --checkpoint sam3.pt \
    --backbone repvit_m2_3 \
    --epochs 5 --batch-size 2 --lr 1e-3

Supported backbones

Config name timm model Stages
efficientvit_l1 efficientvit_l1.r224_in1k (0, 1, 2)
efficientvit_l2 efficientvit_l2.r384_in1k (0, 1, 2)
repvit_m2_3 repvit_m2_3.dist_450e_in1k (0, 1, 2)
tiny_vit_21m tiny_vit_21m_224.dist_in22k_ft_in1k (0, 1, 2)
vit_base vit_base_patch16_224.augreg2_in21k_ft_in1k (0, 1, 2)
vit_base_dinov3 vit_base_patch16_dinov3.lvd1689m (0, 1, 2)

Checkpoint format

Each .pt file contains:

{
    "epoch": 5,
    "loss": float,
    "adapter_state_dict": { ... },   # FPN adapter weights only
    "student_state_dict": { ... },   # Full student backbone + adapter state
}

TRT export

python scripts/export_student_trt.py --models repvit_m2_3 --imgsz 1008

Produces ONNX and TRT FP16 engine files. The encoder-decoder is exported separately (split-engine design) to preserve open-vocabulary flexibility.

Requirements

  • PyTorch >= 2.7.0
  • timm
  • SAM3 checkpoint (sam3.pt)
  • TensorRT >= 10.9 (for TRT deployment)

Citation

@misc{turkcan2026detectrealtimesingleprompt,
      title={Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection}, 
      author={Mehmet Kerem Turkcan},
      year={2026},
      eprint={2603.11441},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.11441}, 
}

License

The student adapter weights are released under the same license as SAM3. The underlying backbone weights (RepViT, TinyViT, EfficientViT) retain their original licenses from timm.

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for mehmetkeremturkcan/DART