V-JEPA Nested Agent (L40 Training)

This model is a V-JEPA 2 based agent equipped with Nested Learning memory and a Flow Matching action head, trained on the LeRobot so101-table-cleanup dataset.

Architecture

The agent consists of three main components:

  1. Vision Encoder: V-JEPA 2 (Vision-Joint Embedding Predictive Architecture)

    • Processes video input (B, T, C, H, W)
    • Pretrained representation for robust visual features.
    • ~327M parameters.
  2. Memory: Nested Learning Module

    • Learns hierarchical temporal abstractions.
    • Captures long-horizon dependencies in the task.
    • ~813K parameters.
  3. Action Head: Flow Matching (Diffusion-like)

    • Conditional Flow Matching policy.
    • Predicts action trajectories based on visual and memory embeddings.
    • Uses max_state_dim=14 and action_dim=7.

Training Details

  • Dataset: so101-table-cleanup (LeRobot)
  • Hardware: 2x NVIDIA L40 GPUs
  • Framework: PyTorch, HuggingFace Trainer
  • Precision: bfloat16

Usage

This model requires the custom VJEPANestedAgent code structure to load.

from gr00t.model.vjepa_nested_pipeline import VJEPANestedAgent, VJEPANestedConfig
from transformers import AutoConfig, AutoModel

# Load config
config = VJEPANestedConfig.from_pretrained("cbjp404/vjepa-nested-agent-l40")

# Initialize model
model = VJEPANestedAgent(config)

# Load weights (example using safetensors)
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict)

Inputs

The model expects a dictionary input with:

  • video: (B, T, C, H, W)
  • state: (B, state_dim) (padded to 14)
  • action: (B, T, action_dim) (for training)
  • action_mask: (B, T, 1)
  • embodiment_id: (B,)
Downloads last month
12
Safetensors
Model size
0.3B params
Tensor type
F32
·
Video Preview
loading