V-JEPA Nested Agent (L40 Training)
This model is a V-JEPA 2 based agent equipped with Nested Learning memory and a Flow Matching action head, trained on the LeRobot so101-table-cleanup dataset.
Architecture
The agent consists of three main components:
Vision Encoder: V-JEPA 2 (Vision-Joint Embedding Predictive Architecture)
- Processes video input
(B, T, C, H, W) - Pretrained representation for robust visual features.
- ~327M parameters.
- Processes video input
Memory: Nested Learning Module
- Learns hierarchical temporal abstractions.
- Captures long-horizon dependencies in the task.
- ~813K parameters.
Action Head: Flow Matching (Diffusion-like)
- Conditional Flow Matching policy.
- Predicts action trajectories based on visual and memory embeddings.
- Uses
max_state_dim=14andaction_dim=7.
Training Details
- Dataset:
so101-table-cleanup(LeRobot) - Hardware: 2x NVIDIA L40 GPUs
- Framework: PyTorch, HuggingFace Trainer
- Precision: bfloat16
Usage
This model requires the custom VJEPANestedAgent code structure to load.
from gr00t.model.vjepa_nested_pipeline import VJEPANestedAgent, VJEPANestedConfig
from transformers import AutoConfig, AutoModel
# Load config
config = VJEPANestedConfig.from_pretrained("cbjp404/vjepa-nested-agent-l40")
# Initialize model
model = VJEPANestedAgent(config)
# Load weights (example using safetensors)
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict)
Inputs
The model expects a dictionary input with:
video:(B, T, C, H, W)state:(B, state_dim)(padded to 14)action:(B, T, action_dim)(for training)action_mask:(B, T, 1)embodiment_id:(B,)
- Downloads last month
- 12