NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

Building simulators for robots has been a long run problem. Traditional engines require guide coding of physics and excellent 3D fashions. NVIDIA is altering this with DreamDojo, a completely open-source, generalizable robotic world mannequin. Instead of utilizing a physics engine, DreamDojo ‘goals’ the outcomes of robotic actions straight in pixels.

https://arxiv.org/pdf/2602.06949

Scaling Robotics with 44k+ Hours of Human Experience

The greatest hurdle for AI in robotics is knowledge. Collecting robot-specific knowledge is dear and gradual. DreamDojo solves this by studying from 44k+ hours of selfish human movies. This dataset, referred to as DreamDojo-HV, is the biggest of its sort for world mannequin pretraining.

It options 6,015 distinctive duties throughout 1M+ trajectories.
The knowledge covers 9,869 distinctive scenes and 43,237 distinctive objects.
Pretraining used 100,000 NVIDIA H100 GPU hours to construct 2B and 14B mannequin variants.

Humans have already mastered complicated physics, equivalent to pouring liquids or folding garments. DreamDojo makes use of this human knowledge to offer robots a ‘frequent sense’ understanding of how the world works.

Bridging the Gap with Latent Actions

Human movies should not have robotic motor instructions. To make these movies ‘robot-readable,’ NVIDIA’s analysis staff launched steady latent actions. This system makes use of a spatiotemporal Transformer VAE to extract actions straight from pixels.

The VAE encoder takes 2 consecutive frames and outputs a 32-dimensional latent vector.
This vector represents essentially the most crucial movement between frames.
The design creates an info bottleneck that disentangles motion from visible context.
This permits the mannequin to study physics from people and apply them to totally different robotic our bodies.

Better Physics by way of Architecture

DreamDojo is predicated on the Cosmos-Predict2.5 latent video diffusion mannequin. It makes use of the WAN2.2 tokenizer, which has a temporal compression ratio of 4. The staff improved the structure with 3 key options:

Relative Actions: The mannequin makes use of joint deltas as an alternative of absolute poses. This makes it simpler for the mannequin to generalize throughout totally different trajectories.
Chunked Action Injection: It injects 4 consecutive actions into every latent body. This aligns the actions with the tokenizer’s compression ratio and fixes causality confusion.
Temporal Consistency Loss: A brand new loss perform matches predicted body velocities to ground-truth transitions. This reduces visible artifacts and retains objects bodily constant.

Distillation for 10.81 FPS Real-Time Interaction

A simulator is simply helpful whether it is quick. Standard diffusion fashions require too many denoising steps for real-time use. NVIDIA staff used a Self Forcing distillation pipeline to unravel this.

The distillation coaching was carried out on 64 NVIDIA H100 GPUs.
The ‘scholar’ mannequin reduces denoising from 35 steps all the way down to 4 steps.
The closing mannequin achieves a real-time pace of 10.81 FPS.
It is steady for steady rollouts of 60 seconds (600 frames).

Unlocking Downstream Applications

DreamDojo’s pace and accuracy allow a number of superior purposes for AI engineers.

1. Reliable Policy Evaluation

Testing robots in the actual world is dangerous. DreamDojo acts as a high-fidelity simulator for benchmarking.

Its simulated success charges present a Pearson correlation of (Pearson =0.995) with real-world outcomes.
The Mean Maximum Rank Violation (MMRV) is simply 0.003.

2. Model-Based Planning

Robots can use DreamDojo to ‘look forward.’ A robotic can simulate a number of motion sequences and choose the most effective one.

In a fruit-packing process, this improved real-world success charges by 17%.
Compared to random sampling, it supplied a 2x enhance in success.

3. Live Teleoperation

Developers can teleoperate digital robots in actual time. NVIDIA staff demonstrated this utilizing a PICO VR controller and an area desktop with an NVIDIA RTX 5090. This permits for secure and speedy knowledge assortment.

Summary of Model Performance

Metric	DREAMDOJO-2B	DREAMDOJO-14B
Physics Correctness	62.50%	73.50%
Action Following	63.45%	72.55%
FPS (Distilled)	10.81	N/A

NVIDIA has launched all weights, coaching code, and analysis benchmarks. This open-source launch means that you can post-train DreamDojo on your individual robotic knowledge right this moment.

Key Takeaways

Massive Scale and Diversity: DreamDojo is pretrained on DreamDojo-HV, the biggest selfish human video dataset to this point, that includes 44,711 hours of footage throughout 6,015 distinctive duties and 9,869 scenes.
Unified Latent Action Proxy: To overcome the dearth of motion labels in human movies, the mannequin makes use of steady latent actions extracted by way of a spatiotemporal Transformer VAE, which serves as a hardware-agnostic management interface.
Optimized Training and Architecture: The mannequin achieves high-fidelity physics and exact controllability by using relative motion transformations, chunked motion injection, and a specialised temporal consistency loss.
Real-Time Performance by way of Distillation: Through a Self Forcing distillation pipeline, the mannequin is accelerated to 10.81 FPS, enabling interactive purposes like stay teleoperation and steady, long-horizon simulations for over 1 minute.
Reliable for Downstream Tasks: DreamDojo capabilities as an correct simulator for coverage analysis, displaying a 0.995 Pearson correlation with real-world success charges, and might enhance real-world efficiency by 17% when used for model-based planning.

Check out the Paper and Codes. Also, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you may be a part of us on telegram as effectively.

The put up NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data appeared first on MarkTechPost.

What's Hot

Important Pages: