What is a robot foundation model?
A foundation model is a single, large model trained on a lot of data that's then adapted to many downstream tasks. Here's what that looks like when the downstream task is moving a body.
In language, a foundation model is a single large transformer trained on a huge corpus of text that can then be specialized — by fine-tuning or prompting — to do summarization, code completion, translation, customer support, and a hundred other things. In robotics, a foundation model is the same idea applied to bodies: a single large model trained on a lot of robot data (and often a lot of internet data too) that you can then adapt to drive an arm, a humanoid, a wheeled platform, or a hand.
The motivation is familiar to anyone who watched language models eat the NLP world. Before foundation models, each downstream task got its own model: a separate sentiment classifier, a separate translator, a separate summarizer. After, one model did all of them — usually better. Robotics is hoping for the same compression: instead of one policy per task per robot, one model that you specialize.
What goes in
A robot foundation model is trained on some mix of:
- Robot trajectories. Sequences of (observation, action) pairs from real robots performing real tasks. The biggest open datasets are Open X-Embodiment (around 1M episodes across 22 robots) and DROID (about 76K episodes from in-the-wild teleoperation). See robot-learning-datasets-explained.
- Human video. People doing tasks the robot might one day do. Some labs use Ego4D, EPIC-Kitchens, or curated YouTube data. Video helps the model learn what manipulation looks like before it has to figure out what to do.
- Internet-scale image-text. Caption pairs, screen UI, instructional content. This is what gives the model its broader visual-semantic grounding — it knows what a "spatula" is before it sees one on a robot arm.
- Simulation rollouts. Synthetic trajectories from physics simulators, often with domain randomization.
The mix matters. Pure robot data is the gold standard for action grounding but is expensive to collect. Human video is plentiful but introduces an embodiment gap. Internet data is essentially free but isn't directly about motor control. Most modern robot foundation models use all three.
What comes out
The model's job is to map perception inputs to action outputs. The "perception" side is usually one or more camera streams plus proprioception (joint angles, gripper state). The "action" side varies by approach:
- Direct joint commands. Each timestep, the model emits target joint positions or velocities.
- End-effector pose. The model emits a 6-DoF target for the gripper; an inverse-kinematics layer turns that into joint motions.
- Tokenized actions. The model emits discrete action tokens that get decoded into continuous control. This is how RT-2 fits actions into a language-model vocabulary.
- Diffusion over trajectories. The model emits a short trajectory of future end-effector poses by iteratively denoising — this is the Diffusion Policy line.
Most models also accept a task spec as input: a natural-language instruction ("pour the cup into the sink"), a goal image, or a target end state.
Three families worth knowing
Vision-language-action (VLA)
The dominant family right now. Start from a pretrained vision-language model (like PaLI or LLaVA), bolt on an action head, and fine-tune on robot data. RT-2, OpenVLA, π0, and Helix all live in this family. They get language and visual grounding mostly for free from the pretraining, then learn to act from a relatively small robot dataset. See vision-language-action-models-explained.
Behavior cloning at scale
Same idea as VLA but without the language-model backbone. The model is a transformer or diffusion network trained directly on (observation, action) pairs. Diffusion Policy is the most influential example. Simpler, sometimes faster to train, but harder to give arbitrary language instructions to.
World models
The model predicts what the world will look like several steps in the future, conditioned on candidate actions. The robot picks the action whose predicted future is closest to the goal. World models are an older idea (Dyna, Dreamer) that's getting renewed attention for embodied AI. They're attractive because the model learns physics implicitly and can be used for planning, not just reactive control.
What "foundation" actually buys you
Three things, in practice:
- Generalization to new tasks. Give the model a new task description and a few examples, and it usually does something reasonable. Pre-foundation policies needed training from scratch.
- Transfer across robots. A model trained on a Franka arm can often be adapted to a UR5 with a few hours of data, not weeks. Cross-embodiment is still an open problem but it's no longer a non-starter.
- A shared substrate for the field. Researchers can build on each other's work because they share a common base. The cost of trying a new idea drops.
What it doesn't buy you
A foundation model is not a deployed robot. There's a long list of things you still need: safety layers, recovery behaviors, latency budgets, hardware-specific calibration, perception that handles your exact lighting and clutter. The foundation model is the brain; the body, the room, and the operating procedure all still need work.
Where to look next
- The Robot Brain Index models tab tracks every robot foundation model we cover, with source-backed claims about what each can actually do.
- For the data side: robot-learning-datasets-explained.
- For the family of models that bolts language on top: vision-language-action-models-explained.