Robot learning datasets, explained
An overview of the open datasets training modern robot models — what they contain, what they're missing, and how to use them.
Most of the progress in robot learning over the last two years has been a data story. Architectures matter, but the dataset is what determines whether a model can do anything in the real world. This guide is a working map of the open robot datasets — what's in them, what they're missing, and how to choose one for a project.
What "a robot dataset" actually contains
A typical entry in a modern robot dataset is an episode: a single attempt at a task. An episode is a sequence of timesteps, and each timestep has:
- One or more camera images (wrist camera, third-person, sometimes depth).
- Proprioception — joint angles, end-effector pose, gripper state.
- The action taken at that timestep (joint deltas, end-effector deltas, gripper command).
- Optionally: language instruction, success/failure label, task ID.
Episodes are stored in a structured format — usually RLDS (TFRecord-based, what Open X-Embodiment uses), LeRobot's HuggingFace datasets format, or a custom parquet/zarr layout.
The big four open datasets
Open X-Embodiment (OXE)
The aggregation dataset. Around 1M episodes across 22 robots from 21 institutions, all converted to a shared RLDS schema. Released alongside RT-X (the cross-embodiment version of RT-2). OXE is what most current foundation models are trained on, sometimes exclusively, sometimes mixed with private data.
What's good: scale, diversity of embodiments, single format. What's not: huge embodiment imbalance (Franka arms dominate), wildly varying action spaces across robots, and quality varies — some sub-datasets were collected by graduate students for one paper and aren't held to data-engineering standards.
DROID
Around 76K episodes collected on Franka arms in real homes and offices by a distributed collection effort. Each episode has language annotations. Designed to look like the kind of data a generalist robot would actually see — varied lighting, clutter, objects from real living spaces.
What's good: realism, scale, language. What's not: a single embodiment (Franka), so cross-robot transfer needs to be added on the model side.
BridgeData V2
About 60K episodes across ~24 environments and ~13 skills on a WidowX arm. The classic dataset for studying generalization — easy to ask whether a model trained on the "train" environments works on the "test" environments.
LeRobot's HuggingFace datasets
A growing collection of datasets in the LeRobot format, often smaller and task-focused (under 10K episodes), but easy to download, mix, and replay. Includes ALOHA bimanual data, SO-100 arm data, mobile manipulation data, and community contributions.
Datasets you might also want
- Ego4D / EPIC-Kitchens. Human-centric video. Not robot data, but useful for pretraining vision encoders and learning what manipulation looks like.
- CALVIN. A simulation-based benchmark for long-horizon language-conditioned manipulation. Useful for evaluation more than pretraining.
- RoboHive datasets. A collection of simulated tasks designed to be a sim counterpart to OXE.
- Real-world humanoid datasets are still mostly private to humanoid companies. This is the biggest open-data gap in the field.
What's missing from open data
Three categories that the field needs and doesn't really have at scale:
- Failure data. Episodes where the robot fails, and the recovery. Most open datasets are filtered to successes only — useful for behavior cloning, useless for learning to recover.
- Tactile. Force-torque and skin-sensor data is rare in open datasets. Without it, models can't learn delicate insertion or in-hand manipulation.
- Long horizons. Most episodes are 5-30 seconds. The kinds of behaviors we care about deploying — a full household task — span minutes. Long-horizon data is hard to collect cleanly.
How to use a dataset
A rough checklist:
- Check the license. Not every "open" dataset is commercial-OK. OXE itself is a mixed bag because it aggregates sub-datasets with different licenses.
- Check the action space. Joint position? Joint velocity? End-effector delta? If your model and your downstream robot don't match, you're going to need an adapter.
- Check the camera intrinsics and extrinsics. If you train on wrist-camera data and deploy on a third-person camera, your model will work poorly.
- Subsample carefully. Most large datasets have heavy class imbalance. A random 10% might consist almost entirely of the most-collected task.
- Use a streaming loader. Loading hundreds of GB of episode data into memory will fail. RLDS, the LeRobot loader, and most parquet readers stream natively — use them.
Where to look next
- The Robot Brain Index datasets tab tracks every dataset we cover with size, license, robots, and best-fit use.
- What is a robot foundation model? explains how the models that consume this data are built.
- How robots learn from demonstration covers the training methods that turn this data into behavior.