Robot learning datasets, explained

Most of the progress in robot learning over the last two years has been a data story. Architectures matter, but the dataset is what determines whether a model can do anything in the real world. This guide is a working map of the open robot datasets — what's in them, what they're missing, and how to choose one for a project.

What "a robot dataset" actually contains

A typical entry in a modern robot dataset is an episode: a single attempt at a task. An episode is a sequence of timesteps, and each timestep has:

One or more camera images (wrist camera, third-person, sometimes depth).
Proprioception — joint angles, end-effector pose, gripper state.
The action taken at that timestep (joint deltas, end-effector deltas, gripper command).
Optionally: language instruction, success/failure label, task ID.

Episodes are stored in a structured format — usually RLDS (TFRecord-based, what Open X-Embodiment uses), LeRobot's HuggingFace datasets format, or a custom parquet/zarr layout.

The big four open datasets

Open X-Embodiment (OXE)

The aggregation dataset. Around 1M episodes across 22 robots from 21 institutions, all converted to a shared RLDS schema. Released alongside RT-X (the cross-embodiment version of RT-2). OXE is what most current foundation models are trained on, sometimes exclusively, sometimes mixed with private data.

What's good: scale, diversity of embodiments, single format. What's not: huge embodiment imbalance (Franka arms dominate), wildly varying action spaces across robots, and quality varies — some sub-datasets were collected by graduate students for one paper and aren't held to data-engineering standards.

DROID

Around 76K episodes collected on Franka arms in real homes and offices by a distributed collection effort. Each episode has language annotations. Designed to look like the kind of data a generalist robot would actually see — varied lighting, clutter, objects from real living spaces.

What's good: realism, scale, language. What's not: a single embodiment (Franka), so cross-robot transfer needs to be added on the model side.

BridgeData V2

About 60K episodes across ~24 environments and ~13 skills on a WidowX arm. The classic dataset for studying generalization — easy to ask whether a model trained on the "train" environments works on the "test" environments.

LeRobot's HuggingFace datasets

A growing collection of datasets in the LeRobot format, often smaller and task-focused (under 10K episodes), but easy to download, mix, and replay. Includes ALOHA bimanual data, SO-100 arm data, mobile manipulation data, and community contributions.

Datasets you might also want

Ego4D / EPIC-Kitchens. Human-centric video. Not robot data, but useful for pretraining vision encoders and learning what manipulation looks like.
CALVIN. A simulation-based benchmark for long-horizon language-conditioned manipulation. Useful for evaluation more than pretraining.
RoboHive datasets. A collection of simulated tasks designed to be a sim counterpart to OXE.
Real-world humanoid datasets are still mostly private to humanoid companies. This is the biggest open-data gap in the field.

What's missing from open data

Three categories that the field needs and doesn't really have at scale:

Failure data. Episodes where the robot fails, and the recovery. Most open datasets are filtered to successes only — useful for behavior cloning, useless for learning to recover.
Tactile. Force-torque and skin-sensor data is rare in open datasets. Without it, models can't learn delicate insertion or in-hand manipulation.
Long horizons. Most episodes are 5-30 seconds. The kinds of behaviors we care about deploying — a full household task — span minutes. Long-horizon data is hard to collect cleanly.

How to use a dataset

A rough checklist:

Check the license. Not every "open" dataset is commercial-OK. OXE itself is a mixed bag because it aggregates sub-datasets with different licenses.
Check the action space. Joint position? Joint velocity? End-effector delta? If your model and your downstream robot don't match, you're going to need an adapter.
Check the camera intrinsics and extrinsics. If you train on wrist-camera data and deploy on a third-person camera, your model will work poorly.
Subsample carefully. Most large datasets have heavy class imbalance. A random 10% might consist almost entirely of the most-collected task.
Use a streaming loader. Loading hundreds of GB of episode data into memory will fail. RLDS, the LeRobot loader, and most parquet readers stream natively — use them.

Where to look next

The Robot Brain Index datasets tab tracks every dataset we cover with size, license, robots, and best-fit use.
What is a robot foundation model? explains how the models that consume this data are built.
How robots learn from demonstration covers the training methods that turn this data into behavior.