← All guides

How robots learn from demonstration

Imitation learning is the dominant way modern robots acquire new skills. Here's what's in the toolkit, what's improving, and where it falls short.

The most common way to teach a modern robot a new skill is not to program it, and not to let it discover the skill through trial and error, but to demonstrate the skill yourself. You teleoperate the robot through a few dozen successful attempts. A model watches your demonstrations and learns to produce similar trajectories. This is learning from demonstration (LfD), also called imitation learning (IL), and it's the workhorse method behind most published manipulation results since 2023.

This guide is a working overview of the techniques, the data collection methods, and the failure modes.

Why imitation, not RL

Two reasons:

  1. Reward design is hard. Reinforcement learning needs a reward signal. For most useful tasks ("fold the laundry", "make the bed", "pour the cereal") a good reward function is impossible to specify by hand. You can use sparse rewards (success/failure at the end) but learning is then prohibitively slow.
  2. Demonstrations are easy. A human can teleoperate a robot through 50 successful tasks in an afternoon. That's enough data to bootstrap a useful policy. Collecting the equivalent data via RL would take weeks of GPU time and a lot of broken cups.

Imitation flips the problem from "specify the reward and let the robot figure it out" to "show the robot what good looks like and let it pattern-match." That's a much easier human ask.

The core methods

Behavior cloning

The simplest form. Given a dataset of (observation, action) pairs, train a model to predict the action from the observation. That's it.

Behavior cloning is a supervised learning problem in disguise. It works surprisingly well when the dataset is good. It fails when:

  • The dataset has gaps the model interpolates badly into.
  • Errors compound (the model takes a slightly wrong action, sees an observation it's never seen, takes a more wrong action, and so on — "covariate shift").
  • The task requires planning beyond a short horizon.

Most modern imitation methods are behavior cloning with extra structure.

Diffusion Policy

The dominant flavor in 2024-2026. Instead of predicting a single action from the current observation, the policy predicts a short trajectory of future actions, conditioned on the recent observation history. The prediction is done by iteratively denoising — the same process diffusion image models use.

Why it works: predicting trajectories smooths over the multimodality problem (multiple valid ways to do a task), and predicting via denoising lets the model express probability distributions over action sequences rather than picking a single mean.

Diffusion Policy is the de facto baseline for new manipulation work.

Action Chunking Transformers (ACT)

A transformer that predicts a chunk of, say, 50 actions at once, conditioned on the current observation. Then executes that chunk open-loop before re-planning. ACT works well for bimanual manipulation; it's the policy ALOHA was originally trained with.

Implicit policies

Instead of mapping observation → action, learn an energy function over (observation, action) pairs. At inference, pick the action that minimizes the energy. Implicit Behavior Cloning was the original version. Less common than diffusion these days but conceptually adjacent.

Vision-language-action (VLA)

Behavior cloning, but on top of a vision-language model. The policy can be conditioned on a natural-language instruction. See vision-language-action-models-explained.

How demonstrations are collected

The data is the hard part. Three methods dominate.

Teleoperation via VR or controllers

A human wears a VR headset (or holds a controller) and the robot mirrors their movements. Most common for arm manipulation. Gives high-quality, smooth trajectories that match what a human would do.

Pros: high quality, scales with operators. Cons: slow (one demo at a time), requires a human, requires good teleop hardware.

Leader-follower puppetry

Used by ALOHA, GELLO, and similar bimanual setups. The operator moves a small mirror version of the arms (the "leader"); the real robot arms (the "follower") track. Lets a single operator do two arms at once.

Pros: very intuitive for bimanual, fast, cheap. Cons: requires building the leader rig.

Recorded human video

Watch humans doing tasks, transfer the learning to the robot. Several methods (R3M, VC-1, MIM) train visual representations from human video that transfer to robots. Some methods try to extract human end-effector trajectories from video and adapt them to robot kinematics.

Pros: scales arbitrarily — there's a lot of human video. Cons: embodiment gap. Humans have five fingers; most robots have grippers. Humans are stronger and more dexterous. Transfer is noisy.

Common failure modes

A list of things that will go wrong:

  • Compounding errors. The first time the policy takes an action slightly off-distribution, the next observation is also off-distribution, and the error grows. DAgger (interactive learning where the human corrects the policy's mistakes) and ensemble methods help.
  • Multimodality collapse. When humans demonstrate "pick up the cup", they pick from many sides. A naive policy averages the demonstrations and tries to approach from the middle, which fails. Diffusion-based methods handle this better than regression.
  • Spurious correlations. Policy learns "the demonstration starts with the operator's hand on the trigger" and breaks when there's no trigger.
  • Camera position changes. A policy trained on a wrist camera doesn't work on a third-person camera. Mix camera positions in training or use a representation that's robust.
  • Reward myopia. If the reward signal is "did the task succeed at the end", the policy can't learn "use less force" or "be more efficient." Many failure modes look like success at the end of an episode.

What's improving fast

  • Mixed datasets across robots. A model trained on data from many embodiments generalizes better than one trained on a single robot. RT-X, OpenVLA, π0 all bet on this.
  • Data efficiency. Pretraining on visual representations from video, fine-tuning on a small number of robot demos. Some methods can learn a new task from under 10 demonstrations.
  • Online correction. Letting the model ask for help when uncertain. Active imitation learning loops are getting practical.
  • Hybrid sim + real. Generate variations of demonstrations in sim, augment the training set. Sim isn't a replacement for real demos but it can multiply them.

Where to look next

Tags:trainingfoundations