Vision-language-action (VLA) models, explained
VLAs are the family of robot models that take a camera image plus a natural-language instruction and emit motor actions. Here's how they work and where they fail.
A vision-language-action (VLA) model takes in pixels and a language instruction and emits actions — usually joint commands or end-effector poses. The simplest one-line description: a vision-language model with the action token added to its vocabulary.
That's a small change architecturally and a large change behaviorally. By treating action like just another modality the model has to learn to produce, you get to reuse the entire vision-language pretraining stack — and you get a model that follows language instructions out of the box, instead of requiring a separate planner.
How they're built
A typical VLA looks like this:
- Start from a pretrained vision-language model. PaLI, LLaVA, Qwen-VL, or a similar multimodal transformer. This gives you image understanding and language grounding for free.
- Discretize the action space. Continuous joint commands or end-effector poses get binned into a fixed vocabulary of action tokens. Now actions look just like words from the model's perspective.
- Fine-tune on robot data. Train the combined model to predict action tokens conditioned on (image, language instruction, recent observations).
- At inference, decode the action tokens back into continuous values and send to the robot.
This pattern was popularized by Google's RT-2, then democratized by OpenVLA (open-source weights and code), and refined by π0, Helix, and others.
Why the language part matters
Before VLAs, robot policies typically took a "task ID" — a fixed integer representing which behavior to execute — or a goal image. Both work, but they don't compose. You couldn't tell a policy "pick up the blue mug that's behind the toaster" without explicitly training on that exact phrasing.
A VLA can interpret an open-ended instruction. The catch is that the model only does well on instructions whose distribution is close to its training data. Tell it to "tape the door shut with the lemon" and it'll probably try something, but it's not really thinking about feasibility — it's interpolating in instruction-space and hoping the action-space interpolation works out.
What they're good at
- Following novel instructions that recombine known objects and known verbs. "Put the apple on the plate" works even if the model has never seen that exact combination, as long as it's seen apple and plate and put in the right contexts.
- Visual grounding. Pointing at the right object in a cluttered scene. This is a direct win from the pretrained vision-language backbone.
- Cross-task transfer. Train on cooking-adjacent tasks, get reasonable behavior on a slightly different cooking task you didn't include.
What they're not (yet) good at
- Long horizons. Most VLAs are reactive — they look at the current frame, the current instruction, maybe a short history, and emit an action. They don't plan over many steps. You can paper over this with a planner that decomposes a long instruction into short ones.
- Precise contact. A small numerical error in a discretized action token can translate to a big real-world error when you're inserting a peg or grasping a thin object. Models with continuous action heads (diffusion-based) tend to do better here.
- Out-of-distribution objects. Show the model a piece of laboratory equipment it's never seen and the visual grounding can fall apart. This is a data problem more than an architecture problem.
- Negotiating physics. "Push the block until it tips over" requires the model to predict the moment of tipping. Reactive VLAs don't have an explicit physics model — they pattern-match.
What's changing fast
Three frontiers worth tracking:
- Continuous action heads. Replacing tokenized actions with a regression head or a diffusion head improves precision at the cost of language-model elegance. Most new VLAs have a continuous variant.
- Multi-camera, multi-arm, mobile. Early VLAs assumed a single arm with a wrist camera. Newer ones handle bimanual setups, third-person views, and mobile bases.
- Speed. RT-2 ran at a few Hz; π0 and OpenVLA can be coaxed to ~10 Hz with optimized inference; Helix targets real-time control. This matters because reactive policies need to close the perception-action loop fast enough to handle small disturbances.
A short reading list
- The original RT-2 paper from Google DeepMind (2023). The blueprint everything else follows.
- The OpenVLA paper (2024). The first major open-source VLA with full reproducibility.
- The π0 paper from Physical Intelligence (2024). Continuous action head, fast inference, manipulation focus.
The Robot Brain Index models tab tracks every notable VLA with source-backed claims about what hardware it runs on, what tasks it handles, and what license you can use it under.
When NOT to use a VLA
If your task is well-specified, repetitive, and the environment is structured — a pick-and-place from a known fixture to a known fixture in good lighting — a classical controller plus a small perception model will outperform any VLA and won't surprise you in production. VLAs earn their keep when the environment is unstructured, the task description is open-ended, or you can't afford to write a custom controller for every variant.