Figure 02 is learning from YouTube videos

My nephew learned to tie his shoes by watching me do it seven times. He watched. He tried. He failed. He watched again. He tried again. Somewhere around the fifth attempt, his fingers figured it out before his brain could explain what they were doing.

Figure AI just showed something that reminded me of that.

Their Figure 02 robot is learning physical tasks by watching demonstration videos. Not carefully staged laboratory demonstrations with motion capture suits and calibrated cameras. YouTube. Regular videos of regular people doing regular things.

A robot that learns to fold a towel by watching someone fold a towel on a webcam.

Why this is different

Most robotic learning works through teleoperation: a human controls the robot’s arms directly, and the robot records the movements and tries to replicate them. It works, but it’s slow. Every task needs a human operator wearing a suit, spending hours guiding the robot through each motion.

Video learning skips that entirely. The robot watches a person do something, builds an internal model of what happened, and translates that visual understanding into motor commands for its own body. Different proportions. Different joint configurations. Different hands.

The translation is the hard part. A human arm and a robot arm don’t move the same way. The robot has to watch what happens and figure out how to achieve the same result with its own body. That’s not copying. That’s understanding the goal and improvising the execution.

The implications I keep thinking about

If a robot can learn from YouTube videos, the training data is essentially infinite. There are billions of hours of humans doing physical tasks on the internet. Cooking. Cleaning. Building. Repairing. Carrying. Sorting.

Every cooking show becomes a robotics training set. Every DIY tutorial becomes a maintenance manual for a machine. Every factory tour becomes an onboarding program.

The bottleneck in robotics has always been teaching robots what to do. If video learning works at scale, the bottleneck dissolves. And the rate of robot capability improvement becomes tied to the rate of video upload, which is to say: it becomes very, very fast.

What I’m uncertain about

The demos look good. Demos always look good. The question is how well this works in the messy, unpredictable, non-YouTube reality of actual kitchens and warehouses and offices.

YouTube videos have good lighting. Clear angles. Uncluttered backgrounds. Real life has none of those things. The gap between “learned from a well-shot video” and “can do it in a dark kitchen with stuff everywhere” is wide.

But the gap is narrowing. And that’s the thing about exponential progress. You spend a long time thinking the gap is permanent, and then one morning you wake up and it’s gone.

I’m watching this closely. The way we teach robots might change everything about how fast they learn. And how fast they learn might change everything else.

Related thinking: