Blog post

Lessons from 50,000 humanoid trajectories

We crossed 50,000 humanoid trajectories last month — captured across Figure, Optimus, Apollo, and one platform we can’t name yet. The number is round and uninteresting on its own. The lessons from getting there are not.

Five things we learned, in order of how much they changed how we run programs:

1. Operator pacing matters more than operator skill

We initially graded operators on per-trajectory quality. We found that operators who shipped clean 50-trajectory batches at slower pace consistently outperformed operators who shipped 200-trajectory batches at faster pace — even when the per-trajectory quality of the fast operators was higher in their first 50.

The cause: long-session quality degradation. Past two hours of concentrated teleop, even skilled operators start producing trajectories that are subtly worse. The signal-to-noise ratio drops. Models trained on the back half of a long session learn artifacts.

We now schedule operators in 90-minute sessions with 20-minute breaks, no more than four sessions a day. Throughput is lower; usable trajectory yield is higher. Net unit economics improve.

2. Cross-embodiment data is more valuable than we thought

Our first humanoid programs were single-platform. We added cross-embodiment capture late, partly because the unit cost is 1.5–2.5x higher.

Models trained on cross-embodiment data outperformed single-platform models by margins that were embarrassingly large — the kind of margins that make you question whether you should ever run a single-platform program again.

The most underrated benefit is that cross-embodiment also forces better task spec discipline. “Pick up the mug” means different things on different platforms, and you can’t paper over that ambiguity. The spec gets tighter, which helps even single-platform downstream.

3. Failure-recovery data is what actually shifts production performance

For our first 20K trajectories, we captured almost exclusively successful demonstrations. Models trained on this data reached competent in-domain performance but failed badly when they encountered situations they hadn’t seen — they had no notion of how to recover from their own mistakes.

We added failure-recovery capture as a deliberate data class around trajectory #30K. Roughly 20% of new captures became operator-triggered-or-encountered failures with recovery. Models trained on the 30K+20% mix reached production performance much faster than models trained on 50K all-successful trajectories.

The intuition: production-grade robotics is about graceful failure as much as it is about success.

4. Annotation cost is the dominant cost, not capture

Going in, we budgeted annotation at roughly 20% of capture cost. After 50K trajectories, our actual annotation cost is closer to 90% of capture cost — nearly 1:1 — driven by:

  • Phase boundary labeling (where does “approach” end?)
  • Goal annotations per sub-task
  • Failure/success grading
  • Language alignment captions
  • Edge-case classification

If we were starting again, we’d build the annotation pipeline before scaling capture. Annotation surprises kill more programs than capture surprises do.

5. The customer’s acceptance criteria changes between week 2 and week 8

The biggest surprise has been how often the customer’s definition of acceptable data shifts as they train their first model on a delivered batch. “Trajectories we thought were fine” turn out to encode subtle artifacts the model latches onto. “Trajectories we thought were too noisy” turn out to be exactly the variation the model needs.

We now run formal recalibration sessions every two weeks with customers who are actively training. The customer’s evolving understanding of what they need is part of the program, not an unwelcome surprise.

What’s next

The 50,000 are split across 12 programs. The next 50,000 will probably come from 3–4, as the programs that survived their first model deployment scale up. We expect the lessons from the next 50K to be different than the first — mostly about the long-horizon and multi-day data classes that the early programs haven’t needed yet.

If you’re running a humanoid program and any of the above sounds familiar, we should talk.

\n\n


\n

Pingal Mukherjee

Pingal Mukherjee · Manager, Presales & Bid Management

Pingal scopes data collection programs for humanoid, surgical, and industrial robotics clients, translating ML requirements into collection specifications for presales and bid teams, and writes about data strategy and sim-to-real transfer.

More from the blog

Read these next

Ready to scope a program?

Send us the platform, the task, and the volume. A solutions engineer responds in one business day.