Blog post

Lessons from 50,000 humanoid trajectories

We crossed 50,000 humanoid trajectories last month — captured across Figure, Optimus, Apollo, and one platform we can’t name yet. The number is round and uninteresting on its own. The lessons from getting there are not.

Five things we learned, in order of how much they changed how we run programs:

1. Operator pacing matters more than operator skill

We initially graded operators on per-trajectory quality. We found that operators who shipped clean 50-trajectory batches at slower pace consistently outperformed operators who shipped 200-trajectory batches at faster pace — even when the per-trajectory quality of the fast operators was higher in their first 50.

The cause: long-session quality degradation. Past two hours of concentrated teleop, even skilled operators start producing trajectories that are subtly worse. The signal-to-noise ratio drops. Models trained on the back half of a long session learn artifacts.

We now schedule operators in 90-minute sessions with 20-minute breaks, no more than four sessions a day. Throughput is lower; usable trajectory yield is higher. Net unit economics improve.

2. Cross-embodiment data is more valuable than we thought

Our first humanoid programs were single-platform. We added cross-embodiment capture late, partly because the unit cost is 1.5–2.5x higher.

Models trained on cross-embodiment data outperformed single-platform models by margins that were embarrassingly large — the kind of margins that make you question whether you should ever run a single-platform program again.

The most underrated benefit is that cross-embodiment also forces better task spec discipline. “Pick up the mug” means different things on different platforms, and you can’t paper over that ambiguity. The spec gets tighter, which helps even single-platform downstream.

3. Failure-recovery data is what actually shifts production performance

For our first 20K trajectories, we captured almost exclusively successful demonstrations. Models trained on this data reached competent in-domain performance but failed badly when they encountered situations they hadn’t seen — they had no notion of how to recover from their own mistakes.

We added failure-recovery capture as a deliberate data class around trajectory #30K. Roughly 20% of new captures became operator-triggered-or-encountered failures with recovery. Models trained on the 30K+20% mix reached production performance much faster than models trained on 50K all-successful trajectories.

The intuition: production-grade robotics is about graceful failure as much as it is about success.

4. Annotation cost is the dominant cost, not capture

Going in, we budgeted annotation at roughly 20% of capture cost. After 50K trajectories, our actual annotation cost is closer to 90% of capture cost — nearly 1:1 — driven by:

Phase boundary labeling (where does “approach” end?)
Goal annotations per sub-task
Failure/success grading
Language alignment captions
Edge-case classification

If we were starting again, we’d build the annotation pipeline before scaling capture. Annotation surprises kill more programs than capture surprises do.

5. The customer’s acceptance criteria changes between week 2 and week 8

The biggest surprise has been how often the customer’s definition of acceptable data shifts as they train their first model on a delivered batch. “Trajectories we thought were fine” turn out to encode subtle artifacts the model latches onto. “Trajectories we thought were too noisy” turn out to be exactly the variation the model needs.

We now run formal recalibration sessions every two weeks with customers who are actively training. The customer’s evolving understanding of what they need is part of the program, not an unwelcome surprise.

What’s next

The 50,000 are split across 12 programs. The next 50,000 will probably come from 3–4, as the programs that survived their first model deployment scale up. We expect the lessons from the next 50K to be different than the first — mostly about the long-horizon and multi-day data classes that the early programs haven’t needed yet.

If you’re running a humanoid program and any of the above sounds familiar, we should talk.

\n\n

Pingal Mukherjee · Manager, Presales & Bid Management

Pingal scopes data collection programs for humanoid, surgical, and industrial robotics clients, translating ML requirements into collection specifications for presales and bid teams, and writes about data strategy and sim-to-real transfer.

Read these next

Warehouse Picking Robots: What Your Training Data Strategy Is Missing

June 26, 2026 No Comments

Warehouse robot training data programs consistently underperform their lab benchmarks in production. The reason is almost never the model architecture. It is almost always a

Read →

Training Data for Surgical Robots: HIPAA, Precision, and Scale

June 26, 2026 No Comments

Surgical robot training data has requirements that no general-purpose robotics data program is built to meet out of the box. Sub-millimeter precision, HIPAA compliance, and

Read →

The QA Pipeline Every Robotics Data Team Needs to Build

June 26, 2026 No Comments

A robotics data quality assurance pipeline is not a checklist or a review meeting. At production scale, robotics data quality requires automated validation, per-operator metrics,

Read →

Robot Data Annotation: A Practical Guide for ML Teams

June 26, 2026 No Comments

Robot data annotation is not image labeling with a different name. The temporal structure of robot trajectories, the grounding in physical task semantics, and the

Read →

Sim-to-Real Transfer: Why Synthetic Data Alone Will Not Train a Deployable Robot

June 26, 2026 No Comments

Sim-to-real robot training with synthetic data is one of the most powerful techniques in embodied AI — and one of the most misunderstood. The gap

Read →

The Embodied AI Data Flywheel: Why Physical AI Will Outpace LLMs

June 26, 2026 No Comments

The embodied AI training data problem is structurally different from the language model data problem. Language models learned from the internet. Embodied AI must learn

Read →

Ready to scope a program?

Send us the platform, the task, and the volume. A solutions engineer responds in one business day.

Blog post

Lessons from 50,000 humanoid trajectories

1. Operator pacing matters more than operator skill

2. Cross-embodiment data is more valuable than we thought

3. Failure-recovery data is what actually shifts production performance

4. Annotation cost is the dominant cost, not capture

5. The customer’s acceptance criteria changes between week 2 and week 8

What’s next

More from the blog

Read these next

Warehouse Picking Robots: What Your Training Data Strategy Is Missing

Training Data for Surgical Robots: HIPAA, Precision, and Scale

The QA Pipeline Every Robotics Data Team Needs to Build

Robot Data Annotation: A Practical Guide for ML Teams

Sim-to-Real Transfer: Why Synthetic Data Alone Will Not Train a Deployable Robot

The Embodied AI Data Flywheel: Why Physical AI Will Outpace LLMs

Ready to scope a program?

DATA SERVICES

PLATFORMS

HOW WE COLLECT

SOLUTIONS

COMPANY

RESOURCES

Blog post

Lessons from 50,000 humanoid trajectories

1. Operator pacing matters more than operator skill

2. Cross-embodiment data is more valuable than we thought

3. Failure-recovery data is what actually shifts production performance

4. Annotation cost is the dominant cost, not capture

5. The customer’s acceptance criteria changes between week 2 and week 8

What’s next

Related reading

External reference

More from the blog

Read these next

Warehouse Picking Robots: What Your Training Data Strategy Is Missing

Training Data for Surgical Robots: HIPAA, Precision, and Scale

The QA Pipeline Every Robotics Data Team Needs to Build

Robot Data Annotation: A Practical Guide for ML Teams

Sim-to-Real Transfer: Why Synthetic Data Alone Will Not Train a Deployable Robot

The Embodied AI Data Flywheel: Why Physical AI Will Outpace LLMs

Ready to scope a program?

DATA SERVICES

PLATFORMS

HOW WE COLLECT

SOLUTIONS

COMPANY

RESOURCES