How much humanoid robot training data do you actually need? The honest answer depends on three things: your deployment tier, your model architecture, and what “enough” means for your specific use case.
The humanoid robot training data question every team asks differently
Ask five humanoid robotics teams how much training data they need and you will get five different answers — not because they disagree, but because they are answering different questions. One team means “enough to get to demo quality.” Another means “enough to deploy in a controlled warehouse.” A third means “enough to handle the full distribution of environments a field robot will encounter.”
These require very different dataset sizes, and conflating them is one of the most expensive mistakes in robotics ML.
A working framework: three deployment tiers
Before estimating data requirements, define which tier you are targeting:
Tier 1 — Controlled environment, fixed task set. Single facility, well-defined workspace, 5–15 task variants, human oversight available. Examples: parts assembly on a known production line, warehouse picking from a fixed SKU catalog, structured lab protocols.
Tier 2 — Semi-structured environment, variable task set. Multiple facilities or environment configurations, 20–50 task variants, infrequent human oversight. Examples: retail stocking across multiple store layouts, hospital logistics across floor configurations, commercial kitchen prep.
Tier 3 — Unstructured environment, open task set. Novel environments, unbounded task variability, minimal human oversight expected. Examples: general-purpose home robots, field service robots, disaster response.
Data requirements by tier
Based on published results and operational experience across humanoid deployments:
Tier 1: 500–5,000 demonstrations per task variant. A 10-task deployment in a controlled setting can achieve reliable performance with 5,000–50,000 total demonstrations. The lower end is achievable with high-quality teleop from skilled operators; the upper end applies when using crowd-sourced or lower-consistency collection methods.
Tier 2: 5,000–50,000 demonstrations per task variant for core tasks, with an additional 20–30% coverage of edge cases and environment variants. A 30-task deployment targeting multiple sites typically requires 150,000–1.5M demonstrations depending on model architecture and the degree of sim augmentation used.
Tier 3: Unknown upper bound. Current frontier models (as of 2025) are trained on tens of millions of demonstrations and still exhibit significant failure modes in novel environments. General-purpose humanoid capability likely requires internet-scale data collection analogous to what enabled large language models — an unsolved problem.
Why architecture changes the answer
Data requirements are not independent of model architecture. Diffusion policy architectures typically require 50–200 demonstrations per behavior to start generalizing, but generalize poorly across environment changes. Transformer-based architectures with visual pre-training (e.g., RT-2-style models) require more demonstrations to fine-tune but generalize better across novel objects and configurations. Foundation model approaches (ACT, π0, OpenVLA) require less task-specific data when the pre-training distribution covers the deployment domain well.
If your target tasks are well-represented in published robotics datasets (standard manipulation, locomotion on even terrain), foundation model fine-tuning can dramatically reduce your data requirements — sometimes by an order of magnitude. If your tasks are novel (surgical robotics, specific industrial processes, unusual end-effector configurations), you are likely starting from scratch.
The distribution problem
Raw demonstration count is a poor proxy for data quality. What matters is distribution coverage: does your dataset represent the full range of conditions the deployed robot will encounter?
A dataset of 50,000 demonstrations collected in a single facility on a clear day in one lighting condition will produce a model that fails in a second facility under fluorescent lighting. A dataset of 5,000 demonstrations collected across six facilities in varied lighting, with deliberate coverage of failure modes and recovery behaviors, will often outperform it.
Before scaling collection, map your deployment distribution: what are the environment variables (lighting, surface texture, object pose variance, clutter level, operator handoff conditions)? Design your collection protocol to sample that distribution deliberately, not opportunistically.
A practical starting point
For teams early in their data strategy, a reasonable starting point for a Tier 1 deployment:
- Collect 200–500 demonstrations per task variant from skilled teleop operators
- Train a baseline model and identify failure modes in simulation
- Target additional collection specifically at the failure distribution
- Repeat until the failure rate in sim drops below your deployment threshold
- Collect 500–2,000 additional demonstrations in the target environment before deployment
This iterative approach almost always produces better results than a single large collection run, because it forces early identification of the hard cases before you have invested in 50,000 demonstrations of the easy ones.
\n\n
The iterative approach almost always produces better results than a single large collection run. If you’re trying to scope your first or next humanoid data program, read our scoping guide or send us the platform and the task.
\n





