Most humanoid robotics teams spend more time scoping their first data collection program than running it. The reason: scoping a humanoid teleop program touches platform engineering, ML, ops, legal, and procurement — five teams that rarely sit at the same table. When the SOW finally lands, it often misses something that costs weeks to fix.
This guide is the framework we use with customers who are scoping their first program with us. It works regardless of whether you end up working with Roborax or building the program in-house.
The four steps:
If you can answer four questions — one per step — you have enough to start contracting.
“Data” is too vague. Humanoid teleop data is at least seven different things, and they have different cost models, different operator skill profiles, and different acceptance bars.
The seven we’d separate:
Whole-body trajectories — Coordinated upper and lower body kinematics captured at 30Hz in the robot’s frame. The bread-and-butter of humanoid foundation model training. Operator wears an exoskeleton or uses bilateral leader-follower; trajectories are recorded with joint encoders, IMU data, and synchronized video.
Dexterous manipulation — Per-finger joint logs and grasp state for both hands. Less coupled with whole-body locomotion; can be captured on a tabletop bimanual rig and later re-projected onto a humanoid. Different operator skill profile (precision over speed).
Locomotion + manipulation chaining — Walking while carrying, reaching while balancing. The chains that break policies trained only on isolated demonstrations. Hardest to capture, most expensive per hour, but the bottleneck for production deployment.
Cross-embodiment trajectories — Same task captured across multiple humanoid morphologies (Figure, Optimus, Apollo, Neo). Used for transfer learning. Requires a calibrated rig of each platform; not all programs need this from day one.
Failure recovery — Operator recovers from a mid-task failure (dropped object, stumble, blocked path). Labeled for imitation or RL. Most underweighted data class — the one that takes models from 70% success to 90% in production.
Procedural narration — Operator describes what they’re doing while doing it. Language-aligned data for VLA training. Different operator (needs to articulate well); much cheaper per hour than physical teleop.
Edge-case captures — Targeted re-captures of failure modes flagged from your production policy logs. Not generated in vacuum; requires existing model output to identify what to re-capture.
The question you need to answer: which two or three of these are you contracting for in this program? “All seven” isn’t a real answer — each has a different SOW.
Volume questions look easy and are not. “How many trajectories do we need?” is the wrong question. The right questions:
How many trajectories does your model architecture process per gradient step, and how many steps do you train for? That’s your training set. Multiply by 2x for held-out validation.
What’s your sample efficiency budget? Different policies have different curves. If your team has historical data on previous models, use that. If not, the rule of thumb for humanoid manipulation policies is 5,000–20,000 task demonstrations to reach competent behavior; 50,000–100,000 to reach production quality.
What’s your per-trajectory unit cost target? This drives delivery model choice. If you have $80 per trajectory to spend, dedicated teams probably don’t fit your budget; you’d lean toward hybrid or crowdsource. If you have $400 per trajectory, dedicated is open to you.
How does volume break down by data class? A balanced humanoid program might be 60% whole-body trajectories, 20% dexterous, 10% locomotion+manipulation chains, 10% failure recovery. Your model’s bottleneck will shift these ratios — don’t anchor to industry numbers if your training pipeline tells you something different.
The question you need to answer: what’s the total trajectory count target per data class, and what’s the unit cost ceiling?
Most scoping documents skip this section. The result: at week 4, the operator team is shipping data the model team rejects, and nobody knows why because no one wrote down the bar.
Acceptance criteria for humanoid teleop typically have four dimensions:
Trajectory quality — Smoothness (no joint velocity spikes), stability (no falls or contact failures), task completion (reached the goal state). These are typically auto-evaluated on capture.
Annotation completeness — Phase labels, success/failure labels, goal annotations attached. Auto-checked but human-spot-checked.
Demographic and environmental coverage — Lighting variations, scene variations, operator variations (height, hand size, technique). Tracked at the dataset level, not per trajectory.
Domain alignment — Operator’s interpretation of the task matches your deployment context. Easy to get wrong; needs ongoing calibration sessions between your team and the operator team.
The question you need to answer: what’s your gold set, and who from your team owns the acceptance review?
Most programs ramp in four phases: pilot (2 weeks), expansion (4–6 weeks), production (months), and edge-case loop (ongoing).
Pilot (2 weeks): Single operator, single rig, 100–500 trajectories. Purpose is calibration, not volume. The deliverable is “we agree on what acceptance means” — not the trajectories themselves.
Expansion (4–6 weeks): Five to ten operators, multi-rig, 5,000–10,000 trajectories. Purpose is to find process bottlenecks before scaling further.
Production (months): Full team, 20,000+ trajectories per month if the budget supports it. Purpose is volume against the spec from steps 1–3.
Edge-case loop (ongoing): Targeted re-captures based on your production policy’s failure logs. Lower volume, higher unit cost, highest ML value per trajectory.
The question you need to answer: which phase are you contracting for, and what triggers transition to the next phase?
If you can answer the four questions from each step, you have enough to draft an SOW. Most teams over-engineer the first program. The goal of pilot phase is calibration, not scale; the goal of the SOW is to enable that calibration, not to commit to volume targets you don’t yet have evidence for.
When you’re ready, send these four answers to a solutions engineer. We respond with a scoped SOW within one business day.
GUIDE 02
Why policies trained on one platform fail on another, and what diverse training data actually looks like.
10 min read · For ML leads
GUIDE 03
How to measure transfer accurately, and what benchmarks are actually worth trusting.
11 min read · For sim engineers
GUIDE 04
The seven questions that separate operations-grade data partners from labeling marketplaces.
9 min read · For procurement
GUIDE 05
When each model fits, what each costs, and the IP and quality tradeoffs at each tier.
8 min read · For program leads
GUIDE 06
Why 30-minute episodes break generic pipelines, and how we structure capture at scale.
11 min read · For research teams
Send us the platform, the task, and the volume. A solutions engineer responds in one business day.