“We have 50,000 trained operators” is a claim that fits on a slide. It tells you almost nothing about whether the data your program produces will be good enough for your model. This guide is the framework procurement and engineering leads can use to actually evaluate whether a data partner’s operator network will meet your bar.
Operator quality is the joint product of three things:
Selection. Which humans are on the network. Background, prior experience, demographic coverage, domain expertise (clinical, mechanical, linguistic) where required.
Training. What operators learn before they touch your program. Platform-specific instruction, calibration sessions on your gold set, ongoing recalibration.
Quality gating. What gets shipped to you. Per-task acceptance, statistical sampling of work, gold-set retesting, operator-level performance tracking.
A partner can be strong on one of these and weak on the others. A partner with great selection but no training will ship inconsistent quality. A partner with great training but weak quality gating will ship some good work and some bad work indistinguishably. Evaluate all three.
When evaluating a data partner, ask these seven. The answers tell you more than any marketing material:
1. What’s the selection process? How does a person become an operator on your network? What do they have to demonstrate before they’re trusted with paid work? Acceptable answers: “background check + skill calibration + training program.” Unacceptable: “anyone with an internet connection.”
2. How are operators trained on a specific program? Is there a calibration session with the customer? A documented task spec? Ongoing recalibration when quality drifts? If the answer is “we send them the spec and they figure it out,” quality will be the random variable.
3. What’s the gold-set methodology? Is there a held-out set of “correct answers” used to evaluate operators continuously? How is it maintained, who builds it, and how often is it refreshed? Programs without active gold-set discipline drift in quality over months.
4. What’s the tier structure? Are operators tiered by skill (e.g., entry / senior / lead)? Does tier affect pay? Are tier transitions performance-based? A flat operator pool means no advancement incentive and high churn.
5. What’s the rejection rate, and what happens to rejected work? A partner with a 0% rejection rate is shipping bad work as good. A partner with a 30% rejection rate is wasting your time and theirs. Ask what “healthy” looks like for the data class. For most teleop programs: 5–12% rejection during ramp, 1–3% steady state.
6. How is operator behavior monitored during sessions? Real-time monitoring catches problems faster than post-hoc review. Acceptable: “sessions are sampled and reviewed within 24 hours.” Better: “sessions are monitored in real time with anomaly detection.” Unacceptable: “we review at the end of the week.”
7. What does operator churn look like? High churn means the training investment doesn’t compound. Low churn means experienced operators with deep program context. Ask for typical operator tenure on a program. Anything under three months means you’re paying to retrain constantly.
The single most important quality signal is whether the partner has built a calibration set for your specific program. A calibration set is 50–200 trajectories or labels where the customer (you) and the partner (them) agree on the “correct” answer. New operators are tested against it before being allowed on the program. Existing operators are spot-tested against it monthly.
If a partner can’t produce a calibration set within two weeks of program start, they’re going to ship random quality. If they have a calibration set but can’t tell you which operators have passed it and when, the quality bar isn’t actually enforced.
Three patterns that consistently predict program failure:
Red flag 1: “We have N thousand operators.” Headcount is a vanity metric. What matters is how many operators are currently active on programs like yours, with calibrated quality. “50 trained on humanoid teleop” is more useful than “50,000 in the network.”
Red flag 2: Single-tier pricing across all data classes. If a partner charges the same per hour for entry-level annotation and surgical-grade teleop, they don’t differentiate operator skill internally. Either you’re overpaying for simple work or you’re getting under-skilled people on complex work.
Red flag 3: No audit trail. Can the partner tell you, for any given trajectory in your dataset: who operated it, when, on which rig, with which calibration? If not, you can’t debug quality issues, you can’t comply with regulatory audit (HIPAA, ITAR), and you can’t learn from operator-level signal. Audit trails are not optional.
When you’re ready to evaluate a partner against these seven questions, send us your spec. We’ll answer all seven on our own program before you commit to anything.
GUIDE 01
A four-step framework for going from “we need data” to a signed SOW.
12 min read · For technical buyers
GUIDE 02
Why policies trained on one platform fail on another, and what diverse training data actually looks like.
10 min read · For ML leads
GUIDE 03
How to measure transfer accurately, and what benchmarks are actually worth trusting.
11 min read · For sim engineers
GUIDE 05
When each model fits, what each costs, and the IP and quality tradeoffs at each tier.
8 min read · For program leads
GUIDE 06
Why 30-minute episodes break generic pipelines, and how we structure capture at scale.
11 min read · For research teams
Send us the platform, the task, and the volume. A solutions engineer responds in one business day.