Guide 04

Operator quality: how to evaluate a data partner

The seven questions that separate operations-grade data partners from labeling marketplaces.

9 MIN READ • LAST UPDATED JUNE 2026

“We have 50,000 trained operators” is a claim that fits on a slide. It tells you almost nothing about whether the data your program produces will be good enough for your model. This guide is the framework procurement and engineering leads can use to actually evaluate whether a data partner’s operator network will meet your bar.

What operator quality actually means

Operator quality is the joint product of three things:

Selection. Which humans are on the network. Background, prior experience, demographic coverage, domain expertise (clinical, mechanical, linguistic) where required.

Training. What operators learn before they touch your program. Platform-specific instruction, calibration sessions on your gold set, ongoing recalibration.

Quality gating. What gets shipped to you. Per-task acceptance, statistical sampling of work, gold-set retesting, operator-level performance tracking.

A partner can be strong on one of these and weak on the others. A partner with great selection but no training will ship inconsistent quality. A partner with great training but weak quality gating will ship some good work and some bad work indistinguishably. Evaluate all three.

The seven questions

When evaluating a data partner, ask these seven. The answers tell you more than any marketing material:

1. What’s the selection process? How does a person become an operator on your network? What do they have to demonstrate before they’re trusted with paid work? Acceptable answers: “background check + skill calibration + training program.” Unacceptable: “anyone with an internet connection.”

2. How are operators trained on a specific program? Is there a calibration session with the customer? A documented task spec? Ongoing recalibration when quality drifts? If the answer is “we send them the spec and they figure it out,” quality will be the random variable.

3. What’s the gold-set methodology? Is there a held-out set of “correct answers” used to evaluate operators continuously? How is it maintained, who builds it, and how often is it refreshed? Programs without active gold-set discipline drift in quality over months.

4. What’s the tier structure? Are operators tiered by skill (e.g., entry / senior / lead)? Does tier affect pay? Are tier transitions performance-based? A flat operator pool means no advancement incentive and high churn.

5. What’s the rejection rate, and what happens to rejected work? A partner with a 0% rejection rate is shipping bad work as good. A partner with a 30% rejection rate is wasting your time and theirs. Ask what “healthy” looks like for the data class. For most teleop programs: 5–12% rejection during ramp, 1–3% steady state.

6. How is operator behavior monitored during sessions? Real-time monitoring catches problems faster than post-hoc review. Acceptable: “sessions are sampled and reviewed within 24 hours.” Better: “sessions are monitored in real time with anomaly detection.” Unacceptable: “we review at the end of the week.”

7. What does operator churn look like? High churn means the training investment doesn’t compound. Low churn means experienced operators with deep program context. Ask for typical operator tenure on a program. Anything under three months means you’re paying to retrain constantly.

Calibration sets and gold standards

The single most important quality signal is whether the partner has built a calibration set for your specific program. A calibration set is 50–200 trajectories or labels where the customer (you) and the partner (them) agree on the “correct” answer. New operators are tested against it before being allowed on the program. Existing operators are spot-tested against it monthly.

If a partner can’t produce a calibration set within two weeks of program start, they’re going to ship random quality. If they have a calibration set but can’t tell you which operators have passed it and when, the quality bar isn’t actually enforced.

Red flags in vendor evaluation

Three patterns that consistently predict program failure:

Red flag 1: “We have N thousand operators.” Headcount is a vanity metric. What matters is how many operators are currently active on programs like yours, with calibrated quality. “50 trained on humanoid teleop” is more useful than “50,000 in the network.”

Red flag 2: Single-tier pricing across all data classes. If a partner charges the same per hour for entry-level annotation and surgical-grade teleop, they don’t differentiate operator skill internally. Either you’re overpaying for simple work or you’re getting under-skilled people on complex work.

Red flag 3: No audit trail. Can the partner tell you, for any given trajectory in your dataset: who operated it, when, on which rig, with which calibration? If not, you can’t debug quality issues, you can’t comply with regulatory audit (HIPAA, ITAR), and you can’t learn from operator-level signal. Audit trails are not optional.

When you’re ready to evaluate a partner against these seven questions, send us your spec. We’ll answer all seven on our own program before you commit to anything.

← All guides

More guides

GUIDE 01

How to scope a humanoid teleop program

A four-step framework for going from “we need data” to a signed SOW.

12 min read · For technical buyers

GUIDE 02

Cross-embodiment data: what it is and why it matters

Why policies trained on one platform fail on another, and what diverse training data actually looks like.

10 min read · For ML leads

GUIDE 03

Sim-to-real transfer: measurement and benchmarks

How to measure transfer accurately, and what benchmarks are actually worth trusting.

11 min read · For sim engineers

GUIDE 05

Choosing a delivery model: dedicated, crowdsource, or hybrid

When each model fits, what each costs, and the IP and quality tradeoffs at each tier.

8 min read · For program leads

GUIDE 06

Long-horizon data capture: methodology

Why 30-minute episodes break generic pipelines, and how we structure capture at scale.

11 min read · For research teams

Ready to scope a program?

Send us the platform, the task, and the volume. A solutions engineer responds in one business day.

Guide 04

Operator quality: how to evaluate a data partner

What operator quality actually means

The seven questions

Calibration sets and gold standards

Red flags in vendor evaluation

More guides

How to scope a humanoid teleop program

Cross-embodiment data: what it is and why it matters

Sim-to-real transfer: measurement and benchmarks

Choosing a delivery model: dedicated, crowdsource, or hybrid

Long-horizon data capture: methodology

Ready to scope a program?

DATA SERVICES

PLATFORMS

HOW WE COLLECT

SOLUTIONS

COMPANY

RESOURCES