Scaling teleop data collection is one of the hardest operational problems in robotics. The volume grows quickly. The quality does not always follow.
The teleop data collection quality cliff nobody warns you about
Scaling teleoperation data collection sounds straightforward: hire more operators, run more sessions, collect more trajectories. In practice, most teams hit a quality cliff between 5,000 and 20,000 demonstrations. The data volume grows, but downstream model performance plateaus or regresses.
The cause is almost always operator variance — not hardware failure, not annotation error, not sim-to-real gap. The operators who performed consistently at 200 sessions per week start cutting corners at 2,000. Latency thresholds that were fine for a pilot degrade imperceptibly but compound across a large dataset. QA workflows designed for small batches become bottlenecks that incentivize passing marginal demonstrations.
This post covers the operational levers that prevent quality degradation as you scale.
1. Operator selection and calibration
Not every operator performs equally on every task type. A skilled teleop operator for bimanual pick-and-place may produce low-quality demonstrations for tool-use tasks requiring precise wrist rotation. Before scaling, run a calibration battery across all task categories you plan to collect. Score on smoothness (jerk metric), success rate, and time-on-task against a reference trajectory.
Set a minimum threshold — we recommend 85th percentile on the calibration battery — before assigning operators to production sessions. Operators who fall below threshold on specific task types get reassigned, not removed. A specialist roster outperforms a generalist pool at scale.
2. Latency envelopes per task class
Teleop latency requirements are task-dependent. For whole-body locomotion, 150ms round-trip is often acceptable. For precision manipulation — inserting connectors, handling deformable objects — anything above 80ms introduces artifacts that corrupt the action signal. Define and enforce latency envelopes per task class before scaling, not after you notice model regressions.
Monitor latency in real time during sessions, not just in post-processing. An operator completing 40 sessions per day on a connection that drifts to 120ms after hour three will contaminate 25% of their output before a daily QA review catches it.
3. Automated QA at collection time
Manual QA does not scale past a few hundred demonstrations per day. Build automated checks into the collection pipeline:
- Jerk filter: flag trajectories where joint jerk exceeds 3 standard deviations from operator baseline
- Stall detector: flag episodes where end-effector velocity drops to zero for more than 400ms mid-task (operator hesitation, not genuine task behavior)
- Success classifier: use a lightweight vision model to verify task completion before the operator marks the episode done
- Outlier embedding check: embed each trajectory and flag outliers versus the session distribution — catches both low-quality and genuinely novel demonstrations that warrant human review
Flag, do not auto-reject. Automatic rejection creates incentives to game the detector. Flag and route to human review; track per-operator flag rates to identify who needs retraining.
4. Session structure and operator fatigue
Operator fatigue is real and measurable. Teleop performance typically degrades after 90 minutes of continuous operation. At scale, operators running 8-hour shifts without structured breaks produce materially worse data in hours 4–8 than in hours 1–3.
Enforce 15-minute breaks every 75 minutes. Segment sessions so that post-break performance can be tracked separately. If post-break performance does not recover to within 10% of pre-session baseline, the session data from that block should be flagged for review.
5. The feedback loop that prevents drift
The most important quality lever is closing the loop between data quality and operator feedback fast. Operators should see their per-session quality scores within 2 hours of session end, not in a weekly review. When an operator sees that 18% of their trajectories were flagged for jerk artifacts, they correct within one to two sessions. When they see a weekly aggregate, the behavior that caused the issue is already gone from memory.
Build a per-operator dashboard. Track: success rate, jerk flag rate, stall flag rate, outlier flag rate, and latency 95th percentile. Set thresholds. Route operators below threshold to a supervised session before they return to production.
What this means at 50,000 demonstrations
At 50,000 demonstrations across 30 operators and 8 task types, quality is an operational problem, not a data science problem. The signal is there. The question is whether your collection pipeline is disciplined enough to capture it consistently. Teams that get this right find that their 50k-demonstration dataset trains models that outperform competitors using 200k demonstrations from unstructured collection pipelines.
Volume is a proxy for quality only when the collection process is controlled. Otherwise it is a proxy for noise.
\n\n
Volume is a proxy for quality only when the collection process is controlled. If you’re scaling a teleop program and hitting consistency problems, describe your pipeline to us. We’ll tell you where the control points are.
\n





