Guide 06

Long-horizon data capture: methodology

Why 30-minute episodes break generic pipelines and what failure-recovery data adds.

11 MIN READ • LAST UPDATED JUNE 2026

Most robotics data pipelines were built for short demonstrations — pick-and-place, single-arm trajectories, 15-second episodes. Long-horizon data is a different animal: 5–30 minute episodes that combine navigation, manipulation, language, and recovery from failure. The pipeline patterns that work for short demos break here. This guide describes what changes and why.

What “long-horizon” actually means

Long-horizon data refers to demonstrations that span multiple sub-goals strung together. “Walk to the kitchen, find the red mug on the counter, pick it up, walk back to the dining table, set it down” is a long-horizon task. So is “prepare a simple breakfast” or “unload the dishwasher.”

Three things define long-horizon:

Multiple sub-goals. The task has natural decomposition points, but the operator has to decide them on the fly.
Long temporal span. Episodes are minutes, not seconds. Operator fatigue and attention become real variables.
Open-ended environments. The episode unfolds in a real space (kitchen, warehouse, office) with clutter and variation, not a controlled bench.

Why 30-minute episodes break generic pipelines

Short-demo pipelines fail on long-horizon data in five places:

1. Storage and bandwidth. A 30-minute episode at 30Hz across 50 channels (joint positions, IMU, multi-camera RGB-D, force-torque, audio) generates 5–10 GB per episode. A program with 1,000 such episodes is 5–10 TB. Network transfer, storage costs, and retrieval latency are all real considerations.

2. Annotation cost per episode. Short demos can be hand-annotated in minutes per episode. Long-horizon episodes need phase labels (where does “approach” end and “grasp” begin?), goal annotations at each sub-task, failure-recovery labels, and language descriptions. Annotation can cost 30–60 minutes per episode — sometimes longer than the episode itself.

3. Operator fatigue. A 30-minute episode with high concentration is roughly twice as draining as a 60-minute episode with breaks. Operator scheduling has to account for cognitive load, not just clock time. Programs that don’t account for this end up with quality degradation across the back half of long episodes.

4. Failure handling. In a short demo, if the operator fails, you discard the episode. In a long-horizon demo, if the operator fails at minute 22, you’ve already invested 22 minutes — you don’t discard, you capture the failure and recovery. This changes what you log and how you label.

5. Multi-modal alignment. Short demos can get away with loose time alignment. Long-horizon demos need tight alignment because cross-modal signals (language at second 240, action at second 245) carry meaning. Alignment drift across 30 minutes accumulates fast.

How we structure long-horizon capture

A pipeline pattern that works:

Episode planning. The customer and operator team jointly design a set of episode templates (task scenarios with allowed variation). Each template includes goal sub-tasks, allowed environments, and explicit failure cases worth capturing.

Pre-episode setup. Environment configured per template. Operator briefed on the specific goal and acceptable variations. Calibration verified.

Capture. Episode runs end-to-end. Operator narrates sub-goals (“now approaching the table”, “now grasping”) for language alignment. All channels logged with hardware-synchronized timestamps.

On-rig integrity check. Before the operator moves on, automated checks verify all channels recorded, timestamps align, no data dropped. This catches issues immediately rather than after a day of capture.

Post-episode annotation. A second-pass annotator (not the operator) labels phase boundaries, success/failure, sub-goal completion. Operator does NOT self-label — too biased.

Failure-recovery handling. If the episode included a failure, it’s explicitly labeled as a failure-recovery episode (not a success). The failure and the recovery are both first-class data, not artifacts to clean up.

The failure-recovery data class

The single highest-value addition to a long-horizon program is deliberate capture of failure-recovery sequences. Most programs leave this on the table.

The pattern: alongside successful episodes, capture episodes where the operator deliberately triggers (or naturally encounters) a failure, then recovers. Drop the object and re-grasp. Stumble and re-balance. Misidentify the goal and correct mid-task. These episodes train the policy to recover from its own mistakes — a critical capability in production where the policy will encounter situations it didn’t see in training.

Programs that include 15–25% failure-recovery data routinely outperform programs that include only successful demonstrations. The improvement isn’t in success rate on familiar tasks — it’s in recovery from unfamiliar ones.

Where to start

If you’re moving from short-demo to long-horizon for the first time:

Start with 50 long-horizon episodes. Not 5,000. Validate your storage, annotation, and alignment pipelines on a small set before scaling.
Build failure-recovery into the templates from day one. Adding it later means re-running episodes, which is expensive.
Budget annotation cost as 1–1.5x capture cost. Not 0.2x like short demos. Surprises here have killed more long-horizon programs than capture cost ever did.
Train operators on episode pacing. Long episodes are an endurance event. Operators who don’t pace themselves degrade in the second half. Tier-promote the ones who don’t.

When you’re ready to scope a long-horizon program, tell us the task length and the environment. We’ll come back with a templated capture plan within one business day.

← All guides

More guides

GUIDE 01

How to scope a humanoid teleop program

A four-step framework for going from “we need data” to a signed SOW.

12 min read · For technical buyers

GUIDE 02

Cross-embodiment data: what it is and why it matters

Why policies trained on one platform fail on another, and what diverse training data actually looks like.

10 min read · For ML leads

GUIDE 03

Sim-to-real transfer: measurement and benchmarks

How to measure transfer accurately, and what benchmarks are actually worth trusting.

11 min read · For sim engineers

GUIDE 04

Operator quality: how to evaluate a data partner

The seven questions that separate operations-grade data partners from labeling marketplaces.

9 min read · For procurement

GUIDE 05

Choosing a delivery model: dedicated, crowdsource, or hybrid

When each model fits, what each costs, and the IP and quality tradeoffs at each tier.

8 min read · For program leads

Ready to scope a program?

Send us the platform, the task, and the volume. A solutions engineer responds in one business day.