Most robotics data pipelines were built for short demonstrations — pick-and-place, single-arm trajectories, 15-second episodes. Long-horizon data is a different animal: 5–30 minute episodes that combine navigation, manipulation, language, and recovery from failure. The pipeline patterns that work for short demos break here. This guide describes what changes and why.
Long-horizon data refers to demonstrations that span multiple sub-goals strung together. “Walk to the kitchen, find the red mug on the counter, pick it up, walk back to the dining table, set it down” is a long-horizon task. So is “prepare a simple breakfast” or “unload the dishwasher.”
Three things define long-horizon:
Short-demo pipelines fail on long-horizon data in five places:
1. Storage and bandwidth. A 30-minute episode at 30Hz across 50 channels (joint positions, IMU, multi-camera RGB-D, force-torque, audio) generates 5–10 GB per episode. A program with 1,000 such episodes is 5–10 TB. Network transfer, storage costs, and retrieval latency are all real considerations.
2. Annotation cost per episode. Short demos can be hand-annotated in minutes per episode. Long-horizon episodes need phase labels (where does “approach” end and “grasp” begin?), goal annotations at each sub-task, failure-recovery labels, and language descriptions. Annotation can cost 30–60 minutes per episode — sometimes longer than the episode itself.
3. Operator fatigue. A 30-minute episode with high concentration is roughly twice as draining as a 60-minute episode with breaks. Operator scheduling has to account for cognitive load, not just clock time. Programs that don’t account for this end up with quality degradation across the back half of long episodes.
4. Failure handling. In a short demo, if the operator fails, you discard the episode. In a long-horizon demo, if the operator fails at minute 22, you’ve already invested 22 minutes — you don’t discard, you capture the failure and recovery. This changes what you log and how you label.
5. Multi-modal alignment. Short demos can get away with loose time alignment. Long-horizon demos need tight alignment because cross-modal signals (language at second 240, action at second 245) carry meaning. Alignment drift across 30 minutes accumulates fast.
A pipeline pattern that works:
Episode planning. The customer and operator team jointly design a set of episode templates (task scenarios with allowed variation). Each template includes goal sub-tasks, allowed environments, and explicit failure cases worth capturing.
Pre-episode setup. Environment configured per template. Operator briefed on the specific goal and acceptable variations. Calibration verified.
Capture. Episode runs end-to-end. Operator narrates sub-goals (“now approaching the table”, “now grasping”) for language alignment. All channels logged with hardware-synchronized timestamps.
On-rig integrity check. Before the operator moves on, automated checks verify all channels recorded, timestamps align, no data dropped. This catches issues immediately rather than after a day of capture.
Post-episode annotation. A second-pass annotator (not the operator) labels phase boundaries, success/failure, sub-goal completion. Operator does NOT self-label — too biased.
Failure-recovery handling. If the episode included a failure, it’s explicitly labeled as a failure-recovery episode (not a success). The failure and the recovery are both first-class data, not artifacts to clean up.
The single highest-value addition to a long-horizon program is deliberate capture of failure-recovery sequences. Most programs leave this on the table.
The pattern: alongside successful episodes, capture episodes where the operator deliberately triggers (or naturally encounters) a failure, then recovers. Drop the object and re-grasp. Stumble and re-balance. Misidentify the goal and correct mid-task. These episodes train the policy to recover from its own mistakes — a critical capability in production where the policy will encounter situations it didn’t see in training.
Programs that include 15–25% failure-recovery data routinely outperform programs that include only successful demonstrations. The improvement isn’t in success rate on familiar tasks — it’s in recovery from unfamiliar ones.
If you’re moving from short-demo to long-horizon for the first time:
When you’re ready to scope a long-horizon program, tell us the task length and the environment. We’ll come back with a templated capture plan within one business day.
GUIDE 01
A four-step framework for going from “we need data” to a signed SOW.
12 min read · For technical buyers
GUIDE 02
Why policies trained on one platform fail on another, and what diverse training data actually looks like.
10 min read · For ML leads
GUIDE 03
How to measure transfer accurately, and what benchmarks are actually worth trusting.
11 min read · For sim engineers
GUIDE 04
The seven questions that separate operations-grade data partners from labeling marketplaces.
9 min read · For procurement
GUIDE 05
When each model fits, what each costs, and the IP and quality tradeoffs at each tier.
8 min read · For program leads
Send us the platform, the task, and the volume. A solutions engineer responds in one business day.