Guide 03

Sim-to-real transfer: measurement and benchmarks

How to measure the sim-to-real gap, what “91% transfer” actually means, and the three pitfalls in simulation pipelines.

11 MIN READ • LAST UPDATED JUNE 2026

“We achieve 91% sim-to-real transfer” is one of the most overloaded claims in robotics. It can mean almost anything depending on how transfer is measured, what counts as the same task in sim and reality, and what baseline you’re comparing against. This guide is for ML leads and sim engineers who need to put real numbers behind sim-to-real claims — either in their own work or when evaluating a data partner’s.

What sim-to-real actually is

Sim-to-real refers to training a policy primarily or entirely in simulation, then deploying it on physical hardware. The motivation: simulation is cheap, parallelizable, and lets you train on edge cases that would be expensive or dangerous to capture in reality.

The challenge: physics simulators are approximate. Real-world friction, contact dynamics, sensor noise, and actuator behavior all differ from what the simulator models. A policy trained naively in sim will fail in reality because it has learned to exploit specific simulator artifacts that don’t exist in the real world.

The fix is domain randomization plus paired sim-real validation. The goal isn’t to make sim perfect; it’s to make policies robust to the gap between sim and reality.

The measurement problem

Sim-to-real transfer rates are usually reported as a single number, but the number means different things depending on three choices:

What’s the task definition? “Pick the cup” in sim and “pick the cup” in real world can be very different tasks. If the real-world cup has different mass distribution, deformability, or surface friction, you’re measuring transfer to a different task, not transfer of the same task.

What’s the baseline? A 91% transfer rate is impressive if the sim-only baseline is 30%. It’s less impressive if the baseline is 88%. Always ask for the baseline.

What’s the success criterion? Binary success (task completed, yes or no) tells you something different than a graded score (task completed efficiently, with no contact failures, in under N seconds). Most reported transfer rates are binary; most production deployments need graded.

What 91% transfer actually means (or doesn’t)

Three concrete examples to illustrate how the same number can mean different things:

Example A: Policy trained on 10K sim episodes. Tested on 100 real-world episodes. 91 succeed (binary). Transfer rate: 91%. Sounds great.

Example B: Policy trained on 10K sim episodes + 50 real-world fine-tuning episodes. Tested on 100 real-world episodes. 91 succeed. Transfer rate: 91% (but it’s a sim-plus-real policy, not pure sim-to-real).

Example C: Policy trained on 10K sim episodes with heavy domain randomization. Tested on 100 real-world episodes in environments matched to the sim’s randomization range. 91 succeed. Transfer rate: 91% (but transfer to environments outside the randomization range may be 30%).

All three are “91% transfer.” Only Example A is sim-to-real in the strict sense. The other two are useful but should be reported with their qualifiers.

Three pitfalls in simulation pipelines

From running sim-to-real programs across humanoid, surgical, and warehouse robotics, three pitfalls account for most failures:

Pitfall 1: The simulator’s contact model doesn’t match reality. Most rigid-body physics engines (MuJoCo, PhysX) handle friction and contact compliance with approximations that work well for some scenarios and poorly for others. Deformable contact (cloth, food, soft tissue) is especially fraught. If your task involves contact-rich manipulation, validate your simulator’s contact model with paired real-world captures before trusting transfer claims.

Pitfall 2: Domain randomization too narrow or too wide. Too narrow: the policy overfits to the randomization range and fails on real-world conditions outside it. Too wide: the policy learns a least-common-denominator behavior that’s suboptimal everywhere. The sweet spot needs paired sim-real captures to calibrate.

Pitfall 3: Sensor noise model mismatched. Real cameras have specific noise patterns, exposure variations, and motion blur. Real depth sensors have specific failure modes (specular surfaces, edge artifacts). If your sim noise is Gaussian and your real noise is structured, the policy will fail on the structured patterns it never saw in sim.

A measurement protocol that works

A protocol we recommend for any sim-to-real program:

Define the task with a binary and graded success criterion. Report both.
Capture 100 paired episodes (same task, same conditions, in both sim and real). Use these for validation, not training.
Report transfer in three settings: (a) sim-only policy on real-world, (b) sim-plus-real-finetune policy on real-world, (c) real-only policy on real-world. The gap between (a) and (c) is the actual sim-to-real gap; (b) is your production option.
Report environmental coverage. A 91% transfer rate on 5 environment configurations is different than 91% on 50. Report the count.
Audit failures. When the policy fails in real, classify failures into categories (contact, perception, control, environment). The category mix tells you where to improve sim.

If a data partner or sim team tells you they achieve X% sim-to-real without offering this level of breakdown, ask for it. If they can’t produce it, the X% number is brand-positive but operationally unreliable. When you’re ready to scope a sim-to-real program with proper measurement built in, tell us the platform and the task.

← All guides

More guides

GUIDE 01

How to scope a humanoid teleop program

A four-step framework for going from “we need data” to a signed SOW.

12 min read · For technical buyers

GUIDE 02

Cross-embodiment data: what it is and why it matters

Why policies trained on one platform fail on another, and what diverse training data actually looks like.

10 min read · For ML leads

GUIDE 04

Operator quality: how to evaluate a data partner

The seven questions that separate operations-grade data partners from labeling marketplaces.

9 min read · For procurement

GUIDE 05

Choosing a delivery model: dedicated, crowdsource, or hybrid

When each model fits, what each costs, and the IP and quality tradeoffs at each tier.

8 min read · For program leads

GUIDE 06

Long-horizon data capture: methodology

Why 30-minute episodes break generic pipelines, and how we structure capture at scale.

11 min read · For research teams

Ready to scope a program?

Send us the platform, the task, and the volume. A solutions engineer responds in one business day.