Home / Data services / Evaluation benchmarks

Data service 06

evaluation benchmark services

Held-out test sets, scenario libraries, and scoring infrastructure for model release gates.

50+
Benchmarks built
2,000+
Scenarios in library
100/day
Eval runs
YOUR MODEL v2.3 MODEL INPUT BENCHMARK SUITE Pick from clutter 94.2% Stack 3 objects 88.7% Deformable fold 71.3% ! Tool handover 42.1% Pour liquid 86.0% + 45 more scenarios... running RELEASE SCORECARD 82.4% Overall pass rate 3 PASS 1 WARN 1 FAIL SHIP / NO-SHIP GATE 50+Benchmarks builtAcross manipulation tasks 2K+ScenariosCurated scenario library 100Runs per dayAutomated eval infra CI/CDRelease gatesShip / no-ship scoring RELEASE-GATED

What are evaluation benchmarks?

Evaluation benchmarks are held-out test sets, scenario libraries, and automated scoring infrastructure that tell you whether a model is ready to ship. They turn “it seems to work” into a pass/fail gate with numbers you can defend.

Typical use cases

  • Release gates — automated pass/fail scoring before deploying a new model version
  • Regression detection — catch tasks that got worse when you improved others
  • Competitor benchmarking — apples-to-apples comparison across architectures
  • Safety validation — scenario-specific tests for collision, drop, and force limits

Why teams partner with us

Building a benchmark that actually predicts real-world performance requires curated scenarios, physical test setups, and statistical rigor. We maintain the library so you run evals, not build them.

  • 2,000+ scenarios — continuously growing library
  • 100 runs/day — automated eval infrastructure
  • Ship/no-ship — clear scoring criteria you define

Why outsource benchmarks?

Internal benchmarks drift toward what your model is already good at. We maintain independent, adversarial test sets designed to find the gaps.

50+ benchmarks built for partners.

2,000+ scenarios in the library.

100/day automated eval runs.

Where we collect

41+ delivery centers across 12 countries. Every program runs from a Roborax hub near your target time zone.

Asia Pacific
India · Philippines

Americas
USA · Canada · Colombia · Jamaica · El Salvador · Belize

EMEA
UK · Albania · Kosovo · Morocco

Explore all locations →

What we deliver

The release-gate primitives

Four artifacts that turn a model release into a confident decision instead of a gut feel.

Held-out test sets

Isolated scenarios your model has never seen. Leakage check enforced.

Scenario libraries

Curated scenes covering production distribution and known edge cases.

Scoring infrastructure

Aggregation, statistical significance, and per-scenario breakdowns.

Regression suites

Replayable runs that catch silent regressions between model versions.

How we work

From criteria to release-grade scorecard

Four stages that produce a benchmark you can actually defend in a safety review.

1Step 1

Define criteria

What does success look like for this release? Pass/fail and metric thresholds.

2Step 2

Build test set

Curated scenarios from real + synthetic. Coverage matrix locked.

3Step 3

Score policy

Run your model against the set. Per-scenario results plus aggregate metrics.

4Step 4

Report

Release-grade scorecard with regression flags. Replayable for any future model.

Rigs and tools

Harnesses and scenario libraries

Isaac eval, Habitat Lab, custom runners. Open splits or your own.

Isaac eval

Sim harness

Custom runner

Your stack

Habitat Lab

Embodied AI

Open X

Public splits

ROS2 scorers

Real-robot eval

Dashboards

Stats reports

What our partners say
Roborax built our release benchmark from scratch in six weeks. Now every model release goes through the same gate. We finally have a story for safety reviews.
Hideo Sato
Eval Lead, Polaris AI

FAQ

Questions about evaluation benchmarks

Benchmark design is a collaborative process. Your team defines the success criteria and task requirements. We design the evaluation protocol, build the test environment, and handle execution.
Benchmark tasks are kept strictly separate from training data. We use held-out scenarios, novel object instances, and modified environmental conditions that your model has not been exposed to.
Task success rate by condition, failure mode taxonomy, per-subtask breakdown, comparison to your previous benchmark run, and recommendations for the next training iteration.
At minimum after every major training update. For active programs we recommend a rolling benchmark cadence — typically fortnightly — so you can track policy improvement in near real time.

Further reading

From the blog

The QA Pipeline Every Robotics Data Team Needs

Success-rate measurement, regression suites, and policy scoring.

From the blog

From Imitation Learning to RL: How Your Data Strategy Changes

Evaluation requirements shift significantly between training regimes.

Build your release benchmark

Tell us the deployment domain and the metrics that matter. Six weeks to scorecard.

FROM THE FIELD

Data operations insights