Home / Data services / Multimodal sensor capture

Data service 03

multimodal sensor capture

RGB-D, LiDAR, IMU, tactile, force/torque, audio, thermal, and event cameras. Microsecond alignment across all eight modalities.

<1ms
Cross-modal sync
8
Modalities supported
7
Capture rigs active
RGB-D Depth + color LiDAR 3D point cloud IMU Accel + gyro Tactile Pressure grid Force / torque 6-axis F/T sensor Audio Contact mic array Thermal IR heat map Event camera Async pixel events 8 MODALITIES TIME ALIGNMENT μs precision PTP / hardware trigger ALIGNED FRAME RGB-D LiDAR IMU Tactile F/T Audio Thermal Event cam FUSED OUTPUT 8Sensor modalitiesSynchronized per frame μsTime alignmentPTP + hardware trigger 1Unified formatHDF5, rosbag, or custom QAFrame-level QADrop & drift detection TRAINING-READY DATA

What is multimodal sensor capture?

Multimodal sensor capture is the synchronized recording of multiple sensor streams — RGB-D cameras, LiDAR, IMUs, tactile arrays, and audio — aligned to microsecond precision. The result is a single fused frame per timestep: every modality in lockstep.

Typical use cases

  • Mobile manipulation — wrist camera + force sensor + joint state for pick-and-place policies
  • Autonomous navigation — LiDAR + RGB-D + IMU for indoor/outdoor mapping
  • Dexterous grasping — tactile arrays + depth camera for in-hand manipulation
  • Human-robot handover — body tracking + force sensing for safe interaction

Why teams partner with us

Getting four sensor types to agree on a timestamp is harder than deploying any one of them. We own the full stack so you skip the integration pain.

  • Pre-calibrated rigs — shipped to your site or ours
  • Hardware sync — PTP and trigger-based, not software post-hoc
  • Frame-level QA — automated drop and drift detection
  • Flexible format — HDF5, rosbag, or your custom schema

Why outsource capture?

Building a multi-sensor rig, calibrating it, and maintaining sync is a full-time job. We do it across dozens of deployments so each one costs you less.

μs-level cross-modal alignment.

4+ modalities per capture session.

99.7% frame integrity rate.

Where we collect

41+ delivery centers across 12 countries. Every program runs from a Roborax hub near your target time zone.

Asia Pacific
India · Philippines

Americas
USA · Canada · Colombia · Jamaica · El Salvador · Belize

EMEA
UK · Albania · Kosovo · Morocco

Explore all locations →

What we deliver

Every modality, one timeline

Synchronized streams ready to drop straight into your perception or fusion pipeline.

RGB-D streams

Color + depth at 60 fps, calibrated intrinsics and extrinsics per camera.

LiDAR point clouds

Continuous 3D scans aligned with camera frames, in your robot frame.

IMU sequences

Accel + gyro at 200 Hz, drift-corrected and bias-calibrated.

Tactile arrays

GelSight and pressure grids, time-locked per grasp frame.

Force / torque

6-axis F/T sensors at wrist or fingertip, synced with joint state.

Audio

Contact mics and ambient arrays for acoustic event detection.

Thermal

Infrared heat maps for material and contact classification.

Event cameras

Asynchronous pixel events for high-speed motion capture.

How we work

From sensor stack to packaged dataset

A four-step protocol that delivers training-ready bag files at the end.

1Step 1

Sensor selection

Match your model’s input modalities. Rig built from inventory or custom.

2Step 2

Sync configuration

Hardware triggers + software time-sync. End-to-end validation.

3Step 3

Capture

Continuous logging with on-rig integrity checks. Drift flagged in real time.

4Step 4

Package

Time-aligned, model-ready dataset. ROSbag, MCAP, or your custom format.

Rigs and tools

Sensors and packaging formats we run

Best-in-class hardware. Pipeline-agnostic output.

RealSense

D435 / D455 RGB-D

Velodyne

Puck / Alpha Prime

Hesai

Pandar series

Xsens MTi

IMU sequences

GelSight

Tactile arrays

ROSbag / MCAP

Packaging

What our partners say
We needed time-synced RGB-D and lidar across a hundred-plus scenes. Their packaging gave us bag files we could drop directly into our training pipeline.
Wei Chen
Sensor Lead, Holos AI

FAQ

Questions about multimodal sensor capture

RGB, depth, tactile, force-torque, IMU, audio, and thermal depending on the platform and task. All streams are hardware-synchronized — not software-interpolated.
Either. We maintain calibrated sensor rigs for standard configurations. For novel sensor combinations you supply the hardware, we integrate it into our capture pipeline and handle calibration.
We use hardware trigger lines wherever possible. Where that is not feasible we use PTP time synchronization with sub-millisecond accuracy. Sync quality is reported in every delivery.
Structured lab environments, purpose-built scene replicas, and real-world deployment sites including factories, warehouses, kitchens, and clinical spaces. We handle logistics, access permissions, and health and safety compliance.

Further reading

From the blog

The Embodied AI Data Flywheel

Why physical AI needs multimodal data at scale.

From the blog

Sim-to-Real Transfer: Why Synthetic Data Alone Falls Short

The role of real sensor data in closing the sim-to-real gap.

Spec a sensor capture program

Tell us the modalities and the scene count. We propose a rig and a schedule in two days.

FROM THE FIELD

Robot training data insights