DexUMI Deployment Guide

Deploy DexUMI in real-world

Installation

Mamba Env and external dependency

Tested on Ubuntu 22.04 with both NVIDIA L40S and RTX 4090 GPU. We recommend using GPU with larger memory such that you can process longer demonstration.

Environment Setup

Terminal - Environment Setup
# We recommend Miniforge for faster installation
cd DexUMI
mamba env create -f environment.yml
mamba activate dexumi
# optional: to set the DEV_PATH in your bashrc or zshrc
export DEV_PATH="/parent/directory/of/DexUMI"

DexUMI utilizes SAM2 and ProPainter to track and remove the exoskeleton and hand. Our system uses Record3D to track the wrist pose. To make Record3D compatible with Python 3.10, please follow the instructions here. Alternatively, you can directly install our forked version, which already integrates the solution.

Please clone the above three packages into the same directory as DexUMI. The final folder structure should be:

.
├── DexUMI
├── sam2
├── ProPainter
├── record3D

Download the SAM2 checkpoint sam2.1_hiera_large.pt into sam2/checkpoints/ and install SAM2.

Terminal - Environment Setup
cd sam2
pip install -e .

You also need to install Record3D on your iPhone. We use iPhone 15 Pro Max to track the wrist pose. You can use any iPhone model with ARKit capability, but you might need to modify some CAD models to adapt to other iPhone dimensions.

Fitting Encoder to Motor Regression

Build action mapping between exoskeleton and robot hardware

The goal is to fit a regression model that maps the joint encoder readings from the exoskeleton to the motor value on the robot hand such that the fingertips of the exoskeleton can perfectly overlayed with the fingertips of the robot hand on the wrist camera image space. We achive this by uniformly sampling the robot hand motor value between lower limit and upper limit. Human manully rotates the exoskeleton joint to align the fingertips in the wrist camera image. We can repeat this process to get a pair data point (encoder reading, motor value). After collecting enough data points, we can fit a linear regression model to map the encoder reading to the motor value. We follow the steps below to fit the model:
# Pseudocode for Motor-Encoder Regression Data Collection
for motor_id in robot_hand_motors:
    motor_values = []
    encoder_readings = []

    for motor_value in uniform_sample(motor_lower_limit, motor_upper_limit):
        # 1) Send sampled motor value to robot hand
        robot_hand.send_motor_command(motor_id, motor_value)

        # 2) Robot hand executes the command
        robot_hand.execute_command()
        wait_for_completion()

        # 3) Manually rotate corresponding exoskeleton joint
        # to overlay fingertips in wrist camera image
        Human("Align exoskeleton fingertip with robot fingertip")

        # 4) Record the encoder reading
        encoder_value = exoskeleton.read_encoder(motor_id)
        motor_values.append(motor_value)
        encoder_readings.append(encoder_value)

    # Fit regression model using collected data
    regression_model = fit_linear_regression(encoder_readings, motor_values)
    save_model(regression_model, f"motor_{motor_id}_regression.pkl")
Notice, for XHand, we do not fit the regression model for all motor. As we found the mapping between exoskeleton joints and robot hand motor are very close to linear. In this case, you can simply change the calibration angles. See video for details. We also provide our mapping for both hand in the repo such that you can directly use it without fitting the regression model. However, you might also need to slighly adjust the regression model. Also see video for details.



2. Data Generation & Collection

Collect, process, and prepare training data

Data Generation Pipeline

For each collected exoskeleton demonstration, we execute the following data processing pipeline:

# Data Generation Pipeline for DexUMI
for demo_video in exoskeleton_demonstrations:
    # Step 0: Synchronize multi-modal data streams
    sync_data_sources(wrist_camera, encoder_readings, wrist_pose, tactile_sensor)

    # Step 1: Record robot hand video by replaying actions
    robot_video = replay_on_robot(encoder_readings)

    # Step 2: Resize exoskeleton&robot hand videos

    # Step 3: Segment hands from both video streams
    exo_mask = segment_exoskeleton(demo_video)
    robot_mask = segment_robot_hand(robot_video)

    # Step 4: Remove exoskeleton and inpaint background
    clean_background = inpaint_video(demo_video, exo_mask)

    # Step 5: Composite final high-fidelity manipulation video
    final_video = composite_videos(clean_background, robot_video, robot_mask, exo_mask)
    save_training_data(final_video, action_labels)

This pipeline transforms raw exoskeleton demonstrations into high-quality training data by removing the human operator and exoskeleton hardware while preserving the nature occlusion between hand and object.

Data generation pipeline overview
Data Generation Commands
cd DexUMI/real_script/data_generation_pipeline
# Sync different data source and replay the exoskelon action on robot hand.
# This cover the Step 0 and Step 1 in the data generation pipeline Pseudocode.
# Modify the DATA_DIR, TARGET_DIR and REFERENCE_DIR before running
./process.sh

# Run data processing pipeline
# This cover the Step 1, 2, 3, 4 and 5 in the data generation pipeline Pseudocode.
# Modify the config/render/render_all_dataset.yaml before running
python render_all_dataset.py

# Generate the final training data
python 6_generate_dataset.py -d path/to/data_replay -t path/to/final_dataset --force-process total --force-adjust
Segmentation Setup (finish it before actual data collection/generation)

To achieve automatic segmentation of the exoskeleton and robot hand, you need to configure prompt points before starting data collection and processing. This is a one-time setup process.

Follow these steps to complete the setup:

  • Collect Reference Episode: Wear the exoskeleton and collect one initial episode. Ensure your hand and exoskeleton are clearly visible in the first few frames, with the hand in a fully open and comfortable pose. This episode will serve as your reference for all future data collection.
  • Generate Robot Replay: Replay the collected episode on the robot hand to create corresponding robot hand video.
  • Create Segmentation Prompts: Set up prompt points for both the exoskeleton and robot hand segmentation. Save these prompt points to the reference episode for consistent use across all future collections.

Check the video below for detailed instructions on setting up prompt points and configuring the reference episode.

Tips for Better Segmentation Results:
  • Color consistency: Wear gloves that match the exoskeleton color to improve detection accuracy
  • Prompt point optimization: Experiment with different positive and negative prompt points, as results can vary significantly based on placement
  • Sparse prompting: Use fewer, well-placed prompt points rather than dense coverage for better results
  • Background exclusion: Place negative prompt points on background regions to prevent SAM2 from including unwanted areas
  • Region-based segmentation: Divide the exoskeleton/robot hand into separate regions (thumb, fingers, pinky) with dedicated prompt points for each, then combine masks later

Data Collection Guide

We visualiuize the prompt points on the wrist camera image. Make sure to adjust your hand and exoskelton to fully cover the prompt points. We also visualize the current encoder reading on the image. The text become red when the encoder reading is not aligned with the prompt points.


Data Generation Commands
cd DexUMI/real_script/data_collection/
# If you do not have a force sensor installed, simply omit the -ef flag.
# Create REFERENCE_DIR before running
python record_exoskeleton.py -et -ef --fps 45 --reference-dir /path/to/reference_folder --hand_type xhand/inspire --data-dir /path/to/data

Real-world deployment

Robot setup

Calibrate iPhone to EE transformation matrix

Before deploying the policy in the real world, you need to determine the transformation matrix from the iPhone coordinate system to the robot end effector coordinate system. Since the collected data records the pose of the iPhone rather than the actual robot end effector pose, we need to calibrate the transformation matrix so that the iPhone pose can be transformed to the robot end effector pose during deployment. In the following figure, we show two images of 3D printed mounting components between the robot hand and UR5/5e. Mounting component (a) is for actual deployment while mounting component (b) is for calibration. The only difference is that you can place an iPhone on the calibration component. Note that in image (b), the iPhone pose in the robot hand wrist (flange) frame is exactly the same as the iPhone pose in the exoskeleton wrist (flange) frame. Therefore, we can use mounting component (b) to determine how to transfer the iPhone pose to the robot end effector pose.
Robot setup diagram
In the video, we only calibrate the translation part as we can infer the rotation matrix from definition of the iPhone coordinate frame and UR5/5e end effector coordinate frame. Our script also supports calibrate the rotation matrix. However, in this case, you need to slighly modify the real_script/calibration/record_ur5_trajectory.py to use a space mouse or scripted trajectory to make robot move in SE(3) space (right now is SO(3)).


Calibrate Transformation Matrix between iPhone and robot EE
cd DexUMI/real_script/calibration
# Make UR5 rotate around flange (TCP/Wrist).
# Note: you might need to modify the target_pose on line 60 to place EE in a safe region.
python record_ur5_trajectory.py -rp iphone_calibration

# Run optimization to compute translation part of the matrix
python compute_ur5_iphone_offset.py -rp iphone_calibration

After running the above script, you can set the correct transformation matrix in the eval scripts. You can repeat the above process several time and compute the average transformation matrix to get a more accurate result.

Policy Depolyment

Policy Depolyment
cd DexUMI/real_script/eval_policy
region.
python DexUMI/real_script/open_server.py --dexhand --ur5

# open a new terminal
python DexUMI/real_script/eval_policy/eval_xhand.py --model_path path/to/model --ckpt N # for xhand
# or
python DexUMI/real_script/eval_policy/eval_inspire.py --model_path path/to/model --ckpt N # for inspire hand