DexUMI Deployment Guide

Installation

Mamba Env and external dependency

Tested on Ubuntu 22.04 with both NVIDIA L40S and RTX 4090 GPU. We recommend using GPU with larger memory such that you can process longer demonstration.

Environment Setup

Terminal - Environment Setup

# We recommend Miniforge for faster installation
cd DexUMI
mamba env create -f environment.yml
mamba activate dexumi
# optional: to set the DEV_PATH in your bashrc or zshrc
export DEV_PATH="/parent/directory/of/DexUMI"

DexUMI utilizes SAM2 and ProPainter to track and remove the exoskeleton and hand. Our system uses Record3D to track the wrist pose. To make Record3D compatible with Python 3.10, please follow the instructions here. Alternatively, you can directly install our forked version, which already integrates the solution.

Please clone the above three packages into the same directory as DexUMI. The final folder structure should be:

                .

                ├── DexUMI

                ├── sam2

                ├── ProPainter

                ├── record3D

Download the SAM2 checkpoint sam2.1_hiera_large.pt into sam2/checkpoints/ and install SAM2.

Terminal - Environment Setup

cd sam2
pip install -e .

You also need to install Record3D on your iPhone. We use iPhone 15 Pro Max to track the wrist pose. You can use any iPhone model with ARKit capability, but you might need to modify some CAD models to adapt to other iPhone dimensions.

Fitting Encoder to Motor Regression

Build action mapping between exoskeleton and robot hardware

The goal is to fit a regression model that maps the joint encoder readings from the exoskeleton to the motor value on the robot hand such that the fingertips of the exoskeleton can perfectly overlayed with the fingertips of the robot hand on the wrist camera image space. We achive this by uniformly sampling the robot hand motor value between lower limit and upper limit. Human manully rotates the exoskeleton joint to align the fingertips in the wrist camera image. We can repeat this process to get a pair data point (encoder reading, motor value). After collecting enough data points, we can fit a linear regression model to map the encoder reading to the motor value. We follow the steps below to fit the model:

                # Pseudocode for Motor-Encoder Regression Data Collection

                for motor_id in robot_hand_motors:

                    motor_values = []

                    encoder_readings = []

                    for motor_value in uniform_sample(motor_lower_limit,
                motor_upper_limit):

                        # 1) Send sampled motor value to robot hand

                        robot_hand.send_motor_command(motor_id,
                motor_value)

                        # 2) Robot hand executes the command

                        robot_hand.execute_command()

                        wait_for_completion()

                        # 3) Manually rotate corresponding exoskeleton
                joint

                        # to overlay fingertips in wrist camera
                image

                        Human("Align exoskeleton fingertip with
                robot fingertip")

                        # 4) Record the encoder reading

                        encoder_value =
                exoskeleton.read_encoder(motor_id)

                        motor_values.append(motor_value)

                        encoder_readings.append(encoder_value)

                    # Fit regression model using collected data

                    regression_model = fit_linear_regression(encoder_readings,
                motor_values)

                    save_model(regression_model, f"motor_{motor_id}_regression.pkl")

Notice, for XHand, we do not fit the regression model for all motor. As we found the mapping between exoskeleton joints and robot hand motor are very close to linear. In this case, you can simply change the calibration angles. See video for details. We also provide our mapping for both hand in the repo such that you can directly use it without fitting the regression model. However, you might also need to slighly adjust the regression model. Also see video for details.

2. Data Generation & Collection

Collect, process, and prepare training data

Data Generation Pipeline

For each collected exoskeleton demonstration, we execute the following data processing pipeline:

                # Data Generation Pipeline for DexUMI

                for demo_video in exoskeleton_demonstrations:

                    # Step 0: Synchronize multi-modal data streams

                    sync_data_sources(wrist_camera, encoder_readings, wrist_pose,
                tactile_sensor)

                    # Step 1: Record robot hand video by replaying actions

                    robot_video = replay_on_robot(encoder_readings)

                    # Step 2: Resize exoskeleton&robot hand videos

                    # Step 3: Segment hands from both video streams

                    exo_mask = segment_exoskeleton(demo_video)

                    robot_mask = segment_robot_hand(robot_video)

                    # Step 4: Remove exoskeleton and inpaint background

                    clean_background = inpaint_video(demo_video, exo_mask)

                    # Step 5: Composite final high-fidelity manipulation video

                    final_video = composite_videos(clean_background, robot_video,
                robot_mask, exo_mask)

                    save_training_data(final_video, action_labels)

This pipeline transforms raw exoskeleton demonstrations into high-quality training data by removing the human operator and exoskeleton hardware while preserving the nature occlusion between hand and object.

Data Generation Commands

cd DexUMI/real_script/data_generation_pipeline
# Sync different data source and replay the exoskelon action on robot hand.
# This cover the Step 0 and Step 1 in the data generation pipeline Pseudocode.
# Modify the DATA_DIR, TARGET_DIR and REFERENCE_DIR before running
./process.sh

# Run data processing pipeline
# This cover the Step 1, 2, 3, 4 and 5 in the data generation pipeline Pseudocode.
# Modify the config/render/render_all_dataset.yaml before running
python render_all_dataset.py

# Generate the final training data
python 6_generate_dataset.py -d path/to/data_replay -t path/to/final_dataset --force-process total --force-adjust

Segmentation Setup (finish it before actual data collection/generation)

To achieve automatic segmentation of the exoskeleton and robot hand, you need to configure prompt points before starting data collection and processing. This is a one-time setup process.

Follow these steps to complete the setup:

Collect Reference Episode: Wear the exoskeleton and collect one initial episode. Ensure your hand and exoskeleton are clearly visible in the first few frames, with the hand in a fully open and comfortable pose. This episode will serve as your reference for all future data collection.
Generate Robot Replay: Replay the collected episode on the robot hand to create corresponding robot hand video.
Create Segmentation Prompts: Set up prompt points for both the exoskeleton and robot hand segmentation. Save these prompt points to the reference episode for consistent use across all future collections.

Check the video below for detailed instructions on setting up prompt points and configuring the reference episode.

Tips for Better Segmentation Results:

Color consistency: Wear gloves that match the exoskeleton color to improve detection accuracy
Prompt point optimization: Experiment with different positive and negative prompt points, as results can vary significantly based on placement
Sparse prompting: Use fewer, well-placed prompt points rather than dense coverage for better results
Background exclusion: Place negative prompt points on background regions to prevent SAM2 from including unwanted areas
Region-based segmentation: Divide the exoskeleton/robot hand into separate regions (thumb, fingers, pinky) with dedicated prompt points for each, then combine masks later

Data Collection Guide

We visualiuize the prompt points on the wrist camera image. Make sure to adjust your hand and exoskelton to fully cover the prompt points. We also visualize the current encoder reading on the image. The text become red when the encoder reading is not aligned with the prompt points.

Data Generation Commands

cd DexUMI/real_script/data_collection/
# If you do not have a force sensor installed, simply omit the -ef flag.
# Create REFERENCE_DIR before running
python record_exoskeleton.py -et -ef --fps 45 --reference-dir /path/to/reference_folder --hand_type xhand/inspire --data-dir /path/to/data

Real-world deployment

Robot setup

Calibrate iPhone to EE transformation matrix

Before deploying the policy in the real world, you need to determine the transformation matrix from the iPhone coordinate system to the robot end effector coordinate system. Since the collected data records the pose of the iPhone rather than the actual robot end effector pose, we need to calibrate the transformation matrix so that the iPhone pose can be transformed to the robot end effector pose during deployment. In the following figure, we show two images of 3D printed mounting components between the robot hand and UR5/5e. Mounting component (a) is for actual deployment while mounting component (b) is for calibration. The only difference is that you can place an iPhone on the calibration component. Note that in image (b), the iPhone pose in the robot hand wrist (flange) frame is exactly the same as the iPhone pose in the exoskeleton wrist (flange) frame. Therefore, we can use mounting component (b) to determine how to transfer the iPhone pose to the robot end effector pose.

Inspire Deployment Inspire Calibration XHand Deployment XHand Calibration

In the video, we only calibrate the translation part as we can infer the rotation matrix from definition of the iPhone coordinate frame and UR5/5e end effector coordinate frame. Our script also supports calibrate the rotation matrix. However, in this case, you need to slighly modify the real_script/calibration/record_ur5_trajectory.py to use a space mouse or scripted trajectory to make robot move in SE(3) space (right now is SO(3)).

Calibrate Transformation Matrix between iPhone and robot EE

cd DexUMI/real_script/calibration
# Make UR5 rotate around flange (TCP/Wrist).
# Note: you might need to modify the target_pose on line 60 to place EE in a safe region.
python record_ur5_trajectory.py -rp iphone_calibration

# Run optimization to compute translation part of the matrix
python compute_ur5_iphone_offset.py -rp iphone_calibration

After running the above script, you can set the correct transformation matrix in the eval scripts. You can repeat the above process several time and compute the average transformation matrix to get a more accurate result.

Policy Depolyment

cd DexUMI/real_script/eval_policy
region.
python DexUMI/real_script/open_server.py --dexhand --ur5

# open a new terminal
python DexUMI/real_script/eval_policy/eval_xhand.py --model_path path/to/model --ckpt N # for xhand
# or
python DexUMI/real_script/eval_policy/eval_inspire.py --model_path path/to/model --ckpt N # for inspire hand