Tested on Ubuntu 22.04 with both NVIDIA L40S and RTX 4090 GPU. We recommend using GPU with larger memory
such that you can process longer demonstration.
Environment Setup
Terminal - Environment Setup
# We recommend Miniforge for faster installation
cd DexUMI
mamba env create -f environment.yml
mamba activate dexumi
# optional: to set the DEV_PATH in your bashrc or zshrc
export DEV_PATH="/parent/directory/of/DexUMI"
DexUMI utilizes SAM2 and
ProPainter to track and
remove the exoskeleton and hand. Our system uses Record3D to track the
wrist pose. To make Record3D compatible with Python 3.10, please follow the instructions here.
Alternatively, you can directly install our forked version, which already integrates the solution.
Please clone the above three packages into the same directory as DexUMI. The final folder
structure should be:
.
├── DexUMI
├── sam2
├── ProPainter
├── record3D
Download the SAM2 checkpoint sam2.1_hiera_large.pt into
sam2/checkpoints/ and install SAM2.
Terminal - Environment Setup
cd sam2
pip install -e .
You also need to install Record3D on your iPhone. We use iPhone 15 Pro Max to track the wrist
pose. You can use any iPhone model with ARKit capability, but you might need to modify some CAD
models to adapt to other iPhone dimensions.
Fitting Encoder to Motor Regression
Build action mapping between exoskeleton and robot hardware
The goal is to fit a regression model that maps the joint encoder readings from the exoskeleton to
the motor value on the robot hand such that
the fingertips of the exoskeleton
can perfectly overlayed with the fingertips of the robot hand on the wrist camera image space. We
achive this by uniformly sampling the robot hand motor value between lower limit and upper limit. Human
manully rotates the exoskeleton joint to align the fingertips in the wrist camera image. We can repeat this
process to get a pair data point (encoder reading, motor value). After collecting enough data points, we can
fit a linear regression model to map the encoder reading to the motor value.
We follow the steps below to fit the model:
# Pseudocode for Motor-Encoder Regression Data Collection
for motor_id in robot_hand_motors:
motor_values = []
encoder_readings = []
for motor_value in uniform_sample(motor_lower_limit,
motor_upper_limit):
# 1) Send sampled motor value to robot hand
robot_hand.send_motor_command(motor_id,
motor_value)
# 2) Robot hand executes the command
robot_hand.execute_command()
wait_for_completion()
# 3) Manually rotate corresponding exoskeleton
joint
# to overlay fingertips in wrist camera
image
Human("Align exoskeleton fingertip with
robot fingertip")
# 4) Record the encoder reading
encoder_value =
exoskeleton.read_encoder(motor_id)
motor_values.append(motor_value)
encoder_readings.append(encoder_value)
# Fit regression model using collected data
regression_model = fit_linear_regression(encoder_readings,
motor_values)
save_model(regression_model, f"motor_{motor_id}_regression.pkl")
Notice, for XHand, we do not fit the regression model for all motor. As we found the mapping between
exoskeleton joints and robot hand motor are very close to linear. In this case, you can simply
change the calibration angles. See video for details. We also provide our mapping for both hand in
the repo such that you can directly use it without fitting the regression model. However, you might
also need to slighly adjust the regression model. Also see video for details.
2. Data Generation & Collection
Collect, process, and prepare training data
Data Generation Pipeline
For each collected exoskeleton demonstration, we execute the following data processing pipeline:
# Data Generation Pipeline for DexUMI
for demo_video in exoskeleton_demonstrations:
# Step 0: Synchronize multi-modal data streams
sync_data_sources(wrist_camera, encoder_readings, wrist_pose,
tactile_sensor)
# Step 1: Record robot hand video by replaying actions
robot_video = replay_on_robot(encoder_readings)
# Step 2: Resize exoskeleton&robot hand videos
# Step 3: Segment hands from both video streams
exo_mask = segment_exoskeleton(demo_video)
robot_mask = segment_robot_hand(robot_video)
# Step 5: Composite final high-fidelity manipulation video
final_video = composite_videos(clean_background, robot_video,
robot_mask, exo_mask)
save_training_data(final_video, action_labels)
This pipeline transforms raw exoskeleton demonstrations into high-quality training data by
removing the human operator and exoskeleton hardware while preserving the nature occlusion
between hand and object.
Data Generation Commands
cd DexUMI/real_script/data_generation_pipeline
# Sync different data source and replay the exoskelon action on robot hand.
# This cover the Step 0 and Step 1 in the data generation pipeline Pseudocode.
# Modify the DATA_DIR, TARGET_DIR and REFERENCE_DIR before running
./process.sh
# Run data processing pipeline
# This cover the Step 1, 2, 3, 4 and 5 in the data generation pipeline Pseudocode.
# Modify the config/render/render_all_dataset.yaml before running
python render_all_dataset.py
# Generate the final training data
python 6_generate_dataset.py -d path/to/data_replay -t path/to/final_dataset --force-process
total --force-adjust
Segmentation Setup (finish it before actual data collection/generation)
To achieve automatic segmentation of the exoskeleton and robot hand, you need to configure prompt
points before starting data collection and processing. This is a one-time setup process.
Follow these steps to complete the setup:
Collect Reference Episode: Wear the exoskeleton and collect one initial
episode. Ensure your hand and exoskeleton are clearly visible in the first few frames,
with the hand in a fully open and comfortable pose. This episode will serve as your
reference for all future data collection.
Generate Robot Replay: Replay the collected episode on the robot hand
to create corresponding robot hand video.
Create Segmentation Prompts: Set up prompt points for both the
exoskeleton and robot hand segmentation. Save these prompt points to the reference
episode for consistent use across all future collections.
Check the video below for detailed instructions on setting up prompt points and configuring
the
reference episode.
Tips for Better Segmentation Results:
Color consistency: Wear gloves that match the exoskeleton color to
improve detection accuracy
Prompt point optimization: Experiment with different positive and
negative prompt points, as results can vary significantly based on placement
Sparse prompting: Use fewer, well-placed prompt points rather than
dense coverage for better results
Background exclusion: Place negative prompt points on background
regions to prevent SAM2 from including unwanted areas
Region-based segmentation: Divide the exoskeleton/robot hand into
separate regions (thumb, fingers, pinky) with dedicated prompt points for each, then
combine masks later
Data Collection Guide
We visualiuize the prompt points on the wrist camera image. Make sure to adjust your hand and
exoskelton to fully cover the prompt points. We also visualize the current encoder reading on the
image. The text become red when the encoder reading is not aligned with the prompt points.
Data Generation Commands
cd DexUMI/real_script/data_collection/
# If you do not have a force sensor installed, simply omit the -ef flag.
# Create REFERENCE_DIR before running
python record_exoskeleton.py -et -ef --fps 45 --reference-dir
/path/to/reference_folder --hand_type xhand/inspire --data-dir /path/to/data
Real-world deployment
Robot setup
Calibrate iPhone to EE transformation matrix
Before deploying the policy in the real world, you need to determine the transformation matrix from
the iPhone coordinate system to the robot end effector coordinate system. Since the collected data
records the pose of the iPhone
rather than the actual robot end effector pose, we need to calibrate the transformation matrix so
that
the
iPhone pose can be transformed to the robot end effector pose during deployment. In the following
figure, we
show two images of 3D printed mounting components between the robot hand and UR5/5e. Mounting
component
(a)
is for actual deployment while mounting component (b) is for calibration. The only difference is
that
you can place an
iPhone on the calibration component. Note that
in image (b), the iPhone pose in the robot hand wrist (flange) frame is exactly the same as the
iPhone pose in the exoskeleton wrist (flange) frame. Therefore, we can use mounting
component (b)
to
determine how to transfer the iPhone pose to the robot end effector pose.
In the video, we only calibrate the translation part as we can infer the rotation matrix from
definition
of the iPhone
coordinate frame and UR5/5e end effector coordinate frame. Our script also supports calibrate the
rotation matrix. However, in this case, you need to slighly modify the
real_script/calibration/record_ur5_trajectory.py to use a space mouse or scripted
trajectory to make
robot move in SE(3) space (right now is SO(3)).
Calibrate Transformation Matrix between iPhone and robot EE
cd DexUMI/real_script/calibration
# Make UR5 rotate around flange (TCP/Wrist).
# Note: you might need to modify the target_pose on line 60 to place EE in a safe
region.
python record_ur5_trajectory.py -rp iphone_calibration
# Run optimization to compute translation part of the matrix
python compute_ur5_iphone_offset.py -rp iphone_calibration
After running the above script, you can set the correct transformation matrix in the eval scripts.
You
can
repeat the above process several time and compute the average transformation matrix to get a more
accurate result.
Policy Depolyment
Policy Depolyment
cd DexUMI/real_script/eval_policy
region.
python DexUMI/real_script/open_server.py --dexhand --ur5
# open a new terminal
python DexUMI/real_script/eval_policy/eval_xhand.py --model_path path/to/model --ckpt N # for
xhand
# or
python DexUMI/real_script/eval_policy/eval_inspire.py --model_path path/to/model --ckpt N # for
inspire hand