DexUMI: Using Human Hand as the Universal Manipulation Interface for Dexterous Manipulation

1 Stanford University, 2 Columbia University,
3 J.P. Morgan AI Research, 4 Carnegie Mellon University, 5 NVIDIA

*Indicates Equal Contribution

Abstract

We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI includes hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands.The hardware adaptation bridges the kinematics gap using a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion.The software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting.We demonstrate DexUMI's capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%..

Introduction to DexUMI

Hardware Design

XHand exoskeleton

Inspire Hand exoskeleton

Capability Experiments

DexUMI experiment video. Please check complete evaluations below.

Tea Picking with Tool

Task: Grasp tweezers from the table and use them to transfer tea leaves from a teapot to a cup. The main challenge is to stably and precisely operate the deformable tweezers with multi-finger contacts.

Hardware: XHand and Inspire Hand.

Ours(Xhand)
Ours(Inspire)
With DexUMI Framework, both XHand and Inspire Hand can complete this long-horizon task with success rate of 85% on average.

Cube Picking

Task: Pick up a 2.5cm wide cube from a table and place it into a cup. This evaluates the basic capabilities and precision of the DexUMI system.

Ablation: We compare the form of finger action trajectory: absolute position or relative trajectory. Notice, We always use relative position for wrist action.

Hardware: Inspire Hand.

Ours
Absolute finger action trajectory
Absolute finger action fail to grasp the cube as it closed the index earlier. We found relative action constantly yield more precise finger actions across all tasks due to its simpler distribution and more reactive behavior.

Kitchen Manipulation

Task: The task involves four sequential steps: turn off the stove knob; transfer the pan from the stove top to the counter; pick up salt from a container; and lastly, sprinkle it over the food in the pan. The task tests DexUMI's capability over long-horizon tasks with precise actions, tactile sensing and skills beyond using fingertips (utilizing the sides of fingers for stable pan handling).

Ablation: The wearable exoskeleton allows users to directly contact objects and receive haptic feedback. However, this human haptic feedback cannot be directly transferred to the robotic dexterous hand. Therefore, we install tactile sensors on the exoskeleton to capture and translate these tactile interactions. We compare the policies trained with and without tactile sensor input.

Hardware: XHand.

Ours
No Tactile Sensor
The policy without tactile sensor input failed to grasp the seasoning. With tactile sensors, the fingers always insert into the salt first then close the fingers. Without tactile feedback, the fingers attempt to grasp the salt sometimes in the air.

Egg Carton

Task: Open an egg carton with multiple fingers: the hand needs the index, middle, ring, and little fingers to apply downward pressure on the carton's top while simultaneously using the thumb to lift the front latch. The task evaluates multi-finger coordination

Ablation: DexUMI developed a software adaption pipeline to bridge the visual gap between policy training and robot depolyment. To test whether software adaption pipeline is crucial to our framework, we test the policy training without software adaption and replaces eplaces pixels occupied by the exoskeleton (during training) or robot hand (during inference) with a green color mask.

Hardware: Inspire Hand.

Ours
Without Software Adaption
Through the experiments, we found software adaption is critical to bridge the visual gap in DexUMI pipeline. Without software adaption, the policy still learn course action like approaching egg box. However, it could not perform precise action when interacting with object.

Efficiency Comparison

DexUMI offers two key advantages over teleoperation: 1) DexUMI is significantly more efficient than traditional teleoperation methods, and 2) DexUMI provides direct haptic feedback, which typical teleoperation systems often fail to deliver.

Inpaint Results

We show the exoskeleton data and inpainted video side by side to demonstrate our software adaptation layer capability. Our software adaptation bridges the visual gap by replacing the human hand and exoskeleton in visual observation recorded by the wrist camera with high-fidelity robot hand inpainting. Though the overall inpainting quality is good, we found there are still some deficits in the output caused by:

  • 1. Imperfect Segmentation from SAM2: In most cases, we found SAM2 (Ravi et al., 2024) can segment the human hand and exoskeleton pretty well. However, we notice, SAM2 sometimes misses some small areas on the exoskeleton.
  • 2. Quality of inpainting method: We use flow-based inpainting ProPainter (Zhou et al., 2023) to replace the human and exoskeleton pixels with background pixels. Though the overall quality is high, there are some areas still blurry.
  • 3. Robot hand hardware: Throughout our experiments, we found that both the Inspire Hand and XHand lack sufficient precision due to backlash and friction. For example, the fingertip location of the Inspire Hand differs when moving from 1000 to 500 motor units compared to moving from 0 to 500 motor units. Consequently, when fitting regression models between encoder and hand motor values, we can typically ensure precision in only "one direction"—either when closing the hand or opening it. This inevitably causes minor discrepancies in the inpainting and action mapping processes.
  • 4. Inconsistent illumination: Similar to prior work (Chen et al., 2024), we found that illumination on the robot hand might be inconsistent with what the robot experiences during deployment. Therefore, we add image augmentation including color jitter and random grayscale during policy training to make the learned policy less sensitive to lightingconditions.
  • 5. 3D-printed exoskeleton deformation: The human hand is powerful and can sometimes cause the 3D-printed exoskeleton to deform during operation. In such cases, the encoder value fails to reflect this deformation. Consequently, the robot finger location might not align with the exoskeleton's actual finger position.
Nevertheless, the processed visual observation is passed to the manipulation policy as input. The learned policy achieves an average task success rate of 86% on two different robot hardware platforms. This suggests our software adaptation layer can effectively minimize the visual gap for policy learning and deployment.

Cube Picking

Egg Carton

Tea Picking with Tool (Inspire Hand)

Tea Picking with Tool (XHand)

Kitchen Manipulation

References