We present DexUMI - a data collection and policy learning framework that uses the human hand as the natural interface to transfer dexterous manipulation skills to various robot hands. DexUMI includes hardware and software adaptations to minimize the embodiment gap between the human hand and various robot hands.The hardware adaptation bridges the kinematics gap using a wearable hand exoskeleton. It allows direct haptic feedback in manipulation data collection and adapts human motion to feasible robot hand motion.The software adaptation bridges the visual gap by replacing the human hand in video data with high-fidelity robot hand inpainting.We demonstrate DexUMI's capabilities through comprehensive real-world experiments on two different dexterous robot hand hardware platforms, achieving an average task success rate of 86%..
Task: Grasp tweezers from the table and use them to transfer tea leaves from a teapot to a cup. The main challenge is to stably and precisely operate the deformable tweezers with multi-finger contacts.
Hardware: XHand and Inspire Hand.
Task: Pick up a 2.5cm wide cube from a table and place it into a cup. This evaluates the basic capabilities and precision of the DexUMI system.
Ablation: We compare the form of finger action trajectory: absolute position or relative trajectory. Notice, We always use relative position for wrist action.
Hardware: Inspire Hand.
Task: The task involves four sequential steps: turn off the stove knob; transfer the pan from the stove top to the counter; pick up salt from a container; and lastly, sprinkle it over the food in the pan. The task tests DexUMI's capability over long-horizon tasks with precise actions, tactile sensing and skills beyond using fingertips (utilizing the sides of fingers for stable pan handling).
Ablation: The wearable exoskeleton allows users to directly contact objects and receive haptic feedback. However, this human haptic feedback cannot be directly transferred to the robotic dexterous hand. Therefore, we install tactile sensors on the exoskeleton to capture and translate these tactile interactions. We compare the policies trained with and without tactile sensor input.
Hardware: XHand.
Task: Open an egg carton with multiple fingers: the hand needs the index, middle, ring, and little fingers to apply downward pressure on the carton's top while simultaneously using the thumb to lift the front latch. The task evaluates multi-finger coordination
Ablation: DexUMI developed a software adaption pipeline to bridge the visual gap between policy training and robot depolyment. To test whether software adaption pipeline is crucial to our framework, we test the policy training without software adaption and replaces eplaces pixels occupied by the exoskeleton (during training) or robot hand (during inference) with a green color mask.
Hardware: Inspire Hand.
We show the exoskeleton data and inpainted video side by side to demonstrate our software adaptation layer capability. Our software adaptation bridges the visual gap by replacing the human hand and exoskeleton in visual observation recorded by the wrist camera with high-fidelity robot hand inpainting. Though the overall inpainting quality is good, we found there are still some deficits in the output caused by: