Junyi Ma1, Wentao Bao2, Jingyi Xu1, Guanzhong Sun3, Yu Zheng1, Erhang Zhang1, Xieyuanli Chen4, Hesheng Wang1*
1 Shanghai Jiao Tong University 2 Meta Reality Labs 3 China University of Mining and Technology 4 National University of Defense Technology
[Paper][Project Page][Code][Preliminary Version]
Human Videos are All You Need!
In this repository, we demonstrate how to train Uni-Hand using only human demonstration videos and leverage the trained model to generate end-effector trajectories for robotic manipulation. To evaluate our method on other public datasets (e.g., EgoPAT3D), please refer to our preliminary work.
If any bugs are spotted or any download links are broken, please do not hesitate to make a PR or open an issue.
🔧 Prepare a new conda environment with required dependencies for Uni-Hand. [Click to expand]
First, clone Uni-Hand
git clone https://github.com/IRMVLab/UniHand
cd UniHand
Then, create and activate a new conda environment
conda create -n unihand python=3.10
conda activate unihand
pip install -r requirements.txt
🔧 Clone HaMeR/SAM-3D-Body and DINOv2 for data preprocessing. (optional) [Click to expand]
Since we use HaMeR for hand motion extraction and DINOv2 for vision feature extraction, we need to clone them in this project. We also recommend using SAM-3D-Body, and its related tutorial for Uni-Hand is coming soon.
git clone https://github.com/geopavlakos/hamer.git
# install HaMeR following its instruction
# replace hamer/datasets/vitdet_dataset.py with preprocess_human_video/vitdet_dataset.py in our repo
git clone https://github.com/facebookresearch/dinov2.git
# install DINOv2 following its instruction
-
Alternatively, you can directly download our preprocessed data (hand trajectories + vision features) for our toy dataset here.
-
Note that you can use vision features generated by any other visual foundation models. Please update the
input_dimsofglip_encoderinmodel.yamlif the feature vector dimension is different from DINOv2's.
📁 We recommend following the default data structure for fast deployment. [Click to expand]
./UniHand
|-- human_video_data
|-- 2025-0723-07-17-46
|-- 2025-0723-07-17-52
|-- 2025-0723-07-17-59
|-- depth
|-- 000000.npy
|-- 000001.npy
|-- ...
|-- rgb
|-- 000000.npy
|-- 000001.npy
|-- ...
|-- 2025-0723-07-17-59_point_cloud.ply
|-- hand_keypoints # auto generated or downloaded
|-- hand_trajs # auto generated or downloaded
|-- vision_features # auto generated or downloaded
- We have provided the toy dataset (100 human pick-and-place videos) here, which was recorded by a RealSense LiDAR Camera L515. Please feel free to use it. We plan to release more human manipulation videos in the future.
- The
.plyfile will be generated by the following scripts automatically. - You can collect your own dataset with one RGBD camera following the data structure. Just sit in front of your robot and hit record—it's that easy!
📁 Extract and refine 3D hand trajectories. [Click to expand]
First, we extract raw 3D hand trajectories from human videos, with the help of HaMeR.cp preprocess_human_video/extract_hand_keypoints.py ./hamer
cd hamer
python extract_hand_keypoints.py \
--img_folder ../human_video_data \
--out_folder ../hand_keypoints \
cd ../preprocess_human_video
python generate_hand_trajs.py \
--input_root ../human_video_data \
--keypoint_root ../hand_keypoints \
--output_root ../hand_trajs
The point cloud files have been generated automatically. Then, we clean the hand trajectories for better training performance
python clean_hand_data.py \
--gt_paths ../hand_trajs \
--joint_idx 0
You can indicate the joint index you want to visualize.
Alternatively, you can directly use our preprocessed hand motion data here.
📁 Label hand-object contact and separation timestamps. [Click to expand]
-
Contact/separation timestamps are crucial for training Uni-Hand, since trajectory data alone is insufficient to generate the gripper’s grasping actions. Please refer to our paper for more details.
-
We have provided the contact/separation labels of the toy dataset under the
unihand/data_utilsfolder. You can manually label the timings of your own videos and organize them asdate_folders.csv. -
You can also try our recent work, EgoLoc, for autonomous temporal interaction localization.
📁 Extract vision features. [Click to expand]
Here we use DINOv2 to extract vision features, while you can also replace it with any other visual foundation models.cp preprocess_human_video/extract_visual_features.py ./dinov2
cd dinov2
python extract_visual_features.py \
--input_root ../human_video_data \
--output_root ../vision_features \
--interval 1
Alternatively, you can directly use our pre-extracted features here. We will release the GLIP version with language instruction in the future.
👉 Train and evaluate Uni-Hand. [Click to expand]
To train Uni-Hand, set evaluate to false in unihand/configs/traineval.yaml, and run the following command
cd unihand
bash run_unihand.sh
-
After training, you can evaluate the trained model by setting
evaluatetotrueinunihand/configs/traineval.yamland run the same command. You can setgapto1for better results. -
You can resume training from or evaluate a checkpoint by setting
resumeinunihand/configs/traineval.yaml. If you evaluate your model trained from scratch, setuse_os_weightstofalseand setresumeto non-existent path. -
Also, we provide the pretrained model of Uni-Hand here. Please set
use_os_weightstotrueand test it! -
Uni-Hand is robust to background variations thanks to the use of depth information.
👉 Visualize end-effector trajectories predicted by Uni-Hand. [Click to expand]
After evaluation, you can visualize end-effector trajectories. As in this demo we predict hand wrist trajectories for robotic manipulation, a heuristic grasp offset is used to convert the predicted wrist waypoints to end-effector trajectories.cd unihand
python viz_predicted_trajs.py
📘 We provide our self-recorded videos and model weights. [Click to expand]
- Human videos with annotations [pick-and-place]
- Human videos with annotations [open-door]
- Pretrained model [pick-and-place]
- Pretrained model [open-door]
- We have put the train/test splits and contact/separation labels of the pick-and-place task under the
unihand/data_utilsfolder. For the open-door task, please refer to this link.
We have demonstrated the deployment of Uni-Hand on real robots. Please refer to our project page and paper for more details.
Let your robot’s end-effector follow the trajectories predicted by Uni-Hand! 🤖
This is the initial version of Uni-Hand, which has already provided an out-of-the-box paradigm for human-video-based imitation learning. We will provide a more comprehensive version with multimodal inputs and additional downstream tasks, and release the heuristic grasp assumption once the paper is accepted.
If you find our work helpful to your research, we would appreciate it if you could cite our paper:
@misc{ma2025unihand,
title={Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views},
author={Junyi Ma and Wentao Bao and Jingyi Xu and Guanzhong Sun and Yu Zheng and Erhang Zhang and Xieyuanli Chen and Hesheng Wang},
year={2025},
eprint={2511.12878},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.12878},
}
@INPROCEEDINGS{ma2025mmtwin,
author={Ma, Junyi and Bao, Wentao and Xu, Jingyi and Sun, Guanzhong and Chen, Xieyuanli and Wang, Hesheng},
booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
title={Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction},
year={2025},
pages={2408-2415},
doi={10.1109/IROS60139.2025.11246803}}
@article{ma2025madiff,
title={MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos},
author={Junyi Ma and Xieyuanli Chen and Wentao Bao and Jingyi Xu and Hesheng Wang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2025},
}
This project is free software made available under the MIT License. For details see the LICENSE file.







