Skip to content

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views (with visual imitation learning for robots)

License

Notifications You must be signed in to change notification settings

IRMVLab/UniHand

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma1, Wentao Bao2, Jingyi Xu1, Guanzhong Sun3, Yu Zheng1, Erhang Zhang1, Xieyuanli Chen4, Hesheng Wang1*

1 Shanghai Jiao Tong University 2 Meta Reality Labs 3 China University of Mining and Technology 4 National University of Defense Technology

[Paper][Project Page][Code][Preliminary Version]

Human Videos are All You Need!

In this repository, we demonstrate how to train Uni-Hand using only human demonstration videos and leverage the trained model to generate end-effector trajectories for robotic manipulation. To evaluate our method on other public datasets (e.g., EgoPAT3D), please refer to our preliminary work.

If any bugs are spotted or any download links are broken, please do not hesitate to make a PR or open an issue.

Install Uni-Hand

🔧 Prepare a new conda environment with required dependencies for Uni-Hand. [Click to expand]

First, clone Uni-Hand

git clone https://github.com/IRMVLab/UniHand
cd UniHand

Then, create and activate a new conda environment

conda create -n unihand python=3.10
conda activate unihand
pip install -r requirements.txt
🔧 Clone HaMeR/SAM-3D-Body and DINOv2 for data preprocessing. (optional) [Click to expand]

Since we use HaMeR for hand motion extraction and DINOv2 for vision feature extraction, we need to clone them in this project. We also recommend using SAM-3D-Body, and its related tutorial for Uni-Hand is coming soon.

git clone https://github.com/geopavlakos/hamer.git
# install HaMeR following its instruction
# replace hamer/datasets/vitdet_dataset.py with preprocess_human_video/vitdet_dataset.py in our repo

git clone https://github.com/facebookresearch/dinov2.git
# install DINOv2 following its instruction
  • Alternatively, you can directly download our preprocessed data (hand trajectories + vision features) for our toy dataset here.

  • Note that you can use vision features generated by any other visual foundation models. Please update the input_dims of glip_encoder in model.yaml if the feature vector dimension is different from DINOv2's.

Prepare Human Video Data

📁 We recommend following the default data structure for fast deployment. [Click to expand]
./UniHand
    |-- human_video_data
        |-- 2025-0723-07-17-46
        |-- 2025-0723-07-17-52
        |-- 2025-0723-07-17-59
        |-- depth
            |-- 000000.npy
            |-- 000001.npy
            |-- ...
        |-- rgb
            |-- 000000.npy
            |-- 000001.npy
            |-- ...
        |-- 2025-0723-07-17-59_point_cloud.ply
    |-- hand_keypoints # auto generated or downloaded
    |-- hand_trajs # auto generated or downloaded
    |-- vision_features # auto generated or downloaded

  • We have provided the toy dataset (100 human pick-and-place videos) here, which was recorded by a RealSense LiDAR Camera L515. Please feel free to use it. We plan to release more human manipulation videos in the future.
  • The .ply file will be generated by the following scripts automatically.
  • You can collect your own dataset with one RGBD camera following the data structure. Just sit in front of your robot and hit record—it's that easy!
📁 Extract and refine 3D hand trajectories. [Click to expand] First, we extract raw 3D hand trajectories from human videos, with the help of HaMeR.
cp preprocess_human_video/extract_hand_keypoints.py ./hamer
cd hamer
python extract_hand_keypoints.py \
    --img_folder ../human_video_data \
    --out_folder ../hand_keypoints \
cd ../preprocess_human_video
python generate_hand_trajs.py \
    --input_root ../human_video_data \
    --keypoint_root ../hand_keypoints \
    --output_root ../hand_trajs

The point cloud files have been generated automatically. Then, we clean the hand trajectories for better training performance

python clean_hand_data.py \
    --gt_paths ../hand_trajs \
    --joint_idx 0

You can indicate the joint index you want to visualize.

Alternatively, you can directly use our preprocessed hand motion data here.

📁 Label hand-object contact and separation timestamps. [Click to expand]
  • Contact/separation timestamps are crucial for training Uni-Hand, since trajectory data alone is insufficient to generate the gripper’s grasping actions. Please refer to our paper for more details.

  • We have provided the contact/separation labels of the toy dataset under the unihand/data_utils folder. You can manually label the timings of your own videos and organize them as date_folders.csv.

  • You can also try our recent work, EgoLoc, for autonomous temporal interaction localization.

📁 Extract vision features. [Click to expand] Here we use DINOv2 to extract vision features, while you can also replace it with any other visual foundation models.
cp preprocess_human_video/extract_visual_features.py ./dinov2
cd dinov2
python extract_visual_features.py \
    --input_root ../human_video_data  \
    --output_root ../vision_features \
    --interval 1

Alternatively, you can directly use our pre-extracted features here. We will release the GLIP version with language instruction in the future.

Run Uni-Hand

👉 Train and evaluate Uni-Hand. [Click to expand]

To train Uni-Hand, set evaluate to false in unihand/configs/traineval.yaml, and run the following command

cd unihand
bash run_unihand.sh 
  • After training, you can evaluate the trained model by setting evaluate to true in unihand/configs/traineval.yaml and run the same command. You can set gap to 1 for better results.

  • You can resume training from or evaluate a checkpoint by setting resume in unihand/configs/traineval.yaml. If you evaluate your model trained from scratch, set use_os_weights to false and set resume to non-existent path.

  • Also, we provide the pretrained model of Uni-Hand here. Please set use_os_weights to true and test it!

  • Uni-Hand is robust to background variations thanks to the use of depth information.

👉 Visualize end-effector trajectories predicted by Uni-Hand. [Click to expand] After evaluation, you can visualize end-effector trajectories. As in this demo we predict hand wrist trajectories for robotic manipulation, a heuristic grasp offset is used to convert the predicted wrist waypoints to end-effector trajectories.
cd unihand
python viz_predicted_trajs.py 

Download Human Videos and Pretrained Models

📘 We provide our self-recorded videos and model weights. [Click to expand]

Deploy on Your Own Robot!

We have demonstrated the deployment of Uni-Hand on real robots. Please refer to our project page and paper for more details.

Let your robot’s end-effector follow the trajectories predicted by Uni-Hand! 🤖

This is the initial version of Uni-Hand, which has already provided an out-of-the-box paradigm for human-video-based imitation learning. We will provide a more comprehensive version with multimodal inputs and additional downstream tasks, and release the heuristic grasp assumption once the paper is accepted.

Cite Our Work

If you find our work helpful to your research, we would appreciate it if you could cite our paper:

@misc{ma2025unihand,
    title={Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views}, 
    author={Junyi Ma and Wentao Bao and Jingyi Xu and Guanzhong Sun and Yu Zheng and Erhang Zhang and Xieyuanli Chen and Hesheng Wang},
    year={2025},
    eprint={2511.12878},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2511.12878}, 
}
@INPROCEEDINGS{ma2025mmtwin,
    author={Ma, Junyi and Bao, Wentao and Xu, Jingyi and Sun, Guanzhong and Chen, Xieyuanli and Wang, Hesheng},
    booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
    title={Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction}, 
    year={2025},
    pages={2408-2415},
    doi={10.1109/IROS60139.2025.11246803}}
@article{ma2025madiff,
    title={MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos}, 
    author={Junyi Ma and Xieyuanli Chen and Wentao Bao and Jingyi Xu and Hesheng Wang}, 
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
    year={2025}, 
}

License

This project is free software made available under the MIT License. For details see the LICENSE file.

About

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views (with visual imitation learning for robots)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published