Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma¹, Wentao Bao², Jingyi Xu¹, Guanzhong Sun³, Yu Zheng¹, Erhang Zhang¹, Xieyuanli Chen⁴, Hesheng Wang^1*

¹ Shanghai Jiao Tong University ² Meta Reality Labs ³ China University of Mining and Technology ⁴ National University of Defense Technology

[Paper][Project Page][Code][Preliminary Version]

Human Videos are All You Need!

In this repository, we demonstrate how to train Uni-Hand using only human demonstration videos and leverage the trained model to generate end-effector trajectories for robotic manipulation. To evaluate our method on other public datasets (e.g., EgoPAT3D), please refer to our preliminary work.

If any bugs are spotted or any download links are broken, please do not hesitate to make a PR or open an issue.

Install Uni-Hand

🔧 Prepare a new conda environment with required dependencies for Uni-Hand. [Click to expand]

First, clone Uni-Hand

git clone https://github.com/IRMVLab/UniHand
cd UniHand

Then, create and activate a new conda environment

conda create -n unihand python=3.10
conda activate unihand

pip install -r requirements.txt

🔧 Clone HaMeR/SAM-3D-Body and DINOv2 for data preprocessing. (optional) [Click to expand]

Since we use HaMeR for hand motion extraction and DINOv2 for vision feature extraction, we need to clone them in this project. We also recommend using SAM-3D-Body, and its related tutorial for Uni-Hand is coming soon.

git clone https://github.com/geopavlakos/hamer.git
# install HaMeR following its instruction
# replace hamer/datasets/vitdet_dataset.py with preprocess_human_video/vitdet_dataset.py in our repo

git clone https://github.com/facebookresearch/dinov2.git
# install DINOv2 following its instruction

Alternatively, you can directly download our preprocessed data (hand trajectories + vision features) for our toy dataset here.
Note that you can use vision features generated by any other visual foundation models. Please update the input_dims of glip_encoder in model.yaml if the feature vector dimension is different from DINOv2's.

Prepare Human Video Data

📁 We recommend following the default data structure for fast deployment. [Click to expand]

./UniHand
    |-- human_video_data
        |-- 2025-0723-07-17-46
        |-- 2025-0723-07-17-52
        |-- 2025-0723-07-17-59
        |-- depth
            |-- 000000.npy
            |-- 000001.npy
            |-- ...
        |-- rgb
            |-- 000000.npy
            |-- 000001.npy
            |-- ...
        |-- 2025-0723-07-17-59_point_cloud.ply
    |-- hand_keypoints # auto generated or downloaded
    |-- hand_trajs # auto generated or downloaded
    |-- vision_features # auto generated or downloaded

We have provided the toy dataset (100 human pick-and-place videos) here, which was recorded by a RealSense LiDAR Camera L515. Please feel free to use it. We plan to release more human manipulation videos in the future.
The .ply file will be generated by the following scripts automatically.
You can collect your own dataset with one RGBD camera following the data structure. Just sit in front of your robot and hit record—it's that easy!

📁 Extract and refine 3D hand trajectories. [Click to expand]

First, we extract raw 3D hand trajectories from human videos, with the help of HaMeR.

cp preprocess_human_video/extract_hand_keypoints.py ./hamer
cd hamer
python extract_hand_keypoints.py \
    --img_folder ../human_video_data \
    --out_folder ../hand_keypoints \
cd ../preprocess_human_video
python generate_hand_trajs.py \
    --input_root ../human_video_data \
    --keypoint_root ../hand_keypoints \
    --output_root ../hand_trajs

The point cloud files have been generated automatically. Then, we clean the hand trajectories for better training performance

python clean_hand_data.py \
    --gt_paths ../hand_trajs \
    --joint_idx 0

You can indicate the joint index you want to visualize.

Alternatively, you can directly use our preprocessed hand motion data here.

📁 Label hand-object contact and separation timestamps. [Click to expand]

Contact/separation timestamps are crucial for training Uni-Hand, since trajectory data alone is insufficient to generate the gripper’s grasping actions. Please refer to our paper for more details.
We have provided the contact/separation labels of the toy dataset under the unihand/data_utils folder. You can manually label the timings of your own videos and organize them as date_folders.csv.
You can also try our recent work, EgoLoc, for autonomous temporal interaction localization.

📁 Extract vision features. [Click to expand]

Here we use DINOv2 to extract vision features, while you can also replace it with any other visual foundation models.

cp preprocess_human_video/extract_visual_features.py ./dinov2
cd dinov2
python extract_visual_features.py \
    --input_root ../human_video_data  \
    --output_root ../vision_features \
    --interval 1

Alternatively, you can directly use our pre-extracted features here. We will release the GLIP version with language instruction in the future.

Run Uni-Hand

👉 Train and evaluate Uni-Hand. [Click to expand]

To train Uni-Hand, set evaluate to false in unihand/configs/traineval.yaml, and run the following command

cd unihand
bash run_unihand.sh

After training, you can evaluate the trained model by setting evaluate to true in unihand/configs/traineval.yaml and run the same command. You can set gap to 1 for better results.
You can resume training from or evaluate a checkpoint by setting resume in unihand/configs/traineval.yaml. If you evaluate your model trained from scratch, set use_os_weights to false and set resume to non-existent path.
Also, we provide the pretrained model of Uni-Hand here. Please set use_os_weights to true and test it!
Uni-Hand is robust to background variations thanks to the use of depth information.

👉 Visualize end-effector trajectories predicted by Uni-Hand. [Click to expand]

After evaluation, you can visualize end-effector trajectories. As in this demo we predict hand wrist trajectories for robotic manipulation, a heuristic grasp offset is used to convert the predicted wrist waypoints to end-effector trajectories.

cd unihand
python viz_predicted_trajs.py

Download Human Videos and Pretrained Models

📘 We provide our self-recorded videos and model weights. [Click to expand]

Human videos with annotations [pick-and-place]
Human videos with annotations [open-door]
Pretrained model [pick-and-place]
Pretrained model [open-door]
We have put the train/test splits and contact/separation labels of the pick-and-place task under the unihand/data_utils folder. For the open-door task, please refer to this link.

Deploy on Your Own Robot!

We have demonstrated the deployment of Uni-Hand on real robots. Please refer to our project page and paper for more details.

Let your robot’s end-effector follow the trajectories predicted by Uni-Hand! 🤖

This is the initial version of Uni-Hand, which has already provided an out-of-the-box paradigm for human-video-based imitation learning. We will provide a more comprehensive version with multimodal inputs and additional downstream tasks, and release the heuristic grasp assumption once the paper is accepted.

Cite Our Work

If you find our work helpful to your research, we would appreciate it if you could cite our paper:

@misc{ma2025unihand,
    title={Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views}, 
    author={Junyi Ma and Wentao Bao and Jingyi Xu and Guanzhong Sun and Yu Zheng and Erhang Zhang and Xieyuanli Chen and Hesheng Wang},
    year={2025},
    eprint={2511.12878},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2511.12878}, 
}

@INPROCEEDINGS{ma2025mmtwin,
    author={Ma, Junyi and Bao, Wentao and Xu, Jingyi and Sun, Guanzhong and Chen, Xieyuanli and Wang, Hesheng},
    booktitle={2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, 
    title={Novel Diffusion Models for Multimodal 3D Hand Trajectory Prediction}, 
    year={2025},
    pages={2408-2415},
    doi={10.1109/IROS60139.2025.11246803}}

@article{ma2025madiff,
    title={MADiff: Motion-Aware Mamba Diffusion Models for Hand Trajectory Prediction on Egocentric Videos}, 
    author={Junyi Ma and Xieyuanli Chen and Wentao Bao and Jingyi Xu and Hesheng Wang}, 
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
    year={2025}, 
}

License

This project is free software made available under the MIT License. For details see the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
preprocess_human_video		preprocess_human_video
unihand		unihand
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Install Uni-Hand

Prepare Human Video Data

Run Uni-Hand

Download Human Videos and Pretrained Models

Deploy on Your Own Robot!

Cite Our Work

License

About

Uh oh!

Releases

Packages

Languages

License

IRMVLab/UniHand

Folders and files

Latest commit

History

Repository files navigation

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Install Uni-Hand

Prepare Human Video Data

Run Uni-Hand

Download Human Videos and Pretrained Models

Deploy on Your Own Robot!

Cite Our Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages