Official implementation of the WACV 2026 paper:
TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model
Project page: https://dfki-av.github.io/TalkingPose/
Diffusion models have recently advanced the realism and generalizability of character-driven animation, enabling high-quality motion synthesis from a single RGB image and driving poses. However, generating temporally coherent long-form content remains challenging: many existing methods are trained on short clips due to computational and memory constraints, limiting their ability to maintain consistency over extended sequences.
We propose TalkingPose, a diffusion-based framework designed for long-form, temporally consistent upper-body human animation. TalkingPose uses driving frames to capture expressive facial and hand motion and transfers them to a target identity through a Stable Diffusion backbone. To improve temporal consistency without additional training stages or computational overhead, we introduce a feedback-guided mechanism built upon image-based diffusion models. This design enables generation with unbounded duration. In addition, we introduce a large-scale dataset to support benchmarking for upper-body human animation.
- Python >= 3.10
- CUDA 11.7
git clone https://github.com/dfki-av/TalkingPose.git
cd TalkingPosepython -m venv tk_pose_venv
source tk_pose_venv/bin/activatepip install --index-url https://download.pytorch.org/whl/cu117 \
torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117
pip install -r requirements.txtpython tools/download_weights.pypython tools/extract_dwpose_from_vid.py \
--video_root /path/to/mp4_videos \
--save_dir /path/to/save_dwposepython tools/extract_meta_info.py \
--video_root /path/to/videos \
--dwpose_root /path/to/dwpose_output \
--dataset_name <your_dataset_name> \
--out_dir /path/to/output_meta_jsonAfter pose extraction and metadata generation, update the training configuration to specify:
- metadata JSON paths
- checkpoint paths
- output directory
Then run:
python train.py --config configs/train/training.yamlFor self-identity animation, specify the checkpoint path and the video and pose directories in the configuration file.
Note: Video folders and their corresponding pose folders must share the same directory names.
python -m scripts.pose2vid --config configs/prompts/self_identity.yamlThe dataset/ directory contains video_ids.csv, which lists the YouTube video IDs included in the TalkingPose dataset.
To download the videos, please use yt-dlp:
https://github.com/yt-dlp/yt-dlp
To evaluate generated videos using the average temporal jittering error:
python tools/tje_error.py \
--real_dir /path/to/real_videos \
--gen_dir /path/to/generated_videos \
--delta 2 \
--outThis repository builds mainly upon and is inspired by the following works:
- Moore-AnimateAnyone: https://github.com/MooreThreads/Moore-AnimateAnyone/tree/master
- DWPose: https://github.com/IDEA-Research/DWPose This work has been partially supported by the EU projects CORTEX2 (GA No. 101070192) and LUMINOUS (GA No. 101135724).
If you find this work useful, please cite:
@article{javanmardi2025talkingpose,
title={TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model},
author={Javanmardi, Alireza and Jaiswal, Pragati and Habtegebrial, Tewodros Amberbir and Millerdurai, Christen and Wang, Shaoxiang and Pagani, Alain and Stricker, Didier},
journal={arXiv preprint arXiv:2512.00909},
year={2025}
}- Inference code
- Pretrained models
- Training code
- Training data
- Annotations (will be released soon)
