Skip to content

[WACV'26] TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

License

Notifications You must be signed in to change notification settings

dfki-av/TalkingPose

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Official implementation of the WACV 2026 paper:
TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Project page: https://dfki-av.github.io/TalkingPose/

TalkingPose pipeline


Overview

Diffusion models have recently advanced the realism and generalizability of character-driven animation, enabling high-quality motion synthesis from a single RGB image and driving poses. However, generating temporally coherent long-form content remains challenging: many existing methods are trained on short clips due to computational and memory constraints, limiting their ability to maintain consistency over extended sequences.

We propose TalkingPose, a diffusion-based framework designed for long-form, temporally consistent upper-body human animation. TalkingPose uses driving frames to capture expressive facial and hand motion and transfers them to a target identity through a Stable Diffusion backbone. To improve temporal consistency without additional training stages or computational overhead, we introduce a feedback-guided mechanism built upon image-based diffusion models. This design enables generation with unbounded duration. In addition, we introduce a large-scale dataset to support benchmarking for upper-body human animation.


Getting Started

Prerequisites

  • Python >= 3.10
  • CUDA 11.7

Installation

1. Clone the repository

git clone https://github.com/dfki-av/TalkingPose.git
cd TalkingPose

2. Create and activate a virtual environment

python -m venv tk_pose_venv
source tk_pose_venv/bin/activate

3. Install dependencies

pip install --index-url https://download.pytorch.org/whl/cu117 \
  torch==2.0.1+cu117 torchvision==0.15.2+cu117 torchaudio==2.0.2+cu117
pip install -r requirements.txt

4. Download pre-trained checkpoints

python tools/download_weights.py

Pose Extraction

Extract DWPose (required for training and inference)

python tools/extract_dwpose_from_vid.py \
  --video_root /path/to/mp4_videos \
  --save_dir /path/to/save_dwpose

Extract metadata for training

python tools/extract_meta_info.py \
  --video_root /path/to/videos \
  --dwpose_root /path/to/dwpose_output \
  --dataset_name <your_dataset_name> \
  --out_dir /path/to/output_meta_json

Training

After pose extraction and metadata generation, update the training configuration to specify:

  • metadata JSON paths
  • checkpoint paths
  • output directory

Then run:

python train.py --config configs/train/training.yaml

Inference

For self-identity animation, specify the checkpoint path and the video and pose directories in the configuration file.

Note: Video folders and their corresponding pose folders must share the same directory names.

python -m scripts.pose2vid --config configs/prompts/self_identity.yaml

Dataset

The dataset/ directory contains video_ids.csv, which lists the YouTube video IDs included in the TalkingPose dataset.

To download the videos, please use yt-dlp:
https://github.com/yt-dlp/yt-dlp


Temporal Jittering Error (TJE) Evaluation

To evaluate generated videos using the average temporal jittering error:

python tools/tje_error.py \
  --real_dir /path/to/real_videos \
  --gen_dir /path/to/generated_videos \
  --delta 2 \
  --out

Acknowledgements

This repository builds mainly upon and is inspired by the following works:


Citation

If you find this work useful, please cite:

@article{javanmardi2025talkingpose,
  title={TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model},
  author={Javanmardi, Alireza and Jaiswal, Pragati and Habtegebrial, Tewodros Amberbir and Millerdurai, Christen and Wang, Shaoxiang and Pagani, Alain and Stricker, Didier},
  journal={arXiv preprint arXiv:2512.00909},
  year={2025}
}

To-do / Release Plan

  • Inference code
  • Pretrained models
  • Training code
  • Training data
  • Annotations (will be released soon)

About

[WACV'26] TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages