SVFAP: Self-supervised Video Facial Affect Perceiver

[arXiv], [IEEE Xplore]
Licai Sun, Zheng Lian, Kexin Wang, Yu He, Mingyu Xu, Haiyang Sun, Bin Liu, and Jianhua Tao
University of Chinese Academy of Sciences & Institute of Automation, Chinese Academy of Sciences & Tsinghua University

📰 News

[2024.09.24] We upload the code, pre-trained and fine-tuned models.
[2024.07.28] Our paper is accepted by IEEE Transactions on Affective Computing.

✨ Overview

Overview of SVFAP.

Encoder architecture (i.e., TPSBT) in SVFAP.

Abstract: Video-based facial affect analysis has recently attracted increasing attention owing to its critical role in human-computer interaction. Previous studies mainly focus on developing various deep learning architectures and training them in a fully supervised manner. Although significant progress has been achieved by these supervised methods, the longstanding lack of large-scale high-quality labeled data severely hinders their further improvements. Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP), to address the dilemma faced by supervised methods. Specifically, SVFAP leverages masked facial video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos. Considering that large spatiotemporal redundancy exists in facial videos, we propose a novel temporal pyramid and spatial bottleneck Transformer as the encoder of SVFAP, which not only largely reduces computational costs but also achieves excellent performance. To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition. Comprehensive results demonstrate that SVFAP can learn powerful affect-related representations via large-scale self-supervised pre-training and it significantly outperforms previous state-of-the-art methods on all datasets.

🚀 Main Results

Comparison with state-of-the-art methods on 9 datasets.

Please check our paper to see detailed results on each dataset.

🔨 Installation

Main prerequisites:

Python 3.8
PyTorch 1.7.1 (cuda 10.2)
timm==0.4.12
einops==0.6.1
decord==0.6.0
scikit-learn=1.1.3
scipy=1.10.1
pandas==1.5.3
numpy=1.23.4
opencv-python=4.7.0.72
tensorboardX=2.6.1

If some are missing, please refer to environment.yml for more details.

➡️ Data Preparation

Please follow the files (e.g., dfew.py) in preprocess for data preparation.

Specifically, you need to generate annotations for dataloader ("<path_to_video> <video_class>" in annotations). The annotation usually includes train.csv, val.csv and test.csv. The format of *.csv file is like:

dataset_root/video_1  label_1
dataset_root/video_2  label_2
dataset_root/video_3  label_3
...
dataset_root/video_N  label_N

An example of train.csv of DFEW fold1 (fd1) is shown as follows:

/data/ycs/AC/Dataset/DFEW/Clip/jpg_256/02522 5
/data/ycs/AC/Dataset/DFEW/Clip/jpg_256/02536 5
/data/ycs/AC/Dataset/DFEW/Clip/jpg_256/02578 6

Note that, label for the pre-training dataset (i.e., VoxCeleb2) is dummy label, you can simply use 0 (see voxceleb2.py).

🔄 Pre-training SVFAP

VoxCeleb2
```
sh scripts/voxceleb2/pretrain_svfap_base.sh
```
You can download our pre-trained model on VoxCeleb2 from here and put it into this folder.

⤴️ Fine-tuning with pre-trained models

DFEW

sh scripts/dfew/finetune_svfap_base.sh

The fine-tuned checkpoints and logs across five folds on DFEW are provided as follows:

Fold	UAR	WR	Fine-tuned Model
1	63.63	75.31	log / checkpoint
2	58.82	71.68	log / checkpoint
3	64.88	74.96	log / checkpoint
4	63.73	74.65	log / checkpoint
5	68.39	77.44	log / checkpoint
Total (Reproduced)	63.89	74.81	-
Total (Reported)	62.83	74.27	-

Note that we lost the original ckpt for this dataset. However, the reproduced result is better than that reported in the paper.

FERV39k
```
sh scripts/ferv39k/finetune_svfap_base.sh
```
The fine-tuned checkpoints and logs on FERV39k are provided as follows:

Version UAR WR Fine-tuned Model

Reproduced 43.05 52.86 log / checkpoint

Reported 42.14 52.29 -

Note that we lost the original ckpt for this dataset. However, the reproduced result is better than that reported in the paper.

MAFW

sh scripts/mafw/finetune_svfap_base.sh

The fine-tuned checkpoints and logs across five folds on MAFW are provided as follows:

Fold	UAR	WR	Fine-tuned Model
1	38.40	49.10	log / checkpoint
2	40.94	53.95	log / checkpoint
3	46.27	59.68	log / checkpoint
4	47.78	61.30	log / checkpoint
5	44.17	57.56	log / checkpoint
Total (Reproduced)	43.51	56.31	-
Total (Reported)	41.19	54.28	-

Note that we lost the original ckpts for this dataset. However, the reproduced result is better than that reported in the paper.

☎️ Contact

If you have any questions, please feel free to reach me out at Licai.Sun@oulu.fi.

👍 Acknowledgements

This project is built upon VideoMAE. Thanks for their great codebase.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@article{sun2024svfap,
  title={SVFAP: Self-supervised video facial affect perceiver},
  author={Sun, Licai and Lian, Zheng and Wang, Kexin and He, Yu and Xu, Mingyu and Sun, Haiyang and Liu, Bin and Tao, Jianhua},
  journal={IEEE Transactions on Affective Computing},
  year={2024},
  publisher={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
figs		figs
preprocess		preprocess
saved		saved
scripts		scripts
LICENSE		LICENSE
README.md		README.md
datasets.py		datasets.py
engine_for_finetuning.py		engine_for_finetuning.py
engine_for_pretraining.py		engine_for_pretraining.py
environment.yml		environment.yml
functional.py		functional.py
kinetics.py		kinetics.py
masking_generator.py		masking_generator.py
mixup.py		mixup.py
modeling_finetune.py		modeling_finetune.py
modeling_pretrain.py		modeling_pretrain.py
optim_factory.py		optim_factory.py
rand_augment.py		rand_augment.py
random_erasing.py		random_erasing.py
run_class_finetuning.py		run_class_finetuning.py
run_mae_pretraining.py		run_mae_pretraining.py
ssv2.py		ssv2.py
transforms.py		transforms.py
utils.py		utils.py
video_transforms.py		video_transforms.py
volume_transforms.py		volume_transforms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SVFAP: Self-supervised Video Facial Affect Perceiver

📰 News

✨ Overview

🚀 Main Results

🔨 Installation

➡️ Data Preparation

🔄 Pre-training SVFAP

⤴️ Fine-tuning with pre-trained models

☎️ Contact

👍 Acknowledgements

✏️ Citation

About

Uh oh!

Releases

Packages

Languages

Version	UAR	WR	Fine-tuned Model
Reproduced	43.05	52.86	log / checkpoint
Reported	42.14	52.29	-

License

sunlicai/SVFAP

Folders and files

Latest commit

History

Repository files navigation

SVFAP: Self-supervised Video Facial Affect Perceiver

📰 News

✨ Overview

🚀 Main Results

🔨 Installation

➡️ Data Preparation

🔄 Pre-training SVFAP

⤴️ Fine-tuning with pre-trained models

☎️ Contact

👍 Acknowledgements

✏️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages