Automated Key Point Description for Vision Transformers using Vision-Language Models A recent work has shown that features extracted by Vision Transformers (ViTs) trained using self-supervised learning can perform unsupervised key point matching between two images with high precision (https://arxiv.org/abs/2112.05814). However, since the key points are identified in an unsupervised manner, human evaluation is necessary to describe the key point that are discovered. The goal of this project is to automatically create textual descriptions of the key points using recent advances in vision-language modeling (https://proceedings.mlr.press/v139/radford21a/radford21a.pdf). The project will be focused on fine-grained classification and has direct links to ongoing research on explainability.
Data Source : https://www.kaggle.com/datasets/wenewone/cub2002011
Important paper links : https://proceedings.neurips.cc/paper_files/paper/2019/file/adf7ee2dcf142b0e11888e72b43fcb75-Paper.pdf https://openaccess.thecvf.com/content/CVPR2023/papers/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.pdf https://arxiv.org/abs/2105.02968 https://dino-vit-features.github.io/ https://arxiv.org/abs/2103.00020