Skip to content

chensy618/02456_Deep_Learning_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

02456_Deep_Learning_Project

Automated Key Point Description for Vision Transformers using Vision-Language Models A recent work has shown that features extracted by Vision Transformers (ViTs) trained using self-supervised learning can perform unsupervised key point matching between two images with high precision (https://arxiv.org/abs/2112.05814). However, since the key points are identified in an unsupervised manner, human evaluation is necessary to describe the key point that are discovered. The goal of this project is to automatically create textual descriptions of the key points using recent advances in vision-language modeling (https://proceedings.mlr.press/v139/radford21a/radford21a.pdf). The project will be focused on fine-grained classification and has direct links to ongoing research on explainability.

Data Source : https://www.kaggle.com/datasets/wenewone/cub2002011

Important paper links : https://proceedings.neurips.cc/paper_files/paper/2019/file/adf7ee2dcf142b0e11888e72b43fcb75-Paper.pdf https://openaccess.thecvf.com/content/CVPR2023/papers/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.pdf https://arxiv.org/abs/2105.02968 https://dino-vit-features.github.io/ https://arxiv.org/abs/2103.00020

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •