(CLIP) Learning Transferable Visual Models From Natural Language Supervision

# CLIP
- **C**ontrastive **L**anguage-**I**mage **P**re-training, is an efficient method of learning from natural language supervision

# Author
- Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1
Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1
  - OpenAI

# Abstract
- demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
- After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks.
- We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification
- For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on
  - https://github.com/OpenAI/CLIP


# Introduction and Motivating Work
- We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models

<img width="882" alt="Image" src="https://github.com/user-attachments/assets/8b9743eb-6f4b-4349-a341-f4a7ff749ec2" />

<img width="439" alt="Image" src="https://github.com/user-attachments/assets/c512189e-0d7e-4323-8854-b01f5afb0d93" />

# Approach
## Natural Language Supervision
- Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”. Instead, methods which work on natural language can learn passively from the supervision contained in the vast amount of text on the internet

## Creating a Sufficiently Large Dataset
- [1] 기존껀 데이터수 적음: Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each
- [2] 데이터수 많은건 메타데이터 빈약: By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality.
- [3] 두가지 경우를 조합한 natural language supervision의 대용량 데이터: A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet.
  - for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries.
  - We approximately class balance the results by including up to 20,000 (image, text) pairs per query.
  - The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.


##  Selecting an Efficient Pre-Training Method
- Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet-50 image encoder, learns to recognize ImageNet classes three times slower than a much simpler baseline that predicts a bag-of-words encoding of the same text.
- Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2 and observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.
- Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N ×N possible (image, text) pairings across a batch actually occurred
- To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N(N−1) incorrect pairings
- We optimize a symmetric cross entropy loss over these similarity scores. In Figure 3 we include pseudocode of the core of an implementation of CLIP. To our knowledge this batch construction technique and objective was first introduced in the area of deep metric learning as the multi-class N-pair loss Sohn (2016), was popularized for contrastive representation learning by Oord et al. (2018) as the InfoNCE loss, and was recently adapted for contrastive (text, image) representation learning in the domain of medical imaging by Zhang et al.(2020).

<img width="428" alt="Image" src="https://github.com/user-attachments/assets/cf5d729d-0d94-419f-9317-7018c8ffadd7" />

- Due to the large size of our pre-training dataset, over-fitting is not a major concern and the details of training CLIP are simplified compared to the implementation of Zhang et al. (2020)
- We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights (오우.. 왜 from scratch로 학습했을까)
- We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space.
  - 오.. 또 왜 contrastive learning 할때 non-linear가 아닌 linear projection을 썼을까 신기하네, 아래처럼 이점이 없다는건가
  - We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods.
  - A random square crop from resized images is the only data augmentation used during training.
- Finally, the temperature parameter which controls the range of the logits in the softmax, τ, is directly optimized during training as a log-parameterized multiplicative scalar to avoid turning as a hyper-parameter.

## Choosing and Scaling a Model

## Training

# Experiments
## Zero-Shot Transfer

<img width="434" alt="Image" src="https://github.com/user-attachments/assets/6cd3ab4e-6a0d-4084-b45f-735af620ebf4" />

<img width="443" alt="Image" src="https://github.com/user-attachments/assets/896f1aa5-d933-4c5e-ad51-5898c0ef473e" />

<img width="436" alt="Image" src="https://github.com/user-attachments/assets/87f27422-903d-4bc8-9f00-eccf36c53a70" />

<img width="428" alt="Image" src="https://github.com/user-attachments/assets/47e2259e-56df-45b4-aa1f-5947e31fb702" />

<img width="430" alt="Image" src="https://github.com/user-attachments/assets/f48fb12d-5c91-439f-b7a7-42020a20d2b0" />

<img width="877" alt="Image" src="https://github.com/user-attachments/assets/39bb7206-429b-4994-967c-ee3e0a116ecc" />

<img width="882" alt="Image" src="https://github.com/user-attachments/assets/34b1eb35-2dae-4d0a-8a3e-710d8b4b035c" />

# Hyper Params

<img width="662" alt="Image" src="https://github.com/user-attachments/assets/bb87a6cd-60a9-4035-90c6-6d748770e1b3" />

![Image](https://github.com/user-attachments/assets/aea2ec1d-67ce-422e-9b7b-aab7a98f4f4d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(CLIP) Learning Transferable Visual Models From Natural Language Supervision #41

CLIP

Author

Abstract

Introduction and Motivating Work

Approach

Natural Language Supervision

Creating a Sufficiently Large Dataset

Selecting an Efficient Pre-Training Method

Choosing and Scaling a Model

Training

Experiments

Zero-Shot Transfer

Hyper Params

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

(CLIP) Learning Transferable Visual Models From Natural Language Supervision #41

Description

CLIP

Author

Abstract

Introduction and Motivating Work

Approach

Natural Language Supervision

Creating a Sufficiently Large Dataset

Selecting an Efficient Pre-Training Method

Choosing and Scaling a Model

Training

Experiments

Zero-Shot Transfer

Hyper Params

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions