Learning Transferable Visual Models from Natural Language Supervision

Alec Radford, Jong Wook Kim

2021 · ICML

Learning Transferable Visual Models from Natural Language Supervision

Problem

Framing

Image classifiers transfer poorly when supervision is a fixed label set. CLIP closes this by training image and text encoders on 400M web pairs, then turning class names into zero-shot classifiers. ImageNet zero-shot reaches 76.2% top-1.

Currently Used Methods

Foundational

@bengioRepresentationLearning2013 — representation learning frames transfer through reusable features.
- Limitation in context: it does not provide language-grounded supervision.
@devlinBERT2018 — large-scale language pre-training from raw text.
- Limitation in context: it is unimodal and cannot align images to text.
@dosovitskiyViT2020 — transformer vision backbone that scales with data.
- Limitation in context: it still depends on closed-set labels.
"VirTex" — image-to-text pre-training with generative caption prediction.
- Limitation in context: caption generation transfers worse than contrastive alignment.
"ConVIRT: Contrastive Learning of Visual Representations from Text" — contrastive image-text pre-training in medical imaging.
- Limitation in context: domain scale is far below web supervision.

Proposed Method

Architecture

CLIP learns a shared 512-d space with an image encoder and a text encoder. The image tower is a modified ResNet or ViT. The text tower is a 12-layer transformer over BPE tokens, used at test time to encode prompted class names.

CLIP pipeline: contrastive image-text pre-training, prompt-based classifier construction, and zero-shot prediction by similarity matching.

Loss / Objective

The training loss is symmetric cross-entropy over batchwise image-text similarities with learned temperature $t$ .

\mathbf{z}_i = \frac{f_\theta(\mathbf{x}_i)}{\|f_\theta(\mathbf{x}_i)\|_2}, \qquad \mathbf{z}'_i = \frac{g_\phi(\mathbf{t}_i)}{\|g_\phi(\mathbf{t}_i)\|_2}

\mathbf{L}_{ij} = \exp(t) \, \mathbf{z}_i^\top \mathbf{z}'_j

\mathcal{L} = \frac{1}{2N}\sum_{i=1}^{N}\left[ -\log \frac{\exp(\mathbf{L}_{ii})}{\sum_{j=1}^{N} \exp(\mathbf{L}_{ij})} -\log \frac{\exp(\mathbf{L}_{ii})}{\sum_{j=1}^{N} \exp(\mathbf{L}_{ji})} \right]

Sampling Rule

Zero-shot classification scores an image against prompted label texts.

p(y=i\mid \mathbf{x}) = \frac{\exp\left(\exp(t)\, f_\theta(\mathbf{x})^\top g_\phi(\tau_i)\right)}{\sum_{j=1}^{K} \exp\left(\exp(t)\, f_\theta(\mathbf{x})^\top g_\phi(\tau_j)\right)}

Training Procedure

Dataset: WIT, 400M image-text pairs.
Query pool: 500,000 text queries.
Max pairs per query: 20,000.
Batch size: 32,768.
Epochs: 32.
Embedding dimension: 512.
Vocabulary size: 49,408.
Models trained: 5 ResNets, 3 ViTs.
Largest run: 592 V100 GPUs for 18 days.

Evaluation

Datasets

Pre-training: WIT, 400M image-text pairs.
Zero-shot transfer: 30+ vision datasets.
Classification examples: ImageNet, aYahoo, SUN.
Robustness: 7 natural distribution-shift datasets.
Retrieval: Flickr30k, MS-COCO.

Metrics

Top-1 accuracy.
Average score across 36 datasets.
Retrieval recall.
Robustness gap under shift.

Headline results

ImageNet zero-shot: 76.2 top-1.
aYahoo zero-shot: 98.4 top-1.
SUN zero-shot: 58.5 top-1.
Average over 36 datasets: prompt engineering and ensembling add almost 5 points.
Natural distribution shift: robustness gap shrinks by up to 75%.

Results page: Table 1 compares CLIP with Visual N-Grams on aYahoo, ImageNet, and SUN; Figure 4 plots average-score gains from prompt engineering and ensembling across model scale.

Ablations

Contrastive loss vs LM baseline: contrastive transfer is about $4\times$ more compute-efficient.
Text transformer vs bag-of-words text encoder: bag-of-words transfers about $3\times$ better at equal compute.
Prompt template choice: "A photo of a {label}." adds 1.3 ImageNet points.
Prompt ensembling: adds almost 5 average points across 36 datasets.

Method Strengths and Weaknesses

Strengths

Replaces fixed classifier heads with language-specified classes.
Reaches 76.2% zero-shot ImageNet top-1.
Improves robustness under shift by up to 75%.
Contrastive pre-training beats caption-prediction baselines at matched compute.

Weaknesses

Training needs 400M pairs and very large compute.
Zero-shot accuracy is sensitive to prompt wording.
Class names and templates materially affect results.
Web data introduces social bias and harmful associations.

Suggestions from the authors

Scale data, models, and compute beyond the reported runs.
Improve prompt design and automatic template selection.
Characterize model biases, harms, and deployment risks.
Extend natural-language supervision to broader multimodal settings.

Learning Transferable Visual Models from Natural Language Supervision

Learning Transferable Visual Models from Natural Language Supervision

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers