Learning Transferable Visual Models from Natural Language Supervision

Alec Radford, Jong Wook Kim

2021 · ICML

Learning Transferable Visual Models from Natural Language Supervision

Problem

Framing

Image classifiers transfer poorly when supervision is a fixed label set. CLIP closes this by training image and text encoders on 400M web pairs, then turning class names into zero-shot classifiers. ImageNet zero-shot reaches 76.2% top-1.

Currently Used Methods

Foundational

Proposed Method

Architecture

CLIP learns a shared 512-d space with an image encoder and a text encoder. The image tower is a modified ResNet or ViT. The text tower is a 12-layer transformer over BPE tokens, used at test time to encode prompted class names.

CLIP pipeline: contrastive image-text pre-training, prompt-based classifier construction, and zero-shot prediction by similarity matching.

Loss / Objective

The training loss is symmetric cross-entropy over batchwise image-text similarities with learned temperature tt.

zi=fθ(xi)fθ(xi)2,zi=gϕ(ti)gϕ(ti)2\mathbf{z}_i = \frac{f_\theta(\mathbf{x}_i)}{\|f_\theta(\mathbf{x}_i)\|_2}, \qquad \mathbf{z}'_i = \frac{g_\phi(\mathbf{t}_i)}{\|g_\phi(\mathbf{t}_i)\|_2} Lij=exp(t)zizj\mathbf{L}_{ij} = \exp(t) \, \mathbf{z}_i^\top \mathbf{z}'_j L=12Ni=1N[logexp(Lii)j=1Nexp(Lij)logexp(Lii)j=1Nexp(Lji)]\mathcal{L} = \frac{1}{2N}\sum_{i=1}^{N}\left[ -\log \frac{\exp(\mathbf{L}_{ii})}{\sum_{j=1}^{N} \exp(\mathbf{L}_{ij})} -\log \frac{\exp(\mathbf{L}_{ii})}{\sum_{j=1}^{N} \exp(\mathbf{L}_{ji})} \right]

Sampling Rule

Zero-shot classification scores an image against prompted label texts.

p(y=ix)=exp(exp(t)fθ(x)gϕ(τi))j=1Kexp(exp(t)fθ(x)gϕ(τj))p(y=i\mid \mathbf{x}) = \frac{\exp\left(\exp(t)\, f_\theta(\mathbf{x})^\top g_\phi(\tau_i)\right)}{\sum_{j=1}^{K} \exp\left(\exp(t)\, f_\theta(\mathbf{x})^\top g_\phi(\tau_j)\right)}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Results page: Table 1 compares CLIP with Visual N-Grams on aYahoo, ImageNet, and SUN; Figure 4 plots average-score gains from prompt engineering and ensembling across model scale.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers