An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov

2020 · ICLR

An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale

Problem

Framing

Pure Transformers lacked a competitive image classification recipe because they removed CNN locality bias and overfit at ImageNet scale. The paper closes this gap by treating images as patch sequences and relying on large-scale supervised pre-training. With JFT-300M, ViT-H/14 reaches 88.55% ImageNet top-1.

Currently Used Methods

Foundational

Proposed Method

Architecture

ViT splits an image into non-overlapping P×PP \times P patches, linearly projects each patch to width DD, prepends a learned class token, and adds learned 1D position embeddings. A standard Transformer encoder then stacks LL pre-norm attention and MLP blocks. Main scales are Base (L=12,D=768,h=12)(L{=}12, D{=}768, h{=}12), Large (24,1024,16)(24,1024,16), and Huge (32,1280,16)(32,1280,16).

Architecture diagram: an image is divided into fixed-size patches, each patch is linearly projected, a class token and position embeddings are added, and the resulting sequence is processed by stacked Transformer encoder blocks with multi-head attention and MLP layers.

Loss / Objective

The model applies standard classification to the final class token after patch embedding and encoder updates.

z0=[xclass;xp1E;xp2E;;xpNE]+Epos\mathbf{z}_0 = [\mathbf{x}_{\mathrm{class}}; \mathbf{x}_p^1 \mathbf{E}; \mathbf{x}_p^2 \mathbf{E}; \ldots; \mathbf{x}_p^N \mathbf{E}] + \mathbf{E}_{\mathrm{pos}} z=MSA(LN(z1))+z1,=1,,L\mathbf{z}'_{\ell} = \mathrm{MSA}(\mathrm{LN}(\mathbf{z}_{\ell-1})) + \mathbf{z}_{\ell-1}, \qquad \ell = 1,\ldots,L z=MLP(LN(z))+z,=1,,L\mathbf{z}_{\ell} = \mathrm{MLP}(\mathrm{LN}(\mathbf{z}'_{\ell})) + \mathbf{z}'_{\ell}, \qquad \ell = 1,\ldots,L y=LN(zL0)\mathbf{y} = \mathrm{LN}(\mathbf{z}_L^0)

Algorithm

Inference runs the encoder on the patch-token sequence and predicts from the final class token.

c^=argmaxkHead(LN(zL0))k\hat{c} = \arg\max_k \, \mathrm{Head}(\mathrm{LN}(\mathbf{z}_L^0))_k

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Results figure

The VTAB comparison shows ViT-H/14 leading on the full 19-task average and on Natural and Structured subsets, while Specialized remains close to BiT-L.

Results figure: grouped bar charts compare VTAB performance of ViT-H/14, BiT-L, VIVI-Ex-100%, and S4L over all 19 tasks and the Natural, Specialized, and Structured task groups.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers