Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao

2021 · ICCV

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Problem

Framing

ViT-style backbones keep one token scale and global self-attention, so dense transfer is awkward and image cost stays quadratic. Swin closes this with hierarchical patch merging plus shifted local windows, giving linear image-size complexity and reaching 58.7 box AP, 51.1 mask AP, and 53.5 mIoU.

Currently Used Methods

Foundational

Proposed Method

Architecture

Swin splits the image into non-overlapping 4×44 \times 4 patches, linearly embeds them, then applies four stages separated by patch merging. Swin-T uses widths 96,192,384,76896, 192, 384, 768, block counts 2,2,6,22,2,6,2, and window size 7×77 \times 7. Consecutive blocks alternate W-MSA and SW-MSA.

Verified architecture diagram: patch partition and linear embedding feed four hierarchical Swin stages with patch merging, plus paired W-MSA and shifted-window MSA blocks.

Loss / Objective

For classification, the final-stage feature map is global-average pooled and optimized with cross-entropy.

Lcls=k=1Kyklogpk\mathcal{L}_{\mathrm{cls}} = - \sum_{k=1}^{K} y_k \log p_k

Algorithm

Each two-block unit alternates regular and shifted window attention with pre-norm residual updates.

z^l=W-MSA(LN(zl1))+zl1\hat{\mathbf{z}}^{l} = \mathrm{W\text{-}MSA}(\mathrm{LN}(\mathbf{z}^{l-1})) + \mathbf{z}^{l-1} zl=MLP(LN(z^l))+z^l\mathbf{z}^{l} = \mathrm{MLP}(\mathrm{LN}(\hat{\mathbf{z}}^{l})) + \hat{\mathbf{z}}^{l} z^l+1=SW-MSA(LN(zl))+zl\hat{\mathbf{z}}^{l+1} = \mathrm{SW\text{-}MSA}(\mathrm{LN}(\mathbf{z}^{l})) + \mathbf{z}^{l} zl+1=MLP(LN(z^l+1))+z^l+1\mathbf{z}^{l+1} = \mathrm{MLP}(\mathrm{LN}(\hat{\mathbf{z}}^{l+1})) + \hat{\mathbf{z}}^{l+1}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Table 4: Shifted windows and position-bias ablations on ImageNet, COCO, and ADE20K

methodtop-1top-5APboxAPmaskmIoU
w/o shifting80.295.147.741.543.3
shifted windows81.395.650.543.746.1
no pos.80.194.949.242.643.8
abs. pos.80.595.249.042.443.2
abs.+rel. pos.81.395.650.243.444.0
rel. pos. w/o app.79.394.748.241.944.1
rel. pos.81.395.650.543.746.1

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers