Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2016 · IEEE

Deep Residual Learning for Image Recognition

Problem

Framing

Deeper CNNs showed higher training error once depth passed the easy-to-optimize regime. The paper closes this degradation gap by rewriting each block as residual learning, $H(\mathbf{x}) = F(\mathbf{x}) + \mathbf{x}$ , enabling 152-layer ImageNet models and a 3.57% test top-5 ensemble error.

Currently Used Methods

Foundational

@lecunGradientbasedLearningApplied1998 — end-to-end convolutional learning for vision.
- Limitation in context: shallow designs did not expose depth-induced optimization failure.
@krizhevskyAlexNet2012 — large-scale CNN classification breakthrough on ImageNet.
- Limitation in context: depth stayed modest, so degradation beyond tens of layers remained unresolved.
@simonyanVGGVeryDeep2014 — very deep plain stacked $3 \times 3$ $3 \times 3$ convolutional nets.
- Limitation in context: deeper plain variants become harder to optimize and more error-prone.
@szegedyGoogLeNet2015 — increased depth and width with inception modules.
- Limitation in context: does not directly fix higher training error in deeper plain counterparts.
@ioffeBatchNormalizationAccelerating2015 — stabilizes optimization with normalization.
- Limitation in context: removes gradient pathologies, not degradation from added depth.

Proposed Method

Architecture

The network replaces plain stacks with residual blocks that add a shortcut to a learned branch. ResNet-18/34 use two $3 \times 3$ layers per block. ResNet-50/101/152 use a bottleneck $1 \times 1 \rightarrow 3 \times 3 \rightarrow 1 \times 1$ block, with identity shortcuts when dimensions match and projection shortcuts when they do not.

$Residual block variants for ImageNet: left is the two-layer basic block on 56\times56 feature maps; right is the bottleneck block with 1\times1, 3\times3, and 1\times1 convolutions plus shortcut addition.$

Loss / Objective

The paper keeps the standard classification loss and changes the block parameterization:

\mathbf{y} = F(\mathbf{x}, \{W_i\}) + \mathbf{x}

For dimension mismatch, the shortcut becomes:

\mathbf{y} = F(\mathbf{x}, \{W_i\}) + W_s \mathbf{x}

Algorithm

Each unit computes a residual branch, adds the shortcut, then applies ReLU:

\mathbf{x}_{l+1} = \mathrm{ReLU}\left(F(\mathbf{x}_l, W_l) + \mathbf{x}_l\right)

Training Procedure

Optimizer: SGD, momentum $0.9$
Weight decay: $10^{-4}$
Batch size: $256$
Image resize: shorter side sampled in $[256, 480]$
Crop size: $224 \times 224$
Learning rate: $0.1$
LR drops: divide by $10$ at epochs $30$ and $60$
Training length: $90$ epochs
CIFAR depths: $20, 32, 44, 56, 110, 1202$

Evaluation

Datasets

ImageNet 2012 classification
CIFAR-10 classification
PASCAL VOC 2007/2012 detection
MS COCO detection and segmentation

Metrics

ImageNet: top-1 error, top-5 error
CIFAR-10: test error
Detection and segmentation: mAP

Headline results

ImageNet single-model, ResNet-34: 25.03 top-1, beating plain-34 at 28.54.
ImageNet single-model, ResNet-152: 21.43 top-1, 5.71 top-5.
ImageNet test ensemble: 3.57 top-5.
CIFAR-10, ResNet-110: 6.43% error.
COCO detection: 28% relative improvement from deeper residual features.

Ablations

Plain vs residual at 18/34 layers: residual blocks cut both training and test error.
Shortcut type for dimension increase: projection improves ImageNet accuracy over zero-padding shortcuts.
CIFAR extreme depth: 1202 layers still optimize, but test error worsens from overfitting.
Bottleneck design: enables 50/101/152-layer models at lower complexity than equally deep plain blocks.

Method Strengths and Weaknesses

Strengths

Directly fixes degradation in training error, not just gradient instability.
Identity shortcuts add negligible compute and no parameters.
Reaches 152 layers with lower complexity than VGG.
Transfers strongly to detection and segmentation benchmarks.

Weaknesses

Gives little theory for why residual parameterization optimizes better.
Extreme depth on CIFAR overfits despite lower training error.
Comparisons focus on plain nets more than other skip designs.
Projection shortcuts add parameters when feature dimensions change.

Suggestions from the authors

Study why plain deep nets fail to realize the constructed identity solution.
Analyze optimization behavior of extremely deep residual networks.
Extend residual learning to more vision tasks.
Test residual principles in non-vision domains.

1. Summary

Motivation / Problem

Deeper CNNs possess stronger representation powers, however it is difficult to optimize showing degradation problems.
- Degradation Problem: Train/Val error increase when depth increases

Prior Work and Its Limitations

VGG / GoogLeNet
- Use Initialization / Normalization for better training process
- No more gradient vanish / explode
- Limitation
  - Still deeper plain net suffers from higher training error

Proposed Method

Residual Connection
- Instead of directly learn target mapping $H(x)$ , learn residual function, $F(x) = H(x)-x$
- Implement the following idea with short connections.
  - When dimension matches just identity mapping is enough
  - When dimension doesn't matches we can add zero padding or projection
Bottleneck structure
- To build deeper model with parameter efficient way, use bottleneck structure.
- ![[@heDeepResidualLearning2016_Bottleneck.png]]

Hypothesis and Evaluation

Hypothesis
- Residual functions is easier than learning unreferenced mappings
Evaluation
- ImageNet / CIFAR-10 / PASCAL VOC / COCO
  - Outperforms every existing plain net

2. Paper Strengths and Weakness

Strengths

Conceptually clean and Easy to implement
Parmeter Efficient
- Bottleneck structure is very parameter efficient and residual connection helps
- Shortcut connection has no parameter needed
Tackles the degradation problem
Faster training

Weaknesses

No theoretical proof that residual mappings are easier to optimize

3. My Opinion

Overall Rating

Strong Accept

Recommendation Justification

Key method to directly solve degradation problem on deeper networks
Key point for architecture design

Detailed Comments

Can we remove residual connection or optimize it later? maybe freezing other layers?
Can we have glimpse of how the gradient landscape looks like? and if we could visualize or interpret them maybe other better methods exist?

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

1. Summary

Motivation / Problem

Prior Work and Its Limitations

Proposed Method

Hypothesis and Evaluation

2. Paper Strengths and Weakness

Strengths

Weaknesses

3. My Opinion

Overall Rating

Recommendation Justification

Detailed Comments