Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition
Problem
Framing
Deeper CNNs showed higher training error once depth passed the easy-to-optimize regime. The paper closes this degradation gap by rewriting each block as residual learning, , enabling 152-layer ImageNet models and a 3.57% test top-5 ensemble error.
Currently Used Methods
Foundational
- @lecunGradientbasedLearningApplied1998 — end-to-end convolutional learning for vision.
- Limitation in context: shallow designs did not expose depth-induced optimization failure.
- @krizhevskyAlexNet2012 — large-scale CNN classification breakthrough on ImageNet.
- Limitation in context: depth stayed modest, so degradation beyond tens of layers remained unresolved.
- @simonyanVGGVeryDeep2014 — very deep plain stacked convolutional nets.
- Limitation in context: deeper plain variants become harder to optimize and more error-prone.
- @szegedyGoogLeNet2015 — increased depth and width with inception modules.
- Limitation in context: does not directly fix higher training error in deeper plain counterparts.
- @ioffeBatchNormalizationAccelerating2015 — stabilizes optimization with normalization.
- Limitation in context: removes gradient pathologies, not degradation from added depth.
Proposed Method
Architecture
The network replaces plain stacks with residual blocks that add a shortcut to a learned branch. ResNet-18/34 use two layers per block. ResNet-50/101/152 use a bottleneck block, with identity shortcuts when dimensions match and projection shortcuts when they do not.

Loss / Objective
The paper keeps the standard classification loss and changes the block parameterization:
For dimension mismatch, the shortcut becomes:
Algorithm
Each unit computes a residual branch, adds the shortcut, then applies ReLU:
Training Procedure
- Optimizer: SGD, momentum
- Weight decay:
- Batch size:
- Image resize: shorter side sampled in
- Crop size:
- Learning rate:
- LR drops: divide by at epochs and
- Training length: epochs
- CIFAR depths:
Evaluation
Datasets
- ImageNet 2012 classification
- CIFAR-10 classification
- PASCAL VOC 2007/2012 detection
- MS COCO detection and segmentation
Metrics
- ImageNet: top-1 error, top-5 error
- CIFAR-10: test error
- Detection and segmentation: mAP
Headline results
- ImageNet single-model, ResNet-34: 25.03 top-1, beating plain-34 at 28.54.
- ImageNet single-model, ResNet-152: 21.43 top-1, 5.71 top-5.
- ImageNet test ensemble: 3.57 top-5.
- CIFAR-10, ResNet-110: 6.43% error.
- COCO detection: 28% relative improvement from deeper residual features.
Ablations
- Plain vs residual at 18/34 layers: residual blocks cut both training and test error.
- Shortcut type for dimension increase: projection improves ImageNet accuracy over zero-padding shortcuts.
- CIFAR extreme depth: 1202 layers still optimize, but test error worsens from overfitting.
- Bottleneck design: enables 50/101/152-layer models at lower complexity than equally deep plain blocks.
Method Strengths and Weaknesses
Strengths
- Directly fixes degradation in training error, not just gradient instability.
- Identity shortcuts add negligible compute and no parameters.
- Reaches 152 layers with lower complexity than VGG.
- Transfers strongly to detection and segmentation benchmarks.
Weaknesses
- Gives little theory for why residual parameterization optimizes better.
- Extreme depth on CIFAR overfits despite lower training error.
- Comparisons focus on plain nets more than other skip designs.
- Projection shortcuts add parameters when feature dimensions change.
Suggestions from the authors
- Study why plain deep nets fail to realize the constructed identity solution.
- Analyze optimization behavior of extremely deep residual networks.
- Extend residual learning to more vision tasks.
- Test residual principles in non-vision domains.
Links
Prior Papers
- @heKaimingInit2015 — initialization work that stabilizes deep optimization but does not solve degradation.
- @lecunGradientbasedLearningApplied1998 — early CNN foundation that residual networks deepen far beyond.
- @simonyanVGGVeryDeep2014 — plain very-deep baseline that exposes the optimization gap ResNet closes.
- @srivastavaDropout2014 — regularization baseline for deep nets, orthogonal to residual parameterization.
- @szegedyGoogLeNet2015 — strong pre-ResNet ImageNet architecture with depth and width scaling.
Further Papers
- @huangDenseNet2017 — extends skip-connected feature reuse beyond additive residual summation.
- @heMaskRCNN2017 — uses ResNet backbones to transfer residual features into detection and segmentation.
- @tanEfficientNet2019 — builds on strong residual-style CNN backbones for accuracy-efficiency scaling.
- @dosovitskiyViT2020 — later vision architecture that still retains residual pathways as an optimization primitive.
1. Summary
Motivation / Problem
- Deeper CNNs possess stronger representation powers, however it is difficult to optimize showing degradation problems.
- Degradation Problem: Train/Val error increase when depth increases
Prior Work and Its Limitations
- VGG / GoogLeNet
- Use Initialization / Normalization for better training process
- No more gradient vanish / explode
- Limitation
- Still deeper plain net suffers from higher training error
Proposed Method
- Residual Connection
- Instead of directly learn target mapping , learn residual function,
- Implement the following idea with short connections.
- When dimension matches just identity mapping is enough
- When dimension doesn't matches we can add zero padding or projection
- Bottleneck structure
- To build deeper model with parameter efficient way, use bottleneck structure.
- ![[@heDeepResidualLearning2016_Bottleneck.png]]
Hypothesis and Evaluation
- Hypothesis
- Residual functions is easier than learning unreferenced mappings
- Evaluation
- ImageNet / CIFAR-10 / PASCAL VOC / COCO
- Outperforms every existing plain net
- ImageNet / CIFAR-10 / PASCAL VOC / COCO
2. Paper Strengths and Weakness
Strengths
- Conceptually clean and Easy to implement
- Parmeter Efficient
- Bottleneck structure is very parameter efficient and residual connection helps
- Shortcut connection has no parameter needed
- Tackles the degradation problem
- Faster training
Weaknesses
- No theoretical proof that residual mappings are easier to optimize
3. My Opinion
Overall Rating
- Strong Accept
Recommendation Justification
- Key method to directly solve degradation problem on deeper networks
- Key point for architecture design
Detailed Comments
- Can we remove residual connection or optimize it later? maybe freezing other layers?
- Can we have glimpse of how the gradient landscape looks like? and if we could visualize or interpret them maybe other better methods exist?