Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015 · ICCV

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Problem

Framing

Deep rectifier CNNs still fail at depth because Xavier scaling assumes symmetric responses and underestimates variance loss after ReLU. The paper closes this with a learned rectifier, PReLU, and a rectifier-aware initialization rule, then reports 5.71% top-5 single-model error on ImageNet.

Currently Used Methods

Foundational

@krizhevskyAlexNet2012 — deep ReLU CNNs establish the ImageNet training baseline.
- Limitation in context: deeper rectifier stacks still destabilize under naive scaling.
@simonyanVGGVeryDeep2014 — very deep small-filter CNNs push accuracy through depth.
- Limitation in context: fixed initialization makes deeper rectifier optimization brittle.
@ioffeBatchNormalizationAccelerating2015 — normalization stabilizes deep-network optimization.
- Limitation in context: this paper seeks stability without added normalization layers.
@szegedyGoogLeNet2015 — multi-branch CNNs improve ImageNet accuracy and efficiency.
- Limitation in context: it does not derive rectifier-specific initialization.
Understanding the difficulty of training deep feedforward neural networks — Xavier initialization preserves linear-layer variance.
- Limitation in context: it ignores ReLU asymmetry in very deep nets.

Proposed Method

Architecture

The paper keeps standard CNN backbones and changes only the activation. PReLU learns a negative slope $a_i$ per channel or per layer, adding negligible parameters relative to convolution weights. The learned slopes are larger in early layers and smaller in deep layers.

Convergence plot for a 22-layer ImageNet model: the proposed rectifier-aware initialization reduces top-1 error earlier than Xavier, while both eventually converge.

Loss / Objective

The modeling change is the learned rectifier itself.

f(y_i) = \max(0, y_i) + a_i \min(0, y_i)

Algorithm

The initialization preserves variance across rectifier layers.

\frac{1}{2} n_l \operatorname{Var}[w_l] = 1

For PReLU with initial slope $a$ :

\frac{1}{2}(1+a^2) n_l \operatorname{Var}[w_l] = 1

Training Procedure

Dataset: ImageNet 2012, 1000 classes.
Input crop: $224 \times 224$ .
Optimizer: SGD.
Mini-batch: 128.
Momentum: 0.9.
Weight decay: 0.0005.
Learning rates: $10^{-2}$ , $10^{-3}$ , $10^{-4}$ .
Schedule: switch when error plateaus.
Total epochs: about 80.

Evaluation

Datasets

ImageNet 2012 classification.
1000 classes.
About 1.2M train, 50k validation, 100k test.

Metrics

Top-1 error.
Top-5 error.
Single-crop evaluation.
Multi-view or multi-scale testing.

Headline results

ImageNet validation, model C single-model: top-5 5.71%.
ImageNet test, six-model ensemble: top-5 4.94%.
ImageNet validation, model A at scale 256: ReLU 8.25 top-5, PReLU 8.08.
ImageNet validation, model A at scale 384: ReLU 7.26 top-5, PReLU 7.03.
ImageNet validation, model A multi-scale: ReLU 6.51 top-5, PReLU 6.28.

Table 6: ReLU vs PReLU on ImageNet validation for model A at different test scales.

model A scale $s$	ReLU top-1	ReLU top-5	PReLU top-1	PReLU top-5
256	26.25	8.25	25.81	8.08
384	24.77	7.26	24.20	7.03
480	25.46	7.63	24.83	7.39
multi-scale	24.02	6.51	22.97	6.28

Ablations

Activation type: PReLU beats ReLU at every reported test scale.
Slope granularity: channel-wise and channel-shared PReLU perform similarly on the small model.
Initialization: proposed scaling converges for 30-layer rectifier nets where Xavier stalls.
Test scale $s$ : larger scales reduce error for both activations.

Method Strengths and Weaknesses

Strengths

Derives a closed-form initialization matched to rectifier asymmetry.
Shows 30-layer convergence where Xavier completely stalls.
PReLU adds negligible parameters yet improves ImageNet error.
Reaches 5.71% top-5 single-model error on ImageNet.

Weaknesses

Single-model PReLU gains over ReLU are modest.
Best headline test number uses a six-model ensemble.
Training requires 3–4 weeks on 4–8 GPUs.
Validation is concentrated on ImageNet classification.

Suggestions from the authors

Test the initialization in even deeper rectifier networks.
Study why deep layers learn smaller negative slopes.
Extend rectifier-aware initialization to other nonlinearities.
Analyze machine errors against human contextual failures.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers