Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Problem

Framing

Deep nets were hard to optimize because each layer's input distribution drifted during training. The paper inserts a differentiable mini-batch normalization with learned scale and shift, enabling much larger learning rates and cutting ImageNet steps to $72.2\%$ accuracy from $31.0 \cdot 10^6$ to $13.3 \cdot 10^6$ .

Currently Used Methods

Foundational

@heKaimingInit2015 — variance-preserving initialization for deep rectifier networks.
- Limitation in context: initialization alone cannot stabilize evolving hidden activations.
@srivastavaDropout2014 — stochastic regularization by dropping units during training.
- Limitation in context: improves generalization, not conditioning of layer inputs.
@krizhevskyAlexNet2012 — large-scale CNN training with careful learning-rate tuning.
- Limitation in context: still needs conservative step sizes for stability.
@szegedyGoogLeNet2015 — efficient Inception modules for ImageNet classification.
- Limitation in context: optimization remains sensitive without activation normalization.
@rumelhartLearningRepresentationsBackpropagating1986 — end-to-end gradient learning for multilayer networks.
- Limitation in context: shifting activations still drive saturation and poor conditioning.

Proposed Method

Architecture

BatchNorm wraps an activation $x$ with mini-batch standardization, then restores representation power with learned $\gamma$ and $\beta$ . In convolutional layers, one pair of moments is shared across all spatial positions in a feature map.

Algorithm 1: the batch-normalizing transform computes mini-batch mean, mini-batch variance, normalization, then learned scale and shift.

Loss / Objective

The task loss is unchanged; the method reparameterizes intermediate activations.

\mu_B = \frac{1}{m}\sum_{i=1}^{m} x_i, \qquad \sigma_B^2 = \frac{1}{m}\sum_{i=1}^{m}(x_i-\mu_B)^2

\hat{x}_i = \frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}, \qquad y_i = \gamma \hat{x}_i + \beta

Algorithm

Training uses batch moments; inference replaces them with frozen population estimates.

\mathbb{E}[x] \leftarrow \mathbb{E}_B[\mu_B], \qquad \mathrm{Var}[x] \leftarrow \frac{m}{m-1}\,\mathbb{E}_B[\sigma_B^2]

y = \frac{\gamma}{\sqrt{\mathrm{Var}[x]+\epsilon}} \cdot x + \left(\beta - \frac{\gamma\,\mathbb{E}[x]}{\sqrt{\mathrm{Var}[x]+\epsilon}}\right)

Algorithm 2: BN is inserted into selected activations during training, then replaced at inference by a fixed affine map using averaged population moments.

Training Procedure

Optimizer: SGD
Mini-batch size: $m$
Stability constant: $\epsilon$
Learned BN parameters per normalized activation: $\gamma, \beta$
Inference statistics: averages of $\mu_B$ and $\sigma_B^2$ over training mini-batches

Evaluation

Datasets

MNIST
ImageNet ILSVRC2012
Inception-style ImageNet model variants

Metrics

Classification accuracy
Top-1 error
Top-5 error
Steps to target validation accuracy

Headline results

ImageNet, Inception baseline: $72.2\%$ max accuracy at $31.0 \cdot 10^6$ steps.
ImageNet, BN-Baseline: $72.7\%$ max accuracy at $13.3 \cdot 10^6$ steps.
ImageNet test set, BN-Inception ensemble: $4.82\%$ top-5 error.
ImageNet, sigmoid Inception variant: trains successfully with BN.
MNIST: hidden activations stay in the non-saturated regime.

Ablations

Nonlinearity sweep: BN makes sigmoid networks trainable at ImageNet scale.
Learning-rate increase: BN tolerates much larger steps without divergence.
Architecture swap: BN Inception reaches target accuracy in fewer updates.
Inference statistics: averaged batch moments give a deterministic test transform.

Method Strengths and Weaknesses

Strengths

Cuts ImageNet steps to $72.2\%$ accuracy by more than $2\times$ .
Preserves expressivity through learned $\gamma$ and $\beta$ .
Makes sigmoid-based deep ImageNet models trainable.
Converts stochastic normalization into a fixed inference affine map.

Weaknesses

Depends on reliable mini-batch moment estimates.
Training and inference use different statistics.
Internal covariate shift is asserted more than formally derived.
Small or skewed batches can distort normalization.

Suggestions from the authors

Apply activation normalization beyond image classification.
Study why normalization improves optimization and regularization.
Use BN to train deeper or harder-to-initialize networks.
Test BN with both saturating and non-saturating nonlinearities.

1. Summary

Motivation / Problem

Internal Covariate Shift
- Changing of parameter results in change of distribution next layer receives
- Gradient Explodes or Vanishes
- Cannot use big learning rate, Slow Optimization etc..
Whitening the data is always helpful

Prior Work and Its Limitations

Stabilization of Training
- Careful Initialization
- Small Learning Rates
- Limitation
  - Cannot directly keep internal activation stable
Whiten activation Layer
- Limitation
  - Computationally expensive and weakens gradient learning

Proposed Method

Batch Normalization
- Normalize layer input using mean and variance of mini-batch
- Add new param $\gamma^{(k)}, \beta^{(k)}$ and learn this scale, shift param in backprop.
- For CNN, compute mean and variance of mini-batch for each pixel.
- ![[@ioffeBatchNormalizationAccelerating2015_BatchNormalizationTransform.png]] ![[@ioffeBatchNormalizationAccelerating2015_BatchNormalizationTraining.png]]

Hypothesis and Evaluation

Hypothesis
- Batch Normalization will stable activation distribution and reduce internal covariate shift.
- Optimization faster and easier
Evaluation
- MNIST
  - hidden activation distribution
- ImageNet
  - training process and performance test

2. Paper Strengths and Weakness

Strengths

Simple, General normalization architecture for Deep Learning.
- Seamlessly useable in neural architecture
Not only faster optimization but has regularization effects too.

Weaknesses

Varies by size of mini-batch. If mini-batch size is too small, batch norm is not effective.
Training and Inference Process differs requiring dirty architecture
Not enough theoretical explanation for internal covariate shift

3. My Opinion

Overall Rating

Strong Accept

Recommendation Justification

This expands the whitening process of ML to Neural Architecture study.
Faster optimization process + Regularization Effect

Detailed Comments

Activation Distribution presented in MNIST shows optimization process could be stabilized with the manner of normalization.
Historically important corner case study

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

1. Summary

Motivation / Problem

Prior Work and Its Limitations

Proposed Method

Hypothesis and Evaluation

2. Paper Strengths and Weakness

Strengths

Weaknesses

3. My Opinion

Overall Rating

Recommendation Justification

Detailed Comments