Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Problem
Framing
Deep nets were hard to optimize because each layer's input distribution drifted during training. The paper inserts a differentiable mini-batch normalization with learned scale and shift, enabling much larger learning rates and cutting ImageNet steps to accuracy from to .
Currently Used Methods
Foundational
- @heKaimingInit2015 — variance-preserving initialization for deep rectifier networks.
- Limitation in context: initialization alone cannot stabilize evolving hidden activations.
- @srivastavaDropout2014 — stochastic regularization by dropping units during training.
- Limitation in context: improves generalization, not conditioning of layer inputs.
- @krizhevskyAlexNet2012 — large-scale CNN training with careful learning-rate tuning.
- Limitation in context: still needs conservative step sizes for stability.
- @szegedyGoogLeNet2015 — efficient Inception modules for ImageNet classification.
- Limitation in context: optimization remains sensitive without activation normalization.
- @rumelhartLearningRepresentationsBackpropagating1986 — end-to-end gradient learning for multilayer networks.
- Limitation in context: shifting activations still drive saturation and poor conditioning.
Proposed Method
Architecture
BatchNorm wraps an activation with mini-batch standardization, then restores representation power with learned and . In convolutional layers, one pair of moments is shared across all spatial positions in a feature map.

Loss / Objective
The task loss is unchanged; the method reparameterizes intermediate activations.
Algorithm
Training uses batch moments; inference replaces them with frozen population estimates.

Training Procedure
- Optimizer: SGD
- Mini-batch size:
- Stability constant:
- Learned BN parameters per normalized activation:
- Inference statistics: averages of and over training mini-batches
Evaluation
Datasets
- MNIST
- ImageNet ILSVRC2012
- Inception-style ImageNet model variants
Metrics
- Classification accuracy
- Top-1 error
- Top-5 error
- Steps to target validation accuracy
Headline results
- ImageNet, Inception baseline: max accuracy at steps.
- ImageNet, BN-Baseline: max accuracy at steps.
- ImageNet test set, BN-Inception ensemble: top-5 error.
- ImageNet, sigmoid Inception variant: trains successfully with BN.
- MNIST: hidden activations stay in the non-saturated regime.
Ablations
- Nonlinearity sweep: BN makes sigmoid networks trainable at ImageNet scale.
- Learning-rate increase: BN tolerates much larger steps without divergence.
- Architecture swap: BN Inception reaches target accuracy in fewer updates.
- Inference statistics: averaged batch moments give a deterministic test transform.
Method Strengths and Weaknesses
Strengths
- Cuts ImageNet steps to accuracy by more than .
- Preserves expressivity through learned and .
- Makes sigmoid-based deep ImageNet models trainable.
- Converts stochastic normalization into a fixed inference affine map.
Weaknesses
- Depends on reliable mini-batch moment estimates.
- Training and inference use different statistics.
- Internal covariate shift is asserted more than formally derived.
- Small or skewed batches can distort normalization.
Suggestions from the authors
- Apply activation normalization beyond image classification.
- Study why normalization improves optimization and regularization.
- Use BN to train deeper or harder-to-initialize networks.
- Test BN with both saturating and non-saturating nonlinearities.
Links
Prior Papers
- @krizhevskyAlexNet2012 — establishes large-scale CNN optimization constraints that BN relaxes.
- @szegedyGoogLeNet2015 — provides the Inception family that BN upgrades in the main experiments.
- @heKaimingInit2015 — tackles optimization through initialization rather than activation normalization.
- @srivastavaDropout2014 — offers regularization that BN partly complements and partly overlaps.
Further Papers
- @baLayerNormalization2016 — removes batch dependence by normalizing within each example.
- @ulyanovInstanceNormalizationMissing2017 — replaces batch statistics with per-instance normalization for style transfer.
- @wuGroupNormalization2018 — restores stable normalization when batch sizes are small.
1. Summary
Motivation / Problem
- Internal Covariate Shift
- Changing of parameter results in change of distribution next layer receives
- Gradient Explodes or Vanishes
- Cannot use big learning rate, Slow Optimization etc..
- Whitening the data is always helpful
Prior Work and Its Limitations
- Stabilization of Training
- Careful Initialization
- Small Learning Rates
- Limitation
- Cannot directly keep internal activation stable
- Whiten activation Layer
- Limitation
- Computationally expensive and weakens gradient learning
- Limitation
Proposed Method
- Batch Normalization
- Normalize layer input using mean and variance of mini-batch
- Add new param and learn this scale, shift param in backprop.
- For CNN, compute mean and variance of mini-batch for each pixel.
- ![[@ioffeBatchNormalizationAccelerating2015_BatchNormalizationTransform.png]] ![[@ioffeBatchNormalizationAccelerating2015_BatchNormalizationTraining.png]]
Hypothesis and Evaluation
- Hypothesis
- Batch Normalization will stable activation distribution and reduce internal covariate shift.
- Optimization faster and easier
- Evaluation
- MNIST
- hidden activation distribution
- ImageNet
- training process and performance test
- MNIST
2. Paper Strengths and Weakness
Strengths
- Simple, General normalization architecture for Deep Learning.
- Seamlessly useable in neural architecture
- Not only faster optimization but has regularization effects too.
Weaknesses
- Varies by size of mini-batch. If mini-batch size is too small, batch norm is not effective.
- Training and Inference Process differs requiring dirty architecture
- Not enough theoretical explanation for internal covariate shift
3. My Opinion
Overall Rating
- Strong Accept
Recommendation Justification
- This expands the whitening process of ML to Neural Architecture study.
- Faster optimization process + Regularization Effect
Detailed Comments
- Activation Distribution presented in MNIST shows optimization process could be stabilized with the manner of normalization.
- Historically important corner case study