Layer Normalization
Layer Normalization
Problem
Framing
Batch normalization ties normalization to mini-batch statistics, which breaks for small batches, online updates, and variable-length RNNs. The paper replaces batch-wise statistics with per-example, per-layer statistics, giving identical train/test computation and faster recurrent optimization.
Currently Used Methods
Foundational
- @ioffeBatchNormalizationAccelerating2015 — normalizes activations with mini-batch mean and variance.
- Limitation in context: batch dependence breaks under small batches, online learning, and variable-length RNNs.
- @choGRU2014 — gated recurrent unit for compact sequence modeling.
- Limitation in context: hidden-state dynamics remain sensitive to activation scale and shift.
- @hochreiterLSTM1997 — gated memory for long-range recurrent credit assignment.
- Limitation in context: recurrent computation still lacks batch-free internal normalization.
- Recurrent Batch Normalization — extends batch normalization into recurrent hidden transitions.
- Limitation in context: needs time-step-specific statistics and still depends on batch size.
- Weight Normalization — reparameterizes weights into norm and direction.
- Limitation in context: does not normalize activations using current-example statistics.
Proposed Method
Architecture
Layer normalization computes one mean and one standard deviation across the pre-activations in a layer for each example. It then applies learned per-unit gain and bias before the nonlinearity. In RNNs, the statistics are recomputed at every time step.

Loss / Objective
The method changes the layer computation, not the task loss.
Algorithm
The forward rule normalizes each layer independently within each example.
Training Procedure
- Adaptive gains initialized to .
- Adaptive biases initialized to .
- Statistics computed per example, per layer.
- In RNNs, statistics computed separately at each time step.
- Train and test use the same normalization rule.
Evaluation
Datasets
- MSCOCO image–sentence ranking
- CNN question answering corpus
- Skip-thought sentence representation tasks
- Binarized MNIST with DRAW
- Handwriting sequence generation
- Permutation-invariant MNIST
Metrics
- Recall@K
- Mean rank
- Validation accuracy
- Pearson correlation
- Spearman correlation
- Mean squared error
- Negative log likelihood
- Variational lower bound
- Test error
Headline results
- MSCOCO symmetric model: caption retrieval R@1 , image retrieval R@1 .
- MSCOCO order-embeddings: caption retrieval R@1 , image retrieval R@1 .
- CNN attentive reader: validation error drops faster than LSTM, BN-LSTM, and BN-everywhere baselines.
- Skip-thought vectors: downstream transfer metrics improve at matched training iterations.
- Permutation-invariant MNIST: layer normalization beats batch normalization most clearly at batch size .
Ablations
- Batch size: batch normalization degrades sharply at size ; layer normalization remains effective.
- Gain initialization: attentive reader stays robust across different initial gain values.
- Architecture type: recurrent models benefit more than feed-forward models.
- Network type: convolutional networks see speedup over no normalization, but batch normalization still wins.
Method Strengths and Weaknesses
Strengths
- Removes train/test mismatch by avoiding running statistics.
- Fits variable-length RNNs because normalization is time-step local.
- Improves MSCOCO retrieval by large R@1 margins.
- Stays effective when mini-batches shrink to .
Weaknesses
- Gains concentrate in recurrent settings, not all feed-forward models.
- Loses batch normalization's stochastic regularization effect.
- Underperforms batch normalization in convolutional networks.
- Adds per-example mean and variance computation at every layer.
Suggestions from the authors
- Study why recurrent networks benefit more than convolutional networks.
- Design better normalization schemes for convolutional activations.
- Clarify when normalization improves generalization, not only optimization speed.
- Extend the invariance analysis to broader architectures and training dynamics.
Links
Prior Papers
- @ioffeBatchNormalizationAccelerating2015 — introduces batch normalization, the direct baseline layer normalization removes batch dependence from.
- @choGRU2014 — provides the GRU architecture used in the paper's recurrent evaluations.
- @hochreiterLSTM1997 — provides the LSTM family whose hidden dynamics layer normalization stabilizes.
Further Papers
- @wuGroupNormalization2018 — extends batch-free normalization to vision settings where layer normalization is weaker.
1. Summary
Motivation / Problem
- Need for normalization method that could be applied in RNN (varying sequence length) and small batch size.
Prior Work and Its Limitations
- Batch Normalization
- Limitations
- Can't work in small batch size.
- Can't learn in online learning environment.
- Don't work on RNN since need for for each time sequence.
- Limitations
Proposed Method
- Layer Normalization
- Normalize each layer's neuron using mean and variance of single training example's layer output
- Add new param and learn this scale, shift param in backprop.
- ![[@baLayerNormalization2016_LayerNorm.png]]
- Theoretical Stableness
- Layer Norm is invariant in per training-case feature shifting and scaling
- Riemann manifold shows how curvature leads to better training even at different weight scaling.
Hypothesis and Evaluation
- Hypothesis
- Layer Norm should accelerate and stabilize training process in small mini-batch, long sequence environments
- Evaluation
- Order Embeddings of Images and Language
- Image + Language Embedding Space --> Retrieval Task
- Better training speed and generalization in sentence encoder(GRU)
- Question Answering Task
- Attentive Reader (LSTM based) won BN variant
- Also performed well with different scale parameter initialization
- Skip-Thought Vectors
- Faster training, Better Performance
- etc
- Generative RNN (DRAW)
- Handwriting Sequence Generation
- Permutation Invariant MNIST
- Order Embeddings of Images and Language
2. Paper Strengths and Weakness
Strengths
- Pointed out and tried to solve the problem of Batch Normalization
- Intuitive Idea and Theoretical Analysis of its method
- Useful on small batch and long sequence environment
Weaknesses
- Only perform better on Recurrent architecture
- Doesn't work on ConvNet since pixel intensity distribution differs especially on side area
- Not enough regularization effect
3. My Opinion
Overall Rating
- Weak Accept
Recommendation Justification
- Pointed out both advantages and disadvantages of method
- fundamental approach for new normalization method on NN.
Detailed Comments
- Change the normalization standard from batch to layer
- Need to understand manifold for deeper theoretical analysis