Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

2016

Layer Normalization

Problem

Framing

Batch normalization ties normalization to mini-batch statistics, which breaks for small batches, online updates, and variable-length RNNs. The paper replaces batch-wise statistics with per-example, per-layer statistics, giving identical train/test computation and faster recurrent optimization.

Currently Used Methods

Foundational

@ioffeBatchNormalizationAccelerating2015 — normalizes activations with mini-batch mean and variance.
- Limitation in context: batch dependence breaks under small batches, online learning, and variable-length RNNs.
@choGRU2014 — gated recurrent unit for compact sequence modeling.
- Limitation in context: hidden-state dynamics remain sensitive to activation scale and shift.
@hochreiterLSTM1997 — gated memory for long-range recurrent credit assignment.
- Limitation in context: recurrent computation still lacks batch-free internal normalization.
Recurrent Batch Normalization — extends batch normalization into recurrent hidden transitions.
- Limitation in context: needs time-step-specific statistics and still depends on batch size.
Weight Normalization — reparameterizes weights into norm and direction.
- Limitation in context: does not normalize activations using current-example statistics.

Proposed Method

Architecture

Layer normalization computes one mean and one standard deviation across the $H$ pre-activations in a layer for each example. It then applies learned per-unit gain $\mathbf{g}$ and bias $\mathbf{b}$ before the nonlinearity. In RNNs, the statistics are recomputed at every time step.

Verified equation figure: per-time-step layer normalization computes a layer-wide mean and standard deviation from one example's pre-activations, then applies learned gain and bias before the nonlinearity.

Loss / Objective

The method changes the layer computation, not the task loss.

\mu^t = \frac{1}{H}\sum_{i=1}^{H} a_i^t, \qquad \sigma^t = \sqrt{\frac{1}{H}\sum_{i=1}^{H} \left(a_i^t - \mu^t\right)^2}

\mathbf{h}^t = f\left[ \frac{\mathbf{g}}{\sigma^t} \odot \left(\mathbf{a}^t - \mu^t\right) + \mathbf{b} \right]

Algorithm

The forward rule normalizes each layer independently within each example.

\mathrm{LN}(\mathbf{a};\mathbf{g},\mathbf{b}) = \frac{\mathbf{a} - \mu}{\sigma} \odot \mathbf{g} + \mathbf{b}, \qquad \mu = \frac{1}{H}\sum_{i=1}^{H} a_i, \qquad \sigma = \sqrt{\frac{1}{H}\sum_{i=1}^{H} \left(a_i - \mu\right)^2}

Training Procedure

Adaptive gains initialized to $1$ .
Adaptive biases initialized to $0$ .
Statistics computed per example, per layer.
In RNNs, statistics computed separately at each time step.
Train and test use the same normalization rule.

Evaluation

Datasets

MSCOCO image–sentence ranking
CNN question answering corpus
Skip-thought sentence representation tasks
Binarized MNIST with DRAW
Handwriting sequence generation
Permutation-invariant MNIST

Metrics

Recall@K
Mean rank
Validation accuracy
Pearson correlation
Spearman correlation
Mean squared error
Negative log likelihood
Variational lower bound
Test error

Headline results

MSCOCO symmetric model: caption retrieval R@1 $45.4 \rightarrow 52.5$ , image retrieval R@1 $36.3 \rightarrow 41.3$ .
MSCOCO order-embeddings: caption retrieval R@1 $46.7 \rightarrow 58.0$ , image retrieval R@1 $37.9 \rightarrow 44.2$ .
CNN attentive reader: validation error drops faster than LSTM, BN-LSTM, and BN-everywhere baselines.
Skip-thought vectors: downstream transfer metrics improve at matched training iterations.
Permutation-invariant MNIST: layer normalization beats batch normalization most clearly at batch size $4$ .

Ablations

Batch size: batch normalization degrades sharply at size $4$ ; layer normalization remains effective.
Gain initialization: attentive reader stays robust across different initial gain values.
Architecture type: recurrent models benefit more than feed-forward models.
Network type: convolutional networks see speedup over no normalization, but batch normalization still wins.

Method Strengths and Weaknesses

Strengths

Removes train/test mismatch by avoiding running statistics.
Fits variable-length RNNs because normalization is time-step local.
Improves MSCOCO retrieval by large R@1 margins.
Stays effective when mini-batches shrink to $4$ .

Weaknesses

Gains concentrate in recurrent settings, not all feed-forward models.
Loses batch normalization's stochastic regularization effect.
Underperforms batch normalization in convolutional networks.
Adds per-example mean and variance computation at every layer.

Suggestions from the authors

Study why recurrent networks benefit more than convolutional networks.
Design better normalization schemes for convolutional activations.
Clarify when normalization improves generalization, not only optimization speed.
Extend the invariance analysis to broader architectures and training dynamics.

1. Summary

Motivation / Problem

Need for normalization method that could be applied in RNN (varying sequence length) and small batch size.

Prior Work and Its Limitations

Batch Normalization
- Limitations
  - Can't work in small batch size.
  - Can't learn in online learning environment.
  - Don't work on RNN since need for $\mu, \sigma$ for each time sequence.

Proposed Method

Layer Normalization
- Normalize each layer's neuron using mean and variance of single training example's layer output
- Add new param $\mathrm{g}, \mathrm{b}$ and learn this scale, shift param in backprop.
- ![[@baLayerNormalization2016_LayerNorm.png]]
Theoretical Stableness
- Layer Norm is invariant in per training-case feature shifting and scaling
- Riemann manifold shows how curvature leads to better training even at different weight scaling.

Hypothesis and Evaluation

Hypothesis
- Layer Norm should accelerate and stabilize training process in small mini-batch, long sequence environments
Evaluation
- Order Embeddings of Images and Language
  - Image + Language Embedding Space --> Retrieval Task
  - Better training speed and generalization in sentence encoder(GRU)
- Question Answering Task
  - Attentive Reader (LSTM based) won BN variant
  - Also performed well with different scale parameter initialization
- Skip-Thought Vectors
  - Faster training, Better Performance
- etc
  - Generative RNN (DRAW)
  - Handwriting Sequence Generation
  - Permutation Invariant MNIST

2. Paper Strengths and Weakness

Strengths

Pointed out and tried to solve the problem of Batch Normalization
Intuitive Idea and Theoretical Analysis of its method
Useful on small batch and long sequence environment

Weaknesses

Only perform better on Recurrent architecture
Doesn't work on ConvNet since pixel intensity distribution differs especially on side area
Not enough regularization effect

3. My Opinion

Overall Rating

Weak Accept

Recommendation Justification

Pointed out both advantages and disadvantages of method
fundamental approach for new normalization method on NN.

Detailed Comments

Change the normalization standard from batch to layer
Need to understand manifold for deeper theoretical analysis

Layer Normalization

Layer Normalization

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

1. Summary

Motivation / Problem

Prior Work and Its Limitations

Proposed Method

Hypothesis and Evaluation

2. Paper Strengths and Weakness

Strengths

Weaknesses

3. My Opinion

Overall Rating

Recommendation Justification

Detailed Comments