Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

2016

Layer Normalization

Problem

Framing

Batch normalization ties normalization to mini-batch statistics, which breaks for small batches, online updates, and variable-length RNNs. The paper replaces batch-wise statistics with per-example, per-layer statistics, giving identical train/test computation and faster recurrent optimization.

Currently Used Methods

Foundational

Proposed Method

Architecture

Layer normalization computes one mean and one standard deviation across the HH pre-activations in a layer for each example. It then applies learned per-unit gain g\mathbf{g} and bias b\mathbf{b} before the nonlinearity. In RNNs, the statistics are recomputed at every time step.

Verified equation figure: per-time-step layer normalization computes a layer-wide mean and standard deviation from one example's pre-activations, then applies learned gain and bias before the nonlinearity.

Loss / Objective

The method changes the layer computation, not the task loss.

μt=1Hi=1Hait,σt=1Hi=1H(aitμt)2\mu^t = \frac{1}{H}\sum_{i=1}^{H} a_i^t, \qquad \sigma^t = \sqrt{\frac{1}{H}\sum_{i=1}^{H} \left(a_i^t - \mu^t\right)^2} ht=f[gσt(atμt)+b]\mathbf{h}^t = f\left[ \frac{\mathbf{g}}{\sigma^t} \odot \left(\mathbf{a}^t - \mu^t\right) + \mathbf{b} \right]

Algorithm

The forward rule normalizes each layer independently within each example.

LN(a;g,b)=aμσg+b,μ=1Hi=1Hai,σ=1Hi=1H(aiμ)2\mathrm{LN}(\mathbf{a};\mathbf{g},\mathbf{b}) = \frac{\mathbf{a} - \mu}{\sigma} \odot \mathbf{g} + \mathbf{b}, \qquad \mu = \frac{1}{H}\sum_{i=1}^{H} a_i, \qquad \sigma = \sqrt{\frac{1}{H}\sum_{i=1}^{H} \left(a_i - \mu\right)^2}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

1. Summary

Motivation / Problem

Prior Work and Its Limitations

Proposed Method

Hypothesis and Evaluation


2. Paper Strengths and Weakness

Strengths

Weaknesses


3. My Opinion

Overall Rating

Recommendation Justification

Detailed Comments