Group Normalization

Yuxin Wu, Kaiming He

2018

Group Normalization

Problem

Framing

Batch Normalization breaks when per-worker batch size collapses, which is common in detection, video, and fine-tuning. Group Normalization removes batch dependence by normalizing each sample over channel groups, keeping ResNet-50 ImageNet error at 24.1% even at batch size 2, where BN rises to 34.7%.

Currently Used Methods

Foundational

Proposed Method

Architecture

GN splits CC channels into GG groups and normalizes each sample over each group's channels and spatial positions. G=1G=1 gives LayerNorm; G=CG=C gives InstanceNorm. The paper uses G=32G=32 in most vision models.

Verified figure: the set definition for GroupNorm, showing that each feature is normalized within its sample and channel group.

Loss / Objective

GN is a drop-in normalization layer with affine re-scaling.

x^i=1σi(xiμi),yi=γx^i+β\hat{x}_i = \frac{1}{\sigma_i}(x_i - \mu_i), \qquad y_i = \gamma \hat{x}_i + \beta

Sampling Rule / Algorithm

The normalization set and statistics are

Si={kkN=iN,  kC/(C/G)=iC/(C/G)}S_i = \{k \mid k_N = i_N,\; \lfloor k_C/(C/G) \rfloor = \lfloor i_C/(C/G) \rfloor\} μi=1mkSixk,σi=1mkSi(xkμi)2+ϵ\mu_i = \frac{1}{m}\sum_{k \in S_i} x_k, \qquad \sigma_i = \sqrt{\frac{1}{m}\sum_{k \in S_i}(x_k - \mu_i)^2 + \epsilon}

Verified figure: the mean and standard deviation equations used by GroupNorm for each group.

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

1. Summary

Motivation / Problem

Prior Work and Its Limitations

Proposed Method

Hypothesis and Evaluation


2. Paper Strengths and Weakness

Strengths

Weaknesses


3. My Opinion

Overall Rating

Recommendation Justification

Detailed Comments