Group Normalization

Yuxin Wu, Kaiming He

2018

Group Normalization

Problem

Framing

Batch Normalization breaks when per-worker batch size collapses, which is common in detection, video, and fine-tuning. Group Normalization removes batch dependence by normalizing each sample over channel groups, keeping ResNet-50 ImageNet error at 24.1% even at batch size 2, where BN rises to 34.7%.

Currently Used Methods

Foundational

@ioffeBatchNormalizationAccelerating2015 — batch-wise activation normalization that accelerates optimization.
- Limitation in context: depends on minibatch statistics and collapses at tiny batch sizes.
@baLayerNormalization2016 — per-sample normalization across all channels.
- Limitation in context: ignores convolutional channel structure and trails BN on ImageNet.
@ulyanovInstanceNormalizationMissing2017 — per-instance, per-channel normalization for style transfer.
- Limitation in context: removes cross-channel coupling and underperforms for recognition.
Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models — reduces BN train-test mismatch.
- Limitation in context: still degrades more than GN in small-batch training.

Proposed Method

Architecture

GN splits $C$ channels into $G$ groups and normalizes each sample over each group's channels and spatial positions. $G=1$ gives LayerNorm; $G=C$ gives InstanceNorm. The paper uses $G=32$ in most vision models.

Verified figure: the set definition for GroupNorm, showing that each feature is normalized within its sample and channel group.

Loss / Objective

GN is a drop-in normalization layer with affine re-scaling.

\hat{x}_i = \frac{1}{\sigma_i}(x_i - \mu_i), \qquad y_i = \gamma \hat{x}_i + \beta

Sampling Rule / Algorithm

The normalization set and statistics are

S_i = \{k \mid k_N = i_N,\; \lfloor k_C/(C/G) \rfloor = \lfloor i_C/(C/G) \rfloor\}

\mu_i = \frac{1}{m}\sum_{k \in S_i} x_k, \qquad \sigma_i = \sqrt{\frac{1}{m}\sum_{k \in S_i}(x_k - \mu_i)^2 + \epsilon}

Verified figure: the mean and standard deviation equations used by GroupNorm for each group.

Training Procedure

Groups: $G=32$ by default
ImageNet backbone: ResNet-50
ImageNet batch sizes: 32, 16, 8, 4, 2 images/GPU
COCO detector: Mask R-CNN with GN fine-tuning
COCO fine-tuning weight decay: 0
Kinetics setting: 32-frame inputs, 8 or 4 clips/GPU

Evaluation

Datasets

ImageNet classification
COCO object detection and instance segmentation
Kinetics video classification

Metrics

ImageNet: top-1 validation error
COCO: $\mathrm{AP}^{\mathrm{bbox}}$ , $\mathrm{AP}^{\mathrm{mask}}$
Kinetics: top-1 and top-5 accuracy

Headline results

ImageNet, ResNet-50, batch 32: BN 23.6% error; GN 24.1%.
ImageNet, ResNet-50, batch 2: BN 34.7% error; GN 24.1%.
COCO, Mask R-CNN R50: BN* 38.6 box AP, 34.5 mask AP; GN 40.3 box AP, 35.7 mask AP.
Kinetics, ResNet-50 I3D, 32-frame, 4 clips/GPU: GN 72.8 top-1, 90.6 top-5; BN 72.1 top-1, 90.0 top-5.

Ablations

Number of groups: best results appear near $G=32$ .
Batch size: GN stays flat from 32 to 2 images/GPU; BN degrades sharply.
Transfer setting: GN beats frozen BN in COCO fine-tuning.
Video batch budget: GN loses less accuracy when clips/GPU drop.

Method Strengths and Weaknesses

Strengths

Removes train-test dependence on batch statistics.
Preserves near-BN ImageNet accuracy at normal batch size.
Dramatically outperforms BN at batch size 2.
Improves COCO and Kinetics under memory-limited training.

Weaknesses

Slightly worse than BN at large-batch ImageNet training.
Introduces group count $G$ as a new hyperparameter.
Loses BN's implicit minibatch regularization.
Evidence focuses on convolutional vision models.

Suggestions from the authors

Test GN beyond vision architectures.
Analyze why grouped channels suit convolutional features.
Study GN with stronger regularization schemes.
Extend GN-based pretraining for transfer tasks.

1. Summary

Motivation / Problem

Batch Normalization suffers from small batch problems and transfer learning.

Prior Work and Its Limitations

Batch Normalization
- Normalization accelerates training process
- Regularization effect by using batch statistics
- Limitations
  - Performs well only on big batch size. e.g. 32
  - The batch statistics becomes useless for transfer learning
Instance Norm / Layer Norm
- Alternative for Batch Norm but cannot outperform BN

Proposed Method

Group Normalization
- Normalize each layer's neuron using mean and variance
- The mean and variance is computed along $(H,W)$ axes and along a group of $C \over G$ channels.
- ![[@wuGroupNormalization2018_NormalizedFeature.png]] ![[@wuGroupNormalization2018_NormalizationMeanStd.png]] ![[@wuGroupNormalization2018_GroupNormSet.png]]
Relation with BN, IN, LN
- Layer Norm when $G=1$
- Instance Norm when $G=C$

Hypothesis and Evaluation

Hypothesis
- GN is insensitive under batch size variation
- GN transfers well then BN on small batch vision tasks
Evaluation
- ImageNet
  - In batch size 32, GN performed almost equivalently as BN
  - In small batch size, GN outperformed every existing normalization method
- COCO Detection/Segmentation
  - GN outperformed frozen BN
- Video Classification
  - GN was able to beat BN in smaller clip conditions.

2. Paper Strengths and Weakness

Strengths

Strong in small batch size
Intuitive approach and easy implementation
Low discrepancy between training / fine-tuning / inference

Weaknesses

Cannot beat BN absolutely in big batch size
Lost of regularization effect
New hyperparameter $G=32$

3. My Opinion

Overall Rating

Strongly Accept

Recommendation Justification

Simple and Practical Idea
Directly solves the limitation of Batch Normalization
Easy to understand and nice comparison with existing methods

Detailed Comments

GN may be a great alternative but it seems hard to replace BN entirely.

Group Normalization

Group Normalization

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers

1. Summary

Motivation / Problem

Prior Work and Its Limitations

Proposed Method

Hypothesis and Evaluation

2. Paper Strengths and Weakness

Strengths

Weaknesses

3. My Opinion

Overall Rating

Recommendation Justification

Detailed Comments