Group Normalization
Group Normalization
Problem
Framing
Batch Normalization breaks when per-worker batch size collapses, which is common in detection, video, and fine-tuning. Group Normalization removes batch dependence by normalizing each sample over channel groups, keeping ResNet-50 ImageNet error at 24.1% even at batch size 2, where BN rises to 34.7%.
Currently Used Methods
Foundational
- @ioffeBatchNormalizationAccelerating2015 — batch-wise activation normalization that accelerates optimization.
- Limitation in context: depends on minibatch statistics and collapses at tiny batch sizes.
- @baLayerNormalization2016 — per-sample normalization across all channels.
- Limitation in context: ignores convolutional channel structure and trails BN on ImageNet.
- @ulyanovInstanceNormalizationMissing2017 — per-instance, per-channel normalization for style transfer.
- Limitation in context: removes cross-channel coupling and underperforms for recognition.
- Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models — reduces BN train-test mismatch.
- Limitation in context: still degrades more than GN in small-batch training.
Proposed Method
Architecture
GN splits channels into groups and normalizes each sample over each group's channels and spatial positions. gives LayerNorm; gives InstanceNorm. The paper uses in most vision models.

Loss / Objective
GN is a drop-in normalization layer with affine re-scaling.
Sampling Rule / Algorithm
The normalization set and statistics are

Training Procedure
- Groups: by default
- ImageNet backbone: ResNet-50
- ImageNet batch sizes: 32, 16, 8, 4, 2 images/GPU
- COCO detector: Mask R-CNN with GN fine-tuning
- COCO fine-tuning weight decay: 0
- Kinetics setting: 32-frame inputs, 8 or 4 clips/GPU
Evaluation
Datasets
- ImageNet classification
- COCO object detection and instance segmentation
- Kinetics video classification
Metrics
- ImageNet: top-1 validation error
- COCO: ,
- Kinetics: top-1 and top-5 accuracy
Headline results
- ImageNet, ResNet-50, batch 32: BN 23.6% error; GN 24.1%.
- ImageNet, ResNet-50, batch 2: BN 34.7% error; GN 24.1%.
- COCO, Mask R-CNN R50: BN* 38.6 box AP, 34.5 mask AP; GN 40.3 box AP, 35.7 mask AP.
- Kinetics, ResNet-50 I3D, 32-frame, 4 clips/GPU: GN 72.8 top-1, 90.6 top-5; BN 72.1 top-1, 90.0 top-5.
Ablations
- Number of groups: best results appear near .
- Batch size: GN stays flat from 32 to 2 images/GPU; BN degrades sharply.
- Transfer setting: GN beats frozen BN in COCO fine-tuning.
- Video batch budget: GN loses less accuracy when clips/GPU drop.
Method Strengths and Weaknesses
Strengths
- Removes train-test dependence on batch statistics.
- Preserves near-BN ImageNet accuracy at normal batch size.
- Dramatically outperforms BN at batch size 2.
- Improves COCO and Kinetics under memory-limited training.
Weaknesses
- Slightly worse than BN at large-batch ImageNet training.
- Introduces group count as a new hyperparameter.
- Loses BN's implicit minibatch regularization.
- Evidence focuses on convolutional vision models.
Suggestions from the authors
- Test GN beyond vision architectures.
- Analyze why grouped channels suit convolutional features.
- Study GN with stronger regularization schemes.
- Extend GN-based pretraining for transfer tasks.
Links
Prior Papers
- @ioffeBatchNormalizationAccelerating2015 — GN replaces BN's batch-coupled statistics with per-sample group statistics.
- @baLayerNormalization2016 — GN recovers LayerNorm at and adapts it to convolutional channel structure.
- @ulyanovInstanceNormalizationMissing2017 — GN recovers InstanceNorm at and interpolates between LN and IN.
Further Papers
- @GenerativeInverseDesignof2023 — later small-batch generative modeling can benefit from GN's batch-size-stable normalization.
1. Summary
Motivation / Problem
- Batch Normalization suffers from small batch problems and transfer learning.
Prior Work and Its Limitations
- Batch Normalization
- Normalization accelerates training process
- Regularization effect by using batch statistics
- Limitations
- Performs well only on big batch size. e.g. 32
- The batch statistics becomes useless for transfer learning
- Instance Norm / Layer Norm
- Alternative for Batch Norm but cannot outperform BN
Proposed Method
- Group Normalization
- Normalize each layer's neuron using mean and variance
- The mean and variance is computed along axes and along a group of channels.
- ![[@wuGroupNormalization2018_NormalizedFeature.png]] ![[@wuGroupNormalization2018_NormalizationMeanStd.png]] ![[@wuGroupNormalization2018_GroupNormSet.png]]
- Relation with BN, IN, LN
- Layer Norm when
- Instance Norm when
Hypothesis and Evaluation
- Hypothesis
- GN is insensitive under batch size variation
- GN transfers well then BN on small batch vision tasks
- Evaluation
- ImageNet
- In batch size 32, GN performed almost equivalently as BN
- In small batch size, GN outperformed every existing normalization method
- COCO Detection/Segmentation
- GN outperformed frozen BN
- Video Classification
- GN was able to beat BN in smaller clip conditions.
- ImageNet
2. Paper Strengths and Weakness
Strengths
- Strong in small batch size
- Intuitive approach and easy implementation
- Low discrepancy between training / fine-tuning / inference
Weaknesses
- Cannot beat BN absolutely in big batch size
- Lost of regularization effect
- New hyperparameter
3. My Opinion
Overall Rating
- Strongly Accept
Recommendation Justification
- Simple and Practical Idea
- Directly solves the limitation of Batch Normalization
- Easy to understand and nice comparison with existing methods
Detailed Comments
- GN may be a great alternative but it seems hard to replace BN entirely.