Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

2017

Instance Normalization: The Missing Ingredient for Fast Stylization

Problem

Framing

Feed-forward stylization was real-time but still trailed Gatys-quality transfer and degraded under larger training sets or longer optimization. The paper closes this gap with one architectural bias: replace batch normalization with instance normalization and keep it active at test time.

Currently Used Methods

Direct antecedents

Image Style Transfer Using Convolutional Neural Networks — optimization-based perceptual style transfer with strongest visual quality.
- Limitation in context: several minutes per $512 \times 512$ image, not real-time.
Texture Networks: Feed-forward Synthesis of Textures and Stylized Images — feed-forward generators for fast fixed-style transfer.
- Limitation in context: quality degrades with many training images or long training.
Perceptual Losses for Real-Time Style Transfer and Super-Resolution — residual feed-forward stylization with perceptual losses.
- Limitation in context: reproduced model still improves after swapping normalization.
@ioffeBatchNormalizationAccelerating2015 — batch-wise activation normalization for CNN optimization.
- Limitation in context: batch statistics retain contrast the generator should remove.

Proposed Method

Architecture

The paper keeps the feed-forward generator and swaps every batch-normalization layer for instance normalization. It tests the change in both the earlier Ulyanov generator and a reproduced Johnson residual generator, with normalization still applied at inference.

Qualitative comparison figure: top row shows content, style, and Gatys transfer; bottom row compares zero padding, improved padding, and zero padding plus instance normalization, where instance normalization suppresses border artifacts.

Loss / Objective

Training keeps the fixed-style perceptual objective:

\min_g \; \frac{1}{n} \sum_{t=1}^{n} \mathcal{L}\big(x_0, x_t, g(x_t, z_t)\big), \qquad z_t \sim \mathcal{N}(0,1)

Normalization Rule

The key change is per-instance, per-channel spatial normalization:

y_{tijk} = \frac{x_{tijk} - \mu_{ti}}{\sqrt{\sigma_{ti}^2 + \epsilon}}

\mu_{ti} = \frac{1}{HW} \sum_{l=1}^{W} \sum_{m=1}^{H} x_{tilm}, \qquad \sigma_{ti}^2 = \frac{1}{HW} \sum_{l=1}^{W} \sum_{m=1}^{H} \big(x_{tilm} - \mu_{ti}\big)^2

Training Procedure

Fixed style image $x_0$ per generator.
Content images $x_t$ , with $t = 1, \dots, n$ .
Noise seeds $z_t \sim \mathcal{N}(0,1)$ .
Same hyperparameters as the batch-normalized baselines.
Instance normalization active at train and test time.

Evaluation

Datasets

Fixed style images, one per trained generator.
Natural content-image collections for training and test stylization.
Qualitative examples on portraits and scenes.

Metrics

Qualitative visual comparison.
Comparison against Gatys optimization-based transfer.
Comparison of batch normalization versus instance normalization.

Headline results

Gatys baseline: several minutes per $512 \times 512$ image.
Ulyanov generator: instance normalization removes severe border artifacts after long training.
Johnson residual generator: the same swap yields similar qualitative gains.
Cross-architecture comparison: both generators improve with instance normalization.
Runtime: real-time inference on standard GPU hardware.

Sample grid: two style images on the top row, then a portrait content image and its two stylized outputs from the proposed method.

Ablations

Normalization type: batch normalization to instance normalization drives the main visual gain.
Architecture family: gains persist in both Ulyanov and Johnson generators.
Padding choice: better padding alone does not remove the dominant border artifacts.
Training scale: many images or long training hurt the original batch-normalized generator.

Method Strengths and Weaknesses

Strengths

One normalization swap yields the central quality improvement.
Gains transfer across two generator architectures.
Test-time normalization matches the contrast-removal hypothesis.
Preserves single-pass, real-time stylization.

Weaknesses

Evaluation is almost entirely qualitative.
No quantitative stylization metric is reported.
Scope is limited to fixed-style transfer.
Training hyperparameters are sparsely specified.

Suggestions from the authors

Test instance normalization in discriminative vision models.
Analyze why contrast removal simplifies image generation.
Apply the same normalization change to other generators.

Instance Normalization: The Missing Ingredient for Fast Stylization

Instance Normalization: The Missing Ingredient for Fast Stylization

Problem

Framing

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

Loss / Objective

Normalization Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers