Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal, Alex Nichol

2021 · NeurIPS

Diffusion Models Beat GANs on Image Synthesis

Problem

Framing

Diffusion models still trailed BigGAN-class image quality on ImageNet and LSUN. The paper closes that gap with an ablated UNet redesign plus classifier guidance, reaching ImageNet $128\times128$ FID 2.97 and guided upsampling FID 3.85 at $512\times512$ .

Currently Used Methods

Foundational

@goodfellowGAN2014 — adversarial training for high-fidelity image synthesis.
- Limitation in context: weaker coverage and unstable training relative to diffusion.
@DenoisingDiffusionProbabilisticModels2020 — DDPM with UNet $\epsilon$ $ϵ$ -prediction training.
- Limitation in context: ImageNet and LSUN sample quality still trails strong GANs.
@nicholImprovedDDPM2021 — learned variances and reduced-step diffusion sampling.
- Limitation in context: ImageNet FID still does not beat BigGAN-deep.
@songScoreSDE2020 — score-based conditioning links diffusion and classifier gradients.
- Limitation in context: this paper targets stronger large-scale image synthesis quality.
@karrasStyleGAN2019 — strong GAN baseline for photorealistic synthesis.
- Limitation in context: lacks diffusion's likelihood training and coverage advantages.

Proposed Method

Architecture

The model keeps the DDPM UNet family and swaps in empirically stronger blocks. The final setting uses variable width, 2 residual blocks per resolution, attention at $32,16,8$ , 64 channels per head, BigGAN up/down blocks, and AdaGN for timestep and class conditioning.

Loss / Objective

Training uses the improved-DDPM hybrid objective with learned reverse variances.

L_{\mathrm{hybrid}} = L_{\mathrm{simple}} + \lambda L_{\mathrm{vlb}}

L_{\mathrm{simple}} := \mathbb{E}_{t\sim [1,T],\mathbf{x}_0\sim q(\mathbf{x}_0),\boldsymbol{\epsilon}\sim \mathcal{N}(0,\mathbf{I})}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)\right\|^2\right]

Sampling Rule / Algorithm

Classifier guidance shifts the reverse-step mean; for DDIM it modifies noise prediction.

p_{\theta,\phi}(\mathbf{x}_t\mid \mathbf{x}_{t+1},y) = Z\,p_{\theta}(\mathbf{x}_t\mid \mathbf{x}_{t+1})\,p_{\phi}(y\mid \mathbf{x}_t)

\mathbf{x}_{t-1} \sim \mathcal{N}\left(\boldsymbol{\mu}_{\theta}(\mathbf{x}_t) + s\,\boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t)\,\nabla_{\mathbf{x}_t}\log p_{\phi}(y\mid \mathbf{x}_t),\; \boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t)\right)

\hat{\boldsymbol{\epsilon}}(\mathbf{x}_t) := \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t) - \sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t}\log p_{\phi}(y\mid \mathbf{x}_t)

Training Procedure

Diffusion steps: $T=1000$ .
Batch size: 256 for ImageNet $128\times128$ architecture ablations.
Batch size: 256 for ImageNet $256\times256$ guidance experiments.
Architecture-ablation sampling steps: 250.
Guidance-study training length: 2M iterations on ImageNet $256\times256$ .
Noise schedule: cosine for ImageNet $64\times64$ ; linear for $128\times128$ , $256\times256$ , $512\times512$ .

Evaluation

Datasets

ImageNet $64\times64$
ImageNet $128\times128$
ImageNet $256\times256$
ImageNet $512\times512$
LSUN bedroom $256\times256$
LSUN cat $256\times256$
LSUN horse $256\times256$

Metrics

FID
sFID
Inception Score
Precision
Recall

Headline results

ImageNet $128\times128$ conditional: FID 2.97.
ImageNet $256\times256$ conditional: FID 4.59.
ImageNet $512\times512$ conditional: FID 7.72.
ImageNet $256\times256$ guided upsampling stack: FID 3.94.
ImageNet $512\times512$ guided upsampling stack: FID 3.85.

Table 1: Classifier guidance on ImageNet $128\times128$ trades diversity for fidelity as gradient scale increases.

gradient scale	FID	sFID	IS	precision	recall
0	5.91	5.09	158.82	0.70	0.65
0.5	2.97	4.69	221.57	0.78	0.61
1.0	3.01	5.11	253.01	0.82	0.59
2.0	5.28	7.24	279.0	0.87	0.50
3.0	6.94	8.94	280.48	0.89	0.45
5.0	9.21	11.37	291.06	0.90	0.39
7.5	10.58	13.03	293.57	0.90	0.35
10.0	12.14	15.36	300.28	0.90	0.28

Ablations

Width versus depth: wider models reach lower FID faster in wall-clock time.
Attention heads: more heads or fewer channels per head improve FID.
Attention resolutions: using $32,16,8$ beats $16$ -only attention.
BigGAN up/down blocks and AdaGN: both improve FID; residual rescaling hurts.

Results plot: three line charts show that increasing classifier gradient scale first improves then worsens FID/sFID, steadily raises IS, and increases precision while reducing recall.

Method Strengths and Weaknesses

Strengths

Beats BigGAN-class baselines on ImageNet FID across multiple resolutions.
Guidance gives one scalar knob for fidelity versus diversity.
Matches BigGAN-deep with as few as 25 forward passes.
Architecture gains are cumulative across controlled ablations.

Weaknesses

Guided sampling reduces recall as precision rises.
Conditional synthesis needs an extra noisy-image classifier.
Best quality still depends on many denoising steps.
Final architecture is heavily tuned on ImageNet ablations.

Suggestions from the authors

Develop better sample-quality metrics beyond FID and IS.
Improve faster samplers that preserve guided-sampling quality.
Understand why large guidance scales avoid adversarial failure.
Leverage unlabeled pretraining before classifier-based specialization.

Diffusion Models Beat GANs on Image Synthesis

Diffusion Models Beat GANs on Image Synthesis

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers