Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal, Alex Nichol

2021 · NeurIPS

Diffusion Models Beat GANs on Image Synthesis

Problem

Framing

Diffusion models still trailed BigGAN-class image quality on ImageNet and LSUN. The paper closes that gap with an ablated UNet redesign plus classifier guidance, reaching ImageNet 128×128128\times128 FID 2.97 and guided upsampling FID 3.85 at 512×512512\times512.

Currently Used Methods

Foundational

Proposed Method

Architecture

The model keeps the DDPM UNet family and swaps in empirically stronger blocks. The final setting uses variable width, 2 residual blocks per resolution, attention at 32,16,832,16,8, 64 channels per head, BigGAN up/down blocks, and AdaGN for timestep and class conditioning.

Loss / Objective

Training uses the improved-DDPM hybrid objective with learned reverse variances.

Lhybrid=Lsimple+λLvlbL_{\mathrm{hybrid}} = L_{\mathrm{simple}} + \lambda L_{\mathrm{vlb}} Lsimple:=Et[1,T],x0q(x0),ϵN(0,I)[ϵϵθ(xt,t)2]L_{\mathrm{simple}} := \mathbb{E}_{t\sim [1,T],\mathbf{x}_0\sim q(\mathbf{x}_0),\boldsymbol{\epsilon}\sim \mathcal{N}(0,\mathbf{I})}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)\right\|^2\right]

Sampling Rule / Algorithm

Classifier guidance shifts the reverse-step mean; for DDIM it modifies noise prediction.

pθ,ϕ(xtxt+1,y)=Zpθ(xtxt+1)pϕ(yxt)p_{\theta,\phi}(\mathbf{x}_t\mid \mathbf{x}_{t+1},y) = Z\,p_{\theta}(\mathbf{x}_t\mid \mathbf{x}_{t+1})\,p_{\phi}(y\mid \mathbf{x}_t) xt1N(μθ(xt)+sΣθ(xt)xtlogpϕ(yxt),  Σθ(xt))\mathbf{x}_{t-1} \sim \mathcal{N}\left(\boldsymbol{\mu}_{\theta}(\mathbf{x}_t) + s\,\boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t)\,\nabla_{\mathbf{x}_t}\log p_{\phi}(y\mid \mathbf{x}_t),\; \boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t)\right) ϵ^(xt):=ϵθ(xt)1αˉtxtlogpϕ(yxt)\hat{\boldsymbol{\epsilon}}(\mathbf{x}_t) := \boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t) - \sqrt{1-\bar{\alpha}_t}\,\nabla_{\mathbf{x}_t}\log p_{\phi}(y\mid \mathbf{x}_t)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: Classifier guidance on ImageNet 128×128128\times128 trades diversity for fidelity as gradient scale increases.

gradient scaleFIDsFIDISprecisionrecall
05.915.09158.820.700.65
0.52.974.69221.570.780.61
1.03.015.11253.010.820.59
2.05.287.24279.00.870.50
3.06.948.94280.480.890.45
5.09.2111.37291.060.900.39
7.510.5813.03293.570.900.35
10.012.1415.36300.280.900.28

Ablations

Results plot: three line charts show that increasing classifier gradient scale first improves then worsens FID/sFID, steadily raises IS, and increases precision while reducing recall.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers