Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel

2020 · arXiv

Denoising Diffusion Probabilistic Models

Problem

Framing

Diffusion models had a variational formulation but had not shown GAN-class sample fidelity. The paper closes that gap with an $\epsilon$ -prediction parameterization of the reverse Gaussian chain, reaching CIFAR-10 IS $9.46$ and FID $3.17$ while remaining a likelihood model.

Currently Used Methods

Foundational and direct antecedents

@DeepUnsupervisedLearningusing2015 — variational diffusion with a fixed Gaussian corruption process.
- Limitation in context: lacked a practical reverse parameterization with strong image fidelity.
@dinhNVP2017 — exact-likelihood flow modeling with invertible transformations.
- Limitation in context: did not match DDPM's perceptual quality on these image benchmarks.
@songScoreSDE2020 — multi-noise score matching with Langevin-style sampling.
- Limitation in context: not the same discrete variational diffusion chain.
@ronnebergerUNet2015 — U-Net encoder–decoder backbone for dense image prediction.
- Limitation in context: needs timestep conditioning for reverse-process denoising.

Proposed Method

Architecture

The reverse model is a U-Net-like Wide-ResNet with group normalization, shared weights across timesteps, sinusoidal timestep embeddings, and self-attention at $16 \times 16$ . The $32 \times 32$ model uses four resolutions; the $256 \times 256$ model uses six.

$Directed graphical model: a forward Gaussian noising chain q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) and a learned reverse denoising chain p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) from noise to image$

Loss / Objective

Training uses the simplified noise-prediction objective at a random timestep.

L_{\mathrm{simple}} = \mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}, t\right)\right\|^2\right]

Sampling Rule / Algorithm

Sampling starts from Gaussian noise and applies one reverse Gaussian step per timestep.

\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)\right) + \sigma_t \mathbf{z}, \qquad \mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Training Procedure

Diffusion length: $T=1000$ .
Forward schedule: linear, $\beta_1=10^{-4}$ to $\beta_T=0.02$ .
Batch size: 128 on CIFAR-10; 64 on larger images.
Optimizer: Adam.
Learning rate: $2 \times 10^{-4}$ .
EMA decay: 0.9999.
CIFAR-10 dropout: 0.1.
Reverse variances: fixed, not learned.

Evaluation

Datasets

CIFAR-10, unconditional, $32 \times 32$
CelebA-HQ, $256 \times 256$
LSUN Bedroom, $256 \times 256$
LSUN Church, $256 \times 256$
LSUN Cat, $256 \times 256$

Metrics

Inception Score
FID
Negative log-likelihood in bits/dim
Rate-distortion RMSE

Headline results

CIFAR-10 unconditional: IS $9.46$ , FID $3.17$ .
CIFAR-10 test-set FID: $5.24$ .
LSUN Bedroom $256 \times 256$ : FID $4.90$ .
LSUN Church $256 \times 256$ : FID $7.89$ .
CIFAR-10 best-sample model: rate $1.78$ bits/dim, distortion $1.97$ bits/dim.

Sample grid: LSUN Church generations with varied building layouts, facades, and lighting

Ablations

Objective: full variational bound improves likelihood; $L_{\mathrm{simple}}$ improves sample quality.
Reverse target: predicting $\tilde{\mu}$ works only with variational-bound training.
Variance parameterization: learned diagonal variance destabilizes training and worsens fidelity.
Sampling length: $T=1000$ fixes generation cost at 1000 network evaluations.

Method Strengths and Weaknesses

Strengths

Reaches CIFAR-10 FID $3.17$ , beating many published image generators.
Uses a simple Gaussian reverse chain with fixed variances.
Connects diffusion training to denoising score matching.
Shows progressive generation and compression behavior across timesteps.

Weaknesses

Sampling needs $1000$ neural evaluations.
Likelihood trails stronger exact-likelihood image models.
Learned reverse variances hurt stability and fidelity.
Best likelihood objective differs from best sample-quality objective.

Suggestions from the authors

Shorten diffusion chains for faster sampling.
Find objectives that improve likelihood and sample quality together.
Explain the progressive lossy coding bias more precisely.
Explore alternative diffusion lengths and forward processes.

Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models

Problem

Framing

Currently Used Methods

Foundational and direct antecedents

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers