Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, Stefano Ermon

2020 · arXiv

Denoising Diffusion Implicit Models

Problem

Framing

DDPMs deliver strong image quality but need long reverse chains, usually $T=1000$ steps. DDIM closes the latency gap by defining a non-Markovian forward family with the same denoising training objective, then using deterministic or partially stochastic short-step sampling. CIFAR-10 reaches FID $4.16$ in $100$ steps.

Currently Used Methods

Direct antecedents

@DeepUnsupervisedLearningusing2015 — diffusion latent-variable learning from nonequilibrium thermodynamics.
- Limitation in context: no practical short-step image sampler.
@DenoisingDiffusionProbabilisticModels2020 — denoising diffusion with strong likelihoods and sample quality.
- Limitation in context: generation needs long sequential reverse chains.
@songScoreSDE2020 — score-based generation with continuous noise perturbations.
- Limitation in context: not framed as direct reuse of DDPM checkpoints.
@goodfellowGAN2014 — one-pass adversarial image synthesis with high perceptual quality.
- Limitation in context: no diffusion-style inversion trajectory.

Proposed Method

Architecture

DDIM changes the sampler, not the denoiser. It reuses the DDPM network $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$ , chooses a subsequence $\tau$ of length $S$ , and controls randomness with $\eta$ , where $\eta=0$ gives a deterministic trajectory.

$Graphical models comparing standard diffusion inference on the left with the DDIM non-Markovian inference family on the right; the DDIM graph conditions intermediate states on both later states and \mathbf{x}_0.$

Loss / Objective

The non-Markovian family shares the DDPM denoising surrogate up to timestep weights.

L_{\gamma}(\boldsymbol{\epsilon}_\theta)=\sum_{t=1}^{T}\gamma_t\,\mathbb{E}_{\mathbf{x}_0,\boldsymbol{\epsilon}_t}\left[\left\|\boldsymbol{\epsilon}^{(t)}_\theta\left(\sqrt{\alpha_t}\,\mathbf{x}_0+\sqrt{1-\alpha_t}\,\boldsymbol{\epsilon}_t\right)-\boldsymbol{\epsilon}_t\right\|_2^2\right]

Sampling Rule

Sampling predicts $\hat{\mathbf{x}}_0$ and updates along the chosen subsequence $\tau$ .

\mathbf{x}_{\tau_{i-1}}=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{\mathbf{x}_{\tau_i}-\sqrt{1-\alpha_{\tau_i}}\,\boldsymbol{\epsilon}^{(\tau_i)}_\theta(\mathbf{x}_{\tau_i})}{\sqrt{\alpha_{\tau_i}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_i}(\eta)^2}\,\boldsymbol{\epsilon}^{(\tau_i)}_\theta(\mathbf{x}_{\tau_i})+\sigma_{\tau_i}(\eta)\boldsymbol{\epsilon}_{\tau_i}

\sigma_{\tau_i}(\eta)=\eta\sqrt{\frac{1-\alpha_{\tau_{i-1}}}{1-\alpha_{\tau_i}}}\sqrt{1-\frac{\alpha_{\tau_i}}{\alpha_{\tau_{i-1}}}}

Training Procedure

$T=1000$
Reuses pretrained DDPM denoisers
Sampling uses a subsequence $\tau$ with length $S$
$\eta\in[0,1]$
Same dataset-specific architectures as DDPM

Evaluation

Datasets

CIFAR-10, $32 \times 32$ , unconditional
CelebA, $64 \times 64$
LSUN Bedroom, $256 \times 256$
LSUN Church, $256 \times 256$

Metrics

FID
Reconstruction MSE on CIFAR-10 test images
Sampling steps $S$

Headline results

CIFAR-10 ( $S=100$ , $\eta=0$ ): FID $4.16$
CelebA ( $S=100$ , $\eta=0$ ): FID $6.53$
LSUN Bedroom ( $S=100$ , $\eta=0$ ): FID $6.62$
LSUN Church ( $S=100$ , $\eta=0$ ): FID $10.58$
DDPM baseline ( $S=1000$ , $\hat{\sigma}$ ): CIFAR-10 FID $3.17$ , CelebA FID $3.26$

Table 1: CIFAR10 and CelebA image generation measured in FID.

$S$	CIFAR10 $\eta=0.0$	CIFAR10 $\eta=0.2$	CIFAR10 $\eta=0.5$	CIFAR10 $\eta=1.0$	CIFAR10 $\hat{\sigma}$	CelebA $\eta=0.0$	CelebA $\eta=0.2$	CelebA $\eta=0.5$	CelebA $\eta=1.0$	CelebA $\hat{\sigma}$
10	13.36	14.04	16.66	41.07	367.43	17.33	17.66	19.86	33.12	299.71
20	6.84	7.11	8.35	18.36	133.37	13.73	14.11	16.06	26.03	183.83
50	4.67	4.77	5.25	8.01	32.72	9.17	9.51	11.01	18.48	71.71
100	4.16	4.25	4.46	5.78	9.99	6.53	6.79	8.09	13.93	45.20
1000	4.04	4.09	4.29	4.73	3.17	3.51	3.64	4.28	5.98	3.26

Ablations

Step count $S$ : larger subsequences improve FID across datasets.
Stochasticity $\eta$ : $\eta=0$ is best in short-step regimes.
DDIM versus DDPM: deterministic updates degrade far less when steps are truncated.
Deterministic trajectories enable interpolation and low-error reconstruction.

Method Strengths and Weaknesses

Strengths

Reuses DDPM checkpoints without retraining.
CIFAR-10 reaches FID $4.16$ in $100$ steps.
Gives $10\times$ to $100\times$ faster sampling than DDPM.
Deterministic paths support inversion and interpolation.

Weaknesses

Sampling remains iterative, not one-shot.
Very short chains still lose quality sharply.
Best overall FID still comes from $1000$ -step DDPM.
Quality depends on timestep subsequence design.

Suggestions from the authors

Study shorter forward processes with the same denoising objective.
Use deterministic inversion as a latent representation.
Extend the construction beyond Gaussian continuous data.
Analyze continuous-time limits of the DDIM process.

Denoising Diffusion Implicit Models

Denoising Diffusion Implicit Models

Problem

Framing

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

Loss / Objective

Sampling Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers