Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, Stefano Ermon

2020 · arXiv

Denoising Diffusion Implicit Models

Problem

Framing

DDPMs deliver strong image quality but need long reverse chains, usually T=1000T=1000 steps. DDIM closes the latency gap by defining a non-Markovian forward family with the same denoising training objective, then using deterministic or partially stochastic short-step sampling. CIFAR-10 reaches FID 4.164.16 in 100100 steps.

Currently Used Methods

Direct antecedents

Proposed Method

Architecture

DDIM changes the sampler, not the denoiser. It reuses the DDPM network ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t), chooses a subsequence τ\tau of length SS, and controls randomness with η\eta, where η=0\eta=0 gives a deterministic trajectory.

Graphical models comparing standard diffusion inference on the left with the DDIM non-Markovian inference family on the right; the DDIM graph conditions intermediate states on both later states and \mathbf{x}_0.

Loss / Objective

The non-Markovian family shares the DDPM denoising surrogate up to timestep weights.

Lγ(ϵθ)=t=1TγtEx0,ϵt[ϵθ(t)(αtx0+1αtϵt)ϵt22]L_{\gamma}(\boldsymbol{\epsilon}_\theta)=\sum_{t=1}^{T}\gamma_t\,\mathbb{E}_{\mathbf{x}_0,\boldsymbol{\epsilon}_t}\left[\left\|\boldsymbol{\epsilon}^{(t)}_\theta\left(\sqrt{\alpha_t}\,\mathbf{x}_0+\sqrt{1-\alpha_t}\,\boldsymbol{\epsilon}_t\right)-\boldsymbol{\epsilon}_t\right\|_2^2\right]

Sampling Rule

Sampling predicts x^0\hat{\mathbf{x}}_0 and updates along the chosen subsequence τ\tau.

xτi1=ατi1(xτi1ατiϵθ(τi)(xτi)ατi)+1ατi1στi(η)2ϵθ(τi)(xτi)+στi(η)ϵτi\mathbf{x}_{\tau_{i-1}}=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{\mathbf{x}_{\tau_i}-\sqrt{1-\alpha_{\tau_i}}\,\boldsymbol{\epsilon}^{(\tau_i)}_\theta(\mathbf{x}_{\tau_i})}{\sqrt{\alpha_{\tau_i}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_i}(\eta)^2}\,\boldsymbol{\epsilon}^{(\tau_i)}_\theta(\mathbf{x}_{\tau_i})+\sigma_{\tau_i}(\eta)\boldsymbol{\epsilon}_{\tau_i} στi(η)=η1ατi11ατi1ατiατi1\sigma_{\tau_i}(\eta)=\eta\sqrt{\frac{1-\alpha_{\tau_{i-1}}}{1-\alpha_{\tau_i}}}\sqrt{1-\frac{\alpha_{\tau_i}}{\alpha_{\tau_{i-1}}}}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: CIFAR10 and CelebA image generation measured in FID.

SSCIFAR10 η=0.0\eta=0.0CIFAR10 η=0.2\eta=0.2CIFAR10 η=0.5\eta=0.5CIFAR10 η=1.0\eta=1.0CIFAR10 σ^\hat{\sigma}CelebA η=0.0\eta=0.0CelebA η=0.2\eta=0.2CelebA η=0.5\eta=0.5CelebA η=1.0\eta=1.0CelebA σ^\hat{\sigma}
1013.3614.0416.6641.07367.4317.3317.6619.8633.12299.71
206.847.118.3518.36133.3713.7314.1116.0626.03183.83
504.674.775.258.0132.729.179.5111.0118.4871.71
1004.164.254.465.789.996.536.798.0913.9345.20
10004.044.094.294.733.173.513.644.285.983.26

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers