Classifier-Free Diffusion Guidance

Jonathan Ho, Tim Salimans

2022 · arXiv

Classifier-Free Diffusion Guidance

Problem

Framing

Classifier guidance gives diffusion models a fidelity–diversity control knob, but it needs a separate noisy-image classifier and classifier gradients at sampling. This paper removes that dependency by training one denoiser with conditioning dropout, then mixing its conditional and unconditional predictions at test time.

Currently Used Methods

Foundational and direct antecedents

Proposed Method

Architecture

The method keeps the class-conditional diffusion U-Net from @dhariwalDiffusionBeatGANs2021. Training drops the class label to a null token \varnothing with probability puncondp_{\mathrm{uncond}}, so one network learns both conditional and unconditional denoisers.

Toy guidance figure: a three-Gaussian mixture sharpens toward class modes as guidance strength increases

Loss / Objective

Training uses the standard continuous-time ϵ\boldsymbol{\epsilon}-prediction loss with conditioning dropout.

L(θ)=Ex,c,λ,ϵ[ϵϵθ(zλ,c)22],c with probability puncond\mathcal{L}(\theta)=\mathbb{E}_{\mathbf{x},c,\lambda,\boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},c)\right\|_2^2\right],\qquad c\leftarrow \varnothing\ \text{with probability }p_{\mathrm{uncond}} zλ=αλx+σλϵ,αλ2=11+eλ,σλ2=1αλ2\mathbf{z}_{\lambda}=\alpha_{\lambda}\mathbf{x}+\sigma_{\lambda}\boldsymbol{\epsilon},\qquad \alpha_{\lambda}^2=\frac{1}{1+e^{-\lambda}},\qquad \sigma_{\lambda}^2=1-\alpha_{\lambda}^2

Sampling Rule / Algorithm

Sampling replaces classifier gradients with a linear interpolation of the two score estimates.

ϵ~θ(zλ,c)=(1+w)ϵθ(zλ,c)wϵθ(zλ,),w0\tilde{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},c)=(1+w)\,\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},c)-w\,\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},\varnothing),\qquad w\ge 0

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: Baseline comparison on ImageNet 128×128128\times128.

ModelFID (\downarrow)IS (\uparrow)
BigGAN-deep, max IS (Brock et al., 2019)25253
BigGAN-deep (Brock et al., 2019)5.7124.5
CDM (Ho et al., 2021)3.52128.8
LOGAN (Wu et al., 2019)3.36148.2
ADM-G (Dhariwal & Nichol, 2021)2.97-

Sample grid: ImageNet cats, corgis, and volcanoes generated with stronger classifier-free guidance, showing sharper but less diverse outputs

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers