Classifier-Free Diffusion Guidance

Jonathan Ho, Tim Salimans

2022 · arXiv

Classifier-Free Diffusion Guidance

Problem

Framing

Classifier guidance gives diffusion models a fidelity–diversity control knob, but it needs a separate noisy-image classifier and classifier gradients at sampling. This paper removes that dependency by training one denoiser with conditioning dropout, then mixing its conditional and unconditional predictions at test time.

Currently Used Methods

Foundational and direct antecedents

@DenoisingDiffusionProbabilisticModels2020 — denoising objective and ancestral reverse sampler.
- Limitation in context: no classifier-free guidance rule.
@songScoreSDE2020 — continuous-time score-based diffusion formulation.
- Limitation in context: conditional steering still uses external guidance.
@dhariwalDiffusionBeatGANs2021 — classifier guidance for sharper class-conditional samples.
- Limitation in context: requires a noisy classifier and gradients.
@nicholImprovedDDPM2021 — improved diffusion parameterization and faster sampling.
- Limitation in context: does not remove classifier dependence.
@ronnebergerUNet2015 — U-Net backbone for denoising networks.
- Limitation in context: no null-conditioning scheme by itself.

Proposed Method

Architecture

The method keeps the class-conditional diffusion U-Net from @dhariwalDiffusionBeatGANs2021. Training drops the class label to a null token $\varnothing$ with probability $p_{\mathrm{uncond}}$ , so one network learns both conditional and unconditional denoisers.

Toy guidance figure: a three-Gaussian mixture sharpens toward class modes as guidance strength increases

Loss / Objective

Training uses the standard continuous-time $\boldsymbol{\epsilon}$ -prediction loss with conditioning dropout.

\mathcal{L}(\theta)=\mathbb{E}_{\mathbf{x},c,\lambda,\boldsymbol{\epsilon}}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},c)\right\|_2^2\right],\qquad c\leftarrow \varnothing\ \text{with probability }p_{\mathrm{uncond}}

\mathbf{z}_{\lambda}=\alpha_{\lambda}\mathbf{x}+\sigma_{\lambda}\boldsymbol{\epsilon},\qquad \alpha_{\lambda}^2=\frac{1}{1+e^{-\lambda}},\qquad \sigma_{\lambda}^2=1-\alpha_{\lambda}^2

Sampling Rule / Algorithm

Sampling replaces classifier gradients with a linear interpolation of the two score estimates.

\tilde{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},c)=(1+w)\,\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},c)-w\,\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{\lambda},\varnothing),\qquad w\ge 0

Training Procedure

$p_{\mathrm{uncond}}\in\{0.1,0.2,0.5\}$ .
$w\in\{0,0.1,0.2,\ldots,4\}$ .
$\lambda_{\min}=-20$ , $\lambda_{\max}=20$ .
Sampling steps $T\in\{128,256,1024\}$ .
FID and IS use $50{,}000$ samples.
Reuses the base architecture and most hyperparameters of @dhariwalDiffusionBeatGANs2021.

Evaluation

Datasets

ImageNet $64\times 64$ , class-conditional.
ImageNet $128\times 128$ , class-conditional.

Metrics

FID.
Inception Score.

Headline results

ImageNet $64\times 64$ ( $w=0$ ): FID 1.48, IS 67.95.
ImageNet $64\times 64$ ( $w=1.0$ ): FID 12.6, IS 170.1.
ImageNet $64\times 64$ ( $w=3.0$ ): FID 24.83, IS 250.4.
ImageNet $128\times 128$ ( $w=1.0$ ): FID 7.86, IS 297.98.
ImageNet $128\times 128$ ( $w=4.0$ ): FID 21.53, IS 421.03.

Table 1: Baseline comparison on ImageNet $128\times128$ .

Model	FID ( $\downarrow$ )	IS ( $\uparrow$ )
BigGAN-deep, max IS (Brock et al., 2019)	25	253
BigGAN-deep (Brock et al., 2019)	5.7	124.5
CDM (Ho et al., 2021)	3.52	128.8
LOGAN (Wu et al., 2019)	3.36	148.2
ADM-G (Dhariwal & Nichol, 2021)	2.97	-

Sample grid: ImageNet cats, corgis, and volcanoes generated with stronger classifier-free guidance, showing sharper but less diverse outputs

Ablations

Guidance strength $w$ : higher $w$ raises IS, then worsens FID.
Dropout $p_{\mathrm{uncond}}$ : larger values tolerate stronger guidance.
Sampler steps $T$ : more steps improve quality; $T=256$ is the trade-off.
Strong guidance: samples become sharper, more saturated, and more repetitive.

Method Strengths and Weaknesses

Strengths

Removes the auxiliary classifier and its extra training pipeline.
Adds one sampling knob $w$ for fidelity–diversity control.
Reuses existing class-conditional diffusion architectures unchanged.
Works on both $64\times64$ and $128\times128$ ImageNet.

Weaknesses

Large $w$ improves IS while degrading FID.
Strong guidance yields saturated colors and repeated motifs.
Quality depends on tuning both $p_{\mathrm{uncond}}$ and $w$ .
The paper gives little theory for the interpolation rule.

Suggestions from the authors

Derive why classifier-free interpolation improves classifier-based metrics.
Tune architectures and hyperparameters specifically for classifier-free guidance.
Test the method in other conditional diffusion settings.
Study why pure generative diffusion models maximize classifier-based metrics.

Classifier-Free Diffusion Guidance

Classifier-Free Diffusion Guidance

Problem

Framing

Currently Used Methods

Foundational and direct antecedents

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers