High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz

2022 · CVPR

High-Resolution Image Synthesis with Latent Diffusion Models

Problem

Framing

Pixel-space diffusion delivers strong likelihoods, but its cost scales badly with image resolution. The paper shifts diffusion to a perceptually compressed latent space, preserving fidelity while cutting training and sampling cost enough for high-quality $256\times256$ synthesis and flexible conditioning.

Currently Used Methods

Foundational

@DenoisingDiffusionProbabilisticModels2020 — pixel-space diffusion with the standard noise-prediction training objective.
- Limitation in context: reverse diffusion stays expensive at high resolution.
@dhariwalDiffusionBeatGANs2021 — guided diffusion with strong ImageNet sample quality.
- Limitation in context: still pays full pixel-space training and sampling cost.
@DenoisingDiffusionImplicitModels2020 — faster non-Markovian diffusion sampling.
- Limitation in context: acceleration alone does not remove pixel-space dimensionality.
@kingmaVAE2013 — continuous latent compression for cheaper generative modeling.
- Limitation in context: latent models alone did not match diffusion quality.
@rameshDALLE2021 — two-stage latent generation for text-to-image synthesis.
- Limitation in context: discrete autoregressive priors remain slow and heavy.

Proposed Method

Architecture

LDM splits generation into an autoencoder $\mathcal{E},\mathcal{D}$ and a latent diffusion U-Net over $\mathbf{z}=\mathcal{E}(\mathbf{x})$ . At $256\times256$ , typical latent grids are $64\times64\times3$ for $f=4$ and $32\times32\times4$ for $f=8$ . Conditioning enters by concatenation or cross-attention inside the denoiser.

Architecture diagram: an encoder maps images to latent space, diffusion runs with a latent U-Net, and conditioning enters through concatenation or cross-attention blocks.

Loss / Objective

The denoiser predicts Gaussian noise in latent space:

L_{\mathrm{LDM}} = \mathbb{E}_{\mathcal{E}(\mathbf{x}),\,\mathbf{y},\,\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}),\,t}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_t,t,\tau_{\theta}(\mathbf{y}))\right\|_2^2\right]

Sampling Rule / Algorithm

Sampling runs reverse diffusion on latents, then decodes once at the end:

\mathbf{z}_{t-1} \sim p_{\theta}(\mathbf{z}_{t-1}\mid \mathbf{z}_t, \mathbf{y}), \qquad \mathbf{x}=\mathcal{D}(\mathbf{z}_0)

Training Procedure

First stage: perceptual + adversarial autoencoder.
Latent downsampling factors studied: $f\in\{1,2,4,8,16,32\}$ .
Unconditional models: 500k diffusion training steps.
Class-conditional ImageNet curves shown up to 2M steps.
Text-to-image sample figure uses 200 DDIM steps, $\eta=1.0$ .
Text guidance example uses guidance scale $s=10.0$ .

Evaluation

Datasets

CelebA-HQ $256\times256$
FFHQ $256\times256$
LSUN-Churches $256\times256$
LSUN-Bedrooms $256\times256$
ImageNet $256\times256$
LAION
COCO
OpenImages

Metrics

FID
IS
Precision
Recall
PSNR
SSIM
R-FID

Headline results

CelebA-HQ $256\times256$ unconditional: FID $5.11$ .
FFHQ $256\times256$ unconditional: FID $4.98$ .
LSUN-Churches $256\times256$ unconditional: FID $4.02$ .
LSUN-Bedrooms $256\times256$ unconditional: FID $2.95$ .
ImageNet $256\times256$ class-conditional: competitive with ADM at lower parameter and compute budgets.

$Sample grid: random generations from LDMs on CelebAHQ, FFHQ, LSUN-Churches, LSUN-Bedrooms, and class-conditional ImageNet at 256\times256.$

Ablations

Downsampling factor $f$ : moderate compression gives the best fidelity-efficiency tradeoff.
First-stage tokenizer: overly aggressive compression degrades reconstruction and downstream generation.
Sampling steps: latent models keep strong FID at lower step counts than pixel-space diffusion.
Conditioning interface: cross-attention extends one backbone across text, layout, class, inpainting, and super-resolution.

Method Strengths and Weaknesses

Strengths

Cuts diffusion cost by moving denoising to compressed latents.
One conditioning mechanism covers text, layout, class labels, and image editing.
Strong $256\times256$ unconditional quality: FID $2.95$ on LSUN-Bedrooms.
Compression ablations give a clear operating regime around $f=4$ to $8$ .

Weaknesses

Quality depends heavily on first-stage autoencoder design.
High-quality synthesis still needs many DDIM steps.
Native training resolution in core experiments stays at $256\times256$ .
Reconstruction bottlenecks can discard details before diffusion starts.

Suggestions from the authors

Improve perceptual compression without losing semantics needed for generation.
Extend latent diffusion to more vision and multimodal tasks.
Study better conditioning interfaces than the current cross-attention design.
Push beyond native training resolution with convolutional or tiled sampling.

High-Resolution Image Synthesis with Latent Diffusion Models

High-Resolution Image Synthesis with Latent Diffusion Models

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers