Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli

2015 · arXiv

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Problem

Framing

Generative models had not combined normalized likelihood, exact ancestral sampling, and easy conditional manipulation in one stochastic framework. The paper closes that gap by diffusing data to a tractable terminal law, then learning only local reverse kernels. On dead leaves it reports $1.244$ bits/pixel.

Currently Used Methods

Foundational

@hintonDeepBeliefNets2006 — greedy layerwise training for deep probabilistic models.
- Limitation in context: no stochastic diffusion path with local reverse-step targets.
@kingmaVAE2013 — amortized latent-variable likelihood modeling.
- Limitation in context: depends on approximate inference, not reversible corruption dynamics.
@goodfellowGAN2014 — adversarial generation with sharp samples.
- Limitation in context: lacks normalized likelihood and cheap state-probability evaluation.
@dinhNVP2017 — invertible exact-likelihood transformations.
- Limitation in context: uses deterministic bijections, not stochastic diffusion transitions.

Proposed Method

Architecture

The model defines a forward diffusion chain $q(\mathbf{x}^{(0\cdots T)})$ and a learned reverse chain $p(\mathbf{x}^{(0\cdots T)})$ . For images, the reverse model predicts per-step mean and covariance from multiscale convolutions followed by several $1 \times 1$ convolutions. Time enters through learned bump-function coefficients over $t$ .

Swiss-roll diffusion figure: top row shows forward noising from the data spiral to a near-Gaussian cloud; middle row shows samples from the learned reverse chain; bottom row shows learned reverse drift vectors at three times.

Loss / Objective

Training maximizes a variational lower bound on log likelihood.

L \ge K

K = - \sum_{t=2}^{T} \int d\mathbf{x}^{(0)} \, d\mathbf{x}^{(t)} \, q(\mathbf{x}^{(0)}, \mathbf{x}^{(t)}) \, D_{KL}\!\left(q(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}, \mathbf{x}^{(0)}) \,\|\, p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)})\right) + H_q(X^{(T)} \mid X^{(0)}) + H_p(X^{(T)})

Sampling Rule / Algorithm

Sampling starts from the tractable terminal distribution and applies learned reverse kernels for $T$ steps.

\mathbf{x}^{(T)} \sim \pi(\mathbf{x}^{(T)}), \qquad \mathbf{x}^{(t-1)} \sim p(\mathbf{x}^{(t-1)} \mid \mathbf{x}^{(t)}), \quad t = T, \ldots, 1

Training Procedure

Swiss roll: $T=40$ Gaussian steps.
Binary heartbeat: $T=2000$ binomial steps.
Images: $T=1000$ .
Bark: $T=500$ .
Optimizer: RMSprop.
Swiss roll MLP hidden width: $16$ .
Heartbeat MLP: $3$ hidden layers, $50$ units each.

Evaluation

Datasets

Swiss roll
Binary heartbeat sequences
MNIST
CIFAR-10
Bark textures
Dead leaves

Metrics

Variational lower bound $K$
Improvement over null model $K - L_{\mathrm{null}}$
Bits
Bits per sequence
Bits per pixel

Headline results

Swiss roll: $K=2.35$ bits, $K-L_{\mathrm{null}}=6.45$ bits.
Binary heartbeat: $K=-2.414$ bits/seq; true process $-2.322$ bits/seq.
Dead leaves: $1.244$ bits/pixel.
CIFAR-10: posterior denoising and unconditional samples are shown.
MNIST: true ancestral samples are shown.

CIFAR-10 qualitative results: holdout images, the same images corrupted with Gaussian noise, posterior denoised reconstructions, and unconditional samples from the diffusion model.

Table 1: Log-likelihood summary across datasets.

Dataset	K	K - Lnull
Swiss Roll	2.35 bits	6.45 bits
Binary Heartbeat	-2.414 bits/seq.	2.676 bits/seq.
MNIST	82.90 bits	136.7 bits
CIFAR-10	4.51 bits/pixel	0.59 bits/pixel
Dead Leaves
MCGSM	1.244 bits/pixel
Diffusion probabilistic model	1.184 bits/pixel	0.53 bits/pixel

Ablations

Diffusion length $T$ : larger $T$ makes each reverse step easier to estimate.
Forward kernel family: Gaussian and binomial diffusions use one training framework.
Noise schedule: likelihood depends on the hand-chosen diffusion-rate schedule.
Task type: the same formulation covers continuous images and binary sequences.

Method Strengths and Weaknesses

Strengths

Reduces density learning to local reverse-step estimation.
Supports exact ancestral sampling from a normalized model.
Handles Gaussian images and binomial sequences in one framework.
Reports $1.184$ bits/pixel on dead leaves, close to MCGSM's $1.244$ .

Weaknesses

Sampling cost scales linearly with the full horizon $T$ .
Image backbone predates U-Net-style denoisers.
CIFAR-10 evaluation is qualitative.
Performance depends on hand-designed diffusion schedules.

Suggestions from the authors

Learn diffusion-rate schedules instead of fixing them.
Extend the framework to richer diffusion kernels.
Use distribution multiplication for conditional inference.
Scale reverse models for long diffusion chains.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers