Auto-Encoding Variational Bayes

Diederik P. Kingma, Max Welling

2013 · ICLR

Auto-Encoding Variational Bayes

Problem

Framing

Mean-field VB for directed latent-variable models fails once $p_{\theta}(\mathbf{x})$ , $p_{\theta}(\mathbf{z}\mid\mathbf{x})$ , and posterior expectations are all intractable. The paper closes this with a reparameterized lower-bound estimator plus an amortized recognition model, replacing per-datapoint iterative inference with one encoder pass.

Currently Used Methods

Foundational

"Deep Unsupervised Learning using Nonequilibrium Thermodynamics" — wake-sleep latent-variable training with a learned recognition network.
- Limitation in context: slower convergence and worse variational bounds than AEVB.
"Estimating or Propagating Gradients Through Stochastic Neurons" — score-function estimators for stochastic computation graphs.
- Limitation in context: gradient variance is too high for practical generic VB.
Mean-field variational Bayes — analytic lower-bound optimization under simple factorized posteriors.
- Limitation in context: breaks when expectations under nonlinear decoders are intractable.
Monte Carlo EM with HMC — posterior-sampling-based likelihood learning for latent-variable models.
- Limitation in context: too expensive for online minibatch learning on large datasets.

Proposed Method

Architecture

The model factorizes as $p_{\theta}(\mathbf{z})p_{\theta}(\mathbf{x}\mid\mathbf{z})$ with an amortized posterior $q_{\phi}(\mathbf{z}\mid\mathbf{x})$ . In the VAE instantiation, encoder and decoder are single-hidden-layer MLPs; the encoder outputs diagonal-Gaussian $\boldsymbol{\mu}(\mathbf{x})$ and $\boldsymbol{\sigma}(\mathbf{x})$ , and the decoder outputs Bernoulli or Gaussian observation parameters.

$Directed graphical model: latent variable z, observed x, solid generative edges for p_{\theta}(z)p_{\theta}(x\mid z), and dashed variational edge for q_{\phi}(z\mid x).$

Loss / Objective

The method maximizes the variational lower bound; for the Gaussian VAE with diagonal posterior it uses:

\mathcal{L}(\theta, \phi; \mathbf{x}^{(i)}) \approx \frac{1}{2}\sum_{j=1}^{J}\left(1 + \log \left((\sigma_j^{(i)})^2\right) - (\mu_j^{(i)})^2 - (\sigma_j^{(i)})^2\right) + \frac{1}{L}\sum_{l=1}^{L} \log p_{\theta}(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i,l)})

Sampling Rule

Sampling uses the pathwise reparameterization:

\mathbf{z}^{(i,l)} = \boldsymbol{\mu}^{(i)} + \boldsymbol{\sigma}^{(i)} \odot \boldsymbol{\epsilon}^{(l)}, \qquad \boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})

Training Procedure

Minibatch size $M = 100$
Samples per datapoint $L = 1$
Optimizer: Adagrad
Global step size in $\{0.01, 0.02, 0.1\}$
Parameter initialization: $\mathcal{N}(0, 0.01)$
Weight-decay prior: $p(\theta)=\mathcal{N}(0, I)$
Hidden units: 500 on MNIST, 200 on Frey Face
Marginal-likelihood runs: 100 hidden units, 3 latent variables

Evaluation

Datasets

MNIST
Frey Face

Metrics

Average variational lower bound per datapoint
Estimated marginal log-likelihood

Headline results

MNIST $(N_z \in \{3,5,10,20,200\})$ : AEVB converges faster and reaches better lower bounds than wake-sleep.
Frey Face $(N_z \in \{2,5,10,20\})$ : AEVB converges faster and reaches better lower bounds than wake-sleep.
MNIST $(N_{\mathrm{train}}=1000)$ : AEVB improves estimated marginal log-likelihood faster than wake-sleep and MCEM.
MNIST $(N_{\mathrm{train}}=50000)$ : MCEM is not efficiently applicable; AEVB remains online.
Figure 2: estimator variance stays below $1$ ; runtime is about $20$ - $40$ minutes per million samples on CPU.

Results plots: MNIST and Frey Face lower-bound curves across latent dimensions, where AEVB train/test curves rise faster and higher than wake-sleep.

Sample grid: Frey Face latent-manifold traversal with smooth identity and expression changes across the grid.

Ablations

Latent dimensionality $N_z$ : extra latent variables do not induce visible overfitting.
Training-set size: AEVB keeps its advantage on both 1k and 50k-example MNIST.
Objective target: better lower bounds align with better estimated marginal likelihood.
Algorithm choice: amortized inference beats wake-sleep and avoids MCEM sampling cost.

Method Strengths and Weaknesses

Strengths

Pathwise gradients avoid the high variance of score-function estimators.
Amortized inference removes per-example MCMC or inner-loop optimization.
$L=1$ sampling works in practice with minibatch size $100$ .
Beats wake-sleep on lower bound and marginal likelihood experiments.

Weaknesses

Core recipe assumes continuous latents with differentiable reparameterization.
Main posterior family is diagonal Gaussian, which limits expressivity.
Empirical scope is narrow: MNIST and Frey Face only.
Marginal-likelihood estimation becomes unreliable at higher latent dimension.

Suggestions from the authors

Extend the method to variational inference over global parameters.
Test the appendix algorithm with variational posteriors over model parameters.
Apply the method to online and non-stationary streaming data.
Explore richer approximate posteriors beyond diagonal Gaussians.

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers