Density Estimation using Real NVP

Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio

2017 · ICLR

Density Estimation using Real NVP

Problem

Framing

Likelihood models still traded off exact density, exact inference, and fast sampling on high-dimensional images. Real NVP closes this with affine coupling maps whose Jacobian is triangular, so density evaluation and inversion stay exact. It reports 3.49 bits/dim on CIFAR-10 and 2.72 on LSUN bedroom.

Currently Used Methods

Foundational

@kingmaVAE2013 — variational latent-variable modeling with amortized inference.
- Limitation in context: optimizes a bound, not exact likelihood or exact inversion.
@goodfellowGAN2014 — adversarial generation with sharp samples.
- Limitation in context: no tractable density and no exact latent inference.
@DeepUnsupervisedLearningusing2015 — autoregressive image density modeling with strong likelihoods.
- Limitation in context: sampling is sequential and slow.
Improving Variational Inference with Inverse Autoregressive Flow — autoregressive flows for richer variational posteriors.
- Limitation in context: remains tied to variational training, not exact bijective density.
NICE: Non-linear Independent Components Estimation — coupling layers with tractable inverse and determinant.
- Limitation in context: volume preserving, so it cannot learn local scaling.

Proposed Method

Architecture

Real NVP stacks affine coupling layers with alternating checkerboard and channel-wise masks. A squeeze operation trades spatial size for channels, and a multi-scale scheme factors out variables across resolutions until a final $4 \times 4 \times c$ tensor.

Verified architecture figure: checkerboard masking before squeezing, then channel-wise masking after reshaping spatial positions into channels.

Loss / Objective

Training maximizes exact log-likelihood under change of variables.

\log p_X(\mathbf{x}) = \log p_Z\big(f(\mathbf{x})\big) + \log \left| \det \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}^T} \right|

Sampling Rule / Algorithm

Each coupling layer is analytically invertible.

\begin{aligned} \mathbf{y}_{1:d} &= \mathbf{x}_{1:d}, \\ \mathbf{y}_{d+1:D} &= \mathbf{x}_{d+1:D} \odot \exp\big(s(\mathbf{x}_{1:d})\big) + t(\mathbf{x}_{1:d}), \\ \mathbf{x}_{1:d} &= \mathbf{y}_{1:d}, \\ \mathbf{x}_{d+1:D} &= \big(\mathbf{y}_{d+1:D} - t(\mathbf{y}_{1:d})\big) \odot \exp\big(-s(\mathbf{y}_{1:d})\big). \end{aligned}

Training Procedure

Prior: isotropic unit Gaussian.
Input transform: $\logit(\alpha + (1-\alpha)\,\mathbf{x}/256)$ with $\alpha = 0.05$ .
Data augmentation: horizontal flips for CIFAR-10, CelebA, LSUN.
Batch size: 64.
Optimizer: @kingmaAdam2014.
Regularization: $L_2$ on weight-scale parameters, coefficient $5 \cdot 10^{-5}$ .
$32 \times 32$ images: 4 residual blocks, 32 hidden feature maps.
$64 \times 64$ images: 2 residual blocks.
CIFAR-10: 8 residual blocks, 64 feature maps, one downscaling step.

Evaluation

Datasets

CIFAR-10
ImageNet $32 \times 32$
ImageNet $64 \times 64$
LSUN bedroom
LSUN tower
LSUN church outdoor
CelebA

Metrics

Bits per dimension
Qualitative sample quality

Headline results

CIFAR-10: 3.49 bits/dim.
ImageNet $32 \times 32$ : 4.28 bits/dim.
ImageNet $64 \times 64$ : 3.98 bits/dim.
LSUN bedroom: 2.72 bits/dim.
CelebA: 3.02 bits/dim.

Table 1: Bits/dim results across CIFAR-10, ImageNet, LSUN, and CelebA.

Dataset	PixelRNN [46]	Real NVP	Conv DRAW [22]	IAF-VAE [34]
CIFAR-10	3.00	3.49	< 3.59	< 3.28
Imagenet (32 \times 32)	3.86 (3.83)	4.28 (4.26)	< 4.40 (4.35)
Imagenet (64 \times 64)	3.63 (3.57)	3.98 (3.75)	< 4.10 (4.04)
LSUN (bedroom)		2.72 (2.70)
LSUN (tower)		2.81 (2.78)
LSUN (church outdoor)		3.08 (2.94)
CelebA		3.02 (2.97)

Ablations

Capacity: limited models generate implausible samples, especially on CelebA.
Batch normalization: enables deeper coupling stacks and stabilizes scale-parameter training.
Latent interpolation: traversals stay semantically smooth across faces and scenes.
Architecture masking: checkerboard masks switch to channel-wise masks after squeezing.

Method Strengths and Weaknesses

Strengths

Exact likelihood, exact inversion, and exact sampling coexist in one model.
Affine coupling keeps log-determinants cheap through triangular Jacobians.
Multi-scale design reaches natural images up to $64 \times 64$ .
Latent interpolations look semantically organized, not purely local.

Weaknesses

CIFAR-10 likelihood trails PixelRNN: 3.49 versus 3.00 bits/dim.
Limited-capacity models produce highly improbable CelebA samples.
Bijectivity forces latent dimensionality to match input dimensionality.
Strong results need deep residual coupling networks and normalization.

Suggestions from the authors

Explore semi-supervised learning with the high-dimensional latent space.
Build conditional Real NVP models using variables such as class labels.
Extend the framework to language, video, and audio.
Learn richer priors over $\mathbf{z}$ than an isotropic Gaussian.

Density Estimation using Real NVP

Density Estimation using Real NVP

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers