Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, Soumith Chintala

2015 · ICLR

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Problem

Framing

GANs produced sharp samples, but deeper CNN GANs were unstable and their learned features were under-validated. DCGAN closes this with a constrained all-convolutional architecture, batch normalization, and stable Adam settings that yield $82.8\%$ CIFAR-10 accuracy from discriminator features.

Currently Used Methods

Foundational

@goodfellowGAN2014 — adversarial learning with a generator and discriminator.
- Limitation in context: vanilla MLP GANs did not train deep convolutional image models stably.
@krizhevskyAlexNet2012 — deep convolutional design for strong supervised visual features.
- Limitation in context: supervised CNN heuristics did not directly stabilize adversarial co-training.
"Striving for Simplicity: The All Convolutional Net" — replaces pooling with learned strided convolutions.
- Limitation in context: it does not address generator–discriminator optimization dynamics.
"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" — batch normalization stabilizes deep optimization.
- Limitation in context: its placement inside GANs was not yet established.
"Discriminative Unsupervised Feature Learning with Convolutional Neural Networks" — unsupervised CNN features transfer to classification.
- Limitation in context: generative training had not matched that representation quality.

Proposed Method

Architecture

DCGAN removes pooling and hidden fully connected layers. The generator maps $\mathbf{z} \in \mathbb{R}^{100}$ to $64 \times 64 \times 3$ through four fractionally strided convolutions; the discriminator mirrors this with strided convolutions. The generator uses ReLU and output $\tanh$ ; the discriminator uses LeakyReLU and batch normalization.

$Verified architecture diagram: a 100-D latent vector is projected to a 4 \times 4 \times 1024 tensor, then upsampled through four stride-2 convolution blocks to a 64 \times 64 \times 3 image.$

Loss / Objective

The model keeps the standard GAN minimax game.

\min_G \max_D \, V(D,G) = \mathbb{E}_{\mathbf{x} \sim p_{\mathrm{data}}}\left[\log D(\mathbf{x})\right] + \mathbb{E}_{\mathbf{z} \sim p_{\mathbf{z}}}\left[\log \left(1 - D(G(\mathbf{z}))\right)\right]

Algorithm

Training alternates discriminator and generator updates under the adversarial objective.

\mathbf{z} \sim \mathrm{Uniform}([-1,1]^{100}), \qquad \hat{\mathbf{x}} = G(\mathbf{z}), \qquad D(\mathbf{x}), D(\hat{\mathbf{x}}) \text{ drive the two-player update}

Training Procedure

Latent prior: $\mathrm{Uniform}([-1,1]^{100})$ .
Images scaled to $[-1,1]$ .
Optimizer: Adam.
Learning rate: $0.0002$ .
Batch size: $128$ .
Momentum: $\beta_1 = 0.5$ .
Weight init: normal, std $0.02$ .

Evaluation

Datasets

LSUN bedrooms.
ImageNet-1k, $32 \times 32$ center crops.
Faces dataset from 10K identities.
CIFAR-10.
SVHN.
MNIST for nearest-neighbor analysis.

Metrics

CIFAR-10 classification accuracy.
CIFAR-10 accuracy with 400 labels per class.
SVHN error rate with 1000 labels.
MNIST nearest-neighbor test error.
Qualitative sample inspection.

Headline results

CIFAR-10, ImageNet-pretrained features: $82.8\%$ accuracy.
CIFAR-10, 400 labels per class: $73.8\% \pm 0.4\%$ .
SVHN, 1000 labels: $22.48\%$ error.
MNIST, $10$ M generated samples: $2.18\%$ nearest-neighbor test error.

Table 1: CIFAR-10 classification using pretrained discriminator features

Model	Accuracy	Accuracy (400 per class)	max # of features units
1 Layer K-means	80.6%	63.7% ( $\pm 0.7\%$ )	4800
3 Layer K-means Learned RF	82.0%	70.7% ( $\pm 0.7\%$ )	3200
View Invariant K-means	81.9%	72.6% ( $\pm 0.7\%$ )	6400
Exemplar CNN	84.3%	77.4% ( $\pm 0.2\%$ )	1024
DCGAN (ours) + L2-SVM	82.8%	73.8% ( $\pm 0.4\%$ )	512

Sample grid: LSUN bedroom generations with coherent room layout, windows, beds, and lighting across many draws.

Ablations

Pooling removal: learned strided convolutions improve training stability.
Fully connected removal: deeper GANs train more reliably.
Momentum sweep: $\beta_1=0.9$ oscillates; $0.5$ stabilizes.
Extended training: some filters collapse into oscillatory modes.

Method Strengths and Weaknesses

Strengths

Architectural rules are simple and reproducible.
Discriminator features reach $82.8\%$ CIFAR-10 accuracy without CIFAR pretraining.
Few-label transfer is strong: $73.8\%$ with 400 labels per class.
LSUN samples show consistent global room structure at $64 \times 64$ .

Weaknesses

Training still shows oscillation and occasional filter collapse.
No calibrated generative metric like FID or likelihood is reported.
Full-label CIFAR-10 trails Exemplar CNN.
Design rules are empirical, not derived from GAN optimization theory.

Suggestions from the authors

Extend the approach to video frame prediction.
Extend learned features to audio and speech synthesis.
Study latent-space structure more systematically.
Develop vector arithmetic for conditional generation with less data.

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers