Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Nitish Srivastava, Geoffrey Hinton

2014 · JMLR

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Problem

Framing

Large neural nets overfit, but explicit test-time averaging over many large nets is too expensive. The paper closes this gap by training random thinned subnetworks and approximating their ensemble with one weight-scaled full network at test time.

Currently Used Methods

Foundational

@krizhevskyAlexNet2012 — large-scale deep CNN training for visual recognition.
- Limitation in context: improves accuracy, but does not solve overfitting cheaply.
@goodfellowGAN2014 — adversarial training for sharp generative modeling.
- Limitation in context: not a regularizer for supervised neural nets.
"Rank, Trace-Norm and Max-Norm" — constrains incoming weight norms during training.
- Limitation in context: regularizes weights, not co-adapting units.
"ImageNet Classification with Deep Convolutional Neural Networks" — high-capacity convnets with data augmentation and tuning.
- Limitation in context: still needs a generic ensemble-like regularizer.
"Improving neural networks by preventing co-adaptation of feature detectors" — early masking intuition for hidden units.
- Limitation in context: lacks the full test-time scaling rule and broad evaluation.

Proposed Method

Architecture

Dropout applies to standard multilayer nets, convnets, and RBMs. For each training case, each unit is independently retained with probability $p$ and removed otherwise; hidden units use $p \approx 0.5$ , inputs use larger $p$ .

Standard multilayer network beside a thinned network after randomly dropping several hidden and input units.

Loss / Objective

Training minimizes the base supervised loss under random Bernoulli masks:

\min_{\theta}\; \mathbb{E}_{(\mathbf{x},y)}\; \mathbb{E}_{\mathbf{r} \sim \mathrm{Bernoulli}(p)}\left[\ell\big(f_{\theta}(\mathbf{x};\mathbf{r}), y\big)\right]

Sampling Rule / Algorithm

Test-time inference uses the full network with outgoing weights scaled by retention probability:

\tilde{w}_{ij} = p\, w_{ij}

Training Procedure

Optimizer: stochastic gradient descent with backpropagation.
Masking: sample a fresh Bernoulli mask for each training case.
Hidden-layer retention: typically $p \in [0.5, 0.8]$ .
Input-layer retention: closer to $1$ than $0.5$ .
Learning rate: typically $10$ -- $100\times$ a standard net.
Momentum: high momentum.
Max-norm bound: typical $c \in [3,4]$ .

Evaluation

Datasets

MNIST
SVHN
CIFAR-10
CIFAR-100
ImageNet / ILSVRC-2010, ILSVRC-2012
TIMIT
Reuters-RCV1
splice-junction gene sequences

Metrics

Classification error rate
Phone error rate
Top-1 error
Top-5 error

Headline results

MNIST: $0.79\%$ test error.
CIFAR-10: $15.6\%$ error.
CIFAR-100: $42.4\%$ error.
TIMIT core test set: $19.7\%$ phone error with pretrained dropout nets.
ILSVRC-2010: $37.5\%$ top-1, $17.0\%$ top-5 test error.

Ablations

Weight scaling vs. explicit averaging: scaling matches ensemble averaging closely.
Dropout rate on MNIST: $p \approx 0.5$ remains near-optimal for hidden units.
Regularizer comparison: dropout plus max-norm beats L2, L1, KL-sparsity, and max-norm alone.
Bernoulli vs. Gaussian noise: multiplicative Gaussian noise performs similarly.

Results figure: activation histograms from dropout-RBM features, showing sparse activations with many near-zero responses.

Method Strengths and Weaknesses

Strengths

One regularizer works across vision, speech, text, and biology benchmarks.
Test-time cost stays near one model, not an explicit ensemble.
Weight scaling gives a simple, usable ensemble approximation.
Dropout plus max-norm beats standard regularizers on MNIST.

Weaknesses

Adds substantial gradient noise; larger learning rates become necessary.
Introduces per-layer retention probabilities to tune.
Test-time averaging is approximate, not exact Bayesian model averaging.
Reported best architectures remain dataset-specific.

Suggestions from the authors

Extend dropout analysis to deeper graphical models.
Study multiplicative noise distributions beyond Bernoulli masks.
Improve test-time averaging beyond simple weight scaling.
Explain why dropout reduces co-adaptation and induces sparse features.

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers