Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Nitish Srivastava, Geoffrey Hinton

2014 · JMLR

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Problem

Framing

Large neural nets overfit, but explicit test-time averaging over many large nets is too expensive. The paper closes this gap by training random thinned subnetworks and approximating their ensemble with one weight-scaled full network at test time.

Currently Used Methods

Foundational

Proposed Method

Architecture

Dropout applies to standard multilayer nets, convnets, and RBMs. For each training case, each unit is independently retained with probability pp and removed otherwise; hidden units use p0.5p \approx 0.5, inputs use larger pp.

Standard multilayer network beside a thinned network after randomly dropping several hidden and input units.

Loss / Objective

Training minimizes the base supervised loss under random Bernoulli masks:

minθ  E(x,y)  ErBernoulli(p)[(fθ(x;r),y)]\min_{\theta}\; \mathbb{E}_{(\mathbf{x},y)}\; \mathbb{E}_{\mathbf{r} \sim \mathrm{Bernoulli}(p)}\left[\ell\big(f_{\theta}(\mathbf{x};\mathbf{r}), y\big)\right]

Sampling Rule / Algorithm

Test-time inference uses the full network with outgoing weights scaled by retention probability:

w~ij=pwij\tilde{w}_{ij} = p\, w_{ij}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Results figure: activation histograms from dropout-RBM features, showing sparse activations with many near-zero responses.

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers