A Fast Learning Algorithm for Deep Belief Nets

Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh

2006 · Neural Computation

A Fast Learning Algorithm for Deep Belief Nets

Problem

Framing

Deep directed belief nets were hard to train because posterior inference in dense directed layers suffers from explaining away. The paper removes that bottleneck with complementary priors plus greedy layerwise RBM learning, turning deep training into tractable local steps. On permutation-invariant MNIST, it reports 1.25% test error.

Currently Used Methods

Foundational

Proposed Method

Architecture

The DBN uses a hybrid stack: a top undirected associative memory and lower directed sigmoid layers. The MNIST model is 7845005002000784 \rightarrow 500 \rightarrow 500 \leftrightarrow 2000, with 10 label units coupled at the top.

Verified architecture page: left, the DBN for images and labels with a 28\times28 image layer, two 500-unit hidden layers, 2000 top units, and 10 label units; right, a toy earthquake-truck-house-jumps belief net illustrating explaining away.

Loss / Objective

Greedy layer addition improves a variational lower bound on data likelihood.

logp(v)hQ(hv)logp(v,h)Q(hv)\log p(\mathbf{v}) \geq \sum_{\mathbf{h}} Q(\mathbf{h}\mid \mathbf{v}) \log \frac{p(\mathbf{v}, \mathbf{h})}{Q(\mathbf{h}\mid \mathbf{v})}

Sampling Rule / Algorithm

The top RBM uses alternating conditional Bernoulli updates.

p(hj=1v)=11+exp ⁣(bjiviWij),p(vi=1h)=11+exp ⁣(cijhjWij)p(h_j = 1 \mid \mathbf{v}) = \frac{1}{1 + \exp\!\left(-b_j - \sum_i v_i W_{ij}\right)}, \qquad p(v_i = 1 \mid \mathbf{h}) = \frac{1}{1 + \exp\!\left(-c_i - \sum_j h_j W_{ij}\right)}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: MNIST test-error comparison across permutation-invariant and geometry-aware baselines

Version of MNIST taskLearning algorithmTest error %
permutation-invariantOur generative model 784\rightarrow500\rightarrow500\leftrightarrow2000\leftrightarrow101.25
permutation-invariantSupport Vector Machine degree 9 polynomial kernel1.4
permutation-invariantBackprop 784\rightarrow500\rightarrow300\rightarrow10 cross-entropy & weight-decay1.51
permutation-invariantBackprop 784\rightarrow800\rightarrow10 cross-entropy & early stopping1.53
permutation-invariantBackprop 784\rightarrow500\rightarrow150\rightarrow10 squared error & on-line updates2.95
permutation-invariantNearest Neighbor All 60,000 examples & L3 norm2.8
permutation-invariantNearest Neighbor All 60,000 examples & L2 norm3.1
permutation-invariantNearest Neighbor 20,000 examples & L3 norm4.0
permutation-invariantNearest Neighbor 20,000 examples & L2 norm4.4
unpermuted images extra data from elastic deformationsBackprop cross-entropy & early-stopping convolutional neural net0.4
unpermuted deskewed images, extra data from 2 pixel transl.Virtual SVM degree 9 polynomial kernel0.56
unpermuted imagesShape-context features hand-coded matching0.63
unpermuted images extra data from affine transformationsBackprop in LeNet5 convolutional neural net0.8
Unpermuted imagesBackprop in LeNet5 convolutional neural net0.95

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers