Representation Learning: A Review and New Perspectives

Yoshua Bengio, Aaron Courville, Pascal Vincent

2013 · IEEE TPAMI

Representation Learning: A Review and New Perspectives

Problem

Framing

Representation learning lacked a compact account of why learned features help transfer, invariance, and data efficiency. The paper closes that gap with a synthesis centered on distributed codes, disentangled explanatory factors, and depth as factor reuse across tasks.

Currently Used Methods

Foundational

@rosenblattPerceptron1958 — early learned linear features for classification.
- Limitation in context: shallow linear structure misses hierarchical explanatory factors.
@rumelhartLearningRepresentationsBackpropagating1986 — distributed hidden representations learned by backpropagation.
- Limitation in context: optimization and depth remained unstable.
@hintonDeepBeliefNets2006 — greedy layer-wise pretraining for deep generative features.
- Limitation in context: no unified account across representation-learning families.
@lecunGradientbasedLearningApplied1998 — convolutional features learned end to end for vision.
- Limitation in context: supervised signals alone do not explain unlabeled transfer.
@mikolovWord2vec2013 — compact distributed word embeddings for language modeling.
- Limitation in context: task-specific embeddings do not yield a general theory.

Proposed Method

Architecture

This is a perspective paper, not a single trainable model. Its core diagram shows inputs mapped to latent explanatory factors, with overlapping factor subsets reused by multiple tasks.

Concept diagram: an input feeds a shared latent layer of explanatory factors, and overlapping subsets support Tasks A, B, and C.

Loss / Objective

The paper does not introduce a new optimization objective.

Algorithm

The paper does not introduce a new sampling or inference rule.

Training Procedure

No single training recipe
No shared hyperparameter table
Covers supervised, unsupervised, and generative settings

Evaluation

Datasets

RT03S speech recognition
Bing mobile business search speech benchmark
Wall Street Journal speech recognition
MNIST digit recognition
ImageNet object recognition
NLP benchmarks used by SENNA

Metrics

Word error rate
Perplexity
Negative log-likelihood
BLEU
Classification error rate
F1 score

Headline results

RT03S speech: word error rate $27.4\% \rightarrow 18.5\%$ .
Bing speech benchmark: relative error reduction $16\%$ to $23\%$ .
Wall Street Journal speech: word error rate $17.2\%$ or $16.9\% \rightarrow 14.4\%$ .
MNIST: error reaches $0.27\%$ ; knowledge-free setting reaches $0.81\%$ .
ImageNet: top error $26.1\% \rightarrow 15.3\%$ .

Sample grid: natural-image samples from a spike-and-slab RBM, shown as a tiled qualitative generation result.

Ablations

Depth: deeper hierarchies support more abstract factor reuse.
Unsupervised pretraining: helps optimization when labels are scarce.
Distributed codes: share statistical strength better than local codes.
Disentangling factors: cleaner separation improves transfer and invariance.

Method Strengths and Weaknesses

Strengths

Unifies transfer, invariance, sparsity, and disentangling in one framework.
Grounds claims with concrete gains in speech, vision, and NLP.
Explains depth as reuse of shared explanatory factors.
Connects generative and discriminative feature learning.

Weaknesses

No formal criterion predicts when disentangling emerges.
Causal claims about why deep features work remain partly conjectural.
Breadth reduces mechanistic detail on any single algorithm.
The paper proposes no benchmark that isolates each hypothesis.

Suggestions from the authors

Explain when deep models recover disentangled explanatory factors.
Clarify why unsupervised pretraining improves optimization and generalization.
Develop priors better matched to manifolds and sparse factors.
Study architectures that separate factors while preserving task information.

Representation Learning: A Review and New Perspectives

Representation Learning: A Review and New Perspectives

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers