Representation Learning: A Review and New Perspectives
Representation Learning: A Review and New Perspectives
Problem
Framing
Representation learning lacked a compact account of why learned features help transfer, invariance, and data efficiency. The paper closes that gap with a synthesis centered on distributed codes, disentangled explanatory factors, and depth as factor reuse across tasks.
Currently Used Methods
Foundational
- @rosenblattPerceptron1958 — early learned linear features for classification.
- Limitation in context: shallow linear structure misses hierarchical explanatory factors.
- @rumelhartLearningRepresentationsBackpropagating1986 — distributed hidden representations learned by backpropagation.
- Limitation in context: optimization and depth remained unstable.
- @hintonDeepBeliefNets2006 — greedy layer-wise pretraining for deep generative features.
- Limitation in context: no unified account across representation-learning families.
- @lecunGradientbasedLearningApplied1998 — convolutional features learned end to end for vision.
- Limitation in context: supervised signals alone do not explain unlabeled transfer.
- @mikolovWord2vec2013 — compact distributed word embeddings for language modeling.
- Limitation in context: task-specific embeddings do not yield a general theory.
Proposed Method
Architecture
This is a perspective paper, not a single trainable model. Its core diagram shows inputs mapped to latent explanatory factors, with overlapping factor subsets reused by multiple tasks.

Loss / Objective
The paper does not introduce a new optimization objective.
Algorithm
The paper does not introduce a new sampling or inference rule.
Training Procedure
- No single training recipe
- No shared hyperparameter table
- Covers supervised, unsupervised, and generative settings
Evaluation
Datasets
- RT03S speech recognition
- Bing mobile business search speech benchmark
- Wall Street Journal speech recognition
- MNIST digit recognition
- ImageNet object recognition
- NLP benchmarks used by SENNA
Metrics
- Word error rate
- Perplexity
- Negative log-likelihood
- BLEU
- Classification error rate
- F1 score
Headline results
- RT03S speech: word error rate .
- Bing speech benchmark: relative error reduction to .
- Wall Street Journal speech: word error rate or .
- MNIST: error reaches ; knowledge-free setting reaches .
- ImageNet: top error .

Ablations
- Depth: deeper hierarchies support more abstract factor reuse.
- Unsupervised pretraining: helps optimization when labels are scarce.
- Distributed codes: share statistical strength better than local codes.
- Disentangling factors: cleaner separation improves transfer and invariance.
Method Strengths and Weaknesses
Strengths
- Unifies transfer, invariance, sparsity, and disentangling in one framework.
- Grounds claims with concrete gains in speech, vision, and NLP.
- Explains depth as reuse of shared explanatory factors.
- Connects generative and discriminative feature learning.
Weaknesses
- No formal criterion predicts when disentangling emerges.
- Causal claims about why deep features work remain partly conjectural.
- Breadth reduces mechanistic detail on any single algorithm.
- The paper proposes no benchmark that isolates each hypothesis.
Suggestions from the authors
- Explain when deep models recover disentangled explanatory factors.
- Clarify why unsupervised pretraining improves optimization and generalization.
- Develop priors better matched to manifolds and sparse factors.
- Study architectures that separate factors while preserving task information.
Links
Prior Papers
- @hintonDeepBeliefNets2006 — establishes greedy layer-wise pretraining, a central precursor for deep representation learning.
- @lecunGradientbasedLearningApplied1998 — shows end-to-end hierarchical feature learning in vision.
- @rumelhartLearningRepresentationsBackpropagating1986 — introduces learned hidden representations and backpropagation.
Further Papers
- @mikolovWord2vec2013 — gives a canonical distributed-representation success case in language.
- @goodfellowGAN2014 — extends representation learning with adversarially learned latent features.
- @devlinBERT2018 — scales transferable self-supervised representations in NLP.
- @radfordCLIP2021 — learns broad multimodal representations from image–text supervision.