Gradient-based learning applied to document recognition
Gradient-based learning applied to document recognition
Problem
Framing
OCR systems still depended on hand-built segmentation and features, which broke under shifts, distortions, and varied handwriting. The paper closes that gap with an end-to-end convolutional recognizer that learns features and classifier jointly from pixels, reaching about error on handwritten-digit recognition.
Currently Used Methods
Foundational
- @rosenblattPerceptron1958 — trainable linear threshold classifier for pattern recognition.
- Limitation in context: no hierarchy, locality, or deformation tolerance.
- @rumelhartLearningRepresentationsBackpropagating1986 — backpropagation enables multilayer feature learning.
- Limitation in context: dense MLPs ignore image geometry and over-parameterize.
- A Theory of the Learnable — margin-based classification framework.
- Limitation in context: still depends on hand-crafted document features.
- Gradient-Based Learning Applied to Handwritten Zip Code Recognition — earlier convolutional zip-code reader.
- Limitation in context: narrower scope than full document-recognition pipelines.
Proposed Method
Architecture
LeNet-5 takes a grayscale image and alternates convolution with subsampling before two dense stages. The core widths are , , , , then , , and a 10-way output.

Loss / Objective
The network trains by supervised gradient descent on output targets.
Algorithm
Inference is a single forward pass from pixels to class scores.
Training Procedure
- Input: grayscale images.
- Feature maps: in the first two convolution blocks.
- Hidden units: before output.
- Output classes: 10 for digit recognition.
Evaluation
Datasets
- Handwritten digit recognition.
- Check reading.
- Document field recognition.
Metrics
- Classification error rate ().
- End-to-end field recognition accuracy.
Headline results
- Handwritten digits: about error.
- K-NN Euclidean: error.
- Deslanted K-NN Euclidean: error.
- Retrieved comparison table shows LeNet variants below classical nearest-neighbor baselines.
Ablations
- Local receptive fields vs dense MLP: fewer parameters and better image modeling.
- Shared convolutions vs hand-built features: learned features improve recognition.
- Distortion-aware training: robustness improves under writing variation.
Method Strengths and Weaknesses
Strengths
- Replaces handcrafted OCR pipelines with end-to-end learning.
- Weight sharing cuts parameters relative to dense image MLPs.
- Architecture encodes translation tolerance through convolution and subsampling.
- Reports about digit error, ahead of retrieved K-NN baselines.
Weaknesses
- Retrieved text does not expose the exact printed loss formula.
- Fixed input constrains variable document layouts.
- Shallow architecture limits representational depth.
- Evaluation summary here is strongest on digits, weaker on broader documents.
Suggestions from the authors
- Extend trainable recognition from isolated characters to full document fields.
- Improve robustness to geometric distortion and handwriting variation.
- Integrate segmentation, recognition, and language constraints jointly.
- Scale convolutional readers to richer document structures.
Links
Prior Papers
- @rosenblattPerceptron1958 — early trainable classification lineage that this paper turns into a spatially structured vision model.
- @rumelhartLearningRepresentationsBackpropagating1986 — backpropagation makes end-to-end training of convolutional recognizers possible.
Further Papers
- @bengioRepresentationLearning2013 — frames why learned hierarchical features replace handcrafted pipelines.
- @krizhevskyAlexNet2012 — scales the convolutional recipe to large-scale visual recognition.
- @simonyanVGGVeryDeep2014 — deepens stacked convolutions into a stronger generic vision backbone.
- @heDeepResidualLearning2016 — extends CNN depth with residual optimization for much larger visual models.
1. Summary
Motivation / Problem
- Traditional OCR / document recognition pipeline requires heavy hand-crafted preprocessing and manually designed features
- Heavy Feature Engineering Process of ML approach
Prior Work and Its Limitations
- ML (Hand-Engineered Feature Extractor + Classical ML Classifier)
- Feature Extractor + K-NN, PCA/quadratic methods, RBFs, SVMs
- Limitation
- Hand-Engineered Feature Extractor needs domain-specific feature engineering --> Labor
- Cannot handle shift, distortion (outliers)
- MLP
- Couldn't capture 2d image's local structure.
Proposed Method
- LeNet - Convolutional Neural Network
- Use Convolutional Neural Network for better auto feature extraction pipeline on the original gradient-based learning
- ![[@lecunGradientbasedLearningApplied1998_LeNet5.png]]
- CNN - ReLU - Avg Pool structure can catch local information that 2D image posses.
- Original MLP cannot capture 2D image's local connectivity
- Gradient can flow not only on the classifier but end-to-end from feature extractor to classifier.
Hypothesis and Evaluation
- Hypothesis
- LeNet can learn task-specific features jointly. + Can beat or match current methods (hand-crafted pipeline)
- Evaluation
- Handwritten character / digit recognition benchmarks
- document-recognition system
2. Paper Strengths and Weakness
Strengths
- Can learn feature without handcrafting
- feature extraction and classifier combined
- Can capture image structure through locality and weight sharing
Weaknesses
- Model is shallow and can perform only simple tasks
3. My Opinion
Overall Rating
- Strong Accept
Recommendation Justification
- This paper plays a historically important role by presenting the first approach for image feature extraction in deep learning.
Detailed Comments
- This piece is very important since it shifts the paradigm of handcrafted feature extraction pipeline to trainable end-to-end systems.