ImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
Problem
Framing
Large-scale image classification had ImageNet-scale data, but CNNs had not shown effective training on M high-resolution images and classes. The paper closes that gap with a deep GPU-trained CNN using ReLUs, augmentation, and dropout, reaching top-5 error on ILSVRC-2010.
Currently Used Methods
Foundational
- @lecunGradientbasedLearningApplied1998 — convolution, pooling, and weight sharing for vision.
- Limitation in context: too small-scale for ImageNet-sized recognition.
- @srivastavaDropout2014 — dropout as cheap ensemble-style regularization.
- Limitation in context: regularization alone does not solve large-CNN optimization.
- Multi-column Deep Neural Networks for Image Classification — GPU CNNs with columnar model parallelism.
- Limitation in context: no ImageNet-scale -class result.
- ImageNet Large Scale Visual Recognition Challenge — sparse-coding competition baseline on ILSVRC.
- Limitation in context: much worse top-1 and top-5 error.
Proposed Method
Architecture
The network has five convolutional layers and three fully connected layers, with ReLU after every learned layer and a -way softmax head. Input is a crop. The conv stack is stride-, , then , , with kernels; the two dense hidden layers have units each. The model is split across two GPUs with limited cross-GPU connections.

Loss / Objective
The model maximizes multinomial logistic regression over ImageNet classes.
Algorithm
Test-time prediction averages softmax outputs over ten crops.
Training Procedure
- Batch size:
- Optimizer: SGD with momentum
- Weight decay:
- Learning rate: start at , reduce when validation error stops improving
- Dropout: in the first two fully connected layers
- Input resize: shorter side to , then mean subtraction
- Augmentation: random crops and horizontal flips
- Color jitter: PCA RGB noise with Gaussian scale
- Hardware: two GTX 580 GB GPUs
- Training time: -- days
Evaluation
Datasets
- ILSVRC-2010: M train, k validation, k test, classes
- ILSVRC-2012 competition test set
Metrics
- Top-1 error
- Top-5 error
Headline results
- ILSVRC-2010 test: top-1 , top-5
- ILSVRC-2010 competition best baseline: top-1 , top-5
- ILSVRC-2010 published SIFT+FV baseline: top-1 , top-5
- ILSVRC-2012 single model: top-5
- ILSVRC-2012 ensemble submission: top-5
Table 1: ILSVRC-2010 test error comparison
| Model | Top-1 | Top-5 |
|---|---|---|
| Sparse coding [2] | 47.1% | 28.2% |
| SIFT + FVs [24] | 45.7% | 25.7% |
| CNN | 37.5% | 17.0% |
Ablations
- ReLU vs. : reaches CIFAR-10 training error about six times faster.
- Two GPUs vs. smaller one-GPU net: lowers top-1 by and top-5 by .
- Local response normalization: lowers top-1 by and top-5 by .
- Overlapping pooling: lowers top-1 by and top-5 by .

Method Strengths and Weaknesses
Strengths
- Cuts ILSVRC-2010 top-5 error from to .
- Shows deep CNNs can train on M images and classes.
- Quantifies gains from ReLUs, normalization, pooling, and multi-GPU training.
- Uses simple augmentation and dropout to control a M-parameter model.
Weaknesses
- Requires two GPUs and -- training days.
- Architecture depends on hand-crafted cross-GPU connectivity.
- Ten-crop evaluation increases inference cost.
- Single-model ILSVRC-2012 top-5 stays at , above the ensemble result.
Suggestions from the authors
- Train larger CNNs as GPU memory and speed improve.
- Use bigger labeled datasets for further accuracy gains.
- Expand public GPU CNN implementations for wider experimentation.
- Push model size beyond current memory limits.
Links
Prior Papers
- @lecunGradientbasedLearningApplied1998 — establishes the convolutional template that AlexNet scales to ImageNet.
- @srivastavaDropout2014 — dropout is one of AlexNet's main regularizers for the dense layers.
Further Papers
- @simonyanVGGVeryDeep2014 — deepens the AlexNet-style CNN stack with smaller filters and stronger ImageNet accuracy.
- @szegedyGoogLeNet2015 — redesigns large-scale CNNs for better accuracy-efficiency tradeoffs on ImageNet.
- @renFasterRCNN2015 — uses AlexNet-era convolutional backbones for modern object detection.
- @radfordDCGAN2015 — transfers AlexNet-inspired convolutional design into generative modeling.
- @mnihDQN2015 — applies deep convolutional feature learning to control from pixels.
- @heKaimingInit2015 — studies initialization for deep rectifier networks, sharpening a core AlexNet choice.