ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

2012 · NeurIPS

ImageNet Classification with Deep Convolutional Neural Networks

Problem

Framing

Large-scale image classification had ImageNet-scale data, but CNNs had not shown effective training on $1.2$ M high-resolution images and $1000$ classes. The paper closes that gap with a deep GPU-trained CNN using ReLUs, augmentation, and dropout, reaching $17.0\%$ top-5 error on ILSVRC-2010.

Currently Used Methods

Foundational

@lecunGradientbasedLearningApplied1998 — convolution, pooling, and weight sharing for vision.
- Limitation in context: too small-scale for ImageNet-sized recognition.
@srivastavaDropout2014 — dropout as cheap ensemble-style regularization.
- Limitation in context: regularization alone does not solve large-CNN optimization.
Multi-column Deep Neural Networks for Image Classification — GPU CNNs with columnar model parallelism.
- Limitation in context: no ImageNet-scale $1000$ -class result.
ImageNet Large Scale Visual Recognition Challenge — sparse-coding competition baseline on ILSVRC.
- Limitation in context: much worse top-1 and top-5 error.

Proposed Method

Architecture

The network has five convolutional layers and three fully connected layers, with ReLU after every learned layer and a $1000$ -way softmax head. Input is a $224 \times 224 \times 3$ crop. The conv stack is $96$ $11 \times 11$ stride- $4$ , $256$ $5 \times 5$ , then $384$ , $384$ , $256$ with $3 \times 3$ kernels; the two dense hidden layers have $4096$ units each. The model is split across two GPUs with limited cross-GPU connections.

Architecture diagram: the AlexNet CNN split across two GPUs, with five convolutional stages, max-pooling, two dense hidden layers, and a 1000-way output.

Loss / Objective

The model maximizes multinomial logistic regression over ImageNet classes.

\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(y_i \mid \mathbf{x}_i)

Algorithm

Test-time prediction averages softmax outputs over ten crops.

\hat{p}(y \mid \mathbf{x}) = \frac{1}{10} \sum_{c \in \mathcal{C}_{10}} p_{\theta}(y \mid c(\mathbf{x}))

Training Procedure

Batch size: $128$
Optimizer: SGD with momentum $0.9$
Weight decay: $5 \times 10^{-4}$
Learning rate: start at $10^{-2}$ , reduce when validation error stops improving
Dropout: $p=0.5$ in the first two fully connected layers
Input resize: shorter side to $256$ , then mean subtraction
Augmentation: random $224 \times 224$ crops and horizontal flips
Color jitter: PCA RGB noise with Gaussian scale $0.1$
Hardware: two GTX 580 $3$ GB GPUs
Training time: $5$ -- $6$ days

Evaluation

Datasets

ILSVRC-2010: $1.2$ M train, $50$ k validation, $150$ k test, $1000$ classes
ILSVRC-2012 competition test set

Metrics

Top-1 error
Top-5 error

Headline results

ILSVRC-2010 test: top-1 $37.5\%$ , top-5 $17.0\%$
ILSVRC-2010 competition best baseline: top-1 $47.1\%$ , top-5 $28.2\%$
ILSVRC-2010 published SIFT+FV baseline: top-1 $45.7\%$ , top-5 $25.7\%$
ILSVRC-2012 single model: top-5 $18.2\%$
ILSVRC-2012 ensemble submission: top-5 $15.3\%$

Table 1: ILSVRC-2010 test error comparison

Model	Top-1	Top-5
Sparse coding [2]	47.1%	28.2%
SIFT + FVs [24]	45.7%	25.7%
CNN	37.5%	17.0%

Ablations

ReLU vs. $\tanh$ : reaches $25\%$ CIFAR-10 training error about six times faster.
Two GPUs vs. smaller one-GPU net: lowers top-1 by $1.7\%$ and top-5 by $1.2\%$ .
Local response normalization: lowers top-1 by $1.4\%$ and top-5 by $1.2\%$ .
Overlapping pooling: lowers top-1 by $0.4\%$ and top-5 by $0.3\%$ .

Sample grid: first-layer convolutional filters, with many grayscale edge detectors and a few color-sensitive kernels.

Method Strengths and Weaknesses

Strengths

Cuts ILSVRC-2010 top-5 error from $25.7\%$ to $17.0\%$ .
Shows deep CNNs can train on $1.2$ M images and $1000$ classes.
Quantifies gains from ReLUs, normalization, pooling, and multi-GPU training.
Uses simple augmentation and dropout to control a $60$ M-parameter model.

Weaknesses

Requires two GPUs and $5$ -- $6$ training days.
Architecture depends on hand-crafted cross-GPU connectivity.
Ten-crop evaluation increases inference cost.
Single-model ILSVRC-2012 top-5 stays at $18.2\%$ , above the ensemble result.

Suggestions from the authors

Train larger CNNs as GPU memory and speed improve.
Use bigger labeled datasets for further accuracy gains.
Expand public GPU CNN implementations for wider experimentation.
Push model size beyond current memory limits.

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers