Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, Andrew Zisserman

2014 · ICLR

Very Deep Convolutional Networks for Large-Scale Image Recognition

Problem

Framing

ImageNet CNNs were still shallow, and depth gains were entangled with larger filters and ad hoc design. The paper isolates depth with uniform $3 \times 3$ stacks, reaches 19 weight layers, and cuts ILSVRC top-5 test error to $6.8\%$ .

Currently Used Methods

Foundational

@lecunGradientbasedLearningApplied1998 — early convolutional hierarchy for visual recognition.
- Limitation in context: too shallow for ImageNet-scale representation learning.
@krizhevskyAlexNet2012 — large-scale CNN training with ReLU, dropout, and augmentation.
- Limitation in context: only 8 learned layers and large early filters.
"Visualizing and Understanding Convolutional Networks" — stronger ImageNet ConvNet with improved optimization.
- Limitation in context: still shallower and less architecturally uniform.
@szegedyGoogLeNet2015 — parameter-efficient deep CNN via inception modules.
- Limitation in context: does not isolate plain depth scaling with homogeneous filters.

Proposed Method

Architecture

The network takes fixed $224 \times 224$ RGB crops, uses stride-1 convolutions with padding 1, and inserts five $2 \times 2$ max-pooling layers. All hidden convolutions use $3 \times 3$ kernels except optional $1 \times 1$ layers in config C. Width increases from 64 to 512 channels; configs A–E span 11 to 19 weight layers and end with FC-4096, FC-4096, FC-1000.

Loss / Objective

Training minimizes multinomial logistic loss over 1000 classes.

\mathcal{L}(\theta) = - \sum_{n=1}^{N} \sum_{k=1}^{1000} y_{nk} \log p_{\theta}(k \mid \mathbf{x}_n)

Sampling Rule / Algorithm

Test-time prediction averages class posteriors across crops, scales, or fused models.

\hat{p}(k \mid \mathbf{x}) = \frac{1}{M} \sum_{m=1}^{M} p_{\theta_m}(k \mid \mathbf{x}^{(m)})

Training Procedure

Input crop: $224 \times 224$ RGB.
Batch size: 256.
Optimizer: SGD with momentum 0.9.
Weight decay: $5 \times 10^{-4}$ .
Dropout: 0.5 in the first two FC layers.
Learning rate: $10^{-2}$ , divided by 10 at validation plateaus.
Training length: 370K iterations, about 74 epochs.
Scale jittering: fixed $S \in \{256, 384\}$ or sampled $S \in [256, 512]$ .

Evaluation

Datasets

ILSVRC-2012 classification.
ILSVRC-2014 localization.
PASCAL VOC-2007 classification.
PASCAL VOC-2012 classification.
PASCAL VOC-2012 action classification.
Caltech-101.
Caltech-256.

Metrics

Top-1 classification error.
Top-5 classification error.
Top-5 localization error.
Mean average precision.
Mean class recall.

Headline results

ILSVRC classification, 2-model fusion: $23.7\%$ top-1 val, $6.8\%$ top-5 val, $6.8\%$ top-5 test.
ILSVRC classification, best single network: $24.8\%$ top-1 val, $7.5\%$ top-5 val.
ILSVRC classification, single-scale config E: $25.5\%$ top-1 val, $8.0\%$ top-5 val.
VOC-2012 action, image+bbox: $84.0$ mAP.
Caltech-256, Net-D&E: $86.2\%$ mean class recall.

Ablations

Depth A $\rightarrow$ E: validation error drops steadily, then saturates near 19 layers.
LRN: no accuracy gain, extra computation.
$1 \times 1$ layers in config C: worse than all- $3 \times 3$ config D.
Train/test scale jittering: improves validation error over fixed-scale evaluation.

Method Strengths and Weaknesses

Strengths

Controlled A–E comparison isolates depth from most other design changes.
Uniform $3 \times 3$ stacks beat shallower large-filter baselines.
Strong transfer results on VOC, Caltech, and action classification.
Two-model fusion reaches $6.8\%$ top-5 test error on ILSVRC.

Weaknesses

Two 4096-unit FC layers make the model parameter-heavy.
Best accuracy depends on multi-scale testing and model fusion.
Gains from 16 to 19 layers are modest.
Dense multi-scale evaluation is computationally expensive.

Suggestions from the authors

Test deeper small-filter ConvNets on larger recognition datasets.
Improve localization with stronger bounding-box regression variants.
Reduce the cost of dense multi-scale evaluation.
Reuse very deep pretrained features across more vision tasks.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers