Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, Andrew Zisserman

2014 · ICLR

Very Deep Convolutional Networks for Large-Scale Image Recognition

Problem

Framing

ImageNet CNNs were still shallow, and depth gains were entangled with larger filters and ad hoc design. The paper isolates depth with uniform 3×33 \times 3 stacks, reaches 19 weight layers, and cuts ILSVRC top-5 test error to 6.8%6.8\%.

Currently Used Methods

Foundational

Proposed Method

Architecture

The network takes fixed 224×224224 \times 224 RGB crops, uses stride-1 convolutions with padding 1, and inserts five 2×22 \times 2 max-pooling layers. All hidden convolutions use 3×33 \times 3 kernels except optional 1×11 \times 1 layers in config C. Width increases from 64 to 512 channels; configs A–E span 11 to 19 weight layers and end with FC-4096, FC-4096, FC-1000.

Loss / Objective

Training minimizes multinomial logistic loss over 1000 classes.

L(θ)=n=1Nk=11000ynklogpθ(kxn)\mathcal{L}(\theta) = - \sum_{n=1}^{N} \sum_{k=1}^{1000} y_{nk} \log p_{\theta}(k \mid \mathbf{x}_n)

Sampling Rule / Algorithm

Test-time prediction averages class posteriors across crops, scales, or fused models.

p^(kx)=1Mm=1Mpθm(kx(m))\hat{p}(k \mid \mathbf{x}) = \frac{1}{M} \sum_{m=1}^{M} p_{\theta_m}(k \mid \mathbf{x}^{(m)})

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers