Very Deep Convolutional Networks for Large-Scale Image Recognition
Very Deep Convolutional Networks for Large-Scale Image Recognition
Problem
Framing
ImageNet CNNs were still shallow, and depth gains were entangled with larger filters and ad hoc design. The paper isolates depth with uniform stacks, reaches 19 weight layers, and cuts ILSVRC top-5 test error to .
Currently Used Methods
Foundational
- @lecunGradientbasedLearningApplied1998 — early convolutional hierarchy for visual recognition.
- Limitation in context: too shallow for ImageNet-scale representation learning.
- @krizhevskyAlexNet2012 — large-scale CNN training with ReLU, dropout, and augmentation.
- Limitation in context: only 8 learned layers and large early filters.
- "Visualizing and Understanding Convolutional Networks" — stronger ImageNet ConvNet with improved optimization.
- Limitation in context: still shallower and less architecturally uniform.
- @szegedyGoogLeNet2015 — parameter-efficient deep CNN via inception modules.
- Limitation in context: does not isolate plain depth scaling with homogeneous filters.
Proposed Method
Architecture
The network takes fixed RGB crops, uses stride-1 convolutions with padding 1, and inserts five max-pooling layers. All hidden convolutions use kernels except optional layers in config C. Width increases from 64 to 512 channels; configs A–E span 11 to 19 weight layers and end with FC-4096, FC-4096, FC-1000.
Loss / Objective
Training minimizes multinomial logistic loss over 1000 classes.
Sampling Rule / Algorithm
Test-time prediction averages class posteriors across crops, scales, or fused models.
Training Procedure
- Input crop: RGB.
- Batch size: 256.
- Optimizer: SGD with momentum 0.9.
- Weight decay: .
- Dropout: 0.5 in the first two FC layers.
- Learning rate: , divided by 10 at validation plateaus.
- Training length: 370K iterations, about 74 epochs.
- Scale jittering: fixed or sampled .
Evaluation
Datasets
- ILSVRC-2012 classification.
- ILSVRC-2014 localization.
- PASCAL VOC-2007 classification.
- PASCAL VOC-2012 classification.
- PASCAL VOC-2012 action classification.
- Caltech-101.
- Caltech-256.
Metrics
- Top-1 classification error.
- Top-5 classification error.
- Top-5 localization error.
- Mean average precision.
- Mean class recall.
Headline results
- ILSVRC classification, 2-model fusion: top-1 val, top-5 val, top-5 test.
- ILSVRC classification, best single network: top-1 val, top-5 val.
- ILSVRC classification, single-scale config E: top-1 val, top-5 val.
- VOC-2012 action, image+bbox: mAP.
- Caltech-256, Net-D&E: mean class recall.
Ablations
- Depth AE: validation error drops steadily, then saturates near 19 layers.
- LRN: no accuracy gain, extra computation.
- layers in config C: worse than all- config D.
- Train/test scale jittering: improves validation error over fixed-scale evaluation.
Method Strengths and Weaknesses
Strengths
- Controlled A–E comparison isolates depth from most other design changes.
- Uniform stacks beat shallower large-filter baselines.
- Strong transfer results on VOC, Caltech, and action classification.
- Two-model fusion reaches top-5 test error on ILSVRC.
Weaknesses
- Two 4096-unit FC layers make the model parameter-heavy.
- Best accuracy depends on multi-scale testing and model fusion.
- Gains from 16 to 19 layers are modest.
- Dense multi-scale evaluation is computationally expensive.
Suggestions from the authors
- Test deeper small-filter ConvNets on larger recognition datasets.
- Improve localization with stronger bounding-box regression variants.
- Reduce the cost of dense multi-scale evaluation.
- Reuse very deep pretrained features across more vision tasks.
Links
Prior Papers
- @krizhevskyAlexNet2012 — establishes large-scale ImageNet CNN training that VGG deepens with smaller filters.
- @lecunGradientbasedLearningApplied1998 — provides the convolution-and-pooling template that VGG scales dramatically.
- @srivastavaDropout2014 — dropout regularizes VGG's large fully connected classifier.
Further Papers
- @heKaimingInit2015 — tackles optimization issues that become acute in very deep rectified nets.
- @heDeepResidualLearning2016 — extends VGG's depth-scaling agenda with residual shortcuts.
- @szegedyGoogLeNet2015 — offers a competing deep CNN design with better parameter efficiency.
- @huangDenseNet2017 — continues the very-deep CNN line with dense feature reuse.