Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia

2015 · CVPR

Going Deeper with Convolutions

Problem

Framing

ImageNet CNNs improved by getting deeper and wider, but naive scaling wasted parameters and compute. The paper closes this with Inception: parallel multi-scale branches plus $1 \times 1$ reductions that keep inference near $1.5$ billion multiply-adds. GoogLeNet reaches $6.67\%$ top-5 error on ILSVRC14.

Currently Used Methods

Foundational

@krizhevskyAlexNet2012 — large-scale ImageNet CNN baseline with strong classification gains.
- Limitation in context: much larger parameter count for worse top-5 accuracy.
@simonyanVGGVeryDeep2014 — deeper CNNs from uniform stacks of small convolutions.
- Limitation in context: depth scaling lacks Inception-style multi-branch efficiency.
@lecunGradientbasedLearningApplied1998 — canonical conv-pool hierarchy for visual recognition.
- Limitation in context: no explicit parallel multi-scale processing inside each stage.
"Network In Network" — introduces $1 \times 1$ $1 \times 1$ convolutions and local micro-networks.
- Limitation in context: stops short of sparse multi-branch module design.
@renFasterRCNN2015 — CNN-based region proposal and detection pipeline.
- Limitation in context: backbone efficiency still constrains detection throughput and accuracy.

Proposed Method

Architecture

GoogLeNet is a 22-layer CNN built by stacking Inception modules. Each module runs parallel $1 \times 1$ , $3 \times 3$ , $5 \times 5$ , and pooling branches, then concatenates channels; $1 \times 1$ projections reduce cost before the expensive branches. The classifier head replaces large fully connected blocks with global average pooling, and training adds two auxiliary classifiers.

$Verified architecture diagram: the naive Inception module and the dimension-reduced version, showing parallel 1 \times 1, 3 \times 3, 5 \times 5, and pooling branches merged by filter concatenation.$

Loss / Objective

Training sums the main softmax loss and two auxiliary softmax losses.

\mathcal{L} = \mathcal{L}_{\mathrm{main}} + 0.3\,\mathcal{L}_{\mathrm{aux1}} + 0.3\,\mathcal{L}_{\mathrm{aux2}}

Algorithm

An Inception block applies four branch transforms to $\mathbf{x}$ and concatenates their outputs.

\mathbf{y} = \operatorname{concat}\Big( \operatorname{conv}_{1 \times 1}(\mathbf{x}), \operatorname{conv}_{3 \times 3}(\operatorname{conv}_{1 \times 1}(\mathbf{x})), \operatorname{conv}_{5 \times 5}(\operatorname{conv}_{1 \times 1}(\mathbf{x})), \operatorname{conv}_{1 \times 1}(\operatorname{pool}_{3 \times 3}(\mathbf{x})) \Big)

Training Procedure

Inference budget: about $1.5$ billion multiply-adds.
Input size: $224 \times 224$ .
Auxiliary heads: $2$ .
Auxiliary-loss weight: $0.3$ each.
Auxiliary-head hidden layer: $1024$ units.
Auxiliary-head dropout: $70\%$ dropped outputs.
Final submission ensemble: $7$ models.
Dense evaluation: $144$ crops per image.

Evaluation

Datasets

ILSVRC 2014 classification: about $1.2$ M train, $50$ k validation, $100$ k test, $1000$ classes.
ILSVRC 2014 detection: $200$ classes.

Metrics

Classification: top-5 error.
Detection: mean average precision (mAP).

Headline results

ILSVRC14 classification (single model, single crop): $10.07\%$ top-5 error.
ILSVRC14 classification (single model, $144$ crops): $7.89\%$ top-5 error.
ILSVRC14 classification ( $7$ -model ensemble): $6.67\%$ top-5 error.
ILSVRC14 detection: $43.9\%$ mAP.
Parameter efficiency: about $12\times$ fewer parameters than @krizhevskyAlexNet2012.

Table 1: ILSVRC classification challenge leaderboard by top-5 error

Team	Year	Place	Error (top-5)	Uses external data
SuperVision	2012	1st	16.4%	no
SuperVision	2012	1st	15.3%	Imagenet 22k
Clarifai	2013	1st	11.7%	no
Clarifai	2013	1st	11.2%	Imagenet 22k
MSRA	2014	3rd	7.35%	no
VGG	2014	2nd	7.32%	no
GoogLeNet	2014	1st	6.67%	no

Ablations

Crop count: denser evaluation cuts top-5 error from $10.07\%$ to $7.89\%$ .
Ensembling: $7$ models cut top-5 error to $6.67\%$ .
Auxiliary classifiers: improve optimization in deeper middle layers.
Wider variant: gives small extra ensemble gains.

Method Strengths and Weaknesses

Strengths

Wins ILSVRC14 classification with $6.67\%$ top-5 error.
Uses about $12\times$ fewer parameters than @krizhevskyAlexNet2012.
Captures multiple receptive-field scales within one module.
$1 \times 1$ reductions cut cost before $3 \times 3$ and $5 \times 5$ branches.

Weaknesses

Best headline number requires a $7$ -model ensemble.
Dense $144$ -crop testing is expensive.
Architecture remains hand-designed and branch-heavy.
Detection result still depends on proposal-based R-CNN components.

Suggestions from the authors

Automate Inception topology design instead of hand-crafting modules.
Test whether the sparse-design principle transfers beyond vision.
Analyze which architectural choices drive the accuracy gains.
Reduce memory cost when training deep multi-branch networks.

Going Deeper with Convolutions

Going Deeper with Convolutions

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers