EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Mingxing Tan, Quoc V. Le

2019 · ICML

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Problem

Framing

CNNs had been scaled one dimension at a time, which wastes compute and yields early saturation. EfficientNet closes this gap with compound scaling: jointly scale depth, width, and input resolution from one searched baseline. On ImageNet, EfficientNet-B7 reaches 84.3% top-1 with 66M parameters.

Currently Used Methods

Foundational

@heDeepResidualLearning2016 — residual connections make very deep CNNs trainable.
- Limitation in context: depth-only scaling gives diminishing accuracy gains.
@huangDenseNet2017 — dense connectivity improves reuse and gradient flow.
- Limitation in context: no principled rule for joint multi-axis scaling.
@szegedyGoogLeNet2015 — Inception improves efficiency with hand-crafted factorization.
- Limitation in context: scaling remains manual and architecture-specific.
MnasNet: Platform-Aware Neural Architecture Search for Mobile — searched mobile backbone with strong efficiency.
- Limitation in context: baseline search alone does not solve large-scale scaling.
AmoebaNet: Regularized Evolution for Image Classifier Architecture Search — strong NAS classifier family.
- Limitation in context: higher accuracy requires far larger parameter budgets.

Proposed Method

Architecture

EfficientNet starts from EfficientNet-B0, a NAS-designed MBConv network with squeeze-and-excitation and Swish. It keeps stage structure fixed and scales all stages with one global coefficient $\phi$ .

Verified diagram of baseline scaling choices: width-only, depth-only, resolution-only, and the proposed compound scaling that increases all three together.

Loss / Objective

The paper optimizes standard supervised classification loss while changing only the scaling rule.

\mathcal{L} = - \sum_{k=1}^{K} y_k \log p_k

Scaling Rule / Algorithm

Compound scaling assigns extra compute uniformly across depth, width, and resolution.

d = \alpha^{\phi}, \qquad w = \beta^{\phi}, \qquad r = \gamma^{\phi}

\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2, \qquad \alpha \ge 1, \ \beta \ge 1, \ \gamma \ge 1

Training Procedure

Optimizer: RMSProp
Momentum: 0.9
RMSProp decay: 0.9
Weight decay: $1 \times 10^{-5}$
Initial learning rate: 0.256
LR decay: $\times 0.97$ every 2.4 epochs
Activation: SiLU / Swish
Model family: B0 through B7

Evaluation

Datasets

ImageNet
CIFAR-10
CIFAR-100
Flowers
Stanford Cars
FGVC Aircraft
Oxford-IIIT Pets
Food-101

Metrics

Top-1 accuracy
Top-5 accuracy
Parameter count
FLOPs
Single-core CPU latency
Transfer-learning accuracy

Headline results

ImageNet, B0: top-1 77.1%, top-5 93.3%, 5.3M params, 0.39B FLOPs.
ImageNet, B4: top-1 82.9%, top-5 96.4%, 19M params, 4.2B FLOPs.
ImageNet, B7: top-1 84.3%, top-5 97.0%, 66M params, 37B FLOPs.
ImageNet, B7 vs GPipe: same 84.3% top-1 with $8.4\times$ fewer parameters.
Transfer learning: best on 5 of 8 datasets with $9.6\times$ fewer parameters on average.

Table 1: ImageNet comparisons at matched accuracy levels show EfficientNet using far fewer parameters and FLOPs.

Model	Top-1 Acc.	Top-5 Acc.	#Params	Ratio-to-EfficientNet	#FLOPs	Ratio-to-EfficientNet
EfficientNet-B0	77.1%	93.3%	5.3M	1x	0.39B	1x
ResNet-50	76.0%	93.0%	26M	4.9x	4.1B	11x
DenseNet-169	76.2%	93.2%	14M	2.6x	3.5B	8.9x
EfficientNet-B1	79.1%	94.4%	7.8M	1x	0.70B	1x
ResNet-152	77.8%	93.8%	60M	7.6x	11B	16x
DenseNet-264	77.9%	93.9%	34M	4.3x	6.0B	8.6x
Inception-v3	78.8%	94.4%	24M	3.0x	5.7B	8.1x
Xception	79.0%	94.5%	23M	3.0x	8.4B	12x
EfficientNet-B2	80.1%	94.9%	9.2M	1x	1.0B	1x
Inception-v4	80.0%	95.0%	48M	5.2x	13B	13x
Inception-resnet-v2	80.1%	95.1%	56M	6.1x	13B	13x
EfficientNet-B3	81.6%	95.7%	12M	1x	1.8B	1x
ResNeXt-101	80.9%	95.6%	84M	7.0x	32B	18x
PolyNet	81.3%	95.8%	92M	7.7x	35B	19x
EfficientNet-B4	82.9%	96.4%	19M	1x	4.2B	1x
SENet	82.7%	96.2%	146M	7.7x	42B	10x
NASNet-A	82.7%	96.2%	89M	4.7x	24B	5.7x
AmoebaNet-A	82.8%	96.1%	87M	4.6x	23B	5.5x
PNASNet	82.9%	96.2%	86M	4.5x	23B	6.0x
EfficientNet-B5	83.6%	96.7%	30M	1x	9.9B	1x
AmoebaNet-C	83.5%	96.5%	155M	5.2x	41B	4.1x
EfficientNet-B6	84.0%	96.8%	43M	1x	19B	1x
EfficientNet-B7	84.3%	97.0%	66M	1x	37B	1x
GPipe	84.3%	97.0%	557M	8.4x	-	-

Ablations

Width scaling: accuracy rises, then saturates near 80%.
Depth scaling: deeper models help first, then returns taper.
Resolution scaling: larger inputs help, but gains diminish.
Compound scaling: best accuracy-efficiency tradeoff at fixed compute.

Method Strengths and Weaknesses

Strengths

One rule replaces ad hoc depth-only or width-only scaling.
B7 reaches 84.3% top-1 with 66M parameters.
Matches GPipe with $8.4\times$ fewer parameters.
Wins 5 of 8 transfer datasets with much smaller models.

Weaknesses

Depends on a strong searched baseline, B0.
Global ratios limit stage-specific scaling choices.
Largest model still costs 37B FLOPs.
Coefficients require search, not full theory.

Suggestions from the authors

Apply compound scaling to other vision architectures.
Study scaling under stricter latency and hardware constraints.
Improve baseline architecture search before scaling.
Extend the rule to more transfer tasks.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Scaling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers