BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

2018 · NAACL

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Problem

Framing

NLP pre-training still relied on left-to-right or shallow bidirectional objectives, so each token could not condition on full context during transfer. BERT closes this gap with masked bidirectional Transformer pre-training plus next-sentence prediction, then fine-tunes one encoder across tasks and sets new best results on 11 benchmarks.

Currently Used Methods

Foundational

@bengioRepresentationLearning2013 — transferable representations from unlabeled data.
- Limitation in context: no deep bidirectional pre-training objective.
@choGRU2014 — gated recurrent encoders for sequence modeling.
- Limitation in context: weaker parallelism and scaling than self-attention.
@petersELMo2018 — contextual bi-LM features for downstream NLP.
- Limitation in context: feature extraction, not full-parameter fine-tuning.
@radfordGPT2018 — Transformer pre-training with left-to-right LM.
- Limitation in context: unidirectional masking blocks full token conditioning.
@vaswaniAttentionAllNeed2017 — Transformer encoder backbone with self-attention.
- Limitation in context: no masked bidirectional pre-training recipe.

Proposed Method

Architecture

BERT is a bidirectional Transformer encoder with token, segment, and position embeddings summed at input. It reports BERT $_{\mathrm{BASE}}$ with $L=12$ , $H=768$ , $A=12$ , 110M parameters, and BERT $_{\mathrm{LARGE}}$ with $L=24$ , $H=1024$ , $A=16$ , 340M.

Overall diagram: pre-training predicts masked tokens and sentence order, then fine-tunes the same BERT encoder for MNLI, NER, and SQuAD with task-specific output heads.

Loss / Objective

Pre-training sums masked LM and next-sentence prediction losses.

\mathcal{L}_{\text{BERT}} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}

\mathcal{L}_{\text{MLM}} = - \sum_{i \in \mathcal{M}} \log p_{\theta}(x_i \mid \mathbf{x}_{\setminus \mathcal{M}})

\mathcal{L}_{\text{NSP}} = - \log p_{\theta}(y_{\text{NSP}} \mid \mathbf{C})

Algorithm

Task heads read either the $[\mathrm{CLS}]$ state or token states from the shared encoder.

p(y \mid \mathbf{x}) = \mathrm{softmax}(\mathbf{C}\mathbf{W}^{\top})

P_i^{\text{start}} = \frac{\exp(\mathbf{S}^{\top}\mathbf{T}_i)}{\sum_j \exp(\mathbf{S}^{\top}\mathbf{T}_j)}, \qquad P_i^{\text{end}} = \frac{\exp(\mathbf{E}^{\top}\mathbf{T}_i)}{\sum_j \exp(\mathbf{E}^{\top}\mathbf{T}_j)}

Training Procedure

Pre-training corpora: BooksCorpus 800M words; English Wikipedia 2,500M words.
Vocabulary: 30,000 WordPiece tokens.
MLM prediction rate: 15% of tokens.
MLM replacement: 80% $[\mathrm{MASK}]$ , 10% random, 10% unchanged.
NSP sampling: 50% IsNext, 50% NotNext.
Optimizer: Adam.
Learning rate: $1 \times 10^{-4}$ .
Adam coefficients: $\beta_1=0.9$ , $\beta_2=0.999$ .
Weight decay: 0.01.
Warmup: 10,000 steps.
Fine-tuning on GLUE: batch size 32, 3 epochs, learning rate in $\{5\times10^{-5}, 4\times10^{-5}, 3\times10^{-5}, 2\times10^{-5}\}$ .

Evaluation

Datasets

GLUE: MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, RTE.
SQuAD v1.1.
SQuAD v2.0.
SWAG.

Metrics

Accuracy for most GLUE tasks and SWAG.
F1 for QQP and MRPC.
Spearman correlation for STS-B.
EM and F1 for SQuAD.

Headline results

GLUE test, BERT $_{\mathrm{LARGE}}$ : average 82.1.
MNLI test, BERT $_{\mathrm{LARGE}}$ : 86.7 matched, 85.9 mismatched.
QNLI test, BERT $_{\mathrm{LARGE}}$ : 92.7.
SQuAD v1.1 test, BERT $_{\mathrm{LARGE}}$ + TriviaQA: 85.1 EM, 91.8 F1.
SWAG test, BERT $_{\mathrm{LARGE}}$ : 86.3.

Table 1: GLUE test results across eight tasks.

System	MNLI-(m/mm)	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	Average
Pre-OpenAI SOTA	80.6/80.1	66.1	82.3	93.2	35.0	81.0	86.0	61.7	74.0
BiLSTM+ELMo+Attn	76.4/76.1	64.8	79.8	90.4	36.0	73.3	84.9	56.8	71.0
OpenAI GPT	82.1/81.4	70.3	87.4	91.3	45.4	80.0	82.3	56.0	75.1
BERTBASE	84.6/83.4	71.2	90.5	93.5	52.1	85.8	88.9	66.4	79.6
BERTLARGE	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	82.1

Ablations

Remove NSP: MNLI drops 84.4 to 83.9; SQuAD drops 88.5 to 87.9.
Replace bidirectional MLM with left-to-right LM: SQuAD falls to 77.8 F1.
Larger models help most on low-resource tasks such as RTE.
More pre-training steps improve downstream accuracy.

Method Strengths and Weaknesses

Strengths

One encoder transfers across classification, tagging, and extractive QA.
Bidirectional MLM beats left-to-right pre-training in ablations.
BERT $_{\mathrm{LARGE}}$ reaches 82.1 average on GLUE test.
Fine-tuning adds only small task-specific heads.

Weaknesses

$[\mathrm{MASK}]$ creates a pre-train and fine-tune mismatch.
NSP is a coarse binary discourse signal.
Gains depend on large corpora and heavy pre-training compute.
BERT $_{\mathrm{LARGE}}$ fine-tuning is unstable on small datasets.

Suggestions from the authors

Scale depth, hidden size, and attention heads further.
Study masking schemes that reduce the $[\mathrm{MASK}]$ mismatch.
Pre-train on larger corpora and additional languages.
Extend the same recipe to more downstream task families.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers