BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

2018 · NAACL

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Problem

Framing

NLP pre-training still relied on left-to-right or shallow bidirectional objectives, so each token could not condition on full context during transfer. BERT closes this gap with masked bidirectional Transformer pre-training plus next-sentence prediction, then fine-tunes one encoder across tasks and sets new best results on 11 benchmarks.

Currently Used Methods

Foundational

Proposed Method

Architecture

BERT is a bidirectional Transformer encoder with token, segment, and position embeddings summed at input. It reports BERTBASE_{\mathrm{BASE}} with L=12L=12, H=768H=768, A=12A=12, 110M parameters, and BERTLARGE_{\mathrm{LARGE}} with L=24L=24, H=1024H=1024, A=16A=16, 340M.

Overall diagram: pre-training predicts masked tokens and sentence order, then fine-tunes the same BERT encoder for MNLI, NER, and SQuAD with task-specific output heads.

Loss / Objective

Pre-training sums masked LM and next-sentence prediction losses.

LBERT=LMLM+LNSP\mathcal{L}_{\text{BERT}} = \mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}} LMLM=iMlogpθ(xixM)\mathcal{L}_{\text{MLM}} = - \sum_{i \in \mathcal{M}} \log p_{\theta}(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) LNSP=logpθ(yNSPC)\mathcal{L}_{\text{NSP}} = - \log p_{\theta}(y_{\text{NSP}} \mid \mathbf{C})

Algorithm

Task heads read either the [CLS][\mathrm{CLS}] state or token states from the shared encoder.

p(yx)=softmax(CW)p(y \mid \mathbf{x}) = \mathrm{softmax}(\mathbf{C}\mathbf{W}^{\top}) Pistart=exp(STi)jexp(STj),Piend=exp(ETi)jexp(ETj)P_i^{\text{start}} = \frac{\exp(\mathbf{S}^{\top}\mathbf{T}_i)}{\sum_j \exp(\mathbf{S}^{\top}\mathbf{T}_j)}, \qquad P_i^{\text{end}} = \frac{\exp(\mathbf{E}^{\top}\mathbf{T}_i)}{\sum_j \exp(\mathbf{E}^{\top}\mathbf{T}_j)}

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: GLUE test results across eight tasks.

SystemMNLI-(m/mm)QQPQNLISST-2CoLASTS-BMRPCRTEAverage
Pre-OpenAI SOTA80.6/80.166.182.393.235.081.086.061.774.0
BiLSTM+ELMo+Attn76.4/76.164.879.890.436.073.384.956.871.0
OpenAI GPT82.1/81.470.387.491.345.480.082.356.075.1
BERTBASE84.6/83.471.290.593.552.185.888.966.479.6
BERTLARGE86.7/85.972.192.794.960.586.589.370.182.1

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers