Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan

2018 · OpenAI

Improving Language Understanding by Generative Pre-Training

Problem

Framing

Labeled NLU data is scarce, and earlier transfer methods stop at word features or task-specific pipelines. The paper closes this gap with autoregressive pre-training of a decoder-only Transformer, then task-aware fine-tuning with minimal architectural change. It reports state of the art on 9 of 12 benchmarks, including +8.9 on Story Cloze and +5.7 on RACE.

Currently Used Methods

Foundational

Proposed Method

Architecture

The model is a 12-layer decoder-only Transformer with masked self-attention, hidden size 768, 12 heads, and feed-forward width 3072. Fine-tuning keeps the backbone and serializes each task as one token sequence, then applies a linear classifier to the final token state.

Verified architecture figure: a 12-layer decoder-only Transformer on the left, and task serializations for classification, entailment, similarity, and multiple-choice QA on the right.

Loss / Objective

Pre-training uses autoregressive likelihood, and fine-tuning adds supervised loss plus an auxiliary LM term.

L1(U)=ilogP(uiuik,,ui1;Θ)L_1(U) = \sum_i \log P(u_i \mid u_{i-k}, \ldots, u_{i-1}; \Theta) L2(C)=(x,y)logP(yx1,,xm)L_2(C) = \sum_{(x,y)} \log P(y \mid x^1, \ldots, x^m) L3(C)=L2(C)+λL1(C)L_3(C) = L_2(C) + \lambda L_1(C)

Algorithm

Each target task is converted to a left-to-right token sequence, and prediction reads out the last hidden state.

hlm=Transformer(x1,,xm)\mathbf{h}_l^m = \mathrm{Transformer}(x^1, \ldots, x^m) P(yx1,,xm)=softmax(Wyhlm)P(y \mid x^1, \ldots, x^m) = \mathrm{softmax}(\mathbf{W}_y \mathbf{h}_l^m)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Verified results plot: zero-shot relative task performance versus pre-training updates, with solid Transformer curves above dashed LSTM curves on sentiment, Winograd, acceptability, and QA.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers