Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu

2019 · OpenAI

Language Models are Unsupervised Multitask Learners

Problem

Framing

Supervised transfer in NLP still bound each task to labeled adaptation. The paper argues that scaling one decoder-only language model on WebText yields zero-shot task behavior from prompting alone. The largest model has 1.5B parameters and improves on CoQA, WMT-14 Fr→En, CNN/Daily Mail, and Natural Questions.

Currently Used Methods

Foundational

Proposed Method

Architecture

GPT-2 is a decoder-only Transformer with byte-level BPE inputs and context length 1024. Model sizes scale from 117M, 12 layers, dmodel=768d_{\mathrm{model}}=768 to 1542M, 48 layers, dmodel=1600d_{\mathrm{model}}=1600. Tasks are specified only by the prompt prefix.

Zero-shot scaling plot across four tasks: CoQA F1, WMT-14 Fr→En BLEU, CNN/Daily Mail ROUGE average, and Natural Questions accuracy versus model size.

Loss / Objective

Training uses standard autoregressive maximum likelihood.

p(x)=i=1npθ(xix<i) p(\mathbf{x}) = \prod_{i=1}^{n} p_{\theta}(x_i \mid x_{<i}) L(θ)=i=1nlogpθ(xix<i) \mathcal{L}(\theta) = -\sum_{i=1}^{n} \log p_{\theta}(x_i \mid x_{<i})

Sampling Rule / Algorithm

Zero-shot inference conditions on a task prompt and samples the same next-token distribution.

xtpθ(xtx<t) x_t \sim p_{\theta}(x_t \mid x_{<t})

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: CNN/Daily Mail summarization ROUGE F1 results

ModelR-1R-2R-LR-AVG
Bottom-Up Sum41.2218.6838.3432.75
Lede-340.3817.6636.6231.55
Seq2Seq + Attn31.3311.8128.8323.99
GPT-2 TL;DR:29.348.2726.5821.40
Random-328.788.6325.5220.98
GPT-2 no hint21.584.0319.4715.03

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers