Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu

2019 · OpenAI

Language Models are Unsupervised Multitask Learners

Problem

Framing

Supervised transfer in NLP still bound each task to labeled adaptation. The paper argues that scaling one decoder-only language model on WebText yields zero-shot task behavior from prompting alone. The largest model has 1.5B parameters and improves on CoQA, WMT-14 Fr→En, CNN/Daily Mail, and Natural Questions.

Currently Used Methods

Foundational

@radfordGPT2018 — generative pretraining with supervised downstream fine-tuning.
- Limitation in context: each task still needs labeled adaptation.
@vaswaniAttentionAllNeed2017 — Transformer decoder for scalable autoregressive sequence modeling.
- Limitation in context: architecture alone does not yield zero-shot transfer.
@petersELMo2018 — contextual representations transferred into supervised task models.
- Limitation in context: requires separate task heads and labeled training.
@devlinBERT2018 — strong bidirectional pretraining with fine-tuned task performance.
- Limitation in context: evaluation still centers on supervised adaptation.

Proposed Method

Architecture

GPT-2 is a decoder-only Transformer with byte-level BPE inputs and context length 1024. Model sizes scale from 117M, 12 layers, $d_{\mathrm{model}}=768$ to 1542M, 48 layers, $d_{\mathrm{model}}=1600$ . Tasks are specified only by the prompt prefix.

Zero-shot scaling plot across four tasks: CoQA F1, WMT-14 Fr→En BLEU, CNN/Daily Mail ROUGE average, and Natural Questions accuracy versus model size.

Loss / Objective

Training uses standard autoregressive maximum likelihood.

p(\mathbf{x}) = \prod_{i=1}^{n} p_{\theta}(x_i \mid x_{<i})

\mathcal{L}(\theta) = -\sum_{i=1}^{n} \log p_{\theta}(x_i \mid x_{<i})

Sampling Rule / Algorithm

Zero-shot inference conditions on a task prompt and samples the same next-token distribution.

x_t \sim p_{\theta}(x_t \mid x_{<t})

Training Procedure

Data: WebText, about 8 million documents, about 40 GB.
Source filter: outbound Reddit links with at least 3 karma.
Context length: 1024 tokens.
Model sizes: 117M, 345M, 762M, 1542M.
Qualitative decoding: top- $k$ sampling with $k=40$ .

Evaluation

Datasets

CoQA
WMT-14 Fr→En
CNN/Daily Mail
Natural Questions
LAMBADA
CBT
WikiText-2
PTB
enwik8
text8
WikiText-103
1BW

Metrics

F1
BLEU
ROUGE-1
ROUGE-2
ROUGE-L
ROUGE-AVG
Accuracy
Perplexity
Bits per byte
Bits per character

Headline results

CoQA zero-shot: 55.0 F1.
WMT-14 Fr→En zero-shot: 11.5 BLEU.
CNN/Daily Mail zero-shot with TL;DR hint: ROUGE-AVG 21.40.
Natural Questions zero-shot: about 4.1 accuracy.
LAMBADA zero-shot: 8.63 perplexity, 63.24% accuracy.

Table 1: CNN/Daily Mail summarization ROUGE F1 results

Model	R-1	R-2	R-L	R-AVG
Bottom-Up Sum	41.22	18.68	38.34	32.75
Lede-3	40.38	17.66	36.62	31.55
Seq2Seq + Attn	31.33	11.81	28.83	23.99
GPT-2 TL;DR:	29.34	8.27	26.58	21.40
Random-3	28.78	8.63	25.52	20.98
GPT-2 no hint	21.58	4.03	19.47	15.03

Ablations

Model size: zero-shot performance improves monotonically across tasks.
Prompt hinting: TL;DR: sharply improves summarization over no hint.
Data overlap: CoQA news overlap reaches about 15% and explains only part of gains.
Task format: natural language prefixes can elicit translation and QA without fine-tuning.

Method Strengths and Weaknesses

Strengths

Unifies many tasks under one next-token objective.
Zero-shot gains rise smoothly with scale across four task families.
No task-specific parameters or gradient updates at evaluation.
Prompted summarization beats random and no-hint baselines.

Weaknesses

Zero-shot results still trail supervised systems on core benchmarks.
Prompt phrasing strongly affects performance.
Benchmark contamination is nontrivial on CoQA news.
Context stays limited to 1024 tokens.

Suggestions from the authors

Train on larger and more diverse corpora.
Measure contamination more rigorously across benchmarks.
Test broader task families under pure zero-shot prompting.
Study how context specifies latent tasks.

Language Models are Unsupervised Multitask Learners

Language Models are Unsupervised Multitask Learners

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers