Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann

2020 · NeurIPS

Language Models are Few-Shot Learners

Problem

Framing

Fine-tuned NLP systems still need labeled task-specific updates, and prompt-only transfer trails badly on benchmarks such as Natural Questions and CoQA. The paper argues that scale alone induces in-context learning in autoregressive transformers, with a 175B model reaching 71.2 on closed-book TriviaQA and 85.0 F1 on CoQA without gradient updates.

Currently Used Methods

Foundational

@radfordGPT2_2019 — decoder-only autoregressive transformer pre-training with prompt-based adaptation.
- Limitation in context: prompting stayed well below fine-tuned leaders.
@vaswaniAttentionAllNeed2017 — transformer self-attention backbone for large-scale sequence modeling.
- Limitation in context: no account of task learning from demonstrations alone.
@devlinBERT2018 — bidirectional pre-training followed by supervised task-specific fine-tuning.
- Limitation in context: each task still needs labeled updates.
@raffelT5_2020 — text-to-text transfer with strong fine-tuned benchmark coverage.
- Limitation in context: benchmark gains still depend on supervised adaptation.
@kaplanScalingLaws2020 — scaling laws for loss versus model size and compute.
- Limitation in context: does not establish few-shot task competence.

Proposed Method

Architecture

GPT-3 keeps the GPT-2 decoder-only transformer and scales it to 175B parameters. The largest model uses 96 layers, $d_{\mathrm{model}}=12288$ , 96 heads, $d_{\mathrm{head}}=128$ , and a 2048-token context window. It alternates dense and locally banded sparse attention.

Verified figure: language-model meta-learning diagram with an outer pre-training loop and inner in-context learning across arithmetic, word-cleanup, and translation example sequences.

Loss / Objective

Training uses standard next-token maximum likelihood.

\mathcal{L}(\theta) = - \sum_{t=1}^{T} \log p_{\theta}(x_t \mid x_{<t})

Algorithm

Task adaptation happens only through conditioning on instructions and demonstrations in the context.

p_{\theta}(\mathbf{y} \mid \mathbf{x}, \mathcal{D}_{\mathrm{ctx}}) = \prod_{t=1}^{|\mathbf{y}|} p_{\theta}\!\left(y_t \mid \mathcal{D}_{\mathrm{ctx}}, \mathbf{x}, y_{<t}\right)

Training Procedure

Training tokens: 300B.
Optimizer: Adam, $\beta_1=0.9$ , $\beta_2=0.95$ , $\epsilon=10^{-8}$ .
Gradient clipping: global norm $1.0$ .
LR schedule: 375M-token warmup, cosine decay to 10% over 260B tokens.
175B batch size: 3.2M tokens.
175B learning rate: $2.0 \times 10^{-4}$ .
Data mix: Common Crawl 60%, WebText2 22%, Books1 8%, Books2 8%, Wikipedia 3%.

Evaluation

Datasets

Cloze and completion: LAMBADA, StoryCloze, HellaSwag.
Question answering: Natural Questions, WebQuestions, TriviaQA.
Reading comprehension: CoQA, DROP, QuAC, SQuADv2, RACE.
Reasoning: Winograd, Winogrande, PIQA, ARC, OpenBookQA.
Aggregate benchmark: SuperGLUE.
Synthetic tasks: arithmetic, symbol removal, novel-word usage.

Metrics

Accuracy.
F1.
Perplexity.
Human preference.
Human-vs-model discrimination.

Headline results

LAMBADA few-shot: 86.4 accuracy, 1.92 perplexity.
TriviaQA closed-book few-shot: 71.2 accuracy.
CoQA few-shot: 85.0 F1.
HellaSwag few-shot: 79.3 accuracy.
Winogrande XL few-shot: 77.7 accuracy.

$Verified plot: validation loss versus training compute for models of different parameter counts, with a dashed power-law fit L = 2.57 \cdot C^{-0.048}.$

Ablations

Model size: zero-, one-, and few-shot performance rise smoothly with scale.
In-context examples $K$ : larger contexts improve most tasks, especially for larger models.
Evaluation regime: few-shot gains widen faster than zero-shot gains as capacity grows.
Contamination checks: most clean subsets move little, but a few benchmarks remain sensitive.

Method Strengths and Weaknesses

Strengths

Removes gradient-based task adaptation across many NLP tasks.
Few-shot gains grow faster than zero-shot gains with scale.
Closed-book TriviaQA reaches 71.2, competitive with fine-tuned systems.
Reports contamination analysis instead of only headline scores.

Weaknesses

Training cost is extreme: 175B parameters and 300B tokens.
Still weak on RACE, QuAC, DROP, and ANLI.
Performance depends strongly on prompt format and context budget.
Web-scale data creates contamination, bias, and misuse risks.

Suggestions from the authors

Evaluate very large language models under standard fine-tuning.
Improve performance on adversarial NLI and hard reading comprehension.
Build stronger methods to detect and reduce benchmark contamination.
Study bias, misuse, fairness, and energy costs at larger scale.

Language Models are Few-Shot Learners

Language Models are Few-Shot Learners

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers