Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann

2020 · NeurIPS

Language Models are Few-Shot Learners

Problem

Framing

Fine-tuned NLP systems still need labeled task-specific updates, and prompt-only transfer trails badly on benchmarks such as Natural Questions and CoQA. The paper argues that scale alone induces in-context learning in autoregressive transformers, with a 175B model reaching 71.2 on closed-book TriviaQA and 85.0 F1 on CoQA without gradient updates.

Currently Used Methods

Foundational

Proposed Method

Architecture

GPT-3 keeps the GPT-2 decoder-only transformer and scales it to 175B parameters. The largest model uses 96 layers, dmodel=12288d_{\mathrm{model}}=12288, 96 heads, dhead=128d_{\mathrm{head}}=128, and a 2048-token context window. It alternates dense and locally banded sparse attention.

Verified figure: language-model meta-learning diagram with an outer pre-training loop and inner in-context learning across arithmetic, word-cleanup, and translation example sequences.

Loss / Objective

Training uses standard next-token maximum likelihood.

L(θ)=t=1Tlogpθ(xtx<t)\mathcal{L}(\theta) = - \sum_{t=1}^{T} \log p_{\theta}(x_t \mid x_{<t})

Algorithm

Task adaptation happens only through conditioning on instructions and demonstrations in the context.

pθ(yx,Dctx)=t=1ypθ ⁣(ytDctx,x,y<t)p_{\theta}(\mathbf{y} \mid \mathbf{x}, \mathcal{D}_{\mathrm{ctx}}) = \prod_{t=1}^{|\mathbf{y}|} p_{\theta}\!\left(y_t \mid \mathcal{D}_{\mathrm{ctx}}, \mathbf{x}, y_{<t}\right)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Verified plot: validation loss versus training compute for models of different parameter counts, with a dashed power-law fit L = 2.57 \cdot C^{-0.048}.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers