Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish

2020 · OpenAI

Scaling Laws for Neural Language Models

Problem

Framing

Language-model scaling lacked a quantitative law linking loss to parameters, data, and compute. The paper shows cross-entropy follows stable power laws in $N$ , $D$ , and compute, then uses them to derive compute-optimal training that favors larger models and early stopping.

Currently Used Methods

Foundational

@vaswaniAttentionAllNeed2017 — Transformer self-attention backbone for autoregressive language modeling.
- Limitation in context: gives architecture, not predictive loss laws versus scale.
@radfordGPT2_2019 — large decoder-only pretraining on WebText.
- Limitation in context: shows scale helps, without compute-allocation equations.
Characterizing Well-Performing Neural Network Architectures — gradient-noise-scale analysis for efficient batch sizing.
- Limitation in context: studies optimization efficiency, not joint scaling in $N$ , $D$ , and $C$ .
Generating Wikipedia by Summarizing Long Sequences — recurrent alternatives for long-context language modeling.
- Limitation in context: architecture choice matters less than total scale here.

Proposed Method

Architecture

The paper introduces no new block. It studies decoder-only Transformers on WebText2 with context $n_{\mathrm{ctx}}=1024$ and measures scale by non-embedding parameters $N$ .

Results figure: three plots showing test loss versus compute, dataset size, and parameter count, each following a near-linear power law on log scales.

Loss / Objective

The core model is a fitted scaling law for autoregressive cross-entropy.

L(N,D)=\left[\left(\frac{N_c}{N}\right)^{\alpha_N/\alpha_D}+\frac{D_c}{D}\right]^{\alpha_D}

L(N,S)=\left(\frac{N_c}{N}\right)^{\alpha_N}+\left(\frac{S_c}{S}\right)^{\alpha_S}

Sampling Rule / Algorithm

Training compute is approximated from parameters, batch size, and optimization steps.

C \approx 6NBS

Training Procedure

Dataset: WebText2
Tokenization: BPE, $n_{\mathrm{vocab}}=50257$
Context length: $1024$
Steps: $2.5 \times 10^5$
Batch size: $512$ sequences of length $1024$
Optimizer: Adam
Optimizer for $>1$ B models: Adafactor
Dropout: $10\%$
Model sizes: $768$ to $1.5 \times 10^9$ non-embedding parameters
Dataset sizes: $2.2 \times 10^7$ to $2.3 \times 10^{10}$ tokens

Evaluation

Datasets

WebText2 test
Internet Books
Common Crawl
Wikipedia

Metrics

Cross-entropy loss in nats
Test loss versus $N$ , $D$ , and compute
Transfer-loss offset across distributions
Critical batch size in tokens

Headline results

Parameter-limited: $\alpha_N \approx 0.076$ , $N_c \approx 8.8 \times 10^{13}$ .
Dataset-limited: $\alpha_D \approx 0.095$ , $D_c \approx 5.4 \times 10^{13}$ tokens.
Compute-optimal frontier: $\alpha_{C_{\min}} \approx 0.050$ , $C_{c,\min} \approx 3.1 \times 10^8$ PF-days.
Overfitting boundary: penalty scales with $N^{0.74}/D$ .
Largest converged runs: critical batch size reaches roughly $1$ - $2$ million tokens.

Table 1: Fit to $L(N,D)$

Parameter	$\alpha_N$	$\alpha_D$	$N_c$
Value	0.076	0.103	$6.4 \times 10^{13}$

Results figure: two plots comparing LSTMs and Transformers, showing Transformers improve more steeply with parameter count and keep benefiting across longer context positions.

Ablations

Width versus depth: loss changes little at fixed total parameter count.
Early-curve extrapolation: later loss is predictable from the stable training regime.
Transfer evaluation: out-of-domain loss differs by an almost constant offset.
Batch-size sweep: critical batch size follows a power law in loss.

Method Strengths and Weaknesses

Strengths

Gives closed-form loss laws across parameters, data, and compute.
Trends span more than six orders of magnitude.
Converts descriptive scaling into compute-allocation rules.
Shows model shape has weak effect once scale is fixed.

Weaknesses

Focuses on cross-entropy, not downstream task accuracy.
Constants depend on dataset, tokenization, and vocabulary.
Largest models stop at $1.5$ B parameters.
Compute-optimal advice relies on extrapolated large-scale behavior.

Suggestions from the authors

Test whether the same exponents persist at larger model scales.
Measure scaling under different datasets and tokenizations.
Extend the analysis to other modalities and objectives.
Probe batch-size behavior farther beyond the measured regime.

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers