Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish

2020 · OpenAI

Scaling Laws for Neural Language Models

Problem

Framing

Language-model scaling lacked a quantitative law linking loss to parameters, data, and compute. The paper shows cross-entropy follows stable power laws in NN, DD, and compute, then uses them to derive compute-optimal training that favors larger models and early stopping.

Currently Used Methods

Foundational

Proposed Method

Architecture

The paper introduces no new block. It studies decoder-only Transformers on WebText2 with context nctx=1024n_{\mathrm{ctx}}=1024 and measures scale by non-embedding parameters NN.

Results figure: three plots showing test loss versus compute, dataset size, and parameter count, each following a near-linear power law on log scales.

Loss / Objective

The core model is a fitted scaling law for autoregressive cross-entropy.

L(N,D)=[(NcN)αN/αD+DcD]αDL(N,D)=\left[\left(\frac{N_c}{N}\right)^{\alpha_N/\alpha_D}+\frac{D_c}{D}\right]^{\alpha_D} L(N,S)=(NcN)αN+(ScS)αSL(N,S)=\left(\frac{N_c}{N}\right)^{\alpha_N}+\left(\frac{S_c}{S}\right)^{\alpha_S}

Sampling Rule / Algorithm

Training compute is approximated from parameters, batch size, and optimization steps.

C6NBSC \approx 6NBS

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: Fit to L(N,D)L(N,D)

ParameterαN\alpha_NαD\alpha_DNcN_c
Value0.0760.1036.4×10136.4 \times 10^{13}

Results figure: two plots comparing LSTMs and Transformers, showing Transformers improve more steeply with parameter count and keep benefiting across longer context positions.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers