Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud

2022 · DeepMind

Training Compute-Optimal Large Language Models

Problem

Framing

Kaplan-style scaling laws over-allocate compute to parameters and under-allocate it to data. The paper refits the compute frontier and shows that optimal training scales parameters and tokens nearly equally. At Gopher compute, this shifts the target from 280B parameters to about 70B trained on 1.4T tokens.

Currently Used Methods

Foundational

@brownGPT3_2020 — 175B dense LM proves few-shot gains at scale.
- Limitation in context: trained on too few tokens for its compute budget.
@kaplanScalingLaws2020 — power-law loss scaling versus parameters, data, and compute.
- Limitation in context: compute rule favors oversized models and short training.
@radfordGPT2_2019 — decoder-only scaling shows broad zero-shot transfer.
- Limitation in context: offers no compute-optimal parameter-token prescription.
Gopher — 280B decoder-only baseline with broad downstream evaluation.
- Limitation in context: same FLOPs train a smaller model longer with lower loss.

Proposed Method

Architecture

Chinchilla keeps the Gopher decoder-only transformer and changes scale, not design. The main model uses 80 layers, $d_{\mathrm{model}}=8192$ , 64 heads, key/value size 128, and feed-forward width $4\times d_{\mathrm{model}}$ .

Training-loss envelope and projected compute-optimal parameter count and token count versus FLOPs.

Loss / Objective

The analysis fits final pretraining loss as separate penalties for finite model size and finite data.

\hat{L}(N,D)=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}}

Sampling Rule / Algorithm

Compute-optimal allocation minimizes the fitted loss under the transformer training-cost constraint.

\min_{N,D}\ \hat{L}(N,D)\quad\text{s.t.}\quad C\approx 6ND

Training Procedure

Dataset: MassiveText.
Optimizer: AdamW.
Model: 70B decoder-only transformer.
Layers: 80.
Heads: 64.
$d_{\mathrm{model}}$ : 8192.
Key/value size: 128.
Feed-forward width: $4\times d_{\mathrm{model}}$ .
Max learning rate: $1\times 10^{-4}$ .
Batch size: $1.5$ M $\rightarrow$ $3$ M tokens.
Training tokens: $1.4$ T.
LR schedule: cosine with $10\times$ decay.

Evaluation

Datasets

Training: MassiveText.
Language modeling: WikiText-103, The Pile subsets, PG-19, arXiv, FreeLaw.
Reading comprehension: RACE-m, RACE-h, LAMBADA.
Question answering: Natural Questions, TriviaQA, TruthfulQA.
Common sense: HellaSwag, Winogrande, PIQA, SIQA, BoolQ.
Language understanding: MMLU, BIG-bench.

Metrics

Pretraining loss.
Bits per byte.
Accuracy.
Exact match.
Group-stratified accuracy on Winogender.

Headline results

Gopher-compute regime: 70B on 1.4T tokens beats 280B Gopher.
MMLU (5-shot): $67.6\%$ .
BIG-bench, 62 tasks: $65.1\%$ .
LAMBADA (zero-shot): $77.4\%$ .
RACE-m (few-shot): $86.8\%$ .
Natural Questions (64-shot): $35.5\%$ .

Ablations

Model size vs tokens: IsoFLOP curves show a clear loss-minimizing valley.
Frontier estimation: three fitting methods recover near-equal parameter and token scaling.
Cosine cycle length: overshooting target steps hurts final loss.
Optimizer choice: AdamW finishes better than Adam.

Method Strengths and Weaknesses

Strengths

Recasts scaling as constrained optimization over parameters and tokens.
Uses 400+ runs instead of one extrapolated frontier fit.
Delivers better downstream accuracy than Gopher with one quarter the parameters.
Three estimation routes agree on near- $0.5/0.5$ scaling.

Weaknesses

Main prescription depends on the fit $E+A/N^{\alpha}+B/D^{\beta}$ .
Analysis models data quantity directly, not data quality.
Validation targets dense decoder-only transformers, not broader architectures.
Winogender group gaps remain despite higher overall accuracy.

Suggestions from the authors

Scale datasets further without degrading data quality.
Test richer loss models beyond simple power-law fits.
Extend compute-optimal analysis beyond dense transformers.
Improve large-scale data collection for long-training regimes.

Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers