Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud

2022 · DeepMind

Training Compute-Optimal Large Language Models

Problem

Framing

Kaplan-style scaling laws over-allocate compute to parameters and under-allocate it to data. The paper refits the compute frontier and shows that optimal training scales parameters and tokens nearly equally. At Gopher compute, this shifts the target from 280B parameters to about 70B trained on 1.4T tokens.

Currently Used Methods

Foundational

Proposed Method

Architecture

Chinchilla keeps the Gopher decoder-only transformer and changes scale, not design. The main model uses 80 layers, dmodel=8192d_{\mathrm{model}}=8192, 64 heads, key/value size 128, and feed-forward width 4×dmodel4\times d_{\mathrm{model}}.

Training-loss envelope and projected compute-optimal parameter count and token count versus FLOPs.

Loss / Objective

The analysis fits final pretraining loss as separate penalties for finite model size and finite data.

L^(N,D)=E+ANα+BDβ\hat{L}(N,D)=E+\frac{A}{N^{\alpha}}+\frac{B}{D^{\beta}}

Sampling Rule / Algorithm

Compute-optimal allocation minimizes the fitted loss under the transformer training-cost constraint.

minN,D L^(N,D)s.t.C6ND\min_{N,D}\ \hat{L}(N,D)\quad\text{s.t.}\quad C\approx 6ND

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers