A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play

David Silver, Thomas Hubert

2018 · Science

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play

Problem

Framing

Chess and shogi engines still depended on handcrafted evaluation and alpha-beta heuristics, while AlphaGo Zero remained tied to Go-friendly symmetries and binary outcomes. AlphaZero closes this with one tabula-rasa self-play algorithm that uses only game rules, a shared policy-value network, and MCTS. It reaches superhuman chess and shogi within 24 hours.

Currently Used Methods

Foundational

@silverAlphaGo2016 — policy-value self-play with neural-guided tree search for Go.
- Limitation in context: exploits Go symmetries and binary outcomes, not chess or shogi.
NeuroChess — neural chess evaluation from 175 handcrafted features with TD learning.
- Limitation in context: depends on engineered inputs and weak depth-2 search.
KnightCap — TD(leaf) neural evaluation inside alpha-beta search.
- Limitation in context: still relies on domain-specific search and hand-initialised weights.
Giraffe — self-play neural chess evaluation reaching master-level strength.
- Limitation in context: remains coupled to alpha-beta search and crafted representations.
DeepChess — supervised pairwise position ranking from expert human games.
- Limitation in context: learns from human data, not tabula-rasa self-play.

Proposed Method

Architecture

AlphaZero learns one network $f_{\theta}(s)=(\mathbf{p},v)$ over board-plane inputs and legal-move outputs. The state tensor is $N \times N \times (MT+L)$ with $T=8$ history steps; chess uses 119 input planes and an $8 \times 8 \times 73$ policy head, shogi uses 362 planes and a $9 \times 9 \times 139$ policy head.

Verified caption: three training curves showing AlphaZero Elo versus training steps in chess, shogi, and Go, compared against Stockfish, Elmo, and earlier AlphaGo systems.

Loss / Objective

The network fits search-improved policy targets and final game outcomes.

(\mathbf{p}, v) = f_{\theta}(s), \qquad l = (z - v)^2 - \pi^{\top} \log \mathbf{p} + c\|\theta\|^2

Sampling Rule

Training samples moves from root visit counts; evaluation plays greedily from the same counts.

\pi(a \mid s) \propto N(s,a)

Training Procedure

Training steps: $700{,}000$
Mini-batch size: $4{,}096$
MCTS simulations per move: $800$
Learning rate: $0.2 \rightarrow 0.02 \rightarrow 0.002 \rightarrow 0.0002$
Dirichlet noise $\alpha$ : chess $0.3$ , shogi $0.15$ , Go $0.03$
Self-play hardware: 5,000 first-generation TPUs
Training hardware: 64 second-generation TPUs
Separate network instance per game

Evaluation

Datasets

Chess: 100-game match versus Stockfish 8
Shogi: 100-game match versus Elmo WCSC27
Go: 100-game match versus AlphaGo Zero 3-day
Time control: 1 minute per move

Metrics

Win / draw / loss over 100 games
Elo during training
Relative Elo versus thinking time
Positions searched per second

Headline results

Chess: 28 wins, 72 draws, 0 losses versus Stockfish.
Shogi: 90 wins, 2 draws, 8 losses versus Elmo.
Go: 52 wins, 48 losses versus AlphaGo Zero 3-day.
Chess training: surpasses Stockfish after $300\mathrm{k}$ steps, about 4 hours.
Shogi training: surpasses Elmo after $110\mathrm{k}$ steps, under 2 hours.

Table 1: Chess match results from the tournament evaluation table

Game	White	Black	Win	Draw	Loss
Chess	AlphaZero	Stockfish	25	25	0
Chess	Stockfish	AlphaZero	3	47	0

Verified caption: two line plots of relative Elo versus seconds per move, showing AlphaZero scaling better with thinking time than Stockfish in chess and Elmo in shogi.

Ablations

Thinking time: AlphaZero gains Elo faster than Stockfish and Elmo as search time increases.
Search efficiency: $80\mathrm{k}$ chess positions per second beats Stockfish at $70{,}000\mathrm{k}$ .
Search efficiency: $40\mathrm{k}$ shogi positions per second beats Elmo at $35{,}000\mathrm{k}$ .
Opening analysis: self-play rediscoveries cover the major human chess openings.

Method Strengths and Weaknesses

Strengths

One algorithm works across chess, shogi, and Go with nearly shared hyperparameters.
Beats Stockfish without a single loss in 100 tournament games.
Beats Elmo decisively despite searching three orders fewer positions.
Removes opening books, endgame tablebases, and handcrafted evaluation.

Weaknesses

Training cost is extreme: 5,000 self-play TPUs and 64 training TPUs.
Requires a separate training run for each game.
Tournament evaluation uses only 100 games per opponent.
Network-search internals remain less interpretable than alpha-beta heuristics.

Suggestions from the authors

Add classical search-engine techniques to improve the pure self-play system.
Test alternative board-state and action representations.
Extend the approach beyond board games to other domains.
Study which alpha-beta ideas still help inside the generic framework.

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers