A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play

David Silver, Thomas Hubert

2018 · Science

A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play

Problem

Framing

Chess and shogi engines still depended on handcrafted evaluation and alpha-beta heuristics, while AlphaGo Zero remained tied to Go-friendly symmetries and binary outcomes. AlphaZero closes this with one tabula-rasa self-play algorithm that uses only game rules, a shared policy-value network, and MCTS. It reaches superhuman chess and shogi within 24 hours.

Currently Used Methods

Foundational

Proposed Method

Architecture

AlphaZero learns one network fθ(s)=(p,v)f_{\theta}(s)=(\mathbf{p},v) over board-plane inputs and legal-move outputs. The state tensor is N×N×(MT+L)N \times N \times (MT+L) with T=8T=8 history steps; chess uses 119 input planes and an 8×8×738 \times 8 \times 73 policy head, shogi uses 362 planes and a 9×9×1399 \times 9 \times 139 policy head.

Verified caption: three training curves showing AlphaZero Elo versus training steps in chess, shogi, and Go, compared against Stockfish, Elmo, and earlier AlphaGo systems.

Loss / Objective

The network fits search-improved policy targets and final game outcomes.

(p,v)=fθ(s),l=(zv)2πlogp+cθ2(\mathbf{p}, v) = f_{\theta}(s), \qquad l = (z - v)^2 - \pi^{\top} \log \mathbf{p} + c\|\theta\|^2

Sampling Rule

Training samples moves from root visit counts; evaluation plays greedily from the same counts.

π(as)N(s,a)\pi(a \mid s) \propto N(s,a)

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: Chess match results from the tournament evaluation table

GameWhiteBlackWinDrawLoss
ChessAlphaZeroStockfish25250
ChessStockfishAlphaZero3470

Verified caption: two line plots of relative Elo versus seconds per move, showing AlphaZero scaling better with thinking time than Stockfish in chess and Elmo in shogi.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers