Human-Level Control through Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu

2015 · Nature

Human-Level Control through Deep Reinforcement Learning

Problem

Framing

Deep RL had not shown stable, end-to-end control from raw pixels across many Atari games. DQN closes this with a convolutional $Q$ -network plus experience replay and a lagged target network, reaching above human score on 29 of 49 games.

Currently Used Methods

Foundational

@krizhevskyAlexNet2012 — deep convolutional networks for visual representation learning.
- Limitation in context: no control objective or temporal-difference bootstrapping.
Playing Atari with Deep Reinforcement Learning — early single-paper DQN preprint on Atari from pixels.
- Limitation in context: smaller evaluation scope than the Nature version.
Reinforcement Learning with Linear Function Approximation — hand-crafted Atari features with linear value prediction.
- Limitation in context: weak visual abstraction from raw pixels.
Contingency Awareness in Reinforcement Learning — SARSA-style Atari agent with game-specific structure.
- Limitation in context: needs stronger prior knowledge than DQN.
Reinforcement Learning for Robots using Neural Networks — experience replay for stabilizing neural RL.
- Limitation in context: no large-scale deep visual control demonstration.

Proposed Method

Architecture

The agent maps a preprocessed $84 \times 84 \times 4$ state stack to one $Q(s,a)$ output per legal action. It uses three convolution layers, then two fully connected layers, with rectifiers after each hidden layer.

Architecture diagram: four stacked Atari frames pass through two illustrated convolution stages and two fully connected layers to action-value outputs mapped to joystick actions.

Loss / Objective

DQN minimizes one-step temporal-difference error against a frozen target network.

L_i(\theta_i) = \mathbb{E}_{(\phi, a, r, \phi') \sim U(D)} \left[ \left( y_i - Q(\phi, a; \theta_i) \right)^2 \right]

y_i = \begin{cases} r & \text{if episode terminates at step } i+1, \\ r + \gamma \max_{a'} Q(\phi', a'; \theta_i^-) & \text{otherwise} \end{cases}

Sampling Rule / Algorithm

Action selection uses an $\epsilon$ -greedy policy over the learned action values.

a_t = \begin{cases} \text{random action} & \text{with probability } \epsilon, \\ \arg\max_a Q(\phi(s_t), a; \theta) & \text{with probability } 1 - \epsilon \end{cases}

Training Procedure

Replay memory: $10^6$ transitions.
Discount: $\gamma = 0.99$ .
Minibatch size: $32$ .
Target network update: every $10{,}000$ parameter updates.
Replay start size: $50{,}000$ frames.
Frame skip: $4$ .
Exploration: $\epsilon$ annealed from $1.0$ to $0.1$ over $10^6$ frames.
Training horizon: $50$ million frames.
Optimizer: RMSProp.
Learning rate: $2.5 \times 10^{-4}$ .

Evaluation

Datasets

Atari 2600 Arcade Learning Environment.
49 games from raw pixels.
Two evaluation protocols: no-op starts and human starts.

Metrics

Raw game score.
Human-normalized score, with random policy as $0\%$ and human as $100\%$ .
Learning curves over training frames.

Headline results

Atari-49, no prior game features: beats prior RL methods on 43 games.
Atari-49, professional-human reference: exceeds $75\%$ of human score on 29 games.
Atari-49, aggregate comparison: competitive with or above human on more than half the suite.
Five validation games: convolutional DQN beats a linear approximator under matched training setup.

Ablations

Replay removal: training degrades and becomes less stable.
Separate target network removal: value learning destabilizes.
Linear model replacement: scores drop on validation games.
Representation analysis: final-layer features cluster semantically similar game states.

Method Strengths and Weaknesses

Strengths

End-to-end control from pixels, rewards, and actions only.
Replay plus target network directly attack divergence in nonlinear $Q$ -learning.
One architecture and hyperparameter set spans 49 games.
Human-level or better play appears on a large fraction of the suite.

Weaknesses

Sample inefficient: training uses $50$ million frames per game.
Discrete-action formulation does not cover continuous control.
One-step bootstrap target still inherits overestimation bias.
Partial observability is handled only by stacking four frames.

Suggestions from the authors

Extend value learning to harder planning and delayed-credit tasks.
Reduce data requirements for real-world control domains.
Learn stronger state representations under partial observability.
Transfer knowledge across games instead of training from scratch.

Human-Level Control through Deep Reinforcement Learning

Human-Level Control through Deep Reinforcement Learning

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Sampling Rule / Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers