Deep Reinforcement Learning from Human Preferences

Paul Christiano, Jan Leike

2017 · NeurIPS

Deep Reinforcement Learning from Human Preferences

Problem

Framing

Deep RL assumed access to a numeric reward, but many goals are easier to judge than specify. The paper closes this gap by fitting a reward model from pairwise human clip comparisons and optimizing that model online. Human feedback covers under 1% of agent interactions.

Currently Used Methods

Foundational

@mnihDQN2015 — deep Q-learning succeeds on Atari with explicit game rewards.
- Limitation in context: cannot act when the reward is unobserved.
@schulmanPPO2017 — practical policy-gradient optimization for large continuous-control problems.
- Limitation in context: still requires a scalar objective to optimize.
@silverAlphaGo2016 — superhuman RL under exact win-loss objectives.
- Limitation in context: depends on perfectly specified task outcomes.
Apprenticeship Learning via Inverse Reinforcement Learning — infers rewards from expert demonstrations.
- Limitation in context: demonstrations fail for hard or non-human behaviors.
Active Preference-Based Learning — learns utilities from pairwise preference queries.
- Limitation in context: did not scale to deep RL domains.

Proposed Method

Architecture

The system has three parts: policy, environment, and reward predictor. The policy generates trajectories, humans compare short clip pairs, and the reward predictor trains asynchronously from those labels. RL then maximizes predicted reward instead of environment reward.

Verified architecture diagram: a reward predictor receives human feedback and environment observations, then sends predicted reward to the RL algorithm, which acts in the environment.

Loss / Objective

The reward model uses a Bradley–Terry preference likelihood over clip returns.

\hat{P}[\sigma^1 \succ \sigma^2] = \frac{\exp \sum_t \hat{r}(o_t^1,a_t^1)}{\exp \sum_t \hat{r}(o_t^1,a_t^1) + \exp \sum_t \hat{r}(o_t^2,a_t^2)}

\mathcal{L}(\hat{r}) = - \sum_{(\sigma^1,\sigma^2,\mu) \in D} \mu(1) \log \hat{P}[\sigma^1 \succ \sigma^2] + \mu(2) \log \hat{P}[\sigma^2 \succ \sigma^1]

Algorithm

Policy optimization is standard RL on the current learned reward.

\max_{\pi} \; \mathbb{E}_{\tau \sim \pi} \left[ \sum_t \hat{r}(o_t,a_t) \right]

Training Procedure

Clip length: 1–2 seconds.
Query choice: ensemble disagreement over segment pairs.
Reward model: bootstrap ensemble.
Predictor holdout fraction: $1/e$ .
MuJoCo optimizer: TRPO.
MuJoCo discount: $\gamma = 0.995$ .
Atari reward-model pretraining: 200 epochs.

Evaluation

Datasets

Atari: BeamRider, Breakout, Pong, Q*bert, Seaquest, SpaceInvaders, Enduro.
MuJoCo: 8 continuous-control tasks.
Novel tasks: Hopper backflip, one-leg Half-Cheetah, Enduro traffic pacing.

Metrics

True environment return.
Human query count.
Environment interactions.
Qualitative success on novel behaviors.

Headline results

MuJoCo: 700 labels nearly match true-reward RL.
MuJoCo: 1400 labels slightly outperform true-reward RL.
Atari: 5.5k human queries learn competitive policies on most games.
Oversight cost: feedback covers less than 1% of interactions.
Hopper backflip: about 900 queries, under one hour.

Verified results figure: Atari learning curves for seven games, comparing true-reward RL, synthetic preference labels, and 5.5k human labels.

Ablations

Query strategy: random queries underperform disagreement-based selection.
Reward ensemble: one predictor degrades learning quality.
Label timing: offline reward training yields bizarre exploitative behavior.
Episode design: variable termination leaks task information.

Method Strengths and Weaknesses

Strengths

Learns reward functions from pairwise judgments, not hand-coded scores.
Reaches strong Atari and MuJoCo performance with sparse oversight.
Online querying reduces reward-model exploitation.
Trains novel behaviors humans can judge but not demonstrate.

Weaknesses

Learned-reward training is less stable than true-reward RL.
Human-labeled Atari trails oracle-label runs on several games.
Reward hacking appears when reward learning is offline.
Performance remains weak on some games, especially Q*bert.

Suggestions from the authors

Improve consistency and quality of human labels.
Tighten online feedback loops against predictor exploitation.
Extend preference learning to harder real-world tasks.
Reduce the amount of feedback needed further.

Deep Reinforcement Learning from Human Preferences

Deep Reinforcement Learning from Human Preferences

Problem

Framing

Currently Used Methods

Foundational

Proposed Method

Architecture

Loss / Objective

Algorithm

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers