Deep Reinforcement Learning from Human Preferences

Paul Christiano, Jan Leike

2017 · NeurIPS

Deep Reinforcement Learning from Human Preferences

Problem

Framing

Deep RL assumed access to a numeric reward, but many goals are easier to judge than specify. The paper closes this gap by fitting a reward model from pairwise human clip comparisons and optimizing that model online. Human feedback covers under 1% of agent interactions.

Currently Used Methods

Foundational

Proposed Method

Architecture

The system has three parts: policy, environment, and reward predictor. The policy generates trajectories, humans compare short clip pairs, and the reward predictor trains asynchronously from those labels. RL then maximizes predicted reward instead of environment reward.

Verified architecture diagram: a reward predictor receives human feedback and environment observations, then sends predicted reward to the RL algorithm, which acts in the environment.

Loss / Objective

The reward model uses a Bradley–Terry preference likelihood over clip returns.

P^[σ1σ2]=exptr^(ot1,at1)exptr^(ot1,at1)+exptr^(ot2,at2)\hat{P}[\sigma^1 \succ \sigma^2] = \frac{\exp \sum_t \hat{r}(o_t^1,a_t^1)}{\exp \sum_t \hat{r}(o_t^1,a_t^1) + \exp \sum_t \hat{r}(o_t^2,a_t^2)} L(r^)=(σ1,σ2,μ)Dμ(1)logP^[σ1σ2]+μ(2)logP^[σ2σ1]\mathcal{L}(\hat{r}) = - \sum_{(\sigma^1,\sigma^2,\mu) \in D} \mu(1) \log \hat{P}[\sigma^1 \succ \sigma^2] + \mu(2) \log \hat{P}[\sigma^2 \succ \sigma^1]

Algorithm

Policy optimization is standard RL on the current learned reward.

maxπ  Eτπ[tr^(ot,at)]\max_{\pi} \; \mathbb{E}_{\tau \sim \pi} \left[ \sum_t \hat{r}(o_t,a_t) \right]

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Verified results figure: Atari learning curves for seven games, comparing true-reward RL, synthetic preference labels, and 5.5k human labels.

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers