Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal

2017 · arXiv

Proximal Policy Optimization Algorithms

Problem

Framing

Vanilla policy gradients waste samples after one update, while TRPO stabilizes updates with a costly constrained solve. PPO closes this gap with a clipped surrogate that supports multiple minibatch epochs using first-order optimization. It reports 30/49 Atari wins by training-average reward.

Currently Used Methods

Foundational

Proposed Method

Architecture

PPO changes the update rule, not the network family. For MuJoCo, it uses separate policy and value MLPs with two 64-unit tanh\tanh layers; the policy outputs Gaussian means with learned standard deviations. Shared policy-value parameters are also supported through a joint loss.

Verified figure: page showing the clipped surrogate objective and piecewise plots of L^{CLIP} versus probability ratio r_t for positive and negative advantages.

Loss / Objective

The core objective clips the probability ratio around the old policy.

LCLIP(θ)=E^t[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t,\; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)} LCLIP+VF+S(θ)=E^t[LtCLIP(θ)c1(Vθ(st)Vttarg)2+c2S[πθ](st)]L^{CLIP+VF+S}(\theta) = \hat{\mathbb{E}}_t \left[ L_t^{CLIP}(\theta) - c_1 \left(V_\theta(s_t) - V_t^{\mathrm{targ}}\right)^2 + c_2 S\left[\pi_\theta\right](s_t) \right]

Algorithm

PPO alternates rollout collection under πθold\pi_{\theta_{\mathrm{old}}} with several epochs of minibatch ascent on the clipped surrogate.

θargmaxθ  E^t[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)]\theta \leftarrow \arg\max_{\theta}\; \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t,\; \operatorname{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

Training Procedure

Evaluation

Datasets

Metrics

Headline results

Table 1: Atari game wins across summary criteria

CriterionA2CACERPPOTie
(1) avg.episode reward over all of training118300
(2) avg.episode reward over last 100 episodes128191

Ablations

Method Strengths and Weaknesses

Strengths

Weaknesses

Suggestions from the authors

Links

Prior Papers

Further Papers