Loading [MathJax]/jax/output/CommonHTML/jax.js
Robotics & Perception/Reinforcement Learning

[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO)

paper의 수식을 정리한 글입니다. (도출과정 없음). document를 참조했습니다.

🔖 Simplest Policy Gradient

We consider a case of stochastic, parameterized policy πθ. We aim to maximize the expected return J(πθ)=Eτπθ[R(τ)]. For this, we want to optimize the policy by gradient descent. 

θk+1=θk+αθJ(πθ)|θk

The gradient of policy performance,

  • Gradient estimator 

ˆg=ˆEt[θlogπθ(at|st)ˆAt]

  • The estimator ˆg is obtained by differentiating the objective 

$L^{PG} (\theta) = \hat{\mathbb{E}}_t [ \log \pi_\theta (a_t|s_t) \hat{A}_t ] $

While it is appealing to perform multiple steps of optimization on this loss LPG using the same trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy updates.

🔖 Trust Region Methods (TRPO)

  • TRPO suggests different object function that is maximized subject to a constraint on the size of the size of policy update.

\ma

🔖 Proximal Policy Optimization (PPO)

  • $L_t(\theta) = L_t^{CLIP}(\theta) + L_t^{VF}(\theta) + L_t^{S}(\theta) = \hat{\mathbb{E}}_t [L_t^{CLIP}(\theta) -c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]$
  •