paper의 수식을 정리한 글입니다. (도출과정 없음). document를 참조했습니다.
🔖 Simplest Policy Gradient
We consider a case of stochastic, parameterized policy πθ. We aim to maximize the expected return J(πθ)=Eτ∼πθ[R(τ)]. For this, we want to optimize the policy by gradient descent.
θk+1=θk+α∇θJ(πθ)|θk
The gradient of policy performance,
- Gradient estimator
ˆg=ˆEt[∇θlogπθ(at|st)ˆAt]
- The estimator ˆg is obtained by differentiating the objective
$L^{PG} (\theta) = \hat{\mathbb{E}}_t [ \log \pi_\theta (a_t|s_t) \hat{A}_t ] $
While it is appealing to perform multiple steps of optimization on this loss LPG using the same trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy updates.
🔖 Trust Region Methods (TRPO)
- TRPO suggests different object function that is maximized subject to a constraint on the size of the size of policy update.
\ma
🔖 Proximal Policy Optimization (PPO)
- $L_t(\theta) = L_t^{CLIP}(\theta) + L_t^{VF}(\theta) + L_t^{S}(\theta) = \hat{\mathbb{E}}_t [L_t^{CLIP}(\theta) -c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]$