[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO)

paper의 수식을 정리한 글입니다. (도출과정 없음). document를 참조했습니다.

🔖 Simplest Policy Gradient

We consider a case of stochastic, parameterized policy $\pi_\theta$ . We aim to maximize the expected return $J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$ . For this, we want to optimize the policy by gradient descent.

$\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta)|_{\theta_k}$

The gradient of policy performance,

Gradient estimator

$\hat{g} = \hat{\mathbb{E}}_t [ \nabla_\theta \log \pi_\theta (a_t|s_t) \hat{A}_t ]$

The estimator $\hat{g}$ is obtained by differentiating the objective

$L^{PG} (\theta) = \hat{\mathbb{E}}_t [ \log \pi_\theta (a_t|s_t) \hat{A}_t ] $

While it is appealing to perform multiple steps of optimization on this loss $L^{PG}$ using the same trajectory, doing so is not well-justified, and empirically it often leads to destructively large policy updates.

🔖 Trust Region Methods (TRPO)

TRPO suggests different object function that is maximized subject to a constraint on the size of the size of policy update.

$\ma$

🔖 Proximal Policy Optimization (PPO)

$L_t(\theta) = L_t^{CLIP}(\theta) + L_t^{VF}(\theta) + L_t^{S}(\theta) = \hat{\mathbb{E}}_t [L_t^{CLIP}(\theta) -c_1L_t^{VF}(\theta)+c_2S[\pi_\theta](s_t)]$

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

[Advanced Topics] 01. RL with human feedback (0)	2023.09.16
Proximal Policy Optimization Algorithms (PPO) Hyper-parameters (0)	2022.05.15
[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG (0)	2022.04.10
[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN) (0)	2022.04.10
[David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning) (0)	2022.04.09

🔖 Simplest Policy Gradient
🔖 Trust Region Methods (TRPO)
🔖 Proximal Policy Optimization (PPO)

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO)

🔖 Simplest Policy Gradient

🔖 Trust Region Methods (TRPO)

🔖 Proximal Policy Optimization (PPO)

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역