[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.

Last lecture, we approximated the value or action-value function using parameters $\theta$ . Policy was generated directly from the value function. In this lecture, we will directly parameterise the policy as stochastic

$\pi_\theta(s,a) = \mathbb{P} [a|s, \theta]$

This taxonomy explains Value-based and Policy-based RL well.

Value-Based and Policy-based RL. David Silver RL lecture 7.

Value-based RL: Learnt value function and implicit (deterministic) policy (e.g. ϵ-greedy)
- Ex. DQN
Policy-based RL: No value function but learnt (stochastic)policy
- Ex. REINFORCE
- Policy can be softmax policy / Gaussian policy
- Advantage
  1. Better convergence properties
  2. Effective in high-dimensional or continuous action spaces
  3. Can learn stochastic policies
- Disadvantage
  1. Typically converge to a local rather than global optimum
  2. Evaluating a policy is typically inefficient and high variance
Value, Policy-based RL: Learnt value function and Learnt policy
- Ex. Actor-Critic

🥏 Policy Gradient with Policy Objective Functions

Goal: Given policy $\pi_\theta (s,a)$ with parameters $\theta$ , find best $\theta$ that maximizes $J(\theta)$
For the target function J(θ), which is the quality of a policy πθ, there are 3 candidates: whatever you choose the quality, the policy is always optimized = Let J(θ) be any policy objective function.
1. In episodic environments, we can use the start value $J_1 (\theta) = V^{\pi_\theta}(s_1) = \mathbb{E}_{\pi_\theta} [v_1]$
2. In continuing environments, we can use the average value $J_{av V}(\theta) = \sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s)$
3. In continuing environments, average reward per time-step $J_{avR}(\theta) = \sum_s d^{\pi_\theta}(s) \sum_a \pi_\theta (s,a) \mathcal{R}_s^a$
Policy gradient algorithms search for a local maximum in $J(\theta)$ by ascending the gradient of the policy, w.r.t. parameters $\theta$ : $\Delta \theta = \alpha \nabla_\theta J(\theta)$ ,

$ \nabla_\theta J(\theta)=$

$(∂J(θ)∂θ1⋮∂J(θ)∂θn)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><mfrac><mrow><mi>∂</mi><mi>J</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mi>θ</mi><mn>1</mn></msub></mrow></mfrac></mtd></mtr><mtr><mtd><mrow data-mjx-texclass="ORD"><mo>⋮</mo></mrow></mtd></mtr><mtr><mtd><mfrac><mrow><mi>∂</mi><mi>J</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mi>θ</mi><mi>n</mi></msub></mrow></mfrac></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$

🥏 Policy Gradient Theorem

The policy gradient generalizes the likelihood ratio approach to multi-step MDPs
For any differentiable policy $\pi_\theta (s,a)$ , for any of the policy objective functions $J = J_1, J_{avR},$ or $\frac{1}{1-\gamma} J_{avV}$ , the policy gradient is

$\nabla_\theta J(\theta) = \mathbb{E}_\theta [\nabla_\theta (\log \pi_\theta (s,a)Q^{\pi_\theta}(s,a))]$

🥏 Monte-Carlo Policy Gradient: REINFORCE

Using return $v_t$ as an unbiased sample of $Q^{\pi_\theta}(s_t, a_t)$

$\Delta \theta_t = \alpha \nabla_\theta \log \pi_\theta (s_t, a_t) v_t$

Psuedo code (Used in Alphago)

https://datascience.stackexchange.com/questions/48872/reinforce-algorithm-with-discounted-rewards-where-does-gammat-in-the-update-c

Huge variance problem -> Solved by Actor-Critic

🥏 Actor-Critic Policy Gradient: Actor-Critic

REINFORCE algorithm still has high variance AC is reduce the variance using critic that can use action-value function
Learn Action-value and policy both
Actor-critic algorithms maintain two sets of parameters
- Critic) Updates action-value function parameters $\textbf{w}$
- Actor) Updates policy parameters $\theta$ , in direction suggested by critic
Actor-critic algorithms follow an approximate policy gradient

$\nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta} [\nabla_{\pi_\theta} (\log \pi_\theta (s,a)Q_\theta(s,a))]$

$\Delta \theta = \alpha \nabla_\theta [\log \pi_\theta (s,a) Q_\textbf{w} (s,a)]$

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

Proximal Policy Optimization Algorithms (PPO) Hyper-parameters (0)	2022.05.15
[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO) (0)	2022.05.11
[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN) (0)	2022.04.10
[David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning) (0)	2022.04.09
[David Silver] 4. Model-Free Prediction: Monte-Carlo, Temporal-Difference (0)	2022.04.09

🥏 Policy Gradient with Policy Objective Functions
🥏 Policy Gradient Theorem
🥏 Monte-Carlo Policy Gradient: REINFORCE

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG

🥏 Policy Gradient with Policy Objective Functions

🥏 Policy Gradient Theorem

🥏 Monte-Carlo Policy Gradient: REINFORCE

🥏 Actor-Critic Policy Gradient: Actor-Critic

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역