Loading [MathJax]/jax/output/CommonHTML/jax.js
[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG
Robotics & Perception/Reinforcement Learning

[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.

Last lecture, we approximated the value or action-value function using parameters θ. Policy was generated directly from the value function. In this lecture, we will directly parameterise the policy as stochastic

πθ(s,a)=P[a|s,θ]

This taxonomy explains Value-based and Policy-based RL well.

Value-Based and Policy-based RL. David Silver RL lecture 7.

  • Value-based RL: Learnt value function and implicit (deterministic) policy (e.g. ϵ-greedy)
    • Ex. DQN
  • Policy-based RL: No value function but learnt (stochastic)policy
    • Ex. REINFORCE
    • Policy can be softmax policy / Gaussian policy
    •  Advantage
      1. Better convergence properties
      2. Effective in high-dimensional or continuous action spaces
      3. Can learn stochastic policies
    • Disadvantage
      1. Typically converge to a local rather than global optimum
      2. Evaluating a policy is typically inefficient and high variance
  • Value, Policy-based RL: Learnt value function and Learnt policy
    • Ex. Actor-Critic

🥏 Policy Gradient with Policy Objective Functions

  • Goal: Given policy πθ(s,a) with parameters θ, find best θ that maximizes J(θ)
  • For the target function J(θ), which is the quality of a policy πθ, there are 3 candidates: whatever you choose the quality, the policy is always optimized = Let J(θ) be any policy objective function.
    1. In episodic environments, we can use the start value J1(θ)=Vπθ(s1)=Eπθ[v1]
    2. In continuing environments, we can use the average value JavV(θ)=sdπθ(s)Vπθ(s)
    3. In continuing environments, average reward per time-step JavR(θ)=sdπθ(s)aπθ(s,a)Ras
  • Policy gradient algorithms search for a local maximum in J(θ) by ascending the gradient of the policy, w.r.t. parameters θ: Δθ=αθJ(θ),

 $ \nabla_\theta J(\theta)=

(J(θ)θ1J(θ)θn) 

🥏 Policy Gradient Theorem

  • The policy gradient generalizes the likelihood ratio approach to multi-step MDPs
  • For any differentiable policy πθ(s,a), for any of the policy objective functions J=J1,JavR,or 11γJavV, the policy gradient is 

θJ(θ)=Eθ[θ(logπθ(s,a)Qπθ(s,a))]

🥏 Monte-Carlo Policy Gradient: REINFORCE

  • Using return vt as an unbiased sample of Qπθ(st,at)

Δθt=αθlogπθ(st,at)vt

  • Psuedo code (Used in Alphago)

https://datascience.stackexchange.com/questions/48872/reinforce-algorithm-with-discounted-rewards-where-does-gammat-in-the-update-c

  • Huge variance problem -> Solved by Actor-Critic

🥏 Actor-Critic Policy Gradient: Actor-Critic

  • REINFORCE algorithm still has high variance AC is reduce the variance using critic that can use action-value function
  • Learn Action-value and policy both
  • Actor-critic algorithms maintain two sets of parameters
    • Critic) Updates action-value function parameters w
    • Actor) Updates policy parameters θ, in direction suggested by critic
  • Actor-critic algorithms follow an approximate policy gradient

θJ(θ)Eπθ[πθ(logπθ(s,a)Qθ(s,a))]

Δθ=αθ[logπθ(s,a)Qw(s,a)]