이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.
Last lecture, we approximated the value or action-value function using parameters θ. Policy was generated directly from the value function. In this lecture, we will directly parameterise the policy as stochastic
πθ(s,a)=P[a|s,θ]
This taxonomy explains Value-based and Policy-based RL well.

- Value-based RL: Learnt value function and implicit (deterministic) policy (e.g. ϵ-greedy)
- Ex. DQN
- Policy-based RL: No value function but learnt (stochastic)policy
- Ex. REINFORCE
- Policy can be softmax policy / Gaussian policy
- Advantage
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies
- Disadvantage
- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
- Value, Policy-based RL: Learnt value function and Learnt policy
- Ex. Actor-Critic
🥏 Policy Gradient with Policy Objective Functions
- Goal: Given policy πθ(s,a) with parameters θ, find best θ that maximizes J(θ)
- For the target function J(θ), which is the quality of a policy πθ, there are 3 candidates: whatever you choose the quality, the policy is always optimized = Let J(θ) be any policy objective function.
- In episodic environments, we can use the start value J1(θ)=Vπθ(s1)=Eπθ[v1]
- In continuing environments, we can use the average value JavV(θ)=∑sdπθ(s)Vπθ(s)
- In continuing environments, average reward per time-step JavR(θ)=∑sdπθ(s)∑aπθ(s,a)Ras
- Policy gradient algorithms search for a local maximum in J(θ) by ascending the gradient of the policy, w.r.t. parameters θ: Δθ=α∇θJ(θ),
$ \nabla_\theta J(\theta)=$
🥏 Policy Gradient Theorem
- The policy gradient generalizes the likelihood ratio approach to multi-step MDPs
- For any differentiable policy πθ(s,a), for any of the policy objective functions J=J1,JavR,or 11−γJavV, the policy gradient is
∇θJ(θ)=Eθ[∇θ(logπθ(s,a)Qπθ(s,a))]
🥏 Monte-Carlo Policy Gradient: REINFORCE
- Using return vt as an unbiased sample of Qπθ(st,at)
Δθt=α∇θlogπθ(st,at)vt
- Psuedo code (Used in Alphago)

- Huge variance problem -> Solved by Actor-Critic
🥏 Actor-Critic Policy Gradient: Actor-Critic
- REINFORCE algorithm still has high variance AC is reduce the variance using critic that can use action-value function
- Learn Action-value and policy both
- Actor-critic algorithms maintain two sets of parameters
- Critic) Updates action-value function parameters w
- Actor) Updates policy parameters θ, in direction suggested by critic
- Actor-critic algorithms follow an approximate policy gradient
∇θJ(θ)≈Eπθ[∇πθ(logπθ(s,a)Qθ(s,a))]
Δθ=α∇θ[logπθ(s,a)Qw(s,a)]