[David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning)

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.

In the previous post, to solve an unknown MDP we had to (1) Estimate the value function (Model-Free Prediction) and (2) Optimize the value function (Model-Free control). In this lecture, we are going to learn (2) how to optimize the value function based on the (1) methodologies which are MC and TD.

So the goal which we have to achieve is on policy iteration,

Policy evaluation: So far there are MC and TD
Policy improvement How are we going to integrate Greedy policy improvement based on those policy evaluation methodologies? The answer would be Monte-Carlo control or Temporal-Difference control.

🥁 On-policy vs. Off-policy

차이가 확실히 아직 느껴지지 않는다.

On-policy learning: Learn about policy $\pi$ from experience sampled from $\pi$
Off-policy learning: Learn about policy π from experience sampled from μ
- Following behavior policy $\mu(a|s)$ , $\big\{ S_1, A_1, R_2, \cdots , S_T \big\} \sim \mu$
- Why is this important?
  1. Learn about optimal policy while following exploratory policy
  2. Learn from observing humans or other agents
  3. Re-use experience generated from old policies $\pi_1, \pi_2, \cdots , \pi_{t-1}$
  4. Learn about multiple policies while following one policy

🥁 $\epsilon$ -Greedy policy improvement

Before) Greedy Policy Improvement
- Greedy policy improvement over $Q(s,a)$ is model free $\pi '(s)=\arg\max_{a \in \mathcal{A}} Q(s,a)$
- No exploration
After) \epsilon-Greedy Policy Improvement
- Simplest idea for ensuring continual exploration.
- All $m$ are tried with non-zero probability.
- With probability $1-\epsilon$ choose the greedy action.

$\pi (a|s) =$

${\begin{cases} \frac{ϵ}{m} + 1 - ϵ & if a^{*} = \arg max_{a \in A} Q (s, a) \\ \frac{ϵ}{m} & otherwise \end{cases}$

🥁 On-policy Monte-Carlo control: GLIE

Greedy in the Limit with Infinite Exploration (GLIE): If all station-action pairs are explored infinitely many times, the policy converges on a greedy policy.

For every episode:

Policy evaluation Monte-Carlo policy evaluation, Q \approx q_\pi
- Sample $k$ th episode using $\pi$ : $\big\{ S_1, A_1, R_2, \cdots , S_T \big\} \sim \pi$
- For each state S_t and action A_t in the episode,
  - $N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$
  - $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)} (G_t - Q (S_t, A_t))$
Policy improvement \epsilon-greedy policy improvement
- Improve policy based on new action-value function
  - $\epsilon \leftarrow \frac{1}{k}, \pi \leftarrow \epsilon \textrm{-greedy}(Q)$

🥁 On-policy Temporal-Difference control: SARSA

For every time-step

Policy evaluation SARSA, Q \approx q_\pi
- Update action-value functions with SARSA $Q(S, A) \leftarrow Q(S,A) + \alpha (R + \gamma Q(S', A')-Q(S,A))$
Policy improvement $\epsilon$ -greedy policy improvement

🥁 Deeper in SARSA

n-Step SARSA updates $Q(s,a)$ towards the n-step Q-return $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (q_t^{(n)}-Q(S_t, A_t))$
Forward view SARSA( $\lambda$ ) $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (q_t^{\lambda}-Q(S_t, A_t))$
Backward view SARSA( $\lambda$ ) $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \delta_t E_t(s,a)$
- TD-error $\delta_t = R_{t+1}+\gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)$
- Eligibility trace for each state-action pair $E_t (s,a) = \gamma \lambda E_{t-1}(s,a) + 1(S_t=s, A_t=a)$
SARSA( $\lambda$ )

SARSA psuedo code David silver lecture 5.

🥁 Off-Policy Learning: Importance Sampling

As we have a target policy $\pi (a|s)$ and behavior policy $\mu(a|s)$ , we can estimate the return between policies ( $\pi, \mu$ ) using importance sampling:

$ \mathbb{E}_{X \sim P} [f(X)] = \mathbb{E}_{X \sim Q} [\frac{P(X)}{Q(X)}f(X)]$

$\mathbb{E}_{X \sim P} [f(X)] = \mathbb{E}_{X \sim Q} [\frac{P(X)}{Q(X)}f(X)]$

🥁 Importance Sampling for Off-Policy Monte-Carlo

For compute the returns generated from $\mu$ to evaluate $\pi$ , we can multiply the importance sampling corrections along whole episode

$G^{\frac{\pi}{\mu}}_t = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})} \cdots \frac{\pi(A_{T}|S_{T})}{\mu(A_{T}|S_{T})}G_t$

Update value towards corrected return

$V(S_t) \leftarrow V(S_t) + \alpha (G_t^{\frac{\pi}{\mu}}-V(S_t))$

But we cannot use this because (1) if $\mu$ is zero when $\pi$ is non-zero (2) Importance sampling can dramatically increase variance.

🥁 Importance Sampling for Off-Policy TD

We can use off-policy when we implement on TD. We only need a single importance sampling correction.

$V(S_t) \leftarrow V(S_t) + \alpha (\frac{\pi (A_t|S_t)}{\mu (A_t|S_t)} (R_{t+1}+ \gamma V(S_{t+1}))-V(S_t))$

Because policies only need to be similar over a single step, this has much lower variance than Monte-Carlo importance sampling.

🥁 Off-Policy Control with Q-Learning

On on-policy, next action is chosen using behavior policy $A_{t+1} \sim \mu(. | S_t)$ . But we consider alternative successor action $A' \sim \pi (.|S_t)$ and update $Q(S_t, A_t)$ towards value of alternative action

$Q(S_t, A_t) \leftarrow Q(S_t,A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A')-Q(S_t, A_t))$

So we can allow both behavior and target policies to improve
- The target policy $\pi$ is greedy $\pi(S_{t+1})= \arg \max_{a'} Q(S_{t+1}, a')$
- The behavior policy $\mu$ is $\epsilon$ -greedy
- The Q-learning target then simplifies:

$R_{t+1} + \gamma Q(S_{t+1}, A') = R_{t+1} + \gamma Q(S_{t+1}, \arg \max_{a'} Q(S_{t+1}, a')) = R_{t+1} + \max_{a'} \gamma Q(S_{t+1}, a')$

Q-Learning

Q-learning psuedo code from David Silver lecture 5.

🥁 Summarize SARSA

Difference diagram from https://www.yanxurui.cc/posts/ai/2019-04-24-RL-Introduction/

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG (0)	2022.04.10
[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN) (0)	2022.04.10
[David Silver] 4. Model-Free Prediction: Monte-Carlo, Temporal-Difference (0)	2022.04.09
[David Silver] 3. Planning by Dynamic Programming (0)	2022.04.02
[David Silver] 2. Markov Decision Processes (0)	2022.04.02

🥁 On-policy vs. Off-policy
🥁 $\epsilon$ -Greedy policy improvement
🥁 On-policy Monte-Carlo control: GLIE
🥁 On-policy Temporal-Difference control: SARSA
🥁 Off-Policy Learning: Importance Sampling
🥁 Off-Policy Control with Q-Learning
🥁 Summarize SARSA

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`