[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN)

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.

This lecture suggests the solution for large MDPs using function approximation.We have to scale up the model-free methods for prediction and control. So for lecture 6 and 7, we will learn how can we scale up the model-free methods.

How have we dealt with small (not large) MDPs so far? We have represented the value function by a lookup table. We can simply think of a matrix.
What would be large MDPs? Reinforcement learning has to solve large problems deriving from (1) Too many state or actions to store in memory, or (2) Too slow to learn the value of each state individually.

🐕‍🦺 Solution for large MDPs: Function Approximation

To estimate value function with function approximation,

$ˆ v (s, w) \approx v π (s) or ˆ q (s, a, w) \approx q π (s, a) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><mrow data-mjx-texclass="ORD"><mover><mi>v</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">(</mo><mi>s</mi><mo>,</mo><mtext mathvariant="bold">w</mtext><mo stretchy="false">)</mo></mtd><mtd><mi></mi><mo>\approx</mo><msub><mi>v</mi><mi>π</mi></msub><mo stretchy="false">(</mo><mi>s</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd><mtext>or</mtext><mstyle scriptlevel="0"><mspace width="0.278em"></mspace></mstyle><mrow data-mjx-texclass="ORD"><mover><mi>q</mi><mo stretchy="false">^</mo></mover></mrow><mo stretchy="false">(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo>,</mo><mtext mathvariant="bold">w</mtext><mo stretchy="false">)</mo></mtd><mtd><mi></mi><mo>\approx</mo><msub><mi>q</mi><mi>π</mi></msub><mo stretchy="false">(</mo><mi>s</mi><mo>,</mo><mi>a</mi><mo stretchy="false">)</mo></mtd></mtr></mtable></math>$

To make function approximation work, we will update $\textbf{w}$ using MC or TD learning. By function approximation, we can generalize from seen states to unseen states.

🐕‍🦺 Types of Value Function Approximation

If we assume the function that function approximation works is $f$ and its parameters are $\textbf{w}$

State-value function $\hat{v}(s, \textbf{w})$
- $f(s,\textbf{w}) = \hat{v}(s, \textbf{w})$
Action-value function $\hat{q} (s,a,\textbf{w})$
- $f(s,\textbf{w}) = \big\{ \hat{q} (s, a_1, \textbf{w}), \hat{q}(s, a_2, \textbf{w}), \cdots, \hat{q}(s, a_m, \textbf{w}) \big\}$
- $f(s, a,\textbf{w}) = \hat{q} (s, a, \textbf{w})$

There are many function approximators, e,g. Linear combinations of features, Neural Networks (Non-linear combinations of features), and decision tree etc. For now, we will focus on implementing on Linear combinations and Neural Networks. Furthermore, we require a training method that is suitable for non-stationary, non-iid data.

To use Function approximation, we have to represent the state. State is represented by a feature vector $\textrm{x}(S) =$

$(x 1 (S) ⋮ x n (S)) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mtable columnspacing="1em" rowspacing="4pt"><mtr><mtd><msub><mtext>x</mtext><mn>1</mn></msub><mo stretchy="false">(</mo><mi>S</mi><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd><mrow data-mjx-texclass="ORD"><mo>⋮</mo></mrow></mtd></mtr><mtr><mtd><msub><mtext>x</mtext><mi>n</mi></msub><mo stretchy="false">(</mo><mi>S</mi><mo stretchy="false">)</mo></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$

🐕‍🦺 Incremental Methods for Value Function Approximation

We use value function approximation for policy evaluation. By approximating value function, Policy iteration would be:

Policy evaluation: Approximate policy evaluation $\hat{q}(.,.,\textbf{w}) \approx q_\pi$
Policy Improvement: $\epsilon$ -greedy policy improvement

🐕‍🦺 Value Function Approximation by Stochastic Gradient Descent

Goal: find parameter vector $\textbf{w}$ minimizing MSE btw approximate value value $\hat{v} (s,\textbf{w})$ and true value $v_\pi (s)$

$J(\textbf{w}) = \mathbb{E}_\pi [{(v_\pi (S)-\hat{v}(S, \mathbf{w})}^2]$

Stochastic gradient descent works as: $\Delta \textbf{w} = \alpha (v_\pi(S) - \hat{v}(S, \textbf{w}))\nabla_\textbf{w} \hat{v}(S, \textbf{w})$

🐕‍🦺 Value Function Approximation with NNs

$\hat{v}(S, \textbf{w}) = x(S)^\top \textbf{w} = \sum_{j=1}^n \textrm{x}_j(S) \textbf{w}_j$
Stochastic gradient descent works on update rule Δw=α(vπ(S)−ˆv(S,w))x(S,A)
- In practice, we substute a target for $q_\pi(S,A)$
- For MC, the target is the return $G_t$
- For TD(0), the target is the TD target $R_{t+1}+\gamma \hat{v}(S_{t+1},A_{t+1}, \textbf{w})$
- For TD( $\lambda$ ), the target is the $\lambda$ -return $q_t^\lambda$

🐕‍🦺 Action-Value Function Approximation by Stochastic Gradient Descent

Goal: find parameter vector $\textbf{w}$ minimizing MSE btw approximate value value $\hat{q} (S, A,\textbf{w})$ and true value $q_\pi (S, A)$

$J(\textbf{w}) = \mathbb{E}_\pi [{(q_\pi (S, A)-\hat{q}(S, A, \mathbf{w}))}^2]$

Stochastic gradient descent works as: $\Delta \textbf{w} = \alpha (q_\pi(S, A) - \hat{q}(S, A, \textbf{w}))\nabla_\textbf{w} \hat{q}(S, A, \textbf{w})$

🐕‍🦺 Action-Value Function Approximation with NNs

$\hat{q}(S, A, \textbf{w}) = x(S, A)^\top \textbf{w} = \sum_{j=1}^n \textrm{x}_j(S, A) \textbf{w}_j$
Stochastic gradient descent works on update rule Δw=α(qπ(S,A)−ˆq(S,A,w))∇wˆq(S,w)
- In practice, we substute a target for $v_\pi(s)$
- For MC, the target is the return Gt
  - Return $G_t$ is an unbiased, but noisy sample of true value $v_\pi (S_t)$
  - This can be supervised learning to "training data": $<S_1, G_1>, <S_2, G_2>, \cdots, <S_T, G_T>$
  - MC evaluation converges to a local optimum
- For TD(0), the target is the TD target Rt+1+γˆv(St+1,w)
  - The TD-target $R_{t+1}+\gamma \hat{v} (S_{t+1}, \textbf{w})$ is a biased sample of true value $v_\pi (S_t)$
  - This can be supervised learning to "training data": $<S_1, R_2+\gamma \hat{v}(S_2, \textbf{w})>, <S_2, R_3+\gamma \hat{v} (S_3, \textbf{w})>, \cdots, <S_{T-1}, R_T>$
  - Linear TD(0) converges (close) to global optimum
- For TD(λ), the target is the λ-return Gλt
  - The $\lambda$ -return $G_t^\lambda$ is also a biased sample of true value $v_\pi(s)$
  - This can be supervised learning to "training data": $<S_1, G_1^\lambda>, <S_2, G_2^\lambda>, \cdots, <S_{T-1}^\lambda, R_{T-1}^\lambda>$
Convergence of Control Algorithms: Linear works on MC control and Sarsa.

🐕‍🦺 Batch Methods for Function Approximation

Gradient descent is simple and appealing. But it is not sample-efficient. Batch methods seek to find the best fitting value function.

🐕‍🦺 Stochastic Gradient Descent with Experience Replay

Given experience consisting of <state, value> pairs $\mathcal{D} = \big\{ <s_1, v_1^\pi>, <s_2, v_2^\pi>, \cdots, <s_T, v_T^\pi>\big\}$. This is called Replay memory

Repeat

Sample state, value from experience $<s, v^\pi> \sim \mathcal{D}$
Apply stochastic graident descent update $\Delta \textbf{w} = \alpha (v^\pi - \hat{v}(s, \textbf{w}))\nalba_\textbf{w} \hat{v}(s, \textbf{w})$

This converges to least squares solution $\textbf^\pi=\arg\min_\textbf{w} \textrm{LS}(\textbf{w})$

🐕‍🦺 Experience Replay in Deep Q-Networks (DQN)

Deep Q-Networks (DQN) uses experience replay and fixed Q-targets.

Dataset generation
1. Take action $a_t$ according to $\epsilon$ -greedy policy
2. Store trainsition $(s_t, a_t, r_{t+1}, s_{t+1})$ in replay memory $\mathcal{D}$
Train Network
1. Sample random mini-batch of transitions (s,a,r,s') from $\mathcal{D}$
2. Compute Q-learning targets w.r.t. old, fixed parameters $\textbf{w}^-$
3. Optimize MSE betweek Q-netowrk and Q-learning targets

$\mathcal{L}_i (\textbf{w}_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}_i} [{(r+\gamma \max_{a'} Q(s', a'; \textbf{w}_i^-)-Q(s,a; \textbf{w}_i))}^2]$

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO) (0)	2022.05.11
[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG (0)	2022.04.10
[David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning) (0)	2022.04.09
[David Silver] 4. Model-Free Prediction: Monte-Carlo, Temporal-Difference (0)	2022.04.09
[David Silver] 3. Planning by Dynamic Programming (0)	2022.04.02

🐕‍🦺 Solution for large MDPs: Function Approximation
🐕‍🦺 Incremental Methods for Value Function Approximation
🐕‍🦺 Action-Value Function Approximation by Stochastic Gradient Descent
🐕‍🦺 Batch Methods for Function Approximation

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`