Processing math: 100%
[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN)
Robotics & Perception/Reinforcement Learning

[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN)

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.

This lecture suggests the solution for large MDPs using function approximation.We have to scale up the model-free methods for prediction and control. So for lecture 6 and 7, we will learn how can we scale up the model-free methods.

  • How have we dealt with small (not large) MDPs so far? We have represented the value function by a lookup table. We can simply think of a matrix.
  • What would be large MDPs? Reinforcement learning has to solve large problems deriving from (1) Too many state or actions to store in memory, or (2) Too slow to learn the value of each state individually.

🐕‍🦺 Solution for large MDPs: Function Approximation

To estimate value function with function approximation,

ˆv(s,w)vπ(s)orˆq(s,a,w)qπ(s,a)

To make function approximation work, we will update w using MC or TD learning. By function approximation, we can generalize from seen states to unseen states.

🐕‍🦺 Types of Value Function Approximation

If we assume the function that function approximation works is f and its parameters are w

  • State-value function ˆv(s,w)
    • f(s,w)=ˆv(s,w)
  • Action-value function ˆq(s,a,w)
    • $f(s,\textbf{w}) = \big\{ \hat{q} (s, a_1, \textbf{w}), \hat{q}(s, a_2, \textbf{w}), \cdots, \hat{q}(s, a_m, \textbf{w}) \big\}$
    • $f(s, a,\textbf{w}) = \hat{q} (s, a, \textbf{w})$

There are many function approximators, e,g. Linear combinations of features, Neural Networks (Non-linear combinations of features), and decision tree etc. For now, we will focus on implementing on Linear combinations and Neural Networks. Furthermore, we require a training method that is suitable for non-stationary, non-iid data.

To use Function approximation, we have to represent the state. State is represented by a feature vector x(S)=

(x1(S)xn(S)) 

🐕‍🦺  Incremental Methods for Value Function Approximation

We use value function approximation for policy evaluation. By approximating value function, Policy iteration would be:

  • Policy evaluation: Approximate policy evaluation ˆq(.,.,w)qπ
  • Policy Improvement: ϵ-greedy policy improvement

🐕‍🦺 Value Function Approximation by Stochastic Gradient Descent

  • Goal: find parameter vector w minimizing MSE btw approximate value value ˆv(s,w) and true value vπ(s)

J(w)=Eπ[(vπ(S)ˆv(S,w)2]

  • Stochastic gradient descent works as: Δw=α(vπ(S)ˆv(S,w))wˆv(S,w)

🐕‍🦺 Value Function Approximation with NNs

  • ˆv(S,w)=x(S)w=nj=1xj(S)wj
  • Stochastic gradient descent works on update rule Δw=α(vπ(S)ˆv(S,w))x(S,A)
    • In practice, we substute a target for qπ(S,A) 
    • For MC, the target is the return Gt
    • For TD(0), the target is the TD target Rt+1+γˆv(St+1,At+1,w)
    • For TD(λ), the target is the λ-return qλt

🐕‍🦺 Action-Value Function Approximation by Stochastic Gradient Descent

  • Goal: find parameter vector w minimizing MSE btw approximate value value ˆq(S,A,w) and true value qπ(S,A)

J(w)=Eπ[(qπ(S,A)ˆq(S,A,w))2]

  • Stochastic gradient descent works as: Δw=α(qπ(S,A)ˆq(S,A,w))wˆq(S,A,w)

🐕‍🦺 Action-Value Function Approximation with NNs

  • ˆq(S,A,w)=x(S,A)w=nj=1xj(S,A)wj
  • Stochastic gradient descent works on update rule Δw=α(qπ(S,A)ˆq(S,A,w))wˆq(S,w)
    • In practice, we substute a target for vπ(s) 
    • For MC, the target is the return Gt
      • Return Gt is an unbiased, but noisy sample of true value vπ(St)
      • This can be supervised learning to "training data": <S1,G1>,<S2,G2>,,<ST,GT>
      • MC evaluation converges to a local optimum
    • For TD(0), the target is the TD target Rt+1+γˆv(St+1,w)
      • The TD-target Rt+1+γˆv(St+1,w) is a biased sample of true value vπ(St)
      • This can be supervised learning to "training data": <S1,R2+γˆv(S2,w)>,<S2,R3+γˆv(S3,w)>,,<ST1,RT>
      • Linear TD(0) converges (close) to global optimum
    • For TD(λ), the target is the λ-return Gλt
      • The λ-return Gλt is also a biased sample of true value vπ(s)
      • This can be supervised learning to "training data": <S1,Gλ1>,<S2,Gλ2>,,<SλT1,RλT1>
  • Convergence of Control Algorithms: Linear works on MC control and Sarsa.

David Silver lecture 6.

 

🐕‍🦺  Batch Methods for Function Approximation

Gradient descent is simple and appealing. But it is not sample-efficient. Batch methods seek to find the best fitting value function. 

🐕‍🦺  Stochastic Gradient Descent with Experience Replay

Given experience consisting of <state, value> pairs $\mathcal{D} = \big\{ <s_1, v_1^\pi>,  <s_2, v_2^\pi>, \cdots,  <s_T, v_T^\pi>\big\}$. This is called Replay memory

Repeat

  1. Sample state, value from experience <s,vπ>∼D
  2. Apply stochastic graident descent update Δw=α(vπˆv(s,w))\nalbawˆv(s,w)

This converges to least squares solution ^π=argminwLS(w)

🐕‍🦺  Experience Replay in Deep Q-Networks (DQN)

Deep Q-Networks (DQN) uses experience replay and fixed Q-targets.

  • Dataset generation
    1. Take action at according to ϵ-greedy policy
    2. Store trainsition (st,at,rt+1,st+1) in replay memory D
  • Train Network
    1. Sample random mini-batch of transitions (s,a,r,s') from D
    2. Compute Q-learning targets w.r.t. old, fixed parameters w
    3. Optimize MSE betweek Q-netowrk and Q-learning targets

Li(wi)=Es,a,r,sDi[(r+γmaxaQ(s,a;wi)Q(s,a;wi))2]