Robotics & Perception/Reinforcement Learning

[David Silver] 2. Markov Decision Processes

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.
(2023.09.12) 추가적으로 필자가 임재환 교수님의 AI611 대학원 수업을 듣고 이해가 부족한 부분을 채웠습니다. -보라색으로 추가하였습니다.

Markov decision process formally describe an fully observable environment for reinforcement learning.

🥭 Markov Processes

Based on the Markov property, A Markov process is a random process, i.e. a sequence of random states $S_1, S_2, \cdots$ with the Markov property.

A Markov Process (or Markov Chain) is a tuple $<\mathcal{S, P}>$

  • $\mathcal{S}$ is a (finite) set of states
  • $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]$

Markov chain episodes are the episodes that are sampled from a Markov Chain. Since its based on the markov property, it does not care about the past, because the current state sufficiently takes it's information. Here, there is no agent. 

🥭 Markov Reward Processes

A Markov reward process (MRP) is a Markov chain with values.

A Markov Reward Process is a tuple $<\mathcal{S, P, R}, \gamma>$

  • $\mathcal{S}$ is a (finite) set of states
  • Agent gets
    • Value function
      • The state-value function$v(s)$ gives the expected return starting from state $s$: $v(s) = \mathbb{E}[G_t |S_t=s]$
    •  Model
      • $\mathcal{P}$ is a state transition probability matrix, $\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1}=s' | S_t=s]$
  • $\mathcal{R}$ is a reward function, $\mathcal{R}_s = \mathbb{E}[R_{t+1}|S_t=s]$
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$

Here, there is no agent, but reward exist.

🥭 Markov Decision Processes

A Markov decision process (MDP) is a Markov reward process with decisions(policy). On an environment in which all states are Markov.  Here, agent exists, therefore, there are transition model/observation etc. Agent with the policy, gives the action, the action affect the environment, which would give the reward (instance feedback), and this process iterates until the agent achieve the goal.

A Markov Decision Process$\mathcal{M}$ is a tuple $<\mathcal{S, A, P, R}, \gamma>$

  • $\mathcal{S}$ is a (finite) set of states
  • Agent with a finite set of actions $A_t$ gets 
    • Policy $\pi$ is a distribution over actions given states. $\pi (a|s) = \mathbb{P}[A_t=a|S_t=s]$
    • Value function
      • Return $G_t$ is the total discounted reward from time-step $t$. 
        • $G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^\infty {\gamma}^k R_{t+k+1}$
      • The state-value function $v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s]$
        • Bellman equation for value-function: $v_\pi (s) = \mathbb{E}_\pi [R_{t+1}+ \gamma V_\pi(S_{t+1})|S_t=s]$
          • Immediate reward인 $R_{t+1}$과 discounted value of successor state $\gamma v(S_{t+1})$의 조화
          • $v_\pi (s) = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'} v_\pi(s')$
      • The action-value function $q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t=s, A_t=a]$
        • $q_\pi (s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}}\mathcal{P}_{ss'}^a v_\pi(s')$
      • Joseph's tip
        • Return은 한 episode에 대해서 가는 것을 처리했을 때를 고려, value-function은 모든 episode를 다 갔을 때를 고려한다.
        • Action-value function은 모든 a 에 대해서 갈 수 있는 all chance of s'을 다 고려한다. 이 때문에 추론하기 어렵지만 우리가 궁극적으로 원하는 값이다.
        • State-value function은 주위의 state들에 대해서 recursive하게 결정이 되므로 상대적으로 추론하기 쉽다.
        • Policy는 highest reward를 줄 수 있는 곳으로 가는데 이는 value function을 통해서 결정이 될 것이다. [Value function -> Policy -> action -> reward ] -> [Value..]의 순환으로 제대로된 value function을 구하는 것이 중요하다.
    • Transition Model
      • The state and reward sequence $S_1, R_2, S_2, \cdots $ is a Markov reward process $<\mathcal{S, P^\pi, R^\pi}, \gamma>$
      • Transition model $\mathcal{P}_{ss'}^\pi = \sum_{a \in \mathcal{A}} \pi (a|s)\mathcal{P}_{ss'}^a$
        • $s$에서 action $a$를 했을 때 $s'$으로 갈 확률.
      • Reward model $\mathcal{R}_s^\pi = \sum_{a \in \mathcal{A}} \pi (a|s) \mathcal{R}_s^a$
        • Evaluating how good is this action is.
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$
    • Without this discount factor, the policy would not converge, and we would like to give importance that the current state is important.

 

We have to find an optimal policy $\pi_*$ that maximize cumulative reward. An MDP is "solved" when we know the optimal value fn. We could get an optimal policy by maximizing over $q_*(s,a)$

  • All optimal policies achieve the optimal value function, $v_* (s) = \max_\pi v_\pi (s) = v_{\pi_*}(s)$
  • All optimal policies achieve the optimal action-value function, $q_* (s,a) = \max_\pi q_\pi (s,a) = q_{\pi_*}(s,a)$ 
  • The optimal value functions are recursively related by the Bellman optimality equations: $v_* (s) = \max_a q_* (s,a)$

 We can derive optimal policy via 1) getting an optimal value function (DQN, value iteration, policy iteration, SARSA) or directly deriving from optimal policy. Note that this is different notations from policy based RL.

🥭 Bellman Expectation Equation

Based on MRP, the state-value function can again be decomposed into immediate reward plus discounted value of successor state, 

$v(s) = \mathbb{E}[G_t|S_t=s] = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1})|S_t=s] = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'} v(s') = \mathcal{R}+ \gamma \mathcal{P}v $

The Bellman equation is a linear equation. It can be solved directly: $v = (I-\gamma \mathcal{P})^{-1}\mathcal{R}$. Computational complexity is $O(n^3)$ for $n$ states. Direct solution only possible for small MRPs. There are many iterative methods for large MRPs for example, DP, MC evaluation and TD learning.

Based on MDP, the state-value function can again be decomposed into immediate reward plus discounted value of successor state,

$v_\pi (s)= \mathbb{E}_\pi [R_{t+1}+\gamma v_\pi (S_{t+1})|S_t=s]= \sum_{a \in \mathbb{A}} \pi (a|s) q_\pi (s,a) $

$v_\pi (s)= \sum_{a \in \mathbb{A}} \pi (a|s) (\mathbb{R}_s^a + \gamma \sum_{s' \in \mathbb{S}} \mathbb{P}_{ss'}^a v_\pi (s')) $

$q_\pi (s,a) = \mathbb{E}_\pi [R_{t+1}+\gamma q_\pi (S_{t+1}, A_{t+1})|S_t=s, A_t=a] = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_\pi (s')$

$q_\pi (s,a)  = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi (a'|s') q_\pi (s', a')$

🥭 Bellman Optimality Equation

The optimal value functions are recursively related by the Bellman optimality equations:

$q_* (s,a) = \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_* (s')$

$v_* (s) = \max_a \mathcal{R}_s^a + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}^a_{ss'} v_* (s')$

Bellman optimality equation is non-linear (No closed form solution). There are many iterative methods for MDPs for example, Value iteration, Policy iteration, Q-learning and SARSA etc.