[Robotics] MDP, POMDP 정리
Robotics & Perception/Probabilistic Robotics

[Robotics] MDP, POMDP 정리

이 글은 David Silver의 강화학습 slide와 이 사이트를 바탕으로 작성한 글입니다.

http://www.pomdp.org/faq.html

✂️ Markov decision process (MDP)

Markov decision process is a Markov reward process with decisions. It is an environment in which all states are Markov.

  • Markov Decision Process is a tuple <S,A,P,R,γ>
    • S: states, A: actions, P: state transition probability matrix
    • R: reward function, gamma: discount factor
  • Policy pi is a distribution over actions given states π(a|s)=P[At=a|St=s]
    • GOAL: Find the optimal policy
      • All optimal policies achieve the optimal value function / the optimal value-action function
      • There is always a deterministic optimal policy for any MDP
  • Value-function
    • State-value function: expected return starting from state s, and then following policy pi  optimal state-value function vπ(s)=Eπ[Gt|St=s]
    • Action-value function: expected return starting from state s, taking action a, and then following policy pi  optimal action-value function qπ(s,a)=Eπ[Gt|St=s,At=a] 

✂️ Partially Observable Markov Decision Process (POMDP)

: POMDP is an MDP with hidden states. It is a hidden Markov model with actions.

왜 POMDP가 출현했는가?

- real world environment에서는 system의 full state를 agent에 제공하는 경우가 거의 없다. 즉, Markov property는 실제 environment에 거의 유지되지 않는다.

- state의 observability가 보장되지 못하는 환경에서 agent를 구성하기 = 부분적인 정보가 제한되는 환경에서 이 부분을 관찰할 때마다 정보를 모아서 합치기

  • Partially Observable Markov Decision Process is a tuple <S,A,P,R,Ω,O,γ>
    • S: states, A: actions, P: state transition probability matrix
    • R: reward function, gamma: discount factor
    • Addition) Omega: observations, O: conditional observation probabilities
    • Addition) Belief State is a probability distribution over states, conditioned on the history h b(h) b(h)=(P[St]s1|Ht=h],...,P[St]sn|Ht=h]) Ht=A0,O1,R1,...,At1,Ot,Rt