[Robotics] MDP, POMDP 정리

이 글은 David Silver의 강화학습 slide와 이 사이트를 바탕으로 작성한 글입니다.

Markov decision process is a Markov reward process with decisions. It is an environment in which all states are Markov.

Markov Decision Process is a tuple $< S, A, P, R, γ >$
- S: states, A: actions, P: state transition probability matrix
- R: reward function, gamma: discount factor
Policy pi is a distribution over actions given states $π (a | s) = P [A_{t} = a | S_{t} = s]$
- GOAL: Find the optimal policy
  - All optimal policies achieve the optimal value function / the optimal value-action function
  - There is always a deterministic optimal policy for any MDP
Value-function
- State-value function: expected return starting from state s, and then following policy pi → optimal state-value function $v_{π} (s) = E_{π} [G_{t} | S_{t} = s]$
- Action-value function: expected return starting from state s, taking action a, and then following policy pi → optimal action-value function $q_{π} (s, a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a]$

: POMDP is an MDP with hidden states. It is a hidden Markov model with actions.

- real world environment에서는 system의 full state를 agent에 제공하는 경우가 거의 없다. 즉, Markov property는 실제 environment에 거의 유지되지 않는다.

- state의 observability가 보장되지 못하는 환경에서 agent를 구성하기 = 부분적인 정보가 제한되는 환경에서 이 부분을 관찰할 때마다 정보를 모아서 합치기

Partially Observable Markov Decision Process is a tuple $< S, A, P, R, Ω, O, γ >$
- S: states, A: actions, P: state transition probability matrix
- R: reward function, gamma: discount factor
- Addition) Omega: observations, O: conditional observation probabilities
- Addition) Belief State is a probability distribution over states, conditioned on the history h b(h) $b (h) = (P [S_{t}] s^{1} | H_{t} = h], . . ., P [S_{t}] s^{n} | H_{t} = h])$ $H_{t} = A_{0}, O_{1}, R_{1}, . . ., A_{t - 1}, O_{t}, R_{t}$

(작성중) Optimal Estimation Algorithms: Kalman and Particle Filters (0)	2022.11.05
[Probabilistic Robotics] Planning and Control: Partially Observable Markov Decision Processes (1)	2022.09.10
[Probabilistic Robotics] Planning and Control: Uncertainty in action/Belief space (1)	2022.09.10

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

티스토리툴바