이 글은 David Silver의 강화학습 slide와 이 사이트를 바탕으로 작성한 글입니다.

✂️ Markov decision process (MDP)
Markov decision process is a Markov reward process with decisions. It is an environment in which all states are Markov.
- Markov Decision Process is a tuple
- S: states, A: actions, P: state transition probability matrix
- R: reward function, gamma: discount factor
- Policy pi is a distribution over actions given states
- GOAL: Find the optimal policy
- All optimal policies achieve the optimal value function / the optimal value-action function
- There is always a deterministic optimal policy for any MDP
- GOAL: Find the optimal policy
- Value-function
- State-value function: expected return starting from state s, and then following policy pi → optimal state-value function
- Action-value function: expected return starting from state s, taking action a, and then following policy pi → optimal action-value function
- State-value function: expected return starting from state s, and then following policy pi → optimal state-value function
✂️ Partially Observable Markov Decision Process (POMDP)
: POMDP is an MDP with hidden states. It is a hidden Markov model with actions.
왜 POMDP가 출현했는가?
- real world environment에서는 system의 full state를 agent에 제공하는 경우가 거의 없다. 즉, Markov property는 실제 environment에 거의 유지되지 않는다.
- state의 observability가 보장되지 못하는 환경에서 agent를 구성하기 = 부분적인 정보가 제한되는 환경에서 이 부분을 관찰할 때마다 정보를 모아서 합치기
- Partially Observable Markov Decision Process is a tuple
- S: states, A: actions, P: state transition probability matrix
- R: reward function, gamma: discount factor
- Addition) Omega: observations, O: conditional observation probabilities
- Addition) Belief State is a probability distribution over states, conditioned on the history h b(h)