[David Silver] 3. Planning by Dynamic Programming

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.
(2023.09.12) 추가적으로 필자가 임재환 교수님의 AI611 대학원 수업을 듣고 이해가 부족한 부분을 채웠습니다. -보라색 처리

This lecture is about a solution of known MDP which is Dynamic Programming. We will talk about what is dynamic programming, and prove MDP is solvable.

🥭 Dynamic Programming

Dynamic programming is a method for solving complex problems. By breaking them down into subproblems, 1. Solve the subproblems 2. Combine solutions to subproblems. It is a very general solution method for problems which have two properties: Optimal substructure, Overlapping subproblems. Markov decision processes satisfy both properties, so we will check how DP solves MDP problem.

Bellman equation도 recursive한 decomposition에 속한다. 처음에는 super accurate하지 않지만 계속 update를 하면서 궁극적으로 optimal에 도달한다는 점.
대부분 MDP Planning할 때 쓰임!!
- Planning vs. Learning
  - Planning: Environment의 model(reward, transition matrix)을 안다는 전제 하에 해결
  - Learning: Enviornment의 model은 모르지만 상호작용을 통해서 문제를 푸는 것Prediction (필자가 이해했을 때는 goal에 달성하기 위해서 policy가 주어졌을 때 그게 얼마나 좋은지 판단. value function을 예측하는 시스템)
Dynamic programming은 두 step으로 나뉨
1. Prediction (optimal하지 않은 policy가 주어졌을 때 그게 얼마나 좋은지 판단. value function을 예측하는 시스템)
2. Control (value function이 있고 이를 바탕으로 policy을 업데이트 가는 걸 생각)
해당 update를 하면서 sub-optimal한 action을 고를수도 있겠지만 update가 축적됨에 따라 궁극적으로 optimal action에 이를 수 있다.

🥭 Policy Iteration

For control, Dynamic programming improves a policy by

Given a policy π
1. Policy Evaluation: Evaluate the policy pi, vπ(s)=E[Rt+1+γRt+2+⋯|St=s]
  - Solution: iterative application of Bellman expectation backup $v_1 \rightarrow v_2 \rightarrow \cdots \rightarrow v_\pi$ , using synchronous backups
  - $v_\pi (s) = \mathbb{E} [ R_{t+1} + \gamma R_{t+2} + \cdots | S_t=s]$
2. Greedy Policy Improvement: Improve the policy by acting greedily with respect to vπ, π′≥π
  - $\pi' (s) = \textrm{greedy}(v_\pi(s)) = \max_{a \in \mathcal{A}} q_\pi (s,a)$
  - If improvements stop, $q_\pi (s, \pi'(s)) = \max_{a \in \mathcal{A}} q_\pi (s, \pi (s)) = v_\pi (s)$

David Silver's Lecture 5. Generalized policy iteration (refresher)

🥭 Value Iteration

Find an optimal policy $\pi$

Solution: iterative application of Bellman optimality backup $v_1 \rightarrow v_2 \rightarrow \cdots \rightarrow v_\pi$ , using synchronous backups

$v_{k+1} = \max_{a \in \mathcal{A}} \mathcal{R}^a + \gamma \mathcal{P}^a v_k$

🥭 DP: "Full-backup" vs. Approximate DP via "Sample-back up"

여기에 너무 잘 설명해놓으셔서 그대로 적어놨습니다.정말 잘 설명해놓으셔서 읽어보시는 것 추천드립니다.

DP는 full-width backup(한 번 update할 때 가능한 모든 successor state의 value function을 통해 update하는 방법)을 사용하고 있기 때문에
1. 단 한 번의 backup을 하는 데도 많은 계산을 해야합니다
2. discontinuous 한 action/state space에 대해서 모든 state에 대한 backup이 불가능
이때 등장하는 개념이 sample back-up입니다. 즉, 모든 가능한 successor state와 action을 고려하는 것이 아니고 Sampling을 통해서 한 길만 가보고 그 정보를 토대로 value function을 업데이트한다는 것입니다.
- DP의 방법대로 optimal한 해를 찾으려면 매 iteration마다 Reward function과 state transition matrix를 알아야 하는데 sample backup의 경우에는 아래 그림과 같이 <S,A,R,S'>을 training set으로 실재 나온 reward와 sample transition으로서 그 두 개를 대체하게 됩니다.

즉, 머리로 다 계산하고 있는 것이 아니고 일단 가보면서 겪는 experience로 문제를 풀겠다는 것입니다. DP를 sampling을 통해서 풀면서부터 "Reinforcement Learning"이 시작됩니다.

🥭 Appendix

David Silver's reinforcement learning lecture 3.

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

[David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning) (0)	2022.04.09
[David Silver] 4. Model-Free Prediction: Monte-Carlo, Temporal-Difference (0)	2022.04.09
[David Silver] 2. Markov Decision Processes (0)	2022.04.02
[David Silver] 1. Introduction to Reinforcement learning (0)	2022.04.02
[Basic] 헷갈리는 용어 정리 (0)	2022.01.04

🥭 Dynamic Programming
🥭 DP: "Full-backup" vs. Approximate DP via "Sample-back up"
🥭 Appendix

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[David Silver] 3. Planning by Dynamic Programming

🥭 Dynamic Programming

🥭 Policy Iteration

🥭 Value Iteration

🥭 DP: "Full-backup" vs. Approximate DP via "Sample-back up"

🥭 Appendix

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역