[David Silver] 1. Introduction to Reinforcement learning

이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다.

🧵 Sequential decision making

Goal: select actions to maximize total future reward
- To maximize total future reward, there might be tasks that long-term reward matters, so it may be better to sacrifice immediate reward to gain more long-term reward. Reward may be delayed.
Solution of Sequential decision making
- Reinforcement Learning
- Planning

🧵 Characteristics of Reinforcement learning

What makes reinforcement learning different from other machine learning paradigms?

There is no supervisor, only a reward signal
Solution of Sequential decision making
- Feedback is delayed, not instantaneous
- Time really matters (sequential, non i.i.d data): Agent's actions affect the subsequent data it receives

🧵 Agent and Environment

https://www.researchgate.net/figure/Sutton-and-Bartos-agent-environment-interface-with-states-generalized-to-observations_fig1_220320890

Basic concept
- Exploration and Exploitation: trial-and-error learning
- Prediction (Given a policy, evaluate the future) and Control (Find the best policy)
At each step t $t$ , the environment,
1. Emits observation $O_t$
2. Emits scalar reward Rt $R_{t}$
  - On step $t$ , agent gets scalar feedback signal $R_t$ , agent wants to maximize cumulative reward.
  - With the reward hypothesis: All goals can be described by the maximisation of expected cumulative reward.
3. Receives action $A_t$ from agent
Agent should include one or more of these components:
- Policy: agent's behavior function, a map from state to action. Can be deterministic $a = \pi(s)$ or stochastic $\pi(a|s) = \mathbb{P}[A_t = a|S_t=s]$
- Value function: how good is each state is, a prediction of future reward. (Therefore it related to select better action.) $v_\pi (s) = \mathbb{E}_\pi [R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots | S_t =s]$
- Model: What the environment will do next based on the transition model and reward model:
  - Transition model Pass′=P[St+1=s′|St=s,At=a] $P_{s s^{'}}^{a} = P [S_{t + 1} = s^{'} | S_{t} = s, A_{t} = a]$
    - State Transition Matrix $\mathcal{P}$ : For a Markov state $s$ and successor state $s'$ , the state transition probability is defined by $\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1}=s'| S_t = s]$ . State transition matrix $\mathcal{P}$ defines transition probabilities from all states $s$ to all successor states $s'$
  - Reward model $\mathcal{R}_s^a = \mathbb{E}[R_{t+1}| S_t=s, A_t=a]$
History can be expressed on step $t$ : $H_t = O_1, R_1, A_1, \cdots, A_{t-1}, O_t, R_t$
We additionally define state: the information used to determine what happens next based on the history $S_t = f(H_t)$
- Division of State
  1. Environment state $S_t^e$ is the environment's private representation. Usually not visible to the agent.
  2. Agent state $S_t^a$ is the agent's internal representation. information used by reinforcement learning algorithm
  3. Information state (a.k.a Markov state) contains all the useful information from the history. A state $S_t$ is Markov if and only if $\mathbb{P}[S_{t+1}|S_t] = \mathbb{P}[S_{t+1}| S_1, \cdots, S_t]$ , "The future is independent of the past given the present." The state captures all relevant information from the history. i.e. The state is a sufficient statistic of the future.
- Observability
  - Full observability: If agent directly observes environment state $O_t = S_t^a = S_t^e$ , this is a Markov decision process (MDP)
  - Partial observability: agent indirectly observes environment $S_t^a \neq S_t^e$ , this is partially observable Markov decision process (POMDP). On this setting, agent construct its own state representation $S_t^a$ based on history $H_t$ , Beliefs of environment state $(\mathcal{P}[S_t^e=s^1], \cdots, \mathcal{P}[S_t^e=s^n])$ or Recurrent neural network $\sigma(S_{t-1}^aW_s + O_tW_o)$

🧵 Categorizing RL agents

~ Based
- Value Based: Value function
- Policy Based: Policy
- Actor critic: Policy + Value function
Model related
- Model Free: Policy and/or Value function
- Model Based: Model + Policy and/or Value function

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

[David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning) (0)	2022.04.09
[David Silver] 4. Model-Free Prediction: Monte-Carlo, Temporal-Difference (0)	2022.04.09
[David Silver] 3. Planning by Dynamic Programming (0)	2022.04.02
[David Silver] 2. Markov Decision Processes (0)	2022.04.02
[Basic] 헷갈리는 용어 정리 (0)	2022.01.04

🧵 Sequential decision making
🧵 Characteristics of Reinforcement learning
🧵 Agent and Environment
🧵 Categorizing RL agents

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[David Silver] 1. Introduction to Reinforcement learning

🧵 Sequential decision making

🧵 Characteristics of Reinforcement learning

🧵 Agent and Environment

🧵 Categorizing RL agents

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역