Processing math: 100%
Robotics & Perception/Reinforcement Learning

[Basic] 헷갈리는 용어 정리

On-policy vs. Off-policy RL

  • On-policy learning: Learn about policy π from experience sampled from π
    • 학습하는 policy와 행동하는 policy가 반드시 같아야만 학습이 가능한 강화학습 알고리즘.
    • ex) Sarsa: on-policy의 경우 1번이라도 학습을 해서 policy improvement를 시킨 순간, 그 policy가 했던 과거의 experience들은 모두 사용이 불가능하다. 즉 매우 데이터 효율성이 떨어진다. 한 번 exploration해서 얻은 experience를 학습하고나면 그냥은 재사용이 불가능하다.(Importance sampling등을 해야 재사용가능 함.)
  • Off-policy learning: Learn about policy π from experience sampled from μ
    • μ: Following behavior policy μ(a|s), {S1,A1,R2,,ST}μ
    • 학습하는 policy와 행동하는 policy가 반드시 같지 않아도 학습이 가능한 알고리즘.
    • ex) Q-learning: off-policy는 현재 학습하는 policy가 과거에 했던 experience도 학습에 사용이 가능하고, 심지어는 해당 policy가 아니라 예를 들어 사람이 한 데이터로부터도 학습을 시킬 수가 있다.

 

Online RL (Online-update) vs. Offline RL (Offline-update)

  • Offline update: update values on each episode (a bunch of time-step) Buffer에다 담고 한꺼번에 학습
  • Online update: Updates are accumulated within an episode. But applied in batch at the end of an episode. 할 때마다 업데이트

 

Policy Iteration vs. Policy evaluation vs. Policy improvement

  • Policy iteration = Repeat [ Policy evaluation --> Policy improvement ]

Bootstrapping vs. Sampling

  • Bootstrapping: update involves an estimate
  • Sampling: Update samples an expectation

REINFORCE algorithm vs. Reinforcement learning

  • REINFORCE algorithm: Monte-Carlo Policy Gradient (Methodology of RL)
  • Reinforcement Learning

MC: Markov Chain vs. Monte-Carlo method vs. Monte-Carlo algorithm

  • Markov Chain/Markov process:  Probability theory - a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
  • Monte-Carlo method/Monte Carlo experiments: a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle.These flows of probability distributions can always be interpreted as the distributions of the random states of a Markov process whose transition probabilities depend on the distributions of the current random states.
  • Monte-Carlo algorithm