강화학습 벼락공부 및 떠다니는 키워드 정리하기
AI, Deep Learning Basics

강화학습 벼락공부 및 떠다니는 키워드 정리하기

이 글은 필자가 강화학습에 대한 감을 유지하고자 그 중에서 헷갈리는 키워드들을 모으고 의문점들을 해결하고자 작성한 글입니다.
색깔 칠한 부분들은 필자가 개인적으로 신경쓰는 부분이라 무시하셔도 됩니다. 참고로 이 글은 여러가지 소스에서 발췌하였으며, 특히 이분이 문서의 도움을 많이 받았습니다. 확실하지 않은 부분은 이렇게 처리됩니다.

헷갈리는 기본 키워드/질문과 답변 모음

  • Planning vs. Control
    • 둘을 나누는 차이는 transtition probability와 reward function을 담는 model의 존재여부로 판단된다. 이 model이 learned되더라도 planning이라고 하는 것으로 판단된다.
    • Motion planning과 task planning 도 이쪽 계열이다. -Control 밑에 이런 planning이 있다고 생각했는데,,? 다시 헷갈리기 시작함. 또 MPC (Model Predictive Control)은 뭔 차이인거지?
  • Model-based vs. Model-free; Definition of model
    • Model defines the reward function and transition probabilities. Model can be known or the algorithm learns it explicitly.
    • Model-based algorithm rely on the model of the environment. Whereas, model-free RL has no dependency on the model during learning.
  • On-policy vs. Off-policy
    • 알고리즘을 학습할 때 사용되는 dataset이 현재 학습되는 policy에서 나온 action 결과들을 바탕으로 나온 것인지에 대한 분류.
    • On-policy는 현재 학습되는 policy에서 나온 experiment로 학습됨; i.e., the same policy that is used to make decisions is also used to evaluate and improve itself. Off-policy는 현재 학습되는 Policy에서 나온 experiment 말고, optimal policy의 value를 학습한다.
  • Difference between online RL and offline RL -밑에 정리
  • Difference between model-based RL and model-free RL -밑에 정리
  • Difference between state value, return and reward -밑에 정리
  • Buffer, Rollout: 학습에 사용되는 experience 데이터들
  • Transition function P records the probability of transitioning from state s to s' after taking action a while obtaining reward r ( P(s', r|s,a)). State-transition function P(s'|s,a).

Basics of RL

기본 구성

  • Agent is acting in an environment.
  • Environment은 stateobservation으로 표현이 가능하다.
  • Environment에 policy를 통해 action을 결정하고, 이를 행한 결과로 reward가 제시된다. (당연하다. Environment는 reward function과 transition probability에 대한 정보를 가지므로 해당 action에 대한 결과를 제시해준다.)
  • Reward를 통해서 Value function (Return, state-value, action-value (궁극적으로 Q 값))을 유추할 수 있다.
    • Return은 total sum of discounted rewards going forward.
    • State value of a state s is the expected return if we are in this state at time t.
    • Action-value of a particular state s, action s is the expected return if we are in this state and apply this action at time t.

첫번째 분류 기준: Model 의 존재 여부

  • Planning vs. Control
  • Model-Free RL vs. Model-based RL
    • Model-based RL
      • Allows the agent to plan by thinking ahead. Bias in the model can be exploited by the agent.
      • Challenge: Performs well with respect to the learned model, but behaves sub-optimally in the real-environment. 아는 domain에서는 기깔나게 잘 하는데 OOD domain에 대해서는 성능이 낮다.
    • Model-Free RL: X use model.

Under Model-Free RL, 두번째 분류 기준: 어떤 방식으로 policy를 학습시키는지

  • Q-Value Based / Usually off-policy
    • Methods learn an approximator $Q_\theta (s,a)$ for the optimal action-value function. Later, actions taken by the Q-learning agent are given by $a(s) = arg \max_a Q_\theta (s,a)$
    • Pros and Cons
      • Methods only indirectly optimize for the agent performance, less stable.
      • But more sample efficient.
    • 진화 -DQN 계열 (DQN, DDQN) -> PER (Prioritized Action Replay) , NoiseNet, C51 -> Rainbow
  • Policy Based / Usually on-policy
    • Methods learn an approximator $V_\phi (x)$ for the on-policy value function, which gets used in figuring out how to udate the policy.
      • Maximize expected return $J(\pi_\theta) = E_{\tau \sim \pi_\theta} [R(\tau)]$
      • 1. We collect a set of trajectories $\mathcal{D}$ where each trajectory is letting the agent act in the environment using the policy $\pi_\theta$.
      • 2. Then GD. Objective function $\bigtriangledown_\theta J(\pi_\theta) = E_{\tau \sim \pi_\theta} [\sum_{t=0}^T \bigtriangledown_\theta \log \pi_\theta (a_t|s_t) R(s_t, a_t)]$
    • Pros and Cons
      • Methods are principled, in the sense tht you directly optimize for the thing you want, it seems stable and reliable.
      • But less sample efficient.
    • 진화
      • 1. REINFORCE: A straightforward algorithm that use entire trajectories to update the policy parameters.
  • Both
    • 진화 -on-policy
      • 1. Actor-Critic (A2C, A3C): These methods combine value-based and policy-based methods. Two models: the actor (policy) and the critic (value function). The critic helps in reduce the variance of the graident estimate.
      • 2. TRPO: This method tries to satisfy a special constraint on how close the new and old policies are allowed to be --expressed in terms of KL-Divergence.
      • 3. PPO :Maximize a surrogate objective function which gives a conservative estimate for how J(\pi_\theta) will change as a result of the update.
    • 진화 -off-policy
      • 1. DDPG (deterministic)
      • 2. SAC (stochastic): adding entropy (less overfitting, better exploration)

그 외

  • Online RL vs. Offline RL vs. On-policy RL vs. Off-policy RL
    • Off-policy: 학습시키고 싶은 데이터셋이 실제 policy와 다름. 남의 것을 구경해서 배움.
    • On-policy: 학습시키고 싶은 데이터셋이 실제 policy와 동일. 내가 직접 배움
    • Online: agent가 직접 환경과 상호작용 / 실시간으로 데이터셋 모아서 처리 (하나 실행 —> 바로 tuning)
    • Offline: 따로 데이터셋 모아서 처리 (실행 필요 없고 바로 tuning)
    • On-policy는 무조건 online, Off-policy는 Online과 offline 둘다 가능
  • Exploitation vs. Exploration. Especially targeting "hard-exploration" problem, which refers to exploration in an environment with very sparse or even deceptive reward. It is difficult because random exploration in such scenairos can rarely discover successful staes or obtain meaningful feedback.
    • Random exploration: Selecting actions completely randomly from its action space, by choosing between exploration and exploration
      • Epsilon-greedy --> Parameter-space noise
      • Upper confidence bounds (UCB): The agent selects the greediest action to maximize the upper confidence bound.
      • Thompson sampling: Agent keeps track of a belief over the probability of optimal actions and samples from this dietribution.
    • Posterior sampling
    • Optimistic exploration; If haven't visited a state, let's assume it might have high reward, until we experience otherwise. e.g., Never-Give-Up 
  • Unsupervised RL (Reward Free exploration); Human baby explores world without predefined task reward and learn how to manipulate many objects --related with exploration / intrinsic rewards / curiosity.
    • Learning diverse behavior without any reward function at all. (1) Pre-train without task reward available (2) Leverage pre-training to learn faster once task reward is available. Why? Fast adaptation to downstream task, Learn sub-skills to use with hierarchical RL.
    • Asymmetric Self-play (ASP); kind of a goal-conditioned framework.
    • Unsupervised Skill Discovery
      • DIYAN; Maximize mutual information between skill and state
      • LSD; Maximize euclidean traveled distance
      • CSD; Discover diverse skills
  •  
  • Model-based RL
    • World model
      • DayDreamer
      • Dream to Control
    • World Model + Planning
      • Visual MPC
  • Reward engineering --Multiple objectives (safety, fluency, diversity, and control cost etc.) --> easily exploit the loophole 
  • Offline RL (or Batch RL): Learning agents from a fixed dataset generated by some arbitrary behavoir policy
    • No interaction allowed. Train the model with samples available in the replay buffer only. In offline RL, where the agent can't interact with the environment and must rely on a fixed dataset, the policy evaluation and improvement steps become challenging. The key issue highlighted is the extrapolation error: when the Q-function must generalize to actions and states not well-represented in the dataset, it can make poor predictions. This, in turn, causes the policy to prefer these poorly predicted actions, leading to suboptimal or even dangerous behaviors when the policy is deployed.
    • CQL
  • Representation learning
    • RL from pixel is hard -- issue 1. poor sample-efficiency, issue 2. poor generalization 
    • Representation learning와 Pre-training은 애매하게 겹친다: 전자는 학습하는 중에, input에 대한 representation을 학습하는 것 (model architecture 에 대한 contribution)과, 후자는 학습 전에 학습되는 경우
    • Learning representations of the data, which contain useful information  --RL objective + Representation learning (Reconstruction, contrastive learning or imagination)
      1. Masked Auto Encoder --> MVP (Masked Visual Pre-training) --> Masked World Model
      2. Value Implicit Pretraining
      3. Time-constrastive learning --> R3M
  • Foundation model for RL: Can we use foundation models for sequential decision making problems?
    • Reward engineering --VLM, LLM for reward / EUREKA
  • Hierarchical RL: Can we generate high-level plans that are guaranteed to work at lower levels?
  • Multi-task RL: training an agent which can perform multiple tasks
  • Meta-RL
    • Learning to learn by gradient descent
    • training an agent which can quickly adapt to new tasks