Processing math: 77%
Robotics & Perception/Reinforcement Learning

[Advanced Topics] 01. RL with human feedback

이 글은 2023년 가을학기 AI707 이기민 교수님 수업을 듣고 복습차 필자의 이해를 위해 정리한 글입니다.
(세계적으로 유명한 교수님들의 직강을 들을 수 있어서 영광이라고 생각한다!!)

Rough introduction of Reinforcement Learning

  • Reinforcement Learning: Finding an optimal policy for sequential decision making problem through interactions and learning
    1. By interacting the environment, the agent generates roll-outs of the form τ={(s0,a0,r0),,(sH,aH,rH)}, and then evaluate that interaction via the discounted sum of rewards Ht=0γtrt. By following the following description:
      1. From environment state st, the agent that has a policy π(a|s) gives the action a.
      2. As the consequence, next state st+1P(s|s,a) and its reward rt=R(st,at,st+1) is determined.
    2. Goal of RL is to maximize the reward, which is formulated as max
      1. The task can differ with different state/reward/action definition that we do.

However, reward engineering is difficult. 

  • We should think "Is it possible to express all reward values as a scalar value?" -Possible by Rich Sutton.
  • Reward engineering is hard due to the following reasons:
    1. Multiple objects (e.g., control cost)
    2. Agent can easily exploit the loophole in the reward function -unexpected behavior in agent 
    3. We need to get a generalized reward at the same time.

In one line: Designing suitable reward that fit multiple objectives is challenging.

RL from Human Feedback (RLHF) resolves these reward engineering issues.

This approach works under the following assumption: By directly giving human feedback in reward, we can design a suitable reward. There are following ways to realize this idea:

(Former) Scalar-valued Feedback + RL

  • This approach gets a scalar-valued feedback from human. The scalar-valued feedback could be binary or likert scale. This feedback are used to train a reward or train a policy directly.
  • Related works: TAMER, COACH
  • Limitation: Due to the subjective rewards from humans (human bias etc.), the feedback can be very noisy across labelers.

Preference-based RL focus to resolve the scalar-value feedback limitations.

This start from a random policy. Policy pre-trained on demonstrations using

This requires repeating the following steps:

  1. Initialize a policy
  2. Collecting human dataset
    1. From the behavior buffer, sample a pair of segments. Let the human give the preference.
  3. Learning a reward function: From those human dataset, we learn a reward model \hat{r}: S \times A \rightarrow \mathcal{R}, utilizing Bradley-Terry model (classification manner).
  4. Optimize a policy using RL algorithms

 

These days trend: Diffusion model with RLHF, LLM with RLHF

Diffusion model with RLHF

This also works as 1) human data collection, 2) reward learning, 3) Fine-tuning Text-to-Image model.

LLM with RLHF

  • Supervised fine-tuning with expert dataset is easy to implement, but designing suitable reward (misalignment between objectives) is difficult. Instead, fine-tuning LLMS by RLHF came out as a solution.
  • Language generation as Token-level MDP, and train as the aforementioned format.
  • GPT-2 to Chat-GPT, GPT-4 use this approach.
    • InstructGPT: use case categories from their API prompt dataset.
    • Chat-GPT: Extension to conversation
  • Now what is left?
    • Reward overoptimization (Goodhart's law): when optimizing the proxy objective causes the true objective to get better then get worse
      • Proxy objective: approximation or estimate of the true objective that you actually care about optimizing.
    • How to select informative query to improve feedback-efficiency when ambiguous comparison is given?