[Advanced Topics] 01. RL with human feedback

이 글은 2023년 가을학기 AI707 이기민 교수님 수업을 듣고 복습차 필자의 이해를 위해 정리한 글입니다.
(세계적으로 유명한 교수님들의 직강을 들을 수 있어서 영광이라고 생각한다!!)

Rough introduction of Reinforcement Learning

Reinforcement Learning: Finding an optimal policy for sequential decision making problem through interactions and learning
1. By interacting the environment, the agent generates roll-outs of the form τ={(s0,a0,r0),⋯,(sH,aH,rH)}, and then evaluate that interaction via the discounted sum of rewards ∑Ht=0γtrt. By following the following description:
  1. From environment state $s_t$ , the agent that has a policy $\pi(a|s)$ gives the action $a$ .
  2. As the consequence, next state $s_{t+1} \sim P(s' | s, a)$ and its reward $r_t = R(s_t, a_t, s_{t+1})$ is determined.
2. Goal of RL is to maximize the reward, which is formulated as max
  1. The task can differ with different state/reward/action definition that we do.

However, reward engineering is difficult.

We should think "Is it possible to express all reward values as a scalar value?" -Possible by Rich Sutton.
Reward engineering is hard due to the following reasons:
1. Multiple objects (e.g., control cost)
2. Agent can easily exploit the loophole in the reward function -unexpected behavior in agent
3. We need to get a generalized reward at the same time.

In one line: Designing suitable reward that fit multiple objectives is challenging.

RL from Human Feedback (RLHF) resolves these reward engineering issues.

This approach works under the following assumption: By directly giving human feedback in reward, we can design a suitable reward. There are following ways to realize this idea:

(Former) Scalar-valued Feedback + RL

This approach gets a scalar-valued feedback from human. The scalar-valued feedback could be binary or likert scale. This feedback are used to train a reward or train a policy directly.
Related works: TAMER, COACH
Limitation: Due to the subjective rewards from humans (human bias etc.), the feedback can be very noisy across labelers.

Preference-based RL focus to resolve the scalar-value feedback limitations.

This start from a random policy. Policy pre-trained on demonstrations using

This requires repeating the following steps:

Initialize a policy
Collecting human dataset
1. From the behavior buffer, sample a pair of segments. Let the human give the preference.
Learning a reward function: From those human dataset, we learn a reward model $\hat{r}: S \times A \rightarrow \mathcal{R}$ , utilizing Bradley-Terry model (classification manner).
Optimize a policy using RL algorithms

These days trend: Diffusion model with RLHF, LLM with RLHF

Diffusion model with RLHF

This also works as 1) human data collection, 2) reward learning, 3) Fine-tuning Text-to-Image model.

LLM with RLHF

Supervised fine-tuning with expert dataset is easy to implement, but designing suitable reward (misalignment between objectives) is difficult. Instead, fine-tuning LLMS by RLHF came out as a solution.
Language generation as Token-level MDP, and train as the aforementioned format.
GPT-2 to Chat-GPT, GPT-4 use this approach.
- InstructGPT: use case categories from their API prompt dataset.
- Chat-GPT: Extension to conversation
Now what is left?
- Reward overoptimization (Goodhart's law): when optimizing the proxy objective causes the true objective to get better then get worse
  - Proxy objective: approximation or estimate of the true objective that you actually care about optimizing.
- How to select informative query to improve feedback-efficiency when ambiguous comparison is given?

저작자표시 비영리 동일조건

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

Reinforcement learning 기억 되새기기 (0)	2025.02.17
[Advanced Topics] 02. Representation Learning for RL (0)	2023.09.16
Proximal Policy Optimization Algorithms (PPO) Hyper-parameters (0)	2022.05.15
[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO) (0)	2022.05.11
[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG (0)	2022.04.10

Rough introduction of Reinforcement Learning
However, reward engineering is difficult.
RL from Human Feedback (RLHF) resolves these reward engineering issues.
(Former) Scalar-valued Feedback + RL
Preference-based RL focus to resolve the scalar-value feedback limitations.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

[Advanced Topics] 01. RL with human feedback

Rough introduction of Reinforcement Learning

However, reward engineering is difficult.

RL from Human Feedback (RLHF) resolves these reward engineering issues.

(Former) Scalar-valued Feedback + RL

Preference-based RL focus to resolve the scalar-value feedback limitations.

Diffusion model with RLHF

LLM with RLHF

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역