이 글은 2023년 가을학기 AI707 이기민 교수님 수업을 듣고 복습차 필자의 이해를 위해 정리한 글입니다.
(세계적으로 유명한 교수님들의 직강을 들을 수 있어서 영광이라고 생각한다!!)
Rough introduction of Reinforcement Learning
- Reinforcement Learning: Finding an optimal policy for sequential decision making problem through interactions and learning
- By interacting the environment, the agent generates roll-outs of the form τ={(s0,a0,r0),⋯,(sH,aH,rH)}, and then evaluate that interaction via the discounted sum of rewards ∑Ht=0γtrt. By following the following description:
- From environment state st, the agent that has a policy π(a|s) gives the action a.
- As the consequence, next state st+1∼P(s′|s,a) and its reward rt=R(st,at,st+1) is determined.
- Goal of RL is to maximize the reward, which is formulated as max
- The task can differ with different state/reward/action definition that we do.
- By interacting the environment, the agent generates roll-outs of the form τ={(s0,a0,r0),⋯,(sH,aH,rH)}, and then evaluate that interaction via the discounted sum of rewards ∑Ht=0γtrt. By following the following description:
However, reward engineering is difficult.
- We should think "Is it possible to express all reward values as a scalar value?" -Possible by Rich Sutton.
- Reward engineering is hard due to the following reasons:
- Multiple objects (e.g., control cost)
- Agent can easily exploit the loophole in the reward function -unexpected behavior in agent
- We need to get a generalized reward at the same time.
In one line: Designing suitable reward that fit multiple objectives is challenging.
RL from Human Feedback (RLHF) resolves these reward engineering issues.
This approach works under the following assumption: By directly giving human feedback in reward, we can design a suitable reward. There are following ways to realize this idea:
(Former) Scalar-valued Feedback + RL
- This approach gets a scalar-valued feedback from human. The scalar-valued feedback could be binary or likert scale. This feedback are used to train a reward or train a policy directly.
- Related works: TAMER, COACH
- Limitation: Due to the subjective rewards from humans (human bias etc.), the feedback can be very noisy across labelers.
Preference-based RL focus to resolve the scalar-value feedback limitations.
This start from a random policy. Policy pre-trained on demonstrations using
This requires repeating the following steps:
- Initialize a policy
- Collecting human dataset
- From the behavior buffer, sample a pair of segments. Let the human give the preference.
- Learning a reward function: From those human dataset, we learn a reward model \hat{r}: S \times A \rightarrow \mathcal{R}, utilizing Bradley-Terry model (classification manner).
- Optimize a policy using RL algorithms
These days trend: Diffusion model with RLHF, LLM with RLHF
Diffusion model with RLHF
This also works as 1) human data collection, 2) reward learning, 3) Fine-tuning Text-to-Image model.
LLM with RLHF
- Supervised fine-tuning with expert dataset is easy to implement, but designing suitable reward (misalignment between objectives) is difficult. Instead, fine-tuning LLMS by RLHF came out as a solution.
- Language generation as Token-level MDP, and train as the aforementioned format.
- GPT-2 to Chat-GPT, GPT-4 use this approach.
- InstructGPT: use case categories from their API prompt dataset.
- Chat-GPT: Extension to conversation
- Now what is left?
- Reward overoptimization (Goodhart's law): when optimizing the proxy objective causes the true objective to get better then get worse
- Proxy objective: approximation or estimate of the true objective that you actually care about optimizing.
- How to select informative query to improve feedback-efficiency when ambiguous comparison is given?
- Reward overoptimization (Goodhart's law): when optimizing the proxy objective causes the true objective to get better then get worse