Proximal Policy Optimization Algorithms (PPO) Hyper-parameters

🔖 Questions

What is the difference between advantage function and reward and value function?
PPO-clip은 KL divergence를 안한다는데 approx_kl이 왜 있는거지?

🔖 파악해야 할 notions

Reward
Loss
- entropy loss: entropy bonus that ensures sufficient exploration.
- value loss $L_t^{VF} (\theta) = {(V_\theta(s_t)-V_t^{targ})}^2$
- Policy gradient loss $L_t^{CLIP}$
Procedure
- Epoch: 전체 데이터셋 돌리기
- Mini batch/one batch: 하나의 mini batch에는 여러개의 episode/rollout들이 존재한다.
- Episode $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \cdots, s_T, a_T, r_T)$
- time step: episode각 time step 마다 data point가 존재한다.
train
- approx_kl: 그냥 확인용인듯
- std: action sample시 있는것 같은데 고정이라고도 한 것 같고 gSDE랑 연관된것 같은데 정확하게 왜 이걸 체크하는지 모르겠음.
- explained_variance
- clip_fraction
Hyper-parameters
- verbose
- n_steps: The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps * n_envs where n_envs is number of environment copies running in parallel)
- batch_size: mini-batch size
- n_epochs
- gamma: reward 계산
- learning rate: parameter update
- gae_lambda: advantage function
- clip_range: clip 할 파라미터 정의
- clip_range_vf: Clipping parameter for the value function
- normalize_advantage: Whether to normalize or not the advantage
- ent_coef: Entropy coefficient for the loss calculation
- vf_coef: Value function coefficient for the loss calculation
- max_grad_norm: The maximum value for the gradient clipping
- use_sde: state-dependent exploration 처리
- sde_sample_freq: Sample a new noise matrix every n steps when using gSDE
- target_kl: Limit the KL divergence between updates, because the clipping is not enough to prevent large update

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

[Advanced Topics] 02. Representation Learning for RL (0)	2023.09.16
[Advanced Topics] 01. RL with human feedback (0)	2023.09.16
[Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO) (0)	2022.05.11
[David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG (0)	2022.04.10
[David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN) (0)	2022.04.10

🔖 Questions
🔖 파악해야 할 notions

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Proximal Policy Optimization Algorithms (PPO) Hyper-parameters

🔖 Questions

🔖 파악해야 할 notions

'Robotics & Perception > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역