๐ Questions
- What is the difference between advantage function and reward and value function?
- PPO-clip์ KL divergence๋ฅผ ์ํ๋ค๋๋ฐ approx_kl์ด ์ ์๋๊ฑฐ์ง?
๐ ํ์ ํด์ผ ํ notions
- Reward
- Loss
- entropy loss: entropy bonus that ensures sufficient exploration.
- value loss $L_t^{VF} (\theta) = {(V_\theta(s_t)-V_t^{targ})}^2$
- Policy gradient loss $L_t^{CLIP}$
- Procedure
- Epoch: ์ ์ฒด ๋ฐ์ดํฐ์ ๋๋ฆฌ๊ธฐ
- Mini batch/one batch: ํ๋์ mini batch์๋ ์ฌ๋ฌ๊ฐ์ episode/rollout๋ค์ด ์กด์ฌํ๋ค.
- Episode $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \cdots, s_T, a_T, r_T)$
- time step: episode๊ฐ time step ๋ง๋ค data point๊ฐ ์กด์ฌํ๋ค.
- train
- approx_kl: ๊ทธ๋ฅ ํ์ธ์ฉ์ธ๋ฏ
- std: action sample์ ์๋๊ฒ ๊ฐ์๋ฐ ๊ณ ์ ์ด๋ผ๊ณ ๋ ํ ๊ฒ ๊ฐ๊ณ gSDE๋ ์ฐ๊ด๋๊ฒ ๊ฐ์๋ฐ ์ ํํ๊ฒ ์ ์ด๊ฑธ ์ฒดํฌํ๋์ง ๋ชจ๋ฅด๊ฒ ์.
- explained_variance
- clip_fraction
- Hyper-parameters
- verbose
- n_steps: The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps * n_envs where n_envs is number of environment copies running in parallel)
- batch_size: mini-batch size
- n_epochs
- gamma: reward ๊ณ์ฐ
- learning rate: parameter update
- gae_lambda: advantage function
- clip_range: clip ํ ํ๋ผ๋ฏธํฐ ์ ์
- clip_range_vf: Clipping parameter for the value function
- normalize_advantage: Whether to normalize or not the advantage
- ent_coef: Entropy coefficient for the loss calculation
- vf_coef: Value function coefficient for the loss calculation
- max_grad_norm: The maximum value for the gradient clipping
- use_sde: state-dependent exploration ์ฒ๋ฆฌ
- sde_sample_freq: Sample a new noise matrix every n steps when using gSDE
- target_kl: Limit the KL divergence between updates, because the clipping is not enough to prevent large update