Robotics & Perception/Reinforcement Learning

Proximal Policy Optimization Algorithms (PPO) Hyper-parameters

 

๐Ÿ”– Questions

  • What is the difference between advantage function and reward and value function?
  • PPO-clip์€ KL divergence๋ฅผ ์•ˆํ•œ๋‹ค๋Š”๋ฐ approx_kl์ด ์™œ ์žˆ๋Š”๊ฑฐ์ง€?

๐Ÿ”– ํŒŒ์•…ํ•ด์•ผ ํ•  notions

  • Reward
  • Loss
    • entropy loss: entropy bonus that ensures sufficient exploration.
    • value loss $L_t^{VF} (\theta) = {(V_\theta(s_t)-V_t^{targ})}^2$
    • Policy gradient loss $L_t^{CLIP}$
  • Procedure
    • Epoch: ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹ ๋Œ๋ฆฌ๊ธฐ
    • Mini batch/one batch: ํ•˜๋‚˜์˜ mini batch์—๋Š” ์—ฌ๋Ÿฌ๊ฐœ์˜ episode/rollout๋“ค์ด ์กด์žฌํ•œ๋‹ค.
    • Episode $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \cdots, s_T, a_T, r_T)$
    • time step: episode๊ฐ time step ๋งˆ๋‹ค data point๊ฐ€ ์กด์žฌํ•œ๋‹ค.
  •  train
    • approx_kl: ๊ทธ๋ƒฅ ํ™•์ธ์šฉ์ธ๋“ฏ
    • std: action sample์‹œ ์žˆ๋Š”๊ฒƒ ๊ฐ™์€๋ฐ ๊ณ ์ •์ด๋ผ๊ณ ๋„ ํ•œ ๊ฒƒ ๊ฐ™๊ณ  gSDE๋ž‘ ์—ฐ๊ด€๋œ๊ฒƒ ๊ฐ™์€๋ฐ ์ •ํ™•ํ•˜๊ฒŒ ์™œ ์ด๊ฑธ ์ฒดํฌํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์Œ.
    • explained_variance
    • clip_fraction
  • Hyper-parameters
    • verbose
    • n_steps: The number of steps to run for each environment per update (i.e. rollout buffer size is n_steps * n_envs where n_envs is number of environment copies running in parallel)
    • batch_size: mini-batch size
    • n_epochs
    • gamma: reward ๊ณ„์‚ฐ
    • learning rate: parameter update
    • gae_lambda: advantage function
    • clip_range: clip ํ•  ํŒŒ๋ผ๋ฏธํ„ฐ ์ •์˜
    • clip_range_vf: Clipping parameter for the value function
    • normalize_advantage:  Whether to normalize or not the advantage
    • ent_coef: Entropy coefficient for the loss calculation
    • vf_coef:  Value function coefficient for the loss calculation
    • max_grad_norm: The maximum value for the gradient clipping
    • use_sde: state-dependent exploration ์ฒ˜๋ฆฌ
    • sde_sample_freq:  Sample a new noise matrix every n steps when using gSDE
    • target_kl: Limit the KL divergence between updates, because the clipping is not enough to prevent large update