Robotics & Perception/Reinforcement Learning

    [Advanced Topics] 02. Representation Learning for RL

    이 글은 2023년 가을학기 AI707 이기민 교수님 수업을 듣고 복습차 필자의 이해를 위해 정리한 글입니다. RL from pixel is difficult. Poor sample-efficiency Poor generalization on different environment Representation learning resolves these issue. What is representation learning; Learning the represenation of the data, which contain useful information for ML methods. However, this is also true that optimizing a main task objective might n..

    [Advanced Topics] 01. RL with human feedback

    이 글은 2023년 가을학기 AI707 이기민 교수님 수업을 듣고 복습차 필자의 이해를 위해 정리한 글입니다. (세계적으로 유명한 교수님들의 직강을 들을 수 있어서 영광이라고 생각한다!!) Rough introduction of Reinforcement Learning Reinforcement Learning: Finding an optimal policy for sequential decision making problem through interactions and learning By interacting the environment, the agent generates roll-outs of the form $\tau = \{(s_0, a_0, r_0), \cdots, (s_H, a_H, r_H)..

    Proximal Policy Optimization Algorithms (PPO) Hyper-parameters

    🔖 Questions What is the difference between advantage function and reward and value function? PPO-clip은 KL divergence를 안한다는데 approx_kl이 왜 있는거지? 🔖 파악해야 할 notions Reward Loss entropy loss: entropy bonus that ensures sufficient exploration. value loss $L_t^{VF} (\theta) = {(V_\theta(s_t)-V_t^{targ})}^2$ Policy gradient loss $L_t^{CLIP}$ Procedure Epoch: 전체 데이터셋 돌리기 Mini batch/one batch: 하나의 mini bat..

    [Policy Gradient] Vanilla Policy Gradient, Trust region policy optimization (TRPO), Proximal Policy Optimization Algorithms (PPO)

    paper의 수식을 정리한 글입니다. (도출과정 없음). document를 참조했습니다. 🔖 Simplest Policy Gradient We consider a case of stochastic, parameterized policy $\pi_\theta$. We aim to maximize the expected return $J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [R(\tau)]$. For this, we want to optimize the policy by gradient descent. $\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\pi_\theta)|_{\theta_k}$ The gradient ..

    [David Silver] 7. Policy Gradient: REINFORCE, Actor-Critic, NPG

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. Last lecture, we approximated the value or action-value function using parameters $\theta$. Policy was generated directly from the value function. In this lecture, we will directly parameterise the policy as stochastic $\pi_\theta(s,a) = \mathbb{P} [a|s, \theta]$ This taxonomy explains Value-based and Policy-based RL well. Value-base..

    [David Silver] 6. Value Function Approximation: Experiment Replay, Deep Q-Network (DQN)

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. This lecture suggests the solution for large MDPs using function approximation.We have to scale up the model-free methods for prediction and control. So for lecture 6 and 7, we will learn how can we scale up the model-free methods. How have we dealt with small (not large) MDPs so far? We have represented the value function by a looku..

    [David Silver] 5. Model-Free Control: On-policy (GLIE, SARSA), Off-policy (Importance Sampling, Q-Learning)

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. In the previous post, to solve an unknown MDP we had to (1) Estimate the value function (Model-Free Prediction) and (2) Optimize the value function (Model-Free control). In this lecture, we are going to learn (2) how to optimize the value function based on the (1) methodologies which are MC and TD. So the goal which we have to achiev..

    [David Silver] 4. Model-Free Prediction: Monte-Carlo, Temporal-Difference

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. The last lecture was about Planning by Dynamic Programming, which solves a known MDP. Now we are going to check how can we solve an unknown MDP (i.e. Model-free RL). To solve an unknown MDP we have to (1) Estimate the value function of an unknown MDP. We usually call this Model-free prediction(Policy evaluation). After that, we will ..

    [David Silver] 3. Planning by Dynamic Programming

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. (2023.09.12) 추가적으로 필자가 임재환 교수님의 AI611 대학원 수업을 듣고 이해가 부족한 부분을 채웠습니다. -보라색 처리 This lecture is about a solution of known MDP which is Dynamic Programming. We will talk about what is dynamic programming, and prove MDP is solvable. 🥭 Dynamic Programming Dynamic programming is a method for solving complex problems. By breaking them down in..

    [David Silver] 2. Markov Decision Processes

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. (2023.09.12) 추가적으로 필자가 임재환 교수님의 AI611 대학원 수업을 듣고 이해가 부족한 부분을 채웠습니다. -보라색으로 추가하였습니다. Markov decision process formally describe an fully observable environment for reinforcement learning. 🥭 Markov Processes Based on the Markov property, A Markov process is a random process, i.e. a sequence of random states $S_1, S_2, \cdots$ with the M..

    [David Silver] 1. Introduction to Reinforcement learning

    이 글은 필자가 David Silver의 Reinforcement Learning 강좌를 듣고 정리한 글입니다. 🧵 Sequential decision making Goal: select actions to maximize total future reward To maximize total future reward, there might be tasks that long-term reward matters, so it may be better to sacrifice immediate reward to gain more long-term reward. Reward may be delayed. Solution of Sequential decision making Reinforcement Learning Pl..

    [Basic] 헷갈리는 용어 정리

    On-policy vs. Off-policy RL On-policy learning: Learn about policy $\pi$ from experience sampled from $\pi$ 학습하는 policy와 행동하는 policy가 반드시 같아야만 학습이 가능한 강화학습 알고리즘. ex) Sarsa: on-policy의 경우 1번이라도 학습을 해서 policy improvement를 시킨 순간, 그 policy가 했던 과거의 experience들은 모두 사용이 불가능하다. 즉 매우 데이터 효율성이 떨어진다. 한 번 exploration해서 얻은 experience를 학습하고나면 그냥은 재사용이 불가능하다.(Importance sampling등을 해야 재사용가능 함.) Off-policy learn..