P2Proximal Policy Optimization (PPO)
importance sampling:
On-policy -> Off-policy
Gradient for update:
*KL divergence(KL距离) 常用来衡量两个概率分布的距离
Q-learning
critic 评价action的效果
Monte-Carlo(MC) based approach
Temporal-difference(TD) approach
*mc td分别代表什么
Target Network
Exploration Epsilon Greedy
Replay Buffer
减少与环境做互动的时间
Typical Q-learning Algorithm
Tips of Q-Learning
Double DQN
Q-Value总是被高估,因为target总是被设得太高
Dueling DQN
修改了网络的架构,分为V+A = Q
Prioritized Reply
Multi-step
Q-Learning for Continuous Actions
Q-Learning不容易处理连续的行为,如自动驾驶,机器人行动
Using gradient ascent to solve the optimization problem
policy-based(ppo)
value-based(Q-learning)
Asynchronous Advantage Actor-Critic
Review Policy Gradient
∇R¯¯¯¯θ≈1N∑n=1N∑t=1Tn(∑t′=tTnγt′−trnt′−b)∇logpθ(ant|snt) ∇ R ¯ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − t r t ′ n − b ) ∇ l o g p θ ( a t n | s t n )
γt′−trnt′−b≈rnt+Vπ(snt+1) γ t ′ − t r t ′ n − b ≈ r t n + V π ( s t + 1 n )
Actor-Critic
Pathwise Derivative Policy Gradient
Sparse Reward
agent多数情况下,无法得到reward
Reward Shaping:
ICM = instrinsic curiosity module 鼓励冒险
根据 atst a t s t ,Network1预测 st+1 s t + 1 与真正的 st+1 s t + 1 差距,即采取某action,无法预测接下来的结果那么鼓励该操作
提取需要的特征,通过Network2仍然得到正确 at a t ,说明过滤掉是无用的信息Curriculum Learning
Reverse Curriculum Generation根据得分反推
Hierarchical RL分级强化学习
Imitation Learning
Behavior Cloning
- 存在局限性,无法收集极端情况下的数据
- 学习无用的行为
Inverse RL
IRL训练专家的数据,得到正确的Reward Funciton;转而,RL利用Reward Funciton得到optimal acto,类似于GAN。
应用:自驾车