DQN和TD更新算法。
value network通常用来一个动作action的价值,而Action-value function Qπ用来评价动作好坏,其梯度用来计算TD error来更新value network。
目录
1.1 Approximate the Q*(s,a) Function
1.3 Temporal Difference(TD) Learning
1.4.2 Train DQN using TD learning
1.5 summary: DQN and TD learning
2. Extension: TD Learning Algorithm
2.1.1 Derive TD Target of Sarsa
2.1.3 Sarsa: Neural Network Version
2.2.1 Derive TD Target of Q-Learning
2.2.2 Q-Learning (tabular version)
2.4.2 One-step return 对比 multi-step return
3. Extension: DQN Advanced Training Skills
3.1 revisiting DQN and TD Learning
3.2.1 Reason why we need Experience Replay
3.2.2 Experience Replay Introduction
3.2.3 TD Algorithm with Experience Reply
3.2.4 Benefits of Experience Replay
3.3 Prioritized Experience Replay
3.4 Target Network & Double DQN
3.4.2 DQN高估问题 Problem of Overestimation
3.5.1 TD Learning with Target Network
3.6.1 Why does DDQN work better?
3.7.3 Dueling Network Mathematical Principle: Overcome Non-identifiability
Review
- Ut是未来奖励reward的加权和
- Qπ(st, at)可以反应当前状态st下动作at的好坏程度。
- 对Qπ关于π求最大化,Q*函数可以给所有动作打分。
1. Deep Q-Network(DQN)
本质:用神经网络近似Q*函数
1.1 Approximate the Q*(s,a) Function
Goal: Win the game(≈ maximize the total world.)
Question: If we know Q*(s, a), what is the best action? 假设我们知道Q*函数
Q* is an indication for how good it is for an agent to pick action a while being in state s.
Challenge: we do not know Q*(s, a)函数
value-based model,就是学习一个函数来近似Q*函数。--> DQN
- Solution: Deep Q Network (DQN)
- Use neural network Q(s, a; w) to approximate Q*(s, a)
用一个神经网络去近似Q*(s,a)函数,神经网络参数是w、输入是s、输出是很多数值(这些数值是对所有可能动作的打分),通过奖励reward来学习这个网络,这个网络对动作的打分就会逐渐改进,打分会越来越准。
1.2 Apply DQN to Play Game
- 当前观测到状态st,用DQN把st作为输入给所有的动作actions打分,选出分数最高的动作作为at,agent执行动作at后,
- environment会改变状态S,用状态转移函数p来随机抽一个新的状态st+1,还会告诉我们这一步的奖励rt(rt可以是正的、负的、0)。
- 奖励reward就是强化学习中的监督信号,DQN要靠这些奖励来训练。
- 有了新的状态st+1,DQN对所有动作进行打分,agent选择分数最高的动作作为at+1。at+1后,环境会再更新状态st+2、再给一个奖励r+1。
- 然后重复这个过程,直到游戏结束。
how to train a DQN?
1.3 Temporal Difference(TD) Learning
TD算法,时间差分算法。
challenge: Can I update the model before finishing the trip?
TD 算法。
- TD target。
- TD error。
- 用梯度下降来减小TD error
1.4 TD Learning for DQN
1.4.1 TD使用条件 condition
证明:
左边称为Prediction;右边称为TD target。
1.4.2 Train DQN using TD learning
Loss function计算梯度,梯度方向为损失函数J(θ)增长最大方向,梯度有正有负,其代表梯度方向。梯度下降即逆梯度方向下降,如果学习率α不能跨过山脊,参数θ更新会始终陷入局部最优。
所以增加了二阶动量概念,保持全局梯度下降方向。SGD是一阶动量(梯度), Adam=SGD + 二阶动量(梯度),避免陷入局部最优。
【转载】深度学习数学基础(二)~随机梯度下降(Stochastic Gradient Descent, SGD)_天狼啸月1990的博客-CSDN博客
agent在t+1时刻的动作at+1,DQN对所有动作a进行打分,分数最高的动作作为at+1。
注意,这里的a不等于at。
1.5 summary: DQN and TD learning
2. Extension: TD Learning Algorithm
2.1 Sarsa Algorithm
Sarsa algorithm used to learn action-value function Qπ。
2.1.1 Derive TD Target of Sarsa
- Discounted Return Ut
- Assume Rt depends on (St, At, St+1)
- Action-value Function Qπ(st, at)
直接求期望很困难, approximate it using Monte Carlo (MC)
TD learning: Encourage Qπ(st, at) to approach yt.
2.1.2 Sarsa: Tabular Version
we want to use (st, at, rt, st+1, at+1) to learn Qπ(s, a) --> State-Action-Reward-State-Action (Sarsa)
- observe a transition (st, at, rt, st+1)
- sample at+1 ~ π(·|st+1), where π is the policy function
- TD target:
- TD error: , 通过查表得到具体数值。
- Update:
2.1.3 Sarsa: Neural Network Version
use neural network to learn Qπ(s, a),得到的网络称为value network q(s, a; w)
价值网络的w一开始是随机初始化的,我们要用观测到的奖励来更新w。
- TD target:
- TD error:
- Loss:
- Gradient:
- Gradient descent:
2.1.4 Sarsa summary
2.2 Q-Learning Algorithm
Q-Learning algorithm used to learn optimal action-value function Q*(s, a)
2.2.1 Derive TD Target of Q-Learning
根据Sarsa Qπ公式,if π is the optimal policy π*, then
如果把Qπ*写成Q*的形式,
The action At+1 is computed by
Thus,
将max Q*函数代入第二个等式,
直接求等式中的期望很困难,所以对期望进行Monte Carlo Approximation。
encourage Q*(st, at) to approach TD target yt。
2.2.2 Q-Learning (tabular version)
- observe a transition (st, at, rt, st+1)
- TD target:
- TD error:
- Update:
2.2.3 Q-Learning: DQN version
DQN是对最优动作价值函数Q*(s,a)的近似,表示为Q(s, a; w)
- Approximate Q*(s,a) by DQN, Q(s,a;w)
- DQN controls the agent by:
- We seek to learn the parameter w using the collected transitions.
training DQN with Q-Learning
- Observe a transition (st, at, rt, st+1)
- TD target:
- TD error:
- Update:
2.2.4 Q-Learning summary
2.3 Sarsa 对比 Q-Learning
这两种TD策略只包含一个reward rt,如果包含多个rt,效果会更好。
2.4 Multi-Step TD Target
using one reward
using multiple rewards
2.4.1 Multi-Step Return
-->
这样Ut就包含了两个奖励Rt,推出multi-step return公式,
- m-step TD target for Sarsa:
- m-step TD target for Q-Learning:
2.4.2 One-step return 对比 multi-step return
3. Extension: DQN Advanced Training Skills
最基础的TD算法训练DQN的效果会很差。
DQN的高级技巧,可以提高DQN的表现:Experience Replay;
3.1 revisiting DQN and TD Learning
DQN
DQN Q(s, a; w) is the neural network that is used to approximate the optimal action-value function, Q*(s,a). Q*(s,a) function makes scores of all actions based on the current state s, the score reflect how good the action is, so the agent should execute the action with the highest score.
TD Learning
TD Learning is Temporal Difference Algorithm.
- Observe state st and perform action at.
- Environment provides new state st+1 and reward rt.
- TD target:
- TD error: , where qt = Q(st, at; w)
- Goal: Make qt close to yt, for all t. (Equivalently, make δt^2 small)
- TD learning: Find w by minimizing
- Online gradient descent:
- Discard (st, at, rt, st+1) after using it.
这是TD算法最原始的实现,这样的效果并不好
make some improvements to make TD algorithm converge faster.
3.2 Experience Replay
3.2.1 Reason why we need Experience Replay
TD Learning shortage 1: Waste of Experience
- a transition: (st, at, rt, st+1)
- Experience: all the transitions, for t=1,2,...
- Previously, we discard (st, at, rt, st+1) after using it
- it is a waste.. .
TD Learning shortage 2: Correlated Updates
- Previously, we use (st, at, rt, st+1) sequentially, for t = 1,2,.., to update w.
- Consecutive states, st and st+1, are strongly correlated (which is bad)。 打散
3.2.2 Experience Replay Introduction
- A transition: (st, at, rt, st+1)
- Store recent n transitions in a replay buffer.
- Remove old transitions so that the buffer has at most n transitions.
- Buffer capacity n is a tuning hyper-parameter.
- n is typically large, e.g. .
- The setting of n relies on application-specific.
3.2.3 TD Algorithm with Experience Reply
- Find w by minimizing
- Stochastic gradient descent (SGD):
- Randomly sample a transition, (si, ai, ri, si+1), from the buffer.
- Compute TD error,
- Stochastic gradient:
- SGD: 。实际中会使用mini-batch,会抽取多个transitions,拿多个梯度的平均来更新w
3.2.4 Benefits of Experience Replay
1. Make the updates uncorrelated.
2. Reuse collected experience many times.
3.3 Prioritized Experience Replay
对Experience Replay的一种改进,用非均匀抽样代替均匀抽样。
- Not all the transitions are equally important。游戏里的场景重要性不同。
- then how do we know which transition is important?
- TD error. If a transition has high TD error |δt|, ti will be given high priority. TD error绝对值越大,transition就越重要,应该给更高的优先级。
Prioritized Experience Replay有两种不同的非均匀抽样方式: Importance Sampling;
3.3.1 Sampling methods
- Use importance sampling instead of uniform sampling.
- Option 1: Sampling probability
- Option 2: Sampling probability
- The transitions are sorted so that |δt| is in the descending order.
- rank(t) is the rank of the t-th transition.
- In sum, big |δt| shall be given high priority.
不同抽样概率的transitions会让DQN的预测有偏差,应该相应调整学习率,抵消掉不同抽样概率造成的偏差。
3.3.2 Scaling Learning Rate
SGD: , where α is the learning rate.
If importance sampling is used, α shall be adjusted according to the importance.
如果一条transition有较大的抽样概率,那么应该把它的学习率设置的比较小。
- Scale the learning rate by , where β ∈ (0,1)
3.3.3 Update TD Error
3.4 Target Network & Double DQN
3.4.1 Bootstrapping
bootstrapping: To lift oneself up by his bootstraps.
RL bootstrapping means "using an estimated value in the update step for the same kind of estimated value".
- Use a transition, (st, at, rt, st+1), to update w.
- TD target: . TD target yt既用到真实观测rt,也用到DQN估计。
- TD error:
- SGD: . 为了更新DQN在t时刻的估计,我们用到的yt包含部分DQN在t时刻的估计,即"自己提升自己"--bootstrapping。
3.4.2 DQN高估问题 Problem of Overestimation
problem: 用TD算法训练DQN,会导致DQN高估真实的动作价值。
- Reason 1: The maximization. 计算TDtarget用到了最大化。
- TD target is bigger than the real action-value.
- Reason 2: Bootstrapping propagates the overestimation. 用它自己的估计再去估计自己
Analysis: Why is overestimation a shortcoming?
solution: Target Network & Double DQN.
- solution 1: Use a target network to compute TD targets. (Address the problem caused by bootstrapping) 不要用DQN自己算出的TD target,而是用另一个neural network去计算TD target, which is called target network.
- solution 2: use Double DQN to alleviate the overestimation caused by maximization. DDQN也用target network,但具体用法有一点区别,这一点区别可以大幅改善效果,缓解高估问题。
3.5 Target Network
DQN用一个神经网络近似optimal action-value Q*函数,现在用两个neural network来近似。
- DQN用来控制agent,并且收集经验,很多transitions;
- Target Network唯一的用途就是计算TD target。从而在一定程度上避免了bootstrapping
3.5.1 TD Learning with Target Network
3.5.2 Update Target Network
3.5.3 TD learning comparisons
用Target Network会减小DQN高估的程度,让DQN表现更好,但还是无法避免高估,无法完全避免bootstapping。
3.6 Double DQN (DDQN)
Double DQN可以比Target Network更好地缓解高估问题。
- Naive TD Target:
- Selection using DQN:
- Evaluation using DQN:
- Serious overestimation.
- Target Network
- Selection using target network:
- Evaluation using target network:
- It works better, but overestimation is still serious.
- Double DQN
- Selection using DQN:
- Evaluation using target network:
- It is the best among the three, but overestimation still happens.
3.6.1 Why does DDQN work better?
3.6.2 Summary
Double DQN同时缓解了造成高估的两个因素,所以效果最好。
3.7 Dueling Network
Target Network和Double DQN是对TD算法的改进。
Dueling Network是对神经网络结构的改进,也可以大幅提升DQN performance.
3.7.1 Advantage Function 优势函数
- Discounted Return Ut, “回报”
- Action-value function Qπ(st, at),它是Ut的条件期望--预测结果。
- State-value function, Vπ(st), Qπ的期望。
- Optimal action-value unctions, Q*(s, a)
- Optimal state-value function, V*(s)
- Optimal advantage function,
Properties of Advantage Function
- Theorem 1: 推出--》
- Theorem 2:
3.7.2 Dueling Network
- approximate Q*(s,a) by a neural network, Q(s,a;w)
- Approximate advantage function A*(s,a) by a neural network,
- Approximate state-value function V*(s) by a neural network,
- Dueling Network:
--》
Dueling Network与DQN有相同的作用, approximate the optimal action-value function Q*(s,a).
蓝色数值 x 红色向量 - 红色向量中最大值 = purple vector
3.7.3 Dueling Network Mathematical Principle: Overcome Non-identifiability
如果V*和A*有两个大小一样方向相反的波动,虽然Q*结果不变,但会造成训练结果不稳定。
通过max A*保证神经网络稳定性。