强化学习RL 02: Value-based Reinforcement Learning

天狼啸月1990

已于 2024-01-16 13:10:59 修改

阅读量1.7k

点赞数

分类专栏： # 强化学习RL 文章标签：强化学习 value based RL

于 2023-02-27 15:52:47 首次发布

本文链接：https://blog.csdn.net/qq_33419476/article/details/129243557

版权

强化学习RL 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

DQN和TD更新算法。

value network通常用来一个动作action的价值，而Action-value function Qπ用来评价动作好坏，其梯度 $\frac{\partial Q_{\pi}}{\partial w}$ 用来计算TD error来更新value network。

Review

1. Deep Q-Network(DQN)

1.1 Approximate the Q*(s,a) Function

1.2 Apply DQN to Play Game

1.3 Temporal Difference(TD) Learning

1.4 TD Learning for DQN

1.4.1 TD使用条件 condition

1.4.2 Train DQN using TD learning

1.5 summary: DQN and TD learning

2. Extension: TD Learning Algorithm

2.1 Sarsa Algorithm

2.1.1 Derive TD Target of Sarsa

2.1.2 Sarsa: Tabular Version

2.1.3 Sarsa: Neural Network Version

2.1.4 Sarsa summary

2.2 Q-Learning Algorithm

2.2.1 Derive TD Target of Q-Learning

2.2.2 Q-Learning (tabular version)

2.2.3 Q-Learning: DQN version

2.2.4 Q-Learning summary

2.3 Sarsa 对比 Q-Learning

2.4 Multi-Step TD Target

2.4.1 Multi-Step Return

2.4.2 One-step return 对比 multi-step return

3. Extension: DQN Advanced Training Skills

3.1 revisiting DQN and TD Learning

3.2 Experience Replay

3.2.1 Reason why we need Experience Replay

3.2.2 Experience Replay Introduction

3.2.3 TD Algorithm with Experience Reply

3.2.4 Benefits of Experience Replay

3.3 Prioritized Experience Replay

3.3.1 Sampling methods

3.3.2 Scaling Learning Rate

3.3.3 Update TD Error

3.4 Target Network & Double DQN

3.4.1 Bootstrapping

3.4.2 DQN高估问题 Problem of Overestimation

3.5 Target Network

3.5.1 TD Learning with Target Network

3.5.2 Update Target Network

3.5.3 TD learning comparisons

3.6 Double DQN (DDQN)

3.6.1 Why does DDQN work better?

3.6.2 Summary

3.7 Dueling Network

3.7.1 Advantage Function 优势函数

3.7.2 Dueling Network

3.7.3 Dueling Network Mathematical Principle: Overcome Non-identifiability

3.7.4 Dueling Network Summary

参考

Review

Ut是未来奖励reward的加权和
Qπ(st, at)可以反应当前状态st下动作at的好坏程度。
对Qπ关于π求最大化，Q*函数可以给所有动作打分。

1. Deep Q-Network(DQN)

本质：用神经网络近似Q*函数

1.1 Approximate the Q*(s,a) Function

Goal: Win the game(≈ maximize the total world.)

Question: If we know Q*(s, a), what is the best action? 假设我们知道Q*函数

Q* is an indication for how good it is for an agent to pick action a while being in state s.

Challenge: we do not know Q*(s, a)函数

value-based model，就是学习一个函数来近似Q*函数。--> DQN

Solution: Deep Q Network (DQN)
Use neural network Q(s, a; w) to approximate Q*(s, a)

用一个神经网络去近似Q*(s,a)函数，神经网络参数是w、输入是s、输出是很多数值(这些数值是对所有可能动作的打分)，通过奖励reward来学习这个网络，这个网络对动作的打分就会逐渐改进，打分会越来越准。

1.2 Apply DQN to Play Game

当前观测到状态st，用DQN把st作为输入给所有的动作actions打分，选出分数最高的动作作为at，agent执行动作at后，
environment会改变状态S，用状态转移函数p来随机抽一个新的状态st+1，还会告诉我们这一步的奖励rt(rt可以是正的、负的、0)。
奖励reward就是强化学习中的监督信号，DQN要靠这些奖励来训练。
有了新的状态st+1，DQN对所有动作进行打分，agent选择分数最高的动作作为at+1。at+1后，环境会再更新状态st+2、再给一个奖励r+1。
然后重复这个过程，直到游戏结束。

how to train a DQN？

1.3 Temporal Difference(TD) Learning

TD算法，时间差分算法。

challenge: Can I update the model before finishing the trip?

TD 算法。

TD target。
TD error。
用梯度下降来减小TD error

1.4 TD Learning for DQN

1.4.1 TD使用条件 condition

证明：

左边称为Prediction；右边称为TD target。

1.4.2 Train DQN using TD learning

Loss function计算梯度，梯度方向为损失函数J(θ)增长最大方向，梯度有正有负，其代表梯度方向。梯度下降即逆梯度方向下降，如果学习率α不能跨过山脊，参数θ更新会始终陷入局部最优。

所以增加了二阶动量概念，保持全局梯度下降方向。SGD是一阶动量(梯度)， Adam=SGD + 二阶动量(梯度)，避免陷入局部最优。

【转载】深度学习数学基础(二)～随机梯度下降(Stochastic Gradient Descent, SGD)_天狼啸月1990的博客-CSDN博客

agent在t+1时刻的动作at+1，DQN对所有动作a进行打分，分数最高的动作作为at+1。

注意，这里的a不等于at。

1.5 summary: DQN and TD learning

2. Extension: TD Learning Algorithm

2.1 Sarsa Algorithm

Sarsa algorithm used to learn action-value function Qπ。

2.1.1 Derive TD Target of Sarsa

Discounted Return Ut

$U_{t} = R_{t} + \gamma U_{t+1}$

Assume Rt depends on (St, At, St+1)
Action-value Function Qπ(st, at)

$Q_{\pi}(s_{t}, a_{t}) = E[U_{t}|s_{t}, a_{t}] = E[R_{t} + \gamma \cdot U_{t+1}|s_{t},a_{t}] \\ = E[R_{t}|s_{t},a_{t}] + \gamma \cdot E[U_{t+1}|s_{t},a_{t}] \\ = E[R_{t}|s_{t},a_{t}] + \gamma \cdot E[Q_{\pi}(S_{t+1}, A_{t+1})|s_{t},a_{t}] \\ = E[R_{t} + \gamma \cdot Q_{\pi}(S_{t+1}, A_{t+1})], for\ all\ \pi$

直接求期望很困难, approximate it using Monte Carlo (MC)

$Q_{\pi} (s_{t}, a_{t}) \approx r_{t} + \gamma \cdot Q_{\pi} (s_{t+1}, a_{t+1}), called\ TD\ target\ y_{t}$

TD learning: Encourage Qπ(st, at) to approach yt.

2.1.2 Sarsa: Tabular Version

we want to use (st, at, rt, st+1, at+1) to learn Qπ(s, a) --> State-Action-Reward-State-Action (Sarsa)

observe a transition (st, at, rt, st+1)
sample at+1 ~ π(·|st+1), where π is the policy function
TD target: $y_{t} = r_{t} + \gamma \cdot Q_{\pi}(s_{t+1}, a_{t+1})$
TD error: $\delta_{t} = Q_{\pi}(s_{t}, a_{t}) - y_{t}$ , 通过查表得到具体数值。
Update: $Q_{\pi}(s_{t}, a_{t}) \leftarrow Q_{\pi}(s_{t}, a_{t}) - \alpha \cdot \delta_{t}$

2.1.3 Sarsa: Neural Network Version

use neural network to learn Qπ(s, a)，得到的网络称为value network q(s, a; w)

价值网络的w一开始是随机初始化的，我们要用观测到的奖励来更新w。

TD target: $y_{t} = r_{t} + \gamma \cdot q(s_{t+1}, a_{t+1};w)$
TD error: $\delta_{t} = q(s_{t}, a_{t}; w) - y_{t}$
Loss: $\delta_{t}^{2}/2$
Gradient: $\frac{\partial \delta_{t}^{2}/2}{\partial w} = \delta_{t} \cdot \frac{\partial q(s_{t}, a_{t}; w)}{\partial w}$
Gradient descent: $w \leftarrow w - \alpha \cdot \delta_{t} \cdot \frac{\partial q(s_{t}, a_{t}; w)}{\partial w}$

2.1.4 Sarsa summary

2.2 Q-Learning Algorithm

Q-Learning algorithm used to learn optimal action-value function Q*(s, a)

2.2.1 Derive TD Target of Q-Learning

根据Sarsa Qπ公式，if π is the optimal policy π*, then

$Q_{\pi^{*}}(s_{t}, a_{t}) = E[R_{t} + \gamma \cdot Q_{\pi ^{*}}(S_{t+1}, A_{t+1})]$

如果把Qπ*写成Q*的形式，

$Q^{*}(s_{t}, a_{t}) = E[R_{t} + \gamma \cdot Q^{*}(S_{t+1}, A_{t+1})]$

The action At+1 is computed by

$A_{t+1} = \underset{a}{argmax} Q^{*} (S_{t+1}, a)$

Thus,

$Q^{*}(S_{t+1}, A_{t+1}) = \underset{a}{max} Q^{*}(S_{t+1}, a)$

将max Q*函数代入第二个等式，

$Q_{\pi^{*}}(s_{t}, a_{t}) = E[R_{t} + \gamma \cdot \underset{a}{max} Q^{*}(S_{t+1}, a)]$

直接求等式中的期望很困难，所以对期望进行Monte Carlo Approximation。

$Q_{\pi^{*}}(s_{t}, a_{t}) \approx r_{t} + \gamma \cdot \underset{a}{max} Q^{*}(s_{t+1}, a), \ called\ TD\ target\ y_{t}$

encourage Q*(st, at) to approach TD target yt。

2.2.2 Q-Learning (tabular version)

observe a transition (st, at, rt, st+1)
TD target: $y_{t} = r_{t} + \gamma \cdot \underset{a}{max} Q^{*}(s_{t+1}, a)$
TD error: $\delta_{t} = Q^{*}(s_{t}, a_{t}) - y_{t}$
Update: $Q^{*}(s_{t}, a_{t}) \leftarrow Q^{*}(s_{t}, a_{t}) - \alpha \cdot \delta_{t}$

2.2.3 Q-Learning: DQN version

DQN是对最优动作价值函数Q*(s,a)的近似，表示为Q(s, a; w)

Approximate Q*(s,a) by DQN, Q(s,a;w)
DQN controls the agent by: $a_{t} = \underset{a}{argmax} Q(s_{t},a; w)$
We seek to learn the parameter w using the collected transitions.

training DQN with Q-Learning

Observe a transition (st, at, rt, st+1)
TD target: $y_{t} = r_{t} + \gamma \cdot \underset{a}{max} Q(s_{t+1}, a; w)$
TD error: $\delta_{t} = Q(s_{t}, a_{t}; w) - y_{t}$
Update: $w \leftarrow w - \alpha \cdot \delta_{t} \cdot \frac{\partial Q(s_{t},a_{t};w)}{\partial w}$

2.2.4 Q-Learning summary

2.3 Sarsa 对比 Q-Learning

这两种TD策略只包含一个reward rt，如果包含多个rt，效果会更好。

2.4 Multi-Step TD Target

using one reward

using multiple rewards

2.4.1 Multi-Step Return

$U_{t} = R_{t} + \gamma \cdot U_{t+1}$ --> $U_{t} = R_{t} + \gamma \cdot (R_{t+1} + \gamma \cdot U_{t+2}) = R_{t} + \gamma \cdot R_{t+1} + \gamma^{2} \cdot U_{t+2}$

这样Ut就包含了两个奖励Rt，推出multi-step return公式，

$U_{t} = \sum_{i=0}^{m-1} \gamma ^{i} \cdot R_{t+i} + \gamma^{m} \cdot U_{t+m}$

m-step TD target for Sarsa:

$y_{t} = \sum_{i=0}^{m-1} \gamma^{i} \cdot r_{t+i} + \gamma^{m} \cdot Q_{\pi}(s_{t+m}, a_{t+m})$

m-step TD target for Q-Learning:

$y_{t} = \sum_{i=0}^{m-1} \gamma^{i} \cdot r_{t+i} + \gamma^{m} \cdot \underset{a}{max}Q^{*}(s_{t+m},a)$

2.4.2 One-step return 对比 multi-step return

3. Extension: DQN Advanced Training Skills

最基础的TD算法训练DQN的效果会很差。

DQN的高级技巧，可以提高DQN的表现：Experience Replay；

3.1 revisiting DQN and TD Learning

DQN

DQN Q(s, a; w) is the neural network that is used to approximate the optimal action-value function, Q*(s,a). Q*(s,a) function makes scores of all actions based on the current state s, the score reflect how good the action is, so the agent should execute the action with the highest score.

TD Learning

TD Learning is Temporal Difference Algorithm.

Observe state st and perform action at.
Environment provides new state st+1 and reward rt.
TD target: $y_{t} = r_{t} + \gamma \cdot \underset{a}{max}Q(s_{t+1},a;w)$
TD error: $\delta_{t} = q_{t} - y_{t}$ , where qt = Q(st, at; w)
Goal: Make qt close to yt, for all t. (Equivalently, make δt^2 small)
TD learning: Find w by minimizing $L(w) = \frac{1}{T} \sum_{t=1}^{T}\frac{\delta_{t}^{2}}{2}$
Online gradient descent:

Discard (st, at, rt, st+1) after using it.

这是TD算法最原始的实现，这样的效果并不好

make some improvements to make TD algorithm converge faster.

3.2 Experience Replay

3.2.1 Reason why we need Experience Replay

TD Learning shortage 1: Waste of Experience

a transition: (st, at, rt, st+1)
Experience: all the transitions, for t=1,2,...
Previously, we discard (st, at, rt, st+1) after using it
it is a waste.. .

TD Learning shortage 2: Correlated Updates

Previously, we use (st, at, rt, st+1) sequentially, for t = 1,2,.., to update w.
Consecutive states, st and st+1, are strongly correlated (which is bad)。打散

3.2.2 Experience Replay Introduction

A transition: (st, at, rt, st+1)
Store recent n transitions in a replay buffer.
Remove old transitions so that the buffer has at most n transitions.
Buffer capacity n is a tuning hyper-parameter.
- n is typically large, e.g. $10^{5}\sim10^{6}$ .
- The setting of n relies on application-specific.

3.2.3 TD Algorithm with Experience Reply

Find w by minimizing $L(w) = \frac{1}{T}\sum_{t=1}^{T} \frac{\delta_{t}^{2}}{2}$
Stochastic gradient descent (SGD):
- Randomly sample a transition, (si, ai, ri, si+1), from the buffer.
- Compute TD error, $\delta_{i}$
- Stochastic gradient: $g_{i} = \frac{\partial \delta_{i}^{i}/2}{\partial w} = \delta_{i} \cdot \frac{\partial Q(s_{i},a_{i};w)}{\partial w}$
- SGD: $w \leftarrow w - \alpha \cdot g_{i}$ 。实际中会使用mini-batch，会抽取多个transitions，拿多个梯度的平均来更新w

3.2.4 Benefits of Experience Replay

1. Make the updates uncorrelated.

2. Reuse collected experience many times.

3.3 Prioritized Experience Replay

对Experience Replay的一种改进，用非均匀抽样代替均匀抽样。

Not all the transitions are equally important。游戏里的场景重要性不同。
then how do we know which transition is important?
TD error. If a transition has high TD error |δt|, ti will be given high priority. TD error绝对值越大，transition就越重要，应该给更高的优先级。

Prioritized Experience Replay有两种不同的非均匀抽样方式: Importance Sampling;

3.3.1 Sampling methods

Use importance sampling instead of uniform sampling.
Option 1: Sampling probability $p_{t} \propto |\delta_{t}| + \varepsilon$
Option 2: Sampling probability
- The transitions are sorted so that |δt| is in the descending order.
- rank(t) is the rank of the t-th transition.
In sum, big |δt| shall be given high priority.

不同抽样概率的transitions会让DQN的预测有偏差，应该相应调整学习率，抵消掉不同抽样概率造成的偏差。

3.3.2 Scaling Learning Rate

SGD: $w \leftarrow w - \alpha \cdot g$ , where α is the learning rate.

If importance sampling is used, α shall be adjusted according to the importance.

如果一条transition有较大的抽样概率，那么应该把它的学习率设置的比较小。

Scale the learning rate by $(np_{t})^{-\beta}$ , where β ∈ (0,1)

3.3.3 Update TD Error

3.4 Target Network & Double DQN

3.4.1 Bootstrapping

bootstrapping: To lift oneself up by his bootstraps.

RL bootstrapping means "using an estimated value in the update step for the same kind of estimated value".

Use a transition, (st, at, rt, st+1), to update w.
- TD target: $y_{t} = r_{t} + \gamma \cdot \underset{a}{max}Q(s_{t+1}, a; w)$ . TD target yt既用到真实观测rt，也用到DQN估计。
- TD error: $\delta_{t} = Q(s_{t}, a_{t};w)-y_{t}$
- SGD: $w \leftarrow w - \alpha \cdot \frac{\partial Q(s_{t},a_{t};w)}{\partial w}$ . 为了更新DQN在t时刻的估计，我们用到的yt包含部分DQN在t时刻的估计，即"自己提升自己"--bootstrapping。

3.4.2 DQN高估问题 Problem of Overestimation

problem: 用TD算法训练DQN，会导致DQN高估真实的动作价值。

Reason 1: The maximization. 计算TDtarget用到了最大化。
- TD target is bigger than the real action-value.

Reason 2: Bootstrapping propagates the overestimation. 用它自己的估计再去估计自己

Analysis: Why is overestimation a shortcoming?

solution: Target Network & Double DQN.

solution 1: Use a target network to compute TD targets. (Address the problem caused by bootstrapping) 不要用DQN自己算出的TD target，而是用另一个neural network去计算TD target, which is called target network.
solution 2: use Double DQN to alleviate the overestimation caused by maximization. DDQN也用target network，但具体用法有一点区别，这一点区别可以大幅改善效果，缓解高估问题。

3.5 Target Network

DQN用一个神经网络近似optimal action-value Q*函数，现在用两个neural network来近似。

DQN用来控制agent，并且收集经验，很多transitions；
Target Network唯一的用途就是计算TD target。从而在一定程度上避免了bootstrapping

3.5.1 TD Learning with Target Network

3.5.2 Update Target Network

3.5.3 TD learning comparisons

用Target Network会减小DQN高估的程度，让DQN表现更好，但还是无法避免高估，无法完全避免bootstapping。

3.6 Double DQN (DDQN)

Double DQN可以比Target Network更好地缓解高估问题。

Naive TD Target:
- Selection using DQN: $a^{*} = \underset{a}{argmax}Q(s_{t+1},a;w)$
- Evaluation using DQN: $y_{t} = r_{t} + \gamma \cdot Q(s_{t+1}, a^{*};w)$
- Serious overestimation.
Target Network
- Selection using target network: $a^{*} = \underset{a}{argmax} Q(s_{t+1},a;w^{-})$
- Evaluation using target network: $y_{t} = r_{t} + \gamma \cdot Q(s_{t+1}, a^{*};w^{-})$
- It works better, but overestimation is still serious.
Double DQN
- Selection using DQN: $a^{*} = \underset{a}{argmax}Q(s_{t+1},a;w)$
- Evaluation using target network: $y_{t} = r_{t} + \gamma \cdot Q(s_{t+1}, a^{*};w^{-})$
- It is the best among the three, but overestimation still happens.

3.6.1 Why does DDQN work better?

3.6.2 Summary

Double DQN同时缓解了造成高估的两个因素，所以效果最好。

3.7 Dueling Network

Target Network和Double DQN是对TD算法的改进。

Dueling Network是对神经网络结构的改进，也可以大幅提升DQN performance.

3.7.1 Advantage Function 优势函数

Discounted Return Ut, “回报”
Action-value function Qπ(st, at)，它是Ut的条件期望--预测结果。
State-value function, Vπ(st), Qπ的期望。
Optimal action-value unctions, Q*(s, a)
Optimal state-value function, V*(s)
Optimal advantage function, $A^{*} (s,a) = Q^{*}(s,a)-V^{*}s$

Properties of Advantage Function

Theorem 1: $V^{*}s = \underset{a}{max}Q^{*}(s,a)$ 推出--》 $\underset{a}{max}A^{*}(s,a) = 0$
Theorem 2: $Q^{*}(s,a)=V^{*}(s)+A^{*}(s,a) - \underset{a}{max}A^{*}(s,a)$

3.7.2 Dueling Network

approximate Q*(s,a) by a neural network, Q(s,a;w)
Approximate advantage function A*(s,a) by a neural network, $A(s,a;w^{A})$
Approximate state-value function V*(s) by a neural network, $V(s;w^{V})$
Dueling Network: $Q(s,a; w^{A},w^{V}) = V(s;w^{V}) + A(s,a;w^{A}) - \underset{a}{max}A(s,a;w^{A})$