【强化学习】周博磊第4章价值函数的近似-CSDN博客

本文链接：https://blog.csdn.net/wsy_Monkey/article/details/131907082

周博磊第4章价值函数的近似

Model Free

为了减少学习和存储的成本，用函数近似的方式
$\hat{v}(s,w) \approx v^{\pi}(s)$
$\hat{q}(s,a,w) \approx q^{\pi}(s, a)$
$\hat{\pi}(s,a,w) \approx \pi(s,a)$

无模型预测的Value 函数近似

MC
$\Delta w = \alpha(G_{t} - \hat{v}(s_t, w))\nabla \hat{v}(s_{t}, w)$
Return $G_{t}$ 是无偏的，但是MC采样的噪音较多。
因为 $\mathbb{E}(G_{t}) = v^{\pi}(s_{t})$ ，所以是无偏的
TD
$\Delta w = \alpha(R_{t+1} + \gamma \hat{v}(s_{t+1}, w) - \hat{v}(s_t, w))\nabla \hat{v}(s_{t}, w)$
TD是有偏的，因为 $\mathbb{E}[R_{t+1}+\gamma \hat{v}(s_{t+1, w}) \neq v^{\pi}(s_{t}) ]$

无模型的Action-Value的函数近似

MC
$\Delta w = \alpha(G_{t} - \hat{q}(s_{t}, a_{t},w))\nabla \hat{q}(s_{t}, a_{t}, w)$
Sarsa 依据同样的策略进行了采样
$\Delta w = \alpha(R_{t+1} + \gamma \hat{q}(s_{t+1}, a_{t+1}, w)- \hat{q}(s_{t}, a_{t},w))\nabla \hat{q}(s_{t}, a_{t}, w)$
Q-learning
$\Delta w = \alpha(R_{t+1} + \gamma \max_{a}\hat{q}(s_{t+1}, a, w)- \hat{q}(s_{t}, a_{t},w))\nabla \hat{q}(s_{t}, a_{t}, w)$

Sarsa 算法：
请添加图片描述

代码：
https://github.com/cuhkrlcourse/RLexample/blob/master/modelfree/q_learning_mountaincar.py

价值函数的近似说明

TD的目标函数的梯度是不正确的，因为包含了两个过程：1.bellman backup 过程 2. 函数近似。2个过程都会引入很多噪声。
Off policy：behavior 和 target policy 并不一致，价值函数不一定准确。所以强化学习比较难训练和收敛。

强化学习的死亡三角

Function Approximation: 因为用了近似，会引入误差
Bootstrapping： TD方式是有偏的
Off-policy：behavior 和 target 策略相差太大

DQN Deep Q-Learning

通过深度学习来近似价值函数
请添加图片描述

问题

样本间的关系，Atari游戏样本间相似度较高，仅有个别像素间差异
目标的非稳定性

解决方式

Experince Replay
Fixed Q target

Experience Replay

构建一个Replay Buffer D，通过对D的随机采样，解决样本间的相似性
请添加图片描述

Fixed Targets

为了提高训练的稳定性，固定目标函数。
请添加图片描述

为什么采用Fixed Targets

这个很有趣，采取了很形象的例子。老鼠：Q target. 猫：Q_estimation
如果不固定，Q target 和 Q estimation 同时移动，很难可以稳定的训练。
请添加图片描述

如果固定后，【Q target 不动，Q estimation 动】 -> 【一起动】-> … 这样猫可以不断减少和老鼠的差距，进行逼近。

DQN 的改进

Double DQN: Deep Reinforcement Learning with Double Q-Learning. Van Hasselt et al, AAAI 2016
Dueling DQN: Dueling Network Architectures for Deep Reinforcement Learning. Wang et al, best paper ICML 2016
Prioritized Replay: Prioritized Experience Replay. Schaul et al, ICLR2016
Agent57: https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark

代码实现：
https://github.com/cuhkrlcourse/DeepRL-Tutorials