强化学习的学习之路（十六）_2021-01-16:价值函数近似（Value function approximation）

最新推荐文章于 2024-04-09 11:11:26 发布

Chou_pijiang

最新推荐文章于 2024-04-09 11:11:26 发布

阅读量463

点赞数

分类专栏：强化学习-基础知识文章标签：强化学习

本文链接：https://blog.csdn.net/zyh19980527/article/details/112675763

版权

强化学习-基础知识专栏收录该内容

60 篇文章 111 订阅

订阅专栏

作为一个新手，写这个强化学习-基础知识专栏是想和大家分享一下自己学习强化学习的学习历程，希望对大家能有所帮助。这个系列后面会不断更新，希望自己在2021年能保证平均每日一更的更新速度，主要是介绍强化学习的基础知识，后面也会更新强化学习的论文阅读专栏。本来是想每一篇多更新一点内容的，后面发现大家上CSDN主要是来提问的，就把很多拆分开来了（而且这样每天任务量也小一点哈哈哈哈偷懒大法）。但是我还是希望知识点能成系统，所以我在目录里面都好按章节系统地写的，而且在github上写成了书籍的形式，如果大家觉得有帮助，希望从头看的话欢迎关注我的github啊，谢谢大家！另外我还会分享深度学习-基础知识专栏以及深度学习-论文阅读专栏，很早以前就和小伙伴们花了很多精力写的，如果有对深度学习感兴趣的小伙伴也欢迎大家关注啊。大家一起互相学习啊！可能会有很多错漏，希望大家批评指正！不要高估一年的努力，也不要低估十年的积累，与君共勉！

Value function approximation

为什么需要做function approximation

Previous lectures on small RL problems:
- Cliff walk: $\times 16$ states
- Mountain car: 1600 states
- Tic-Tac-Toe: $10^{3}$ states
Large-scale problems:
- Backgammon: $10^{20}$ states
- Chess: $10^{47}$ states
- Game of Go: $10^{170}$ states
- Q Robot Arm and Helicopter have continuous state space
- (5 Number of atomics in universe: $10^{80}$

即使智能体具有完整和准确的环境模型，智能体也通常没有足够的计算能力在每一时刻都全面利用它。而可用的存储资源也是一个重要的限制。精确的价值函数、策略和模型都需要占用储存资源。在大多数实际问题中，环境状态远远不是一个表格可以装下的。所以就想着用函数近似去解决这个问题，一来可以解决q表太大的问题，二来可以对于没有见过的状态有泛化性。

做function approximation的方式

当我们去对V函数做价值优化的时候，我们有两种情况

一种是我们已经知道真值（这里的意思其实是知道部分真值，然后去对V函数做近似就可以有泛化能力）：

We assumed that true value function $v^{\pi}(s)$ given by supervisor / oracle O Off-policy TD

还有一种情况也是更现实的情况就是我们不知道V函数，我们只知道reward，那么我们就可以用前面的方法去估算V：
- For $\mathrm{M} \mathrm{C}$ , the target is the actual return $G_{t}$
$\Delta \mathbf{w}=\alpha\left(G_{t}-\hat{v}\left(s_{t}, \mathbf{w}\right)\right) \nabla_{\mathbf{w}} \hat{v}\left(s_{t}, \mathbf{w}\right)$

$\mathrm{Return} G_{t}$ is an unbiased, but noisy sample of true value $v^{\pi}\left(s_{t}\right)$
- For $T D (0),$ the target is the $T D$ target $R_{t+1}+\gamma \hat{v}\left(s_{t+1}, w\right)$
$\Delta \mathbf{w}=\alpha\left(R_{t+1}+\gamma \hat{v}\left(s_{t+1}, \mathbf{w}\right)-\hat{v}\left(s_{t}, \mathbf{w}\right)\right) \nabla_{\mathbf{w}} \hat{v}\left(s_{t}, \mathbf{w}\right)$

TD target $R_{t+1}+\gamma \hat{v}\left(s_{t+1}, \mathbf{w}\right)$ is a biased sample of true value $v^{\pi}\left(s_{t}\right)$

Why biased?

It is drawn from our previous estimate, rather than the true value: $\mathbb{E}\left[R_{t+1}+\gamma \hat{v}\left(s_{t+1}, \mathbf{w}\right)\right] \neq v^{\pi}\left(s_{t}\right)$

Using linear $\mathrm{TD}(0),$ the stochastic gradient descend update is
$\begin{aligned} \Delta \mathbf{w} &=\alpha\left(R+\gamma \hat{v}\left(s^{\prime}, \mathbf{w}\right)-\hat{v}(s, \mathbf{w})\right) \nabla_{\mathbf{w}} \hat{v}(s, \mathbf{w}) \\ &=\alpha\left(R+\gamma \hat{v}\left(s^{\prime}, \mathbf{w}\right)-\hat{v}(s, \mathbf{w})\right) \mathbf{x}(s) \end{aligned}$
This is also called as semi-gradient, as we ignore the effect of changing the weight vector w on the target

Semi-gradient Sarsa for VFA Control:

具体我们可以用一下几种模型去进行拟合：

Linear combinations of features 特征的线性组合
Neural networks 神经网络
Decision trees 决策树
Nearest neighbors

上面的选择其实是对label的选择，对于模型来说，我们可以表示为状态提取出的特征的线性变换：

Represent value function by a linear combination of features
$\hat{v}(s, \mathbf{w})=\mathbf{x}(s)^{T} \mathbf{w}=\sum_{j=1}^{n} x_{j}(s) w_{j}$
The objective function is quadratic in parameter $\mathbf{w}$
$J(\mathbf{w})=\mathbb{E}_{\pi}\left[\left(v^{\pi}(s)-\mathbf{x}(s)^{T} \mathbf{w}\right)^{2}\right]$
Thus the update rule is as simple as
$\begin{array}{c} \Delta \mathbf{w}=\alpha\left(v^{\pi}(s)-\hat{v}(s, \mathbf{w})\right) \mathbf{x}(s) \\ \text { Update }=\text { Stepsize } \times \text { PredictionError } \times \text { Feature Value } \end{array}$
Stochastic gradient descent converges to global optimum. Because in the linear case, there is only one optimum, thus local optimum is automatically converge to or near the global optimum.

强化学习的致命三要素

Function approximation
Bootstrapping
off-policy

如果包含这三个要素，很有可能不稳定性就难以避免，如果只出现两个要素，那么不稳定性就是有可能避免的。

在这三个要素中，FA是最不可能舍弃的，状态聚合或者非参数化的方法的复杂性随数据的增大而增大，都效果太差或价格太昂贵。

不使用Boot strapping是有可能的，付出的代价是计算和数据上的效率。

很多强化学习算法都是在解决这个不稳定性的问题