RL(六)——Value Fuction Approximation

最新推荐文章于 2022-11-27 08:12:08 发布

Mia_compiling

最新推荐文章于 2022-11-27 08:12:08 发布

阅读量255

点赞数

分类专栏： RL 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_41796745/article/details/105322794

版权

RL 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

The reason of using value function approximation

The problems with large MDPs

表格型求解方法适用于有限个数状态动作的情况，想要求解比如西洋双路棋或者围棋这种状态动作无限多的情况，就需要使用函数逼近value function approximation。

Problems with laege MDPs:

There are too many states and actions to store in memory
Even if we build such huge memory storage, it is too slow to learn the value of each state individually.

Solution

在这里插入图片描述

We are going to build a value function approximator v(s, w) or q(s,a, w), with this we can fit any state (or state & action pair) and get a approximate value, and at the same time we can reduce the size of our memory.

it is like this:
在这里插入图片描述
We input the state s (or the state&action pair) into the function(the boxes in the picture), then we will get the approximate value function or action-value function. And we need to do using Reinforcement learning is to make the outcome close to the real value.

The linear combination of features and Neural network are differentiable function approximators(可微函数逼近器).

SGD — Stochastic Gradient Descent

假设我们知道v_Π(S)的真实值，假设我们有oracle（数据库），我们就知道了每个state应该具有的value function v_Π(S)，我们就可以通过计算并缩小真实值和函数值的误差来找到一个平衡各个状态的误差的近似价值函数。
在这里插入图片描述
② 这里由于真实值v_Π(S)是一个常数，所以取它的梯度就等于零，J(w)的梯度就又后面的第二项决定，就能得到下面的式子。
① 这里期望符号去掉的原因是，由于我们要对它进行抽样，sampling，所以不需要去求期望值了。

这个Δw的含义就是，对于每一个state我们都用相应的误差值乘以梯度，再乘上间距步长α，并朝着这个方向更新移动，去调整参数价值w的量。（第四个式子指出了误差值，以及梯度告诉我们怎样去修正）

For every step you take, you look where you are, and you make a prediction about where your value is gonna be, and the oracle tells you what the value should be, you immediately adjust your weight w, and you move on for the next step.
You’re fiitting the oracle predictions by minimizing the mean-squared error(均方误差) between the ……

Feature Vectors

在这里插入图片描述

Linear Value Function Approximation

在这里插入图片描述
① 其中的objective function就是我们之前计算的均方误差的平方值，是我们想要优化的部分，只不过现在是quadratic（二次方）的形式。
② 由于是线性的，所以求导十分简单，就得到了式子里的结果。

将表格型与线性方法联系起来

表格型是线性方法的一个特例。
在这里插入图片描述
① 当我们建立一个feature vector X^table(S)，我们可以让我们所在的状态对应的值为1，其他为0。

② 我们将X(S)与parameter vector w参数向量相点乘，就可以得到我们估计的value function v^(S, w)，并且发现，我们可以通过选中我们所在状态的相应的权重w_n，（通过表格搜索的方式，找到对应状态的权重w），就可以得到这个状态的入口。Depending on which state we are in, we just pick one entry from our table. the parameter vector is our table. 我们就又有了表格搜索的方式。

在以上过程中，我们都是假设知道了真实的value function v_Π(S)，但是实际上我们并不知道，所以实际操作中，我们需要将它替换成target。如下图所示：在这里插入图片描述 TD(λ)下：

函数逼近方法的策略迭代如下图所示：
不用等到每一幕结束再更新，而是每n步就更新一次得到新的value function v（向上的箭头），然后利用新得到的v进行策略更新（选出下一个动作或者策略，在这里我理解的是比如在使用Sarsa的时候，就是根据现在得到的q和策略e-greedy policy选出下一个动作）（向下的箭头），然后再经过n步……

在这里插入图片描述

Action-Value function Approximation

Similarly, we can approximate the action-value function.
在这里插入图片描述 Using Linear Action-Value Function Approximation as an example:
The diffference is we use X(S,A) as feature vector instead.

Again, similarly, the algorithm is exactly the same as when we use value function(when prediction):

Should we bootstrap? (use λ)

用山地小车的例子来看，到达目标的step数如下图所示：
在这里插入图片描述
可以看出，当λ == 1时，也就是使用 MC 的 return 作为 target 时，学习的最慢，因为必须要到每一幕的最后才能进行更新，当λ == 0时，也就是使用TD(0)的时候，比起 MC 要好一点。当 λ 在0，1中间时，情况会因为不同的游戏而不同。

TD的问题是，不一定在所有情况下都收敛，下图给出了收敛离散情况：
（the cross means there are chances for divergence）
在这里插入图片描述

Batch Methods(批量处理)

之前我们都在学习之后将学习过的经验丢掉了，所以没有发挥这些经验的最大价值，使用Batch方式，我们建立一个数据库来储存这些经验。
在这里插入图片描述
说实话这个我还没太看懂，关于experience replay，应该是先建立一个数据库 D，从里面抽样，找到Least Square（最小二乘法）的w，先放着，感觉不太影响后面，不过我后面的 Deep Q Network的算法看懂了：

① 我们的D可能很大，可能有几百万组数据，我们从中抽一小部分，比如几十个，然后用他们做Gradient Descent。

② 用上面的数据当作获取 Q target 来源，计算出 q-target 。（因为DQN使用的是 Q-learning 方法，所以目标值就是 q-target）
这里需要说明的是DQN使用了两个Q Network，一个新的一个旧的，旧的用来bootstraping，从旧的中获取目标值q-target，新的用来更新

③ 我们不断地缩小MSE（Mean Squared Error）均方误差。

这种方法的优点是，一定会收敛到最优，不会离散，非常稳定，DO NOT BLOW UP. 而使他非常稳定的两个原因就是 :

<1> experience replay 因为他打破了轨迹之间的联系。（不太明白原理其实）

<2> fixed Q target 固定了 Q 的 target 值。因为使用了两个神经网络，一个新的一个旧的（也就是有两套参数 w），旧的固定不变（freeze the old one for a while）。并且我们从旧的神经网络中bootstraping，这样一来，公式中的 q-target 就一直保持不变，不会随着预测值的更新而变动，因此稳定。

经过很多 steps，很多更新之后，我们把旧的神经网络再转换为新的神经网络，再重复之前的操作。

就像之前一样，如下图
在这里插入图片描述

Mia_compiling

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
RL(六)——Value Fuction Approximation

The reason of using value function approximationThe problems with large MDPs表格型求解方法适用于有限个数状态动作的情况，想要求解比如西洋双路棋或者围棋这种状态动作无限多的情况，就需要使用函数逼近value function approximation。Problems with laege MDPs:There ...
复制链接

扫一扫