On-policy 控制近似方法

胧月夜い

于 2021-09-14 14:58:51 发布

阅读量142

点赞数

文章标签：强化学习

本文链接：https://blog.csdn.net/qq_46013251/article/details/119944050

版权

On-policy Control with Approximation

回合半梯度控制
- 案例：陡坡汽车任务
半梯度 n-步 Sarsa
- 案例
平均奖励：持续性任务的新问题设置
弃用折扣设置
差分半梯度 n-步 Sarsa
参考

回合半梯度控制

将半梯度预测方法推广到动作价值上，这种情况下是动作-价值函数 $\hat{q} \approx q_\pi$ 的近似，其表示为带有权重向量 $\mathbf{w}$ 的参数化函数形式
现在我们考虑形式 $S_t , A_t \mapsto U_t$ 的样例
更新的目标 $U_t$ 可以是 $q_\pi (S_t, A_t)$ 的任何近似，包括通常的备份值，像完整蒙特卡洛回报 $G_t$ 或任何 n-步 Sarsa 回报
动作-价值预测的一般梯度下降更新是：
$\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha [U_t - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q} (S_t, A_t, \mathbf{w}_t)$
比如，一步 Sarsa 方法的更新是：
$\mathbf{w}_{t+1} \doteq \mathbf{w}_t + \alpha [R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}_t) - \hat{q}(S_t, A_t, \mathbf{w}_t)] \nabla \hat{q} (S_t, A_t, \mathbf{w}_t)$
我们叫这种方法为回合半梯度一步 Sarsa
对于常量策略，这个方法以与 TD(0) 相同方式收敛，并带有同类型的误差边界

为了形成控制方法，我们需要将这种动作-价值预测方法和策略提升与动作选择技术相结合
如果动作集合是离散的并且不太大，那么我们可以使用前面已经介绍的技术
对在下一个状态 $S_{t+1}$ 可用的每一个可能动作 $a$ ，我们可以计算 $\hat{q}(S_{t+1}, a, \mathbf{w}_t)$ ，然后找到贪婪动作 $A_{t+1} = \argmax_a \hat{q}(S_{t+1}, a, \mathbf{w}_t)$
接着，通过将估计策略更改为贪婪策略的软近似来完成策略提升
动作也根据相同的策略来选择

下面是伪代码：
在这里插入图片描述

案例：陡坡汽车任务

考虑如下图所示的汽车爬坡问题：
在这里插入图片描述

注意重力要强于汽车的引擎，即使是全油门时，汽车也不能在斜坡上加速
唯一的解决方案是首先远离目标，驱车驶向左边的斜坡，然后通过全油门建立足够的惯性以便到达目标点
在所有时间步的奖励都是 -1，除了汽车冲过山顶目标，结束了这一回合
有三种可能的动作：全油门前进（+1），全油门倒车（-1），零油门（0）
其位置 $x_t$ ，速度 $\dot{x}_t$ 由以下公式更新：
在这里插入图片描述

其中 $b o u n d$ 操作强制 $\leqslant x_{t+1} \leqslant 0.5$ 和 $\leqslant \dot{x}_{t+1} \leqslant 0.07$
此外，当 $x_{t+1}$ 到达左边界时， $\dot{x}_{t+1}$ 重置为零
当其到达右边界时，到达目标，回合结束
每个回合开始于随机位置 $x_t \in [-0.6, -0.4)$ 和零初速度

程序：

mlpack 有相同的环境，我们直接拿来用：

/**
 * Implementation of Mountain Car task.
 */
class MountainCar
{
   
 public:
  /**
   * Implementation of state of Mountain Car. Each state is a
   * (velocity, position) vector.
   */
  class State
  {
   
   public:
    /**
     * Construct a state instance.
     */
    State(): data(dimension, arma::fill::zeros)
    {
    /* Nothing to do here. */ }

    /**
     * Construct a state based on the given data.
     *
     * @param data Data for the velocity and position.
     */
    State(const arma::colvec& data): data(data)
    {
    /* Nothing to do here. */ }

    //! Modify the internal representation of the state.
    arma::colvec& Data() {
    return data; }

    //! Get the velocity.
    double Velocity() const {
    return data[0]; }
    //! Modify the velocity.
    double& Velocity() {
    return data[0]; }

    //! Get the position.
    double Position() const {
    return data[1]; }
    //! Modify the position.
    double& Position() {
    return data[1]; }

    //! Encode the state to a column vector.
    const arma::colvec& Encode() const {
    return data; }

    //! Dimension of the encoded state.
    static constexpr size_t dimension = 2;

   private:
    //! Locally-stored velocity and position vector.
    arma::colvec data;
  };

  /**
   * Implementation of action of Mountain Car.
   */
  class Action
  {
   
   public:
    enum actions
    {
   
      backward,
      stop,
      forward
    };
    // To store the action.
    Action::actions action;

    // Track the size of the action space.
    static const size_t size = 3;
  };

  /**
   * Construct a Mountain Car instance using the given constant.
   *
   * @param maxSteps The number of steps after which the episode
   *    terminates. If the value is 0, there is no limit.
   * @param positionMin Minimum legal position.
   * @param positionMax Maximum legal position.
   * @param positionGoal Final target position.
   * @param velocityMin Minimum legal velocity.
   * @param velocityMax Maximum legal ve

最低0.47元/天解锁文章

胧月夜い

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
On-policy 控制近似方法

On-policy 控制近似方法回合半梯度控制案例：陡坡汽车任务半梯度 n-步 Sarsa平均奖励：持续性任务的新问题设置案例：一个访问控制队列任务弃用折扣设置差分半梯度 n-步 Sarsa参考回合半梯度控制将半梯度预测方法推广到动作价值上，这种情况下是动作-价值函数 q^≈qπ\hat{q} \approx q_\piq^≈qπ 的近似，其表示为带有权重向量 w\mathbf{w}w 的参数化函数形式现在我们考虑形式 St,At↦UtS_t , A_t \mapsto U_tSt,At↦Ut
复制链接

扫一扫