增强学习论文记录

这篇论文探讨了使用广义优势估计(GAE)来降低策略梯度估计的方差,同时保持可容忍的偏差。通过价值函数减少样本数量和稳定改进的难度,提出了一种适用于在线和批量设置的分析方法,并结合信任区域优化,有效地学习了针对高维连续控制任务的神经网络策略。主要贡献包括:GAE的理论依据和应用推广、价值函数的信任区域优化方法,以及这些方法结合后的实验效果。
摘要由CSDN通过智能技术生成

< HIGH-DIMENSIONAL CONTINUOUS CONTROL USING GENERALIZED ADVANTAGE ESTIMATION >

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan and Pieter Abbeel
Department of Electrical Engineering and Computer Science
University of California, Berkeley
{joschu,pcmoritz,levine,jordan,pabbeel}@eecs.berkeley.edu

  • 主要是说用GAE模拟Advantage 函数,降低variance. 通过使用参数控制一系列的actions对reward的影响范围 This observation suggests an interpretation of Equation(16): reshape the rewards using V to shrink the temporal extent of the response function,and then introduce a “steeper” discount γλ to cut off the noise arising from long delays, i.e., ignore terms θlogπθ(at|st)δVt+l where l>>1/(1γλ)

  • GAN: generalized advantage estimation

  • Two main challenges
    • large number of samples
    • difficulty of obtaining stable and steady improvement
  • 解决办法

    • using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ).
    • We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks.

  1. 提出了一族policy gradient estimators, 能显著减少variance while 将bias维持在在tolerable level. 被 γ[0,1]λ[0,1] 参数化的generalized advantage estimator (GAE)
  2. 一个更general的分析,能同时用于online和batch setting, 讨论了我们方法的解释as an instance of reward shapping

论文的三个贡献:

  • We provide justification and intuition for an effective variance reduction scheme for policy gradients, which we call generalized advantage estimation (GAE). While the formula has been proposed in prior work (Kimura & Kobayashi, 1998; Wawrzynski ´ , 2009), our analysis is novel and enables GAE to be applied with a more general set of algorithms, including the batch trust-region algorithm we use for our experiments.
  • We propose the use of a trust region optimization method for the value function, which we find is a robust and efficient way to train neural network value functions with thousands of parameters.
  • By combining (1) and (2) above, we obtain an algorithm that empirically is effective at learning neural network policies for challenging control tasks. The results extend the state of the art in using reinforcement learning for high-dimensional continuous control.
    Videos are available at
    https://sites.google.com/site/gaepapersupp.

几种不同的polciy grad 方法


重点:

  • 后更新value function
  • The choice Ψt=Aπ(st,at) yields almost the lowest possible
    variance, though in parcticce, the advantage function is not known
    and must be estimated.
  • introduce a prameter γ to reduce variance by downweighting
    rewards对应delayde effects 但代价是introducing bias.

  • 这里写图片描述

    g^=1Nn=1Nt=0A^ntθlogπθ(ant|snt)(9)
    其中的 n 代表batch序号

  • V 是近似value function, 定义 δVt=rt+γV(st+1)V(st) 可以当成是action at 的advantage估计

  • 推导进行中 即代表了一部分a telescoping sum advantage
  • The generalized advantage estimator GAE !!!!!!!! !!!!!!两种情况
  • 用GAE创造一个biased gγ 估计, 通过改写公式6 这里写图片描述
  • 具体算法:

这里写图片描述


这里写图片描述


这里写图片描述

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值