强化学习-Vanilla Policy Gradient(VPG)

最新推荐文章于 2023-12-04 08:48:05 发布

Mystery_zero

最新推荐文章于 2023-12-04 08:48:05 发布

阅读量3k

点赞数 1

分类专栏：强化学习深度学习文章标签：深度学习机器学习

原文链接：https://github.com/openai

版权

深度学习同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

强化学习

10 篇文章 0 订阅

订阅专栏

文章目录

Background
Documentation
Referances

Background

策略梯度背后的关键思想是提高导致更高回报的操作的概率，并降低导致低回报的操作的概率，直到获得最佳策略。

Quick Facts

VPG 是一个on-policy算法
VPG 能用于连续或者离散动作空间的环境
结合MPI可以有并行运算的VPG

Key Equations

令 $π_θ$ 表示参数为 θ 的策略， $J(π_θ)$ 表示策略的有限步长无折扣收益的期望。 $J(π_θ)$ 的梯度为: $\nabla_{\theta} J(\pi_{\theta}) = \underset{\tau \sim \pi_{\theta}}E[{ \sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t|s_t) A^{\pi_{\theta}}(s_t,a_t) }]$ 其中 $\tau$ 是一个轨迹， $A^{\pi_\theta}$ 是当前策略的优势函数。

策略梯度算法通过策略表现的随机梯度上升来更新策略参数： $\theta_{k+1} = \theta_k + \alpha \nabla_{\theta} J(\pi_{\theta_k})$ 尽管其他情况有使用有限步长无折扣策略梯度公式，策略梯度实现通常基于无限步长折扣收益来计算优势函数估计值。

Exploration vs. Exploitation

VPG以一种按on-policy方式训练随机策略。这意味着它将根据最新版本的随机策略通过采样动作来进行探索。动作选择的随机性取决于初始条件和训练程序。在训练过程中，由于更新规则鼓励该策略利用已发现的奖励，因此该策略通常变得越来越少随机性。这可能会导致策略陷入局部最优状态。

Pseudocode

在这里插入图片描述

Documentation

spinup.vpg(env_fn, actor_critic=, ac_kwargs={}, seed=0, steps_per_epoch=4000, epochs=50, gamma=0.99, pi_lr=0.0003, vf_lr=0.001, train_v_iters=80, lam=0.97, max_ep_len=1000, logger_kwargs={}, save_freq=10)
Parameters:

env_fn – A function which creates a copy of the environment. The environment must satisfy the OpenAI Gym API.
actor_critic – A function which takes in placeholder symbols for state, x_ph, and action, a_ph, and returns the main outputs from the agent’s Tensorflow computation graph:
ac_kwargs (dict) – Any kwargs appropriate for the actor_critic function you provided to VPG.
seed (int) – Seed for random number generators.
steps_per_epoch (int) – Number of steps of interaction (state-action pairs) for the agent and the environment in each epoch.
epochs (int) – Number of epochs of interaction (equivalent to number of policy updates) to perform.
gamma (float) – Discount factor. (Always between 0 and 1.)
pi_lr (float) – Learning rate for policy optimizer.
vf_lr (float) – Learning rate for value function optimizer.
train_v_iters (int) – Number of gradient descent steps to take on value function per epoch.
lam (float) – Lambda for GAE-Lambda. (Always between 0 and 1, close to 1.)
max_ep_len (int) – Maximum length of trajectory / episode / rollout.
logger_kwargs (dict) – Keyword args for EpochLogger.
save_freq (int) – How often (in terms of gap between epochs) to save the current policy and value function.