【David Silver强化学习公开课】-7:Policy Gradient

本文探讨了强化学习中策略梯度方法的基本原理及其优势,包括如何在连续动作空间中应用,实现策略随机化,以及简化复杂的价值函数计算。文中介绍了有限差分策略梯度方法,并对比了不同目标函数的选择,还讨论了Softmax和高斯策略的具体实现方式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、介绍

之前的控制方法都是Value-based,而在确定价值函数之后,其实我们是在价值函数的基础上确定了某种策略(贪婪, ϵ ϵ-贪婪)找到action。那么我们为什么不直接通过策略函数控制action呢?

这样做的好处:

  • 连续的动作空间(或者高维空间)中更加高效;
  • 可以实现随机化的策略;
  • 某种情况下,价值函数可能比较难以计算,而策略函数较容易。
二、Finite Difference Policy Gradient

首先,对某种参数化策略 πθ πθ,我们需要确定一个目标函数 J(θ) J(θ),这里给出了三种:

  • start value
  • average value
  • average reward per time-step

由于要最大化目标函数,因此使用梯度上升的方法优化参数 θ θ

那么要怎么计算策略梯度呢,使用了一种叫做finite difference的方法,也就是在每个维度k上增加一个很小的值,然后求出一个接近偏导数的值:

J(θ)θJ(θ+ϵuk)J(θ)ϵ ∂J(θ)∂θ≈J(θ+ϵuk)−J(θ)ϵ

likelihood ratio,如下的公式: θπ(s,a)=π(s,a)×θlogπ(s,a) ∂θπ(s,a)=π(s,a)×∂θlogπ(s,a)

Softmax Policy:利用特征的线性组合进行softmax,决定动作的概率的策略。

Gaussian Policy:利用特征的线性组合作为分布的均值 μ μ πN(μ,σ2) π∼N(μ,σ2)

对于任意可微的策略函数 π π,其在MDP中的梯度计算如下:

θJ(θ)=Eπθ[θlogπθ(s,a)Qπθ(s,a)] ∂θJ(θ)=Eπθ[∂θlogπθ(s,a)Qπθ(s,a)]

所以最后,以上面这个梯度计算的公式为基础,给出了Monte-Carlo-Policy-Gradient的流程,其中以样本中的return作为Q的无偏估计:

三、Actor-Critic

上一讲中的MC-PG方法造成的方差太大,引入Actor-Critic方法解决。

actor,参数θ,在环境中学习策略并且执行,进行一个Policy-Gradient的过程更新θ;

critic,参数w,用来估计价值函数Q,进行一个策略评估的过程更新w。

减少方差的trick:

  • 减去一个Baseline函数,也就是在PG的过程中,不再使用Q函数,而是使用Advantage函数,即Q-V,这里Baseline函数也就是状态价值函数V。(Dueling)

估计Advantage函数,可以使用TD error,因为这两者是等价的。


原文地址: http://cairohy.github.io/2017/09/06/deeplearning/%E3%80%8ADavid%20Silver%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0%E5%85%AC%E5%BC%80%E8%AF%BE%E3%80%8B-7%EF%BC%9APolicy%20Gradient/

### DRL-TransKey Paper Overview The **DRL-TransKey** paper focuses on the application of deep reinforcement learning (DRL) techniques to achieve policy gradient-based motion representation transfer, specifically transitioning from keyframe animations to more dynamic and adaptive motions[^1]. This approach leverages advanced machine learning models that allow for a seamless integration between traditional hand-crafted animations and AI-driven procedural generation. In this context, the method utilizes policy gradients as an optimization technique within the framework of reinforcement learning. The primary goal is to learn policies that can generalize across different scenarios while preserving the artistic intent embedded in original keyframes[^2]. #### Key Concepts Discussed in the Paper One significant aspect highlighted involves representing complex movements through latent space embeddings derived via autoencoders or variational methods before applying them into RL environments where agents interactively refine their behaviors over time steps under reward signals defined by task objectives such as smoothness, realism preservation etc.[^3] Additionally, it introduces mechanisms like curriculum learning which gradually increases difficulty levels during training phases ensuring stable convergence towards optimal solutions without falling prey common pitfalls associated naive implementations involving high dimensional continuous action spaces typical character control problems found video games industry applications among others areas requiring sophisticated motor skills simulation tasks performed virtual characters controlled autonomously using learned strategies rather than scripted sequences alone thus enhancing overall flexibility adaptability real world conditions encountered various domains including robotics autonomous vehicles beyond mere entertainment purposes only but also extending scientific research experimental setups needing precise manipulations objects environments alike depending upon specific requirements set forth each individual case study considered throughout entire document length covering multiple aspects ranging theoretical foundations practical implementation details alongside empirical evaluations demonstrating effectiveness proposed methodologies against baseline comparisons established literature review sections provided earlier parts text body itself too! ```python import gym from stable_baselines3 import PPO env = gym.make('CustomMotionEnv-v0') # Hypothetical environment setup model = PPO("MlpPolicy", env, verbose=1) def train_model(): model.learn(total_timesteps=100_000) train_model() ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值