策略学习(Policy-Based Reinforcement Learning)

策略函数是一个概率密度函数。输入是状态s,输出的概率分布,反映的是接下来采取动作的概率。agent从中做一个随机抽样,如向上是0.7,则可能从中抽取向上的动作。

策略网络,用一个策略网络去拟合近似策略函数。例子:输入是当前状态(可能是一张图片),经过若干卷积层之后生成特征向量,然后经过全连接层把特征向量映射到三维向量(因为游戏里有三个动作),然后用softmax激活函数(该激活函数能将输出全为正数且和为1)将其输出为概率分布,输出的即为每个动作的概率。

动作价值函数:是Ut的条件期望,这个期望把t+1之后的状态s和动作a都消掉了。只依赖于当前的状态st和动作at。还依赖于策略函数π,用不同的π得到的就不一样。可以评价在状态st的情况下,做出动作at的好坏程度。

状态价值函数:的期望,把的动作A给积分掉,这样状态价值函数只和当前状态st和策略函数π有关了。给定策略π,可以评价当前状态的好坏。越大,说明当前状态越好,胜算越大。给定状态s,可以评价当前策略π的好坏,越大,说明策略越好。

当A是离散变量时,期望公式可以写为,当A是连续变量时,求和符号可以写为积分符号。

策略学习的主要思想:

以上用过策略网络来近似策略函数,这样一来,状态价值函数就可以写为:

。既可以进行预测状态价值(给定状态s,策略越好,则越大),可以改变θ,来使得变大。基于此,将目标函数定为,即的期望。这个期望是根据状态s求的,因此j只跟θ有关,策略网络越好,j(θ)就越大。

所以Policy-Based Reinforcement Learning的目的就是改变θ,使得j(θ)越大越好。如何改变?这里用到算法(策略梯度算法),步骤为:agent每一步都能观测都不同的状态s,s就是从状态的概率分布中随机抽样出来的。用此时的s进行求导,然后用梯度上升来更新θ,因为想要让j(θ)越来越大。

策略梯度函数的算法:

这里做了一个假设,即与θ无关,但其实不太准确,因为θ与策略函数有关。此处只是简化计算。

即得到策略梯度算法的两种形式(是等价的):

第一种情况(离散点)的计算:

第二种情况(连续)的计算:

因为神经网络的函数异常复杂,基本积不出来,所以这里用到蒙特卡洛法来进行计算。蒙特卡洛是指从策略函数种随机抽取一个,然后把当作固定值来进行下面的计算。

总结策略梯度算法;

 算法中待解决问题:

 两种方法解决:

法一:Reinforce方法

从一开始,一直进行到运行结束。把agent的轨迹都记录下来,就可以算出return ut。该算法最终就是用ut代替qt。

法二:

再用一个神经网络来拟合,这样就生成两个神经网络,称为actor-critic网络。以后介绍。。。

总结:

希望得到一个策略函数π,用π自动控制agent的运动,每当agent观测到状态st,agent就根据策略函数算出一个概率分布,随机抽样得到动作at。

直接计算策略函数比较困难,我们用神经网络来近似策略函数,即策略网络。θ是策略网络的参数,一开始随机初始化,然后通过策略梯度算法来学习参数θ。

 策略梯度是价值函数v关于θ的导数,然后用梯度上升来计算参数θ,目标函数是价值函数v关于状态s的期望,这个目标函数可以理解为使用策略π,agent的平均胜算有多大。策略函数越好,这个目标函数就越大,agent的胜算也就越大。

  • 2
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值