Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion

Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion

文章来自Google Brain,提出了一种使用 model-based RL 算法来提升 model-free RL 算法性能的技巧,名为 STEVE (stochastic ensemble value expansion)。对于 model-free 方法来说,由于 model-based 的方法的引入,所以变得 “Sample-Efficient” 。

Contribution:
2)提出了一种 model-free 和 model-based 相结合的算法,解决了当 model 不够精确时,算法性能被削弱的问题。
3)由于利用了 model-based 的方法,故提高了算法的样本效率。

Code: https://github.com/tensorflow/models/tree/master/research/steve

先看看 model-based 中的 model 都是些什么:
共有3个函数,其中 T ^ ξ ( s , a ) \hat T_\xi(s,a) T^ξ(s,a)是状态转移函数,返回下一时刻的状态; d ^ ξ ( s ) \hat d_\xi(s) d^ξ(s)是轨迹终端函数,返回的是回合是否结束的信号; r ^ ψ ( s , a , s ′ ) \hat r_\psi(s,a,s') r^ψ(s,a,s)是奖励函数,返回即时奖励。训练模型时,目标是最小化下面的函数:

式中, H \mathbb H H 是交叉熵, d ( s ) d(s) d(s) 是真实环境中状态的终端信号,结束时返回1。

那么,这个模型是如何被 model-free 方法利用,并提升其性能的?

我们以 TD 算法的动作值函数 Q ( s , a ) Q(s,a) Q(s,a) 的更新公式作为切入点,其公式如下:

其中 T T D \mathcal T^{TD} TTD 是 Q 值函数的目标,文章的重点就是放在怎么更改这个目标式子上了。

因为本文的算法 STEVE 是在 MVE 算法上改进的,所以,先看下文中使用到的 MVE 式子,如下:

其中

只要将 TD 算法中的 Q 值函数目标 T T D \mathcal T^{TD} TTD 换成 T H M V E \mathcal T_{_H}^{MVE} THMVE 就变成了 MVE 算法了。(其实相比原论文中的 MVE 算法,这个式子还是有一点出入的,不过这篇文章需要到这个计算式)

那么 T H M V E \mathcal T_{_H}^{MVE} THMVE 和文章提出的 T H S T E V E \mathcal T_{_H}^{STEVE} THSTEVE 有什么关系?
直接看表达式:

式中, T i μ \mathcal T_{_i}^{\mu} Tiμ T i σ 2 \mathcal T_{_i}^{\sigma^2} Tiσ2 分别是 T H M V E \mathcal T_{_H}^{MVE} THMVE 的期望和方差。

直接将 TD 算法中的 Q 值函数目标 T T D \mathcal T^{TD} TTD 换成 T H S T E V E \mathcal T_{_H}^{STEVE} THSTEVE 就变成这篇文章提出的 STEVE 算法了。是不是很简单…

但是话说期望 T i μ \mathcal T_{_i}^{\mu} Tiμ 和方差 T i σ 2 \mathcal T_{_i}^{\sigma^2} Tiσ2 是怎么来的呢?

这个就要用到之前说的 model 了,在文章中,分别用一组参数去近似 Q-function,reward function 和 transition function,得到: θ = { θ 1 , . . . , θ L } , ψ = { ψ 1 , . . . , ψ N } , ξ = { ξ 1 , . . . , ξ M } \theta = \{\theta_1,...,\theta_L\}, \psi = \{\psi_1,...,\psi_N\}, \xi = \{\xi_1,...,\xi_M\} θ={θ1,...,θL},ψ={ψ1,...,ψN},ξ={ξ1,...,ξM},即共有 L 个Q函数,N个奖励函数,M个状态转移函数。

用这么多个函数去计算 T H M V E \mathcal T_{_H}^{MVE} THMVE 时,可以计算出 M × N × L M \times N \times L M×N×L T H M V E \mathcal T_{_H}^{MVE} THMVE,这样就可以从中得到它的期望和方差了。

举个例子,比如 H=2 时,即计算 T 2 M V E \mathcal T_{_2}^{MVE} T2MVE
假如每一组参数的个数也是2,即 M = N = L = 2 M = N = L = 2 M=N=L=2 (注意,这里的 H 取值与 M N L的取值没有任何关系),则得到 2 × 2 × 2 = 8 2 \times 2 \times 2 = 8 2×2×2=8 T 2 M V E \mathcal T_{_2}^{MVE} T2MVE,如下图所示:

解释:
因为 M=2,所以有两个状态转移函数,那么就会得到两条不同的轨迹(蓝色和浅蓝色);
因为 N=2,所以有两个奖励函数,那么每条轨迹就会有两种不同的奖励方式(红色和橙色);
因为 L=2,所以有两个Q值函数,那么每条轨迹会产生两个不同的Q值(绿色和浅绿色)。
所以排列组合起来共有 8 个 T 2 M V E \mathcal T_{_2}^{MVE} T2MVE

最后一个问题是, T H S T E V E \mathcal T_{_H}^{STEVE} THSTEVE 式子中的权重 w ~ i \tilde w_i w~i 怎么理解?
除以 ∑ j w ~ j \sum_j\tilde w_j jw~j 是为了将权重归一化。
然后 w ~ i \tilde w_i w~i 选择的是 T H M V E \mathcal T_{_H}^{MVE} THMVE 的方差的倒数,这种加权方式叫做 inverse-variance weighting,维基百科上的解释是:

In statistics, inverse-variance weighting is a method of aggregating two or more random variables to minimize the variance of the weighted average. Each random variable is weighted in inverse proportion to its variance, i.e. proportional to its precision.

其实就是根据 T H M V E \mathcal T_{_H}^{MVE} THMVE 的精度来决定加权值,如果精度高,即方差小,那么这个值的分量就会更大。所以,当 model 精度比较高时,我们就更相信它的值,model 不够精确时,就不那么相信它的值。这样在利用模型优势的时候,就不会因为模型误差过大而降低性能了。

最后记录一个小知识点:
The bias in the learned Q-function is not uniform across states and actions。They find that the bias in the Q-function on states sampled from the replay buffer is lower than when the Q-function is evaluated on states generated from model rollouts. They term this the distribution
mismatch problem
and propose the TD-k trick as a solution.

Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值