SQDDPG:解决全局奖励游戏的局部奖励方法

本文介绍了Shapley Q-value,一种解决全局奖励游戏信用分配问题的新方法。通过扩展凸博弈(ECG)理论,作者提出Shapley Q值作为局部奖励,它可以反映每个智能体在合作博弈中的贡献。SQDDPG算法基于此,实现了在多智能体深度强化学习中的高效学习。实验证明,SQDDPG在协同导航、捕食者和交通交汇点等任务上相比现有算法表现出更好的收敛性能。
摘要由CSDN通过智能技术生成

Shapley Q-value: A Local Reward Approach to Solve Global Reward Games
Shapley Q-value:解决全局奖励游戏的局部奖励方法

https://arxiv.org/abs/1907.05707 

Abstract 摘要

        Cooperative game is a critical research area in the multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize the global reward. Credit assignment is an important problem studied in the global reward game. Most of previous works stood by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent being assigned a shared global reward directly. This, however, may give each agent an inaccurate reward on its contribution to the group, which could cause inefficient learning. To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value. Shapley Q-value is able to distribute the global reward, reflecting each agent’s own contribution in contrast to the shared reward approach. Moreover, we derive an MARL algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as the critic for each agent. We evaluate SQDDPG on Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with the state-of-the-art algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the experiments, SQDDPG shows a significant improvement on the convergence rate. Finally, we plot Shapley Q-value and validate the property of fair credit assignment.
        
合作博弈是多智能体强化学习(MARL)的一个关键研究领域。全球奖励博弈是合作博弈的一个子类,所有智能体都旨在最大化全局奖励。信用分配是全球奖励博弈中研究的一个重要问题。以往的著作大多坚持非合作博弈理论框架的观点,即直接为每个智能体分配一个共享的全局奖励。然而,这可能会给每个智能体一个不准确的奖励,因为它对小组的贡献,这可能导致学习效率低下。为了解决这个问题,我们i)引入了一种称为扩展凸博弈(ECG)的合作博弈理论框架,它是全局奖励博弈的超集,ii)提出了一种称为Shapley Q值的局部奖励方法。Shapley Q 值能够分配全局奖励,反映每个智能体自己的贡献,这与共享奖励方法形成鲜明对比。此外,我们推导出了一种称为 Shapley Q 值深度确定性策略梯度 (SQDDPG) 的 MARL 算法,使用 Shapley Q 值作为每个智能体的批评者。我们评估了 SQDDPG 在协作导航、捕食者和交通交汇点上的性能,并与最先进的算法(例如 MADDPG、COMA、独立 DDPG 和独立 A2C)进行了比较。在实验中,SQDDPG在收敛率上表现出显著的改善。最后,绘制Shapley Q值并验证公平信用分配的性质。

1 Introduction 

1 引言

Cooperative game is a critical research area in multi-agent reinforcement learning (MARL). Many real-life tasks can be modeled as cooperative games, e.g., the coordination of autonomous vehicles (?), autonomous distributed logistics (?) and search-and-rescue robots (??). Global reward game (?) is a subclass of cooperative games where agents aim to maximize the global reward. In this game, credit assignment is an important problem, which targets on finding a method to distribute the global reward. There are two categories of approaches to solve out this problem, namely shared reward approach (also known as shared reward game or fully cooperative game) (???) and local reward approach (?). The shared reward approach directly assigns the global reward to all agents. The local reward approach, on the other hand, distributes the global reward according to each agent’s contribution, and turns out to have superior performances in many tasks (??).
合作博弈是多智能体强化学习(MARL)的一个关键研究领域。许多现实生活中的任务可以建模为合作游戏,例如,自动驾驶汽车(?)、自主分布式物流(?)和搜救机器人(?; ?)的协调。全局奖励博弈 (?) 是合作博弈的一个子类,其中代理旨在最大化全局奖励。在这个游戏中,信用分配是一个重要的问题,它的目标是找到一种分配全局奖励的方法。解决这个问题的方法有两类,即共享奖励方法(也称为共享奖励博弈或完全合作博弈)(?; ?; ?)和局部奖励方法(?)。共享奖励方法直接将全局奖励分配给所有代理。另一方面,局部奖励方法根据每个智能体的贡献分配全局奖励,结果证明在许多任务中具有卓越的表现(?; ?)。

Whatever approach is adopted, a remaining open question is whether there is an underlying theory to explain credit assignment. Conventionally, a global reward game is built upon non-cooperative game theory, which primarily aims to find Nash equilibrium as the stable solution (??). This formulation can be extended to a dynamic environment with infinite horizons via stochastic game (?). However, Nash equilibrium focuses on the individual reward and has no explicit incentives for cooperation. As a result, the shared reward function has to be applied to force cooperation, which could be used as a possible explanation to the shared reward approach, but not the local reward approach.
无论采用何种方法,一个悬而未决的问题是,是否存在解释信贷分配的基本理论。传统上,全局奖励博弈是建立在非合作博弈论之上的,其主要目的是找到纳什均衡作为稳定解(?; ?)。该公式可以通过随机博弈(?)扩展到具有无限视野的动态环境。然而,纳什均衡侧重于个人奖励,没有明确的合作激励。因此,必须将共享奖励函数应用于强制合作,这可以用作对共享奖励方法的可能解释,但不能用作局部奖励方法的解释。

In our work, we introduce and investigate the cooperative game theory (or the coalitional game theory) (?) in which local reward approach becomes rationalized. In cooperative game theory, the objective is dividing coalitions and binding agreements among agents who belong to the same coalition. We focus on convex game (CG) which is a typical game in cooperative game theory featuring the existence of a stable coalition structure with an efficient payoff distribution scheme (i.e., a local reward approach) called core. This payoff distribution is equivalent to the credit assignment, thereby the core rationalizes and well explains the local reward approach (?).
在我们的工作中,我们引入并研究了合作博弈论(或联盟博弈论)(?),其中局部奖励方法变得合理化。在合作博弈论中,目标是在属于同一联盟的代理人之间划分联盟和具有约束力的协议。我们专注于凸博弈(CG),它是合作博弈论中的典型博弈,其特点是存在稳定的联盟结构,具有有效的收益分配方案(即局部奖励方法),称为核心。这种收益分配等同于信用分配,因此核心合理化并很好地解释了局部奖励方法(?)。

Referring to the previous work (?), we extend CG to an infinite-horizon scenario, namely extended CG (ECG). In addition, we show that a global reward game is equivalent to an ECG with the grand coalition and an efficient payoff distribution scheme. Furthermore, we propose Shapley Q-value, extending Shapley value (i.e., an efficient payoff distribution scheme) for the credit assignment in an ECG with the grand coalition. Therefore, it results in a conclusion that Shapley Q-value is able to work in a global reward game. Finally, we derive an algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG) according to the actor-critic framework (?) to learn decentralized policies with centralized critics (i.e., Shapley Q-values). SQDDPG is evaluated on the environments such as Cooperative Navigation, Prey-and-Predator (?), and Traffic Junction (?), compared with the state-of-the-art baselines, e.g., MADDPG (?), COMA (?), Independent DDPG (?) and Independent A2C (?).
参考之前的工作(?),我们将CG扩展到无限视界场景,即扩展CG(ECG)。此外,我们表明,全球奖励博弈等同于具有大联盟和有效收益分配方案的心电图。此外,我们提出了 Shapley Q 值,扩展了 Shapley 值(即有效的收益分配方案)用于与大联盟的 ECG 中的信用分配。因此,它得出的结论是,Shapley Q 值能够在全局奖励游戏中起作用。最后,我们根据参与者-批评者框架(?)推导出一种称为Shapley Q值深度确定性策略梯度(SQDDPG)的算法,以学习具有中心化批评者的去中心化策略(即Shapley Q值)。SQDDPG 在协作导航、捕食者 (?) 和交通交汇点 (?) 等环境中进行评估,并与最先进的基线(例如 MADDPG (?)、COMA (?)、独立 DDPG (?) 和独立 A2C (?) 进行比较。

2 Related Work 

2 相关工作

Multi-agent Learning 
多智能体学习

Multi-agent learning refers to a category of methods that tackle the games with multiple agents such as cooperative games. Among these methods, we only focus on using reinforcement learning to deal with a cooperative game, which is called multi-agent reinforcement learning (MARL). Incredible progresses have recently been made on MARL. Some researchers (???) focus on distributed executions, which allow communications among agents. Others (?????) consider decentralized executions, where no communication is permitted during the execution. Nevertheless, all of them study on centralized critics, which means information can be shared on the value function during training. In our work, we pay our attention to the decentralized execution and the centralized critic.
多智能体学习是指使用多个智能体处理游戏的一类方法,例如合作游戏。在这些方法中,我们只专注于使用强化学习来处理合作博弈,这被称为多智能体强化学习(MARL)。最近在MARL方面取得了令人难以置信的进展。一些研究人员(?; ?; ?)专注于分布式执行,它允许代理之间的通信。其他人(?; ?; ?; ?; ?)考虑去中心化执行,在执行过程中不允许通信。尽管如此,他们都在研究集中的批评家,这意味着在培训期间可以共享有关价值函数的信息。在我们的工作中,我们关注分散的执行和集中的批评。

Cooperative Game 
合作游戏

As opposed to competing with others, agents in a cooperative game aim to cooperate to solve a joint task or maximize the global payoff (also known as the global reward) (?). ? (?) proposed a non-cooperative game theoretical framework called stochastic game, which models the dynamics of multiple agents in zero-sum game with infinite horizons. ? (?) introduced a general-sum stochastic game theoretical framework, which generalises the zero-sum game. To force cooperation under this framework, potential function (?) was applied such that each agent shares the same objective, namely global reward game (?). In this paper, we use cooperative game theory whereas the existing cooperative game framework are built under the non-cooperative game theory. Our framework gives a new view on the global reward game and well explains the reason why credit assignment is important. We show that the global reward game is a subclass of our framework if we interpret that the agents in a global reward game forms a grand coalition (i.e., the group including the whole agents). Under our framework, it is more rational to use a local reward approach to distribute the global reward.
与与他人竞争相反,合作游戏中的智能体旨在合作解决共同任务或最大化全局收益(也称为全局奖励)(?)。?(?)提出了一种称为随机博弈的非合作博弈理论框架,该框架模拟了零和博弈中多个智能体的动态,具有无限的视野。?(?)引入了广和随机博弈理论框架,概括了零和博弈。为了在这个框架下强制合作,应用了潜在函数(?),使得每个智能体都有相同的目标,即全局奖励博弈(?)。本文采用合作博弈论,而现有的合作博弈框架是在非合作博弈论下构建的。我们的框架为全球奖励博弈提供了新的观点,并很好地解释了信用分配很重要的原因。我们表明,如果我们解释全球奖励博弈中的智能体形成一个大联盟(即包括整个智能体的群体),那么全局奖励博弈是我们框架的一个子类。在我们的框架下,使用局部奖励方式来分配全球奖励更为合理。

Credit Assignment 
学分分配

Credit assignment is a significant problem that has been studied in cooperative games for a long period. There are two sorts of credit assignment approaches, i.e., shared reward approach and local reward approach. The shared reward approach directly assigns each agent the global reward (????). We show that this is actually equivalent to distributing the global reward equally to individual agents. The global reward game with this credit assignment scheme is also called shared reward game (also known as fully cooperative game) (?). However, ? (?) claimed that the shared reward approach does not give each agent the accurate contribution. Thus, it may not perform well in difficult problems. This motivates the study on the local reward approach, which can distribute the global reward to agents according to their contributions. The existing question is how to quantify the contributions. To investigate the answer to this question, ? (?) attempted using Kalman filter to infer the contribution of each agent. Recently, ? (?) and ? (?) modelled the marginal contributions inspired by the reward difference (?). Under our proposed framework (i.e., ECG), we propose a new method called Shapley Q-value to learn a local reward. This method is extended from the Shapley value (?). It is theoretically guaranteed to distribute the global reward fairly. Although Shapley value can be regarded as the expectation of the marginal contributions, it is different from the previous works (??): it considers all possible orders of agents to form a grand coalition, which has not been mentioned in these works.
信用分配是一个在合作博弈中长期研究的重要问题。信用分配方法有两种,即共享奖励方法和局部奖励方法。共享奖励方法直接为每个智能体分配全局奖励(?; ?; ?; ?)。我们表明,这实际上等同于将全局奖励平均分配给单个智能体。具有这种信用分配方案的全球奖励博弈也称为共享奖励博弈(也称为完全合作博弈)(?)。然而?(?)声称共享奖励方法并不能为每个代理提供准确的贡献。因此,它在困难的问题中可能表现不佳。这激发了对本地奖励方法的研究,该方法可以根据代理商的贡献将全局奖励分配给代理商。现有的问题是如何量化贡献。为了调查这个问题的答案,?(?)尝试使用卡尔曼滤波来推断每个智能体的贡献。最近?(?)和?(?)对受奖励差异启发的边际贡献进行建模(?)。在我们提出的框架(即 ECG)下,我们提出了一种称为 Shapley Q 值的新方法来学习局部奖励。此方法是从 Shapley 值 (?) 扩展而来的。从理论上讲,可以保证公平分配全球奖励。虽然沙普利值可以看作是边际贡献的期望,但它与以往的著作(?;?)不同:它考虑了所有可能的代理人秩序,以形成一个大联盟,这在这些著作中没有提到。

3 Preliminaries 

3 预备知识

3.1 Convex Game 

3.1 凸游戏

Convex game (CG) is a typical transferable utility game in the cooperative game theory. The definitions below are referred to the textbook (?). A CG is formally represented as Γ=〈𝒩,�〉, where 𝒩 is the set of agents and � is the value function to measure the profits earned by a coalition (i.e., a group). 𝒩 itself is called the grand coalition. The value function �:2𝒩→ℝ is a mapping from a coalition 𝒞⊆𝒩 to a real number �​(𝒞). In a CG, its value function satisfies two properties, i.e., 1) �​(𝒞∪𝒟)≥�​(𝒞)+�​(𝒟),∀𝒞,𝒟⊂𝒩,𝒞∩𝒟=�; 2) the coalitions are independent. The solution of a CG is a tuple (𝒞​𝒮,x), where 𝒞​𝒮={𝒞1,𝒞2,…,𝒞�} is a coalition structure and x=(x�)�∈𝒩 indicates the payoff (i.e., the local reward) distributed to each agent, which satisfies two conditions, i.e., 1) x�≥0,∀�∈𝒩; 2) x​(𝒞)≤�​(𝒞),∀𝒞⊆𝒞​𝒮, where x​(𝒞)=∑�∈𝒞x�. 𝒞​𝒮𝒩 denotes the set of all possible coalition structures. The core is a stable solution set of a CG, which can be defined mathematically as core​(Γ)={(𝒞,x)|x​(𝒞)≥�​(𝒞),∀𝒞⊆𝒩}. The core of a CG ensures a reasonable payoff distribution and inspires our work on credit assignment in MARL.
凸博弈(CG)是合作博弈论中典型的可转移效用博弈。以下定义参考教科书(?CG 正式表示为 Γ=〈𝒩,�〉 ,其中 𝒩 是代理集合, � 是衡量联盟(即团体)赚取的利润的价值函数。 𝒩 本身被称为大联盟。值函数 �:2𝒩→ℝ 是从联盟 𝒞⊆𝒩 到实数 �​(𝒞) 的映射。在 CG 中,其值函数满足两个属性,即 1) �​(𝒞∪𝒟)≥�​(𝒞)+�​(𝒟),∀𝒞,𝒟⊂𝒩,𝒞∩𝒟=� ;2)联盟是独立的。CG 的解是一个元组 (𝒞​𝒮,x) ,其中 𝒞​𝒮={𝒞1,𝒞2,…,𝒞�} 是联盟结构, x=(x�)�∈𝒩 表示分配给每个智能体的收益(即局部奖励),它满足两个条件,即 1) x�≥0,∀�∈𝒩 ;2) x​(𝒞)≤�​(𝒞),∀𝒞⊆𝒞​𝒮 ,其中 x​(𝒞)=∑�∈𝒞x� . 𝒞​𝒮𝒩 表示所有可能的联盟结构的集合。核心是 CG 的稳定解集,可以在数学上定义为 core​(Γ)={(𝒞,x)|x​(𝒞)≥�​(𝒞),∀𝒞⊆𝒩} 。企业管治的核心确保了合理的收益分配,并激发了我们在MARL中的信用分配工作。

3.2 Shapley Value

Shapley value (?) is one of the most popular methods to solve the payoff distribution problem for a grand coalition (???). Given a cooperative game Γ=(𝒩,�), for any 𝒞⊆𝒩\{�} let ��​(𝒞)=�​(𝒞∪{�})−�​(𝒞) be a marginal contribution, then the Shapley value of each agent � can be written as:
Shapley 值 (?) 是解决大联盟收益分配问题的最流行方法之一 (?; ?; ?)。给定一个合作博弈 Γ=(𝒩,�) ,对于任何 𝒞⊆𝒩\{�} 边 ��​(𝒞)=�​(𝒞∪{�})−�​(𝒞) 际贡献,那么每个智能体 � 的 Shapley 值可以写成:

Sh�​(Γ)=∑𝒞⊆𝒩\{�}|𝒞|!​(|𝒩|−|𝒞|−1)!|𝒩|!⋅��​(𝒞). (1)

Literally, Shapley value takes the average of marginal contributions of all possible coalitions, so that it satisfies: 1) efficiency: x​(𝒩)=�​(𝒩); 2) dummy player: if an agent � has no contribution, then ��=0; and 3) symmetry: if �-th and �-th agents have the same contribution, then ��=�� (?). All these properties form the fairness. As we can see from Eq.1, if we calculate Shapley value for an agent, we have to consider 2|𝒩|−1 possible coalitions that the agent could join in to form a grand coalition, which causes the computational catastrophe. To mitigate this issue, we propose an approximation in the scenarios with infinite horizons called approximate Shapley Q-value which is introduced in the next section.
从字面上看,Shapley 值取所有可能的联盟的边际贡献的平均值,因此它满足:1) 效率: x​(𝒩)=�​(𝒩) ;2)虚拟玩家:如果代理 � 没有贡献,那么 ��=0 ;3)对称性:如果 � -th和 � -th智能体具有相同的贡献,则 ��=�� (?)。所有这些属性都构成了公平性。从方程 1 中可以看出,如果我们计算智能体的 Shapley 值,我们必须考虑 2|𝒩|−1 智能体可能加入的联盟,形成一个大联盟,这会导致计算灾难。为了缓解这个问题,我们提出了一种在具有无限视界的场景中的近似值,称为近似 Shapley Q 值,将在下一节中介绍。

3.3 Multi-agent Actor-Critic

3.3 多智能体演员-评论家

Different from the value based method, i.e., Q-learning (?), policy gradient (?) directly learns the policy by maximizing �​(�)=𝔼�∼��,�∼��​[�​(�,�)], where �​(�,�) is the reward of an arbitrary state-action pair. Since the gradient of �​(�) w.r.t. � cannot be directly calculated, policy gradient theorem (?) is used to approximate the gradient such that ∇��​(�)=𝔼�∼��,�∼��​[��​(�,�)​∇�log⁡��​(�|�)]. In the actor-critic framework (?) (that is derived from the policy gradient theorem), ��​(�|�) is called actor and ��​(�,�) is called critic. Additionally, ��​(�,�)=𝔼�​[∑�=1∞��−1​�​(��,��)|�1​ = ​�,�1​ = ​�]. Extending to the multi-agent scenarios, the gradient of each agent � can be represented as ∇���​(��)=𝔼�∼��,a∼��​[���​(�,��)​∇��log⁡����​(��|�)]. ���​(�,��) can be regarded as the estimation of the contribution of each agent �. If the deterministic policy (?) needs to be learned in MARL problems, we can reformulate the approximate gradient of each agent as ∇���​(��)=𝔼�∼��​[∇������​(�)​∇�����​(�,��)|��=����​(�)]. In this work, we applies this formulation to learn the deterministic policy for each agent.
与基于价值的方法(即Q学习(?)不同,策略梯度(?)直接通过最大化 �​(�)=𝔼�∼��,�∼��​[�​(�,�)] 来学习策略,其中 �​(�,�) 是任意状态-动作对的奖励。由于不能直接计算w.r.t. � 的 �​(�) 梯度,因此使用策略梯度定理(?)来近似梯度,使得 ∇��​(�)=𝔼�∼��,�∼��​[��​(�,�)​∇�log⁡��​(�|�)] .在演员-评论家框架(?(即从政策梯度定理推导出来) ��​(�|�) 被称为演员, ��​(�,�) 被称为批评者。此外, ��​(�,�)=𝔼�​[∑�=1∞��−1​�​(��,��)|�1​ = ​�,�1​ = ​�] .扩展到多智能体方案,每个智能体 � 的梯度可以表示为 ∇���​(��)=𝔼�∼��,a∼��​[���​(�,��)​∇��log⁡����​(��|�)] 。 ���​(�,��) 可以看作是对每个代理人 � 贡献的估计。如果需要在 MARL 问题中学习确定性策略 (?),我们可以将每个智能体的近似梯度重新表述为 ∇���​(��)=𝔼�∼��​[∇������​(�)​∇�����​(�,��)|��=����​(�)] 。在这项工作中,我们应用这个公式来学习每个智能体的确定性策略。

4 Our Work 

4 我们的工作

In this section, we (i) extend convex game (CG) with the infinite horizons and decisions, namely extended convex game (ECG) and show that a global reward game is equivalent to an ECG with the grand coalition and an efficient distribution scheme, (ii) show that the shared reward approach is an efficient distribution scheme in an ECG with the grand coalition, (iii) propose Shapley Q-value by extending and approximating Shapley value to distribute the credits in a global reward game, because it can accelerate the convergence rate compared with shared reward approach, and (iv) derive an MARL algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG), using the Shapley Q-value as each agent’s critic.
在本节中,我们(i)用无限的视野和决策扩展凸博弈(CG),即扩展凸博弈(ECG),并证明全局奖励博弈等同于具有大联盟和有效分配方案的心电图,(ii)表明共享奖励方法是具有大联盟的心电图中的有效分配方案, (iii) 通过扩展和逼近 Shapley 值来分配全局奖励博弈中的信用,从而提出 Shapley Q 值,因为与共享奖励方法相比,它可以加速收敛率,以及 (iv) 推导出一种称为 Shapley Q 值深度确定性策略梯度 (SQDDPG) 的 MARL 算法,使用 Shapley Q 值作为每个智能体的批评者。

4.1 Extended Convex Game 

4.1 扩展凸博弈

Referring to the previous work (??), we extend the CG to the scenarios with infinite horizons and decisions, named as extended CG (ECG). The set of joint actions of agents is defined as 𝒜=×�∈𝒩𝒜�, where 𝒜� is the feasible action set for each agent �. 𝒮 is the set of possible states in the environment. The dynamics of the environment are defined as �​�​(�′|�,a), where �,�′∈𝒮 and a∈𝒜. Inspired by ? (?), we construct the ECG by two stages. In the stage 1, an oracle arranges the coalition structure and contracts the cooperation agreements, i.e., the credit assigned to an agent for his optimal long-term contribution if he joins in some coalition. We assume that this oracle can observe the whole environment and be familiar with each agent’s feature. In the stage 2, after joining in the allocated coalition, each agent will further make a decision by ��​(��|�) to maximize the social value of its coalition, so that the optimal social value of each coalition and individual credit assignment can be achieved, where ��∈𝒜� and �∈𝒮. Mathematically, the optimal value of a coalition 𝒞∈𝒞​𝒮 can be written as max�𝒞⁡��𝒞​(𝒞)=𝔼�𝒞​[∑�=1∞��−1​��​(𝒞)]; �𝒞=×�∈𝒞��; ��​(𝒞) is the reward gained by coalition 𝒞 at each time step. According to the property (1) of the CG aforementioned, the formula max�𝒞∪𝒟⁡��𝒞∪𝒟​(𝒞∪𝒟)≥max�𝒞⁡��𝒞​(𝒞)+max�𝒟⁡��𝒟​(𝒟),∀𝒞,𝒟⊂𝒩,𝒞∩𝒟=� holds. In this paper, we denote the joint policy of the whole agents as �=×�∈𝒩��(��|�) and assume that each agent can observe the global state.
参考前作(?; ?),我们将CG扩展到具有无限视野和决策的场景,称为扩展CG(ECG)。智能体的联合操作集定义为 ǔ

  • 19
    点赞
  • 28
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

资源存储库

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值