多智能体强化学习—QPLEX

QPLEX是一种用于多智能体强化学习的算法,它采用双层 Dueling 网络结构,通过分解联合动作价值函数为个体动作价值和优势函数,确保了 Individual-Global-Max (IGM) 原则的一致性。这种方法允许在不牺牲表达能力的情况下进行有效的学习。QPLEX通过Transformer网络将全局信息融入个体智能体的决策过程,以实现更好的协作和决策。在星际争霸2的实验中,QPLEX表现出了优越的性能,超越了其他算法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

多智能体强化学习—QPLEX

论文地址:QPLEX: Duplex Dueling Multi-Agent Q-Learning
视频效果:Experiments on StarCraft II
建议了解一下QMIX:多智能体强化学习—QMIX

1 介绍

IGM(Individual-Global-Max):
argmax ⁡ u Q t o t ( τ , u ) = ( argmax ⁡ u 1 Q 1 ( τ 1 , u 1 ) ⋮ argmax ⁡ u n Q n ( τ n , u n ) ) \underset{\mathbf{u}}{\operatorname{argmax}} Q_{t o t}(\tau, \mathbf{u})=\left(\begin{array}{cc} \operatorname{argmax}_{u^{1}} & Q_{1}\left(\tau^{1}, u^{1}\right) \\ \vdots \\ {\operatorname{argmax}}_{u^{n}} & Q_{n}\left(\tau^{n}, u^{n}\right) \end{array}\right) uargmaxQtot(τ,u)=argmaxu1argmaxunQ1(τ1,u1)Qn(τn,un)

  其中, Q t o t Q_{tot} Qtot表示联合Q函数; Q i Q_i Qi表示智能体 i的动作值函数。
IGM表示 a r g m a x ( Q t o t ) argmax (Q_{tot}) argmax(Qtot) a r g m a x ( Q i ) argmax (Q_i) argmax(Qi)得到相同结果,这表示在无约束条件的情况下,个体最优就代表整体最优。
  QPLEX:QMIX和VDN提出了IGM的两个充分条件(下式)来因式分解联合动作值函数。这两种分解方法都受到结构约束,并限制了它们可以表示的联合动作值函数类。
Q t o t V D N ( τ , a ) = ∑ i = 1 n Q i ( τ i , a i )  and  ∀ i ∈ N , ∂ Q t o t Q M I X ( τ , a ) ∂ Q i ( τ i , a i ) > 0 Q_{t o t}^{\mathrm{VDN}}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{i=1}^{n} Q_{i}\left(\tau_{i}, a_{i}\right) \quad \text { and } \quad \forall i \in \mathcal{N}, \frac{\partial Q_{t o t}^{\mathrm{QMIX}}(\boldsymbol{\tau}, \boldsymbol{a})}{\partial Q_{i}\left(\tau_{i}, a_{i}\right)}>0 QtotVDN(τ,a)=i=1nQi(τi,ai) and iN,Qi(τi,ai)QtotQMIX(τ,a)>0
  为了解决这一挑战,提出了QPLEX,称为duPLEX dueling multi-agent Q-learning-(QPLEX),该方法采用 duplex dueling network 结构将联合动作值函数分解为个体的动作值函数。QPLEX引入了dueling structureQ=V+A,用于表示联合和个体动作值函数,然后将IGM原则重新形式化为基于优势的IGM。这种重新表述将IGM的一致性转换为对优势函数值范围的约束,从而促进了具有线性分解结构的动作-值函数的学习。QPLEX利用 advantage of a duplex dueling 架构将IGM约束条件编码到神经网络结构中,保证IGM的一致性。
  QPLEX的主要亮点:分别对联合Q值 Q t o t Q_{tot} Qtot 和各个agent的Q值 Q i Q_{i} Qi 使用Dueling structure: Q = V + A Q=V+A Q=V+A 进行分解,将IGM一致性转化为易于实现的优势函数取值范围约束,从而方便了具有线性分解结构的值函数的学习,这种分解让Q值的获得更为具体,Q值=当前状态的价值V+采取动作的价值A,这样可以进一步判断Q值的获得是由于状态还是由于采取的动作的优势。
  下面进行具体分解:

  • 联合动作价值函数的dueling分解:

 (Joint Dueling)  Q t o t ( τ , a ) = V tot  ( τ ) + A tot  ( τ , a )  and  V tot  ( τ ) = max ⁡ a ′ Q tot  ( τ , a ′ ) \text { (Joint Dueling) } \quad Q_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=V_{\text {tot }}(\boldsymbol{\tau})+A_{\text {tot }}(\boldsymbol{\tau}, \boldsymbol{a}) \text { and } V_{\text {tot }}(\boldsymbol{\tau})=\max _{a^{\prime}} Q_{\text {tot }}\left(\boldsymbol{\tau}, \boldsymbol{a}^{\prime}\right)  (Joint Dueling) Qtot(τ,a)=Vtot (τ)+Atot (τ,a) and Vtot (τ)=amaxQtot (τ,a)
其中: Q tot  : T × A ↦ R Q_{\text {tot }}: \mathcal{T} \times \mathcal{A} \mapsto \mathbb{R} Qtot :T×AR

  • 个体动作价值函数的dueling分解:

 (Individual Dueling)  Q i ( τ i , a i ) = V i ( τ i ) + A i ( τ i , a i )  and  V i ( τ i ) = max ⁡ a i ′ Q i ( τ i , a i ′ ) \text { (Individual Dueling) } \quad Q_{i}\left(\tau_{i}, a_{i}\right)=V_{i}\left(\tau_{i}\right)+A_{i}\left(\tau_{i}, a_{i}\right) \text { and } V_{i}\left(\tau_{i}\right)=\max _{a_{i}^{\prime}} Q_{i}\left(\tau_{i}, a_{i}^{\prime}\right)  (Individual Dueling) Qi(τi,ai)=Vi(τi)+Ai(τi,ai) and Vi(τi)=aimaxQi(τi,ai)
其中: [ Q i : T × A ↦ R ] i = 1 n \left[Q_{i}: \mathcal{T} \times \mathcal{A} \mapsto \mathbb{R}\right]_{i=1}^{n} [Qi:T×AR]i=1n, where ∀ τ ∈ T , ∀ a ∈ A , ∀ i ∈ N \forall \boldsymbol{\tau} \in \mathcal{T}, \forall \boldsymbol{a} \in \mathcal{A}, \forall i \in \mathcal{N} τT,aA,iN

  • 约束条件:

arg ⁡ max ⁡ a ∈ A A tot  ( τ , a ) = ( arg ⁡ max ⁡ a 1 ∈ A A 1 ( τ 1 , a 1 ) , … , arg ⁡ max ⁡ a n ∈ A A n ( τ n , a n ) ) \underset{\boldsymbol{a} \in \mathcal{A}}{\arg \max } A_{\text {tot }}(\boldsymbol{\tau}, \boldsymbol{a})=\left(\underset{a_{1} \in \mathcal{A}}{\arg \max } A_{1}\left(\tau_{1}, a_{1}\right), \ldots, \underset{a_{n} \in \mathcal{A}}{\arg \max } A_{n}\left(\tau_{n}, a_{n}\right)\right) aAargmaxAtot (τ,a)=(a1AargmaxA1(τ1,a1),,anAargmaxAn(τn,an))
对于优势函数有:
A π ( s , a ) = Q π ( s , a ) − V π ( s ) A^{\pi}(s, a)=Q^{\pi}(s, a)-V^{\pi}(s) Aπ(s,a)=Qπ(s,a)Vπ(s)
当执行最佳动作时: Q π ( s , a ) = V π ( s ) Q^{\pi}(s, a)=V^{\pi}(s) Qπ(s,a)=Vπ(s) ,则对于联合动作优势函数有:
A t o t ( τ , a ∗ ) = A i ( τ i , a i ∗ ) = 0 a n d A t o t ( τ , a ) < 0 , A i ( τ i , a i ) ≤ 0 A_{t o t}\left(\boldsymbol{\tau}, \boldsymbol{a}^{*}\right)=A_{i}\left(\tau_{i}, a_{i}^{*}\right)=0 \quad and \quad A_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})<0, A_{i}\left(\tau_{i}, a_{i}\right) \leq 0 Atot(τ,a)=Ai(τi,ai)=0andAtot(τ,a)<0,Ai(τi,ai)0
where A ∗ ( τ ) = { a ∣ a ∈ A , Q tot  ( τ , a ) = V tot  ( τ ) } \mathcal{A}^{*}(\boldsymbol{\tau})=\left\{\boldsymbol{a} \mid \boldsymbol{a} \in \mathcal{A}, Q_{\text {tot }}(\boldsymbol{\tau}, \boldsymbol{a})=V_{\text {tot }}(\boldsymbol{\tau})\right\} A(τ)={aaA,Qtot (τ,a)=Vtot (τ)}   ∀ τ ∈ T , ∀ a ∗ ∈ A ∗ ( τ ) , ∀ a ∈ A \ A ∗ ( τ ) , ∀ i ∈ N \forall \boldsymbol{\tau} \in \mathcal{T}, \forall a^{*} \in \mathcal{A}^{*}(\tau), \forall a \in \mathcal{A} \backslash \mathcal{A}^{*}(\tau), \forall i \in \mathcal{N} τT,aA(τ),aA\A(τ),iN
因为V只和状态有关,与动作无关,所以影响Q的主要是A,那么就把上式的约束转成了IGM约束条件

2 QMIX 算法框架

在这里插入图片描述

框架主要分三两部分:

  • (a)Dueling Mixing网络
  • (b)整体的Duplex Dueling架构
  • (c)智能体网络结构,Transformation网络

下面进行具体分析:

2.1 Agent network(和QMIX网络的Agent network相同)

输入 t t t时刻智能体 a a a的观测值 o i t o_i^t oit t − 1 t-1 t1时刻智能体 a a a的动作 a i t − 1 a_i^{t-1} ait1
输出 t t t时刻智能体 a a a的值函数 Q i ( τ i , a i t ) Q_{i}\left(\tau_{i}, a_i^{t}\right) Qi(τi,ait)
在这里插入图片描述

  Agent network由DRQN网络实现,根据不同的任务需求,不同智能体的网络可以进行单独训练,也可进行参数共享,DRQN是将DQN中的全连接层替换为LSTM网络,循环网络在观测质量变化的情况下,具有更强的适应性。如图所示,其网络一共包含 3 层,输入层(MLP多层神经网络)→ 中间层(GRU门控循环神经网络)→ 输出层(MLP多层神经网络)
  实现代码如下:
智能体网络参数配置:

# --- Agent parameters ---
agent: "rnn" # Default rnn agent
rnn_hidden_dim: 64 # Size of hidden state for default rnn agent
obs_agent_id: True # Include the agent's one_hot id in the observation
obs_last_action: True # Include the agent's last action (one_hot) in the observation

RNN网络:

class RNNAgent(nn.Module):
    def __init__(self, input_shape, args):
        super(RNNAgent, self).__init__()
        self.args = args
        #根据参数配置,智能体网络的输入
        #input_shape = obs_shape + n_actions + one_hot_code(one_hot_code_o+one_hot_code_u)
        self.fc1 = nn.Linear(input_shape, args.rnn_hidden_dim)  # 线性层
        self.rnn = nn.GRUCell(args.rnn_hidden_dim, args.rnn_hidden_dim)  # GRU层,需要输入隐藏状态
        self.fc2 = nn.Linear(args.rnn_hidden_dim, args.n_actions)  # 线性层

    def init_hidden(self):
        # make hidden states on same device as model
        return self.fc1.weight.new(1, self.args.rnn_hidden_dim).zero_()

    def forward(self, inputs, hidden_state):
        x = F.relu(self.fc1(inputs))  # 输入经过线性层后relu激活,输出x
        h_in = hidden_state.reshape(-1, self.args.rnn_hidden_dim)  # 对隐藏状态进行变形,列数为隐藏层维度大小
        h = self.rnn(x, h_in)  # 循环神经网络,输入x,与隐藏状态(上一时刻信息)
        q = self.fc2(h)  # 输出Q值
        return q, h
Transformation network

输入:智能体 i i i的状态价值函数 V i ( τ i ) V_i(\tau_i) Vi(τi)、优势函数 A i ( τ i , a i ) A_i(\tau_i,a_i) Ai(τi,ai)、全局状态 s t s_t st
输出:基于全局信息 s s s 智能体 i i i的状态价值函数 V i ( τ ) V_i(\tau) Vi(τ)、优势函数 A i ( τ , a i ) A_i(\tau,a_i) Ai(τ,ai)

[ V i ( τ i ) , A i ( τ i , a i ) ] i = 1 n  to  [ V i ( τ ) , A i ( τ , a i ) ] i = 1 n \left[V_{i}\left(\tau_{i}\right), A_{i}\left(\tau_{i}, a_{i}\right)\right]_{i=1}^{n} \text { to }\left[V_{i}(\boldsymbol{\tau}), A_{i}\left(\boldsymbol{\tau}, a_{i}\right)\right]_{i=1}^{n} [Vi(τi),Ai(τi,ai)]i=1n to [Vi(τ),Ai(τ,ai)]i=1n
  将局部的值函数与优势函数 V i ( τ i ) V_i(\tau_i) Vi(τi) , A i ( τ i , a i ) A_i(\tau_i,a_i) Ai(τi,ai) 与全局信息 s t s_t st(或联合观测历史 )结合,获得基于全局观测信息的局部值函数 V i ( τ ) V_i(\tau) Vi(τ) A i ( τ , a i ) A_i(\tau,a_i) Ai(τ,ai)
具体实现方式是:
V i ( τ ) = w i ( τ ) V i ( τ i ) + b i ( τ )  and  A i ( τ , a i ) = w i ( τ ) A i ( τ i , a i ) + b i ( τ ) V_{i}(\boldsymbol{\tau})=w_{i}(\boldsymbol{\tau}) V_{i}\left(\tau_{i}\right)+b_{i}(\boldsymbol{\tau}) \quad \text { and } \quad A_{i}\left(\boldsymbol{\tau}, a_{i}\right)=w_{i}(\boldsymbol{\tau}) A_{i}\left(\tau_{i}, a_{i}\right)+b_{i}(\boldsymbol{\tau}) Vi(τ)=wi(τ)Vi(τi)+bi(τ) and Ai(τ,ai)=wi(τ)Ai(τi,ai)+bi(τ)
其中 w i w_i wi是正权值,保证了局部函数和全局函数之间的单调性。

  实现代码如下:

# 超网络输出权重 W
self.hyper_w_final = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
                                   nn.ReLU(),
                                   nn.Linear(hypernet_embed, self.n_agents))
# 超网络输出偏差 bias
self.V = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
                       nn.ReLU(),
                       nn.Linear(hypernet_embed, self.n_agents))

# 根据全局状态s获得transformation网络的参数
w_final = self.hyper_w_final(states)  # 获得W权重
w_final = th.abs(w_final)  # 求绝对值保证单调
w_final = w_final.view(-1, self.n_agents) + 1e-10
v = self.V(states)  # 获得b偏差
v = v.view(-1, self.n_agents)

if self.args.weighted_head:  # 是否使用加权头
    agent_qs = w_final * agent_qs + v  # 计算智能体动作价值函数
if not is_v:
    max_q_i = max_q_i.view(-1, self.n_agents)
    if self.args.weighted_head:
        max_q_i = w_final * max_q_i + v  # 根据状态值函数计算
2.2 Dueling Mixing network

输入:智能体 i i i的状态价值函数 V i ( τ ) V_i(\tau) Vi(τ)、优势函数 A i ( τ , a i ) A_i(\tau,a_i) Ai(τ,ai)、全局状态 s t s_t st
输出:联合动作价值函数 Q t o t ( τ , a ) Q_{tot}\left(\tau, a\right) Qtot(τ,a)

  Dueling Mixing network 主要由两部分组成:
Q t o t ( τ , a ) = V t o t ( τ ) + A t o t ( τ , a ) Q_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=V_{t o t}(\boldsymbol{\tau})+A_{t o t}(\boldsymbol{\tau}, \boldsymbol{a}) Qtot(τ,a)=Vtot(τ)+Atot(τ,a)

  • 计算 V t o t ( τ ) V_{tot}(\tau) Vtot(τ):

因为 V V V仅与状态 s s s(或联合观测历史 τ \tau τ)有关, V t o t ( τ ) V_{tot}(\tau) Vtot(τ) 表示为:
V t o t ( τ ) = ∑ i = 1 n V i ( τ ) V_{t o t}(\boldsymbol{\tau})=\sum_{i=1}^{n} V_{i}(\boldsymbol{\tau}) Vtot(τ)=i=1nVi(τ)

  • 计算 A t o t ( τ , a ) A_{tot}(\tau,a) Atot(τ,a):
    A t o t ( τ , a ) = ∑ i = 1 n λ i ( τ , a ) A i ( τ , a i ) A_{t o t}(\tau, a)=\sum_{i=1}^{n} \lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a}) A_{i}\left(\boldsymbol{\tau}, a_{i}\right) Atot(τ,a)=i=1nλi(τ,a)Ai(τ,ai)

其中 , λ i ( τ , a ) > 0 ,\lambda_{i}(\tau, \boldsymbol{a})>0 ,λi(τ,a)>0,保证贪婪动作选择与策略一致
λ i ( τ , a ) = ∑ k = 1 K λ i , k ( τ , a ) ϕ i , k ( τ ) v k ( τ ) \lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{k=1}^{K} \lambda_{i, k}(\boldsymbol{\tau}, \boldsymbol{a}) \phi_{i, k}(\boldsymbol{\tau}) v_{k}(\boldsymbol{\tau}) λi(τ,a)=k=1Kλi,k(τ,a)ϕi,k(τ)vk(τ)
λ i ( τ , a ) \lambda_{i}(\tau, \boldsymbol{a}) λi(τ,a)采用multi-head attention机制,式中, K K K 是头数, λ i , k ( τ , a ) , ϕ i , k ( τ ) \lambda_{i, k}(\tau, a), \phi_{i, k}(\tau) λi,k(τ,a),ϕi,k(τ) 为被sigmoid激活的注意力权重, v k ( τ ) > 0 v_{k}(\tau)>0 vk(τ)>0 为每个head的key

  • 有了 V t o t ( τ ) V_{tot}(\tau) Vtot(τ) A t o t ( τ , a ) A_{tot}(\tau,a) Atot(τ,a),可求 Q t o t Q_{tot} Qtot
    Q t o t ( τ , a ) = V tot  ( τ ) + A t o t ( τ , a ) = ∑ i = 1 n Q i ( τ , a i ) + ∑ i = 1 n ( λ i ( τ , a ) − 1 ) A i ( τ , a i ) Q_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=V_{\text {tot }}(\boldsymbol{\tau})+A_{t o t}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{i=1}^{n} Q_{i}\left(\boldsymbol{\tau}, a_{i}\right)+\sum_{i=1}^{n}\left(\lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a})-1\right) A_{i}\left(\boldsymbol{\tau}, a_{i}\right) Qtot(τ,a)=Vtot (τ)+Atot(τ,a)=i=1nQi(τ,ai)+i=1n(λi(τ,a)1)Ai(τ,ai)

其中, 前半部分 ∑ i = 1 n Q i ( τ , a i ) \sum_{i=1}^{n} Q_{i}\left(\tau, a_{i}\right) i=1nQi(τ,ai) 与VDN的 Q t o t V D N Q_{t o t}^{V D N} QtotVDN 相同, 而后半部分修正了 Q tot  V D N Q_{\text {tot }}^{V D N} Qtot VDN 与真实的联合动作价值函数 Q t o t Q_{tot} Qtot之间的误差。
  实现代码如下:

# use the Q_Learner to train
agent_output_type: "q"
learner: "q_learner"
double_q: True
mixer: "qmix"
mixing_embed_dim: 32
hypernet_layers: 2
hypernet_embed: 64
class DMAQer(nn.Module):
    def __init__(self, args):
        super(DMAQer, self).__init__()
        # 智能体、环境参数读取
        self.args = args
        self.n_agents = args.n_agents
        self.n_actions = args.n_actions
        self.state_dim = int(np.prod(args.state_shape))
        self.action_dim = args.n_agents * self.n_actions
        self.state_action_dim = self.state_dim + self.action_dim + 1

        self.embed_dim = args.mixing_embed_dim  # 隐层维度

        hypernet_embed = self.args.hypernet_embed  # 超网络的隐层维度
        # 超网络输出权重 W
        self.hyper_w_final = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
                                           nn.ReLU(),
                                           nn.Linear(hypernet_embed, self.n_agents))
        # 超网络输出偏差 bias
        self.V = nn.Sequential(nn.Linear(self.state_dim, hypernet_embed),
                               nn.ReLU(),
                               nn.Linear(hypernet_embed, self.n_agents))
        # 计算lambda
        self.si_weight = DMAQ_SI_Weight(args)

    # Q_{tot}=V_{tot}+A_{tot}=\sum Q_{i}+ \sum{(\lambda-1)*A_{i}}
    def calc_v(self, agent_qs):  # Dueling Mixing网络计算V_tot V_{tot}=\sum V_{i}
        agent_qs = agent_qs.view(-1, self.n_agents)
        v_tot = th.sum(agent_qs, dim=-1)  # 求和
        return v_tot

    def calc_adv(self, agent_qs, states, actions, max_q_i):  # Dueling Mixing网络计算A_tot \sum{(\lambda-1)*A_{i}}
        states = states.reshape(-1, self.state_dim)
        actions = actions.reshape(-1, self.action_dim)
        agent_qs = agent_qs.view(-1, self.n_agents)
        max_q_i = max_q_i.view(-1, self.n_agents)

        adv_q = (agent_qs - max_q_i).view(-1, self.n_agents).detach()  # 计算优势函数,并去掉梯度

        adv_w_final = self.si_weight(states, actions)  # 获得权重
        adv_w_final = adv_w_final.view(-1, self.n_agents)

        # 计算A_tot
        if self.args.is_minus_one:  # 是不是相减的形式
            adv_tot = th.sum(adv_q * (adv_w_final - 1.), dim=1)  # \sum{(\lambda-1)*A_{i}}
        else:
            adv_tot = th.sum(adv_q * adv_w_final, dim=1)
        return adv_tot

    def calc(self, agent_qs, states, actions=None, max_q_i=None, is_v=False):  # 计算total价值函数
        if is_v:
            v_tot = self.calc_v(agent_qs)
            return v_tot
        else:
            adv_tot = self.calc_adv(agent_qs, states, actions, max_q_i)
            return adv_tot

    def forward(self, agent_qs, states, actions=None, max_q_i=None, is_v=False):
        bs = agent_qs.size(0)  # 样本数量数
        states = states.reshape(-1, self.state_dim)
        agent_qs = agent_qs.view(-1, self.n_agents)
        # 根据全局状态s获得transformation网络的参数
        w_final = self.hyper_w_final(states)  # 获得W权重
        w_final = th.abs(w_final)  # 求绝对值保证单调
        w_final = w_final.view(-1, self.n_agents) + 1e-10
        v = self.V(states)  # 获得b偏差
        v = v.view(-1, self.n_agents)

        if self.args.weighted_head:  # 是否使用加权头
            agent_qs = w_final * agent_qs + v  # 计算智能体动作价值函数
        if not is_v:
            max_q_i = max_q_i.view(-1, self.n_agents)
            if self.args.weighted_head:
                max_q_i = w_final * max_q_i + v  # 根据状态值函数计算

        y = self.calc(agent_qs, states, actions=actions, max_q_i=max_q_i, is_v=is_v)  # 进入Dueling Mixing网络,计算total
        v_tot = y.view(bs, -1, 1)

        return v_tot

多头注意力计算部分:
λ i ( τ , a ) = ∑ k = 1 K λ i , k ( τ , a ) ϕ i , k ( τ ) v k ( τ ) \lambda_{i}(\boldsymbol{\tau}, \boldsymbol{a})=\sum_{k=1}^{K} \lambda_{i, k}(\boldsymbol{\tau}, \boldsymbol{a}) \phi_{i, k}(\boldsymbol{\tau}) v_{k}(\boldsymbol{\tau}) λi(τ,a)=k=1Kλi,k(τ,a)ϕi,k(τ)vk(τ)

class DMAQ_SI_Weight(nn.Module):
    def __init__(self, args):
        super(DMAQ_SI_Weight, self).__init__()

        self.args = args
        self.n_agents = args.n_agents
        self.n_actions = args.n_actions
        self.state_dim = int(np.prod(args.state_shape))
        self.action_dim = args.n_agents * self.n_actions
        self.state_action_dim = self.state_dim + self.action_dim

        self.num_kernel = args.num_kernel

        self.key_extractors = nn.ModuleList()
        self.agents_extractors = nn.ModuleList()
        self.action_extractors = nn.ModuleList()

        adv_hypernet_embed = self.args.adv_hypernet_embed
        for i in range(self.num_kernel):  # multi-head attention
            if getattr(args, "adv_hypernet_layers", 1) == 1:
                self.key_extractors.append(nn.Linear(self.state_dim, 1))  # key
                self.agents_extractors.append(nn.Linear(self.state_dim, self.n_agents))  # agent
                self.action_extractors.append(nn.Linear(self.state_action_dim, self.n_agents))  # action
            elif getattr(args, "adv_hypernet_layers", 1) == 2:
                self.key_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
                                                         nn.ReLU(),
                                                         nn.Linear(adv_hypernet_embed, 1)))  # key
                self.agents_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
                                                            nn.ReLU(),
                                                            nn.Linear(adv_hypernet_embed, self.n_agents)))  # agent
                self.action_extractors.append(nn.Sequential(nn.Linear(self.state_action_dim, adv_hypernet_embed),
                                                            nn.ReLU(),
                                                            nn.Linear(adv_hypernet_embed, self.n_agents)))  # action
            elif getattr(args, "adv_hypernet_layers", 1) == 3:
                self.key_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
                                                         nn.ReLU(),
                                                         nn.Linear(adv_hypernet_embed, adv_hypernet_embed),
                                                         nn.ReLU(),
                                                         nn.Linear(adv_hypernet_embed, 1)))  # key
                self.agents_extractors.append(nn.Sequential(nn.Linear(self.state_dim, adv_hypernet_embed),
                                                            nn.ReLU(),
                                                            nn.Linear(adv_hypernet_embed, adv_hypernet_embed),
                                                            nn.ReLU(),
                                                            nn.Linear(adv_hypernet_embed, self.n_agents)))  # agent
                self.action_extractors.append(nn.Sequential(nn.Linear(self.state_action_dim, adv_hypernet_embed),
                                                            nn.ReLU(),
                                                            nn.Linear(adv_hypernet_embed, adv_hypernet_embed),
                                                            nn.ReLU(),
                                                            nn.Linear(adv_hypernet_embed, self.n_agents)))  # action
            else:
                raise Exception("Error setting number of adv hypernet layers.")

    def forward(self, states, actions):
        states = states.reshape(-1, self.state_dim)
        actions = actions.reshape(-1, self.action_dim)
        data = th.cat([states, actions], dim=1)

        all_head_key = [k_ext(states) for k_ext in self.key_extractors]
        all_head_agents = [k_ext(states) for k_ext in self.agents_extractors]
        all_head_action = [sel_ext(data) for sel_ext in self.action_extractors]

        head_attend_weights = []
        for curr_head_key, curr_head_agents, curr_head_action in zip(all_head_key, all_head_agents, all_head_action):
            x_key = th.abs(curr_head_key).repeat(1, self.n_agents) + 1e-10 #v_{k}(\tau)
            x_agents = F.sigmoid(curr_head_agents) #\phi_{i, k}(\tau)
            x_action = F.sigmoid(curr_head_action) #\lambda_{i, k}(\tau, a)
            weights = x_key * x_agents * x_action #权重
            head_attend_weights.append(weights)

        head_attend = th.stack(head_attend_weights, dim=1)
        head_attend = head_attend.view(-1, self.num_kernel, self.n_agents)
        head_attend = th.sum(head_attend, dim=1) #求和

        return head_attend

2.3 算法更新流程

在这里插入图片描述

损失函数: L ( θ ) = ∑ i = 1 b [ ( y i t o t − Q t o t ( τ , u , s ; θ ) ) 2 ] \mathcal{L}(\theta)=\sum_{i=1}^{b}\left[\left(y_{i}^{t o t}-Q_{t o t}(\tau, \mathbf{u}, s ; \theta)\right)^{2}\right] L(θ)=i=1b[(yitotQtot(τ,u,s;θ))2]
其中 b b b表示从经验池中采样的样本数量, y t o t = r + γ max ⁡ u ′ Q t o t ( τ ′ , u ′ , s ′ ; θ − ) y^{t o t}=r+\gamma \max _{\mathbf{u}^{\prime}} Q_{t o t}\left(\tau^{\prime}, \mathbf{u}^{\prime}, s^{\prime} ; \theta^{-}\right) ytot=r+γmaxuQtot(τ,u,s;θ) θ − \theta^{-} θ是目标网络的参数,
所以时序差分的误差可表示为:
T D e r r o r = ( r + γ Q t o t (  target  ) ) − Q t o t (  evalutate  ) \begin{aligned} {TDerror}=(r+\gamma Q _{ tot }(\text { target })) -Q _{ tot }(\text { evalutate }) \end{aligned} TDerror=(r+γQtot( target ))Qtot( evalutate )
Q t o t (  target  ) Q _{ tot }(\text { target }) Qtot( target ):状态 s ′ s^{'} s的情况下,所有行为中,获取的最大价值 Q t o t Q_{tot} Qtot。根据IGM条件,输入为此状态下每个智能体的最大动作价值。
Q t o t (  evalutate  ) Q _{ tot }(\text { evalutate }) Qtot( evalutate ): 状态 s s s的情况下,根据当前网络策略所能获得 Q t o t Q_{tot} Qtot
  实现代码如下:
参数配置:

# use epsilon greedy action selector
# --- QMIX specific parameters ---

# use epsilon greedy action selector
action_selector: "epsilon_greedy"
epsilon_start: 1.0
epsilon_finish: 0.05
epsilon_anneal_time: 50000

runner: "episode"

buffer_size: 5000

# update the target network every {} episodes
target_update_interval: 200

# use the Q_Learner to train
agent_output_type: "q"
learner: "q_learner"
double_q: True
mixer: "qmix"
mixing_embed_dim: 32
hypernet_layers: 2
hypernet_embed: 64

name: "qmix"

动作选择:(ε-greedy)

class EpsilonGreedyActionSelector():

    def __init__(self, args):
        self.args = args

        self.schedule = DecayThenFlatSchedule(args.epsilon_start, args.epsilon_finish, args.epsilon_anneal_time,
                                              decay="linear")
        self.epsilon = self.schedule.eval(0)

    def select_action(self, agent_inputs, avail_actions, t_env, test_mode=False):
        # Assuming agent_inputs is a batch of Q-Values for each agent bav
        self.epsilon = self.schedule.eval(t_env)  # 获取epsilon

        if test_mode:
            # Greedy action selection only
            self.epsilon = 0.0

        # mask actions that are excluded from selection
        masked_q_values = agent_inputs.clone()  # q值 q_value
        masked_q_values[avail_actions == 0.0] = -float("inf")  # should never be selected! 不能选择的动作赋值为 负无穷

        random_numbers = th.rand_like(agent_inputs[:, :, 0])  # 生成相同维度的随机矩阵
        pick_random = (random_numbers < self.epsilon).long()  # 如果小于epsilon
        random_actions = Categorical(avail_actions.float()).sample().long()  # 把可选的动作进行类别分布
        # pick_random==1 说明 random_numbers < self.epsilon 进行随机探索
        # pick_random==0 说明 random_numbers > self.epsilon 选择动作价值最大的函数

        picked_actions = pick_random * random_actions + (1 - pick_random) * masked_q_values.max(dim=2)[1]  # 进行动作选择
        return picked_actions  # 选择的动作

计算单个智能体估计的Q值

# Calculate estimated Q-Values 估计每个agent对应的Q值
mac_out = []
mac.init_hidden(batch.batch_size)
for t in range(batch.max_seq_length):
    agent_outs = mac.forward(batch, t=t)  # 计算智能体的Q值
    mac_out.append(agent_outs)
mac_out = th.stack(mac_out, dim=1)  # Concat over time

# Pick the Q-Values for the actions taken by each agent
# 取每个agent动作对应的Q值,并且把最后不需要的一维去掉,因为最后一维只有一个值了
chosen_action_qvals = th.gather(mac_out[:, :-1], dim=3, index=actions).squeeze(3)  # Remove the last dim

x_mac_out = mac_out.clone().detach()  # 提取数据不带梯度
x_mac_out[avail_actions == 0] = -9999999  # 不能执行的动作赋值为负无穷
max_action_qvals, max_action_index = x_mac_out[:, :-1].max(dim=3)  # 最大的动作值及其索引

max_action_index = max_action_index.detach().unsqueeze(3)  # 去掉梯度
is_max_action = (max_action_index == actions).int().float()  # 是最大动作

计算单个智能体目标Q值

# Calculate the Q-Values necessary for the target 计算目标Q值
target_mac_out = []
self.target_mac.init_hidden(batch.batch_size)
for t in range(batch.max_seq_length):
    target_agent_outs = self.target_mac.forward(batch, t=t)
    target_mac_out.append(target_agent_outs)

# We don't need the first timesteps Q-Value estimate for calculating targets
target_mac_out = th.stack(target_mac_out[1:], dim=1)  # Concat across time

# Mask out unavailable actions
target_mac_out[avail_actions[:, 1:] == 0] = -9999999

# Max over target Q-Values 找到最大的动作价值
if self.args.double_q:  # 使用double结构,找到最大价值动作,再进行计算价值
    # Get actions that maximise live Q (for double q-learning)
    mac_out_detach = mac_out.clone().detach()
    mac_out_detach[avail_actions == 0] = -9999999
    cur_max_actions = mac_out_detach[:, 1:].max(dim=3, keepdim=True)[1]  # 找到最大价值的动作
    # 利用最优动作求取最大动作价值,并且把最后不需要的一维去掉
    target_chosen_qvals = th.gather(target_mac_out, 3, cur_max_actions).squeeze(3)
    target_max_qvals = target_mac_out.max(dim=3)[0]
    target_next_actions = cur_max_actions.detach()

    cur_max_actions_onehot = th.zeros(cur_max_actions.squeeze(3).shape + (self.n_actions,)).cuda()  # 最大价值动作的独热码
    cur_max_actions_onehot = cur_max_actions_onehot.scatter_(3, cur_max_actions, 1)
else:
    # Calculate the Q-Values necessary for the target
    # 上面都写了,不知道为啥又写一遍
    target_mac_out = []
    self.target_mac.init_hidden(batch.batch_size)
    for t in range(batch.max_seq_length):
        target_agent_outs = self.target_mac.forward(batch, t=t)
        target_mac_out.append(target_agent_outs)
    # We don't need the first timesteps Q-Value estimate for calculating targets
    target_mac_out = th.stack(target_mac_out[1:], dim=1)  # Concat across time
    target_max_qvals = target_mac_out.max(dim=3)[0]  # 找到最大价值函数

根据损失函数,进行反向传播

# Mix 混合网络,求total值
# QPLEX更新过程,evaluate网络输入的是每个agent选出来的行为的q值,target网络输入的是每个agent最大的q值,和DQN更新方式一样
if mixer is not None:
    # 计算Q _{ tot }(evalutate)
    if self.args.mixer == "dmaq_qatten":
        ans_chosen, q_attend_regs, head_entropies = \
            mixer(chosen_action_qvals, batch["state"][:, :-1], is_v=True)  # 计算状态价值V
        ans_adv, _, _ = mixer(chosen_action_qvals, batch["state"][:, :-1], actions=actions_onehot,
                              max_q_i=max_action_qvals, is_v=False)  # 计算优势值A
        chosen_action_qvals = ans_chosen + ans_adv  # 动作价值Q
    else:
        ans_chosen = mixer(chosen_action_qvals, batch["state"][:, :-1], is_v=True)  # 计算状态价值V
        ans_adv = mixer(chosen_action_qvals, batch["state"][:, :-1], actions=actions_onehot,
                        max_q_i=max_action_qvals, is_v=False)  # 计算优势值A
        chosen_action_qvals = ans_chosen + ans_adv  # 动作价值Q
    # 计算Q _{ tot }(target )
    if self.args.double_q:
        if self.args.mixer == "dmaq_qatten":
            target_chosen, _, _ = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:],
                                                    is_v=True)  # 计算状态价值V
            target_adv, _, _ = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:],
                                                 actions=cur_max_actions_onehot,
                                                 max_q_i=target_max_qvals, is_v=False)  # 计算优势值A
            target_max_qvals = target_chosen + target_adv  # 动作价值Q
        else:
            target_chosen = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:], is_v=True)  # 计算状态价值V
            target_adv = self.target_mixer(target_chosen_qvals, batch["state"][:, 1:],
                                           actions=cur_max_actions_onehot,
                                           max_q_i=target_max_qvals, is_v=False)  # 计算优势值A
            target_max_qvals = target_chosen + target_adv  # 动作价值Q
    else:
        target_max_qvals = self.target_mixer(target_max_qvals, batch["state"][:, 1:], is_v=True)  # 动作价值Q

# Calculate 1-step Q-Learning targets 以Q-Learning的方法计算目标值r+gamma*Q _{ tot }({ target }
targets = rewards + self.args.gamma * (1 - terminated) * target_max_qvals

if show_demo:
    tot_q_data = chosen_action_qvals.detach().cpu().numpy()
    tot_target = targets.detach().cpu().numpy()
    print('action_pair_%d_%d' % (save_data[0], save_data[1]), np.squeeze(q_data[:, 0]),
          np.squeeze(q_i_data[:, 0]), np.squeeze(tot_q_data[:, 0]), np.squeeze(tot_target[:, 0]))
    self.logger.log_stat('action_pair_%d_%d' % (save_data[0], save_data[1]),
                         np.squeeze(tot_q_data[:, 0]), t_env)
    return

# Td-error
td_error = (chosen_action_qvals - targets.detach())

mask = mask.expand_as(td_error)  # 将mask扩展为td_error相同的size

# 0-out the targets that came from padded data
masked_td_error = td_error * mask  # 抹掉填充的经验的td_error

# Normal L2 loss, take mean over actual data
if self.args.mixer == "dmaq_qatten":
    loss = (masked_td_error ** 2).sum() / mask.sum() + q_attend_regs
else:
    # L2的损失函数,不能直接用mean,因为还有许多经验是没用的,所以要求和再比真实的经验数,才是真正的均值
    loss = (masked_td_error ** 2).sum() / mask.sum()

# Optimise RMSprop
# 优化
optimiser.zero_grad()
loss.backward()
grad_norm = th.nn.utils.clip_grad_norm_(params, self.args.grad_norm_clip)
optimiser.step()

3 实验效果:

下图为在星际争霸2六种不同的在线数据收集地图的学习曲线。可以看到QPLEX显著优于其他算法
在这里插入图片描述

参考:

博客:QPLEX: Duplex Dueling Multi-agent Q-learning
代码:https://github.com/oxwhirl/pymarl

【项目介绍】 基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip 基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip 基于python实现多智能体强化学习VDN、QMIX、QTRAN、QPLEX算法源码+对应模型文件.zip 【说明】 1、项目源码在上传前,都经过本地成功运行,功能测试无误。请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机科学、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、机械电子信息等相关专业背景的在校大学生、专业老师 行业从业人员等下载使用。 3、用途:项目代表性强,具有创新性和启发性,故具有挺高的学习借鉴价值。不仅适合小白入门进阶,还可作为毕设项目、课程设计、大作业、比赛初期项目立项演示等。 4、如果基础还不错,又热爱学习钻研,也可基于此项目基础上进行修改进行二次开发。 本人也是技术狂热者,如果觉得此项目对您有价值,欢迎下载使用! 无论您是运行还是二次开发,遇到问题或困惑,欢迎私信交流学习。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Spgroc

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值