论文笔记_A Decentralized Communication Policy for Multi Agent Multi Armed

基础符号

符号说明
t t t时刻
i ∈ { 1 , 2 , 3... n 0 } i \in \{1,2,3...n_0\} i{1,2,3...n0}选择(options、levels)
i ∗ i_* i最佳选择(optimal option)
j ∈ { 1 , 2 , 3... n A } j \in \{1,2,3...n_A\} j{1,2,3...nA}玩家(agents)
N j t \mathcal N_j^t Njt与玩家 j j j在时刻 t t t进行交流的邻居集合( j ∈ N j t j \in \mathcal N_j^t jNjt
φ i t ∈ { 1 , 2 , 3... n 0 } \varphi _i^t \in \{1,2,3...n_0\} φit{1,2,3...n0}玩家 i i i在时刻 t t t做出的选择
X i t X_i^t Xit在时刻 t t t选择了 i i i,对应的收益(reward)
Π { φ j t = i } ∈ { 0 , 1 } \Pi_{\{\varphi _j^t=i\}} \in \{0,1\} Π{φjt=i}{0,1}玩家 j j j在时刻 t t t是否选择了 i i i
ϵ i j t ≜ { 1 , i f ( ∑ k ∈ N j t Π { φ k t = i } ) ≠ 0 0 , i f ( ∑ k ∈ N j t Π { φ k t = i } ) = 0 \epsilon _{ij}^t \triangleq \left\{\begin{matrix}1, & if \left ( \sum_{k \in \mathcal N_j^t} \Pi_{\{\varphi _k^t=i\}} \right ) \neq 0\\0, & if \left ( \sum_{k \in \mathcal N_j^t} \Pi_{\{\varphi_k^t=i\}} \right ) = 0\end{matrix}\right. ϵijt1,0,if(kNjtΠ{φkt=i})=0if(kNjtΠ{φkt=i})=0玩家 j j j及其邻居是否有人在时刻 t t t选择了 i i i
N i j ( t ) ≜ ∑ v = 1 t ϵ i j v = N i j ( t − 1 ) + ϵ i j t N_{ij}(t) \triangleq \sum_{v=1}^t \epsilon _{ij}^v=N_{ij}(t-1)+\epsilon _{ij}^t Nij(t)v=1tϵijv=Nij(t1)+ϵijt玩家 j j j及其邻居在时间 [ 1 , 2... t ] [1,2...t] [1,2...t]内选择 i i i的总次数
S j ( T ) ≜ ∑ i = 1 n 0 ∑ t = 1 T X i t Π { φ j t = i } S_j(T) \triangleq \sum_{i=1}^{n0}\sum_{t=1}^{T}X_i^t\Pi_{\{\varphi _j^t=i\}} Sj(T)i=1n0t=1TXitΠ{φjt=i}累计收益(cumulative reward)
R j ( T ) ≜ E ( ∑ i ≠ i ∗ n 0 ∑ t = 1 T ( X i ∗ t − X i t ) Π { φ j t = i } ) ) R_j(T) \triangleq E\left (\sum_{i \neq i_*}^{n_0} \sum_{t=1}^{T}(X_{i_*}^t-X_i^t) \Pi_{\{\varphi _j^t=i\}}) \right ) Rj(T)Ei=in0t=1T(XitXit)Π{φjt=i})损失(regret)
  • 每个选择在某一时刻对应的收益是由高斯分布( μ , ν \mu, \nu μ,ν)确定的,与玩家无关,每个玩家在同一时刻选择同一个,收益是一样的
  • 目标:累积收益的期望值 E ( X ) E(X) E(X)达到最大、损失regret达到最小
  • 变量加上标 s s s表示自我的,即玩家 j j j自己相关的,如 R j s ( T ) R_j^s(T) Rjs(T) S j s ( T ) S_j^s(T) Sjs(T) N i j s ( T ) N_{ij}^s(T) Nijs(T) S i j s ( T ) ≜ ∑ t = 1 T X i t Π { φ j t = i } R i j s ( T ) = E ( ∑ t = 1 T ( X i ∗ t − X i t ) Π { φ j t = i } ) ) ≤ Δ ˉ E ( ∑ t = 1 T Π { φ j t = i } ) = Δ ˉ E ( N i j s ( T ) ) S_{ij}^s(T) \triangleq \sum_{t=1}^TX_i^t\Pi_{\{\varphi _j^t=i\}} \\ R_{ij}^s(T)= E\left ( \sum_{t=1}^{T}(X_{i_*}^t-X_i^t) \Pi_{\{\varphi _j^t=i\}}) \right ) \leq \bar{\Delta}E \left(\sum_{t=1}^{T} \Pi_{\{\varphi _j^t=i\}} \right)=\bar{\Delta}E \left(N_{ij}^s(T) \right) Sijs(T)t=1TXitΠ{φjt=i}Rijs(T)=E(t=1T(XitXit)Π{φjt=i}))ΔˉE(t=1TΠ{φjt=i})=ΔˉE(Nijs(T))其中有 Δ ≤ E ( X i ∗ r ) − E ( X i r ) ≤ Δ ˉ \Delta \leq E(X_{i*}^r)-E(X_{i}^r) \leq \bar{\Delta} ΔE(Xir)E(Xir)Δˉ,即 Δ ˉ \bar{\Delta} Δˉ E ( X i ∗ r ) − E ( X i r ) E(X_{i*}^r)-E(X_{i}^r) E(Xir)E(Xir)的最大值
  • 变量加上标 c c c表示邻居的(communication),即除 j j j本身以外的邻居,如 R j c ( T ) R_j^c(T) Rjc(T) S j c ( T ) S_j^c(T) Sjc(T) N i j c ( T ) N_{ij}^c(T) Nijc(T)
    如果有多个邻居选择了 i i i,则只计一次 S i j c ( T ) ≜ ∑ t = 1 T X i t Π { φ j t ≠ i   &   ϵ i j t = 1 } R i j c ( T ) = E ( ∑ t = 1 T ( X i ∗ t − X i t ) Π { φ j t ≠ i   &   ϵ i j t = 1 } ) ) ≤ Δ ˉ E ( ∑ t = 1 T Π { φ j t ≠ i   &   ϵ i j t = 1 } ) S_{ij}^c(T) \triangleq \sum_{t=1}^TX_i^t\Pi_{\{\varphi _j^t \neq i \, \&\, \epsilon _{ij}^t=1 \}} \\ R_{ij}^c(T)= E\left ( \sum_{t=1}^{T}(X_{i_*}^t-X_i^t) \Pi_{\{\varphi _j^t \neq i \, \&\, \epsilon _{ij}^t=1 \}}) \right ) \leq \bar{\Delta}E \left(\sum_{t=1}^{T} \Pi_{\{\varphi _j^t \neq i \, \&\, \epsilon _{ij}^t=1 \}} \right) Sijc(T)t=1TXitΠ{φjt=i&ϵijt=1}Rijc(T)=E(t=1T(XitXit)Π{φjt=i&ϵijt=1}))ΔˉE(t=1TΠ{φjt=i&ϵijt=1})

UCB算法

进行选择

每个选择的得分由两部分决定, e x p l o i t a t i o n exploitation exploitation e x p l o r a t i o n exploration exploration Q i j t = X ^ i j t + C i j t Q_{ij}^t=\widehat{X}_{ij}^t+C_{ij}^t Qijt=X ijt+Cijt 其中 C i j t = Ψ j k ( t ) N i j k ( t ) C_{ij}^t=\sqrt{\frac{\Psi _{jk}(t)}{N_{ijk}(t)}} Cijt=Nijk(t)Ψjk(t) Ψ j k ( t ) ≈ l o g ( t ) \Psi _{jk}(t) \approx log(t) Ψjk(t)log(t)

e x p l o i t a t i o n exploitation exploitation是玩家 i i i对选择 j j j的收益的估计值,是 j j j和邻居在时间 [ 1 , 2... t ] [1,2...t] [1,2...t]内对于 j j j的总收益的平均值 X ^ i j t ≜ 1 N i j ( t ) ( ∑ τ = 1 t X i τ ϵ i j τ ) \widehat{X}_{ij}^t \triangleq \frac{1}{N_{ij}(t)} \left ( \sum_{\tau =1}^t X^{\tau }_i \epsilon _{ij}^{\tau } \right ) X ijtNij(t)1(τ=1tXiτϵijτ)

寻找邻居

在时刻 t t t,玩家 j j j对于选择 i i i和另一个玩家 k k k的得分定义为 Q i j k t = X ^ i j k t + Ψ j k ( t ) N i j k ( t ) Q_{ijk}^t=\widehat{X}_{ijk}^t+\sqrt{\frac{\Psi _{jk}(t)}{N_{ijk}(t)}} Qijkt=X ijkt+Nijk(t)Ψjk(t) 其中 X ^ i j k t \widehat{X}_{ijk}^t X ijkt j j j k k k处获取的对于选择 i i i的收益平均值

  • 玩家 j j j每个时刻要选择 n j n_j nj(预先定义好的量)个邻居,首先在每个 k k k内部,对 j j j进行排序,选择一个最大的值作为 k k k的得分,之后对 k k k进行排序,选择 T o p ( n j ) Top(n_j) Top(nj)作为本次选择的邻居
    每轮与邻居交换的信息包括邻居的选择和相应的收益
  • 随着时间的推移,玩家会趋向于 e x p l o i t a t i o n exploitation exploitation,而选择邻居时,希望选择 e x p l o r a t i o n exploration exploration程度更大的玩家

部分代码

   self.k = [ (1/((8*self.reward_variance[i])**2)*2) for i in range(self.no_bandits)]

def pick(self):

    Qij_T = [0 for i in range(self.no_bandits)]
    for i in range(self.no_bandits):

        k = self.k[i]
        alpha = 3/(2*k)
        gamma = alpha*log(self.T)
        Qij_T[i] = self.X_ij_T[i] + sqrt(gamma/self.Nij_T[i])
    self.Q = Qij_T
    sorted_choice = argsort(Qij_T)

    m = sorted_choice[-1]
    m_c = 1
    for i in range(1, self.no_bandits):

        if Qij_T[sorted_choice[-(i+1)]] != Qij_T[m]:
            break
        else:
            m_c += 1

    if m_c == 1:  # 只有一个最大值
        index = 1
    else:  # 有多个最大值,随机选择一个
        index = random.randint(1, m_c)
    choice = sorted_choice[-index]
    return choice
Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值