Deep Decentralized Multi-task Multi-Agent RL under Partial Observability——论文阅读笔记

       

        这篇论文的题目“基于部分可观的深度离散多任务多智能体强化学习”,研究动机主要是解决在部分可观和有限通信环境下的多任务多智能体强化学习,主要问题有:

        (1)学习难度:由于部分可观性,在智能体的局部视角,环境是非平稳的

        (2)为每个任务学习专门的策略可能会导致问题,因为智能体不仅要为每个任务存储不同的策略,而且在实际应用中任务的id往往是不可观察的,难以做到任务和策略的匹配

        为减小学习难度,提高环境平稳性。

        谨慎乐观学习者:通过使用滞后Q学习(Hysteretic Q-learning)的变体,智能体能够更加稳定地学习策略,即使在队友的探索行为导致环境状态发生变化的情况下。这种学习方法允许智能体在面对负的TD误差时以较小的学习率更新Q值,从而减少对队友探索行为的敏感性,提高学习稳定性。

        并行经验回放轨迹(CERTs):在传统的经验回放(Experience Replay)中,每个智能体独立地从其经验池中随机抽取样本进行学习。然而,在多智能体环境中,这种独立采样可能会导致学习过程中的同步问题,因为智能体可能会在不同的时间点学习到不同的策略,这可能导致策略之间的不协调。CERTs这种机制允许智能体并行地存储和重放经验,通过同步更新来减少策略不稳定的风险。这种并行采样和经验回放的方法有助于智能体更好地协调和学习在部分可观察环境中的策略。a图中的e、i、t分别对应训练轮数,时间步长,智能体id,可以理解为一个经验tuple。b图中为采样步骤,红色为被采样区域,每个智能体都采样同一时刻数据。因为采样时是采取长度为τ的长度数据,可能导致每个step采样的概率不一样,文中就从[-τ+1,tmax]的区域采样,空缺处用0填充。

     

       在第一阶段,每个智能体使用分散式滞后深度循环Q网络(Dec-HDRQNs)针对每个任务进行专门的学习。这使得智能体能够针对每个任务学习到有效的策略。在第二阶段,每个智能体的专门化策略被提炼成一个通用的策略,提炼的方法参考Rusu et al. (2015),具体做法从专门化策略中学习到的Q值网络中提取知识,并将其转移到一个新的通用Q值网络中。在提炼策略的过程中,智能体会从所有任务的经验中学习,而不是只从单个任务中学习。这种方法允许智能体学习到能够在多个任务之间泛化的知识。

        

       利用每个智能体对不同任务得到的Q值,希望多任务的网络训练的Q值与之接近,这里就有点类似于监督学习的问题,单任务的Q值作为标签去训练这个多任务网络。每个智能体在训练中类似于独立学习(independent learning,IL),智能体直接的协同调度或许有改进方法解决。

  • 10
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Multi-agent reinforcement learning (MARL) is a subfield of reinforcement learning (RL) that involves multiple agents learning simultaneously in a shared environment. MARL has been studied for several decades, but recent advances in deep learning and computational power have led to significant progress in the field. The development of MARL can be divided into several key stages: 1. Early approaches: In the early days, MARL algorithms were based on game theory and heuristic methods. These approaches were limited in their ability to handle complex environments or large numbers of agents. 2. Independent Learners: The Independent Learners (IL) algorithm was proposed in the 1990s, which allowed agents to learn independently while interacting with a shared environment. This approach was successful in simple environments but often led to convergence issues in more complex scenarios. 3. Decentralized Partially Observable Markov Decision Process (Dec-POMDP): The Dec-POMDP framework was introduced to address the challenges of coordinating multiple agents in a decentralized manner. This approach models the environment as a Partially Observable Markov Decision Process (POMDP), which allows agents to reason about the beliefs and actions of other agents. 4. Deep MARL: The development of deep learning techniques, such as deep neural networks, has enabled the use of MARL in more complex environments. Deep MARL algorithms, such as Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG), have achieved state-of-the-art performance in many applications. 5. Multi-Agent Actor-Critic (MAAC): MAAC is a recent algorithm that combines the advantages of policy-based and value-based methods. MAAC uses an actor-critic architecture to learn decentralized policies and value functions for each agent, while also incorporating a centralized critic to estimate the global value function. Overall, the development of MARL has been driven by the need to address the challenges of coordinating multiple agents in complex environments. While there is still much to be learned in this field, recent advancements in deep learning and reinforcement learning have opened up new possibilities for developing more effective MARL algorithms.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值