# 强化学习 | Multi Agents | Trust Region | HATRPO | HAPPO

4 篇文章 24 订阅

🌱 主要看了论文 MATRPO (HATRPO)，这篇博客是笔记。其它地，最近简单看了下 GAE 和 Variance of MAPG 两篇，没做笔记

🌱 MATRPO 逻辑很清晰，理论推导做得很扎实，附录的公式推导近10页。其code

🌱 作者写了 论文解析blog，核心点解释得蛮清晰，不涉及复杂的数学推导。认真读下其blog就能懂个大概，再重读论文便轻松了

🌱 全文共7k字，纯手敲。含大量手写笔记，含大量个人主观理解。较于前两篇学习blog，这篇中的原文引用和英文笔记偏多些。如有错误，欢迎指正

🌱 后序文章4：强化学习 | Mirror Learning

🌻 论文概述

🌻 符号定义

🌴 MARL基本符号

🌴 Q-value Function

🌻 Decomposition Lemma

🌻 Trust Region Learning

🌴 表示差异

🌴 Trust Region in Single Agent

🌴  Trust Region in Multi Agents

🌻 HATRPO

🌴 原理

🌴 伪代码

🌻 HAPPO

🌴 原理

🌴 伪代码

🌻 实验情况

🌴 SMAC

🌵 任务

🌵 结果

🌵 分析

🌴 Multi-Agent MuJoCo

🌵 任务

🌵 结果

🌵 分析

## 🌻 论文概述

• Homogenous, sharing the same action space and policy parameters, which largely limits their applicability and harm the performance
• heterogenous, not need agents to share parameters, having their own action space

证明如下（通过举反例证明还蛮有意思的；这个证明过程比较好懂）：

## 🌻 符号定义

### 🌴 Q-value Function

The multi-agent state-action value function Q for an arbitrary ordered agent subset   is defined as

where  refers to its complement and  refers to the  agent in the ordered subset.

complement应该翻译成补集吧，应该是指除了这1:m个agents外的agents，即

The multi-agent advantage function A of subsets  is defined as

where  and  are disjoint subsets.

## 🌻 Decomposition Lemma

In any cooperative Markov games, given a joint policy Π, for any state s, and any agent subset , the below equations holds.

The lemma shows that the joint advantage function can be decomposed into a summation of each agent's local advantages in the process of sequential update

ps：从直观上理解从第一行到第二行的意思

## 🌻 Trust Region Learning

### 🌴 表示差异

ps：notion definition不同，一方面是本文作者下了multi agents时的 Q-value Function 和 Advantage Function 的新定义（已写于上一part），另一方面不同论文表示意思相同时可能用了不同符号（部分总结于这儿）。为了下面对比两篇论文数学公式时候顺畅一点，才写了这一部分。其实没什么影响，只是很正常地不同论文中表示不同

### 🌴  Trust Region in Multi Agents

This definition characterises the equilibrium point at convergence for cooperative MARL tasks. Based on this, we have the following result that describes the asymptotic convergent behaviour towards NE.

Nash equilibrium (纳什平衡)：平衡了，任何人都利益最大化，认为遵循协议行事强于违背协议。参考链接1参考链接2 , 参考链接3

Proposition2中涉及到的Corollary 1又涉及到蛮多推导和定义：

## 🌻 HAPPO

### 🌴 原理

To further alleviate the computation burden from Hessian Matrix in HATRPO,one can follow the idea of PPO by considering only using first order derivatives. This is achieved by making agent  choose a policy parameter  which maimises the clipping objecvite of

The optimisation process can be performed by stochastic gradient methods such as Adam.

## 🌻 实验情况

### 🌴 SMAC

#### 🌵 任务

SMAC (StarCraftll Multi-Agent Challenge) contains a set of StarCraft maps in which a team of ally units aims to defeat the opponent team.

#### 🌵 分析

SMAC任务较简单，non-parameter sharing is not necessarily required，sharing policies is sufficient to solve SMAC tasks.

### 🌴 Multi-Agent MuJoCo

#### 🌵 任务

A continuous control task. MuJoCo tasks challenge a robot to learn an optimal way of motion; Multi-Agent MuJoCo models each part of a robot as an independent agent, for example, a leg for a spider or an arm for a swimmer.

#### 🌵 结果

HATRPO and HAPPO enjoy superior performance over those of parameter-sharing methods:IPPPO and MAPPO, and the gap enlarges with the number of agents increases.

HATRPO and HAPPO also outperform non-parameter sharing MADDPG with both in terms of reward values and variance.

#### 🌵 分析

HATRPO比 参数共享方法 (MAPPO等) 性能好得多。而且随着智能体数目增加，两类算法差距越拉越大，这说明了modelling heterogeneous policies的必要性

HATRPO性能表现优于HAPPO，认为是 hard KL constraint 相较于 clipping  更接近原理描述

• 32
点赞
• 66
收藏
觉得还不错? 一键收藏
• 打赏
• 21
评论
03-11 8622
09-01 3777
08-04 3652
09-05
02-23 2680

### “相关推荐”对你有帮助么？

• 非常没帮助
• 没帮助
• 一般
• 有帮助
• 非常有帮助

111辄

¥1 ¥2 ¥4 ¥6 ¥10 ¥20

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。