LeCTR(Learning to Coordinate and Teach Reinforcement)
一、简介
LeCTR是一种在Dec-POMDP(Decenteralized Partilly Observable Markov Decision Process)的多个智能体中使用“Learning to teach”方法的RL算法。这些智能体在合适的时机扮演老师或学生的角色,来提供或请求有关行动的建议,从而实现更好的整体上的学习效果。
这种“Learning to teach”的方法可以加快学习的速度并且每个智能体可以仅仅只有“local knowledge“。当然它也存在一些困难:
1、agents需要学会在什么时机提供建议,提供什么样的建议。
2、虽然agents之间要协作但是他们的学习应该使独立的,他们不能完全共享policy
3、agents要学会评估来自队友的建议的价值
4、因为MARL需要agents之间相互协作所以它的学习过程是非静态的并且有很高的计算复杂度
二、LeCTR中的一些参数
其中:
a
=
⟨
a
1
,
…
,
a
n
⟩
j
o
i
n
t
a
c
t
i
o
n
P
(
s
′
∣
s
,
a
)
=
T
(
s
,
a
,
s
′
)
o
=
⟨
o
1
,
…
,
o
n
⟩
j
o
i
n
t
o
b
s
e
r
v
a
t
i
o
n
P
(
o
∣
s
′
,
a
)
=
O
(
o
,
s
′
,
a
)
h
t
i
=
(
o
1
i
,
…
,
o
t
i
)
o
b
s
e
r
v
a
t
i
o
n
h
i
s
t
o
r
y
a
i
=
π
i
(
h
t
i
)
π
=
⟨
π
1
,
…
,
π
n
⟩
π
i
s
p
a
r
a
m
e
t
e
r
i
z
e
d
b
y
θ
r
t
=
R
(
s
t
,
a
t
)
V
(
s
;
θ
)
=
E
[
∑
t
γ
t
r
t
∣
s
0
=
s
]
Q
i
(
o
i
,
a
i
;
h
i
)
a
c
t
i
o
n
v
a
l
u
e
Q
⃗
(
o
i
;
h
i
)
\begin{aligned}& \boldsymbol{a}=\left\langle a^{1}, \ldots, a^{n}\right\rangle \quad joint \ action\\& P\left(s^{\prime} | s, \boldsymbol{a}\right)=\mathcal{T}\left(s, \boldsymbol{a}, s^{\prime}\right) \\& \boldsymbol{o}=\left\langle o^{1}, \ldots, o^{n}\right\rangle \quad joint \ observation\\& P\left(\boldsymbol{o} | s^{\prime}, \boldsymbol{a}\right)=\mathcal{O}\left(\boldsymbol{o}, s^{\prime}, \boldsymbol{a}\right)\\& h_{t}^{i}=\left(o_{1}^{i}, \ldots, o_{t}^{i}\right) \quad observation \ history \\& a^{i}=\pi^{i}\left(h_{t}^{i}\right)\\& \boldsymbol{\pi}=\left\langle\pi^{1}, \ldots, \pi^{n}\right\rangle \quad \pi \ is \ parameterized \ by\ \theta\\& r_{t}=\mathcal{R}\left(s_{t}, \boldsymbol{a}_{t}\right)\\& V(s ; \boldsymbol{\theta})=\mathbb{E}\left[\sum_{t} \gamma^{t} r_{t} | s_{0}=s\right]\\& Q^{i}\left(o^{i}, a^{i} ; h^{i}\right) \quad \ action \ value\\& \vec{Q}\left(o^{i} ; h^{i}\right)\end{aligned}
a=⟨a1,…,an⟩joint actionP(s′∣s,a)=T(s,a,s′)o=⟨o1,…,on⟩joint observationP(o∣s′,a)=O(o,s′,a)hti=(o1i,…,oti)observation historyai=πi(hti)π=⟨π1,…,πn⟩π is parameterized by θrt=R(st,at)V(s;θ)=E[t∑γtrt∣s0=s]Qi(oi,ai;hi) action valueQ(oi;hi)
三、Cooperative MARL中的teaching
cooperative MARL中有两个重要的问题:
1、 P Task \mathcal{P}_{\text {Task }} PTask : the task-level learning problem
2、 P ~ Advise \widetilde{\mathcal{P}}_{\text {Advise }} P Advise :the advising-level problem 所有带波浪线上标的都是advising-level problem
这两者的区别在于task-level problem的目标没有脱离RL,仍然是最大化回报,但是advising-level problem是要通过advising来学习怎样更好地影响task-level learning。
这也是和“Learning to communicate”的不同,“Learning to teach”最重要的不是如何实现agents之间的communication,而是更加关注advising可以对task-level learning产生怎样的影响
task-level policy体现的是每个agent学到的local knowledge,通过”teach to learn“的过程,最终agent应该学到最优的task-level policy并且不再需要其他agent的建议了。teach to learn需要用到”action advising“机制,它可以将local knowledge 转换成 action advice。
四、LeCTR算法(2 agents)
LeCTR分为两个阶段:
1、agent根据advising policies在blackbox(个人理解是一个CNN)中学习task-leve problem
2、用advising-level rewards更新advising policies
第一阶段:
LeCTR要学习student policy ⟨ π ~ S i , π ~ S j ⟩ \left\langle\tilde{\pi}_{S}^{i}, \tilde{\pi}_{S}^{j}\right\rangle ⟨π~Si,π~Sj⟩ 和 teache policy ⟨ π ~ T i , π ~ T j ⟩ \left\langle\widetilde{\pi}_{T}^{i}, \widetilde{\pi}_{T}^{j}\right\rangle ⟨π Ti,π Tj⟩ 但是要注意,这两个policy不能用同一个神经网络来训练。
其中 student policy可以用来决定什么时候想teacher请求建议,它要用到advising-level的observation: o ~ S i = ⟨ o i , Q ⃗ i ( o i ; h i ) ⟩ \widetilde{o}_{S}^{i}=\left\langle o^{i}, \vec{Q}^{i}\left(o^{i} ; h^{i}\right)\right\rangle o Si=⟨oi,Qi(oi;hi)⟩ 其中, Q ⃗ i ( o i ; h i ) \vec{Q}^{i}\left(o^{i} ; h^{i}\right) Qi(oi;hi) 是task-level的动作价值函数向量。由此得到: a ~ S j = π ~ S j ( o ~ S j ) ∈ { request advice, do not request advice } \widetilde{a}_{S}^{j}=\widetilde{\pi}_{S}^{j}\left(\widetilde{o}_{S}^{j}\right) \in\{\text { request advice, do not request advice }\} a Sj=π Sj(o Sj)∈{ request advice, do not request advice }
teacher policy也是同理:通过advising-level的observation: o ~ T j = ⟨ o i , Q ⃗ i ( o i ; h i ) , Q ⃗ j ( o i ; h i ) ⟩ \widetilde{o}_{T}^{j}=\left\langle o^{i}, \vec{Q}^{i}\left(o^{i} ; h^{i}\right), \vec{Q}^{j}\left(o^{i} ; h^{i}\right)\right\rangle o Tj=⟨oi,Qi(oi;hi),Qj(oi;hi)⟩ 可以发现teacher的observation包含了student的taks-level state knowledge: Q ⃗ i ( o i ; h i ) \vec{Q}^{i}\left(o^{i} ; h^{i}\right) Qi(oi;hi) 以及自己的task-level knowledge: Q ⃗ j ( o i ; h i ) \vec{Q}^{j}\left(o^{i} ; h^{i}\right) Qj(oi;hi) 。teacher policy可以用来决定给student什么建议,所以有: a ~ T j = π ~ T j ( ∂ ~ T j ) ∈ A i ∪ { a ~ ∅ } \widetilde{a}_{T}^{j}=\widetilde{\pi}_{T}^{j}\left(\widetilde{\partial}_{T}^{j}\right) \in \mathcal{A}^{i} \cup\left\{\widetilde{a}_{\emptyset}\right\} a Tj=π Tj(∂ Tj)∈Ai∪{a ∅} 其中 a ~ ∅ \widetilde{a}_{\emptyset} a ∅ 代表没有建议的action。这里需要注意,teacher policy是从student agent的动作集中选择动作的!既然是这样那么teacher policy是如何更新的就十分关键了,而这和LeCTR的第二个阶段有关。
第二阶段:
文中给出了六种advising-level reward
JVG: r ~ T j = V ( s ; θ t + 1 ) − V ( s ; θ t ) \widetilde{r}_{T}^{j}=V\left(s ; \boldsymbol{\theta}_{t+1}\right)-V\left(s ; \boldsymbol{\theta}_{t}\right) r Tj=V(s;θt+1)−V(s;θt)
这个reward表示的是用advised action的学习效果有多好。它很直观但是具有较高的方差并且容易受到policy initialization的影响
QTR: r ~ T j = I T ( o i , a i ; h i ) = max a Q T ( o i , a ; h i ) − Q T ( o i , a i ; h i ) \tilde{r}_{T}^{j}=I_{T}\left(o^{i}, a^{i} ; h^{i}\right)=\max _{a} Q_{T}\left(o^{i}, a ; h^{i}\right)-Q_{T}\left(o^{i}, a^{i} ; h^{i}\right) r~Tj=IT(oi,ai;hi)=maxaQT(oi,a;hi)−QT(oi,ai;hi) ( I I I 是agents的集合)
每当student agent向teacher agent发出请求的时候teacher agent就可以根据QRT来定夺到底要不要给出建议。QTR的意思是当有比student将要采取的动作 a i a^i ai 更好的选择时就有更大的概率会给student agent建议。
TDG: r ~ T j = ∣ δ t i ∣ − ∣ δ t + 1 i ∣ \tilde{r}_{T}^{j}=\left|\delta_{t}^{i}\right|-\left|\delta_{t+1}^{i}\right| r~Tj=∣∣δti∣∣−∣∣δt+1i∣∣
用TD error作为advising reward也很好理解,能让agent的task-level TD error 更小的action就被推荐给student agent
VEG: r ~ T j = 1 ( V ^ ( θ i ) > τ ) \tilde{r}_{T}^{j}=\mathbb{1}\left(\hat{V}\left(\theta^{i}\right)>\tau\right) r~Tj=1(V^(θi)>τ) V ^ ( θ i ) = max a i Q ( o i , a i ; θ i , h i ) \hat{V}\left(\theta^{i}\right)=\max _{a^{i}} Q\left(o^{i}, a^{i} ; \theta^{i}, h^{i}\right) V^(θi)=maxaiQ(oi,ai;θi,hi)
VEG的意思时当task-level的价值函数大于设定的阈值时就给student建议
上述advising-level reward都只是agent j 的,在实际的优化过程中使用的时joint advising-level reward: r ~ = r ~ T i + r ~ T j \widetilde{r}=\widetilde{r}_{T}^{i}+\widetilde{r}_{T}^{j} r =r Ti+r Tj
五、训练LeCTR
LeCTR的训练也是采用的actor-critic算法,其中actor是: π ~ = ⟨ π ~ S i , π ~ S j , π ~ T i , π ~ T j ⟩ \tilde{\boldsymbol{\pi}}=\left\langle\tilde{\pi}_{S}^{i}, \tilde{\pi}_{S}^{j}, \tilde{\pi}_{T}^{i}, \tilde{\pi}_{T}^{j}\right\rangle π~=⟨π~Si,π~Sj,π~Ti,π~Tj⟩ ;critic是: r ~ + γ Q ~ ( o ~ ′ , a ~ ′ ; θ ~ ) − Q ~ ( o ~ , a ~ ; θ ~ ) \widetilde{r}+\gamma \widetilde{Q}\left(\widetilde{\boldsymbol{o}}^{\prime}, \widetilde{\boldsymbol{a}}^{\prime} ; \widetilde{\boldsymbol{\theta}}\right)-\widetilde{Q}(\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}} ; \widetilde{\boldsymbol{\theta}}) r +γQ (o ′,a ′;θ )−Q (o ,a ;θ )
由此可以得到:
L
(
θ
~
)
=
E
o
~
,
a
~
,
r
~
,
o
′
~
∼
M
~
[
(
r
~
+
γ
Q
~
(
o
~
′
,
a
~
′
;
θ
~
)
−
Q
~
(
o
~
,
a
~
;
θ
~
)
)
2
]
∣
a
~
′
=
π
~
(
o
′
~
)
(10.5.1)
\begin{aligned}\mathcal{L}(\widetilde{\boldsymbol{\theta}})=\left.\mathbb{E}_{\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}}, \widetilde{r}, \widetilde{\boldsymbol{o}^{\prime}} \sim \widetilde{\mathcal{M}}}\left[\left(\widetilde{r}+\gamma \widetilde{Q}\left(\widetilde{\boldsymbol{o}}^{\prime}, \widetilde{\boldsymbol{a}}^{\prime} ; \widetilde{\boldsymbol{\theta}}\right)-\widetilde{Q}(\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}} ; \widetilde{\boldsymbol{\theta}})\right)^{2}\right]\right|_{\widetilde{\boldsymbol{a}}^{\prime}=\widetilde{\boldsymbol{\pi}}}(\widetilde{\boldsymbol{o}^{\prime}})\end{aligned}\tag{10.5.1}
L(θ
)=Eo
,a
,r
,o′
∼M
[(r
+γQ
(o
′,a
′;θ
)−Q
(o
,a
;θ
))2]∣∣∣∣a
′=π
(o′
)(10.5.1)
其中
M
~
\widetilde{\mathcal{M}}
M
表是advising-level的memory replay
advising policy的更新方法如下:
∇
θ
~
J
(
θ
~
)
=
E
o
~
,
a
~
∼
M
~
[
∇
θ
~
log
π
~
(
a
~
∣
o
~
)
Q
~
(
o
~
,
a
~
;
θ
~
)
]
=
E
o
~
,
a
~
∼
M
~
[
∑
α
∈
{
i
,
j
}
,
ρ
∈
{
S
,
T
}
θ
~
ρ
α
log
π
~
ρ
α
(
a
~
ρ
α
∣
o
~
ρ
α
)
∇
a
~
ρ
α
Q
~
(
o
~
,
a
~
;
θ
~
)
]
(10.5.2)
\begin{aligned}\nabla_{\tilde{\boldsymbol{\theta}}} J(\widetilde{\boldsymbol{\theta}})=& \mathbb{E}_{\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}} \sim \widetilde{\mathcal{M}}}\left[\nabla_{\widetilde{\boldsymbol{\theta}}} \log \widetilde{\boldsymbol{\pi}}(\widetilde{\boldsymbol{a}} | \widetilde{\boldsymbol{o}}) \widetilde{Q}(\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}} ; \widetilde{\boldsymbol{\theta}})\right] \\=& \mathbb{E}_{\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}} \sim \widetilde{\mathcal{M}}}\left[\sum_{{\alpha \in\{i, j\}},{\rho \in\{S, T\}}} \widetilde{\theta}_{\rho}^{\alpha} \log \widetilde{\pi}_{\rho}^{\alpha}\left(\widetilde{a}_{\rho}^{\alpha} | \widetilde{o}_{\rho}^{\alpha}\right) \nabla_{\widetilde{a}_{\rho}^{\alpha}} \widetilde{Q}(\widetilde{\boldsymbol{o}}, \widetilde{\boldsymbol{a}} ; \widetilde{\boldsymbol{\theta}})\right] \\\end{aligned}\tag{10.5.2}
∇θ~J(θ
)==Eo
,a
∼M
[∇θ
logπ
(a
∣o
)Q
(o
,a
;θ
)]Eo
,a
∼M
⎣⎡α∈{i,j},ρ∈{S,T}∑θ
ραlogπ
ρα(a
ρα∣o
ρα)∇a
ραQ
(o
,a
;θ
)⎦⎤(10.5.2)
具体算法如下: