目录
一 . Background
一.引入
二.方法
- policy-based approach(learning an actor)
- value-based approach(learning a critic)
- actor+critic (A3C、A2C)
1.policy-based approach
1.1 开局一张图
1.2 机器学习三大步
step1:定义一个函数
step2:定义函数的好坏
假设让actor(定义为:
π
θ
(
s
)
\pi_\theta(s)
πθ(s))玩一场游戏从开始到结束有这样一个轨迹:
τ
=
{
s
1
,
a
1
,
r
1
,
s
2
,
a
2
,
r
2
,
…
,
s
T
,
a
T
,
r
T
}
\tau=\{ s_1,a_1,r_1,s_2,a_2,r_2,\dots,s_T,a_T,r_T\}
τ={s1,a1,r1,s2,a2,r2,…,sT,aT,rT};
R
θ
=
∑
t
=
1
T
r
t
R_\theta=\sum_{t=1}^{T}r_t
Rθ=∑t=1Trt;
由于actor和游戏具有随机性,故
R
θ
R_\theta
Rθ是一个随机变量,故转而求它的期望值(
R
ˉ
θ
\bar{R}_\theta
Rˉθ)的最大值;
期望:
R
ˉ
θ
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
\bar{R}_\theta=\sum_{\tau}R(\tau)p(\tau \vert \theta)
Rˉθ=∑τR(τ)p(τ∣θ);
抽样
{
τ
1
,
τ
2
,
…
,
τ
N
}
\{\tau^1,\tau^2,\dots,\tau^N\}
{τ1,τ2,…,τN}估计总体:
即:
R
ˉ
θ
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
\bar{R}_\theta \approx \frac{1}{N} \sum_{n=1}^{N}R(\tau^n)
Rˉθ≈N1∑n=1NR(τn)
step3:选择最好的函数
1.目标函数:
θ
∗
=
arg max
θ
R
ˉ
θ
\theta^*=\argmax_{\theta}\bar{R}_{\theta}
θ∗=θargmaxRˉθ
2.梯度上升法(policy gradient):
θ
n
e
w
←
θ
o
l
d
+
η
▽
R
ˉ
θ
\theta^{new} \leftarrow \theta^{old}+\eta\triangledown \bar{R}_\theta
θnew←θold+η▽Rˉθ
3.推导过程
R
ˉ
θ
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
▽
R
ˉ
θ
=
∑
τ
R
(
τ
)
▽
p
(
τ
∣
θ
)
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
▽
p
(
τ
∣
θ
)
p
(
τ
∣
θ
)
=
∑
τ
R
(
τ
)
p
(
τ
∣
θ
)
▽
log
p
(
τ
∣
θ
)
≈
1
N
∑
n
=
1
N
R
(
τ
n
)
▽
log
p
(
τ
n
∣
θ
)
τ
=
{
s
1
,
a
1
,
r
1
,
s
2
,
a
2
,
r
2
,
…
,
s
T
,
a
T
,
r
T
}
p
(
τ
∣
θ
)
=
p
(
s
1
)
p
(
a
1
∣
s
1
,
θ
)
p
(
r
1
,
s
2
∣
s
1
,
a
1
)
p
(
a
2
∣
s
2
,
θ
)
p
(
r
2
,
s
3
∣
s
2
,
a
2
)
…
=
p
(
s
1
)
∏
t
=
1
T
p
(
a
t
∣
s
t
,
θ
)
p
(
r
t
,
s
t
+
1
∣
s
t
,
a
t
)
▽
R
ˉ
θ
≈
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
▽
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\begin{aligned}\bar{R}_\theta &=\sum_{\tau}R(\tau)p(\tau \vert \theta) \\ \triangledown \bar{R}_\theta &= \sum_{\tau}R(\tau)\triangledown{p(\tau \vert \theta)} \\ & =\sum_{\tau}R(\tau)p(\tau \vert \theta)\frac{\triangledown{p(\tau \vert \theta)}}{p(\tau \vert \theta)} \\ & =\sum_{\tau}R(\tau)p(\tau \vert \theta) \triangledown{\log p(\tau \vert \theta) } \\ & \approx \frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\triangledown{\log p(\tau^n \vert \theta) } \\ \tau &=\{ s_1,a_1,r_1,s_2,a_2,r_2,\dots,s_T,a_T,r_T\} \\p(\tau \vert \theta) &=p(s_1)p(a_1 \vert s_1,\theta)p(r_1,s_2 \vert s_1,a_1)p(a_2 \vert s_2,\theta)p(r_2,s_3 \vert s_2,a_2) \dots \\ &=p(s_1)\prod_{t=1}^{T}p(a_t\vert s_t,\theta)p(r_t,s_{t+1} \vert s_t,a_t) \\ \triangledown \bar{R}_\theta & \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^n) \triangledown \log p(a_t^n \vert s_t^n,\theta)\end{aligned}
Rˉθ▽Rˉθτp(τ∣θ)▽Rˉθ=τ∑R(τ)p(τ∣θ)=τ∑R(τ)▽p(τ∣θ)=τ∑R(τ)p(τ∣θ)p(τ∣θ)▽p(τ∣θ)=τ∑R(τ)p(τ∣θ)▽logp(τ∣θ)≈N1n=1∑NR(τn)▽logp(τn∣θ)={s1,a1,r1,s2,a2,r2,…,sT,aT,rT}=p(s1)p(a1∣s1,θ)p(r1,s2∣s1,a1)p(a2∣s2,θ)p(r2,s3∣s2,a2)…=p(s1)t=1∏Tp(at∣st,θ)p(rt,st+1∣st,at)≈N1n=1∑Nt=1∑TnR(τn)▽logp(atn∣stn,θ)
结论:如果
R
(
τ
n
)
>
0
R(\tau^n)>0
R(τn)>0,增加
p
(
a
t
n
∣
s
t
n
)
p(a_t^n \vert s_t^n)
p(atn∣stn);如果
R
(
τ
n
)
<
0
R(\tau^n)<0
R(τn)<0,减少
p
(
a
t
n
∣
s
t
n
)
p(a_t^n \vert s_t^n)
p(atn∣stn);
4.改进
原因:从
▽
R
ˉ
θ
≈
1
N
∑
n
=
1
N
∑
t
=
1
T
n
R
(
τ
n
)
▽
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\triangledown \bar{R}_\theta \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^n) \triangledown \log p(a_t^n \vert s_t^n,\theta)
▽Rˉθ≈N1∑n=1N∑t=1TnR(τn)▽logp(atn∣stn,θ)中可以看出,在一场游戏中每一个动作的权重一样,假设
R
(
τ
n
)
>
0
R(\tau^n)>0
R(τn)>0,不会
τ
n
\tau^n
τn中的每一个动作的奖励都是正的,也不会每一个动作是一样的重要,所以要给不同的动作设一个不同的权重;
法一:这个行为得到的奖励应该是行为发生及其以后的奖励总和
▽
R
ˉ
θ
≈
1
N
∑
n
=
1
N
∑
t
=
1
T
n
(
∑
t
′
=
t
T
n
r
t
′
n
)
▽
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\triangledown \bar{R}_\theta \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(\sum_{t^\prime=t}^{T_n}r_{t^\prime}^n)\triangledown \log p(a_t^n \vert s_t^n,\theta)
▽Rˉθ≈N1n=1∑Nt=1∑Tn(t′=t∑Tnrt′n)▽logp(atn∣stn,θ)
法二:由于发生某个行为后,它后面的奖励都可能是这个行为的后果,但是时间越久,这个行为的影响力越小
▽
R
ˉ
θ
≈
1
N
∑
n
=
1
N
∑
t
=
1
T
n
(
∑
t
′
=
t
T
n
γ
t
′
−
t
r
t
′
n
)
▽
log
p
(
a
t
n
∣
s
t
n
,
θ
)
\triangledown \bar{R}_\theta \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(\sum_{t^\prime=t}^{T_n}\gamma^{t^\prime-t}r_{t^\prime}^n)\triangledown \log p(a_t^n \vert s_t^n,\theta)
▽Rˉθ≈N1∑n=1N∑t=1Tn(∑t′=tTnγt′−trt′n)▽logp(atn∣stn,θ)
二.Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning
一.攻击目标
具体的目标:降低被害者的总奖励的期望值
二.存在的困难
- 环境是动态变化的并且
三.作者做的事
1.将以往的对抗攻击根据空间函数分为3类
-
空间1:攻击者误导agent采取非最优的行动;
-
空间2:攻击者要么误导agent采取非最优的行为,要么让agent采取原来的行为;
-
空间3:攻击者诱导agent学习有害的policy;
2.证明了空间1
⊆
\subseteq
⊆空间2
⊆
\subseteq
⊆空间3,且空间3是能产生最强的攻击者的空间;
3.根据空间3,将任务分为两个阶段的优化,第一阶段训练一个decisive policy,这个decisive policy可以去探索环境的动态变化,并且根据被更改的reward function,让decisive policy得到的总奖励的期望值最低;第二阶段将受害者模仿decisive policy,从而得到的总奖励的期望值最低.
四.具体内容
1.State-adversarial markov decision process(SA-MDP)
g
∗
=
arg min
g
∈
G
E
a
t
∼
π
g
(
⋅
∣
s
t
)
,
s
t
+
1
∼
P
a
(
s
t
,
a
t
)
[
∑
t
=
0
∞
γ
t
r
t
]
g^*=\argmin_{g \in G} \mathbb{E}_{a_t \sim \pi_g(\cdot \vert s_t),s_{t+1} \sim P_a(s_t,a_t)} \left[ \sum_{t=0}^{\infty} \gamma^tr_t\right]
g∗=g∈GargminEat∼πg(⋅∣st),st+1∼Pa(st,at)[t=0∑∞γtrt]
其中
G
G
G:攻击者集合(attacker set)
攻击者(attacker):
g
∈
G
:
S
→
F
(
S
)
g\in G:S \rightarrow F(S)
g∈G:S→F(S)
状态集合(state set):
S
S
S
在
S
S
S上的分布:
F
(
S
)
F(S)
F(S)
折扣因子(discount factor):
γ
\gamma
γ
行为集合(action set):
A
A
A
跳转函数(transition function):
P
a
:
S
×
A
→
F
(
S
)
P_a:S \times A \rightarrow F(S)
Pa:S×A→F(S)
回报函数(reward function):
R
:
S
×
A
×
S
→
R
R:S \times A \times S \rightarrow \mathbb{R}
R:S×A×S→R
策略(policy):
π
:
S
→
F
(
A
)
\pi:S \rightarrow F(A)
π:S→F(A)
2.Understanding adversarial attacks in function space
H
:
H:
H:adversary的函数空间
h
(
s
t
)
=
s
t
+
δ
s
t
h(s_t)=s_t+\delta_{s_t}
h(st)=st+δst
H
=
{
h
∣
∥
h
(
s
)
−
s
∥
p
≤
ϵ
,
∀
s
∈
S
}
H=\{ h \vert \|h(s)-s\|_p\leq\epsilon,\forall s\in S\}
H={h∣∥h(s)−s∥p≤ϵ,∀s∈S}
h
∗
=
arg min
h
∈
H
[
E
a
∼
π
h
∑
t
=
0
∞
γ
t
r
t
]
h^*=\argmin_{h \in H}[\mathbb{E}_{a \sim \pi_{h}}\sum_{t=0}^{\infty} \gamma^t r_t]
h∗=h∈Hargmin[Ea∼πh∑t=0∞γtrt]