Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning 论文分享

本文探讨了针对观察的深度强化学习中的对抗攻击。介绍了背景,包括策略基方法的详细步骤,以及在对抗攻击方面的理解,特别是针对状态的马尔科夫决策过程(SA-MDP)和函数空间中的攻击理解。文章还阐述了攻击目标,存在的困难,以及作者如何通过分阶段优化来构造强大的攻击者策略。
摘要由CSDN通过智能技术生成

一 . Background

一.引入

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

二.方法

  • policy-based approach(learning an actor)
  • value-based approach(learning a critic)
  • actor+critic (A3C、A2C)

1.policy-based approach

1.1 开局一张图

在这里插入图片描述

1.2 机器学习三大步

step1:定义一个函数
在这里插入图片描述
step2:定义函数的好坏

假设让actor(定义为: π θ ( s ) \pi_\theta(s) πθ(s))玩一场游戏从开始到结束有这样一个轨迹:
τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , … , s T , a T , r T } \tau=\{ s_1,a_1,r_1,s_2,a_2,r_2,\dots,s_T,a_T,r_T\} τ={s1,a1,r1,s2,a2,r2,,sT,aT,rT};
R θ = ∑ t = 1 T r t R_\theta=\sum_{t=1}^{T}r_t Rθ=t=1Trt;
由于actor和游戏具有随机性,故 R θ R_\theta Rθ是一个随机变量,故转而求它的期望值( R ˉ θ \bar{R}_\theta Rˉθ)的最大值;
期望: R ˉ θ = ∑ τ R ( τ ) p ( τ ∣ θ ) \bar{R}_\theta=\sum_{\tau}R(\tau)p(\tau \vert \theta) Rˉθ=τR(τ)p(τθ);
抽样 { τ 1 , τ 2 , … , τ N } \{\tau^1,\tau^2,\dots,\tau^N\} {τ1,τ2,,τN}估计总体:
即: R ˉ θ ≈ 1 N ∑ n = 1 N R ( τ n ) \bar{R}_\theta \approx \frac{1}{N} \sum_{n=1}^{N}R(\tau^n) RˉθN1n=1NR(τn)

step3:选择最好的函数

1.目标函数: θ ∗ = arg max ⁡ θ R ˉ θ \theta^*=\argmax_{\theta}\bar{R}_{\theta} θ=θargmaxRˉθ
2.梯度上升法(policy gradient): θ n e w ← θ o l d + η ▽ R ˉ θ \theta^{new} \leftarrow \theta^{old}+\eta\triangledown \bar{R}_\theta θnewθold+ηRˉθ
3.推导过程
R ˉ θ = ∑ τ R ( τ ) p ( τ ∣ θ ) ▽ R ˉ θ = ∑ τ R ( τ ) ▽ p ( τ ∣ θ ) = ∑ τ R ( τ ) p ( τ ∣ θ ) ▽ p ( τ ∣ θ ) p ( τ ∣ θ ) = ∑ τ R ( τ ) p ( τ ∣ θ ) ▽ log ⁡ p ( τ ∣ θ ) ≈ 1 N ∑ n = 1 N R ( τ n ) ▽ log ⁡ p ( τ n ∣ θ ) τ = { s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , … , s T , a T , r T } p ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 , θ ) p ( r 1 , s 2 ∣ s 1 , a 1 ) p ( a 2 ∣ s 2 , θ ) p ( r 2 , s 3 ∣ s 2 , a 2 ) … = p ( s 1 ) ∏ t = 1 T p ( a t ∣ s t , θ ) p ( r t , s t + 1 ∣ s t , a t ) ▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ▽ log ⁡ p ( a t n ∣ s t n , θ ) \begin{aligned}\bar{R}_\theta &=\sum_{\tau}R(\tau)p(\tau \vert \theta) \\ \triangledown \bar{R}_\theta &= \sum_{\tau}R(\tau)\triangledown{p(\tau \vert \theta)} \\ & =\sum_{\tau}R(\tau)p(\tau \vert \theta)\frac{\triangledown{p(\tau \vert \theta)}}{p(\tau \vert \theta)} \\ & =\sum_{\tau}R(\tau)p(\tau \vert \theta) \triangledown{\log p(\tau \vert \theta) } \\ & \approx \frac{1}{N}\sum_{n=1}^{N}R(\tau^n)\triangledown{\log p(\tau^n \vert \theta) } \\ \tau &=\{ s_1,a_1,r_1,s_2,a_2,r_2,\dots,s_T,a_T,r_T\} \\p(\tau \vert \theta) &=p(s_1)p(a_1 \vert s_1,\theta)p(r_1,s_2 \vert s_1,a_1)p(a_2 \vert s_2,\theta)p(r_2,s_3 \vert s_2,a_2) \dots \\ &=p(s_1)\prod_{t=1}^{T}p(a_t\vert s_t,\theta)p(r_t,s_{t+1} \vert s_t,a_t) \\ \triangledown \bar{R}_\theta & \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^n) \triangledown \log p(a_t^n \vert s_t^n,\theta)\end{aligned} RˉθRˉθτp(τθ)Rˉθ=τR(τ)p(τθ)=τR(τ)p(τθ)=τR(τ)p(τθ)p(τθ)p(τθ)=τR(τ)p(τθ)logp(τθ)N1n=1NR(τn)logp(τnθ)={s1,a1,r1,s2,a2,r2,,sT,aT,rT}=p(s1)p(a1s1,θ)p(r1,s2s1,a1)p(a2s2,θ)p(r2,s3s2,a2)=p(s1)t=1Tp(atst,θ)p(rt,st+1st,at)N1n=1Nt=1TnR(τn)logp(atnstn,θ)
结论:如果 R ( τ n ) > 0 R(\tau^n)>0 R(τn)>0,增加 p ( a t n ∣ s t n ) p(a_t^n \vert s_t^n) p(atnstn);如果 R ( τ n ) < 0 R(\tau^n)<0 R(τn)<0,减少 p ( a t n ∣ s t n ) p(a_t^n \vert s_t^n) p(atnstn);

4.改进
原因:从 ▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n R ( τ n ) ▽ log ⁡ p ( a t n ∣ s t n , θ ) \triangledown \bar{R}_\theta \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}R(\tau^n) \triangledown \log p(a_t^n \vert s_t^n,\theta) RˉθN1n=1Nt=1TnR(τn)logp(atnstn,θ)中可以看出,在一场游戏中每一个动作的权重一样,假设 R ( τ n ) > 0 R(\tau^n)>0 R(τn)>0,不会 τ n \tau^n τn中的每一个动作的奖励都是正的,也不会每一个动作是一样的重要,所以要给不同的动作设一个不同的权重;
法一:这个行为得到的奖励应该是行为发生及其以后的奖励总和
▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n r t ′ n ) ▽ log ⁡ p ( a t n ∣ s t n , θ ) \triangledown \bar{R}_\theta \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(\sum_{t^\prime=t}^{T_n}r_{t^\prime}^n)\triangledown \log p(a_t^n \vert s_t^n,\theta) RˉθN1n=1Nt=1Tn(t=tTnrtn)logp(atnstn,θ)

在这里插入图片描述
法二:由于发生某个行为后,它后面的奖励都可能是这个行为的后果,但是时间越久,这个行为的影响力越小
▽ R ˉ θ ≈ 1 N ∑ n = 1 N ∑ t = 1 T n ( ∑ t ′ = t T n γ t ′ − t r t ′ n ) ▽ log ⁡ p ( a t n ∣ s t n , θ ) \triangledown \bar{R}_\theta \approx \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_n}(\sum_{t^\prime=t}^{T_n}\gamma^{t^\prime-t}r_{t^\prime}^n)\triangledown \log p(a_t^n \vert s_t^n,\theta) RˉθN1n=1Nt=1Tn(t=tTnγttrtn)logp(atnstn,θ)
在这里插入图片描述

二.Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning

一.攻击目标

在这里插入图片描述
在这里插入图片描述
具体的目标:降低被害者的总奖励的期望值

二.存在的困难

  • 环境是动态变化的并且

三.作者做的事

1.将以往的对抗攻击根据空间函数分为3类

  • 空间1:攻击者误导agent采取非最优的行动;

  • 空间2:攻击者要么误导agent采取非最优的行为,要么让agent采取原来的行为;

  • 空间3:攻击者诱导agent学习有害的policy;

2.证明了空间1 ⊆ \subseteq 空间2 ⊆ \subseteq 空间3,且空间3是能产生最强的攻击者的空间;
3.根据空间3,将任务分为两个阶段的优化,第一阶段训练一个decisive policy,这个decisive policy可以去探索环境的动态变化,并且根据被更改的reward function,让decisive policy得到的总奖励的期望值最低;第二阶段将受害者模仿decisive policy,从而得到的总奖励的期望值最低.

四.具体内容

1.State-adversarial markov decision process(SA-MDP)

g ∗ = arg min ⁡ g ∈ G E a t ∼ π g ( ⋅ ∣ s t ) , s t + 1 ∼ P a ( s t , a t ) [ ∑ t = 0 ∞ γ t r t ] g^*=\argmin_{g \in G} \mathbb{E}_{a_t \sim \pi_g(\cdot \vert s_t),s_{t+1} \sim P_a(s_t,a_t)} \left[ \sum_{t=0}^{\infty} \gamma^tr_t\right] g=gGargminEatπg(st),st+1Pa(st,at)[t=0γtrt]
其中 G G G:攻击者集合(attacker set)
攻击者(attacker): g ∈ G : S → F ( S ) g\in G:S \rightarrow F(S) gG:SF(S)
状态集合(state set): S S S
S S S上的分布: F ( S ) F(S) F(S)
折扣因子(discount factor): γ \gamma γ
行为集合(action set): A A A
跳转函数(transition function): P a : S × A → F ( S ) P_a:S \times A \rightarrow F(S) Pa:S×AF(S)
回报函数(reward function): R : S × A × S → R R:S \times A \times S \rightarrow \mathbb{R} R:S×A×SR
策略(policy): π : S → F ( A ) \pi:S \rightarrow F(A) π:SF(A)

2.Understanding adversarial attacks in function space

H : H: H:adversary的函数空间
h ( s t ) = s t + δ s t h(s_t)=s_t+\delta_{s_t} h(st)=st+δst
H = { h ∣ ∥ h ( s ) − s ∥ p ≤ ϵ , ∀ s ∈ S } H=\{ h \vert \|h(s)-s\|_p\leq\epsilon,\forall s\in S\} H={hh(s)spϵ,sS}
h ∗ = arg min ⁡ h ∈ H [ E a ∼ π h ∑ t = 0 ∞ γ t r t ] h^*=\argmin_{h \in H}[\mathbb{E}_{a \sim \pi_{h}}\sum_{t=0}^{\infty} \gamma^t r_t] h=hHargmin[Eaπht=0γtrt]

Adversarial attacks are a major concern in the field of deep learning as they can cause misclassification and undermine the reliability of deep learning models. In recent years, researchers have proposed several techniques to improve the robustness of deep learning models against adversarial attacks. Here are some of the approaches: 1. Adversarial training: This involves generating adversarial examples during training and using them to augment the training data. This helps the model learn to be more robust to adversarial attacks. 2. Defensive distillation: This is a technique that involves training a second model to mimic the behavior of the original model. The second model is then used to make predictions, making it more difficult for an adversary to generate adversarial examples that can fool the model. 3. Feature squeezing: This involves converting the input data to a lower dimensionality, making it more difficult for an adversary to generate adversarial examples. 4. Gradient masking: This involves adding noise to the gradients during training to prevent an adversary from estimating the gradients accurately and generating adversarial examples. 5. Adversarial detection: This involves training a separate model to detect adversarial examples and reject them before they can be used to fool the main model. 6. Model compression: This involves reducing the complexity of the model, making it more difficult for an adversary to generate adversarial examples. In conclusion, improving the robustness of deep learning models against adversarial attacks is an active area of research. Researchers are continually developing new techniques and approaches to make deep learning models more resistant to adversarial attacks.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值