LAFEAT: Piercing Through Adversarial Defenses with Latent Features论文解读

摘要

在这篇文章中,我们展示了特定“robust“模型中的隐变量特征对于对抗攻击来说是非常易受攻击的。基于这一点,我们提出了一个unified l ∞ l_{\infty} l白盒攻击算法LAFEAT,这个算法在梯度下降过程中利用隐变量特征。我们展示了这种算法不仅能更高效完成攻击,并且它也是对一系列防御机制下,相比于目前SOTA的一个强力的adversary。这提示我们模型的robustness可能依防御者对hidden组成部分的有效使用而定,并且robustness不应该再从holistic角度来进行看待。

1 Introduction

作为特征提取器,CNN的浅层部分提取出简单的局部texture,更深层的神经元用于提取复杂对象[40,10]。受到这点的启发,我们希望错误提取的浅层特征无法组装成正确的高层特征。此外,这对后续的层也有cascading效果。为了说明,我们改变了PGD的功能,使得它可以通过仅最大化攻击者训练的判别器的loss来攻击中间层,目前我们将其称为LPGD。在图1中,我们scrambled了LPGD攻击过的网络的中间层特征,并且观察到自然图片以及对抗图片随着网络层数加深其特征差异不断增加。

在这里插入图片描述
然而,现存的攻击以及防御策略将整个模型视为一个单独的非线性可微分方程 f f f,这个方程将输入图片映射为输出logits。尽管这些方式在不同的模型间泛化能力很好,它们倾向于忽略模型内部中间层提取的隐变量特征。

作者提出了两个问题:

  • Can latent features be vulnerable to attacks?
  • Can the falsely extracted features be cascaded to the remaining layers to make the model output incorrect?

It turns out that 通过LPGD计算出的对抗样本能harm the accuracies of the “robust” models above (Figure 2)。实验结果显示尽管这些模型被训练用于有效对抗PGD,它们在面对指向隐变量特征时的攻击很容易失效。这同样有可能暗示一个关于输入图片的flat model loss landscape does not necessarily entail flat latent features w.r.t. the input. 现有的依赖于模型整体视角的攻击方式因此可能无法提供对于模型robustness的可靠评估。

在这里插入图片描述
受到上述发现的启发,在这篇文章中我们提出了一种新的称为LAFEAT的攻击方法。这种攻击方式尝试在一种泛化性更好的框架下利用隐变量特征。这篇文章利用了许多近些年发现的有效技巧, 例如momentum [16, 14], surrogate loss [20, 15], step size schedule [14, 21], and multi-targeted attacks [21, 44, 43]。我们的主要贡献如下:

  • 我们介绍了怎样利用网络中间层来进行对抗攻击
  • 我们展示了隐变量可以提供更快的收敛速度,并加速基于梯度的攻击
  • 通过将多种有效攻击策略进行组合,我们提出了LAFEAT。实验结果显示它能在attack performance和computational efficiency上和现有的方法匹敌。

To the best of our knowledge, LAFEAT is currently the strongest against a wide variety of defense mechanisms and matches the current top-1 on the TRADES [64] CIFAR-10 white-box leaderboard (Section 4). 由于隐变量特征对于对抗攻击而言十分脆弱, 这可以反过来用于破坏robust模型,我们相信未来对于模型robustness的评估应当依怎样有效利用防御模型的hidden component而定。简短来说,模型的robustness不应该再从一个整理角度来看待。

相关工作

2.1 对抗样本

我们将判别器网络定义为 f θ ( x ) f_{\theta}(x) fθ(x),这里 θ \theta θ指代模型参数, f θ : I → R K f_{\theta}:\mathcal{I}\to\mathbb{R}^K fθ:IRK将输入图片映射为判别结果。 I ⊂ R C × H × W I\subset\mathbb{R}^{C\times H\times W} IRC×H×W将图片限定在一个有效范围内。

攻击者的目标是找到对抗样本 x ^ ∈ I \hat{x}\in\mathcal{I} x^I,通过解决如下的优化问题得到:

max ⁡ x ^ ∈ I ∧ ( x , x ^ ) ≤ ϵ L s c e ( f θ ( x ^ ) , y ) ( 1 ) \max\limits_{\hat{x}\in\mathcal{I}\land(x,\hat{x})\le\epsilon}\mathcal{L}^{sce}(f_{\theta}(\hat{x}),y)\quad\quad\quad\quad\quad(1) x^I(x,x^)ϵmaxLsce(fθ(x^),y)(1)

这里 L s c e ( f θ ( x ^ ) , y ) \mathcal{L}^{sce}(f_{\theta}(\hat{x}),y) Lsce(fθ(x^),y)指代输出以及one-hot ground truth间的softmax cross-entropy(SCE)。通过最小化这个损失,我们可以得到一个 x ^ \hat{x} x^,此时 arg max ⁡ f θ ( x ^ ) ≠ arg max ⁡ y \argmax f_{\theta}({\hat{x}})\ne\argmax y argmaxfθ(x^)=argmaxy。这里 d ( x , x ^ ) ≤ ϵ d(x,\hat{x})\le\epsilon d(x,x^)ϵ将限定了原始图片和对抗样本之间的欧氏距离。

In general, the distance metric d (x, xˆ) is commonly
defined as the ℓp-norm of the difference between x and
xˆ [51, 19, 35, 7]. Different choices of norm were explored
in literature, e.g. one pixel attacks [50] minimizes the ℓ0-
norm kx − xˆk0
, while others may be interested in the standard Euclidean distance, the ℓ2-norm [51, 37, 8]. In this
paper, we focus on another popular choice of distance metric, the ℓ∞-norm d (x, xˆ) , kx − xˆk∞, as used in [19, 35].

防御者们用于评估白盒adversarial robustness的一个popular攻击方式是projected gradient descent(PGD)。PGD通过执行如下的迭代更新来寻找对抗样本[35]:

x ^ i + 1 = P ϵ , x ( x ^ i + α i s i g n ( ∇ x ^ i L s c e ( f θ ( x ^ i ) , y ) ) ) ( 2 ) \hat{x}_{i+1}=\mathcal{P}_{\epsilon,x}(\hat{x}_i+\alpha_i sign(\nabla_{\hat{x}_i}\mathcal{L}^{sce}(f_{\theta}(\hat{x}_i),y)))\quad\quad\quad\quad\quad(2) x^i+1=Pϵ,x(x^i+αisign(x^iLsce(fθ(x^i),y)))(2)

初始时, x ^ 0 = P ϵ , x ( x + u ) \hat{x}_0=\mathcal{P}_{\epsilon,x}(x+u) x^0=Pϵ,x(x+u),这里 u ∼ U ( [ − ϵ , ϵ ] ) u\sim\mathcal{U}([-\epsilon,\epsilon]) uU([ϵ,ϵ])。函数 P ϵ , x : R C × H × W → I \mathcal{P}_{\epsilon,x}:\mathbb{R}^{C\times H\times W}\to\mathcal{I} Pϵ,x:RC×H×WIclips the range of its input into the ϵ − \epsilon- ϵ ball neighbor and the I \mathcal{I} I ∇ x ^ i L s c e ( f θ ( x ^ i ) , y ) \nabla_{\hat{x}_i}\mathcal{L}^{sce}(f_{\theta}(\hat{x}_i),y) x^iLsce(fθ(x^i),y)计算loss关于输入 x ^ i \hat{x}_i x^i的梯度。 α i \alpha_i αi指代步长,并且对tensor z z z中的每个元素, s i g n ( z ) sign(z) sign(z)返回 1 , 0 , − 1 1,0,-1 1,0,1。为了简化,我们定义:

P G D ϵ , x , y ( L , α , i ) ≜ x ^ i PGD_{\epsilon,x,y}(\mathcal{L},\alpha,i)\triangleq\hat{x}_i PGDϵ,x,y(L,α,i)x^i

i.e. the result of iterating for i times with a sequence of
step sizes α and the loss function L on the original image
x.

Because the SCE loss function L
sce in the objective (1)
is highly non-linear, easily saturated, and normally evaluated with limited floating-point precision, gradient-based
attacks may experience vanishing gradients and difficulty
converging [8, 14]. Recent attack methods hence use surrogate losses instead for gradient calculation [8, 21], and
optimize an alternative objective by replacing L s c e \mathcal{L}^{sce} Lsce with a custom surrogate loss function. As the alternative objective
is usually aligned with the original, maximizing the latter
would also maximize the former

Many auxiliary tricks can push the limit of existing attack methods, for instance, a step-size schedule [14, 21]
with a decaying step-size in relation to the iteration count
could improve the overall success rate. Multi-targeted attack [21, 44, 43] uses label-specific surrogate loss by enumerating all possible target labels. Attackers may also resort to an ensemble of multiple attack strategies, making
the compound approach stronger than any individual attacks [7, 14]. The latter two methods, however, tend to introduce an order of magnitude increase in the worse-case
computational costs.
Generative networks that learn from the loss were proposed for adversarial example synthesis [5, 26]. This tactic can be further enhanced with generative adversarial networks (GANs) [18], where the discriminator network encourages the distribution of adversarial examples to become
indistinguishable from that of natural examples [59, 36].
Finally, there are a few recent publications that leverages
latent features in their attacks [30, 41]. Unlike these methods, LAFEAT considers ℓ∞-norm white-box attacks, and
further differentiate itself from them as it learns to attack
defending models.

2.2 Defending against adversarial examples

3 LAFEAT 方法介绍

LAFEAT攻击流程如图3所示:
在这里插入图片描述
首先,对于一个关于隐变量特征的firm grip, 它从使用训练集为每一个残差块训练一个全连接层直到收敛为开始。注意我们确保原始模型参数 f θ f_{\theta} fθ在这个过程中保持不变。为了计算对抗样本,我们最大化alternative对抗损失 L ^ \hat{L} L^ ,这是an adaptively-weighted sum of surrogate losses from individual layers. 为了测试 adversarial robustness, 生成的对抗样本接下来会被传送给原始模型 f θ f_{\theta} fθ来进行评估。

3.1 Latent feature adversarial

Following the footsteps of surrogate losses(?), 在第一部分我们假设一个相似的对于隐变量特征的indirect loss也可以有效增强对抗攻击。LAFEAT使用中间层提取出的特征来生成一个对于 f θ f_{\theta} fθ 而言更强的对抗样本。我们假设模型结构 f θ f_{\theta} fθ N N N 层(或残差块)序列,可以通过如下的方式进行表示:

f θ ( x ) = f ( N ) ( ⋯ f ( 2 ) ( f ( 1 ) ( x ) ) ⋯   ) ( 5 ) f_{\theta}(x)=f^{(N)}(\cdots f^{(2)}(f^{(1)}(x))\cdots)\quad\quad\quad\quad\quad\quad(5) fθ(x)=f(N)(f(2)(f(1)(x)))(5)

这里 f ( 1 ) , f ( 2 ) , … , f ( N ) f^{(1)},f^{(2)},\dots,f^{(N)} f(1),f(2),,f(N) 表示模型中间层序列。为了简化表达,我们省略了每层的参数。我们因此formalize this proposal by generalizing the traditional PGD attack (2) with a latent-feature PGD (LFPGD) 对抗优化问题:

max ⁡ h , λ , α , L s u r L s c e ( f θ ( P G D ϵ , x , y ( L λ l f , α , I ) ) , y ) , w h e r e L λ l f ( z ) = L s u r ( ∑ l ∈ [ 1 : N ] λ ( l ) h ( l ) ( z ( l ) ) , y ) ( 6 ) \max\limits_{h,\lambda,\alpha,\mathcal{L}^{sur}}\mathcal{L}^{sce}(f_{\theta}(PGD_{\epsilon,x,y}(\mathcal{L}_{\lambda}^{lf},\alpha,I)),y),\quad\quad\quad\quad\quad\\where\quad\mathcal{L}^{lf}_{\lambda}(z)=\mathcal{L}^{sur}(\sum\limits_{l\in[1:N]}\lambda^{(l)}h^{(l)}(z^{(l)}),y)\quad\quad\quad\quad(6) h,λ,α,LsurmaxLsce(fθ(PGDϵ,x,y(Lλlf,α,I)),y),whereLλlf(z)=Lsur(l[1:N]λ(l)h(l)(z(l)),y)(6)

符号含义如下:

  • 常数 I I I 代表梯度更新的最大迭代轮数。
  • 对于每一层 l ∈ [ 1 : N ] , λ ( l ) ∈ [ 0 , 1 ] l\in[1:N],\lambda^{(l)}\in[0,1] l[1:N],λ(l)[0,1] 给每一层的梯度都赋予了一个重要性权重,并且 ∑ l ∈ N λ ( l ) = 1 \sum\limits_{l\in N}\lambda^{(l)}=1 lNλ(l)=1
  • z ( l ) ≜ f ( l ) ∘ ⋯ ∘ f ( 1 ) ( z ) z^{(l)}\triangleq f^{(l)}\circ\cdots\circ f^{(1)}(z) z(l)f(l)f(1)(z) 表示从 l t h l^{th} lth 层提取出的特征。
  • l t h l^{th} lth 层的函数 h ( l ) h^{(l)} h(l) f ( l ) f^{(l)} f(l) 输出的特征映射为logits。
  • α i \alpha_i αiis a step-size schedule。

我们的目标是找到logits functions的正确组合 h ≜ ( h ( 1 ) , … , h N ) h\triangleq(h^{(1)},\dots,h^{N}) h(h(1),,hN)以及它们相应的权重 λ ≜ ( λ ( 1 ) , … , λ ( N ) ) \lambda\triangleq(\lambda^{(1)},\dots,\lambda^{(N)}) λ(λ(1),,λ(N))from intermediate layers, the step-size schedule α \alpha α, and the surrogate loss L s u r L^{sur} Lsur to use.

不幸的是,实际中上述LFPGD无法解决。因此我们设计了一种可以近似解决的方法,and nevertheless enable us to generate adversarial examples stronger than competing method。

3.2 Training intermediate logits layers

为了利用隐变量特征,我们使用SGD为所有的 l ∈ [ 1 : N − 1 ] l\in[1:N-1] l[1:N1] 独立中间层 f ( l ) f^{(l)} f(l) 训练 logits layers h ( l ) h^{(l)} h(l) 直到收敛。函数 h ( l ) : R C ( l ) × H ( l ) × W ( l ) → R K h^{(l)}:\mathbb{R}^{C^{(l)}\times H^{(l)}\times W^{(l)}}\to\mathbb{R}^K h(l):RC(l)×H(l)×W(l)RK被定义为由一个 global average pooling 层( R C ( l ) × H ( l ) × W ( l ) → R C ( l ) \mathbb{R}^{C^{(l)}\times H^{(l)}\times W^{(l)}}\to\mathbb{R}^{C^{(l)}} RC(l)×H(l)×W(l)RC(l))以及一个相连的全连接层所组成的一个辅助分类器,其公式定义如下:

h ( l ) ( x ( l ) ) ≜ p o o l ( x ( l ) ) ϕ ( l ) + η ( l ) ( 7 ) h^{(l)}(x^{(l)})\triangleq \rm{pool}(x^{(l)})\phi^{(l)}+\eta^{(l)}\quad\quad\quad\quad\quad(7) h(l)(x(l))pool(x(l))ϕ(l)+η(l)(7)

这里 x ( l ) x^{(l)} x(l) 指代从 l t h l^{th} lth 层提取出的特征, ϕ ( l ) ∈ R C ( l ) × K \phi^{(l)}\in\mathbb{R}^{C^{(l)}\times K} ϕ(l)RC(l)×K 以及 η ( l ) ∈ R K \eta^{(l)}\in\mathbb{R}^K η(l)RK 都是方程 h ( l ) h^{(l)} h(l) 中待训练的参数。因为最后一层 f ( N ) f^{(N)} f(N) 已经是logits layer,我们假定 h ( N ) ≜ i d h^{(N)}\triangleq id h(N)id 是一个 identity function.

取决于是否可得,我们可以使用 D t r a i n \mathcal{D}_{train} Dtrain或者用于攻击的 D a t t a c k \mathcal{D}_{attack} Dattack(或者两者一起)来训练这些层。尽管我们在实验中使用 D t r a i n \mathcal{D}_{train} Dtrain,我们观察到in practice negligible differences in either attack strengths given sufficient amount of training examples, as they are theoretically drawn from the same data sampling distribution。

需要注意的很重要的一点是在训练 h ( l ) h^{(l)} h(l) 的过程中,原始模型 f θ f_{\theta} fθ 被用作一个特征提取器,其训练时使用的所有技巧都被禁用(例如dropout、参数更新等等)。这代表着模型参数 θ \theta θ,层 f l f_l fl以及它们的参数,batch normalization等等都保持constant,只有 h ( l ) h^{(l)} h(l)中的参数会被训练。

3.3 选择要攻击的中间层

对于 λ ( 1 : N ) \lambda^{(1:N)} λ(1:N)的搜索is difficult because of the computations cost associated with finding a statistically significant amount of adversarial examples。出于这个原因,LATEAT使用一个greedy但是有效的方式来简化搜索过程。首先,我们枚举所有中间层 l ∈ [ 1 : N − 1 ] l\in[1:N-1] l[1:N1] ,并且使得:

L λ l f ( z , y ) = L s u r ( β h ( l ) ( z ( l ) ) + ( 1 − β ) f θ ( z ) , y ) ( 8 ) \mathcal{L}_{\lambda}^{lf}(z,y)=\mathcal{L}^{sur}(\beta h^{(l)}(z^{(l)})+(1-\beta)f_{\theta}(z),y)\quad\quad\quad\quad(8) Lλlf(z,y)=Lsur(βh(l)(z(l))+(1β)fθ(z),y)(8)

λ = β o n e h o t ( N , N ) + ( 1 − β ) o n e h o t ( l , N ) \lambda=\beta\rm{onehot}(N,N)+(1-\beta)\rm{onehot}(l,N) λ=βonehot(N,N)+(1β)onehot(l,N) β \beta β的初值设定为 1 2 \frac{1}{2} 21。 In other words, the attack is now using only the l t h l^{th} lth layer together with the output layer at a time while disabling all others. With this method, we can discover the most effective layer for subsequent attack procedures across all images in Dattack. 通过实验,我们发现在大多数的防御模型中,最脆弱的链接是倒数第二层的残差块,因此出于性能考虑我们可以直接跳过整个搜索过程。但是也有也有例外,例如,我们发现[38]的模型,倒数第六个残差块exhibits the weakest defense 并且利用这一层进行攻击可以更快收敛。

最终,实验结果显示the intermediate layer l l l can be adaptively disabled if it misclassifies the adversarial example,即当 arg max ⁡ h ( l ) ( X ^ i ) ≠ y \argmax h^{(l)}(\hat{X}_i)\ne y argmaxh(l)(X^i)=y时,这是为了optimize faster towards the original adversarial objective(1)。We incorporate this in the final algorithm。

3.4 Surrogate loss function

由SCE损失计算得到的梯度has been notoriously shown to easily underflow in floating-point arithmetic [8, 14]. (这里可以看看为什么SCE会出现这样的问题)出于这个原因,surrogate loss functions have been proposed [8, 21, 14] to work around this limitation.尽管它们对于打破防御机制是十分有效的,但是我们很难解释它们为何有效(因为它们并不maximize the original adversarial objective (1) directly)。因此,我们建议对原始SCE损失进行微小的改动,which scales the logits adaptively before evaluating the softmax operation:

L s u r ( z , y ) ≜ L s c e ( z t ( y T z − max ⁡ ( ( 1 − y ) ⋅ z ) − 1 , y ) ( 9 ) \mathcal{L}^{sur}(z,y)\triangleq\mathcal{L}^{sce}(\frac{z}{t}(y^Tz-\max((1-y)\cdot z)^{-1},y)\quad\quad\quad\quad(9) Lsur(z,y)Lsce(tz(yTzmax((1y)z)1,y)(9)

这里 ⋅ \cdot 代表element-wise product,并且在所有我们的实验中temperature t=1。这里 y T z − max ⁡ ( ( 1 − y ) ⋅ z ) y^Tz-\max((1-y)\cdot z) yTzmax((1y)z)是众所周知的difference of logits(DL)[8],这个值评估了logits z z z中最大输出以及第二大输出间的差距。The negated version of DL and the ratio-based variant—the difference of the logits ratio (DLR) loss—have respectively been used as surrogate losses in [8] and [14].

我们的surrogate loss(9)有着两方面的优势。首先,它避免了 the gradients from floating-point underflows and improves convergence。第二,和DL或者DLR损失不同,it can still represent the original SCE loss in a faithful fashion, and all logits can still contribute to the final loss.

最终,我们定义了一个 L s u r \mathcal{L}^{sur} Lsur的targeted variant, which moves the logits towards a predefined target class k k k, with a one-hot vector τ \tau τ = onehot ( k , K ) (k, K) (k,K)

L τ s u r ( z , y ) ≜ = − L s c e ( z t ( y T z − max ⁡ ( ( 1 − y ) ⋯ z ) ) − 1 , τ ) ( 10 ) \mathcal{L}_{\tau}^{sur}(z,y)\triangleq=-\mathcal{L}^{sce}(\frac{z}{t}(y^Tz-\max((1-y)\cdots z))^{-1},\tau)\quad\quad\quad\quad\quad(10) Lτsur(z,y)=Lsce(tz(yTzmax((1y)z))1,τ)(10)

3.5 总结

In addition to the original contributions explained above, we also employ simple yet helpful tactics from previous literatures. First, a step-size schedule with a linear decay
2ǫ(1 − i/I), is used in the iterative updates, where ǫ is the
ℓ∞-norm perturbation boundary ǫ in (1), i is the current iteration number and I denotes the total number of iterations.
Second, we adapt momentum-based updates from [14].

我们在算法1中对LAFEAT算法进行了小结:
在这里插入图片描述
函数 LAFEAT_Attack接收如下的输入:

  • 模型 f θ f_{\theta} fθ
  • the pretrained logits function for the l t h l^{th} lth layer to be attacked jointly
  • 自然图像 x x x
  • 图像 x x x的ont-hot标签 y y y
  • 步长 α \alpha α
  • the interpolation parameter β \beta β between the l t h l^{th} lth layer and the output layer
  • the momentum weight used V \mathcal{V} V
  • 扰动边界 ϵ \epsilon ϵ
  • 最大迭代数 I I I
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值