LAFEAT: Piercing Through Adversarial Defenses with Latent Features论文解读

最新推荐文章于 2024-10-12 11:26:41 发布

你回到了你的家

最新推荐文章于 2024-10-12 11:26:41 发布

阅读量645

点赞数

分类专栏：论文解读文章标签：深度学习

本文链接：https://blog.csdn.net/kking_edc/article/details/121225377

版权

对抗攻击隐变量特征 LAFEAT算法梯度下降鲁棒性

关键词由CSDN通过智能技术生成

论文解读专栏收录该内容

38 篇文章

订阅专栏

摘要

在这篇文章中，我们展示了特定“robust“模型中的隐变量特征对于对抗攻击来说是非常易受攻击的。基于这一点，我们提出了一个unified $l_{\infty}$ 白盒攻击算法LAFEAT，这个算法在梯度下降过程中利用隐变量特征。我们展示了这种算法不仅能更高效完成攻击，并且它也是对一系列防御机制下，相比于目前SOTA的一个强力的adversary。这提示我们模型的robustness可能依防御者对hidden组成部分的有效使用而定，并且robustness不应该再从holistic角度来进行看待。

1 Introduction

作为特征提取器，CNN的浅层部分提取出简单的局部texture，更深层的神经元用于提取复杂对象[40,10]。受到这点的启发，我们希望错误提取的浅层特征无法组装成正确的高层特征。此外，这对后续的层也有cascading效果。为了说明，我们改变了PGD的功能，使得它可以通过仅最大化攻击者训练的判别器的loss来攻击中间层，目前我们将其称为LPGD。在图1中，我们scrambled了LPGD攻击过的网络的中间层特征，并且观察到自然图片以及对抗图片随着网络层数加深其特征差异不断增加。

在这里插入图片描述
然而，现存的攻击以及防御策略将整个模型视为一个单独的非线性可微分方程 $f$ ，这个方程将输入图片映射为输出logits。尽管这些方式在不同的模型间泛化能力很好，它们倾向于忽略模型内部中间层提取的隐变量特征。

作者提出了两个问题：

Can latent features be vulnerable to attacks？
Can the falsely extracted features be cascaded to the remaining layers to make the model output incorrect？

It turns out that 通过LPGD计算出的对抗样本能harm the accuracies of the “robust” models above (Figure 2)。实验结果显示尽管这些模型被训练用于有效对抗PGD，它们在面对指向隐变量特征时的攻击很容易失效。这同样有可能暗示一个关于输入图片的flat model loss landscape does not necessarily entail flat latent features w.r.t. the input. 现有的依赖于模型整体视角的攻击方式因此可能无法提供对于模型robustness的可靠评估。

在这里插入图片描述
受到上述发现的启发，在这篇文章中我们提出了一种新的称为LAFEAT的攻击方法。这种攻击方式尝试在一种泛化性更好的框架下利用隐变量特征。这篇文章利用了许多近些年发现的有效技巧，例如momentum [16, 14], surrogate loss [20, 15], step size schedule [14, 21], and multi-targeted attacks [21, 44, 43]。我们的主要贡献如下：

我们介绍了怎样利用网络中间层来进行对抗攻击
我们展示了隐变量可以提供更快的收敛速度，并加速基于梯度的攻击
通过将多种有效攻击策略进行组合，我们提出了LAFEAT。实验结果显示它能在attack performance和computational efficiency上和现有的方法匹敌。

To the best of our knowledge, LAFEAT is currently the strongest against a wide variety of defense mechanisms and matches the current top-1 on the TRADES [64] CIFAR-10 white-box leaderboard (Section 4). 由于隐变量特征对于对抗攻击而言十分脆弱，这可以反过来用于破坏robust模型，我们相信未来对于模型robustness的评估应当依怎样有效利用防御模型的hidden component而定。简短来说，模型的robustness不应该再从一个整理角度来看待。

相关工作

2.1 对抗样本

我们将判别器网络定义为 $f_{\theta}(x)$ ，这里 $\theta$ 指代模型参数， $f_{\theta}:\mathcal{I}\to\mathbb{R}^K$ 将输入图片映射为判别结果。 $I\subset\mathbb{R}^{C\times H\times W}$ 将图片限定在一个有效范围内。

攻击者的目标是找到对抗样本 $\hat{x}\in\mathcal{I}$ ，通过解决如下的优化问题得到：

$\max\limits_{\hat{x}\in\mathcal{I}\land(x,\hat{x})\le\epsilon}\mathcal{L}^{sce}(f_{\theta}(\hat{x}),y)\quad\quad\quad\quad\quad(1)$

这里 $\mathcal{L}^{sce}(f_{\theta}(\hat{x}),y)$ 指代输出以及one-hot ground truth间的softmax cross-entropy（SCE）。通过最小化这个损失，我们可以得到一个 $\hat{x}$ ，此时 $\argmax f_{\theta}({\hat{x}})\ne\argmax y$ 。这里 $d(x,\hat{x})\le\epsilon$ 将限定了原始图片和对抗样本之间的欧氏距离。

In general, the distance metric d (x, xˆ) is commonly
defined as the ℓp-norm of the difference between x and
xˆ [51, 19, 35, 7]. Different choices of norm were explored
in literature, e.g. one pixel attacks [50] minimizes the ℓ0-
norm kx − xˆk0
, while others may be interested in the standard Euclidean distance, the ℓ2-norm [51, 37, 8]. In this
paper, we focus on another popular choice of distance metric, the ℓ∞-norm d (x, xˆ) , kx − xˆk∞, as used in [19, 35].

防御者们用于评估白盒adversarial robustness的一个popular攻击方式是projected gradient descent（PGD）。PGD通过执行如下的迭代更新来寻找对抗样本[35]：

$\hat{x}_{i+1}=\mathcal{P}_{\epsilon,x}(\hat{x}_i+\alpha_i sign(\nabla_{\hat{x}_i}\mathcal{L}^{sce}(f_{\theta}(\hat{x}_i),y)))\quad\quad\quad\quad\quad(2)$

初始时， $\hat{x}_0=\mathcal{P}_{\epsilon,x}(x+u)$ ，这里 $u\sim\mathcal{U}([-\epsilon,\epsilon])$ 。函数 $\mathcal{P}_{\epsilon,x}:\mathbb{R}^{C\times H\times W}\to\mathcal{I}$ clips the range of its input into the $\epsilon-$ ball neighbor and the $\mathcal{I}$ 。 $\nabla_{\hat{x}_i}\mathcal{L}^{sce}(f_{\theta}(\hat{x}_i),y)$ 计算loss关于输入 $\hat{x}_i$ 的梯度。 $\alpha_i$ 指代步长，并且对tensor $z$ 中的每个元素， $s i g n (z)$ 返回 $1, 0, - 1$ 。为了简化，我们定义：

$PGD_{\epsilon,x,y}(\mathcal{L},\alpha,i)\triangleq\hat{x}_i$

i.e. the result of iterating for i times with a sequence of
step sizes α and the loss function L on the original image
x.

Because the SCE loss function L
sce in the objective (1)
is highly non-linear, easily saturated, and normally evaluated with limited floating-point precision, gradient-based
attacks may experience vanishing gradients and difficulty
converging [8, 14]. Recent attack methods hence use surrogate losses instead for gradient calculation [8, 21], and
optimize an alternative objective by replacing $\mathcal{L}^{sce}$ with a custom surrogate loss function. As the alternative objective
is usually aligned with the original, maximizing the latter
would also maximize the former

Many auxiliary tricks can push the limit of existing attack methods, for instance, a step-size schedule [14, 21]
with a decaying step-size in relation to the iteration count
could improve the overall success rate. Multi-targeted attack [21, 44, 43] uses label-specific surrogate loss by enumerating all possible target labels. Attackers may also resort to an ensemble of multiple attack strategies, making
the compound approach stronger than any individual attacks [7, 14]. The latter two methods, however, tend to introduce an order of magnitude increase in the worse-case
computational costs.
Generative networks that learn from the loss were proposed for adversarial example synthesis [5, 26]. This tactic can be further enhanced with generative adversarial networks (GANs) [18], where the discriminator network encourages the distribution of adversarial examples to become
indistinguishable from that of natural examples [59, 36].
Finally, there are a few recent publications that leverages
latent features in their attacks [30, 41]. Unlike these methods, LAFEAT considers ℓ∞-norm white-box attacks, and
further differentiate itself from them as it learns to attack
defending models.

2.2 Defending against adversarial examples

3 LAFEAT 方法介绍

LAFEAT攻击流程如图3所示：
在这里插入图片描述
首先，对于一个关于隐变量特征的firm grip, 它从使用训练集为每一个残差块训练一个全连接层直到收敛为开始。注意我们确保原始模型参数 $f_{\theta}$ 在这个过程中保持不变。为了计算对抗样本，我们最大化alternative对抗损失 $\hat{L}$ ，这是an adaptively-weighted sum of surrogate losses from individual layers. 为了测试 adversarial robustness, 生成的对抗样本接下来会被传送给原始模型 $f_{\theta}$ 来进行评估。

3.1 Latent feature adversarial

Following the footsteps of surrogate losses（？）, 在第一部分我们假设一个相似的对于隐变量特征的indirect loss也可以有效增强对抗攻击。LAFEAT使用中间层提取出的特征来生成一个对于 $f_{\theta}$ 而言更强的对抗样本。我们假设模型结构 $f_{\theta}$ 是 $N$ 层（或残差块）序列，可以通过如下的方式进行表示：

$f_{\theta}(x)=f^{(N)}(\cdots f^{(2)}(f^{(1)}(x))\cdots)\quad\quad\quad\quad\quad\quad(5)$

这里 $f^{(1)},f^{(2)},\dots,f^{(N)}$ 表示模型中间层序列。为了简化表达，我们省略了每层的参数。我们因此formalize this proposal by generalizing the traditional PGD attack (2) with a latent-feature PGD (LFPGD) 对抗优化问题：

$\max\limits_{h,\lambda,\alpha,\mathcal{L}^{sur}}\mathcal{L}^{sce}(f_{\theta}(PGD_{\epsilon,x,y}(\mathcal{L}_{\lambda}^{lf},\alpha,I)),y),\quad\quad\quad\quad\quad\\where\quad\mathcal{L}^{lf}_{\lambda}(z)=\mathcal{L}^{sur}(\sum\limits_{l\in[1:N]}\lambda^{(l)}h^{(l)}(z^{(l)}),y)\quad\quad\quad\quad(6)$

符号含义如下：

常数 $I$ 代表梯度更新的最大迭代轮数。
对于每一层 $l\in[1:N],\lambda^{(l)}\in[0,1]$ 给每一层的梯度都赋予了一个重要性权重，并且 $\sum\limits_{l\in N}\lambda^{(l)}=1$ 。
$z^{(l)}\triangleq f^{(l)}\circ\cdots\circ f^{(1)}(z)$ 表示从 $l^{th}$ 层提取出的特征。
$l^{th}$ 层的函数 $h^{(l)}$ 将 $f^{(l)}$ 输出的特征映射为logits。
$\alpha_i$ is a step-size schedule。

我们的目标是找到logits functions的正确组合 $h\triangleq(h^{(1)},\dots,h^{N})$ 以及它们相应的权重 $\lambda\triangleq(\lambda^{(1)},\dots,\lambda^{(N)})$ from intermediate layers, the step-size schedule $\alpha$ , and the surrogate loss $L^{sur}$ to use.

不幸的是，实际中上述LFPGD无法解决。因此我们设计了一种可以近似解决的方法，and nevertheless enable us to generate adversarial examples stronger than competing method。

3.2 Training intermediate logits layers

为了利用隐变量特征，我们使用SGD为所有的 $l\in[1:N-1]$ 独立中间层 $f^{(l)}$ 训练 logits layers $h^{(l)}$ 直到收敛。函数 $h^{(l)}:\mathbb{R}^{C^{(l)}\times H^{(l)}\times W^{(l)}}\to\mathbb{R}^K$ 被定义为由一个 global average pooling 层（ $\mathbb{R}^{C^{(l)}\times H^{(l)}\times W^{(l)}}\to\mathbb{R}^{C^{(l)}}$ ）以及一个相连的全连接层所组成的一个辅助分类器，其公式定义如下：

$h^{(l)}(x^{(l)})\triangleq \rm{pool}(x^{(l)})\phi^{(l)}+\eta^{(l)}\quad\quad\quad\quad\quad(7)$

这里 $x^{(l)}$ 指代从 $l^{th}$ 层提取出的特征， $\phi^{(l)}\in\mathbb{R}^{C^{(l)}\times K}$ 以及 $\eta^{(l)}\in\mathbb{R}^K$ 都是方程 $h^{(l)}$ 中待训练的参数。因为最后一层 $f^{(N)}$ 已经是logits layer，我们假定 $h^{(N)}\triangleq id$ 是一个 identity function.

取决于是否可得，我们可以使用 $\mathcal{D}_{train}$ 或者用于攻击的 $\mathcal{D}_{attack}$ （或者两者一起）来训练这些层。尽管我们在实验中使用 $\mathcal{D}_{train}$ ，我们观察到in practice negligible differences in either attack strengths given sufficient amount of training examples, as they are theoretically drawn from the same data sampling distribution。

需要注意的很重要的一点是在训练 $h^{(l)}$ 的过程中，原始模型 $f_{\theta}$ 被用作一个特征提取器，其训练时使用的所有技巧都被禁用（例如dropout、参数更新等等）。这代表着模型参数 $\theta$ ，层 $f_l$ 以及它们的参数，batch normalization等等都保持constant，只有 $h^{(l)}$ 中的参数会被训练。

3.3 选择要攻击的中间层

对于 $\lambda^{(1:N)}$ 的搜索is difficult because of the computations cost associated with finding a statistically significant amount of adversarial examples。出于这个原因，LATEAT使用一个greedy但是有效的方式来简化搜索过程。首先，我们枚举所有中间层 $l\in[1:N-1]$ ，并且使得：

$\mathcal{L}_{\lambda}^{lf}(z,y)=\mathcal{L}^{sur}(\beta h^{(l)}(z^{(l)})+(1-\beta)f_{\theta}(z),y)\quad\quad\quad\quad(8)$

$\lambda=\beta\rm{onehot}(N,N)+(1-\beta)\rm{onehot}(l,N)$ ， $\beta$ 的初值设定为 $\frac{1}{2}$ 。 In other words, the attack is now using only the $l^{th}$ layer together with the output layer at a time while disabling all others. With this method, we can discover the most effective layer for subsequent attack procedures across all images in Dattack. 通过实验，我们发现在大多数的防御模型中，最脆弱的链接是倒数第二层的残差块，因此出于性能考虑我们可以直接跳过整个搜索过程。但是也有也有例外，例如，我们发现[38]的模型，倒数第六个残差块exhibits the weakest defense 并且利用这一层进行攻击可以更快收敛。

最终，实验结果显示the intermediate layer $l$ can be adaptively disabled if it misclassifies the adversarial example，即当 $\argmax h^{(l)}(\hat{X}_i)\ne y$ 时，这是为了optimize faster towards the original adversarial objective(1)。We incorporate this in the final algorithm。

3.4 Surrogate loss function

由SCE损失计算得到的梯度has been notoriously shown to easily underflow in floating-point arithmetic [8, 14]. （这里可以看看为什么SCE会出现这样的问题）出于这个原因，surrogate loss functions have been proposed [8, 21, 14] to work around this limitation.尽管它们对于打破防御机制是十分有效的，但是我们很难解释它们为何有效（因为它们并不maximize the original adversarial objective (1) directly）。因此，我们建议对原始SCE损失进行微小的改动，which scales the logits adaptively before evaluating the softmax operation：

$\mathcal{L}^{sur}(z,y)\triangleq\mathcal{L}^{sce}(\frac{z}{t}(y^Tz-\max((1-y)\cdot z)^{-1},y)\quad\quad\quad\quad(9)$

这里 $\cdot$ 代表element-wise product，并且在所有我们的实验中temperature t=1。这里 $y^Tz-\max((1-y)\cdot z)$ 是众所周知的difference of logits（DL）[8]，这个值评估了logits $z$ 中最大输出以及第二大输出间的差距。The negated version of DL and the ratio-based variant—the difference of the logits ratio (DLR) loss—have respectively been used as surrogate losses in [8] and [14].

我们的surrogate loss（9）有着两方面的优势。首先，它避免了 the gradients from floating-point underflows and improves convergence。第二，和DL或者DLR损失不同，it can still represent the original SCE loss in a faithful fashion, and all logits can still contribute to the final loss.

最终，我们定义了一个 $\mathcal{L}^{sur}$ 的targeted variant， which moves the logits towards a predefined target class $k$ , with a one-hot vector $\tau$ = onehot $(k, K)$ ：

$\mathcal{L}_{\tau}^{sur}(z,y)\triangleq=-\mathcal{L}^{sce}(\frac{z}{t}(y^Tz-\max((1-y)\cdots z))^{-1},\tau)\quad\quad\quad\quad\quad(10)$

3.5 总结

In addition to the original contributions explained above, we also employ simple yet helpful tactics from previous literatures. First, a step-size schedule with a linear decay
2ǫ(1 − i/I), is used in the iterative updates, where ǫ is the
ℓ∞-norm perturbation boundary ǫ in (1), i is the current iteration number and I denotes the total number of iterations.
Second, we adapt momentum-based updates from [14].

我们在算法1中对LAFEAT算法进行了小结：
在这里插入图片描述
函数 LAFEAT_Attack接收如下的输入：

模型 $f_{\theta}$
the pretrained logits function for the $l^{th}$ layer to be attacked jointly
自然图像 $x$
图像 $x$ 的ont-hot标签 $y$
步长 $\alpha$
the interpolation parameter $\beta$ between the $l^{th}$ layer and the output layer
the momentum weight used $\mathcal{V}$
扰动边界 $\epsilon$
最大迭代数 $I$