摘要
深层神经网络容易受到对抗样本的攻击,由于潜在的严重后果,这对这些算法造成了安全问题。对抗性攻击是在部署深度学习模型之前评估其鲁棒性的重要替代手段。然而,大多数现有的对抗性攻击只能愚弄成功率较低的黑盒模型。为了解决这个问题,我们提出了一大类基于动量的迭代算法来增强对抗性攻击。通过将动量项集成到攻击的迭代过程中,我们的方法可以稳定更新方向,并在迭代过程中避开较差的局部极大值,从而产生更多可转移的对抗性示例。为了进一步提高黑盒攻击的成功率,我们将动量迭代算法应用到一组模型中,并表明具有强大防御能力的对抗训练模型也容易受到黑盒攻击。我们希望所提出的方法将作为评估各种深部模型和防御方法稳健性的基准。通过这种方法,我们赢得了NIPS 2017非目标对抗性攻击和目标对抗性攻击竞赛的第一名。
Deep neural networks are vulnerable to adversarial examples, which poses security concerns on these algorithms due to the potentially severe consequences. Adversarial attacks serve as an important surrogate to evaluate the robustness of deep learning models before they are deployed. However, most of existing adversarial attacks can only fool a black-box model with a low success rate. To address this issue, we propose a broad class of momentum-based iterative algorithms to boost adversarial attacks. By integrating the momentum term into the iterative process for attacks, our methods can stabilize update directions and escape from poor local maxima during the iterations, resulting in more transferable adversarial examples. To further improve the success rates for black-box attacks, we apply momentum iterative algorithms to an ensemble of models, and show that the adversarially trained models with a strong defense ability are also vulnerable to our black-box attacks. We hope that the proposed methods will serve as a benchmark for evaluating the robustness of various deep models and defense methods. With this method, we won the first places in NIPS 2017 Non-targeted Adversarial Attack and Targeted Adversarial Attack competitions.
2 背景
In this section, we provide the background knowledge as well as review the related works about adversarial attack and defense methods. Given a classifier f ( x ) : x ∈ X → y ∈ Y f(x):x\in\mathcal{X}\to y\in\mathcal{Y} f(x):x∈X→y∈Y that outputs a label y as the prediction for an input x, the goal of adversarial attacks is to seek an example x ∗ x^* x∗ in the vicinity of x but is misclassified by the classifier. Specifically, there are two classes of adversarial examples—non-targeted and targeted ones. For a correctly classified input x with ground-truth label y such that f(x) = y, a non-targeted adversarial example x ∗ x^∗ x∗ is crafted by adding small noise to x x x without changing the label, but misleads the classifier as f ( x ∗ ) ≠ y f(x^*)\ne y f(x∗)=y,and a targeted adversarial example aims to fool the classifier by outputting a specific label as f ( x ∗ ) = y ∗ f(x^*)=y^* f(x∗)=y∗, where y ∗ y^* y∗ is the target label specified by the adversary, and y ∗ ≠ y y^*\ne y y∗=y. In most cases, the L p L_p Lp norm of the adversarial noise is required to be less than an allowed value ϵ \epsilon ϵ as ∥ x ∗ − x ∥ p ≤ ϵ \Vert x^*-x\Vert_p\le\epsilon ∥x∗−x∥p≤ϵ, where p could be 0 , 1 , 2 , ∞ 0,1,2,\infty 0,1,2,∞。
2.1 攻击方法
现有的生成对抗样本的方法可以被分成三类。我们在这里介绍这些攻击的non-targeted版本(targeted版本的基本同理)
基于梯度的单步攻击,例如fast gradient sign method(FGSM),通过最大化loss函数 J ( x ∗ , y ) J(x^*,y) J(x∗,y)来寻找对抗样本 x ∗ x^* x∗,这里 J J J通常指交叉熵损失。FGSM generates adversarial examples to meet the L ∞ L_{\infty} L∞ norm bound ∥ x ∗ − x ∥ ∞ ≤ ϵ \Vert x^*-x\Vert_{\infty}\le\epsilon ∥x∗−x∥∞≤ϵ as
x ∗ = x + ϵ ⋅ sign ( ∇ x J ( x , y ) ) x^*=x+\epsilon\cdot\text{sign}(\nabla_xJ(x,y)) x∗=x+ϵ⋅sign(∇xJ(x,y))
这里 ∇ x J ( x , y ) \nabla_xJ(x,y) ∇xJ(x,y)是loss函数针对 x x x的导数。fast gradient method(FGM)is a generalization of FGSM to meet the L 2 L_2 L2 norm bound ∥ x ∗ − x ∥ 2 ≤ ϵ \Vert x^*-x\Vert_2\le\epsilon ∥x∗−x∥2≤ϵ as
x 8 = x + ϵ ⋅ ∇ x J ( x , y ) ∥ ∇ x J ( x , y ) ∥ 2 x^8=x+\epsilon\cdot\frac{\nabla_xJ(x,y)}{\Vert\nabla_xJ(x,y)\Vert_2} x8=x+ϵ⋅∥∇xJ(x,y)∥2∇xJ(x,y)
迭代攻击通过一个较小的步长 α \alpha α来多次迭代应用fast gradient。FGSM的迭代版本(I-FGSM)可以被表示为:
x 0 ∗ = x , x t + 1 ∗ = x t ∗ + α ⋅ sign ( ∇ x J ( x t ∗ , y ) ) x^*_0=x,\quad x_{t+1}^*=x_t^*+\alpha\cdot\text{sign}(\nabla_xJ(x_t^*,y)) x0∗=x,xt+1∗=xt∗+α⋅sign(∇xJ(xt∗,y))
为了使得生成的对抗样本满足 L ∞ L_{\infty} L∞或者 L 2 L_2 L2范数约束,我们可以 clip x t ∗ x_t^* xt∗ into the ϵ \epsilon ϵ vicinity of x x x or simply set α = ϵ / T \alpha=\epsilon/T α=ϵ/T,这里 T T T指代迭代轮数。 It has been shown that iterative methods are stronger white-box adversaries than one-step methods at the cost of worse transferability [10, 24].
基于优化的方法[23]直接在满足对抗样本误判的条件下优化对抗样本以及真实图片间的距离。Box-constrained L-BFGS可以用于解决这样的问题。A more sophisticated way[1] is solving:
arg min x ∗ λ ⋅ ∥ x ∗ − x ∥ p − J ( x ∗ , y ) \argmin_{x^*}\lambda\cdot\Vert x^*-x\Vert_p-J(x^*,y) x∗argminλ⋅∥x∗−x∥p−J(x∗,y)
Since it directly optimizes the distance between an adversarial example and the corresponding real example, there is no guarantee that the L ∞ ( L 2 ) L_{\infty}(L_2) L∞(L2) distance is less than the required value. Optimization-based methods also lack the efficacy in black-box attacks just like iterative methods.
3 方法
在本文中,我们提出了一大类momentum iterative gradient-based 方法来生成对抗样本,这可以欺骗白盒模型和黑盒模型。首先我们介绍如何将momentum集成进入iterative FGSM中,这引出了新的方法momentum iterative fast gradient sign method(MI-FGSM),可以满足 L ∞ L_{\infty} L∞范数约束下通过untargeted attack生成对抗样本。我们接下来提出几种方法来有效攻击an ensemble of models。最终,我们将MI-FGSM拓展到 L 2 L_2 L2范数约束以及targeted attack。
Momentum iterative fast gradient sign method
momentum[18]是一种通过在迭代过程中沿损失函数的梯度方向累积速度向量来加速梯度下降的技术。记忆历史梯度有助于barrel through narrow vallyes, small humps and poor local minima or maxima [4]。momentum方法还显示了其在随机梯度下降中稳定更新的有效性[20]。我们运用momentum的思想来产生对抗样本,并obtain tremendous benefits。
为了从一个真实样本 x x x中生成一个满足 L ∞ L_{\infty} L∞范数约束的non-targeted对抗样本 x ∗ x^* x∗,基于梯度的方式通过解决如下的约束最优化问题来寻找对抗样本:
arg max x ∗ J ( x ∗ , y ) s . t . ∥ x ∗ − x ∥ ∞ ≤ ϵ \argmax_{x^*}J(x^*,y)\quad\quad s.t.\ \Vert x^*-x\Vert_{\infty}\le\epsilon x∗argmaxJ(x∗,y)s.t. ∥x∗−x∥∞≤ϵ
这里 ϵ \epsilon ϵ指代对抗扰动的大小。FGSM通过假设数据点周围的决策边界是线性的,并且applying the sign of the gradient to a real example only once(公式1)来生成对抗样本。然而在实践过程中,当distortion很大时候线性假设可能不成立[12]。这使得FGSM生成的对抗样本“underfits”模型,限制了它的攻击能力。与之相反,迭代FGSM在每次迭代过程中贪婪地将对抗样本沿梯度符号的方向移动(公式3)。因此,对抗样本很容易陷入较差的局部最大值并且“overfit”模型,这使得攻击的转移性较差。
为了打破上述局面,我们将momentum集成到iterative FGSM中来稳定更新方向并且逃离较差的局部最优解。因此,基于momentum的方法能够在不断迭代的同时保留对抗样本的转移性,and at the same time acts as a strong adversary for the white-box models like iterative FGSM。It alleviates the trade-off between the attack ability and the transferability, demonstrating strong black-box attacks.
momentum iterative fast gradient method(MI-FGSM)在算法1中进行了总结:
特别的,
g
t
g_t
gt gathers the gradients of the first
t
t
t iterations with a decay factor
μ
\mu
μ,由公式6定义。接下来对抗样本
x
t
∗
x_t^*
xt∗ until the t-th iteration is perturbed in the direction of the sign of
g
t
g_t
gt with a step size
α
\alpha
α in 公式7。 如果
μ
\mu
μ等于0,那么MI-FGSM会退化成为迭代FGSM。在每一轮迭代,当前梯度
∇
x
J
(
x
t
∗
,
y
)
\nabla_xJ(x_t^*,y)
∇xJ(xt∗,y) is normalized by the
L
1
L_1
L1 distance(任何距离测量都可以取)of itself, because we notice that the scale of the gradients in different iterations varies in magnitude。
3.2 攻击集成模型
3.3 拓展
momentum迭代方法可以被很容易应用到其他的攻击方法中。通过将当前的梯度替换为之前累计的历史梯度,任何迭代方法都可以被拓展为其迭代变体。Here we introduce the methods for generating adversarial examples in terms of the L2 norm bound attacks and the targeted attacks.
To find an adversarial examples within the ϵ \epsilon ϵ vicinity of a real example measured by L 2 L_2 L2 distance as ∥ x ∗ − x ∥ 2 ≤ ϵ \Vert x^*-x\Vert_2\le\epsilon ∥x∗−x∥2≤ϵ,the momentum variant of iterative fast gradient method (MI-FGM) can be written as:
x t + 1 ∗ = x t ∗ + α ⋅ g t + 1 ∥ g t + 1 ∥ 2 x^*_{t+1}=x^*_t+\alpha\cdot\frac{g_{t+1}}{\Vert g_{t+1}\Vert_2} xt+1∗=xt∗+α⋅∥gt+1∥2gt+1
这里 g t + 1 g_{t+1} gt+1在等式6中进行了定义并且 α = ϵ T \alpha=\frac{\epsilon}{T} α=Tϵ, T T T表示总迭代轮数。
对于targeted攻击,攻击目标是finding an adversarial example misclassified as a target class y ∗ y^* y∗ is to minimize the loss function J ( x ∗ , y ∗ ) J(x^*,y^*) J(x∗,y∗)。累计梯度可以通过如下方式得到:
g t + 1 = μ ⋅ g t + J ( x t ∗ , y ∗ ) ∥ g_{t+1}=\mu\cdot g_t+\frac{J(x_t^*,y^*)}{\Vert} gt+1=μ⋅gt+∥J(xt∗,y∗)
4 实验
4.1 设定
我们总共研究7个模型,其中的4个是正常训练的模型-Inception v3(Inc-v3),Inception v4(Inc-v4),Inception Resnet v2(IncRes-v2),Resnet v2-152(Res-152),其余是三个模型是通过ensemble adversarial training- I n c − v 3 e n s 3 , I n c − v 3 e n s 4 , I n c R e s − v 2 e n s Inc-v3_{ens3},Inc-v3_{ens4},IncRes-v2_{ens} Inc−v3ens3,Inc−v3ens4,IncRes−v2ens。我们简单地将后三个模型称为“adversarially trained models”。
It is less meaningful to study the success rates of attacks if the models cannot classify the original image correctly. Therefore, we randomly choose 1000 images belonging to the 1000 categories from the ILSVRC 2012 validation set, which are all correctly classified by them.
In our experiments, we compare our methods to one-step gradient-based methods and iterative methods. Since optimization-based methods cannot explicitly control the distance between the adversarial examples and the corresponding real examples, they are not directly comparable to ours, but they have similar properties with iterative methods as discussed in Sec. 2.1. For clarity, we only report thresults based on L ∞ L_{\infty} L∞ norm bound for non-targeted attacks, and leave the results based on L 2 L_2 L2 norm bound and targeted attacks in the supplementary material. The findings in this paper are general across different attack settings.