论文阅读:Explaining and Harnessing Adversarial Examples(解释分析对抗样本)

最新推荐文章于 2024-06-24 09:33:25 发布

SYSU_BOND

最新推荐文章于 2024-06-24 09:33:25 发布

阅读量6.9k

点赞数 2

分类专栏：论文阅读文章标签：对抗样本论文阅读

本文链接：https://blog.csdn.net/SYSU_BOND/article/details/79785989

版权

论文阅读专栏收录该内容

12 篇文章 0 订阅

订阅专栏

论文摘要

AdversarialExample的延伸，表明不仅仅只有NeuralNetwork有这种现象，在此文章之前对该问题的解释集中在nonlinearity和overfitting上。但是本文提出这些算法对于对抗样本脆弱性的主要原因正是在于它们线性的本质。并通过定量分析来解释Adversarial Example在不同架构不同数据集能够work的原因。并以此提出一种快速产生AdversarialExample的方法

对抗样本出现的分析

对于Adversarial Example出现的推断性的解释是深度神经网络的高度非线性特征，以及纯粹的监督学习模型中不充分的模型平均和不充分的正则化所导致的过拟合。而本文认为线性模型只要在高维空间的情况下就足以产生Adversarial Example。并且本文提出对抗训练能起到类似于dropout等泛化作用(但因为训练时间效率问题，没有实际应用)，但是传统的泛化手段不能解决对抗样本的问题，但是将其转换为如RBF的非线性网络就能达到这一点。故而需要权衡模型线性与非线性的关系，线性模型更容易训练，而非线性模型能抵御对抗样本问题。(In the long run, it may be possible to escape this tradeoff by designing morepowerful optimization methods that can succesfully train more nonlinear models.)

所以引发思考是，即便当前最好的模型是否只是虚有其表，并没有真正学习到内在的语义信息，故而无法应对这样的泛化问题。

These results have often been interpretedas being a flaw in deep networks in particular, even though linear classifiers have the same problem。因为只有深度学习有能力去拟合一个非线性模型，而这是浅层模型所无法处理的。反倒是人们误以为深度学习的方法更容易受到对抗攻击，实际上深度学习能够学习到一个函数能够抵御对抗攻击而其他浅层模型不行。浅层模型不能做到说在不同输入给出不同输出的同时还要给临近的输入得到相同的输出。当然了，也没有理论证明算法是否能发现一个能够完全符合我们期望的函数，就像标准的有监督训练并不能保证能够学习到能够抵御对抗样本的函数，这个过程需要显式的在训练过程体现

快速生成对抗样本：Fast Gradient Sign Method

Fast Gradient Sign Method方法虽然简单，但是因为扰动是根据w来计算的，故而生成的对抗样本不会说100%使得网络进行误分类。作者表示，除此之外也可以通过其他方法产生对抗样本，如使x绕着梯度方向旋转一定的角度，不过文章后面提到这样生成的对抗训练泛化性能并不佳，可能是因为旋转等操作比较容易学习到(旋转矩阵的结构比较简单，However, we did not find nearly aspowerful of a regularizing result from this process, perhaps because thesekinds of adversarial examples are not as difficult to solve.)。这些生成的对抗样本反过来又证明了对抗样本的线性解释

作者通过实现一个二分类的逻辑回归(MNIST中的3和7)，查看模型参数w、扰动n，以及对比生成对抗样本前后的图片分析，如下所示

对抗训练效果解释：降低overfit

这里给了一个解释说明说对抗训练与L1范数罚项的区别，说对抗训练其实比L1范数好，因为范数在模型饱和(saturates，额，不overfit？)时会逐渐消失，但是如果模型underfit的时候，模型就不必要范数？，对抗样本虽然能够在overfit时表现出同样特性，但是在underfit的情况向，对抗训练只会加重underfitting？额，第5节看不太懂，似乎是对比了这里的方法与L1范数罚项的泛化能力的区别。

对抗训练能起到一定的正则化作用，但是对抗样本又不同于其他数据增量模型，因为这些通过转换所增加的数据是期望出现在测试集上的，而对抗样本在自然情况下是不可能出现的，并且本文得到的对抗训练在benchmark上的表现也没有说要比dropout好，作者说这可能是使用要的对抗样本范围有限：it was difficult to experiment extensively with expensive adversarialexamples based on L-BFGS.

对抗训练可以对应于如下的正则化【这里的等价是要求这里的θ的参数是已经是训练好的模型下的】，其中在本文实验中α=0.5（随意测试的值，其他值或许更好），通过有dropout网络的对抗训练，能够比单纯的dropout的错误率低(from94% to 0.84%)：

通过上式的训练不仅能降低overfit并提高准确率，同时还能抵御对抗样本(对抗样本错误率从89.4%降低到17.9%)。前面提到，对抗样本有迁移能力，对不同的模型同样有效，通过对对抗正则化训练得到的新模型却能很好的降低这个问题：通过原模型和新模型的参数分别产生两种对抗样本，分别交叉放入新模型和旧模型中，错误率为19.6%和40.9%，不过对于误分类的样本的confidence依旧很高，平均达81.4%。而且作者发现通过对抗训练得到的网络参数更加局部化，解释性更强。

文章后来也提到说是否有必要对隐藏层进行扰动，Szegedy的表明将其应用在隐藏层的时候得到一个最好的正则效果。不过这里的实验发现隐藏单元的激活值为明确的情况下，没有必要去扰动，这样只是单纯让某些激活值更大，并且实验正则效果并不理想，甚至不如直接在输入层加入的扰动

Onereason that the existence of adversarial examples can seem counter-intuitive is thatmost of us have poor intuitions forhigh dimensional spaces. Welive in three dimensions, so we are not used to small effects in hundreds of dimensions adding up to create a largeeffect. There is another way that ourintuitions serve us poorly.

The simple RBFnetworks with low capacity are naturally immune to adversarialexamples, in the sense that they have low confidence when they are fooled. 虽然在FGSM下产生的对抗样本的错误率也有55.4%，但是对这些样本的confidence是1.2%，而原本测试集的confidence是60.6%。We can’t expect a model with such low capacity to get the rightanswer at all points of space, but it does correctly respond by reducing its confidence considerably on points it does not “understand.” 即我们希望得到的性质是说如果“not understand”的样本，即便不具备泛化能力，也不能高置信的给出错误答案

对抗样本的泛化的解释

前面提到从一个特定模型得到的对抗样本，在另外一个模型甚至另外一个数据集训练得到的网络上依旧有效，甚至它们还会将对抗样本误分为相同的类。这是因为对抗样本与模型的权值向量高度吻合，同时为了训练执行相同的任务，不同的模型学习了相似的函数。实验通过不同模型误分类的对抗样本的相关性证明的，先用maxout网络产生对抗样本，分别放入RBF和softmax分类器，16.0%和54.6%的错误率(不同模型有不同错误率)。在maxout误分的样本中，softmax错误率达84.6%，RBF错误率达54.3%。在softmax误分类的样本中，RBF的错误率达53.6%。是否有必要查看softmax和maxout都误分类的情况下，RBF的误分类结果？似乎是可算的：a significant proportion of them areconsistent with linear behavior being a major cause of cross-modelgeneralization.

这种泛化特征意味着如果有人希望对模型进行恶意攻击，攻击者根本不必访问需要攻击的目标模型，就可以通过训练自己的模型来产生对抗样本，然后将这些对抗样本部署到他们需要攻击的模型中。并且下图解释了为什么误分类仍有比较高的置信度。

一些假设：

one hypothesis is that generative training could provide more constraint on the training process，or cause the model to learn what to distinguish “real” from “fake” data and be confident only on “real” data。
Another hypothesis about why adversarial examples exist is that individual models have strange quirks but averaging over many models can cause adversarial examples to wash out。使用12个maxout网络的ensemble，对抗样本的错误率还是达91.1%，对于单个的maxout网络产生的对抗样本，ensemble网络的错误率虽然降低到87.9%，所以ensemble对于抵御对抗样本虽然有效，但却有限

总结：

• Adversarial examples can be explained as a property of high-dimensional dot products. They are a result of models being too linear, rather than too nonlinear.
• The generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to perform the same task.
• The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers.
• Because it is the direction that matters most, adversarial perturbations generalize across different clean examples. ??? Why?
• We have introduced a family of fast methods for generating adversarial examples.
• We have demonstrated that adversarial training can result in regularization; even further regularization than dropout.
• We have run control experiments that failed to reproduce this effect with simpler but less efficient regularizers including L1 weight decay and adding noise.
• Models that are easy to optimize are easy to perturb.
• Linear models lack the capacity to resist adversarial perturbation; only structures with a hidden layer (where the universal approximator theorem applies) should be trained to resist adversarial perturbation.
• RBF networks are resistant to adversarial examples.
• Models trained to model the input distribution are not resistant to adversarial examples.
• Ensembles are not resistant to adversarial examples.