AI可解释性 I | 对抗样本（Adversarial Sample）论文导读- Part I-CSDN博客

AI可解释性 I | 对抗样本（Adversarial Sample）论文导读（持续更新）

导言#

本文作为AI可解释性系列的第一部分，旨在以汉语整理并阅读对抗攻击（Adversarial Attack）相关的论文，并持续更新。与此同时，AI可解释性系列的第二部分：归因方法（Attribution）也即将上线，敬请期待。

Intriguing properties of neural networks（Dec 2013）#

作者：Christian Szegedy

简介#

Intriguing properties of neural networks 乃是对抗攻击的开山之作，首次发现并将对抗样本命名为Adversarial Sample，首先发现了神经网络存在的两个性质：

单个高层神经元和多个高层神经元的线性组合之间并无差别，即表示语义信息的是高层神经元的空间而非某个具体的神经元

there is no distinction between individual high level units and random linear combinations of high level units, ..., it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
神经网络的输入-输出之间的映射很大程度是不连续的，可以通过对样本施加难以觉察的噪声扰动（perturbation）最大化网络预测误差以使得网络错误分类，并且可以证明这种扰动并不是一种随机的学习走样（random artifact of learning），可以应用在不同数据集训练的不同结构的神经网络

we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. Specifically, we find that we can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

神经元激活#

文章通过实验证明，某个神经元的方向（natural basis direction）和随机挑选一个方向（random basis）再和整个层级的激活值作余弦相似度之后的结果完全不可区分。

可以证明，单个神经元的解释程度和整个层级的解释程度不分伯仲，即所谓”神经网络解耦了不同坐标上的特征“存在疑问。

This suggest that the natural basis is not better than a random basis in for inspecting the properties of $\phi(x)$ . Moreover, it puts into question the notion that neural networks disentangle variation factors across coordinates.

虽然每个层级似乎在了输入分布的某个部分存在不变性，但是很明显在这些部分的邻域中又存在着一种反直觉的未定义的行为。

观点认为神经网络的多层非线性叠加的目的就是为了使得模型对样本空间进行非局部泛化先验（non-local generalization prior）的编码，换句话说，输出可能会对其周围没有训练样本的输入空间邻域分配不显著（推测约为非 $\epsilon$ ）的概率（对抗攻击可以发生的假设）。这样做的好处在于，不同视角的同一张图片可能在像素空间产生变化，但是经过非局部泛化先验编码使得在概率上的输出是不变的。

In other words, it is possible for the output unit to assign non-significant (and, presumably, non-epsilon) probabilities to regions of the input space that contain no training examples in their vicinity.

这里可以推导出一种平滑性假设，即训练样本在 $\epsilon$ 领域内（ $||x^\prime-x||<\epsilon$ ）的所有样本都是满足和训练样本的类别一致。

And that in particular, for a small enough radius " in the vicinity of a given training input x, an x " which satisfies ||x x " || < " will get assigned a high probability of the correct class by the model.

Uzuki评论：神经网络验证做的工作就是寻找这个 $\epsilon$ ，证明在 $\epsilon$ 邻域内不存在对抗样本

接下来，作者将通过实验证明这种平滑性假设在很多的核方法（kernel method）上都是不成立的，可以通过一种高效的优化算法完成（即对抗攻击），这种优化过程在于遍历从网络形成的流形上以寻找“对抗样本”。这些样本在网络的高维流形上被认为是低概率出现的局部“口袋“。

In some sense, what we describe is a way to traverse the manifold represented by the network in an efficient way (by optimization) and finding adversarial examples in the input space.

算法的形式化描述#

给定一个分类器 $f:\mathbb{R}^m\rightarrow \{1\dots k\}$ 以及一个连续的损失函数，输入图像和目标类别 ${1\dots k}$ 以解开下面如下的箱约束（box-constrained）问题：

最小化 $||r||_2$ 并保证：
1. $f(x+r)=l$
2. $x+r\in[0,1]^m$ （确保是RGB范围）

这个问题只要 $f(x)\neq l$ 就是一个非平凡的难解问题，因此我们通过box-constrained L-BFGS算法去优化近似求解。问题可以如下表示：通过线搜索找到一个最小的 $c$ ，以最小化 $r$

最小化 $c|r|+\text{loss}_f(x+r,l)\quad s.t.\quad x+r\in[0,1]^m$

In general, the exact computation of D(x, l) is a hard problem, so we approximate it by using a box-constrained L-BFGS. Concretely, we find an approximation of D(x, l) by performing line-search to find the minimum c > 0 for which the minimizer r of the following problem satisfies f (x + r) = l.

实验#

通过实验，可以得到如下三个结论：

对于文章研究的所有网络（包括MNIST、QuocNet、AlexNet），针对每个样本，始终能够生成与原始样本极其相似、视觉上无法区分的对抗样本，且这些样本均被原网络误分类。
跨模型的泛化性：当使用不同超参数（如层数、正则化项或初始权重）从头训练网络时，仍有相当比例的对抗样本会被误分类。
跨训练集的泛化性：在完全不同的训练集上从头训练的网络，同样会误分类相当数量的对抗样本。

For all the networks we studied (MNIST, QuocNet [10], AlexNet [9]), for each sample, we always manage to generate very close, visually indistinguishable, adversarial examples that are misclassified by the original network (see figure 5 for examples).
Cross model generalization: a relatively large fraction of examples will be misclassified by networks trained from scratch with different hyper-parameters (number of layers, regularization or initial weights).
Cross training-set generalization a relatively large fraction of examples will be misclassified by networks trained from scratch trained on a disjoint training set.

可以证明对抗样本存在普适性，一个微妙但关键的细节是：对抗样本需针对每一层的输出生成，并用于训练该层之上的所有层级。实验表明，高层生成的对抗样本比输入层或低层生成的更具训练价值。

A subtle, but essential detail is that adversarial examples are generated for each layer output and are used to train all the layers above. Adversarial examples for the higher layers seem to be more useful than those on the input or lower layers.

然而，这个实验仍然留下了关于训练集依赖性的问题。生成示例的难度是否仅仅依赖于我们训练集作为样本的特定选择，还是这一效应能够泛化到在完全不同训练集上训练的模型？

Still, this experiment leaves open the question of dependence over the training set. Does the hardness of the generated examples rely solely on the particular choice of our training set as a sample or does this effect generalize even to models trained on completely different training sets?

因此作者进行了一个迁移性实验，将MNIST的训练集切分成两个大小为30000的部分 $P_1$ 和 $P_2$ ，用以训练三个全连接网络：

名称	结构	训练数据
$M_1$	100-100-10	$P_1$
$M_1^\prime$	123-456-10	$P_1$
$M_2$	100-100-10	$P_2$

接下来在每个网络上使用测试集训练对抗样本，并将这些对抗样本迁移到其他网络上，这些对抗样本对其他网络同样有效果。

Uzuki评论：这个性质为对抗攻击的迁移性（transferability）提供了保证，那些无法得知内部结构黑盒网络，可以通过一个结构类似的白盒作为代理（surrogate model）生成对抗样本。

因此可以得到一个有趣的结论：即使在不相交的训练集上训练的模型，对抗样本仍然难以处理，尽管它们的有效性显著降低。

The intriguing conclusion is that the adversarial examples remain hard for models trained even on a disjoint training set, although their effectiveness decreases considerably.

网络不稳定性的谱分析#

作者将上一节提出这种监督学习网络对这些特定的扰动族存在的不稳定性可以被如下数学表示：

给定多个集合对 $(x_i,n_i)$ ，使得 $||n_i||<\delta$ 但 $||\phi(x_i;W)-\phi(x_i+n_i;W)||\ge M \ge 0$ ，其中 $\delta$ 是一个非常小的值， $W$ 是一个一般的可训练参数。那么这个不稳定的扰动 $n_i$ 取决于网络结构 $\phi$ 而非特定训练的参数 $W$

Uzuki评论：这个结论就不得不提对抗训练（Adversarial training）了，因为对抗训练不改变网络结构但是改变了训练集的结构以生成不同的训练参数，提高模型对对抗样本的防护能力，因此对抗训练应该是证明了某个特定的对抗样本生成算法本身总不是最优的

结论#

通过寻找对抗样本的过程，我们可以证明神经网络本身并没有很好地实现泛化能力，尽管这些对抗样本在训练过程中出现的概率是极低的，但是模型构建的空间本身是稠密的，几乎每个正常样本都能寻找到对抗样本。

Explaining and Haressing Adversarial Examples（Mar 2015）#

作者：Ian J. Goodfellow

简介#

本文作为对Intriguing properties of neural networks一文的修正，指出神经网络对对抗样本的敏感性并非源自于非线性而是线性上，并利用这个性质提出了一种更高效的线性攻击：Fast Gradient Sign Method （FGSM）。同时提出了一种正则化方法称为对抗训练（Adversarial training）有助于提高网络对对抗样本的抗性。并证明传统的正则化方法例如dropout，预训练以及模型集成并不能很好地提高模型对于对抗样本的抗性，但是将模型换成非线性结构确能提高模型的抗性。

作者认为，在设计由于其线性而易于训练的模型与设计利用非线性效应以抵抗对抗扰动的模型之间存在根本性的矛盾。因此长期来看，神经网络的突破应该在于提出一种更好的训练方法以应对非线性的网络。

Our explanation suggests a fundamental tension between designing models that are easy to train due to their linearity and designing models that use nonlinear effects to resist adversarial perturbation.

对抗样本的线性阐释#

我们给定一个对抗样本 $\tilde x=x+\eta$ ，并且保证 $||\eta||_{\infty}<\epsilon$ ，这个 $\epsilon$ 将难以被数据传感器识别并且 $\eta$ 的精度不会小于 $1/255$ （对于八位的RGB图像），那么当这个样本 $\tilde{x}$ 与权重 $w$ 做点乘，有：

w^{T} \tilde{x} = w^{T} x + w^{T} η

$w^T\tilde{x}=w^Tx+w^T\eta$

相比于原始样本 $x$ ，输出的激活值将增加 $w^T\eta$ ，当我们在范数约束范围内取 $\eta=\text{sign}{\ w}$ 时可以得到最大值。假设 $w$ 是 $n$ 维的并且每个元素的平均大小为 $m$ ，那么输出增加的部分可以写作 $\epsilon m n$ . 可以看到，范数约束 $||\eta||_{\infty}$ 并不会随着维度 $n$ 的增加而增加，但是扰动却会随着维度 $n$ 增加而增加，那么哪怕 $\eta$ 的值近乎无限小，只要维度足够大也能使得输出变化非常大。

Uzuki评论：至于 $\eta=\text{sign}{\ w}$ 这部分原文即如此，应该指的是 $\eta$ 与权重 $w$ 的符号方向一致。

作者指出这是因为线性模型只关注与权重紧密相关的输入，哪怕其他输入的值再大。

a linear model is forced to attend exclusively to the signal that aligns most closely with its weights, even if multiple signals are present and other signals have much greater amplitude.

因此作者认为只要维度足够大，哪怕是最简单的线性模型也可以产生对抗样本，而过去的工作通过分析神经网络的高度非线性以阐释对抗样本。因此本工作通过线性的角度阐释对抗样本将更为简单。

FGSM方法的提出#

首先给定一个假设，大多数模型都被人为设计拥有一定程度的线性结构，另一方面即使是非线性的结构（如sigmoid激活函数）在处于非饱和态的时候都是非常接近线性的。那么基于这个假设，就可以认为模型将非常容易受到线性扰动的影响，从而我们可以利用线性性质非常快速地生成对抗样本。

Uzuki评论：所谓的“饱和态”就是比如sigmoid函数在分别接近0和1时，函数值将增长缓慢，而在“非饱和态”的时候拥有非常接近线性函数的结构

给定模型参数 $\theta$ ，模型输入 $x$ ，以及 $x$ 对应的标签 $y$ ， $J(\theta,x,y)$ 表示训练时所使用的损失函数。那么FGSM将沿着损失的线性方向（梯度方向）以产生一个最大范数约束的扰动：

η = ϵ sign (\nabla_{x} J (θ, x, y))

$\eta=\epsilon \text{sign}(\nabla_xJ(\theta,x,y))$

这个方法就被称为“fast gradient sign method”

对抗训练 v.s. L1正则化#

首先我们以最简单的模型为例，即对数几何回归（logistic regression）。给点类别 $y\in\{-1,1\}$ ，并且有 $P(y=1)=\sigma(w^Tx+b)$ ，其中 $\sigma(z)$ 是sigmoid函数，那么训练损失如下：

E_{x, y \sim p_{data}} ζ (- y (w^{T} x + b))

$\mathbb{E}_{x,y\sim p_\text{data}}\zeta(-y(w^Tx+b))$

其中 $\zeta(z)=\log(1+\exp(z))$ ，是softplus函数

Uzuki评论：

如果是按最大似然估计或者交叉熵损失： $\mathcal{L}(x, y)=-y\log P(y=1 \mid x)$ 来算的话，推导出来应该是 $\mathbb{E}_{x,y\sim p_\text{data}}y\zeta(-(w^Tx+b))$ ，疑似笔误了...

如果我们使 $w^T\text{sign}(w)=||w||_1$ ，那么可以给出对抗训练梯度的闭式解：

E_{x, y \sim p_{data}} ζ (y (ϵ | | w | |_{1} - w^{T} x - b))

$\mathbb{E}_{x,y\sim p_\text{data}}\zeta(y(\epsilon||w||_1-w^Tx-b))$

可以看到我们得到了一个非常接近 $L_1$ 正则化的形式，只不过这个正则项是在训练的时候在激活值中被直接减去，而不是附加在损失函数上，这就使得我们的对抗训练正则项会在模型趋于饱和的时候衰减。因此我们可以认为 $L_1$ 正则化比对抗训练具有更强的正则化作用，更为“悲观”。通过实验可以验证当 $L_1$ 正则化系数为 $0.0025$ 时，正则化效果也远高于我们取 $\epsilon=0.25$ 时的对抗训练，使得模型在训练集上存在 $5\%$ 的误差

更加深层网络的对抗训练#

作者认为更加深层的网络对对抗扰动的抗性更高，但是需要显式加入对抗训练进行训练。作者借用“万用拟合器理论（universal approximator theorem）”进行解释，即模型有能力拟合这些对抗样本从而增加对抗扰动的抗性。可以基于FGSM给出一个更广义的正则化损失函数：

~ J (θ, x, y) = α J (θ, x, y) + (1 - α) J (θ, x + ϵ sign (\nabla_{x} J (θ, x, y), y)

$\tilde J(\theta,x,y)=\alpha J(\theta,x,y)+(1-\alpha)J(\theta,x+\epsilon\text{sign}(\nabla_x J(\theta,x,y),y)$

作者通过实验证明经过对抗训练的模型对对抗样本的抗性更强，并且拥有了更强的可解释性与定位性

We also found that the weights of the learned model changed significantly, with the weights of the adversarially trained model being significantly more localized and interpretable (see Fig. 3).

作者也解释了为什么随机噪声的效果不如对抗样本，因为随机噪声大多数样本都不是如对抗样本那样困难的样本，以至于训练效果不好

作者同意Intriguing properties of neural networks一文提出的在越深入的层级进行对抗训练效果越好，这是因为中间层级的激活值经常都是无界的，更容易产生极端的输出

we find that networks with hidden units whose activations are unbounded simply respond by making their hidden unit activations very large, so it is usually better to just perturb the original input