论文阅读笔记——How Hard Is Trojan Detection In Dnns

大大怪将军zwh

已于 2022-10-28 16:29:30 修改

阅读量434

点赞数

文章标签：论文阅读

于 2022-10-28 16:27:27 首次发布

本文链接：https://blog.csdn.net/Wendy_094/article/details/127574092

版权

HOW HARD IS TROJAN DETECTION IN DNNS? FOOLING DETECTORS WITH EVASIVE TROJANS

论文相关

paper地址：https://openreview.net/pdf?id=V-RDBWYf0go

Under review as a conference paper at ICLR 2023

有开源代码，但链接目前进不去

摘要

检测trojan的技术很有效，但trojan逃避检测的工作较少。提出一种新方法，让trojan逃避通用检测，结合了分布匹配、特异性和随机化，以消除木马网络的显著特征。难检测、高ASR、难逆向

前人工作

Trojan Attacks： adversarial perturbations、 learnable triggers等等

Trojan Detection：逆向（label、neuron）、query等

Evasive Trojans：有很多让trojan triggers stealthy(隐蔽的)的方法，但很少让trojaned models本身难以检测的方法。CA不掉（Gu et al., 2017; Chen et al.,2017，太simple）、假设太强（ Xu et al.(2021) ，black-box setting）、不通用（Bagdasaryan & Shmatikov (2021); Hong et al. (2021) ）、one-layer networks（Goldwasser et al. (2022) ）、最相似的是Sahabandu et al.(2022) ：train trojans and a meta-network detector in a min-max alternating fashion to be hard to distinguish from clean networks

以及之前的插入trojan的方法对specificity的假设比较弱，认为如果一个木马不影响clean example的准确性，那么它就具有很高的特异性。作者将其扩展到包括 unintended triggers

Contributions

1、提出一种让trojan逃避通用检测的新方法。结合了分布匹配、特异性和随机化，以消除木马网络的显著特征。

detector-agnostic（不可知） loss：encourages them to be indistinguishable from clean networks

包含了a distribution matching loss inspired by the Wasserstein distance along with specificity and randomization losses.

白盒攻击模型 allows defenders full access to training sets of evasive trojans

优点：难检测、高ASR、难逆向（表现在 target label prediction and trigger synthesis）

2、the first to systematically measure reverse-engineering on a large scale

Background

Neural Trojans：classification networks and all-to-one attacks

评估一个攻击是否成功：高ASR、高CA、高特异性

Threat Model：把trojan detection看成是一个攻击者和防御者之间的interaction。防御者可以接触到干净的数据集还有trojaned networks，知道攻击者的trojan分布

Method

损失函数是 $L_{task} + L_{trojan} + L_{evasion}$ （前两个都是cross-entropy）

$L_{task}$ : 提高CA（the task loss that increases accuracy on clean examples）
$L_{trojan}$ ：提高ASR（ the trojan loss that increases the attack success rate）
$L_{evasion}$ （主要改进）：让trojan难以检测（ the evasion loss, which is designed to make trojans hard to detect）

$L_{evasion}$ 又分为distribution matching, specificity, and randomization

distribution matching loss：让clean networks和trojaned networks的分布尽可能地相似
specificity, and randomization losses：通过解决两个挑战强化这个中心损失

1、distribution matching loss

如果使用基于GANs的损失，花费过高

使用一个基于primal form of the 1-Wasserstein distance的损失函数

$\theta_f和\theta_g是网络的参数，f'(x)和g'(x)$ 是输出的unnormalized logits

2、Specificity

前人方法：在negative examples上使用和clean label的交叉熵损失

used cross-entropy to the clean label on negative examples (Nguyen & Tran, 2021)

缺点：clean networks do not behave normally on examples with triggers inserted

low specificity：unintended triggers can activate the trojan

如果一个trojan有low specificity，那么防御者在知道了trigger的分布之后，就可以判断已知的trigger是否会造成异常的行为，从而判断是否存在trojan（Spec方法）。

f是 trojaned network，g是它的 clean initialization on negative examples，h是 trigger insertion function

实验中用的是 inserting incorrect triggers into 16 examples for blended attacks and 10 examples for patch attacks

3、Randomization

distribution matching loss引入了参数空间的自然协调（ emergent coordination），导致对网络参数的 summary features使用逻辑回归(Param方法)就很容易发现trojans

$\theta'_f和\theta'_g$ 是f和g的 summary statistics（differentiable summary features，可微摘要特征。因为在原始参数空间中使用随机化损失限制太大）， r是在训练开始时就固定的一个随机方向。

This loss encourages the difference between $\theta_f$ and $\theta_g$ to not only be small, but to also point in a random direction

也就是说最终的loss是要保证分布接近的同时，新的模型的参数和原模型的参数之差要尽量接近r。

不同训练的模型r不一样，所以不同模型之间的参数差距就变大了。

实验

配置

训练数据集：MNIST, CIFAR-10, CIFAR-100, and GTSRB (downsampled to 32x32)

对每个数据集，产生250个clean networks和250个用于trojan attack的网络

总共训练了超过6k个神经网络来评估

两种attack： the Patch attack (Gu et al., 2017) and Blended attack (Chen et al., 2017).都是通过数据投毒来实现的

神经网络架构和超参数：

CIFAR-10 and CIFAR-100：40-2 Wide ResNets (Zagoruyko & Komodakis, 2016) with a dropout rate of 0*.*3 (Srivastava et al., 2014).
GTSRB：SimpleViT Vision Transformer (Beyer et al., 2022) as implemented by lucidrains
MNIST：a simple 5-layer convnet with batch norm

Trojan Attack步骤：

训练一个新的clean networks的数据集，用于初始化和训练evasive trojans
用和standard trojans相同的 patch and blended triggers来训练evasive trojans

6个Detectors：

accuracy-based detector (Acc)：只检查CA
specificity-based detector (Spec) ：用随机的trigger去检验

computes the entropy of average posteriors on inputs with random triggers inserted. If a trojan has low specificity, these random triggers will activate it and reveal its presence

Neural Cleanse (NC)
ABS
MNTD：trains a set of queries and a shallow classifier on the training set of clean and trojaned models (Xu

et al., 2021)
Param detector： a logistic regressor with a single linear layer that takes summary statistics of the raw network parameters as its input

Max and Avg baselines：the maximum and average AUROC obtained by the six detectors on a given set of trojaned networks

结果

效果挺好

ASR相似

躲避防御和通用性：

在clean and trojaned networks的测试集上的评估指标：AUROC

a threshold-independent metric that can be interpreted as the probability that a positive example has a higher detection score than a negative example (Fawcett, 2006), so 50% corresponds to random detection performance.

https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/

相对于standard trojans，躲避防御的效果更好

NC实际上是为检测patch attack而设计的，但它检测blended attack的效果巨好。

Reverse-engineering

unexpected，因为在实验设计时其实并没有考虑到逆向，但效果挺好

Target Label Prediction

metric：accuracy

使得预测target label更难了：

在NC上面的效果非常好，从几乎完美的预测到了几乎随机的预测

Trigger Synthesis

预测trigger mask

metric： mean intersection（交叉） over union (IoU) across trojaned networks

在NC和ABS上的IoU和standard trojans非常接近，但在Param和MNTD上的IoU降低了

个人看法

只测试了patch和blended attack，不知道在其他attack上的效果怎么样

大大怪将军zwh

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫