论文阅读笔记——How Hard Is Trojan Detection In Dnns

HOW HARD IS TROJAN DETECTION IN DNNS? FOOLING DETECTORS WITH EVASIVE TROJANS

论文相关

paper地址:https://openreview.net/pdf?id=V-RDBWYf0go

Under review as a conference paper at ICLR 2023

有开源代码,但链接目前进不去

摘要

检测trojan的技术很有效,但trojan逃避检测的工作较少。提出一种新方法,让trojan逃避通用检测,结合了分布匹配、特异性和随机化,以消除木马网络的显著特征。难检测、高ASR、难逆向

前人工作

Trojan Attacks: adversarial perturbations、 learnable triggers等等

Trojan Detection:逆向(label、neuron)、query等

Evasive Trojans:有很多让trojan triggers stealthy(隐蔽的)的方法,但很少让trojaned models本身难以检测的方法。CA不掉(Gu et al., 2017; Chen et al.,2017,太simple)、假设太强( Xu et al.(2021) ,black-box setting)、不通用(Bagdasaryan & Shmatikov (2021); Hong et al. (2021) )、one-layer networks(Goldwasser et al. (2022) )、最相似的是Sahabandu et al.(2022) :train trojans and a meta-network detector in a min-max alternating fashion to be hard to distinguish from clean networks

以及之前的插入trojan的方法对specificity的假设比较弱,认为如果一个木马不影响clean example的准确性,那么它就具有很高的特异性。作者将其扩展到包括 unintended triggers

Contributions

1、提出一种让trojan逃避通用检测的新方法。结合了分布匹配、特异性和随机化,以消除木马网络的显著特征。

detector-agnostic(不可知) loss:encourages them to be indistinguishable from clean networks

包含了a distribution matching loss inspired by the Wasserstein distance along with specificity and randomization losses.

白盒攻击模型 allows defenders full access to training sets of evasive trojans

优点:难检测、高ASR、难逆向(表现在 target label prediction and trigger synthesis)

image-20221027153033082

2、the first to systematically measure reverse-engineering on a large scale

Background

Neural Trojans:classification networks and all-to-one attacks

评估一个攻击是否成功:高ASR、高CA、高特异性

Threat Model:把trojan detection看成是一个 攻击者和防御者之间的interaction。防御者可以接触到干净的数据集还有trojaned networks,知道攻击者的trojan分布

Method

损失函数是 L t a s k + L t r o j a n + L e v a s i o n L_{task} + L_{trojan} + L_{evasion} Ltask+Ltrojan+Levasion(前两个都是cross-entropy)

  • L t a s k L_{task} Ltask: 提高CA(the task loss that increases accuracy on clean examples)
  • L t r o j a n L_{trojan} Ltrojan​:提高ASR( the trojan loss that increases the attack success rate)
  • L e v a s i o n L_{evasion} Levasion​(主要改进):让trojan难以检测( the evasion loss, which is designed to make trojans hard to detect)

L e v a s i o n L_{evasion} Levasion又分为distribution matching, specificity, and randomization

  • distribution matching loss:让clean networks和trojaned networks的分布尽可能地相似
  • specificity, and randomization losses:通过解决两个挑战强化这个中心损失

1、distribution matching loss

如果使用基于GANs的损失,花费过高

使用一个基于primal form of the 1-Wasserstein distance的损失函数

image-20221028090424627

θ f 和 θ g 是网络的参数, f ′ ( x ) 和 g ′ ( x ) \theta_f和\theta_g是网络的参数,f'(x)和g'(x) θfθg是网络的参数,f(x)g(x)是输出的unnormalized logits

2、Specificity

前人方法:在negative examples上使用和clean label的交叉熵损失

used cross-entropy to the clean label on negative examples (Nguyen & Tran, 2021)

缺点:clean networks do not behave normally on examples with triggers inserted


low specificity:unintended triggers can activate the trojan

如果一个trojan有low specificity,那么防御者在知道了trigger的分布之后,就可以判断已知的trigger是否会造成异常的行为,从而判断是否存在trojan(Spec方法)。

image-20221028091852200

f是 trojaned network,g是它的 clean initialization on negative examples,h是 trigger insertion function

实验中用的是 inserting incorrect triggers into 16 examples for blended attacks and 10 examples for patch attacks

3、Randomization

distribution matching loss引入了参数空间的自然协调( emergent coordination),导致对网络参数的 summary features使用逻辑回归(Param方法)就很容易发现trojans

image-20221028092757730

θ f ′ 和 θ g ′ \theta'_f和\theta'_g θfθg是f和g的 summary statistics(differentiable summary features,可微摘要特征。因为在原始参数空间中使用随机化损失限制太大), r是在训练开始时就固定的一个随机方向。

This loss encourages the difference between θ f \theta_f θf and θ g \theta_g θg to not only be small, but to also point in a random direction

也就是说最终的loss是要保证分布接近的同时,新的模型的参数和原模型的参数之差要尽量接近r。

不同训练的模型r不一样,所以不同模型之间的参数差距就变大了。

image-20221028094058035

image-20221028094106238

实验

配置

训练数据集:MNIST, CIFAR-10, CIFAR-100, and GTSRB (downsampled to 32x32)

对每个数据集,产生250个clean networks和250个用于trojan attack的网络

总共训练了超过6k个神经网络来评估

两种attack: the Patch attack (Gu et al., 2017) and Blended attack (Chen et al., 2017).都是通过数据投毒来实现的

神经网络架构和超参数

  • CIFAR-10 and CIFAR-100:40-2 Wide ResNets (Zagoruyko & Komodakis, 2016) with a dropout rate of 0*.*3 (Srivastava et al., 2014).
  • GTSRB:SimpleViT Vision Transformer (Beyer et al., 2022) as implemented by lucidrains
  • MNIST:a simple 5-layer convnet with batch norm

Trojan Attack步骤

  • 训练一个新的clean networks的数据集,用于初始化和训练evasive trojans
  • 用和standard trojans相同的 patch and blended triggers来训练evasive trojans

6个Detectors

  • accuracy-based detector (Acc):只检查CA

  • specificity-based detector (Spec) : 用随机的trigger去检验

computes the entropy of average posteriors on inputs with random triggers inserted. If a trojan has low specificity, these random triggers will activate it and reveal its presence

  • Neural Cleanse (NC)

  • ABS

  • MNTD:trains a set of queries and a shallow classifier on the training set of clean and trojaned models (Xu

    et al., 2021)

  • Param detector: a logistic regressor with a single linear layer that takes summary statistics of the raw network parameters as its input


    Max and Avg baselines:the maximum and average AUROC obtained by the six detectors on a given set of trojaned networks

结果

效果挺好

ASR相似
image-20221028102021315
躲避防御和通用性:

在clean and trojaned networks的测试集上的评估指标:AUROC

a threshold-independent metric that can be interpreted as the probability that a positive example has a higher detection score than a negative example (Fawcett, 2006), so 50% corresponds to random detection performance.

https://glassboxmedicine.com/2019/02/23/measuring-performance-auc-auroc/

相对于standard trojans,躲避防御的效果更好

NC实际上是为检测patch attack而设计的,但它检测blended attack的效果巨好。

image-20221028103547770

image-20221028150820496

Reverse-engineering

unexpected,因为在实验设计时其实并没有考虑到逆向,但效果挺好

Target Label Prediction

metric:accuracy

使得预测target label更难了:

在NC上面的效果非常好,从几乎完美的预测到了几乎随机的预测

image-20221028102452715

Trigger Synthesis

预测trigger mask

metric: mean intersection(交叉) over union (IoU) across trojaned networks

在NC和ABS上的IoU和standard trojans非常接近,但在Param和MNTD上的IoU降低了

image-20221028111258219

个人看法

只测试了patch和blended attack,不知道在其他attack上的效果怎么样

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值