NARCISSUS: A Practical Clean-Label Backdoor Attack with Limited Information 论文笔记

最新推荐文章于 2023-12-02 20:54:20 发布

wwweiyx

最新推荐文章于 2023-12-02 20:54:20 发布

阅读量836

点赞数 1

分类专栏： AI安全文章标签：论文阅读

本文链接：https://blog.csdn.net/weiyuxin107/article/details/127983873

版权

AI安全专栏收录该内容

19 篇文章 8 订阅

订阅专栏

#论文笔记#

1. 论文信息

论文名称	NARCISSUS: A Practical Clean-Label Backdoor Attack with Limited Information
作者	Yi Zeng (Virginia Tech, Blacksburg)
出版社	无
pdf	📄在线pdf
代码	💻pytorch

攻击者只知道目标类数据。通过POOD数据集构建代理网络，再通过目标类进行微调，通过公式 $\delta^{*}=\underset{\delta \in \Delta}{\arg \min } \sum_{(x, t) \in D_{t}} \mathcal{L}\left(f_{\theta_{\mathrm{sur}}}(x+\delta), t\right)$ 生成 trigger。

2. introduction

2.1 背景

早期的后门攻击的做法，是在图片中加入 backdoor tirgger 以及修改他们的 label 为 target label。以此强迫模型学习到 backdoor trigger 和 target label 的联系。但是这种方法导致数据集的标签明显是错误的，容易被人工检测出来。所以有了针对 clean-label attack 的研究

最简单的 clean-label attack ：只将 backdoor trigger 注入目标类中，然后进行训练。发现实验效果不佳，因为模型更倾向于学习目标类中的自然特征而不是 backdoor trigger。文章中做了实验证明，只有在目标类中 70% 的数据中加上 backdoor trigger 才能使攻击成功。

2.2 存在的问题

现存的 clean-label attack 方法都需要访问所有训练数据来获取知识。但是在现实情况下，获取全部分类的训练数据成本很高，甚至是不太现实的。

例如训练人脸分类器时，训练集是由不同用户上传的，攻击者只能修改自己的数据，无法修改别人的数据。

2.3 文章的贡献

文章提出了 “NARCISSUS”，一种简单有效的算法，只需要知道目标类的数据即可攻击成功
只修改 0.05% 的数据就能达到 85.81% 的攻击成功率，并且模型准确率只下降 1.72%
实现 physical-world 的实验，是“the first workable physical-world clean-label backdoor attack”
对于以下防御模型进行了测试，popular：Neural Cleanse，Fine-pruning 和 state-of-the-art：I-BAU，Anti Backdoor Learning. 由于生成的trigger 具有 high-frequency artifacts，Frequency-based defense 可以有效地检测出本文的攻击，但是只要对生成的 trigger 施加低频约束，频域的检测就失效了。

在这里插入图片描述

3. method

3.1 威胁模型

威胁模型：

假设受害者从多个来源中搜集数据集用于训练，因此攻击者可以访问训练数据的一部分
攻击者了解的信息
1. 知道目标类一些具有代表性的样本
2. 知道任务的信息。比如知道是人脸分类器还是鸟类分类器，但是不知道具体的类别。
3. 可以搜集一些与学任务相关的额外样本。文章使用了一些 public out-of-distribution (POOD) examples

3.2 optimization problem

目的：通过优化生成 backdoor trigger

假设模型 oracle model： $f_{\theta_{\text {orc }}}$ 是一个不含后门的干净模型

optimization problem： $\delta^{*}=\underset{\delta \in \Delta}{\arg \min } \sum_{(x, t) \in D_{t}} \mathcal{L}\left(f_{\theta_{\text {orc }}}(x+\delta), t\right)$ ， $D_{t}$ 为目标类

“Intuitively, $\delta^{*}$ can be thought of as the most robust, representative feature of the target class, as adding it into any inputs would maximize the chance of them being predicted as the target class universally.”

由于无法访问 $f_{\theta_{\text {orc }}}$ ，受到黑盒攻击的启发，通过目标类的样本和POOD样本构建一个代理模型 $f_{\theta_{\text {sur }}}$ ，因此优化问题变为：

$\delta^{*}=\underset{\delta \in \Delta}{\arg \min } \sum_{(x, t) \in D_{t}} \mathcal{L}\left(f_{\theta_{\mathrm{sur}}}(x+\delta), t\right)$

$\delta^{*}$ increases the confidence of all target-class examples and thus represents a direction that points towards the inside of the target class.”

通过这种方式进行优化产生的 trigger 指向类的内部
通过实验发现生成的扰动在不同模型结构中具有鲁棒性

3.3 训练过程

在这里插入图片描述

先在 POOD 样本上训练代理模型，再在目标类上进行微调。生成 Surrogate model

为什么直接将 POOD 和目标类进行组合然后训练模型？

当选择其他类作为目标类的时候，又要重新训练模型
Trigger-Generation

$\Delta=\left\{\delta:\|\delta\|_{\infty} \leq \epsilon\right\}$

Projection to a $l_{\infty^{-}}$ norm can be done by just clipping each dimension of δ into $[-\epsilon,+\epsilon]$
Trigger Insertion
Test Query Manipulation