论文阅读笔记——Backdoor Defense with Machine Unlearning

大大怪将军zwh

已于 2022-11-03 17:21:08 修改

阅读量708

点赞数

文章标签：论文阅读深度学习

于 2022-11-03 16:51:35 首次发布

本文链接：https://blog.csdn.net/Wendy_094/article/details/127673752

版权

Backdoor Defense with Machine Unlearning

论文相关

paper地址：https://arxiv.org/abs/2201.09538

preliminarily accepted in IEEE International Conference on Computer Communications (IEEE INFOCOM 2022).

摘要

提出了一种新的方法 BAERASER，可以消除注入 victim model 的后门。具体来说，BAERASER主要通过两个关键步骤实现后门防御。首先，进行 trigger pattern 恢复，提取被受害者模型感染的触发模式。这里的触发模式恢复问题相当于从受害者模型中提取未知噪声分布的问题，可以通过基于熵最大化的生成模型轻松解决。随后，诱饵利用这些恢复的触发模式来逆转后门注入过程，并通过一种新设计的基于梯度上升的 machine unlearning 方法，诱导受害者模型来消除被污染的记忆。与以往的机器遗忘解决方案相比，该方法摆脱了对训练数据进行再训练的依赖，显示了更高的后门擦除有效性。

前人工作

Backdoor Injection Attack & Defense

后门攻击：很多种，This paper is proposed to defend against all kinds of attacks mentioned above

后门防御大体上分为两类：

后门检测：Strip、Deepinspect、One-pixel signature。最多能达到99%以上的准确率
后门消除：Fine-pruning、Neural attention distillation。效果不佳。

fine-tuning: straightforward and typically available to real-world applications, fine-tuning is not a robust method and fails to defend most of the state-of-the-art attacks with limited clean data

fine-pruning: usually causes considerable model performance degradation while achieving a high defense rate.

Neural attention distillation: not removing it thoroughly, and thus, cannot achieve a very high defense rate in most cases

Machine Unlearning

Cao et al.: 开山之作

优缺点：straightforward and effective, it can only be applied to the non-adaptive models and achieves poor performance on adaptive models, like neural networks.
the recently proposed methods to implement machine unlearning are mainly retraining based.

优缺点：easily understandable and practical for applications, the required computational and storage resources of these retraining-based methods are massive and usually unaffordable.

Contributions

insight: backdoor erasing is equivalent to unlearning the unexpected memory of the victim model about backdoor trigger patterns.

两个挑战：

find the targeted set of data to be unlearned

解决方案：adopt generative based model to recover the trigger patterns infected by the victim model without need to access any training data

之前方法的问题：Moreover, previous solutions for trigger pattern recovery are usually based on the typical generative model, like generative adversarial network (GAN), which often suffers from an unexpected performance loss when estimating high-dimensional trigger patterns.

Therefore, instead of using GAN, BAERASER adopts the mutual information neural estimator [16], a generative model that can avoid the above problem via entropy maximization.
前人工作的缺陷：most existing machine unlearning methods are based on retraining that requires a full access to the training set of the target model. However, such a rigid data access setting is usually hard to be satisfied in the backdoor defense scenarios as discussed in most prior works [6], [7], [9], [10].

解决方案：reverse the process via gradient ascent

a vanilla gradient ascent method in our evaluation can suffer from obvious model performance degradation caused by catastrophic forgetting. Therefore, BAERASER introduces a weighted penalty mechanism to mitigate the problem.

贡献如下：

提出了一种后门防御的新方法，use a max-entropy staircase approximator to achieve trigger reconstruction, and then, erase the injected backdoor via machine unlearning.
introduce a gradient ascent based machine unlearning method that can mitigate catastrophic forgetting through a dynamic penalty mechanism
conduct extensive experiments on four datasets with three state-of-the-art backdoor injection attacks. The results show that our method can lower the attack success rate by 98% on average with less than 5% accuracy drops, which outperforms most of prior methods

此外，This paper is proposed to defend against all kinds of attacks mentioned above

Method

Threat Model & Goals

the attacker has successfully launched the attack and the victim model needs to be purifified.

The defender has no prior knowledge about which images are polluted or the target label of the attacker.
The defender can only get access to a limited portion of validation data but cannot hold the whole training set.

Defense goal：在erasing trigger的同时保持CA

Defense Intuition and Overview

BAERASER involves two key steps, i.e., trigger pattern recovery and trigger pattern unlearning.

Trigger Pattern Recovery

max-entropy staircase approximator (MSA)

constrained the trigger pattern recovery problem to be an unknown noise distribution extraction problem.

Observation 1: the backdoor injection attack tricks F to learn a specific trigger pattern distribution ∆ that can transform F(X +∆) = y_i to be F(X +∆) = y_t.

Observation 2: If a trigger pattern ∆ exists, all inputs X +∆ will be classifified into the target class y_t.

和对抗样本的区别：对抗样本的noise适用于单个样本，而MSA的noise适用于 a batch of inputs；MSA还会 abandons the recovered trigger patterns whose overall attack success rates are lower than the desired threshold (line 13).

个人理解：

对每个label，经过n次迭代，recover出一个trigger，每次迭代都让 $F_{\theta_0}(x+G_i(\delta))[y_p]$ （ $F_{\theta_0}(x+G_i(\delta))$ 是添加了trigger的input经过backdoor model之后的输出， $F_{\theta_0}(x+G_i(\delta))[y_p]$ 就是输出为 $y_p$ 的概率）尽可能接近 $\varepsilon_i$ ，那么最后一次迭代， $F_{\theta_0}(x+G_i(\delta))[y_p]$ 就应该接近1。

max函数起到截断的作用，也就是说如果 $\varepsilon_i - F_{\theta_0}(x+G_i(\delta))[y_p] < 0$ ，也就是 $F_{\theta_0}(x+G_i(\delta))[y_p]$ 比 $\varepsilon_i$ 大（这正是我们所希望的，加了trigger之后在target label上的概率当然越大越好啦），所以这时候就让损失为0而不是为负数。

$H_i$ 是mutual information estimator，作用是让 $G_i(\delta)$ 和 $\delta'$ 尽可能相似？

第13行，只有ASR高于阈值 $\tau$ 的trigger才会被加入trigger pool。

Trigger Pattern Unlearning

直接的loss：

may cause obvious catastrophic forgetting, which makes the model suffer from signifificant performance degradation

tricks：

leveraging the validation data to maintain the memory of the target model over the normal data
introducing a dynamic penalty mechanism to punish the over-unlearning of the memorizes unrelated to trigger patterns.

As β/α increases, more attention of BAERASER is paid to control model performance loss, and otherwise, BAERASER tends to lower the ASR as much as possible.

实验

Baselines: Fine Tuning、Fine-Pruning、Neural Attention Distillation(NAD)

Attacks：BadNet [29], TrojanNN [9] and IMC[30]

Datasets：MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100

base model：ResNet18

Evaluation Metrics：Attack Success Rate (ASR) and model Accuracy (Acc)

Comparison analysis

效果非常好：

Overall, BAERASER signifificantly outperforms all baselines on most attack settings and lowers around 98% ASR of all three backdoor attacks while introducing negligible Acc drops.

characteristic:For BAERASER, its performance potentially becomes better along with the increasing of trigger sizes, however, other methods tends to perform worse in the same condition.

原因：machine unlearning is sensitive to the feature distribution required to be unlearned

According to Eq. 4, more distinguished trigger patterns will promote the gradient ascent of machine unlearning to converge faster to wards the desired direction and mitigate its side-effect on other normal data (less punishment)

Further Understanding of BAERASER

Trade-off between ASR and Acc

lower ASR requires more Acc sacrififice, and the sacrifificed Acc intensively grows with the decreasing of ASR.

Besides, although Acc decay rate increases with unlearning iterations, we can still maintain less than 10% Acc drops as the ASR is decreased to be almost 0%.

Impact of Holding Ratio

BAERASER performs better with higher holding ratios of clean data. Furthermore, even with only 1% clean data (500 clean data points), BAERASER still achieves an appealing defense rate (lowering at most 98% ASR)

Impact of α and β