Efficient Decision-based Black-box Patch Attacks on Video Recognition（背景、本文要解决的问题、创新点、方案和实验、代码复现）

call me by ur name

已于 2024-05-29 16:23:03 修改

阅读量1k

点赞数 12

分类专栏：保研文章标签：深度学习人工智能机器学习

于 2024-03-20 12:46:06 首次发布

本文链接：https://blog.csdn.net/qq_61786525/article/details/136830255

版权

保研专栏收录该内容

10 篇文章 1 订阅

订阅专栏

$《 E ff i c i e n t Dec i s i o n - ba se d Bl a c k - b o x P a t c h A tt a c k so nVi d eo R eco g ni t i o n 》$

patch 包括 texture, position, shape

背景

通用背景

Although Deep Neural Networks (DNNs) have demon- strated excellent
performance, they are vulnerable to adversarial patches that
introduce perceptible and localized perturbations to the input

深度神经网络（DNNs）容易受到攻击

Further, many defense methods that are effective against
perturbation-based attacks turn ineffective against patch at-This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version;
the final published version of the proceedings is available on IEEE Xplore.4379
tacks

原先的 $p er t u r ba t i o n - ba se d$ 攻击过时了， $patch\ attack$ 更好

Since the white-box is too ideal, more work
has explored black-box score-based attacks, where the ad-
versary can only access the model output (e.g. labels and
corresponding logits) by querying models and utilize log-
its to design optimization strategies

因为白盒攻击太理想了，所以研究黑盒攻击

as model security and privacy concerns continue to grow, obtaining logits has become increasingly difficult, particularly for video models used in security-critical tasks

模型的安全和隐私关注持续上升，所以获得黑盒攻击所需要的逻辑变得更加困难。所以需要decision-based attack

特定背景

Generating adversarial patcheson images has received much attention, while adversarialpatches on videos have not been well investigated. Further,decision-based attacks, where attackers only access the predicted hard labels by querying threat models, have not been well explored on video models either, even if they are practical in real-world video recognition scenes.

图片添加扰动进行已经受到广泛关注，并且有很多大的进步
但是相比之下decision-based attacks在视频方面缺乏研究目前最好的是BSCA，a patch attack against video recognition models，所以还不能很好评估视频模型的鲁棒性

要解决的问题

Compared to images, the temporal dimension of videos substan-
tially enlarges the parameter space and incurs a significant
query burden. Particularly, the mutual complement of in-
formation between frames increases the difficulty of attack.
The large parameter space of the patch (position, shape,
texture) and the scarce output of the model (top-1 predicted
label) can easily lead the attack to local optima [30], which
reduces the efficiency of the attack.

问题：

巨量的参数，decision-based models只会返回有限的信息preditions（predicited label），因此加重了查询的负担（对抗性攻击可能需要与目标模型进行大量的交互，以便推理出模型的内部信息或梯度）和攻击难度
patch的信息过多，并且decision-based attack模型返回信息过少，导致结果局部最优

所以要解决对视频模型进行查询高效攻击query-efficient attack的需求

解释

Parameter space in videos

Videos consist of a sequence of frames, and each frame contains a large number of pixels.
Each pixel has multiple color channels, such as red, green, and blue.
The combination of frames and pixels creates a huge parameter space in videos.
This means that there are numerous possible variations and configurations of pixels in a video, making it difficult to analyze and manipulate.

Minimal information returned by decision-based models

Decision-based models are machine learning models that make predictions based on the input data.
In the context of video recognition, these models predict the class or label of a video.
However, decision-based models only provide limited information about their predictions, such as the predicted label.
They do not provide detailed information about the internal workings or reasoning behind their predictions.
This minimal information makes it challenging for attackers to understand and exploit the vulnerabilities of the model.

Increased attack difficulty

The combination of the huge parameter space in videos and the minimal information returned by decision-based models makes it difficult for attackers to craft effective adversarial patches.
Attackers need to explore a vast number of possible variations in the video frames to find the optimal perturbations that can fool the model.
Without detailed information about the model’s decision-making process, attackers have to rely on trial and error to find successful attacks.
This trial and error process becomes more challenging and time-consuming due to the large parameter space in videos.

Increased query burden

Query burden refers to the number of queries or requests an attacker needs to make to the model to gather information and optimize their attack.
In decision-based attacks, attackers can only access the predicted labels by querying the model.
With the minimal information returned by decision-based models, attackers need to make a large number of queries to gather enough information about the model’s behavior and vulnerabilities.
This increases the query burden and adds to the computational cost and time required for the attack.

创新点

首次提出 $decision-based\ patch\ attack$ ，混合两方优势并用于视频识别领域combines the advantage of patch and decision-based attacks to improve the assessment system for video model robustness

提出 $spatial-temporal\ differential\ evolution (STDE)\ framework$ 来减少参数在时空上的空间reduces the parameter space in the temporal and spatial domains

解释

"decision-based patch attack"并不是指一个单独的攻击方法，而是指一类攻击方法，其中攻击者只能访问模型的决策输出（即预测的硬标签）。STDE框架是这类攻击方法的一个具体实现，它结合了补丁攻击和决策式攻击的特点，以提高攻击效率和降低查询负担。

方案

STDE introduces target videos as prior knowledge to fill the texture of the patch and uses paired coordinates to model the position and shape of the patch.

先引入目标视频来制作 $patch\ texture$

In the temporal domain, STDE performs binary en-
coding on the video sequence and selects keyframes accord-
ing to the temporal difference, achieving a sparse attack.

采用了一种二进制编码技术来处理视频序列，并根据时间差异选择关键帧，从而实现一种稀疏攻击

STDE框架的流程可以概括为以下几个步骤：

种群初始化：随机生成N个种群，每个种群由补丁位置P和关键帧二进制序列FK组成。P包含补丁的左上角和右下角坐标，FK表示每个时间帧是否为关键帧。同时，生成对应的关键帧和补丁位置的掩码矩阵M。N个种群并不直接指视频帧总数，而是指在STDE算法中维护的种群数量。 FK数组是一个长度为视频帧数的二进制数组。 P：其长度与视频帧数相同
种群初始化：在STDE框架中，攻击从能够误导威胁模型的对抗性示例开始。因此，种群集合V中的每个种群v都必须确保生成的x_adv是对抗性的。初始种群P和FK从均匀分布中随机采样，并受到初始化率μ和帧覆盖率cf的限制。
适应度函数：在决策式设置中，适应度函数作为新的查询反馈信息，用于评估进化算法中每个种群的质量。适应度函数将补丁区域引入到优化目标中，并根据模型查询反馈信息将攻击成功或失败量化为0或∞。计算关键帧中不同补丁的交集区域和补丁面积
从种群集合中随机选择2个种群vi和vj，以及当前最优种群vbest。然后，通过差分变异生成新种群vnew，包括位置变异和关键帧变异。位置变异采用整数运算，关键帧变异采用二进制运算。变异率γ控制变异程度。
- 位置变异：位置变异采用差分进化策略，具体计算公式如下： Pnew = Pbest + γ * (Pi - Pj) 其中，Pbest表示当前最优种群的位置参数，Pi和Pj表示随机选择的两个种群的位置参数，γ表示变异率。Pnew表示变异后新种群的位置参数。位置变异通过差分运算实现，利用了最优种群的信息作为搜索方向。
- 关键帧变异：关键帧变异同样采用差分进化策略，具体计算公式如下： FKnew = FKbest ∧ (FKi ∨ FKj) 其中，FKbest表示当前最优种群的关键帧参数，FKi和FKj表示随机选择的两个种群的关键帧参数。FKnew表示变异后新种群的关键帧参数。关键帧变异通过逻辑运算实现，同时考虑了最优种群的信息和随机选择种群的信息。
交叉：对新种群vnew进行交叉操作，包括位置交叉和关键帧交叉。位置交叉通过添加随机噪声控制交叉程度，关键帧交叉通过改变α个关键帧状态控制交叉程度。交叉率α控制交叉程度。
- 位置交叉：位置交叉采用稀疏交叉策略，对每个位置参数进行随机添加噪声。具体计算公式为： Pnew = S_Cross(Pnew, γ) 其中，Pnew表示交叉后的新种群位置参数，γ表示交叉率。S_Cross表示稀疏交叉操作，它对Pnew中的每个元素随机添加噪声k，k ∈ {-1, 0, 1}。通过调整γ的值，可以控制位置交叉的程度。
- 关键帧交叉：关键帧交叉采用随机改变关键帧状态的策略，保持交叉后的关键帧分布的连续性。具体计算公式为： FKnew = T_Cross(FKnew, α) 其中，FKnew表示交叉后的新种群关键帧参数，α表示交叉率。T_Cross表示关键帧交叉操作，它随机选择FKnew中的α个位置，改变其状态。通过调整α的值，可以控制关键帧交叉的程度。
种群选择：如果新生成的种群vnew的适应度g(vnew)优于当前最差种群vworst的适应度g(vworst)，则vnew将加入种群集合V，最差种群将被淘汰。否则，保持种群集合不变。
输出结果：当达到最大查询次数后，选择适应度最低的种群生成最终对抗视频。

实验

实验设置

数据集：选择了两个流行的视频识别数据集：UCF-101和Kinetics-400。

视频识别模型：C3D、Non-local (NL)和TPN。

视频采样：从UCF-101的测试集和Kinetics-400的验证集中随机采样视频，作为clean videos，并以相同的方式选择目标视频target videos，需要确保目标视频与干净视频属于不同的类别。

评估指标：包括愚弄率（Fooling Rate, FR）、平均遮挡区域（Average Occluded Area, AOA）、平均显著区域遮挡区域（Average Occluded Area in the Salient Region, AOA*）和平均查询次数（Average Query Number, AQN）。

性能比较

与强化学习比较：将STDE与使用强化学习框架的TPA和BSCA进行比较。实验结果表明，STDE在所有指标上都优于TPA。
与随机搜索比较：将STDE与使用随机搜索的Patch-Rs进行比较。STDE在攻击强度和效率上都超过了Patch-Rs。此外，和BSCA的比较分为两个部分：1.无目标攻击和基于模型分数的攻击；2.验证可移植性，分为STDE-W（白色弹幕屏幕攻击）和STDE-P（先验弹幕屏幕攻击）
与Bash Hopping Evolution比较：将STDE与使用Bash Hopping Evolution算法的AdvW进行比较。STDE在非目标攻击上的表现优于AdvW。

诊断实验

超参数调整：通过调整STDE的超参数，如变异率γ和交叉率α，来观察攻击性能的变化。
目标视频源：验证STDE对于不同目标视频源的鲁棒性。
图像质量评估：使用PSNR、SSIM、MSE和AOA*等指标来量化生成的对抗性视频的质量。

消融研究：

时间差异：研究关键帧选择和时间差异对攻击性能的影响。
适应度函数：探讨不同范数约束对性能的影响。
不同类型的目标：比较使用高斯噪声和单色块作为目标视频的效果。
不同视频的鲁棒性：验证STDE对不同目标视频的鲁棒性

时间和内存成本：

成本分析：比较STDE和BSCA在时间和内存成本上的表现。

攻击防御方法：

防御方法：测试STDE对局部梯度平滑（Local Gradient Smoothing, LGS）和数字水印（Digital Watermarking, DW）等补丁防御方法的攻击性能。

有效性分析：

注意力分布：使用Grad-Cam可视化技术来分析STDE如何通过改变模型的注意力分布来实现目标攻击。

实验复现

前提

论文代码

公开在 $g i t h u b$ 的代码只有针对 $c 3 d$ 模型的攻击，而论文中提到的 $N L$ , $TPN$ 模型未给出相关处理了代码和预训练权重文件，所以没有对此进行实验

论文中 $ST D E$ 是在 $U CF - 101$ 和 $K in e t i cs - 400$ 数据集进行实验，其中 $K in e t i cs - 400$ 太大，目前设备条件不足够，所以没有在 $K in e t i cs - 400$ 进行实验

实验设置

按照 $g i t h u b$ 仓库中的 $re a d m e$ 文件给出的conda env create -f environment.yaml，创建环境
如果是windows系统的，需要删除environment.yaml中的机器信息，并且有一部分包无法安装（但是，经过实验，既使未安装也能正常运行，并得到正常结果）
如果是linux系统，则直接使用相应命令即可

Video Sampling. Following [17, 42, 41], we randomly sample one video from each class of test set on UCF-101 and validation set on Kinetics-400 as clean videos, and follow the same procedure to select target videos. We guarantee the video recognition model can accurately classify each video and the target video corresponding to each clean video belongs to a different class.

按照论文要求，在 $U CF - 101$ 测试集每个类别随机挑选一个视频。我设置的是每个类别的第一个视频作为 $clean\ video$ ，然后每个类别的最后一个视频作为 $target\ video$

按照代码的数据处理要求，转成高宽为112,112的编号为0–15总共16张视频帧 $p n g$ 图片

并且，保证视频识别模型能够准确分类 $clean\ video$ 和 $target\ video$ ，这对 $untarget\ attack$ 的指标分数影响较大

接着，我设置的 $clean\ video$ 和 $target\ video$ 配对情况为0–1，2–3，4–5以此类推。其中，剔除了不符合以上要求的配对组

实验结果

在这里插入图片描述
AOA*指标为 patch 占显著区域的百分比，我没有计算。原因：论文和代码未给出计算显著区域的函数，上网搜索之后，试着使用opencv库自带的获取显著区域的函数，出现矩阵不匹配等问题，没想到解决方法，所以没计算。

截取最终部分

target attack

在这里插入图片描述
FR=100%
AOA=18.5%
query=2865
其中query有大概100左右的偏差，其余两个指标基本吻合

untarget attack

在这里插入图片描述
FR=100%
AOA=4.7%
query=2953
其中query有大概300左右的偏差，其余两个指标基本吻合

出现偏差的原因可能是：实验数据集设置上和原团队不一样

call me by ur name

关注

12
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
Efficient Decision-based Black-box Patch Attacks on Video Recognition（背景、本文要解决的问题、创新点、方案和实验、代码复现）

EfficientDecision−basedBlack−boxPatchAttacksonVideoRecognitionpatch 包括 texture, position, shape。
复制链接

扫一扫