SPARK: Spatial-aware Online Incremental Attack

最新推荐文章于 2021-12-08 23:45:10 发布

theonePhoebe

最新推荐文章于 2021-12-08 23:45:10 发布

阅读量644

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/qq_42026346/article/details/113861852

版权

Notes: SPARK: Spatial-aware Online Incremental Attack

reference: Guo Q. et al. (2020) SPARK: Spatial-Aware Online Incremental Attack Against Visual Tracking. In: Vedaldi A., Bischof H., Brox T., Frahm JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12370. Springer, Cham. https://doi.org/10.1007/978-3-030-58595-2_13、

Catalogue

Notes: SPARK: Spatial-aware Online Incremental Attack

Problem to solve

Online generating imperceptible perturbations that mislead trackers along with an incorrect or specified trajectory, i.e., Untargeted Attack (UA) and Targeted Attack (TA).

Difficulty

Object tracking processes incoming frames one by one in order. When a current frame t is under attack, all the previous frames are already analyzed and cannot be changed. In addition, the future frames are not yet available.
The object tracking often depends on a target designated object template cropped from the first frame of a video for further analysis. The different initially designated object might lead to different tracking analysis, which renders the universal adversarial perturbation often ineffective.
Object tracking functions at real-time speed, requiring the attack to be effecient enough so that the adversarial perturbation of the current frame can be completed before the next frame arrives.

Basic problem definition

Online video with T frames: $V = \{X_t\}_1^T$ , $X_T$ is the $T$ th frame;

Tracker with parameters $\theta$ : $\phi_\theta(.)$ ;

Object template: $T$ ;

At $t$ th frame, the tracker calculates: $\{y_t^i,b_t^i\}=\phi_\theta(X_t,T)$ , $b_t^i\,$ is the $i$ th object candidate, $y_t^i\,$ is the positive activation of $b_t^i\,$ ;

The tracker’s predictive bounding box at the clean $t$ th frame: $b_t^{gt}$ ;

The object tracker assigns the predictive result: $OT(X_t,T)=b_t^{gt}=b_t^k$ , $k=argmax_{1\leq i\leq N}(y_t^i)$ .

To attack a tracker $O T (.)$ , use another tracker $O T^{'} (.)$ to generate adversarial examples.

Untargeted Attacker

$minimize \; D(X_t,X_t+E_t)$
$subject \; to\; IoU(OT'(X_t+E_t,T),b_t^{gt'})=0$
$E_T$ being the desired distortion, $D$ is a distance metric.

$X_t^a = X_t+E_t$ is the generated adversarial examples. $b_t^{gt'}$ is the predictive result of the clean frame $X_t$ .

Object function:
$f^{ua}(X_t+E_t,T)=y_t^{gt'}-\max \limits_{IoU(b_t^i,b_t^{gt'})=0}(y_t^i)$
$f^{ua}(X_t+E_t,T)<0$
$y_t^{gt'}$ is the activation value of $b_t^{gt'}$ (being the ‘correct’ prediction), $y_t^i$ is the activation value of $b_t^i$ , $\{(y_t^i,b_t^i)\}_{i=1}^N=\phi_{\theta'}(X_t+E_t,T)$ (being the desired prediction).

Targeted Attacker

$minimize \; D(X_t,X_t+E_t)$
$subject \; to\; ce(OT'(X_t+E_t,T))=p_t^{tr}$
$E_T$ being the desired distortion, $D$ is a distance metric.

$p_t^{tr}$ is the targeted position at frame $t$ , $c e (.)$ outputs the center position of a bounding box.

Object function:
$f^{ta}(X_t+E_t,T)=y_t^{gt'}-\max \limits_{ce(b_t^i)=p_t^{tr}}(y_t^i)$
$f^{ta}(X_t+E_t,T)<0$

Empirical study

In this part, this paper performed an empirical study on two objections:

how effective is the attack by applying basic attack on every frame? (BA-E)
how is its impact of the temporal frames in the video? (BA-R)

BA-E : attacking each frame using FGSM, BIM and C&W;

BA-R1: randomly select some frames and do basic attack with a posibility of 0.1;

BA-R2:randomly select some frames and do basic attack with an interval of 10 frames.

The conclusion is that BA-E is not efficient enough for real-time tracker, and BA-R sacrifices success rate. BA-R1 and BA-R2 only work at the specific frames on which the attacks are performed. So the perturbations generated by BA is difficult to transfer to the next frames directly due to the dynamic scene in the video.

SPARK online incremental attack

SPARK is able to achieve transferability between nearby frames, with an intuition to still attack each frame, but apply previous perturbations on the new frame combined with small but effective incremental perturbation via optimization.

At frame $t$ , UA is defined as:
$\; D(X_t,X_t+E_{t-1}+\epsilon_t)$
$\; to\; IoU(OT'(X_t+E_{t-1}+\epsilon_t,T),b_t^{gt'})=0$

$\epsilon_t$ is the incremental perturbation, i.e. $\epsilon_t=E_t-E_{t-1}$ .

So $E_t=\epsilon_{t-1}+\sum_{t_0}^{t}\epsilon_\tau$ , with $t_0$ being the start of an attack. $t_0 = t-L$ .

The new objective function is:
$f^{ua}(X_t+\epsilon_t+\sum_{t-L}^{t-1}\epsilon_\tau,T)+\lambda||\Gamma||_{2,1}$

$\Gamma=[\epsilon_{t-L},...,\epsilon_{t-1},\epsilon_t]$ .

The $X_t+\epsilon_t+\sum_{t-L}^{t-1}\epsilon_\tau$ part is the incremental perturbation strategy. SPARK also introduces $L_{2,1}$ norm to regularize $\{\epsilon_\tau\}$

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.
https://en.wikipedia.org/wiki/Regularization_(mathematics)
In mathematics, a norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance of a vector from the origin is a norm, called the Euclidean norm, or 2-norm, which may also be defined as the square root of the inner product of a vector with itself.
https://en.wikipedia.org/wiki/Norm_(mathematics)

$L_1$ norm (absolute-value norm):
$||X||_1=\sum_i|x_i|$

LASSO Regression (L1)
“You can also think of L1 as reducing the number of features in the model altogether.”
https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261
“So, this works well for feature selection in case we have a huge number of features.”
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c\

$L_2$ norm (Euclidean norm):
$||X||_1=\sum_i\sqrt{x_i^2}$

Ridge regression (L2)
“L2 regularization forces the weights to be small but does not make them zero and does non sparse solution. Ridge regression performs better when all the input features influence the output and all with weights are of roughly equal size.”
https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2

$L_{2,1}$ norm is the sum of the Euclidean norms of the columns of the matrix:
$||X||_{2,1}=\sum_{i=1}^n\sqrt{{\sum_{j=1}^t}x_{i,j}^2}=\sum_{i=1}^n{||X_{i,;}||}_2$

“The {\displaystyle L_{2,1}}L_{2,1} norm as an error function is more robust, since the error for each data point (a column) is not squared. It is used in robust data analysis and sparse coding.”
https://en.wikipedia.org/wiki/Matrix_norm

Strategy

In this paper, SPARK uses sign gradient descent to minimize the two objective functions with step size of 0.3, followed by a clip operation. It can be effective and efficient due to:

Optimizing $\epsilon_t$ is equivalent to optimizing $E_t$ considering $E_{t-1}$ as a starting point. Since neighboring frames of a video is usually similar, such start point helps get an effective perturbation within very few iterations.
The $L_{2,1}$ norm makes the incremental perturbations spatial-temporal sparse and let $E_T$ more inperceptible.

Sign gradient descent: involves the sign of the gradient instead of the gradient itself.
https://www.sciencedirect.com/science/article/abs/pii/S0020025519303135
The Geometry of Sign Gradient descent

In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks.
$x_{t+1}=x_t-sign(\nabla f_t)$

In practice, SPARK performs at every 30 frames and calculate $E_{t_0}$ with 10 iterations. The attack also takes place on the search region of the attacked tracker instead of the whole frame. The search region of the $t$ th frame is cropped from $X_t$ at the center of predictive result of frame $t - 1$ , i.e., $b_{t-1}^a$ . The trackers can be reformulated as $\phi_{\theta'}(X_t,T,b_{t-1}^a)$ and $\phi_\theta(X_t,T,b_{t-1}^a)$ .

Pseudocode (TA)

Pseudocode

Experiment

Dataset
OTB100, VOT2018, UAV123, and LaSOT.

OTB100: The full benchmark contains 100 sequences from recent literatures. Each row in the ground-truth files represents the bounding box of the target in that frame, (x, y, box-width, box-height).
VOT2018: VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets. Annotated with rotated bounding box.
UAV123: UAV123 dataset contains a total of 123 video sequences and more than 110K frames making it the second-largest object tracking dataset after ALOV300++. All sequences are fully annotated with upright bounding boxes.
LaSOT: LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.

Models

SiamRPN-based trackers

SiamRPN++
Siamese-RPN

that uses AlexNet, MobileNetv2, and ResNet-50 as backbones.
online updating variants of SiamRPN-based trackers.
SiamDW tracker.

Metrics

Prec.Drop (UA) : precision drop of a tracker (after attacking).
A tracker locates an object successfully if center location error $CLE(b_t,b_t^{an}) = ||ce(b_t)-ce(b_t^{an})||_2<20$ .
Succ.Rate (TA) : the rate of frames where an attack method fools a tracker successfully. An attacker succeeds at frame $t$ if $ce(b_t)-p_t^{tr}||_2<20$ .
MAP (mean absolute perturbation) : to measure the distortion of adversarial perturbations. $MAP=\frac{1}{D*K}\sum_d\sum_k\frac{1}{M*C}\sum_i\sum_c|E_{k,d}(i,c)|$ , where $D$ is the number of videos in a video dataset, $K$ , $M$ , and $C$ refer to thee number of frames, pixels and channels, respectively.

Comparison results

compare with FGSM, BIM, MI-FGSM, C&W, Wei.

Analysis

Validation of the online incremental attack : implement 6 variants of SPARK by setting $L\in\{5,10,15,20,25,30\}$ .
Results under challenging attributes : OTB dataset.
Transferability across models : applying perturbations generated from one model to another.
SPARK without template $T$ : using SSD to detect all possible objects in the frame and select the object nearest to the ceter as the target object.
SPARK without the attacked tracker’s predictions : replacing the $b_{t-1}^a$ with $b_{t-1}^{a'}$ , i.e., perform attack on the search region of $\phi_{\theta'}(.)$ and propagate the perturbation to the whole frame.

Attacking other tracking frameworks

Transferability to online updating trackers : using the adversarial perturbations from SiamRPN-AlexNet, MobileNetV2, and ResNet-50 to attack the DSiamRPN-based trackers.
Attacking SiamDW