SPARK: Spatial-aware Online Incremental Attack

Notes: SPARK: Spatial-aware Online Incremental Attack

reference: Guo Q. et al. (2020) SPARK: Spatial-Aware Online Incremental Attack Against Visual Tracking. In: Vedaldi A., Bischof H., Brox T., Frahm JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12370. Springer, Cham. https://doi.org/10.1007/978-3-030-58595-2_13、

Problem to solve

Online generating imperceptible perturbations that mislead trackers along with an incorrect or specified trajectory, i.e., Untargeted Attack (UA) and Targeted Attack (TA).

Difficulty

  • Object tracking processes incoming frames one by one in order. When a current frame t is under attack, all the previous frames are already analyzed and cannot be changed. In addition, the future frames are not yet available.
  • The object tracking often depends on a target designated object template cropped from the first frame of a video for further analysis. The different initially designated object might lead to different tracking analysis, which renders the universal adversarial perturbation often ineffective.
  • Object tracking functions at real-time speed, requiring the attack to be effecient enough so that the adversarial perturbation of the current frame can be completed before the next frame arrives.

Basic problem definition

Online video with T frames:   V = { X t } 1 T \ V = \{X_t\}_1^T  V={Xt}1T, X T X_T XT is the T T Tth frame;

Tracker with parameters θ \theta θ : ϕ θ ( . ) \phi_\theta(.) ϕθ(.);

Object template: T T T;

At t t tth frame, the tracker calculates: { y t i , b t i } = ϕ θ ( X t , T ) \{y_t^i,b_t^i\}=\phi_\theta(X_t,T) {yti,bti}=ϕθ(Xt,T),    b t i   \;b_t^i\, btiis the i i ith object candidate,   y t i   \,y_t^i\, ytiis the positive activation of b t i   b_t^i\, bti;

The tracker’s predictive bounding box at the clean t t tth frame: b t g t b_t^{gt} btgt;

The object tracker assigns the predictive result: O T ( X t , T ) = b t g t = b t k OT(X_t,T)=b_t^{gt}=b_t^k OT(Xt,T)=btgt=btk, k = a r g m a x 1 ≤ i ≤ N ( y t i ) k=argmax_{1\leq i\leq N}(y_t^i) k=argmax1iN(yti).

To attack a tracker O T ( . ) OT(.) OT(.), use another tracker O T ′ ( . ) OT'(.) OT(.) to generate adversarial examples.

Untargeted Attacker

m i n i m i z e    D ( X t , X t + E t ) minimize \; D(X_t,X_t+E_t) minimizeD(Xt,Xt+Et)
s u b j e c t    t o    I o U ( O T ′ ( X t + E t , T ) , b t g t ′ ) = 0 subject \; to\; IoU(OT'(X_t+E_t,T),b_t^{gt'})=0 subjecttoIoU(OT(Xt+Et,T),btgt)=0
E T E_T ET being the desired distortion, D D D is a distance metric.

X t a = X t + E t X_t^a = X_t+E_t Xta=Xt+Et is the generated adversarial examples. b t g t ′ b_t^{gt'} btgt is the predictive result of the clean frame X t X_t Xt.

Object function:
f u a ( X t + E t , T ) = y t g t ′ − max ⁡ I o U ( b t i , b t g t ′ ) = 0 ( y t i ) f^{ua}(X_t+E_t,T)=y_t^{gt'}-\max \limits_{IoU(b_t^i,b_t^{gt'})=0}(y_t^i) fua(Xt+Et,T)=ytgtIoU(bti,btgt)=0max(yti)
f u a ( X t + E t , T ) < 0 f^{ua}(X_t+E_t,T)<0 fua(Xt+Et,T)<0
y t g t ′ y_t^{gt'} ytgtis the activation value of b t g t ′ b_t^{gt'} btgt(being the ‘correct’ prediction), y t i y_t^i yti is the activation value of b t i b_t^i bti, { ( y t i , b t i ) } i = 1 N = ϕ θ ′ ( X t + E t , T ) \{(y_t^i,b_t^i)\}_{i=1}^N=\phi_{\theta'}(X_t+E_t,T) {(yti,bti)}i=1N=ϕθ(Xt+Et,T) (being the desired prediction).

Targeted Attacker

m i n i m i z e    D ( X t , X t + E t ) minimize \; D(X_t,X_t+E_t) minimizeD(Xt,Xt+Et)
s u b j e c t    t o    c e ( O T ′ ( X t + E t , T ) ) = p t t r subject \; to\; ce(OT'(X_t+E_t,T))=p_t^{tr} subjecttoce(OT(Xt+Et,T))=pttr
E T E_T ET being the desired distortion, D D D is a distance metric.

p t t r p_t^{tr} pttr is the targeted position at frame t t t, c e ( . ) ce(.) ce(.) outputs the center position of a bounding box.

Object function:
f t a ( X t + E t , T ) = y t g t ′ − max ⁡ c e ( b t i ) = p t t r ( y t i ) f^{ta}(X_t+E_t,T)=y_t^{gt'}-\max \limits_{ce(b_t^i)=p_t^{tr}}(y_t^i) fta(Xt+Et,T)=ytgtce(bti)=pttrmax(yti)
f t a ( X t + E t , T ) < 0 f^{ta}(X_t+E_t,T)<0 fta(Xt+Et,T)<0

Empirical study

In this part, this paper performed an empirical study on two objections:

  1. how effective is the attack by applying basic attack on every frame? (BA-E)
  2. how is its impact of the temporal frames in the video? (BA-R)

BA-E : attacking each frame using FGSM, BIM and C&W;

BA-R1: randomly select some frames and do basic attack with a posibility of 0.1;

BA-R2:randomly select some frames and do basic attack with an interval of 10 frames.

The conclusion is that BA-E is not efficient enough for real-time tracker, and BA-R sacrifices success rate. BA-R1 and BA-R2 only work at the specific frames on which the attacks are performed. So the perturbations generated by BA is difficult to transfer to the next frames directly due to the dynamic scene in the video.

SPARK online incremental attack

SPARK is able to achieve transferability between nearby frames, with an intuition to still attack each frame, but apply previous perturbations on the new frame combined with small but effective incremental perturbation via optimization.

At frame t t t, UA is defined as:
m i n i m i z e    D ( X t , X t + E t − 1 + ϵ t ) minimize \; D(X_t,X_t+E_{t-1}+\epsilon_t) minimizeD(Xt,Xt+Et1+ϵt)
s u b j e c t    t o    I o U ( O T ′ ( X t + E t − 1 + ϵ t , T ) , b t g t ′ ) = 0 subject \; to\; IoU(OT'(X_t+E_{t-1}+\epsilon_t,T),b_t^{gt'})=0 subjecttoIoU(OT(Xt+Et1+ϵt,T),btgt)=0

ϵ t \epsilon_t ϵt is the incremental perturbation, i.e. ϵ t = E t − E t − 1 \epsilon_t=E_t-E_{t-1} ϵt=EtEt1.

So E t = ϵ t − 1 + ∑ t 0 t ϵ τ E_t=\epsilon_{t-1}+\sum_{t_0}^{t}\epsilon_\tau Et=ϵt1+t0tϵτ, with t 0 t_0 t0 being the start of an attack. t 0 = t − L t_0 = t-L t0=tL.

The new objective function is:
f u a ( X t + ϵ t + ∑ t − L t − 1 ϵ τ , T ) + λ ∣ ∣ Γ ∣ ∣ 2 , 1 f^{ua}(X_t+\epsilon_t+\sum_{t-L}^{t-1}\epsilon_\tau,T)+\lambda||\Gamma||_{2,1} fua(Xt+ϵt+tLt1ϵτ,T)+λΓ2,1

Γ = [ ϵ t − L , . . . , ϵ t − 1 , ϵ t ] \Gamma=[\epsilon_{t-L},...,\epsilon_{t-1},\epsilon_t] Γ=[ϵtL,...,ϵt1,ϵt].

The X t + ϵ t + ∑ t − L t − 1 ϵ τ X_t+\epsilon_t+\sum_{t-L}^{t-1}\epsilon_\tau Xt+ϵt+tLt1ϵτ part is the incremental perturbation strategy. SPARK also introduces L 2 , 1 L_{2,1} L2,1 norm to regularize { ϵ τ } \{\epsilon_\tau\} {ϵτ}

In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.
https://en.wikipedia.org/wiki/Regularization_(mathematics)
In mathematics, a norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance of a vector from the origin is a norm, called the Euclidean norm, or 2-norm, which may also be defined as the square root of the inner product of a vector with itself.
https://en.wikipedia.org/wiki/Norm_(mathematics)

L 1 L_1 L1 norm (absolute-value norm):
∣ ∣ X ∣ ∣ 1 = ∑ i ∣ x i ∣ ||X||_1=\sum_i|x_i| X1=ixi

LASSO Regression (L1)
“You can also think of L1 as reducing the number of features in the model altogether.”
https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261
“So, this works well for feature selection in case we have a huge number of features.”
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c\

L 2 L_2 L2 norm (Euclidean norm):
∣ ∣ X ∣ ∣ 1 = ∑ i x i 2 ||X||_1=\sum_i\sqrt{x_i^2} X1=ixi2

Ridge regression (L2)
“L2 regularization forces the weights to be small but does not make them zero and does non sparse solution. Ridge regression performs better when all the input features influence the output and all with weights are of roughly equal size.”
https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2

L 2 , 1 L_{2,1} L2,1 norm is the sum of the Euclidean norms of the columns of the matrix:
∣ ∣ X ∣ ∣ 2 , 1 = ∑ i = 1 n ∑ j = 1 t x i , j 2 = ∑ i = 1 n ∣ ∣ X i , ; ∣ ∣ 2 ||X||_{2,1}=\sum_{i=1}^n\sqrt{{\sum_{j=1}^t}x_{i,j}^2}=\sum_{i=1}^n{||X_{i,;}||}_2 X2,1=i=1nj=1txi,j2 =i=1nXi,;2

“The {\displaystyle L_{2,1}}L_{2,1} norm as an error function is more robust, since the error for each data point (a column) is not squared. It is used in robust data analysis and sparse coding.”
https://en.wikipedia.org/wiki/Matrix_norm

Strategy

In this paper, SPARK uses sign gradient descent to minimize the two objective functions with step size of 0.3, followed by a clip operation. It can be effective and efficient due to:

  1. Optimizing ϵ t \epsilon_t ϵt is equivalent to optimizing E t E_t Et considering E t − 1 E_{t-1} Et1 as a starting point. Since neighboring frames of a video is usually similar, such start point helps get an effective perturbation within very few iterations.
  2. The L 2 , 1 L_{2,1} L2,1 norm makes the incremental perturbations spatial-temporal sparse and let E T E_T ET more inperceptible.

Sign gradient descent: involves the sign of the gradient instead of the gradient itself.
https://www.sciencedirect.com/science/article/abs/pii/S0020025519303135
The Geometry of Sign Gradient descent

In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks.
x t + 1 = x t − s i g n ( ∇ f t ) x_{t+1}=x_t-sign(\nabla f_t) xt+1=xtsign(ft)

In practice, SPARK performs at every 30 frames and calculate E t 0 E_{t_0} Et0 with 10 iterations. The attack also takes place on the search region of the attacked tracker instead of the whole frame. The search region of the t t tth frame is cropped from X t X_t Xt at the center of predictive result of frame t − 1 t-1 t1, i.e., b t − 1 a b_{t-1}^a bt1a. The trackers can be reformulated as ϕ θ ′ ( X t , T , b t − 1 a ) \phi_{\theta'}(X_t,T,b_{t-1}^a) ϕθ(Xt,T,bt1a) and ϕ θ ( X t , T , b t − 1 a ) \phi_\theta(X_t,T,b_{t-1}^a) ϕθ(Xt,T,bt1a).

Pseudocode (TA)

Pseudocode

Experiment

Dataset
OTB100, VOT2018, UAV123, and LaSOT.

OTB100: The full benchmark contains 100 sequences from recent literatures. Each row in the ground-truth files represents the bounding box of the target in that frame, (x, y, box-width, box-height).
VOT2018: VOT2018 is a dataset for visual object tracking. It consists of 60 challenging videos collected from real-life datasets. Annotated with rotated bounding box.
UAV123: UAV123 dataset contains a total of 123 video sequences and more than 110K frames making it the second-largest object tracking dataset after ALOV300++. All sequences are fully annotated with upright bounding boxes.
LaSOT: LaSOT is a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT one of the largest densely annotated tracking benchmark. The average video length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view.

Models

  • SiamRPN-based trackers

    SiamRPN++
    Siamese-RPN

    that uses AlexNet, MobileNetv2, and ResNet-50 as backbones.

  • online updating variants of SiamRPN-based trackers.

  • SiamDW tracker.

Metrics

  • Prec.Drop (UA) : precision drop of a tracker (after attacking).
    A tracker locates an object successfully if center location error C L E ( b t , b t a n ) = ∣ ∣ c e ( b t ) − c e ( b t a n ) ∣ ∣ 2 < 20 CLE(b_t,b_t^{an}) = ||ce(b_t)-ce(b_t^{an})||_2<20 CLE(bt,btan)=ce(bt)ce(btan)2<20.
  • Succ.Rate (TA) : the rate of frames where an attack method fools a tracker successfully. An attacker succeeds at frame t t t if ∣ ∣ c e ( b t ) − p t t r ∣ ∣ 2 < 20 ||ce(b_t)-p_t^{tr}||_2<20 ce(bt)pttr2<20.
  • MAP (mean absolute perturbation) : to measure the distortion of adversarial perturbations. M A P = 1 D ∗ K ∑ d ∑ k 1 M ∗ C ∑ i ∑ c ∣ E k , d ( i , c ) ∣ MAP=\frac{1}{D*K}\sum_d\sum_k\frac{1}{M*C}\sum_i\sum_c|E_{k,d}(i,c)| MAP=DK1dkMC1icEk,d(i,c), where D D D is the number of videos in a video dataset, K K K, M M M, and C C C refer to thee number of frames, pixels and channels, respectively.

Comparison results

compare with FGSM, BIM, MI-FGSM, C&W, Wei.

Analysis

  • Validation of the online incremental attack : implement 6 variants of SPARK by setting L ∈ { 5 , 10 , 15 , 20 , 25 , 30 } L\in\{5,10,15,20,25,30\} L{5,10,15,20,25,30}.
  • Results under challenging attributes : OTB dataset.
  • Transferability across models : applying perturbations generated from one model to another.
  • SPARK without template T T T : using SSD to detect all possible objects in the frame and select the object nearest to the ceter as the target object.
  • SPARK without the attacked tracker’s predictions : replacing the b t − 1 a b_{t-1}^a bt1a with b t − 1 a ′ b_{t-1}^{a'} bt1a, i.e., perform attack on the search region of ϕ θ ′ ( . ) \phi_{\theta'}(.) ϕθ(.) and propagate the perturbation to the whole frame.

Attacking other tracking frameworks

  • Transferability to online updating trackers : using the adversarial perturbations from SiamRPN-AlexNet, MobileNetV2, and ResNet-50 to attack the DSiamRPN-based trackers.
  • Attacking SiamDW
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值