Rotation Equivariant Networks for Tracking论文解读

1. Introduction

The task of visual object tracking with Siamese networks, referred as Siamese tracking, transforms the problem of tracking into similarity estimation between a template frame and sampled region from a candidate frame.
孪生网络是把追踪任务描述成template和search region之间相似度响应的问题
Although Siamese trackers are generally shown to work well, they are prone to failure under challenges such as partial occlusion、scale change or when one of the two inputs is rotated
The CNN archietectures used in Siamese trackers are not inherently equivariant to in-plane rotations of the target. The implication is that the model may perform well on object orientations that are represented in the training set, but may fail on other previously unseen orientations
A straightforward approach to enforce learning of rotated variants is to use training dataset where in-plane rotations occur naturally or through data augmentation
Limitations of Data-Augmentation
1. Such procedures would require learning separate representations for different rotated variants of the data
2. The more variations are considered, the more flexible tracker model needs to be to capture them all
3. Futher, such an approach would make the model invariant to rotations, thus making the predictions unreliable when the target is surrounded by similar objects, e.g.,tracking a fish in a school of fishes.


Exemple demonstrating rotation non-equivariance in regular CNN models used in object tracking:

ψ θ ( f ( / c d o t ) ) ≠ f ( ψ θ ( ⋅ ) ) \psi_\theta(f(/cdot)) \neq f(\psi_\theta(\cdot)) ψθ(f(/cdot))=f(ψθ())



t r a n s f o r m [ F ( x ) ] = F ( t r a n s f o r m [ x ] ) transform[F(x)] = F(transform[x]) transform[F(x)]=F(transform[x])

F ( x ) = F ( t r a n s f o r m [ x ] ) F(x) = F(transform[x]) F(x)=F(transform[x])

t r a n s f o r m ∗ F ( x ) = F ( t r a n s f o r m [ x ] ) transform^*F(x) = F(transform[x]) transformF(x)=F(transform[x])

2. Related Work

Equivariant CNNs
SiamRPN++ proposed a training strategy which removes the spatial bias introduced in non fully-convolutional backbone
Deeper and wider siamese networks for real-time visual tracking showed that existing tracking models induce positional bias, which breaks strict translation equivariance
Deeper and wider siamese networks for real-time visual tracking 指出,现有追踪模型引起了位置偏置,打破了等变变换
Scale Equivariance Improves Siamese Tracking(SE-SiamNet) introduced scale-equivariant Siamese trackers which is crucial when the camera zooms its lens or when the target moves into depth
Scale Equivariance Improves Siamese Tracking(SE-SiamNet)引入尺度等变性孪生网络,在相机伸缩镜头或者目标在景深中移动时影响巨大

3. Rotation Equivariant CNNs


Rotation Equivariance旋转等变性


Learning steerable filters for rotation equivariant cnns indicated that one of the more robust ways of enforcing rotation equivariance in CNNs is through the use of steerable filter(SFC-NNs)
Learning steerable filters for rotation equivariant cnns指出,让CNNs具有旋转等变性的一个比较鲁邦的方式是使用可控滤波器(SFC-NNs)
For rotation equivariance with steerable filters, the network must perform convolutions with different rotated versions of each filter
Steerable filters not only facilitate efficiently computing responses for an arbitrary number of discrete filter rotations, but they also exhibit strong expressive power as well



  1. 球面坐标没有 z 和 θ z和\theta zθ就是圆谐函数系

ψ j k ( r , φ ) = τ j ( r ) j k φ \qquad \\ \psi_{jk}(r,\varphi) = \tau_j(r)^{jk\varphi}\\ \qquad \\ ψjk(r,φ)=τj(r)jkφ

  • 以下两个参数控制偏置函数(径向函数 τ j \tau_j τj)的偏置范围

φ ∈ ( − π , π ] \varphi \in (-\pi,\pi] φ(π,π]

当前次数 j = 1 , 2 , … , J j=1,2,\dots,J j=1,2,,J


  • 控制极坐 ( x 1 , x 2 ) (x_1,x_2) (x1,x2)标旋转角度

( r , ϕ ) (r,\phi) (r,ϕ)


  • 角向函数 ( e j k φ ) (e^{jk\varphi}) (ejkφ)的角频率,也成为阶数

k ∈ Z 其值跟函数系的当前函数次数 j 相关 Z ∈ [ − j , j ] k \in Z其值跟函数系的当前函数次数j相关 Z \in [-j,j] kZ其值跟函数系的当前函数次数j相关Z[j,j]


  1. 用欧拉旋转定理表示目标的旋转

ρ θ ψ j k ( x ) = e − i k θ ψ j k ( x ) \qquad \\ \rho_{\theta}\psi_{jk}(x) = e^{-ik\theta}\psi_{jk}(x) \\ \qquad \\ ρθψjk(x)=eikθψjk(x)
e − i k θ 表示顺时针旋转 θ , e + i k θ 表示逆时针旋转 θ e^{-ik\theta}表示顺时针旋转\theta,e^{+ik\theta}表示逆时针旋转\theta eikθ表示顺时针旋转θe+ikθ表示逆时针旋转θ

注意,这里的 ψ j k ( x ) 指的是 ψ j k ( ⋅ ) , x 是泛指,而非特指 注意,这里的\psi_{jk}(x)指的是\psi_{jk}(\cdot),x是泛指,而非特指 注意,这里的ψjk(x)指的是ψjk()x是泛指,而非特指


  1. 每个学到的权重 w j k ∈ C ,被构建为一个基本滤波器之间的线性连接 每个学到的权重w_{jk} \in \mathbb{C},被构建为一个基本滤波器之间的线性连接 每个学到的权重wjkC,被构建为一个基本滤波器之间的线性连接

Ψ ( x ) = ∑ j = 1 J ∑ k = 0 K w j k ψ j k ( x ) \qquad \\ \Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}\psi_{jk}(x) \\ \qquad \\ Ψ(x)=j=1Jk=0Kwjkψjk(x)

  1. 对于旋转 θ 角度,可以通过基本滤波器的相会控制来实现控制合成滤波器 对于旋转\theta角度,可以通过基本滤波器的相会控制来实现控制合成滤波器 对于旋转θ角度,可以通过基本滤波器的相会控制来实现控制合成滤波器

ρ θ Ψ ( x ) = ∑ j = 1 J ∑ k = 0 K w j k e − i k θ ψ j k ( x ) \qquad \\ \rho_{\theta}\Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}e^{-ik\theta}\psi_{jk}(x) \\ \qquad \\ ρθΨ(x)=j=1Jk=0Kwjkeikθψjk(x)
通过 Ψ 的实部可以求取滤波器的一个旋转方向,称之为 R e Ψ ( x ) 通过\Psi的实部可以求取滤波器的一个旋转方向, 称之为Re\Psi(x) 通过Ψ的实部可以求取滤波器的一个旋转方向,称之为ReΨ(x)



4. Rotation Equivariant Siamese Trackers


4.1 Formulation Based on Siam-FC

Author started from and modified the basic SiamFC model due to its simple design.

h ( z , x ) = f ( z ) ∗ f ( x ) \qquad \\ h(z,x)=f(z)*f(x) \\ \qquad \\ h(z,x)=f(z)f(x)

f ( ⋅ ) 是指特征提取网络 \qquad f(\cdot)是指特征提取网络 f()是指特征提取网络

∗ 指互相关的卷积操作 \qquad * 指互相关的卷积操作 指互相关的卷积操作

For rotational Siamese tracker, author introduced rotation equivariant modules and a group max pooling module that selects the cross-correlation encoding for the most approximate orientations among the multiple heatmaps generated in setup


  1. 网络的Candidate Head(处理Search region的)使用一张search image(没变)

  2. 网络的Template Head修改成可以输入多个template image(如图,旋转后的template)作为输入,一系列旋转变量 Λ \Lambda Λ定义为Z集,其中 Z = { z 1 , z 2 , … , z Λ } Z=\{z_{1}, z_{2},\dots, z_{\Lambda}\} Z={z1,z2,,zΛ},即为所有可能存在的旋转角度

  3. 先计算初始traget的特征 f ( z ) f(z) f(z),然后再旋转 f ( z ) f(z) f(z),由于是旋转等变网络,所以理论上是可以这么干的

  4. 旋转Template中的Target:

y c ~ ( 1 ) ( x , θ ) = R e ∑ c = 1 C ∑ j = 1 J ∑ k = 0 K w c ^ c j k e − i k θ ( I c ∗ ψ j k ) ( x ) \qquad \\ y_{\tilde{c}}^{(1)}(x,\theta) = Re \sum_{c=1}^{C}\sum_{j=1}^{J}\sum_{k=0}^{K}w_{\hat{c}cjk}e^{-ik\theta}(I_c * \psi_{jk})(x) \\ \qquad \\ yc~(1)(x,θ)=Rec=1Cj=1Jk=0Kwc^cjkeikθ(Icψjk)(x)

  • I c 是通道为 c 的图片, c ∈ { 1 , 2 , … , C } I_c是通道为c的图片,c \in \{ 1, 2, \dots, C\} Ic是通道为c的图片,c{1,2,,C}
  • ρ θ Ψ c ^ c ( 1 ) 旋转滤波器 \rho_{\theta}\Psi_{\hat{c}c}^{(1)}旋转滤波器 ρθΨc^c(1)旋转滤波器
  • c ^ ∈ { 1 , 2 , … , C ^ } \hat{c} \in \{1, 2,\dots, \hat{C} \} c^{1,2,,C^}
  • 等距旋转角度 θ 可以由集合 Θ = { 0 , Λ , … , 2 π Λ − 1 Λ } 等距旋转角度\theta可以由集合\Theta=\{0, \Lambda, \dots, 2\pi \frac{\Lambda-1}{\Lambda}\} 等距旋转角度θ可以由集合Θ={0,Λ,,2πΛΛ1}
  • 偏置项 β c ^ ( 1 ) 用于在层 ( 第一层 ) : ζ c ^ ( 1 ) 获取特征图 偏置项\beta_{\hat{c}}^{(1)}用于在层(第一层):\zeta_{\hat{c}}^{(1)}获取特征图 偏置项βc^(1)用于在层(第一层)ζc^(1)获取特征图
  • 非线性连接 σ c ^ ( 1 ) 用于在层 ( 第一层 ) : ζ c ^ ( 1 ) 获取特征图 非线性连接\sigma_{\hat{c}}^{(1)}用于在层(第 一层):\zeta_{\hat{c}}^{(1)}获取特征图 非线性连接σc^(1)用于在层(第一层)ζc^(1)获取特征图
  1. 旋转等变的卷积

y c ^ ( l ) = R e ∑ c = 1 C ∑ ϕ ∈ Θ ∑ j , k w c ^ c j k , θ − ϕ e − i k θ ( ζ c l − 1 ( , ˙ ϕ ) ∗ ψ j k ) ( x ) \qquad \\ y_{\hat{c}}^{(l)} = Re\sum_{c=1}^{C}\sum_{\phi \in \Theta}\sum_{j,k}w_{\hat{c}cjk,\theta - \phi}\hspace{1mm}e^{-ik\theta}(\zeta_c^{l-1}(\dot, \phi)*\psi_{jk})(x) \\ \qquad \\ yc^(l)=Rec=1CϕΘj,kwc^cjk,θϕeikθ(ζcl1(,˙ϕ)ψjk)(x)

权重项 w 中的下标 θ − ϕ 是指以角度维度进行的分组卷积操作 权重项w中的下标\theta-\phi是指以角度维度进行的分组卷积操作 权重项w中的下标θϕ是指以角度维度进行的分组卷积操作


  1. 旋转等变的池化

最后一个分组卷基层的输出会在旋转维度上进行深加工。跟传统的分类网络不同,这种池化并不在W\times H的维度(spatial维度)上进行,而是在角度分组( { 0 , 2 π 8 , 4 π 8 , … , 14 π 8 } \{0, \frac{2\pi}{8}, \frac{4\pi}{8}, \dots, \frac{14\pi}{8} \} {0,82π,84π,,814π})的维度上进行池化,以保留旋转等变性的特征


  1. 旋转等变性的互相关


  • 从 R e − S i a m N e t 的两个子网络可以得到一个 f e a t u r e − m a p 集合 { ϕ ( z ) 和 ϕ ( x ) } 从Re-SiamNet的两个子网络可以得到一个feature-map集合\{\phi(z)和\phi(x)\} ReSiamNet的两个子网络可以得到一个featuremap集合{ϕ(z)ϕ(x)}

  • ϕ ( z ) 是转动角度 Λ 的 f e a t u r e − m a p 集合 \phi(z)是转动角度\Lambda的feature-map集合 ϕ(z)是转动角度Λfeaturemap集合

  • 通过互相关层 { h ^ ( z , x ) } ,计算不同旋转角度 Λ 的 T e m p l a t e 特征图的热图, h i ( z , x ) = ϕ ( z ) ∗ ϕ ( x ) 通过互相关层\{\hat{h}(z,x)\},计算不同旋转角度\Lambda的Template特征图的热图,h_i(z, x)=\phi(z)*\phi(x) 通过互相关层{h^(z,x)},计算不同旋转角度ΛTemplate特征图的热图,hi(z,x)=ϕ(z)ϕ(x)

  • 将 { h ^ ( z , x ) } 经过全局最大池化,输出一个热图 h ( Z , x ) , 即在 { h ^ ( z , x ) } 中挑出最大的 h ^ 将\{\hat{h}(z, x)\}经过全局最大池化,输出一个热图h(Z,x),即在\{\hat{h}(z,x)\}中挑出最大的\hat{h} {h^(z,x)}经过全局最大池化,输出一个热图h(Z,x),即在{h^(z,x)}中挑出最大的h^



4.2 Constructing RE-SiamNet Framework

  1. Identify the precision of the tracker in terms of discriminating between orientations of the rotational degree of freedom. Author considered here Λ \Lambda Λ rotation groups, based on which RE-SiamNets would be perfectly equivariant to angles defined by the set Θ = { ( i − 1 ) Λ ∗ 2 π } i = 1 Λ ⇒ { ( i − 1 ) 2 π 8 } i = 1 Λ = 8 \Theta=\{\frac{(i-1)}{\Lambda}*2\pi\}_{i=1}^{\Lambda} \Rightarrow \{(i-1)\frac{2\pi}{8}\}_{i=1}^{\Lambda=8} Θ={Λ(i1)2π}i=1Λ{(i1)82π}i=1Λ=8
  1. Define the non-parametric encoding ϕ ( ⋅ ) \phi(\cdot) ϕ()based on existing Siamese trackers. Based on the choice of ϕ ( ⋅ ) \phi(\cdot) ϕ(),discriminative power of trackers varies.
基于已有的Siamese tracker定义无参数编码器。追踪器的辨别能力会基于这些编码器的选择而发生改变
  1. Replace all the convolutional layers of ϕ ( ⋅ ) \phi(\cdot) ϕ() with the rotation-equivariant modules.
这里用到了 e2CNN 来实现旋转
  1. Instead of a single convolution to generate h = ( z , x ) , Λ h=(z,x),\Lambda h=(z,x)Λ convolutions are performed to generate Λ \Lambda Λ different heatmap
  1. Perform Global max-pooling over the feature maps to generate h ( Z , x ) h(Z,x) h(Z,x), which is then processed to localize the target.


5. Unsupervised Relative Rotation Estimation


5.1 Unsupervised 2D pose estimation


  • The inherent design of RE-SiamNet allows to obtain an estimation of the relative changes of 2D pose of the target in a fully unsupervised manner. This information can be obtained from the result of the group maxpooling step
  • Let i ∈ { 1 , 2 , … , Γ } i \in \{1,2,\dots, \Gamma\} i{1,2,,Γ} denote one of Λ \Lambda Λ orientations of the template. Then, i i i is the number of rotation groups by which the pose of the template differs from that of its appearance in the candidate image if :
    h ( Z , x ) = h ^ ( z i , x ) = g r o u p − m a x p o o l ( { z , x } ) h(Z,x)=\hat{h}(z_i, x)=group-maxpool(\{z, x\}) h(Z,x)=h^(zi,x)=groupmaxpool({z,x})


  • 0
  • 1
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


