Two-stream network
文章:
Convolutional Two-Stream Network Fusion for Video Action Recognition
网络融合:
Conv fusion
y c o n v = f c o n v ( x a , x b ) \mathbf{y}^{\mathrm{conv}}=f^{\mathrm{conv}}\left(\mathbf{x}^{a}, \mathbf{x}^{b}\right) yconv=fconv(xa,xb)
y c o n v = y c a t ∗ f + b \mathbf{y}^{\mathrm{conv}}=\mathbf{y}^{\mathrm{cat}} * \mathbf{f}+b yconv=ycat∗f+b
f ∈ R 1 × 1 × 2 D × D \mathbf{f} \in \mathbb{R}^{1 \times 1 \times 2 D \times D} f∈R1×1×2D×D
先叠加再卷积,将通道变更为D。
Concatenation fusion
y c a t = f c a t ( x a , x b ) \mathbf{y}^{\mathrm{cat}}=f^{\mathrm{cat}}\left(\mathbf{x}^{a}, \mathbf{x}^{b}\right) ycat=fcat(xa,xb)
y i , j , 2 d c a t = x i , j , d a y i , j , 2 d − 1 c a t = x i , j , d b y_{i, j, 2 d}^{\mathrm{cat}}=x_{i, j, d}^{a} \quad y_{i, j, 2 d-1}^{\mathrm{cat}}=x_{i, j, d}^{b} yi,j,2dcat=xi,j,dayi,j,2d−1cat=xi,j,db
y ∈ R H × W × 2 D \mathbf{y} \in \mathbb{R}^{H \times W \times 2 D} y∈RH×W×2D
Sum fusion
y s u m = f s u m ( x a , x b ) \mathbf{y}^{\mathrm{sum}}=f^{\mathrm{sum}}\left(\mathbf{x}^{a}, \mathbf{x}^{b}\right) ysum=fsum(xa,xb)
y i , j , d s u m = x i , j , d a + x i , j , d b y_{i, j, d}^{\mathrm{sum}}=x_{i, j, d}^{a}+x_{i, j, d}^{b} yi,j,dsum=xi,j,da+xi,j,db
y ∈ R H × W × D \mathbf{y} \in \mathbb{R}^{H \times W \times D} y∈RH×W×D
Max fusion
y max = f max ( x a , x b ) \mathbf{y}^{\max }=f^{\max }\left(\mathbf{x}^{a}, \mathbf{x}^{b}\right) ymax=fmax(xa,xb)
y i , j , d max = max { x i , j , d a , x i , j , d b } y_{i, j, d}^{\max }=\max \left\{x_{i, j, d}^{a}, x_{i, j, d}^{b}\right\} yi,j,dmax=max{xi,j,da,xi,j,db}
Bilinear fusion
y b i l = f b i l ( x a , x b ) \mathbf{y}^{\mathrm{bil}}=f^{\mathrm{bil}}\left(\mathbf{x}^{a}, \mathbf{x}^{b}\right) ybil=fbil(xa,xb)
y b i l = ∑ i = 1 H ∑ j = 1 W x i , j a ⊤ x i , j b \mathbf{y}^{\mathrm{bil}}=\sum_{i=1}^{H} \sum_{j=1}^{W} \mathbf{x}_{i, j}^{a \top} \mathbf{x}_{i, j}^{b} ybil=i=1∑Hj=1∑Wxi,ja⊤xi,jb