《Video Self-Stitching Graph Network for Temporal Action Localization》-VSGN笔记
Link: http://arxiv.org/abs/2011.14598
Model Name: VSGN
Code: https://github.com/coolbay/VSGN
针对短视频中动作检测准确度不高的问题提出解决方案。
两个关键部分:
VSS(vedio self-stitching)
xGPN(cross-scale graph pyramid network):
Related work
- Multi-scale solution in object detection
借鉴FPN,mosaic augmentation; - Temporal action localization
两种方法:- 固定长度的视频输入(such as 100 frames)
–BSN\BMN\G-TAD\BC-GNN
–small input scale make it effcient
–down-scaling harm short actions(easily gets lost or distorted)
–up-scal in those methods is not useful due to their architectures-scaling curse - 滑动窗口(sliding windows)
–R-C3D, TAL-NET, PBRNet
–preserve the original information
–perform pooling / strided convolution to obtain multi-scale features
- 固定长度的视频输入(such as 100 frames)
- Graph neural networks for TAL
–VSGN builds a graph on video snippets as G-TAD
–VSGN exploits correlations between cross-scale snippets and defines a cross-scale edge to break the scaling curse.
Video Self-Stitching Graph Network
1.视频:WHT*3
2.视频小片段(snippet-a consecutive video frames )基础上进行特征提取: TSN, I3D
3.特征序列:F={ft}t=1->T/a,T/a×C
4.视频片段特征进行裁剪
5.Clip up-scaling:Clip O( original short clip )+Clip GAP+Clip U(up-scaled clip )
Cross-Scale Graph Pyramid Network
借鉴FPN,具有编码金字塔和解码金字塔,他们之间具有连接。
图建立
1.以特征为节点,特征之间建立边,特征点有K个边
2.K/2为free edges–所有特征之间计算MSE(平均均方差)的前K/2个(不考虑尺度,特征近即可)
3.K/2为cross-scale edges–跨片段,clip O与clip U之间
特征聚合
1.多层感知器(MLP)权重W
2.edge convolution operations,如公式
3.we take the maximum value in a channel-wise fashion to generate the aggregated feature
图特征与时序特征融合
Scoring and Localization
网络由四部分组成:Mloc、Mcls、Madj、Mscr
Mloc:
1)contain 4 blocks of Conv1d (3, 1), group normalization (GN) and ReLU layers, followed by one Conv1d (1, 1) to generate the location offsets for each anchor segment
2)pre-defined anchor segments
Mcls:
1)contain 4 blocks of Conv1d (3, 1), group normalization (GN) and ReLU layers, followed by one Conv1d (1, 1) to generate the classification scores for each anchor segment
2)update the anchor segments by applying their predicted offsets from the Mloc module —update method in [37]
Madj:
1) inspired by FGD in [23]
2) sample 3 features from around its start and end locations, respectively. Then we temporally concatenate the 3 feature vectors from each location and apply Conv1d(3,1)-ReLU-Conv1d(1,1) to predict start/end offsets.
Mscr:
1)Conv1d(3,1), ReLU and Conv1d(1,1), predicts actionness/startness/endness scores [20] for each sequence.
LOSS
training:
Lloc、Ladj: computed based on the distance between the updated / adjusted anchor segments and their corresponding ground-truth actions, respectively.----GIOU
Lcls: loss between predicted classification scores and the ground-truth categories.----focal loss [22]
Lscr:----TEM losses [20]
anchor segment is positive or negavite:temporal intersection-over-union (tIoU)
inference:
1.score: ψ = (ts, te, s)
s = cψ · ps(ts) · pe(te)
confidence score cψ from Mcls
startness probabilities ps from Mscr
endness probability pe from Mscr
2.process Clip U:shift the boundaries of each detected segments to the beginning of the sequence and down-scale them back to the original scale