《Video Self-Stitching Graph Network for Temporal Action Localization》-VSGN阅读笔记

柳柳liuliu

已于 2022-02-28 16:08:07 修改

阅读量341

点赞数 1

文章标签：计算机视觉视频处理

于 2022-02-24 21:16:30 首次发布

本文链接：https://blog.csdn.net/m0_52552751/article/details/123105763

版权

《Video Self-Stitching Graph Network for Temporal Action Localization》-VSGN笔记

Link: http://arxiv.org/abs/2011.14598
Model Name: VSGN
Code: https://github.com/coolbay/VSGN

针对短视频中动作检测准确度不高的问题提出解决方案。
两个关键部分：
VSS（vedio self-stitching）
xGPN（cross-scale graph pyramid network）：

Related work

Multi-scale solution in object detection
借鉴FPN，mosaic augmentation；
Temporal action localization
两种方法：
1. 固定长度的视频输入（such as 100 frames）
  –BSN\BMN\G-TAD\BC-GNN
  –small input scale make it effcient
  –down-scaling harm short actions(easily gets lost or distorted)
  –up-scal in those methods is not useful due to their architectures-scaling curse
2. 滑动窗口（sliding windows）
  –R-C3D, TAL-NET, PBRNet
  –preserve the original information
  –perform pooling / strided convolution to obtain multi-scale features
Graph neural networks for TAL
–VSGN builds a graph on video snippets as G-TAD
–VSGN exploits correlations between cross-scale snippets and defines a cross-scale edge to break the scaling curse.

Video Self-Stitching Graph Network

Alt
1.视频：WHT*3
2.视频小片段（snippet-a consecutive video frames ）基础上进行特征提取: TSN, I3D
3.特征序列：F={f_t}_t=1->T/a,T/a×C
4.视频片段特征进行裁剪
5.Clip up-scaling:Clip O( original short clip )+Clip GAP+Clip U(up-scaled clip )

Cross-Scale Graph Pyramid Network

借鉴FPN，具有编码金字塔和解码金字塔，他们之间具有连接。
在这里插入图片描述
图建立
1.以特征为节点，特征之间建立边，特征点有K个边
2.K/2为free edges–所有特征之间计算MSE（平均均方差）的前K/2个（不考虑尺度，特征近即可）
3.K/2为cross-scale edges–跨片段，clip O与clip U之间
特征聚合

1.多层感知器（MLP）权重W
2.edge convolution operations，如公式
在这里插入图片描述
3.we take the maximum value in a channel-wise fashion to generate the aggregated feature
图特征与时序特征融合

Scoring and Localization

在这里插入图片描述

网络由四部分组成：M_loc、M_cls、M_adj、M_scr
M_loc：
1）contain 4 blocks of Conv1d (3, 1), group normalization (GN) and ReLU layers, followed by one Conv1d (1, 1) to generate the location offsets for each anchor segment
2）pre-defined anchor segments
M_cls：
1）contain 4 blocks of Conv1d (3, 1), group normalization (GN) and ReLU layers, followed by one Conv1d (1, 1) to generate the classification scores for each anchor segment
2）update the anchor segments by applying their predicted offsets from the M_loc module —update method in [37]
M_adj：
1） inspired by FGD in [23]
2） sample 3 features from around its start and end locations, respectively. Then we temporally concatenate the 3 feature vectors from each location and apply Conv1d(3,1)-ReLU-Conv1d(1,1) to predict start/end offsets.
M_scr：
1）Conv1d(3,1), ReLU and Conv1d(1,1), predicts actionness/startness/endness scores [20] for each sequence.

LOSS

training:
在这里插入图片描述
L_loc、L_adj： computed based on the distance between the updated / adjusted anchor segments and their corresponding ground-truth actions, respectively.----GIOU
L_cls： loss between predicted classification scores and the ground-truth categories.----focal loss [22]
L_scr：----TEM losses [20]
anchor segment is positive or negavite：temporal intersection-over-union (tIoU)

inference:
1.score: ψ = (ts, te, s)
s = cψ · ps(ts) · pe(te)
confidence score cψ from M_cls
startness probabilities ps from M_scr
endness probability pe from M_scr
2.process Clip U:shift the boundaries of each detected segments to the beginning of the sequence and down-scale them back to the original scale