《Video Self-Stitching Graph Network for Temporal Action Localization》-VSGN阅读笔记

《Video Self-Stitching Graph Network for Temporal Action Localization》-VSGN笔记

Link: http://arxiv.org/abs/2011.14598
Model Name: VSGN
Code: https://github.com/coolbay/VSGN

针对短视频中动作检测准确度不高的问题提出解决方案。
两个关键部分:
VSS(vedio self-stitching)
xGPN(cross-scale graph pyramid network):

Related work

  1. Multi-scale solution in object detection
    借鉴FPN,mosaic augmentation;
  2. Temporal action localization
    两种方法:
    1. 固定长度的视频输入(such as 100 frames)
      –BSN\BMN\G-TAD\BC-GNN
      –small input scale make it effcient
      –down-scaling harm short actions(easily gets lost or distorted)
      –up-scal in those methods is not useful due to their architectures-scaling curse
    2. 滑动窗口(sliding windows)
      –R-C3D, TAL-NET, PBRNet
      –preserve the original information
      –perform pooling / strided convolution to obtain multi-scale features
  3. Graph neural networks for TAL
    –VSGN builds a graph on video snippets as G-TAD
    –VSGN exploits correlations between cross-scale snippets and defines a cross-scale edge to break the scaling curse.

Video Self-Stitching Graph Network

Alt
1.视频:WHT*3
2.视频小片段(snippet-a consecutive video frames )基础上进行特征提取: TSN, I3D
3.特征序列:F={ft}t=1->T/a,T/a×C
4.视频片段特征进行裁剪
5.Clip up-scaling:Clip O( original short clip )+Clip GAP+Clip U(up-scaled clip )

Cross-Scale Graph Pyramid Network

借鉴FPN,具有编码金字塔和解码金字塔,他们之间具有连接。
在这里插入图片描述
图建立
1.以特征为节点,特征之间建立边,特征点有K个边
2.K/2为free edges–所有特征之间计算MSE(平均均方差)的前K/2个(不考虑尺度,特征近即可)
3.K/2为cross-scale edges–跨片段,clip O与clip U之间
特征聚合

1.多层感知器(MLP)权重W
2.edge convolution operations,如公式
在这里插入图片描述
3.we take the maximum value in a channel-wise fashion to generate the aggregated feature
图特征与时序特征融合
在这里插入图片描述

Scoring and Localization

在这里插入图片描述

网络由四部分组成:Mloc、Mcls、Madj、Mscr
Mloc
1)contain 4 blocks of Conv1d (3, 1), group normalization (GN) and ReLU layers, followed by one Conv1d (1, 1) to generate the location offsets for each anchor segment
2)pre-defined anchor segments
Mcls
1)contain 4 blocks of Conv1d (3, 1), group normalization (GN) and ReLU layers, followed by one Conv1d (1, 1) to generate the classification scores for each anchor segment
2)update the anchor segments by applying their predicted offsets from the Mloc module —update method in [37]
Madj
1) inspired by FGD in [23]
2) sample 3 features from around its start and end locations, respectively. Then we temporally concatenate the 3 feature vectors from each location and apply Conv1d(3,1)-ReLU-Conv1d(1,1) to predict start/end offsets.
Mscr
1)Conv1d(3,1), ReLU and Conv1d(1,1), predicts actionness/startness/endness scores [20] for each sequence.

LOSS

training:
在这里插入图片描述
Lloc、Ladj computed based on the distance between the updated / adjusted anchor segments and their corresponding ground-truth actions, respectively.----GIOU
Lcls loss between predicted classification scores and the ground-truth categories.----focal loss [22]
Lscr----TEM losses [20]
anchor segment is positive or negavite:temporal intersection-over-union (tIoU)

inference:
1.score: ψ = (ts, te, s)
s = cψ · ps(ts) · pe(te)
confidence score cψ from Mcls
startness probabilities ps from Mscr
endness probability pe from Mscr
2.process Clip U:shift the boundaries of each detected segments to the beginning of the sequence and down-scale them back to the original scale

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值