【论文翻译】 BMN: Boundary-Matching Network for Temporal Action Proposal Generation

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

边界匹配网络[时序动作提名]

1. Introduction

With the number of videos in Internet growing rapidly,video content analysis methods have attracted widespread attention from both academia and industry.

随着互联网视频数量的快速增长,视频内容分析方法受到了学术界和业界的广泛关注。

Temporal action detection is an important task in video content analysis area, which aims to locate action instances in untrimmed long videos with both action categories and temporal boundaries.

时间动作检测是视频内容分析领域的一项重要任务,其目的是在既有动作类别又有时间边界的未裁剪长视频中定位动作实例。

Akin to object detection, temporal action detection method can be divided into two stages: temporal action proposal generation and action classification.

与目标检测类似,时间动作检测方法可以分为两个阶段:时间动作提议生成和动作分类。

Although convincing classification accuracy can be achieved by action recognition methods, the detection performance is still low in mainstream benchmarks [14, 5].

虽然动作识别方法可以达到令人信服的分类精度,但在主流基准中检测性能仍然较低[14,5]。

Therefore, many recent methods work on improving the quality of temporal action proposals.

因此,许多最近的方法都致力于提高时序动作提名的质量

Besides being used in temporal action detection task, temporal proposal generation methods also have wide applications in many areas such as video recommendation, video highlight detection and smart surveillance

除了用于时间动作检测任务外,时序提名生成方法在视频推荐、视频高亮检测和智能监控等领域也有广泛的应用
在这里插入图片描述
Figure 1. Overview of our method. Given an untrimmed video,BMN can simultaneously generate (1) boundary probabilities sequence to construct proposals and (2) Boundary-Matching confidence map to densely evaluate confidence of all proposals.

图 1:我们方法的概述。给定一个未裁剪的视频,BMN可以同时生成(1)边界概率序列来构造建议,(2)边界匹配置信图来密集评估所有建议的置信度。

To achieve high proposal quality, a proposal generation method should (1) generate temporal proposals with flexible duration and precise boundaries to cover groundtruth action instances precisely and exhaustively; (2) generate reliable confidence scores so that proposals can be retrieved properly.

为了提高提名的质量,提名生成方法应该:(1)生成时间灵活、边界精确的时间提案,精确、详尽地覆盖ground - truth action实例;(2)生成可靠的置信度分数,使提名可以被正确检索。

Most existing proposal generation methods [3, 4, 8, 23] adopted a “top-down” fashion to generate proposals with multi-scale temporal sliding windows in regular interval, and then evaluate confidence scores of proposals respectively or simultaneously.

现有的大多数提案生成方法[3,4,8,23]采用“自上而下”的方式,在规则的间隔内生成具有多尺度时间滑动窗口的提名,然后分别或同时评估提案的置信度得分。

The main drawback of these methods is that generated proposals are usually not temporally precise or not flexible enough to cover ground-truth action instances of varies duration.

这些方法的主要缺点是生成的建议通常在时间上不够精确,或者不够灵活,不能涵盖持续时间不同的真实行动实例。

Recently, Boundary-Sensitive Network (BSN) [17] adopted a “bottom-up” fashion to generate proposals in two stages:(1) locate temporal boundaries and combine boundaries as proposals and (2) evaluate confidence score of each proposal using constructed proposal feature.

最近,边界敏感网络(BSN)[17]采用了“自底向上”的方式生成提名,分为两个阶段:(1)定位时间边界并将边界合并为提案;(2)利用构造的提案特征评估每个提名的可信度得分

By exploiting local clues, BSN can generate proposals with more precise boundaries and more flexible duration than existing topdown methods.

通过利用局部线索,BSN可以生成比现有的自上而下方法更精确的边界和更灵活的时间

However, BSN has three main drawbacks: (1) proposal feature construction and confidence evaluation procedures are conducted to each proposal respectively, leading to inefficiency; (2) the proposal feature constructed in BSN is too simple to capture enough temporal context; (3) BSN is multiple-stage but not an unified framework.

但BSN存在三个主要缺陷:(1)对每个提名分别进行提名特征构建和置信度评估程序,导致效率低下;(2) BSN中构造的proposal feature过于简单,无法捕捉足够的时间上下文;(3) BSN是多阶段的,但不是一个统一的框架

Can we evaluate confidence for all proposals simultaneously with rich context? Top-down methods [18, 2] can achieve this easily with anchor mechanism, where proposals are pre-defined as non-continuous distributed anchors.

我们能否在丰富的背景下同时评估所有提名的可信度?自上而下的方法[18,2]可以通过锚定机制轻松实现这一点,锚定机制将提案预先定义为非连续的分布式锚。

However, since the boundary and duration of proposals are much more flexible, anchor mechanism is not suitable for bottom-up methods such as BSN.

但是,由于提名的边界和期限要灵活得多,锚定机制不适合BSN等自下而上的方法。

To address these difficulties, we propose the Boundary-Matching (BM) mechanism for confidence evaluation of densely distributed proposals.

为了解决这些困难,我们提出了边界匹配(BM)机制来评估密集分布的提议的置信度。

In BM mechanism, a proposal is denoted as a matching pair of its starting and ending boundaries, and then all BM pairs are combined as a two dimensional BM confidence map to represent densely distributed proposals with continuous starting boundaries and temporal duration.

在BM机制中,将一个提名表示为其起始边界和结束边界的一对匹配对,然后将所有的BM对组合为一个二维BM置信图,以表示密集分布且起始边界和时间持续时间连续的提议。

Thus, we can generate confidence scores for all proposals simultaneously via the BM confidence map.

因此,我们可以通过BM置信度图同时生成所有提名的置信度得分。

A BM layer is proposed to generate BM feature map from temporal feature sequence, and the BM confidence map can be obtained from the BM feature map using a series of conv-layers.

提出了一种基于时间特征序列生成BM特征图的BM层,利用一系列的卷积层从BM特征图获得BM置信度图。

BM feature map contains rich feature and temporal context for each proposal, and gives the potential for exploiting context of adjacent proposals.

BM特征图包含了每个提名的丰富特征和时间上下文,并为开发相邻提案的上下文提供了潜力。

In summary, our work has three main contributions:

总之,我们的工作有三个主要贡献:

  1. We introduce the Boundary-Matching mechanism for evaluating confidence scores of densely distributed proposals, which can be easily embedded in network.

我们引入了边界匹配机制来评估分布密集的建议的置信度,该机制可以很容易地嵌入到网络中。

  1. We propose an efficient, effective and end-to-end temporal action proposal generation method BoundaryMatching Network (BMN). Temporal boundary probability sequence and BM confidence map are generated simultaneously in two branches of BMN, which are trained jointly as an unified framework.

我们提出了一种高效、有效、端到端的时间动作提议生成方法——边界匹配网络(BMN)。在BMN的两个分支中同时生成时间边界概率序列和BM置信图,并将其联合训练成统一的框架。

  1. Extensive experiments show that BMN can achieve significantly better proposal generation performance than other state-of-the-art methods, with remarkable efficiency, great generalizability and great performance on temporal action detection task.

大量的实验表明,与目前最先进的方法相比,BMN可以获得明显更好的提议生成性能,具有显著的效率、良好的泛化性和较好的时间动作检测任务性能。

2. Related Work

Action Recognition

行为识别

Action recognition is a fundamental and important task of video understanding area.

动作识别是视频理解领域的一项基本而重要的任务。

Handcrafted features such as HOG, HOF and MBH are widely used in earlier works, such as improved Dense Trajectory (iDT) [29, 30].

手工制作的特征如HOG、HOF和MBH在早期的著作中被广泛使用,如improved Dense Trajectory (iDT)[29,30]。

Recently, deep learning models have achieved significantly performance promotion in action recognition task.

近年来,深度学习模型在动作识别任务中取得了显著的性能提升。

The mainstream networks fall into two categories: two-stream networks [9, 24, 32] exploit appearance and motion clues from RGB image and stacked optical flow separately; 3D networks [27, 21] exploit appearance and motion clues directly from raw video volume.

主流网络分为两类:双流网络[9,24,32]分别利用RGB图像和堆叠光流的外观和运动线索;3D网络[27,21]直接从原始视频量中利用外观和运动线索。

In our work, by convention, we adopt action recognition models to extract visual feature sequence of untrimmed video.

在我们的工作中,我们按照惯例,采用动作识别模型来提取未裁剪视频的视觉特征序列。

Correlation Matching

相关匹配

Correlation matching algorithms are widely used in many computer vision tasks, such as image registration, action recognition and stereo matching.

相关匹配算法广泛应用于图像配准、动作识别和立体匹配等计算机视觉任务中。

Specifically, stereo matching aims to find corresponding pixels from stereo images.

立体匹配是指从立体图像中找到相应的像素点。

For each pixel in left image of a rectified image pair, the stereo matching method need to find corresponding pixel in right image along horizontal direction, or we can say finding right pixel with minimum cost.

对于矫正后的图像对左图像中的每个像素,立体匹配方法需要沿水平方向在右图像中找到相应的像素,或者说以最小的代价找到右像素。

Thus, the cost minimization of all left pixels can be denoted as a cost volume, which denotes each leftright pixel pair as a point in volume.

因此,所有左像素的最小成本可以表示为成本体积,表示每个左像素对作为体积上的一个点。

Based on cost volume, many recent works [26, 20, 16] achieve end-to-end network via generating cost volume directly from combining two feature maps, using correlation layer [20] or feature concatenation [6].

在cost volume的基础上,近年来的许多著作[26,20,16]采用相关层[20]或特征拼接[6],通过结合两个特征映射直接生成cost volume来实现端到端网络。

Inspired by cost volume, our proposed BM confidence map contains pairs of temporal starting and ending boundaries as proposals, thus can directly generate confidence scores for all proposals using convolutional layers.

受cost volume的启发,我们提出的BM置信度图包含一对时间开始和结束边界作为提名,因此可以使用卷积层直接为所有提名生成置信度得分。

We propose BM layer to efficiently generate BM feature map via sampling feature among starting and ending boundaries of each proposal simultaneously.

提出了一种BM层算法,通过对每个方案的起始边界和结束边界进行采样,有效地生成BM特征图。

Temporal Action Proposal Generation

时序动作提名

As aforementioned, the goal of temporal action detection task is to detect action instances in untrimmed videos with temporal boundaries and action categories, which can be divided into temporal proposal generation and action classification stages.

如前所述,时序动作检测任务的目标是检测未修剪视频中具有时间边界和动作类别的动作实例,分为时间提议生成和动作分类两个阶段。

These two stages are taken apart in most detection methods [23, 25, 35], and are taken together as single model in some methods [18, 2]. For proposal generation task, most previous works [3, 4, 8, 12, 23] adopt top-down fashion to generate proposals with pre-defined duration and interval, where the main drawback is the lack of boundary precision and duration flexibility.

这两个阶段在大多数检测方法中被分开[23,25,35],在一些方法中被合并为单一模型[18,2]。对于提名生成任务,以往的作品[3,4,8,12,23]大多采用自顶向下的方式生成具有预定义时间和时间间隔的提案,其主要缺点是缺乏边界精度和时间灵活性。

There are also some methods [35, 17] adopt bottom-up fashion. TAG [35] generates proposals using temporal watershed algorithm, but lack confidence scores for retrieving.

也有一些方法[35,17]采用自下而上的方式。标签[35]使用时间分水岭算法生成建议,但缺乏检索的置信度。

Recently, BSN [17] generates proposals via locally locating temporal boundaries and globally evaluating confidence scores, and achieves significant performance promotion over previous proposal generation methods.

近年来,BSN[17]通过局部定位时间边界和全局评估置信度来生成建议,与以前的建议生成方法相比,取得了显著的性能提升。

In this work, we propose the Boundary-Matching mechanism for proposal confidence evaluation, which can largely simplify the pipeline of BSN and bring significant promotion in both efficiency and effectiveness.

在这项工作中,我们提出了边界匹配机制来评估提议的可信度,这大大简化了BSN的流程,并在效率和有效性方面带来了显著的提升。

在这里插入图片描述
Figure 2. Illustration of BM confidence map. Proposals in the same row have the same temporal duration, and proposals in the same column have the same starting time. The ending boundaries of proposals at right-bottom corner exceed the range of video, thus these proposals are not considered during training and inference.

图2。BM置信图图解。同一行中的提案具有相同的时间持续时间,同一列中的提案具有相同的开始时间。由于右下角建议的结束边界超出了视频的范围,所以在训练和推理时不考虑这些建议。

3. Our Approach

实现方法

3.1. Problem Formulation

问题公式化

在这里插入图片描述

Figure 3. Illustration of BM layer. For each proposal, we conduct dot product at T dimension between sampling weight and temporal feature sequence, to generate BM feature of shape C × N.

图3。BM层图。对于每个提名,我们在采样权值和时间特征序列之间进行T维点积,生成形状为C×N的BM特征。

3.4. Boundary-Matching Network

边界匹配网络

Different with the multiple-stage framework of BSN[17], BMN generates local boundary probabilities sequence and global proposal confidence map simultaneously, while the whole model is trained in an unified framework.

与BSN[17]的多级框架不同,BMN同时生成局部边界概率序列和全局建议置信度图,同时在统一框架下对整个模型进行训练。

As demonstrated in Fig 4, BMN model contains three modules:

如图4所示,BMN模型包含三个模块:

Base Module handles the input feature sequence, and outputs feature sequence shared by the following two modules;

基础模块处理输入特征序列,输出特征序列由以下两个模块共享;

Temporal Evaluation Module evaluates starting and ending probabilities of each location in video to generate boundary probability sequences;

时序评估模块对视频中每个位置的起始和结束概率进行评估,生成边界概率序列;

Proposal Evaluation Module contains the BM layer to transfer feature sequence to BM feature map, and contains a series of 3D and 2D convolutional layers to generate BM confidence map.

提名评估模块包含BM层,将特征序列转换为BM特征图,包含一系列三维和二维卷积层,生成BM置信度图。

在这里插入图片描述
Figure 4. The framework of Boundary-Matching Network. After feature extraction, we use BMN to simultaneously generate temporal boundary probability sequence and BM confidence map, and then construct proposals based on boundary probabilities and get corresponding confidence score from BM confidence map.

图4. 边界匹配网络的框架。在特征提取后,利用BMN同时生成时间边界概率序列和BM置信图,然后根据边界概率构造提名,从BM置信图中得到相应的置信值。

Table 1. The detailed architecture of BMN, where the output feature sequence of base module is shared by temporal evaluation and proposal evaluation modules. T and D are length of input feature sequence and maximum proposal duration separately.

表1。BMN的详细架构,其中基本模块的输出特征序列由时间评估和提案评估模块共享。T和D分别为输入特征序列长度和最大提案持续时间。

在这里插入图片描述

Base Module.

基础模块

The goal of the base module is to handle the input feature sequence, expand the receptive field and serve as backbone of network, to provide a shared feature sequence for TEM and PEM.

基本模块的目标是处理输入特征序列,扩展接受域,充当网络的骨干,为TEM和PEM提供共享的特征序列。

Since untrimmed videos have uncertain temporal length, we adopt a long observation window with length l ω l_ω lω to truncate the untrimmed feature sequence with length l f l_f lf .

由于未修剪的视频具有不确定的时间长度,我们采用长度为 l ω l_ω lω的长观察窗口对长度为 l f l_f lf的未修剪特征序列进行截断。

We denote an observation window as ω = { t ω , s , t ω , e , Ψ ω , F ω } ω = \{t_{ω,s}, t_{ω,e}, Ψ_ω, F_ω\} ω={tω,s,tω,e,Ψω,Fω}, where t ω , s t_{ω,s} tω,s and t ω , e t_{ω,e} tω,e are the starting and ending time of ω ω ω separately, Ψ ω Ψ_ω Ψω and F ω F_ω Fω are annotations and feature sequence within the window separately.

我们设一个观察窗口为 ω = { t ω , s , t ω , e , ψ ω , F ω } ω= \{t_{ω,s},t_{ω,e},ψ_ω,F_ω\} ω={tω,s,tω,e,ψω,Fω},其中 t ω , s t_{ω,s} tω,s t ω , e t_{ω,e} tωe分别是 ω ω ω的开始时间和结束时间, ψ ω ψ_ω ψω F ω F_ω Fω分别是窗口内的标注和特征序列。

The window length l ω = t ω , e − t ω , s l_ω = t_{ω,e}-t_{ω,s} lω=tω,etω,s is set depending on the dataset. The details of base module is shown in Table 1, including two temporal convolutional layers.

根据数据集设置窗口长度 l ω = t ω , e − t ω , s l_ω = t_{ω,e}-t_{ω,s} lω=tω,etω,s。base模块的详细信息如表1所示,包括两个时域卷积层。

Temporal Evaluation Module (TEM).

时序评估模块(TEM)

The goal of TEM is to evaluate the starting and ending probabilities for all temporal locations in untrimmed video.

TEM 的目标是评估未裁剪视频中所有时间点的起始和结束概率

These boundary probability sequences are used for generating proposals during post processing.

这些边界概率序列用于在后处理过程中产生提名。

The details of TEM are shown in Table 1, where c o n v 1 d 4 conv1d_4 conv1d4 layer with two sigmoid activated filters output starting probability sequence P S , ω = { p t n s } n = 1 l ω P_{S,ω} = \{p^s_{tn}\} ^{lω}_{n=1} PS,ω={ptns}n=1lω and ending probability sequence P E , ω = { p t n e } n = 1 l ω P_{E,ω}=\{p^e_{tn}\} ^{l_ω}_{n=1} PE,ω={ptne}n=1lω separately for an observation window ω ω ω.

TEM的细节如表1所示,其中,对于一个观测窗口 ω ω ω,使用两个sigmoid激活滤波器的 c o n v 1 d 4 conv1d_4 conv1d4层分别输出起始概率序列 P S , ω = { p t n s } n = 1 l ω P_{S,ω}=\{p^ s_{tn}\} ^{lω}_{n=1} PSω={ptns}n=1lω和结束概率序列 P E , ω = { p t n e } n = 1 l ω P_{E,ω}=\{p^e_{tn}\} ^{l_ω}_{n=1} PEω={ptne}n=1lω

Proposal Evaluation Module (PEM).

提案评估模块(PEM)。

The goal of PEM is to generate Boundary-Matching (BM) confidence map, which contains confidence scores for densely distributed proposals. To achieve this, PEM contains BM layer and a series of 3d and 2d convolutional layers.

PEM的目标是生成边界匹配(BM)置信度图,该置信度图包含对分布密集的提名的置信度值。为此,PEM包含BM层和一系列的3d、2d卷积层。

As introduced in Sec. 3.3, BM layer transfers temporal feature sequence S to BM feature map M F M_F MF via matrix dot product between S and sampling mask weight W in temporal dimension.

如3.3节所述,BM层通过S与采样掩码权值W在时间维度上的矩阵点积,将时间特征序列S转移到BM特征映射 M F M_F MF

In BM layer, the number of sample points N is set to 32, and the maximum proposal duration D is set depending on dataset.

After generating BM feature map M F M_F MF , first we conduct c o n v 3 d 1 conv3d_1 conv3d1 layer in sample dimension to reduce dimension length from N to 1, and increase hidden units from 128 to 512.

在生成BM feature map M F M_F MF后,我们首先在样本维数上进行 c o n v 3 d 1 conv3d_1 conv3d1层,将维数长度从N减少到1,将隐藏单位从128增加到512。

Then, we conduct c o n v 2 d 1 conv2d_1 conv2d1 layer with 1 × 1 kernel to reduce the hidden units, and c o n v 2 d 2 conv2d_2 conv2d2 layer with 3 × 3 kernel to capture context of adjacent proposals.

然后,我们引入了 c o n v 2 d 1 conv2d_1 conv2d1层和 c o n v 2 d 2 conv2d_2 conv2d2层,其中 c o n v 2 d 1 conv2d_1 conv2d1层采用1×1核来减少隐含单元, c o n v 2 d 2 conv2d_2 conv2d2层采用3×3核来捕获相邻建议的上下文。

Finally, we generate two types of BM confidence map M C C M_CC MCC , M C R ∈ R D × T M_CR ∈ R^{D×T} MCRRD×T with sigmoid activation, where M C C M_{CC} MCC and M C R M_{CR} MCR are trained using binary classification and regression loss function separately.

:最后,我们通过sigmoid激活生成了两种BM置信映射 M C C M_CC MCC M C R ∈ R D × T M_CR∈R^{D×T} MCRRD×T,其中 M C C M_{CC} MCC M C R M_{CR} MCR分别使用二元分类和回归损失函数进行训练。

5. Conclusion

In this paper, we introduced the Boundary-Matching mechanism for evaluating confidence scores of densely distributed proposals, which is achieved via denoting proposal as BM pair and combining all proposals as BM confidence map.

在本文中,我们引入了边界匹配机制来评估分布密集的建议的置信度,该机制是通过将建议表示为BM对,并将所有建议组合为BM置信度映射来实现的。

Meanwhile, we proposed the Boundary-Matching Network (BMN) for effective and efficient temporal action proposal generation, where BMN generates proposals with precise boundaries and flexible duration via combining high probability boundaries, and simultaneously generates reliable confidence scores for all proposals based on BM mechanism.

同时,我们提出了边界匹配网络(Boundary-Matching Network, BMN),用于有效和高效地生成时间动作提议,BMN通过结合高概率边界生成具有精确边界和灵活时间的提议,同时基于BM机制为所有提议生成可靠的置信度分数。

Extensive experiments demonstrate that BMN outperforms other state-of-the-art proposal generation methods in both proposal generation and temporal action detection tasks, with remarkable efficiency and generalizability.

大量实验表明,无论是在提议生成还是时间动作检测任务上,BMN都优于其他最先进的提议生成方法,具有显著的效率和通用性。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值