【论文翻译】 BMN: Boundary-Matching Network for Temporal Action Proposal Generation

1. Introduction

With the number of videos in Internet growing rapidly,video content analysis methods have attracted widespread attention from both academia and industry.


Temporal action detection is an important task in video content analysis area, which aims to locate action instances in untrimmed long videos with both action categories and temporal boundaries.


Akin to object detection, temporal action detection method can be divided into two stages: temporal action proposal generation and action classification.


Although convincing classification accuracy can be achieved by action recognition methods, the detection performance is still low in mainstream benchmarks [14, 5].


Therefore, many recent methods work on improving the quality of temporal action proposals.


Besides being used in temporal action detection task, temporal proposal generation methods also have wide applications in many areas such as video recommendation, video highlight detection and smart surveillance

Figure 1. Overview of our method. Given an untrimmed video,BMN can simultaneously generate (1) boundary probabilities sequence to construct proposals and (2) Boundary-Matching confidence map to densely evaluate confidence of all proposals.

图 1:我们方法的概述。给定一个未裁剪的视频,BMN可以同时生成(1)边界概率序列来构造建议,(2)边界匹配置信图来密集评估所有建议的置信度。

To achieve high proposal quality, a proposal generation method should (1) generate temporal proposals with flexible duration and precise boundaries to cover groundtruth action instances precisely and exhaustively; (2) generate reliable confidence scores so that proposals can be retrieved properly.

为了提高提名的质量,提名生成方法应该:(1)生成时间灵活、边界精确的时间提案,精确、详尽地覆盖ground - truth action实例;(2)生成可靠的置信度分数,使提名可以被正确检索。

Most existing proposal generation methods [3, 4, 8, 23] adopted a “top-down” fashion to generate proposals with multi-scale temporal sliding windows in regular interval, and then evaluate confidence scores of proposals respectively or simultaneously.


The main drawback of these methods is that generated proposals are usually not temporally precise or not flexible enough to cover ground-truth action instances of varies duration.


Recently, Boundary-Sensitive Network (BSN) [17] adopted a “bottom-up” fashion to generate proposals in two stages:(1) locate temporal boundaries and combine boundaries as proposals and (2) evaluate confidence score of each proposal using constructed proposal feature.


By exploiting local clues, BSN can generate proposals with more precise boundaries and more flexible duration than existing topdown methods.


However, BSN has three main drawbacks: (1) proposal feature construction and confidence evaluation procedures are conducted to each proposal respectively, leading to inefficiency; (2) the proposal feature constructed in BSN is too simple to capture enough temporal context; (3) BSN is multiple-stage but not an unified framework.

但BSN存在三个主要缺陷:(1)对每个提名分别进行提名特征构建和置信度评估程序,导致效率低下;(2) BSN中构造的proposal feature过于简单,无法捕捉足够的时间上下文;(3) BSN是多阶段的,但不是一个统一的框架

Can we evaluate confidence for all proposals simultaneously with rich context? Top-down methods [18, 2] can achieve this easily with anchor mechanism, where proposals are pre-defined as non-continuous distributed anchors.


However, since the boundary and duration of proposals are much more flexible, anchor mechanism is not suitable for bottom-up methods such as BSN.


To address these difficulties, we propose the Boundary-Matching (BM) mechanism for confidence evaluation of densely distributed proposals.


In BM mechanism, a proposal is denoted as a matching pair of its starting and ending boundaries, and then all BM pairs are combined as a two dimensional BM confidence map to represent densely distributed proposals with continuous starting boundaries and temporal duration.


Thus, we can generate confidence scores for all proposals simultaneously via the BM confidence map.


A BM layer is proposed to generate BM feature map from temporal feature sequence, and the BM confidence map can be obtained from the BM feature map using a series of conv-layers.


BM feature map contains rich feature and temporal context for each proposal, and gives the potential for exploiting context of adjacent proposals.


In summary, our work has three main contributions:


  1. We introduce the Boundary-Matching mechanism for evaluating confidence scores of densely distributed proposals, which can be easily embedded in network.


  1. We propose an efficient, effective and end-to-end temporal action proposal generation method BoundaryMatching Network (BMN). Temporal boundary probability sequence and BM confidence map are generated simultaneously in two branches of BMN, which are trained jointly as an unified framework.


  1. Extensive experiments show that BMN can achieve significantly better proposal generation performance than other state-of-the-art methods, with remarkable efficiency, great generalizability and great performance on temporal action detection task.


2. Related Work

Action Recognition


Action recognition is a fundamental and important task of video understanding area.


Handcrafted features such as HOG, HOF and MBH are widely used in earlier works, such as improved Dense Trajectory (iDT) [29, 30].

手工制作的特征如HOG、HOF和MBH在早期的著作中被广泛使用,如improved Dense Trajectory (iDT)[29,30]。

Recently, deep learning models have achieved significantly performance promotion in action recognition task.


The mainstream networks fall into two categories: two-stream networks [9, 24, 32] exploit appearance and motion clues from RGB image and stacked optical flow separately; 3D networks [27, 21] exploit appearance and motion clues directly from raw video volume.


In our work, by convention, we adopt action recognition models to extract visual feature sequence of untrimmed video.


Correlation Matching


Correlation matching algorithms are widely used in many computer vision tasks, such as image registration, action recognition and stereo matching.


Specifically, stereo matching aims to find corresponding pixels from stereo images.


For each pixel in left image of a rectified image pair, the stereo matching method need to find corresponding pixel in right image along horizontal direction, or we can say finding right pixel with minimum cost.


Thus, the cost minimization of all left pixels can be denoted as a cost volume, which denotes each leftright pixel pair as a point in volume.


Based on cost volume, many recent works [26, 20, 16] achieve end-to-end network via generating cost volume directly from combining two feature maps, using correlation layer [20] or feature concatenation [6].

在cost volume的基础上,近年来的许多著作[26,20,16]采用相关层[20]或特征拼接[6],通过结合两个特征映射直接生成cost volume来实现端到端网络。

Inspired by cost volume, our proposed BM confidence map contains pairs of temporal starting and ending boundaries as proposals, thus can directly generate confidence scores for all proposals using convolutional layers.

受cost volume的启发,我们提出的BM置信度图包含一对时间开始和结束边界作为提名,因此可以使用卷积层直接为所有提名生成置信度得分。

We propose BM layer to efficiently generate BM feature map via sampling feature among starting and ending boundaries of each proposal simultaneously.


Temporal Action Proposal Generation


As aforementioned, the goal of temporal action detection task is to detect action instances in untrimmed videos with temporal boundaries and action categories, which can be divided into temporal proposal generation and action classification stages.


These two stages are taken apart in most detection methods [23, 25, 35], and are taken together as single model in some methods [18, 2]. For proposal generation task, most previous works [3, 4, 8, 12, 23] adopt top-down fashion to generate proposals with pre-defined duration and interval, where the main drawback is the lack of boundary precision and duration flexibility.


There are also some methods [35, 17] adopt bottom-up fashion. TAG [35] generates proposals using temporal watershed algorithm, but lack confidence scores for retrieving.


Recently, BSN [17] generates proposals via locally locating temporal boundaries and globally evaluating confidence scores, and achieves significant performance promotion over previous proposal generation methods.


In this work, we propose the Boundary-Matching mechanism for proposal confidence evaluation, which can largely simplify the pipeline of BSN and bring significant promotion in both efficiency and effectiveness.


Figure 2. Illustration of BM confidence map. Proposals in the same row have the same temporal duration, and proposals in the same column have the same starting time. The ending boundaries of proposals at right-bottom corner exceed the range of video, thus these proposals are not considered during training and inference.


3. Our Approach


3.1. Problem Formulation



Figure 3. Illustration of BM layer. For each proposal, we conduct dot product at T dimension between sampling weight and temporal feature sequence, to generate BM feature of shape C × N.


3.4. Boundary-Matching Network


Different with the multiple-stage framework of BSN[17], BMN generates local boundary probabilities sequence and global proposal confidence map simultaneously, while the whole model is trained in an unified framework.


As demonstrated in Fig 4, BMN model contains three modules:


Base Module handles the input feature sequence, and outputs feature sequence shared by the following two modules;


Temporal Evaluation Module evaluates starting and ending probabilities of each location in video to generate boundary probability sequences;


Proposal Evaluation Module contains the BM layer to transfer feature sequence to BM feature map, and contains a series of 3D and 2D convolutional layers to generate BM confidence map.


Figure 4. The framework of Boundary-Matching Network. After feature extraction, we use BMN to simultaneously generate temporal boundary probability sequence and BM confidence map, and then construct proposals based on boundary probabilities and get corresponding confidence score from BM confidence map.

图4. 边界匹配网络的框架。在特征提取后,利用BMN同时生成时间边界概率序列和BM置信图,然后根据边界概率构造提名,从BM置信图中得到相应的置信值。

Table 1. The detailed architecture of BMN, where the output feature sequence of base module is shared by temporal evaluation and proposal evaluation modules. T and D are length of input feature sequence and maximum proposal duration separately.



Base Module.


The goal of the base module is to handle the input feature sequence, expand the receptive field and serve as backbone of network, to provide a shared feature sequence for TEM and PEM.


Since untrimmed videos have uncertain temporal length, we adopt a long observation window with length l ω l_ω lω to truncate the untrimmed feature sequence with length l f l_f lf .

由于未修剪的视频具有不确定的时间长度,我们采用长度为 l ω l_ω lω的长观察窗口对长度为 l f l_f lf的未修剪特征序列进行截断。

We denote an observation window as ω = { t ω , s , t ω , e , Ψ ω , F ω } ω = \{t_{ω,s}, t_{ω,e}, Ψ_ω, F_ω\} ω={tω,s,tω,e,Ψω,Fω}, where t ω , s t_{ω,s} tω,s and t ω , e t_{ω,e} tω,e are the starting and ending time of ω ω ω separately, Ψ ω Ψ_ω Ψω and F ω F_ω Fω are annotations and feature sequence within the window separately.

我们设一个观察窗口为 ω = { t ω , s , t ω , e , ψ ω , F ω } ω= \{t_{ω,s},t_{ω,e},ψ_ω,F_ω\} ω={tω,s,tω,e,ψω,Fω},其中 t ω , s t_{ω,s} tω,s t ω , e t_{ω,e} tωe分别是 ω ω ω的开始时间和结束时间, ψ ω ψ_ω ψω F ω F_ω Fω分别是窗口内的标注和特征序列。

The window length l ω = t ω , e − t ω , s l_ω = t_{ω,e}-t_{ω,s} lω=tω,etω,s is set depending on the dataset. The details of base module is shown in Table 1, including two temporal convolutional layers.

根据数据集设置窗口长度 l ω = t ω , e − t ω , s l_ω = t_{ω,e}-t_{ω,s} lω=tω,etω,s。base模块的详细信息如表1所示,包括两个时域卷积层。

Temporal Evaluation Module (TEM).


The goal of TEM is to evaluate the starting and ending probabilities for all temporal locations in untrimmed video.

TEM 的目标是评估未裁剪视频中所有时间点的起始和结束概率

These boundary probability sequences are used for generating proposals during post processing.


The details of TEM are shown in Table 1, where c o n v 1 d 4 conv1d_4 conv1d4 layer with two sigmoid activated filters output starting probability sequence P S , ω = { p t n s } n = 1 l ω P_{S,ω} = \{p^s_{tn}\} ^{lω}_{n=1} PS,ω={ptns}n=1lω and ending probability sequence P E , ω = { p t n e } n = 1 l ω P_{E,ω}=\{p^e_{tn}\} ^{l_ω}_{n=1} PE,ω={ptne}n=1lω separately for an observation window ω ω ω.

TEM的细节如表1所示,其中,对于一个观测窗口 ω ω ω,使用两个sigmoid激活滤波器的 c o n v 1 d 4 conv1d_4 conv1d4层分别输出起始概率序列 P S , ω = { p t n s } n = 1 l ω P_{S,ω}=\{p^ s_{tn}\} ^{lω}_{n=1} PSω={ptns}n=1lω和结束概率序列 P E , ω = { p t n e } n = 1 l ω P_{E,ω}=\{p^e_{tn}\} ^{l_ω}_{n=1} PEω={ptne}n=1lω

Proposal Evaluation Module (PEM).


The goal of PEM is to generate Boundary-Matching (BM) confidence map, which contains confidence scores for densely distributed proposals. To achieve this, PEM contains BM layer and a series of 3d and 2d convolutional layers.


As introduced in Sec. 3.3, BM layer transfers temporal feature sequence S to BM feature map M F M_F MF via matrix dot product between S and sampling mask weight W in temporal dimension.

如3.3节所述,BM层通过S与采样掩码权值W在时间维度上的矩阵点积,将时间特征序列S转移到BM特征映射 M F M_F MF

In BM layer, the number of sample points N is set to 32, and the maximum proposal duration D is set depending on dataset.

After generating BM feature map M F M_F MF , first we conduct c o n v 3 d 1 conv3d_1 conv3d1 layer in sample dimension to reduce dimension length from N to 1, and increase hidden units from 128 to 512.

在生成BM feature map M F M_F MF后,我们首先在样本维数上进行 c o n v 3 d 1 conv3d_1 conv3d1层,将维数长度从N减少到1,将隐藏单位从128增加到512。

Then, we conduct c o n v 2 d 1 conv2d_1 conv2d1 layer with 1 × 1 kernel to reduce the hidden units, and c o n v 2 d 2 conv2d_2 conv2d2 layer with 3 × 3 kernel to capture context of adjacent proposals.

然后,我们引入了 c o n v 2 d 1 conv2d_1 conv2d1层和 c o n v 2 d 2 conv2d_2 conv2d2层,其中 c o n v 2 d 1 conv2d_1 conv2d1层采用1×1核来减少隐含单元, c o n v 2 d 2 conv2d_2 conv2d2层采用3×3核来捕获相邻建议的上下文。

Finally, we generate two types of BM confidence map M C C M_CC MCC , M C R ∈ R D × T M_CR ∈ R^{D×T} MCRRD×T with sigmoid activation, where M C C M_{CC} MCC and M C R M_{CR} MCR are trained using binary classification and regression loss function separately.

:最后,我们通过sigmoid激活生成了两种BM置信映射 M C C M_CC MCC M C R ∈ R D × T M_CR∈R^{D×T} MCRRD×T,其中 M C C M_{CC} MCC M C R M_{CR} MCR分别使用二元分类和回归损失函数进行训练。

5. Conclusion

In this paper, we introduced the Boundary-Matching mechanism for evaluating confidence scores of densely distributed proposals, which is achieved via denoting proposal as BM pair and combining all proposals as BM confidence map.


Meanwhile, we proposed the Boundary-Matching Network (BMN) for effective and efficient temporal action proposal generation, where BMN generates proposals with precise boundaries and flexible duration via combining high probability boundaries, and simultaneously generates reliable confidence scores for all proposals based on BM mechanism.

同时,我们提出了边界匹配网络(Boundary-Matching Network, BMN),用于有效和高效地生成时间动作提议,BMN通过结合高概率边界生成具有精确边界和灵活时间的提议,同时基于BM机制为所有提议生成可靠的置信度分数。

Extensive experiments demonstrate that BMN outperforms other state-of-the-art proposal generation methods in both proposal generation and temporal action detection tasks, with remarkable efficiency and generalizability.


