Duanxx的论文阅读:
Behavior Recognition via Sparse Spatio-Temporal Features
基于稀疏时空特征点的运动识别
——Duanxx
——2015-04-24
1、Inreoduction
In this work we develop a general framework for detecting and characterizing behavior from video sequences, making few underlying assumptions about the domain and subjects under observation. Consider some of the well known difficulties faced in behavior recognition. Subjects under observation can vary in posture, appearance and size. Occlusions and complex backgrounds can impede observation, and variations in the environment, such as in illumination, can further make observations difficult. Moreover, there are variations in the behaviors themselves.
在面对运动识别的问题时,各种各样的困难需要考虑,比如说:被观察者的姿态、外观、大小可能会改变;遮挡以及复杂的背景也会妨碍我们的观察;像光照等环境因素也会让观察变得异常的棘手;最要命的是运动自身的变化也是多样的。
本文提出了一种针对视频的通用的运动检测及识别的框架,它只需要对被观察者做很少的一些前提假设就可以很好的应用。
The inspiration for our approach comes from approaches to object recognition that rely on sparsely detected features in a particular arrangement to characterize an object, e.g. [6, 1, 18]. Such approaches tend to be robust to pose, image clutter, occlusion, object variation, and the imprecise nature of the feature detectors. In short they can provide a robust descriptor for objects without relying on too many assumptions
这个方法主要是基于以下三篇论文:
1:S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse, part-based representation. PAMI, 26(11):1475–
1490, Nov 2004
2:A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing objects in range data using regional point descriptors. In ECCV,
2004.
3:B. Leibe and B. Schiele. Scale invariant object categorization using a scale-adaptive mean-shift search. In DAGM, Aug. 2004.
简而言之,它可以在不需要太多的前提假设下给出一个鲁棒性强的运动表述算子。
We propose to characterize behavior through the use of spatio-temporal feature points (see figure 1). A spatiotemporal feature is a short, local video sequence such as an eye opening or a knee bending, or for a mouse a paw rapidly moving back and forth. A behavior is then fully described in terms of the types and locations of feature points present.
本文通过时空特征点(spatio-temporal feature points)来描述一个行为。所谓时空特征点就是一些特别短的局部视频序列,比如眨眼睛、曲膝等。而对一个行为的描述,就是对一些时空特征点的类型和位置的描述。
The motivation is that an eye opening can be characterized as such regardless of global appearance, posture,
nearby motion or occlusion and so forth, for example, see figure 2. The complexity of discerning whether two behaviors are similar is shifted to the detection and description of a rich set of features.
之所以能有上面的时空特征点的描述算法,是基于运动的局部性,比如在对眨眼睛这个动作进行识别时,我们根本不需要考虑到被观察者全部的外观、姿态、遮挡等等。
于是乎,判断两个运动是否相似的问题就转化为了对一个时空特征点序列的检测和描述的问题了。
3、 Proposed Algorithm 算法描述
这里详细的描述了本文所使用的算法,算法一共分为四个步骤:
3.1. Feature Detection 特征提取
关于特征提取的算法和论文是非常的多的,这里提供一篇特征提取的综述:
C. Schmid, R. Mohr, and C. Bauckhage. Evaluation of interest point detectors. IJCV, 37(2):151–172, June 2000.
其中比较流行的方法如下:
-
based on the detection of corners 基于角点的检测
所谓角点就是局部梯度向量正交的区域
而梯度向量又是通过计算已经平滑过的图像的一阶微分得到的:L (x, y, σ ) =I (x, y ) ∗g (x, y, σ ),
其中g是高斯平滑算子,σ控制的是角点的尺度。
The response strength at each point is then based on the rank of the covariance matrix of the gradient calculated
in a local window.
基于不同点局部的梯度的协方差矩阵的秩,每个点有不同的反应强度。
参考论文如下:
1、W. Forstner and E. G¨ ulch. A fast operator for detection and precise¨ location of distinct points. In Intercommission Conf. on Fast Processing of Photogrammetric Data, pages 281–305, Switzerland, 1987.
2、C. Harris and M. Stephens. A combined corner and edge detector. InProc. Alvey Conf., pages 189–192, 1988
-
the Laplacian of Gaussian (LoG) and Harris detector LoG算子和Harris的检测
参考论文:
1、K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In ICCV, pages I: 525–531, 2001
-
specific goal of detecting features显著点检测
参考论文:
1、T. Kadir and M. Brady. Saliency, scale and image description. IJCV, 45(2):83–105, Nov 2001.
3.1.1. Extensions to the Spatio-Temporal Case 扩展到时空特征提取
The general idea of interest point detection in the spatiotemporal case is similar to the spatial case. Instead of an image I(x, y), interest point detection must operate on a stack of images denoted by I(x, y, t). Localization must proceed not only along the spatial dimensions x and y but also the temporal dimension t. Likewise, detected features
also have temporal extent.
在时空域的感兴趣点提取和空间域上的提取方式相似,不过并不是仅仅是空间域上的I(x, y)描述,而是一个图像序列的描述:I(x, y, t)。对时空域上的特征提取除了空间域上的x和y以外,还需要时间t作为其描述的一个维度。
现今已知的时空域的感兴趣点的描述子是3D harris特征提取(an extension of the Harris corner detect to the 3D case)。其参考论文为:
1、Laptev and T. Lindeberg. Space-time interest points. In ICCV, pages 432–439, 2003.
2、C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a¨ local SVM approach. In ICPR, pages III: 32–36, 2004.
但是在实际的测试中,3D harris特征提取所能够提取的特征点太少了,并不能很好的完成时空域的特征提取。
作者的时空特征提取算法:
Like much of the work on interest point detectors, our response function is calculated by application of separable linear filters. We assume a stationary camera or a process that can account for camera motion. The response function has the form R = ( I ∗g ∗h ev ) 2 + ( I ∗g ∗h od) 2 where g( x, y; σ) is the 2D Gaussian smoothing kernel, applied only along the spatial dimensions, and h ev and h od are a quadrature pair [10] of 1D Gabor filters applied temporally. These are defined as h ev(t; τ, ω) = − cos(2πtω)e − t 2 /τ 2 and h od(t; τ, ω) = − sin(2πtω)e − t 2 /τ 2 . In all cases we use ω = 4/τ, effectively giving the response function R two parameters σ and τ, corresponding roughly to the spatial and temporal scale of the detector.
和大多数感兴趣点检测算子一样,本文的响应函数的计算也是利用了线性可分滤波器。本文的算法是基于静态摄像头或者说可以做运动补偿的摄像头而言的。
响应函数R = ( I ∗g ∗h ev ) ^2+ ( I ∗g ∗h od) ^2,其中,g( x, y; σ)是2D的高斯平滑核,它仅仅在空间域上使用;h ev和h od 是时空域上的一维Gabor滤波器(G. Granlund and H. Knutsson. Signal Processing for Computer Vision. Kluwer Academic Publishers, Dordrecht, The Netherlands,1995.)。这两个分别定义为:
hev(t; τ, ω) = − cos(2πtω)e − t 2 /τ 2
h od(t; τ, ω) = − sin(2πtω)e − t 2 /τ 2 .
ω = 4/τ
σ 和 τ 分别是空间域和时间域上的尺度参数。
这个算法对周期性运动的物体有很好的检测能力,但是对纯粹的平移运动,其检测能力就比较的差了。
3.2. Cuboids
At each interest point (local maxima of the response function defined above), a cuboid is extracted which contains the spatio-temporally windowed pixel values. The size of the cuboid is set to contain most of the volume of data that contributed to the response function at that interest point; specifically, cuboids have a side length of approximately six times the scale at which they were detected.
对于每一个感兴趣点(上面响应函数的局部最大值)而言,一旦感兴趣点被检测到,那么就将该该兴趣点扩展成为一个cuboid,Cuboid需要尽量多的包含对感兴趣点响应函数起作的数据。Cuboids几乎有检测到的感兴趣点集合的6倍长。
To compare two cuboids, a notion of similarity needs to be defined. Given the large number of cuboids we deal with in some of the datasets (on the order of10 5), we opted to use a descriptor that could be computed once for each cuboid and compare using Euclidean distance.
接下来就是要利用相似度的概念来比较两个cuboid了。这里提供了一种cuboid的描述子,每个cuboid只需要计算一次,并且通过欧式距离来比较cuboid之间的相似度。
文章中一共提供了三种cuboid的描述子的变换算法:
-
normalized pixel values 像素值归一化方法
-
The brightness gradient is calculated at each spatio-temporal location (x, y, t), giving rise to three channels
(G x, G y, G t) each the same size as the cuboid.
亮度梯度法是在时空域上进行计算的,计算三个通道中和cuboid有相同尺寸区域的增量。
-
windowed optical flow 窗口光流法
以上用于产生描述子的方法其目的,是要在保持描述子的识别能力的前提下,让描述子对一些小的变换具有不变性。
To extract motion information we calculate LucasKanade optical flow [20] between each pair of consecutive frames, creating two channels (Vx, Vy). Each channel is the same size as the cuboid, minus one frame.
为了提取运动信息,这里计算了每一对连续的视频帧之间的LucasKanade optical flow(B.D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI, pages 674–679, 1981.)。
We use one of three methods to create a feature vector given the transformed cuboid (or multiple resulting cuboids when using the gradient or optical flow). The simplest method involves flattening the cuboid into a vector, although the resulting vector is potentially sensitive to small cuboid perturbations. The second method involves histogramming the values in the cuboid. Such a representation is robust to perturbations but also discards all positional information (spatial and temporal). Local histograms, used as part of Lowe's 2D SIFT descriptor [19], provide a compromise solution. The cuboid is divided into a number of regions and a local histogram is created for each region. The goal is to introduce robustness to small perturbations while retaining some positional information. For all the methods, to reduce the dimensionality of the final descriptors we use
PCA [12]
这里有三种可供选择的方法用来计算变换后的cuboid的特征向量。
-
flattening
这个方法比较的简单,但同时它对cuboid的细小的扰动也比较的敏感。
-
global histogramming
这个方法对扰动有很好的抗性,但是同时也忽略了很多时空域里有用的信息。
-
local histogramming
这个方法使用了SIFT描述算子(D.G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, Nov 2004.)。相对于前两个算法而言,它提供了一种折衷的解决方案。
这里将cuboid分割成了一系列的小的区域,针对每个小的区域在做了一个局部直方图。其目的是为了在保持对扰动的鲁棒性的同时,尽可能多的保留有用的信息。
对于所有的方法,得到的最终的描述子而言,采用PCA方法来降维(T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer Verlag, Basel, 2001.)。
上面所提到的各种方法都是基于2D特征提取的研究,关于2D特征提取的研究的详细论述,可以参考论文: K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. In CVPR, pages II: 257–263, 2003.
In all experiments reported later in the paper we used the flattened gradient as the descriptor, which is essentially a generalization of the PCA-SIFT descriptor [15].
在本文下面提到的所有的试验中,我们采用flattened gradient作为描述子,它是PCA-SIFT描述子的衍生物,其主要参考文献为:
Y. Ke and R. Sukthankar. PCA-SIFT: A more distinctive representation for local image descriptors. In CVPR, pages 506–513, 2004.
Recall that the descriptors we use involve first transforming the cuboid into: (1) normalized brightness, (2) gradient, or (3) windowed optical flow, followed by a conversion into a vector by (1) flattening, (2) global histogramming, or (3) local histogramming, for a total of nine methods, along with multi-dimensional histograms when they apply. Using the gradient in any form gave very reliable results, as did using the flattened vector of normalized brightness values.
重申一下描述子的算法:首先,是对cuboid一个转换,它包括三种方法:(1) normalized brightness, (2) gradient (3) windowed optical flow;然后,将转换后的cuboid通过下面三种方法 (1) flattening, (2) global histogramming (3) local histogramming,映射成为特征向量。在上面的方法中,最终选择的是先使用gradienth或者normalized brightness,然后使用flattening。
3.3. Cuboid Prototypes
Our approach is based on the idea that although two instances of the same behavior may vary significantly in terms of their overall appearance and motion, many of the interest points they give rise to are similar. Under this assumption, even though the number of possible cuboids is virtually unlimited, the number of different types of cuboids is relatively small. In terms of recognition the exact form of a cuboid becomes unimportant, only its type matters.
尽管两次做用一个动作的外观和动作是存在显著的差异的,但是只要是同一个动作,就一定会有一定的相似性。基于这个假设,尽管cuboid的数目很多,但是cuboid的类型却只有为数不多的几个。
这里通过K-means聚类的方法对cuboids进行分类。将每个cuboid都分到对应的一个cuboid prototypes中。
3.4. Behavior Descriptor
After extraction of the cuboids the original clip is discarded. The rationale for this is that once the interest points have been detected, together their local neighborhoods contain all the information necessary to characterize a behavior. Each cuboid is assigned a type by mapping it to the closest prototype vector, at which point the cuboids themselves are discarded and only their type is kept.
一旦完成cuboids的抽取后,原始的视频片段就不需要使用了。
基本原理如下:感兴趣点一旦被检测到,那么连同感兴趣点周围区域必要的信息都被识别为一个动作。每一个cuboid都被分配到离它最近的一个类型中,这样的话,cuboids自身也不需要再关心了,需要关心的仅仅是保存下来的类型。
这里使用了cuboid类型直方图作为behavior Descriptor,而behavior Descriptor之间的距离是用欧氏距离或者x^2表示的。