资料翻译:使用运动历史梯度信息完成运动分割与姿态识别,Motion Segmentation and Pose Recognition with Motion History Gradients



Motion Segmentation and Pose Recognition with Motion History Gradients

Gary R. Bradski
Intel Corporation, Microcomputer Research Labs
SC12-303, 2200 Mission College Blvd.
Santa Clara, CA 95052-8 1 19 USA

James Davis
MIT Media Lab
E15-390,20 Ames St.
Cambridge, MA 02 139 USA


This paper uses a simple method for representing motion
in successively layered silhouettes that directly encode
system time termed the timed Motion History Image
(tMHI). This representation can be used to both (a)
determine the current pose of the object and to (b)
segment and measure the motions induced by the object
in a video scene. These segmented regions are not
“motion blobs”, but instead motion regions naturally
connected to the moving parts of the object of interest.
This method may be used as a very general gesture
recognition “toolbox”. We use it to recognize waving
and overhead clapping motions to control a music
synthesis program.

1. Introduction and Related Work
Three years ago, a PC cost about $2500 and a low end
video camera and capture board cost about $300. Today
the computer could be had for under $700 and an
adequate USB camera costs about $50. It is not surprising
then that there is an increasing interest in the recognition
of human motion and action in real-time vision. For
example, during these three years this topic has been
addressed by [[5][6][7][8][9][10][11][12][15] [17][24]]
among others. Several survey papers in this period have
reviewed computer vision based motion recognition [25],
human motion capture [[22][23]] and human motion
analysis [ 11. In particular, with the advent of inexpensive
and powerful hardware, tracking/surveillance systems,
human computer interfaces, and entertainment domains
have a heightened interest in understanding and
recognizing human movements. For example, monitoring
applications may wish to signal only when a person is
seen moving in a particular area (perhaps within a
dangerous or secure area), interface systems may require
the understanding of gesture as a means of input or
control, and entertainment applications may want to
analyze the actions of the person to better aid in the
immersion or reactivity of the experience.
三年前,PC价格2500$,USB 摄像头300$,现在PC价格700$,USB 摄像头 50$,
例如, 三年前,这个课题可以参考[[5][6][7][8][9][10][11][12][15] [17][24]],

One possible motion representation is found by collecting
optical flow over the image or region of interest
throughout the sequence, but this is computationally
expensive and many times not robust. For example,
hierarchical [2] and/or robust estimation [4] is often
needed, and optical flow frequently signals unwanted
motion in regions such as loose and textured clothing.
Moreover, in the absence of some type of grouping,
optical flow happens frame to frame whereas human
gestures may span seconds. Despite these difficulties,
optical flow signals have been grouped into regional blobs
and used successfully for gesture recognition [9].

An alternative approach was proposed in [13] where
successive layering of image silhouettes is used to
represent patterns of motions. Every time a new frame
arrives, the existing silhouettes are decreased in value
subject to some threshold and the new silhouette (if any)
is overlaid at maximal brightness. This layered motion
image is termed a Motion History Image (MHI). MHI
representations have the advantage that a range of times
from frame to frame to several seconds may be encoded in
a single image. Thus MHIs span the time scales of human
一个可替代的方法在[13]提到了,是利用连续的图像轮廓来表达运动的模式,在每个时间点上,得到一个新的帧,这已经存在的轮廓减去一些阈值,新的轮廓被赋予最大的亮度。这分层的轮廓被叫做运动历史图像Motion History Image (MHI),MHI表达了帧到帧的时间特性,而将其集中在一个图像内,因此,MHI扩展了人体姿态的时间范围。

In this paper, we generalize the Motion History Image to
directly encode actual time in a floating point format
which we call the timed Motion History Image (tMHI).
We take Hu Moment shape descriptors [19] of the current
silhouette to recognize pose. A gradient of the tMHI is
used to determine normal optical flow (e.g. motion flow
orthogonal to object boundaries). The motion is then
segmented relative to object boundaries and the motion
orientation and magnitude of each region is obtained. The
processing flow is summarized in Figure 1 where numbers
indicate which section that processing step is described in.
The end result is recognized pose, and motion to that pose
-- a general “tool” for use in object motion analysis or
gesture recognition. Section 5 compares the
computational advantages of our approach with the optical
flow approaches such as used in [9]. We use our
approach in section 6 to recognize walking, waving and
clapping motions to control musical synthesis.
在这个文献里,我们直接在运动历史图像中加入了浮点格式的时间点,称作tMHI,我们采用HU 运动形状描述[19]来表示当前的轮廓与位置。


2. Pose and Motion Representation

2.1. Silhouettes and Pose Recognition
The algorithm as shown in Figure 1 depends on generating
silhouettes of the object of interest. Almost any silhouette
generation method can be used. Possible methods of
silhouette generation include stereo disparity or stereo
depth subtraction [3], infrared back-lighting [ 121, frame
differencing [13], color histogram back-projection [6],
texture blob segmentation, range imagery foreground
segmentation etc. We chose a simple background
subtraction method for the purposes of this paper as
described below.

2.1. 1. Silhouette Generation
Although there is recent work on more sophisticated
methods of background subtraction [[ 14][ 18][21]], we use
a simplistic method here. We label as foreground those
pixels that are a set number of standard deviations from
the mean RGB background. Then a pixel dilation and
region growing method is applied to remove noise and
extract the silhouette. A limitation of using silhouettes is
that no motion inside the body region can be seen. For
example, a silhouette generated from a camera facing a
person would not show the hands moving in front of the
body. One possibility to help overcome this problem is to
simultaneously use multiple camera views. Another
approach would be to separately segment flesh-colored
regions and overlay them when they cross the foreground


2.1.2. Mahalanobis Match to Hu Moments of Silhouette Pose

For recognition of silhouette pose, seven higher-order Hu
moments [19】 provide shape descriptors that are invariant
to translation and scale. Since these moments are of
different orders. we must use the Mahalanobis distance
metric [26] for matching based on a statistical measure of
closeness to training examples.
mahaZ(x) = (x-my K-’(x-m) (1)
where x is the moment feature vector, m is the mean of
the training moment vectors, and K-’ is the inverse
covariance matrix for the training vectors. The
discriminatory power of these moment features for the
silhouette poses is indicated by a short example. For this
example, the training set consisted of 5 people doing 5
repetitions of 3 gestural poses (“Y”, “T”, and “Left Arm”)
shown in Figure 2 done by each of five people. A sixth
person who had not practiced the gestures was brought in
to perform the gestures.

Table 1 shows typical results for pose discrimination. We
can see that even the confusable poses “Y” and “T” are
separated by more than an order of magnitude making it
easy to set thresholds to recognize test poses against
trained model poses.

An alternative approach to pose recognition uses gradient
histograms of the segmented silhouette region [5].

2.1.2.  Mahalanobis 匹配HU运动轮廓姿态
为了识别轮廓姿态,7个更高HU运动提供了平移和放大不变的形状描述工具,因为这些运动有不同的顺序,我们必须使用Mahalanobis 距离方法




列表1显示了一些姿态识别的结果,我们可以看到就算是姿态 "Y" 与 "T",这两个容易混淆的姿态,识别结果相差不止一个数量级,这样可以容


Table1 姿态识别的结果,可以看到正确模型间的距离远远小于不正确模型匹配的距离


2.2. timed Motion History Images (tMHI)
In this paper, we use a floating point Motion History
Image [IO] where new silhouette values are copied in with
a floating point timestamp in the format:
seconds.milliseconds. This MHI representation is updated
as follows:

where r is the current time-stamp, and 6 is the maximum
time duration constant (typically a few seconds)
associated with the template. This method makes our
representation independent of system speed or frame rate
(within limits) so that a given gesture will cover the same
MHI area at different capture rates. We call this
representation the timed Motion History Image (tMHI).
Figure 3 shows a schematic representation for a person
doing an upward arm movement.


其中r(译注,字符无法表达,实际见图上)是当前的时间标记,6 是最大的过期常量(一般是几秒),这种方法可以摆脱系统的速度或受限制的帧的速度,就算一个特定的手势在MHI记录时速度不一样,效果也相同的。我们称作为tMHI.图中显示了一个手臂上举动作。

2.3. Motion History Gradients
Notice in the right image in Figure 3 (tMHI) that if we
took the gradient of the tMHI, we would get direction
vectors pointing in the direction of the movement of the
arm. Note that these gradient vectors will point
orthogonal to the moving object boundaries at each “step”
in the tMHI giving us a normal optical flow representation
(see middle left image, Figure 4). Gradients of the tMHI
can be calculated efficiently by convolution with
separable Sobel filters in the X and Y directions yielding
the spatial derivatives: F,(x,y) and F, ( x , Y ) . Gradient
orientation at each pixel is then:

We must be careful, though, when calculating the gradient
information because it is only valid at locations within the
tMHI. The surrounding boundary of the tMHI should not
be used because non-silhouette (zero valued) pixels would
be included in the gradient calculation, thus corrupting the
result. Only tMHI interior silhouette pixels should be
examined. Additionally, we must not use gradients of
MHI pixels that have a contrast which is too low (inside a
silhouette) or too high (large temporal disparity) in their
local neighborhood. Figure 4 center, left shows raw tMHI
gradients. Applying the above criteria to the raw gradients
yeilds a masked region of valid gradients in Figure 4
center, right.
After calculating the motion gradients, we can then extract
motion features to varying scales. For instance, we can
generate a radial histogram of the motion orientations
which then can be used directly for recognition as done in
[lo]. But, an even simpler measure is to find the global
motion orientation as discussed next.



3. Global Gradient Orientation
Calculation of the global orientation should be weighted
by normalized tMHI values to give more influence to the
most current motion within the template. A simple
calculation for the global weighted orientation is as

3. 全局梯度方向
全局梯度方向的计算时,为了突出当下这个时刻的运动,应该考虑到tMHI 的法向值的加权,一个简单的计算公式如下,

where (b is the global motion orientation, (bre+ is the base
reference angle (peaked value in the histogram of
orientations), (b( x , y ) is the motion orientation map found
from gradient convolutions, ?zo~wz(Z, 6, thfHI,(x, y ) )
is a normalized tMHI value (linearly normalizing the
tMHI from 0-1 using the current time-stamp Z and
duration 6), and angoifS((b(x,y),(brej ) is the
minimum, signed angular difference of an orientation
from the reference angle. A histogram-based reference
angle ( (bre, ) is required due to problems associated with
averaging circular distance measurements. Figure 4 shows
from left to right a tMHI, the raw gradients, the masked
region of valid gradients and finally the orientation
histogram with global direction vector calculated.

其中 0 (译注,字符无法表达,实际见图上) 是全局方向,0ref 是基查考角度(在方向柱图的最高值),

4. Motion Segmentation
Any segmentation schemc begs the question as to what is
being segmented. Segincntation hy collecting “hlohs” of
similar direction motion collected frame to l‘ramc lrorn
optical flow as done in 191 doesn’t guarantcc that thc
motion corresponds to thc actual inovclncnt of objects in a
scene. Wc want to group motion regions that wcrc
produced hy the inovemcnt of parts or thc whole 01‘ thc
ohjcct of intercst. A iiovel modification to thc tMHI
gratlicnt algorithm has an advantagc i n this rcgard - by
labeling motion regions connectcd to thc currcnt
silhoucttc using a downward stepping floodfill, wc can
identify areas of motion directly attached to parts oi the
object of interest.

4.1. Motion Attached to Object
By construction, the most recent silhouette has the
maximal values (e.g. most recent timestamp) in thc tMHI.
We scan the image until we find this value, then “walk”
along the most recent silhouette’s contour to find attached
areas of motion. Below, let dT be a time difference
threshold, for example, the time difference between each
video frame. The algorithm for creating masks to segment
motion regions is as follows (with reference to Figure 6):
Scan the tMHI until we find a pixel of the current
timestamp. This is a boundary pixel of the most
recent silhouette (Figure 6b).
“Walk” around the boundary of the current silhouette
region looking outside for recent (within dT)
unmarked motion history “steps”. When a suitable
step is found, mark it with a downward floodfill
(Figure 6b). If the size of the fill isn’t big enough,
zero out the area.
Store the segmented motion masks that were found.
Continue the boundary “walk’ until thc silhouette has
been circumnavigated.

In the algorithm above, “downfill” refers to flodfills that
will fill (replace with a labeled value) pixels with thc same
value, OR pixels of a value onc stcp (within dT) Iowcr
than thc current pixcl being fillctl. Thc seginentation
algorithm then relics on 2 paramctcrs: ( I ) ‘Ihc maximum
nllowablc downward stcp distancc dT (e.g. how far hack
in time can a past motion be considcred to hc connectcd to
the current silhouette); (2) The minimum acceptable s i x
of the downward flood fill (clsc iero it out hecausc thc
region is too small -- a motion “noise” region).
The algorithm above produces segmentation inasks that
are used to select portions of the valid motion history
gradicnl described in Section 2.3. These segmentcd
regions may then be labclcd with thcir weighted regional
orientation as described in Section 3. Since thesc
segmentation masks derive directly lrom past motion that
“spilled” from the current silhouette boundary ol’ the
object, the motion regions are directly connected to the
object itself. We give scgincntation examples in the
section below.

帧间的时间差。 这个算法创建掩码(???)来分割运动的,见图6。



4.2. Motion Segmentation Examples
Figure 8 shows a hand opening and closing in front of a
camera. Note that the small arrows correctly catch the
finger motion while the global motion is ambiguous.


Figure 9 shows a kicking motion from left to right. At left,
hands had just been brought down as indicated by the
large global motion arrow. The small segmentation arrow
is already catching the leftward lean of the body at right.
In the center, left image the left leg lean and right leg
motion are detected, At center right, the left hand motion
and right leg are indicated. At right, the downward leg
motion and rightward lean of the body are found.


Figure IO shows segmented motion and recognized pose
for lifting the arms into a “T” position and then dropping
the arms back down. The large arrow indicates global
motion over a few seconds, the smaller arrows show
segmented motion as long as the corresponding silhouette
region moved less than 0.2 seconds ago.



