View Invariant Gait Recognition Using Only One Uniform Model论文翻译以及理解

View Invariant Gait Recognition Using Only One Uniform Model论文翻译以及理解

一行英文,一行翻译

论文中所述的优点:The unique advantage is that it can extract view invariant feature from any view using only one model。

II. VIEW INVARIANT GAIT FEATURE EXTRACTION

In gait recognition, when the angle between the walking direction of and the camera is 90◦ (the side view), it is the best view for gait recognition because of more dynamic information. We would try to transform the gait data from any views to the side view using one uniform non-linear model, and then extract the view invariant feature. The proposed model is inspired by the one in [14] where a model based on auto-encoder which is named as Stacked Progressive AutoEncoders(SPAE) is proposed to deal with multi-view face recognition. We use it to multi-view gait recognition. The framework for view invariant feature extraction is illustrated in Fig. 1. The model is described in the following subsections.

在步态识别中,当步行方向与摄像机之间的角度为90°(侧视图)时,其所携带的动态信息更多,因此它是步态识别的最佳视图。 我们将尝试使用一个统一的非线性模型将步态数据从任何视图转换为侧视图,然后提取视图不变特征。 所提出的模型受到[14]中的模型的启发,这个基于自动编码器的模型名为堆叠渐进自编码器(SPAE),它被提出来是为了处理多视图人脸识别。 我们用它来做多视图步态识别。 用来提取视图不变特征的框架如图1所示。该模型在以下小节中描述。

Figure1

B. Auto-Encoder for Gait View Transformation

figure
( a ) \left (a \right ) (a) schematic diagram of auto-encoder;
( b ) \left (b \right ) (b) larger angle change would be much difficult for one auto-encoder to handle. It is much more difficult for one auto-encoder to transform 54◦ image to 90◦ image than transform 72◦ image to 90◦ image, but we could gradually transform 54◦ image to 72◦ image with one auto-encoder and then 72◦ image to 90◦ image with another auto-encoder, it would be much easier.

对上图进行解释:
( a ) \left (a \right ) (a) 自编码器的示意图;
( b ) \left (b \right ) (b) 用一个编码器来处理角度变化较大的情况比较困难,一个自动编码器将54°图像转换为90°图像要比将72°图像转换为90°图像困难得多,但我们可以使用一个自动编码器逐渐将54°图像转换为72°图像,然后再使用另一个自动编码器将72°图像转换为90°图像,这比第一个处理方法来说会容易一些。

Auto-encoder [16] is one of the popular models in recent years. It can be used to extract compact features. As shown in Fig. 3(a), an auto-encoder usually contains three layers: one input layer, one hidden layer and one output layer. There are two parts in an auto-encoder, encoder and decoder. The encoder can transform the input data into a new representation in hidden layer. It usually consists of a linear transformation and a nonlinear transformation as follows:

自编码器[16]是近年来最受欢迎的模型之一。它可用于提取紧凑的特征。 如图3(a)所示,自动编码器通常包含三层:一个输入层,一个隐藏层和一个输出层。自编码器分为编码器和解码器两部分。 编码器可以将输入数据转换为隐藏层中的新表示。 它通常由线性变换和非线性变换组成,如下所示:
tt

where f ( ∙ ) f\left ( \bullet \right ) f() denotes the encoder, W denotes the linear transformation, b denotes the basis and s(·) is the nonlinear transformation, also called activation function, such as:

其中, f ( ∙ ) f\left ( \bullet \right ) f() 表示编码器, W W W表示现行变换, b b b表示偏置, s ( ∙ ) s\left ( \bullet \right ) s()表示非线性变换,也可以被称为激活函数,例如 :
yy
The decoder can transform the hidden layer representation back to input data as follows:

解码器按照下式,将隐藏层的表示反向转化为输入数据:
tt
where g ( ∙ ) g\left ( \bullet \right ) g()denotes the decoder, W ′ W^{′} W and b ′ b^{′} bdenote the linear transformation and basis in decoder and x ′ x^{′} x is the output data. We usually use the least square error as the cost function to optimize the parameters in W W W, b b b, W ′ W^{′} W and b ′ b^{′} b:

其中, g ( ∙ ) g\left ( \bullet \right ) g()表示解码器, W ′ W^{′} W b ′ b^{′} b表示解码器中的线性变换和偏置, x ′ x^{′} x 表示解码器的输出,我们通常采用均方误差来优化参数 W W W, b b b, W ′ W^{′} W and b ′ b^{′} b
yy
where x i x_{i} xi denotes the i t h i_{th} ith one of the N N N training samples and x i ′ x_{i}^{'} xi means the correspond output of x i x_{i} xi. The traditional auto-encoder can reconstruct the input, but if we replace the output with a different data what distinguishes with input data, then the whole auto-encoder could be regarded as a regression function. But it would be really hard for just one auto-encoder to deal with large angle change.

其中 x i x_ {i} xi表示 N N N训练样本中的第 x t h x_ {th} xth个, x i ′ x_ {i} ^ {'} xi表示对应 x i x_ {i} xi 的输出。传统的自编码器可以对输入进行重建,但是如果我们将输出用不同于输入的数据代替,那么整个自编码器可以被视为回归函数。但是仅仅只采用一个自编码器是无法处理输入和输出图像所跨视角比较大的情况的。

As shown in Fig. 3(b), the difference between 54◦ image and 90◦ image is much larger than that between 72◦ image and 90◦ image, especially in the leg part. It would be very difficult for just one auto-encoder to transform 54◦ image to 90◦image. But if we use one auto encoder to transform 54◦ image to the 72◦ one, and then use another one auto-encoder to transform 72◦ image to 90◦ image, it would be much easier. So we may need more than one auto-encoder to deal with the multi-view challenge。

如图 3 ( b ) 3\left ( b \right ) 3(b)所示,54°图像和90°图像之间的差异远大于72°图像和90°图像之间的差异,尤其是腿部。只用一个自编码器将54°图像转换为90°图像是非常困难的。但是如果我们使用一个自编码器将54°图像转换为72°图像,然后使用另一个自编码器将72°图像转换为90°图像,则会更容易些。因此,我们可能需要多个自编码器来处理多视图挑战。

C. Stacked Progressive Auto-Encoders (SPAE)

In [14], the authors stacked some auto-encoders together to deal with the multi-view problem in face recognition. In model training, the output is synthesized in a progressive way. We try to stack some auto-encoders together for the multi-view challenge in gait recognition in a similar manner.In gait recognition, side view contains more dynamic information about the gait. So we would try to convert all the gait energy image to side view. But it is difficult for one auto-encoder to deal with all the views, so each auto-encoder will convert the gait energy images at a larger view to an adjacent smaller view. At the same time those gait energy images at smaller views are kept unchanged. Then after some auto-encoders, all the images would gradually become side view images as shown in Fig. 4, it would be very helpful for improving the accuracy of gait recognition. It is assumed that there are 2 × L + 1 views in the dataset. The difference between the adjacent angles is ∆ = 1 8 ∘ 18^{\circ} 18and L = 5. The view angles of the gait data are{ 0 ∘ 0^{\circ} 0, 1 8 ∘ 18^{\circ} 18, · · · , 18 0 ∘ 180^{\circ} 180}. The auto-encoder in first layer would map the gait images at 0◦to 18◦, and the gait images at 180◦ to 162◦. Meanwhile it keeps the gait images from 18◦ to 162◦ unchanged. Then auto-encoder in second layer would map the gait image which is smaller than 36◦ to 36◦ , and larger than 144◦ to 144◦ . The last layer would map all the images to 90◦ but maintain images at 90◦ unchanged. Fig. 4 shows a schematic view of the training phase in a progressive way.

在[14]中,作者将一些自编码器堆叠在一起,以处理人脸识别中的多视角问题。在模型训练中,输出图像以渐进方式合成。我们尝试以同样的方式,将一些自编码器堆叠在一起来应对步态识别中的多视图问题的挑战。在步态识别中,侧视图包含更多的动态步态信息。所以我们会尝试将所有步态能量图像转换为侧视图。但是一个自编码器很难处理所有视角,因此每个自动编码器会将视角较大的步态能量图转换为相邻的较小视角的步态能量图。同时,较小视角的步态能量图像是保持不变。然后经过一些自编码器之后,所有步态能量图将逐渐变为侧视图,如图4所示,这对提高步态识别的准确性非常有帮助。假设数据集中有2×L + 1个视角。相邻角度之间的差值是 Δ \Delta Δ =18°和 L = 5 步态数据的视角为{ 0 ∘ 0^{\circ} 0, 1 8 ∘ 18^{\circ} 18, · · · , 18 0 ∘ 180^{\circ} 180}.。第一层中的自编码器将步态图像映射到 0 ∘ 0^{\circ} 0 1 8 ∘ 18^{\circ} 18 18 0 ∘ 180^{\circ} 180 16 2 ∘ 162^{\circ} 162。同时它保持步态图像从 1 8 ∘ 18^{\circ} 18 16 2 ∘ 162^{\circ} 162不变。然后,第二层中的自动编码器将步态能量图小于 3 6 ∘ 36^{\circ} 36的映射到 3 6 ∘ 36^{\circ} 36,大于 14 4 ∘ 144^{\circ} 144的映射到 14 4 ∘ 144^{\circ} 144。最后一层将所有图像映射到 9 0 ∘ 90^{\circ} 90,但保持 9 0 ∘ 90^{\circ} 90的图像不变。图4以渐进方式示出了训练阶段的示意图。(每一层的数据集不同,但是结构是一样的,每一层用特定视角的数据集先训练,然后再把所有的层串联起来用所有数据集进行微调)
yy

We train each layer individually and the output of each layer is the input of next layer. After training all the auto-encoders, the whole network is fine tuned by optimizing all layers together as bellow, j means the jth layer in all L layers.

我们单独的训练每一层,每层的输出是下一层的输入。在训练完所有自编码器之后,通过将所有层连接在一起来优化,进而对整个网络进行微调,如下所示,j表示所有L层中的第j层。
tt
(感觉这个优化的目标函数,每一个 f ( ∙ ) f\left ( \bullet \right ) f()和每一个 g ( ∙ ) g\left ( \bullet \right ) g()配对使用的,但是在这个式子当中,只有在最后一层中用了 g ( ∙ ) g\left ( \bullet \right ) g(),感觉这样写不是作者想要表达的吧,还是我理解错了? 。。。)

D. Feature Extraction

As the view variations become smaller layer by layer, the output the top most layer f L f_{L} fL should be the synthesized side view feature, and is robust to view variation. But the lower layers should also contain some view invariant information. So we cumulate the representation in multiple hidden layers at descending order as follows:

随着视图变化逐层变小,最顶层 f L f_{L} fL的输出应该是合成的侧视图特征,并且对于视图变化是鲁棒的。但是下面的较低层也应包含一些视图固有的信息。 因此,我们按降序累计多个隐藏层中的表示,如下所示:
yy

where 0 ⩽ i ≤ L 0\leqslant i\leq L 0iL。And we will use it combined with Principal Component Analysis(PCA) for feature extraction. There are only 4 samples from each subject, and experimental results also show that LDA can not improve the recognition rate obviously. Considering the computational cost of LDA, we do not use LDA as in [14].

其中 。我们将SPAE结合用于特征提取主成分分析(PCA)方法来测识别率。 每个受试者只有4个样本,实验结果也表明LDA不能明显提高识别率。 考虑到LDA的计算成本,我们不像[14]那样使用LDA。
uu

The results of different combinations are shown in Fig. 5. From the results we can find that the feature consists of the last two layers, the 4-th and 5-th ones, achieves the highest recognition rate. So we use the output from the last two layers and concatenate them as a vector as view invariant gait feature.

不同组合的结果如图5所示。从结果中我们可以发现,如果特征由最后两层组成,即第4层和第5层,可以实现最高的识别率。 因此,我们使用最后两层的输出,并将它们拼接为向量作为视图不变步态特征。(以前老是不明白第四第五层的意思,对照Fig.4,就是上面两层

III. EXPERIMENTS AND ANALYSIS

A. Dataset

CASIA B gait dataset [15] is one of the largest public gait databases, which was created by The Institute of Automation, Chinese Academy of Sciences in January 2005. It consists of 124 subjects (31 females and 93 males) captured from 11 views. The view range is from 0◦ to 180◦ with 18◦ interval between two nearest views. There are 10 sequences for each subject. There are 6 sequences for normal walking (”nm”), 2 sequences for walking with a bag (”bag”) and 2 sequences for walking in a coat (”cl”). Fig. 6 shows the samples at 11 views from a subject of normal walking.
在这里插入图片描述

CASIA B步态数据集[15]是最大的公开步态数据库之一,由中国科学院自动化研究所于2005年1月创建。它包括从11个视角中捕获124个受试者(31名女性和93名男性)。视角范围从0°到180°,两个最近视图之间的角度间隔为18°。每个受试者有10个序列。 有6个正常步行序列(“nm”),2个背包行走的序列(“bag”)和2个穿外套行走的序列(“cl”)。图6 显示了来自正常步行的受试者的11个视图处的样本。

B. Experiments design

In this work, we mainly focus on view variance challenge in gait recognition. So we will only use normal walking data. We put the first 62 objects into the training set and the remaining 62 objects into the test set. In the test set, the first 4 normal walking sequences of each subjects are put into the gallery set and the others into the probe set. The experiment design is listed in Table I.
yy
在这项工作中,我们主要关注步态识别中的视差变化的挑战。 所以我们只使用正常的行走数据。 我们将前62个对象放入训练集,将剩余的62个对象放入测试集。 在测试集中,每个受试者的前4个正常步行序列被放入Gallery set中,其他被放入Probe set中。实验设计列于表一。

C. Results and analysis

Since we use the GEI image as our input and try to extract more robust feature to deal with the multi-view challenge. We will first compare our method with GEI+PCA [4]. The experiment design about gallery set and probe set for GEI+PCA is exactly the same as our experiment design in Table. I. Fig. 7 shows the comparison of recognition rate with GEI+PCA at each probe angle. For each probe angle, we test on the gallery angles from other 10 views except the corresponding probe view. As illustrated in Fig. 7, our method performs much better than GEI+PCA at each probe angle and gets more than 60% higher than GEI+PCA when probe angle is 18◦ and gallery angle is 54◦
yy
由于我们使用GEI图像作为输入,并尝试提取更鲁棒的特征来处理多视图挑战。我们首先将我们的方法与GEI + PCA进行比较[4]。 关于GEI + PCA的Gallery set和Probe set的实验设计与我们在表中的实验设计完全相同。I.图7显示了在每个Probe angle下识别率与GEI + PCA的比较。 对于每个Probe angle,我们测试除了相应Probe 视图之外还测试了其他10个通道角度的视图。 如图7所示,当Probe angle为18°且Gallery set为54°时,我们的方法在每个Probe angle下的性能比GEI + PCA好得多,比GEI + PCA高出60%以上。
.

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值