【论文笔记】ASTPN注意力空间池化网络

最新推荐文章于 2024-03-26 22:05:13 发布

魏大明白

最新推荐文章于 2024-03-26 22:05:13 发布

阅读量2.4k

点赞数

分类专栏：论文笔记文章标签：计算机视觉神经网络

本文链接：https://blog.csdn.net/qq_37747189/article/details/109982706

版权

论文笔记专栏收录该内容

8 篇文章 1 订阅

订阅专栏

Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification

Jointly Attentive Spatial-Temporal Pooling Networks for Video-based
Person Re-Identification

摘要

人的再识别(Person re-id)在视觉监控和人机交互方面的应用是一个非常重要的课题。在这项工作中,我们提出一个新颖的空间和时间关注池网络(ASTPN)用于视频行人重识别,使得特征提取器需要注意当前的输入视频序列,以一种相互依存的匹配项可以直接影响其他表现的计算。具体来说，空间池化层能够从每一帧中选择区域，而执行的注意力时间池化能够从序列中选择信息帧，这两种池化都是由来自距离匹配的信息引导的。
还分析了两个维度中的联合池如何提高person re-id性能比单独使用更有效。

介绍

行人重识别：给定一个查询图像，任务是从池中识别一组匹配的person图像，通常是从相同/不同的对象中捕获的相机，从不同的角度，在相同/不同的时间点。因为大的变化的照明条件，观看角度，身体姿势和遮挡，这是一个非常具有挑战性的任务。

在行人重识别方面，已经做了很多的研究，但大都是对于静态图，比如：
特征表示学习（feature representation learning）、距离度量学习（distance metric learning）、基于卷积神经网络的方法（CNN-based schemes）。

但是基于视频序列进行行人重识别进行学习，可以获得更多的时序信息，会获得更好的表现。图像序列提供了丰富的人的外观样本，有助于提高再识别性能，具有更多的鉴别特征。最早的能够从视频图像中得到较好表现的是CNN-RNN模型，还有利用距离函数来判断它们的匹配程度。然而，这些方法大多是单独推导每个序列的表示，很少考虑其他序列的影响，忽略了两个视频序列在匹配任务上下文中的相互影响。
让我们来思考一下在比较视频序列时，人类的视觉处理是如何工作的。在这里插入图片描述

例如,在上图中,当比较视频帧分别与另外两个b和c, b和c都是不同的,是大脑自然画的不同侧重于不同的帧。另一方面,比较序列之间的相互作用也应该对空间维度有影响,这对于视点变化较大或物体移动速度较快的场景尤为重要。这个例子说明了为什么在比较不同的视频帧对时，我们应该采用不同的关注度（也就是注意力机制提出的原因）。

受最近注意力模型成功的启发[1,31,34,5]，我们提出了共同关注时空池网络(ASTPN)，这是一种通过考虑视频序列之间的相互依赖性来学习视频序列表示的强大机制。具体来说，ASTPN 首先学习从两个输入项的递归卷积网络中提取的特征的相似度度量，然后利用特征之间的相似度评分来计算空间维度(每一帧的区域)和时间维度(序列上的帧)的注意向量。接下来，使用注意向量来执行pooling。最后，在注意向量上部署了孪生网络结构。利用端到端训练模式，可以有效地训练所提出的架构。
在这里插入图片描述

模型架构

1.Spatial Pooling Layer（空间池化层）

在这里插入图片描述该层的设计：
1)能够生成每幅图像的多尺度区域patch，并将其送入RNN/attention pooling层;
2)使模型对任意分辨率/长度的图像序列具有鲁棒性；

如上图所示：
3层CNN卷积神经网络的参数表

在经过3层CNN卷积网络（为啥是3呢）得到的特征图经过SPP层（spatial pyramid pooling）得到了Image-level Representation。SPP层有多层的空间容器来产生多层的空间特征，并且这些特征最后会融合成一个固定长度的特征（包含行人位置和多尺度空间信息），这也是注意力空间池化机制能够从每一帧选择区域的原因。
公式解释不想打了，直接上图：
在这里插入图片描述

2.Attentive Temporal Pooling Layer(注意力时间池化层)

虽然递归层能够捕捉隐藏的时间信息，但是存在当我变化不大时，会吸收冗余的模糊背景和服装信息。（他不行，所以我要提出的东西才有价值，有道理）加入的注意力时间池化层可以增大输入图像和目标图像之间的联系，允许输入变量 $I_{P}$ 直接影响目标变量特征 $v_{g}$ .
在这里插入图片描述强行解释一波，输入是前面得到的固定尺寸的含有行人位置和多尺度空间信息的特征，分别经过两个RNN网络，输出的是两个注意力矩阵，矩阵的每一行分别代表原图像的第i个时步。
两个矩阵与一个参数矩阵相乘再通过一个激活函数得到融合矩阵A，对A的每一行最大池化得到Probe序列第i帧重要性分数，对A每一列最大池化得到Gallery序列第J帧重要性分数。
最后，对两个向量求损失函数，并与矩阵点积，得到最终注意力向量。再将两者通过孪生网络计算距离。
贴公式：
在这里插入图片描述

参考文献

References
[1] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. 2, 3
[2] A. Barhillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from equivalence constraints. JMLR, pages 937–965, 2005. 2
[3] Y. Cheng, L. M. Brown, Q. Fan, R. S. Feris, S. Pankanti, and T. Zhang. Riskwheel: Interactive visual analytics for surveillance event detection. In IEEE International Conference on Multimedia and Expo, ICME 2014, Chengdu, China, July 14-18, 2014, pages 1–6, 2014. 2
[4] Y. Cheng, Q. Fan, S. Pankanti, and A. Choudhary. Temporal sequence modeling for video event detection. In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014. 2
[5] C. N. dos Santos, M. Tan, B. Xiang, and B. Zhou. Attentivepooling networks. CoRR, abs/1602.03609, 2016. 2, 3, 4
[6] M. Farenzena, L. Bazzani, A. Perina, V. Murino, andM. Cristani. Person re-identification by symmetry-driven accumulation of local features. In :IEEE CVPR, pages 2360–2367, 2010. 2
[7] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition.CoRR, abs/1406.4729, 2014. 3
[8] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof. Person re-identification by descriptive and discriminative classification. In scandinavian conference on image analysis, pages91–102, 2011. 5, 6
[9] S. Karanam, Y. Li, and R. Radke. Person re-identification with discriminatively trained viewpoint invariant dictionaries. In :IEEE ICCV, pages 4516–4524, 2015. 7
[10] S. Karanam, Y. Li, and R. Radke. Sparse re-id: Block sparsity for person re-identification. In :IEEE CVPR Workshops,pages 33–40, 2015. 7, 8
[11] I. Kviatkovsky, A. Adam, and E. Rivlin. Color invariants for person reidentification. IEEE TPAMI, 35(7):1622–1634,2013. 1, 2
[12] Y. Li, Z. Wu, S. Karanam, and R. Radke. Multi-shot human re-identification using adaptive fisher discriminant analysis. In BMVC, 2015. 6, 7 [13] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. R.Smith. Learning locally-adaptive decision functions for person verification. In :IEEE CVPR, pages 3610–3617, 2013.2
[14] S. Liao and S. Z. Li. Efficient psd constrained asymmetricmetric learning for person re-identification. In :IEEE ICCV,pages 3685–3693, 2015. 1, 2
[15] C. Liu, S. Gong, and C. C. Loy. Person re-identification:what features are important? In :IEEE ICCV, pages 391–401, 2012. 1, 2
[16] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. End-to-end comparative attention networks for person re-identification.CoRR, abs/1606.04404, 2016. 3
[17] K. Liu, B. Ma, W. Zhang, and R. Huang. A spatiotemporal appearance representation for video-based pedestrian re-identification. In :IEEE CVPR, pages 3810–3818,2015. 6
[18] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In IJCAI,pages 674–679, 1981. 6
[19] B. Ma, Y. Su, and F. Jurie. Local descriptors encoded byfisher vectors for person re-identification. In :IEEE ICCV,pages 413–422, 2012. 1, 2
[20] N. Mclaughlin, J. M. Rincon, and P. Miller. Recurrent convolutional network for video-based person re-identification.In :IEEE CVPR, pages 1325–1334, 2016. 1, 3, 5, 6, 7
[21] S. Paisitkriangkrai, C. Shen, and A. V. D. Hengel. Learning to rank in person re-identification with metric ensembles. In:IEEE CVPR, pages 1846–1855, 2015. 1, 2
[22] D. G. S, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. PETS,3, 2007. 6
[23] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. CoRR, abs/1511.04119, 2015.3
[24] A. Subramaniam, M. Chatterjee, and A. Mittal. Deep neural networks with inexact matching for person re-identification. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in NIPS, pages 2667–2675. Curran Associates, Inc., 2016. 1
[25] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A siamese long short-term memory architecture for human reidentification. In ECCV, pages 135–153, 2016. 1
[26] J. Wang, Y. Cheng, and R. Schmidt Feris. Walk and learn: Facial attribute representation learning from egocentric video and contextual data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016. 8
[27] T. Wang, S. Gong, X. Zhu, and S. Wang. Person reidentification by video ranking. In ECCV, pages 688–703, 2014. 2, 5, 6
[28] X. Wang, G. Doretto, T. Sebastian, J. Rittscher, and P. Tu. Shape and appearance context modeling. In :IEEE ICCV, pages 1–8, 2007. 2
[29] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, pages 207–244, 2009. 2
[30] F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person reidentification using kernel-based metric learning methods. In ECCV, pages 1–16, 2014. 1, 2
[31] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015. 2, 3
[32] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang. Person re-identification via recurrent feature aggregation. In ECCV, pages 701–716, 2016. 1, 3, 6, 7
[33] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Deep metric learning for person re-identification. In :IEEE CVPR, pages 24–39, 2014. 1, 3
[34] W. Yin, H. Schutze, B. Xiang, and B. Zhou. ABCNN: ¨ attention-based convolutional neural network for modeling sentence pairs. TACL, 4:259–272, 2016. 2, 3, 4
[35] J. You, A. Wu, X. Li, and W. Zheng. Top-push video-based person re-identification. In :IEEE CVPR, pages 1345–1253, 2016. 2
[36] Z. Zhang, Y. chen, and V. Saligrama. Group membership prediction. In :IEEE ICCV, pages 3916–3924, 2015. 1, 2
[37] R. Zhao, W. Ouyang, and X. Wang. Person re-identification by salience matching. In :IEEE ICCV, pages 2528–2535,2013. 1, 2
[38] R. Zhao, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In :IEEE CVPR, pages 144–151, 2014. 1, 2
[39] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, pages 868–884, 2016. 5, 7
[40] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, J. Bu, and Q. Tian. Scalable person re-identification: a benchmark. In :IEEE ICCV, pages 1116–1124, 2015. 2
[41] W. S. Zheng, S. Gong, and T. Xiang. Reidentification by relative distance comparison. IEEE TPAMI, 35(3):653–668, 2013.

魏大明白

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
【论文笔记】ASTPN注意力空间池化网络

Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-IdentificationJointly Attentive Spatial-Temporal Pooling Networks for Video-basedPerson Re-Identification摘要人的再识别(Person re-id)在视觉监控和人机交互方面的应用是一个非常重要的课题。在这项工作中,我们提出一个新颖的空间和时间关
复制链接

扫一扫