读书笔记25：Temporal Hallucinating for Action Recognition with Few Still Images（CVPR2018）

最新推荐文章于 2020-07-30 22:26:55 发布

b224618

最新推荐文章于 2020-07-30 22:26:55 发布

阅读量927

点赞数 1

openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Temporal_Hallucinating_for_CVPR_2018_paper.pdf

摘要首先介绍背景，从静态图片中进行动作识别最近被深度学习方法促进，但是成功的方法都需要大量的训练数据，因此不是很实用。但是人类在识别同样的问题就不需要那么多数据，因为人类可以将看到的图片与记忆中看到过的video相比较，正是基于这样的观察，本文提出一个hybrid video memory machine，可以根据记忆中的video，在看到still image的时候幻想出来一些动作信息。模型大概有三个部分，首先是temporal memory module，用来进行temporal hallucinating和temporal predicting，其中temporal hallucinating以无监督的方式生成静态图像的temporal feature，temporal predicting对query image进行action类别的推测；第二部分是spatial memory，用来进行spatial predicting；最后是一个video selection module，用来将相关性强的video选择出来作为memory。

首先，本文使用了Towards good practices for very deep two-stream convnets和Temporal segment networks: Towards good practices for deep action recognition中的two-stream 模型来得到video和image中不同的action characteristics，方法是将video memory的RGB和optical flow分别输入到two-stream模型的spatial和temporal streams，然后得到video空间和时间的feature

接着，query image和training image被输入到spatial CNN中，得到spatial feature

这里面代表query image，代表training images（有很多），由于still image是没有optical flow的，也就没有temporal feature，但是接着会介绍模型的temporal memory module，这个模块可以从still image中“幻想”出temporal feature，幻想出来的再结合到video memory里面进行temporal prediction。

temporal memory module：这个模块分两部分，首先介绍temporal hallucinating，temporal hallucinating的目标是利用video memory，从still image中学习到temporal feature，也就是，本文采取高斯过程来实现这个目标。高斯过程是一个non-parametric的贝叶斯模型，作者称使用这个的主要原因在于，对于从memory中比对新的样本，non-parametric的方法比parametric的方法更合适，不容易发生“catastrophic forgetting”（看到这里不是很懂，非参数模型和高斯过程都不是很懂）。

如上图所示，temporal hallucinating首先要对比images和video的spatial similarities，这个过程的目的是掂量一下哪些video可能和这个image更相关一些，这个比对过程可以用下式，一个GP（高斯过程）的kernel operation进行

这里面是前面提到过的query和training image的spatial feature，指的是video的spatial feature，kernel matrices 的每一个元素都是用一个kernel function 得到的，而和都是features，是噪声项。的每一行都是对于一个图像生成的video的权重向量。

得到了video和image之间的similarity matrix 之后，image的temporal feature可以通过video的temporal feature的加权求和来得到，也即

在这里作者强调，由于temporal hallucinating是以一种非监督的方式进行的，不需要使用label，因此可以应用在实际的情境中，比如有的时候image和video的类别可能不是一致的，不配套的。总结一下，这个模块其实就是通过比较query image和video的spatial similarities，得到query image和memory中的各个video的相似程度，得到weight vector，之后根据这些weights将video的temporal feature加起来，得到query image的temporal image。

当静态图片的temporal feature生成了之后，就可以为query image进行temporal prediction了。首先进行的是temporal similarity 的比较，前面得到了video和training images的temporal feature，也即，之后，比较query image和这个temporal memory之间的相似度，掂量一下哪个video和哪个training image和query image在temporal上更加相关。需要注意的一点是，query image和training images之间similarities的性质可能和query image与video之间相似度的性质不同，因为video和image是不同domain上的数据。因此，GP（高斯过程）的kernel operation需要加一个domain-adaptation噪声项

这里面是query image的temporal feature，通过temporal hallucinating获得；是temporal memory；kernel matrices 的元素是通过kernel function 得到的；和都是噪声项；最终得到的是对temporal memory加权求和的权重，得到它之后，temporal predicting就可以通过在temporal memory的labels上加权求和获得

其中，分别是temporal memory中video和training image的label，由于video和image的action categories可能不统一，不一致，所以将两者的category总数用作表示类别的one-hot vector的长度。temporal predicting的示意图如下

在此基础之上，由于image和video都有spatial feature，本文还引入了spatial memory module，将videos和training images的spatial feature整合起来作为spatial memory。具体来讲，这个module的结构和是将temporal predicting中的那个的所有temporal terms都换成spatial terms。这样，对一个query image进行最终的预测的时候，就是一个spatial temporal fusion的过程，时空两个维度的信息可以相互补充，帮助预测。

在实际应用中，将整个video的数据集作为memory是很不经济也没有必要的，因为不是所有video都和query image相关，因此本文设计了一个video selection module用来从video bag中筛选高相关的video，这个video bag是从每一个video-domain action categories中随机抽取出来的。这个video bag抽取出来之后，首先用来对training images进行temporal feature的hallucinating，这样，每一个有了temporal feature的image就被视作一个pseudo video了，同时具有spatial temporal features。之后使用pseudo videos作为memory，对video bag中的video进行spatial temporal predicting，video bag中的每一个video的prediction score对应的是image domain的label（因为memory选择的是image生成的pseudo video）。之后，每一个video的spatial和temporal score都会通过一个spatial and temporal score fusion进行聚合（具体fusion的方式并没有说），得到，而从中选择最大的就可以作为每一个video对于image domain的重要性的衡量（importance），对于video bag中的每一个video-domain category，都选择最大的个video，这些video的spatial和temporal的feature就作为video memory了，为了使video和training image平衡一些，和每一个image-domain category中的image数量相同。video selection的示意图如下

b224618

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
读书笔记25：Temporal Hallucinating for Action Recognition with Few Still Images（CVPR2018）

openaccess.thecvf.com/content_cvpr_2018/papers/Wang_Temporal_Hallucinating_for_CVPR_2018_paper.pdf摘要首先介绍背景，从静态图片中进行动作识别最近被深度学习方法促进，但是成功的方法都需要大量的训练数据，因此不是很实用。但是人类在识别同样的问题就不需要那么多数据，因为人类可以将看到的图片与记忆中看到过...
复制链接

扫一扫