Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics
2019年的CVPR文章,作者所设计的pretext task是由运动、颜色衍生出来的统计量(具体来说是最大的动作位置以及方向,颜色改变最大 / 最小的位置以及颜色的值),在文章的Introduction中作者提到了动作的表示在人的视觉系统中是基于一系列learned patterns,文章的思路跟这息息相关。
The idea is inspired by Giese and Poggio’s work on human visual system [14], in which the representation of motion is found to be based on a set of learned patterns.
These patterns are encoded as sequences of snapshots of body shapes by neurons in the form pathway, and by sequences of complex optic flow patterns in the motion pathway.