目录
参考资料
论文:
A Closer Look at Spatiotemporal Convolutions for Action Recognition
博客:
第1章 引言
深度学习对静态图像领域产生了深远影响,但在视频领域,深度学习方法并没有超过最好的传统手提特征工程(iDT)。另外,2D卷积(ResNet-152)在视频单帧上的表现十分接近3D卷积的最佳表现。之前的观点是,2D卷积不能对视频分析中的时间信息和动作模式建模。但基于2D卷积的实验,说明temporal reasoning未必就是精确动作识别的必要条件,因为重要的动作类别信息已经包含在了视频的一个个静态视频帧中。
但之前很多3D CNN模型实验结果同样表明其在相同网络深度下在大规模数据集下效果还是优于2D CNN模型的。基于此,本文介绍了两种新的卷积方式:
(1)混合型卷积(MixedConvolution
)—— 在浅层使用3维卷积,在深层接上2维卷积:
因为动作(motion)建模是一种中低层的操作(low、mid-level operation),故在对应的网络浅层只使用3维卷积,这些mid-level动作特征基础上在深层进行2维卷积即可。
(2)2+1维卷积块(R(2+1)D
):就是把3维卷积操作分解成两个接连进行的子卷积块—2维空间卷积和1维时间卷积:
第一就是两个子卷积之间多出来一个非线性操作,和原来同样参数量的3维卷积相比double了非线性操作,给网络扩容。第二个好处就是时空分解让优化的过程也分解开来,事实上之前发现,3维时空卷积把空间信息和动态信息拧巴在一起,不容易优化,而2+1维卷积更容易优化,loss更低。
第2章 R(2+1)D和P3D的比较
虽然本文也对比了其他几种3D、2D卷积的特征提取方法,但主要贡献还是把3D卷积分解成2D+1D卷积。而分解3D卷积的工作和之前的P3D网络类似,这里对比一下二者的区别。
2.1 分解3D卷积的方式
P3D
: 3D卷积的参数量太大了,导致训练和优化都非常困难,因此C3D的效果还不如Resnet-152。R(2+1)D
: 把一个3D卷积核分解成两个卷积核,可以多加一个非线性层(激活函数),提高了模型表达能力;此外,分解之后的实验发现,即使分解后的网络和3D网络的参数量相同,分解后的网络也更容易优化。
2.2 组合两个分解的卷积的方式
P3D
: 设计了P3D-A、P3D-B、P3D-C三种residual blocks,并且P3D的出发点是降低计算量,因此把这些blocks都放在了bottleneck中(bottleneck头尾各一个尺寸为1的卷积核,用来压缩维度和恢复维度)。在P3D最终确定的P3D ResNet模型中,循环使用了ABC三种blocks。R(2+1)D
: 同样使用了残差连接,但没有使用bottleneck
来节省计算量,并且在所有的block中,分解的两个卷积有且只有一种组合方式(和P3D-A一样的串行组合方式)。
2.3 网络参数量
P3D
:把3D卷积分解成2D卷积的目的就是为了节省参数量,从而便于优化并提高计算效率。R(2+1)D
:从另一个角度说明了分解3D卷积的好处,即保证分解前后模型的参数量大致相等的条件下,分解后的网络依然比没有分解的3D网络更容易优化,并且表现更好。(这里需要注意的是怎么保证分解后仍然使得参数量大致相等,后面会介绍)
第3章 主要方法
作者主要对比了6种不同的使用2D、3D卷积处理视频任务的方法:R2D、f-R2D、R3D、MCx、rMCx 以及 R(2+1)D ,如 Fig. 1 所示:
3.1 R2D
R2D模块就是传统的2d卷积,将输入 c × t × h × w c×t×h×w c×t×h×w 看作是 c t × h × w ct×h×w ct×h×w(把多张当作一整张),只是将2d卷积作用于multi-frames上。一个视频假设有10帧,那么就把这个视频当成一个10通道的图片进行处理,如果是10帧的彩色图片,那么这个就是30通道的图片。
正常情况下,一个视频是有四个维度的: c h a n n e l × t i m e × h × w channel×time×h×w channel×time×h×w ,假设有一个10帧的 1080 × 960 1080\times960 1080×960 的彩色视频,那么是个视频转换成张量应该是: 3 × 10 × 960 × 1080 3×10×960×1080 3×10×960×1080, 但是对于R2D算法来说,这个视频的张量为: 30 × 960 × 1080 30×960×1080 30×960×1080 ,它是把多帧输入frames的RGB通道全都整合在一起了,这样的话,当经过一个2d的卷积,就相当于完全放弃了时间上的信息。
3.2 f-R2D
既然R2D把时间维度和通道揉在一起之后,经过第一个卷积核就会完全损失运动信息,那么这里就换一种思路:还是使用2D的卷积网络,像处理图像任务一样,分别的对每一张frame提取空间特征,假如输入数据还是 3 × T × H × W 3×T×H×W 3×T×H×W(总共T个frames),那么就分别提取出来T个空间特征了。然后这T个空间特征:(1)送到时序网络中去提取运动特征;(2)使用空间池化去提取运动特征。
这样看来首先对每个frame建立各自的空间表示,再把所有frame的空间表示放在一块去建模运动信息,比R2D的方法更合理。
3.3 R3D
R3D就是网络中所有卷积核都是3D卷积,这个其实就是使用了Resnet网络的C3D,网络结构如 Fig. 3 所示:
3.4 MCx&rMCx
这个的 MCx
的结构是前面3层是2D卷积,之后跟上两个3D卷积,因为是有3层的2D卷积,之后才换成3D卷积的,所以这个叫做 MC3
。MCx
就是先使用3D提取运动信息,然后再使用2D卷积提取外观特征。
这里需要注意的是,3D卷积能够建模时间信息,但是其也具有空间特征的提取能力,所以MCx说“运动信息的建模应该在浅层网络”中的运动信息,也是依赖于外观信息的。
动作建模在浅层是有用的,但在高级语义抽象层(深层),动作建模是不必要的。
后面的 rMCx
是和MCx相反的,是先3D卷积,然后再2D卷积,这里是rMC3; 先2D还是3D取决于:你认识时间的信息处理是依赖于浅层网络还是深层的网络。
实验认为MCx要比rMCx合理些,MCx
的效果更好一些。因为如果先用2D卷积会先把时间维度上的信息舍去了,后面再用3D卷积就效果不大。
3.5 R(2+1)D
R(2+1)D
这个和 MCx
同属于混合卷积,但它是用2D卷积和1D卷积来逼近3D卷积。
但要保证参数相同,因此作者设计了2d卷积和1d卷积filter个数的匹配公式:
相比于R3D,虽然参数没变,但由于R(2+1)D添加更多Relu激活层,模型的表达能力应该更强,同时也更容易训练优化。
第4章 实验
结论:
- (1)F-R2D,R2D 这些2维卷积和R3D或者MCx,rMCx之间是有显著差距的(2维较差),这个差距会在输入帧数为16时变大。说明动作建模对动作识别来说是十分重要的。
- (2)分解的时空卷积效果要比3维卷积和混合卷积好,更比2维卷积模型效果好。
- (3)在较长输入片段进行时间建模更加有效,但不能过长。
- (4)随着网络深度正增加,R(2+1)D比R3D更容易训练。
第5章 Pytorch实现R2+1D
参考:
import math
import torch.nn as nn
from torch.nn.modules.utils import _triple
class SpatioTemporalConv(nn.Module):
r"""通过在空间轴和时间轴上执行2D卷积到中间子空间,然后在时间轴上执行1D卷积以产生最终输出
Args:
in_channels (int): Number of channels in the input tensor,输入张量中的通道数
out_channels (int): Number of channels produced by the convolution,卷积提供的通道数
kernel_size (int or tuple): Size of the convolving kernel
stride (int or tuple, optional): Stride of the convolution. Default: 1,卷积的步伐。 默认值:1
padding (int or tuple, optional): Zero-padding added to the sides of the input during their respective convolutions. Default: 0,
在它们各自的卷积期间将零填充添加到输入的边。 默认值:0
bias (bool, optional): If ``True``, adds a learnable bias to the output. Default: ``True``,
偏见(布尔型,可选):如果为``True'',则向输出添加可学习的偏见。 默认值:``True``
"""
def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, bias=False, first_conv=False):
super(SpatioTemporalConv, self).__init__()
# if ints are entered, convert them to iterables, 1 -> [1, 1, 1]
kernel_size = _triple(kernel_size)
stride = _triple(stride)
padding = _triple(padding)
if first_conv: # 首层设置
spatial_kernel_size = kernel_size # (1,7,7)
spatial_stride = (1, stride[1], stride[2]) # (1,2,2)
spatial_padding = padding # (0,3,3)
temporal_kernel_size = (3, 1, 1)
temporal_stride = (stride[0], 1, 1) # (1,1,1)
temporal_padding = (1, 0, 0)
# from the official code, first conv's intermed_channels = 45
# 其中intermed_channels出自论文中的计算公式
# 也就是(3D卷积核x输入通道数x输出通道数)/(空间卷积核x输入通道数 + 时间卷积核x输出通道数)。
intermed_channels = 45
# 空间卷积等价于2D卷积, followed by batch_norm and ReLU
self.spatial_conv = nn.Conv3d(in_channels, intermed_channels, spatial_kernel_size,
stride=spatial_stride, padding=spatial_padding, bias=bias)
self.bn1 = nn.BatchNorm3d(intermed_channels)
# 时间卷积等效于1D卷积,
self.temporal_conv = nn.Conv3d(intermed_channels, out_channels, temporal_kernel_size,
stride=temporal_stride, padding=temporal_padding, bias=bias)
self.bn2 = nn.BatchNorm3d(out_channels)
self.relu = nn.ReLU()
else:
# decomposing the parameters into spatial and temporal components by
# masking out the values with the defaults on the axis that
# won't be convolved over. This is necessary to avoid unintentional
# behavior such as padding being added twice
spatial_kernel_size = (1, kernel_size[1], kernel_size[2]) # 一般为(1,3,3)
spatial_stride = (1, stride[1], stride[2]) # stride = 2时下采样
spatial_padding = (0, padding[1], padding[2]) #
temporal_kernel_size = (kernel_size[0], 1, 1)
temporal_stride = (stride[0], 1, 1)
temporal_padding = (padding[0], 0, 0)
# 公式计算中间通道数
# from the paper section 3.5
intermed_channels = int(
math.floor((kernel_size[0] * kernel_size[1] * kernel_size[2] * in_channels * out_channels) / \
(kernel_size[1] * kernel_size[2] * in_channels + kernel_size[0] * out_channels)))
# the spatial conv is effectively a 2D conv due to the
# spatial_kernel_size, followed by batch_norm and ReLU
self.spatial_conv = nn.Conv3d(in_channels, intermed_channels, spatial_kernel_size,
stride=spatial_stride, padding=spatial_padding, bias=bias)
self.bn1 = nn.BatchNorm3d(intermed_channels)
# the temporal conv is effectively a 1D conv, but has batch norm
# and ReLU added inside the model constructor, not here. This is an
# intentional design choice, to allow this module to externally act
# identical to a standard Conv3D, so it can be reused easily in any
# other codebase
self.temporal_conv = nn.Conv3d(intermed_channels, out_channels, temporal_kernel_size,
stride=temporal_stride, padding=temporal_padding, bias=bias)
self.bn2 = nn.BatchNorm3d(out_channels)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.bn1(self.spatial_conv(x)))
x = self.relu(self.bn2(self.temporal_conv(x)))
return x
class SpatioTemporalResBlock(nn.Module):
r"""Single block for the ResNet network. Uses SpatioTemporalConv in
the standard ResNet block layout (conv->batchnorm->ReLU->conv->batchnorm->sum->ReLU)
ResNet网络的单个块。 在标准ResNet块布局中使用SpatioTemporalConv
Args:
in_channels (int): Number of channels in the input tensor.
out_channels (int): Number of channels in the output produced by the block.
kernel_size (int or tuple): Size of the convolving kernels.
downsample (bool, optional): If ``True``, the output size is to be smaller than the input. Default: ``False``
"""
def __init__(self, in_channels, out_channels, kernel_size, downsample=False):
super(SpatioTemporalResBlock, self).__init__()
# If downsample == True, the first conv of the layer has stride = 2
# to halve the residual output size, and the input x is passed
# through a seperate 1x1x1 conv with stride = 2 to also halve it.
# no pooling layers are used inside ResNet
self.downsample = downsample
# to allow for SAME padding
padding = kernel_size // 2
if self.downsample: # 下采样为true,对输入层也得进行图片的长宽压缩
# 下采样输入x,残差的右侧分支部分
self.downsampleconv = SpatioTemporalConv(in_channels, out_channels, 1, stride=2) # 卷积核1x1x1,直接对输入进行下采样
self.downsamplebn = nn.BatchNorm3d(out_channels) # 接bn3d
# 下采样的主线部分,一次conv包括 空间卷积加时间卷积
self.conv1 = SpatioTemporalConv(in_channels, out_channels, kernel_size, padding=padding, stride=2)
else: # 不进行下采样的话 ,右侧分支直接相加就行了
self.conv1 = SpatioTemporalConv(in_channels, out_channels, kernel_size, padding=padding)
self.bn1 = nn.BatchNorm3d(out_channels)
self.relu = nn.ReLU()
# 再进行一次标准的卷积 + bn3d + relu
self.conv2 = SpatioTemporalConv(out_channels, out_channels, kernel_size, padding=padding)
self.bn2 = nn.BatchNorm3d(out_channels)
def forward(self, x):
res = self.relu(self.bn1(self.conv1(x)))
res = self.bn2(self.conv2(res))
if self.downsample:
x = self.downsamplebn(self.downsampleconv(x))
return self.relu(x + res)
class SpatioTemporalResLayer(nn.Module):
"""
形成ResNet网络的单层,并重复多次输出大小相同的块彼此堆叠
Args:
in_channels (int): Number of channels in the input tensor.
out_channels (int): Number of channels in the output produced by the layer.
kernel_size (int or tuple): Size of the convolving kernels.
layer_size (int): Number of blocks to be stacked to form the layer
block_type (Module, optional): Type of block that is to be used to form the layer. Default: SpatioTemporalResBlock.
downsample (bool, optional): If ``True``, the first block in layer will implement downsampling. Default: ``False``
"""
def __init__(self, in_channels, out_channels, kernel_size, layer_size, block_type=SpatioTemporalResBlock,
downsample=False):
super(SpatioTemporalResLayer, self).__init__()
# 首层,采用SpatioTemporalResBlock的 有下采样的结构。
self.block1 = block_type(in_channels, out_channels, kernel_size, downsample)
# 接下来重复进行layer_size - 1次不进行下采样的结构,依次堆叠;
# layer_size为2时,即再进行1次不进行下采样的SpatioTemporalResBlock结构
self.blocks = nn.ModuleList([])
for i in range(layer_size - 1):
# 所有这些块都是相同的,并且默认情况下downsample = False
self.blocks += [block_type(out_channels, out_channels, kernel_size)]
def forward(self, x):
x = self.block1(x)
for block in self.blocks:
x = block(x)
return x
class R2Plus1DNet(nn.Module):
r"""Forms the overall ResNet feature extractor by initializng 5 layers, with the number of blocks in
each layer set by layer_sizes, and by performing a global average pool at the end producing a
512-dimensional vector for each element in the batch.
通过初始化5层,并通过layer_sizes设置每层中的块数,并最后通过执行全局平均池来为批次中的每个元素生成512维向量,来形成整个ResNet特征提取器。
Args:
layer_sizes (tuple): An iterable containing the number of blocks in each layer
block_type (Module, optional): Type of block that is to be used to form the layers. Default: SpatioTemporalResBlock.
"""
def __init__(self, layer_sizes, block_type=SpatioTemporalResBlock):
super(R2Plus1DNet, self).__init__()
# 第一层,输入通道为3,卷积核大小(1,7,7), stride=(1,2,2)在三维方向上步长是1,在宽和高上步长是2进行下采样,
# padding为(0, 3, 3),宽高padding为 7/2=3
self.conv1 = SpatioTemporalConv(3, 64, (1, 7, 7), stride=(1, 2, 2), padding=(0, 3, 3), first_conv=True)
# 第二层输出和第一层输出大小相同,不进行下采样通道数也不变,卷积核大小为3x3x3
self.conv2 = SpatioTemporalResLayer(64, 64, 3, layer_sizes[0], block_type=block_type)
# 最后三层的输出通道数二倍于输入通道数,而且进行下采样,在每一层的第一个block进行下采样
self.conv3 = SpatioTemporalResLayer(64, 128, 3, layer_sizes[1], block_type=block_type, downsample=True)
self.conv4 = SpatioTemporalResLayer(128, 256, 3, layer_sizes[2], block_type=block_type, downsample=True)
self.conv5 = SpatioTemporalResLayer(256, 512, 3, layer_sizes[3], block_type=block_type, downsample=True)
# global average pooling of the output
self.pool = nn.AdaptiveAvgPool3d(1)
def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.conv5(x)
x = self.pool(x)
return x.view(-1, 512)
class R2Plus1DClassifier(nn.Module):
r"""Forms a complete ResNet classifier producing vectors of size num_classes, by initializng 5 layers,
with the number of blocks in each layer set by layer_sizes, and by performing a global average pool
at the end producing a 512-dimensional vector for each element in the batch,
and passing them through a Linear layer.
Args:
num_classes(int): Number of classes in the data
layer_sizes (tuple): An iterable containing the number of blocks in each layer
block_type (Module, optional): Type of block that is to be used to form the layers. Default: SpatioTemporalResBlock.
"""
def __init__(self, num_classes, layer_sizes, block_type=SpatioTemporalResBlock, pretrained=False):
super(R2Plus1DClassifier, self).__init__()
self.res2plus1d = R2Plus1DNet(layer_sizes, block_type)
self.linear = nn.Linear(512, num_classes)
self.__init_weight()
if pretrained:
self.__load_pretrained_weights()
def forward(self, x):
x = self.res2plus1d(x)
logits = self.linear(x)
return logits
def __load_pretrained_weights(self):
s_dict = self.state_dict()
for name in s_dict:
print(name)
print(s_dict[name].size())
def __init_weight(self):
for m in self.modules():
if isinstance(m, nn.Conv3d):
# n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
# m.weight.data.normal_(0, math.sqrt(2. / n))
nn.init.kaiming_normal_(m.weight)
elif isinstance(m, nn.BatchNorm3d):
m.weight.data.fill_(1)
m.bias.data.zero_()
def get_1x_lr_params(model):
"""
This generator returns all the parameters for the conv layer of the net.
"""
b = [model.res2plus1d]
for i in range(len(b)):
for k in b[i].parameters():
if k.requires_grad:
yield k
def get_10x_lr_params(model):
"""
This generator returns all the parameters for the fc layer of the net.
"""
b = [model.linear]
for j in range(len(b)):
for k in b[j].parameters():
if k.requires_grad:
yield k
if __name__ == "__main__":
import torch
inputs = torch.rand(1, 3, 16, 112, 112)
net = R2Plus1DClassifier(101, (2, 2, 2, 2), pretrained=False)
outputs = net.forward(inputs)
print(outputs.size())