参考资料
论文:
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
博客:
第1章 引言
在学习视频的时空特征时,3D卷积效果很好,但是由于3D卷积核的参数多,模型层数不深,但是模型却很大。比如,C3D模型,只有11层,但是模型却达到321MB。而152层的2D ResNet却只有235MB。而且,直接对ResNet 152在Sports-1M上进行微调,比C3D从头训练的效果更好。
在 GoogLeNet-Inception
系列中提出了非对称卷积的概念,即叠加一个卷积核尺寸为
1
×
k
1\times k
1×k 和一个
k
×
1
k\times1
k×1 的卷积来代替一个卷积核为
k
×
k
k\times k
k×k 的卷积。
这种设计在能够获得相当的特征提取能力的同时,显著节省参数量: k 2 k^2 k2 个参数变成了 2 k 2k 2k 个参数。
作者也尝试着对3D卷积进行分解,把一个 k × k × k k×k×k k×k×k 的3D卷积(共有 k 3 k^3 k3 个参数),分解成一个 1 × k × k 1×k×k 1×k×k 的卷积和一个 k × 1 × 1 k×1×1 k×1×1 的卷积(共有 k 2 + k k^2+k k2+k 个参数),作者文章中 k k k 的值为3,前者可以提取空间特征,后者可以融合时间特征,并且这样做的好处是我们可以充分的利用2D网络在Imagenet上的预训练参数来进行模型初始化。
此外,作者基于可分离的3D卷积核结构和2D ResNet模型,并尝试了并联、串联的方式,设计了几种botteneck building block
, 这些blocks显著降低了参数量,最后构建了P3D ResNet深度模型。
第2章 相关工作
视频表征学习主要有两大方法:
(1)基于手工特征的方法:
- STIP
- Histogram of Gradient and Histogram of Optical Flow
- 3D Histogram of Gradient
- SIFT-3D
- Dense trajectory features
(2)基于深度学习的方法:
- Stack CNN-based frame-level representations
- Two-Stream
- TSN
- …
- 3D CNN
对于没有使用3D卷积核的模型,作者认为其没有很好的提取连续帧之间的运动特征。
对于目前使用了3D卷积核的C3D,作者认为,受限于3D卷积核的参数和计算量,C3D模型只有11层,层数太少,表征能力有限。
第3章 P3D网络结构
作者首先回顾了一下 ResNet
的 Residual Units
单元:
x
t
+
1
=
h
(
x
t
)
+
F
(
x
t
)
x_{t+1}=h(x_t)+F(x_t)
xt+1=h(xt)+F(xt)
也可以表示为:
(
I
+
F
)
⋅
x
t
=
x
t
+
F
⋅
x
t
=
x
t
+
F
(
x
t
)
=
x
t
+
1
(I+F)⋅x_t=x_t+F⋅x_t=x_t+F(x_t)=x_{t+1}
(I+F)⋅xt=xt+F⋅xt=xt+F(xt)=xt+1
基于此,作者提出了 P3D
的结构,将2D卷积核替换为可分离的3D卷积核,并设计了以下三种不同的结构:
- Fig. 2 (a) 是两种卷积串联;
- Fig. 2 (b) 是两种卷积并联;
- Fig. 2 © 是两种卷积串联+并联;
但是每种卷积的输出都直接和 skip connection
连接,即都能够直接影响到block最终的输出。
P3D-A
: ( I + T ⋅ S ) ⋅ x t : = x t + T ( S ( x t ) ) = x t + 1 (I+T⋅S)⋅x_t:=x_t+T(S(x_t))=x_{t+1} (I+T⋅S)⋅xt:=xt+T(S(xt))=xt+1P3D-B
: ( I + S + T ) ⋅ x t : = x t + S ( x t ) + T ( x t ) = x t + 1 (I+S+T)⋅x_t:=x_t+S(x_t)+T(x_t)=x_{t+1} (I+S+T)⋅xt:=xt+S(xt)+T(xt)=xt+1P3D-C
: ( I + S + T ⋅ S ) ⋅ x t : = x t + S ( x t ) + T ( S ( x t ) ) = x t + 1 (I+S+T⋅S)⋅x_t:=x_t+S(x_t)+T(S(x_t))=x_{t+1} (I+S+T⋅S)⋅xt:=xt+S(xt)+T(S(xt))=xt+1
作者严格借鉴了 ResNet
中的 Bottleneck block
来设计 P3D Blocks
。
另外补充一下,
Bottleneck
用来压缩通道并恢复通道,从而节省计算量。在ResNet
系列中,-50,-101,-152用的是Bottleneck block
,而-18,-34用的是Basic Residual block
。
第4章 实验
论文中为了对比哪种P3D Block更好,基于 Resnet-50
设计了4种网络:
- 原始的Resnet-50,处理单帧图像
- 把Resnet-50种的所有Bottleneck blocks 都替换成P3D-A
- 把Resnet-50种的所有Bottleneck blocks 都替换成P3D-B
- 把Resnet-50种的所有Bottleneck blocks 都替换成P3D-C
这些网络的训练配置如下:
- 对于原始的
Resnet-50
训练,使用UCF-101
进行fine-tuning
,视频首先resize成 240 × 320 240×320 240×320 ,再随机crop出来 224 × 224 224×224 224×224 的区域,冻结除了第一个BN之外的其他所有BN,以0.9的概率加入了一个Dropout层,对于视频的预测结果采用平均对所有单帧图像的score来得到。 - 对于三种
P3D
网络的训练,使用Resnet-50
的预训练权重来初始化网络参数(额外添加的时间卷积没有办法,只能随机初始化),使用UCF-101
进行fine-tuning,视频首先resize成 182 × 242 182×242 182×242 ,然后挑选出16帧不重叠的frames,并随机crop出 160 × 160 160×160 160×160 的区域。
可以发现,在P3D参数量只增长了很少就获得了比较不错的准确率提升。并且,A和C两种串行结构的准确率要高于B的并行结构,说明了串行结构融合了时空信息的效果更好。
对于 Fig. 4 的最后一行 P3D ResNet,是作者为了追求网络结构的多样性,在一个网络中同时使用 A、B、C
三种blocks。
具体的,如 ResNet
的配置如 Fig. 6 所示:
其中以 Resnet-50
为例,其共有 [3,4,6,3]
共16个 residual blocks
,那么在 P3D ResNet
中,这16个 residual blocks
就依次替换为A B C三种block,即 [ABC,ABCA,BCABCA,BCA]
。可以发现,这种多样性的结构获得了最高的准确率。
第5章 Pytorch实现P3D
代码参考:
博客:
Pseudo-3D Residual Networks算法的pytorch代码
具体代码如下,建议结合RestNet的源码去理解:
"""
代码参考:https://github.com/qijiezhao/pseudo-3d-pytorch
"""
from __future__ import print_function
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torch.autograd import Variable
import math
from functools import partial
from torchsummary import summary
__all__ = ['P3D', 'P3D63', 'P3D131', 'P3D199']
# 空间卷积核
def conv_S(in_planes, out_planes, stride=1, padding=1):
# as is descriped, conv S is 1x3x3
return nn.Conv3d(in_planes, out_planes, kernel_size=(1, 3, 3), stride=1,
padding=padding, bias=False)
# 时间卷积核
def conv_T(in_planes, out_planes, stride=1, padding=1):
# conv T is 3x1x1
return nn.Conv3d(in_planes, out_planes, kernel_size=(3, 1, 1), stride=1,
padding=padding, bias=False)
def downsample_basic_block(x, planes, stride):
out = F.avg_pool3d(x, kernel_size=1, stride=stride)
zero_pads = torch.Tensor(out.size(0), planes - out.size(1),
out.size(2), out.size(3),
out.size(4)).zero_()
# 判断变量类型
if isinstance(out.data, torch.cuda.FloatTensor):
zero_pads = zero_pads.cuda()
out = Variable(torch.cat([out.data, zero_pads], dim=1))
return out
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None, n_s=0, depth_3d=47, ST_struc=('A', 'B', 'C')):
super(Bottleneck, self).__init__()
self.downsample = downsample
self.depth_3d = depth_3d
self.ST_struc = ST_struc
self.len_ST = len(self.ST_struc)
stride_p = stride
# 如果需要进行下采样
if not self.downsample == None:
stride_p = (1, 2, 2)
if n_s < self.depth_3d:
if n_s == 0:
stride_p = 1
self.conv1 = nn.Conv3d(inplanes, planes, kernel_size=1, bias=False, stride=stride_p)
self.bn1 = nn.BatchNorm3d(planes)
else:
if n_s == self.depth_3d:
stride_p = 2
else:
stride_p = 1
self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False, stride=stride_p)
self.bn1 = nn.BatchNorm2d(planes)
# self.conv2 = nn.Conv3d(planes, planes, kernel_size=3, stride=stride,
# padding=1, bias=False)
self.id = n_s
self.ST = list(self.ST_struc)[self.id % self.len_ST]
if self.id < self.depth_3d:
self.conv2 = conv_S(planes, planes, stride=1, padding=(0, 1, 1))
self.bn2 = nn.BatchNorm3d(planes)
#
self.conv3 = conv_T(planes, planes, stride=1, padding=(1, 0, 0))
self.bn3 = nn.BatchNorm3d(planes)
else:
self.conv_normal = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
self.bn_normal = nn.BatchNorm2d(planes)
if n_s < self.depth_3d:
self.conv4 = nn.Conv3d(planes, planes * 4, kernel_size=1, bias=False)
self.bn4 = nn.BatchNorm3d(planes * 4)
else:
self.conv4 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
self.bn4 = nn.BatchNorm2d(planes * 4)
self.relu = nn.ReLU(inplace=True)
self.stride = stride
# 串联结构
def ST_A(self, x):
x = self.conv2(x) # 1x3x3
x = self.bn2(x)
x = self.relu(x)
x = self.conv3(x) # 3x1x1
x = self.bn3(x)
x = self.relu(x)
return x
# 并联结构
def ST_B(self, x):
tmp_x = self.conv2(x)
tmp_x = self.bn2(tmp_x)
tmp_x = self.relu(tmp_x)
x = self.conv3(x)
x = self.bn3(x)
x = self.relu(x)
return x + tmp_x
# 串联+并联结构
def ST_C(self, x):
x = self.conv2(x)
x = self.bn2(x)
x = self.relu(x)
tmp_x = self.conv3(x)
tmp_x = self.bn3(tmp_x)
tmp_x = self.relu(tmp_x)
return x + tmp_x
def forward(self, x):
residual = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
# out = self.conv2(out)
# out = self.bn2(out)
# out = self.relu(out)
if self.id < self.depth_3d: # C3D parts:
if self.ST == 'A':
out = self.ST_A(out)
elif self.ST == 'B':
out = self.ST_B(out)
elif self.ST == 'C':
out = self.ST_C(out)
else:
out = self.conv_normal(out) # normal is res5 part, C2D all.
out = self.bn_normal(out)
out = self.relu(out)
out = self.conv4(out)
out = self.bn4(out)
if self.downsample is not None:
residual = self.downsample(x)
out += residual
out = self.relu(out)
return out
class P3D(nn.Module):
def __init__(self, block, layers, modality='RGB',
shortcut_type='B', num_classes=400, dropout=0.5, ST_struc=('A', 'B', 'C')):
self.inplanes = 64
super(P3D, self).__init__()
# self.conv1 = nn.Conv3d(3, 64, kernel_size=7, stride=(1, 2, 2),
# padding=(3, 3, 3), bias=False)
# 如果输入是视频帧(也就是图像),那么输入channel就是3,如果输入是optical flow,那么输入channel就是2
self.input_channel = 3 if modality == 'RGB' else 2 # 2 is for flow
self.ST_struc = ST_struc
self.conv1_custom = nn.Conv3d(self.input_channel, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2),
padding=(0, 3, 3), bias=False)
# layers[0]+layers[1]+layers[2] 为3D卷积层,layers[3]为2d卷积层
self.depth_3d = sum(layers[:3]) # C3D layers are only (res2,res3,res4), res5 is C2D
self.bn1 = nn.BatchNorm3d(64) # bn1 is followed by conv1
self.cnt = 0
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool3d(kernel_size=(2, 3, 3), stride=2, padding=(0, 1, 1)) # pooling layer for conv1.
self.maxpool_2 = nn.MaxPool3d(kernel_size=(2, 1, 1), padding=0,
stride=(2, 1, 1)) # pooling layer for res2, 3, 4.
self.layer1 = self._make_layer(block, 64, layers[0], shortcut_type)
self.layer2 = self._make_layer(block, 128, layers[1], shortcut_type, stride=2)
self.layer3 = self._make_layer(block, 256, layers[2], shortcut_type, stride=2)
self.layer4 = self._make_layer(block, 512, layers[3], shortcut_type, stride=2)
self.avgpool = nn.AvgPool2d(kernel_size=(5, 5), stride=1) # pooling layer for res5.
self.dropout = nn.Dropout(p=dropout)
self.fc = nn.Linear(512 * block.expansion, num_classes)
for m in self.modules():
if isinstance(m, nn.Conv3d):
n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
m.weight.data.normal_(0, math.sqrt(2. / n))
elif isinstance(m, nn.BatchNorm3d):
m.weight.data.fill_(1)
m.bias.data.zero_()
# some private attribute
self.input_size = (self.input_channel, 16, 160, 160) # input of the network
self.input_mean = [0.485, 0.456, 0.406] if modality == 'RGB' else [0.5]
self.input_std = [0.229, 0.224, 0.225] if modality == 'RGB' else [np.mean([0.229, 0.224, 0.225])]
@property
def scale_size(self):
return self.input_size[2] * 256 // 160 # asume that raw images are resized (340,256).
@property
def temporal_length(self):
return self.input_size[1]
@property
def crop_size(self):
return self.input_size[2]
def _make_layer(self, block, planes, blocks, shortcut_type, stride=1):
downsample = None
stride_p = stride # especially for downsample branch.
# 3d卷积部分
if self.cnt < self.depth_3d:
if self.cnt == 0:
stride_p = 1
else:
stride_p = (1, 2, 2)
# Shortcuts用于构建 Conv Block 和 Identity Block
if stride != 1 or self.inplanes != planes * block.expansion:
# 使用pool3d进行下采样
if shortcut_type == 'A':
downsample = partial(downsample_basic_block,
planes=planes * block.expansion,
stride=stride)
# 使用Conv3d进行下采样
else:
downsample = nn.Sequential(
nn.Conv3d(self.inplanes, planes * block.expansion,
kernel_size=1, stride=stride_p, bias=False),
nn.BatchNorm3d(planes * block.expansion)
)
# 2d卷积部分
else:
# Shortcuts用于构建 Conv Block 和 Identity Block
if stride != 1 or self.inplanes != planes * block.expansion:
if shortcut_type == 'A':
downsample = partial(downsample_basic_block,
planes=planes * block.expansion,
stride=stride)
else:
downsample = nn.Sequential(
nn.Conv2d(self.inplanes, planes * block.expansion,
kernel_size=1, stride=2, bias=False),
nn.BatchNorm2d(planes * block.expansion)
)
layers = []
layers.append(block(self.inplanes, planes, stride, downsample, n_s=self.cnt, depth_3d=self.depth_3d,
ST_struc=self.ST_struc))
# 计数值+1,代表层数,因为要确保到达layer[3]时变为2d卷积
self.cnt += 1
self.inplanes = planes * block.expansion
for i in range(1, blocks):
layers.append(block(self.inplanes, planes, n_s=self.cnt, depth_3d=self.depth_3d, ST_struc=self.ST_struc))
self.cnt += 1
return nn.Sequential(*layers)
def forward(self, x):
# 3x16x160x160 -> 64x16x80x80 -> 64x8x40x40
x = self.conv1_custom(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
# 64x8x40x40 -> 256x8x40x40 -> 256x4x40x40
x = self.maxpool_2(self.layer1(x)) # Part Res2
# 256x4x40x40 -> 512x4x20x20 -> 512x2x20x20
x = self.maxpool_2(self.layer2(x)) # Part Res3
# 512x2x20x20 -> 1024x2x10x10 -> 1024x1x10x10
x = self.maxpool_2(self.layer3(x)) # Part Res4
# 此时3d卷积要变成2d卷积了
sizes = x.size()
# (batch, 1024, 1, 10, 10) -> (batch, 1024, 10, 10)
x = x.view(-1, sizes[1], sizes[3], sizes[4]) # Part Res5
# 1024x10x10 -> 2048x5x5 -> 2048x1x1
x = self.layer4(x)
x = self.avgpool(x)
# 全连接层
x = x.view(-1, self.fc.in_features)
x = self.fc(self.dropout(x))
return x
def P3D63(**kwargs):
"""Construct a P3D63 modelbased on a ResNet-50-3D model.
"""
model = P3D(Bottleneck, [3, 4, 6, 3], **kwargs)
return model
def P3D131(**kwargs):
"""Construct a P3D131 model based on a ResNet-101-3D model.
"""
model = P3D(Bottleneck, [3, 4, 23, 3], **kwargs)
return model
def P3D199(pretrained=False, modality='RGB', **kwargs):
"""construct a P3D199 model based on a ResNet-152-3D model.
"""
model = P3D(Bottleneck, [3, 8, 36, 3], modality=modality, **kwargs)
if pretrained == True:
if modality == 'RGB':
pretrained_file = 'p3d_rgb_199.checkpoint.pth.tar'
elif modality == 'Flow':
pretrained_file = 'p3d_flow_199.checkpoint.pth.tar'
weights = torch.load(pretrained_file)['state_dict']
model.load_state_dict(weights)
return model
# custom operation
def get_optim_policies(model=None, modality='RGB', enable_pbn=True):
'''
first conv: weight --> conv weight
bias --> conv bias
normal action: weight --> non-first conv + fc weight
bias --> non-first conv + fc bias
bn: the first bn2, and many all bn3.
'''
first_conv_weight = []
first_conv_bias = []
normal_weight = []
normal_bias = []
bn = []
if model == None:
exit()
conv_cnt = 0
bn_cnt = 0
for m in model.modules():
if isinstance(m, torch.nn.Conv3d) or isinstance(m, torch.nn.Conv2d):
ps = list(m.parameters())
conv_cnt += 1
if conv_cnt == 1:
first_conv_weight.append(ps[0])
if len(ps) == 2:
first_conv_bias.append(ps[1])
else:
normal_weight.append(ps[0])
if len(ps) == 2:
normal_bias.append(ps[1])
elif isinstance(m, torch.nn.Linear):
ps = list(m.parameters())
normal_weight.append(ps[0])
if len(ps) == 2:
normal_bias.append(ps[1])
elif isinstance(m, torch.nn.BatchNorm3d):
bn_cnt += 1
# later BN's are frozen
if not enable_pbn or bn_cnt == 1:
bn.extend(list(m.parameters()))
elif isinstance(m, torch.nn.BatchNorm2d):
bn.extend(list(m.parameters()))
elif len(m._modules) == 0:
if len(list(m.parameters())) > 0:
raise ValueError("New atomic module type: {}. Need to give it a learning policy".format(type(m)))
slow_rate = 0.7
n_fore = int(len(normal_weight) * slow_rate)
slow_feat = normal_weight[:n_fore] # finetune slowly.
slow_bias = normal_bias[:n_fore]
normal_feat = normal_weight[n_fore:]
normal_bias = normal_bias[n_fore:]
return [
{'params': first_conv_weight, 'lr_mult': 5 if modality == 'Flow' else 1, 'decay_mult': 1,
'name': "first_conv_weight"},
{'params': first_conv_bias, 'lr_mult': 10 if modality == 'Flow' else 2, 'decay_mult': 0,
'name': "first_conv_bias"},
{'params': slow_feat, 'lr_mult': 1, 'decay_mult': 1,
'name': "slow_feat"},
{'params': slow_bias, 'lr_mult': 2, 'decay_mult': 0,
'name': "slow_bias"},
{'params': normal_feat, 'lr_mult': 1, 'decay_mult': 1,
'name': "normal_feat"},
{'params': normal_bias, 'lr_mult': 2, 'decay_mult': 0,
'name': "normal_bias"},
{'params': bn, 'lr_mult': 1, 'decay_mult': 0,
'name': "BN scale/shift"},
]
def test():
model = P3D63(num_classes=400)
#创建模型,部署gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
summary(model, (3, 16, 160, 160))
if __name__ == '__main__':
# # 导入模型部分通过调用P3D199得到199层的P3D网络
# # model = P3D199(pretrained=True, num_classes=400)
# model = P3D63(num_classes=400)
# model = model.cuda()
# # 这一行是随机生成输入数据,第一个维度是10说明该输入数据包含10个clip,
# # 其中每个clip包含16帧图像,每帧图像是160*160的3通道图像。
# data = torch.autograd.Variable(
# torch.rand(10, 3, 16, 160, 160)).cuda() # if modality=='Flow', please change the 2nd dimension 3==>2
# out = model(data)
# print(out.size(), out)
test()