【视频理解】四、P3D

travellerss

已于 2023-09-17 09:05:22 修改

阅读量310

点赞数

分类专栏：视频理解文章标签：深度学习神经网络

于 2022-11-06 21:38:24 首次发布

本文链接：https://blog.csdn.net/qq_30196905/article/details/127721688

版权

视频理解专栏收录该内容

15 篇文章 6 订阅

订阅专栏

参考资料

论文：

Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

博客：

[论文笔记] P3D

论文 2017 [P3D]

第1章引言

在学习视频的时空特征时，3D卷积效果很好，但是由于3D卷积核的参数多，模型层数不深，但是模型却很大。比如，C3D模型，只有11层，但是模型却达到321MB。而152层的2D ResNet却只有235MB。而且，直接对ResNet 152在Sports-1M上进行微调，比C3D从头训练的效果更好。

在这里插入图片描述

Fig. 1 ResNet-152和C3D、P3D的效果和模型参数量比较

在 GoogLeNet-Inception 系列中提出了非对称卷积的概念，即叠加一个卷积核尺寸为 $1\times k$ 和一个 $k\times1$ 的卷积来代替一个卷积核为 $k\times k$ 的卷积。

这种设计在能够获得相当的特征提取能力的同时，显著节省参数量： $k^2$ 个参数变成了 $2 k$ 个参数。

作者也尝试着对3D卷积进行分解，把一个 $k \times k \times k$ 的3D卷积（共有 $k^3$ 个参数），分解成一个 $1 \times k \times k$ 的卷积和一个 $k \times 1 \times 1$ 的卷积（共有 $k^2+k$ 个参数），作者文章中 $k$ 的值为3，前者可以提取空间特征，后者可以融合时间特征，并且这样做的好处是我们可以充分的利用2D网络在Imagenet上的预训练参数来进行模型初始化。

此外，作者基于可分离的3D卷积核结构和2D ResNet模型，并尝试了并联、串联的方式，设计了几种botteneck building block , 这些blocks显著降低了参数量，最后构建了P3D ResNet深度模型。

第2章相关工作

视频表征学习主要有两大方法：

（1）基于手工特征的方法：

STIP
Histogram of Gradient and Histogram of Optical Flow
3D Histogram of Gradient
SIFT-3D
Dense trajectory features

（2）基于深度学习的方法：

Stack CNN-based frame-level representations
Two-Stream
TSN
…
3D CNN

对于没有使用3D卷积核的模型，作者认为其没有很好的提取连续帧之间的运动特征。

对于目前使用了3D卷积核的C3D，作者认为，受限于3D卷积核的参数和计算量，C3D模型只有11层，层数太少，表征能力有限。

第3章 P3D网络结构

作者首先回顾了一下 ResNet 的 Residual Units 单元：
$x_{t+1}=h(x_t)+F(x_t)$
也可以表示为：
$I+F)⋅x_t=x_t+F⋅x_t=x_t+F(x_t)=x_{t+1}$
基于此，作者提出了 P3D 的结构，将2D卷积核替换为可分离的3D卷积核，并设计了以下三种不同的结构：

Fig. 2 (a) 是两种卷积串联；
Fig. 2 (b) 是两种卷积并联；
Fig. 2 © 是两种卷积串联+并联；

但是每种卷积的输出都直接和 skip connection 连接，即都能够直接影响到block最终的输出。

在这里插入图片描述

Fig. 2 基本的Residual block的3种连接方式

P3D-A： $I+T⋅S)⋅x_t:=x_t+T(S(x_t))=x_{t+1}$
P3D-B： $I+S+T)⋅x_t:=x_t+S(x_t)+T(x_t)=x_{t+1}$
P3D-C： $I+S+T⋅S)⋅x_t:=x_t+S(x_t)+T(S(x_t))=x_{t+1}$

作者严格借鉴了 ResNet 中的 Bottleneck block 来设计 P3D Blocks 。

另外补充一下，Bottleneck 用来压缩通道并恢复通道，从而节省计算量。在 ResNet 系列中，-50，-101，-152用的是 Bottleneck block ，而-18，-34用的是 Basic Residual block 。

在这里插入图片描述

Fig. 3 基本的Residual block，以及三种 P3D Blocks

第4章实验

论文中为了对比哪种P3D Block更好，基于 Resnet-50 设计了4种网络：

原始的Resnet-50，处理单帧图像
把Resnet-50种的所有Bottleneck blocks 都替换成P3D-A
把Resnet-50种的所有Bottleneck blocks 都替换成P3D-B
把Resnet-50种的所有Bottleneck blocks 都替换成P3D-C

这些网络的训练配置如下：

对于原始的 Resnet-50 训练，使用 UCF-101 进行 fine-tuning ，视频首先resize成 $240 \times 320$ ，再随机crop出来 $224 \times 224$ 的区域，冻结除了第一个BN之外的其他所有BN，以0.9的概率加入了一个Dropout层，对于视频的预测结果采用平均对所有单帧图像的score来得到。
对于三种 P3D 网络的训练，使用 Resnet-50 的预训练权重来初始化网络参数（额外添加的时间卷积没有办法，只能随机初始化），使用 UCF-101 进行fine-tuning，视频首先resize成 $182 \times 242$ ，然后挑选出16帧不重叠的frames，并随机crop出 $160 \times 160$ 的区域。

在这里插入图片描述

Fig. 4 ResNet, P3D-A, P3D-B, P3D-C和P3D ResNet的准确率对比

可以发现，在P3D参数量只增长了很少就获得了比较不错的准确率提升。并且，A和C两种串行结构的准确率要高于B的并行结构，说明了串行结构融合了时空信息的效果更好。

对于 Fig. 4 的最后一行 P3D ResNet，是作者为了追求网络结构的多样性，在一个网络中同时使用 A、B、C 三种blocks。
在这里插入图片描述

Fig. 5 混合结构

具体的，如 ResNet 的配置如 Fig. 6 所示：

在这里插入图片描述

Fig. 6 ResNet结构配置

其中以 Resnet-50 为例，其共有 [3,4,6,3] 共16个 residual blocks ，那么在 P3D ResNet 中，这16个 residual blocks 就依次替换为A B C三种block，即 [ABC，ABCA，BCABCA，BCA] 。可以发现，这种多样性的结构获得了最高的准确率。

第5章 Pytorch实现P3D

代码参考：

qijiezhao/pseudo-3d-pytorch

博客：

Pseudo-3D Residual Networks算法的pytorch代码

具体代码如下，建议结合RestNet的源码去理解：

"""
    代码参考:https://github.com/qijiezhao/pseudo-3d-pytorch
"""

from __future__ import print_function
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torch.autograd import Variable
import math
from functools import partial
from torchsummary import summary

__all__ = ['P3D', 'P3D63', 'P3D131', 'P3D199']


# 空间卷积核
def conv_S(in_planes, out_planes, stride=1, padding=1):
    # as is descriped, conv S is 1x3x3
    return nn.Conv3d(in_planes, out_planes, kernel_size=(1, 3, 3), stride=1,
                     padding=padding, bias=False)


# 时间卷积核
def conv_T(in_planes, out_planes, stride=1, padding=1):
    # conv T is 3x1x1
    return nn.Conv3d(in_planes, out_planes, kernel_size=(3, 1, 1), stride=1,
                     padding=padding, bias=False)


def downsample_basic_block(x, planes, stride):
    out = F.avg_pool3d(x, kernel_size=1, stride=stride)
    zero_pads = torch.Tensor(out.size(0), planes - out.size(1),
                             out.size(2), out.size(3),
                             out.size(4)).zero_()

    # 判断变量类型
    if isinstance(out.data, torch.cuda.FloatTensor):
        zero_pads = zero_pads.cuda()

    out = Variable(torch.cat([out.data, zero_pads], dim=1))

    return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, n_s=0, depth_3d=47, ST_struc=('A', 'B', 'C')):
        super(Bottleneck, self).__init__()
        self.downsample = downsample
        self.depth_3d = depth_3d
        self.ST_struc = ST_struc
        self.len_ST = len(self.ST_struc)

        stride_p = stride
        # 如果需要进行下采样
        if not self.downsample == None:
            stride_p = (1, 2, 2)

        if n_s < self.depth_3d:
            if n_s == 0:
                stride_p = 1
            self.conv1 = nn.Conv3d(inplanes, planes, kernel_size=1, bias=False, stride=stride_p)
            self.bn1 = nn.BatchNorm3d(planes)
        else:
            if n_s == self.depth_3d:
                stride_p = 2
            else:
                stride_p = 1
            self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False, stride=stride_p)
            self.bn1 = nn.BatchNorm2d(planes)
        # self.conv2 = nn.Conv3d(planes, planes, kernel_size=3, stride=stride,
        #                        padding=1, bias=False)
        self.id = n_s
        self.ST = list(self.ST_struc)[self.id % self.len_ST]
        if self.id < self.depth_3d:
            self.conv2 = conv_S(planes, planes, stride=1, padding=(0, 1, 1))
            self.bn2 = nn.BatchNorm3d(planes)
            #
            self.conv3 = conv_T(planes, planes, stride=1, padding=(1, 0, 0))
            self.bn3 = nn.BatchNorm3d(planes)
        else:
            self.conv_normal = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
            self.bn_normal = nn.BatchNorm2d(planes)

        if n_s < self.depth_3d:
            self.conv4 = nn.Conv3d(planes, planes * 4, kernel_size=1, bias=False)
            self.bn4 = nn.BatchNorm3d(planes * 4)
        else:
            self.conv4 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
            self.bn4 = nn.BatchNorm2d(planes * 4)
        self.relu = nn.ReLU(inplace=True)

        self.stride = stride

    # 串联结构
    def ST_A(self, x):
        x = self.conv2(x)   # 1x3x3
        x = self.bn2(x)
        x = self.relu(x)

        x = self.conv3(x)   # 3x1x1
        x = self.bn3(x)
        x = self.relu(x)

        return x

    # 并联结构
    def ST_B(self, x):
        tmp_x = self.conv2(x)
        tmp_x = self.bn2(tmp_x)
        tmp_x = self.relu(tmp_x)

        x = self.conv3(x)
        x = self.bn3(x)
        x = self.relu(x)

        return x + tmp_x

    # 串联+并联结构
    def ST_C(self, x):
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)

        tmp_x = self.conv3(x)
        tmp_x = self.bn3(tmp_x)
        tmp_x = self.relu(tmp_x)

        return x + tmp_x

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        # out = self.conv2(out)
        # out = self.bn2(out)
        # out = self.relu(out)
        if self.id < self.depth_3d:  # C3D parts:

            if self.ST == 'A':
                out = self.ST_A(out)
            elif self.ST == 'B':
                out = self.ST_B(out)
            elif self.ST == 'C':
                out = self.ST_C(out)
        else:
            out = self.conv_normal(out)  # normal is res5 part, C2D all.
            out = self.bn_normal(out)
            out = self.relu(out)

        out = self.conv4(out)
        out = self.bn4(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class P3D(nn.Module):

    def __init__(self, block, layers, modality='RGB',
                 shortcut_type='B', num_classes=400, dropout=0.5, ST_struc=('A', 'B', 'C')):
        self.inplanes = 64
        super(P3D, self).__init__()
        # self.conv1 = nn.Conv3d(3, 64, kernel_size=7, stride=(1, 2, 2),
        #                        padding=(3, 3, 3), bias=False)

        # 如果输入是视频帧（也就是图像），那么输入channel就是3，如果输入是optical flow，那么输入channel就是2
        self.input_channel = 3 if modality == 'RGB' else 2  # 2 is for flow
        self.ST_struc = ST_struc

        self.conv1_custom = nn.Conv3d(self.input_channel, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2),
                                      padding=(0, 3, 3), bias=False)

        # layers[0]+layers[1]+layers[2] 为3D卷积层，layers[3]为2d卷积层
        self.depth_3d = sum(layers[:3])  # C3D layers are only (res2,res3,res4),  res5 is C2D

        self.bn1 = nn.BatchNorm3d(64)  # bn1 is followed by conv1
        self.cnt = 0
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool3d(kernel_size=(2, 3, 3), stride=2, padding=(0, 1, 1))  # pooling layer for conv1.
        self.maxpool_2 = nn.MaxPool3d(kernel_size=(2, 1, 1), padding=0,
                                      stride=(2, 1, 1))  # pooling layer for res2, 3, 4.

        self.layer1 = self._make_layer(block, 64, layers[0], shortcut_type)
        self.layer2 = self._make_layer(block, 128, layers[1], shortcut_type, stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], shortcut_type, stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], shortcut_type, stride=2)

        self.avgpool = nn.AvgPool2d(kernel_size=(5, 5), stride=1)  # pooling layer for res5.
        self.dropout = nn.Dropout(p=dropout)
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv3d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm3d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

        # some private attribute
        self.input_size = (self.input_channel, 16, 160, 160)  # input of the network
        self.input_mean = [0.485, 0.456, 0.406] if modality == 'RGB' else [0.5]
        self.input_std = [0.229, 0.224, 0.225] if modality == 'RGB' else [np.mean([0.229, 0.224, 0.225])]

    @property
    def scale_size(self):
        return self.input_size[2] * 256 // 160  # asume that raw images are resized (340,256).

    @property
    def temporal_length(self):
        return self.input_size[1]

    @property
    def crop_size(self):
        return self.input_size[2]

    def _make_layer(self, block, planes, blocks, shortcut_type, stride=1):
        downsample = None
        stride_p = stride  # especially for downsample branch.

        # 3d卷积部分
        if self.cnt < self.depth_3d:
            if self.cnt == 0:
                stride_p = 1
            else:
                stride_p = (1, 2, 2)

            # Shortcuts用于构建 Conv Block 和 Identity Block
            if stride != 1 or self.inplanes != planes * block.expansion:
                # 使用pool3d进行下采样
                if shortcut_type == 'A':
                    downsample = partial(downsample_basic_block,
                                         planes=planes * block.expansion,
                                         stride=stride)
                # 使用Conv3d进行下采样
                else:
                    downsample = nn.Sequential(
                        nn.Conv3d(self.inplanes, planes * block.expansion,
                                  kernel_size=1, stride=stride_p, bias=False),
                        nn.BatchNorm3d(planes * block.expansion)
                    )

        # 2d卷积部分
        else:
            # Shortcuts用于构建 Conv Block 和 Identity Block
            if stride != 1 or self.inplanes != planes * block.expansion:
                if shortcut_type == 'A':
                    downsample = partial(downsample_basic_block,
                                         planes=planes * block.expansion,
                                         stride=stride)
                else:
                    downsample = nn.Sequential(
                        nn.Conv2d(self.inplanes, planes * block.expansion,
                                  kernel_size=1, stride=2, bias=False),
                        nn.BatchNorm2d(planes * block.expansion)
                    )
        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, n_s=self.cnt, depth_3d=self.depth_3d,
                            ST_struc=self.ST_struc))
        # 计数值+1，代表层数，因为要确保到达layer[3]时变为2d卷积
        self.cnt += 1

        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes, n_s=self.cnt, depth_3d=self.depth_3d, ST_struc=self.ST_struc))
            self.cnt += 1

        return nn.Sequential(*layers)

    def forward(self, x):
        # 3x16x160x160 -> 64x16x80x80 -> 64x8x40x40
        x = self.conv1_custom(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        # 64x8x40x40 -> 256x8x40x40 -> 256x4x40x40
        x = self.maxpool_2(self.layer1(x))  # Part Res2
        # 256x4x40x40 -> 512x4x20x20 -> 512x2x20x20
        x = self.maxpool_2(self.layer2(x))  # Part Res3
        # 512x2x20x20 -> 1024x2x10x10 -> 1024x1x10x10
        x = self.maxpool_2(self.layer3(x))  # Part Res4

        # 此时3d卷积要变成2d卷积了
        sizes = x.size()
        # (batch, 1024, 1, 10, 10) -> (batch, 1024, 10, 10)
        x = x.view(-1, sizes[1], sizes[3], sizes[4])  # Part Res5
        # 1024x10x10 -> 2048x5x5 -> 2048x1x1
        x = self.layer4(x)
        x = self.avgpool(x)

        # 全连接层
        x = x.view(-1, self.fc.in_features)
        x = self.fc(self.dropout(x))

        return x


def P3D63(**kwargs):
    """Construct a P3D63 modelbased on a ResNet-50-3D model.
    """
    model = P3D(Bottleneck, [3, 4, 6, 3], **kwargs)
    return model


def P3D131(**kwargs):
    """Construct a P3D131 model based on a ResNet-101-3D model.
    """
    model = P3D(Bottleneck, [3, 4, 23, 3], **kwargs)
    return model


def P3D199(pretrained=False, modality='RGB', **kwargs):
    """construct a P3D199 model based on a ResNet-152-3D model.
    """
    model = P3D(Bottleneck, [3, 8, 36, 3], modality=modality, **kwargs)
    if pretrained == True:
        if modality == 'RGB':
            pretrained_file = 'p3d_rgb_199.checkpoint.pth.tar'
        elif modality == 'Flow':
            pretrained_file = 'p3d_flow_199.checkpoint.pth.tar'
        weights = torch.load(pretrained_file)['state_dict']
        model.load_state_dict(weights)
    return model


# custom operation
def get_optim_policies(model=None, modality='RGB', enable_pbn=True):
    '''
    first conv:         weight --> conv weight
                        bias   --> conv bias
    normal action:      weight --> non-first conv + fc weight
                        bias   --> non-first conv + fc bias
    bn:                 the first bn2, and many all bn3.

    '''
    first_conv_weight = []
    first_conv_bias = []
    normal_weight = []
    normal_bias = []
    bn = []

    if model == None:
        exit()

    conv_cnt = 0
    bn_cnt = 0
    for m in model.modules():
        if isinstance(m, torch.nn.Conv3d) or isinstance(m, torch.nn.Conv2d):
            ps = list(m.parameters())
            conv_cnt += 1
            if conv_cnt == 1:
                first_conv_weight.append(ps[0])
                if len(ps) == 2:
                    first_conv_bias.append(ps[1])
            else:
                normal_weight.append(ps[0])
                if len(ps) == 2:
                    normal_bias.append(ps[1])
        elif isinstance(m, torch.nn.Linear):
            ps = list(m.parameters())
            normal_weight.append(ps[0])
            if len(ps) == 2:
                normal_bias.append(ps[1])

        elif isinstance(m, torch.nn.BatchNorm3d):
            bn_cnt += 1
            # later BN's are frozen
            if not enable_pbn or bn_cnt == 1:
                bn.extend(list(m.parameters()))
        elif isinstance(m, torch.nn.BatchNorm2d):
            bn.extend(list(m.parameters()))
        elif len(m._modules) == 0:
            if len(list(m.parameters())) > 0:
                raise ValueError("New atomic module type: {}. Need to give it a learning policy".format(type(m)))

    slow_rate = 0.7
    n_fore = int(len(normal_weight) * slow_rate)
    slow_feat = normal_weight[:n_fore]  # finetune slowly.
    slow_bias = normal_bias[:n_fore]
    normal_feat = normal_weight[n_fore:]
    normal_bias = normal_bias[n_fore:]

    return [
        {'params': first_conv_weight, 'lr_mult': 5 if modality == 'Flow' else 1, 'decay_mult': 1,
         'name': "first_conv_weight"},
        {'params': first_conv_bias, 'lr_mult': 10 if modality == 'Flow' else 2, 'decay_mult': 0,
         'name': "first_conv_bias"},
        {'params': slow_feat, 'lr_mult': 1, 'decay_mult': 1,
         'name': "slow_feat"},
        {'params': slow_bias, 'lr_mult': 2, 'decay_mult': 0,
         'name': "slow_bias"},
        {'params': normal_feat, 'lr_mult': 1, 'decay_mult': 1,
         'name': "normal_feat"},
        {'params': normal_bias, 'lr_mult': 2, 'decay_mult': 0,
         'name': "normal_bias"},
        {'params': bn, 'lr_mult': 1, 'decay_mult': 0,
         'name': "BN scale/shift"},
    ]

def test():
    model = P3D63(num_classes=400)
    #创建模型，部署gpu
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    summary(model, (3, 16, 160, 160))

if __name__ == '__main__':
    # # 导入模型部分通过调用P3D199得到199层的P3D网络
    # # model = P3D199(pretrained=True, num_classes=400)
    # model = P3D63(num_classes=400)
    # model = model.cuda()
    # # 这一行是随机生成输入数据，第一个维度是10说明该输入数据包含10个clip，
    # # 其中每个clip包含16帧图像，每帧图像是160*160的3通道图像。
    # data = torch.autograd.Variable(
    #     torch.rand(10, 3, 16, 160, 160)).cuda()  # if modality=='Flow', please change the 2nd dimension 3==>2
    # out = model(data)
    # print(out.size(), out)
    test()