【视频理解】四、P3D


参考资料

论文

  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

博客

  [论文笔记] P3D

  论文 2017 [P3D]


第1章 引言

 在学习视频的时空特征时,3D卷积效果很好,但是由于3D卷积核的参数多,模型层数不深,但是模型却很大。比如,C3D模型,只有11层,但是模型却达到321MB。而152层的2D ResNet却只有235MB。而且,直接对ResNet 152在Sports-1M上进行微调,比C3D从头训练的效果更好

在这里插入图片描述

Fig. 1 ResNet-152和C3D、P3D的效果和模型参数量比较

 在 GoogLeNet-Inception 系列中提出了非对称卷积的概念,即叠加一个卷积核尺寸为 1 × k 1\times k 1×k 和一个 k × 1 k\times1 k×1 的卷积来代替一个卷积核为 k × k k\times k k×k 的卷积。

 这种设计在能够获得相当的特征提取能力的同时,显著节省参数量: k 2 k^2 k2 个参数变成了 2 k 2k 2k 个参数。

 作者也尝试着对3D卷积进行分解,把一个 k × k × k k×k×k k×k×k 的3D卷积(共有 k 3 k^3 k3 个参数),分解成一个 1 × k × k 1×k×k 1×k×k 的卷积和一个 k × 1 × 1 k×1×1 k×1×1 的卷积(共有 k 2 + k k^2+k k2+k 个参数),作者文章中 k k k 的值为3,前者可以提取空间特征,后者可以融合时间特征,并且这样做的好处是我们可以充分的利用2D网络在Imagenet上的预训练参数来进行模型初始化

 此外,作者基于可分离的3D卷积核结构和2D ResNet模型,并尝试了并联、串联的方式,设计了几种botteneck building block , 这些blocks显著降低了参数量,最后构建了P3D ResNet深度模型。


第2章 相关工作

 视频表征学习主要有两大方法:

(1)基于手工特征的方法

  • STIP
  • Histogram of Gradient and Histogram of Optical Flow
  • 3D Histogram of Gradient
  • SIFT-3D
  • Dense trajectory features

(2)基于深度学习的方法

  • Stack CNN-based frame-level representations
  • Two-Stream
  • TSN
  • 3D CNN

对于没有使用3D卷积核的模型,作者认为其没有很好的提取连续帧之间的运动特征

 对于目前使用了3D卷积核的C3D,作者认为,受限于3D卷积核的参数和计算量,C3D模型只有11层,层数太少,表征能力有限


第3章 P3D网络结构

 作者首先回顾了一下 ResNetResidual Units 单元:
x t + 1 = h ( x t ) + F ( x t ) x_{t+1}=h(x_t)+F(x_t) xt+1=h(xt)+F(xt)
 也可以表示为:
( I + F ) ⋅ x t = x t + F ⋅ x t = x t + F ( x t ) = x t + 1 (I+F)⋅x_t=x_t+F⋅x_t=x_t+F(x_t)=x_{t+1} (I+F)xt=xt+Fxt=xt+F(xt)=xt+1
 基于此,作者提出了 P3D 的结构,将2D卷积核替换为可分离的3D卷积核,并设计了以下三种不同的结构:

  • Fig. 2 (a) 是两种卷积串联;
  • Fig. 2 (b) 是两种卷积并联;
  • Fig. 2 © 是两种卷积串联+并联;

 但是每种卷积的输出都直接和 skip connection 连接,即都能够直接影响到block最终的输出。

在这里插入图片描述

Fig. 2 基本的Residual block的3种连接方式
  • P3D-A ( I + T ⋅ S ) ⋅ x t : = x t + T ( S ( x t ) ) = x t + 1 (I+T⋅S)⋅x_t:=x_t+T(S(x_t))=x_{t+1} (I+TS)xt:=xt+T(S(xt))=xt+1
  • P3D-B ( I + S + T ) ⋅ x t : = x t + S ( x t ) + T ( x t ) = x t + 1 (I+S+T)⋅x_t:=x_t+S(x_t)+T(x_t)=x_{t+1} (I+S+T)xt:=xt+S(xt)+T(xt)=xt+1
  • P3D-C ( I + S + T ⋅ S ) ⋅ x t : = x t + S ( x t ) + T ( S ( x t ) ) = x t + 1 (I+S+T⋅S)⋅x_t:=x_t+S(x_t)+T(S(x_t))=x_{t+1} (I+S+TS)xt:=xt+S(xt)+T(S(xt))=xt+1

 作者严格借鉴了 ResNet 中的 Bottleneck block 来设计 P3D Blocks

 另外补充一下,Bottleneck 用来压缩通道并恢复通道,从而节省计算量。在 ResNet 系列中,-50,-101,-152用的是 Bottleneck block ,而-18,-34用的是 Basic Residual block

在这里插入图片描述

Fig. 3 基本的Residual block,以及三种 P3D Blocks

第4章 实验

 论文中为了对比哪种P3D Block更好,基于 Resnet-50 设计了4种网络:

  • 原始的Resnet-50,处理单帧图像
  • 把Resnet-50种的所有Bottleneck blocks 都替换成P3D-A
  • 把Resnet-50种的所有Bottleneck blocks 都替换成P3D-B
  • 把Resnet-50种的所有Bottleneck blocks 都替换成P3D-C

 这些网络的训练配置如下:

  • 对于原始的 Resnet-50 训练,使用 UCF-101 进行 fine-tuning ,视频首先resize成 240 × 320 240×320 240×320 ,再随机crop出来 224 × 224 224×224 224×224 的区域,冻结除了第一个BN之外的其他所有BN,以0.9的概率加入了一个Dropout层,对于视频的预测结果采用平均对所有单帧图像的score来得到。
  • 对于三种 P3D 网络的训练,使用 Resnet-50 的预训练权重来初始化网络参数(额外添加的时间卷积没有办法,只能随机初始化),使用 UCF-101 进行fine-tuning,视频首先resize成 182 × 242 182×242 182×242 ,然后挑选出16帧不重叠的frames,并随机crop出 160 × 160 160×160 160×160 的区域。

在这里插入图片描述

Fig. 4 ResNet, P3D-A, P3D-B, P3D-C和P3D ResNet的准确率对比

 可以发现,在P3D参数量只增长了很少就获得了比较不错的准确率提升。并且,A和C两种串行结构的准确率要高于B的并行结构,说明了串行结构融合了时空信息的效果更好

 对于 Fig. 4 的最后一行 P3D ResNet,是作者为了追求网络结构的多样性,在一个网络中同时使用 A、B、C 三种blocks。
在这里插入图片描述

Fig. 5 混合结构

 具体的,如 ResNet 的配置如 Fig. 6 所示:

在这里插入图片描述

Fig. 6 ResNet结构配置

 其中以 Resnet-50 为例,其共有 [3,4,6,3] 共16个 residual blocks ,那么在 P3D ResNet 中,这16个 residual blocks 就依次替换为A B C三种block,即 [ABC,ABCA,BCABCA,BCA] 。可以发现,这种多样性的结构获得了最高的准确率。


第5章 Pytorch实现P3D

代码参考

  qijiezhao/pseudo-3d-pytorch

博客

  Pseudo-3D Residual Networks算法的pytorch代码


具体代码如下,建议结合RestNet的源码去理解

"""
    代码参考:https://github.com/qijiezhao/pseudo-3d-pytorch
"""

from __future__ import print_function
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F
from torch.autograd import Variable
import math
from functools import partial
from torchsummary import summary

__all__ = ['P3D', 'P3D63', 'P3D131', 'P3D199']


# 空间卷积核
def conv_S(in_planes, out_planes, stride=1, padding=1):
    # as is descriped, conv S is 1x3x3
    return nn.Conv3d(in_planes, out_planes, kernel_size=(1, 3, 3), stride=1,
                     padding=padding, bias=False)


# 时间卷积核
def conv_T(in_planes, out_planes, stride=1, padding=1):
    # conv T is 3x1x1
    return nn.Conv3d(in_planes, out_planes, kernel_size=(3, 1, 1), stride=1,
                     padding=padding, bias=False)


def downsample_basic_block(x, planes, stride):
    out = F.avg_pool3d(x, kernel_size=1, stride=stride)
    zero_pads = torch.Tensor(out.size(0), planes - out.size(1),
                             out.size(2), out.size(3),
                             out.size(4)).zero_()

    # 判断变量类型
    if isinstance(out.data, torch.cuda.FloatTensor):
        zero_pads = zero_pads.cuda()

    out = Variable(torch.cat([out.data, zero_pads], dim=1))

    return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, n_s=0, depth_3d=47, ST_struc=('A', 'B', 'C')):
        super(Bottleneck, self).__init__()
        self.downsample = downsample
        self.depth_3d = depth_3d
        self.ST_struc = ST_struc
        self.len_ST = len(self.ST_struc)

        stride_p = stride
        # 如果需要进行下采样
        if not self.downsample == None:
            stride_p = (1, 2, 2)

        if n_s < self.depth_3d:
            if n_s == 0:
                stride_p = 1
            self.conv1 = nn.Conv3d(inplanes, planes, kernel_size=1, bias=False, stride=stride_p)
            self.bn1 = nn.BatchNorm3d(planes)
        else:
            if n_s == self.depth_3d:
                stride_p = 2
            else:
                stride_p = 1
            self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, bias=False, stride=stride_p)
            self.bn1 = nn.BatchNorm2d(planes)
        # self.conv2 = nn.Conv3d(planes, planes, kernel_size=3, stride=stride,
        #                        padding=1, bias=False)
        self.id = n_s
        self.ST = list(self.ST_struc)[self.id % self.len_ST]
        if self.id < self.depth_3d:
            self.conv2 = conv_S(planes, planes, stride=1, padding=(0, 1, 1))
            self.bn2 = nn.BatchNorm3d(planes)
            #
            self.conv3 = conv_T(planes, planes, stride=1, padding=(1, 0, 0))
            self.bn3 = nn.BatchNorm3d(planes)
        else:
            self.conv_normal = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
            self.bn_normal = nn.BatchNorm2d(planes)

        if n_s < self.depth_3d:
            self.conv4 = nn.Conv3d(planes, planes * 4, kernel_size=1, bias=False)
            self.bn4 = nn.BatchNorm3d(planes * 4)
        else:
            self.conv4 = nn.Conv2d(planes, planes * 4, kernel_size=1, bias=False)
            self.bn4 = nn.BatchNorm2d(planes * 4)
        self.relu = nn.ReLU(inplace=True)

        self.stride = stride

    # 串联结构
    def ST_A(self, x):
        x = self.conv2(x)   # 1x3x3
        x = self.bn2(x)
        x = self.relu(x)

        x = self.conv3(x)   # 3x1x1
        x = self.bn3(x)
        x = self.relu(x)

        return x

    # 并联结构
    def ST_B(self, x):
        tmp_x = self.conv2(x)
        tmp_x = self.bn2(tmp_x)
        tmp_x = self.relu(tmp_x)

        x = self.conv3(x)
        x = self.bn3(x)
        x = self.relu(x)

        return x + tmp_x

    # 串联+并联结构
    def ST_C(self, x):
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)

        tmp_x = self.conv3(x)
        tmp_x = self.bn3(tmp_x)
        tmp_x = self.relu(tmp_x)

        return x + tmp_x

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        # out = self.conv2(out)
        # out = self.bn2(out)
        # out = self.relu(out)
        if self.id < self.depth_3d:  # C3D parts:

            if self.ST == 'A':
                out = self.ST_A(out)
            elif self.ST == 'B':
                out = self.ST_B(out)
            elif self.ST == 'C':
                out = self.ST_C(out)
        else:
            out = self.conv_normal(out)  # normal is res5 part, C2D all.
            out = self.bn_normal(out)
            out = self.relu(out)

        out = self.conv4(out)
        out = self.bn4(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class P3D(nn.Module):

    def __init__(self, block, layers, modality='RGB',
                 shortcut_type='B', num_classes=400, dropout=0.5, ST_struc=('A', 'B', 'C')):
        self.inplanes = 64
        super(P3D, self).__init__()
        # self.conv1 = nn.Conv3d(3, 64, kernel_size=7, stride=(1, 2, 2),
        #                        padding=(3, 3, 3), bias=False)

        # 如果输入是视频帧(也就是图像),那么输入channel就是3,如果输入是optical flow,那么输入channel就是2
        self.input_channel = 3 if modality == 'RGB' else 2  # 2 is for flow
        self.ST_struc = ST_struc

        self.conv1_custom = nn.Conv3d(self.input_channel, 64, kernel_size=(1, 7, 7), stride=(1, 2, 2),
                                      padding=(0, 3, 3), bias=False)

        # layers[0]+layers[1]+layers[2] 为3D卷积层,layers[3]为2d卷积层
        self.depth_3d = sum(layers[:3])  # C3D layers are only (res2,res3,res4),  res5 is C2D

        self.bn1 = nn.BatchNorm3d(64)  # bn1 is followed by conv1
        self.cnt = 0
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool3d(kernel_size=(2, 3, 3), stride=2, padding=(0, 1, 1))  # pooling layer for conv1.
        self.maxpool_2 = nn.MaxPool3d(kernel_size=(2, 1, 1), padding=0,
                                      stride=(2, 1, 1))  # pooling layer for res2, 3, 4.

        self.layer1 = self._make_layer(block, 64, layers[0], shortcut_type)
        self.layer2 = self._make_layer(block, 128, layers[1], shortcut_type, stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], shortcut_type, stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], shortcut_type, stride=2)

        self.avgpool = nn.AvgPool2d(kernel_size=(5, 5), stride=1)  # pooling layer for res5.
        self.dropout = nn.Dropout(p=dropout)
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv3d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
            elif isinstance(m, nn.BatchNorm3d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

        # some private attribute
        self.input_size = (self.input_channel, 16, 160, 160)  # input of the network
        self.input_mean = [0.485, 0.456, 0.406] if modality == 'RGB' else [0.5]
        self.input_std = [0.229, 0.224, 0.225] if modality == 'RGB' else [np.mean([0.229, 0.224, 0.225])]

    @property
    def scale_size(self):
        return self.input_size[2] * 256 // 160  # asume that raw images are resized (340,256).

    @property
    def temporal_length(self):
        return self.input_size[1]

    @property
    def crop_size(self):
        return self.input_size[2]

    def _make_layer(self, block, planes, blocks, shortcut_type, stride=1):
        downsample = None
        stride_p = stride  # especially for downsample branch.

        # 3d卷积部分
        if self.cnt < self.depth_3d:
            if self.cnt == 0:
                stride_p = 1
            else:
                stride_p = (1, 2, 2)

            # Shortcuts用于构建 Conv Block 和 Identity Block
            if stride != 1 or self.inplanes != planes * block.expansion:
                # 使用pool3d进行下采样
                if shortcut_type == 'A':
                    downsample = partial(downsample_basic_block,
                                         planes=planes * block.expansion,
                                         stride=stride)
                # 使用Conv3d进行下采样
                else:
                    downsample = nn.Sequential(
                        nn.Conv3d(self.inplanes, planes * block.expansion,
                                  kernel_size=1, stride=stride_p, bias=False),
                        nn.BatchNorm3d(planes * block.expansion)
                    )

        # 2d卷积部分
        else:
            # Shortcuts用于构建 Conv Block 和 Identity Block
            if stride != 1 or self.inplanes != planes * block.expansion:
                if shortcut_type == 'A':
                    downsample = partial(downsample_basic_block,
                                         planes=planes * block.expansion,
                                         stride=stride)
                else:
                    downsample = nn.Sequential(
                        nn.Conv2d(self.inplanes, planes * block.expansion,
                                  kernel_size=1, stride=2, bias=False),
                        nn.BatchNorm2d(planes * block.expansion)
                    )
        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, n_s=self.cnt, depth_3d=self.depth_3d,
                            ST_struc=self.ST_struc))
        # 计数值+1,代表层数,因为要确保到达layer[3]时变为2d卷积
        self.cnt += 1

        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes, n_s=self.cnt, depth_3d=self.depth_3d, ST_struc=self.ST_struc))
            self.cnt += 1

        return nn.Sequential(*layers)

    def forward(self, x):
        # 3x16x160x160 -> 64x16x80x80 -> 64x8x40x40
        x = self.conv1_custom(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        # 64x8x40x40 -> 256x8x40x40 -> 256x4x40x40
        x = self.maxpool_2(self.layer1(x))  # Part Res2
        # 256x4x40x40 -> 512x4x20x20 -> 512x2x20x20
        x = self.maxpool_2(self.layer2(x))  # Part Res3
        # 512x2x20x20 -> 1024x2x10x10 -> 1024x1x10x10
        x = self.maxpool_2(self.layer3(x))  # Part Res4

        # 此时3d卷积要变成2d卷积了
        sizes = x.size()
        # (batch, 1024, 1, 10, 10) -> (batch, 1024, 10, 10)
        x = x.view(-1, sizes[1], sizes[3], sizes[4])  # Part Res5
        # 1024x10x10 -> 2048x5x5 -> 2048x1x1
        x = self.layer4(x)
        x = self.avgpool(x)

        # 全连接层
        x = x.view(-1, self.fc.in_features)
        x = self.fc(self.dropout(x))

        return x


def P3D63(**kwargs):
    """Construct a P3D63 modelbased on a ResNet-50-3D model.
    """
    model = P3D(Bottleneck, [3, 4, 6, 3], **kwargs)
    return model


def P3D131(**kwargs):
    """Construct a P3D131 model based on a ResNet-101-3D model.
    """
    model = P3D(Bottleneck, [3, 4, 23, 3], **kwargs)
    return model


def P3D199(pretrained=False, modality='RGB', **kwargs):
    """construct a P3D199 model based on a ResNet-152-3D model.
    """
    model = P3D(Bottleneck, [3, 8, 36, 3], modality=modality, **kwargs)
    if pretrained == True:
        if modality == 'RGB':
            pretrained_file = 'p3d_rgb_199.checkpoint.pth.tar'
        elif modality == 'Flow':
            pretrained_file = 'p3d_flow_199.checkpoint.pth.tar'
        weights = torch.load(pretrained_file)['state_dict']
        model.load_state_dict(weights)
    return model


# custom operation
def get_optim_policies(model=None, modality='RGB', enable_pbn=True):
    '''
    first conv:         weight --> conv weight
                        bias   --> conv bias
    normal action:      weight --> non-first conv + fc weight
                        bias   --> non-first conv + fc bias
    bn:                 the first bn2, and many all bn3.

    '''
    first_conv_weight = []
    first_conv_bias = []
    normal_weight = []
    normal_bias = []
    bn = []

    if model == None:
        exit()

    conv_cnt = 0
    bn_cnt = 0
    for m in model.modules():
        if isinstance(m, torch.nn.Conv3d) or isinstance(m, torch.nn.Conv2d):
            ps = list(m.parameters())
            conv_cnt += 1
            if conv_cnt == 1:
                first_conv_weight.append(ps[0])
                if len(ps) == 2:
                    first_conv_bias.append(ps[1])
            else:
                normal_weight.append(ps[0])
                if len(ps) == 2:
                    normal_bias.append(ps[1])
        elif isinstance(m, torch.nn.Linear):
            ps = list(m.parameters())
            normal_weight.append(ps[0])
            if len(ps) == 2:
                normal_bias.append(ps[1])

        elif isinstance(m, torch.nn.BatchNorm3d):
            bn_cnt += 1
            # later BN's are frozen
            if not enable_pbn or bn_cnt == 1:
                bn.extend(list(m.parameters()))
        elif isinstance(m, torch.nn.BatchNorm2d):
            bn.extend(list(m.parameters()))
        elif len(m._modules) == 0:
            if len(list(m.parameters())) > 0:
                raise ValueError("New atomic module type: {}. Need to give it a learning policy".format(type(m)))

    slow_rate = 0.7
    n_fore = int(len(normal_weight) * slow_rate)
    slow_feat = normal_weight[:n_fore]  # finetune slowly.
    slow_bias = normal_bias[:n_fore]
    normal_feat = normal_weight[n_fore:]
    normal_bias = normal_bias[n_fore:]

    return [
        {'params': first_conv_weight, 'lr_mult': 5 if modality == 'Flow' else 1, 'decay_mult': 1,
         'name': "first_conv_weight"},
        {'params': first_conv_bias, 'lr_mult': 10 if modality == 'Flow' else 2, 'decay_mult': 0,
         'name': "first_conv_bias"},
        {'params': slow_feat, 'lr_mult': 1, 'decay_mult': 1,
         'name': "slow_feat"},
        {'params': slow_bias, 'lr_mult': 2, 'decay_mult': 0,
         'name': "slow_bias"},
        {'params': normal_feat, 'lr_mult': 1, 'decay_mult': 1,
         'name': "normal_feat"},
        {'params': normal_bias, 'lr_mult': 2, 'decay_mult': 0,
         'name': "normal_bias"},
        {'params': bn, 'lr_mult': 1, 'decay_mult': 0,
         'name': "BN scale/shift"},
    ]

def test():
    model = P3D63(num_classes=400)
    #创建模型,部署gpu
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    summary(model, (3, 16, 160, 160))

if __name__ == '__main__':
    # # 导入模型部分通过调用P3D199得到199层的P3D网络
    # # model = P3D199(pretrained=True, num_classes=400)
    # model = P3D63(num_classes=400)
    # model = model.cuda()
    # # 这一行是随机生成输入数据,第一个维度是10说明该输入数据包含10个clip,
    # # 其中每个clip包含16帧图像,每帧图像是160*160的3通道图像。
    # data = torch.autograd.Variable(
    #     torch.rand(10, 3, 16, 160, 160)).cuda()  # if modality=='Flow', please change the 2nd dimension 3==>2
    # out = model(data)
    # print(out.size(), out)
    test()


  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

travellerss

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值