R(2+1)D理解与MindSpore框架下的实现

Andy&1024

已于 2023-03-14 14:34:00 修改

阅读量1.7k

点赞数

文章标签：深度学习计算机视觉

于 2023-02-03 14:07:21 首次发布

本文链接：https://blog.csdn.net/weixin_47907614/article/details/128865402

版权

一、R(2+1)D算法原理介绍

论文地址：[1711.11248] A Closer Look at Spatiotemporal Convolutions for Action Recognition (arxiv.org)

Tran等人在2018年发表在CVPR 的文章《A Closer Look at Spatiotemporal Convolutions for Action Recognition》提出了R(2+1)D，表明将三位卷积核分解为独立的空间和时间分量可以显著提高精度，R(2+1)D中的卷积模块将 $\times t \times d \times d$ 的3D卷积拆分为 $\times 1 \times d \times d$ 的2D空间卷积和 $\times t \times 1 \times 1$ 的1D时间卷积，其中N和M为卷积核的个数，超参数M决定了信号在空间卷积和时间卷积之间投影的中间子空间的维数，论文中将M的值设置为：
$M_{i}= \left \lfloor \frac{td^{2}N_{i-1}N_{i}}{d^{2}N_{i-1}+tN_{i}} \right \rfloor$

i表示残差网络中第i个卷积块，通过这种方式以保证(2+1)D模块中的参数量近似于3D卷积的参数量。

在这里插入图片描述
与全三维卷积相比，(2+1)D分解有两个优点，首先，尽管没有改变参数的数量，但由于每个块中2D和1D卷积之间的额外激活函数，网络中的非线性数量增加了一倍，非线性数量的增加了可以表示的函数的复杂性。第二个好处在于，将3D卷积强制转换为单独的空间和时间分量，使优化变得更容易，这表现在与相同参数量的3D卷积网络相比，(2+1)D网络的训练误差更低。

下表展示了18层和34层的R3D网络的架构，在R3D中，使用(2+1)D卷积代替3D卷积就能得到对应层数的R(2+1)D网络。

在这里插入图片描述
实验部分在Kinetics 上比较了不同形式的卷积的动作识别准确性，如下表所示。所有模型都基于 ResNet18，并在 8 帧或 16 帧剪辑输入上从头开始训练，结果表明R(2+1)D 的精度优于所有其他模型。

在这里插入图片描述
在Kinetics上与sota方法比较的结果如下表所示。当在 RGB 输入上从头开始训练时，R(2+1)D 比 I3D 高出 4.5%，在 Sports-1M 上预训练的 R(2+1)D 也比在 ImageNet 上预训练的 I3D 高 2.2%。

在这里插入图片描述

二、R(2+1)D的mindspore代码实现

功能函数说明

数据预处理

使用GeneratorDataset读取了视频数据集文件，输出batch_size=16的指定帧数的三通道图片。
数据前处理包括混洗、归一化。
数据增强包括video_random_crop类实现的随机裁剪、video_resize类实现的调整大小、video_random_horizontal_flip实现的随机水平翻转。

模型主干

R2Plus1d18中，输入首先经过一个 (2+1)D卷积模块，经过一个最大池化层，之后通过4个由(2+1)D卷积模块组成的residual block，再经过平均池化层、展平层最后到全连接层。
最先的(2+1)D卷积模块具体为卷积核大小为(1,7,7)的Conv3d再接一个卷积核大小为(3,1,1)的Conv3d，卷积层之间是Batch Normalization和Relu层。
R2Plus1d18中包含4个residual block，每个block在模型中都堆叠两次，同时每个block都由两个(2+1)D卷积模块组成，每个(2+1)D卷积都由一个卷积核大小为(1,3,3)的Conv3d再接一个卷积核大小为(3,1,1)的Conv3d组成，卷积层之间仍然是Batch Normalization和Relu层，block的输入和输出之间是残差连接的结构。

具体模型搭建中各个类的作用为：

Unit3D类实现了输入经过Conv3d、BN、Relu、Pooling层的结构，其中BN层、Relu层和Pooling层是可选的。

class Unit3D(nn.Cell):
    """
    Conv3d fused with normalization and activation blocks definition.

    Args:
        in_channels (int):  The number of channels of input frame images.
        out_channels (int):  The number of channels of output frame images.
        kernel_size (tuple): The size of the conv3d kernel.
        stride (Union[int, Tuple[int]]): Stride size for the first convolutional layer. Default: 1.
        pad_mode (str): Specifies padding mode. The optional values are "same", "valid", "pad".
            Default: "pad".
        padding (Union[int, Tuple[int]]): Implicit paddings on both sides of the input x.
            If `pad_mode` is "pad" and `padding` is not specified by user, then the padding
            size will be `(kernel_size - 1) // 2` for C, H, W channel.
        dilation (Union[int, Tuple[int]]): Specifies the dilation rate to use for dilated
            convolution. Default: 1
        group (int): Splits filter into groups, in_channels and out_channels must be divisible
            by the number of groups. Default: 1.
        activation (Optional[nn.Cell]): Activation function which will be stacked on top of the
            normalization layer (if not None), otherwise on top of the conv layer. Default: nn.ReLU.
        norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution
            layer. Default: nn.BatchNorm3d.
        pooling (Optional[nn.Cell]): Pooling layer (if not None) will be stacked on top of all the
            former layers. Default: None.
        has_bias (bool): Whether to use Bias.

    Returns:
        Tensor, output tensor.

    Examples:
        Unit3D(in_channels=in_channels, out_channels=out_channels[0], kernel_size=(1, 1, 1))
    """

    def __init__(self,
                 in_channels: int,
                 out_channels: int,
                 kernel_size: Union[int, Tuple[int]] = 3,
                 stride: Union[int, Tuple[int]] = 1,
                 pad_mode: str = 'pad',
                 padding: Union[int, Tuple[int]] = 0,
                 dilation: Union[int, Tuple[int]] = 1,
                 group: int = 1,
                 activation: Optional[nn.Cell] = nn.ReLU,
                 norm: Optional[nn.Cell] = nn.BatchNorm3d,
                 pooling: Optional[nn.Cell] = None,
                 has_bias: bool = False
                 ) -> None:
        super().__init__()
        if pad_mode == 'pad' and padding == 0:
            padding = tuple((k - 1) // 2 for k in six_padding(kernel_size))
        else:
            padding = 0
        layers = [nn.Conv3d(in_channels=in_channels,
                            out_channels=out_channels,
                            kernel_size=kernel_size,
                            stride=stride,
                            pad_mode=pad_mode,
                            padding=padding,
                            dilation=dilation,
                            group=group,
                            has_bias=has_bias)
                  ]

        if norm:
            layers.append(norm(out_channels))
        if activation:
            layers.append(activation())

        self.pooling = None
        if pooling:
            self.pooling = pooling

        self.features = nn.SequentialCell(layers)

    def construct(self, x):
        """ construct unit3d"""
        output = self.features(x)
        if self.pooling:
            output = self.pooling(output)
        return output

Inflate3D类使用Unit3D实现了(2+1)D卷积模块。

class Inflate3D(nn.Cell):
    """
    Inflate3D block definition.

    Args:
        in_channel (int):  The number of channels of input frame images.
        out_channel (int):  The number of channels of output frame images.
        mid_channel (int): The number of channels of inner frame images.
        kernel_size (tuple): The size of the spatial-temporal convolutional layer kernels.
        stride (Union[int, Tuple[int]]): Stride size for the second convolutional layer. Default: 1.
        conv2_group (int): Splits filter into groups for the second conv layer,
            in_channels and out_channels
            must be divisible by the number of groups. Default: 1.
        norm (Optional[nn.Cell]): Norm layer that will be stacked on top of the convolution
            layer. Default: nn.BatchNorm3d.
        activation (List[Optional[Union[nn.Cell, str]]]): Activation function which will be stacked
            on top of the normalization layer (if not None), otherwise on top of the conv layer.
            Default: nn.ReLU, None.
        inflate (int): Whether to inflate two conv3d layers and with different kernel size.

    Returns:
        Tensor, output tensor.

    Examples:
        >>> from mindvision.msvideo.models.blocks import Inflate3D
        >>> Inflate3D(3, 64, 64)
    """

    def __init__(self,
                 in_channel: int,
                 out_channel: int,
                 mid_channel: int = 0,
                 stride: tuple = (1, 1, 1),
                 kernel_size: tuple = (3, 3, 3),
                 conv2_group: int = 1,
                 norm: Optional[nn.Cell] = nn.BatchNorm3d,
                 activation: List[Optional[Union[nn.Cell, str]]] = (nn.ReLU, None),
                 inflate: int = 1,
                 ):
        super(Inflate3D, self).__init__()
        if not norm:
            norm = nn.BatchNorm3d
        self.in_channel = in_channel
        if mid_channel == 0:
            self.mid_channel = (in_channel * out_channel * kernel_size[1] * kernel_size[2] * 3) // \
                               (in_channel * kernel_size[1] * kernel_size[2] + 3 * out_channel)
        else:
            self.mid_channel = mid_channel
        self.inflate = inflate
        if self.inflate == 0:
            conv1_kernel_size = (1, 1, 1)
            conv2_kernel_size = (1, kernel_size[1], kernel_size[2])
        elif self.inflate == 1:
            conv1_kernel_size = (kernel_size[0], 1, 1)
            conv2_kernel_size = (1, kernel_size[1], kernel_size[2])
        elif self.inflate == 2:
            conv1_kernel_size = (1, 1, 1)
            conv2_kernel_size = (kernel_size[0], kernel_size[1], kernel_size[2])
        self.conv1 = Unit3D(
            self.in_channel,
            self.mid_channel,
            stride=(1, 1, 1),
            kernel_size=conv1_kernel_size,
            norm=norm,
            activation=activation[0])
        self.conv2 = Unit3D(
            self.mid_channel,
            self.mid_channel,
            stride=stride,
            kernel_size=conv2_kernel_size,
            group=conv2_group,
            norm=norm,
            activation=activation[1])

    def construct(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        return x

Resnet3D类实现了输入经过Unit3D、Max Pooling再接4个residual block的结构，residual block的堆叠数量可以通过参数进行指定。

class ResNet3D(nn.Cell):
    """
    ResNet3D architecture.

    Args:
        block (Optional[nn.Cell]): THe block for network.
        layer_nums (Tuple[int]): The numbers of block in different layers.
        stage_channels (Tuple[int]): Output channel for every res stage.
            Default: [64, 128, 256, 512].
        stage_strides (Tuple[Tuple[int]]): Strides for every res stage.
            Default:[[1, 1, 1],
                     [1, 2, 2],
                     [1, 2, 2],
                     [1, 2, 2]].
        group (int): The number of Group convolutions. Default: 1.
        base_width (int): The width of per group. Default: 64.
        norm (nn.Cell, optional): The module specifying the normalization layer to use.
            Default: None.
        down_sample(nn.Cell, optional): Residual block in every resblock, it can transfer the input
            feature into the same channel of output. Default: Unit3D.
        kwargs (dict, optional): Key arguments for "make_res_layer" and resblocks.
    Inputs:
        - **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, T_{in}, H_{in}, W_{in})`.

    Outputs:
        Tensor of shape :math:`(N, 2048, 7, 7, 7)`

    Supported Platforms:
        ``GPU``

    Examples:
        >>> import numpy as np
        >>> import mindspore as ms
        >>> from mindvision.msvideo.models.backbones import ResNet3D, ResidualBlock3D
        >>> net = ResNet(ResidualBlock3D, [3, 4, 23, 3])
        >>> x = ms.Tensor(np.ones([1, 3, 16, 224, 224]), ms.float32)
        >>> output = net(x)
        >>> print(output.shape)
        (1, 2048, 7, 7)

    About ResNet:

    The ResNet is to ease the training of networks that are substantially deeper than
        those used previously.
    The model explicitly reformulate the layers as learning residual functions with
        reference to the layer inputs, instead of learning unreferenced functions.

    """

    def __init__(self,
                 block: Optional[nn.Cell],
                 layer_nums: Tuple[int],
                 stage_channels: Tuple[int] = (64, 128, 256, 512),
                 stage_strides: Tuple[Tuple[int]] = ((1, 1, 1),
                                                     (1, 2, 2),
                                                     (1, 2, 2),
                                                     (1, 2, 2)),
                 group: int = 1,
                 base_width: int = 64,
                 norm: Optional[nn.Cell] = None,
                 down_sample: Optional[nn.Cell] = Unit3D,
                 **kwargs
                 ) -> None:
        super().__init__()
        if not norm:
            norm = nn.BatchNorm3d
        self.norm = norm
        self.in_channels = stage_channels[0]
        self.group = group
        self.base_with = base_width
        self.down_sample = down_sample
        self.conv1 = Unit3D(3, self.in_channels, kernel_size=7, stride=2, norm=norm)
        self.max_pool = ops.MaxPool3D(kernel_size=3, strides=2, pad_mode='same')
        self.layer1 = self._make_layer(
            block,
            stage_channels[0],
            layer_nums[0],
            stride=stage_strides[0],
            norm=self.norm,
            **kwargs)
        self.layer2 = self._make_layer(
            block,
            stage_channels[1],
            layer_nums[1],
            stride=stage_strides[1],
            norm=self.norm,
            **kwargs)
        self.layer3 = self._make_layer(
            block,
            stage_channels[2],
            layer_nums[2],
            stride=stage_strides[2],
            norm=self.norm,
            **kwargs)
        self.layer4 = self._make_layer(
            block,
            stage_channels[3],
            layer_nums[3],
            stride=stage_strides[3],
            norm=self.norm,
            **kwargs)

    def _make_layer(self,
                    block: Optional[nn.Cell],
                    channel: int,
                    block_nums: int,
                    stride: Tuple[int] = (1, 2, 2),
                    norm: Optional[nn.Cell] = nn.BatchNorm3d,
                    **kwargs):
        """Block layers."""
        down_sample = None
        if stride[1] != 1 or self.in_channels != channel * block.expansion:
            down_sample = self.down_sample(
                self.in_channels,
                channel * block.expansion,
                kernel_size=1,
                stride=stride,
                norm=norm,
                activation=None)
        self.stride = stride
        bkwargs = [{} for _ in range(block_nums)]  # block specified key word args
        temp_args = kwargs.copy()
        for pname, pvalue in temp_args.items():
            if isinstance(pvalue, (list, tuple)):
                Validator.check_equal_int(len(pvalue), block_nums, f'len({pname})')
                for idx, v in enumerate(pvalue):
                    bkwargs[idx][pname] = v
                kwargs.pop(pname)
        layers = []
        layers.append(
            block(
                self.in_channels,
                channel,
                stride=self.stride,
                down_sample=down_sample,
                group=self.group,
                base_width=self.base_with,
                norm=norm,
                **(bkwargs[0]),
                **kwargs
            )
        )
        self.in_channels = channel * block.expansion
        for i in range(1, block_nums):
            layers.append(
                block(self.in_channels,
                      channel,
                      stride=(1, 1, 1),
                      group=self.group,
                      base_width=self.base_with,
                      norm=norm,
                      **(bkwargs[i]),
                      **kwargs
                      )
            )
        return nn.SequentialCell(layers)

    def construct(self, x):
        """Resnet3D construct."""
        x = self.conv1(x)
        x = self.max_pool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        return x

R2Plus1dNet类继承了Resnet3D类，主要是使用了Resnet3D中的4个residual block，实现了输入经过(2+1)D、Max Pooling，再通过4个residual block，最后经过平均池化层、展平层到全连接层的结构。

class R2Plus1dNet(ResNet3D):
    """Generic R(2+1)d generator.

    Args:
        block (Optional[nn.Cell]): THe block for network.
        layer_nums (Tuple[int]): The numbers of block in different layers.
        stage_channels (Tuple[int]): Output channel for every res stage. Default: (64, 128, 256, 512).
        stage_strides (Tuple[Tuple[int]]): Strides for every res stage.
            Default:((1, 1, 1),
                     (2, 2, 2),
                     (2, 2, 2),
                     (2, 2, 2).
        conv12 (nn.Cell, optional): Conv1 and conv2 config in resblock. Default: Conv2Plus1D.
        base_width (int): The width of per group. Default: 64.
        norm (nn.Cell, optional): The module specifying the normalization layer to use. Default: None.
        num_classes(int): Number of categories in the action recognition dataset.
        keep_prob(float): Dropout probability in classification stage.
        kwargs (dict, optional): Key arguments for "make_res_layer" and resblocks.

    Returns:
        Tensor, output tensor.

    Examples:
        >>> from mindvision.msvideo.models.backbones.r2plus1d import *
        >>> from mindvision.msvideo.models.backbones.resnet3d import ResidualBlockBase3D
        >>> data = Tensor(np.random.randn(2, 3, 16, 112, 112), dtype=mindspore.float32)
        >>>
        >>> net = R2Plus1dNet(block=ResidualBlockBase3D, layer_nums=[2, 2, 2, 2])
        >>>
        >>> predict = net(data)
        >>> print(predict.shape)
    """

    def __init__(self,
                 block: Optional[nn.Cell],
                 layer_nums: Tuple[int],
                 stage_channels: Tuple[int] = (64, 128, 256, 512),
                 stage_strides: Tuple[Tuple[int]] = ((1, 1, 1),
                                                     (2, 2, 2),
                                                     (2, 2, 2),
                                                     (2, 2, 2)),
                 num_classes: int = 400,
                 **kwargs) -> None:
        super().__init__(block=block,
                         layer_nums=layer_nums,
                         stage_channels=stage_channels,
                         stage_strides=stage_strides,
                         conv12=Conv2Plus1d,
                         **kwargs)
        self.conv1 = nn.SequentialCell([nn.Conv3d(3, 45,
                                                  kernel_size=(1, 7, 7),
                                                  stride=(1, 2, 2),
                                                  pad_mode='pad',
                                                  padding=(0, 0, 3, 3, 3, 3),
                                                  has_bias=False),
                                        nn.BatchNorm3d(45),
                                        nn.ReLU(),
                                        nn.Conv3d(45, 64,
                                                  kernel_size=(3, 1, 1),
                                                  stride=(1, 1, 1),
                                                  pad_mode='pad',
                                                  padding=(1, 1, 0, 0, 0, 0),
                                                  has_bias=False),
                                        nn.BatchNorm3d(64),
                                        nn.ReLU()])
        self.avgpool = AdaptiveAvgPool3D((1, 1, 1))
        self.flatten = nn.Flatten()
        self.classifier = nn.Dense(stage_channels[-1] * block.expansion,
                                   num_classes)
        # init weights
        self._initialize_weights()

    def construct(self, x):
        """R2Plus1dNet construct."""
        x = self.conv1(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = self.flatten(x)
        x = self.classifier(x)
        return x

    def _initialize_weights(self):
        """
        Init the weight of Conv3d and Dense in the net.
        """
        for _, cell in self.cells_and_names():
            if isinstance(cell, nn.Conv3d):
                cell.weight.set_data(init.initializer(
                    init.HeNormal(math.sqrt(5), mode='fan_out', nonlinearity='relu'),
                    cell.weight.shape, cell.weight.dtype))
                if cell.bias:
                    cell.bias.set_data(init.initializer(
                        init.Zero(), cell.bias.shape, cell.bias.dtype))
            elif isinstance(cell, nn.BatchNorm2d):
                cell.gamma.set_data(init.initializer(
                    init.One(), cell.gamma.shape, cell.gamma.dtype))
                cell.beta.set_data(init.initializer(
                    init.Zero(), cell.beta.shape, cell.beta.dtype))

R2Plus1d18类继承了R2Plu1dNet类，主要的作用是指定residual block的堆叠次数，在此类中指定的数量即为每个block都堆叠两次。

class R2Plus1d18(R2Plus1dNet):
    """
    The class of R2Plus1d-18 uses the registration mechanism to register,
    need to use the yaml configuration file to call.
    """

    def __init__(self, **kwargs):
        super(R2Plus1d18, self).__init__(block=ResidualBlockBase3D,
                                         layer_nums=(2, 2, 2, 2),
                                         **kwargs)

三、可执行案例

notebook文件链接

数据集准备

代码仓库使用 Kinetics400 数据集进行训练和验证。

预训练模型

预训练模型是在 kinetics400 数据集上训练，下载地址：r2plus1d18_kinetic400.ckpt

环境准备

git clone https://gitee.com/yanlq46462828/zjut_mindvideo.git
cd zjut_mindvideo

# Please first install mindspore according to instructions on the official website: https://www.mindspore.cn/install

pip install -r requirements.txt
pip install -e .

训练流程

from mindspore import nn
from mindspore import context, load_checkpoint, load_param_into_net
from mindspore.context import ParallelMode
from mindspore.communication import init, get_rank, get_group_size
from mindspore.train import Model
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits

from msvideo.utils.check_param import Validator,Rel

数据集加载

通过基于VideoDataset编写的Kinetic400类来加载kinetic400数据集。

from msvideo.data.kinetics400 import Kinetic400
# Data Pipeline.
dataset = Kinetic400(path='/home/publicfile/kinetics-400',
                    split="train",
                    seq=32,
                    num_parallel_workers=1,
                    shuffle=True,
                    batch_size=6,
                    repeat_num=1)
ckpt_save_dir = './r2plus1d'

/home/publicfile/kinetics-400/cls2index.json

数据处理

通过VideoRescale对视频进行缩放，利用VideoResize改变大小，再用VideoRandomCrop对Resize后的视频进行随机裁剪，再用VideoRandomHorizontalFlip根据概率对视频进行水平翻转，利用VideoReOrder对维度进行变换，再用VideoNormalize进行归一化处理。

from msvideo.data.transforms import VideoRandomCrop, VideoRandomHorizontalFlip, VideoRescale
from msvideo.data.transforms import VideoNormalize, VideoResize, VideoReOrder

transforms = [VideoRescale(shift=0.0),
                VideoResize([128, 171]),
                VideoRandomCrop([112, 112]),
                VideoRandomHorizontalFlip(0.5),
                VideoReOrder([3, 0, 1, 2]),
                VideoNormalize(mean=[0.43216, 0.394666, 0.37645],
                                std=[0.22803, 0.22145, 0.216989])]
dataset.transform = transforms
dataset_train = dataset.run()
Validator.check_int(dataset_train.get_dataset_size(), 0, Rel.GT)
step_size = dataset_train.get_dataset_size()

[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-10:30:59.929.412 [mindspore/dataset/core/validator_helpers.py:804] 'Compose' from mindspore.dataset.transforms.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'Compose' from mindspore.dataset.transforms instead.

网络构建

from msvideo.models.r2plus1d import R2Plus1d18
# Create model
network = R2Plus1d18(num_classes=400)

from msvideo.schedule.lr_schedule import warmup_cosine_annealing_lr_v1
# Set learning rate scheduler.
learning_rate = warmup_cosine_annealing_lr_v1(lr=0.01,
                                                steps_per_epoch=step_size,
                                                warmup_epochs=4,
                                                max_epoch=100,
                                                t_max=100,
                                                eta_min=0)

# Define optimizer.
network_opt = nn.Momentum(network.trainable_params(),
                            learning_rate=learning_rate,
                            momentum=0.9,
                            weight_decay=0.00004)
# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")

# Set the checkpoint config for the network.
ckpt_config = CheckpointConfig(
        save_checkpoint_steps=step_size,
        keep_checkpoint_max=10)
ckpt_callback = ModelCheckpoint(prefix='r2plus1d_kinetics400',
                                directory=ckpt_save_dir,
                                config=ckpt_config)

# Init the model.
model = Model(network, loss_fn=network_loss, optimizer=network_opt, metrics={'acc'})

# Begin to train.
print('[Start training `{}`]'.format('r2plus1d_kinetics400'))
print("=" * 80)
model.train(1,
            dataset_train,
            callbacks=[ckpt_callback, LossMonitor()],
            dataset_sink_mode=False)
print('[End of training `{}`]'.format('r2plus1d_kinetics400'))

[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-10:41:43.490.637 [mindspore/dataset/core/validator_helpers.py:804] 'Compose' from mindspore.dataset.transforms.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'Compose' from mindspore.dataset.transforms instead.
[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-10:41:43.498.663 [mindspore/dataset/core/validator_helpers.py:804] 'Compose' from mindspore.dataset.transforms.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'Compose' from mindspore.dataset.transforms instead.

[Start training `r2plus1d_kinetics400`]
================================================================================
epoch: 1 step: 1, loss is 5.998835563659668
epoch: 1 step: 2, loss is 5.921803951263428
epoch: 1 step: 3, loss is 6.024421691894531
epoch: 1 step: 4, loss is 6.08278751373291
epoch: 1 step: 5, loss is 6.014780044555664
epoch: 1 step: 6, loss is 5.945815086364746
epoch: 1 step: 7, loss is 6.078174114227295
epoch: 1 step: 8, loss is 6.0565361976623535
epoch: 1 step: 9, loss is 5.952683448791504
epoch: 1 step: 10, loss is 6.033120632171631
epoch: 1 step: 11, loss is 6.05575704574585
epoch: 1 step: 12, loss is 5.9879350662231445
epoch: 1 step: 13, loss is 6.006839275360107
epoch: 1 step: 14, loss is 5.9968180656433105
epoch: 1 step: 15, loss is 5.971335411071777
epoch: 1 step: 16, loss is 6.0620856285095215
epoch: 1 step: 17, loss is 6.081112861633301
epoch: 1 step: 18, loss is 6.106649398803711
epoch: 1 step: 19, loss is 6.095144271850586
epoch: 1 step: 20, loss is 6.00246000289917
epoch: 1 step: 21, loss is 6.061524868011475
epoch: 1 step: 22, loss is 6.046009063720703
epoch: 1 step: 23, loss is 5.997835159301758
epoch: 1 step: 24, loss is 6.007784366607666
epoch: 1 step: 25, loss is 5.946590423583984
epoch: 1 step: 26, loss is 5.9461164474487305
epoch: 1 step: 27, loss is 5.9034929275512695
epoch: 1 step: 28, loss is 5.925591945648193
epoch: 1 step: 29, loss is 6.176599979400635
......

评估流程

from mindspore import context
from msvideo.data.kinetics400 import Kinetic400

context.set_context(mode=context.GRAPH_MODE, device_target="GPU")

# Data Pipeline.
dataset_eval = Kinetic400("/home/publicfile/kinetics-400",
                            split="val",
                            seq=32,
                            seq_mode="interval",
                            num_parallel_workers=1,
                            shuffle=False,
                            batch_size=8,
                            repeat_num=1)

/home/publicfile/kinetics-400/cls2index.json

from msvideo.data.transforms import VideoCenterCrop, VideoRescale, VideoReOrder
from msvideo.data.transforms import VideoNormalize, VideoResize

transforms = [VideoResize([128, 171]),
                VideoRescale(shift=0.0),
                VideoCenterCrop([112, 112]),
                VideoReOrder([3, 0, 1, 2]),
                VideoNormalize(mean=[0.43216, 0.394666, 0.37645],
                                 std=[0.22803, 0.22145, 0.216989])]
dataset_eval.transform = transforms
dataset_eval = dataset_eval.run()

from mindspore import nn
from mindspore import context, load_checkpoint, load_param_into_net
from mindspore.train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from msvideo.utils.callbacks import EvalLossMonitor
from msvideo.models.r2plus1d import R2Plus1d18

# Create model
network = R2Plus1d18(num_classes=400)

# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")

param_dict = load_checkpoint('/home/zhengs/r2plus1d/r2plus1d18_kinetic400.ckpt')
load_param_into_net(network, param_dict)

# Define eval_metrics.
eval_metrics = {'Loss': nn.Loss(),
                'Top_1_Accuracy': nn.Top1CategoricalAccuracy(),
                'Top_5_Accuracy': nn.Top5CategoricalAccuracy()}


# Init the model.
model = Model(network, loss_fn=network_loss, metrics=eval_metrics)

print_cb = EvalLossMonitor(model)

# Begin to eval.
print('[Start eval `{}`]'.format('r2plus1d_kinetics400'))
result = model.eval(dataset_eval,
                    callbacks=[print_cb],
                    dataset_sink_mode=False)
print(result)

[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-11:35:48.745.627 [mindspore/train/model.py:1077] For EvalLossMonitor callback, {'epoch_end', 'step_end', 'epoch_begin', 'step_begin'} methods may not be supported in later version, Use methods prefixed with 'on_train' or 'on_eval' instead when using customized callbacks.
[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-11:35:48.747.418 [mindspore/dataset/core/validator_helpers.py:804] 'Compose' from mindspore.dataset.transforms.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'Compose' from mindspore.dataset.transforms instead.
[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-11:35:48.749.293 [mindspore/dataset/core/validator_helpers.py:804] 'Compose' from mindspore.dataset.transforms.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'Compose' from mindspore.dataset.transforms instead.
[WARNING] ME(150956:140289176069952,MainProcess):2023-03-13-11:35:48.751.452 [mindspore/dataset/core/validator_helpers.py:804] 'Compose' from mindspore.dataset.transforms.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'Compose' from mindspore.dataset.transforms instead.

[Start eval `r2plus1d_kinetics400`]
step:[    1/ 2484], metrics:[], loss:[3.070/3.070], time:1923.473 ms, 
step:[    2/ 2484], metrics:['Loss: 3.0702', 'Top_1_Accuracy: 0.3750', 'Top_5_Accuracy: 0.7500'], loss:[0.808/1.939], time:169.314 ms, 
step:[    3/ 2484], metrics:['Loss: 1.9391', 'Top_1_Accuracy: 0.5625', 'Top_5_Accuracy: 0.8750'], loss:[2.645/2.175], time:192.965 ms, 
step:[    4/ 2484], metrics:['Loss: 2.1745', 'Top_1_Accuracy: 0.5417', 'Top_5_Accuracy: 0.8750'], loss:[2.954/2.369], time:172.657 ms, 
step:[    5/ 2484], metrics:['Loss: 2.3695', 'Top_1_Accuracy: 0.5000', 'Top_5_Accuracy: 0.8438'], loss:[2.489/2.393], time:176.803 ms, 
step:[    6/ 2484], metrics:['Loss: 2.3934', 'Top_1_Accuracy: 0.4750', 'Top_5_Accuracy: 0.8250'], loss:[1.566/2.256], time:172.621 ms, 
step:[    7/ 2484], metrics:['Loss: 2.2556', 'Top_1_Accuracy: 0.4792', 'Top_5_Accuracy: 0.8333'], loss:[0.761/2.042], time:172.149 ms, 
step:[    8/ 2484], metrics:['Loss: 2.0420', 'Top_1_Accuracy: 0.5357', 'Top_5_Accuracy: 0.8571'], loss:[3.675/2.246], time:181.757 ms, 
step:[    9/ 2484], metrics:['Loss: 2.2461', 'Top_1_Accuracy: 0.4688', 'Top_5_Accuracy: 0.7969'], loss:[3.909/2.431], time:186.722 ms, 
step:[   10/ 2484], metrics:['Loss: 2.4309', 'Top_1_Accuracy: 0.4583', 'Top_5_Accuracy: 0.7639'], loss:[3.663/2.554], time:199.209 ms, 
step:[   11/ 2484], metrics:['Loss: 2.5542', 'Top_1_Accuracy: 0.4375', 'Top_5_Accuracy: 0.7375'], loss:[3.438/2.635], time:173.766 ms, 
step:[   12/ 2484], metrics:['Loss: 2.6345', 'Top_1_Accuracy: 0.4318', 'Top_5_Accuracy: 0.7159'], loss:[2.695/2.640], time:171.364 ms, 
step:[   13/ 2484], metrics:['Loss: 2.6395', 'Top_1_Accuracy: 0.4375', 'Top_5_Accuracy: 0.7292'], loss:[3.542/2.709], time:172.889 ms, 
step:[   14/ 2484], metrics:['Loss: 2.7090', 'Top_1_Accuracy: 0.4231', 'Top_5_Accuracy: 0.7308'], loss:[3.404/2.759], time:216.287 ms, 
step:[   15/ 2484], metrics:['Loss: 2.7586', 'Top_1_Accuracy: 0.4018', 'Top_5_Accuracy: 0.7232'], loss:[4.012/2.842], time:171.686 ms, 
step:[   16/ 2484], metrics:['Loss: 2.8422', 'Top_1_Accuracy: 0.3833', 'Top_5_Accuracy: 0.7167'], loss:[5.157/2.987], time:170.363 ms, 
step:[   17/ 2484], metrics:['Loss: 2.9869', 'Top_1_Accuracy: 0.3750', 'Top_5_Accuracy: 0.6875'], loss:[4.667/3.086], time:171.926 ms, 
step:[   18/ 2484], metrics:['Loss: 3.0857', 'Top_1_Accuracy: 0.3603', 'Top_5_Accuracy: 0.6618'], loss:[5.044/3.194], time:197.028 ms, 
step:[   19/ 2484], metrics:['Loss: 3.1945', 'Top_1_Accuracy: 0.3403', 'Top_5_Accuracy: 0.6458'], loss:[3.625/3.217], time:222.758 ms, 
step:[   20/ 2484], metrics:['Loss: 3.2171', 'Top_1_Accuracy: 0.3355', 'Top_5_Accuracy: 0.6513'], loss:[1.909/3.152], time:207.416 ms, 
step:[   21/ 2484], metrics:['Loss: 3.1517', 'Top_1_Accuracy: 0.3563', 'Top_5_Accuracy: 0.6625'], loss:[4.591/3.220], time:171.645 ms, 
step:[   22/ 2484], metrics:['Loss: 3.2202', 'Top_1_Accuracy: 0.3631', 'Top_5_Accuracy: 0.6667'], loss:[3.545/3.235], time:209.975 ms, 
step:[   23/ 2484], metrics:['Loss: 3.2350', 'Top_1_Accuracy: 0.3693', 'Top_5_Accuracy: 0.6591'], loss:[3.350/3.240], time:185.889 ms,