基于Mindspore框架的VisTR模型复现

最新推荐文章于 2024-06-05 21:58:16 发布

holo337

最新推荐文章于 2024-06-05 21:58:16 发布

阅读量399

点赞数 1

文章标签：计算机视觉深度学习人工智能

本文链接：https://blog.csdn.net/spicy00/article/details/129738952

版权

VisTR可执行文件

VisTR图像分割

模型简介

VisTR是由美团无人车配送团队在CVPR 2021上发表的文章中提出一种图像分割算法。使用NVIDIA Tesla V100在 youtube-vis数据集上，以resnet50和resnet101为backbone分别取得了maskAP 36.2和maskAP 40.1的精度，另外FPS分别为69.9和57.5。

算法介绍

首先，相较于单帧图像，视频含有关于每个实例更为完备和丰富的信息，比如不同实例的轨迹和运动模态，这些信息能够帮助克服单帧实例分割任务中一些比较困难的问题，比如外观相似、物体邻近或者存在遮挡的情形等。另一方面，多帧所提供的关于单个实例更好的特征表示也有助于模型对物体进行更好的跟踪。因此，作者想要实现一个端到端对视频实例目标进行建模的框架。作者认为可以借鉴自然语言处理任务的思想，把视频实例分割建模为序列到序列的任务，即给定多帧图像作为输入，直接输出多帧的分割mask序列。其次，分割本身是像素特征之间相似度的学习，而跟踪本质是实例特征之间相似度的学习，因此理论上他们可以统一到同一个相似度学习的框架之下。基于以上的思考，作者选取了一个同时能够进行序列的建模和相似度学习的模型，即自然语言处理中的transformers模型。transformers本身可以用于seq2seq的任务，即给定一个序列，可以输入一个序列。并且该模型十分擅长对长序列进行建模，因此非常适合应用于视频领域对多帧序列的时序信息进行建模。其次，transformers的核心机制，自注意力模块(Self-attention) ，可以基于两两之间的相似度来进行特征的学习和更新，使得将像素特征之间相似度以及实例特征之间相似度统一在一个框架内实现成为可能。以上的特性使得transformers成为VIS任务的恰当选择。另外transformers已经有被应用于计算机视觉中进行目标检测的实践DETR。因此作者基于transformers设计了视频实例分割（VIS）的模型VisTR

图1 算法框图

图1 VisTR算法框架

模型架构

VisTR整体框架如图2所示。图中最左边表示输入的多帧原始图像序列（以三帧为例），右边表示输出的实例预测序列，其中相同形状对应同一帧图像的输出，相同颜色对应同一个物体实例的输出。给定多帧图像序列，首先利用卷积神经网络（CNN）进行初始图像特征的提取，然后将多帧的特征结合作为特征序列输入transformer进行建模，实现序列的输入和输出。不难看出，首先，VisTR是一个端到端的模型，即同时对多帧数据进行建模。建模的方式即：将其变为一个seq2seq的任务，输入多帧图像序列，模型可以直接输出预测的实例序列。虽然在时序维度多帧的输入和输出是有序的，但是单帧输入的实例的序列在初始状态下是无序的，这样仍然无法实现实例的跟踪关联，因此作者强制使得每帧图像输出的实例的顺序是一致的（用图中同一形状的符号有着相同的颜色变化顺序表示），这样只要找到对应位置的输出，便可自然而然实现同一实例的关联，无需任何后处理操作。为了实现此目标，需要对属于同一个实例位置处的特征进行序列维度的建模。针对性地，为了实现序列级别的监督，作者提出了Instance Sequence Matching的模块。同时为了实现序列级别的分割，作者提出了Instance Sequence Segmentation的模块。端到端的建模将视频的空间和时间特征当做一个整体，可以从全局的角度学习整个视频的信息，同时transformer所建模的密集特征序列又能够较好的保留细节的信息。

在这里插入图片描述

图2 VisTR网络结构

模型特点

端到端的模型架构
引入tansformer，把视频实例分割建模转换为了seq to seq的任务

环境准备

本文基于MIndSpore实现，dcn编码部分使用到了torch。在开始实验前确保环境配置完成

git clone https://gitee.com/yanlq46462828/zjut_mindvideo.git

cd zjut_mindvideo

pip install -r requirements.txt
pip install -e .

数据准备与处理

本文使用的数据集是youtube-vis。为了方便训练和推理，文中从原数据集中选取了部分，构成了一个比较小的子集。运行以下代码，把数据集下载并且解压到指定路径。

from download import download

dataset_url = "https://mindvideo.obs.cn-north-4.myhuaweicloud.com/VOS_min.zip"
path = "./VOS_min/"
path = download(dataset_url, path, kind="zip", replace=True)

数据集创建

调用了Ytvos类，创建了训练所需要的数据集。

import os
from mindspore import context
from msvideo.data.ytvos import Ytvos

os.environ["CUDA_VISIBLE_DEVICES"] = '1'
context.set_context(mode=context.PYNATIVE_MODE,
                    device_target='GPU')

dataset_train = Ytvos(path='./VOS_min',
                      split='train',
                      seq=36,
                      batch_size=1,
                      repeat_num=1,
                      shuffle=False,
                      shard_id=None,
                      num_shards=None)

loading annotations into memory...
Done (t=0.00s)
creating index...
index created!

数据处理

训练数据的处理主要包括了以下几个方面：

RandomHorizontalFlip
RandomResize
PhotometricDistort，光度失真主要包括了调整图像亮度，色度，对比度，饱和度以及加入噪点
RandomSizeCrop
Normalize

import numpy as np
from mindspore import ops
from msvideo.data.transforms import ytvos_transform
import mindspore.dataset.transforms.py_transforms as trans

class DeFaultTrans(trans.PyTensorOperation):
    """
    ytvos default transform
    """

    def __init__(self):
        self.cast = ops.Cast()
        self.trans = ytvos_transform.make_coco_transforms()

    def __call__(self, path, label, box, valid, mask, resize_shape):
        video = []
        for im in path:
            im_path = bytes.decode(im.tobytes())
            im = Image.open(im_path).convert('RGB')
            video.append(im)
        video, box, mask, resize_shape, label, valid = self.trans(video, box, mask, resize_shape, label, valid)
        video = video.astype(np.float32)
        labels = label.astype(np.int32)
        boxes = box.astype(np.float32)
        valids = valid.astype(np.int32)
        masks = mask.astype(np.float32)
        resize_shape = resize_shape.astype(np.float32)
        return video, labels, boxes, valids, masks, resize_shape
    
dataset_train = dataset_train.run()
step_size = dataset_train.get_dataset_size()

模型构建

调用VistrCom类创建模型，如图二所示，模型主要有以下几层组成

ResNet backbone
encoder和decoder
instance sequence segmentation

from msvideo.models.vistr import VistrCom

network = VistrCom(name="ResNet50",
                    num_frames=36,
                    num_queries=360,
                    dropout=0.1,
                    aux_loss=True,
                    num_class=41)

损失函数

根据前面的描述，网络学习中需要计算损失的主要有两个地方，一个是Instance Sequence Matching阶段的匹配过程，一个是找到监督之后最终整个网络的损失函数计算过程。

Instance Sequence Matching过程的计算公式如式1所示：由于matching阶段只是用于寻找监督，而计算mask之间的距离运算比较密集，因此在此阶段作者只考虑box和预测的类别c两个因素。第一行中的yi表示对应第i个实例的ground truth序列，其中c表示类别，b表示bounding box，T表示帧数，即T帧该实例对应的类别和boundingbox序列。第二行和第三行分别表示预测序列的结果，其中p表示在ci这个类别的预测的概率，b表示预测的boundingbox。序列之间距离的运算是通过两个序列对应位置的值两两之间计算损失函数得到的，图中用Lmatch表示，对于每个预测的序列，找到Lmatch最低那个ground truth序列作为它的监督。根据对应的监督信息，就可以计算整个网络的损失函数。

在这里插入图片描述

式1 Instance Sequence Matching

由于作者的方法是将分类、检测、分割和跟踪做到一个端到端网络里，因此最终的损失函数也同时包含类别、boundingbox和mask三个方面，跟踪通过直接对序列计算损失函数体现。式2表示分割的损失函数，得到了对应的监督结果之后，作者计算对应序列之间的Dice loss和Focal loss作为mask的损失函数。

在这里插入图片描述

式2 分割损失函数

最终的损失函数如式3所示，为同时包含分类（类别概率）、检测（bounding box）以及分割（mask）的序列损失函数之和。

在这里插入图片描述

式3 VisTR整体损失函数

from msvideo.loss.vistr_loss import SetCriterion
from msvideo.models.layers.instance_sequence_match import HungarianMatcher
# Define loss function

weight_dict = {'loss_ce': 1, 'loss_bbox': 5}
weight_dict['loss_giou'] = 2
weight_dict["loss_mask"] = 1
weight_dict["loss_dice"] = 1

aux_weight_dict = {}
for i in range(6 - 1):
    aux_weight_dict.update(
        {k + f'_{i}': v for k, v in weight_dict.items()})
weight_dict.update(aux_weight_dict)

matcher = HungarianMatcher(num_frames=36,
                            cost_class=1,
                            cost_bbox=1,
                            cost_giou=1)

network_loss = SetCriterion(num_classes=41,
                            matcher=matcher,
                            weight_dict=weight_dict,
                            eos_coef=0.1,
                            aux_loss=True)

[WARNING] ME(288:140031504295744,MainProcess):2023-03-14-15:09:57.527.081 [mindspore/common/_decorator.py:38] 'arange' is deprecated from version 2.0 and will be removed in a future version, use 'range' instead.

训练过程

模型在输出预测结果之后，将预测结果和GT进行匹配并计算损失

import os
import torch
import mindspore
from mindspore import nn, context
from mindspore.train import Model
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig
from msvideo.schedule.lr_schedule import piecewise_constant_lr
from msvideo.utils.callbacks import LossMonitor

class LossCell(nn.Cell):
    "cell with loss function"

    def __init__(self, net, loss):
        super().__init__(auto_prefix=False)
        self._net = net
        self._loss = loss

    def construct(self, video, labels, boxes, valids, masks, resize_shape):
        """Cell with loss function."""
        outputs, pred_masks = self._net(video)
        return self._loss(outputs, pred_masks, labels, boxes, valids, masks, resize_shape)
    
# set learning rate scheduler.
lr = piecewise_constant_lr(
    [step_size * 1, step_size * 2],
    [0.0001, 0.0001 * 0.1]
)
lr_embed = piecewise_constant_lr(
    [step_size * 1, step_size * 2],
    [0.00001, 0.00001 * 0.1]
)

#  Define optimizer.
param_dicts = [
    {
        'params': [par for par in network.trainable_params() if 'embed' not in par.name],
        'lr': lr,
        'weight_decay': 0.0001
    },
    {
        'params': [par for par in network.trainable_params() if 'embed' in par.name],
        'lr': lr_embed,
        'weight_decay': 0.0001
    }
]


# Set the checkpoint config for the network.
ckpt_config = CheckpointConfig(save_checkpoint_steps=1, keep_checkpoint_max=10)
ckpt_callback = ModelCheckpoint(prefix="vistr_r50", directory="./ckpt_save_dir", config=ckpt_config)
# Init the model.
net_with_loss = LossCell(network, network_loss)

network_opt = nn.AdamWeightDecay(param_dicts)
loss_scale_manager = mindspore.FixedLossScaleManager(1024.0)
model = Model(net_with_loss,
              optimizer=network_opt,
              loss_scale_manager=loss_scale_manager)

# Begin to train.
print('[Start training `{}`]'.format('vistr_r50'))
print("=" * 80)
model.train(1,
            dataset_train,
            callbacks=[ckpt_callback, LossMonitor(lr)],
            dataset_sink_mode=False)
print('[End of training `{}`]'.format('vistr_r50'))

[Start training `vistr_r50`]
================================================================================


[WARNING] DEVICE(288,7f5ba0122740,python):2023-03-14-15:10:17.119.635 [mindspore/ccsrc/plugin/device/gpu/hal/device/kernel_info_setter.cc:205] SelectCustomKernel] Not find operator information for op[Custom_/home/zgz/vistr_mindspore/src/models/layers/dcn/ms_dcn.so:ms_deformable_conv_forward]. Infer operator information from inputs.
[WARNING] DEVICE(288,7f5ba0122740,python):2023-03-14-15:10:22.811.237 [mindspore/ccsrc/plugin/device/gpu/hal/device/kernel_info_setter.cc:205] SelectCustomKernel] Not find operator information for op[Custom_/home/zgz/vistr_mindspore/src/models/layers/dcn/ms_dcn.so:ms_deformable_conv_forward]. Infer operator information from inputs.
[WARNING] DEVICE(288,7f5ba0122740,python):2023-03-14-15:11:19.226.176 [mindspore/ccsrc/plugin/device/gpu/hal/device/kernel_info_setter.cc:205] SelectCustomKernel] Not find operator information for op[Custom_/home/zgz/vistr_mindspore/src/models/layers/dcn/ms_dcn_bprop.so:msDCN_backward_input]. Infer operator information from inputs.
[WARNING] DEVICE(288,7f5ba0122740,python):2023-03-14-15:11:19.227.735 [mindspore/ccsrc/plugin/device/gpu/hal/device/kernel_info_setter.cc:205] SelectCustomKernel] Not find operator information for op[Custom_/home/zgz/vistr_mindspore/src/models/layers/dcn/ms_dcn_bprop.so:msDCN_backward_filter]. Infer operator information from inputs.


Epoch:[  0/  1], step:[    1/   20], loss:[65.806/65.806], time:105643.677 ms, lr:0.00010
Epoch:[  0/  1], step:[    2/   20], loss:[94.434/80.120], time:47237.714 ms, lr:0.00010
Epoch:[  0/  1], step:[    3/   20], loss:[49.806/70.015], time:41246.504 ms, lr:0.00010
Epoch:[  0/  1], step:[    4/   20], loss:[51.003/65.262], time:37313.868 ms, lr:0.00010
Epoch:[  0/  1], step:[    5/   20], loss:[35.141/59.238], time:37865.883 ms, lr:0.00010
Epoch:[  0/  1], step:[    6/   20], loss:[32.705/54.816], time:39216.800 ms, lr:0.00010
Epoch:[  0/  1], step:[    7/   20], loss:[21.688/50.083], time:39396.472 ms, lr:0.00010
Epoch:[  0/  1], step:[    8/   20], loss:[24.573/46.894], time:37093.604 ms, lr:0.00010
Epoch:[  0/  1], step:[    9/   20], loss:[23.924/44.342], time:39990.465 ms, lr:0.00010
Epoch:[  0/  1], step:[   10/   20], loss:[18.781/41.786], time:38969.989 ms, lr:0.00010
Epoch:[  0/  1], step:[   11/   20], loss:[18.676/39.685], time:37904.329 ms, lr:0.00010
Epoch:[  0/  1], step:[   12/   20], loss:[23.221/38.313], time:37366.640 ms, lr:0.00010
Epoch:[  0/  1], step:[   13/   20], loss:[23.885/37.203], time:37826.169 ms, lr:0.00010
Epoch:[  0/  1], step:[   14/   20], loss:[18.362/35.857], time:36671.492 ms, lr:0.00010
Epoch:[  0/  1], step:[   15/   20], loss:[18.171/34.678], time:40284.733 ms, lr:0.00010
Epoch:[  0/  1], step:[   16/   20], loss:[21.431/33.850], time:37168.168 ms, lr:0.00010
Epoch:[  0/  1], step:[   17/   20], loss:[21.613/33.130], time:39942.996 ms, lr:0.00010
Epoch:[  0/  1], step:[   18/   20], loss:[22.077/32.516], time:38506.711 ms, lr:0.00010
Epoch:[  0/  1], step:[   19/   20], loss:[22.440/31.986], time:40051.668 ms, lr:0.00010
Epoch:[  0/  1], step:[   20/   20], loss:[18.993/31.336], time:39340.184 ms, lr:0.00010
Epoch time: 868269.174 ms, per step time: 43413.459 ms, avg loss: 31.336
[End of training `vistr_r50`]

推理过程

创建数据集

数据集通过直接读取的方式加载

创建推理模型

模型创建之后加载对应的ckpt模型权重文件

import os
import json
import numpy as np
import mindspore
from mindspore import nn, ops, context
from msvideo.models.vistr import VistrCom
from mindspore.dataset.vision import py_transforms as T_p
from mindspore import load_checkpoint, load_param_into_net

os.environ["CUDA_VISIBLE_DEVICES"] = '1'
context.set_context(mode=context.PYNATIVE_MODE,
                    device_target='GPU')


transform = T_p.ToTensor()

mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])

num_frames = 36
num_ins = 10

ann_path = os.path.join('./VOS_min', "annotations/instances_valid_sub.json")
folder = os.path.join('./VOS_min', "val/JPEGImages/")
videos = json.load(open(ann_path, 'rb'))['videos']

ms_model = VistrCom(name='ResNet50',
                    aux_loss=True,
                    dropout=0.0)

param_dict = load_checkpoint("./ckpt/vistr_r50_all.ckpt")
load_param_into_net(ms_model, param_dict)
weight = np.load("./ckpt/weights_r50.npy")
weight = mindspore.Tensor(weight, mindspore.float32)
ms_model.mask_head.dcn.conv_weight = weight

vis_num = len(videos)
result = []

ms_sigmoid = ops.Sigmoid()
concat = ops.Concat(axis=0)
softmax = nn.Softmax(axis=-1)
expand_dims = ops.ExpandDims()

[WARNING] ME(2683:139926123353920,MainProcess):2023-03-14-15:38:29.300.036 [mindspore/dataset/core/validator_helpers.py:804] 'ToTensor' from mindspore.dataset.vision.py_transforms is deprecated from version 1.8 and will be removed in a future version. Use 'ToTensor' from mindspore.dataset.vision instead.

推理过程

数据处理

推理部分的数据处理主要包括以下几部：

Resize
Normalize

结果处理

在得到模型预测结果之后，由于在开始时对图片进行了Resize操作，之后也需要还原图片到原来的尺寸。

import os
import math
import json
import torch
from mindspore import Tensor
from PIL import Image
import pycocotools.mask as mask_util

folder_r = os.path.exists("./results/results.json")
if not folder_r:
    os.makedirs("./results")
    f = open("./results/results.json", 'w')
    f.close

result = []
for i in range(vis_num):
    print("Process video: ", i)
    id_ = videos[i]['id']
    length = videos[i]['length']
    file_names = videos[i]['file_names']    
    img_set = []
    if length < num_frames:
        clip_names = file_names*(math.ceil(num_frames/length))
        clip_names = clip_names[:num_frames]
    else:
        clip_names = file_names[:num_frames]
    if clip_names == []:
        continue
    if len(clip_names) < num_frames:
        clip_names.extend(file_names[:num_frames-len(clip_names)])
    for k in range(num_frames):
        im = Image.open(os.path.join(folder, clip_names[k]))
        h = im.size[1]
        w = im.size[0]
        width = int((im.size[0]*300) / im.size[1])
        height = 300
        im = im.resize((width, height), resample=Image.Resampling.BILINEAR)
        im = transform(im)
        im = (im - mean[:, None, None]) / std[:, None, None]
        im = Tensor(im, mindspore.float32)
        im = expand_dims(im, 0)
        img_set.append(im)
    img = concat(img_set)
    images = Tensor(img, mindspore.float32)
    images = images.expand_dims(axis=0)
    if images.shape[-1] <= 700:
        pred, pred_masks = ms_model(images)

        pred_logits = pred[-1, ..., :42]
        pred_boxes = pred[-1, ..., 42:]
        pred_logits = softmax(pred_logits)[0, :, :-1]
        pred_boxes = pred_boxes[0]
        pred_masks = pred_masks[0]

        pred_masks = pred_masks.reshape(36, 10, pred_masks.shape[-2], pred_masks.shape[-1])
        resize_bilinear = ops.ResizeBilinear((h, w))
        pred_masks = resize_bilinear(pred_masks)
        pred_masks = ms_sigmoid(pred_masks).asnumpy() > 0.5

        pred_masks = pred_masks[:length]
        pred_logits = pred_logits.reshape(num_frames, num_ins, pred_logits.shape[-1]).asnumpy()
        pred_logits = pred_logits[:length]
        pred_scores = np.max(pred_logits, axis=-1)
        pred_logits = np.argmax(pred_logits, axis=-1)
        for m in range(num_ins):
            if pred_masks[:, m].max() == 0:
                continue
            score = pred_scores[:, m].mean()
            category_id = np.argmax(np.bincount(pred_logits[:, m]))
            instance = {'video_id': id_, 'score': float(score), 'category_id': int(category_id)}
            segmentation = []
            for n in range(length):
                if pred_scores[n, m] < 0.001:
                    segmentation.append(None)
                else:
                    mask = (pred_masks[n, m]).astype('uint8')
                    rle = mask_util.encode(np.array(mask[:, :, np.newaxis], order='F'))[0]
                    rle["counts"] = rle["counts"].decode("utf-8")
                    segmentation.append(rle)
            instance['segmentations'] = segmentation
            result.append(instance)
    with open("results/results.json", 'w', encoding='utf-8') as f:
        json.dump(result, f)

Process video:  0
Process video:  1
Process video:  2
Process video:  3
Process video:  4
Process video:  5
Process video:  6
Process video:  7
Process video:  8
Process video:  9
Process video:  10
Process video:  11
Process video:  12
Process video:  13
Process video:  14
Process video:  15
Process video:  16
Process video:  17
Process video:  18


[WARNING] DEVICE(2683,7f4316e09740,python):2023-03-14-15:40:50.605.353 [mindspore/ccsrc/plugin/device/gpu/hal/device/kernel_info_setter.cc:205] SelectCustomKernel] Not find operator information for op[Custom_/home/zgz/vistr_mindspore/src/models/layers/dcn/ms_dcn.so:ms_deformable_conv_forward]. Infer operator information from inputs.


Process video:  19

holo337

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
基于Mindspore框架的VisTR模型复现

VisTR是由美团无人车配送团队在CVPR 2021上发表的文章中提出一种图像分割算法。使用NVIDIA Tesla V100在 youtube-vis数据集上，以resnet50和resnet101为backbone分别取得了maskAP 36.2和maskAP 40.1的精度，另外FPS分别为69.9和57.5。
复制链接

扫一扫