可执行案列下载:Non-Local notebook
NonLocal
《Non-local Neural Networks》发表于CVPR2018,用于作为处理动作分类的一种方法。
算法原理简介
Figure 1 nonlocal_block
NonLocal是一个灵活的构建块,可以很容易地与卷积/循环层一起使用。它可以添加到深度神经网络的前面部分,不像fc层经常在最后使用。允许构建一个结合non-local信息和局部信息的更丰富的层次结构。论文中的nonlocal将某一位置的响应当做是一种从特征图谱所有位置的加权和来计算,这些位置既可以代表空间位置, 也可以代表时间, 时空等。Nonlocal其实和self-attention机制十分相关。在文中,为了能够将提出的nonlocal block当作一个组件自由的接入到各个神经网络中,作者设计的nonlocal 操作使得输入输出大小一致,具体实现公式如下:
Formulation
公式中,x代表输入,y代表输出,i和j分别代表输入的某个空间位置,xi是一个向量,维数跟x的channel数一样,f是一个计算任意两点相似关系的函数,g是一个映射函数,将一个点映射成一个向量,即该点的特征。为了计算输出层的一个点,需要将输入的每个点都考虑一遍,考虑的方式就和attention机制类似:过程中mask则是根据f函数给出,再和g映射函数相乘,最后求和,输出的某个点在原图上的attention。每个点以这样的方式计算,最后得到一个nonlocal的“attention map”。
Table 1 baseline_ResNet50_C2D
表1显示了ResNet-50骨干网下的C2D基线。在这个存储库中,我们使用ResNet-50骨干网下的Inflated 3D ConvNet(I3D)。可以通过“inflating”内核将表1中的C2D模型转换为3D卷积模型。例如,一个2D的k×k内核可以被扩展为一个横跨t帧的3D t×k×k内核。我们添加5个块(3到res4, 2到res3,每隔一个残差块)。欲了解更多细节信息,请阅读论文《Non-local Neural Networks》。
环境准备
git clone https://gitee.com/yanlq46462828/zjut_mindvideo.git
cd zjut_mindvideo
# Please first install mindspore according to instructions on the official website: https://www.mindspore.cn/install
pip install -r requirements.txt
pip install -e .
训练流程
from mindspore import nn
from mindspore.train import Model
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore.nn.metrics import Accuracy
from msvideo.utils.check_param import Validator,Rel
数据集加载
通过基于VideoDataset编写的Kinetic400类来加载kinetic400数据集。将数据集下载到以下路径,或者根据自己需求改变路径即可。数据集下载链接:https://deepmind.com/research/open-source/kinetics
from msvideo.data.kinetics400 import Kinetic400
# Data Pipeline.
dataset = Kinetic400(path='/home/publicfile/kinetics-400',
split="train",
shuffle=True,
seq=32,
seq_mode='interval',
num_parallel_workers=1,
batch_size=6,
repeat_num=1,
frame_interval=6)
ckpt_save_dir = './nonlocal'
数据处理
用VideoShortEdgeResize根据短边来进行Resize,再用VideoRandomCrop对Resize后的视频进行随机裁剪,再用VideoRandomHorizontalFlip根据概率对视频进行水平翻转,通过VideoRescale对视频进行缩放,利用VideoReOrder对维度进行变换,再用VideoNormalize进行归一化处理。
from msvideo.data.transforms import VideoRandomCrop, VideoRandomHorizontalFlip, VideoRescale
from msvideo.data.transforms import VideoNormalize, VideoShortEdgeResize, VideoReOrder
# Data Pipeline.
transforms = [VideoShortEdgeResize(size=256, interpolation='bicubic'),
VideoRandomCrop([224, 224]),
VideoRandomHorizontalFlip(0.5),
VideoRescale(),
VideoReOrder([3, 0, 1, 2]),
VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])]
dataset.transform = transforms
dataset_train = dataset.run()
Validator.check_int(dataset_train.get_dataset_size(), 0, Rel.GT)
step_size = dataset_train.get_dataset_size()
网络构建
Nonlocal最重要的结构是NonlocalBlockND(nn.Cell),该block包含四种成对相似度计算公式,以dot_product为例,主要通过三个Conv3d进行线性变换。NonlocalBlockND操作只需用到常用的卷积、矩阵相乘、加法、softmax等算子,用户可以非常方便的实现组网以构建模型。
nonlocal3d包含backbone、avg_pool、flatten、head几部分组成。大致可以归纳为如下几点。第一部分:backbone部分是NLResInflate3D50(NLInflateResNet3D类),它是在NLInflateResNet3D结构中实现[3,4,6,3]规格的stage。而NLInflateResNet3D该结构是继承于ResNet3d50的结构,在ResNet3d50的[3,4,6,3]第2,3两个stage中的10层以每隔1层的方式插入一个NonlocalBlockND。第二部分:NLResInflate3D50输出到一个平均池化并flatten,第三部分:分类头。将flatten后的tensor输入到Dropdensehead进行分类,得到shape(N, NUM_CLASSES)的tensor。
from msvideo.models.nonlocal3d import nonlocal3d
# Create model
network = nonlocal3d(in_d=32,
in_h=224,
in_w=224,
num_classes=400,
keep_prob=0.5)
from msvideo.schedule.lr_schedule import warmup_step_lr
# Set learning rate scheduler.
lr = warmup_step_lr(lr=0.0003,
lr_epochs=[1],
steps_per_epoch=step_size,
warmup_epochs=1,
max_epoch=1,
gamma=0.1)
# Define optimizer.
network_opt = nn.SGD(network.trainable_params(),
lr,
momentum=0.9,
weight_decay=0.0001)
# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
# set checkpoint for the network
ckpt_config = CheckpointConfig(
save_checkpoint_steps=step_size,
keep_checkpoint_max=1)
ckpt_callback = ModelCheckpoint(prefix='nonlocal_kinetics400',
directory=ckpt_save_dir,
config=ckpt_config)
# Init the model.
model = Model(network,
loss_fn=network_loss,
optimizer=network_opt,
metrics={"Accuracy": Accuracy()})
# Begin to train.
print('[Start training `{}`]'.format('nonlocal_kinetics400'))
print("=" * 80)
model.train(1,
dataset_train,
callbacks=[ckpt_callback, LossMonitor()],
dataset_sink_mode=False)
print('[End of training `{}`]'.format('nonlocal_kinetics400'))
评估流程
from mindspore import context
from mindspore.train.callback import Callback
class PrintEvalStep(Callback):
""" print eval step """
def step_end(self, run_context):
""" eval step """
cb_params = run_context.original_args()
print("eval: {}/{}".format(cb_params.cur_step_num, cb_params.batch_num))
context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
from msvideo.data.kinetics400 import Kinetic400
dataset_eval = Kinetic400(path="/home/publicfile/kinetics-400",
split="val",
shuffle=True,
seq=32,
seq_mode='interval',
num_parallel_workers=1,
batch_size=1,
frame_interval=6)
from msvideo.data.transforms import VideoReOrder, VideoRescale, VideoNormalize
from msvideo.data.transforms import VideoCenterCrop, VideoShortEdgeResize
transforms = [VideoShortEdgeResize(size=256, interpolation='bicubic'),
VideoCenterCrop([224, 224]),
VideoRescale(),
VideoReOrder([3, 0, 1, 2]),
VideoNormalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.255])]
dataset_eval.transform = transforms
dataset_eval = dataset_eval.run()
from mindspore import nn
from mindspore.train import Model
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore import load_checkpoint, load_param_into_net
from msvideo.models.nonlocal3d import nonlocal3d
# Create model.
network = nonlocal3d()
# Define loss function.
network_loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
# Load pretrained model.
param_dict = load_checkpoint(ckpt_file_name='/home/hcx/nonlocal_mindspore/scripts/nonlocal_output_0.0003/nonlocal-1_4975.ckpt')
load_param_into_net(network, param_dict)
# Define eval_metrics.
eval_metrics = {'Loss': nn.Loss(),
'Top_1_Accuracy': nn.Top1CategoricalAccuracy(),
'Top_5_Accuracy': nn.Top5CategoricalAccuracy()}
print_cb = PrintEvalStep()
# Init the model.
model = Model(network, loss_fn=network_loss, metrics=eval_metrics)
# Begin to eval.
print('[Start eval `{}`]'.format('nonlocal_kinetics400'))
result = model.eval(dataset_eval,
callbacks=[print_cb],
dataset_sink_mode=False)
print(result)