Video Saliency Detection 文献学习及代码解读(二) TASED-Net

2019年ICCV发表的论文
Kyle Min Jason J. Corso University of Michigan

论文

摘要

个人觉得写得挺中规中矩,就介绍了模型的特点,即在DHF1K, Hollywood2, and UCFSports.数据集上表现亮眼。

Introduction

这一段介绍研究意义不错
Video saliency detection aims to model the gaze fixation patterns of humans when viewing a dynamic scene. Because the predicted saliency map can be used to prioritize the video information across space and time, this task has a number of applications such as video surveillance [12, 41], video captioning [27], video compression [11, 13], etc.
之前的研究提取时域信息主要通过LSTM,但是这样做的问题是,时空特征信息的提取,融合,编码过程分离。

  • S3D本质其实是一个时空特征分别提取的网络

fail to jointly process spatial and temporal information when predicting a saliency map from the extracted features. Specifically, either spatial decoding and temporal aggregation are performed separately, or only one of these two processes is considered for the final prediction.
因此提出了一个3D卷积网络提取特征,整体结构如下。
3D fully-convolutional encoder-decoder network architecture for video saliency detection,
which we call the Temporally-Aggregating Spatial Encoder-Decoder Network (TASED-Net).

在这里插入图片描述
网络的Encoder迁移了基于运动数据集的S3D模型
S3D is efficient and effective in extracting spatiotemporal features,
Decoder 提出了一个Auxiliary pooling,可以更大的时域感受野,减少时域维度。效果上比interpolation and transposed convolution (deconvolution)更有效。
为铺垫Auxiliary poolings有如下表述
The tricky part is that the max-unpooling layers cannot reuse the pooling indices or switches [42] from the corresponding max-pooling layers since they have larger temporal receptive field than the maxunpooling layers.

  • 没太理解

三点贡献如下:

  1. 端到端的3D卷积网络
  2. 创新提出了一个Auxiliary pooling
  3. 在三大数据集上表现突出

相关工作

Video saliency detection

STSConvNet采用了空间和时域信息分流的方式建模,其中时域信息通过光流输入模型;
RMDN使用了C3D模型同时提取时空信息,对于长时域的信息采用了LSTM模块;
OMCNN分别利用YOLO和FlowNet提取空间和时域信息,再喂给LSTM网络;
ACLNet,基于SALICON采用ConvLSTM的预测结果比image saliency detection的性能都更好,说明了空间特征对Video saliency detection的重要。

2D ConvNets

铺垫了VGG16, SegNet model
铺垫了 Auxiliary pooling layer
max-unpooling is more suitable for decoding than other upsampling operations such as linear upsampling or even learnable upsampling method through transposed convolution,
which inspires our Auxiliary pooling.

3D ConvNets

铺垫了S3D,并说S3D在视频特征提取上可以起到与VGG16的效果。

方法(approach)

总述

三点假设:

  1. saliency map 可以由过去固定的T帧画面确定
  2. 多帧输入的saliency detection效果比image好
  3. 视频有充足的输入画面,至少有(2T-1)

核心内容:在decoder融合了时域信息,提出了 Auxiliary pooling
our model comes from temporal aggregation inside the prediction network, which requires extra operations thatwe call Auxiliary pooling
网络的主干结构如下:
在这里插入图片描述

Auxiliary pooling

通过下面这个示例具体操作还是很好理解的,首先1st Auxiliary pooling一个时域维度的max pooling. 然后,2nd Auxiliary pooling 空间维度的max pooling操作,最后通过一个Switches确定上采样的位置。
相比之前的做法由于decoder保留了空间维度,所以为保证维度的匹配需要引入auxiliary pooling。
在这里插入图片描述

Temporal aggregation strategy

如下,作者尝试了三个时空维度的decoder策略,试验结果发现最后一种:分两步,先空间上采样后空间融合的策略模型性能最好。
在这里插入图片描述

Evaluation

数据集DHF1K, Hollywood2, and UCFSports,考虑到数据的丰富度主要以DHF1K为main benchmark。

  • Each frame is resized to 224 × 384.
  • batch size of 40 on 600 videosfrom the DHF1K
  • SGD algorithm with 0.9 momentum in an end-to-endmanner.
  • The learning rate is fixed at 0.001 for the encoder network.
  • sample 2K clips to approximate the validation loss.
  • Kullback-Leibler(KL) divergence as the loss function, which Jiang et al. [20] have
    shown to be effective for training saliency models.
  • (i) Normalized Scanpath Saliency (NSS), (ii) Linear Correlation Coefficient (CC),
    (iii) Similarity (SIM), (iv) Area Under the Curve by Judd (AUC-J),
    and (v) Shuffled-AUC (s-AUC).
  • 21.2M Params and 63.2G FLOPs

不同输入帧数

在这里插入图片描述
不同模型在DHF1K数据集上的表现
在这里插入图片描述
不同数据集
在这里插入图片描述

Necessity of Auxiliary pooling

The first variant, which we call TASEDNet-tri, replaces all the max-unpooling layers with trilinear upsampling (interpolation).
The second variant, which we name TASED-Net-trp, replaces the max-unpooling layers with transposed convolutions (deconvolution).
在这里插入图片描述
输入画面的帧数
在这里插入图片描述

代码

model.py

S3D

模型中的BasicConv3d , SepConv3d和Mixed_*,可以参看这篇博文

视频分类 S3D(separable 3D convolutions)模型及代码分析
因为网络所有层都是pytorch定义好的函数,这里主要对网络的整体框架进行介绍。

主干网络

结合此图,与模型的TASED_v2的forward函数中的数据流,便可以直观快速的理解网络结构。
在这里插入图片描述
在Encoder部分,以S3D中的每个mixed模块的输出为特征提取对象分别接上一个Auxiliary pooling,即一个时间域的maxpooled3d和一个空间域的maxpooled3d(return_indices=True),分别作为Decoder不同层的输入。
在Decoder部分,从S3D的最后输出,通过一个1×1×1 3d卷积层 to re-distribute encoded information across the channel dimension。接下来就是3组 nn.ConvTranspose3d(stride=1)和nn.MaxUnpool3d(stride=(1,2,2))组成的 convtsp模块。最后一组convtsp4由两组nn.ConvTranspose3d(stride=(1,2,2))和nn.Conv3d(stride=(2,1,1))构成。
整个网络到特征层尺寸标注在下图中, 可以看出不同Encoder和Decoder特征层数据传输的结构思想跟U-net比较接近。

class TASED_v2(nn.Module):
    def __init__(self):
        super(TASED_v2, self).__init__()
            
    def forward(self, x):
    	# Encoder
        # [1, 3, 32, 224, 384] => [1, 192, 16, 56, 96]
        y3 = self.base1(x)
        # [1, 192, 16, 56, 96] => [1, 192, 16, 28, 48]
        y = self.maxp2(y3)
        # [1, 192, 16, 56, 96] => [1, 192, 4, 56, 96]
        y3 = self.maxm2(y3)
        # [1, 192, 4, 56, 96] => [1, 192, 4, 28, 48]
        _, i2 = self.maxt2(y3)
        # [1, 192, 16, 28, 48] => [1, 480, 16, 28, 48]
        y2 = self.base2(y)
        # [1, 480, 16, 28, 48] => [1, 480, 8, 14, 24]
        y = self.maxp3(y2)
        # [1, 480, 16, 28, 48] => [1, 480, 4, 28, 48]
        y2 = self.maxm3(y2)
        # [1, 480, 4, 28, 48] => [1, 480, 4, 14, 24]
        _, i1 = self.maxt3(y2)
        # [1, 480, 8, 14, 24] => [1, 832, 8, 14, 24]
        y1 = self.base3(y)
        # [1, 832, 8, 14, 24] => [1, 832, 4, 14, 24]
        y = self.maxt4(y1)
        # [1, 832, 4, 14, 24] => [1, 832, 4, 7, 12], [1, 832, 4, 7, 12]
        y, i0 = self.maxp4(y)
        # [1, 832, 4, 7, 12] => [1, 1024, 4, 7, 12]
        y0 = self.base4(y)

		# Decoder
        # [1, 1024, 4, 7, 12] => [1, 832, 4, 7, 12]
        z = self.convtsp1(y0)
        # [1, 832, 4, 7, 12], [1, 832, 4, 7, 12] => [1, 832, 4, 14, 24]
        z = self.unpool1(z, i0)
        # [1, 832, 4, 14, 24] => [1, 480, 4, 14, 24]
        z = self.convtsp2(z)
        # [1, 480, 4, 14, 24], [1, 480, 4, 14, 24] => [1, 480, 4, 28, 48]
        z = self.unpool2(z, i1, y2.size())
        # [1, 480, 4, 28, 48] => [1, 192, 4, 28, 48]
        z = self.convtsp3(z)
        # [1, 192, 4, 28, 48], [1, 192, 4, 28, 48] => [1, 192, 4, 56, 96]
        z = self.unpool3(z, i2, y3.size())
        # [1, 192, 4, 56, 96] => [1, 1, 1, 224, 384]
        z = self.convtsp4(z)
        # [1, 1, 1, 224, 384] => [1, 224, 384]
        z = z.view(z.size(0), z.size(3), z.size(4))

        return z
if __name__ == '__main__':
    
    model = TASED_v2()

    for name, parameters in model.named_parameters():
        print(name, ':', parameters.size())

    model_dict = model.state_dict()
    print(model_dict)

    input_tensor = torch.randn(1, 3, 32, 224, 384)
    output_tensor = model(input_tensor)
    print(output_tensor.shape)

源代码
https://github.com/MichiganCOG/TASED-Net

run_example.py

此程序主要用于演示模型预测效果的example,根据作者提供的权重文件便可以直接预测结果

import sys
import os
import numpy as np
import cv2
import torch
from model import TASED_v2
from scipy.ndimage.filters import gaussian_filter

def main():
    ''' read frames in path_indata and generate frame-wise saliency maps in path_output '''
    # optional two command-line arguments
    path_indata = './example'
    path_output = './output'
    if len(sys.argv) > 1:
        path_indata = sys.argv[1]
        if len(sys.argv) > 2:
            path_output = sys.argv[2]
    if not os.path.isdir(path_output):
        os.makedirs(path_output)

    len_temporal = 32
    file_weight = './TASED_updated.pt'

    model = TASED_v2()
    # print(model)

    # load the weight file and copy the parameters
    if os.path.isfile(file_weight):
        print ('loading weight file')
        weight_dict = torch.load(file_weight)
        model_dict = model.state_dict()
        for name, param in weight_dict.items():
            if 'module' in name:
                name = '.'.join(name.split('.')[1:])
            if name in model_dict:
                if param.size() == model_dict[name].size():
                    model_dict[name].copy_(param)
                else:
                    print (' size? ' + name, param.size(), model_dict[name].size())
            else:
                print (' name? ' + name)

        print (' loaded')
    else:
        print ('weight file?')

    model = model.cuda()
    # 网络数据保持不变,为提高cuda效率设置为False,https://blog.csdn.net/byron123456sfsfsfa/article/details/96003317
    torch.backends.cudnn.benchmark = False
    model.eval()

    # iterate over the path_indata directory
    list_indata = [d for d in os.listdir(path_indata) if os.path.isdir(os.path.join(path_indata, d))]
    list_indata.sort()
    for dname in list_indata:
        print ('processing ' + dname)
        list_frames = [f for f in os.listdir(os.path.join(path_indata, dname)) if os.path.isfile(os.path.join(path_indata, dname, f))]
        list_frames.sort()

        # process in a sliding window fashion
        if len(list_frames) >= 2*len_temporal-1:
            path_outdata = os.path.join(path_output, dname)
            if not os.path.isdir(path_outdata):
                os.makedirs(path_outdata)

            snippet = []
            for i in range(len(list_frames)):
                img = cv2.imread(os.path.join(path_indata, dname, list_frames[i]))
                img = cv2.resize(img, (384, 224))
                img = img[...,::-1]
                snippet.append(img)

                # 输入frames数量大于32 len=32 imgs
                if i >= len_temporal-1:
                    clip = transform(snippet)

                    process(model, clip, path_outdata, i)

                    # 前32帧画面,倒序以后输入网络,感觉把时序后的画面帧喂给网络没什么道理,可能只是为充数
                    # process first (len_temporal-1) frames
                    if i < 2*len_temporal-2:
                        process(model, torch.flip(clip, [1]), path_outdata, i-len_temporal+1)

                    # 删除第一个,保持输入帧数32
                    del snippet[0]

        else:
            print (' more frames are needed')


def transform(snippet):
    ''' stack & noralization '''
    #list => (224,384,96)
    snippet = np.concatenate(snippet, axis=-1)
    # (224, 384, 96)=> (96, 224, 384) float
    snippet = torch.from_numpy(snippet).permute(2, 0, 1).contiguous().float()
    snippet = snippet.mul_(2.).sub_(255).div(255)
    #(96, 224, 384) => (1, 32, 3, 224, 384) 
    return snippet.view(1,-1,3,snippet.size(1),snippet.size(2)).permute(0,2,1,3,4)


def process(model, clip, path_outdata, idx):
    ''' process one clip and save the predicted saliency map '''
    with torch.no_grad():
        smap = model(clip.cuda()).cpu().data[0]

    # smap (224, 384)
    smap = (smap.numpy()*255.).astype(np.int)/255.
    smap = gaussian_filter(smap, sigma=7)
    cv2.imwrite(os.path.join(path_outdata, '%04d.png'%(idx+1)), (smap/np.max(smap)*255.).astype(np.uint8))


if __name__ == '__main__':
    main()

run_train.py

import sys
import os
import numpy as np
import cv2
import time
from datetime import timedelta
import torch
from model import TASED_v2
from loss import KLDLoss
from dataset import DHF1KDataset, InfiniteDataLoader
from itertools import islice

def main():
    ''' concise script for training '''
    # optional two command-line arguments
    path_indata = './DHF1K'
    path_output = './output'
    if len(sys.argv) > 1:
        path_indata = sys.argv[1]
        if len(sys.argv) > 2:
            path_output = sys.argv[2]

    # we checked that using only 2 gpus is enough to produce similar results
    num_gpu = 1
    pile = 5
    batch_size = 8
    num_iters = 1000
    len_temporal = 32
    file_weight = '../S3D-master/S3D_kinetics400.pt'
    path_output = os.path.join(path_output, time.strftime("%m-%d_%H-%M-%S"))
    if not os.path.isdir(path_output):
        os.makedirs(path_output)

    model = TASED_v2()

    # load the weight file and copy the parameters
    if os.path.isfile(file_weight):
        print ('loading weight file')
        weight_dict = torch.load(file_weight)
        model_dict = model.state_dict()
        for name, param in weight_dict.items():
            # 将权重文件字典中的name 调整为 model 文件中的name
            # weight name : 'module.base.0.conv_s.weight'
            if 'module' in name:
                # name: 'base.0.conv_s.weight'
                name = '.'.join(name.split('.')[1:])
            if 'base.' in name:
                bn = int(name.split('.')[1]) # base num
                sn_list = [0, 5, 8, 14]
                sn = sn_list[0]
                if bn >= sn_list[1] and bn < sn_list[2]:
                    sn = sn_list[1]
                elif bn >= sn_list[2] and bn < sn_list[3]:
                    sn = sn_list[2]
                elif bn >= sn_list[3]:
                    sn = sn_list[3]
                # 'conv_s.weight'
                name = '.'.join(name.split('.')[2:])
                # 'base1.0.conv_s.weight'
                name = 'base%d.%d.'%(sn_list.index(sn)+1, bn-sn)+name
            if name in model_dict:
                if param.size() == model_dict[name].size():
                    model_dict[name].copy_(param)
                else:
                    print (' size? ' + name, param.size(), model_dict[name].size())
            else:
                print (' name? ' + name)

        print (' loaded')
    else:
        print ('weight file?')

    # 对不同层定义不同的学习率。
    # parameter setting for fine-tuning
    # fixed at 0.001 for the encoder network
    # prediction network, the learning rate starts at 0.1 and decays twice by a factor of 10 when the validation loss
    params = []
    for key, value in dict(model.named_parameters()).items():
        if 'convtsp' in key:
            # Decoder层参数,修改层名
            params += [{'params':[value], 'key':key+'(new)'}]
        else:
            # Encoder 层参数,增加学习率参数
            params += [{'params':[value], 'lr':0.001, 'key':key}]

    optimizer = torch.optim.SGD(params, lr=0.1, momentum=0.9, weight_decay=2e-7)
    criterion = KLDLoss()

    model = model.cuda()
    # 多块GPU并行计算 https://zhuanlan.zhihu.com/p/102697821
    model = torch.nn.DataParallel(model, device_ids=range(num_gpu))
    torch.backends.cudnn.benchmark = False
    model.train()

    train_loader = InfiniteDataLoader(DHF1KDataset(path_indata, len_temporal), batch_size=batch_size, shuffle=True, num_workers=24)

    i, step = 0, 0
    loss_sum = 0
    start_time = time.time()
    # islice(iterable, [start, ] stop [, step]) 生成器切片
    for clip, annt in islice(train_loader, num_iters*pile):
        with torch.set_grad_enabled(True):
            output = model(clip.cuda())
            loss = criterion(output, annt.cuda())

        loss_sum += loss.item()
        loss.backward()
        # 真实的batch大小: pile*batch_size=40, 由于模型尺寸和网络参数量
        if (i+1) % pile == 0:
            optimizer.step()
            optimizer.zero_grad()
            step += 1

            # whole process takes less than 3 hours 每个batch的loss显示
            print ('iteration: [%4d/%4d], loss: %.4f, %s' % (step, num_iters, loss_sum/pile, timedelta(seconds=int(time.time()-start_time))), flush=True)
            loss_sum = 0

            # adjust learning rate 只修改decoder层
            # first decaying point is at step 750, the second one is at step 950.
            if step in [750, 950]:
                for opt in optimizer.param_groups:
                    if 'new' in opt['key']:
                        opt['lr'] *= 0.1

            if step % 25 == 0:
                torch.save(model.state_dict(), os.path.join(path_output, 'iter_%04d.pt' % step))

        i += 1


if __name__ == '__main__':
    main()
  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值