Video Saliency Detection 文献学习及代码解读（二） TASED-Net

最新推荐文章于 2022-01-21 16:56:10 发布

老光头_ME2CS

最新推荐文章于 2022-01-21 16:56:10 发布

阅读量1.4k

点赞数 2

分类专栏： Pytorch Saliency Detection 文章标签：计算机视觉深度学习

本文链接：https://blog.csdn.net/forrest97/article/details/107736166

版权

Pytorch 同时被 2 个专栏收录

14 篇文章 1 订阅

订阅专栏

Saliency Detection

2 篇文章 0 订阅

订阅专栏

2019年ICCV发表的论文
Kyle Min Jason J. Corso University of Michigan

论文

摘要

个人觉得写得挺中规中矩，就介绍了模型的特点，即在DHF1K, Hollywood2, and UCFSports.数据集上表现亮眼。

Introduction

这一段介绍研究意义不错
Video saliency detection aims to model the gaze fixation patterns of humans when viewing a dynamic scene. Because the predicted saliency map can be used to prioritize the video information across space and time, this task has a number of applications such as video surveillance [12, 41], video captioning [27], video compression [11, 13], etc.
之前的研究提取时域信息主要通过LSTM，但是这样做的问题是，时空特征信息的提取，融合，编码过程分离。

S3D本质其实是一个时空特征分别提取的网络

fail to jointly process spatial and temporal information when predicting a saliency map from the extracted features. Specifically, either spatial decoding and temporal aggregation are performed separately, or only one of these two processes is considered for the final prediction.
因此提出了一个3D卷积网络提取特征，整体结构如下。
3D fully-convolutional encoder-decoder network architecture for video saliency detection,
which we call the Temporally-Aggregating Spatial Encoder-Decoder Network (TASED-Net).

在这里插入图片描述
网络的Encoder迁移了基于运动数据集的S3D模型
S3D is efficient and effective in extracting spatiotemporal features,
Decoder 提出了一个Auxiliary pooling，可以更大的时域感受野，减少时域维度。效果上比interpolation and transposed convolution (deconvolution)更有效。
为铺垫Auxiliary poolings有如下表述
The tricky part is that the max-unpooling layers cannot reuse the pooling indices or switches [42] from the corresponding max-pooling layers since they have larger temporal receptive field than the maxunpooling layers.

没太理解

三点贡献如下：

端到端的3D卷积网络
创新提出了一个Auxiliary pooling
在三大数据集上表现突出

方法（approach）

总述

三点假设：

saliency map 可以由过去固定的T帧画面确定
多帧输入的saliency detection效果比image好
视频有充足的输入画面，至少有（2T-1）

核心内容：在decoder融合了时域信息，提出了 Auxiliary pooling
our model comes from temporal aggregation inside the prediction network, which requires extra operations thatwe call Auxiliary pooling
网络的主干结构如下：
在这里插入图片描述

Auxiliary pooling

通过下面这个示例具体操作还是很好理解的，首先1st Auxiliary pooling一个时域维度的max pooling. 然后，2nd Auxiliary pooling 空间维度的max pooling操作，最后通过一个Switches确定上采样的位置。
相比之前的做法由于decoder保留了空间维度，所以为保证维度的匹配需要引入auxiliary pooling。
在这里插入图片描述

Temporal aggregation strategy

如下，作者尝试了三个时空维度的decoder策略，试验结果发现最后一种：分两步，先空间上采样后空间融合的策略模型性能最好。
在这里插入图片描述

Evaluation

数据集DHF1K, Hollywood2, and UCFSports，考虑到数据的丰富度主要以DHF1K为main benchmark。

Each frame is resized to 224 × 384.
batch size of 40 on 600 videosfrom the DHF1K
SGD algorithm with 0.9 momentum in an end-to-endmanner.
The learning rate is fixed at 0.001 for the encoder network.
sample 2K clips to approximate the validation loss.
Kullback-Leibler(KL) divergence as the loss function, which Jiang et al. [20] have
shown to be effective for training saliency models.
(i) Normalized Scanpath Saliency (NSS), (ii) Linear Correlation Coefficient (CC),
(iii) Similarity (SIM), (iv) Area Under the Curve by Judd (AUC-J),
and (v) Shuffled-AUC (s-AUC).
21.2M Params and 63.2G FLOPs

不同输入帧数

在这里插入图片描述
不同模型在DHF1K数据集上的表现

不同数据集

Necessity of Auxiliary pooling

The first variant, which we call TASEDNet-tri, replaces all the max-unpooling layers with trilinear upsampling (interpolation).
The second variant, which we name TASED-Net-trp, replaces the max-unpooling layers with transposed convolutions (deconvolution).
在这里插入图片描述
输入画面的帧数

代码

model.py

S3D

模型中的BasicConv3d ， SepConv3d和Mixed_*，可以参看这篇博文

视频分类 S3D（separable 3D convolutions）模型及代码分析
因为网络所有层都是pytorch定义好的函数，这里主要对网络的整体框架进行介绍。

主干网络

结合此图，与模型的TASED_v2的forward函数中的数据流，便可以直观快速的理解网络结构。
在这里插入图片描述
在Encoder部分，以S3D中的每个mixed模块的输出为特征提取对象分别接上一个Auxiliary pooling，即一个时间域的maxpooled3d和一个空间域的maxpooled3d（return_indices=True），分别作为Decoder不同层的输入。
在Decoder部分，从S3D的最后输出，通过一个1×1×1 3d卷积层 to re-distribute encoded information across the channel dimension。接下来就是3组 nn.ConvTranspose3d（stride=1）和nn.MaxUnpool3d（stride=(1,2,2)）组成的 convtsp模块。最后一组convtsp4由两组nn.ConvTranspose3d（stride=(1,2,2)）和nn.Conv3d（stride=(2,1,1)）构成。
整个网络到特征层尺寸标注在下图中，可以看出不同Encoder和Decoder特征层数据传输的结构思想跟U-net比较接近。

class TASED_v2(nn.Module):
    def __init__(self):
        super(TASED_v2, self).__init__()
            
    def forward(self, x):
    	# Encoder
        # [1, 3, 32, 224, 384] => [1, 192, 16, 56, 96]
        y3 = self.base1(x)
        # [1, 192, 16, 56, 96] => [1, 192, 16, 28, 48]
        y = self.maxp2(y3)
        # [1, 192, 16, 56, 96] => [1, 192, 4, 56, 96]
        y3 = self.maxm2(y3)
        # [1, 192, 4, 56, 96] => [1, 192, 4, 28, 48]
        _, i2 = self.maxt2(y3)
        # [1, 192, 16, 28, 48] => [1, 480, 16, 28, 48]
        y2 = self.base2(y)
        # [1, 480, 16, 28, 48] => [1, 480, 8, 14, 24]
        y = self.maxp3(y2)
        # [1, 480, 16, 28, 48] => [1, 480, 4, 28, 48]
        y2 = self.maxm3(y2)
        # [1, 480, 4, 28, 48] => [1, 480, 4, 14, 24]
        _, i1 = self.maxt3(y2)
        # [1, 480, 8, 14, 24] => [1, 832, 8, 14, 24]
        y1 = self.base3(y)
        # [1, 832, 8, 14, 24] => [1, 832, 4, 14, 24]
        y = self.maxt4(y1)
        # [1, 832, 4, 14, 24] => [1, 832, 4, 7, 12], [1, 832, 4, 7, 12]
        y, i0 = self.maxp4(y)
        # [1, 832, 4, 7, 12] => [1, 1024, 4, 7, 12]
        y0 = self.base4(y)

		# Decoder
        # [1, 1024, 4, 7, 12] => [1, 832, 4, 7, 12]
        z = self.convtsp1(y0)
        # [1, 832, 4, 7, 12], [1, 832, 4, 7, 12] => [1, 832, 4, 14, 24]
        z = self.unpool1(z, i0)
        # [1, 832, 4, 14, 24] => [1, 480, 4, 14, 24]
        z = self.convtsp2(z)
        # [1, 480, 4, 14, 24], [1, 480, 4, 14, 24] => [1, 480, 4, 28, 48]
        z = self.unpool2(z, i1, y2.size())
        # [1, 480, 4, 28, 48] => [1, 192, 4, 28, 48]
        z = self.convtsp3(z)
        # [1, 192, 4, 28, 48], [1, 192, 4, 28, 48] => [1, 192, 4, 56, 96]
        z = self.unpool3(z, i2, y3.size())
        # [1, 192, 4, 56, 96] => [1, 1, 1, 224, 384]
        z = self.convtsp4(z)
        # [1, 1, 1, 224, 384] => [1, 224, 384]
        z = z.view(z.size(0), z.size(3), z.size(4))

        return z
if __name__ == '__main__':
    
    model = TASED_v2()

    for name, parameters in model.named_parameters():
        print(name, ':', parameters.size())

    model_dict = model.state_dict()
    print(model_dict)

    input_tensor = torch.randn(1, 3, 32, 224, 384)
    output_tensor = model(input_tensor)
    print(output_tensor.shape)

源代码
https://github.com/MichiganCOG/TASED-Net

run_example.py

此程序主要用于演示模型预测效果的example，根据作者提供的权重文件便可以直接预测结果

import sys
import os
import numpy as np
import cv2
import torch
from model import TASED_v2
from scipy.ndimage.filters import gaussian_filter

def main():
    ''' read frames in path_indata and generate frame-wise saliency maps in path_output '''
    # optional two command-line arguments
    path_indata = './example'
    path_output = './output'
    if len(sys.argv) > 1:
        path_indata = sys.argv[1]
        if len(sys.argv) > 2:
            path_output = sys.argv[2]
    if not os.path.isdir(path_output):
        os.makedirs(path_output)

    len_temporal = 32
    file_weight = './TASED_updated.pt'

    model = TASED_v2()
    # print(model)

    # load the weight file and copy the parameters
    if os.path.isfile(file_weight):
        print ('loading weight file')
        weight_dict = torch.load(file_weight)
        model_dict = model.state_dict()
        for name, param in weight_dict.items():
            if 'module' in name:
                name = '.'.join(name.split('.')[1:])
            if name in model_dict:
                if param.size() == model_dict[name].size():
                    model_dict[name].copy_(param)
                else:
                    print (' size? ' + name, param.size(), model_dict[name].size())
            else:
                print (' name? ' + name)

        print (' loaded')
    else:
        print ('weight file?')

    model = model.cuda()
    # 网络数据保持不变，为提高cuda效率设置为False，https://blog.csdn.net/byron123456sfsfsfa/article/details/96003317
    torch.backends.cudnn.benchmark = False
    model.eval()

    # iterate over the path_indata directory
    list_indata = [d for d in os.listdir(path_indata) if os.path.isdir(os.path.join(path_indata, d))]
    list_indata.sort()
    for dname in list_indata:
        print ('processing ' + dname)
        list_frames = [f for f in os.listdir(os.path.join(path_indata, dname)) if os.path.isfile(os.path.join(path_indata, dname, f))]
        list_frames.sort()

        # process in a sliding window fashion
        if len(list_frames) >= 2*len_temporal-1:
            path_outdata = os.path.join(path_output, dname)
            if not os.path.isdir(path_outdata):
                os.makedirs(path_outdata)

            snippet = []
            for i in range(len(list_frames)):
                img = cv2.imread(os.path.join(path_indata, dname, list_frames[i]))
                img = cv2.resize(img, (384, 224))
                img = img[...,::-1]
                snippet.append(img)

                # 输入frames数量大于32 len=32 imgs
                if i >= len_temporal-1:
                    clip = transform(snippet)

                    process(model, clip, path_outdata, i)

                    # 前32帧画面，倒序以后输入网络，感觉把时序后的画面帧喂给网络没什么道理，可能只是为充数
                    # process first (len_temporal-1) frames
                    if i < 2*len_temporal-2:
                        process(model, torch.flip(clip, [1]), path_outdata, i-len_temporal+1)

                    # 删除第一个，保持输入帧数32
                    del snippet[0]

        else:
            print (' more frames are needed')


def transform(snippet):
    ''' stack & noralization '''
    #list => (224,384,96)
    snippet = np.concatenate(snippet, axis=-1)
    # (224, 384, 96)=> (96, 224, 384) float
    snippet = torch.from_numpy(snippet).permute(2, 0, 1).contiguous().float()
    snippet = snippet.mul_(2.).sub_(255).div(255)
    #(96, 224, 384) => (1, 32, 3, 224, 384) 
    return snippet.view(1,-1,3,snippet.size(1),snippet.size(2)).permute(0,2,1,3,4)


def process(model, clip, path_outdata, idx):
    ''' process one clip and save the predicted saliency map '''
    with torch.no_grad():
        smap = model(clip.cuda()).cpu().data[0]

    # smap (224, 384)
    smap = (smap.numpy()*255.).astype(np.int)/255.
    smap = gaussian_filter(smap, sigma=7)
    cv2.imwrite(os.path.join(path_outdata, '%04d.png'%(idx+1)), (smap/np.max(smap)*255.).astype(np.uint8))


if __name__ == '__main__':
    main()

run_train.py

import sys
import os
import numpy as np
import cv2
import time
from datetime import timedelta
import torch
from model import TASED_v2
from loss import KLDLoss
from dataset import DHF1KDataset, InfiniteDataLoader
from itertools import islice

def main():
    ''' concise script for training '''
    # optional two command-line arguments
    path_indata = './DHF1K'
    path_output = './output'
    if len(sys.argv) > 1:
        path_indata = sys.argv[1]
        if len(sys.argv) > 2:
            path_output = sys.argv[2]

    # we checked that using only 2 gpus is enough to produce similar results
    num_gpu = 1
    pile = 5
    batch_size = 8
    num_iters = 1000
    len_temporal = 32
    file_weight = '../S3D-master/S3D_kinetics400.pt'
    path_output = os.path.join(path_output, time.strftime("%m-%d_%H-%M-%S"))
    if not os.path.isdir(path_output):
        os.makedirs(path_output)

    model = TASED_v2()

    # load the weight file and copy the parameters
    if os.path.isfile(file_weight):
        print ('loading weight file')
        weight_dict = torch.load(file_weight)
        model_dict = model.state_dict()
        for name, param in weight_dict.items():
            # 将权重文件字典中的name 调整为 model 文件中的name
            # weight name : 'module.base.0.conv_s.weight'
            if 'module' in name:
                # name: 'base.0.conv_s.weight'
                name = '.'.join(name.split('.')[1:])
            if 'base.' in name:
                bn = int(name.split('.')[1]) # base num
                sn_list = [0, 5, 8, 14]
                sn = sn_list[0]
                if bn >= sn_list[1] and bn < sn_list[2]:
                    sn = sn_list[1]
                elif bn >= sn_list[2] and bn < sn_list[3]:
                    sn = sn_list[2]
                elif bn >= sn_list[3]:
                    sn = sn_list[3]
                # 'conv_s.weight'
                name = '.'.join(name.split('.')[2:])
                # 'base1.0.conv_s.weight'
                name = 'base%d.%d.'%(sn_list.index(sn)+1, bn-sn)+name
            if name in model_dict:
                if param.size() == model_dict[name].size():
                    model_dict[name].copy_(param)
                else:
                    print (' size? ' + name, param.size(), model_dict[name].size())
            else:
                print (' name? ' + name)

        print (' loaded')
    else:
        print ('weight file?')

    # 对不同层定义不同的学习率。
    # parameter setting for fine-tuning
    # fixed at 0.001 for the encoder network
    # prediction network, the learning rate starts at 0.1 and decays twice by a factor of 10 when the validation loss
    params = []
    for key, value in dict(model.named_parameters()).items():
        if 'convtsp' in key:
            # Decoder层参数，修改层名
            params += [{'params':[value], 'key':key+'(new)'}]
        else:
            # Encoder 层参数，增加学习率参数
            params += [{'params':[value], 'lr':0.001, 'key':key}]

    optimizer = torch.optim.SGD(params, lr=0.1, momentum=0.9, weight_decay=2e-7)
    criterion = KLDLoss()

    model = model.cuda()
    # 多块GPU并行计算 https://zhuanlan.zhihu.com/p/102697821
    model = torch.nn.DataParallel(model, device_ids=range(num_gpu))
    torch.backends.cudnn.benchmark = False
    model.train()

    train_loader = InfiniteDataLoader(DHF1KDataset(path_indata, len_temporal), batch_size=batch_size, shuffle=True, num_workers=24)

    i, step = 0, 0
    loss_sum = 0
    start_time = time.time()
    # islice(iterable, [start, ] stop [, step]) 生成器切片
    for clip, annt in islice(train_loader, num_iters*pile):
        with torch.set_grad_enabled(True):
            output = model(clip.cuda())
            loss = criterion(output, annt.cuda())

        loss_sum += loss.item()
        loss.backward()
        # 真实的batch大小： pile*batch_size=40, 由于模型尺寸和网络参数量
        if (i+1) % pile == 0:
            optimizer.step()
            optimizer.zero_grad()
            step += 1

            # whole process takes less than 3 hours 每个batch的loss显示
            print ('iteration: [%4d/%4d], loss: %.4f, %s' % (step, num_iters, loss_sum/pile, timedelta(seconds=int(time.time()-start_time))), flush=True)
            loss_sum = 0

            # adjust learning rate 只修改decoder层
            # first decaying point is at step 750, the second one is at step 950.
            if step in [750, 950]:
                for opt in optimizer.param_groups:
                    if 'new' in opt['key']:
                        opt['lr'] *= 0.1

            if step % 25 == 0:
                torch.save(model.state_dict(), os.path.join(path_output, 'iter_%04d.pt' % step))

        i += 1


if __name__ == '__main__':
    main()

老光头_ME2CS

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
1
评论
Video Saliency Detection 文献学习及代码解读（二） TASED-Net

2019年ICCV发表的论文Kyle Min Jason J. Corso University of Michigan摘要个人觉得写得挺中规中矩，就介绍了模型的特点，即在DHF1K, Hollywood2, and UCFSports.数据集上表现亮眼。Introduction这一段介绍研究意义不错Video saliency detection aims to model the gaze fixation patterns of humans when viewing a dynamic
复制链接

扫一扫