【SIFA】

最新推荐文章于 2024-07-06 12:03:33 发布

Shiina丶Mashiro

最新推荐文章于 2024-07-06 12:03:33 发布

阅读量128

点赞数

文章标签： python 深度学习人工智能

本文链接：https://blog.csdn.net/afadgas/article/details/130238885

版权

简介

这篇文章是一篇视频动作识别的论文，和之前的图像分类文章不同。从结构来说它做的是将当前帧的patch（作为Q）和下一帧的八邻域patches（作为K和V）之间做一个注意力。这九个patches根据offset predictor 预测做的注意力和原位置做注意力的相加。
另外一个特点就是以通用图像识别骨干网络（resnet）作为主干网络。
在这里插入图片描述

代码

这个帧间关系的模块用在了resnet的倒数三个layer。

        x_origin = self.layer4(x)
        x_dual = self.maxpool_dual(x) #两帧抽一帧
        x_dual = self.layer4_dual(x_dual) #抽一帧之后再来resnet_layer
        x_g_origin = self.pool(x_origin) 
        x_g_dual = self.pool_dual(x_dual)
        x_g_origin = self.drop(x_g_origin)
        x_g_dual = self.drop_dual(x_g_dual)
        x1_origin = self.fc(x_g_origin)
        x1_dual = self.fc_dual(x_g_dual)

        return [x1_origin, x1_dual]

在这里插入图片描述
out_tmp是运动图，通过运动图求偏置。这个偏置是 $2k^2$ 个位移。
corre_weight做一个带偏置的注意力。

out_agg是融合，详细代码在下面：

        # insert the aggregate block
        out_tmp = out.clone()
        out_tmp[:,:,1:,:,:] = out_tmp[:,:,:-1,:,:]
        out_tmp = (torch.sigmoid(out - out_tmp) * out) + out # [sig(f_{t}-f_{t-1})*f_{t}]*f_{t}
        offset = self.conv_offset(out_tmp)        
        
        corre_weight = self.def_cor(out, offset)
        out_agg = self.def_agg(out, offset, corre_weight)
        
        mask = torch.ones(out.size()).cuda()
        mask[:,:,-1,:,:] = 0
        mask.requires_grad = False
        out_shift = out_agg.clone()
        out_shift[:,:,:-1,:,:] = out_shift[:,:,1:,:,:]
        out = out + out_shift * mask

代码写了一段cuda:核心代码如下：实现了下一帧的n个patch和本帧的patch相乘再乘一个通道权重。

    float val = static_cast<float>(0);
    const int c_in_start = group_id * c_in_per_defcor;
    const int c_in_end = (group_id + 1) * c_in_per_defcor;
    if (h_im > -1 && w_im > -1 && h_im < height && w_im < width){
      for (int c_in = c_in_start; c_in < c_in_end; ++c_in){
        const int former_ptr = (((n_out * channels + c_in) * times + t_former) * height + h_out) * width + w_out;
        const int weight_offset = ((c_in * times + t_out) * kernel_h + kh) * kernel_w + kw;
        const float* data_im_ptr = data_im + ((n_out * channels + c_in) * times + t_out) * height * width;
        float next_val = im2col_bilinear_cuda(data_im_ptr, width, height, width, h_im, w_im);
        val += (data_weight[weight_offset] * data_im[former_ptr] * next_val);
      }
    }
    data_out[index] = val;