non-local模块理解(以视频行人reid-GLTR为例)

论文内容

Global-Local Temporal Representations For Video Person Re-Identification: 原文链接

主要方法

This paper proposes the Global-Local Temporal Representation (GLTR) to exploit the multi-scale temporal cues in video sequences for video person Re-Identification (ReID). GLTR is constructed by first modeling the short-term temporal cues among adjacent frames, then capturing the long-term relations among inconsecutive frames. Specifically, the short-term temporal cues are modeled by parallel dilated convolutions with different temporal dilation rates to represent the motion and appearance of pedestrian. The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences. The short and long-term temporal cues are aggregated as the final GLTR by a simple single-stream CNN.

GLTR 首先提取相邻帧中的短时特征,再提取连续帧中的长时特征。短时特征的提取方法为:利用并行的不同时序膨胀率的膨胀卷积, Dilated Temporal Pyramid (DTP)。长时特征提取方法为:利用时序自注意模型,以此减轻遮挡和噪声对于视频序列的影响, Temporal Self-Attention (TSA) 。短时和长时特征通过单流CNN聚合。

模型细节

模型结构图

GLTR framework

模型前向代码理解

代码来源: https://github.com/ljn114514/GLTR

Dilated Temporal Pyramid (DTP)
"""
		self.feat1 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(1,1), padding=(1,0), bias=False)
		self.feat2 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(2,1), padding=(2,0), bias=False)
		self.feat3 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(3,1), padding=(3,0), bias=False)
"""
	def forward(self, x, shape=None):
		if shape is not None:
			x = x.squeeze()[:shape]
		x = self.conv1(x)
		x = self.bn1(x)
		x = self.relu(x)
		x = self.maxpool(x)

		x = self.layer1(x)
		x = self.layer2(x)
		x = self.layer3(x)
		x = self.layer4(x)
		
		x = self.avgpool(x)
		x = x.view(x.size(0), -1)  ## resnet backbone

		x = self.feat(x) 
		## 每一帧映射为feat_dim=128的特征
		x = x.view(x.size(0)//self.frames, self.frames, -1)  #(N, T, d) d/feat_dims=128 #T=16

		x0 = torch.transpose(x, 1, 2) #(N, d, T)
		x = x.unsqueeze(dim=1) #(N, 1, T, d)
		x1 = self.feat1(x).squeeze(dim=3) 
		#feat1.output=(N, 128, T, 1) conv(in_channel=1, out_channel=128)
		x2 = self.feat2(x).squeeze(dim=3)
		x3 = self.feat3(x).squeeze(dim=3)

		x = torch.cat((x0, x1, x2, x3), dim=1)  #(N, 4*128, T)
		#print x.size()
		x = self.Nonlocal_block0(x).mean(dim=2)
		if self.istrain:
			x = self.feat_bn(x)
			x = self.relu(x)
			x = self.drop(x)
			x = self.classifier(x)
		return x#0+x1+x2+x3

通过修改dataloader,在每个序列中抽取连续的16帧作为模型输入。即self.frames/d=16, x1.shape = (N, T,d)T=16, N表示序列数量,d=128为每一帧emd的特征维度。经过形状表面换后,x.shape = (N,1,T,d), 作为膨胀卷积的输入。
self.feat2 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(2,1), padding=(2,0), bias=False)为例。该膨胀卷积输入为1通道,输出128通道。核的大小为(3,128),表示感受野为三帧X128维的特正。
膨胀卷积输出大小为(N, 128, T, 1),表示将3帧128维特征聚合为一个数,输出通道为128,形成128*T的短时特征图, 即原文3.1中描述 f t f_t ft转换为 f t ′ f'_t ft
拼接x0,x1,x2,x3,都到短时特征图(N, 4d,T)。作为提取长时特征的输入。

Temporal Self-Attention (TSA)
# non-local block 
class NonLocalBlock1D(nn.Module):
    def __init__(self, in_channels, inter_channels=None):
    	'''
    	when training:
    	in_channels=4d (4*128), inter_channels=None
    	'''
        super(NonLocalBlock1D, self).__init__()
        self.in_channels = in_channels
        self.inter_channels = inter_channels
        
        if self.inter_channels is None:
            self.inter_channels = in_channels // 2
            if self.inter_channels == 0:
                self.inter_channels = 1

        self.g = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
        self.W = nn.Sequential(
            nn.Conv1d(self.inter_channels, in_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm1d(in_channels)
        )
        nn.init.constant_(self.W[1].weight, 0)
        nn.init.constant_(self.W[1].bias, 0)       

        self.theta = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
        self.phi = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        '''
        :param x: (N,4d,T)
        :return:
        '''
        batch_size = x.size(0)
        g_x = self.g(x).view(batch_size, self.inter_channels, -1) # (N,2d,T)
        g_x = g_x.permute(0, 2, 1) #(N, T, 2d)

        theta_x = self.theta(x).view(batch_size, self.inter_channels, -1)
        theta_x = theta_x.permute(0, 2, 1) #(N,T,2d)
        phi_x = self.phi(x).view(batch_size, self.inter_channels, -1) #(N,2d,T)
        f = torch.matmul(theta_x, phi_x) #(N,T,T)
        f_div_C = F.softmax(f, dim=-1) 
        #基于最后一个维度求softmax, 这样torch.matmul(f_div_C, g_x)才是顺序合理的
        y = torch.matmul(f_div_C, g_x) #(N,T,2d) 
        y = y.permute(0, 2, 1).contiguous() #(N,2d,T)
        y = y.view(batch_size, self.inter_channels, *x.size()[2:]) #(N,2d,T)
        W_y = self.W(y)
        z = W_y + x  #(N,4d,T)
        return z

self.theta(), self.phi(), self.g()即为图中得到 B , C , F ′ ˉ B,C, \bar{\mathcal{F}'} B,C,Fˉ 前的conv。最终得到的z为16帧间自注意后的结果,包含了长时特征。最终,通过平均,得到(N,4d) 的小的向量图,即每连续16帧提取一个特征。

应用方法

from section 4.2

  • Employ standard ResNet50 [12] as the backbone for frame feature extraction.
  • For 2D CNN training,t he training is finished after 20 epoches.
  • For DTP and TSA training, we sample 16 adjacent frames from each sequence as input for each training epoch.
  • All models are trained with only softmax loss.
  • During testing, we use 2D CNN to extract a d=128-dim feature from each video frame, then fuse frame features into GLTR using the network illustrated in Fig.

拓展理解

SANet: link

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值