non-local模块理解（以视频行人reid-GLTR为例）

本文链接：https://blog.csdn.net/qq_38853994/article/details/132083547

本文介绍了GLTR模型，它利用多尺度时间线索处理视频行人重识别任务，通过DilatedTemporalPyramid提取短时特征，TemporalSelf-Attention捕捉长时关系。模型使用ResNet50作为基础网络，通过实验展示了其在行人重识别中的性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

non-local模块理解（以视频行人reid-GLTR为例）

论文内容
拓展理解

论文内容

Global-Local Temporal Representations For Video Person Re-Identification: 原文链接

主要方法

This paper proposes the Global-Local Temporal Representation (GLTR) to exploit the multi-scale temporal cues in video sequences for video person Re-Identification (ReID). GLTR is constructed by first modeling the short-term temporal cues among adjacent frames, then capturing the long-term relations among inconsecutive frames. Specifically, the short-term temporal cues are modeled by parallel dilated convolutions with different temporal dilation rates to represent the motion and appearance of pedestrian. The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences. The short and long-term temporal cues are aggregated as the final GLTR by a simple single-stream CNN.

GLTR 首先提取相邻帧中的短时特征，再提取连续帧中的长时特征。短时特征的提取方法为：利用并行的不同时序膨胀率的膨胀卷积, Dilated Temporal Pyramid (DTP)。长时特征提取方法为：利用时序自注意模型，以此减轻遮挡和噪声对于视频序列的影响, Temporal Self-Attention (TSA) 。短时和长时特征通过单流CNN聚合。

模型细节

模型结构图

GLTR framework

模型前向代码理解

代码来源: https://github.com/ljn114514/GLTR

Dilated Temporal Pyramid (DTP)

"""
		self.feat1 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(1,1), padding=(1,0), bias=False)
		self.feat2 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(2,1), padding=(2,0), bias=False)
		self.feat3 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(3,1), padding=(3,0), bias=False)
"""
	def forward(self, x, shape=None):
		if shape is not None:
			x = x.squeeze()[:shape]
		x = self.conv1(x)
		x = self.bn1(x)
		x = self.relu(x)
		x = self.maxpool(x)

		x = self.layer1(x)
		x = self.layer2(x)
		x = self.layer3(x)
		x = self.layer4(x)
		
		x = self.avgpool(x)
		x = x.view(x.size(0), -1)  ## resnet backbone

		x = self.feat(x) 
		## 每一帧映射为feat_dim=128的特征
		x = x.view(x.size(0)//self.frames, self.frames, -1)  #(N, T, d) d/feat_dims=128 #T=16

		x0 = torch.transpose(x, 1, 2) #(N, d, T)
		x = x.unsqueeze(dim=1) #(N, 1, T, d)
		x1 = self.feat1(x).squeeze(dim=3) 
		#feat1.output=(N, 128, T, 1) conv(in_channel=1, out_channel=128)
		x2 = self.feat2(x).squeeze(dim=3)
		x3 = self.feat3(x).squeeze(dim=3)

		x = torch.cat((x0, x1, x2, x3), dim=1)  #(N, 4*128, T)
		#print x.size()
		x = self.Nonlocal_block0(x).mean(dim=2)
		if self.istrain:
			x = self.feat_bn(x)
			x = self.relu(x)
			x = self.drop(x)
			x = self.classifier(x)
		return x#0+x1+x2+x3

通过修改dataloader，在每个序列中抽取连续的16帧作为模型输入。即self.frames/d=16, x1.shape = (N, T,d)T=16, N表示序列数量，d=128为每一帧emd的特征维度。经过形状表面换后，x.shape = (N,1,T,d), 作为膨胀卷积的输入。
以self.feat2 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(2,1), padding=(2,0), bias=False)为例。该膨胀卷积输入为1通道，输出128通道。核的大小为(3,128)，表示感受野为三帧X128维的特正。
膨胀卷积输出大小为(N, 128, T, 1)，表示将3帧128维特征聚合为一个数，输出通道为128，形成128*T的短时特征图, 即原文3.1中描述 $f_t$ 转换为 $f'_t$ 。
拼接x0,x1,x2,x3,都到短时特征图（N, 4d,T)。作为提取长时特征的输入。

Temporal Self-Attention (TSA)

# non-local block 
class NonLocalBlock1D(nn.Module):
    def __init__(self, in_channels, inter_channels=None):
    	'''
    	when training:
    	in_channels=4d (4*128), inter_channels=None
    	'''
        super(NonLocalBlock1D, self).__init__()
        self.in_channels = in_channels
        self.inter_channels = inter_channels
        
        if self.inter_channels is None:
            self.inter_channels = in_channels // 2
            if self.inter_channels == 0:
                self.inter_channels = 1

        self.g = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
        self.W = nn.Sequential(
            nn.Conv1d(self.inter_channels, in_channels, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm1d(in_channels)
        )
        nn.init.constant_(self.W[1].weight, 0)
        nn.init.constant_(self.W[1].bias, 0)       

        self.theta = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
        self.phi = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        '''
        :param x: (N,4d,T)
        :return:
        '''
        batch_size = x.size(0)
        g_x = self.g(x).view(batch_size, self.inter_channels, -1) # (N,2d,T)
        g_x = g_x.permute(0, 2, 1) #(N, T, 2d)

        theta_x = self.theta(x).view(batch_size, self.inter_channels, -1)
        theta_x = theta_x.permute(0, 2, 1) #(N,T,2d)
        phi_x = self.phi(x).view(batch_size, self.inter_channels, -1) #(N,2d,T)
        f = torch.matmul(theta_x, phi_x) #(N,T,T)
        f_div_C = F.softmax(f, dim=-1) 
        #基于最后一个维度求softmax, 这样torch.matmul(f_div_C, g_x)才是顺序合理的
        y = torch.matmul(f_div_C, g_x) #(N,T,2d) 
        y = y.permute(0, 2, 1).contiguous() #(N,2d,T)
        y = y.view(batch_size, self.inter_channels, *x.size()[2:]) #(N,2d,T)
        W_y = self.W(y)
        z = W_y + x  #(N,4d,T)
        return z

self.theta(), self.phi(), self.g()即为图中得到 $\bar{\mathcal{F}'}$ 前的conv。最终得到的z为16帧间自注意后的结果，包含了长时特征。最终，通过平均，得到（N，4d) 的小的向量图，即每连续16帧提取一个特征。

应用方法

from section 4.2

Employ standard ResNet50 [12] as the backbone for frame feature extraction.
For 2D CNN training,t he training is finished after 20 epoches.
For DTP and TSA training, we sample 16 adjacent frames from each sequence as input for each training epoch.
All models are trained with only softmax loss.
During testing, we use 2D CNN to extract a d=128-dim feature from each video frame, then fuse frame features into GLTR using the network illustrated in Fig.

拓展理解

SANet: link