non-local模块理解(以视频行人reid-GLTR为例)
论文内容
Global-Local Temporal Representations For Video Person Re-Identification: 原文链接
主要方法
This paper proposes the Global-Local Temporal Representation (GLTR) to exploit the multi-scale temporal cues in video sequences for video person Re-Identification (ReID). GLTR is constructed by first modeling the short-term temporal cues among adjacent frames, then capturing the long-term relations among inconsecutive frames. Specifically, the short-term temporal cues are modeled by parallel dilated convolutions with different temporal dilation rates to represent the motion and appearance of pedestrian. The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences. The short and long-term temporal cues are aggregated as the final GLTR by a simple single-stream CNN.
GLTR 首先提取相邻帧中的短时特征,再提取连续帧中的长时特征。短时特征的提取方法为:利用并行的不同时序膨胀率的膨胀卷积, Dilated Temporal Pyramid (DTP)。长时特征提取方法为:利用时序自注意模型,以此减轻遮挡和噪声对于视频序列的影响, Temporal Self-Attention (TSA) 。短时和长时特征通过单流CNN聚合。
模型细节
模型结构图
模型前向代码理解
代码来源: https://github.com/ljn114514/GLTR
Dilated Temporal Pyramid (DTP)
"""
self.feat1 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(1,1), padding=(1,0), bias=False)
self.feat2 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(2,1), padding=(2,0), bias=False)
self.feat3 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(3,1), padding=(3,0), bias=False)
"""
def forward(self, x, shape=None):
if shape is not None:
x = x.squeeze()[:shape]
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = x.view(x.size(0), -1) ## resnet backbone
x = self.feat(x)
## 每一帧映射为feat_dim=128的特征
x = x.view(x.size(0)//self.frames, self.frames, -1) #(N, T, d) d/feat_dims=128 #T=16
x0 = torch.transpose(x, 1, 2) #(N, d, T)
x = x.unsqueeze(dim=1) #(N, 1, T, d)
x1 = self.feat1(x).squeeze(dim=3)
#feat1.output=(N, 128, T, 1) conv(in_channel=1, out_channel=128)
x2 = self.feat2(x).squeeze(dim=3)
x3 = self.feat3(x).squeeze(dim=3)
x = torch.cat((x0, x1, x2, x3), dim=1) #(N, 4*128, T)
#print x.size()
x = self.Nonlocal_block0(x).mean(dim=2)
if self.istrain:
x = self.feat_bn(x)
x = self.relu(x)
x = self.drop(x)
x = self.classifier(x)
return x#0+x1+x2+x3
通过修改dataloader,在每个序列中抽取连续的16帧作为模型输入。即self.frames/d=16
, x1.shape = (N, T,d)
T=16, N表示序列数量,d=128为每一帧emd的特征维度。经过形状表面换后,x.shape = (N,1,T,d)
, 作为膨胀卷积的输入。
以self.feat2 = nn.Conv2d(1, 128, kernel_size=(3,128), stride=1, dilation=(2,1), padding=(2,0), bias=False)
为例。该膨胀卷积输入为1通道,输出128通道。核的大小为(3,128),表示感受野为三帧X128维的特正。
膨胀卷积输出大小为(N, 128, T, 1)
,表示将3帧128维特征聚合为一个数,输出通道为128,形成128*T的短时特征图, 即原文3.1中描述
f
t
f_t
ft转换为
f
t
′
f'_t
ft′。
拼接x0,x1,x2,x3
,都到短时特征图(N, 4d,T)。作为提取长时特征的输入。
Temporal Self-Attention (TSA)
# non-local block
class NonLocalBlock1D(nn.Module):
def __init__(self, in_channels, inter_channels=None):
'''
when training:
in_channels=4d (4*128), inter_channels=None
'''
super(NonLocalBlock1D, self).__init__()
self.in_channels = in_channels
self.inter_channels = inter_channels
if self.inter_channels is None:
self.inter_channels = in_channels // 2
if self.inter_channels == 0:
self.inter_channels = 1
self.g = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
self.W = nn.Sequential(
nn.Conv1d(self.inter_channels, in_channels, kernel_size=1, stride=1, padding=0),
nn.BatchNorm1d(in_channels)
)
nn.init.constant_(self.W[1].weight, 0)
nn.init.constant_(self.W[1].bias, 0)
self.theta = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
self.phi = nn.Conv1d(in_channels, self.inter_channels, kernel_size=1, stride=1, padding=0)
def forward(self, x):
'''
:param x: (N,4d,T)
:return:
'''
batch_size = x.size(0)
g_x = self.g(x).view(batch_size, self.inter_channels, -1) # (N,2d,T)
g_x = g_x.permute(0, 2, 1) #(N, T, 2d)
theta_x = self.theta(x).view(batch_size, self.inter_channels, -1)
theta_x = theta_x.permute(0, 2, 1) #(N,T,2d)
phi_x = self.phi(x).view(batch_size, self.inter_channels, -1) #(N,2d,T)
f = torch.matmul(theta_x, phi_x) #(N,T,T)
f_div_C = F.softmax(f, dim=-1)
#基于最后一个维度求softmax, 这样torch.matmul(f_div_C, g_x)才是顺序合理的
y = torch.matmul(f_div_C, g_x) #(N,T,2d)
y = y.permute(0, 2, 1).contiguous() #(N,2d,T)
y = y.view(batch_size, self.inter_channels, *x.size()[2:]) #(N,2d,T)
W_y = self.W(y)
z = W_y + x #(N,4d,T)
return z
self.theta(), self.phi(), self.g()
即为图中得到
B
,
C
,
F
′
ˉ
B,C, \bar{\mathcal{F}'}
B,C,F′ˉ 前的conv。最终得到的z为16帧间自注意后的结果,包含了长时特征。最终,通过平均,得到(N,4d) 的小的向量图,即每连续16帧提取一个特征。
应用方法
from section 4.2
- Employ standard ResNet50 [12] as the backbone for frame feature extraction.
- For 2D CNN training,t he training is finished after 20 epoches.
- For DTP and TSA training, we sample 16 adjacent frames from each sequence as input for each training epoch.
- All models are trained with only softmax loss.
- During testing, we use 2D CNN to extract a d=128-dim feature from each video frame, then fuse frame features into GLTR using the network illustrated in Fig.
拓展理解
SANet: link