场景边界预测:
二分类问题,即对每个镜头进行是否为场景边界的分类预测。
(1)首先采用一个边界网络(BNet)对镜头的差异与关系进行提取,BNet由两个分支网络构建,Bd捕捉镜头前后两幕的差异,Br用于捕捉镜头关系,由一个卷积加一个最大池化层构建。
class BNet(nn.Module):
def __init__(self, cfg):
super(BNet, self).__init__()
self.shot_num = cfg.shot_num
self.channel = cfg.model.sim_channel
self.conv1 = nn.Conv2d(1, self.channel, kernel_size=(cfg.shot_num, 1))
self.max3d = nn.MaxPool3d(kernel_size=(self.channel, 1, 1))
self.cos = Cos(cfg)
self.feat_extractor = feat_extractor(cfg)
def forward(self, x): # [batch_size, seq_len, shot_num, 3, 224, 224]
feat = self.feat_extractor(x)
# [batch_size, seq_len, shot_num, feat_dim]
context = feat.view(
feat.shape[0]*feat.shape[1], 1, feat.shape[-2], feat.shape[-1])
context = self.conv1(context)
# batch_size*seq_len,sim_channel,1,feat_dim
context = self.max3d(context)
# batch_size*seq_len,1,1,feat_dim
context = context.squeeze()
sim = self.cos(feat)
bound = torch.cat((context, sim), dim=1)
return bound
(2)基于获得通过Bnet得到的镜头代表,构建一个LSTM模型获得镜头构成场景边界的概率,通过设定场景个数阈值得到结果。
class LGSSone(nn.Module):
def __init__(self, cfg, mode="image"):
super(LGSSone, s