1) 现在的数据集如100M含有大量噪声,因此给出了一个新的数据集。
2) 提出了一个可以大规模使用图片与视频的模型。
2 编码器:
1) 序列编码器:
序列编码器就是调包,使用transformer的几个函数。看看代码:
self.text_model = AutoModel.from_pretrained(text_params['model'])
def compute_text(self, text_data):
if self.text_params['model'].startswith('bert'):
text_embeddings = self.text_model(text_data['input_ids'], attention_mask=text_data['attention_mask'])[
'pooler_output']
elif self.text_params['model'].startswith('distilbert'):
text_embeddings = self.text_model(**text_data).last_hidden_state[:, 0, :]
else:
raise NotImplementedError #先进行transformer编码
text_embeddings = self.txt_proj(text_embeddings) #在进行线性转化
return text_embeddings
2) 视频编码器:
首先,它图片是为只有一帧的视频,然后也是采用VIT的视频编码过程(二维),在这个基础上,每个patches加入了时间信息与空间信息,最后通过自注意力模块来得到结果。
video_transformer.py:
def forward_features(self, x):
b, curr_frames, channels, _, _ = x.shape
x = self.patch_embed(x) #运用卷积核将3*M*H*W -> M*N
x = x.flatten(2).transpose(2, 1)
x = x.reshape(b, -1, self.patch_embed.embed_dim)
BF = x.shape[0]
cls_tokens = self.cls_token.expand(BF, -1, -1) # stole cls_tokens impl from Phil Wang, thanks
x = torch.cat((cls_tokens, x), dim=1) #在输入张量的开头添加”CLS"标记
# positional embed needs to be tiled for each frame (this does [1,2,3] --> [1,2,3,1,2,3]...)
cls_embed = self.pos_embed[:, 0, :].unsqueeze(1)
tile_pos_embed = self.pos_embed[:, 1:, :].repeat(1, self.num_frames, 1) #添加位置信息
# temporal embed needs to be repeated within each frame (this does [1,2,3] --> [1,1,1,2,2,2,3,3,3]...)
tile_temporal_embed = self.temporal_embed.repeat_interleave(self.patches_per_frame, 1) #添加时间信息
total_pos_embed = tile_pos_embed + tile_temporal_embed
total_pos_embed = torch.cat([cls_embed, total_pos_embed], dim=1)
curr_patches = x.shape[1]
x = x + total_pos_embed[:, :curr_patches]
x = self.pos_drop(x) #随意丢弃部分元素
n = self.patches_per_frame
f = curr_frames
for blk in self.blocks:
x = blk(x, self.einops_from_space, self.einops_to_space, self.einops_from_time,self.einops_to_time,
time_n=n, space_f=f)
#自注意力层的计算
x = self.norm(x)[:, 0]
x = self.pre_logits(x)
return x
3) 相似度函数:
这个就是大家都经常用的一个版本了:
def sim_matrix(a, b, eps=1e-8):
"""
added eps for numerical stability
"""
a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
a_norm = a / torch.max(a_n, eps * torch.ones_like(a_n))
b_norm = b / torch.max(b_n, eps * torch.ones_like(b_n))
sim_mt = torch.mm(a_norm, b_norm.transpose(0, 1))
return sim_mt