文本视频检索4：Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval-CSDN博客

本文链接：https://blog.csdn.net/qq_51964119/article/details/134173122

代码链接：GitHub - m-bain/frozen-in-time: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]https://github.com/m-bain/frozen-in-time1 创新点：

1）现在的数据集如100M含有大量噪声，因此给出了一个新的数据集。

2）提出了一个可以大规模使用图片与视频的模型。

2 编码器：

1）序列编码器：

序列编码器就是调包，使用transformer的几个函数。看看代码：

self.text_model = AutoModel.from_pretrained(text_params['model'])

    def compute_text(self, text_data):
        if self.text_params['model'].startswith('bert'):
            text_embeddings = self.text_model(text_data['input_ids'], attention_mask=text_data['attention_mask'])[
                'pooler_output']
        elif self.text_params['model'].startswith('distilbert'):
            text_embeddings = self.text_model(**text_data).last_hidden_state[:, 0, :]
        else:
            raise NotImplementedError    #先进行transformer编码
        text_embeddings = self.txt_proj(text_embeddings)  #在进行线性转化
        return text_embeddings

2）视频编码器：

首先，它图片是为只有一帧的视频，然后也是采用VIT的视频编码过程（二维），在这个基础上，每个patches加入了时间信息与空间信息，最后通过自注意力模块来得到结果。

video_transformer.py:

    def forward_features(self, x):
        b, curr_frames, channels, _, _ = x.shape
        x = self.patch_embed(x)  #运用卷积核将3*M*H*W -> M*N
        x = x.flatten(2).transpose(2, 1)
        x = x.reshape(b, -1, self.patch_embed.embed_dim)

        BF = x.shape[0]
        cls_tokens = self.cls_token.expand(BF, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
        x = torch.cat((cls_tokens, x), dim=1) #在输入张量的开头添加”CLS"标记
        # positional embed needs to be tiled for each frame (this does [1,2,3] --> [1,2,3,1,2,3]...)
        cls_embed = self.pos_embed[:, 0, :].unsqueeze(1)
        tile_pos_embed = self.pos_embed[:, 1:, :].repeat(1, self.num_frames, 1) #添加位置信息
        # temporal embed needs to be repeated within each frame (this does [1,2,3] --> [1,1,1,2,2,2,3,3,3]...)
        tile_temporal_embed = self.temporal_embed.repeat_interleave(self.patches_per_frame, 1) #添加时间信息
        total_pos_embed = tile_pos_embed + tile_temporal_embed
        total_pos_embed = torch.cat([cls_embed, total_pos_embed], dim=1)

        curr_patches = x.shape[1]
        x = x + total_pos_embed[:, :curr_patches]
        x = self.pos_drop(x)  #随意丢弃部分元素
        n = self.patches_per_frame
        f = curr_frames

        for blk in self.blocks:
            x = blk(x, self.einops_from_space, self.einops_to_space, self.einops_from_time,self.einops_to_time,
                    time_n=n, space_f=f)
            #自注意力层的计算
        x = self.norm(x)[:, 0]
        x = self.pre_logits(x)

        return x

3）相似度函数：

这个就是大家都经常用的一个版本了：

def sim_matrix(a, b, eps=1e-8):
    """
    added eps for numerical stability
    """
    a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
    a_norm = a / torch.max(a_n, eps * torch.ones_like(a_n))
    b_norm = b / torch.max(b_n, eps * torch.ones_like(b_n))
    sim_mt = torch.mm(a_norm, b_norm.transpose(0, 1))
    return sim_mt