文本视频检索4:Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

代码链接:GitHub - m-bain/frozen-in-time: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [ICCV'21]icon-default.png?t=N7T8https://github.com/m-bain/frozen-in-time1 创新点:

1) 现在的数据集如100M含有大量噪声,因此给出了一个新的数据集。

2) 提出了一个可以大规模使用图片与视频的模型。

2 编码器:

1) 序列编码器:

序列编码器就是调包,使用transformer的几个函数。看看代码:

self.text_model = AutoModel.from_pretrained(text_params['model'])

    def compute_text(self, text_data):
        if self.text_params['model'].startswith('bert'):
            text_embeddings = self.text_model(text_data['input_ids'], attention_mask=text_data['attention_mask'])[
                'pooler_output']
        elif self.text_params['model'].startswith('distilbert'):
            text_embeddings = self.text_model(**text_data).last_hidden_state[:, 0, :]
        else:
            raise NotImplementedError    #先进行transformer编码
        text_embeddings = self.txt_proj(text_embeddings)  #在进行线性转化
        return text_embeddings

2) 视频编码器:

首先,它图片是为只有一帧的视频,然后也是采用VIT的视频编码过程(二维),在这个基础上,每个patches加入了时间信息与空间信息,最后通过自注意力模块来得到结果。

video_transformer.py:

    def forward_features(self, x):
        b, curr_frames, channels, _, _ = x.shape
        x = self.patch_embed(x)  #运用卷积核将3*M*H*W -> M*N
        x = x.flatten(2).transpose(2, 1)
        x = x.reshape(b, -1, self.patch_embed.embed_dim)

        BF = x.shape[0]
        cls_tokens = self.cls_token.expand(BF, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
        x = torch.cat((cls_tokens, x), dim=1) #在输入张量的开头添加”CLS"标记
        # positional embed needs to be tiled for each frame (this does [1,2,3] --> [1,2,3,1,2,3]...)
        cls_embed = self.pos_embed[:, 0, :].unsqueeze(1)
        tile_pos_embed = self.pos_embed[:, 1:, :].repeat(1, self.num_frames, 1) #添加位置信息
        # temporal embed needs to be repeated within each frame (this does [1,2,3] --> [1,1,1,2,2,2,3,3,3]...)
        tile_temporal_embed = self.temporal_embed.repeat_interleave(self.patches_per_frame, 1) #添加时间信息
        total_pos_embed = tile_pos_embed + tile_temporal_embed
        total_pos_embed = torch.cat([cls_embed, total_pos_embed], dim=1)

        curr_patches = x.shape[1]
        x = x + total_pos_embed[:, :curr_patches]
        x = self.pos_drop(x)  #随意丢弃部分元素
        n = self.patches_per_frame
        f = curr_frames

        for blk in self.blocks:
            x = blk(x, self.einops_from_space, self.einops_to_space, self.einops_from_time,self.einops_to_time,
                    time_n=n, space_f=f)
            #自注意力层的计算
        x = self.norm(x)[:, 0]
        x = self.pre_logits(x)

        return x

3) 相似度函数:

这个就是大家都经常用的一个版本了:

def sim_matrix(a, b, eps=1e-8):
    """
    added eps for numerical stability
    """
    a_n, b_n = a.norm(dim=1)[:, None], b.norm(dim=1)[:, None]
    a_norm = a / torch.max(a_n, eps * torch.ones_like(a_n))
    b_norm = b / torch.max(b_n, eps * torch.ones_like(b_n))
    sim_mt = torch.mm(a_norm, b_norm.transpose(0, 1))
    return sim_mt

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值