文本视频检索2（Learning Joint Embedding with Multimodal Cues forCross-Modal Video-Text Retrieval）

衡一的光

已于 2023-10-20 22:44:23 修改

阅读量149

点赞数

文章标签：人工智能

于 2023-10-20 22:34:28 首次发布

本文链接：https://blog.csdn.net/qq_51964119/article/details/133951898

版权

本文介绍了一种新的跨模态检索方法，通过提出双通道特征融合策略和关注最难识别负样本的新损失函数。文章详细解读了特征提取过程，包括视频、文本、运动和音频特征的融合，以及特殊设计的损失函数用于优化模型性能。

摘要由CSDN通过智能技术生成

代码链接：https://github.com/niluthpol/multimodal_vtt

介绍：这一篇文章也是一篇跨模态检索文本与视频检索的文章，首先来介绍它的主要特点：

1）提出了一个新的特征提取框架：

就是这个图，看着虽然很复杂，于我之前看的文章相比，别的文章一般就是直接将appearance feature 与 sentence feature相融合，然后得到特征，去做预测之类的；然后这个文章就是就是提出了一个双通道，首先一个就是将appearance feature 与 sentence feature 相融合得到一个最终特征，然后另外一个就是先将 motion feature 与 audio feature 相融合，得到一个特征，然后将这个特征再与sentence feature相融合，得到一个最终特征，这两个最终特征相融合，去做预测的。看起来挺麻烦的，但融合的方式还是比较简单的，后面去讲代码的时候可以康康。然后我感觉最难的那个特征提取部分，文章给的代码并没有给出是怎么提取的，就是直接去读取特征了。

2）提出了一个新的损失函数。

S(v,t) 就是正样本， S(v,t一撇) （那个撇我不会打）就是负样本，一般的就是正负样本直接求和，但是这个它只关心那个最难识别的负样本。

展开：他这个相似度是余弦相似度，就是要越大越好，然后他就不关心其余的负样本，只关心那个最大的负样本。

然后我们来看具体的代码（我只关心那些比较重要的代码“仅仅个人认为”）：

1. 得到基本特征

1）这部分的代码提取视频，文本特征

    def __getitem__(self, index):
        '''
		Return a training sample pair (including the video frame feature and the corresponding caption)
        According to the caption to find the corresponding video, so the need for video storage is in accordance with the id ascending order
        '''
        caption = self.captions[index]
        length = self.lengths[index]
        video_id = self.video_ids[index]
        vid_feat_dir = self.vid_feat_dir

        path=vid_feat_dir+ "video"+ str(video_id) + ".npy"
        # / hdd2 / niluthpol / VTT / MSR_VTT / resnet_feat_caffe_all / video8763.npy
        video_feat = torch.from_numpy(np.load(path))
        video_feat = video_feat.mean(dim=0, keepdim=False)  #  average pooling
        video_feat=video_feat.float()

        return video_feat, caption, index, video_id 
        # torch.Size([2048])
        # torch.Size([28]) #长度不一定
        # 18033  #长度不一定
        # 9144   #长度不一定

2）提取 motion feature 与 audio feature，sentence feature

    def __getitem__(self, index):
        caption = self.captions[index]
        length = self.lengths[index]
        video_id = self.video_ids[index]
        vid_feat_dir = self.vid_feat_dir
		
		# activity (i3d) feature
        path1=vid_feat_dir+'/video_features'+ "/msr_vtt-I3D-RGBFeatures-video"+ str(video_id) + ".npy"
        video_feat = torch.from_numpy(np.load(path1))
        video_feat = video_feat.mean(dim=0, keepdim=False)

		# audio (soundnet) Feature
        audio_feat_file = vid_feat_dir+'/audio_features/'+"/video"+str(video_id)+".mp3.soundnet.h5"
        audio_h5 = h5py.File(audio_feat_file,'r')
        audio_feat=audio_h5['layer24'][()]
        audio_feat=torch.from_numpy(audio_feat)
        audio_feat = audio_feat.mean(dim=1, keepdim=False)

        video_feat = torch.cat([video_feat,audio_feat])   #将提取的activaty与audio特征直接拼接

        return video_feat, caption, index, video_id

    # torch.Size([2048])
    # torch.Size([8])  #不一定
    # 39587    #不一定
    # 8604    #不一定

可以看到，在得到基本特征这一步，还是相对比较简单的。

2. 特征转化

就是把这些特征的维度之类的，都给变到一个相同的维度，代码基本上就是一个线性 + 激活层（看看是不是一个未来可以优化的方向）最后都是1024维度， 1024维度。

3. 损失函数：

    def forward(self, im, s):
        # compute image-sentence score matrix
        scores = self.sim(im, s)  #计算句子的余弦匹配度
        diagonal = scores.diag().view(im.size(0), 1) #提取对角矩阵的分数
        d1 = diagonal.expand_as(scores)  #每行的值相同
        d2 = diagonal.t().expand_as(scores)	 #每列的值相同
		
        d1_sort, d1_indice=torch.sort(scores,dim=1,descending=True) #按照行内部的值进行排序，由大到小（值，下标）
        val, id1 = torch.min(d1_indice,1) #找到每行对应的最小值
        rank_weights1 = id1.float()
		
        for j in range(d1.size(0)):
                rank_weights1[j]=1+torch.tensor(self.beta)/(   d1.size(0)-(d1_indice[j,:]==j).nonzero()   ).to(dtype=torch.float)
		#获得对于权值的索引

        d2_sort, d2_indice=torch.sort(scores.t(),dim=1,descending=True)
        val, id2 = torch.min(d2_indice,1)
        rank_weights2 = id2.float()
		
        for k in range(d2.size(0)):
            rank_weights2[k]=1+torch.tensor(self.beta)/(d2.size(0)-(d2_indice[k,:]==k).nonzero()).to(dtype=torch.float)
			
        # compare every diagonal score to scores in its column
        # 将每条对角线的得分与其所在列的得分进行比较
        # caption retrieval
        # 标题检索
        cost_s = (self.margin + scores - d1).clamp(min=0)
        # compare every diagonal score to scores in its row
        # image retrieval
        cost_im = (self.margin + scores - d2).clamp(min=0)

        # clear diagonals 将对角线上元素清0
        mask = torch.eye(scores.size(0)) > .5
        I = Variable(mask)
        if torch.cuda.is_available():
            I = I.cuda()
        cost_s = cost_s.masked_fill_(I, 0)
        cost_im = cost_im.masked_fill_(I, 0)
 
        # keep the maximum violating negative for each query
        # 保留每个查询的最大违规负值
        cost_s = cost_s.max(1)[0]
        cost_im = cost_im.max(0)[0]
		
		# weight similarity scores
        cost_s= torch.mul(rank_weights1, cost_s)
        cost_im= torch.mul(rank_weights2, cost_im)

        return cost_s.sum() + cost_im.sum()

4. 计算最终结果排名： R@1 ， R@5 ， R@10

def i2t(videos, captions, videos2, captions2, shared_space='both', measure='cosine', return_ranks=False):    
    npts = int(videos.shape[0] / 20 )
    index_list = []
    print(npts)
	
    ranks = numpy.zeros(npts)
    top1 = numpy.zeros(npts)
    for index in range(npts):
        # Get query image
        im = videos[20 * index].reshape(1, videos.shape[1])
        im2 = videos2[20 * index].reshape(1, videos2.shape[1])
        # Compute scores
        if 'both' == shared_space:
            d1 = numpy.dot(im, captions.T).flatten()   #一段视频与多对文本描述相乘，后展开为一维数组
            d2 = numpy.dot(im2, captions2.T).flatten()
            d= d1+d2   #得到的结果相加
        elif 'object_text' == shared_space:
            d = numpy.dot(im, captions.T).flatten()
        elif 'activity_text' == shared_space:
            d = numpy.dot(im2, captions2.T).flatten()		
			
        inds = numpy.argsort(d)[::-1] #排序，排序后的结果反转，得分高的在前面
        index_list.append(inds[0])
        # Score
        rank = 1e20
        for i in range(20 * index, 20 * index + 20, 1):
            tmp = numpy.where(inds == i)[0][0]
            if tmp < rank:
                rank = tmp
                flag=i-20 * index
        ranks[index] = rank
        top1[index] = inds[0]

    # Compute metrics
    r1 = 100.0 * len(numpy.where(ranks < 1)[0]) / len(ranks)
    r5 = 100.0 * len(numpy.where(ranks < 5)[0]) / len(ranks)
    r10 = 100.0 * len(numpy.where(ranks < 10)[0]) / len(ranks)
    medr = numpy.floor(numpy.median(ranks)) + 1
    meanr = ranks.mean() + 1
    if return_ranks:
        return (r1, r5, r10, medr, meanr), (ranks, top1)
    else:
        return (r1, r5, r10, medr, meanr)

衡一的光

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
文本视频检索2（Learning Joint Embedding with Multimodal Cues forCross-Modal Video-Text Retrieval）

然后这个文章就是就是提出了一个双通道，首先一个就是将appearance feature 与 sentence feature 相融合得到一个最终特征，然后另外一个就是先将 motion feature 与 audio feature 相融合，得到一个特征，然后将这个特征再与sentence feature相融合，得到一个最终特征，这两个最终特征相融合，去做预测的。S(v,t) 就是正样本， S(v,t一撇) （那个撇我不会打）就是负样本，一般的就是正负样本直接求和，但是这个它只关心那个最难识别的负样本。
复制链接

扫一扫