Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval——论文

Abstract

1. 目的:

composed query image retrieval

2. 目前存在的问题:

the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction(特征编码器结构的差异性会限制视觉-语言纵向交互)

3. 本文方法:
ComqueryFormer:
(1)we leverage a unified transformer-based architecture to homogeneously encode the vision-language inputs(基于统一的transformer结构来同时编码视觉-语言特征);
(2)crossmodal transformer is adopted to hierarchically fuse the composed
query at various vision scales(跨模态transformer可以在任意的视觉尺度上分层融合组合query);
(3)an efficient global-local alignment module(一个有效地全局-局部对齐模块)。

1. Introduction

1. 主要挑战:

(1)how to extract related information from the reference image and the modification text(如何从参考图像和修改后的文本中提取相关信息);
(2)how to learn a better alignment strategy to narrow the distance between the
composed query and the target image(如何学习一种更好的对齐策略,来缩小组合查询和目标图像之间的距离).

2. 本文方法:

组成:the vision Transformer, the language Transformer, and the cross modal Transformer

3. 主要贡献:

(1)ComqueryFormer;

(2)an effective global-local mechanism to align the composed query and target image in a complementary manner(全局-局部机制).

2. Related Work

2.1 Composed Query Image Retrieval

传统的图像检索:往往采用example image, text, and sketch map,但是单一的query很难满足用户需求。

interactive image search:交互图像搜索

2.2 Transformer

2.3 Multi-modal Machine Learning

3. Method

输入数据:Ir is the reference image(参考图像), M is the modification text(修改文本), It is the target image(目标图像)。

3.1 Text Encoder

BERT:

(1)将modification text的开始位置和结尾位置分别增加a [CLS] token and a [SEP] token;

(2)convert each word in the sentence into a one-hot vector through the word embedding layer(将每个单词转为词向量);

(3)each word vector and its position embedding are added to fed into several transformer blocks(结合词向量和位置编码向量,并送入transformer块中).

transformer比较特别的地方就是增加了multi-head注意力、feed-forward network (FFN), layer normalization (LN)和残差机制。

3.2 Image Encoder

Swin Transformer

3.3 Vision-language Composition Module

Language Guide Vision (LGV) and Vision Guide Language (VGL):

最终输出:

3.4 Global-local Alignment Module

Local alignment(局部对齐):we learn k local semantic regions by multiple spatial attention

masks(学习局部区域的语义). Each region is generated by:

Global alignment(全局对齐): 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值