Abstract
1. 目的:
composed query image retrieval
2. 目前存在的问题:
the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction(特征编码器结构的差异性会限制视觉-语言纵向交互)
1. Introduction
1. 主要挑战:
2. 本文方法:
组成:the vision Transformer, the language Transformer, and the cross modal Transformer
3. 主要贡献:
(1)ComqueryFormer;
(2)an effective global-local mechanism to align the composed query and target image in a complementary manner(全局-局部机制).
2. Related Work
2.1 Composed Query Image Retrieval
传统的图像检索:往往采用example image, text, and sketch map,但是单一的query很难满足用户需求。
2.2 Transformer
2.3 Multi-modal Machine Learning
3. Method
输入数据:Ir is the reference image(参考图像), M is the modification text(修改文本), It is the target image(目标图像)。
3.1 Text Encoder
BERT:
(1)将modification text的开始位置和结尾位置分别增加a [CLS] token and a [SEP] token;
(2)convert each word in the sentence into a one-hot vector through the word embedding layer(将每个单词转为词向量);
(3)each word vector and its position embedding are added to fed into several transformer blocks(结合词向量和位置编码向量,并送入transformer块中).
transformer比较特别的地方就是增加了multi-head注意力、feed-forward network (FFN), layer normalization (LN)和残差机制。
3.2 Image Encoder
3.3 Vision-language Composition Module
Language Guide Vision (LGV) and Vision Guide Language (VGL):
最终输出:
3.4 Global-local Alignment Module
Local alignment(局部对齐):we learn k local semantic regions by multiple spatial attention