Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval——论文

最新推荐文章于 2024-07-24 21:01:17 发布

weixin_45154287

最新推荐文章于 2024-07-24 21:01:17 发布

阅读量109

点赞数

文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/weixin_45154287/article/details/133575759

版权

Abstract

1. 目的：

composed query image retrieval

2. 目前存在的问题：

the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction（特征编码器结构的差异性会限制视觉-语言纵向交互）

3. 本文方法：

ComqueryFormer：

（1）we leverage a unified transformer-based architecture to homogeneously encode the vision-language inputs（基于统一的transformer结构来同时编码视觉-语言特征）；

（2）crossmodal transformer is adopted to hierarchically fuse the composed

query at various vision scales（跨模态transformer可以在任意的视觉尺度上分层融合组合query）；

（3）an efficient global-local alignment module（一个有效地全局-局部对齐模块）。

1. Introduction

1. 主要挑战：

（1）how to extract related information from the reference image and the modification text（如何从参考图像和修改后的文本中提取相关信息）；

（2）how to learn a better alignment strategy to narrow the distance between the

composed query and the target image（如何学习一种更好的对齐策略，来缩小组合查询和目标图像之间的距离）.

2. 本文方法：

组成：the vision Transformer, the language Transformer, and the cross modal Transformer

3. 主要贡献：

（1）ComqueryFormer；

（2）an effective global-local mechanism to align the composed query and target image in a complementary manner（全局-局部机制）.

2. Related Work

2.1 Composed Query Image Retrieval

传统的图像检索：往往采用example image, text, and sketch map，但是单一的query很难满足用户需求。

interactive image search：交互图像搜索

2.2 Transformer

2.3 Multi-modal Machine Learning

3. Method

输入数据：Ir is the reference image（参考图像）, M is the modification text（修改文本）, It is the target image（目标图像）。

3.1 Text Encoder

BERT：

（1）将modification text的开始位置和结尾位置分别增加a [CLS] token and a [SEP] token；

（2）convert each word in the sentence into a one-hot vector through the word embedding layer（将每个单词转为词向量）；

（3）each word vector and its position embedding are added to fed into several transformer blocks（结合词向量和位置编码向量，并送入transformer块中）.

transformer比较特别的地方就是增加了multi-head注意力、feed-forward network (FFN), layer normalization (LN)和残差机制。

3.2 Image Encoder

Swin Transformer

3.3 Vision-language Composition Module

Language Guide Vision (LGV) and Vision Guide Language (VGL)：

最终输出：

3.4 Global-local Alignment Module

Local alignment（局部对齐）：we learn k local semantic regions by multiple spatial attention

masks（学习局部区域的语义）. Each region is generated by:

Global alignment（全局对齐）:

weixin_45154287

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval——论文

1. 目的：composed query image retrieval2. 目前存在的问题：the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction（特征编码器结构的差异性会限制视觉-语言纵向交互）1. 主要挑战：2. 本文方法：组成：the vision Transformer, the language Transformer, and th
复制链接

扫一扫