Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models

CVPR2018的一篇关于跨媒体检索的文章,paper链接https://arxiv.org/abs/1711.06420,一作是南洋理工大学的PHD,作者的homepage http://jxgu.cc/,code已经被released出来了https://github.com/ujiuxiang/NLP_Practice.PyTorch/tree/master/cross_modal_retrieval
个人瞎扯:看这篇文章的三个原因。

  • 1.这是我见过的第一篇同时利用GAN和Reinforcement Learning(RL)做跨媒体检索的文章。
  • 2.这个神奇的网络可以同时做三件跨媒体的任务。cross-media retrieval,image caption and text-to-image synthesis(对于后两个任务,文章只给出了可视化的结果,没有给出定量的分析)。
  • 3.这篇文章发表在CVPR2018上并是Spotlight,而且在MSCOCO上面cross-media retrieval的性能是state-of-the-art。

文章要做的事情(cross-media retrieval):
输入:image(sentence)+dataset      输出:sentence(image)rank list
文章show的可视化的实验结果如下所示。
visual results
与state-of-the-art方法对比结果如下所示。
comparision with SOTA
文章中的ablation study实验如下所示。
ablation study

method
paper的framework如下所示。
framework
文章主要分为三个部分multi-modal feature embedding (the entire upper part),image-to-text generative feature learning (the blue path) and text-to-image generative adversarial feature learning (the green path)。

multi-modal feature embedding:
image encoding: 首先将sentence用one-hot vector进行表示,然后用一个matrix对one-hot vector做word embedding,然后在用两层双向的GRU对word embedding进行表示。
sentence encoding: pre-trained CNN on ImageNet。(image encoding和sentence encoding分别提取high-level abstract features and detailed grounded features。)
feature embedding loss:分别对high-level abstract features and detailed grounded features做带有order-violation penalty [ https://arxiv.org/abs/1511.06361 ] 的 two branches ranking loss。

image-to-text generative feature learning:
先利用CNN将image encode成detailed grounded features,然后利用RNN将detailed grounded features decode成sentence,然后通过loss function使得decode出来的sentence与image所对应的sentence尽可能是相似。
loss function: cross-entropy (XE) loss + reinforcement learning (RL) loss [ https://arxiv.org/abs/1612.00563 ]

  • XE: word-level cost。根据decoder来predict work,使得ground-truth work的概率尽可能大。
  • RL: sentence-level cost。将decoder出来的sentence与ground-truth sentence通过metric(BLEU or CIDEr)计算score,根据score给reward。

text-to-image generative adversarial feature learning:
先用RNN将sentence encode成detailed grounded features,然后再利用conditioning augmentation [ https://arxiv.org/abs/1612.03242 ] 方法将detailed grounded features compress到lower dimension来augment data以致enforce smoothness,然后将augmented data与noise vertor进行concatenation操作,最后利用concatenation vector通过text-to-image synthesis model [ https://arxiv.org/abs/1605.05396 ] 生成图像,在training的过程中为了尽一步enforce smoothness and avoid overfitting,在generator段加入Kullback-Leibler divergence (KL divergence)。

framework training:
文章的training的过程是先train image-to-text generative feature learning和text-to-image generative adversarial feature learning(先train discriminator再train generator),然后再train multi-modal feature embedding。
文章中给出的Pseudocode如下所示。
Pseudocode

总结:

  • 这篇文章的本质是rank loss+双向交叉的autoencoder [ https://people.cs.clemson.edu/~jzwang/1501863/mm2014/p7-feng.pdf ] ,起到主要作用的还是rank loss,双向交叉的autoencoder可以学到比较好的common space的feature。
  • 文章中GAN来源于text-to-image synthesis,reinforcement learning (RL) 本质上来源于image caption,GAN和RL用来learn额外的feature,并没有promote rank loss。
  • 文章的rank loss采用order-violation penalty rank loss有些指标比hard triplet(VSE++)要好(不知道sittting是不是一样,或者调参的技巧会不会有差异),有些差不多,因此对于retrieval,hard triplet并不一定是最好的loss。
基于对抗的跨媒体检索Cross-modal retrieval aims to enable flexible retrieval experience across different modalities (e.g., texts vs. images). The core of crossmodal retrieval research is to learn a common subspace where the items of different modalities can be directly compared to each other. In this paper, we present a novel Adversarial Cross-Modal Retrieval (ACMR) method, which seeks an effective common subspace based on adversarial learning. Adversarial learning is implemented as an interplay between two processes. The first process, a feature projector, tries to generate a modality-invariant representation in the common subspace and to confuse the other process, modality classifier, which tries to discriminate between different modalities based on the generated representation. We further impose triplet constraints on the feature projector in order to minimize the gap among the representations of all items from different modalities with same semantic labels, while maximizing the distances among semantically different images and texts. Through the joint exploitation of the above, the underlying cross-modal semantic structure of multimedia data is better preserved when this data is projected into the common subspace. Comprehensive experimental results on four widely used benchmark datasets show that the proposed ACMR method is superior in learning effective subspace representation and that it significantly outperforms the state-of-the-art cross-modal retrieval methods.
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值