Grounding of Textual Phrases in Images by Reconstruction

Grounding of Textual Phrases in Images by Reconstruction

(看看 image grounding 论文,给 moment 任务找找灵感)

注:ECCV 2016, paper arXiv 传送门

motivation:

  1. Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground . On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. (之前工作主要针对少量名词做image grounding → \to 本文针对任意自然语言 phrase)

  2. Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but which is also able to incorporate phrases with bounding box supervision when available. (标注成本过高 → \to 适用于弱监督下的grounding,即训练集只有 image 的标注 phrase 而没有 image 内对应区域的标注)

Target:

弱监督条件,即给定phrase p p p 和相应的 image I I I,得到 I I I 中与 p p p 相关的 region r i r_i ri(segment 或者bounding box)

Main Idea:

既然有 phrase p p p , image I I I与 region r i r_{i} ri 的对应关系,即 f : p , I → r i f:p,I \to r_i f:p,Iri ;那么理想情况下, 也可以通过 image I I I与 region r i r_{i} ri , 重构出 phrase p p p (类似于 autoencoder 的思路)

Contribution:

  1. 提出的 model 在 grounding 阶段使用 attention 机制

  2. 加入重构 phrase p p p 的模块,引入重构损失,使得提出的 model 可以用于各种监督条件:监督、半监督、非监督

  3. good performance

Model:

image

1)Grounding

To select the correct bounding box from region proposal { r i } i = 1 , . . . , N {\{r_i\}_{i=1,...,N}} { ri}i=1,...,N, we define an attention function f A T T f_{ATT} fATT and select the box j j

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值