Grounding of Textual Phrases in Images by Reconstruction

最新推荐文章于 2023-11-10 22:50:01 发布

thyya

最新推荐文章于 2023-11-10 22:50:01 发布

阅读量413

点赞数

本文链接：https://blog.csdn.net/weixin_44048998/article/details/101268502

版权

（看看 image grounding 论文，给 moment 任务找找灵感）

Many prior efforts in this area have focused on rather constrained settings with a small number of nouns to ground . On the contrary, we want to tackle the problem of grounding arbitrary natural language phrases in images. (之前工作主要针对少量名词做image grounding $\to$ 本文针对任意自然语言 phrase)
Most parallel corpora of sentence/visual data do not provide localization annotations (e.g. bounding boxes) and the annotation process is costly. We propose an approach which can learn to localize phrases relying only on phrases associated with images without bounding box annotations but which is also able to incorporate phrases with bounding box supervision when available. (标注成本过高 $\to$ 适用于弱监督下的grounding，即训练集只有 image 的标注 phrase 而没有 image 内对应区域的标注)

弱监督条件，即给定phrase $p$ 和相应的 image $I$ ，得到 $I$ 中与 $p$ 相关的 region $r_i$ (segment 或者bounding box)

既然有 phrase $p$ , image $I$ 与 region $r_{i}$ 的对应关系，即 $\to r_i$ ;那么理想情况下, 也可以通过 image $I$ 与 region $r_{i}$ , 重构出 phrase $p$ (类似于 autoencoder 的思路)

To select the correct bounding box from region proposal ${\{r_i\}_{i=1,...,N}}$ , we define an attention function $f_{ATT}$ and select the box

关注