Good Visual Guidance Makes A Better Extractor:Hierarchical Visual Prefix for Multimodal Entity and

辉辉小学生

于 2022-08-18 16:07:10 发布

阅读量307

点赞数

分类专栏：多模态paper 文章标签： java 数据库前端

本文链接：https://blog.csdn.net/huihuixiaoxue/article/details/125962974

版权

多模态paper 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

Abstract

problem:

MNER and MRE usually suffer from error sensitiv ity when irrelevant object images incorporated in texts.

solution:

H ierarchical V isual P refix fusion NeT work

detail:

regard visual representation as pluggable visual prefix to guide the textual repre sentation for error insensitive forecasting deci sion

a dynamic gated aggregation strategy to achieve hierarchical multi scaled visual features as visual prefix for fu sion

1 Introduction

main contribution:

present a hierarchical visual prefix fusion network towards MNER and MRE

the first work to leverage hierarchical pyramidal visual features for multimodal learning

2 Related work

Multimodal Entity and Relation Extraction

text-only -> multimodel ignoring the error sensitivity -> multimodel with classifier reducing irrelevant images but requiring expensive annotation -> our works

Pre-trained Multimodal Representation

the existing visual-linguistic BERT models:Architecture And Pretraining tasks

why not applying current visual-language models to the MNER and MRE task?

MNER and MRE mainly focus on leveraging visual information to enhance the text rather than conducting prediction on the image side

3 Methodology

The overall architecture of our hierarchical visual prefix for multimodal entity and relation extraction

3.1 Collection of Pyramidal Visual Feature

the regional image providing more se mantic knowledge to assist information extraction

global images express abstract concepts as weak learning signals

so we take the regional images as the vital information and the global images as the supplement

adopt the visual grounding toolkit for extracting local visual objects with top m salience

rescale the global image and object image to 224 × 224 pixels as the

global image： I and visual objects： O = { o 1 , o 2 , ..., o m , }

given an image, we encode it with a backbone model and generate a list of pyramidal feature maps { F 1 , F 2 , F 3 , . . . , F c }

后面略

辉辉小学生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Good Visual Guidance Makes A Better Extractor:Hierarchical Visual Prefix for Multimodal Entity and

不完整
复制链接

扫一扫

专栏目录