Multimodal Relation Extraction with Efficient Graph Alignment

最新推荐文章于 2024-07-18 00:00:00 发布

辉辉小学生

最新推荐文章于 2024-07-18 00:00:00 发布

阅读量759

点赞数 2

分类专栏：多模态paper 文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/huihuixiaoxue/article/details/125874271

版权

多模态paper 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

ABSTRACT

multimodel relation extraction to solve the influence of lack of contexts(visaul contents to supplement the missing semantics)

develop a dual graph alignment method to capture this correlation for better performance

1 INTRODUCTION

Different from multimodal named entity recognition task, introducing visual information into relation extraction asks models not only to capture the correlations between visual objects and textual entities, but also to focus on the mappings from visual relations between objects in an image to textual relations between entities in a sentence.(很绕，我把例子也摆上来)

contributions:

present the multimodal relation extraction (MRE) task; provide a human-annotated dataset

(MNRE)

propose a multimodal relation extraction neural network with efficient alignment strategy for textual and visual graphs

conduct experiment on the MNRE dataset

2 RELATED WORKS

2.1 Relation Extraction in Social Media

2.2 Multimodal Representation and Alignment

assign the graph similarity computed by both structural similarity and semantic agreement

3 METHODOLOGY

steps to build model:

1. extract the textual semantic representations with a pretrained BERT encoder

we generate the scene graphs (structural representations) from images which provide rich visual information including vi sual objects features and visual relations among the objects.

2.to acquire the structural representations, we obtain the syntax dependency tree of the input texts which models the syntax structure of textual information. The visual object relation extracted by scene graph can be constructed as a structural graph representation. 3.to make good use of image information for multimodal relation extraction, we respectively align the structural and semantic information of multimodal features to capture the multi-perspective correlation between multimodal information.

4.we concatenate the textual representations which represent the two entities and the aligned visual representation as the fusion feature of text and image to predict the relations of entities.