Multimodal Relation Extraction with Efficient Graph Alignment

ABSTRACT
multimodel relation extraction to solve the influence of lack of contexts(visaul contents to supplement the missing semantics)
develop a dual graph alignment method to capture this correlation for better performance
1 INTRODUCTION
Different from multimodal named entity recognition task, introducing visual information into relation extraction asks models not only to capture the correlations between visual objects and textual entities, but also to focus on the mappings from visual relations between objects in an image to textual relations between entities in a sentence.(很绕,我把例子也摆上来)
contributions:
present the multimodal relation extraction (MRE) task; provide a human-annotated dataset
(MNRE)
propose a multimodal relation extraction neural network with efficient alignment strategy for textual and visual graphs
conduct experiment on the MNRE dataset
2 RELATED WORKS
2.1 Relation Extraction in Social Media
2.2 Multimodal Representation and Alignment
assign the graph similarity computed by both structural similarity and semantic agreement
3 METHODOLOGY
steps to build model:
1. extract the textual semantic representations with a pretrained BERT encoder
we generate the scene graphs (structural representations) from images which provide rich visual information including vi sual objects features and visual relations among the objects.

2.to acquire the structural representations, we obtain the syntax dependency tree of the input texts which models the syntax structure of textual information.                                                   The visual object relation extracted by scene graph can be constructed as a structural graph representation.                                                                                                                                    3.to make good use of image information for multimodal relation extraction, we respectively align the structural and semantic information of multimodal features to capture the multi-perspective correlation between multimodal information.

4.we concatenate the textual representations which represent the two entities and the aligned visual representation as the fusion feature of text and image to predict the relations of entities.

3.1 Semantic Feature Representation
3.1.1 Textual Semantic Representation.
The input text message is first tokenized into a token sequence 𝑠 1
to fit the BERT encoding procedure, we add the token ’[CLS]’ ‘[SEP]’
we augment the 𝑠 1 with four reserved word pieces, [ 𝐸 1 𝑠𝑡𝑎𝑟𝑡 ] , [ 𝐸 1 𝑒𝑛𝑑 ] , [ 𝐸 2 𝑠𝑡𝑎𝑟𝑡 ] and [ 𝐸 2 𝑒𝑛𝑑 ]
3.1.2 Visual Semantic Representation.
3.2 Structural Feature Representation
3.2.1 Syntax Dependency Tree
3.2.2 Scene Graph Generation
3.3 Multimodal Feature Alignment
3.3.1 Graph Structure Alignment.
3.3.2 Semantic Features Alignment.
3.4 Entities Representation Concatenation
4 EXPERIMENT SETTINGS
4.1 Dataset
4.2 Baseline Methods
4.3 Parameter Settings
5 RESULTS AND DISCUSSION

 

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值