Multimodal Relation Extraction with Efficient Graph Alignment

ABSTRACT
multimodel relation extraction to solve the influence of lack of contexts(visaul contents to supplement the missing semantics)
develop a dual graph alignment method to capture this correlation for better performance
1 INTRODUCTION
Different from multimodal named entity recognition task, introducing visual information into relation extraction asks models not only to capture the correlations between visual objects and textual entities, but also to focus on the mappings from visual relations between objects in an image to textual relations between entities in a sentence.(很绕,我把例子也摆上来)
contributions:
present the multimodal relation extraction (MRE) task; provide a human-annotated dataset
(MNRE)
propose a multimodal relation extraction neural network with efficient alignment strategy for textual and visual graphs
conduct experiment on the MNRE dataset
2 RELATED WORKS
2.1 Relation Extraction in Social Media
2.2 Multimodal Representation and Alignment
assign the graph similarity computed by both structural similarity and semantic agreement
3 METHODOLOGY
steps to build model:
1. extract the textual semantic representations with a pretrained BERT encoder
we generate the scene graphs (structural representations) from images which provide rich visual information including vi sual objects features and visual relations among the objects.

2.to acquire the structural representations, we obtain the syntax dependency tree of the input texts which models the syntax structure of textual information.                                                   The visual object relation extracted by scene graph can be constructed as a structural graph representation.                                                                                                                                    3.to make good use of image information for multimodal relation extraction, we respectively align the structural and semantic information of multimodal features to capture the multi-perspective correlation between multimodal information.

4.we concatenate the textual representations which represent the two entities and the aligned visual representation as the fusion feature of text and image to predict the relations of entities.

3.1 Semantic Feature Representation
3.1.1 Textual Semantic Representation.
The input text message is first tokenized into a token sequence 𝑠 1
to fit the BERT encoding procedure, we add the token ’[CLS]’ ‘[SEP]’
we augment the 𝑠 1 with four reserved word pieces, [ 𝐸 1 𝑠𝑡𝑎𝑟𝑡 ] , [ 𝐸 1 𝑒𝑛𝑑 ] , [ 𝐸 2 𝑠𝑡𝑎𝑟𝑡 ] and [ 𝐸 2 𝑒𝑛𝑑 ]
3.1.2 Visual Semantic Representation.
3.2 Structural Feature Representation
3.2.1 Syntax Dependency Tree
3.2.2 Scene Graph Generation
3.3 Multimodal Feature Alignment
3.3.1 Graph Structure Alignment.
3.3.2 Semantic Features Alignment.
3.4 Entities Representation Concatenation
4 EXPERIMENT SETTINGS
4.1 Dataset
4.2 Baseline Methods
4.3 Parameter Settings
5 RESULTS AND DISCUSSION

 

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
上传者不拥有讲义的原始版权。所有版权归属CMU。 该文件集是CMU开设的11-777课程,名为multimodal machine learning,每年fall学期开设。 本讲义是2019 Fall的版本。 课程介绍: Description Multimodal machine learning (MMML) is a vibrant multi-disciplinary research field which addresses some of the original goals of artificial intelligence by integrating and modeling multiple communicative modalities, including linguistic, acoustic and visual messages. With the initial research on audio-visual speech recognition and more recently with language vision projects such as image and video captioning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. The course will present the fundamental mathematical concepts in machine learning and deep learning relevant to the five main challenges in multimodal machine learning: (1) multimodal representation learning, (2) translation mapping, (3) modality alignment, (4) multimodal fusion and (5) co-learning. These include, but not limited to, multimodal auto-encoder, deep canonical correlation analysis, multi-kernel learning, attention models and multimodal recurrent neural networks. We will also review recent papers describing state-of-the-art probabilistic models and computational algorithms for MMML and discuss the current and upcoming challenges. The course will discuss many of the recent applications of MMML including multimodal affect recognition, image and video captioning and cross-modal multimedia retrieval. This is a graduate course designed primarily for PhD and research master students at LTI, MLD, CSD, HCII and RI; others, for example (undergraduate) students of CS or from professional master programs, are advised to seek prior permission of the instructor. It is required for students to have taken an introduction machine learning course such as 10-401, 10-601, 10-701, 11-663, 11-441, 11-641 or 11-741. Prior knowledge of deep learning is recommended.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值