多模态实体识别笔记

更科瑠夏Q_Q

已于 2023-07-14 16:51:16 修改

阅读量650

点赞数 1

文章标签：笔记 python

于 2023-07-14 16:49:09 首次发布

本文链接：https://blog.csdn.net/qq_46236257/article/details/131727289

版权

多模态命名实体识别(MNER) 相关论文阅读笔记

一. 任务介绍

在这里插入图片描述

多模态命名实体识别 (MNER) 旨在发现自由文本中的命名实体，并将它们分类为带有图像的预定义类型。在CoNLL2003任务中，实体是LOC，PER，ORG，MISC，也就是位置，人名，组织名和杂项(miscellaneous)，非实体表示为“0”。由于一些实体由多个单词组成，使用标签结构来区分实体的开始（B-...）begin，和实体内（I-...）inner，其他还有例如“IOBES”等结构。

多模态命名实体识别(MNER)已经成为命名实体识别(NER)的一个重要研究方向，它可以利用图像作为额外的输入来改进基于文本的NER。它假设在文本信息不足的情况下，图像信息可以帮助识别有歧义的命名实体。 例如，给定文本“Handsome Rob after a fish dinner”，我们无法推断命名实体Rob的类型。它可以描述一个人或一只动物。 借助其附带的图像（如图1所示 ), 我们可以很容易地确定它的类型是misc。

在这里插入图片描述

MNER的思考

MNER is a hard task since it needs multimodal understanding in social media domain. However, existing methods simplify it to extacting helpful viusal clue to assist NER, with a simple showcase. In twitter datasets, the image-text pair always has no or vague relationship, which needs extra information or supervision for model to understand. Therefore, I believe that is why MNER-QG, MoRe, R-GCN and PromptMNER work. However, existing works are still nowhere near logical understanding, since they all introduce out-sample knowledge. Now I am trying to introduce knowledge graph in MNER to provide in-sample context.

需要做的事情：

Tricky task: When I developed my work (SOTA in two datasets but in submission), I found

1. only tuning task head of BERT (freeze Bert and ViT) can achieve comparable results (0.2-0.5% drop). So I believe we can directly introduce prompt eniggering for text.

2. large language model matters more than fancy innovation

3. using a empty image to replace all valid images during test only drop 2%

4. simple loss like contractive loss brings 2% improvement. So I think the model heavily focuses on text.

5. Applying the same code in different environments even in different GPUs has results with large variance...

6. It has two mainstreams depending on image+text or caption+text as input. DAMO-series works mainly focus on the latter one, which is proved to be more extandable and SOTA than others.

MAF: A General Matching and Alignment Framework for Multimodal Named Entity Recognition (2022 WSDM)
(一种通用的多模态命名实体识别匹配与对齐框架)

在这里插入图片描述

1. 为了解决第一个问题（图文不匹配），我们提出了一种新颖的跨模态匹配（CM,cross-modal matching）模块来计算文本和图像之间的相似度得分，并使用该得分来确定应该保留的图像信息的比例。

2. 为了解决第二个问题（图文表示不一致），我们提出了一个跨模态对齐 (CA,cross-modal alignment) 模块，以使两种模态的表示更加一致。

大体流程

1. 将文本text输入BERT模型，通过BERT获得每个单词和整个文本的表示，图片输入resnet中，通过ResNet获得图像的区域和全局表示。

2. 然后，整个文本的表示和图像的全局表示将被馈送到跨模态对齐模块，每个单词的表示和图像的区域表示将被馈送到跨模态交互模块。跨模态对齐模块(CA)用于使文本编码器和图像编码器的表示更加一致，跨模态交互模块用于获得文本感知的图像表示。

3. 然后我们使用跨模态匹配模块(CM)来确定应该保留的图像信息的比例。

4. 最后，我们使用跨模态融合模块来融合两种模态的表示，并将它们输入条件随机场层以获得最终的预测结果。