QIUXP-预训练语言模型：BertMarker：MarkBERT: Marking Word Boundaries Improves Chinese BERT

最新推荐文章于 2024-09-30 23:23:06 发布

YingJingh

最新推荐文章于 2024-09-30 23:23:06 发布

阅读量649

点赞数

分类专栏：论文记录文章标签：语言模型 word bert

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/Hekena/article/details/128159295

版权

论文记录专栏收录该内容

147 篇文章 9 订阅

订阅专栏

MarkBERT: Marking Word Boundaries Improves Chinese BERT

作者觉得现有的基于words作为一个unit的方式，对于OOV和中文并不十分适用。
提出的markbert，是在以词组为切割的基础上，还加入了marker标记。

预训练任务包括两种：

The first task is masked language modeling and we
also mask markers such that word boundary knowledge can
be learned since the pre-trained model needs to recognize
the word boundaries within the context. The second task is
replaced word detection. We replace a word with artificially
generated words and ask the markers behind the word to predict whether the word is replace

在replace word detection中，混淆词的构造可以是多种多样的。我们采取了两种简单的策略：（1）我们使用同义词作为混淆；（2）我们使用中文中语音（拼音）相似的词。为了获得同义词，我们使用Zhang和Yang（2018）提供的外部词汇嵌入。我们计算单词之间的余弦相似度，并使用最相似的单词作为同义词混淆。为了获得基于语音的混淆，如图2所示，我们使用一个外部工具来获得单词的语音，并选择一个与之混淆的单词共享相同的语音。

在这里插入图片描述

模型结构

在这里插入图片描述

作者觉得，这种预训练方式更有助于中文NER中的marker标记下的识别。我不觉得特备好。而且论文的实验上，做的并不是很充分。

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

YingJingh CSDN认证博客专家 CSDN认证企业博客

码龄5年

345: 原创

2万+: 周排名

8796: 总排名

26万+: 访问

: 等级

4161: 积分

2050: 粉丝

240: 获赞

49: 评论

706: 收藏

私信

关注

热门文章

分类专栏

最新评论

关系抽取：传统：UniRel: Unified Representation and Interaction for Joint Relational
snacksix: 你好，请问换成中文后效果如何
论文复现_1：Chinese NER Using Lattice LSTM
Fɪɴᴀʟ: YJ使用的词典可以分享一下吗
word中避免无引用源的方法
hx0520: 摸索了一下mac系统锁定域,按command+fn+f11
PDF相关的处理操作
haakaa: csdn这段确实好用
EMNLP-21-Enhanced Language Representation with Label Knowledge for Span Extraction-NER-融入label knowl
小阳不一样666666: 请问作者你复现成功了嘛？我按照论文设置超参数，但是对于ace2005效果只有0.84没有论文的0.86，这是我设置的情况：--task_type=ner --task_save_name=ner111 --data_dir=./data/ace2005 --data_name=ace2005 --model_name_or_path=D:/YangCode/data/bert-large-cased --model_name=SERS --output_dir=./outmodel --result_dir=./result --do_lower_case=False --first_label_file=./data/ace2005/processed/label_map.json --train_set=./data/ace2005/processed/train.json --dev_set=./data/ace2005/processed/dev.json --test_set=./data/ace2005/processed/test.json --label_str_file=./data/ace2005/processed/label_annotation.txt --overwrite_output_dir=True --exist_nested=True --do_train=True --is_chinese=False --val_step=20 --use_attn=True --seed=42 --max_seq_length=128 --dropout_rate=0.1 --learning_rate=3e-5 --task_layer_lr=2 --num_train_epochs=20能帮忙看看问题所在嘛？

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

打赏作者

YingJingh 你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20

扫码支付：¥1

获取中

扫码支付

您的余额不足，请更换扫码支付或充值

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。