MacBERT 的改进(Revisiting Pre-Trained Models for Chinese Natural Language Processing)


github地址: https://github.com/ymcui/MacBERT
论文地址: https://arxiv.org/abs/2004.13922
MacBERT 的主要工作是在RoBERT的基础上进行了几方面的改进,尤其在 Mask的策略

1.MacBERT简介

we also propose a new pre-trained model called MacBERT, which replaces the original MLM task into MLM as correction (Mac) task and mitigate the discrepancy of pre-training and fine-tuning stage.
我们提出新的预训练模型为 MacBERT, MacBERT替换bert原始的MLM任务为Mac任务,并减少预训练和微调阶段的差异。

2.论文的主要工作

The contributions of this paper arelisted as follows.(论文的贡献。)

  • [1] Extensive empirical studies are carried out to revisit the performance of Chinese pre-trained models on various tasks with careful analyses.(广泛的演究重新审视中文预训练模型在各种任务上的表现。)
  • [2] We propose a new pre-trained model called MacBERT that mitigate the gapbetween the pre-training and fine-tuning stage by masking the word with its similar word, which has proven to be effective on down-streamtasks.(我们提出新的预训练模型MacBERT,使用相似的词替换mask词能减缓预训练和微调两个阶段的误差,已经证实能有效的提高下游任务的现。)
  • [3] To further acceleratefuture research on Chinese NLP, we create and release the Chinese pre-trained model series to the
    community.(为了进一步推动中文的预训练模型的演究,我们开源了一些列的预训练模型对社区。)

2.1预训练模型对比

在这里插入图片描述

BERT

BERT consists of two pre-training tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP).(bert包含两种任务:MLM 和 NSP)
MLM: Randomly masks some of the tokens from the input, and the objective is to predict the original word based only on its context.(随机mask一些输入的token,目标是:通过的mask上下文的token预测mask的单词。)
NSP: To predict whether sentence B is the next sentence of A.(预测B是否是A的下一个语句。)
Later, they further proposed a technique called Whole Word Masking (WWM) for optimizing the original masking in the MLM task.(进一步的通过全词mask的方式优化MLM任务中的mask方式。)In this setting,instead of randomly selecting WordPiece tokens to mask, we always mask all of the tokens corresponding to a whole word at once.(在这种策略下,代替随机mask WordPoece token,mask整个词相关的WordPoece token。)

ERNIE

Enhanced Representation through kNowledge IntEgration (ERNIE) is designed to optimize the masking process of BERT, which includes entity-level masking and phrase-level masking. (ERNIE设计优化bert的mask策略,包含实体水平的mask和短语水平的mask。)Different from selecting random words in the input, entity-level masking will mask the named entities, which are often formed by several words. (与输入随机选择词不同,实体水平的mask将mask命名实体的词,有可能是几个词。)Phrase-level masking is to mask consecutive words, which is similar to the N-gram masking strategy.(短语水平的mask是mask连续的几个词,与ngram mask策略相似的。)

XLNET

To alleviate this problem, they proposed XLNet, which was based onTransformer-XL.(为了缓解bert预训练和微调阶段的误差,提出了XLNET,它是基于Transformer-XL。)XLNet mainly modifies in two ways. The first is to maximize the expected likelihood over all permutations of the factorization order of the input, where they called the Permutation Language Model (PLM). Another is to change the autoencoding language model into an autoregressive one, which is similar to the traditional statistical language models.(XLNET主要做了两方面的改变。首先,其次,最大化所有输入排列的对数似然,它成为重排列语言建模PLM)。XLNET改变自编码语言模型为自回归语言模型,它是与传统的统计语言模型相似的。)

3.MacBERT的结构

3.1BERT-wwm & RoBERTa-wwm

We use the traditional Chinese Word Segmentation (CWS) tool to split the text into several words.(我么采用中文分词工具切分文本为一类列词序列。)In this way, we could adopt whole word masking in Chinese to mask the word instead of individual Chinese characters.(在这种方式下,中文语料中我们采用整词mask替代单个字。)MacBERT remains the same pre-training tasks as BERT with several modifications.For the MLM task, we perform the following modifications .(MacBERT任然保持与bert相同的预训练任务,仅做几方面的修改。对于MLM任务,我们做以下修改。)We use LTP (Che et al., 2010) for Chinese word segmentation to identify the word boundaries.(我们使用LTP对于中文分词识别词的边界。)Note that the whole word masking only affects the selection of the masking tokens in the pre-training stage.(注意:整词mask仅影响预训练阶段mask token的选择。)The input of BERT still uses WordPiece tokenizer to split the text, which is identical to the original BERT.(Bert的输入仍然使用WordPiece切分文本,这与原始的bert相同。 )
不同的mask策略

3.2MacBERT训练过程

For the MLM task, we perform the following modifications .(对于MLM任务,仅做以下修改。)

  1. We use whole word masking as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of 40%, 30%, 20%, 10% for unigram to 4-gram.(我们使用全词mask和ngram mask的策略选择mask的token,对于ngram mask 策略,从unigram到4-gram的比例分别为40%, 30%, 20%, 10% 。)
  2. Instead of masking with [MASK] token, which never appears in the fine-tuning stage, we propose to use similar words for the masking purpose. (在微调阶段没使用mask策略,为了替换mask token,我们采用mask token相似的词。)A similar word is obtained by using Synonyms toolkit , which is based on word2vec similarity calculations.(相似的词使用Synonyms工具计算得到,它是基于Wordvector相似度计算。) If an N-gram is selected to mask, we will find similar words individually.(如果使用N-grammask策略,我们将分别找到相似的词。) In rare cases, when there is no similar word, we will degrade to use random word replacement.(在少数情况下,如果没找相似的词,我们将使用随机的词替换。)
  3. We use a percentage of 15% input words for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.(我们采用15%的输入的word进行mask,其中80%的词替换为相似的词,10%的词随机替换,10%的词保持不变。)
    For the NSP task, we perform sentence-order prediction (SOP) task as introduced by ALBERT, where the negative samples are created by switching the original order of two consecutive sentences.(对于NSP任务,我们采用albert的SOP任务,SOP中的负样本是交换两个句子的顺序。)

试验结果

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

总结

MacBERT 在bert的各个改进版本之上的,持续优化训练的策略:(1)对于MLM任务,使用全词mask和ngram mask策略,mask的词使用相似词替换,减少训练和微调之间的误差:(2)使用albert的SOP任务;在下游的任务中,比之前的bert版本效果均要好。
[1]: http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference
[2]: https://mermaidjs.github.io/
[3]: https://mermaidjs.github.io/
[4]: https://arxiv.org/pdf/1907.11692.pdf*

VLAD(Vector of Locally Aggregated Descriptors)是一种图像表示方法,常用于图像检索和计算机视觉任务中。它通过聚合局部特征描述符来生成图像的紧凑表示。 以下是一个简单的C++实现示例,展示了如何实现VLAD图像表示: ```cpp #include <iostream> #include <vector> #include <opencv2/opencv.hpp> // 聚类算法(这里使用K-means) cv::Mat kmeansClustering(const std::vector<cv::Mat>& descriptors, int numClusters) { cv::Mat allDescriptors; for (const cv::Mat& descriptor : descriptors) { allDescriptors.push_back(descriptor); } cv::Mat labels, centers; cv::TermCriteria criteria(cv::TermCriteria::EPS + cv::TermCriteria::MAX_ITER, 100, 0.01); cv::kmeans(allDescriptors, numClusters, labels, criteria, 1, cv::KMEANS_PP_CENTERS, centers); return centers; } // 计算VLAD图像表示 cv::Mat computeVLAD(const std::vector<cv::Mat>& descriptors, const cv::Mat& visualWords) { int descriptorSize = descriptors[0].cols; cv::Mat vlad(visualWords.rows, descriptorSize, CV_32F, cv::Scalar(0)); for (const cv::Mat& descriptor : descriptors) { // 找到每个描述符最近的视觉词 cv::Mat difference = visualWords - descriptor; cv::Mat distances; cv::reduce(difference.mul(difference), distances, 1, cv::REDUCE_SUM); cv::Point minLoc; cv::minMaxLoc(distances, nullptr, nullptr, &minLoc); // 计算每个视觉词的残差 cv::Mat residual = descriptor - visualWords.row(minLoc.y); // 更新VLAD表示 for (int i = 0; i < descriptorSize; i++) { vlad.at<float>(minLoc.y, i) += residual.at<float>(0, i); } } // 归一化VLAD表示 cv::normalize(vlad, vlad, 1.0, 0.0, cv::NORM_L2); return vlad; } int main() { // 假设有一组局部特征描述符(使用OpenCV的Mat表示) std::vector<cv::Mat> descriptors = { (cv::Mat_<float>(1, 128) << /* descriptor values */ ), (cv::Mat_<float>(1, 128) << /* descriptor values */ ), (cv::Mat_<float>(1, 128) << /* descriptor values */ ), // ... }; // 聚类算法,得到视觉词汇 int numClusters = 100; cv::Mat visualWords = kmeansClustering(descriptors, numClusters); // 计算VLAD图像表示 cv::Mat vlad = computeVLAD(descriptors, visualWords); // 输出VLAD表示结果 std::cout << "VLAD Representation:\n" << vlad << std::endl; return 0; } ``` 在这个示例中,`descriptors`是一组局部特征描述符,每个描述符用一个`cv::Mat`对象表示。首先,使用K-means聚类算法将所有描述符聚类成`numClusters`个视觉词汇,并得到`visualWords`矩阵。然后,根据每个描述符找到最近的视觉词,并计算每个视觉词的残差。将残差累加到VLAD表示中,并进行归一化处理。最后,输出VLAD图像表示结果。 请注意,这只是一个简单的VLAD图像表示的C++实现示例,供你参考。在实际应用中,你可能需要根据具体需求进行修改和扩展,例如使用更复杂的特征提取方法、改进聚类算法等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值