BERT-从业者的观点

最新推荐文章于 2024-09-05 10:28:02 发布

weixin_26632369

最新推荐文章于 2024-09-05 10:28:02 发布

阅读量317

点赞数

文章标签： python

原文链接：https://medium.com/swlh/bert-a-practitioners-perspective-11d49cdcb0a0

版权

什么是BERT？ (What is BERT?)

BERT stands for “Bidirectional Encoder Representations from Transformers”. It is currently the leading language model. According to published results it (or its variants) has hit quite a few language tests out of the park. If interested your can read the paper here: Bidirectional Encoder Representations from Transformers. The appendix section (of the paper) discusses the different types of language tasks that they tested BERT on.

BERT代表“来自变压器的双向编码器表示形式”。目前是领先的语言模型。根据已发布的结果，它(或其变体)在公园外进行了不少语言测试。如果有兴趣，您可以在这里阅读本文：变压器的双向编码器表示。 (本文的附录部分)讨论了他们测试BERT所使用的不同类型的语言任务。

细节 (Details)

There are many articles online that explain what BERT does. I found most of them cumbersome and loaded with details that make it even harder to understand. Let me try to explain it in simple terms (i.e. not as a researcher who is planning to improve BERT but as a person who is interested in using BERT)

在线上有很多文章解释了BERT的作用。我发现其中大多数都很繁琐，并且包含很多细节，这使它变得更加难以理解。让我尝试用简单的术语来解释它(例如，不是作为一个计划改进BERT的研究人员，而是一个对使用BERT感兴趣的人)

作为黑匣子 (As a black-box)

Let us first understand BERT as a black-box.

让我们首先将BERT理解为黑匣子。

The above picture is taken from the paper. Let us first understand what the inputs and outputs for BERT are. You can see that there are multiple ways in which you can submit inputs to BERT.

上面的图片摘自本文。首先让我们了解BERT的输入和输出。您会看到可以通过多种方式将输入提交到BERT。

您能输入什么？ (What can you input?)

Single sentence — In the above figure you can see illustrations b and d where you feed a single sentence to BERT. This is done for the case of sentence classification (Sentiment Analysis) or Sentence Tagging (E.g. Named Entity Recognition)
单句 —在上图中，您可以看到插图b和d，其中向BERT提供了单句。这是针对句子分类(情感分析)或句子标记(例如命名实体识别)的情况
Two sentences (separated by a marker) — In the above figure you can see illustrations a and c where you feed two sentences separated by a marker (SEP). This is done for the cases of sentence-pair classification tasks such as sentence-similarity, or knowing if sentence-2 follows sentence-1 or obtaining answers (for questions) from a given paragraph of text
两个句子(由标记分隔) —在上图中，您可以看到插图a和c，其中提供了两个由标记(SEP)分隔的句子。对于句子对分类任务(例如句子相似性)或知道句子2是否跟随句子1或从给定的文本段落中获取答案(针对问题)的情况，可以执行此操作

输出是什么？ (What is the output?)

For single sentence tasks — For single sentence tasks (b and d in the figure) you either consume the Class-Label (in b) for sentence classification (sentiment analysis) tasks or a tag (in d) for sentence-tagging tasks (named entity recognition etc.)
对于单句任务 -对于单句任务(图中的b和d)，您可以使用Class-Label(在b中)用于句子分类(情感分析)任务，或者使用标签(在d中)用于句子标记任务(命名为实体识别等)
For two sentence tasks — For cases where you input two sentences (a and c in the figure) you either consume the class-label (in a) which can be a score between 0 and 1 (for example, indicating the probability of sentence 2 following sentence 1 etc.) Or you can consume the the second part of the output tags (in c) which is a representation of the answer you require (for the question and paragraph pair input to BERT)
对于两个句子的任务 —对于输入两个句子(图中的a和c)的情况，您要么使用了类别标签(在a中)，其得分可以在0到1之间(例如，指示句子2的概率)后面的句子1等。)或者您可以使用输出标签的第二部分(在c中)，该部分表示所需的答案(对于输入到BERT的问题和段落对)

Each output tag (C or T) listed in the output is a vector in a H-dimensional space. Where H is 768 (as per the paper) and most implementations of BERT give you an embedding in 768 dimensions.

输出中列出的每个输出标签(C或T)都是H维空间中的向量。其中H是768(根据本文)，大多数BERT的实现都为您嵌入768尺寸。

Cool! So, far so good. If you are just interested in using BERT then you are good to go. You can directly install a few open source libraries and start playing with BERT. Examples are listed below.

凉！到目前为止，一切都很好。如果您只是对使用BERT感兴趣，那么就很好了。您可以直接安装一些开源库并开始使用BERT。示例如下。

例 (Example)

You can check the examples quoted here to see how straight-forward it is to use BERT. Probably not more than 10 lines of code. I will not quote the code here, you can check the links for code.

您可以查看此处引用的示例，以了解使用BERT的直接性。大概不超过10行代码。我不会在这里引用代码，您可以检查代码链接。

Sentence summarizer — https://pypi.org/project/bert-extractive-summarizer/
句子摘要程序-https: //pypi.org/project/bert-extractive-summarizer/
Sentence encoder — https://pypi.org/project/sentence-transformers/
句子编码器-https: //pypi.org/project/sentence-transformers/

高级—步骤1 —预训练和微调 (Advanced — Step 1 — Pre-training & Fine-tuning)

预训练 (Pre-training)

The major motivation behind BERT seems to be to build a model that is pre-trained with an existing corpus and then the same model can be fine-tuned to be used for different tasks. For example in the above figure we see that the same BERT model is being used for various tasks. So, what the research team did was to build a BERT model and trained it using English-Wikipedia (2500M words) and Books corpus (800M words) on two tasks. The learning tasks are also simple:

BERT背后的主要动机似乎是建立一个使用现有语料库进行预训练的模型，然后可以对该模型进行微调以用于不同的任务。例如，在上图中，我们看到相同的BERT模型正在用于各种任务。因此，研究团队要做的是建立一个BERT模型，并使用英语-维基百科(2500万个单词)和图书语料库(800M个单词)在两个任务上对其进行训练。学习任务也很简单：

Masked LM (MLM) — They Mask 15% of the input tokens (i.e. words) with a [MASK] and then they will make BERT guess that [MASK] label and train it accordingly. When they do this training they do not check if the entire sentence is output in the order. They just make sure that model guesses the [MASK] label correctly. To also ensure that they do not create a mismatch between pre-training and fine-tuning they do the following adjustments: If the i-th token is chosen (at random), we replace the i-th token with(1) The[MASK] token is replaced 80% of the time (2) a random token 10% of the time (3)the unchanged i-th token 10% of the time. Then,Ti will be used to predict the original token with cross entropy loss.
Masked LM(MLM)—他们用[MASK]掩盖了15％的输入令牌(即单词)，然后他们会让BERT猜到[MASK]标签并对其进行相应的训练。当他们进行此训练时，他们不会检查整个句子是否按顺序输出。他们只是确保模型正确猜测[MASK]标签。为了确保他们不会在预训练和微调之间造成不匹配，他们进行了以下调整：如果选择了第i个令牌(随机)，我们将第i个令牌替换为(1)The [ [MASK]令牌被替换为80％的时间(2)随机令牌被替换为10％的时间(3)不变的第i个令牌被替换为10％的时间。 然后，Ti将用于预测具有交叉熵损失的原始令牌。
Next Sentence Prediction (NSP) — They input two sentences A and B to BERT and they make it predict if B follows A or not. 50% of the inputs are where B follows A and 50% is where it does not. Sentences A and B are input to BERT with a [SEP] word inserted between them.
下一个句子预测(NSP)-他们向BERT输入两个句子A和B，并使其预测B是否跟随A。 50％的输入位于B跟随A的位置，而50％的输入则不在B的位置。将句子A和B输入到BERT，并在它们之间插入[SEP]字。

微调 (Fine-tuning)

So, in the previous step you have a BERT that is pre-trained with some corpus and on some learning tasks. Now you have a model that outputs a set of Tags with each tag/output ‘T’ in a H-dimensional space (768 dimensions as per the paper). Cool! Now what you need to do is to fine-tune the entire model for your use-case. You can attach this BERT output layer to another layer of your choice (i.e. a multi-label classifier etc. ). The paper lists a set of 11 tasks that they fine tuned the pre-trained BERT for and then the results they obtained for those tasks. When they fine-tune they make sure that weights are tuned across the entire model (and not just to the layers attached on-top of BERT).

因此，在上一步中，您有一个经过一些语料库和某些学习任务预训练的BERT。现在，您有了一个模型，该模型在H维空间(根据纸张为768维)中输出具有每个Tag /输出'T'的一组Tag。凉！现在，您需要做的是针对用例微调整个模型。您可以将此BERT输出层附加到您选择的另一层(即，多标签分类器等)。本文列出了11个任务的集合，他们对预训练的BERT进行了微调，然后对这些任务获得了结果。当他们进行微调时，请确保在整个模型中调整权重(而不仅仅是调整到BERT顶部附加的图层)。

高级—步骤2 —深入了解 (Advanced — Step 2 — Under the hood)

变形金刚 (Transformers)

BERT basically leverages Transformers. This paper “Attention is all you need” discusses the concept of transformers. Transformers is nothing but an encoder-decoder architecture. For full details, refer to this article: http://jalammar.github.io/illustrated-transformer/ (Pictures are taken from this article). Even this article is good: https://towardsdatascience.com/transformers-141e32e69591. Now Transformers are the bleeding edge in NLP and they are replacing/replaced LSTM based RNN models.

BERT基本上利用了变形金刚。本文“ 注意就是您所需要的 ”讨论了变压器的概念。变压器不过是一种编码器-解码器体系结构。有关完整的详细信息，请参见本文： http : //jalammar.github.io/illustrated-transformer/ (图片摘自本文)。 即使这篇文章也不错： https : //towardsdatascience.com/transformers-141e32e69591 。现在，变形金刚是NLP中的前沿技术，它们正在替换/替换基于LSTM的RNN模型。

Image for post — Encoder — Decoder layers in a Transformer

From the above example you can see that transformer is leveraged for translation of sentences (French to English in the above example) or even for sentence prediction. When we peer within the encoder and decoder, the layers at a superficial level are as follows:

从上面的示例中，您可以看到，转换器可以用来翻译句子(在上面的示例中为法语到英语)，甚至可以用于句子预测。当我们在编码器和解码器中进行对等时，其表面层如下：

And transformers in turn leverages a concept of Attention. The idea behind Attention is that the context of a word in a paragraph is captured, in some sense, by all the other words in that paragraph/document. You can refer to the links I mentioned above for more details.

而变压器又会利用注意力的概念。注意背后的想法是，在某种意义上，段落中的某个词的上下文在某种意义上被该段落/文档中的所有其他词捕获。您可以参考我上面提到的链接以获取更多详细信息。

注意 (Attention)

Now attention can be achieved using multiple methods. Google’s attention paper I mentioned above does not use RNN/CNN on top of a attention layer. But I believe there are other approaches like the one mentioned in this video (not me btw :) ) where attention is used in conjunction with RNNs:

现在可以使用多种方法来获得关注。我上面提到的Google注意文档不在关注层顶部使用RNN / CNN。但是我相信还有其他方法，例如本视频中提到的方法(不是我btw :))，其中将注意力与RNN结合使用：

Achieving Attention

获得关注

变压器续 (Transformers contd…)

With reference to the Google paper (which is a well cited one), when you zoom into the encoding and decoding layers, transformer architecture looks like this.

参考Google的论文(该论文被广泛引用)，当您放大编码和解码层时，转换器架构看起来像这样。

The fun part is that instead of attention this one talks about multi-head attention. What is the difference?

有趣的是，这不是关注，而是谈论多头关注。有什么区别？

Single attention apparently brings focus to a single area of the picture / corpus at a time. Whereas multi-head attention ensures that multiple areas of the corpus / image are focused upon at the same time.

一次关注显然可以一次将注意力集中到图片/语料库的单个区域。而多头注意力可确保同时集中语料库/图像的多个区域。

More details can be found here: NLP — Bert & Transformer

可以在这里找到更多详细信息： NLP — Bert＆Transformer

And Google paper proposes their multi-head attention mechanism like this:

Google的论文提出了这样的多头注意力机制：

Now, if you are still interested in digging into the details relating to what those V, K and Q vectors are and what linear operations are applied on them etc. then the best place to get those details is the paper. Even this article [NLP — Bert & Transformer] has a very good description of these details. If I talk about them in my article then it will become super long and boring. For this article it is sufficient enough to understand that there is something called as attention and it is used in a transformer.

现在，如果您仍然想深入了解与V，K和Q向量有关的详细信息以及对其应用了哪些线性运算等，那么获取这些详细信息的最佳位置就是本文。即使这篇文章[ NLP — Bert＆Transformer ]也对这些细节进行了很好的描述。如果我在文章中谈论它们，那么它将变得冗长而无聊。对于本文而言，足以理解有一种称为注意的东西并已在变压器中使用。

变压器的用途 (Uses of a Transformer)

Transformer is used in BERT. So, that is a major use. As I said already they are replacing the LSTM based RNNs, so probably whenever you think about RNN for your project/problem you should also give Transformer a thought and see if you can train a transformer instead of a RNN for your use-case. It is an encoder decoder architecture and you should be able to use Transformer in such use-cases. Some typical problems that are tackled with a transformer are as follows:

BERT中使用了变压器。因此，这是主要用途。就像我已经说过的那样，他们正在取代基于LSTM的RNN，所以大概每当您想到项目/问题的RNN时，您都应该考虑一下Transformer，看看您是否可以针对用例训练一个变压器而不是RNN。它是一种编码器解码器体系结构，您应该能够在这种用例中使用Transformer。变压器解决的一些典型问题如下：

Next-Sentence prediction
下一句预测
Question Answering
问题回答
Reading comprehension
阅读理解
Sentiment analysis (and)
情绪分析(和)
Text summarization
文字摘要

伯特 (BERT)

Now coming to BERT, we already said that it builds on top of transformers. BERT basically builds on the knowledge gained from other models built a few years ago, E.g. ELMo and OpenAI GPT. The major difference is that BERT does a fully-connected model where ELMO and OpenAI GPT use only feed-forward connections or separate blocks where you do forward and backward connections. This method of doing a fully-connected transformer network seems to ensure that attention is properly distributed across the entire network thereby ensuring that the model learns the language.

现在谈到BERT，我们已经说过，它建立在变压器之上。 BERT基本上基于几年前从其他模型(例如ELMo和OpenAI GPT)获得的知识。主要区别在于BERT进行的是完全连接的模型，其中ELMO和OpenAI GPT仅使用前馈连接或使用单独的块进行向前和向后连接。这种建立完全连接的变压器网络的方法似乎可以确保将注意力正确地分布在整个网络上，从而确保模型学习语言。

为什么有效？ (Why it works?)

The paper talks about the reasons behind the apparent success of BERT over other models. But no one really knows the precise reasons behind the success of these models (or for that matter any deep-learning model). This is an active area of research and you may want to consider it for your thesis :) . Also, it means that DL modeling is more art and less science. But the BERT paper deserves credit because the authors seem to be well aware of many other architectures, their inner workflows and a hunch about what may or may-not work. They also addressed three key things which makes BERT out-right appealing and interesting.

本文讨论了BERT明显优于其他模型的原因。但是，没有人真正知道这些模型成功的确切原因(或就此而言，任何深度学习模型)。这是一个活跃的研究领域，您可能需要为自己的论文考虑它：)。同样，这意味着DL建模更多的是艺术，而更少的科学。但是BERT的论文值得赞扬，因为作者似乎对许多其他体系结构，其内部工作流程以及对可能有效或无效的预感有所了解。他们还解决了三个关键问题，这些问题使BERT完全具有吸引力和趣味性。

Unsupervised learning for pre-training
无监督学习以进行预训练
Using a readily available corpus like Wikipedia for pre-training
使用现成的语料库(如Wikipedia)进行预训练
Reducing the effort required to fine-tune for specific tasks
减少微调特定任务所需的精力

如何使用BERT？ (How to use BERT?)

There are many libraries available for BERT. Some notable ones that I have come across are:

有许多可用于BERT的库。我遇到的一些值得注意的是：

结论 (Conclusion)

In conclusion, if you are just consuming pre-trained BERT then it is pretty straight-forward. Huggingface also has some fine-tuned models that others have shared with the community. If you wish to fine-tune BERT for your own use-cases and if you have some tagged data then you can use huggingface transformers and pyTorch to fine-tune a pre-trained BERT for your use-case. So, why are you waiting. Just install and explore BERT.

总而言之，如果您只是使用经过预训练的BERT，那就很简单了。 Huggingface还具有一些与他人共享的微调模型。如果您希望针对自己的用例微调BERT，并且如果您有一些标记的数据，则可以使用拥抱转换器和pyTorch来针对您的用例微调预训练的BERT。所以，你为什么要等待。只需安装和探索BERT。