nlp bert 教程_Google BERT NLP机器学习教程

最新推荐文章于 2024-08-16 13:29:44 发布

cumian9828

最新推荐文章于 2024-08-16 13:29:44 发布

阅读量898

点赞数

文章标签：大数据编程语言 python 机器学习人工智能

原文链接：https://www.freecodecamp.org/news/google-bert-nlp-machine-learning-tutorial/

版权

nlp bert 教程

There are plenty of applications for machine learning, and one of those is natural language processing or NLP.

机器学习有很多应用程序，其中之一是自然语言处理或NLP。

NLP handles things like text responses, figuring out the meaning of words within context, and holding conversations with us. It helps computers understand the human language so that we can communicate in different ways.

NLP处理诸如文本响应，弄清上下文中单词的含义以及与我们进行对话之类的事情。它可以帮助计算机理解人类的语言，以便我们可以以不同的方式进行交流。

From chat bots to job applications to sorting your email into different folders, NLP is being used everywhere around us.

从聊天机器人到求职应用程序，再到将电子邮件分类到不同的文件夹中，NLP在我们周围无处不在。

At its core, natural language processing is a blend of computer science and linguistics. Linguistics gives us the rules to use to train our machine learning models and get the results we're looking for.

自然语言处理的核心是计算机科学和语言学的融合。语言学为我们提供了用于训练机器学习模型并获得所需结果的规则。

There are a lot of reasons natural language processing has become a huge part of machine learning. It helps machines detect the sentiment from a customer's feedback, it can help sort support tickets for any projects you're working on, and it can read and understand text consistently.

自然语言处理已成为机器学习的重要组成部分，原因有很多。它可以帮助机器从客户的反馈中发现情绪，可以帮助您对正在进行的任何项目的支持通知进行分类，并且可以一致地阅读和理解文本。

And since it operates off of a set of linguistic rules, it doesn't have the same biases as a human would.

而且由于它是根据一套语言规则运作的，因此它与人类的偏见不同。

Since NLP is such a large area of study, there are a number of tools you can use to analyze data for your specific purposes.

由于NLP涉及的领域非常广泛，因此可以使用许多工具来针对特定目的分析数据。

There's the rules-based approach where you set up a lot of if-then statements to handle how text is interpreted. Usually a linguist will be responsible for this task and what they produce is very easy for people to understand.

有一种基于规则的方法，您可以设置许多if-then语句来处理文本的解释方式。通常，语言学家将负责此任务，并且他们产生的内容很容易让人理解。

This might be good to start with, but it becomes very complex as you start working with large data sets.

首先可能会很好，但是随着您开始使用大型数据集，它将变得非常复杂。

Another approach is to use machine learning where you don't need to define rules. This is great when you are trying to analyze large amounts of data quickly and accurately.

另一种方法是在不需要定义规则的地方使用机器学习。当您尝试快速准确地分析大量数据时，这非常有用。

Picking the right algorithm so that the machine learning approach works is important in terms of efficiency and accuracy. There are common algorithms like Naïve Bayes and Support Vector Machines. Then there are the more specific algorithms like Google BERT.

选择正确的算法以使机器学习方法有效，这在效率和准确性方面很重要。有一些常见的算法，例如朴素贝叶斯 ( NaïveBayes)和支持向量机(Support Vector Machines) 。然后是更具体的算法，例如Google BERT。

什么是BERT？ (What is BERT?)

BERT is an open-source library created in 2018 at Google. It's a new technique for NLP and it takes a completely different approach to training models than any other technique.

BERT是Google于2018年创建的开源库。这是NLP的一项新技术，与其他任何技术相比，它采用完全不同的方法来训练模型。

BERT is an acronym for Bidirectional Encoder Representations from Transformers. That means unlike most techniques that analyze sentences from left-to-right or right-to-left, BERT goes both directions using the Transformer encoder. Its goal is to generate a language model.

BERT是“变压器的双向编码器表示形式”的首字母缩写。这意味着与大多数从左到右或从右到左分析句子的技术不同，BERT使用Transformer编码器向两个方向移动。其目标是生成语言模型。

This gives it incredible accuracy and performance on smaller data sets which solves a huge problem in natural language processing.

这使其在较小的数据集上具有令人难以置信的准确性和性能，从而解决了自然语言处理中的巨大问题。

While there is a huge amount of text-based data available, very little of it has been labeled to use for training a machine learning model. Since most of the approaches to NLP problems take advantage of deep learning, you need large amounts of data to train with.

尽管有大量的基于文本的数据可用，但很少有标签被标记为用于训练机器学习模型。由于解决NLP问题的大多数方法都利用了深度学习，因此您需要大量的数据进行训练。

You really see the huge improvements in a model when it has been trained with millions of data points. To help get around this problem of not having enough labelled data, researchers came up with ways to train general purpose language representation models through pre-training using text from around the internet.

在训练了数百万个数据点的模型后，您真的看到了巨大的改进。为了解决标签数据不足的问题，研究人员提出了通过使用来自互联网的文本进行预训练来训练通用语言表示模型的方法。

These pre-trained representation models can then be fine-tuned to work on specific data sets that are smaller than those commonly used in deep learning. These smaller data sets can be for problems like sentiment analysis or spam detection. This is the way most NLP problems are approached because it gives more accurate results than starting with the smaller data set.

然后，可以对这些经过预训练的表示模型进行微调，以处理比深度学习中常用的较小的特定数据集。这些较小的数据集可以用于情感分析或垃圾邮件检测之类的问题。这是解决大多数NLP问题的方法，因为与从较小的数据集开始相比，它提供了更准确的结果。

That's why BERT is such a big discovery. It provides a way to more accurately pre-train your models with less data. The bidirectional approach it uses means it gets more of the context for a word than if it were just training in one direction. With this additional context, it is able to take advantage of another technique called masked LM.

这就是为什么BERT如此重要的发现。它提供了一种使用更少的数据更准确地预训练模型的方法。它使用的双向方法意味着，与仅在一个方向上进行训练相比，它可以获得单词的更多上下文。有了这个额外的上下文，它就可以利用另一种称为“蒙版LM”的技术。

它与其他机器学习算法有何不同 (How it's different from other machine learning algorithms)

Masked LM randomly masks 15% of the words in a sentence with a [MASK] token and then tries to predict them based on the words surrounding the masked one. That's how BERT is able to look at words from both left-to-right and right-to-left.

Masked LM使用[MASK]令牌随机掩盖句子中15％的单词，然后尝试根据被掩盖的单词周围的单词进行预测。这就是BERT能够从左至右和从右至左查看单词的方式。

This is completely different from every other existing language model because it looks at the words before and after a masked word at the same time. A lot of the accuracy BERT has can be attributed to this.

这与其他所有现有语言模型完全不同，因为它会同时查看被屏蔽单词之前和之后的单词。 BERT的许多准确性都可以归因于此。

To get BERT working with your data set, you do have to add a bit of metadata. There will need to be token embeddings to mark the beginning and end of sentences. You'll need to have segment embeddings to be able to distinguish different sentences. Lastly you'll need positional embeddings to indicate the position of words in a sentence.

为了使BERT处理您的数据集，您必须添加一些元数据。将需要标记嵌入来标记句子的开头和结尾。您需要具有段嵌入 ，才能区分不同的句子。最后，您将需要位置嵌入来指示单词在句子中的位置。

It'll look similar to this.

看起来与此类似。

[CLS] the [MASK] has blue spots [SEP] it rolls [MASK] the parking lot [SEP]

With the metadata added to your data points, masked LM is ready to work.

将元数据添加到数据点后，已屏蔽的LM已准备就绪。

Once it's finished predicting words, then BERT takes advantage of next sentence prediction. This looks at the relationship between two sentences. It does this to better understand the context of the entire data set by taking a pair of sentences and predicting if the second sentence is the next sentence based on the original text.

一旦完成了单词的预测，那么BERT就可以利用下一个句子的预测。这着眼于两个句子之间的关系。这样做可以通过采用一对句子并根据原始文本预测第二个句子是否为下一个句子来更好地理解整个数据集的上下文。

For next sentence prediction to work in the BERT technique, the second sentence is sent through the Transformer based model.

为了使下一句预测能够在BERT技术中起作用，第二句通过基于Transformer的模型发送。

There are four different pre-trained versions of BERT depending on the scale of data you're working with. You can learn more about them here: https://github.com/google-research/bert#bert

根据要处理的数据规模，有四种不同的BERT预训练版本。您可以在此处了解有关它们的更多信息： https : //github.com/google-research/bert#bert

The drawback to this approach is that the loss function only considers the masked word predictions and not the predictions of the others. That means the BERT technique converges slower than the other right-to-left or left-to-right techniques.

该方法的缺点在于，损失函数仅考虑掩盖的单词预测，而不考虑其他单词的预测。这意味着BERT技术的收敛速度比其他从右到左或从左到右的技术慢。

BERT can be applied to any NLP problem you can think of, including intent prediction, question-answering applications, and text classification.

BERT可以应用于您可能想到的任何NLP问题，包括意图预测，问答应用程序和文本分类。

代码示例 (Code Example)

设定 (Getting set up)

Now we're going to go through an example of BERT in action. First thing you'll need to do is clone the Bert repo.

现在，我们将通过一个实际的BERT例子。您需要做的第一件事是克隆Bert存储库。

git clone https://github.com/google-research/bert.git

Now you need to download the pre-trained BERT model files from the BERT GitHub page. Throughout the rest of this tutorial, I'll refer to the directory of this repo as the root directory.

现在，您需要从BERT GitHub页面下载经过预训练的BERT模型文件。在本教程的其余部分中，我将将此仓库的目录称为根目录。

These files give you the hyper-parameters, weights, and other things you need with the information Bert learned while pre-training. I'll be using the BERT-Base, Uncased model, but you'll find several other options across different languages on the GitHub page.

这些文件为您提供了Bert在预训练中所学到的信息所需要的超参数，权重以及其他内容。我将使用BERT-Base Uncased模型，但您会在GitHub页面上找到不同语言的其他几个选项。

Some reasons you would choose the BERT-Base, Uncased model is if you don't have access to a Google TPU, in which case you would typically choose a Base model.

您选择BERT-Base Uncased模型的某些原因是，如果您无权访问Google TPU，在这种情况下，通常会选择Base模型。

If you think the casing of the text you're trying to analyze is case-sensitive (the casing of the text gives real contextual meaning), then you would go with a Cased model.

如果您认为要分析的文本的大小写区分大小写(文本的大小写提供了实际的上下文含义)，则可以使用Cased模型。

If the casing isn't important or you aren't quite sure yet, then an Uncased model would be a valid choice.

如果机壳不重要或您不确定，那么无箱模型将是一个有效的选择。

We'll be working with some Yelp reviews as our data set. Remember, BERT expects the data in a certain format using those token embeddings and others. We'll need to add those to a .tsv file. This file will be similar to a .csv, but it will have four columns and no header row.

我们将使用一些Yelp评论作为我们的数据集。请记住，BERT希望使用那些令牌嵌入和其他形式以某种格式存储数据。我们需要将它们添加到.tsv文件中。该文件将类似于.csv，但它将具有四列，并且没有标题行。

Here's what the four columns will look like.

这是四列的外观。

Column 0: Row id
列0：行ID
Column 1: Row label (needs to be an integer)
第1列：行标签(需要为整数)
Column 2: A column of the same letter for all rows (it doesn't get used for anything, but BERT expects it)
第2列：所有行都使用相同字母的列(不会用到任何东西，但BERT期望使用)
Column 3: The text we want to classify
第3列：我们要分类的文字

You'll need to make a folder called data in the directory where you cloned BERT and add three files there: train.tsv, dev.tsv, test.tsv.

您需要在克隆BERT的目录中创建一个名为data的文件夹，并在其中添加三个文件： train.tsv，dev.tsv，test.tsv 。

In the train.tsv and dev.tsv files, we'll have the four columns we talked about earlier. In the test.tsv file, we'll only have the row id and text we want to classify as columns. These are going to be the data files we use to train and test our model.

在train.tsv和dev.tsv文件中，我们将有前面讨论的四列。在test.tsv文件中，我们只有要分类为列的行ID和文本。这些将成为我们用来训练和测试模型的数据文件。

准备数据 (Prepping the data)

First we need to get the data we'll be working with. You can download the Yelp reviews for yourself here: https://course.fast.ai/datasets#nlp It'll be under the NLP section and you'll want the Polarity version.

首先，我们需要获取将要使用的数据。您可以在这里自己下载Yelp评论：https://course.fast.ai/datasets#nlp它位于NLP部分下，并且您需要Polarity版本。

The reason we'll work with this version is because the data already has a polarity, which means it already has a sentiment associated with it. Save this file in the data directory.

我们使用此版本的原因是因为数据已经具有极性，这意味着它已经具有与之相关的情感。将此文件保存在数据目录中。

Now we're ready to start writing code. Create a new file in the root directory called pre_processing.py and add the following code.

现在我们准备开始编写代码。在根目录中创建一个名为pre_processing.py的新文件，并添加以下代码。

import pandas as pd
# this is to extract the data from that .tgz file
import tarfile
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# get all of the data out of that .tgz
yelp_reviews = tarfile.open('data/yelp_review_polarity_csv.tgz')
yelp_reviews.extractall('data')
yelp_reviews.close()

# check out what the data looks like before you get started
# look at the training data set
train_df = pd.read_csv('data/yelp_review_polarity_csv/train.csv', header=None)
print(train_df.head())

# look at the test data set
test_df = pd.read_csv('data/yelp_review_polarity_csv/test.csv', header=None)
print(test_df.head())

In this code, we've imported some Python packages and uncompressed the data to see what the data looks like. You'll notice that the values associated with reviews are 1 and 2, with 1 being a bad review and 2 being a good review. We need to convert these values to more standard labels, so 0 and 1. You can do that with the following code.

在这段代码中，我们导入了一些Python包并解压缩了数据，以查看数据的外观。您会注意到与评论相关的值为1和2，其中1代表差评，2代表好评。我们需要将这些值转换为更多的标准标签，所以0和1。您可以使用以下代码来实现。

train_df[0] = (train_df[0] == 2).astype(int)
test_df[0] = (test_df[0] == 2).astype(int)

Whenever you make updates to your data, it's always important to take a look at if things turned out right. So we'll do that with the following commands.

每当您对数据进行更新时，请务必查看结果是否正确。因此，我们将使用以下命令进行操作。

print(train_df.head())
print(test_df.head())

When you see that your polarity values have changed to be what you expected. Now that the data should have 1s and 0s.

当您看到极性值已更改为期望值时。现在，数据应具有1和0。

Since we've cleaned the initial data, it's time to get things ready for BERT. We'll have to make our data fit the column formats we talked about earlier. Let's start with the training data.

既然我们已经清理了初始数据，那么现在该为BERT做好准备了。我们必须使数据适合我们之前讨论的列格式。让我们从训练数据开始。

The training data will have all four columns: row id, row label, single letter, text we want to classify.

训练数据将包含所有四列：行ID，行标签，单个字母，我们要分类的文本。

BERT expects two files for training called train and dev. We'll make those files by splitting the initial train file into two files after we format our data with the following commands.

BERT希望有两个训练文件，分别是train和dev 。在使用以下命令格式化数据后，我们将原始火车文件分为两个文件来制作这些文件。

bert_df = pd.DataFrame({
    'id': range(len(train_df)),
    'label': train_df[0],
    'alpha': ['q']*train_df.shape[0],
    'text': train_df[1].replace(r'\n', ' ', regex=True)
})

train_bert_df, dev_bert_df = train_test_split(bert_df, test_size=0.01)

With the bert_df variable, we have formatted the data to be what BERT expects. You can choose any other letter for the alpha value if you like. The train_test_split method we imported in the beginning handles splitting the training data into the two files we need.

使用bert_df变量，我们已将数据格式化为BERT期望的格式。如果愿意，您可以选择其他任何字母作为alpha值。我们在开始时导入的train_test_split方法处理将训练数据分为所需的两个文件。

Take a look at how the data has been formatted with this command.

看一下如何使用此命令格式化数据。

print(train_bert_df.head())

Now we need to format the test data. This will look different from how we handled the training data. BERT only expects two columns for the test data: row id, text we want to classify. We don't need to do anything else to the test data once we have it in this format and we'll do that with the following command.

现在我们需要格式化测试数据。这看起来与我们处理训练数据的方式不同。 BERT仅需要两列测试数据：行ID，我们要分类的文本。一旦有了这种格式的数据，我们就无需对测试数据做任何其他事情，我们将使用以下命令进行操作。

test_bert_df = pd.DataFrame({
    'id': range(len(test_df)),
    'text': test_df[1].replace(r'\n', ' ', regex=True)
})

It's similar to what we did with the training data, just without two of the columns. Take a look at the newly formatted test data.

它与我们对训练数据所做的相似，只是没有两列。看一下新格式化的测试数据。

test_bert_df.head()

If everything looks good, you can save these variables as the .tsv files BERT will work with.

如果一切正常，您可以将这些变量保存为BERT将使用的.tsv文件。

train_bert_df.to_csv('data/train.tsv', sep='\t', index=False, header=False)
dev_bert_df.to_csv('data/dev.tsv', sep='\t', index=False, header=False)
test_bert_df.to_csv('data/test.tsv', sep='\t', index=False, header=False)

训练模型 (Training the model)

One quick note before we get into training the model: BERT can be very resource intensive on laptops. It might cause memory errors because there isn't enough RAM or some other hardware isn't powerful enough. You could try making the training_batch_size smaller, but that's going to make the model training really slow.

在训练模型之前，请快速注意一下：BERT在笔记本电脑上可能会占用大量资源。这可能会导致内存错误，因为没有足够的RAM或某些其他硬件的功能不足。您可以尝试减小training_batch_size的大小，但这会使模型训练的速度变慢。

Add a folder to the root directory called model_output. That's where our model will be saved after training is finished. Now open a terminal and go to the root directory of this project. Once you're in the right directory, run the following command and it will begin training your model.

在根目录中添加一个名为model_output的文件夹。训练完成后，将在此处保存我们的模型。现在打开一个终端，然后转到该项目的根目录。进入正确的目录后，运行以下命令，它将开始训练您的模型。

python run_classifier.py --task_name=cola --do_train=true --do_eval=true --data_dir=./data/ --vocab_file=./uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=./uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=./uncased_L-12_H768_A-12/bert_model.ckpt.index --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=./model_output --do_lower_case=False

You should see some output scrolling through your terminal. Once this finishes running, you will have a trained model that's ready to make predictions!

您应该在终端上看到一些输出滚动。一旦运行完成，您将拥有训练有素的模型，可以随时进行预测！

做出判断 (Making a predication)

If you take a look in the model_output directory, you'll notice there are a bunch of model.ckpt files. These files have the weights for the trained model at different points during training so you want to find the one with the highest number. That will be the final trained model that you'll want to use.

如果您查看model_output目录，您会注意到其中有一堆model.ckpt文件。这些文件在训练过程中的不同时间点具有训练模型的权重，因此您希望找到编号最大的模型。那将是您要使用的最终训练模型。

Now we'll run run_classifier.py again with slightly different options. In particular, we'll be changing the init_checkpoint value to the highest model checkpoint and setting a new --do_predict value to true. Here's the command you need to run in your terminal.

现在，我们将使用稍有不同的选项再次运行run_classifier.py 。特别是，我们将init_checkpoint值更改为最高模型检查点，并将新的--do_predict值设置为true。这是您需要在终端中运行的命令。

python run_classifier.py --task_name=cola --do_predict=true --data_dir=./data --vocab_file=./uncased_L-12_H-768-A-12/bert_config.json --init_checkpoint=./model_output/model.ckpt-<highest checkpoint number> --max_seq_length=128 --output_dir=./model_output

Once the command is finished running, you should see a new file called test_results.tsv. This will have your predicted results based on the model you trained!

命令运行完成后，您应该看到一个名为test_results.tsv的新文件。这将根据您训练的模型得出您的预测结果！

You've just used BERT to analyze some real data and hopefully this all made sense.

您刚刚使用BERT分析了一些真实数据，并希望所有这些都有意义。

其他想法 (Other thoughts)

I felt it was necessary to go through the data cleaning process here just in case someone hasn't been through it before. Sometimes machine learning seems like magic, but it's really taking the time to get your data in the right condition to train with an algorithm.

我觉得有必要在这里进行数据清理过程，以防万一以前没有有人经过它。有时机器学习看起来像魔术，但实际上要花费时间使数据处于正确的条件下以进行算法训练。

BERT is still relatively new since it was just released in 2018, but it has so far proven to be more accurate than existing models even if it is slower.

自BERT于2018年发布以来，它仍然相对较新，但到目前为止，它被证明比现有模型更准确，即使它速度较慢。