机器学习文本分类代码_无需担心机器学习-如何在少于10行代码中对文本进行分类

最新推荐文章于 2023-01-26 13:22:01 发布

weixin_26752765

最新推荐文章于 2023-01-26 13:22:01 发布

阅读量412

点赞数

文章标签：机器学习 python 人工智能深度学习 java

原文链接：https://towardsdatascience.com/no-fear-of-machine-learning-classify-your-textual-data-in-less-than-10-lines-of-code-2360d5cec798

版权

机器学习文本分类代码

This article builds upon my previous two articles where I share some tips on how to get started with data analysis in Python (or R) and explain some basic concepts on text analysis in Python. In this article, I want to go a step further and talk about how to get started with text classification with the help of machine learning.

本文以我之前的两篇文章为基础，我在其中分享了一些有关如何开始使用Python(或R)进行数据分析的技巧，并解释了有关使用Python进行文本分析的一些基本概念。在本文中，我想更进一步，讨论如何借助机器学习来开始文本分类。

The motivation behind writing this article is the same as the previous ones: because there are enough people out there who are stuck with tools that are not optimal for the given task, e.g. using MS Excel for text analysis. I want to encourage people to use Python, not be scared of programming, and automate as much of their work as possible.

撰写本文的动机与之前的动机相同：因为那里有足够多的人被那些对于给定任务不是最佳的工具所困，例如使用MS Excel进行文本分析。我想鼓励人们使用Python，不要害怕编程，并尽可能使他们的工作自动化。

Speaking of automation, in my last article I presented some methods on how to extract information out of textual data, using railroad incident reports as an example. Instead of having to read each incident text, I showed how to extract the most common words from a dataset (to get an overall idea of the data), how to look for specific words like “damage” or “hazardous materials” in order to log whether damages occurred or not, and how to extract the cost of the damages with the help of regular expressions.

说到自动化，在上一篇文章中，我以铁路事故报告为例，介绍了一些有关如何从文本数据中提取信息的方法。我无需阅读每个事件文本，而是展示了如何从数据集中提取最常用的词(以全面了解数据)，如何查找“损坏”或“危险材料”之类的特定词，以便记录是否发生损坏，以及如何借助正则表达式提取损坏的成本。

However, what if there are too many words to look for? Using crime reports as an example — what if you are interested in whether minors were involved? There are many different words you can use to describe minors: Underaged person, minor, youth, teenager, juvenile, adolescent, etc. Keeping track of all these words would be rather tedious. But, if you have already labeled data from the past, there is a simple solution. Assuming you have a dataset of previous crime reports that you have labeled manually, you can train a classifier that can learn the patterns of the labeled crime reports and match them to the labels (whether a minor was involved, whether the person was intoxicated, etc). If you get a new batch of unlabelled crime reports, you could run the classifier on the new data and label the reports automatically, without ever having to read them. Sounds great, doesn’t it? And what if I told you that this can be done in as little as 7 lines of code? Mind blowing, I know.

但是，如果要查找的单词过多，该怎么办？以犯罪报告为例-如果您对未成年人是否参与感兴趣怎么办？您可以使用许多不同的词来描述未成年人： 未成年人，未成年人，青年，少年，少年，青少年等。跟踪所有这些词将非常乏味。但是，如果您已经标记了过去的数据，则有一个简单的解决方案。假设您具有手动标记的先前犯罪报告的数据集，则可以训练一个分类器，该分类器可以学习标记的犯罪报告的模式并将其与标记匹配(是否涉及未成年人，该人是否陶醉等) )。如果您获得了一批新的未标记犯罪报告，则可以对新数据运行分类器并自动标记报告，而不必阅读它们。听起来不错，不是吗？而且，如果我告诉您仅用7行代码就可以完成呢？我知道

In this article, I will go over some main concepts in machine learning and natural language processing (NLP) and link articles for further reading. Some knowledge of text preprocessing is required, like stopword removal and lemmatisation, which I described in my previous article. I will then show on a real-world example, using a dataset that I have acquired during my latest PhD study, how to train a classifier.

在本文中，我将介绍机器学习和自然语言处理(NLP)中的一些主要概念，并链接文章以供进一步阅读。需要一些有关文本预处理的知识，例如停用词的去除和词形的修饰，这些我已在上一篇文章中进行了介绍。然后，我将使用我在最近的博士学位研究期间获得的数据集，展示一个如何训练分类器的真实示例。

分类问题 (Classification Problem)

Classification is the task of identifying to which of a set of categories a new observation belongs. An easy example would be the classification of emails into spam and ham (binary classification). If you have more than two categories it’s called multi-class classification. There are several popular classification algorithms which I will not discuss in depth but invite you to check out the provided links and do your own research:

分类是确定新观测值属于一组类别中的哪一个的任务。一个简单的例子就是将电子邮件分类为垃圾邮件和火腿(二进制分类)。如果您有两个以上的类别，则称为多类别分类。有几种流行的分类算法，我将不作深入讨论，但邀请您查看提供的链接并进行自己的研究：

Logistic regression (check out this video and this article for a great explanation)
Logistic回归(请观看此视频和本文以获得详细说明)
K-nearest neighbour (KNN) (video, article)
K近邻(KNN)( 视频，文章 )
Support-vector machines (SVM) (video, article)
支持向量机(SVM)( 视频，文章 )
Naive Bayes (video, article)
朴素贝叶斯( 视频，文章 )
Decision Trees (video, article)
决策树( 视频，文章 )
Neural Networks (video, article)
神经网络( 视频，文章 )

培训和测试数据 (Training and Testing Data)

When you receive a new email, the email is not labelled as “spam” or “ham” by the sender. Your email provider has to label the incoming email as either ham and send it to your inbox, or spam and send it into the spam folder. To tell how well your classifier (machine learning model) works in the real world, you split your data into a training and a testing set during the development.

当您收到新电子邮件时，发件人未将电子邮件标记为“垃圾邮件”或“垃圾邮件”。您的电子邮件提供商必须将收到的电子邮件标记为火腿，然后将其发送到收件箱，或者将其标记为垃圾邮件，然后将其发送到垃圾邮件文件夹。为了告诉您分类器(机器学习模型)在现实世界中的运行情况，您可以在开发过程中将数据分为训练集和测试集。

Let’s say you have 1000 emails that are labelled as spam and ham. If you train your classifier with the whole data you have no data left to tell how accurate your classifier is because you do not have any e-mail examples that the model has not seen. So you want to leave some emails out of the training set, to let your trained model predict the labels of those left-out emails (testing set). By comparing the predicted with the actual labels you will be able to tell how well your model generalises to new data and how it would perform in the real world.

假设您有1000封标为垃圾邮件和火腿的电子邮件。如果使用全部数据训练分类器，则没有数据可以告诉您分类器的准确性，因为您没有该模型未看到的任何电子邮件示例。因此，您想将一些电子邮件排除在训练集中之外，以使您的训练模型可以预测那些遗漏的电子邮件(测试集)的标签。通过将预测标签与实际标签进行比较，您将能够知道模型对新数据的概括程度以及在现实世界中的表现。

以计算机可读格式表示文本(数字) (Representing text in computer-readable format (numbers))

Ok, so now you have your data, you have divided it into a training and a testing set, let’s start training the classifier, no? Given, that we are dealing with text data, there is one more thing we need to do. If we were dealing with numerical data, we could feed those numbers into the classifier directly. But computers don’t understand text. The text has to be converted into numbers, known as text vectorisation. There are several ways of doing it, but I will only go through one. Check out these articles for further reading on tf-idf and word embeddings.

好的，现在您有了数据，将其分为训练和测试集，让我们开始训练分类器，不是吗？鉴于我们正在处理文本数据，我们还需要做另一件事。如果要处理数字数据，则可以将这些数字直接输入分类器。但是计算机不懂文字。文本必须转换为数字，称为文本向量化 。有几种方法可以做到，但我只会讲一种。查看这些文章，以进一步阅读tf-idf和单词嵌入。

A popular and simple method of feature extraction with text data is called the bag-of-words model of text. Let’s take the following sentence as an example:

一种流行且简单的利用文本数据进行特征提取的方法称为文本的词袋模型。让我们以下面的句子为例：

“apples are great but so are pears, however, sometimes I feel like oranges and on other days I like bananas”

“苹果很棒，但梨也很棒，但是，有时候我觉得自己像橘子，而有时候我喜欢香蕉”

This quote is used as our vocabulary. The sentence contains 17 distinct words. The following three sentences can then be represented as vectors using our vocabulary:

这句话被用作我们的词汇。该句子包含17个不同的词。然后，可以使用我们的词汇将以下三个句子表示为向量：

“I like apples”
“我喜欢苹果”

[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
[1，0，0，0，0，0，0，0，1，0，1，0，0，0，0，0，0]
“Bananas are great. Bananas are awesome,”
“香蕉很棒。 香蕉很棒，”

[0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
[0，2，1，0，0，0，0，0，0，0，0，0，0，0，0，0，2]

or
要么

[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1] if you don’t care how often a word occurs.
如果您不在乎一个单词出现的频率，则为[0，1，1，0，0，0，0，0，0，0，0，0，0，0，0，0，1]。
“She eats kiwis”
“她吃猕猴桃”

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0，0]

Below you can see a representation of the sentence vectors in table format.

您可以在下面看到表格形式的句子向量。

You can see that we count how often each word of the vocabulary shows up in the vector representation of the sentence. Words that are not in our initial vocabulary are of course not known and therefore the last sentence contains only zeros. You should now be able to see how a classifier can identify patterns when trying o predict the label of a sentence. Imagine a training set that contains sentences about different fruits. The sentences that talk about apples will have a value of higher than 0 at index 1 in the vector representation, whereas sentences that talk about bananas will have a value of higher than 0 at index 17 in the vector representation.

您可以看到我们计算了词汇中每个单词在句子的矢量表示中出现的频率。当然，我们最初词汇表中没有的单词是未知的，因此最后一句话仅包含零。现在，您应该能够看到分类器在尝试预测句子的标签时如何识别模式。想象一下一个包含关于不同水果的句子的训练集。谈论苹果的句子在矢量表示中的索引1处的值将大于0，而谈论香蕉的句子在矢量表示中的索引17处的值将大于0。

测量精度 (Measuring accuracy)

Great, now we have trained our classifier, how do we know how accurate it is? That depends on your type of data. The simplest way to measure the accuracy of your classifier is by comparing the predictions on the test set with the actual labels. If the classifier got 80/100 correct, the accuracy is 80%. There are, however, a few things to consider. I will not go into great detail but will give an example to demonstrate my point: Imagine you have a dataset about credit fraud. This dataset contains transactions made by credit cards and presents transactions that occurred in two days, where there were 492 frauds out of 284,807 transactions. You can see that the dataset is highly unbalanced and frauds account for only 0.172% of all transactions. You could assign “not fraud” to all transactions and get 99.98% accuracy”! This sounds great but completely misses the point of detecting fraud as we have identified exactly 0% of the fraud cases.

太好了，现在我们已经训练了分类器，我们如何知道它的准确性？这取决于您的数据类型。衡量分类器准确性的最简单方法是将测试集的预测与实际标签进行比较。如果分类器正确率为80/100，则准确性为80％。但是，需要考虑一些事项。我不会详细介绍，但会举一个例子来说明我的观点：假设您有一个有关信用欺诈的数据集。该数据集包含通过信用卡进行的交易，并提供了两天内发生的交易，其中284,807笔交易中有492起欺诈。您可以看到数据集高度不平衡，欺诈仅占所有交易的0.172％。您可以为所有交易指定“非欺诈”，并获得99.98％的准确性”！这听起来不错，但由于我们已经准确地确定了0％的欺诈案件，因此完全没有发现欺诈的意义。

Here is an article that explains different ways to evaluate machine learning models. And here another one that describes techniques of how to deal with unbalanced data.

本文介绍了评估机器学习模型的不同方法。这里是另一种描述如何处理不平衡数据的技术。

交叉验证 (Cross-Validation)

There is one more concept I want to introduce before going through a coding example, which is cross-validation. There are many great videos and articles that explain cross-validation so, again, I will only quickly touch upon it in regards of accuracy evaluation. Cross-validation is usually used to avoid over- and underfitting of the model on the training data. This is an important concept in machine learning, so I invite you to read up on it.

在进行编码示例之前，我还想介绍一个概念，即交叉验证。有很多很棒的视频和文章介绍了交叉验证，因此，在准确性评估方面，我将只快速介绍一下。交叉验证通常用于避免模型在训练数据上的过拟合和过拟合。这是机器学习中的一个重要概念，因此我邀请您继续阅读。

When splitting your data into training and testing set — what if you got lucky with the testing set and the performance of the model on it overestimates how well your model would perform in real life? Or the opposite — what if you got unlucky and your model might perform better on real-world data than on the testing set. To get a more general idea of your model’s accuracy it, therefore, makes sense to perform several different training and testing splits and averaging the different accuracies that you get for each iteration.

将数据分为训练和测试集时-如果您对测试集感到幸运，并且模型的性能高估了模型在现实生活中的表现，该怎么办？或相反-如果您不走运，并且您的模型在真实数据上的性能可能会比测试集更好。因此，为了更全面地了解模型的准确性，进行几次不同的训练和测试划分并平均每次迭代获得的不同精度是有意义的。

分类人们对潜在COVID-19疫苗的关注。 (Classifying people’s concerns about a potential COVID-19 vaccine.)

Initially, I wanted to use a public dataset from the BBC comprised of 2225 articles, each labeled under one of 5 categories: business, entertainment, politics, sport or tech, and train a classifier that can predict the category of a news article. However, often popular datasets give you very high accuracy scores without having to do much preprocessing of the data. The notebook is available here, where I demonstrate that you can get 97% accuracy without doing anything with the textual data. This is great but, unfortunately, not how it works for most problems in real life.

最初，我想使用来自BBC的公开数据集，其中包含2225篇文章，每篇文章都被标记为以下5个类别之一：商业，娱乐，政治，体育或科技，并训练可以预测新闻类别的分类器。但是，通常流行的数据集可以为您提供非常高的准确性得分，而无需进行大量的数据预处理。该笔记本可在此处找到，我在此演示了无需对文本数据进行任何处理即可获得97％的准确性。这很好，但是不幸的是，它不是如何解决现实生活中大多数问题的。

So, I used my own small dataset instead, which I collected during my latest PhD study. Once I have finished writing the paper I will link it here. I have crowdsourced arguments from people who are opposed to taking a COVID-19 vaccine should one be developed. I labeled the data partially automatically, partially manually by “concern” that the argument addresses. The most popular concern is, not surprisingly, potential side effects of the vaccine, followed by a lack of trust in the government and concerns about insufficient testing due to the speed of the vaccine’s development.

因此，我改用了自己的小型数据集，该数据集是我在最近的博士学位研究期间收集的。一旦写完论文，我将在这里链接。我反对使用COVID-19疫苗的人提出了自己的论点，应该开发一种。我部分自动地标记了数据，部分地通过“关注”参数所标记的方式手动标记了数据。毫不奇怪，最普遍的担忧是疫苗的潜在副作用，其次是对政府的信任不足以及由于疫苗开发速度而导致测试不足的担忧。

I am using Python and the scikit-learn library for this task. Scikit-learn can do a lot of the heavy lifting for you which, as stated in the title of this article, allows you to code up classifiers for textual data in as little as 7 lines of code. So, let’s start!

我正在使用Python和scikit-learn库执行此任务。如本文标题所述，Scikit-learn可以为您完成很多繁重的工作，这使您只需7行代码即可为文本数据分类器编码。所以，让我们开始吧！

After reading in the data, which is a CSV file that contains 2 columns, the argument, and it’s assigned concern, I am first preprocessing the data. I describe preprocessing methods in my last article. I add a new column that contains the preprocessed arguments into the data frame and will use the preprocessed arguments to train the classifier(s).

读取数据后，该数据是一个包含2列(即参数)的CSV文件，并且已为其分配关注点，我首先对数据进行预处理。我在上一篇文章中描述了预处理方法。我将一个包含预处理参数的新列添加到数据框中，并将使用预处理参数来训练分类器。

I classify the arguments and their labels into training and testing set at an 80:20 ratio. Scikit-learn has a train_test_split function which can also stratify and shuffle the data for you.

我将参数及其标签按80:20的比例分为训练和测试集。 Scikit-learn具有train_test_split函数，该函数还可以为您分层和随机整理数据。

Next, I instantiate a CountVectoriser to transform the arguments into sparse vectors, as described above. Then I call the fit() function to learn the vocabulary from the arguments in the testing set, and afterward, I call the transform() function on the arguments to encode each as a vector (scikit-learn provides a function which does both steps at the same time). Now note, that I only call transform() on the testing arguments. Why? Because I do not want to learn the vocabulary from the testing arguments and introduce bias into the classifier.

接下来，我实例化一个CountVectoriser，以将参数转换为稀疏向量，如上所述。然后，我调用fit()函数从测试集中的参数中学习词汇，然后，我在参数上调用transform()函数，将每个参数编码为一个向量(scikit-learn提供了同时执行这两个步骤的函数与此同时)。现在注意，我只在测试参数上调用transform() 。为什么？因为我不想从测试参数中学习词汇并将偏见引入分类器。

Finally, I instantiate a multinomial Naive Bayes, fit it on the training arguments, test it on the testing arguments, and print out the accuracy score. Almost 74%. But what if we use cross-validation? Then accuracy drops to 67%. So we were just lucky with the testing data that the random state created. The random state ensures that the splits that you generate are reproducible.

最后，我实例化多项式朴素贝叶斯(Naive Bayes)，将其拟合到训练参数上，在测试参数上进行测试，然后打印出准确性得分。近74％。但是，如果我们使用交叉验证怎么办？然后准确性下降到67％。因此，我们很幸运地发现了随机状态创建的测试数据。随机状态可确保您生成的拆分是可重现的。

Out of curiosity, I am printing out the predictions in the notebook and check what arguments the classifier got wrong and we can see that mostly the concern “side_effects” is assigned to wrongly labelled arguments. This makes sense because the dataset is unbalanced and there are more than twice as many arguments that address side effects than for the second most popular concern, lack of government trust. So assigning side effects as a concern has a higher chance of being correct than assigning another one. I have tried undersampling and dropped some side effects arguments but the results were not great. I encourage you to experiment with it.

出于好奇，我正在打印笔记本中的预测，并检查分类器弄错了哪些参数，我们可以看到，大多数关注点“ side_effects”都分配给了标签错误的参数。这是有道理的，因为数据集是不平衡的，而且解决副作用的论点是第二大关注点，即缺乏政府信任，是争论的两倍。因此，将副作用分配为关注点比分配另一个关注点的可能性更高。我尝试了欠采样，并删除了一些副作用参数，但结果并不理想。我鼓励您尝试一下。

I coded up a few more classifiers, and we can see that accuracy for all of them is around 70%. I managed to get it up to 74% using a support vector machine and dropping around 30 side_effect arguments. I have also coded up a neural network using tensorflow, which is beyond the scope of this article, which increased accuracy by almost 10%.

我编写了更多分类器，我们可以看到所有分类器的准确性都在70％左右。使用支持向量机，并设法删除了约30个side_effect参数，我设法将其提高到74％。我还使用tensorflow编写了一个神经网络，这超出了本文的范围，它使准确性提高了近10％。

I hope this article provided you with the necessary resources to get started with machine learning on textual data. Feel free to use any of the code from the notebook and I encourage you to play around with different preprocessing methods, classification algorithms, text vectorisations and, of course, datasets.

我希望本文为您提供了必要的资源，以开始对文本数据进行机器学习。随意使用笔记本中的任何代码，我鼓励您尝试使用不同的预处理方法，分类算法，文本向量化以及数据集。