端到端机器学习_端到端机器学习项目：评论分类

最新推荐文章于 2022-08-24 00:20:16 发布

weixin_26752765

最新推荐文章于 2022-08-24 00:20:16 发布

阅读量331

点赞数 1

文章标签：机器学习人工智能 python java 数据挖掘

原文链接：https://towardsdatascience.com/end-to-end-machine-learning-project-reviews-classification-60666d90ec19

版权

本文介绍了一个端到端的机器学习项目，重点是评论分类。通过使用Python和相关工具，从数据预处理到模型训练和评估，详细阐述了整个流程。

摘要由CSDN通过智能技术生成

端到端机器学习

In this article, we will go through a classification problem that involves classifying a review as either positive or negative. The reviews used here are the reviews made by customers on a service ABC.

在本文中，我们将讨论分类问题，该问题涉及将评论分为正面还是负面。此处使用的评论是客户在ABC服务上所做的评论。

数据收集与预处理 (Data Collection and Pre-processing)

The data used in this particular project were scraped from the web and data cleaning done in this notebook.

该特定项目中使用的数据是从Web上抓取的，并且在此笔记本中完成了数据清理。

After we scraping the data was saved to a .txt file and here is an example of one line of the file (to represent one data point)

在我们抓取数据之后，将其保存到一个.txt file ，以下是该文件的一行示例(代表一个数据点)

{'socialShareUrl': 'https://www.abc.com/reviews/5ed0251025e5d20a88a2057d', 'businessUnitId': '5090eace00006400051ded85', 'businessUnitDisplayName': 'ABC', 'consumerId': '5ed0250fdfdf8632f9ee7ab6', 'consumerName': 'May', 'reviewId': '5ed0251025e5d20a88a2057d', 'reviewHeader': 'Wow - Great Service', 'reviewBody': 'Wow. Great Service with no issues.  Money was available same day in no time.', 'stars': 5}

The data point is a dictionary and we are interested in the reviewBody and stars.

数据点是字典，我们对reviewBody和stars感兴趣。

We will categorize the reviews as follows

我们将对评论进行如下分类

1 and 2 - Negative
3 - Neutral
4 and 5 - Positive

At the moment of data collection, there were 36456 reviews on the site. The data is highly imbalanced: 94% of the total reviews are positive, 4% are negative and 2% are neutral. In this project, we will fit different Sklearn models on the imbalanced data and also on balanced data (dropping positive excesses so that we have the same number of positive and negative reviews.)

收集数据时，网站上有36456条评论。数据高度不平衡：总评论中有94％是正面的，4％是负面的，2％是中立的。在此项目中，我们将在不平衡数据和平衡数据上使用不同的Sklearn模型(删除正的过剩量，以便我们获得相同数量的正面和负面评论。)

Below is a plot showing the composition of the data:

以下是显示数据组成的图：

Image for post — Fig 2: Data composition (Source: Author)

In Fig 2 and the figures above, we can see that the data is highly imbalanced. Could this be a sign of a problem? We shall see.

在图2和上面的图中，我们可以看到数据高度不平衡。这可能是问题的征兆吗？我们将会看到。

Let's start by importing necessary packages and also define the class Review that we will use to categorize a given review message

让我们从导入必要的程序包开始，并定义Review类，我们将使用该类对给定的评论消息进行分类

Here, we will load the data and use the Review class to categorize the review message as positive, negative or neutral

在这里，我们将加载数据并使用Review类将评论消息分类为肯定，否定或中性

最低0.47元/天解锁文章

weixin_26752765

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
端到端机器学习_端到端机器学习项目：评论分类

端到端机器学习In this article, we will go through a classification problem that involves classifying a review as either positive or negative. The reviews used here are the reviews made by customers on a serv...
复制链接

扫一扫