端到端机器学习_端到端机器学习项目:评论分类

本文介绍了一个端到端的机器学习项目,重点是评论分类。通过使用Python和相关工具,从数据预处理到模型训练和评估,详细阐述了整个流程。
摘要由CSDN通过智能技术生成

端到端机器学习

In this article, we will go through a classification problem that involves classifying a review as either positive or negative. The reviews used here are the reviews made by customers on a service ABC.

在本文中,我们将讨论分类问题,该问题涉及将评论分为正面还是负面。 此处使用的评论是客户在ABC服务上所做的评论。

数据收集与预处理 (Data Collection and Pre-processing)

The data used in this particular project were scraped from the web and data cleaning done in this notebook.

该特定项目中使用的数据是从Web上抓取的,并且在此笔记本中完成了数据清理。

After we scraping the data was saved to a .txt file and here is an example of one line of the file (to represent one data point)

在我们抓取数据之后,将其保存到一个.txt file ,以下是该文件的一行示例(代表一个数据点)

{'socialShareUrl': 'https://www.abc.com/reviews/5ed0251025e5d20a88a2057d', 'businessUnitId': '5090eace00006400051ded85', 'businessUnitDisplayName': 'ABC', 'consumerId': '5ed0250fdfdf8632f9ee7ab6', 'consumerName': 'May', 'reviewId': '5ed0251025e5d20a88a2057d', 'reviewHeader': 'Wow - Great Service', 'reviewBody': 'Wow. Great Service with no issues.  Money was available same day in no time.', 'stars': 5}

The data point is a dictionary and we are interested in the reviewBody and stars.

数据点是字典,我们对reviewBodystars感兴趣。

We will categorize the reviews as follows

我们将对评论进行如下分类

1 and 2 - Negative
3 - Neutral
4 and 5 - Positive

At the moment of data collection, there were 36456 reviews on the site. The data is highly imbalanced: 94% of the total reviews are positive, 4% are negative and 2% are neutral. In this project, we will fit different Sklearn models on the imbalanced data and also on balanced data (dropping positive excesses so that we have the same number of positive and negative reviews.)

收集数据时,网站上有36456条评论。 数据高度不平衡:总评论中有94%是正面的,4%是负面的,2%是中立的。 在此项目中,我们将在不平衡数据和平衡数据上使用不同的Sklearn模型(删除正的过剩量,以便我们获得相同数量的正面和负面评论。)

Below is a plot showing the composition of the data:

以下是显示数据组成的图:

Image for post
Fig 2: Data composition (Source: Author)
图2:数据组成(来源:作者)

In Fig 2 and the figures above, we can see that the data is highly imbalanced. Could this be a sign of a problem? We shall see.

在图2和上面的图中,我们可以看到数据高度不平衡。 这可能是问题的征兆吗? 我们将会看到。

Let's start by importing necessary packages and also define the class Review that we will use to categorize a given review message

让我们从导入必要的程序包开始,并定义Review类,我们将使用该类对给定的评论消息进行分类

Here, we will load the data and use the Review class to categorize the review message as positive, negative or neutral

在这里,我们将加载数据并使用Review类将评论消息分类为肯定,否定或中性

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值