eda分析_从eda到部署的亚马逊食品评论的情感分析

本文探讨了如何进行亚马逊食品评论的情感分析,从探索性数据分析(EDA)开始,逐步深入到模型的训练和部署。内容涵盖了数据预处理、特征工程以及使用Java、Python等技术进行机器学习模型的构建。
摘要由CSDN通过智能技术生成

eda分析

Amazon.com, Inc., is an American multinational technology company based in Seattle, Washington. Amazon focuses on e-commerce, cloud computing, digital streaming, and artificial intelligence. As they are strong in e-commerce platforms their review system can be abused by sellers or customers writing fake reviews in exchange for incentives. It is expensive to check each and every review manually and label its sentiment. So a better way is to rely on machine learning/deep learning models for that. In this case study, we will focus on the fine food review data set on amazon which is available on Kaggle.

Amazon.com,Inc.是一家位于华盛顿州西雅图的美国跨国技术公司。 亚马逊专注于电子商务,云计算,数字流和人工智能。 由于他们在电子商务平台中的实力很强,他们的评论系统可能会被卖方或客户滥用虚假评论以换取激励而滥用。 手动检查每条评论并标注其观点非常昂贵。 因此,更好的方法是依靠机器学习/深度学习模型。 在本案例研究中,我们将重点关注Kaggle上提供的亚马逊上的精美食品评论数据集

Note: This article is not a code explanation for our problem. Rather I will be explaining the approach I used. You can look at my code from here.

注意:本文不是针对我们问题的代码说明。 相反,我将解释我使用的方法。 您可以从 这里 查看我的代码

关于数据集 (About Data set)

The data set consists of reviews of fine foods from amazon over a period of more than 10 years, including 568,454 reviews till October 2012. Reviews include rating, product and user information, and a plain text review. It also includes reviews from all other Amazon categories.

数据集包括对亚马逊地区超过10年的精美食品的评论,包括截至2012年10月的568,454条评论。评论包括等级,产品和用户信息以及纯文本评论。 它还包括来自所有其他亚马逊类别的评论。

We have the following columns:

我们有以下几列:

  1. Product Id: Unique identifier for the product

    产品ID:产品的唯一标识符
  2. User Id: unique identifier for the user

    用户ID:用户的唯一标识符
  3. Profile Name: Profile name of the user

    配置文件名称:用户的配置文件名称
  4. Helpfulness Numerator: Number of users who found the review helpful

    帮助性分子:认为该评论有用的用户数
  5. Helpfulness Denominator: Number of users who indicated whether they found the review helpful or not

    帮助度分母:表示他们是否认为本评论有用的用户数量
  6. Score: Rating between 1 and 5

    分数:介于1到5之间
  7. Time: Timestamp

    时间:时间戳
  8. Summary: Summary of the review

    摘要:审查摘要
  9. Text: Review

    文字:评论
Image for post

目的 (Objective)

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

给定评论,确定评论是正面的(4或5级)还是负面的(1或2级)。

How to determine if a review is positive or negative?

如何确定评论是正面还是负面?

We could use Score/Rating. A rating of 4 or 5 can be considered as a positive review. A rating of 1 or 2 can be considered as a negative one. A review of rating 3 is considered neutral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

我们可以使用得分/评分。 评分为4或5可被视为正面评价。 评级为1或2可以被认为是负面的。 等级3的评论被视为中立,我们的分析忽略了此类评论。 这是确定评论的极性(阳性/阴性)的近似和替代方法。

探索性数据分析 (Exploratory Data Analysis)

基本预处理 (Basic Preprocessing)

As a step of basic data cleaning, we first checked for any missing values. Fortunately, we don’t have any missing values. Next, we will check for duplicate entries. On analysis, we found that for different products the same review is given by the same user at the same time. Practically it doesn’t make sense. So we will keep only the first one and remove other duplicates.

作为基本数据清理的步骤,我们首先检查是否有缺失值。 幸运的是,我们没有任何遗漏的值 。 接下来,我们将检查重复的条目 。 经过分析,我们发现对于不同的产品,同一用户在同一时间给出了相同的评论。 实际上,这没有任何意义。 因此,我们将仅保留第一个,并删除其他重复项。

Example for a duplicate entry:

重复条目的示例:

Image for post
data
数据

Now our data points got reduced to about 69%.

现在,我们的数据点减少到了约69%。

分析评论趋势 (Analyzing the review trend)

Image for post
review trend
回顾趋势
  • From 2001 to 2006 the number of reviews is consistent. But after that, the number of reviews began to increase. Out of those, a number of reviews with 5-star ratings were high. Maybe that are unverified accounts boosting the seller inappropriately with fake reviews. The other reason can be due to an increase in the number of user accounts.

    从2001年到2006年,评论的数量是一致的。 但此后,评论数量开始增加。 在这些评论中,有许多5星级的评论很高。 也许那是未经验证的帐户,以虚假评论不当地刺激了卖方。 另一个原因可能是由于用户帐户数量的增加。

分析目标变量 (Analyzing target variable)

As discussed earlier we will assign all data points above rating 3 as positive class and below 3 as a negative class. we will neglect the rest of the points.

如前所述,我们将高于3的所有数据点分配为正数,将低于3的所有数据点分配为负数。 我们将忽略其余的要点。

Image for post
target
目标

Observation: It is clear that we have an imbalanced data set for classification. So We cannot choose accuracy as a metric. So here we will go with AUC(Area under ROC curve)

观察:很明显,我们的分类数据集不平衡。 因此,我们不能选择准确性作为度量标准。 所以这里我们将使用AUC(ROC曲线下的面积)

Why accuracy not for imbalanced datasets?

为什么准确度不适用于不平衡的数据集?

Consider a scenario like this where we have an imbalanced data set. For example, consider the case of credit card fraud detection with 98% percentage of points as non-fraud(1) and rest 2% points as fraud(1). In such cases even if we predict all the points as non-fraud also we will get 98% accuracy. But actually it is not the case. So we can’t use accuracy as a metric.

考虑这样的情况,我们的数据集不平衡。 例如,以信用卡欺诈检测为例,其中98%的点为不欺诈(1),其余2%的点为欺诈(1)。 在这种情况下,即使我们将所有点都预测为非欺诈,我们也将获得98%的准确性。 但实际上并非如此。 因此,我们不能将准确性用作指标。

What is AUC ROC?

什么是AUC ROC?

AUC is the area under the ROC curve. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

AUC是ROC曲线下的面积。 它告诉模型该模型能够区分多少类。 AUC越高,模型在将0s预测为0s和将1s预测为1s时越好。 用TPR相对FPR绘制ROC曲线,其中TPR在y轴上,FPR在x轴上。

分析用户行为 (Analyzing User behavior)

Image for post
user behavior
用户行为
  • After analyzing the no of products that the user brought, we came to know that most of the users have brought a single product.

    在分析用户带来的产品数量之后,我们知道大多数用户带来了单个产品。
  • Another thing to note is that the helpfulness denominator should be always greater than the numerator as the helpfulness numerator is the number of users who found the review helpful and the helpfulness denominator is the number of users who indicated whether they found the review helpful or not. There are some data points that violate this. So we remove those points.

    还要注意的另一点是,帮助分母应始终大于分子,因为帮助分母是发现该评论有用的用户数,而帮助分母是表明他们认为该评论是否有用的用户数。 有些数据点违反了这一点。 因此,我们删除了这些点。

After our preprocessing, data got reduced from 568454 to 364162.ie, about 64% of the data is remaining. Now let's get into our important part. Processing review data.

经过我们的预处理,数据从568454减少到364162。即,大约有64%的数据剩余。 现在,让我们进入我们的重要部分。 处理评论数据。

预处理文本数据 (Preprocessing text data)

Text data requires some preprocessing before we go on further with analysis and making the prediction model. Hence in the preprocessing phase, we do the following in the order below:-

文本数据需要进行一些预处理,然后才能继续进行分析和建立预测模型。 因此,在预处理阶段,我们按以下顺序执行以下操作:

  • Begin by removing the Html tags.

    首先删除Html标签。
  • Remove any punctuation’s or a limited set of special characters like, or . or #,! etc.

    删除所有标点符号或一组有限的特殊字符,例如或。 要么 #,! 等等
  • Check if the word is made up of English letters and is not alpha-numeric

    检查单词是否由英文字母组成并且不是字母数字
  • Convert the word to lowercase

    将单词转换为小写
  • Finally, remove Stopwords

    最后,删除停用词

火车测试拆分 (Train test split)

Once we are done with preprocessing, we will split our data into train and test. We will do splitting after sorting the data based on time as a change in time can influence the reviews.

完成预处理后,我们会将数据分为训练和测试。 我们将在根据时间对数据进行排序后进行拆分,因为时间的变化会影响评论。

向量化文字数据 (Vectorizing text data)

After that, I have applied bow vectorization, tfidf vectorization, average word2vec, and tfidf word2vec techniques for featuring our text and saved them as separate vectors. As vectorizing large amounts of data is expensive, I computed it once and stored so that I do not want to recompute it again and again.

在那之后,我应用了弓向量化,tfidf向量化,平均word2vec和tfidf word2vec技术来突出显示文本,并将它们保存为单独的向量。 由于矢量化大量数据的成本很高,因此我对其进行了一次计算并进行存储,因此我不想一次又一次地重新计算它。

Note: I used a unigram approach for a bag of words and tfidf. In the case of word2vec, I trained the model rather than using pre-trained weights. You can always try with an n-gram approach for bow/tfidf and can use pre-trained embeddings in the case of word2vec.

注意:我对一袋单词和tfidf使用了unigram方法。 对于word2vec,我训练了模型,而不是使用预先训练的权重。 您始终可以对弓/ tfidf尝试使用n-gram方法,在word2vec的情况下可以使用预训练的嵌入。

You should always try to fit your model on train data and transform it on test data. Do not try to fit your vectorizer on test data as it can cause data leakage issues.

您应该始终尝试使模型适合训练数据,并根据测试数据进行转换。 不要尝试将矢量化仪安装在测试数据上,因为它可能导致数据泄漏问题。

TSNE可视化 (TSNE visualization)

TSNE which stands for t-distributed stochastic neighbor embedding is one of the most popular dimensional reduction techniques. It is mainly used for visualizing in lower dimensions. Before getting into machine learning models, I tried to visualize it at a lower dimension.

代表t分布随机邻居嵌入的TSNE是最流行的降维技术之一。 它主要用于较小尺寸的可视化。 在进入机器学习模型之前,我尝试以较低的维度对其进行可视化。

Image for post
TSNE
东北电力公司

Steps I followed for TSNE:

我为TSNE遵循的步骤:

  • Keeping perplexity constant I ran TSNE at different iterations and found the most stable iteration.

    保持困惑不变,我在不同的迭代中运行TSNE,并找到了最稳定的迭代。
  • Now keeping that iteration constant I ran TSNE at different perplexity to get a better result.

    现在使迭代保持恒定,我以不同的困惑运行了TSNE,以获得更好的结果。
  • Once I got the stable result, ran TSNE again with the same parameters.

    获得稳定的结果后,再次使用相同的参数运行TSNE。

But I found that TSNE is not able to well separate the points in a lower dimension.

但是我发现TSNE不能很好地在较低维度上分离点。

Note: I tried TSNE with random 20000 points (with equal class distribution). May results improve with a large number of datapoints

注意:我尝试使用随机20000点(具有相等的类分布)尝试TSNE。 使用大量数据点可能会改善结果

机器学习方法 (Machine learning Approach)

朴素贝叶斯 (Naive Bayes)

It is always better in machine learning if we have a baseline model to evaluate. We will begin by creating a naive Bayes model. For the naive Bayes model, we will split data to train, cv, and test since we are using manual cross-validation. Finally, we have tried multinomial naive Bayes on bow features and tfidf features. After hyperparameter tuning, we end with the following results.

如果我们要评估一个基准模型,那么在机器学习中总会更好。 我们将从创建朴素的贝叶斯模型开始。 对于朴素的贝叶斯模型,由于将使用手动交叉验证,因此我们将数据拆分为训练,简历和测试。 最后,我们在弓形特征和tfidf特征上尝试了多项式朴素贝叶斯。 经过超参数调整后,我们得到以下结果。

Image for post
Naive Bayes
朴素贝叶斯

We can see that in both cases model is slightly overfitting. Don’t worry we will try out other algorithms as well.

我们可以看到,在两种情况下,模型都有些过拟合。 不用担心,我们也会尝试其他算法。

逻辑回归 (Logistic regression)

As the algorithm was fast it was easy for me to train on a 12gb RAM machine. In this case, I only split the data into train and test since grid search cv does internal cross-validation. Finally, I did hyperparameter tuning of bow features,tfidf features, average word2vec features, and tfidf word2vec features.

由于算法很快,因此我很容易在12gb RAM机器上进行训练。 在这种情况下,由于网格搜索cv会进行内部交叉验证,因此我仅将数据分为训练和测试。 最后,我对弓形特征,tfidf特征,平均word2vec特征和tfidf word2vec特征进行了超参数调整。

Image for post
Logistic regression
逻辑回归

Even though bow and tfidf features gave higher AUC on test data, models are slightly overfitting. Average word2vec features make and more generalized model with 91.09 AUC on test data.

即使bow和tfidf功能在测试数据上提供了更高的AUC,模型还是有些过拟合。 平均word2vec功能使测试数据具有91.09 AUC的更通用模型。

Image for post
performance metric logistic regression on avg-word2vec
avg-word2vec上的性能指标逻辑回归

支持向量机 (Support vector machines)

Next, I tried with the SVM algorithm. I tried both with linear SVM and well as RBF SVM.SVM performs well with high dimensional data. Linear SVM with average word2vec features resulted in a more generalized model.

接下来,我尝试使用SVM算法。 我同时尝试了线性SVM和RBF SVM.SVM在处理高维数据时表现良好。 具有平均word2vec特征的线性SVM产生了更通用的模型。

Image for post
SVM
支持向量机
Image for post
performance metric linear SVM on avg word2vec
在word2vec上的性能指标线性SVM

决策树 (Decision Trees)

Even though we already know that this data can easily overfit on decision trees, I just tried in order to see how well it performs on tree-based models.

即使我们已经知道该数据很容易在决策树上过度拟合,但我还是尝试了一下,以查看其在基于树的模型上的表现如何。

After hyperparameter tuning, I end up with the following result. We can see that the models are overfitting and the performance of decision trees are lower compared to logistic regression, naive Bayes, and SVM. We can either overcome this to a certain extend by using post pruning techniques like cost complexity pruning or we can use some ensemble models over it. Here I decided to use ensemble models like random forest and XGboost and check the performance.

经过超参数调整后,我得到以下结果。 我们可以看到,与逻辑回归,朴素贝叶斯和SVM相比,模型过度拟合,决策树的性能较低。 我们可以通过使用后期修剪技术(例如成本复杂度修剪)在一定程度上克服这一问题,或者可以在其上使用一些集成模型。 在这里,我决定使用诸如随机森林和XGboost之类的集成模型并检查性能。

Image for post
decision trees
决策树

随机森林 (Random Forest)

With Random Forest we can see that the Test AUC increased. but still, most of the models are slightly overfitting.

使用随机森林,我们可以看到测试AUC增加了。 但是,大多数模型还是有些过拟合。

Image for post
random-forest
随机森林

XG助推器 (XG-Boost)

Xg-boost also performed similarly to the random forest. Most of the models were overfitting.

Xg-boost的表现也与随机森林相似。 大多数模型都是过拟合的。

Image for post
xgboost
xgboost

After trying several machine learning approaches we can see that logistic regression and linear SVM on average word2vec features gives a more generalized model.

在尝试了几种机器学习方法之后,我们可以看到平均word2vec特征上的逻辑回归和线性SVM提供了更通用的模型。

Do not Stop Here!!!

不要在这里停下来!!!

What about sequence models. They have proved well for handling text data. Next, we will try to solve the problem using a deep learning approach and see whether the result is improving.

序列模型呢? 事实证明,它们非常适合处理文本数据。 接下来,我们将尝试使用深度学习方法解决问题,并查看结果是否有所改善。

深度学习方法 (Deep Learning Approach)

Basically the text preprocessing is a little different if we are using sequence models to solve this problem.

基本上,如果我们使用序列模型来解决此问题,则文本预处理会有所不同。

  • The initial preprocessing is the same as we have done before. We will remove punctuations, special characters, stopwords, etc and we will also convert each word to lower case.

    初始预处理与我们之前所做的相同。 我们将删除标点符号,特殊字符,停用词等,并且还将每个单词转换为小写。
  • Next, instead of vectorizing data directly, we will use another approach. First, we convert the text data into sequenced by encoding them. ie, for each unique word in the corpus we will assign a number, and the number gets repeated if the word repeats.

    接下来,我们将使用另一种方法,而不是直接向量化数据。 首先,我们通过编码将文本数据转换为有序的文本。 也就是说,对于语料库中的每个唯一单词,我们将分配一个数字,如果单词重复,该数字就会重复。

For eg, the sequence for “it is really tasty food and it is awesome” be like “ 25, 12, 20, 50, 11, 17, 25, 12, 109” and sequence for “it is bad food” be “25, 12, 78, 11”

例如,“ 真是美味,真棒 ”的顺序是“ 25、12、20、50、11、17、25、12、109”,“真是不好的食物”的顺序是“ 25 ,12、78、11”

  • Finally, we will pad each of the sequences to the same length.

    最后,我们将每个序列填充到相同的长度。
Image for post
padding
填充

After plotting, the length of the sequence, I found that most of the reviews have sequence length ≤225. So I took the maximum length of the sequence as 225. If the sequence length is > 225, we will take the last 225 numbers in sequence and if it is < 225 we fill the initial points with zeros.

在绘制出序列的长度后,我发现大多数评论的序列长度≤225。 因此,我将序列的最大长度设为225。如果序列长度> 225,我们将采用序列中的最后225个数字;如果值小于<225,我们将使用零填充初始点。

Our model consists of an embedding layer with pre-trained weights, an LSTM layer, and multiple dense layers. We tried different combinations of LSTM and dense layer and with different dropouts. We have used pre-trained embedding using glove vectors. I would say this played an important role in improving our AUC score to a certain extend. At last, we got better results with 2 LSTM layers and 2 dense layers and with a dropout rate of 0.2. Our architecture looks as follows:

我们的模型包括一个具有预训练权重的嵌入层,一个LSTM层和多个致密层。 我们尝试了LSTM和致密层的不同组合以及不同的滤失。 我们已经使用了手套向量进行预训练的嵌入。 我想说这在将我们的AUC分数提高到一定程度方面发挥了重要作用。 最后,在2个LSTM层和2个致密层以及辍学率为0.2的情况下,我们获得了更好的结果。 我们的架构如下:

Image for post
architecture
建筑

Our model got easily converged in the second epoch itself. We got a validation AUC of about 94.8% which is the highest AUC we got for a generalized model.

我们的模型很容易在第二个时代本身收敛。 我们获得了约94.8%的验证AUC,这是我们针对广义模型获得的最高AUC。

Image for post
with 1 LSTM layer
具有1个LSTM层
Image for post
with 2 LSTM layers
具有2个LSTM层

Some of our experimentation results are as follows:

我们的一些实验结果如下:

Image for post
LSTM results
LSTM结果

Thus I had trained a model successfully. Here comes an interesting question. But how to use it? Don’t worry! I will also explain how I deployed the model using a flask.

因此,我成功地训练了一个模型。 这是一个有趣的问题。 但是如何使用呢? 不用担心 我还将说明如何使用烧瓶部署模型。

使用Flask进行模型部署 (Model Deployment Using Flask)

This is the most exciting part that everyone misses out. How to deploy the model we just created? I choose Flask as it is a python based micro web framework. As I am coming from a non-web developer background Flask is comparatively easy to use.

这是每个人都错过的最令人兴奋的部分。 如何部署我们刚刚创建的模型? 我选择Flask,因为它是基于python的微型网络框架。 由于我来自非网络开发人员背景,因此Flask相对易于使用。

Now we will test our application by predicting the sentiment of the text “food has good taste”.We will test it by creating a request as follows

现在,我们将通过预测“食物具有良好的味道”文本的情绪来测试我们的应用程序。我们将通过创建如下请求来进行测试

Image for post
test code
测试代码

Our application will output both probabilities of the given text to be the corresponding class and the class name. Here our text is predicted to be a positive class with probability of about 94%.

我们的应用程序将同时输出给定文本的概率作为相应的类和类名。 在这里,我们的文本预计为阳性类别,可能性约为94%。

Image for post
out

You can play with the full code from my Github project.

您可以使用我的Github项目中的完整代码。

Scope of improvement:

改善范围:

  • Still, there is a lot of scope of improvement for our present model. In order to train machine learning models, I never used the full data set. You can always try that. It may help in overcoming the over fitting issue of our ml models.

    尽管如此,我们目前的模型仍有很多改进的范围。 为了训练机器学习模型,我从未使用过完整的数据集。 您可以随时尝试。 它可能有助于克服我们的ml模型的过拟合问题。
  • I only used pretrained word embedding for our deep learning model but not with machine learning models. So you can try is to use pretrained embedding like a glove or word2vec with machine learning models.

    我只将预训练词嵌入用于我们的深度学习模型,而没有用于机器学习模型。 因此,您可以尝试在机器学习模型中使用像手套或word2vec这样的预训练嵌入。

翻译自: https://towardsdatascience.com/sentiment-analysis-on-amazon-food-reviews-from-eda-to-deployment-f985c417b0c

eda分析

  • 2
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值