亚马逊月仓储费逻辑_逻辑回归与亚马逊食品评论

最新推荐文章于 2024-09-12 12:00:33 发布

weixin_26731327

最新推荐文章于 2024-09-12 12:00:33 发布

阅读量723

点赞数

文章标签：逻辑回归 python 机器学习

原文链接：https://medium.com/analytics-vidhya/logistic-regression-with-amazon-food-reviews-164b3748335e

版权

亚马逊月仓储费逻辑

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. There are lots of classification problems that are available, but the logistics regression is common and is a useful regression method for solving the binary classification problem.

逻辑回归是一种分类算法，用于将观察值分配给一组离散的类。存在许多可用的分类问题，但是物流回归很常见，是解决二元分类问题的有用回归方法。

There are lots of classification problems that are available, but the logistics regression is common and is a useful regression method for solving the binary classification problem.

存在许多可用的分类问题，但是物流回归很常见，是解决二元分类问题的有用回归方法。

内容 (Contents)

Geometric Intuition Of Logistic Regression
Logistic回归的几何直觉

2. Regularization techniques to avoid Overfitting and Underfitting

2.正则化技术，以避免过度拟合和欠拟合

3. Probabilistic interpretation of Logistic Regression

3. Logistic回归的概率解释

4. Loss Minimization Interpretation of Logistic Regression

4. Logistic回归的损失最小化解释

5.Implementation of Logistic Regression Algorithm with Amazon Food Reviews

5，使用亚马逊食品评论实现Logistic回归算法

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problem.

Logistic回归是用于两类分类的最简单，最常用的机器学习算法之一。它易于实现，并且可以用作任何二进制分类问题的基准。

Its basic fundamental concepts are also constructive in deep learning. Logistic regression describes and estimates the relationship between one dependent binary variable and independent variables.

它的基本基本概念在深度学习中也具有建设性。 Logistic回归描述和估计一个因变量和自变量之间的关系。

Logistic回归的几何直觉 (Geometric Intuition Of Logistic Regression)

ASSUMPTION: The biggest assumption of Logistic Regression is our data is linearly separable or almost linearly separable.

假设：Logistic回归的最大假设是我们的数据是线性可分离的或几乎线性可分离的。

In the above picture, we have: W ⇒Normal to the plane, Pi(𝜋) ⇒ Plane

在上面的图中，我们有：W⇒Normal于所述平面，PI(π)⇒平面

If we take any of the +ve class points and compute the distance from a point to a plane (di = wT*xi/||w||. let, norm (||w||) is 1). Since w and xi in the same side of the decision boundary then distance will be +ve. Now compute dj = wT*xj since Xj is the opposite side of w then distance will be -ve. If we say, points which are in the same direction of w are all +ve points and the points which are in the opposite direction of w are -ve points.

如果我们取+ ve类的任何点，并计算从一个点到平面的距离(di = wT * xi / || w || .let，范数(|| w ||)为1)。由于w和xi位于决策边界的同一侧，因此距离将为+ ve。现在计算dj = wT * xj，因为Xj是w的另一侧，则距离将为-ve。如果说，在w相同方向上的点都是+ ve点，而在w相反方向上的点是-ve点。

Now,

现在，

we could easily classify the -ve and +ve points using wT*xi>0 then y =+1 and If wT*xi < 0 then y = -1. While doing this we could do some mistake but it is okay because in real-world we will never get data which are perfectly separable.

我们可以使用wT * xi> 0轻松地对-ve和+ ve点进行分类，然后y = + 1；如果wT * xi <0，则y = -1。在执行此操作时，我们可能会犯一些错误，但这是可以的，因为在现实世界中，我们永远不会获得完全可分离的数据。

Observations:

观察结果：

Look at the above image visually and observe all the listed points below-

目视看上面的图片，并观察下面列出的所有要点-

If Yi = +1 means it is positive data-points and wT*xi > 0 i.e classifier(A mathematical function, implemented by a classification algorithm, that maps input data to a category.) is saying it is positive points. So what happened, if Yi*wT*xi > 0 then it is correctly classified points because multiplying two positive number will always be greater than 0.
如果Yi = +1表示它是正数据点，并且wT * xi> 0，即分类器(通过分类算法实现的数学函数，将输入数据映射到类别。)表示它是正点。因此发生了什么，如果Yi * wT * xi> 0，那么它是正确分类的点，因为将两个正数相乘将始终大于0。
If Yi = -1 means it is -ve data-points and wT*xi < 0 i.e. classifier is saying it is negative points. if Yi* wT*xi > 0 then it is correctly classified points because multiplying two negative numbers will always be greater than zero. So, for both positive and negative points Yi* wT*xi > 0 this implies the model is correctly classifying the points xi.
如果Yi = -1表示它是-ve个数据点并且wT * xi <0，即分类器说它是负点。如果Yi * wT * xi> 0，则将其正确分类，因为将两个负数相乘将始终大于零。因此，对于正点和负点Yi * wT * xi> 0，这都意味着该模型已正确分类了点xi。
If Yi= +1 and wT*xi < 0 i.e. Yi is positive points but the classifier is saying it is negative then we will get a negative value. This means the actual class label is positive but it is classified as negative then this is miss-classified points.
如果Yi = +1并且wT * xi <0，即Yi是正点，但分类器说它是负，那么我们将得到一个负值。这意味着实际的类别标签为正，但被分类为负，则这是未分类的分数。
If yi = -1 and wT*xi > 0. Which means actual class label is negative but classified as positive then it is miss-classified points( Yi*wT*xi < 0).
如果yi = -1且wT * xi>0。这意味着实际类别标签为负但被分类为正，则它是未分类的点(Yi * wT * xi <0)。

From the above observations, we want our classifier to minimize the miss-classification error. i.e. we want Yi*wT*xi to be greater than 0. Here, xi, Yi is fixed because these are coming from the data-set. As we change w, and b the sum will change and we want to find such w and b that maximize that sum given below.

根据以上观察，我们希望我们的分类器将未分类错误最小化。也就是说，我们希望Yi * wT * xi大于0。在这里，xi，Yi是固定的，因为它们来自数据集。当我们改变w和b时，总和将改变，我们想找到使下面给出的总和最大化的w和b。

压扁(或)S形函数 (Squashing (or) Sigmoid Function)

The sigmoid function is a differentiable real function that is defined for all real input and has a non-negative derivative at each point. It is a monotonic function in which squashes value between 0 and 1. We will look at a very simple example where we will see how the sum of signed distances (Yi*wT*xi) can be impacted by erroneous(or)outlier points and we need to come up with another formulation which is less impacted by an outlier.

sigmoid函数是为所有实数输入定义的可微分实函数，并且在每个点上都具有非负导数。这是一个单调函数，其中挤压值介于0和1之间。我们将看一个非常简单的示例，在该示例中，我们将了解错误的(或)异常点如何影响有符号距离的总和(Yi * wT * xi)和我们需要提出另一种不受异常值影响的公式。

Suppose the distance (d) from any point to decision boundary is 1 for all negative side of decision boundary points and positive side of decision boundary points, except an outlier point which is in the positive side of the decision boundary and the distance is 100. If we compute the signed distance then it will be -90.

假设对于决策边界点的所有负侧和决策边界点的正侧，从任意点到决策边界的距离(d)为1，但离异常点在决策边界的正侧的距离为100。如果我们计算符号距离，则它将为-90。

The distance (d) from any point to the decision boundary is 1 and their distances from each other are also 1. If we compute the signed distance then it will be 1. So, we have 5 miss-classified points (the point is negative but are in the positive side of the decision boundary) in the right below figure and the sum of the signed distance is -90. In the left below figure, we have 1 miss-classified point, and the sum of the signed distance is 1. And remember we wanted to maximize the sum of signed distances which is 1 in this case. So, If we choose the sum of signed distance, in the presence of outlier, our prediction may not correct and we end up with the worst model.

任何点到决策边界的距离(d)为1，彼此之间的距离也为1。如果我们计算符号距离，则为1。因此，我们有5个未分类的点(该点为负数)但位于图右下方的决策边界的正侧)中，有符号距离的总和为-90。在下图的左下方，我们有1个未分类的点，有符号距离的总和为1。请记住，在这种情况下，我们希望使有符号距离的总和为1。因此，如果我们选择有符号距离的总和，那么在存在异常值的情况下，我们的预测可能会不正确，最终会得到最差的模型。

So, to avoid this problem we need another function that can be more robust than the maximizing signed distances. Such function we use here is called the sigmoid function.

因此，为避免此问题，我们需要另一个函数，该函数可能比最大化有符号距离更健壮。我们在这里使用的这种函数称为S形函数。

So instead of using simply signed distance, we will use, If the signed distance is small then use as it is. If the signed distance is large then make it a small value. So we want a function is When its value is small increasing linearly. When its value becomes large tapper it off. One such function we have is SIGMOID FUNCTION

因此，我们将使用简单的符号距离，而不是简单地使用符号距离。如果符号距离很小，则照原样使用。如果有符号距离较大，则将其取小值。所以我们要一个函数是当它的值小的时候线性增加。当其值变大时，将其关闭。我们拥有的此类功能之一是SIGMOID FUNCTION

Maximizing some function f(x) is the same as minimizing this function with -ve sign. I.e. argmax f(x) = argmin -f(x) and if we take log then the final formulation becomes,In below image Yi = +1 or -1.

最大化某些函数f(x)与使用-ve符号最小化此函数相同。即argmax f(x)= argmin -f(x)，如果取对数，则最终公式为下图中的Yi = +1或-1。

避免过度拟合和拟合不足的正则化技术 (Regularization techniques to avoid Overfitting and Underfitting)

What is Regularization?

什么是正则化？

Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem.

正则化是一种阻止模型复杂性的技术。它通过惩罚损失函数来做到这一点。这有助于解决过度拟合的问题。

wT means W transpose, W is normal to the hyperplane which we are dealing with and it is represented as a row vector.

wT表示W转置，W垂直于我们要处理的超平面，并表示为行向量。

optimization problem i.e.

优化问题

is also known as a signed distance.

也称为有符号距离。

if we pick W such that all the training points are correctly classified and all the zi tends to +infinity then we get the optimal w*.

如果我们选择W使得所有训练点都正确分类，并且所有zi趋于+无穷大，那么我们将获得最优w *。

If all training points are correctly classified then we have the problem of overfitting (means doing a perfect job on the training set but performing very badly on a test set, i.e. errors on train data is almost zero but errors on test data are very high) and also if each zi tends to infinity then we will have the same problem to overcome this problem we use regularization techniques.

如果对所有训练点进行了正确分类，那么我们将面临过度拟合的问题(意味着在训练集上做得很好，但在测试集上表现很差，即训练数据上的错误几乎为零，但测试数据上的错误非常高)并且如果每个zi趋于无穷大，那么我们将使用正则化技术来克服相同的问题来克服这个问题。

L2正则化(或)岭回归： (L2 Regularization (or) Ridge Regression:)

The L2-norm loss function is also known as the least-squares error (LSE).

L2-范数损失函数也称为最小二乘误差(LSE)。

∑ (wj )² is a regularization term and ∑ [log(1+exp(-zi))] is the Loss term. λ is a hyperparameter.

∑(wj)²是正则项，而∑ [log(1 + exp(-zi))]是损耗项。 λ是超参数。

We added the regularization term(i.e. squared magnitude) to the loss term to make sure that the model does not undergo an overfitting problem.

我们在损失项上添加了正则化项(即平方值)，以确保模型不会出现过拟合问题。

Here we will minimize both the Loss term and the regularization term. If the hyperparameter(λ) is 0 then there is no regularization term then it will overfit and if the hyperparameter(λ) is very large then it will add too much weight which leads to underfit.

在这里，我们将最小化损失项和正则项。如果超参数( λ )为0，则没有正则化项，则它将过拟合；如果超参数( λ )非常大，则将增加过多的权重，从而导致拟合不足。

We can find the best hyperparameter by using cross-validation or Gridsearch cross-validation.

我们可以使用交叉验证或Gridsearch交叉验证找到最佳的超参数。

L1正则化(或)套索回归： (L1 Regularization (or) Lasso Regression:)

The L1-norm loss function is also known as the least absolute deviations (LAD), the least absolute errors (LAE).

L1-范数损失函数也称为最小绝对偏差(LAD)，最小绝对误差(LAE)。

In L1 regularization we use L1 norm instead of L2 norm

在L1正则化中，我们使用L1规范而不是L2规范

In the L1 norm, we shrink the parameters to zero. When input features have weights closer to zero that leads to sparse L1 norm. In the Sparse solution, the majority of the input features have zero weights and very few features have non zero weights.

在L1规范中，我们将参数缩小为零。当输入要素的权重接近零时，会导致L1范数稀疏。在稀疏解决方案中，大多数输入要素的权重为零，很少有要素的权重为非零。

Here the L1 norm term will also avoid the model to undergo overfit problems. The advantage of using L1 regularization is Sparsity.

在这里，L1规范项还将避免模型遭受过度拟合问题。使用L1正则化的优点是稀疏性。

Sparsity:

稀疏性：

A vector(w in this case) is said to be sparse when most of its cells(wi’s in this case) are zero.

一个向量(在这种情况下为w)在其大多数像元(在这种情况下为wi)为零时被称为稀疏。

w* is said to be sparse when most of wi’s are zeros.

当wi的大多数为零时，称w *稀疏。

If we use L1 regularization in Logistic Regression all the Less important features will become zero. If we use L2 regularization then the wi values will become small but not necessarily zero.

如果我们在Logistic回归中使用L1正则化，则所有次要特征将变为零。如果我们使用L2正则化，则wi值将变小，但不一定为零。

Here I am writing the code to check how the sparsity increases with an increase in the hyperparameter value.

在这里，我正在编写代码来检查稀疏度如何随着超参数值的增加而增加。

Here, we are going to check how sparsity increases as we increase lambda (or decrease C, as C= 1/ λ) when L1 Regularizer is used.

在这里，我们将检查稀疏度如何随着lambda的增加(或减少的C，因为C = 1 / λ)当使用L1正则器时。

In code hyperparameter C is Inverse of regularization strength; It must be a positive float.

在代码中，超参数C是正则强度的倒数；它必须是一个正浮点数。

弹性网： (Elastic-Net:)

Elastic net regularization is a combination of both L1 and L2 regularization.

弹性净正则化是L1和L2正则化的组合。

λ1 and λ2 are hyperparameters.

λ1和λ2是超参数。

Logistic回归的概率解释 (Probabilistic interpretation of Logistic Regression)

Logistic Regression assumes a parametric form for the distribution P(Y|X), then directly estimates its parameters from the training data.

Logistic回归假设分布P(Y | X)为参数形式，然后直接从训练数据中估计其参数。

Y is boolean, governed by a Bernoulli distribution, with the parameter π is P(Y = 1).
Y为布尔值，由伯努利分布控制，参数π为P(Y = 1)。
X = hX1 …Xni, where each Xi is a continuous random variable.
X = hX1…Xni，其中每个Xi是一个连续随机变量。

• For each Xi , P(Xi |Y = yk) is a Gaussian distribution of the form N(µik,σi)

•对于每个Xi，P(Xi | Y = yk)是形式为N(µik，σi)的高斯分布

For all i and j ≠ i, Xi and Xj are conditionally independent.
对于所有i和j≠i，Xi和Xj在条件上独立。

Note here we are assuming the standard deviations σi vary from attribute to attribute, but do not depend on Y. Probabilistic interpretation of Logistic Regression is given by,

请注意，此处我们假设标准偏差σi因属性而异，但不依赖于Y。Logistic回归的概率解释由下式给出：

In order to maximize the log-likelihood function or minimize loss for finding coefficient, we need to compute partial derivative i.e. if we have any problem where we have to maximize or minimize something comes under optimization problem. But if you want derivation of any mathematical equation, you can visit here.

为了最大化对数似然函数或最小化寻找系数的损失，我们需要计算偏导数，即，如果我们有任何问题需要最大化或最小化，则某些问题会出现在优化问题下。但是，如果您想推导任何数学方程式，可以在这里访问。

Logistic回归的损失最小化解释 (Loss Minimization Interpretation of Logistic Regression)

Binary classification involves 0/1 loss(non-convex) and when data is not perfectly separable then we like to minimize the number of error or miss-classified points (Yi (wT*xi + b) < 0)), Then the problems become to find the optimal w and b that minimizes the loss. This is again an optimization problem where we solve the following equation.

二进制分类涉及0/1损失(非凸)，当数据不能完全分离时，我们希望最小化错误或未分类点的数量(Yi(wT * xi + b)<0))，然后是问题寻找使损耗最小的最佳w和b。这又是一个优化问题，我们可以解决以下方程。

Where L is the 0/1 loss function and if yi(wT*xi + b) < 0 it gives 1(miss-classified point) else 0(correctly classified point) below is the image.

其中L是0/1损失函数，如果yi(wT * xi + b)<0，则给出1(未分类的点)，否则下面的0(正确分类的点)是图像。

So, In many practical methods, we replace the non-convex(such as 0/1 loss) function to convex function because optimizing non-convex function is very hard, the algorithm may be stuck into local minimum which does not correspond to the actual minimum value of the objective function L(yi, f(xi)). where, f(xi) = wT*xi + b.

因此，在许多实际方法中，由于优化非凸函数非常困难，因此我们将非凸函数(例如0/1损失)替换为凸函数，该算法可能陷入与实际不符的局部最小值目标函数L(yi，f(xi))的最小值。其中，f(xi)= wT * xi + b。

The basic idea is to work with a smooth (differentiable) function which is an approximation to the 0–1 loss. When we use logistic loss(log-loss) as an approximation of 0–1 loss to solve the classification problems then it is called logistic regression. There could be much approximation of 0–1 loss which is used by the different algorithm to solve classification problems.

基本思想是使用平滑(可微分)函数，该函数近似于0-1损耗。当我们使用逻辑损失(log-loss)作为0-1损失的近似值来解决分类问题时，这称为逻辑回归。不同算法可能使用0-1损失的近似值来解决分类问题。

损失约为0-1 (Approximation of 0–1 Loss)

when y ∈ {1, -1}, where 1 for positive class, -1 for negative class then the logistic loss function, which we will not focus, is defined as follows

当y∈{1，-1}时，其中正类别为1，负类别为-1，则我们将不关注的逻辑损失函数定义如下

And when y ∈ {0, 1}, then the logistic loss function is defined as follows:

并且当y∈{0，1}时，逻辑损失函数定义如下：

Where, for each row i in the dataset, y is an outcome which can be either 0 or 1. P is predicted probability outcome by applying the logistic regression equation(P = e^x/1+e^x, where x = wT * xi + b).

其中，对于数据集中的第i行，y是可以为0或1的结果。P是通过应用逻辑回归方程式预测的概率结果(P = e ^ x / 1 + e ^ x，其中x = wT * xi + b)。

From the equation, When y = 1 then our loss function becomes log(pi) and If Pi approaching 1 then loss tends to approach 0. and similarly when y = 0 then our loss function becomes log(1- pi) and if p approaching 0 then again loss tends to approach 0. That way, we just end up multiplying the log of the actual predicted probability for the actual class label.

根据等式，当y = 1时，我们的损失函数变为log(pi) ，如果Pi接近1，则损失趋于接近0。同样，当y = 0时，我们的损失函数变为log(1- pi) ，如果p接近0，然后损失趋向于接近0。这样，我们最终只是将实际预测概率的对数乘以实际类别标签。

when the response variable(y) is 1 then the probability value should be as high as possible. and when it is 0 then the probability value should be as low as possible and this will minimize the total log loss.

当响应变量(y)为1时，概率值应尽可能高。当它为0时，概率值应尽可能低，这样可以使总的对数损失最小。

Column Standardization: In Logistic Regression also we used distance as a measure, so ‘mandatory’ to perform feature standardization (or) Column standardization before training on our dataset.

列标准化：在Logistic回归中，我们还使用距离作为度量，因此在对数据集进行训练之前，“强制性”执行特征标准化(或)列标准化。

Feature Importance: If all the features are independents to each other take the absolute value of the weights, which are is large that features are more important features.

特征重要性：如果所有特征彼此独立，则采用权重的绝对值，权重的绝对值很大，因此特征是更重要的特征。

If the features are not independent of each other than we can’t use weights as feature importance then we use forward (or) backward feature selection method to find the best features.

如果要素彼此之间不是相互独立的，而不能使用权重作为要素重要性，则可以使用正向(或)向后特征选择方法来找到最佳特征。

If you don’t know about feature selection methods don’t worry I written a blog to read visit my previous blog here.

如果您不了解功能选择方法，请不要担心，我写了一个博客来阅读，请在这里访问我以前的博客。

Pertubation technique: We don’t want collinearity (or) Multicollinearity in our data.

插管技术：我们不需要数据中的共线性(或多重共线性)。

To check Multicollinearity in given data we use the pertubation test as follows

为了检查给定数据中的多重共线性，我们使用如下的插管测试

First, find the wights of the model.
首先，找到模型的权重。
Then add some Xi+ε for each feature.
然后为每个功能添加一些Xi +ε。
After again find the weights using another model.
再次使用其他模型找到权重。
If the weights are significantly different before and after the pertubation test then we conclude that collinearity is present in the given data.
如果在插管试验之前和之后权重显着不同，那么我们得出结论，给定数据中存在共线性。

Assumption: The biggest assumption of Logistic Regression is our data is linearly separable or almost linearly separable.

假设：Logistic回归的最大假设是我们的数据是线性可分离的或几乎线性可分离的。

Decision surface: Decision surface of Logistic Regression is Linear (or) Hyperplane.

决策面： Logistic回归的决策面是线性(或)超平面。

Outlier Impact: Less impact because of using Sigmoid Function. Compute weights of the features remove points that are very far away from Hyperplane.

异常影响：由于使用了Sigmoid函数，因此影响较小。计算功能权重会删除与Hyperplane距离很远的点。

Multiclass-Classification: For Multiclass-Classification typically we can use one v/s Rest method.

多类分类：对于多类分类，通常我们可以使用一种v / s Rest方法。

Similarity Matrix: Normal Logistic Regression methods can’t handle similarity matrices.

相似度矩阵：普通Logistic回归方法无法处理相似度矩阵。

带有亚马逊食品评论分析的Logistic回归算法 (Logistic Regression Algorithm with Amazon Food Reviews Analysis)

Let’s apply the Logistic Regression algorithm for the real-world dataset Amazon Fine Food Review Analysis from Kaggle.

让我们将Logistic回归算法应用于来自Kaggle的真实数据集Amazon Fine Food Review Analysis。

First We want to know What is Amazon Fine Food Review Analysis?

首先，我们想知道什么是Amazon Fine Food Review Analysis？

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.

该数据集包含来自亚马逊的精美食品评论。数据涵盖了10多年的时间，包括截至2012年10月的所有约500,000条评论。评论包括产品和用户信息，评分以及纯文本评论。我们也有来自所有其他亚马逊类别的评论。

Amazon reviews are often the most publicly visible reviews of consumer products. As a frequent Amazon user, I was interested in examining the structure of a large database of Amazon reviews and visualizing this information so as to be a smarter consumer and reviewer.

亚马逊评论通常是消费者产品中最公开的评论。作为Amazon的经常用户，我感兴趣的是检查大型Amazon评论数据库的结构并将这些信息可视化，从而成为更聪明的消费者和评论者。

Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

资源： https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Amazon Fine Food Reviews数据集包含来自Amazon的精美食品评论。

Number of reviews: 568,454Number of users: 256,059Number of products: 74,258Timespan: Oct 1999 — Oct 2012Number of Attributes/Columns in data: 10

评论数：568,454用户数：256,059产品数：74,258时间跨度：1999年10月— 2012年10月数据中的属性/列数：10

Attribute Information:

属性信息：

Id
ID
ProductId — unique identifier for the product
ProductId-产品的唯一标识符
UserId — unique identifier for the user
UserId-用户的唯一标识符
ProfileName
个人资料名称
HelpfulnessNumerator — number of users who found the review helpful
WAP] AffinnessNumerator —认为该评论有用的用户数
HelpfulnessDenominator — number of users who indicated whether they found the review helpful or not
AffinnessDenominator —表示他们是否认为本评论有用的用户数量
Score — a rating between 1 and 5
得分—介于1到5之间的评分
Time — timestamp for the review
时间-审核时间戳
Summary — Brief summary of the review
摘要-审查摘要
Text — Text of the review
文字-评论文字

目的(Objective)

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

给定评论，确定评论是正面的(4或5级)还是负面的(1或2级)。

数据预处理 (Data Preprocessing)

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis.

数据预处理是一种用于将原始数据转换为干净数据集的技术。换句话说，无论何时从不同来源收集数据，数据都以原始格式收集，这对于分析是不可行的。

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

要了解Amazon Food评论数据集和功能化的完整概述，请在此处访问我以前的博客链接。

火车测试拆分 (Train-Test split)

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

训练测试拆分过程用于估计机器学习算法的性能，这些算法用于对未用于训练模型的数据进行预测。

If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

如果有一个数据集，则需要首先使用Sklearn train_test_split函数对其进行拆分。

使用单词袋进行文本特征化 (Text Featurization using Bag of Words)

超参数调整(Hyper Parameter tuning)

we want to choose the best alpha for better performance of the model, to choose the best alpha by using Grid Search cross-validation.

我们希望选择最佳Alpha以获得更好的模型性能，通过使用Grid Search交叉验证来选择最佳Alpha。

we already defined a Grid_search Function when we call it, it will give the result.

我们在调用Grid_search函数时已经定义了它，它将给出结果。

After we find the best alpha using a Grid search CV we want to check the performance with Test data, in this case study, we use the AUC as the Performance measure.

在使用网格搜索CV找到最佳的alpha值之后，我们要使用测试数据检查性能，在本案例研究中，我们将AUC用作性能指标。

we already defined a Function for testing the test data when we call it, it will give the result.

我们已经定义了一个函数，当我们调用它来测试测试数据时，它将给出结果。

性能指标 (Performance Metrics)

Performance metrics are used to measure the behavior, activities, and performance of a business. This should be in the form of data that measures required data within a range, allowing a basis to be formed supporting the achievement of overall business goals.

绩效指标用于衡量企业的行为，活动和绩效。这应该以数据的形式进行，该数据可以测量一个范围内的所需数据，从而可以形成支持实现总体业务目标的基础。

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

要了解有关机器学习中使用的性能指标的详细信息，请访问我以前的博客链接 在这里。

we already defined a Function for performance metrics when we call it, it will give the result.

我们在调用性能指标时已经定义了一个Function，它将给出结果。

使用BOW上的L1正则化计算权重向量的稀疏性 (Calculating sparsity on weight vector obtained using L1 regularization on BOW)

Similarly, we built a Logistic Regression model with TFIDF, AvgWord2Vec, TFIDF_AvgWord2Vec features with L1, and L2 Regularization also. To understand the full code please visit my GitHub link.

同样，我们使用TFIDF，AvgWord2Vec，带有L1的TFIDF_AvgWord2Vec功能以及L2正则化构建了Logistic回归模型。要了解完整的代码，请访问我的GitHub链接。

结论 (Conclusions)

To write concussions in the table we used the python library PrettyTable.

为了在表格中编写脑震荡记录，我们使用了python库PrettyTable。

The pretty table is a simple Python library designed to make it quick and easy to represent tabular data in visually appealing tables.

漂亮的表是一个简单的Python库，旨在使在具有视觉吸引力的表中快速轻松地表示表格数据。

观察结果 (Observations)

Compare to Bag of words features representation, TFIDF features with L2 Regularization are getting the highest 93.25% AUC score on Test data.
与词袋特征表示相比，具有L2正则化的TFIDF特征在测试数据上获得了最高的93.25％AUC分数。

2. The C values are different from model to model.

2. C值因型号而异。

To Know the Complete overview of the Amazon Food review dataset and Featurization visit my previous blog link here.

要了解Amazon Food评论数据集和功能化的完整概述，请在此处访问我以前的博客链接。

To Know detailed information about performance metrics used in Machine Learning please visit my previous blog link here.

要了解有关机器学习中使用的性能指标的详细信息，请访问我以前的博客链接 在这里。

To understand the full code please visit my GitHub link.

要了解完整的代码，请访问我的GitHub链接。