Python中的线性回归：Sklearn与Excel

最新推荐文章于 2024-07-17 10:36:01 发布

weixin_26752765

最新推荐文章于 2024-07-17 10:36:01 发布

阅读量1k

点赞数

文章标签： python 逻辑回归机器学习

原文链接：https://towardsdatascience.com/linear-regression-in-python-sklearn-vs-excel-6790187dc9ca

版权

内部AI (Inside AI)

Around 13 years ago, Scikit-learn development started as a part of Google Summer of Code project by David Cournapeau. As time passed Scikit-learn became one of the most famous machine learning library in Python. It offers several classifications, regression and clustering algorithms and its key strength, in my opinion, is seamless integration with Numpy, Pandas and Scipy.

大约13年前，Scikit学习发展始于华氏度大卫Cournapeau代码项目的谷歌暑期的一部分。随着时间的流逝，Scikit-learn成为Python中最著名的机器学习库之一。我认为它提供了几种分类，回归和聚类算法，并且它的主要优势是与Numpy，Pandas和Scipy无缝集成。

In this article, I will compare the prediction accuracy of multiple linear regression of Scikit-learn with excel. Scikit-learn offers many parameters (known as hyper-parameters of an estimator) to fine-tune the training of the model and increase the accuracy of prediction. In the excel, we do not have much to tune the regression algorithm. For a fair comparison, I will train the sklearn regression model with default parameters.

在本文中，我将比较Scikit-learn与excel的多元线性回归的预测准确性。 Scikit学习提供许多参数(称为估算器的超参数)来微调模型的训练并提高预测的准确性。在Excel中，我们没有太多的调整回归算法。为了公平地比较，我将使用默认参数训练sklearn回归模型。

Objective

目的

This comparison aims to learn the prediction accuracy of the linear regression in excel and Scikit-learn. Also, I will touch briefly on the process to perform linear regression in excel.

该比较旨在了解excel和Scikit-learn中线性回归的预测准确性。另外，我将简要介绍在Excel中执行线性回归的过程。

Sample Data File

样本数据文件

For the comparison, we will use historical 100,000 readings of precipitation, minimum temperature, maximum temperature and wind speed, measured several times in a day for 8 years.

为了进行比较，我们将使用历史上100,000次的降水，最低温度，最高温度和风速的读数，在8年中每天进行几次测量。

We will use the precipitation, minimum temperature and maximum temperature to predict the wind speed. Hence, wind speed is the dependent variable, and other data is the independent variable.

我们将使用降水量，最低温度和最高温度来预测风速。因此，风速是因变量，其他数据是自变量。

We will first build and predict the wind speed with a linear regression model on excel. Then we will do the same exercise with Scikit-learn, and finally, we will compare the predicted results.

我们将首先在excel上使用线性回归模型构建和预测风速。然后，我们将使用Scikit学习进行相同的练习，最后，我们将比较预测结果。

To perform the linear regression in excel, we will open the sample data file and click the “Data” tab in excel ribbon. In the “Data” tab, select the Data Analysis option.

要在excel中执行线性回归，我们将打开示例数据文件，然后单击excel功能区中的“数据”标签。在“数据”选项卡中，选择“数据分析”选项。

Tip: In case you do not see the “Data Analysis” option then, click File > Options> Add-ins. Select the “Analysis Toolpak” and click the “Go” button as shown below

提示： 如果您没有看到“数据分析”选项，请单击文件>选项>加载项。 选择“ Analysis Toolpak”，然后单击“ Go”按钮，如下所示

On clicking the “Data Analysis” option, a pop-window will open up showing different analysis tools available in the excel. We will select the Regression and then click “OK”.

单击“数据分析”选项后，将打开一个弹出窗口，显示excel中可用的不同分析工具。我们将选择回归，然后单击“确定”。

Another pop-up window to provide the independent and dependent values series will be shown. Excel cell reference of wind speed (dependent variable) is mentioned in the “Input Y range” field. In “Input X Range” we will provide the cell reference for independent variables i.e. precipitation, minimum temperature and maximum temperature.

将显示另一个弹出窗口，提供独立和从属值系列。 “输入Y范围”字段中提到了风速(因变量)的Excel单元格参考。在“输入X范围”中，我们将为独立变量(例如降水，最低温度和最高温度)提供像元参考。

We need to select the checkbox “Label” as the first row in our sample data has variable names.

我们需要选中复选框“ Label”，因为示例数据的第一行具有变量名。

On clicking the “Ok” button after specifying the data, excel will build a linear regression model. You can consider it like training (fit option) in Scikit-learn coding.

指定数据后，单击“确定”按钮，excel将建立线性回归模型。您可以将其视为Scikit-learn编码中的训练(拟合选项)。

Excel does the calculations and shows the information in a nice format. In our example, excel could fit the linear regression model with R Square of 0.953. Considering 100,000 records in the training dataset, excel performed the linear regression in less than 7 seconds. Along with other statistical information, it also shows the intercepts and coefficients of different independent variables.

Excel进行计算，并以一种很好的格式显示信息。在我们的示例中，excel可以拟合线性回归模型，R Square为0.953。考虑到训练数据集中的100,000条记录，excel在不到7秒的时间内执行了线性回归。除其他统计信息外，它还显示了不同自变量的截距和系数。

Based on the excel linear regression output, we can put together the below mathematical relationship.

根据excel线性回归输出，我们可以将以下数学关系汇总起来。

Wind Speed = 2.438 + (Precipitation* 0.026) + (MinTemp*0.393)+ (MaxTemp*0.395)

风速= 2.438 +(降水* 0.026)+(最小温度* 0.393)+(最大温度* 0.395)

We will use this formula to predict the wind speed of the test data set, which excel regression model has not seen before.

我们将使用此公式来预测测试数据集的风速，这是excel回归模型之前从未见过的。

For example for the first test data set, Wind Speed= 2.438 + (0.51* 0.026) + (17.78*0.393)+ (25.56*0.395) = 19.55

例如对于第一个测试数据集，风速= 2.438 +(0.51 * 0.026)+(17.78 * 0.393)+(25.56 * 0.395)= 19.55

Further, we have calculated the residual of the prediction and plotted it to understand the trend of it. We can see that in nearly all cases the wind speed predicted is lower than the actual value and faster the wind speed higher is the error in the prediction.

此外，我们已经计算了预测的残差并将其绘制以了解其趋势。我们可以看到，几乎在所有情况下，预测的风速都低于实际值，而更快的风速是预测中的误差。

Let us not delve into linear regression in Scikit-learn.

让我们不要研究Scikit学习中的线性回归。

Step 1- We will import the packages which we are going to use for our analysis. Individual independent variables values are spread across different value ranges and not standard normally distributed, hence we need StandardScaler for standardization of independent variables.

第1步-我们将导入将用于分析的软件包。各个自变量的值分布在不同的值范围内，而不是标准的正态分布，因此我们需要StandardScaler对自变量进行标准化。

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2- Read the training and test data from excel file into the PandasDataframe Training_data and Test_data respectively.

步骤2-将训练和测试数据从excel文件分别读取到PandasDataframe Training_data和Test_data中。

Training_data=pd.read_excel(“Weather.xlsx”, sheet_name=”Sheet1") Test_data=pd.read_excel(“Weather Test.xlsx”, sheet_name=”Sheet1")

I will not focus on preliminary data quality checks like blank values, outliers, etc. and respective correction approach in this article, and assuming that there are no data series related to the discrepancy.

在本文中，我将不着重于初步的数据质量检查，例如空白值，离群值等，以及相应的校正方法，并假设没有与差异有关的数据系列。

Please refer “How to identify the right independent variables for Machine Learning Supervised Algorithms?” for independent variable selection criteria and correlation analysis.

请参阅“ 如何为机器学习监督算法识别正确的自变量？ ”用于独立变量选择标准和相关性分析。

Step 3- In the below code, we have declared all the columns data except “WindSpeed” as the independent variable and only “WindSpeed” as the dependent variable for training and test data. Please note that we will not use “SourceData_test_dependent” for linear regression but to compare the predicted value with it.

步骤3-在下面的代码中，我们已声明所有列数据(“ WindSpeed”作为自变量，仅“ WindSpeed”作为因变量用于训练和测试数据)。请注意，我们不会将“ SourceData_test_dependent”用于线性回归，而是将其与预测值进行比较。

SourceData_train_independent= Training_data.drop(["WindSpeed"], axis=1) # Drop depedent variable from training datasetSourceData_train_dependent=Training_data["WindSpeed"].copy() #  New dataframe with only independent variable value for training datasetSourceData_test_independent=Test_data.drop(["WindSpeed"], axis=1)
SourceData_test_dependent=Test_data["WindSpeed"].copy()

Step 4- As the independent variable ranges are quite disparate, hence we need to scale it to avoid the unintended influence of one variable. In the code below the independent train and test variable is scaled, and saved to X-train and X_test respectively. We neither need to scale training or testing dependent variable values. In y_train, the dependent trained variable is saved without scaling.

步骤4-由于自变量范围非常不同，因此我们需要对其进行缩放，以避免一个变量的意外影响。在下面的代码中，独立训练和测试变量被缩放，并分别保存到X-train和X_test。我们既不需要扩展训练规模，也不需要测试因变量值。在y_train中，将保存因变量而不缩放。

sc_X = StandardScaler()X_train=sc_X.fit_transform(SourceData_train_independent.values) #scale the independent variablesy_train=SourceData_train_dependent # scaling is not required for dependent variableX_test=sc_X.transform(SourceData_test_independent)
y_test=SourceData_test_dependent

Step 5- Now we will feed the independent and dependent train data i.e. X_train and y_train respectively to train the linear regression model. We will perform the model fit with default parameters for the reasons mentioned at the start of the article.

步骤5-现在，我们将分别输入独立和相关的训练数据，即X_train和y_train，以训练线性回归模型。由于本文开头提到的原因，我们将使用默认参数执行模型拟合。

reg = LinearRegression().fit(X_train, y_train)
print("The Linear regression score on training data is ", round(reg.score(X_train, y_train),2))

The Linear regression score on the training data is the same as we observed with excel.

训练数据上的线性回归得分与我们在excel中观察到的相同。

Step 6- Finally, we will predict the wind speed based on test independent value data sets.

步骤6-最后，我们将基于独立于测试的值数据集来预测风速。

predict=reg.predict(X_test)

Based on the predicted wind speed value and residual scatter plot we can see that Sklean predictions are more close to actual values.

根据预测的风速值和残留散点图，我们可以看到Sklean的预测更接近实际值。

On comparing the Sklearn and Excel residuals side by side, we can see that both the model deviated more from actual values as the wind speed increases but sklearn did better than excel.

通过并排比较Sklearn和Excel残差，我们可以看到，随着风速的增加，两个模型与实际值的偏差都更大，但是sklearn的表现要优于excel。

On a different note, excel did predict the wind speed similar value range like sklearn. If you an approximate linear regression model is good enough for your business case then to quickly predict the values excel comes across a very good option.

另一方面，Excel确实预测了风速类似sklearn的值范围。如果近似线性回归模型足以满足您的业务需求，则可以快速预测excel的值是一个很好的选择。

Excel can perform linear regression prediction at the same accuracy level as sklearn is not the takeaway of this exercise. We can improve the sklearn linear regression prediction accuracy massively with fine-tuning of the parameters and it is more equipped to handle complex models. For quick and approximate prediction use cases excel is a very good alternative with acceptable accuracy.

Excel可以与sklearn相同的精度级别执行线性回归预测，而不是本练习的重点。通过参数的微调，我们可以大大提高sklearn线性回归预测的准确性，并且它更有能力处理复杂的模型。对于快速和近似的预测用例，excel是可以接受的准确度非常好的选择。