线性回归 假设_线性回归的假设

本文探讨了线性回归的基础,重点介绍了该算法的重要假设,包括数据的线性关系、独立性、正态分布误差和同方差性等。通过对这些假设的理解,有助于提升线性回归模型的预测准确性。
摘要由CSDN通过智能技术生成

线性回归 假设

Linear Regression is the bicycle of regression models. It’s simple yet incredibly useful. It can be used in a variety of domains. It has a nice closed formed solution, which makes model training a super-fast non-iterative process.

线性回归是回归模型的基础。 这很简单,但却非常有用。 它可以用于多种领域。 它具有良好的封闭式解决方案,这使得模型训练成为超快速的非迭代过程。

A Linear Regression model’s performance characteristics are well understood and backed by decades of rigorous research. The model’s predictions are easy to understand, easy to explain and easy to defend.

线性回归模型的性能特征已得到数十年的严格研究的很好理解和支持。 该模型的预测易于理解,易于解释和易于捍卫。

If there only one regression model that you have time to learn inside-out, it should be the Linear Regression model.

如果只有一个回归模型可供您内外学习,则应该使用线性回归模型。

If your data satisfies the assumptions that the Linear Regression model, specifically the Ordinary Least Squares Regression (OLSR) model makes, in most cases you need look no further.

如果您的数据满足线性回归模型(特别是普通最小二乘回归(OLSR)模型)所做的假设,则在大多数情况下,您无需再进行任何研究。

Which brings us to the following four assumptions that the OLSR model makes:

这使我们得出OLSR模型做出的以下四个假设:

  1. Linear functional form: The response variable y should be a linearly related to the explanatory variables X.

    线性函数形式:响应变量y应该与解释变量X线性相关

  2. Residual errors should be i.i.d.: After fitting the model on the training data set, the residual errors of the model should be independent and identically distributed random variables.

    残留误差应该被消除:将模型拟合到训练数据集之后,模型的残留误差应该是独立的并且分布均匀的随机变量。

  3. Residual errors should be normally distributed: The residual errors should be normally distributed.

    残留误差应呈正态分布:残留误差应呈正态分布。

  4. Residual errors should be homoscedastic: The residual errors should have constant variance.

    残留误差应为等方差:残留误差应具有恒定的方差。

Let’s look at the four assumptions in detail and how to test them.

让我们详细看一下这四个假设以及如何测试它们。

假设1:线性函数形式 (Assumption 1: Linear functional form)

Linearity requires little explanation. After all, if you have chosen to do Linear Regression, you are assuming that the underlying data exhibits linear relationships, specifically the following linear relationship:

线性几乎不需要解释。 毕竟,如果您选择进行线性回归,则假定基础数据具有线性关系,特别是以下线性关系:

y = β*X + ϵ

y = β * X + ϵ

Where y is the dependent variable vector, X is the matrix of explanatory variables which includes the intercept, β is the vector of regression coefficients and ϵ is the vector of error terms i.e. the portion of y that X is unable to explain.

其中y是因变量矢量, X是解释变量的矩阵,其中包括截距, β是回归系数的向量, ϵ是误差项的向量,即yX不能解释的部分。

How to test the linearity assumption using Python

如何使用Python测试线性假设

This can be done in two ways:

这可以通过两种方式完成:

  1. An easy way is to plot y against each explanatory variable x_j and visually inspect the scatter plot for signs of non-linearity.

    一种简单的方法是针对每个解释变量x_j绘制y并目视检查散点图是否存在非线性迹象。

  2. One could also use the DataFrame.corr() method in Pandas to get the Pearson’s correlation coefficient ‘r’ between the response variable y and each explanatory variable x_j to get a quantitative feel for the degree of linear correlation.

    还可以在Pandas中使用DataFrame.corr()方法来获得响应变量y与每个解释变量x_j之间的皮尔逊相关系数'r' ,从而获得线性相关程度的定量感觉。

Note that Pearson’s ‘r’ should be used only when the the relation between y and X is known to be linear.

请注意,仅当已知yX之间的关系为线性时,才应使用Pearson的“ r”。

Let’s test the linearity assumption on the following data set of 9568 observations of 4 operating parameters of a combined cycle power plant taken over 6 years:

让我们根据以下6组观察结果得出的线性假设,这些数据是对联合循环发电厂在6年内进行的4个运行参数的9568个观测值的:

The explanatory variables x_j are as the following 4 power plant parameters:

说明变量x_j如下四个电厂参数:

Ambient_Temp in CelsiusExhaust_Volume in column height of Mercury in centimetersAmbient_Pressure in millibars of MercuryRelative_Humidity expressed as a percentage

Ambient_Temp摄氏Exhaust_Volume在厘米水银柱高度Ambient_Pressure水星Relative_Humidity的毫巴,以百分比表示

The response variable y is Power_Output of the power plant in MW.

响应变量y发电厂的Power_Output ,单位为MW。

Let’s load the data set into a Pandas DataFrame.

让我们将数据集加载到Pandas DataFrame中。

import pandas as pdfrom patsy import dmatricesfrom matplotlib import pyplot as pltimport numpy as npdf = pd.read_csv('power_plant_output.csv', header=0)

Plot the scatter plots of each explanatory variable against the response variable Power_Output.

绘制每个解释变量相对于响应变量Power_Output的散点图。

df.plot.scatter(x='Ambient_Temp', y='Power_Output')
plt.xlabel('Ambient_Temp', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()df.plot.scatter(x='Exhaust_Volume', y='Power_Output')
plt.xlabel('Exhaust_Volume', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()df.plot.scatter(x='Ambient_Pressure', y='Power_Output')
plt.xlabel('Ambient_Pressure', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()df.plot.scatter(x='Relative_Humidity', y='Power_Output')
plt.xlabel('Relative_Humidity', fontsize=18)
plt.ylabel('Power_Output', fontsize=18)
plt.show()

Here is a collage of the four plots:

这是四个情节的拼贴画:

Image for post
Scatter plots of Power_Output against each explanatory variable
Power_Output与每个解释变量的散点图

You can see that Ambient_Temp and Exhaust_Volume seem to be most linearly related to the power plant’s Power_Output, followed by Ambient_Pressure and Relative_Humidity in that order.

您可以看到,Ambient_Temp和Exhaust_Volume似乎与发电厂的Power_Output线性关系最大,其次是Ambient_Pressure和Relative_Humidity。

Let’s also print out the Pearson’s ‘r’:

让我们也打印出皮尔逊的“ r”:

df.corr()['Power_Output']

We get the following output, which backs up our visual intuition:

我们得到以下输出,它支持我们的视觉直觉:

Ambient_Temp        -0.948128
Exhaust_Volume -0.869780
Ambient_Pressure 0.518429
Relative_Humidity 0.389794
Power_Output 1.000000
Name: Power_Output, dtype: float64

Related read: The Intuition Behind Correlation, for an in-depth explanation of the Pearson’s correlation coefficient.

相关阅读: 相关性背后的直觉 ,深入了解皮尔逊相关系数。

假设2:iid残差 (Assumption 2: i.i.d. residual errors)

The second assumption that one makes while fitting OLSR models is that the residual errors left over from fitting the model to the data are independent, identically distributed random variables.

在拟合OLSR模型时做出的第二个假设是,将模型拟合到数据后剩下的残留误差是独立的均匀分布的 随机变量

We break this assumption into three parts:

我们将此假设分为三个部分:

  1. The residual errors are random variables,

    残留误差是随机变量,
  2. They are independent random variables, and

    它们是独立的随机变量,并且

  3. Their probability distributions are identical.

    它们的概率分布是相同的

为什么残留误差是随机变量? (Why are residual errors random variables?)

After we train a Linear Regression model on a data set, if we run the training data through the same model, the model will generate predictions. Let’s call them y_pred. For each predicted value y_pred in the vector y_pred, there is a corresponding actual value y from the response variable vector y. The difference (y — y_pred) is the residual error ‘ε’. There are as many of these ε as the number of rows in the training set and together they form the residual errors vector ε.

在数据集上训练线性回归模型后,如果通过同一模型运行训练数据,则该模型将生成预测。 我们称它们为y_pred。y_pred矢量y_pred各预测值,存在来自响应变量矢量y相应的实际值y。(y_y_pred)是残余误差“ ε” 。 这些ε与训练集中的行数一样多,它们一起形成了残留误差向量ε

Each residual error ε is a random variable. To understand why, recollect that our training set (y_train, X_train) is just a sample of n values drawn from some very large population of values.

每个残余误差ε是一个随机变量 。 要了解原因,请回想一下我们的训练集(y_train,X_train)只是从一些非常大的值总体

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值