直线回归数据离群值_处理离群值：OLS与稳健回归

最新推荐文章于 2024-07-22 03:30:55 发布

weixin_26752765

最新推荐文章于 2024-07-22 03:30:55 发布

阅读量2.4k

点赞数 1

文章标签： python java

原文链接：https://towardsdatascience.com/working-with-outliers-ols-vs-robust-regressions-5cf861168ac4

版权

本文探讨了如何处理直线回归数据中的离群值，比较了使用普通最小二乘法(OLS)和稳健回归两种方法。通过实例展示了在Python和Java中如何实施这些方法来提高回归分析的准确性。

摘要由CSDN通过智能技术生成

直线回归数据离群值

When it comes to regression analysis — outliers (or values that are well outside of the mean for a particular set of data) can cause issues.

说到回归分析，离群值(或特定数据集的均值超出平均值)可能会引起问题。

背景 (Background)

Let’s consider this issue further using the Pima Indians Diabetes dataset.

让我们使用比马印第安人糖尿病数据集来进一步考虑这个问题。

Here is a boxplot of BMI values across patients. We can see that according to the above boxplot, there are several outliers present that are much larger than that indicated by the interquartile range.

这是患者之间BMI值的箱线图。我们可以看到，根据上面的箱线图，存在一些离群值大于四分位数范围指示的离群值。

Furthermore, we also have visual indication of a positively skewed distribution — where several positive outliers “push” the distribution out to the right:

此外，我们还可以看到正偏分布的视觉指示-多个正离群值将分布“推”到右侧：

Outliers can cause issues when it comes to conducting regression analysis. A key assumption of this model is the line of best fit, or the regression line that minimises the distance between the regression line and the individual observations.

在进行回归分析时，异常值可能会引起问题。该模型的关键假设是最佳拟合线 ，或者是使回归线与各个观测值之间的距离最小的回归线。

Clearly, if outliers are present, then this weakens the predictive power of the regression model as the line of best fit. It also violates the assumption that the data is normally distributed.

显然，如果存在离群值，则这会削弱回归模型作为最佳拟合线的预测能力。它也违反了数据是正态分布的假设。

In this regard, both an OLS regression model and robust regression models (using Huber and Bisquare weights) are run in order to predict BMI values across the test set — with a view to measuring whether accuracy was significantly improved by using the latter model.

在这方面，为了预测整个测试集的BMI值，运行了OLS回归模型和鲁棒回归模型(使用Huber和Bisquare权重)，以期通过使用后者模型来测量准确性是否得到了显着提高。

Here is a quick overview of the data and the correlations between each feature:

以下是数据及其每个功能之间的相关性的简要概述：

最小二乘 (OLS)

Using the above correlation plot in ensuring that the independent variables in the regression model are not strongly correlated with each other, the regression model is defined as follows:

使用上面的相关图，以确保回归模型中的自变量之间不存在强烈的相关性，回归模型的定义如下：

reg1 <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset)

Note that Outcome is a categorical variable between 0 and 1 (not diabetic vs. diabetic).

请注意，结果是介于0和1之间的分类变量(非糖尿病vs.糖尿病)。

The data is split into both a training set and a test set (to serve as unseen data for the model).

数据被分为训练集和测试集(作为模型的看不见的数据)。

For the training set — 80% of this set is used to train the regression model, while 20% is used as a validation set to assess the results.

对于训练集，该训练集的80％用于训练回归模型，而20％作为验证集来评估结果。

# Training and Validation Data
trainset <- diabetes1[1:479, ]
valset <- diabetes1[480:599, ]

Here are the OLS results:

这是OLS结果：

> # OLS Regression
> summary(ols <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset))Call:
lm(formula = BMI ~ Outcome + Age + Insulin + SkinThickness, data = trainset)Residuals:
     Min       1Q   Median       3Q      Max 
-12.0813  -4.2762  -0.8733   3.4031  28.2196Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   28.0498978  0.9740705  28.797  < 2e-16 ***
Outcome        4.1290646  0.6171707   6.690 6.30e-11 ***
Age           -0.0101171  0.0248626  -0.407    0.684    
Insulin        0.0000262  0.0027077   0.010    0.992    
SkinThickness  0.1513285  0.0195945   7.723 6.81e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.176 on 474 degrees of freedom
Multiple R-squared:  0.2135, Adjusted R-squared:  0.2069 
F-statistic: 32.17 on 4 and 474 DF,  p-value: < 2.2e-16

Outcome and SkinThickness are identified as significant variables at the 5% level. While the R-Squared of 21.35% is quite low — this can be expected as there are many more variables that can influence BMI which have not been included in the model.

结果和皮肤厚度被确定为5％水平的重要变量。尽管21.35％的R平方非常低-但可以预期，因为还有更多可能影响BMI的变量尚未包含在模型中。

Let’s drop the age and insulin variables from the OLS model and run it once again.

让我们从OLS模型中删除年龄和胰岛素变量，然后再次运行。

> # OLS Regression
> summary(ols <- lm(BMI ~ Outcome + SkinThickness, data=trainset))Call:
lm(formula = BMI ~ Outcome + SkinThickness, data = trainset)Residuals:
     Min       1Q   Median       3Q      Max 
-12.1740  -4.2115  -0.8532   3.3852  28.3072Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   27.70940    0.49723  55.728   <2e-16 ***
Outcome        4.06953    0.59223   6.872    2e-11 ***
SkinThickness  0.15247    0.01774   8.595   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.165 on 476 degrees of freedom
Multiple R-squared:  0.2132, Adjusted R-squared:  0.2099 
F-statistic: 64.51 on 2 and 476 DF,  p-value: < 2.2e-16

稳健回归 (Robust Regressions)

A modified version of the above regression is now run — also known as a robust regression. The reason we refer to the regression as “robust” is that such models are less sensitive to violations of OLS, including the presence of outliers in the data. The following presentation gives more information as to how specifically a robust regression works.

现在运行上述回归的修改版本-也称为健壮回归。 我们将回归称为“稳健”的原因是，此类模型对违反OLS(包括数据中存在异常值)的敏感度较低。以下演示提供了有关稳健回归如何具体工作的更多信息。

In this example, we will use two different types of weighting to run this type of regression: Huber and Bisquare weights.

在此示例中，我们将使用两种不同类型的权重来运行这种类型的回归： Huber和Bisquare权重。

The same regressions are run once again, but this time using the above weightings.

再次运行相同的回归，但是这次使用上述权重。

胡贝尔权重 (Huber Weights)

> # Huber Weights
> rr.huber <- rlm(BMI ~ Outcome + SkinThickness, data=trainset)
> summary(rr.huber)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset)
Residuals:
     Min       1Q   Median       3Q      Max 
-12.4130  -3.6492  -0.3479   3.7717  28.7081Coefficients:
              Value   Std. Error t value
(Intercept)   27.0596  0.4685    57.7581
Outcome        3.7631  0.5580     6.7438
SkinThickness  0.1645  0.0167     9.8445Residual standard error: 5.47 on 476 degrees of freedom

双平方权重 (Bisquare Weights)

> # Bisquare weighting
> rr.bisquare <- rlm(BMI ~ Outcome + SkinThickness, data=trainset, psi = psi.bisquare)
> summary(rr.bisquare)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset, 
    psi = psi.bisquare)
Residuals:
     Min       1Q   Median       3Q      Max 
-12.1991  -3.6106  -0.3015   3.8074  28.8724Coefficients:
              Value   Std. Error t value
(Intercept)   27.0524  0.4793    56.4472
Outcome        3.6491  0.5708     6.3927
SkinThickness  0.1636  0.0171     9.5689Residual standard error: 5.483 on 476 degrees of freedom

比较方式 (Comparison)

Here is the performance of the regression models in predicting the test set values (both on a root mean squared error and mean absolute percentage error basis):

这是回归模型预测测试集值的性能(均基于均方根误差和均值绝对百分比误差 )：

RMSE (RMSE)

OLS: 5.81
OLS：5.81
Huber: 5.86
胡贝尔：5.86
Bisquare: 5.87
双方块：5.87

玛普 (MAPE)

OLS: 0.139
OLS：0.139
Huber: 0.137
胡贝尔：0.137
Bisquare: 0.137
双平方：0.137

We can see that the errors increased slightly on an RMSE basis (contrary to our expectations), while there was only a marginal decrease on an MAPE basis.

我们可以看到，基于RMSE的误差略有增加(与我们的预期相反)，而基于MAPE的误差仅略有减少。

异常值是否“具有影响力”？ (Are the outliers “influential”?)

Using a robust regression to account for outliers did not show significant accuracy improvements as might have been expected.

使用稳健的回归来解决离群值并不能像预期的那样显示出明显的准确性提高。

However, simply because outliers might be present in a dataset — doesn’t necessarily mean that those outliers are influential.

但是，仅因为异常值可能存在于数据集中-并不一定意味着这些异常值具有影响力。

By influential, we mean that the outlier has a direct effect on the response variable.

所谓影响力，是指离群值直接影响响应变量。

This can be determined by using Cook’s Distance.

这可以通过使用库克距离来确定。

We can see that while outliers are indicated as being present in the dataset — they still do not approach the threshold as outlined by Cook’s distance at the top right-hand corner of the graph.

我们可以看到，虽然异常值被指示为存在于数据集中-但它们仍未接近图右上角的库克距离所概述的阈值。

In this regard, it is now evident why the robust regression did not show superior performance to OLS from an accuracy standpoint — the outliers are not influential enough to warrant using a robust regression.

在这方面，现在显而易见的是，为什么从准确性的角度来看，稳健回归没有表现出优于OLS的性能-异常值的影响力不足以保证使用稳健回归。

结论 (Conclusion)

Robust regressions are useful when it comes to modelling outliers in a dataset and there have been cases where they can produce superior results to OLS.

在对数据集中的异常值进行建模时，稳健的回归非常有用，并且在某些情况下可以产生比OLS更好的结果。

However, those outliers must be influential and in this regard one must practice caution in using robust regressions in a situation such as this — where outliers are present but they do not particularly influence the response variable.

但是，这些离群值必须具有影响力，因此在这种情况下(在存在离群值但它们并没有特别影响响应变量的情况下)，使用稳健回归时必须谨慎行事。

Hope you enjoyed this article, and any questions or feedback are greatly welcomed. You can find the code and datasets for this example at the MGCodesandStats GitHub repository.

希望您喜欢本文，并欢迎任何问题或反馈。您可以在MGCodesandStats GitHub存储库中找到此示例的代码和数据集。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明：本文按“原样”撰写，不作任何担保。 它旨在提供数据科学概念的概述，并且不应以任何方式解释为专业建议。