When it comes to regression analysis — outliers (or values that are well outside of the mean for a particular set of data) can cause issues.


背景 (Background)

Let’s consider this issue further using the Pima Indians Diabetes dataset.


Here is a boxplot of BMI values across patients. We can see that according to the above boxplot, there are several outliers present that are much larger than that indicated by the interquartile range.

这是患者之间BMI值的箱线图。 我们可以看到,根据上面的箱线图,存在一些离群值大于四分位数范围指示的离群值。

Furthermore, we also have visual indication of a positively skewed distribution — where several positive outliers “push” the distribution out to the right:


Image for post
Source: RStudio

Outliers can cause issues when it comes to conducting regression analysis. A key assumption of this model is the line of best fit, or the regression line that minimises the distance between the regression line and the individual observations.

在进行回归分析时,异常值可能会引起问题。 该模型的关键假设是最佳拟合线 ,或者是使回归线与各个观测值之间的距离最小的回归线。

Clearly, if outliers are present, then this weakens the predictive power of the regression model as the line of best fit. It also violates the assumption that the data is normally distributed.

显然,如果存在离群值,则这会削弱回归模型作为最佳拟合线的预测能力。 它也违反了数据是正态分布的假设。

In this regard, both an OLS regression model and robust regression models (using Huber and Bisquare weights) are run in order to predict BMI values across the test set — with a view to measuring whether accuracy was significantly improved by using the latter model.


Here is a quick overview of the data and the correlations between each feature:


Image for post
Source: RStudio

最小二乘 (OLS)

Using the above correlation plot in ensuring that the independent variables in the regression model are not strongly correlated with each other, the regression model is defined as follows:


reg1 <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset)

Note that Outcome is a categorical variable between 0 and 1 (not diabetic vs. diabetic).

请注意, 结果是介于0和1之间的分类变量(非糖尿病vs.糖尿病)。

The data is split into both a training set and a test set (to serve as unseen data for the model).


For the training set — 80% of this set is used to train the regression model, while 20% is used as a validation set to assess the results.


# Training and Validation Data
trainset <- diabetes1[1:479, ]
valset <- diabetes1[480:599, ]

Here are the OLS results:


> # OLS Regression
> summary(ols <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset))Call:
lm(formula = BMI ~ Outcome + Age + Insulin + SkinThickness, data = trainset)Residuals:
Min 1Q Median 3Q Max
-12.0813 -4.2762 -0.8733 3.4031 28.2196Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.0498978 0.9740705 28.797 < 2e-16 ***
Outcome 4.1290646 0.6171707 6.690 6.30e-11 ***
Age -0.0101171 0.0248626 -0.407 0.684
Insulin 0.0000262 0.0027077 0.010 0.992
SkinThickness 0.1513285 0.0195945 7.723 6.81e-14 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.176 on 474 degrees of freedom
Multiple R-squared: 0.2135, Adjusted R-squared: 0.2069
F-statistic: 32.17 on 4 and 474 DF, p-value: < 2.2e-16

Outcome and SkinThickness are identified as significant variables at the 5% level. While the R-Squared of 21.35% is quite low — this can be expected as there are many more variables that can influence BMI which have not been included in the model.

结果和皮肤厚度被确定为5%水平的重要变量。 尽管21.35%的R平方非常低-但可以预期,因为还有更多可能影响BMI的变量尚未包含在模型中。

Let’s drop the age and insulin variables from the OLS model and run it once again.


> # OLS Regression
> summary(ols <- lm(BMI ~ Outcome + SkinThickness, data=trainset))Call:
lm(formula = BMI ~ Outcome + SkinThickness, data = trainset)Residuals:
Min 1Q Median 3Q Max
-12.1740 -4.2115 -0.8532 3.3852 28.3072Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.70940 0.49723 55.728 <2e-16 ***
Outcome 4.06953 0.59223 6.872 2e-11 ***
SkinThickness 0.15247 0.01774 8.595 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.165 on 476 degrees of freedom
Multiple R-squared: 0.2132, Adjusted R-squared: 0.2099
F-statistic: 64.51 on 2 and 476 DF, p-value: < 2.2e-16

稳健回归 (Robust Regressions)

A modified version of the above regression is now run — also known as a robust regression. The reason we refer to the regression as “robust” is that such models are less sensitive to violations of OLS, including the presence of outliers in the data. The following presentation gives more information as to how specifically a robust regression works.

现在运行上述回归的修改版本-也称为健壮回归。 我们将回归称为“稳健”的原因是,此类模型对违反OLS(包括数据中存在异常值)的敏感度较低。 以下演示提供了有关稳健回归如何具体工作的更多信息。

In this example, we will use two different types of weighting to run this type of regression: Huber and Bisquare weights.

在此示例中,我们将使用两种不同类型的权重来运行这种类型的回归: HuberBisquare权重。

The same regressions are run once again, but this time using the above weightings.


胡贝尔权重 (Huber Weights)

> # Huber Weights
> rr.huber <- rlm(BMI ~ Outcome + SkinThickness, data=trainset)
> summary(rr.huber)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset)
Min 1Q Median 3Q Max
-12.4130 -3.6492 -0.3479 3.7717 28.7081Coefficients:
Value Std. Error t value
(Intercept) 27.0596 0.4685 57.7581
Outcome 3.7631 0.5580 6.7438
SkinThickness 0.1645 0.0167 9.8445Residual standard error: 5.47 on 476 degrees of freedom

双平方权重 (Bisquare Weights)

> # Bisquare weighting
> rr.bisquare <- rlm(BMI ~ Outcome + SkinThickness, data=trainset, psi = psi.bisquare)
> summary(rr.bisquare)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset,
psi = psi.bisquare)
Min 1Q Median 3Q Max
-12.1991 -3.6106 -0.3015 3.8074 28.8724Coefficients:
Value Std. Error t value
(Intercept) 27.0524 0.4793 56.4472
Outcome 3.6491 0.5708 6.3927
SkinThickness 0.1636 0.0171 9.5689Residual standard error: 5.483 on 476 degrees of freedom

比较方式 (Comparison)

Here is the performance of the regression models in predicting the test set values (both on a root mean squared error and mean absolute percentage error basis):

这是回归模型预测测试集值的性能(均基于均方根误差均值绝对百分比误差 ):


  • OLS: 5.81

  • Huber: 5.86

  • Bisquare: 5.87


玛普 (MAPE)

  • OLS: 0.139

  • Huber: 0.137

  • Bisquare: 0.137


We can see that the errors increased slightly on an RMSE basis (contrary to our expectations), while there was only a marginal decrease on an MAPE basis.


异常值是否“具有影响力”? (Are the outliers “influential”?)

Using a robust regression to account for outliers did not show significant accuracy improvements as might have been expected.


However, simply because outliers might be present in a dataset — doesn’t necessarily mean that those outliers are influential.


By influential, we mean that the outlier has a direct effect on the response variable.


This can be determined by using Cook’s Distance.


Image for post
Source: RStudio

We can see that while outliers are indicated as being present in the dataset — they still do not approach the threshold as outlined by Cook’s distance at the top right-hand corner of the graph.


In this regard, it is now evident why the robust regression did not show superior performance to OLS from an accuracy standpoint — the outliers are not influential enough to warrant using a robust regression.


结论 (Conclusion)

Robust regressions are useful when it comes to modelling outliers in a dataset and there have been cases where they can produce superior results to OLS.


However, those outliers must be influential and in this regard one must practice caution in using robust regressions in a situation such as this — where outliers are present but they do not particularly influence the response variable.


Hope you enjoyed this article, and any questions or feedback are greatly welcomed. You can find the code and datasets for this example at the MGCodesandStats GitHub repository.

希望您喜欢本文,并欢迎任何问题或反馈。 您可以在MGCodesandStats GitHub存储库中找到此示例的代码和数据集。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明:本文按“原样”撰写,不作任何担保。 它旨在提供数据科学概念的概述,并且不应以任何方式解释为专业建议。

直线回归数据 离群值





