多元线性回归—异方差(二)
异方差检验问题®
1 异方差及来源
在经典OLS估计中,需要假定扰动项为同方差,即 ε ∼ N ( 0 , σ 2 ) \varepsilon \sim N(0,\sigma^2) ε∼N(0,σ2),及扰动项方差为大于0的常数,不随其他自变量的取值而发生改变。但实际样本往往存在异方差问题,随着自变量取值不同,其方差也会发生改变。如果存在异方差,估计量无偏性不会发生变化,但不再是方差最小,相对OLS假定意味着方差增大,则统计量t值将降低,造成系数不显著。从异方差来源看,包括
- 截面型数据结构,不同截面的性质特征
- 重要解释变量忽略,导致扰动项保护遗漏变量信息,波动性更大
- 模型设定错误,非线性模型设定为线性模型
- 经济结构变动,存在结构性断点问题
- 其他经济原因导致
2 异方差识别与解决
识别异方差的方法有很多,从经验上看,如果是一元线性回归,通过自变量和因变量的散点图可大致识别—如果被解释变量随着自变量增大其波动幅度发生改变。对于多元线性回归,可以将残差平方项和自变量分别画散点图,观察散点图走势,但有时候不好判断。
在理论上,可以通过相关系数检验、white检验、Park检验以及BP检验等。这里介绍相关系数检验、white与BP检验
- 相关系数检验,将残差绝对值与自变量作相关系数矩阵,哪个自变量与残差绝对值相关系数越高,极可能扰动项方差是该自变量的函数,具体什么形式函数,不知道
- white检验,将残差平方项对所有自变量,交互项、平方项作回归,哪个自变量显著,极可能意味着扰动项方差是关于该自变量的函数,具体形式不知道
- BP检验,类似于white检验,但不包括交互项、平方项,哪个自变量显著,极可能意味着扰动项方差是关于该自变量的函数,具体形式不知道
3 异方差识别及处理®
# R语言与异方差
# 内存清理
rm(list=ls())
# 导入数据
library(readxl)
HET <- read_excel("D:/Allcode/Rstudy/model/OLS/HET.xlsx")
# 变量名处理
mydata = HET
# 普通最小二乘法
OLS1 = lm(y~.,data = mydata)
# 回归结果详情
summary(OLS1)
普通最小二乘法估计结果
# Call:
# lm(formula = y ~ ., data = mydata)
#
# Residuals:
# Min 1Q Median 3Q Max
# -454.01 -58.68 0.15 66.83 564.76
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -82.22311 111.58700 -0.737 0.47537
# x1 0.08479 0.03299 2.571 0.02452 *
# x2 2.66835 0.77964 3.423 0.00505 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 243.6 on 12 degrees of freedom
# Multiple R-squared: 0.7292, Adjusted R-squared: 0.6841
# F-statistic: 16.16 on 2 and 12 DF, p-value: 0.000394
由于变量x1,x2系数不大显著,考虑存在异方差问题。先进性识别
方法1,相关系数检验
# 异方差检验
mydata$r = abs(OLS1$residuals)
# 自变量与残差绝对值相关系数
cor(mydata[,-1])
相关系数矩阵
# x1 x2 r
# x1 1.00000000 0.4399801 0.02981016
# x2 0.43998014 1.0000000 0.82297373
# r 0.02981016 0.8229737 1.00000000
其中x2与残差绝对值相关系数最高,0.823
方法2,white检验方法
# white检验:将残差平方项对所有自变量平方及交互项作回归
# 平方项生成
mydata$x12 = mydata$x1^2;mydata$x22 = mydata$x2^2
# 辅助回归
OLS_white = lm(r~1+x1*x2+x12+x22,data = mydata)
summary(OLS_white)
OLS_white = lm(r~1+x1*x2,data = mydata)
summary(OLS_white)
结果如下
# Call:
# lm(formula = r ~ 1 + x1 * x2 + x12 + x22, data = mydata)
#
# Residuals:
# Min 1Q Median 3Q Max
# -80.30 -34.97 15.39 25.53 103.82
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -2.379e+01 9.860e+01 -0.241 0.8147
# x1 9.843e-02 4.953e-02 1.987 0.0782 .
# x2 2.900e-01 2.004e+00 0.145 0.8881
# x12 -6.122e-06 4.911e-06 -1.247 0.2440
# x22 5.564e-03 5.864e-03 0.949 0.3675
# x1:x2 -4.335e-04 1.485e-04 -2.919 0.0171 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 62.78 on 9 degrees of freedom
# Multiple R-squared: 0.9132, Adjusted R-squared: 0.865
# F-statistic: 18.94 on 5 and 9 DF, p-value: 0.0001548
#-------------------------x12,x22及不显著,去掉这两个变量----------------
# Call:
# lm(formula = r ~ 1 + x1 * x2, data = mydata)
#
# Residuals:
# Min 1Q Median 3Q Max
# -109.238 -22.457 -1.769 24.784 111.056
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -9.983e+01 3.402e+01 -2.934 0.0136 *
# x1 3.926e-02 2.622e-02 1.498 0.1623
# x2 2.202e+00 2.362e-01 9.323 1.48e-06 ***
# x1:x2 -3.719e-04 1.294e-04 -2.875 0.0151 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 62.78 on 11 degrees of freedom
# Multiple R-squared: 0.8939, Adjusted R-squared: 0.865
# F-statistic: 30.9 on 3 and 11 DF, p-value: 1.178e-05
white检验表明x2与方差存在较高相关性。x2是导致异方差的主要来源·
方法3,BP检验
OLS_white = lm(r~1+x1+x2,data = mydata)
summary(OLS_white)
结果如下
# Call:
# lm(formula = r ~ 1 + x1 + x2, data = mydata)
#
# Residuals:
# Min 1Q Median 3Q Max
# -144.911 -27.066 2.737 40.910 121.507
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -47.57546 36.44256 -1.305 0.2162
# x1 -0.03204 0.01077 -2.974 0.0116 *
# x2 1.84539 0.25462 7.248 1.02e-05 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 79.55 on 12 degrees of freedom
# Multiple R-squared: 0.8142, Adjusted R-squared: 0.7832
# F-statistic: 26.29 on 2 and 12 DF, p-value: 4.114e-05
与white检验类似
接下来对异方差进行处理,常用的方法是采用加权最小二乘法。对于残差波动较大的观测值赋予较小的权重,对于残差较小观测赋予较大权重,与处理存在异常值的稳健回归的思想类似。由于x2是异方差主要来源,因此权重设定为x2的幂函数,即 1 / x 2 α 1/x_2^\alpha 1/x2α,参数 α \alpha α通过模型目的或指标进行挑选。
# 权重设定,例如α=2.5
w = 1/(mydata$x2)^2.5
OLS2 = lm(y~1+x1+x2,weights = w,data = mydata)
summary(OLS2)
# Call:
# lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
#
# Weighted Residuals:
# Min 1Q Median 3Q Max
# -0.42144 -0.31103 -0.01519 0.17582 0.81897
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -59.03487 53.82151 -1.097 0.2942
# x1 0.08145 0.03010 2.706 0.0191 *
# x2 2.48872 0.93144 2.672 0.0203 *
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.3814 on 12 degrees of freedom
# Multiple R-squared: 0.7485, Adjusted R-squared: 0.7065
# F-statistic: 17.85 on 2 and 12 DF, p-value: 0.0002533
OLS与WLS结果对比
# 结果对比
#install.packages("stargazer")
library("stargazer")
# 默认latex
stargazer(OLS1,OLS2)
stargazer(OLS1,OLS2,type = "text")
Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
% Date and time: 周五, 10月 22, 2021 - 12:57:40
\begin{table}[!htbp] \centering
\caption{}
\label{}
\begin{tabular}{@{\extracolsep{5pt}}lcc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{2}{c}{\textit{Dependent variable:}} \\
\cline{2-3}
\\[-1.8ex] & \multicolumn{2}{c}{y} \\
\\[-1.8ex] & (1) & (2)\\
\hline \\[-1.8ex]
x1 & 0.085$^{**}$ & 0.081$^{**}$ \\
& (0.033) & (0.030) \\
& & \\
x2 & 2.668$^{***}$ & 2.489$^{**}$ \\
& (0.780) & (0.931) \\
& & \\
Constant & $-$82.223 & $-$59.035 \\
& (111.587) & (53.822) \\
& & \\
\hline \\[-1.8ex]
Observations & 15 & 15 \\
R$^{2}$ & 0.729 & 0.748 \\
Adjusted R$^{2}$ & 0.684 & 0.707 \\
Residual Std. Error (df = 12) & 243.593 & 0.381 \\
F Statistic (df = 2; 12) & 16.160$^{***}$ & 17.853$^{***}$ \\
\hline
\hline \\[-1.8ex]
\textit{Note:} & \multicolumn{2}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\end{tabular}
\end{table}
>
将上述代码放入到latex相关编译器运行即可,下面是文本形式
# ==========================================================
# Dependent variable:
# ----------------------------
# y
# (1) (2)
# ----------------------------------------------------------
# x1 0.085** 0.081**
# (0.033) (0.030)
#
# x2 2.668*** 2.489**
# (0.780) (0.931)
#
# Constant -82.223 -59.035
# (111.587) (53.822)
#
# ----------------------------------------------------------
# Observations 15 15
# R2 0.729 0.748
# Adjusted R2 0.684 0.707
# Residual Std. Error (df = 12) 243.593 0.381
# F Statistic (df = 2; 12) 16.160*** 17.853***
# ==========================================================
# Note: *p<0.1; **p<0.05; ***p<0.01
现在通过遍历得到系列加权ols,比较R2选择最优模型
result = NULL
d = seq(1,4,0.5)
for (i in d){
w = 1/(mydata$x2)^i
n = (i-0.5)/0.5
result[[n]] = lm(y~1+x1+x2,weights = w,data = mydata)
cat("==========第",n,"个回归===============")
print(summary(result[[n]]))
}
结果有7个
==========第 1 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-25.966 -8.332 -0.465 6.064 34.187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.81289 67.33876 -1.007 0.33379
x1 0.08390 0.03008 2.789 0.01637 *
x2 2.56948 0.76343 3.366 0.00561 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 16.23 on 12 degrees of freedom
Multiple R-squared: 0.7859, Adjusted R-squared: 0.7502
F-statistic: 22.03 on 2 and 12 DF, p-value: 9.629e-05
==========第 2 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-6.2014 -3.0223 -0.1551 1.8806 8.4148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -63.68727 57.93664 -1.099 0.29322
x1 0.08357 0.02952 2.831 0.01514 *
x2 2.52927 0.79761 3.171 0.00805 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.412 on 12 degrees of freedom
Multiple R-squared: 0.7888, Adjusted R-squared: 0.7535
F-statistic: 22.4 on 2 and 12 DF, p-value: 8.886e-05
==========第 3 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.48714 -1.04040 -0.04757 0.57733 2.39052
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -60.80776 54.08594 -1.124 0.2829
x1 0.08288 0.02959 2.801 0.0160 *
x2 2.50010 0.85613 2.920 0.0128 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.262 on 12 degrees of freedom
Multiple R-squared: 0.7761, Adjusted R-squared: 0.7388
F-statistic: 20.8 on 2 and 12 DF, p-value: 0.0001259
==========第 4 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.42144 -0.31103 -0.01519 0.17582 0.81897
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -59.03487 53.82151 -1.097 0.2942
x1 0.08145 0.03010 2.706 0.0191 *
x2 2.48872 0.93144 2.672 0.0203 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3814 on 12 degrees of freedom
Multiple R-squared: 0.7485, Adjusted R-squared: 0.7065
F-statistic: 17.85 on 2 and 12 DF, p-value: 0.0002533
==========第 5 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.147015 -0.088932 -0.005241 0.054579 0.281038
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.98229 55.26792 -1.049 0.3148
x1 0.07909 0.03070 2.576 0.0243 *
x2 2.49388 1.00874 2.472 0.0294 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1211 on 12 degrees of freedom
Multiple R-squared: 0.7084, Adjusted R-squared: 0.6598
F-statistic: 14.58 on 2 and 12 DF, p-value: 0.0006143
==========第 6 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.051320 -0.023101 -0.001766 0.018498 0.096651
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -56.73416 56.89965 -0.997 0.338
x1 0.07591 0.03107 2.443 0.031 *
x2 2.50036 1.07254 2.331 0.038 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03985 on 12 degrees of freedom
Multiple R-squared: 0.6621, Adjusted R-squared: 0.6057
F-statistic: 11.75 on 2 and 12 DF, p-value: 0.00149
==========第 7 个回归===============
Call:
lm(formula = y ~ 1 + x1 + x2, data = mydata, weights = w)
Weighted Residuals:
Min 1Q Median 3Q Max
-0.017860 -0.005890 -0.000502 0.006249 0.033352
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -54.35365 57.82320 -0.940 0.3658
x1 0.07219 0.03108 2.323 0.0386 *
x2 2.48902 1.11469 2.233 0.0454 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01344 on 12 degrees of freedom
Multiple R-squared: 0.6168, Adjusted R-squared: 0.5529
F-statistic: 9.656 on 2 and 12 DF, p-value: 0.003169
结果比较
stargazer(result[[1]],result[[2]],result[[3]],
result[[4]],result[[5]],result[[6]],
result[[7]],type = "text")
# ==================================================================================================
# Dependent variable:
# --------------------------------------------------------------------
# y
# (1) (2) (3) (4) (5) (6) (7)
# --------------------------------------------------------------------------------------------------
# x1 0.084** 0.084** 0.083** 0.081** 0.079** 0.076** 0.072**
# (0.030) (0.030) (0.030) (0.030) (0.031) (0.031) (0.031)
#
# x2 2.569*** 2.529*** 2.500** 2.489** 2.494** 2.500** 2.489**
# (0.763) (0.798) (0.856) (0.931) (1.009) (1.073) (1.115)
#
# Constant -67.813 -63.687 -60.808 -59.035 -57.982 -56.734 -54.354
# (67.339) (57.937) (54.086) (53.822) (55.268) (56.900) (57.823)
#
# --------------------------------------------------------------------------------------------------
# Observations 15 15 15 15 15 15 15
# R2 0.786 0.789 0.776 0.748 0.708 0.662 0.617
# Adjusted R2 0.750 0.754 0.739 0.707 0.660 0.606 0.553
# Residual Std. Error (df = 12) 16.225 4.412 1.262 0.381 0.121 0.040 0.013
# F Statistic (df = 2; 12) 22.025*** 22.403*** 20.802*** 17.853*** 14.579*** 11.754*** 9.656***
# ==================================================================================================
# Note: *p<0.1; **p<0.05; ***p<0.01
发现结果(2)R2比较大,优先选择参数 α = 1.5 \alpha = 1.5 α=1.5,即权重 w = 1 / x 2 1.5 w = 1/x_2^{1.5} w=1/x21.5.