文章目录
第14章 模型诊断
14.2 残差
本例使用的数据前五行
neighborhood type units year_built sq_ft income \
0 FINANCIAL R9-CONDOMINIUM 42 1920.0 36500 1332615
1 FINANCIAL R4-CONDOMINIUM 78 1985.0 126420 6633257
2 FINANCIAL RR-CONDOMINIUM 500 NaN 554174 17310000
3 FINANCIAL R4-CONDOMINIUM 282 1930.0 249076 11776313
4 TRIBECA R4-CONDOMINIUM 239 1985.0 219495 10004582
income_per_sq_ft expense expense_per_sq_ft net_income value \
0 36.51 342005 9.37 990610 7300000
1 52.47 1762295 13.94 4870962 30690000
2 31.24 3543000 6.39 13767000 90970000
3 47.28 2784670 11.18 8991643 67556006
4 45.58 2783197 12.68 7221385 54320996
value_per_sq_ft boro
0 200.00 Manhattan
1 242.76 Manhattan
2 164.15 Manhattan
3 271.23 Manhattan
4 247.48 Manhattan
- 数据进行多元线性回归拟合
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
house1 = smf.glm(formula='value_per_sq_ft~units + sq_ft + boro', data = housing).fit()
print (house1.summary())
- 得到回归的结果
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: value_per_sq_ft No. Observations: 2626
Model: GLM Df Residuals: 2619
Model Family: Gaussian Df Model: 6
Link Function: identity Scale: 1879.5
Method: IRLS Log-Likelihood: -13621.
Date: Thu, 15 Jul 2021 Deviance: 4.9224e+06
Time: 14:50:04 Pearson chi2: 4.92e+06
No. Iterations: 3
Covariance Type: nonrobust
=========================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 43.2909 5.330 8.122 0.000 32.845 53.737
boro[T.Brooklyn] 34.5621 5.535 6.244 0.000 23.714 45.411
boro[T.Manhattan] 130.9924 5.385 24.327 0.000 120.439 141.546
boro[T.Queens] 32.9937 5.663 5.827 0.000 21.895 44.092
boro[T.Staten Island] -3.6303 9.993 -0.363 0.716 -23.216 15.956
units -0.1881 0.022 -8.511 0.000 -0.231 -0.145
sq_ft 0.0002 2.09e-05 10.079 0.000 0.000 0.000
=========================================================================================
1.残差图
import seaborn as sns
import matplotlib.pyplot as plt
fig,ax = plt.subplots()
axm= sns.regplot(x=house1.fittedvalues, y=house1.resid_deviance, fit_reg = False)
plt.show()
得到结果
残差图的效果非常不好,包含了明显的集中分布,而正态分布的残差图应该是随机分布在直线0两侧,为了更清楚残差图是如何分布的,需要对不同的区域进行着色
res_df = pd.DataFrame({
'fittedvalues':house1.fittedvalues,
'resid_deviance':house1.resid_deviance,
'boro':housing['boro']
})
fig = sns.lmplot(x='fittedvalues', y='resid_deviance', data= res_df, hue = 'boro', fit_reg =False)
plt.show()
2.Q-Q图
Q-Q图是用来判断数据是否成正态分布,当图形近似一条直线时可以服从正态分布
from scipy import stats
resid = house1.resid_deviance.copy()
resid_std = stats.zscore(resid)
fig = statsmodels.graphics.gofplots.qqplot(resid,line="r")
plt.show()
3.直方图
除了Q-Q图,直方图作为描述统计的一种方法也可用作判断数据是否服从正态分布
fig, ax = plt.subplots()
ax= sns.displot(resid_std)
plt.show()
14.3 比较多个模型
需要从多个模型中挑选出一个表现较好的模型,该如何进行了?
14.3.1 比较线性模型
+向模型中添加协变量,而有些是用*添加的,表示变量之间不是独立的,变量不是简单的相加。
f1 = 'value_per_sq_ft~units + sq_ft + boro'
f2 = 'value_per_sq_ft~units * sq_ft + boro'
f3 = 'value_per_sq_ft~units + sq_ft * boro + type'
f4 = 'value_per_sq_ft~units + sq_ft * boro + sq_ft*type'
f5 = 'value_per_sq_ft~ boro + type'
house1 = smf.ols(f1, data= housing).fit()
house2 = smf.ols(f2, data= housing).fit()
house3 = smf.ols(f3, data= housing).fit()
house4 = smf.ols(f4, data= housing).fit()
house5 = smf.ols(f5, data= housing).fit()
#针对所有模型,获取所有系数和相关模型
mod_results = pd.concat([house1.params,house2.params,house3.params,house4.params,house5.params], axis= 1).\
rename(columns = lambda x: 'house' + str(x+1)).\
reset_index().\
rename(columns = {'index':'param'}).\
melt(id_vars='param', var_name='model', value_name ='estimate')
print (mod_results.head())
获得所有模型的系数和相关模型
param model estimate
0 Intercept house1 43.290863
1 boro[T.Brooklyn] house1 34.562150
2 boro[T.Manhattan] house1 130.992363
3 boro[T.Queens] house1 32.993674
4 boro[T.Staten Island] house1 -3.630251
将系数绘制出来,直观展示如何绘制彼此参数
fig, ax = plt.subplots()
ax = sns.pointplot(x = 'estimate', y = "param", hue= 'model', data = mod_results,dodge= True, join = False)
plt.tight_layout()
plt.show()
- Q: 如何分析这个图?不太明白
1.方差分析
model_names = ['house1', 'house2', 'house3', 'house4','house5']
house_anova = statsmodels.stats.anova.anova_lm(house1,house2,house3,house4,house5)
house_anova.index = model_names
print (house_anova)
输出结果
df_resid ssr df_diff ss_diff F \
house1 2619.0 4.922389e+06 0.0 NaN NaN
house2 2618.0 4.884872e+06 1.0 37517.437605 20.039049
house3 2612.0 4.619926e+06 6.0 264945.539994 23.585728
house4 2609.0 4.576671e+06 3.0 43255.441192 7.701289
house5 2618.0 4.901463e+06 -9.0 -324791.847907 19.275539
Pr(>F)
house1 NaN
house2 7.912333e-06
house3 2.754431e-27
house4 4.025581e-05
house5 NaN
2.赤池信息量准则(AIC)和贝叶斯准则(BIC)
house_models = [house1, house2, house3, house4, house5]
house_aic = list(map(statsmodels.regression.linear_model.RegressionResults.aic,house_models))
house_bic = list(map(statsmodels.regression.linear_model.RegressionResults.bic,house_models))
abic = pd.DataFrame({'model':model_names, 'aic':house_aic, 'bic':house_bic})
- Q:一直报错也不知道为啥