一、简介
这里简单介绍了回归诊断,本文简单介绍如何用python计算其中的值。
二、计算
import statsmodels.api as sm
# 以波士顿房价为例
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
X = load_boston()['data']
y = load_boston()['target']
# 加上全1列
X = sm.add_constant(X)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2021)
# 建立线性回归模型
ols = sm.OLS(y_train, X_train)
models = ols.fit()
# 计算预测值
y_predict = models.predict(X_train)
outliers = models.get_influence()
2.1 计算残差
y_train - y_predict # (n_samples,)
2.2 计算学生化残差
resids1 = outliers.resid_studentized_external # (n_samples,)
或
resids2 = outliers.resid_studentized_internal # (n_samples,)
我也没搞懂这两个有啥区别,它们之间的数值差的比较小。
2.3 画残差图
plt.scatter(y_predict, resids1)
plt.xlabel('y_predict')
plt.ylabel('resid')
plt.yticks(range(-5, 6))
plt.axhline(y=2, color='r', linestyle='--')
plt.axhline(y=-2, color='r', linestyle='--')
plt.show()
2.4 计算Cook距离
cook = outliers.cooks_distance # (n_samples,)
2.5 帽子矩阵
h = outliers.hat_matrix_diag # (n_samples,)
2.6 dffits值
diffts = outliers.dffits
diffts[0] # (n_samples,)
diffts[1] # (), 就一个数