Chapter 5 定性预测变量

5.0程序运行前的库

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor

5.1引言

  定性变量可以用实型变量或虚拟变量来表示,对于定性变量,其一般取值为0或1,表明观测属于两个可能类别中的一个;示性变量也可以区分3个或3个以上的类别

5.2薪水调查数据

(一)目的:明确并量化那些决定薪水差异的变量;也可用来检验公司的薪水管理制度是否得到遵行
(二)响应变量为薪水S;预测变量:工作年数X,教育E[高中1,硕士2,更高3],管理M[人在管理职位1;其他职位0]
两个示性变量表示三种教育水平,这两个变量使得我们可以衡量受教育水平对薪水差异的影响

data1=pd.read_csv('F:\linear  regression\chapter 5\Table 5.1.csv')
data1.columns=['S','X','E','M']#X工作年数,E教育,M为是否为在职管理者
E=data1['E']
X=data1['X']

# 我们需要对表格中的数据进行定性的表示
这样做的目的:用来区分两个管理类别的示性变量只需要一个,示性变量取值为0的类别为基础类别,因为示性变量的回归系数是相对于对照组的增量 

#哑变量 示性变量
#首先我们将定性变量E转换为虚拟变量
dummies=pd.get_dummies(data1['E'],'E')
#将E列展开,对E中插入我们想要的列表的数据
#读取第1列的值
#读取第2列的值
data1.insert(2,'E_1',dummies.iloc[:,0])
data1.insert(3,'E_2',dummies.iloc[:,1])
data1.insert(1,'constant',1)
#去掉哑变量
data1=data1.drop('E',axis=1)#删除其中的E教育所在行列,用0,1来表示
data1

 Table 5.3 Regression Analysis of Salary Survey Data

ols_1=(sm.OLS(data1['S'],data1[data1.columns[1:6]].astype(float))).fit()
print(ols_1.summary())
           OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.957
Model:                            OLS   Adj. R-squared:                  0.953
Method:                 Least Squares   F-statistic:                     226.8
Date:                Wed, 13 Dec 2023   Prob (F-statistic):           2.23e-27
Time:                        21:59:45   Log-Likelihood:                -381.63
No. Observations:                  46   AIC:                             773.3
Df Residuals:                      41   BIC:                             782.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant    1.103e+04    383.217     28.787      0.000    1.03e+04    1.18e+04
X            546.1840     30.519     17.896      0.000     484.549     607.819
E_1        -2996.2103    411.753     -7.277      0.000   -3827.762   -2164.659
E_2          147.8249    387.659      0.381      0.705    -635.069     930.719
M           6883.5310    313.919     21.928      0.000    6249.559    7517.503
==============================================================================
Omnibus:                        2.293   Durbin-Watson:                   2.237
Prob(Omnibus):                  0.318   Jarque-Bera (JB):                1.362
Skew:                          -0.077   Prob(JB):                        0.506
Kurtosis:                       2.171   Cond. No.                         35.4
==============================================================================
from scipy.stats import t,f
t(41).isf(0.025)
2.019540963982894

5.3 交互作用变量

模型设定问题

Figure 5.1 对工作年数的标准化残差图

   首先这幅图存在三个或多个的残差水平。

outliers=ols_1.get_influence()
res1=outliers.resid_studentized_internal
plt.scatter(data1['X'],res1)
plt.xlabel('X')
plt.ylabel('Residuals')
plt.title('Figure 5.1')
plt.show()

Figure 5.2 对E-M组合类别变量的标准化残差图

分为6种;从表中不难发现:残差或者几乎全为正或者几乎全为负,这表明给出的模型并未充分解释薪水和工作年数,教育,以及管理变量之间的关系【总的概括而言:数据中一些隐含的数据结构并未被挖掘出来】

em=-1+2*E+data1['M']
plt.scatter(em,res1)
plt.xlabel('Catagory')
plt.ylabel('Residuals')
plt.title('Figure 5.2')
plt.show()

Table 5.4 扩展模型

教育和管理两者不可加,但却承认两个变量有交互效应的模型

data1['E1M']=data1['E_1']*data1['M']
data1['E2M']=data1['E_2']*data1['M']
data1

(一)首先,我们对扩展后的模型有一个初步的了解,先做回归分析

ols_2=(sm.OLS(data1['S'],data1[data1.columns[1:]].astype(float))).fit()
print(ols_2.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     5517.
Date:                Wed, 13 Dec 2023   Prob (F-statistic):           1.67e-55
Time:                        21:59:53   Log-Likelihood:                -298.74
No. Observations:                  46   AIC:                             611.5
Df Residuals:                      39   BIC:                             624.3
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant     1.12e+04     79.065    141.698      0.000     1.1e+04    1.14e+04
X            496.9870      5.566     89.283      0.000     485.728     508.246
E_1        -1730.7483    105.334    -16.431      0.000   -1943.806   -1517.690
E_2         -349.0777     97.568     -3.578      0.001    -546.427    -151.728
M           7047.4120    102.589     68.695      0.000    6839.906    7254.918
E1M        -3066.0351    149.330    -20.532      0.000   -3368.084   -2763.986
E2M         1836.4879    131.167     14.001      0.000    1571.177    2101.799
==============================================================================
Omnibus:                       74.761   Durbin-Watson:                   2.244
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1037.873
Skew:                          -4.103   Prob(JB):                    4.25e-226
Kurtosis:                      24.776   Cond. No.                         83.4
==============================================================================

 

Figur 5.3 扩展模型对工作年数的标准化残差图

此模型过高预测了这一点上的薪水,检查原数据集这个观测,发现这个人比其他特征相似的人年薪低几百元,为了确保这一观测不过度影响回归估计,将它去掉重新做回归

n=len(X)
index=np.arange(1,46)
outliers=ols_2.get_influence()
res2=outliers.resid_studentized_internal
#找出outliers
col=[]
for i in range(0,n):
    if abs(res2[i])>2:
        col.append('r')
        plt.annotate(index[i],xy=(X[i],res2[i]))#散点图注释
    else:
        col.append('b')
plt.scatter(X,res2,c=col)
plt.xlabel('X')
plt.ylabel('Resdual')
plt.title('index plot of resduals:expanded model')
plt.show()

outlier 为第33个观测值,需要删除 

data1_1=data1.drop(32,axis=0)#第1列的第33个观测值
ols_3=(sm.OLS(data1_1['S'],data1_1[data1_1.columns[1:]].astype(float))).fit()
print(ols_3.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.543e+04
Date:                Wed, 13 Dec 2023   Prob (F-statistic):           1.30e-69
Time:                        21:59:56   Log-Likelihood:                -249.34
No. Observations:                  45   AIC:                             512.7
Df Residuals:                      38   BIC:                             525.3
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant     1.12e+04     30.533    366.802      0.000    1.11e+04    1.13e+04
X            498.4178      2.152    231.640      0.000     494.062     502.774
E_1        -1741.3359     40.683    -42.803      0.000   -1823.693   -1658.979
E_2         -357.0423     37.681     -9.475      0.000    -433.324    -280.761
M           7040.5801     39.619    177.707      0.000    6960.376    7120.785
E1M        -3051.7633     57.674    -52.914      0.000   -3168.519   -2935.008
E2M         1997.5306     51.785     38.574      0.000    1892.697    2102.364
==============================================================================
Omnibus:                        1.462   Durbin-Watson:                   2.438
Prob(Omnibus):                  0.482   Jarque-Bera (JB):                1.171
Skew:                           0.177   Prob(JB):                        0.557
Kurtosis:                       2.293   Cond. No.                         82.5
==============================================================================

Figure 5.4

删除第33个观测值的扩展模型中对X的标准化残差 新的残差图与原可加模型类似的残差图相比更令人满意

outliers=ols_3.get_influence()
res3=outliers.resid_studentized_internal
plt.scatter(data1_1['X'],res3)
plt.xlabel('X')
plt.ylabel('Residuals')
plt.title('Figure 5.4')
plt.show()

对EM组合变量的标准化残差(扩展,删33) 

每个组合类的残差关于0的分布对称

E=E.drop(32)
EM=-1+2*E+data1_1['M']
plt.scatter(EM,res3)
plt.xlabel('Category',size=13)
plt.ylabel('Residuals',size=13)
plt.title('Figure 5.5',size=15)
plt.show()

求取对应的t值

from scipy.stats import t,f
t(38).isf(0.025)
2.0243941645751367
beta=ols_3.params
x0=np.vstack([[1,1,1,1,1,1],[0,0,0,0,0,0],[1,1,0,0,0,0],[0,0,1,1,0,0],[0,1,0,1,0,1],[0,1,0,0,0,0],[0,0,0,1,0,0]]).T
yuce=x0.dot(beta)

#计算标准误
X=data1_1[['constant','X','E_1','E_2','M','E1M','E2M']].values
fit=ols_3.fittedvalues
sigma=((np.sum((data1_1['S']-fit)**2))/38)**0.5
s_e=[]
for i in range(6):
    x01=x0[i,:].reshape(7,1)
    se=x01.T.dot(np.linalg.inv(X.T.dot(X))).dot(x01)
    se=((se+1)**0.5)*sigma
    s_e.insert(i,float(se))

#计算预测值的预测区间
upper=[]
lower=[]
for i in range(6):
    upper.insert(i,yuce[i]+t.isf(0.025,38)*s_e[i])
    lower.insert(i,yuce[i]-t.isf(0.025,38)*s_e[i])

table=pd.DataFrame()
table['category']=[1,2,3,4,5,6]
table['estimate']=yuce
table['se']=s_e
table['CI_lower']=lower
table['CI_upper']=upper
table

5.4 回归方程系统:两个组的比较 

5.4.0数据导入以及需要用到的库

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
data2=pd.read_csv('F:\linear  regression\chapter 5\Table 5.7.csv')
data2.columns=['TEST','RACE','JPERF']
RACE=data2['RACE']
JPERF=data2['JPERF']
data2.insert(1,'constant',1)
ols_4=(sm.OLS(data2['JPERF'],data2[['constant','TEST']].astype(float))).fit()
print(ols_4.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.517
Model:                            OLS   Adj. R-squared:                  0.490
Method:                 Least Squares   F-statistic:                     19.25
Date:                Wed, 13 Dec 2023   Prob (F-statistic):           0.000356
Time:                        22:09:05   Log-Likelihood:                -36.614
No. Observations:                  20   AIC:                             77.23
Df Residuals:                      18   BIC:                             79.22
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       1.0350      0.868      1.192      0.249      -0.789       2.859
TEST           2.3605      0.538      4.387      0.000       1.230       3.491
==============================================================================
Omnibus:                        0.324   Durbin-Watson:                   2.896
Prob(Omnibus):                  0.850   Jarque-Bera (JB):                0.483
Skew:                          -0.186   Prob(JB):                        0.785
Kurtosis:                       2.336   Cond. No.                         5.26
==============================================================================
from prettytable import PrettyTable#创建表格
label1=['Source','Sum of Squares','d.f.','Mean Square']
table=PrettyTable(label1)
table.add_row(['Regression',ols_4.df_model*ols_4.mse_model,ols_4.df_model,ols_4.mse_model])
table.add_row(['Residuals',ols_4.ssr,ols_4.df_resid,ols_4.mse_resid])
table.add_column('F',[ols_4.fvalue,'--'])
print(table)

​​​​​​​

+------------+--------------------+------+--------------------+------------------+
|   Source   |   Sum of Squares   | d.f. |    Mean Square     |        F         |
+------------+--------------------+------+--------------------+------------------+
| Regression |  48.7229581716978  | 1.0  |  48.7229581716978  | 19.2461274204537 |
| Residuals  | 45.568296828302195 | 18.0 | 2.5315720460167888 |        --        |
+------------+--------------------+------+--------------------+------------------+
data2['RACE*TEST']=data2['RACE']*data2['TEST']
ols_5=(sm.OLS(data2['JPERF'],data2[['constant','TEST','RACE','RACE*TEST']].astype(float))).fit()
print(ols_5.summary())
 OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.664
Model:                            OLS   Adj. R-squared:                  0.601
Method:                 Least Squares   F-statistic:                     10.55
Date:                Wed, 13 Dec 2023   Prob (F-statistic):           0.000451
Time:                        22:09:06   Log-Likelihood:                -32.971
No. Observations:                  20   AIC:                             73.94
Df Residuals:                      16   BIC:                             77.92
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       2.0103      1.050      1.914      0.074      -0.216       4.236
TEST           1.3134      0.670      1.959      0.068      -0.108       2.735
RACE          -1.9132      1.540     -1.242      0.232      -5.179       1.352
RACE*TEST      1.9975      0.954      2.093      0.053      -0.026       4.021
==============================================================================
Omnibus:                        3.377   Durbin-Watson:                   3.015
Prob(Omnibus):                  0.185   Jarque-Bera (JB):                1.330
Skew:                           0.120   Prob(JB):                        0.514
Kurtosis:                       1.760   Cond. No.                         13.8
==============================================================================
label2=['Source','Sum of Squares','d.f.','Mean Square']
table=PrettyTable(label2)
table.add_row(['Regression',ols_5.df_model*ols_5.mse_model,ols_5.df_model,ols_5.mse_model])
table.add_row(['Residuals',ols_5.ssr,ols_5.df_resid,ols_5.mse_resid])
table.add_column('F',[ols_5.fvalue,'--'])
print(table)
+------------+-------------------+------+-------------------+-------------------+
|   Source   |   Sum of Squares  | d.f. |    Mean Square    |         F         |
+------------+-------------------+------+-------------------+-------------------+
| Regression | 62.63578183766151 | 3.0  | 20.87859394588717 | 10.55291454406796 |
| Residuals  | 31.65547316233848 | 16.0 | 1.978467072646155 |         --        |
+------------+-------------------+------+-------------------+-------------------+

 

 

Figure 5.7 Model 1 对测试得分的标准化残差图 

outliers=ols_4.get_influence()
res4=outliers.resid_studentized_internal
plt.scatter(data2['TEST'],res4)
plt.xlabel('TEST',size=13)
plt.ylabel('Residuals',size=13)
plt.title('Figure 5.7',size=15)
plt.show()

Figure 5.8 Model 3 对测试得分的标准化残差图 

outliers=ols_5.get_influence()
res5=outliers.resid_studentized_internal
plt.scatter(data2['TEST'],res5)
plt.xlabel('TEST',size=13)
plt.ylabel('Residuals',size=13)
plt.title('Figure 5.8',size=15)
plt.show()

Figure 5.9 对Model 1 种族的标准化残差图 

plt.scatter(data2['RACE'],res4)
plt.xlabel('RACE')
plt.ylabel('Residuals')
plt.title('Figure 5.9')
plt.show()

from scipy.stats import t,f
f.isf(0.05,2,16)
3.63372346759163
SSE_4=sum(res4**2)
SSE_5=sum(res5**2)
print(SSE_4)
print(SSE_5)
20.02709825217295
19.806440174731453

results_1=(sm.OLS(data2.iloc[0:10,3],data2.iloc[0:10,0:2])).fit()
print(results_1.summary())
results_2=(sm.OLS(data2.iloc[10:20,3],data2.iloc[10:20,0:2])).fit()
print(results_2.summary())
OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.779
Model:                            OLS   Adj. R-squared:                  0.751
Method:                 Least Squares   F-statistic:                     28.14
Date:                Wed, 13 Dec 2023   Prob (F-statistic):           0.000724
Time:                        22:20:37   Log-Likelihood:                -15.637
No. Observations:                  10   AIC:                             35.27
Df Residuals:                       8   BIC:                             35.88
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
TEST           3.3109      0.624      5.305      0.001       1.872       4.750
constant       0.0971      1.035      0.094      0.928      -2.290       2.484
==============================================================================
Omnibus:                        0.635   Durbin-Watson:                   2.411
Prob(Omnibus):                  0.728   Jarque-Bera (JB):                0.554
Skew:                          -0.161   Prob(JB):                        0.758
Kurtosis:                       1.893   Cond. No.                         5.55
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.293
Model:                            OLS   Adj. R-squared:                  0.205
Method:                 Least Squares   F-statistic:                     3.320
Date:                Wed, 13 Dec 2023   Prob (F-statistic):              0.106
Time:                        22:20:37   Log-Likelihood:                -17.211
No. Observations:                  10   AIC:                             38.42
Df Residuals:                       8   BIC:                             39.03
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
TEST           1.3134      0.721      1.822      0.106      -0.349       2.976
constant       2.0103      1.129      1.780      0.113      -0.593       4.614
==============================================================================
Omnibus:                        2.077   Durbin-Watson:                   2.880
Prob(Omnibus):                  0.354   Jarque-Bera (JB):                0.932
Skew:                           0.294   Prob(JB):                        0.627
Kurtosis:                       1.624   Cond. No.                         5.01
==============================================================================

 Figure 5.10

#图5.10
outliers=results_1.get_influence()
res1=outliers.resid_studentized_internal
plt.scatter(data2.loc[data2['RACE']==1,'TEST'],res1)
plt.xlabel('TEST',size=13)
plt.ylabel('Residuals',size=13)
plt.title('Figure 5.10',size=15)
plt.show()

 

 Figure 5.11

#图5.11
outliers=results_2.get_influence()
res2=outliers.resid_studentized_internal
plt.scatter(data2.loc[data2['RACE']==0,'TEST'],res2)
plt.xlabel('TEST',size=13)
plt.ylabel('Residuals',size=13)
plt.title('Figure 5.11',size=15)
plt.show()

​​​​​​​

ols_6=(sm.OLS(data2['JPERF'],data2[['constant','TEST','RACE']].astype(float))).fit()
print(ols_6.summary())
 OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.572
Model:                            OLS   Adj. R-squared:                  0.522
Method:                 Least Squares   F-statistic:                     11.38
Date:                Sun, 17 Dec 2023   Prob (F-statistic):           0.000731
Time:                        19:18:40   Log-Likelihood:                -35.390
No. Observations:                  20   AIC:                             76.78
Df Residuals:                      17   BIC:                             79.77
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
constant       0.6120      0.887      0.690      0.500      -1.260       2.483
TEST           2.2988      0.522      4.400      0.000       1.197       3.401
RACE           1.0276      0.691      1.487      0.155      -0.430       2.485
==============================================================================
Omnibus:                        0.251   Durbin-Watson:                   3.028
Prob(Omnibus):                  0.882   Jarque-Bera (JB):                0.437
Skew:                          -0.059   Prob(JB):                        0.804
Kurtosis:                       2.286   Cond. No.                         5.72
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

 

label3=['Source','Sum of Squares','d.f.','Mean Square']
table=PrettyTable(label3)
table.add_row(['Regression',ols_6.df_model*ols_6.mse_model,ols_6.df_model,ols_6.mse_model])
table.add_row(['Residuals',ols_6.ssr,ols_6.df_resid,ols_6.mse_resid])
table.add_column('F',[ols_6.fvalue,'--'])
print(table)
+------------+--------------------+------+-------------------+--------------------+
|   Source   |   Sum of Squares   | d.f. |    Mean Square    |         F          |
+------------+--------------------+------+-------------------+--------------------+
| Regression | 53.96970908443848  | 2.0  | 26.98485454221924 | 11.377106626278483 |
| Residuals  | 40.321545915561515 | 17.0 | 2.371855642091854 |         --         |
+------------+--------------------+------+-------------------+--------------------+
from scipy.stats import t,f
f.isf(0.05,2,16)
3.63372346759163
  • 20
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值