pandas—线性模型

最新推荐文章于 2024-04-13 01:00:00 发布

当星星也可以

最新推荐文章于 2024-04-13 01:00:00 发布

阅读量824

点赞数

文章标签：开发语言 python pandas

本文链接：https://blog.csdn.net/qq_45286306/article/details/130881791

版权

1 线性回归

线性回归的目标是描述响应变量y 和预测变量x 之间的直线关系。

1.1 statsmodels库

ols函数计算普通最小二乘值，(直线公式y=mx+b,y响应变量，x自变量)公式由两部分组成(y~x)
fit方法用数据拟合模型
summary方法查看结果
params属性只查看系数m、b
conf_int()方法提取置信区间，确定估计值，误差范围

import pandas as pd
import seaborn as sns
#以tips数据集为例
tips = sns.load_dataset('tips')
print(tips.head())

#statsmodels库
import statsmodels.formula.api as smf
#ols函数计算普通最小二乘值，(直线公式y=mx+b,y响应变量，x自变量)公式由两部分组成(y~x)
model = smf.ols(formula ='tip ~ total_bill',data=tips)
results = model.fit()         #fit方法用数据拟合模型

print(results.summary())      #summary方法查看结果
print(results.params)         #params属性只查看系数
print(results.conf_int())    #conf_int()方法提取置信区间，确定估计值，误差范围

'''
                           OLS Regression Results                            
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.457
Model:                            OLS   Adj. R-squared:                  0.454
Method:                 Least Squares   F-statistic:                     203.4
Date:                Mon, 29 May 2023   Prob (F-statistic):           6.69e-34
Time:                        17:33:40   Log-Likelihood:                -350.54
No. Observations:                 244   AIC:                             705.1
Df Residuals:                     242   BIC:                             712.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9203      0.160      5.761      0.000       0.606       1.235
total_bill     0.1050      0.007     14.260      0.000       0.091       0.120
==============================================================================
Omnibus:                       20.185   Durbin-Watson:                   2.151
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.750
Skew:                           0.443   Prob(JB):                     6.35e-09
Kurtosis:                       4.711   Cond. No.                         53.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Intercept     0.920270
total_bill    0.105025

                   0         1
Intercept   0.605622  1.234918
total_bill  0.090517  0.119532
'''

1.2 sklearn库

fit中指定自变量X和响应变量y，注意大小写

#使用pip install sklearn之后import不成功，再次pip install sklearn时显示已经安装
#sklearn的包名是scikit-learn，于是重新安装pip install scikit-learn ，import sklearn成功

from sklearn import linear_model
lr = linear_model.LinearRegression()     #创建LinearRegression对象

##sklearn接收numpy数组，有时需要处理数据，运行此语句会报错告知传入矩阵的形状不对
#predicted = lr.fit(X=tips['total_bill'],y=tips['tip'])

#改进：要根据是否只有一个变量或者一个样本分别指定 reshape(-1,1) 或 reshape(1,-1)
#pandas数据类型要使用values属性，注意X大写y小写
predicted = lr.fit(X=tips['total_bill'].values.reshape(-1,1)，y=tips['tip'])

#输出结果与上面使用statsmodels相同
print(predicted.coef_)         #获得系数
print(predicted.intercept_)    #获得截距

'''
[0.10502452]
0.9202696135546731
'''

2 多元回归

多元回归可以把多个自变量放入模型中

2.1 使用statsmodels库

import statsmodels.formula.api as smf

#ols函数，在formula参数中，使用’+‘添加协变量。
model = smf.ols(formula ='tip ~ total_bill+size',data=tips).fit()

print(model.summary())      
print(model.params)         
print(model.conf_int())    
'''
Intercept     0.668945
total_bill    0.092713
size          0.192598
'''

2.2 使用statsmodels和分类变量

statsmodels会自动创建虚拟变量。为了避免多重共线性，通常会删除其中一个虚拟变量。

unique() 返回参数数组中所有不同的值，并按照从小到大排序

#上面都是处理连续自变量，但是数据中有分类变量，怎么处理？
print(tips.info())
#unique()：返回参数数组中所有不同的值，并按照从小到大排序
print(tips.sex.unique())      #查看性别中的不同值
#statsmodels会自动创建虚拟变量，也会删除另一个虚拟变量，比如删除代表男性的虚拟变量
#下面用到了所有变量，看结果
#结果中：sex[T.Female]意思是：当sex从male变为female，tip增加0.324
model = smf.ols(
    formula='tip~total_bill + size + sex +smoker + day +time',data=tips).fit()
print(model.summary())
#查看日期day，发现上面结果缺少了thur，即thur是参考变量。
print(tips.day.unique())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None

['Female', 'Male']
Categories (2, object): ['Male', 'Female']

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.470
Model:                            OLS   Adj. R-squared:                  0.452
Method:                 Least Squares   F-statistic:                     26.06
Date:                Mon, 29 May 2023   Prob (F-statistic):           1.20e-28
Time:                        20:55:16   Log-Likelihood:                -347.48
No. Observations:                 244   AIC:                             713.0
Df Residuals:                     235   BIC:                             744.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.5908      0.256      2.310      0.022       0.087       1.095
sex[T.Female]      0.0324      0.142      0.229      0.819      -0.247       0.311
smoker[T.No]       0.0864      0.147      0.589      0.556      -0.202       0.375
day[T.Fri]         0.1623      0.393      0.412      0.680      -0.613       0.937
day[T.Sat]         0.0408      0.471      0.087      0.931      -0.886       0.968
day[T.Sun]         0.1368      0.472      0.290      0.772      -0.793       1.066
time[T.Dinner]    -0.0681      0.445     -0.153      0.878      -0.944       0.808
total_bill         0.0945      0.010      9.841      0.000       0.076       0.113
size               0.1760      0.090      1.966      0.051      -0.000       0.352
==============================================================================
Omnibus:                       27.860   Durbin-Watson:                   2.096
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               52.555
Skew:                           0.607   Prob(JB):                     3.87e-12
Kurtosis:                       4.923   Cond. No.                         281.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']
'''

2.3 使用sklearn库

from sklearn import linear_model
lr = linear_model.LinearRegression()     #创建LinearRegression对象

#同样把需要使用的列传入模型，注意[[]]
predicted = lr.fit(X=tips[['total_bill','size']],y=tips['tip'])

print(predicted.coef_)         #获得系数
print(predicted.intercept_)    #获得截距

'''
[0.09271334 0.19259779]
0.6689447408125022
'''

2.4 使用sklearn和分类变量

需要手动为sklearn创建虚拟变量，可使用pandas的get_dummies函数实现

传入drop_first=True参数，删除参考变量

from sklearn import linear_model

#手动把分类变量转化为虚拟变量，输出可以查看所有变量
tips_dummy = pd.get_dummies(
    tips[['total_bill','size','sex','smoker','day','time']]
    )
print(tips_dummy.head())

#在上面基础上，传入drop_first=True参数，删除参考变量，输出发现参考变量没了
tips_dummy_ref = pd.get_dummies(
    tips[['total_bill','size','sex','smoker','day','time']],drop_first=True
    )
print(tips_dummy_ref.head())


#以下步骤同上节
lr = linear_model.LinearRegression()     #创建LinearRegression对象
predicted = lr.fit(X=tips_dummy_ref,y=tips['tip'])   #同样把转换成虚拟变量的数据列传入
print(predicted.coef_)         #获得系数
print(predicted.intercept_)    #获得截距
'''
   total_bill  size  sex_Male  ...  day_Sun  time_Lunch  time_Dinner
0       16.99     2         0  ...        1           0            1
1       10.34     3         1  ...        1           0            1
2       21.01     3         1  ...        1           0            1
3       23.68     2         1  ...        1           0            1
4       24.59     4         0  ...        1           0            1

   total_bill  size  sex_Female  ...  day_Sat  day_Sun  time_Dinner
0       16.99     2           1  ...        0        1            1
1       10.34     3           0  ...        0        1            1
2       21.01     3           0  ...        0        1            1
3       23.68     2           0  ...        0        1            1
4       24.59     4           1  ...        0        1            1

[ 0.09448701  0.175992    0.03244094  0.08640832  0.1622592   0.04080082
  0.13677854 -0.0681286 ]
0.5908374259513769
'''

3 sklearn的索引标签问题

sklearn输出不带标签，不如statsmodels输出易懂，可以手动存储添加系数。

#在使用2.4代码的基础上
import numpy as np
#以下两句同
lr = linear_model.LinearRegression()     #创建LinearRegression对象
predicted = lr.fit(X=tips_dummy_ref,y=tips['tip'])   #同样把转换成虚拟变量的数据列传入

#这里不同
#获取截距和系数数据
values = np.append(predicted.intercept_,predicted.coef_)  #append返回一个列表
#获取值的名称
names = np.append('intercept',tips_dummy_ref.columns)
#把所有项放入一个带标签的dataframe中
results = pd.DataFrame(values,index = names,columns=['coef'])  #注意使用方括号
print(results)

'''
                 coef
intercept    0.590837
total_bill   0.094487
size         0.175992
sex_Female   0.032441
smoker_No    0.086408
day_Fri      0.162259
day_Sat      0.040801
day_Sun      0.136779
time_Dinner -0.068129
'''