线性回归模型也并不适用于所有情况,有些结果可能包含而元数据(比如正面与反面)或者计数数据,广义线性模型可用于解释这类数据,使用的仍然是自变量的线性组合。
目录
逻辑回归
当响应变量为二元数据时,常用逻辑回归对数据进行建模。
以下数据来源于pandas活用所提供的数据,如需要可在此下载https://download.csdn.net/download/qq_57099024/79301082
import pandas as pd
d=pd.read_csv('D:/pandas活用/pandas_for_everyone-master/data/acs_ny.csv')
print(d.columns)
print('@'*66)#输出特殊符号以区分两次输出
print(d.head())
'''以下为输出结果:
Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms', 'NumChildren',
'NumPeople', 'NumRooms', 'NumUnits', 'NumVehicles', 'NumWorkers',
'OwnRent', 'YearBuilt', 'HouseCosts', 'ElectricBill', 'FoodStamp',
'HeatingFuel', 'Insurance', 'Language'],
dtype='object')
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Acres FamilyIncome FamilyType NumBedrooms NumChildren NumPeople \
0 1-10 150 Married 4 1 3
1 1-10 180 Female Head 3 2 4
2 1-10 280 Female Head 4 0 2
3 1-10 330 Female Head 2 1 2
4 1-10 330 Male Head 3 1 2
NumRooms NumUnits NumVehicles NumWorkers OwnRent YearBuilt \
0 9 Single detached 1 0 Mortgage 1950-1959
1 6 Single detached 2 0 Rented Before 1939
2 8 Single detached 3 1 Mortgage 2000-2004
3 4 Single detached 1 0 Rented 1950-1959
4 5 Single attached 1 0 Mortgage Before 1939
HouseCosts ElectricBill FoodStamp HeatingFuel Insurance Language
0 1800 90 No Gas 2500 English
1 850 90 No Oil 0 English
2 2600 260 No Oil 6600 Other European
3 1800 140 No Oil 0 English
4 860 150 No Gas 660 Spanish '''
以下对FamilyIncome 进行分箱操作:
d['income_15w']=pd.cut(d['FamilyIncome'],[0,150000,d['FamilyIncome'].max()],labels=[0,1])
d['income_15w']=d['income_15w'].astype(int)
使用cut分箱操作,创建二值响应变量_我就是一个小怪兽的博客-CSDN博客
使用statsmodels
import statsmodels.formula.api as smf
model=smf.logit('income_15w~HouseCosts+NumWorkers+OwnRent+NumBedrooms+FamilyType',data=d)
results=model.fit()
print(results.summary())
Optimization terminated successfully. Current function value: 0.391651 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: income_15w No. Observations: 22745 Model: Logit Df Residuals: 22737 Method: MLE Df Model: 7 Date: Sat, 05 Feb 2022 Pseudo R-squ.: 0.2078 Time: 08:46:18 Log-Likelihood: -8908.1 converged: True LL-Null: -11244. Covariance Type: nonrobust LLR p-value: 0.000 =========================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------- Intercept -5.8081 0.120 -48.456 0.000 -6.043 -5.573 OwnRent[T.Outright] 1.8276 0.208 8.782 0.000 1.420 2.236 OwnRent[T.Rented] -0.8763 0.101 -8.647 0.000 -1.075 -0.678 FamilyType[T.Male Head] 0.2874 0.150 1.913 0.056 -0.007 0.582 FamilyType[T.Married] 1.3877 0.088 15.781 0.000 1.215 1.560 HouseCosts 0.0007 1.72e-05 42.453 0.000 0.001 0.001 NumWorkers 0.5873 0.026 22.393 0.000 0.536 0.639 NumBedrooms 0.2365 0.017 13.985 0.000 0.203 0.270 ==================================================================================
使用sklearn
predictors=pd.get_dummies(d[['HouseCosts','NumWorkers','OwnRent','NumBedrooms','FamilyType']],drop_first=True)
from sklearn import linear_model
lr=linear_model.LogisticRegression()
results=lr.fit(X=predictors,y=d['income_15w'])
print(results.coef_)
print('-*-'*10)
print(results.intercept_)
[[ 5.86894916e-04 7.32489391e-01 2.86764784e-01 7.17542587e-02 -2.13282748e+00 -1.03910262e+00 2.63647146e-01]] -*--*--*--*--*--*--*--*--*--*- [-4.86108187]
泊松回归
常用于计数数据分析
使用statsmodels
results=smf.poisson('NumChildren~FamilyIncome+FamilyType+OwnRent',data=d).fit()
print(results.summary())
Optimization terminated successfully. Current function value: nan Iterations 1 Poisson Regression Results ============================================================================== Dep. Variable: NumChildren No. Observations: 22745 Model: Poisson Df Residuals: 22739 Method: MLE Df Model: 5 Date: Sat, 05 Feb 2022 Pseudo R-squ.: nan Time: 09:05:28 Log-Likelihood: nan converged: True LL-Null: -30977. Covariance Type: nonrobust LLR p-value: nan =========================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------- Intercept nan nan nan nan nan nan FamilyType[T.Male Head] nan nan nan nan nan nan FamilyType[T.Married] nan nan nan nan nan nan OwnRent[T.Outright] nan nan nan nan nan nan OwnRent[T.Rented] nan nan nan nan nan nan FamilyIncome nan nan nan nan nan nan ==================================================================================
负二项回归
如果泊松回归的假设不理想(例如数据过度离散),可使用负二项回归来代替
statsmodels的GLM文档列入了可以传入GLM参数的许多分布族,可在sm.familiese.<FAMILY>.links下找到连接函数::
Binomial(二项式分布)
Gamma(伽马分布)
InverseGaussian(逆高斯分布)
NegativeBinomial(负二项式分布)
Poisson(泊松分布)
Tweedie分布
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
model=smf.glm('NumChildren~FamilyIncome+FamilyType+OwnRent',data=d,family=sm.families.NegativeBinomial(sm.genmod.families.links.log))
results=model.fit()
print(results.summary())
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: NumChildren No. Observations: 22745 Model: GLM Df Residuals: 22739 Model Family: NegativeBinomial Df Model: 5 Link Function: log Scale: 1.0000 Method: IRLS Log-Likelihood: -29749. Date: Sat, 05 Feb 2022 Deviance: 20731. Time: 10:06:21 Pearson chi2: 1.77e+04 No. Iterations: 6 Covariance Type: nonrobust =========================================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------------------- Intercept -0.3345 0.029 -11.672 0.000 -0.391 -0.278 FamilyType[T.Male Head] -0.0468 0.052 -0.905 0.365 -0.148 0.055 FamilyType[T.Married] 0.1529 0.029 5.200 0.000 0.095 0.211 OwnRent[T.Outright] -1.9737 0.243 -8.113 0.000 -2.450 -1.497 OwnRent[T.Rented] 0.4164 0.030 13.754 0.000 0.357 0.476 FamilyIncome 5.398e-07 9.55e-08 5.652 0.000 3.53e-07 7.27e-07 =================================================================================