非线性回归-Polynomial regression

非线性回归-Polynomial regression

之前讲过线性回归。本文主要讲下非线性回归中的polynominal regression. 非线性回归包含的算法非常多,比如常见的逻辑回归,SVM,决策树,neural network,指数平滑等等。polynominal regression也可以归为非线性回归的一种,但是有些文章也会把polynominal regression归为线性回归。

我们看下polynominal regression 的一般表达式的例子: y = β 0 + β 1 x 1 + β 4 x 2 + β 2 x 1 x 2 + β 3 x 3 2 + ϵ y = \beta_0 + \beta_1x_1 + \beta_4x_2 + \beta_2x_1x_2 + \beta_3x_3^2 + \epsilon y=β0+β1x1+β4x2+β2x1x2+β3x32+ϵ

从上式来看,其中 x 1 x_1 x1 x 2 x_2 x2以及 x 3 x_3 x3和因变量都不是线性关系,但是如果我们把 x 1 x 2 x_1x_2 x1x2替换为 x 4 x_4 x4,把 x 3 2 x_3^2 x32替换为 x 5 x_5 x5,则原式可以改写为: y = β 0 + β 1 x 1 + β 4 x 2 + β 2 x 4 + β 3 x 5 + ϵ y = \beta_0 + \beta_1x_1 + \beta_4x_2 + \beta_2x_4 + \beta_3x_5 + \epsilon y=β0+β1x1+β4x2+β2x4+β3x5+ϵ

这时,原式从format上就是线性回归了。当然,如果用线性回归来处理的话,就会遇到多重共线性这个问题了。处理多重共线性的话不免要remove掉一些变量,无论用什么办法,最终都会造成信息量的减少。比如线性回归中,AveRooms和AveBedrms存在多重共线性,我们remove了AveRooms来处理多重共线性这个问题,但是AveRooms里包含了全屋的平均面积,直接remove掉feature,会对拟合效果有影响。同理啊,我们如果简单把 x 1 x 2 x_1x_2 x1x2替换为 x 4 x_4 x4,那么就会 x 4 x_4 x4 x 1 x_1 x1以及 x 2 x_2 x2都会存在多重共线性。发现所以遇到非线性回归,是不能简单用线性回归的方式来处理的。只是说形式上,也可以将polynomial regression归为广义的线性回归。

我们以boston housing data的例子来说明下如何用polynominal regression.

data.describe()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDValue
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000
VariableFeature or Target
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTATFeature
MEDValueTarget
X = data.iloc[:,0:13]
y = data.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=48)
poly_re = PolynomialFeatures(degree = 2, include_bias=False,interaction_only=False)
x_poly = pd.DataFrame(poly_re.fit_transform(x_train),columns = poly_re.get_feature_names(input_features = boston.feature_names))
x_poly.head(3)
CRIMZNINDUSCHASNOXRMAGEDISRADTAX...TAX^2TAX PTRATIOTAX BTAX LSTATPTRATIO^2PTRATIO BPTRATIO LSTATB^2B LSTATLSTAT^2
00.177830.09.690.00.5855.56973.52.39996.0391.0...152881.07507.2154746.075904.10368.647598.784289.920156633.89295976.1270228.0100
10.0750333.02.180.00.4727.42071.93.09927.0222.0...49284.04084.888111.801436.34338.567302.960119.048157529.61002567.943041.8609
25.669980.018.101.00.6316.68396.81.356724.0666.0...443556.013453.2249969.782484.18408.047581.66675.346140872.60891399.980913.9129

3 rows × 104 columns

这时,特征变量由13个扩展为了104个,这104个包括13个初始特征,13个初始特征的二阶值,78( 13 ∗ 12 2 \frac{13*12}{2} 21312)个初始特征两两组合的联合特征。

Base model - linear regression

我们用linear regression作为base model,看下 R 2 R_2 R2

model_l = sm.OLS(y_train,sm.add_constant(x_train)).fit()
model_l.summary()
OLS Regression Results
Dep. Variable:MEDValue R-squared: 0.755
Model:OLS Adj. R-squared: 0.746
Method:Least Squares F-statistic: 92.28
Date:Sun, 10 Jul 2022 Prob (F-statistic):3.24e-110
Time:16:27:37 Log-Likelihood: -1177.5
No. Observations: 404 AIC: 2383.
Df Residuals: 390 BIC: 2439.
Df Model: 13
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const 34.9728 5.470 6.393 0.000 24.218 45.728
CRIM -0.1110 0.032 -3.433 0.001 -0.175 -0.047
ZN 0.0453 0.015 3.010 0.003 0.016 0.075
INDUS 0.0130 0.066 0.198 0.843 -0.116 0.142
CHAS 3.4712 0.950 3.652 0.000 1.603 5.340
NOX -16.2379 4.159 -3.904 0.000 -24.415 -8.061
RM 3.8985 0.450 8.662 0.000 3.014 4.783
AGE -0.0112 0.015 -0.771 0.441 -0.040 0.017
DIS -1.3897 0.214 -6.492 0.000 -1.811 -0.969
RAD 0.2533 0.071 3.564 0.000 0.114 0.393
TAX -0.0117 0.004 -2.907 0.004 -0.020 -0.004
PTRATIO -0.8912 0.139 -6.418 0.000 -1.164 -0.618
B 0.0064 0.003 2.290 0.023 0.001 0.012
LSTAT -0.4716 0.054 -8.656 0.000 -0.579 -0.364
Omnibus:113.968 Durbin-Watson: 2.178
Prob(Omnibus): 0.000 Jarque-Bera (JB): 362.559
Skew: 1.278 Prob(JB): 1.87e-79
Kurtosis: 6.874 Cond. No. 1.52e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
y_predict = model_l.predict(sm.add_constant(x_test))

测试集结果

from sklearn.metrics import r2_score
r2_score(y_test,y_predict)
0.679162956664036
x_poly_test = pd.DataFrame(poly_re.fit_transform(x_test),columns = poly_re.get_feature_names(input_features = boston.feature_names))

Polynomial Regression

我们通过遍历来看下在13个初始特征基础上,增加某一个二阶特征对拟合效果的影响。我们用adjusted R 2 R^2 R2来评估拟合效果。

numbers = []
for i in range(13,104):
    x_poly_alt2 = x_poly.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,i]]
    model = sm.OLS(y_poly,sm.add_constant(x_poly_alt2)).fit()
    numbers.append(model.rsquared_adj)
numbers_df = pd.DataFrame(numbers)
names = x_poly.drop(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],axis=1)
col_heads = pd.DataFrame(list(names))
output = pd.concat([col_heads,numbers_df.reset_index(drop=True)],axis=1)
output.columns = ['names','rsquared_adj']
output = output.sort_values(by=['rsquared_adj'],ascending = False)
output
namesrsquared_adj
62RM LSTAT0.815162
55RM^20.809425
59RM TAX0.802159
58RM RAD0.796393
90LSTAT^20.789813
.........
34INDUS B0.745860
20ZN RAD0.745845
79RAD B0.745836
83TAX B0.745833
37CHAS NOX0.745833

91 rows × 2 columns

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid',{'axes.facecolor':'.9'})
plt.figure(figsize=(15,8))
plt.plot(output['names'],output['rsquared_adj'])

在这里插入图片描述

selected_df = x_poly[orign_featue]
model_f = sm.OLS(y_poly,sm.add_constant(selected_df)).fit()
model_f.summary()
OLS Regression Results
Dep. Variable:MEDValue R-squared: 0.873
Model:OLS Adj. R-squared: 0.864
Method:Least Squares F-statistic: 96.00
Date:Sun, 10 Jul 2022 Prob (F-statistic):9.30e-151
Time:16:27:30 Log-Likelihood: -1044.0
No. Observations: 404 AIC: 2144.
Df Residuals: 376 BIC: 2256.
Df Model: 27
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const -62.4149 35.867 -1.740 0.083 -132.939 8.110
CRIM -0.8410 0.234 -3.586 0.000 -1.302 -0.380
ZN -0.0942 0.115 -0.817 0.415 -0.321 0.133
INDUS -0.8661 0.584 -1.482 0.139 -2.015 0.283
CHAS -0.9866 0.890 -1.108 0.268 -2.737 0.764
NOX 73.3084 29.128 2.517 0.012 16.035 130.582
RM 13.4162 6.691 2.005 0.046 0.260 26.572
AGE 0.0111 0.121 0.092 0.927 -0.228 0.250
DIS 0.3002 1.702 0.176 0.860 -3.046 3.646
RAD 0.3127 0.566 0.553 0.581 -0.800 1.425
TAX 0.0275 0.034 0.803 0.422 -0.040 0.095
PTRATIO 3.2518 1.130 2.878 0.004 1.030 5.473
B -0.0386 0.022 -1.717 0.087 -0.083 0.006
LSTAT 0.3855 0.487 0.791 0.430 -0.573 1.344
RM LSTAT -0.1561 0.064 -2.457 0.014 -0.281 -0.031
RM^2 0.8728 0.267 3.267 0.001 0.348 1.398
RM TAX -0.0065 0.006 -1.173 0.241 -0.018 0.004
RM RAD -0.0166 0.093 -0.178 0.859 -0.199 0.166
LSTAT^2 0.0035 0.005 0.761 0.447 -0.006 0.013
RM PTRATIO -0.6216 0.178 -3.497 0.001 -0.971 -0.272
INDUS RM 0.1534 0.093 1.641 0.102 -0.030 0.337
NOX RM -14.7428 4.795 -3.074 0.002 -24.172 -5.313
RM B 0.0064 0.004 1.734 0.084 -0.001 0.014
RM AGE -0.0031 0.020 -0.158 0.874 -0.042 0.035
CRIM RM 0.1130 0.038 3.000 0.003 0.039 0.187
RM DIS -0.1752 0.267 -0.656 0.512 -0.700 0.350
CRIM CHAS 2.1670 0.268 8.086 0.000 1.640 2.694
ZN RM 0.0133 0.017 0.769 0.442 -0.021 0.047
Omnibus:50.040 Durbin-Watson: 2.019
Prob(Omnibus): 0.000 Jarque-Bera (JB): 199.285
Skew: 0.453 Prob(JB): 5.32e-44
Kurtosis: 6.319 Cond. No. 9.18e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.18e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
y_poly_predict = model_f.predict(sm.add_constant(x_poly_test[orign_featue]))
r2_score(y_poly_test,y_poly_predict)
0.7293035500578724

模型的 R 2 R^2 R2提升到了0.873,前文的base model的 R 2 R^2 R2是0.755.同时我们比较下模型在测试集的效果, R 2 R^2 R2也是由0.68提升到了0.73.这么看来,增加二阶特征对模型拟合效果是有很大影响的。本文只是那degree为2举了个例子,当然最高此项也有可能是3.后面可以讲下怎么选择模型。

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值