非线性回归-Polynomial regression
之前讲过线性回归。本文主要讲下非线性回归中的polynominal regression. 非线性回归包含的算法非常多,比如常见的逻辑回归,SVM,决策树,neural network,指数平滑等等。polynominal regression也可以归为非线性回归的一种,但是有些文章也会把polynominal regression归为线性回归。
我们看下polynominal regression 的一般表达式的例子: y = β 0 + β 1 x 1 + β 4 x 2 + β 2 x 1 x 2 + β 3 x 3 2 + ϵ y = \beta_0 + \beta_1x_1 + \beta_4x_2 + \beta_2x_1x_2 + \beta_3x_3^2 + \epsilon y=β0+β1x1+β4x2+β2x1x2+β3x32+ϵ
从上式来看,其中 x 1 x_1 x1 , x 2 x_2 x2以及 x 3 x_3 x3和因变量都不是线性关系,但是如果我们把 x 1 x 2 x_1x_2 x1x2替换为 x 4 x_4 x4,把 x 3 2 x_3^2 x32替换为 x 5 x_5 x5,则原式可以改写为: y = β 0 + β 1 x 1 + β 4 x 2 + β 2 x 4 + β 3 x 5 + ϵ y = \beta_0 + \beta_1x_1 + \beta_4x_2 + \beta_2x_4 + \beta_3x_5 + \epsilon y=β0+β1x1+β4x2+β2x4+β3x5+ϵ
这时,原式从format上就是线性回归了。当然,如果用线性回归来处理的话,就会遇到多重共线性这个问题了。处理多重共线性的话不免要remove掉一些变量,无论用什么办法,最终都会造成信息量的减少。比如线性回归中,AveRooms和AveBedrms存在多重共线性,我们remove了AveRooms来处理多重共线性这个问题,但是AveRooms里包含了全屋的平均面积,直接remove掉feature,会对拟合效果有影响。同理啊,我们如果简单把 x 1 x 2 x_1x_2 x1x2替换为 x 4 x_4 x4,那么就会 x 4 x_4 x4和 x 1 x_1 x1以及 x 2 x_2 x2都会存在多重共线性。发现所以遇到非线性回归,是不能简单用线性回归的方式来处理的。只是说形式上,也可以将polynomial regression归为广义的线性回归。
我们以boston housing data的例子来说明下如何用polynominal regression.
data.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDValue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
Variable | Feature or Target |
---|---|
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT | Feature |
MEDValue | Target |
X = data.iloc[:,0:13]
y = data.iloc[:,13]
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=48)
poly_re = PolynomialFeatures(degree = 2, include_bias=False,interaction_only=False)
x_poly = pd.DataFrame(poly_re.fit_transform(x_train),columns = poly_re.get_feature_names(input_features = boston.feature_names))
x_poly.head(3)
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | ... | TAX^2 | TAX PTRATIO | TAX B | TAX LSTAT | PTRATIO^2 | PTRATIO B | PTRATIO LSTAT | B^2 | B LSTAT | LSTAT^2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.17783 | 0.0 | 9.69 | 0.0 | 0.585 | 5.569 | 73.5 | 2.3999 | 6.0 | 391.0 | ... | 152881.0 | 7507.2 | 154746.07 | 5904.10 | 368.64 | 7598.784 | 289.920 | 156633.8929 | 5976.1270 | 228.0100 |
1 | 0.07503 | 33.0 | 2.18 | 0.0 | 0.472 | 7.420 | 71.9 | 3.0992 | 7.0 | 222.0 | ... | 49284.0 | 4084.8 | 88111.80 | 1436.34 | 338.56 | 7302.960 | 119.048 | 157529.6100 | 2567.9430 | 41.8609 |
2 | 5.66998 | 0.0 | 18.10 | 1.0 | 0.631 | 6.683 | 96.8 | 1.3567 | 24.0 | 666.0 | ... | 443556.0 | 13453.2 | 249969.78 | 2484.18 | 408.04 | 7581.666 | 75.346 | 140872.6089 | 1399.9809 | 13.9129 |
3 rows × 104 columns
这时,特征变量由13个扩展为了104个,这104个包括13个初始特征,13个初始特征的二阶值,78( 13 ∗ 12 2 \frac{13*12}{2} 213∗12)个初始特征两两组合的联合特征。
Base model - linear regression
我们用linear regression作为base model,看下 R 2 R_2 R2
model_l = sm.OLS(y_train,sm.add_constant(x_train)).fit()
model_l.summary()
Dep. Variable: | MEDValue | R-squared: | 0.755 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.746 |
Method: | Least Squares | F-statistic: | 92.28 |
Date: | Sun, 10 Jul 2022 | Prob (F-statistic): | 3.24e-110 |
Time: | 16:27:37 | Log-Likelihood: | -1177.5 |
No. Observations: | 404 | AIC: | 2383. |
Df Residuals: | 390 | BIC: | 2439. |
Df Model: | 13 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 34.9728 | 5.470 | 6.393 | 0.000 | 24.218 | 45.728 |
CRIM | -0.1110 | 0.032 | -3.433 | 0.001 | -0.175 | -0.047 |
ZN | 0.0453 | 0.015 | 3.010 | 0.003 | 0.016 | 0.075 |
INDUS | 0.0130 | 0.066 | 0.198 | 0.843 | -0.116 | 0.142 |
CHAS | 3.4712 | 0.950 | 3.652 | 0.000 | 1.603 | 5.340 |
NOX | -16.2379 | 4.159 | -3.904 | 0.000 | -24.415 | -8.061 |
RM | 3.8985 | 0.450 | 8.662 | 0.000 | 3.014 | 4.783 |
AGE | -0.0112 | 0.015 | -0.771 | 0.441 | -0.040 | 0.017 |
DIS | -1.3897 | 0.214 | -6.492 | 0.000 | -1.811 | -0.969 |
RAD | 0.2533 | 0.071 | 3.564 | 0.000 | 0.114 | 0.393 |
TAX | -0.0117 | 0.004 | -2.907 | 0.004 | -0.020 | -0.004 |
PTRATIO | -0.8912 | 0.139 | -6.418 | 0.000 | -1.164 | -0.618 |
B | 0.0064 | 0.003 | 2.290 | 0.023 | 0.001 | 0.012 |
LSTAT | -0.4716 | 0.054 | -8.656 | 0.000 | -0.579 | -0.364 |
Omnibus: | 113.968 | Durbin-Watson: | 2.178 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 362.559 |
Skew: | 1.278 | Prob(JB): | 1.87e-79 |
Kurtosis: | 6.874 | Cond. No. | 1.52e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
y_predict = model_l.predict(sm.add_constant(x_test))
测试集结果
from sklearn.metrics import r2_score
r2_score(y_test,y_predict)
0.679162956664036
x_poly_test = pd.DataFrame(poly_re.fit_transform(x_test),columns = poly_re.get_feature_names(input_features = boston.feature_names))
Polynomial Regression
我们通过遍历来看下在13个初始特征基础上,增加某一个二阶特征对拟合效果的影响。我们用adjusted R 2 R^2 R2来评估拟合效果。
numbers = []
for i in range(13,104):
x_poly_alt2 = x_poly.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,i]]
model = sm.OLS(y_poly,sm.add_constant(x_poly_alt2)).fit()
numbers.append(model.rsquared_adj)
numbers_df = pd.DataFrame(numbers)
names = x_poly.drop(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT'],axis=1)
col_heads = pd.DataFrame(list(names))
output = pd.concat([col_heads,numbers_df.reset_index(drop=True)],axis=1)
output.columns = ['names','rsquared_adj']
output = output.sort_values(by=['rsquared_adj'],ascending = False)
output
names | rsquared_adj | |
---|---|---|
62 | RM LSTAT | 0.815162 |
55 | RM^2 | 0.809425 |
59 | RM TAX | 0.802159 |
58 | RM RAD | 0.796393 |
90 | LSTAT^2 | 0.789813 |
... | ... | ... |
34 | INDUS B | 0.745860 |
20 | ZN RAD | 0.745845 |
79 | RAD B | 0.745836 |
83 | TAX B | 0.745833 |
37 | CHAS NOX | 0.745833 |
91 rows × 2 columns
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid',{'axes.facecolor':'.9'})
plt.figure(figsize=(15,8))
plt.plot(output['names'],output['rsquared_adj'])
selected_df = x_poly[orign_featue]
model_f = sm.OLS(y_poly,sm.add_constant(selected_df)).fit()
model_f.summary()
Dep. Variable: | MEDValue | R-squared: | 0.873 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.864 |
Method: | Least Squares | F-statistic: | 96.00 |
Date: | Sun, 10 Jul 2022 | Prob (F-statistic): | 9.30e-151 |
Time: | 16:27:30 | Log-Likelihood: | -1044.0 |
No. Observations: | 404 | AIC: | 2144. |
Df Residuals: | 376 | BIC: | 2256. |
Df Model: | 27 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -62.4149 | 35.867 | -1.740 | 0.083 | -132.939 | 8.110 |
CRIM | -0.8410 | 0.234 | -3.586 | 0.000 | -1.302 | -0.380 |
ZN | -0.0942 | 0.115 | -0.817 | 0.415 | -0.321 | 0.133 |
INDUS | -0.8661 | 0.584 | -1.482 | 0.139 | -2.015 | 0.283 |
CHAS | -0.9866 | 0.890 | -1.108 | 0.268 | -2.737 | 0.764 |
NOX | 73.3084 | 29.128 | 2.517 | 0.012 | 16.035 | 130.582 |
RM | 13.4162 | 6.691 | 2.005 | 0.046 | 0.260 | 26.572 |
AGE | 0.0111 | 0.121 | 0.092 | 0.927 | -0.228 | 0.250 |
DIS | 0.3002 | 1.702 | 0.176 | 0.860 | -3.046 | 3.646 |
RAD | 0.3127 | 0.566 | 0.553 | 0.581 | -0.800 | 1.425 |
TAX | 0.0275 | 0.034 | 0.803 | 0.422 | -0.040 | 0.095 |
PTRATIO | 3.2518 | 1.130 | 2.878 | 0.004 | 1.030 | 5.473 |
B | -0.0386 | 0.022 | -1.717 | 0.087 | -0.083 | 0.006 |
LSTAT | 0.3855 | 0.487 | 0.791 | 0.430 | -0.573 | 1.344 |
RM LSTAT | -0.1561 | 0.064 | -2.457 | 0.014 | -0.281 | -0.031 |
RM^2 | 0.8728 | 0.267 | 3.267 | 0.001 | 0.348 | 1.398 |
RM TAX | -0.0065 | 0.006 | -1.173 | 0.241 | -0.018 | 0.004 |
RM RAD | -0.0166 | 0.093 | -0.178 | 0.859 | -0.199 | 0.166 |
LSTAT^2 | 0.0035 | 0.005 | 0.761 | 0.447 | -0.006 | 0.013 |
RM PTRATIO | -0.6216 | 0.178 | -3.497 | 0.001 | -0.971 | -0.272 |
INDUS RM | 0.1534 | 0.093 | 1.641 | 0.102 | -0.030 | 0.337 |
NOX RM | -14.7428 | 4.795 | -3.074 | 0.002 | -24.172 | -5.313 |
RM B | 0.0064 | 0.004 | 1.734 | 0.084 | -0.001 | 0.014 |
RM AGE | -0.0031 | 0.020 | -0.158 | 0.874 | -0.042 | 0.035 |
CRIM RM | 0.1130 | 0.038 | 3.000 | 0.003 | 0.039 | 0.187 |
RM DIS | -0.1752 | 0.267 | -0.656 | 0.512 | -0.700 | 0.350 |
CRIM CHAS | 2.1670 | 0.268 | 8.086 | 0.000 | 1.640 | 2.694 |
ZN RM | 0.0133 | 0.017 | 0.769 | 0.442 | -0.021 | 0.047 |
Omnibus: | 50.040 | Durbin-Watson: | 2.019 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 199.285 |
Skew: | 0.453 | Prob(JB): | 5.32e-44 |
Kurtosis: | 6.319 | Cond. No. | 9.18e+05 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.18e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
y_poly_predict = model_f.predict(sm.add_constant(x_poly_test[orign_featue]))
r2_score(y_poly_test,y_poly_predict)
0.7293035500578724
模型的 R 2 R^2 R2提升到了0.873,前文的base model的 R 2 R^2 R2是0.755.同时我们比较下模型在测试集的效果, R 2 R^2 R2也是由0.68提升到了0.73.这么看来,增加二阶特征对模型拟合效果是有很大影响的。本文只是那degree为2举了个例子,当然最高此项也有可能是3.后面可以讲下怎么选择模型。