导入所需的包以及创建数据:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import statsmodels.api as sm
import statsmodels.graphics.api as smg
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
import seaborn as sns
x = [[0.18], [0.24], [0.3], [0.36], [0.41], [0.48], [0.49], [0.6], [0.62], [0.64], [0.71], [0.72], [0.82], [0.83], [0.96], [0.98],
[1.0], [1.19], [1.2], [1.2], [1.24], [1.24], [1.25], [1.28], [1.29], [1.42], [1.49], [1.64], [1.96], [1.99], [2.0], [2.09],
[2.38], [2.39], [2.4], [2.48], [2.5], [2.51], [2.56], [2.84], [3.92], [4.0], [4.76], [5.0]]
y = [[0.12], [0.16], [0.20], [0.24], [0.3], [0.32], [0.39], [0.4], [0.41], [0.45], [0.48], [0.5], [0.59], [0.6], [0.64], [0.67],
[0.69], [0.7], [0.78], [0.8], [0.81], [0.82], [0.89], [0.9], [0.99], [1.1], [1.11], [1.15], [1.19], [1.2], [1.29], [1.34],
[1.35], [1.38], [1.39], [1.56], [1.6], [1.64], [1.8], [2.36], [2.39], [2.4], [2.68], [2.76]]
线性回归拟合:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)
print("斜率:", model.coef_[0][0])
print("截距:", model.intercept_[0])
print(f"拟合函数为:y={model.intercept_[0]}x+{model.coef_[0][0]}")
plt.scatter(x, y)
X = np.linspace(0, 5.5, 100)
plt.plot(X, model.coef_[0][0]*X+model.intercept_[0], 'g')
plt.plot(x, model.predict(x), 'r')
plt.show()
print("拟合效果:", r2_score(model.predict(x), y))
斜率: 0.5799536736145978
截距: 0.1049651997671357
拟合函数为:y=0.1049651997671357x+0.5799536736145978
拟合效果: 0.964665988316247
我们也可以利用 import statsmodels.api as sm
进行拟合:
X=pd.DataFrame({"c":np.ones(len(x)),"x":pd.Series([_[0] for _ in x])})
model=sm.OLS(pd.Series([_[0] for _ in y]),X)
result=model.fit()
print(result.params)
print("截距:",result.params['c'])
print("斜率:",result.params['x'])
c 0.104965
x 0.579954
dtype: float64
截距: 0.10496519976713606
斜率: 0.5799536736145978
感觉一次线性拟合不是很好,我们可以采用更高次幂的线性函数进行拟合:
from sklearn.preprocessing import PolynomialFeatures
二次:
ploy=PolynomialFeatures(2,include_bias=True)
x_ploy=ploy.fit_transform(x)
model=LinearRegression()
model.fit(x_ploy,y)
print("一次项系数:",model.coef_[0][1])
print("二次项系数:",model.coef_[0][2])
print("截距:",model.intercept_[0])
print(f"拟合函数为:y={model.coef_[0][2]}x^2+{model.coef_[0][1]}x+{model.intercept_[0]}")
plt.scatter(x, y)
X = np.linspace(0, 5.5, 100)
plt.plot(X,model.coef_[0][1]*X+model.coef_[0][2]*X**2+model.intercept_[0],'g')
plt.plot(x, model.predict(x_ploy),'r')
plt.show()
print("拟合效果:",r2_score(model.predict(x_ploy),y))
一次项系数: 0.7469011589908677
二次项系数: -0.036181545253088934
截距: -0.021172564638651803
拟合函数为:y=-0.036181545253088934x^2+0.7469011589908677x+-0.021172564638651803
拟合效果: 0.973319174087569
这种拟合情况相当于利用管道进行拟合:
model=Pipeline([('poly',PolynomialFeatures(2)),('liner',LinearRegression())])
model=model.fit(x,y)
plt.scatter(x,y)
plt.plot(x,model.predict(x),'g')
plt.show()
print("拟合效果:",r2_score(model.predict(x),y))
拟合效果: 0.973319174087569
model=make_pipeline(PolynomialFeatures(5),LinearRegression())
model=model.fit(x,y)
plt.scatter(x,y)
plt.plot(x,model.predict(x),'g')
plt.show()
print("拟合效果:",r2_score(model.predict(x),y))
拟合效果: 0.9758071614095045
不管是Pipeline还是make_pipeline,他们的目标都是将若干个转换器
与评估器
连接起来。上面的PolynomialFeatures
就是转换器
,LinearRegression
就是评估器
。
import statsmodels.api as sm 也可以得到相同的结果:
X=pd.DataFrame({"c":np.ones(len(x)),"x":pd.Series([_[0] for _ in x]),"x^2":pd.Series([_[0] for _ in x])**2})
model=sm.OLS(pd.Series([_[0] for _ in y]),X)
result=model.fit()
print(result.params)
print("截距:",result.params['c'])
print("一次项系数:",result.params['x'])
print("二次项系数:",result.params['x^2'])
c -0.021173
x 0.746901
x^2 -0.036182
dtype: float64
截距: -0.02117256463865172
一次项系数: 0.7469011589908683
二次项系数: -0.036181545253089045
三次:
ploy=PolynomialFeatures(3,include_bias=True)
x_ploy=ploy.fit_transform(x)
model=LinearRegression()
model.fit(x_ploy,y)
print("一次项系数:",model.coef_[0][1])
print("二次项系数:",model.coef_[0][2])
print("三次项系数:",model.coef_[0][3])
print("截距:",model.intercept_[0])
print(f"拟合函数为:y={model.coef_[0][3]}x^3+{model.coef_[0][2]}x^2+{model.coef_[0][1]}x+{model.intercept_[0]}")
plt.scatter(x, y)
X = np.linspace(0, 5.5, 100)
plt.plot(X,model.coef_[0][3]*X**3+model.coef_[0][1]*X+model.coef_[0][2]*X**2+model.intercept_[0],'g')
plt.plot(x, model.predict(x_ploy),'r')
plt.show()
print("拟合效果:",r2_score(model.predict(x_ploy),y))
一次项系数: 0.6121695193895076
二次项系数: 0.035088817026739584
三次项系数: -0.009720786309123671
截距: 0.03716717408435666
拟合函数为:y=-0.009720786309123671x^3+0.035088817026739584x^2+0.6121695193895076x+0.03716717408435666
拟合效果: 0.9741082686199946
四次:
ploy=PolynomialFeatures(4,include_bias=True)
x_ploy=ploy.fit_transform(x)
model=LinearRegression()
model.fit(x_ploy,y)
print("一次项系数:",model.coef_[0][1])
print("二次项系数:",model.coef_[0][2])
print("三次项系数:",model.coef_[0][3])
print("四次项系数:",model.coef_[0][4])
print("截距:",model.intercept_[0])
print(f"拟合函数为:y={model.coef_[0][4]}x^4+{model.coef_[0][3]}x^3+{model.coef_[0][2]}x^2+{model.coef_[0][1]}x+{model.intercept_[0]}")
plt.scatter(x, y)
X = np.linspace(0, 5.5, 100)
plt.plot(X,model.coef_[0][4]*X**4+model.coef_[0][3]*X**3+model.coef_[0][1]*X+model.coef_[0][2]*X**2+model.intercept_[0],'g')
plt.plot(x, model.predict(x_ploy),'r')
plt.show()
print("拟合效果:",r2_score(model.predict(x_ploy),y))
一次项系数: 0.8237407976439868
二次项系数: -0.15713093509088588
三次项系数: 0.05250537654853375
四次项系数: -0.00641270191634358
截距: -0.025611612440094467
拟合函数为:y=-0.00641270191634358x^4+0.05250537654853375x^3+-0.15713093509088588x^2+0.8237407976439868x+-0.025611612440094467
拟合效果: 0.9747156306781533
五次:
ploy=PolynomialFeatures(5,include_bias=True)
x_ploy=ploy.fit_transform(x)
model=LinearRegression()
model.fit(x_ploy,y)
print("一次项系数:",model.coef_[0][1])
print("二次项系数:",model.coef_[0][2])
print("三次项系数:",model.coef_[0][3])
print("四次项系数:",model.coef_[0][4])
print("五次项系数:",model.coef_[0][5])
print("截距:",model.intercept_[0])
print(f"拟合函数为:y={model.coef_[0][5]}^5+{model.coef_[0][4]}x^4+{model.coef_[0][3]}x^3+{model.coef_[0][2]}x^2+{model.coef_[0][1]}x+{model.intercept_[0]}")
plt.scatter(x, y)
X = np.linspace(0, 5.5, 100)
plt.plot(X,model.coef_[0][5]*X**5+model.coef_[0][4]*X**4+model.coef_[0][3]*X**3+model.coef_[0][1]*X+model.coef_[0][2]*X**2+model.intercept_[0],'g')
plt.plot(x, model.predict(x_ploy),'r')
plt.show()
print("拟合效果:",r2_score(model.predict(x_ploy),y))
一次项系数: 1.502446352683112
二次项系数: -1.0642795452335645
三次项系数: 0.5350710566181669
四次项系数: -0.11523898348823729
五次项系数: 0.008694307907109533
截距: -0.17266066636826705
拟合函数为:y=0.008694307907109533^5+-0.11523898348823729x^4+0.5350710566181669x^3+-1.0642795452335645x^2+1.502446352683112x+-0.17266066636826705
拟合效果: 0.9758071614095045
可见,貌似拟合次幂越高,拟合效果越好
我们来对比一下五次幂拟合和一次线性拟合的情况:
X = pd.DataFrame({"c": np.ones(len(x)), "x": pd.Series([_[0] for _ in x]), "x^2": pd.Series([_[0] for _ in x])**2,
"x^3": pd.Series([_[0] for _ in x])**3, "x^4": pd.Series([_[0] for _ in x])**4, "x^5": pd.Series([_[0] for _ in x])**5})
model = sm.OLS(pd.Series([_[0] for _ in y]), X)
result5 = model.fit()
print(result5.params)
print("截距:", result5.params['c'])
print("一次项系数:", result5.params['x'])
print("二次项系数:", result5.params['x^2'])
print("三次项系数:", result5.params['x^3'])
print("四次项系数:", result5.params['x^4'])
print("五次项系数:", result5.params['x^5'])
c -0.172661
x 1.502446
x^2 -1.064280
x^3 0.535071
x^4 -0.115239
x^5 0.008694
dtype: float64
截距: -0.17266066636842506
一次项系数: 1.502446352683135
二次项系数: -1.0642795452334017
三次项系数: 0.5350710566180569
四次项系数: -0.11523898348821326
五次项系数: 0.00869430790710762
result.summary()
Dep. Variable: | y | R-squared: | 0.974 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.973 |
Method: | Least Squares | F-statistic: | 768.3 |
Date: | Thu, 24 Sep 2020 | Prob (F-statistic): | 3.18e-33 |
Time: | 10:21:01 | Log-Likelihood: | 34.613 |
No. Observations: | 44 | AIC: | -63.23 |
Df Residuals: | 41 | BIC: | -57.87 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
c | -0.0212 | 0.046 | -0.462 | 0.647 | -0.114 | 0.071 |
x | 0.7469 | 0.049 | 15.275 | 0.000 | 0.648 | 0.846 |
x^2 | -0.0362 | 0.010 | -3.584 | 0.001 | -0.057 | -0.016 |
Omnibus: | 49.431 | Durbin-Watson: | 1.073 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 316.704 |
Skew: | 2.578 | Prob(JB): | 1.69e-69 |
Kurtosis: | 15.090 | Cond. No. | 27.2 |
result5.summary()
Dep. Variable: | y | R-squared: | 0.976 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.973 |
Method: | Least Squares | F-statistic: | 314.1 |
Date: | Thu, 24 Sep 2020 | Prob (F-statistic): | 8.20e-30 |
Time: | 10:21:14 | Log-Likelihood: | 36.714 |
No. Observations: | 44 | AIC: | -61.43 |
Df Residuals: | 38 | BIC: | -50.72 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
c | -0.1727 | 0.149 | -1.162 | 0.252 | -0.473 | 0.128 |
x | 1.5024 | 0.584 | 2.573 | 0.014 | 0.320 | 2.685 |
x^2 | -1.0643 | 0.732 | -1.453 | 0.154 | -2.547 | 0.418 |
x^3 | 0.5351 | 0.379 | 1.412 | 0.166 | -0.232 | 1.302 |
x^4 | -0.1152 | 0.084 | -1.365 | 0.180 | -0.286 | 0.056 |
x^5 | 0.0087 | 0.007 | 1.293 | 0.204 | -0.005 | 0.022 |
Omnibus: | 38.240 | Durbin-Watson: | 1.163 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 157.238 |
Skew: | 2.023 | Prob(JB): | 7.18e-35 |
Kurtosis: | 11.331 | Cond. No. | 3.88e+04 |
通过看R-squared可以看出result5比result大,但是还是不能有力说明五次幂的拟合比线性拟合好,因为还没有进行残差检验。
fig,ax=plt.subplots()
x=np.linspace(-2.,2.1,100)
plt.plot(x,x,'g')
smg.qqplot(result.resid,ax=ax,color='b')
smg.qqplot(result5.resid,ax=ax,color='r')
fig.tight_layout()
通过这张图也可以略微看出,红色线是比蓝色线要略微靠近绿色线的,所以基本可以认为五次幂的拟合比线性拟合好一些。