使用Scikit-Learn进行线性回归——梯度下降
前几节里面,都是手写的梯度下降、线性回归和归一化,但明明有现成的库为什么不调包呢,因此这里介绍使用Scikit-Learn进行线性回归的方法
numpy文档
scikit-learn中文社区 某些翻译明显机翻
scikit-learn官方
1.导入
import numpy as np
np.set_printoptions(precision=2)
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.preprocessing import StandardScaler
from lab_utils_multi import load_house_data
import matplotlib.pyplot as plt
dlblue = '#0096ff'; dlorange = '#FF9300'; dldarkred='#C00000'; dlmagenta='#FF40FF'; dlpurple='#7030A0';
plt.style.use('./deeplearning.mplstyle')
Scikit-learn has a gradient descent regression model (scikit-learn 自带的梯度下降模型)sklearn.linear_model.SGDRegressor. 这个模型也需要标准化后的输入数据,别急,Sklearn也提供了归一化的库 sklearn.preprocessing.StandardScaler ,具体参数看文档
2.载入数据和处理输入
导入数据
X_train, y_train = load_house_data()
X_features = ['size(sqft)','bedrooms','floors','age']
归一化数据
创建StandardScaler对象后使用fit_transform归一化
scaler = StandardScaler() #首先创建一个用来归一化的对象
X_norm = scaler.fit_transform(X_train) #归一化X_train
print(f"Peak to Peak range by column in Raw X:{np.ptp(X_train,axis=0)}")
print(f"Peak to Peak range by column in Normalized X:{np.ptp(X_norm,axis=0)}")
Peak to Peak range by column in Raw X:[2.41e+03 4.00e+00 1.00e+00 9.50e+01]
Peak to Peak range by column in Normalized X:[5.85 6.14 2.06 3.69]
上述np.ptp的含义见np.ptp文档,peek to peek的意思是某个轴上的最大值减最小值,例如
x = np.array([[4, 9, 2, 10],
[6, 9, 7, 12]])
np.ptp(x, axis=1)
array([8, 6]) #轴1,10-2=8,12-6=6
np.ptp(x, axis=0)
array([2, 0, 5, 2]) #轴0,6-4=2,9-9=0……
np.ptp(x) # 所有值的最大值减最小值
10
3.拟合数据
创建SGDRegressor对象(使用梯度下降方法拟合)后使用fit函数拟合数据
sgdr = SGDRegressor(max_iter=1000)
sgdr.fit(X_norm, y_train)
print(sgdr)
print(f"number of iterations completed: {sgdr.n_iter_}, number of weight updates: {sgdr.t_}")
SGDRegressor()
number of iterations completed: 129, number of weight updates: 12772.0
查看模型参数intercept_和coef_
b_norm = sgdr.intercept_
w_norm = sgdr.coef_
print(f"model parameters: w: {w_norm}, b:{b_norm}")
print(f"model parameters from previous lab: w: [110.56 -21.27 -32.71 -37.97], b: 363.16")
model parameters: w: [110.22 -21.12 -32.53 -38.02], b:[363.15]
model parameters from previous lab: w: [110.56 -21.27 -32.71 -37.97], b: 363.16
4.预测
库中提供了函数predict
# make a prediction using sgdr.predict()
y_pred_sgd = sgdr.predict(X_norm)
# make a prediction using w,b.
y_pred = np.dot(X_norm, w_norm) + b_norm
print(f"prediction using np.dot() and sgdr.predict match: {(y_pred == y_pred_sgd).all()}")
print(f"Prediction on training set:\n{y_pred[:4]}" )
print(f"Target values \n{y_train[:4]}")
prediction using np.dot() and sgdr.predict match: True
Prediction on training set:
[295.19 485.88 389.58 492.04]
Target values
[300. 509.8 394. 540. ]
5.可视化结果
# plot predictions and targets vs original features
fig,ax=plt.subplots(1,4,figsize=(12,3),sharey=True)
for i in range(len(ax)):
ax[i].scatter(X_train[:,i],y_train, label = 'target')
ax[i].set_xlabel(X_features[i])
ax[i].scatter(X_train[:,i],y_pred,color=dlorange, label = 'predict')
ax[0].set_ylabel("Price"); ax[0].legend();
fig.suptitle("target versus prediction using z-score normalized model")
plt.show()