1、过拟合与欠拟合
问题:训练数据训练的很好啊,误差也不大,为什么在测试集上面有问题呢?
欠拟合:
过拟合
分析上图1
经过训练后,知道了天鹅是有翅膀的,天鹅的嘴巴是长长的。简单的认为有这些特征的都是天鹅。因为机器学习到的天鹅特征太少了,导致区分标准太粗糙,不能准确识别出天鹅。
分析上图2
机器通过这些图片来学习天鹅的特征,经过训练后,知道了天鹅是有翅膀的,天鹅的嘴巴是长长的弯曲的,天鹅的脖子是长长的有点曲度,天鹅的整个体型像一个"2"且略大于鸭子。这时候机器已经基本能区别天鹅和其他动物了。然后,很不巧已有的天鹅图片全是白天鹅的,于是机器经过学习后,会认为天鹅的羽毛都是白的,以后看到羽毛是黑的天鹅就会认为那不是天鹅。
对线性模型进行训练学习会变成复杂模型(也就是线性变成非线性,直线变成曲线)
上图中第一个明显是欠拟合,第三个是过拟合,而第二个是刚刚好的状态
1.1 欠拟合原因以及解决办法
• 原因:
• 学习到数据的特征过少
• 解决办法:
• 增加数据的特征数量
1.2 过拟合原因以及解决办法
• 原因:
• 原始特征过多,存在一些嘈杂特征,模型过于复杂是因为模型尝试去兼顾各个测试数据点
• 解决办法:
• 进行特征选择,消除关联性大的特征(很难做)
• 交叉验证(让所有数据都有过训练)(只是检验出来是否过拟合)
• 正则化(了解)
1.3 L2正则化
2、 带有正则化的线性回归-Ridge
• sklearn.linear_model.Ridge
观察正则化程度的变化,对结果的影响?
2.1 线性回归 LinearRegression与Ridge对比
代码:
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression,SGDRegressor,Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
#线性回归预测房价
#获取数据
lb = load_boston()
#分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(lb.data, lb.target, test_size=0.25)
print(y_test.shape) #一维的
#进行标准化处理(特征值和目标值都需要标准化处理)
std_x = StandardScaler()
x_train = std_x.fit_transform(x_train)
x_test = std_x.transform(x_test)
#目标值
std_y = StandardScaler()
y_train = std_y.fit_transform(y_train.reshape(-1,1)) #sklearn 0.19之后必须要求穿进去的数组是二维的
y_test = std_y.transform(y_test.reshape(-1,1))
# estimator预测
# 正规方程求解方式预测结果
lr = LinearRegression()
lr.fit(x_train,y_train)
print(lr.coef_)
#预测测试集房子的价格
y_predict = lr.predict(x_test)
y_predict = std_y.inverse_transform(y_predict) #反标准化
print('房子的价格',y_predict)
print("正规方程的均方误差:", mean_squared_error(std_y.inverse_transform(y_test), y_predict))
# # 梯度下降去进行房价预测
sgd = SGDRegressor()
sgd.fit(x_train, y_train)
print(sgd.coef_)
# 预测测试集的房子价格
y_sgd_predict = std_y.inverse_transform(sgd.predict(x_test))
print("梯度下降测试集里面每个房子的预测价格:", y_sgd_predict)
print("梯度下降的均方误差:", mean_squared_error(std_y.inverse_transform(y_test), y_sgd_predict))
#岭回归
rd = Ridge()
rd.fit(x_train,y_train)
print(rd.coef_)
#预测测试集房子的价格
y_predict = rd.predict(x_test)
y_predict = std_y.inverse_transform(y_predict) #反标准化
print('房子的价格',y_predict)
print("岭回归的均方误差:", mean_squared_error(std_y.inverse_transform(y_test), y_predict))
结果
(127,)
[[-0.10567608 0.12800059 -0.02081012 0.08019896 -0.20419646 0.26203745
-0.0143011 -0.34425272 0.32636026 -0.24074745 -0.21175963 0.08813659
-0.40618423]]
房子的价格 [[14.77076723]
[13.27831203]
[16.60033542]
[25.26172339]
[20.62467259]
[ 6.16434782]
[13.28311046]
[22.87709636]
[17.31795715]
[18.66015137]
[21.0218563 ]
[24.84294609]
[20.81495002]
[13.43407714]
[19.57565738]
[19.3089579 ]
[22.6531201 ]
[35.55047995]
[28.65242938]
[23.23185524]
[21.56498689]
[32.01154171]
[18.65673742]
[22.35284797]
[25.31174824]
[30.75292852]
[34.37896357]
[37.47214832]
[ 6.73466295]
[35.32260072]
[17.29205635]
[24.04981707]
[26.49537869]
[19.39206106]
[20.30202713]
[23.78966036]
[17.51792953]
[17.11736623]
[15.6918776 ]
[27.67409696]
[37.15818615]
[26.99110797]
[34.99126767]
[14.76681021]
[34.43555452]
[16.52073494]
[21.84037719]
[16.68374204]
[ 2.41539153]
[14.31783444]
[28.6838572 ]
[15.57163765]
[21.21643664]
[25.27575566]
[25.37553741]
[20.83692649]
[17.85187079]
[24.64831743]
[15.73134754]
[42.77785003]
[20.40416974]
[33.31204004]
[38.85798562]
[11.48432283]
[27.76057343]
[12.06166884]
[20.86579503]
[27.24237201]
[19.61373115]
[24.52191849]
[31.33421453]
[26.77816914]
[16.75660085]
[34.50518995]
[25.14802216]
[25.90921364]
[21.57533282]
[ 8.48949894]
[21.59651502]
[ 9.69458833]
[22.22350557]
[24.57534372]
[24.28353498]
[34.33310039]
[29.96124903]
[14.70893927]
[22.30509421]
[19.22276018]
[ 2.96171808]
[35.76179714]
[12.00840098]
[15.11632841]
[13.23431461]
[13.9356148 ]
[18.72450404]
[32.50044844]
[31.9595019 ]
[21.6475679 ]
[18.26864955]
[22.91285212]
[19.22742578]
[ 4.15097764]
[32.40218537]
[19.73426363]
[13.61425393]
[32.71529794]
[19.9715399 ]
[23.94900489]
[32.38243401]
[32.10231901]
[31.68013488]
[16.64492959]
[25.67337496]
[17.50075022]
[23.06085794]
[17.06145436]
[17.07135798]
[29.03759939]
[26.82782037]
[34.1179701 ]
[17.72120486]
[29.36134957]
[13.47418262]
[25.43552433]
[18.74529461]
[26.64585416]
[42.28879618]]
正规方程的均方误差: 19.943775280232305
[-0.08193998 0.07953498 -0.07694884 0.09692202 -0.13142005 0.30038964
-0.01906175 -0.27527616 0.16869863 -0.0634908 -0.20211355 0.09301433
-0.39414702]
梯度下降测试集里面每个房子的预测价格: [15.39218891 13.52798063 16.47874323 25.06860083 20.22602808 6.26771315
13.43347791 23.51282623 17.74938116 18.77260822 21.03522778 24.72813256
21.64328198 13.28995688 20.15220852 19.65319754 23.8877247 36.02024967
27.69980762 24.15177507 20.92014599 31.92950466 19.17057437 22.73044775
25.49973931 31.39313354 33.30013821 37.49444783 6.62350093 35.51710287
17.58961995 24.05900694 25.99117215 19.93798484 20.70995703 23.21038327
17.52540607 17.99853656 16.16881702 26.95125066 37.49791199 27.0166394
35.22614528 15.35325779 32.80594904 17.07976506 24.6797388 17.13262971
1.50530288 14.55552582 29.26293297 13.29204548 18.91253258 25.38577115
25.51541994 21.15984203 17.56395625 24.556492 16.5073547 41.09975136
20.31642047 32.10630265 38.6040282 11.68399041 28.03813137 11.54805457
21.0233875 26.32102522 19.2644227 24.12210071 30.72340628 26.53815011
16.64108217 35.16727989 25.11923725 25.22226935 21.21523941 8.52212096
21.10075899 10.12849153 24.10399226 24.26465393 23.77298872 34.51866174
30.94224501 14.85166827 21.9526793 19.78448065 5.01258647 35.92316777
12.98575522 14.25323648 13.39858078 11.51172671 19.10362107 32.6881197
31.55462351 19.36427109 18.83545012 22.93005441 19.77607529 3.77376201
32.519765 17.36803532 13.82339068 32.06590057 21.23638127 23.44861165
32.36442134 30.23762236 32.14463003 16.58802876 25.55767847 19.130649
22.67014251 16.77307442 16.98033589 28.80486746 26.48238797 33.26111273
17.92842635 29.21141036 15.86024054 24.7808194 18.21191731 26.26559592
42.72435392]
梯度下降的均方误差: 20.404166344036355
[[-0.10424982 0.12529391 -0.02529284 0.08079958 -0.20029906 0.26336097
-0.01482654 -0.33993608 0.31510062 -0.22903615 -0.21089159 0.08823867
-0.40448616]]
房子的价格 [[14.78924732]
[13.30010965]
[16.57696075]
[25.24707526]
[20.61309215]
[ 6.17940565]
[13.28115626]
[22.87471525]
[17.31990407]
[18.68278409]
[21.03853641]
[24.83759427]
[20.82609359]
[13.41756242]
[19.57966149]
[19.33891951]
[22.68080588]
[35.53031648]
[28.60084297]
[23.29566985]
[21.5399288 ]
[31.97179546]
[18.69967699]
[22.38224565]
[25.33065901]
[30.81274563]
[34.32007772]
[37.43897018]
[ 6.73579572]
[35.28928948]
[17.28658948]
[24.05025822]
[26.44959675]
[19.39504212]
[20.29661989]
[23.77114472]
[17.53343315]
[17.15705308]
[15.70188588]
[27.6129232 ]
[37.14202271]
[27.00277016]
[34.9780761 ]
[14.78041933]
[34.33971773]
[16.54984676]
[22.03162035]
[16.68645277]
[ 2.38729393]
[14.35814778]
[28.72701499]
[15.43021136]
[21.06146923]
[25.28772018]
[25.34809906]
[20.86630923]
[17.8549852 ]
[24.66236145]
[15.76838931]
[42.6644899 ]
[20.37563767]
[33.24864358]
[38.81903503]
[11.52924678]
[27.78827352]
[12.02936785]
[20.88694035]
[27.20015279]
[19.60763795]
[24.5060948 ]
[31.28905906]
[26.72864739]
[16.77370858]
[34.53611675]
[25.14759987]
[25.87097919]
[21.53370603]
[ 8.49458801]
[21.54398626]
[ 9.71478109]
[22.3594711 ]
[24.54208078]
[24.2675291 ]
[34.35532187]
[30.0270077 ]
[14.72148164]
[22.30424844]
[19.22587415]
[ 3.11957867]
[35.77944566]
[12.06742078]
[15.06110648]
[13.2325652 ]
[13.79110675]
[18.72137177]
[32.45196325]
[31.91784709]
[21.49110079]
[18.27479433]
[22.92200478]
[19.24883539]
[ 4.14542685]
[32.42578572]
[19.57750059]
[13.63308605]
[32.66999885]
[20.00646844]
[23.93030843]
[32.39244989]
[32.00070472]
[31.70563306]
[16.66581832]
[25.64800136]
[17.6386729 ]
[23.04127692]
[17.05655024]
[17.05296158]
[29.02411803]
[26.81445658]
[34.05924831]
[17.74815383]
[29.37237632]
[13.58364839]
[25.39631465]
[18.6970774 ]
[26.61423726]
[42.27200545]]
岭回归的均方误差: 19.958165933908848