蒸汽预测赛题——模型验证
天池大赛比赛地址:链接
理论知识
-
欠拟合 高偏差
- 增加额外特征
- 增加多项式特征
- 降低惩罚
-
过拟合 高方差
- 收集更多数据
- 使用更少的特征
- 增加惩罚
-
泛化:机器学习模型学习到的概念在处理未遇到过的样本时的表现
-
正则化:防止过拟合
- L1正则化 Lasso
- 向量元素绝对值之和
- 迫使那些弱的特征所对应的系数变成0,往往会使学到的模型很稀疏
- L2正则化 Ridge 岭回归
- 欧几里得范数 向量元素绝对值平方和再开方
- 拟合的始终是条曲线
- L1正则化 Lasso
-
回归模型的评估方法
-
指标 描述 方法 Mean Absolute Error(MAE) 平均绝对值误差 from sklearn.metrics import mean_absolue_error Mean Square Error(MSE) 均方误差 mean_square_error Root Mean Square Error(RMSE) 均方根误差 mean_square_error sort R-Squared R平均值 = 1-(真实-预测的平方)/(真实-均值的平方) r2_score
-
-
交叉验证
-
简单交叉验证 通常划分30%的数据作为测试集
-
K折交叉验证 KFold
-
原始数据划分K组 每个子集数据分别做一次验证集 剩余K-1次作为训练集 最后取平均
-
from sklearn.model_selection import KFold kf = KFold(n_splits=10) for train,test in kf.split(data):
-
-
留一法交叉验证 LeaveOneOut
-
训练集由除一个样本之外的其余样本组成,剩下一个样本组成验证集
-
from sklearn.model_selection import LeaveOneOut loo = LeaveOneOut()
-
-
留P法交叉验证 LeavePOut
-
从完整的数据集中删除p个样本,产生训练集和验证集
-
from sklearn.model_selection import LeavePOut lpo = LeavePOut(p=5)
-
-
基于类标签,具有分层的交叉验证
- 解决样本不平衡的问题
- 确保相应的类别频率在每个训练和验证的折叠中保留
- StratifiedKFold:K-Fold变种 返回的是index 返每个小集合中 各个类别的样例比例大致和完整数据中相同
- StratifiedShuffleSplit:ShuffleSplit变种 返回的直接是值 划分中每个类的比例和完整中的相同
-
用于分组数据的交叉验证
- 留出一组特定的不属于测试集和训练集的数据
- GroupKFold、LeaveOneOut、LeavePOut、GroupShuffleSplit
-
时间序列分割 TimeSeriesSplit
- 有关时间序列的样本,必须保证时间上的顺序性
- 不能用未来的数据去检验现在的数据正确性
-
-
模型调参
- 整体模型的偏差和方差的大和谐
- 过程影响类参数
- 子模型数 n_estimators
- 学习率 learning_rate
- 子模型影响类参数
- 最大数深度 max_depth
- 分裂条件 criterion
- bagging 降低方差
- boosting 降低偏差
- RandomForest
- criterion、n_estimators、max_leaf_nodes、ma x_depth、min_samples_split、min_samples_leaf、min_weight_fraction_leaf
- GradientTree
- learning_rate、n_estimators、max_leaf_nodes、ma x_depth、min_samples_split、min_samples_leaf、min_weight_fraction_leaf、max_features、subsample
- 网格搜索 穷举
1. 导包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
2. 加载数据
train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"
train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
feature_columns = [col for col in train_data if col not in ['target']]
min_max_scaler = MinMaxScaler().fit(train_data[feature_columns])
train_data_scaler = min_max_scaler.transform(train_data[feature_columns])
test_data_scaler = min_max_scaler.transform(test_data[feature_columns])
train_data_scaler = pd.DataFrame(train_data_scaler,columns=feature_columns)
train_data_scaler['target'] = train_data['target']
test_data_scaler = pd.DataFrame(test_data_scaler,columns=feature_columns)
pca = PCA(n_components=16)
new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,:-1])
new_train_pca_16 = pd.DataFrame(new_train_pca_16)
new_train_pca_16['target'] = train_data_scaler['target']
new_test_pca_16 = pca.transform(test_data_scaler)
new_test_pca_16 = pd.DataFrame(new_test_pca_16)
train= new_train_pca_16.fillna(0)[new_test_pca_16.columns]
target = new_train_pca_16['target']
train_data,test_data,train_target,test_target = train_test_split(train,target,test_size=0.2,random_state=0)
3. 拟合数据
# 欠拟合 max_iter最大迭代次数 tol迭代到loss变化小于这个值时停止
clf = SGDRegressor(max_iter=500,tol=1e-2)
clf.fit(train_data,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
# SGDRegressor train MSE: 0.1518963746787445
# SGDRegressor test MSE: 0.15617004077751775
# 过拟合
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data_poly)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data_poly)))
# SGDRegressor train MSE: 0.13239797454332378
# SGDRegressor test MSE: 0.14471496796078484
# 正常拟合
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data_poly)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data_poly)))
# SGDRegressor train MSE: 0.1339474648768595
# SGDRegressor test MSE: 0.14245462035372333
# 联合正则化
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
#elasticNet 联合l1 和l2 范数加权正则化 l1_ratio 表示占多少
clf = SGDRegressor(max_iter=1000,tol=1e-3,penalty='elasticNet',l1_ratio=0.9,alpha=0.001)
clf.fit(train_data_poly,train_target)
print("SGDRegressor train MSE: ",mean_squared_erroror(train_target,clf.predict(train_data_poly)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data_poly)))
# SGDRegressor train MSE: 0.1415386321846727
# SGDRegressor test MSE: 0.1480928293310648
4. 交叉验证 K折KFold、留一法LeaveOneOut、留P法LeavePOut
# from sklearn.model_selection import LeaveOneOut
# loo = LeaveOneOut()
# from sklearn.model_selection import LeavePOut
# lpo = LeavePOut(p=10)
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):
train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],\
target.values[train_index],target.values[test_index]
clf = SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data,train_target)
print(k,"折 ","SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print(k,"折 ","SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
'''
注意KFold返回的是index
0 折 SGDRegressor train MSE: 0.15000344819294997
0 折 SGDRegressor test MSE: 0.10575417172621228
1 折 SGDRegressor train MSE: 0.13355864901674058
1 折 SGDRegressor test MSE: 0.18233565251295702
2 折 SGDRegressor train MSE: 0.1465085643709417
2 折 SGDRegressor test MSE: 0.13286058060702355
3 折 SGDRegressor train MSE: 0.14066988317392046
3 折 SGDRegressor test MSE: 0.1618205154729875
4 折 SGDRegressor train MSE: 0.13813813503634081
4 折 SGDRegressor test MSE: 0.16437792249692298
'''
5. 模型超参空间搜索 GridSearchCV
from sklearn.model_selection import GridSearchCV
randomForestRegressor = RandomForestRegressor()
# 构建字典 超参空间
parameters = {'n_estimators':[100,200],'max_depth':[5,10]}
clf = GridSearchCV(randomForestRegressor,param_grid=parameters,cv=3)
clf.fit(train_data,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
# cv_results_中有训练时间和验证指标的信息
print(sorted(clf.cv_results_))
print(clf.best_params_)
'''
SGDRegressor train MSE: 0.03937490165306472
SGDRegressor test MSE: 0.15825533758727703
['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_n_estimators', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']
{'max_depth': 10, 'n_estimators': 200}
'''
#随机参数优化
train_data,test_data,train_target,test_target = train_test_split(train,target,test_size=0.2,random_state=0)
from sklearn.model_selection import RandomizedSearchCV
randomForestRegressor = RandomForestRegressor()
parameters = {'n_estimators':[50,100,150,200],'max_depth':[3,5,10,15]}
clf = RandomizedSearchCV(randomForestRegressor,param_distributions=parameters,cv=3)
clf.fit(train_data,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
print(sorted(clf.cv_results_))
print(clf.best_params_)
'''
SGDRegressor train MSE: 0.022121433444846298
SGDRegressor test MSE: 0.15637853039323252
['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_n_estimators', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']
{'n_estimators': 200, 'max_depth': 15}
'''
6. LGB模型5折交叉验证
train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"
train_data2 = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data2 = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
train_data2_d = train_data2[test_data2.columns].values
train_data2_t = train_data2['target'].values
from sklearn.model_selection import KFold
import lightgbm as lgb
import numpy as np
kf = KFold(n_splits=5,shuffle=True,random_state=2020)
MSE_DICT={'train_mse':[],'test_mse':[]}
for k,(train_index,test_index) in enumerate(kf.split(train_data2_d)):
lgb_reg = lgb.LGBMRegressor(
learning_rate=0.01,
max_depth=-1,
n_estimators=5000,
boosting_type='gbdt',
random_state=2020,
objective='regression',
)
X_train_KFold,X_test_KFold = train_data2_d[train_index],train_data2_d[test_index]
Y_train_KFold,Y_test_KFold = train_data2_t[train_index],train_data2_t[test_index]
lgb_reg.fit(X=X_train_KFold,
y=Y_train_KFold,
eval_set=[(X_train_KFold,Y_train_KFold),(X_test_KFold,Y_test_KFold)],
eval_names=['train','test'],
early_stopping_rounds=100,
eval_metric='MSE',
verbose = 50
)
y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)
y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_)
train_mse = mean_squared_error(y_train_KFold_predict, Y_train_KFold)
test_mse = mean_squared_error(y_test_KFold_predict, Y_test_KFold)
MSE_DICT['train_mse'].append(train_mse)
MSE_DICT['test_mse'].append(test_mse)
print('第{}折 训练和预测 训练MSE 预测MSE'.format(k + 1))
print('------\n', '训练MSE\n', train_mse, '\n------')
print('------\n', '预测MSE\n', test_mse, '\n------\n')
print('------\n', '训练MSE\n', MSE_DICT['train_mse'], '\n',np.mean(MSE_DICT['train_mse']), '\n------')
print('------\n', '预测MSE\n', MSE_DICT['test_mse'], '\n',np.mean(MSE_DICT['test_mse']), '\n------')
'''
Training until validation scores don't improve for 100 rounds
[50] train's l2: 0.425465 test's l2: 0.496533
[100] train's l2: 0.220271 test's l2: 0.283883
[150] train's l2: 0.134273 test's l2: 0.19732
[200] train's l2: 0.0949213 test's l2: 0.158257
[250] train's l2: 0.0741039 test's l2: 0.13845
[300] train's l2: 0.0617553 test's l2: 0.127983
[350] train's l2: 0.052972 test's l2: 0.122248
[400] train's l2: 0.0463628 test's l2: 0.118739
[450] train's l2: 0.0408907 test's l2: 0.116877
[500] train's l2: 0.036397 test's l2: 0.115797
[550] train's l2: 0.0326647 test's l2: 0.115223
[600] train's l2: 0.0293509 test's l2: 0.114269
[650] train's l2: 0.0265367 test's l2: 0.113535
[700] train's l2: 0.0241231 test's l2: 0.112868
[750] train's l2: 0.0219745 test's l2: 0.112329
[800] train's l2: 0.0200866 test's l2: 0.111965
[850] train's l2: 0.0184577 test's l2: 0.111526
[900] train's l2: 0.0169686 test's l2: 0.111169
[950] train's l2: 0.0156194 test's l2: 0.110916
[1000] train's l2: 0.014383 test's l2: 0.1108
[1050] train's l2: 0.0132859 test's l2: 0.110887
[1100] train's l2: 0.0122815 test's l2: 0.110883
Early stopping, best iteration is:
[1002] train's l2: 0.0143354 test's l2: 0.110787
第1折 训练和预测 训练MSE 预测MSE
------
训练MSE
0.014335406865061465
------
------
预测MSE
0.11078669083212028
....
训练MSE
[0.014335406865061465, 0.00038001724874838936, 0.0066264997159705035, 0.002476702516409125, 0.006477785687227073]
0.006059282406683311
------
------
预测MSE
[0.11078669083212028, 0.11085583591285612, 0.11445116914718123, 0.09379035773727774, 0.11315095015348814]
0.10860700075658469
------
------
'''
7. 学习曲线 learning_curve
- learning_curvce 可参考网址
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curve
plt.rcParams.update({# Use mathtext, not LaTeX
'text.usetex': False,
# Use the Computer modern font
'font.family': 'serif',
'font.serif': 'cmr10',
'mathtext.fontset': 'cm',
})
def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(.1, 1.0, 7)):
plt.figure(figsize=(9, 5), dpi=250)
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1,
color="r")
plt.fill_between(train_sizes,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1,
color="g")
plt.plot(train_sizes,
train_scores_mean,
'o-',
color="r",
label="Training score")
plt.plot(train_sizes,
test_scores_mean,
'o-',
color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"
train_data2 = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data2 = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
X = train_data2[test_data2.columns].values
Y = train_data2['target'].values
title = "LinearRegression"
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=2020)
estimator = SGDRegressor()
plot_learning_curve(estimator, title, X, Y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)
8. 验证曲线 validation_curve
- 可以用图的形式找最佳超参数
- validation_curve 可参考网址
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curve
plt.rcParams.update({# Use mathtext, not LaTeX
'text.usetex': False,
# Use the Computer modern font
'font.family': 'serif',
'font.serif': 'cmr10',
'mathtext.fontset': 'cm',
})
train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"
train_data2 = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data2 = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')
X = train_data2[test_data2.columns].values
Y = train_data2['target'].values
# 这里判断选多少惩罚系数合理
param_range = [0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001]
# validation_curve 可参考 https://blog.csdn.net/lvchunyang66/article/details/104411659
train_scores, test_scores = validation_curve(
SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1'),X,Y,
param_name="alpha",
param_range=param_range,
cv=10,scoring='r2',n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.figure(figsize=(9, 5), dpi=150)
plt.title("Validation Curve with SGDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
# 绘制轴半对数刻度曲线 semilogx
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2,
color="r")
plt.semilogx(param_range,test_scores_mean,label="Cross-validation score",color="g")
plt.fill_between(param_range,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2,
color="g")
plt.legend(loc="best")
plt.show()