智慧海洋建设-Task4模型建立
此部分为智慧海洋建设竞赛的模型建立模块。在该模块中主要介绍了如何进行模型建立并对模型调优。
学习目标
- 学习如何选择合适的模型以及如何通过模型来进行特征选择
- 掌握随机森林、lightGBM、Xgboost模型的使用。
- 掌握贝叶斯优化方法的具体使用
内容介绍
- 模型训练与预测
- 随机森林
- lightGBM模型
- Xgboost模型
- 交叉验证
- 模型调参
- 智慧海洋数据集模型代码示例
模型训练与预测
模型训练与预测的主要步骤为:
(1):导入需要的工具库
(2):对数据预处理,包括导入数据集、处理数据等操作,具体为缺失值处理、连续特征归一化、类别特征转换等
(3):训练模型。选择合适的机器学习模型,利用训练集对模型进行训练,达到最佳拟合效果。
(4):预测结果。将待预测的数据输入到训练好的模型中,得到预测的结果。
下面进行几种常用的分类算法进行介绍
随机森林分类
随机森林参数介绍
随机森林是通过集成学习的思想将多棵树集成的一种算法,基本单元是决策树,而它的本质属于机器学习的一个分支——集成学习。
随机森林模型的主要优点是:在当前算法中,具有较好的准确率;能够有效地运行在大数据集上;能够处理具有高维特征的输入样本,而且不需要降维;能够评估各个特征在分类问题上的重要性;在生成过程中,能够获取到内部生成误差的一种无偏估计;对于缺省值问题也能够获得很好的结果。
使用sklearn调用随机森林分类树进行预测算法:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
#数据集导入
iris=datasets.load_iris()
feature=iris.feature_names
X = iris.data
y = iris.target
#随机森林
clf=RandomForestClassifier(n_estimators=200)
train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.1,random_state=5)
clf.fit(train_X,train_y)
test_pred=clf.predict(test_X)
#特征的重要性查看
print(str(feature)+'\n'+str(clf.feature_importances_))
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[0.09838896 0.01544017 0.34365936 0.5425115 ]
采用F1 score进行模型的评价,此为一篇csdn中对该评价方法的简单说明
#F1-score 用于模型评价
#如果是二分类问题则选择参数‘binary’
#如果考虑类别的不平衡性,需要计算类别的加权平均,则使用‘weighted’
#如果不考虑类别的不平衡性,计算宏平均,则使用‘macro’
score=f1_score(test_y,test_pred,average='macro')
print("随机森林-macro:",score)
score=f1_score(test_y,test_pred,average='weighted')
print("随机森林-weighted:",score)
随机森林-macro: 0.818181818181818
随机森林-weighted: 0.8
lightGBM模型
lightGBM的学习可参见这篇文章
lightGBM中文文档这个对超参数的讲解较为详细,建议仔细阅读
- lightGBM过拟合处理方案:
- 使用较小的 max_bin
- 使用较小的 num_leaves
- 使用 min_data_in_leaf 和 min_sum_hessian_in_leaf
- 通过设置 bagging_fraction 和 bagging_freq 来使用 bagging
- 通过设置 feature_fraction 来使用特征子抽样
- 使用更大的训练数据
- 使用 lambda_l1, lambda_l2 和 min_gain_to_split 来使用正则
- 尝试 max_depth 来避免生成过深的树
- lightGBM针对更快的训练速度的解决方案
- 通过设置 bagging_fraction 和 bagging_freq 参数来使用 bagging 方法
- 通过设置 feature_fraction 参数来使用特征的子抽样
- 使用较小的 max_bin
- 使用 save_binary 在未来的学习过程对数据加载进行加速
- 使用并行学习, 可参考 并行学习指南
- lightGBM针对更好的准确率
- 使用较大的 max_bin (学习速度可能变慢)
- 使用较小的 learning_rate 和较大的 num_iterations
- 使用较大的 num_leaves (可能导致过拟合)
- 使用更大的训练数据
- 尝试 dart
import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score
import matplotlib.pyplot as plt
# 加载数据
iris = datasets.load_iris()
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)
# 转换为Dataset数据格式
train_data = lgb.Dataset(X_train, label=y_train)
validation_data = lgb.Dataset(X_test, label=y_test)
# 参数
results = {}
params = {
'learning_rate': 0.1,
'lambda_l1': 0.1,
'lambda_l2': 0.9,
'max_depth': 1,
'objective': 'multiclass', # 目标函数
'num_class': 3,
'verbose': -1
}
# 模型训练
gbm = lgb.train(params, train_data, valid_sets=(validation_data,train_data),valid_names=('validate','train'),evals_result= results)
# 模型预测
y_pred_test = gbm.predict(X_test)
y_pred_data = gbm.predict(X_train)
y_pred_data = [list(x).index(max(x)) for x in y_pred_data]
y_pred_test = [list(x).index(max(x)) for x in y_pred_test]
# 模型评估
print(accuracy_score(y_test, y_pred_test))
print('训练集',f1_score(y_train, y_pred_data,average='macro'))
print('验证集',f1_score(y_test, y_pred_test,average='macro'))
[1] train's multi_logloss: 0.975702 validate's multi_logloss: 1.009
[2] train's multi_logloss: 0.877457 validate's multi_logloss: 0.914377
[3] train's multi_logloss: 0.794798 validate's multi_logloss: 0.824134
[4] train's multi_logloss: 0.723326 validate's multi_logloss: 0.750893
[5] train's multi_logloss: 0.661667 validate's multi_logloss: 0.682191
[6] train's multi_logloss: 0.607721 validate's multi_logloss: 0.628136
[7] train's multi_logloss: 0.560519 validate's multi_logloss: 0.574289
[8] train's multi_logloss: 0.518687 validate's multi_logloss: 0.529814
[9] train's multi_logloss: 0.481561 validate's multi_logloss: 0.485778
[10] train's multi_logloss: 0.448635 validate's multi_logloss: 0.449967
[11] train's multi_logloss: 0.418864 validate's multi_logloss: 0.414047
[12] train's multi_logloss: 0.392319 validate's multi_logloss: 0.386407
[13] train's multi_logloss: 0.368389 validate's multi_logloss: 0.357079
[14] train's multi_logloss: 0.346782 validate's multi_logloss: 0.33318
[15] train's multi_logloss: 0.327196 validate's multi_logloss: 0.308858
[16] train's multi_logloss: 0.309539 validate's multi_logloss: 0.288072
[17] train's multi_logloss: 0.293482 validate's multi_logloss: 0.268706
[18] train's multi_logloss: 0.278991 validate's multi_logloss: 0.25158
[19] train's multi_logloss: 0.265753 validate's multi_logloss: 0.23781
[20] train's multi_logloss: 0.253744 validate's multi_logloss: 0.226251
[21] train's multi_logloss: 0.242663 validate's multi_logloss: 0.211902
[22] train's multi_logloss: 0.232649 validate's multi_logloss: 0.202371
[23] train's multi_logloss: 0.223439 validate's multi_logloss: 0.192921
[24] train's multi_logloss: 0.215001 validate's multi_logloss: 0.182008
[25] train's multi_logloss: 0.2072 validate's multi_logloss: 0.175881
[26] train's multi_logloss: 0.200111 validate's multi_logloss: 0.168868
[27] train's multi_logloss: 0.193543 validate's multi_logloss: 0.160918
[28] train's multi_logloss: 0.187559 validate's multi_logloss: 0.155117
[29] train's multi_logloss: 0.182121 validate's multi_logloss: 0.148551
[30] train's multi_logloss: 0.177063 validate's multi_logloss: 0.141508
[31] train's multi_logloss: 0.172155 validate's multi_logloss: 0.136823
[32] train's multi_logloss: 0.167851 validate's multi_logloss: 0.13318
[33] train's multi_logloss: 0.163832 validate's multi_logloss: 0.127932
[34] train's multi_logloss: 0.160045 validate's multi_logloss: 0.124999
[35] train's multi_logloss: 0.156511 validate's multi_logloss: 0.11994
[36] train's multi_logloss: 0.153185 validate's multi_logloss: 0.117388
[37] train's multi_logloss: 0.150086 validate's multi_logloss: 0.113542
[38] train's multi_logloss: 0.147138 validate's multi_logloss: 0.11118
[39] train's multi_logloss: 0.144376 validate's multi_logloss: 0.107657
[40] train's multi_logloss: 0.141792 validate's multi_logloss: 0.105666
[41] train's multi_logloss: 0.139327 validate's multi_logloss: 0.102515
[42] train's multi_logloss: 0.137023 validate's multi_logloss: 0.101176
[43] train's multi_logloss: 0.134844 validate's multi_logloss: 0.0975092
[44] train's multi_logloss: 0.132768 validate's multi_logloss: 0.0948682
[45] train's multi_logloss: 0.130798 validate's multi_logloss: 0.0939896
[46] train's multi_logloss: 0.128917 validate's multi_logloss: 0.0915695
[47] train's multi_logloss: 0.127132 validate's multi_logloss: 0.0906398
[48] train's multi_logloss: 0.12546 validate's multi_logloss: 0.0892012
[49] train's multi_logloss: 0.123835 validate's multi_logloss: 0.0884964
[50] train's multi_logloss: 0.122284 validate's multi_logloss: 0.087185
[51] train's multi_logloss: 0.120772 validate's multi_logloss: 0.0849336
[52] train's multi_logloss: 0.119346 validate's multi_logloss: 0.0835437
[53] train's multi_logloss: 0.11795 validate's multi_logloss: 0.0829754
[54] train's multi_logloss: 0.116534 validate's multi_logloss: 0.0819892
[55] train's multi_logloss: 0.115189 validate's multi_logloss: 0.0808175
[56] train's multi_logloss: 0.113915 validate's multi_logloss: 0.0791856
[57] train's multi_logloss: 0.112663 validate's multi_logloss: 0.0778838
[58] train's multi_logloss: 0.111477 validate's multi_logloss: 0.0767819
[59] train's multi_logloss: 0.110319 validate's multi_logloss: 0.0761175
[60] train's multi_logloss: 0.109189 validate's multi_logloss: 0.075811
[61] train's multi_logloss: 0.108108 validate's multi_logloss: 0.0743217
[62] train's multi_logloss: 0.107049 validate's multi_logloss: 0.0730824
[63] train's multi_logloss: 0.106037 validate's multi_logloss: 0.0725497
[64] train's multi_logloss: 0.105039 validate's multi_logloss: 0.0709544
[65] train's multi_logloss: 0.104078 validate's multi_logloss: 0.0703405
[66] train's multi_logloss: 0.103134 validate's multi_logloss: 0.0701205
[67] train's multi_logloss: 0.102229 validate's multi_logloss: 0.0692772
[68] train's multi_logloss: 0.101354 validate's multi_logloss: 0.068559
[69] train's multi_logloss: 0.100491 validate's multi_logloss: 0.0673473
[70] train's multi_logloss: 0.0995573 validate's multi_logloss: 0.0674286
[71] train's multi_logloss: 0.0986634 validate's multi_logloss: 0.0674853
[72] train's multi_logloss: 0.0978101 validate's multi_logloss: 0.0672873
[73] train's multi_logloss: 0.0969953 validate's multi_logloss: 0.0673621
[74] train's multi_logloss: 0.0962072 validate's multi_logloss: 0.066834
[75] train's multi_logloss: 0.0954358 validate's multi_logloss: 0.06728
[76] train's multi_logloss: 0.0946999 validate's multi_logloss: 0.0666785
[77] train's multi_logloss: 0.093984 validate's multi_logloss: 0.0652261
[78] train's multi_logloss: 0.093268 validate's multi_logloss: 0.0653247
[79] train's multi_logloss: 0.0925889 validate's multi_logloss: 0.0654675
[80] train's multi_logloss: 0.0919186 validate's multi_logloss: 0.0649799
[81] train's multi_logloss: 0.0912796 validate's multi_logloss: 0.0638035
[82] train's multi_logloss: 0.0906195 validate's multi_logloss: 0.0638154
[83] train's multi_logloss: 0.0899888 validate's multi_logloss: 0.0642833
[84] train's multi_logloss: 0.0893663 validate's multi_logloss: 0.0636025
[85] train's multi_logloss: 0.0887785 validate's multi_logloss: 0.0626043
[86] train's multi_logloss: 0.0881855 validate's multi_logloss: 0.0623685
[87] train's multi_logloss: 0.0875885 validate's multi_logloss: 0.0627226
[88] train's multi_logloss: 0.0870171 validate's multi_logloss: 0.0624081
[89] train's multi_logloss: 0.0864743 validate's multi_logloss: 0.0625911
[90] train's multi_logloss: 0.0859347 validate's multi_logloss: 0.0620309
[91] train's multi_logloss: 0.0854179 validate's multi_logloss: 0.0622157
[92] train's multi_logloss: 0.0849131 validate's multi_logloss: 0.0617822
[93] train's multi_logloss: 0.0844192 validate's multi_logloss: 0.0619947
[94] train's multi_logloss: 0.0839399 validate's multi_logloss: 0.0614539
[95] train's multi_logloss: 0.0834681 validate's multi_logloss: 0.0616655
[96] train's multi_logloss: 0.0830149 validate's multi_logloss: 0.06134
[97] train's multi_logloss: 0.0825657 validate's multi_logloss: 0.0613612
[98] train's multi_logloss: 0.0821295 validate's multi_logloss: 0.0611025
[99] train's multi_logloss: 0.0816869 validate's multi_logloss: 0.0613398
[100] train's multi_logloss: 0.0812595 validate's multi_logloss: 0.0610704
0.9777777777777777
训练集 0.9717813051146384
验证集 0.9734471313418682
# 有以下曲线可知验证集的损失是比训练集的损失要高,所以模型可以判断模型出现了过拟合
lgb.plot_metric(results)
plt.show()
#因此可以尝试将lambda_l2设置为0.9
lgb.plot_metric(results)
plt.show()
# 绘制重要的特征
lgb.plot_importance(gbm,importance_type = "split")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ggU5Xc7j-1644978859938)(Task4_files/Task4_16_0.png)]
xgboost模型
XGBoost基础介绍
XGBoost参数介绍
XGboost参数调优方法
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score # 准确率
# 加载样本数据集
iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) # 数据集分割
# 算法参数
params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'eval_metric':'mlogloss',
'num_class': 3,
'gamma': 0.1,
'max_depth': 6,
'lambda': 2,
'subsample': 0.7,
'colsample_bytree': 0.75,
'min_child_weight': 3,
'eta': 0.1,
'seed': 1,
'nthread': 4,
}
# plst = params.items()
train_data = xgb.DMatrix(X_train, y_train) # 生成数据集格式
num_rounds = 500
model = xgb.train(params, train_data) # xgboost模型训练
# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
# 计算准确率
F1_score = f1_score(y_test,y_pred,average='macro')
print("F1_score: %.2f%%" % (F1_score*100.0))
# 显示重要特征
plot_importance(model)
plt.show()
F1_score: 95.56%
https://blog.csdn.net/han_xiaoyang/article/details/52665396
https://www.cnblogs.com/TimVerion/p/11436001.html
交叉验证
交叉验证是验证分类器性能的一种统计分析方法,其基本思想在某种意义下将原始数据进行分组,一部分作为训练集,另一部分作为验证集。首先是用训练集对分类器进行训练,再利用验证集来测试所得到的的模型,以此来作为评价分类器的性能指标。常用的交叉验证方法包括简单交叉验证、K折交叉验证、留一法交叉验证和留P法交叉验证
1.简单交叉验证(cross validation)
简单交叉验证是将原始数据分为两组,一组作为训练集,另一组作为验证集,利用训练集训练分类器,然后利用验证集验证模型,将最后的分类准确率作为此分类器的性能指标。通常是划分30%的数据作为测试数据
2.K折交叉验证(K-Fold cross validation)
K折交叉验证是将原始数据分为K组,然后将每个子集数据分别做一次验证集,其余的K-1组子集作为训练集,这样就会得到K个模型,将K个模型最终的验证集的分类准确率取平均值,作为K折交叉验证分类器的性能指标。通常设置为K为5或者10.
3.留一法交叉验证(Leave-One-Out Cross Validation,LOO-CV)
留一法交叉验证是指每个训练集由除一个样本之外的其余样本组成,留下的一个样本组成检验集。这样对于N个样本的数据集,可以组成N个不同的训练集和N个不同的验证集,因此该方法会得到N个模型,用N个模型最终的验证集的分类准确率的平均是作为分类器的性能指标。
4.留P法交叉验证
该方法与留一法类似,是从完整数据集中删除P个样本,产生所有可能的训练集和验证集。
交叉验证示例代码
1.简单交叉验证
from sklearn.model_selection import train_test_split
from sklearn import datasets
#数据集导入
iris=datasets.load_iris()
feature=iris.feature_names
X = iris.data
y = iris.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4,random_state=0)
2.K折交叉验证
from sklearn.model_selection import KFold
folds = KFold(n_splits=10, shuffle=is_shuffle)
3.留一法交叉验证
from sklearn.model_selection import LeaveOneOut
loo=LeaveOneOut()
4.留P法交叉验证
from sklearn.model_selection import LeavePOut
lpo=LeavePOut(p=5)
另外还有一些其他交叉验证的分割方法,如基于类标签,具有分层的交叉验证。这一类交叉验证方法主要用于解决样本不平衡的问题。
这种情况下常用StratifiedKFold和StratifiedShuffleSplit的分层抽样方法,可以确保相应的类别频率在每个训练和验证的(fold)中得以保留。
StratifiedKFold:是K-fold的变种,会返回stratified(分层)的折叠:每个小集合中的各个类别的样本比例大致和完整数据集相同。
StratifiedShuffleSplit:是ShuffleSplit的一种变种,会返回直接的划分,比如创建一个划分,但是划分中的每个类的比例和完整数据集中的相同。
模型调参
调参就是对模型的参数进行调整,找到使模型最优的超参数,调参的目标就是尽可能达到整体模型的最优
1.网格搜索
网格搜索就是一种穷举搜索,在所有候选的参数选择中通过循环遍历去在所有候选参数中寻找表现最好的结果。
2.学习曲线
学习曲线是在训练集大小不同时通过绘制模型训练集和交叉验证集上的准确率来观察模型在新数据上的表现,进而来判断模型是否方差偏高或偏差过高,以及增大训练集是否可以减小过拟合。
![img](https://i-blog.csdnimg.cn/blog_migrate/7f1acda8b01de9ab28bed86ec12303c6.png)
左上角的偏差很高,训练集和验证集的准确率都很低,很可能是欠拟合。
我们可以增加模型参数,比如,构建更多的特征,减小正则项。
此时通过增加数据量是不起作用的。
2、当训练集和测试集的误差之间有大的差距时,为高方差。
当训练集的准确率比其他独立数据集上的测试结果的准确率要高时,一般都是过拟合。
右上角方差很高,训练集和验证集的准确率相差太多,应该是过拟合。
我们可以增大训练集,降低模型复杂度,增大正则项,或者通过特征选择减少特征数。
理想情况是是找到偏差和方差都很小的情况,即收敛且误差较小。
3.验证曲线
和学习曲线不同,验证曲线的横轴为某个超参数的一系列值,由此比较不同超参数设置下的模型准确值。从下图的验证曲线可以看到,随着超参数设置的改变,模型可能会有从欠拟合到合适再到过拟合的过程,进而可以选择一个合适的超参数设置来提高模型的性能。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BQ5IURz1-1644978859940)(attachment:image.png)]
#以Xgboost为例,该网格搜索代码示例如下
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import GridSearchCV
cancer = load_breast_cancer()
x = cancer.data[:50]
y = cancer.target[:50]
train_x, valid_x, train_y, valid_y = train_test_split(x, y, test_size=0.333, random_state=0) # 分训练集和验证集
# 这里不需要Dmatrix
parameters = {
'max_depth': [5, 10, 15, 20, 25],
'learning_rate': [0.01, 0.02, 0.05, 0.1, 0.15],
'n_estimators': [50, 100, 200, 300, 500],
'min_child_weight': [0, 2, 5, 10, 20],
'max_delta_step': [0, 0.2, 0.6, 1, 2],
'subsample': [0.6, 0.7, 0.8, 0.85, 0.95],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9],
'reg_alpha': [0, 0.25, 0.5, 0.75, 1],
'reg_lambda': [0.2, 0.4, 0.6, 0.8, 1],
'scale_pos_weight': [0.2, 0.4, 0.6, 0.8, 1]
}
xlf = xgb.XGBClassifier(max_depth=10,
learning_rate=0.01,
n_estimators=2000,
silent=True,
objective='binary:logistic',
nthread=-1,
gamma=0,
min_child_weight=1,
max_delta_step=0,
subsample=0.85,
colsample_bytree=0.7,
colsample_bylevel=1,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
seed=1440,
missing=None)
# 有了gridsearch我们便不需要fit函数
gsearch = GridSearchCV(xlf, param_grid=parameters, scoring='accuracy', cv=3)
gsearch.fit(train_x, train_y)
print("Best score: %0.3f" % gsearch.best_score_)
print("Best parameters set:")
best_parameters = gsearch.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
#极其耗费时间,电脑没执行完
智慧海洋数据集模型代码示例
lightGBM模型
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import StratifiedKFold, KFold,train_test_split
import lightgbm as lgb
import os
import warnings
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
all_df=pd.read_csv('group_df.csv',index_col=0)
use_train = all_df[all_df['label'] != -1]
use_test = all_df[all_df['label'] == -1]#label为-1时是测试集
use_feats = [c for c in use_train.columns if c not in ['ID', 'label']]
X_train,X_verify,y_train,y_verify= train_test_split(use_train[use_feats],use_train['label'],test_size=0.3,random_state=0)
1.根据特征的重要性进行特征选择
##############特征选择参数###################
selectFeatures = 200 # 控制特征数
earlyStopping = 100 # 控制早停
select_num_boost_round = 1000 # 特征选择训练轮次
#首先设置基础参数
selfParam = {
'learning_rate':0.01, # 学习率
'boosting':'dart', # 算法类型, gbdt,dart
'objective':'multiclass', # 多分类
'metric':'None',
'num_leaves':32, #
'feature_fraction':0.7, # 训练特征比例
'bagging_fraction':0.8, # 训练样本比例
'min_data_in_leaf':30, # 叶子最小样本
'num_class': 3,
'max_depth':6, # 树的最大深度
'num_threads':8,#LightGBM 的线程数
'min_data_in_bin':30, # 单箱数据量
'max_bin':256, # 最大分箱数
'is_unbalance':True, # 非平衡样本
'train_metric':True,
'verbose':-1,
}
# 特征选择 ---------------------------------------------------------------------------------
def f1_score_eval(preds, valid_df):
labels = valid_df.get_label()
preds = np.argmax(preds.reshape(3, -1), axis=0)
scores = f1_score(y_true=labels, y_pred=preds, average='macro')
return 'f1_score', scores, True
train_data = lgb.Dataset(data=X_train,label=y_train,feature_name=use_feats)
valid_data = lgb.Dataset(data=X_verify,label=y_verify,reference=train_data,feature_name=use_feats)
sm = lgb.train(params=selfParam,train_set=train_data,num_boost_round=select_num_boost_round,
valid_sets=[valid_data],valid_names=['valid'],
feature_name=use_feats,
early_stopping_rounds=earlyStopping,verbose_eval=False,keep_training_booster=True,feval=f1_score_eval)
features_importance = {k:v for k,v in zip(sm.feature_name(),sm.feature_importance(iteration=sm.best_iteration))}
sort_feature_importance = sorted(features_importance.items(),key=lambda x:x[1],reverse=True)
print('total feature best score:', sm.best_score)
print('total feature importance:',sort_feature_importance)
print('select forward {} features:{}'.format(selectFeatures,sort_feature_importance[:selectFeatures]))
#model_feature是选择的超参数
model_feature = [k[0] for k in sort_feature_importance[:selectFeatures]]
D:\SOFTWEAR_H\Anaconda3\lib\site-packages\lightgbm\callback.py:186: UserWarning: Early stopping is not available in dart mode
warnings.warn('Early stopping is not available in dart mode')
total feature best score: defaultdict(<class 'collections.OrderedDict'>, {'valid': OrderedDict([('f1_score', 0.9004541298211368)])})
total feature importance: [('pos_neq_zero_speed_q_40', 1783), ('lat_lon_countvec_1_x', 1771), ('rank2_mode_lat', 1737), ('pos_neq_zero_speed_median', 1379), ('pos_neq_zero_speed_q_60', 1369), ('lat_lon_tfidf_0_x', 1251), ('pos_neq_zero_speed_q_80', 1194), ('sample_tfidf_0_x', 1168), ('w2v_9_mean', 1134), ('lat_lon_tfidf_11_x', 963), ('rank3_mode_lat', 946), ('w2v_5_mean', 900), ('w2v_16_mean', 874), ('pos_neq_zero_speed_q_30', 866), ('w2v_12_mean', 862), ('pos_neq_zero_speed_q_70', 856), ('lat_lon_tfidf_9_x', 787), ('grad_tfidf_7_x', 772), ('pos_neq_zero_speed_q_90', 746), ('rank3_mode_cnt', 733), ('grad_tfidf_12_x', 729), ('w2v_4_mean', 697), ('sample_tfidf_14_x', 695), ('lat_lon_tfidf_4_x', 693), ('lat_min', 683), ('w2v_23_mean', 647), ('rank2_mode_lon', 631), ('w2v_26_mean', 626), ('rank1_mode_lon', 620), ('grad_tfidf_15_x', 607), ('speed_neq_zero_speed_q_90', 603), ('grad_tfidf_5_x', 572), ('lat_lon_countvec_22_x', 571), ('lat_lon_countvec_1_y', 565), ('w2v_13_mean', 557), ('w2v_27_mean', 550), ('grad_tfidf_2_x', 507), ('lat_lon_tfidf_20_x', 503), ('lat_lon_countvec_0_x', 499), ('lat_lon_countvec_18_x', 490), ('sample_tfidf_21_x', 488), ('grad_tfidf_14_x', 484), ('lat_lon_countvec_27_x', 470), ('w2v_22_mean', 466), ('lat_lon_tfidf_1_x', 461), ('direction_nunique', 460), ('lon_max', 457), ('w2v_15_mean', 441), ('grad_tfidf_23_x', 431), ('w2v_19_mean', 429), ('w2v_11_mean', 428), ('lat_lon_tfidf_29_x', 420), ('pos_neq_zero_lon_q_10', 417), ('w2v_3_mean', 411), ('lat_lon_tfidf_0_y', 407), ('sample_tfidf_29_x', 406), ('anchor_cnt', 404), ('grad_tfidf_8_x', 397), ('sample_tfidf_10_x', 397), ('sample_tfidf_12_x', 385), ('w2v_28_mean', 384), ('grad_tfidf_13_x', 381), ('direction_q_90', 380), ('speed_neq_zero_lon_min', 374), ('w2v_25_mean', 371), ('anchor_ratio', 367), ('lat_lon_tfidf_16_x', 367), ('rank1_mode_lat', 365), ('w2v_18_mean', 365), ('sample_tfidf_23_x', 364), ('lon_min', 354), ('grad_tfidf_0_x', 351), ('pos_neq_zero_lat_q_90', 341), ('w2v_20_mean', 341), ('sample_tfidf_4_x', 334), ('lat_lon_tfidf_23_x', 332), ('sample_tfidf_0_y', 328), ('pos_neq_zero_direction_q_90', 326), ('speed_neq_zero_direction_nunique', 326), ('sample_tfidf_19_x', 323), ('lat_lon_countvec_9_x', 319), ('pos_neq_zero_lon_q_90', 314), ('w2v_8_mean', 312), ('grad_tfidf_3_x', 309), ('lon_median', 305), ('pos_neq_zero_speed_q_20', 304), ('lat_lon_countvec_4_x', 304), ('lat_mean', 301), ('speed_neq_zero_lon_max', 301), ('lat_lon_tfidf_14_x', 301), ('speed_neq_zero_lat_min', 300), ('lat_lon_countvec_5_x', 296), ('speed_neq_zero_speed_q_80', 294), ('grad_tfidf_16_x', 293), ('rank3_mode_lon', 292), ('lat_lon_tfidf_18_x', 291), ('w2v_7_mean', 290), ('grad_tfidf_6_x', 285), ('grad_tfidf_20_x', 283), ('grad_tfidf_18_x', 282), ('w2v_0_mean', 280), ('grad_tfidf_21_x', 279), ('grad_tfidf_22_x', 273), ('sample_tfidf_24_x', 273), ('speed_q_90', 271), ('w2v_2_mean', 271), ('lat_max', 264), ('sample_tfidf_9_x', 264), ('grad_tfidf_11_x', 262), ('lon_q_20', 260), ('rank1_mode_cnt', 258), ('speed_max', 256), ('lat_lon_tfidf_12_x', 251), ('pos_neq_zero_lon_q_20', 248), ('lat_lon_tfidf_28_x', 242), ('speed_neq_zero_direction_q_60', 241), ('sample_tfidf_11_x', 241), ('w2v_17_mean', 241), ('sample_tfidf_13_x', 238), ('w2v_14_mean', 236), ('lat_nunique', 235), ('grad_tfidf_4_x', 234), ('w2v_21_mean', 234), ('sample_tfidf_5_x', 231), ('lat_lon_tfidf_9_y', 225), ('speed_neq_zero_lat_q_90', 222), ('direction_median', 221), ('sample_tfidf_17_x', 220), ('sample_tfidf_14_y', 216), ('lat_lon_tfidf_21_x', 215), ('lon_q_10', 214), ('lat_lon_tfidf_22_x', 214), ('grad_tfidf_26_x', 213), ('grad_tfidf_7_y', 213), ('w2v_29_mean', 212), ('pos_neq_zero_lat_q_80', 210), ('cnt', 209), ('lat_lon_tfidf_4_y', 208), ('direction_q_60', 204), ('sample_tfidf_18_x', 203), ('lat_lon_tfidf_11_y', 203), ('pos_neq_zero_lat_min', 202), ('pos_neq_zero_speed_mean', 201), ('speed_neq_zero_lat_q_70', 200), ('grad_tfidf_12_y', 198), ('sample_tfidf_20_x', 197), ('w2v_1_mean', 194), ('speed_neq_zero_lat_q_40', 193), ('pos_neq_zero_speed_max', 192), ('grad_tfidf_27_x', 192), ('grad_tfidf_15_y', 191), ('lat_lon_tfidf_19_x', 189), ('lat_median', 187), ('lat_lon_tfidf_15_x', 187), ('lat_q_20', 186), ('lat_q_70', 186), ('lon_q_70', 185), ('w2v_24_mean', 184), ('pos_neq_zero_lat_q_40', 183), ('grad_tfidf_25_x', 181), ('w2v_10_mean', 181), ('lon_mean', 180), ('sample_tfidf_27_x', 180), ('w2v_6_mean', 180), ('lat_lon_tfidf_24_x', 178), ('lat_lon_countvec_12_x', 178), ('pos_neq_zero_lat_mean', 177), ('speed_neq_zero_speed_q_70', 174), ('speed_neq_zero_direction_q_80', 172), ('rank2_mode_cnt', 172), ('speed_neq_zero_lat_nunique', 171), ('lat_lon_tfidf_2_x', 171), ('sample_tfidf_25_x', 170), ('lat_lon_tfidf_5_x', 169), ('lat_lon_countvec_26_x', 167), ('grad_tfidf_9_x', 166), ('lat_lon_countvec_28_x', 163), ('lat_lon_countvec_22_y', 163), ('sample_tfidf_1_x', 162), ('pos_neq_zero_direction_nunique', 161), ('pos_neq_zero_speed_q_10', 157), ('sample_tfidf_16_x', 155), ('speed_neq_zero_direction_q_90', 154), ('grad_tfidf_14_y', 153), ('lat_lon_tfidf_7_x', 151), ('pos_neq_zero_direction_q_80', 149), ('lat_q_80', 148), ('grad_tfidf_23_y', 148), ('lat_lon_countvec_11_x', 147), ('sample_tfidf_22_x', 146), ('speed_neq_zero_lat_max', 144), ('sample_tfidf_15_x', 144), ('grad_tfidf_2_y', 144), ('pos_neq_zero_lat_q_10', 142), ('lat_lon_tfidf_1_y', 142), ('lat_lon_countvec_16_x', 141), ('grad_tfidf_13_y', 138), ('lat_lon_countvec_29_x', 136), ('lat_lon_tfidf_29_y', 136), ('grad_tfidf_5_y', 136), ('direction_max', 135), ('pos_neq_zero_lon_median', 134), ('lat_lon_tfidf_27_x', 134), ('lon_q_80', 133), ('lat_lon_countvec_15_x', 133), ('pos_neq_zero_lon_max', 132), ('lat_lon_countvec_14_x', 132), ('lat_lon_tfidf_26_x', 131), ('grad_tfidf_19_x', 131), ('sample_tfidf_8_x', 131), ('lat_q_60', 130), ('sample_tfidf_28_x', 130), ('lat_lon_countvec_27_y', 130), ('lat_lon_countvec_6_x', 128), ('lat_lon_countvec_0_y', 128), ('sample_tfidf_12_y', 127), ('lat_lon_tfidf_8_x', 126), ('sample_tfidf_29_y', 126), ('lat_lon_countvec_17_x', 125), ('direction_q_70', 124), ('lat_lon_tfidf_20_y', 124), ('lat_lon_tfidf_3_x', 121), ('sample_tfidf_21_y', 120), ('grad_tfidf_0_y', 119), ('pos_neq_zero_lat_median', 118), ('lat_lon_tfidf_16_y', 118), ('grad_tfidf_10_x', 117), ('sample_tfidf_2_x', 116), ('lat_lon_countvec_4_y', 116), ('speed_median', 115), ('pos_neq_zero_direction_q_10', 115), ('speed_neq_zero_lon_mean', 115), ('pos_neq_zero_direction_max', 114), ('lat_q_40', 113), ('grad_tfidf_1_x', 113), ('speed_nunique', 111), ('sample_tfidf_23_y', 111), ('speed_q_30', 110), ('pos_neq_zero_lat_q_30', 110), ('lat_lon_tfidf_10_x', 110), ('lat_lon_countvec_10_x', 110), ('lat_lon_tfidf_23_y', 109), ('pos_neq_zero_speed_min', 106), ('speed_neq_zero_lat_q_60', 106), ('lat_lon_countvec_21_x', 106), ('lat_lon_countvec_18_y', 106), ('lat_lon_tfidf_17_x', 105), ('grad_tfidf_8_y', 103), ('grad_tfidf_6_y', 102), ('sample_tfidf_10_y', 101), ('pos_neq_zero_lon_min', 100), ('lat_lon_countvec_8_x', 100), ('lat_lon_countvec_9_y', 100), ('direction_mean', 99), ('grad_tfidf_21_y', 99), ('lat_lon_tfidf_6_x', 98), ('lat_lon_tfidf_18_y', 97), ('direction_q_80', 96), ('pos_neq_zero_direction_q_70', 96), ('lat_lon_countvec_20_x', 95), ('speed_neq_zero_direction_q_70', 93), ('lat_lon_countvec_25_x', 93), ('lat_lon_countvec_23_x', 92), ('lat_lon_tfidf_14_y', 92), ('lat_q_90', 91), ('sample_tfidf_7_x', 91), ('pos_neq_zero_lon_q_70', 90), ('lat_lon_countvec_5_y', 90), ('pos_neq_zero_direction_q_20', 89), ('lat_lon_tfidf_12_y', 89), ('lat_lon_tfidf_28_y', 89), ('sample_tfidf_4_y', 89), ('direction_q_40', 88), ('pos_neq_zero_lat_q_20', 87), ('grad_tfidf_17_x', 87), ('sample_tfidf_9_y', 87), ('sample_tfidf_24_y', 87), ('pos_neq_zero_lat_max', 86), ('pos_neq_zero_lon_mean', 86), ('speed_neq_zero_direction_q_40', 86), ('lat_lon_countvec_7_x', 86), ('speed_neq_zero_speed_q_40', 85), ('sample_tfidf_6_x', 84), ('sample_tfidf_19_y', 84), ('speed_min', 83), ('direction_q_10', 83), ('lat_lon_countvec_19_x', 83), ('grad_tfidf_24_x', 83), ('speed_q_60', 82), ('lat_lon_tfidf_25_x', 82), ('sample_tfidf_3_x', 82), ('grad_tfidf_22_y', 82), ('direction_q_30', 80), ('speed_neq_zero_direction_mean', 80), ('grad_tfidf_18_y', 77), ('lat_q_10', 76), ('speed_neq_zero_speed_max', 75), ('grad_tfidf_3_y', 75), ('sample_tfidf_11_y', 75), ('lon_nunique', 74), ('lon_q_90', 74), ('speed_neq_zero_lon_q_10', 74), ('speed_neq_zero_speed_median', 74), ('grad_tfidf_28_x', 74), ('grad_tfidf_20_y', 74), ('speed_neq_zero_lon_q_70', 73), ('lat_lon_tfidf_24_y', 73), ('pos_neq_zero_lat_q_60', 72), ('lat_lon_countvec_2_x', 72), ('lat_lon_countvec_3_x', 69), ('sample_tfidf_20_y', 69), ('lat_lon_tfidf_13_x', 68), ('grad_tfidf_16_y', 68), ('sample_tfidf_13_y', 67), ('speed_neq_zero_lon_q_30', 66), ('speed_q_40', 65), ('grad_tfidf_4_y', 65), ('sample_tfidf_5_y', 65), ('lat_q_30', 64), ('pos_neq_zero_direction_median', 64), ('speed_neq_zero_lat_median', 64), ('grad_tfidf_11_y', 64), ('grad_tfidf_27_y', 64), ('lat_lon_tfidf_19_y', 62), ('pos_neq_zero_lon_q_40', 61), ('lat_lon_countvec_26_y', 61), ('pos_neq_zero_lon_q_80', 60), ('sample_tfidf_17_y', 60), ('lon_q_40', 59), ('lat_lon_countvec_28_y', 59), ('lat_lon_tfidf_22_y', 57), ('grad_tfidf_29_x', 56), ('lat_lon_countvec_12_y', 56), ('sample_tfidf_15_y', 56), ('sample_tfidf_27_y', 56), ('speed_q_70', 55), ('lat_lon_tfidf_21_y', 55), ('grad_tfidf_9_y', 55), ('sample_tfidf_25_y', 55), ('pos_neq_zero_direction_mean', 54), ('sample_tfidf_26_x', 54), ('sample_tfidf_18_y', 53), ('speed_neq_zero_lon_q_90', 51), ('speed_neq_zero_direction_max', 51), ('lat_lon_tfidf_5_y', 50), ('pos_neq_zero_direction_q_60', 49), ('sample_tfidf_2_y', 49), ('pos_neq_zero_lon_q_60', 48), ('speed_neq_zero_speed_mean', 48), ('lat_lon_tfidf_15_y', 48), ('pos_neq_zero_direction_q_30', 47), ('speed_neq_zero_lon_nunique', 47), ('lat_lon_countvec_24_x', 47), ('sample_tfidf_8_y', 47), ('lat_lon_tfidf_10_y', 46), ('lon_q_60', 45), ('pos_neq_zero_lat_q_70', 45), ('speed_neq_zero_direction_q_10', 45), ('lat_lon_tfidf_3_y', 45), ('speed_neq_zero_lat_mean', 43), ('speed_neq_zero_lat_q_80', 43), ('lat_lon_tfidf_2_y', 43), ('lat_lon_tfidf_8_y', 43), ('grad_tfidf_19_y', 43), ('grad_tfidf_25_y', 43), ('grad_tfidf_26_y', 43), ('lon_q_30', 42), ('speed_neq_zero_lon_q_20', 42), ('pos_neq_zero_speed_nunique', 41), ('speed_neq_zero_speed_nunique', 41), ('speed_neq_zero_speed_q_30', 41), ('lat_lon_tfidf_7_y', 41), ('lat_lon_tfidf_17_y', 41), ('lat_lon_countvec_14_y', 41), ('grad_tfidf_10_y', 41), ('lat_lon_tfidf_26_y', 40), ('grad_tfidf_1_y', 40), ('speed_neq_zero_lat_q_20', 39), ('speed_q_80', 38), ('speed_neq_zero_lat_q_30', 38), ('lat_lon_countvec_15_y', 38), ('pos_neq_zero_direction_q_40', 37), ('speed_neq_zero_direction_median', 37), ('pos_neq_zero_lon_q_30', 36), ('lat_lon_countvec_11_y', 36), ('lat_lon_countvec_21_y', 35), ('sample_tfidf_28_y', 35), ('speed_neq_zero_speed_q_60', 34), ('lat_lon_countvec_29_y', 34), ('sample_tfidf_1_y', 34), ('sample_tfidf_22_y', 34), ('lat_lon_countvec_6_y', 33), ('lat_lon_countvec_10_y', 33), ('lat_lon_countvec_16_y', 33), ('speed_mean', 32), ('lat_lon_countvec_17_y', 31), ('lat_lon_countvec_23_y', 31), ('speed_neq_zero_direction_q_30', 30), ('lat_lon_tfidf_13_y', 30), ('sample_tfidf_16_y', 30), ('speed_neq_zero_lat_q_10', 29), ('lat_lon_tfidf_27_y', 29), ('grad_tfidf_17_y', 29), ('lat_lon_countvec_13_x', 27), ('lat_lon_countvec_19_y', 27), ('grad_tfidf_24_y', 26), ('speed_neq_zero_lon_q_40', 25), ('lat_lon_tfidf_25_y', 25), ('lat_lon_countvec_8_y', 25), ('speed_neq_zero_lon_median', 24), ('speed_neq_zero_speed_min', 24), ('lat_lon_countvec_25_y', 24), ('sample_tfidf_6_y', 24), ('pos_neq_zero_lat_nunique', 23), ('speed_neq_zero_lon_q_80', 23), ('lat_lon_countvec_20_y', 23), ('speed_neq_zero_speed_q_10', 22), ('lat_lon_countvec_3_y', 22), ('grad_tfidf_28_y', 22), ('sample_tfidf_7_y', 22), ('lat_lon_countvec_7_y', 21), ('sample_tfidf_26_y', 21), ('lat_lon_tfidf_6_y', 20), ('sample_tfidf_3_y', 20), ('grad_tfidf_29_y', 18), ('speed_neq_zero_lon_q_60', 16), ('speed_neq_zero_speed_q_20', 14), ('lat_lon_countvec_24_y', 14), ('lat_lon_countvec_2_y', 11), ('speed_neq_zero_direction_q_20', 9), ('lat_lon_countvec_13_y', 8), ('speed_q_10', 7), ('pos_neq_zero_lon_nunique', 5), ('direction_q_20', 4), ('speed_q_20', 2), ('pos_neq_zero_direction_min', 2), ('direction_min', 0), ('speed_neq_zero_direction_min', 0)]
select forward 200 features:[('pos_neq_zero_speed_q_40', 1783), ('lat_lon_countvec_1_x', 1771), ('rank2_mode_lat', 1737), ('pos_neq_zero_speed_median', 1379), ('pos_neq_zero_speed_q_60', 1369), ('lat_lon_tfidf_0_x', 1251), ('pos_neq_zero_speed_q_80', 1194), ('sample_tfidf_0_x', 1168), ('w2v_9_mean', 1134), ('lat_lon_tfidf_11_x', 963), ('rank3_mode_lat', 946), ('w2v_5_mean', 900), ('w2v_16_mean', 874), ('pos_neq_zero_speed_q_30', 866), ('w2v_12_mean', 862), ('pos_neq_zero_speed_q_70', 856), ('lat_lon_tfidf_9_x', 787), ('grad_tfidf_7_x', 772), ('pos_neq_zero_speed_q_90', 746), ('rank3_mode_cnt', 733), ('grad_tfidf_12_x', 729), ('w2v_4_mean', 697), ('sample_tfidf_14_x', 695), ('lat_lon_tfidf_4_x', 693), ('lat_min', 683), ('w2v_23_mean', 647), ('rank2_mode_lon', 631), ('w2v_26_mean', 626), ('rank1_mode_lon', 620), ('grad_tfidf_15_x', 607), ('speed_neq_zero_speed_q_90', 603), ('grad_tfidf_5_x', 572), ('lat_lon_countvec_22_x', 571), ('lat_lon_countvec_1_y', 565), ('w2v_13_mean', 557), ('w2v_27_mean', 550), ('grad_tfidf_2_x', 507), ('lat_lon_tfidf_20_x', 503), ('lat_lon_countvec_0_x', 499), ('lat_lon_countvec_18_x', 490), ('sample_tfidf_21_x', 488), ('grad_tfidf_14_x', 484), ('lat_lon_countvec_27_x', 470), ('w2v_22_mean', 466), ('lat_lon_tfidf_1_x', 461), ('direction_nunique', 460), ('lon_max', 457), ('w2v_15_mean', 441), ('grad_tfidf_23_x', 431), ('w2v_19_mean', 429), ('w2v_11_mean', 428), ('lat_lon_tfidf_29_x', 420), ('pos_neq_zero_lon_q_10', 417), ('w2v_3_mean', 411), ('lat_lon_tfidf_0_y', 407), ('sample_tfidf_29_x', 406), ('anchor_cnt', 404), ('grad_tfidf_8_x', 397), ('sample_tfidf_10_x', 397), ('sample_tfidf_12_x', 385), ('w2v_28_mean', 384), ('grad_tfidf_13_x', 381), ('direction_q_90', 380), ('speed_neq_zero_lon_min', 374), ('w2v_25_mean', 371), ('anchor_ratio', 367), ('lat_lon_tfidf_16_x', 367), ('rank1_mode_lat', 365), ('w2v_18_mean', 365), ('sample_tfidf_23_x', 364), ('lon_min', 354), ('grad_tfidf_0_x', 351), ('pos_neq_zero_lat_q_90', 341), ('w2v_20_mean', 341), ('sample_tfidf_4_x', 334), ('lat_lon_tfidf_23_x', 332), ('sample_tfidf_0_y', 328), ('pos_neq_zero_direction_q_90', 326), ('speed_neq_zero_direction_nunique', 326), ('sample_tfidf_19_x', 323), ('lat_lon_countvec_9_x', 319), ('pos_neq_zero_lon_q_90', 314), ('w2v_8_mean', 312), ('grad_tfidf_3_x', 309), ('lon_median', 305), ('pos_neq_zero_speed_q_20', 304), ('lat_lon_countvec_4_x', 304), ('lat_mean', 301), ('speed_neq_zero_lon_max', 301), ('lat_lon_tfidf_14_x', 301), ('speed_neq_zero_lat_min', 300), ('lat_lon_countvec_5_x', 296), ('speed_neq_zero_speed_q_80', 294), ('grad_tfidf_16_x', 293), ('rank3_mode_lon', 292), ('lat_lon_tfidf_18_x', 291), ('w2v_7_mean', 290), ('grad_tfidf_6_x', 285), ('grad_tfidf_20_x', 283), ('grad_tfidf_18_x', 282), ('w2v_0_mean', 280), ('grad_tfidf_21_x', 279), ('grad_tfidf_22_x', 273), ('sample_tfidf_24_x', 273), ('speed_q_90', 271), ('w2v_2_mean', 271), ('lat_max', 264), ('sample_tfidf_9_x', 264), ('grad_tfidf_11_x', 262), ('lon_q_20', 260), ('rank1_mode_cnt', 258), ('speed_max', 256), ('lat_lon_tfidf_12_x', 251), ('pos_neq_zero_lon_q_20', 248), ('lat_lon_tfidf_28_x', 242), ('speed_neq_zero_direction_q_60', 241), ('sample_tfidf_11_x', 241), ('w2v_17_mean', 241), ('sample_tfidf_13_x', 238), ('w2v_14_mean', 236), ('lat_nunique', 235), ('grad_tfidf_4_x', 234), ('w2v_21_mean', 234), ('sample_tfidf_5_x', 231), ('lat_lon_tfidf_9_y', 225), ('speed_neq_zero_lat_q_90', 222), ('direction_median', 221), ('sample_tfidf_17_x', 220), ('sample_tfidf_14_y', 216), ('lat_lon_tfidf_21_x', 215), ('lon_q_10', 214), ('lat_lon_tfidf_22_x', 214), ('grad_tfidf_26_x', 213), ('grad_tfidf_7_y', 213), ('w2v_29_mean', 212), ('pos_neq_zero_lat_q_80', 210), ('cnt', 209), ('lat_lon_tfidf_4_y', 208), ('direction_q_60', 204), ('sample_tfidf_18_x', 203), ('lat_lon_tfidf_11_y', 203), ('pos_neq_zero_lat_min', 202), ('pos_neq_zero_speed_mean', 201), ('speed_neq_zero_lat_q_70', 200), ('grad_tfidf_12_y', 198), ('sample_tfidf_20_x', 197), ('w2v_1_mean', 194), ('speed_neq_zero_lat_q_40', 193), ('pos_neq_zero_speed_max', 192), ('grad_tfidf_27_x', 192), ('grad_tfidf_15_y', 191), ('lat_lon_tfidf_19_x', 189), ('lat_median', 187), ('lat_lon_tfidf_15_x', 187), ('lat_q_20', 186), ('lat_q_70', 186), ('lon_q_70', 185), ('w2v_24_mean', 184), ('pos_neq_zero_lat_q_40', 183), ('grad_tfidf_25_x', 181), ('w2v_10_mean', 181), ('lon_mean', 180), ('sample_tfidf_27_x', 180), ('w2v_6_mean', 180), ('lat_lon_tfidf_24_x', 178), ('lat_lon_countvec_12_x', 178), ('pos_neq_zero_lat_mean', 177), ('speed_neq_zero_speed_q_70', 174), ('speed_neq_zero_direction_q_80', 172), ('rank2_mode_cnt', 172), ('speed_neq_zero_lat_nunique', 171), ('lat_lon_tfidf_2_x', 171), ('sample_tfidf_25_x', 170), ('lat_lon_tfidf_5_x', 169), ('lat_lon_countvec_26_x', 167), ('grad_tfidf_9_x', 166), ('lat_lon_countvec_28_x', 163), ('lat_lon_countvec_22_y', 163), ('sample_tfidf_1_x', 162), ('pos_neq_zero_direction_nunique', 161), ('pos_neq_zero_speed_q_10', 157), ('sample_tfidf_16_x', 155), ('speed_neq_zero_direction_q_90', 154), ('grad_tfidf_14_y', 153), ('lat_lon_tfidf_7_x', 151), ('pos_neq_zero_direction_q_80', 149), ('lat_q_80', 148), ('grad_tfidf_23_y', 148), ('lat_lon_countvec_11_x', 147), ('sample_tfidf_22_x', 146), ('speed_neq_zero_lat_max', 144), ('sample_tfidf_15_x', 144), ('grad_tfidf_2_y', 144), ('pos_neq_zero_lat_q_10', 142), ('lat_lon_tfidf_1_y', 142), ('lat_lon_countvec_16_x', 141), ('grad_tfidf_13_y', 138), ('lat_lon_countvec_29_x', 136), ('lat_lon_tfidf_29_y', 136), ('grad_tfidf_5_y', 136)]
贝叶斯优化介绍也是在建模调参过程中常用的一种方法,下面是通过贝叶斯优化进行超参数选择的代码
##############超参数优化的超参域###################
spaceParam = {
'boosting': hp.choice('boosting',['gbdt','dart']),
'learning_rate':hp.loguniform('learning_rate', np.log(0.01), np.log(0.05)),
'num_leaves': hp.quniform('num_leaves', 3, 66, 3),
'feature_fraction': hp.uniform('feature_fraction', 0.7,1),
'min_data_in_leaf': hp.quniform('min_data_in_leaf', 10, 50,5),
'num_boost_round':hp.quniform('num_boost_round',500,2000,100),
'bagging_fraction':hp.uniform('bagging_fraction',0.6,1)
}
# 超参数优化 ---------------------------------------------------------------------------------
def getParam(param):
for k in ['num_leaves', 'min_data_in_leaf','num_boost_round']:
param[k] = int(float(param[k]))
for k in ['learning_rate', 'feature_fraction','bagging_fraction']:
param[k] = float(param[k])
if param['boosting'] == 0:
param['boosting'] = 'gbdt'
elif param['boosting'] == 1:
param['boosting'] = 'dart'
# 添加固定参数
param['objective'] = 'multiclass'
param['max_depth'] = 7
param['num_threads'] = 8
param['is_unbalance'] = True
param['metric'] = 'None'
param['train_metric'] = True
param['verbose'] = -1
param['bagging_freq']=5
param['num_class']=3
param['feature_pre_filter']=False
return param
def f1_score_eval(preds, valid_df):
labels = valid_df.get_label()
preds = np.argmax(preds.reshape(3, -1), axis=0)
scores = f1_score(y_true=labels, y_pred=preds, average='macro')
return 'f1_score', scores, True
def lossFun(param):
param = getParam(param)
m = lgb.train(params=param,train_set=train_data,num_boost_round=param['num_boost_round'],
valid_sets=[train_data,valid_data],valid_names=['train','valid'],
feature_name=features,feval=f1_score_eval,
early_stopping_rounds=earlyStopping,verbose_eval=False,keep_training_booster=True)
train_f1_score = m.best_score['train']['f1_score']
valid_f1_score = m.best_score['valid']['f1_score']
loss_f1_score = 1 - valid_f1_score
print('训练集f1_score:{},测试集f1_score:{},loss_f1_score:{}'.format(train_f1_score, valid_f1_score, loss_f1_score))
return {'loss': loss_f1_score, 'params': param, 'status': STATUS_OK}
features = model_feature
train_data = lgb.Dataset(data=X_train[model_feature],label=y_train,feature_name=features)
valid_data = lgb.Dataset(data=X_verify[features],label=y_verify,reference=train_data,feature_name=features)
best_param = fmin(fn=lossFun, space=spaceParam, algo=tpe.suggest, max_evals=100, trials=Trials())
best_param = getParam(best_param)
print('Search best param:',best_param)
训练集f1_score:1.0,测试集f1_score:0.9238060849905194,loss_f1_score:0.07619391500948058
训练集f1_score:0.9414337502771342,测试集f1_score:0.8878751759836653,loss_f1_score:0.11212482401633472
训练集f1_score:1.0,测试集f1_score:0.9275451088133652,loss_f1_score:0.07245489118663484
训练集f1_score:1.0,测试集f1_score:0.9262405937033683,loss_f1_score:0.07375940629663169
训练集f1_score:0.9708237804866381,测试集f1_score:0.9105982243190386,loss_f1_score:0.08940177568096142
训练集f1_score:0.9689912364726484,测试集f1_score:0.9086459359345839,loss_f1_score:0.09135406406541613
训练集f1_score:0.9841597696688008,测试集f1_score:0.9027075194168233,loss_f1_score:0.09729248058317674
训练集f1_score:1.0,测试集f1_score:0.9215512877825286,loss_f1_score:0.0784487122174714
训练集f1_score:1.0,测试集f1_score:0.924555451978199,loss_f1_score:0.075444548021801
训练集f1_score:0.998357894114157,测试集f1_score:0.9157797895654226,loss_f1_score:0.08422021043457739
训练集f1_score:1.0,测试集f1_score:0.9225868784774544,loss_f1_score:0.07741312152254565
训练集f1_score:1.0,测试集f1_score:0.9188521505717673,loss_f1_score:0.08114784942823272
训练集f1_score:0.9268245763808158,测试集f1_score:0.8763935795977332,loss_f1_score:0.12360642040226677
训练集f1_score:1.0,测试集f1_score:0.9215959099478135,loss_f1_score:0.07840409005218651
训练集f1_score:1.0,测试集f1_score:0.9265015559936258,loss_f1_score:0.07349844400637418
训练集f1_score:1.0,测试集f1_score:0.9143628354188641,loss_f1_score:0.0856371645811359
训练集f1_score:1.0,测试集f1_score:0.9202754009210264,loss_f1_score:0.07972459907897356
训练集f1_score:0.9550283459834631,测试集f1_score:0.8923546584333147,loss_f1_score:0.10764534156668526
训练集f1_score:1.0,测试集f1_score:0.9255732985564632,loss_f1_score:0.0744267014435368
训练集f1_score:1.0,测试集f1_score:0.926093875740129,loss_f1_score:0.07390612425987098
训练集f1_score:1.0,测试集f1_score:0.9275189170142104,loss_f1_score:0.07248108298578959
训练集f1_score:1.0,测试集f1_score:0.9257895202231272,loss_f1_score:0.07421047977687278
训练集f1_score:1.0,测试集f1_score:0.9248738969479765,loss_f1_score:0.0751261030520235
训练集f1_score:1.0,测试集f1_score:0.9272520229049039,loss_f1_score:0.07274797709509606
训练集f1_score:1.0,测试集f1_score:0.9256769527801775,loss_f1_score:0.07432304721982252
训练集f1_score:1.0,测试集f1_score:0.9252959646692677,loss_f1_score:0.07470403533073233
训练集f1_score:1.0,测试集f1_score:0.9280536344383128,loss_f1_score:0.07194636556168721
训练集f1_score:1.0,测试集f1_score:0.9316114105930104,loss_f1_score:0.06838858940698955
训练集f1_score:1.0,测试集f1_score:0.9282603014798921,loss_f1_score:0.07173969852010786
训练集f1_score:1.0,测试集f1_score:0.9169851848129301,loss_f1_score:0.08301481518706988
训练集f1_score:0.9998006409358186,测试集f1_score:0.9170084634982812,loss_f1_score:0.08299153650171875
训练集f1_score:1.0,测试集f1_score:0.919142326688697,loss_f1_score:0.080857673311303
训练集f1_score:1.0,测试集f1_score:0.927350422658861,loss_f1_score:0.07264957734113897
训练集f1_score:1.0,测试集f1_score:0.9248086877712395,loss_f1_score:0.07519131222876052
训练集f1_score:1.0,测试集f1_score:0.9170626453496801,loss_f1_score:0.08293735465031993
训练集f1_score:1.0,测试集f1_score:0.9277641923766077,loss_f1_score:0.07223580762339232
训练集f1_score:1.0,测试集f1_score:0.9221988666312404,loss_f1_score:0.0778011333687596
训练集f1_score:1.0,测试集f1_score:0.9225220095934339,loss_f1_score:0.07747799040656611
训练集f1_score:1.0,测试集f1_score:0.9239565521812777,loss_f1_score:0.0760434478187223
训练集f1_score:1.0,测试集f1_score:0.9276828960144917,loss_f1_score:0.07231710398550828
训练集f1_score:1.0,测试集f1_score:0.9205931627810685,loss_f1_score:0.07940683721893149
训练集f1_score:1.0,测试集f1_score:0.9262928923256212,loss_f1_score:0.07370710767437882
训练集f1_score:0.9944566925965641,测试集f1_score:0.9103100448505551,loss_f1_score:0.08968995514944489
训练集f1_score:1.0,测试集f1_score:0.9267901922541096,loss_f1_score:0.07320980774589037
训练集f1_score:1.0,测试集f1_score:0.920503002249437,loss_f1_score:0.07949699775056296
训练集f1_score:0.9315809154440894,测试集f1_score:0.888114739372245,loss_f1_score:0.11188526062775495
训练集f1_score:1.0,测试集f1_score:0.9312944518110373,loss_f1_score:0.06870554818896268
训练集f1_score:1.0,测试集f1_score:0.9303459748533314,loss_f1_score:0.06965402514666863
训练集f1_score:1.0,测试集f1_score:0.931353840440614,loss_f1_score:0.06864615955938602
训练集f1_score:1.0,测试集f1_score:0.9229280238009058,loss_f1_score:0.07707197619909423
训练集f1_score:1.0,测试集f1_score:0.9081707271979852,loss_f1_score:0.0918292728020148
训练集f1_score:1.0,测试集f1_score:0.9263682433473132,loss_f1_score:0.07363175665268684
训练集f1_score:0.9979810910594639,测试集f1_score:0.9137152734108268,loss_f1_score:0.08628472658917319
训练集f1_score:1.0,测试集f1_score:0.9258220879299731,loss_f1_score:0.07417791207002689
训练集f1_score:1.0,测试集f1_score:0.9174454505221505,loss_f1_score:0.08255454947784946
训练集f1_score:1.0,测试集f1_score:0.9271364668867941,loss_f1_score:0.07286353311320592
训练集f1_score:1.0,测试集f1_score:0.9147023183361269,loss_f1_score:0.08529768166387308
训练集f1_score:0.9818127606280159,测试集f1_score:0.9017199309349478,loss_f1_score:0.09828006906505216
训练集f1_score:1.0,测试集f1_score:0.9144702886766378,loss_f1_score:0.08552971132336218
训练集f1_score:0.9987361493711533,测试集f1_score:0.9152462742627984,loss_f1_score:0.08475372573720164
训练集f1_score:1.0,测试集f1_score:0.9283825864164065,loss_f1_score:0.07161741358359353
训练集f1_score:1.0,测试集f1_score:0.9185245776900096,loss_f1_score:0.08147542230999039
训练集f1_score:1.0,测试集f1_score:0.9176200948292667,loss_f1_score:0.08237990517073335
训练集f1_score:0.9993129514194335,测试集f1_score:0.9174352830766729,loss_f1_score:0.08256471692332712
训练集f1_score:1.0,测试集f1_score:0.9276704131051788,loss_f1_score:0.07232958689482116
训练集f1_score:1.0,测试集f1_score:0.9268048760558437,loss_f1_score:0.07319512394415628
训练集f1_score:1.0,测试集f1_score:0.9304568955332027,loss_f1_score:0.06954310446679735
训练集f1_score:1.0,测试集f1_score:0.9222607611550148,loss_f1_score:0.07773923884498524
训练集f1_score:1.0,测试集f1_score:0.9303686983620825,loss_f1_score:0.06963130163791753
训练集f1_score:1.0,测试集f1_score:0.9275281467065163,loss_f1_score:0.07247185329348371
训练集f1_score:1.0,测试集f1_score:0.9263494542572851,loss_f1_score:0.0736505457427149
训练集f1_score:1.0,测试集f1_score:0.9262464202510822,loss_f1_score:0.07375357974891783
训练集f1_score:1.0,测试集f1_score:0.9213298706249988,loss_f1_score:0.07867012937500117
训练集f1_score:1.0,测试集f1_score:0.9255381820063792,loss_f1_score:0.07446181799362084
训练集f1_score:1.0,测试集f1_score:0.9262492441399471,loss_f1_score:0.07375075586005286
训练集f1_score:1.0,测试集f1_score:0.9267529385979496,loss_f1_score:0.0732470614020504
训练集f1_score:1.0,测试集f1_score:0.9279362552557409,loss_f1_score:0.07206374474425914
训练集f1_score:1.0,测试集f1_score:0.9105496558898486,loss_f1_score:0.0894503441101514
训练集f1_score:1.0,测试集f1_score:0.9255677088759965,loss_f1_score:0.07443229112400351
训练集f1_score:1.0,测试集f1_score:0.9258810998636311,loss_f1_score:0.0741189001363689
训练集f1_score:1.0,测试集f1_score:0.9236045683410877,loss_f1_score:0.07639543165891227
训练集f1_score:1.0,测试集f1_score:0.9236482035413927,loss_f1_score:0.07635179645860735
训练集f1_score:0.9998006409358186,测试集f1_score:0.9161826380576955,loss_f1_score:0.08381736194230449
训练集f1_score:1.0,测试集f1_score:0.9226427795765888,loss_f1_score:0.0773572204234112
训练集f1_score:1.0,测试集f1_score:0.9227047668043812,loss_f1_score:0.07729523319561882
训练集f1_score:1.0,测试集f1_score:0.9255689533534145,loss_f1_score:0.07443104664658551
训练集f1_score:1.0,测试集f1_score:0.9322007348532765,loss_f1_score:0.06779926514672352
训练集f1_score:1.0,测试集f1_score:0.9169573599775939,loss_f1_score:0.08304264002240613
训练集f1_score:1.0,测试集f1_score:0.9230059720988804,loss_f1_score:0.07699402790111964
训练集f1_score:1.0,测试集f1_score:0.922697478395862,loss_f1_score:0.07730252160413797
训练集f1_score:1.0,测试集f1_score:0.9079606352786754,loss_f1_score:0.09203936472132457
训练集f1_score:1.0,测试集f1_score:0.9229248123974857,loss_f1_score:0.0770751876025143
训练集f1_score:1.0,测试集f1_score:0.923913432252704,loss_f1_score:0.07608656774729605
训练集f1_score:1.0,测试集f1_score:0.9257200990324236,loss_f1_score:0.07427990096757642
训练集f1_score:1.0,测试集f1_score:0.9276995504041144,loss_f1_score:0.07230044959588555
训练集f1_score:1.0,测试集f1_score:0.9251348482525271,loss_f1_score:0.07486515174747288
训练集f1_score:1.0,测试集f1_score:0.9231090610362633,loss_f1_score:0.07689093896373667
训练集f1_score:1.0,测试集f1_score:0.9164413618677342,loss_f1_score:0.08355863813226583
训练集f1_score:1.0,测试集f1_score:0.9293008018695311,loss_f1_score:0.07069919813046888
训练集f1_score:1.0,测试集f1_score:0.9301285411934597,loss_f1_score:0.06987145880654033
100%|█████████████████████████████████████████████| 100/100 [33:56<00:00, 20.36s/trial, best loss: 0.06779926514672352]
Search best param: {'bagging_fraction': 0.7310343530671259, 'boosting': 'gbdt', 'feature_fraction': 0.8644701162989126, 'learning_rate': 0.0483933201073737, 'min_data_in_leaf': 15, 'num_boost_round': 1100, 'num_leaves': 60, 'objective': 'multiclass', 'max_depth': 7, 'num_threads': 8, 'is_unbalance': True, 'metric': 'None', 'train_metric': True, 'verbose': -1, 'bagging_freq': 5, 'num_class': 3, 'feature_pre_filter': False}
经过特征选择和超参数优化后,最终的模型使用为将参数设置为贝叶斯优化之后的超参数,然后进行5折交叉,对测试集进行叠加求平均。
def f1_score_eval(preds, valid_df):
labels = valid_df.get_label()
preds = np.argmax(preds.reshape(3, -1), axis=0)
scores = f1_score(y_true=labels, y_pred=preds, average='macro')
return 'f1_score', scores, True
def sub_on_line_lgb(train_, test_, pred, label, cate_cols, split,
is_shuffle=True,
use_cart=False,
get_prob=False):
n_class = 3
train_pred = np.zeros((train_.shape[0], n_class))
test_pred = np.zeros((test_.shape[0], n_class))
n_splits = 5
assert split in ['kf', 'skf'
], '{} Not Support this type of split way'.format(split)
if split == 'kf':
folds = KFold(n_splits=n_splits, shuffle=is_shuffle, random_state=1024)
kf_way = folds.split(train_[pred])
else:
#与KFold最大的差异在于,他是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同。
folds = StratifiedKFold(n_splits=n_splits,
shuffle=is_shuffle,
random_state=1024)
kf_way = folds.split(train_[pred], train_[label])
print('Use {} features ...'.format(len(pred)))
#将以下参数改为贝叶斯优化之后的参数
params = {
'learning_rate': 0.05,
'boosting_type': 'gbdt',
'objective': 'multiclass',
'metric': 'None',
'num_leaves': 60,
'feature_fraction':0.86,
'bagging_fraction': 0.73,
'bagging_freq': 5,
'seed': 1,
'bagging_seed': 1,
'feature_fraction_seed': 7,
'min_data_in_leaf': 15,
'num_class': n_class,
'nthread': 8,
'verbose': -1,
'num_boost_round': 1100,
'max_depth': 7,
}
for n_fold, (train_idx, valid_idx) in enumerate(kf_way, start=1):
print('the {} training start ...'.format(n_fold))
train_x, train_y = train_[pred].iloc[train_idx
], train_[label].iloc[train_idx]
valid_x, valid_y = train_[pred].iloc[valid_idx
], train_[label].iloc[valid_idx]
if use_cart:
dtrain = lgb.Dataset(train_x,
label=train_y,
categorical_feature=cate_cols)
dvalid = lgb.Dataset(valid_x,
label=valid_y,
categorical_feature=cate_cols)
else:
dtrain = lgb.Dataset(train_x, label=train_y)
dvalid = lgb.Dataset(valid_x, label=valid_y)
clf = lgb.train(params=params,
train_set=dtrain,
# num_boost_round=3000,
valid_sets=[dvalid],
early_stopping_rounds=100,
verbose_eval=100,
feval=f1_score_eval)
train_pred[valid_idx] = clf.predict(valid_x,
num_iteration=clf.best_iteration)
test_pred += clf.predict(test_[pred],
num_iteration=clf.best_iteration) / folds.n_splits
print(classification_report(train_[label], np.argmax(train_pred,
axis=1),
digits=4))
if get_prob:
sub_probs = ['qyxs_prob_{}'.format(q) for q in ['围网', '刺网', '拖网']]
prob_df = pd.DataFrame(test_pred, columns=sub_probs)
prob_df['ID'] = test_['ID'].values
return prob_df
else:
test_['label'] = np.argmax(test_pred, axis=1)
return test_[['ID', 'label']]
use_train = all_df[all_df['label'] != -1]
use_test = all_df[all_df['label'] == -1]
# use_feats = [c for c in use_train.columns if c not in ['ID', 'label']]
use_feats=model_feature
sub = sub_on_line_lgb(use_train, use_test, use_feats, 'label', [], 'kf',is_shuffle=True,use_cart=False,get_prob=False)
Use 200 features ...
the 1 training start ...
Training until validation scores don't improve for 100 rounds
D:\SOFTWEAR_H\Anaconda3\lib\site-packages\lightgbm\engine.py:151: UserWarning: Found `num_boost_round` in params. Will use it instead of argument
warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
[100] valid_0's f1_score: 0.894256
[200] valid_0's f1_score: 0.909942
[300] valid_0's f1_score: 0.913423
[400] valid_0's f1_score: 0.917897
[500] valid_0's f1_score: 0.920616
Early stopping, best iteration is:
[456] valid_0's f1_score: 0.920717
the 2 training start ...
Training until validation scores don't improve for 100 rounds
[100] valid_0's f1_score: 0.918357
[200] valid_0's f1_score: 0.916436
Early stopping, best iteration is:
[140] valid_0's f1_score: 0.92449
the 3 training start ...
Training until validation scores don't improve for 100 rounds
[100] valid_0's f1_score: 0.915242
[200] valid_0's f1_score: 0.927189
[300] valid_0's f1_score: 0.930614
Early stopping, best iteration is:
[238] valid_0's f1_score: 0.930614
the 4 training start ...
Training until validation scores don't improve for 100 rounds
[100] valid_0's f1_score: 0.901683
[200] valid_0's f1_score: 0.912985
[300] valid_0's f1_score: 0.916988
[400] valid_0's f1_score: 0.92147
[500] valid_0's f1_score: 0.921353
Early stopping, best iteration is:
[411] valid_0's f1_score: 0.922153
the 5 training start ...
Training until validation scores don't improve for 100 rounds
[100] valid_0's f1_score: 0.900975
[200] valid_0's f1_score: 0.908373
[300] valid_0's f1_score: 0.91384
[400] valid_0's f1_score: 0.917567
Early stopping, best iteration is:
[369] valid_0's f1_score: 0.919843
precision recall f1-score support
0 0.8726 0.9001 0.8861 1621
1 0.9569 0.8949 0.9249 1018
2 0.9586 0.9619 0.9603 4361
accuracy 0.9379 7000
macro avg 0.9294 0.9190 0.9238 7000
weighted avg 0.9385 0.9379 0.9380 7000
<ipython-input-42-6cbdd079efb6>:88: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
test_['label'] = np.argmax(test_pred, axis=1)