智慧海洋建设-Task4模型建立
该部分学习如何进行模型的搭建及调参
模型训练与预测
模型训练与预测的主要步骤为:
(1):导入需要的工具库
(2):对数据预处理,包括导入数据集、处理数据等操作,具体为缺失值处理、连续特征归一化、类别特征转换等
(3):训练模型。选择合适的机器学习模型,利用训练集对模型进行训练,达到最佳拟合效果。
(4):预测结果。将待预测的数据输入到训练好的模型中,得到预测的结果。
以下是常用几种算法:
随机森林
随机森林是通过集成学习的思想将多棵树集成的一种算法,基本单元是决策树,而它的本质属于机器学习的一个分支——集成学习。 随机森林模型的主要优点是:在当前算法中,具有较好的准确率;能够有效地运行在大数据集上;能够处理具有高维特征的输入样本,而且不需要降维;能够评估各个特征在分类问题上的重要性;在生成过程中,能够获取到内部生成误差的一种无偏估计;对于缺省值问题也能够获得很好的结果。
例子
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
#数据集导入
iris=datasets.load_iris()
feature=iris.feature_names
X = iris.data
y = iris.target
#随机森林
clf=RandomForestClassifier(n_estimators=200)
train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.1,random_state=5)
clf.fit(train_X,train_y)
test_pred=clf.predict(test_X)
test_pred
array([1, 1, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 2, 2, 0])
#特征的重要性查看
print(str(feature)+'\n'+str(clf.feature_importances_))
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[0.09698324 0.01857797 0.36312347 0.52131532]
采用F1 score进行模型的评价
#F1-score 用于模型评价
#如果是二分类问题则选择参数‘binary’
#如果考虑类别的不平衡性,需要计算类别的加权平均,则使用‘weighted’
#如果不考虑类别的不平衡性,计算宏平均,则使用‘macro’
score=f1_score(test_y,test_pred,average='macro')
print("随机森林-macro:",score)
score=f1_score(test_y,test_pred,average='weighted')
print("随机森林-weighted:",score)
随机森林-macro: 0.818181818181818
随机森林-weighted: 0.8
lightgbm
1.lightGBM过拟合处理方案:
使用较小的 max_bin
使用较小的 num_leaves
使用 min_data_in_leaf 和 min_sum_hessian_in_leaf
通过设置 bagging_fraction 和 bagging_freq 来使用 bagging
通过设置 feature_fraction 来使用特征子抽样
使用更大的训练数据
使用 lambda_l1, lambda_l2 和 min_gain_to_split 来使用正则
尝试 max_depth 来避免生成过深的树
2.lightGBM针对更快的训练速度的解决方案
通过设置 bagging_fraction 和 bagging_freq 参数来使用 bagging 方法
通过设置 feature_fraction 参数来使用特征的子抽样
使用较小的 max_bin
使用 save_binary 在未来的学习过程对数据加载进行加速
使用并行学习, 可参考 并行学习指南
3.lightGBM针对更好的准确率
使用较大的 max_bin (学习速度可能变慢)
使用较小的 learning_rate 和较大的 num_iterations
使用较大的 num_leaves (可能导致过拟合)
使用更大的训练数据
尝试 dart
import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score
import matplotlib.pyplot as plt
# 加载数据
iris = datasets.load_iris()
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)
# 转换为Dataset数据格式
train_data = lgb.Dataset(X_train, label=y_train)
validation_data = lgb.Dataset(X_test, label=y_test)
# 参数
results = {}
params = {
'learning_rate': 0.1,
'lambda_l1': 0.1,
'lambda_l2': 0.9,
'max_depth': 1,
'objective': 'multiclass', # 目标函数
'num_class': 3,
'verbose': -1
}
# 模型训练
gbm = lgb.train(params, train_data, valid_sets=(validation_data,train_data),valid_names=('validate','train'),evals_result= results)
# 模型预测
y_pred_test = gbm.predict(X_test)
y_pred_data = gbm.predict(X_train)
y_pred_data = [list(x).index(max(x)) for x in y_pred_data]
y_pred_test = [list(x).index(max(x)) for x in y_pred_test]
# 模型评估
print(accuracy_score(y_test, y_pred_test))
print('训练集',f1_score(y_train, y_pred_data,average='macro'))
print('验证集',f1_score(y_test, y_pred_test,average='macro'))
[1] train's multi_logloss: 0.974035 validate's multi_logloss: 0.989901
[2] train's multi_logloss: 0.871516 validate's multi_logloss: 0.895742
[3] train's multi_logloss: 0.785872 validate's multi_logloss: 0.817409
[4] train's multi_logloss: 0.712391 validate's multi_logloss: 0.749888
[5] train's multi_logloss: 0.648841 validate's multi_logloss: 0.688816
[6] train's multi_logloss: 0.593199 validate's multi_logloss: 0.638565
[7] train's multi_logloss: 0.543972 validate's multi_logloss: 0.587589
[8] train's multi_logloss: 0.500217 validate's multi_logloss: 0.549097
[9] train's multi_logloss: 0.461592 validate's multi_logloss: 0.508798
[10] train's multi_logloss: 0.42673 validate's multi_logloss: 0.475485
[11] train's multi_logloss: 0.395727 validate's multi_logloss: 0.446187
[12] train's multi_logloss: 0.367699 validate's multi_logloss: 0.420564
[13] train's multi_logloss: 0.3425 validate's multi_logloss: 0.3949
[14] train's multi_logloss: 0.319592 validate's multi_logloss: 0.373046
[15] train's multi_logloss: 0.298978 validate's multi_logloss: 0.353714
[16] train's multi_logloss: 0.280239 validate's multi_logloss: 0.338281
[17] train's multi_logloss: 0.263341 validate's multi_logloss: 0.320746
[18] train's multi_logloss: 0.247778 validate's multi_logloss: 0.306397
[19] train's multi_logloss: 0.233846 validate's multi_logloss: 0.292591
[20] train's multi_logloss: 0.220967 validate's multi_logloss: 0.283194
[21] train's multi_logloss: 0.209264 validate's multi_logloss: 0.271909
[22] train's multi_logloss: 0.198558 validate's multi_logloss: 0.262316
[23] train's multi_logloss: 0.188758 validate's multi_logloss: 0.252746
[24] train's multi_logloss: 0.179772 validate's multi_logloss: 0.24557
[25] train's multi_logloss: 0.171563 validate's multi_logloss: 0.238501
[26] train's multi_logloss: 0.163905 validate's multi_logloss: 0.232761
[27] train's multi_logloss: 0.156974 validate's multi_logloss: 0.22642
[28] train's multi_logloss: 0.150558 validate's multi_logloss: 0.220691
[29] train's multi_logloss: 0.144613 validate's multi_logloss: 0.217781
[30] train's multi_logloss: 0.139117 validate's multi_logloss: 0.21288
[31] train's multi_logloss: 0.134104 validate's multi_logloss: 0.209723
[32] train's multi_logloss: 0.129382 validate's multi_logloss: 0.206157
[33] train's multi_logloss: 0.125032 validate's multi_logloss: 0.202053
[34] train's multi_logloss: 0.120979 validate's multi_logloss: 0.19939
[35] train's multi_logloss: 0.117276 validate's multi_logloss: 0.195702
[36] train's multi_logloss: 0.113776 validate's multi_logloss: 0.194672
[37] train's multi_logloss: 0.110425 validate's multi_logloss: 0.191736
[38] train's multi_logloss: 0.107355 validate's multi_logloss: 0.189831
[39] train's multi_logloss: 0.104528 validate's multi_logloss: 0.187118
[40] train's multi_logloss: 0.101843 validate's multi_logloss: 0.185557
[41] train's multi_logloss: 0.0993945 validate's multi_logloss: 0.183465
[42] train's multi_logloss: 0.0970376 validate's multi_logloss: 0.182638
[43] train's multi_logloss: 0.094839 validate's multi_logloss: 0.181165
[44] train's multi_logloss: 0.0927832 validate's multi_logloss: 0.18011
[45] train's multi_logloss: 0.0908613 validate's multi_logloss: 0.178461
[46] train's multi_logloss: 0.0890459 validate's multi_logloss: 0.178037
[47] train's multi_logloss: 0.0872901 validate's multi_logloss: 0.176963
[48] train's multi_logloss: 0.0856689 validate's multi_logloss: 0.176258
[49] train's multi_logloss: 0.0841141 validate's multi_logloss: 0.175017
[50] train's multi_logloss: 0.0826536 validate's multi_logloss: 0.17484
[51] train's multi_logloss: 0.081264 validate's multi_logloss: 0.173881
[52] train's multi_logloss: 0.0799308 validate's multi_logloss: 0.173339
[53] train's multi_logloss: 0.0786754 validate's multi_logloss: 0.172624
[54] train's multi_logloss: 0.0774673 validate's multi_logloss: 0.171771
[55] train's multi_logloss: 0.0762979 validate's multi_logloss: 0.170869
[56] train's multi_logloss: 0.0751944 validate's multi_logloss: 0.170182
[57] train's multi_logloss: 0.0741161 validate's multi_logloss: 0.170612
[58] train's multi_logloss: 0.0730567 validate's multi_logloss: 0.169815
[59] train's multi_logloss: 0.0720464 validate's multi_logloss: 0.169527
[60] train's multi_logloss: 0.0710733 validate's multi_logloss: 0.168816
[61] train's multi_logloss: 0.0701397 validate's multi_logloss: 0.168021
[62] train's multi_logloss: 0.0692277 validate's multi_logloss: 0.167906
[63] train's multi_logloss: 0.0683669 validate's multi_logloss: 0.167455
[64] train's multi_logloss: 0.0675215 validate's multi_logloss: 0.167227
[65] train's multi_logloss: 0.0667166 validate's multi_logloss: 0.167565
[66] train's multi_logloss: 0.0659318 validate's multi_logloss: 0.16717
[67] train's multi_logloss: 0.0651763 validate's multi_logloss: 0.167452
[68] train's multi_logloss: 0.0644274 validate's multi_logloss: 0.167251
[69] train's multi_logloss: 0.0637188 validate's multi_logloss: 0.167495
[70] train's multi_logloss: 0.0630436 validate's multi_logloss: 0.167048
[71] train's multi_logloss: 0.0623662 validate's multi_logloss: 0.167153
[72] train's multi_logloss: 0.0617154 validate's multi_logloss: 0.167589
[73] train's multi_logloss: 0.0610915 validate's multi_logloss: 0.167194
[74] train's multi_logloss: 0.0604864 validate's multi_logloss: 0.16767
[75] train's multi_logloss: 0.0598866 validate's multi_logloss: 0.167494
[76] train's multi_logloss: 0.0593059 validate's multi_logloss: 0.167135
[77] train's multi_logloss: 0.0587531 validate's multi_logloss: 0.167925
[78] train's multi_logloss: 0.0581985 validate's multi_logloss: 0.16782
[79] train's multi_logloss: 0.0576724 validate's multi_logloss: 0.168105
[80] train's multi_logloss: 0.0571648 validate's multi_logloss: 0.168219
[81] train's multi_logloss: 0.0566601 validate's multi_logloss: 0.167948
[82] train's multi_logloss: 0.0561717 validate's multi_logloss: 0.167754
[83] train's multi_logloss: 0.0556998 validate's multi_logloss: 0.168321
[84] train's multi_logloss: 0.0552297 validate's multi_logloss: 0.168112
[85] train's multi_logloss: 0.0547754 validate's multi_logloss: 0.168583
[86] train's multi_logloss: 0.0543419 validate's multi_logloss: 0.168906
[87] train's multi_logloss: 0.0539131 validate's multi_logloss: 0.168747
[88] train's multi_logloss: 0.0534946 validate's multi_logloss: 0.169193
[89] train's multi_logloss: 0.0530806 validate's multi_logloss: 0.169151
[90] train's multi_logloss: 0.0526893 validate's multi_logloss: 0.16954
[91] train's multi_logloss: 0.0522966 validate's multi_logloss: 0.169339
[92] train's multi_logloss: 0.0519192 validate's multi_logloss: 0.169904
[93] train's multi_logloss: 0.0515437 validate's multi_logloss: 0.169782
[94] train's multi_logloss: 0.051186 validate's multi_logloss: 0.170174
[95] train's multi_logloss: 0.0508276 validate's multi_logloss: 0.170056
[96] train's multi_logloss: 0.0504822 validate's multi_logloss: 0.170857
[97] train's multi_logloss: 0.0501398 validate's multi_logloss: 0.17074
[98] train's multi_logloss: 0.0498117 validate's multi_logloss: 0.170944
[99] train's multi_logloss: 0.049484 validate's multi_logloss: 0.17093
[100] train's multi_logloss: 0.0491638 validate's multi_logloss: 0.171452
0.9777777777777777
训练集 0.9903381642512077
验证集 0.9784047370254267
# 有以下曲线可知验证集的损失是比训练集的损失要高,所以由此可以判断模型出现了过拟合
lgb.plot_metric(results)
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mX3lg6Wz-1619108780112)(output_15_0.png)]
# 绘制重要的特征
lgb.plot_importance(gbm,importance_type = "split")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JeHOap0I-1619108780117)(output_16_0.png)]
xgboost模型
from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score # 准确率
# 加载样本数据集
iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) # 数据集分割
# 算法参数
params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'eval_metric':'mlogloss',
'num_class': 3,
'gamma': 0.1,
'max_depth': 6,
'lambda': 2,
'subsample': 0.7,
'colsample_bytree': 0.75,
'min_child_weight': 3,
'eta': 0.1,
'seed': 1,
'nthread': 4,
}
# plst = params.items()
train_data = xgb.DMatrix(X_train, y_train) # 生成数据集格式
num_rounds = 500
model = xgb.train(params, train_data) # xgboost模型训练
# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
# 计算准确率
F1_score = f1_score(y_test,y_pred,average='macro')
print("F1_score: %.2f%%" % (F1_score*100.0))
F1_score: 95.56%
智慧海洋数据集模型代码示例
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import StratifiedKFold, KFold,train_test_split
import lightgbm as lgb
import os
import warnings
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
all_df=pd.read_csv(r'E:\百度云下载\智慧海洋组队文件\group_df.csv',index_col=0)
all_df.head()
ID | label | cnt | anchor_cnt | anchor_ratio | lat_min | lat_max | lat_mean | lat_median | lat_nunique | ... | w2v_20_mean | w2v_21_mean | w2v_22_mean | w2v_23_mean | w2v_24_mean | w2v_25_mean | w2v_26_mean | w2v_27_mean | w2v_28_mean | w2v_29_mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2 | 414 | 372 | 0.898551 | 40.071140 | 40.276246 | 40.077743 | 40.071642 | 23 | ... | 5.184389 | 4.208591 | -2.088988 | -2.854722 | -1.686467 | 1.700887 | 1.500208 | -1.897580 | 1.876206 | 1.622946 |
1 | 1 | 2 | 385 | 187 | 0.485714 | 40.070104 | 40.214892 | 40.100511 | 40.071709 | 138 | ... | 1.413161 | 2.259752 | -0.436357 | -1.192455 | -0.682811 | 1.128157 | 1.210701 | -0.223389 | 1.656132 | 1.078244 |
2 | 10 | 2 | 397 | 60 | 0.151134 | 40.206756 | 40.594734 | 40.283386 | 40.208845 | 97 | ... | 4.050247 | 2.544489 | -0.881991 | -0.593533 | -0.080704 | 2.097406 | 0.287642 | -0.038832 | 1.786582 | 1.170315 |
3 | 100 | 2 | 411 | 67 | 0.163017 | 40.073365 | 40.525435 | 40.312746 | 40.336845 | 337 | ... | 0.841477 | 0.671789 | -0.587554 | -0.842925 | -0.349460 | 0.170843 | 0.420094 | 0.127812 | 0.514026 | 0.275592 |
4 | 1000 | 0 | 377 | 36 | 0.095491 | 41.112123 | 41.892649 | 41.742938 | 41.851099 | 257 | ... | 0.291974 | 0.578577 | -0.997851 | -0.379300 | -0.004710 | 0.518054 | 0.317797 | 0.060017 | 0.463325 | 0.733860 |
5 rows × 440 columns
use_train = all_df[all_df['label'] != -1]
use_test = all_df[all_df['label'] == -1]#label为-1时是测试集
use_feats = [c for c in use_train.columns if c not in ['ID', 'label']]
X_train,X_verify,y_train,y_verify= train_test_split(use_train[use_feats],use_train['label'],test_size=0.2,random_state=0)
#1.根据特征的重要性进行特征选择
##############特征选择参数###################
selectFeatures = 200 # 控制特征数
earlyStopping = 100 # 控制早停
select_num_boost_round = 1000 # 特征选择训练轮次
#首先设置基础参数
selfParam = {
'learning_rate':0.01, # 学习率
'boosting':'dart', # 算法类型, gbdt,dart
'objective':'multiclass', # 多分类
'metric':'None',
'num_leaves':32, #
'feature_fraction':0.8, # 训练特征比例
'bagging_fraction':0.8, # 训练样本比例
'min_data_in_leaf':30, # 叶子最小样本
'num_class': 3,
'max_depth':6, # 树的最大深度
'num_threads':8,#LightGBM 的线程数
'min_data_in_bin':30, # 单箱数据量
'max_bin':256, # 最大分箱数
'is_unbalance':True, # 非平衡样本
'train_metric':True,
'verbose':-1,
}
# 特征选择 ---------------------------------------------------------------------------------
def f1_score_eval(preds, valid_df):
labels = valid_df.get_label()
preds = np.argmax(preds.reshape(3, -1), axis=0)
scores = f1_score(y_true=labels, y_pred=preds, average='macro')
return 'f1_score', scores, True
train_data = lgb.Dataset(data=X_train,label=y_train,feature_name=use_feats)
valid_data = lgb.Dataset(data=X_verify,label=y_verify,reference=train_data,feature_name=use_feats)
sm = lgb.train(params=selfParam,train_set=train_data,num_boost_round=select_num_boost_round,
valid_sets=[valid_data],valid_names=['valid'],
feature_name=use_feats,
early_stopping_rounds=earlyStopping,verbose_eval=False,keep_training_booster=True,feval=f1_score_eval)
features_importance = {k:v for k,v in zip(sm.feature_name(),sm.feature_importance(iteration=sm.best_iteration))}
sort_feature_importance = sorted(features_importance.items(),key=lambda x:x[1],reverse=True)
print('total feature best score:', sm.best_score)
print('total feature importance:',sort_feature_importance)
print('select forward {} features:{}'.format(selectFeatures,sort_feature_importance[:selectFeatures]))
#model_feature是选择的超参数
model_feature = [k[0] for k in sort_feature_importance[:selectFeatures]]
##############超参数优化的超参域###################
spaceParam = {
'boosting': hp.choice('boosting',['gbdt','dart']),
'learning_rate':hp.loguniform('learning_rate', np.log(0.01), np.log(0.05)),
'num_leaves': hp.quniform('num_leaves', 3, 66, 3),
'feature_fraction': hp.uniform('feature_fraction', 0.7,1),
'min_data_in_leaf': hp.quniform('min_data_in_leaf', 10, 50,5),
'num_boost_round':hp.quniform('num_boost_round',500,2000,100),
'bagging_fraction':hp.uniform('bagging_fraction',0.6,1)
}
# 超参数优化 ---------------------------------------------------------------------------------
def getParam(param):
for k in ['num_leaves', 'min_data_in_leaf','num_boost_round']:
param[k] = int(float(param[k]))
for k in ['learning_rate', 'feature_fraction','bagging_fraction']:
param[k] = float(param[k])
if param['boosting'] == 0:
param['boosting'] = 'gbdt'
elif param['boosting'] == 1:
param['boosting'] = 'dart'
# 添加固定参数
param['objective'] = 'multiclass'
param['max_depth'] = 7
param['num_threads'] = 8
param['is_unbalance'] = True
param['metric'] = 'None'
param['train_metric'] = True
param['verbose'] = -1
param['bagging_freq']=5
param['num_class']=3
param['feature_pre_filter']=False
return param
def f1_score_eval(preds, valid_df):
labels = valid_df.get_label()
preds = np.argmax(preds.reshape(3, -1), axis=0)
scores = f1_score(y_true=labels, y_pred=preds, average='macro')
return 'f1_score', scores, True
def lossFun(param):
param = getParam(param)
m = lgb.train(params=param,train_set=train_data,num_boost_round=param['num_boost_round'],
valid_sets=[train_data,valid_data],valid_names=['train','valid'],
feature_name=features,feval=f1_score_eval,
early_stopping_rounds=earlyStopping,verbose_eval=False,keep_training_booster=True)
train_f1_score = m.best_score['train']['f1_score']
valid_f1_score = m.best_score['valid']['f1_score']
loss_f1_score = 1 - valid_f1_score
print('训练集f1_score:{},测试集f1_score:{},loss_f1_score:{}'.format(train_f1_score, valid_f1_score, loss_f1_score))
return {'loss': loss_f1_score, 'params': param, 'status': STATUS_OK}
features = model_feature
train_data = lgb.Dataset(data=X_train[model_feature],label=y_train,feature_name=features)
valid_data = lgb.Dataset(data=X_verify[features],label=y_verify,reference=train_data,feature_name=features)
best_param = fmin(fn=lossFun, space=spaceParam, algo=tpe.suggest, max_evals=100, trials=Trials())
best_param = getParam(best_param)
print('Search best param:',best_param)
经过特征选择和超参数优化后,最终的模型使用为将参数设置为贝叶斯优化之后的超参数,然后进行10折交叉,对测试集进行叠加求平均。
def f1_score_eval(preds, valid_df):
labels = valid_df.get_label()
preds = np.argmax(preds.reshape(3, -1), axis=0)
scores = f1_score(y_true=labels, y_pred=preds, average='macro')
return 'f1_score', scores, True
def sub_on_line_lgb(train_, test_, pred, label, cate_cols, split,
is_shuffle=True,
use_cart=False,
get_prob=False):
n_class = 3
train_pred = np.zeros((train_.shape[0], n_class))
test_pred = np.zeros((test_.shape[0], n_class))
n_splits = 10
assert split in ['kf', 'skf'
], '{} Not Support this type of split way'.format(split)
if split == 'kf':
folds = KFold(n_splits=n_splits, shuffle=is_shuffle, random_state=1024)
kf_way = folds.split(train_[pred])
else:
#与KFold最大的差异在于,他是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同。
folds = StratifiedKFold(n_splits=n_splits,
shuffle=is_shuffle,
random_state=1024)
kf_way = folds.split(train_[pred], train_[label])
print('Use {} features ...'.format(len(pred)))
#将以下参数改为贝叶斯优化之后的参数
params = {
'learning_rate': 0.05,
'boosting_type': 'gbdt',
'objective': 'multiclass',
'metric': 'None',
'num_leaves': 60,
'feature_fraction':0.86,
'bagging_fraction': 0.73,
'bagging_freq': 5,
'seed': 1,
'bagging_seed': 1,
'feature_fraction_seed': 7,
'min_data_in_leaf': 15,
'num_class': n_class,
'nthread': 8,
'verbose': -1,
'num_boost_round': 1100,
'max_depth': 7,
}
for n_fold, (train_idx, valid_idx) in enumerate(kf_way, start=1):
print('the {} training start ...'.format(n_fold))
train_x, train_y = train_[pred].iloc[train_idx
], train_[label].iloc[train_idx]
valid_x, valid_y = train_[pred].iloc[valid_idx
], train_[label].iloc[valid_idx]
if use_cart:
dtrain = lgb.Dataset(train_x,
label=train_y,
categorical_feature=cate_cols)
dvalid = lgb.Dataset(valid_x,
label=valid_y,
categorical_feature=cate_cols)
else:
dtrain = lgb.Dataset(train_x, label=train_y)
dvalid = lgb.Dataset(valid_x, label=valid_y)
clf = lgb.train(params=params,
train_set=dtrain,
num_boost_round=3000,
valid_sets=[dvalid],
early_stopping_rounds=100,
verbose_eval=100,
feval=f1_score_eval)
train_pred[valid_idx] = clf.predict(valid_x,
num_iteration=clf.best_iteration)
test_pred += clf.predict(test_[pred],
num_iteration=clf.best_iteration) / folds.n_splits
print(classification_report(train_[label], np.argmax(train_pred,
axis=1),
digits=4))
if get_prob:
sub_probs = ['qyxs_prob_{}'.format(q) for q in ['围网', '刺网', '拖网']]
prob_df = pd.DataFrame(test_pred, columns=sub_probs)
prob_df['ID'] = test_['ID'].values
return prob_df
else:
test_['label'] = np.argmax(test_pred, axis=1)
return test_[['ID', 'label']]
use_train = all_df[all_df['label'] != -1]
use_test = all_df[all_df['label'] == -1]
use_feats = [c for c in use_train.columns if c not in ['ID', 'label']]
use_feats=model_feature
sub = sub_on_line_lgb(use_train, use_test, use_feats, 'label', [], 'kf',is_shuffle=True,use_cart=False,get_prob=False)