智慧海洋建设-Task4模型建立

智慧海洋建设-Task4模型建立

该部分学习如何进行模型的搭建及调参

模型训练与预测

模型训练与预测的主要步骤为:
(1):导入需要的工具库
(2):对数据预处理,包括导入数据集、处理数据等操作,具体为缺失值处理、连续特征归一化、类别特征转换等
(3):训练模型。选择合适的机器学习模型,利用训练集对模型进行训练,达到最佳拟合效果。
(4):预测结果。将待预测的数据输入到训练好的模型中,得到预测的结果。
以下是常用几种算法:

随机森林

随机森林是通过集成学习的思想将多棵树集成的一种算法,基本单元是决策树,而它的本质属于机器学习的一个分支——集成学习。 随机森林模型的主要优点是:在当前算法中,具有较好的准确率;能够有效地运行在大数据集上;能够处理具有高维特征的输入样本,而且不需要降维;能够评估各个特征在分类问题上的重要性;在生成过程中,能够获取到内部生成误差的一种无偏估计;对于缺省值问题也能够获得很好的结果。

例子

from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
#数据集导入
iris=datasets.load_iris()
feature=iris.feature_names
X = iris.data
y = iris.target
#随机森林
clf=RandomForestClassifier(n_estimators=200)
train_X,test_X,train_y,test_y = train_test_split(X,y,test_size=0.1,random_state=5)
clf.fit(train_X,train_y)
test_pred=clf.predict(test_X)
test_pred
array([1, 1, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 2, 2, 0])
#特征的重要性查看
print(str(feature)+'\n'+str(clf.feature_importances_))
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[0.09698324 0.01857797 0.36312347 0.52131532]

采用F1 score进行模型的评价

#F1-score 用于模型评价
#如果是二分类问题则选择参数‘binary’
#如果考虑类别的不平衡性,需要计算类别的加权平均,则使用‘weighted’
#如果不考虑类别的不平衡性,计算宏平均,则使用‘macro’
score=f1_score(test_y,test_pred,average='macro')
print("随机森林-macro:",score)
score=f1_score(test_y,test_pred,average='weighted')
print("随机森林-weighted:",score)
随机森林-macro: 0.818181818181818
随机森林-weighted: 0.8

lightgbm

1.lightGBM过拟合处理方案:
使用较小的 max_bin
使用较小的 num_leaves
使用 min_data_in_leaf 和 min_sum_hessian_in_leaf
通过设置 bagging_fraction 和 bagging_freq 来使用 bagging
通过设置 feature_fraction 来使用特征子抽样
使用更大的训练数据
使用 lambda_l1, lambda_l2 和 min_gain_to_split 来使用正则
尝试 max_depth 来避免生成过深的树
2.lightGBM针对更快的训练速度的解决方案
通过设置 bagging_fraction 和 bagging_freq 参数来使用 bagging 方法
通过设置 feature_fraction 参数来使用特征的子抽样
使用较小的 max_bin
使用 save_binary 在未来的学习过程对数据加载进行加速
使用并行学习, 可参考 并行学习指南
3.lightGBM针对更好的准确率
使用较大的 max_bin (学习速度可能变慢)
使用较小的 learning_rate 和较大的 num_iterations
使用较大的 num_leaves (可能导致过拟合)
使用更大的训练数据
尝试 dart


import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score
import matplotlib.pyplot as plt

# 加载数据
iris = datasets.load_iris()
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)
# 转换为Dataset数据格式
train_data = lgb.Dataset(X_train, label=y_train)
validation_data = lgb.Dataset(X_test, label=y_test)
# 参数
results = {}
params = {
    'learning_rate': 0.1,
    'lambda_l1': 0.1,
    'lambda_l2': 0.9,
    'max_depth': 1,
    'objective': 'multiclass',  # 目标函数
    'num_class': 3,
    'verbose': -1 
}
# 模型训练
gbm = lgb.train(params, train_data, valid_sets=(validation_data,train_data),valid_names=('validate','train'),evals_result= results)
# 模型预测
y_pred_test = gbm.predict(X_test)
y_pred_data = gbm.predict(X_train)
y_pred_data = [list(x).index(max(x)) for x in y_pred_data]
y_pred_test = [list(x).index(max(x)) for x in y_pred_test]
# 模型评估
print(accuracy_score(y_test, y_pred_test))
print('训练集',f1_score(y_train, y_pred_data,average='macro'))
print('验证集',f1_score(y_test, y_pred_test,average='macro'))
[1]	train's multi_logloss: 0.974035	validate's multi_logloss: 0.989901
[2]	train's multi_logloss: 0.871516	validate's multi_logloss: 0.895742
[3]	train's multi_logloss: 0.785872	validate's multi_logloss: 0.817409
[4]	train's multi_logloss: 0.712391	validate's multi_logloss: 0.749888
[5]	train's multi_logloss: 0.648841	validate's multi_logloss: 0.688816
[6]	train's multi_logloss: 0.593199	validate's multi_logloss: 0.638565
[7]	train's multi_logloss: 0.543972	validate's multi_logloss: 0.587589
[8]	train's multi_logloss: 0.500217	validate's multi_logloss: 0.549097
[9]	train's multi_logloss: 0.461592	validate's multi_logloss: 0.508798
[10]	train's multi_logloss: 0.42673	validate's multi_logloss: 0.475485
[11]	train's multi_logloss: 0.395727	validate's multi_logloss: 0.446187
[12]	train's multi_logloss: 0.367699	validate's multi_logloss: 0.420564
[13]	train's multi_logloss: 0.3425	validate's multi_logloss: 0.3949
[14]	train's multi_logloss: 0.319592	validate's multi_logloss: 0.373046
[15]	train's multi_logloss: 0.298978	validate's multi_logloss: 0.353714
[16]	train's multi_logloss: 0.280239	validate's multi_logloss: 0.338281
[17]	train's multi_logloss: 0.263341	validate's multi_logloss: 0.320746
[18]	train's multi_logloss: 0.247778	validate's multi_logloss: 0.306397
[19]	train's multi_logloss: 0.233846	validate's multi_logloss: 0.292591
[20]	train's multi_logloss: 0.220967	validate's multi_logloss: 0.283194
[21]	train's multi_logloss: 0.209264	validate's multi_logloss: 0.271909
[22]	train's multi_logloss: 0.198558	validate's multi_logloss: 0.262316
[23]	train's multi_logloss: 0.188758	validate's multi_logloss: 0.252746
[24]	train's multi_logloss: 0.179772	validate's multi_logloss: 0.24557
[25]	train's multi_logloss: 0.171563	validate's multi_logloss: 0.238501
[26]	train's multi_logloss: 0.163905	validate's multi_logloss: 0.232761
[27]	train's multi_logloss: 0.156974	validate's multi_logloss: 0.22642
[28]	train's multi_logloss: 0.150558	validate's multi_logloss: 0.220691
[29]	train's multi_logloss: 0.144613	validate's multi_logloss: 0.217781
[30]	train's multi_logloss: 0.139117	validate's multi_logloss: 0.21288
[31]	train's multi_logloss: 0.134104	validate's multi_logloss: 0.209723
[32]	train's multi_logloss: 0.129382	validate's multi_logloss: 0.206157
[33]	train's multi_logloss: 0.125032	validate's multi_logloss: 0.202053
[34]	train's multi_logloss: 0.120979	validate's multi_logloss: 0.19939
[35]	train's multi_logloss: 0.117276	validate's multi_logloss: 0.195702
[36]	train's multi_logloss: 0.113776	validate's multi_logloss: 0.194672
[37]	train's multi_logloss: 0.110425	validate's multi_logloss: 0.191736
[38]	train's multi_logloss: 0.107355	validate's multi_logloss: 0.189831
[39]	train's multi_logloss: 0.104528	validate's multi_logloss: 0.187118
[40]	train's multi_logloss: 0.101843	validate's multi_logloss: 0.185557
[41]	train's multi_logloss: 0.0993945	validate's multi_logloss: 0.183465
[42]	train's multi_logloss: 0.0970376	validate's multi_logloss: 0.182638
[43]	train's multi_logloss: 0.094839	validate's multi_logloss: 0.181165
[44]	train's multi_logloss: 0.0927832	validate's multi_logloss: 0.18011
[45]	train's multi_logloss: 0.0908613	validate's multi_logloss: 0.178461
[46]	train's multi_logloss: 0.0890459	validate's multi_logloss: 0.178037
[47]	train's multi_logloss: 0.0872901	validate's multi_logloss: 0.176963
[48]	train's multi_logloss: 0.0856689	validate's multi_logloss: 0.176258
[49]	train's multi_logloss: 0.0841141	validate's multi_logloss: 0.175017
[50]	train's multi_logloss: 0.0826536	validate's multi_logloss: 0.17484
[51]	train's multi_logloss: 0.081264	validate's multi_logloss: 0.173881
[52]	train's multi_logloss: 0.0799308	validate's multi_logloss: 0.173339
[53]	train's multi_logloss: 0.0786754	validate's multi_logloss: 0.172624
[54]	train's multi_logloss: 0.0774673	validate's multi_logloss: 0.171771
[55]	train's multi_logloss: 0.0762979	validate's multi_logloss: 0.170869
[56]	train's multi_logloss: 0.0751944	validate's multi_logloss: 0.170182
[57]	train's multi_logloss: 0.0741161	validate's multi_logloss: 0.170612
[58]	train's multi_logloss: 0.0730567	validate's multi_logloss: 0.169815
[59]	train's multi_logloss: 0.0720464	validate's multi_logloss: 0.169527
[60]	train's multi_logloss: 0.0710733	validate's multi_logloss: 0.168816
[61]	train's multi_logloss: 0.0701397	validate's multi_logloss: 0.168021
[62]	train's multi_logloss: 0.0692277	validate's multi_logloss: 0.167906
[63]	train's multi_logloss: 0.0683669	validate's multi_logloss: 0.167455
[64]	train's multi_logloss: 0.0675215	validate's multi_logloss: 0.167227
[65]	train's multi_logloss: 0.0667166	validate's multi_logloss: 0.167565
[66]	train's multi_logloss: 0.0659318	validate's multi_logloss: 0.16717
[67]	train's multi_logloss: 0.0651763	validate's multi_logloss: 0.167452
[68]	train's multi_logloss: 0.0644274	validate's multi_logloss: 0.167251
[69]	train's multi_logloss: 0.0637188	validate's multi_logloss: 0.167495
[70]	train's multi_logloss: 0.0630436	validate's multi_logloss: 0.167048
[71]	train's multi_logloss: 0.0623662	validate's multi_logloss: 0.167153
[72]	train's multi_logloss: 0.0617154	validate's multi_logloss: 0.167589
[73]	train's multi_logloss: 0.0610915	validate's multi_logloss: 0.167194
[74]	train's multi_logloss: 0.0604864	validate's multi_logloss: 0.16767
[75]	train's multi_logloss: 0.0598866	validate's multi_logloss: 0.167494
[76]	train's multi_logloss: 0.0593059	validate's multi_logloss: 0.167135
[77]	train's multi_logloss: 0.0587531	validate's multi_logloss: 0.167925
[78]	train's multi_logloss: 0.0581985	validate's multi_logloss: 0.16782
[79]	train's multi_logloss: 0.0576724	validate's multi_logloss: 0.168105
[80]	train's multi_logloss: 0.0571648	validate's multi_logloss: 0.168219
[81]	train's multi_logloss: 0.0566601	validate's multi_logloss: 0.167948
[82]	train's multi_logloss: 0.0561717	validate's multi_logloss: 0.167754
[83]	train's multi_logloss: 0.0556998	validate's multi_logloss: 0.168321
[84]	train's multi_logloss: 0.0552297	validate's multi_logloss: 0.168112
[85]	train's multi_logloss: 0.0547754	validate's multi_logloss: 0.168583
[86]	train's multi_logloss: 0.0543419	validate's multi_logloss: 0.168906
[87]	train's multi_logloss: 0.0539131	validate's multi_logloss: 0.168747
[88]	train's multi_logloss: 0.0534946	validate's multi_logloss: 0.169193
[89]	train's multi_logloss: 0.0530806	validate's multi_logloss: 0.169151
[90]	train's multi_logloss: 0.0526893	validate's multi_logloss: 0.16954
[91]	train's multi_logloss: 0.0522966	validate's multi_logloss: 0.169339
[92]	train's multi_logloss: 0.0519192	validate's multi_logloss: 0.169904
[93]	train's multi_logloss: 0.0515437	validate's multi_logloss: 0.169782
[94]	train's multi_logloss: 0.051186	validate's multi_logloss: 0.170174
[95]	train's multi_logloss: 0.0508276	validate's multi_logloss: 0.170056
[96]	train's multi_logloss: 0.0504822	validate's multi_logloss: 0.170857
[97]	train's multi_logloss: 0.0501398	validate's multi_logloss: 0.17074
[98]	train's multi_logloss: 0.0498117	validate's multi_logloss: 0.170944
[99]	train's multi_logloss: 0.049484	validate's multi_logloss: 0.17093
[100]	train's multi_logloss: 0.0491638	validate's multi_logloss: 0.171452
0.9777777777777777
训练集 0.9903381642512077
验证集 0.9784047370254267
# 有以下曲线可知验证集的损失是比训练集的损失要高,所以由此可以判断模型出现了过拟合
lgb.plot_metric(results)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mX3lg6Wz-1619108780112)(output_15_0.png)]

# 绘制重要的特征
lgb.plot_importance(gbm,importance_type = "split")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JeHOap0I-1619108780117)(output_16_0.png)]

xgboost模型

from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score   # 准确率
# 加载样本数据集
iris = load_iris()
X,y = iris.data,iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565) # 数据集分割
# 算法参数
params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',
    'eval_metric':'mlogloss',
    'num_class': 3,
    'gamma': 0.1,
    'max_depth': 6,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.75,
    'min_child_weight': 3,
    'eta': 0.1,
    'seed': 1,
    'nthread': 4,
}

# plst = params.items()

train_data = xgb.DMatrix(X_train, y_train) # 生成数据集格式
num_rounds = 500
model = xgb.train(params, train_data) # xgboost模型训练

# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)

# 计算准确率
F1_score = f1_score(y_test,y_pred,average='macro')
print("F1_score: %.2f%%" % (F1_score*100.0))
F1_score: 95.56%

智慧海洋数据集模型代码示例

import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import StratifiedKFold, KFold,train_test_split
import lightgbm as lgb
import os
import warnings
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
all_df=pd.read_csv(r'E:\百度云下载\智慧海洋组队文件\group_df.csv',index_col=0)
all_df.head()
IDlabelcntanchor_cntanchor_ratiolat_minlat_maxlat_meanlat_medianlat_nunique...w2v_20_meanw2v_21_meanw2v_22_meanw2v_23_meanw2v_24_meanw2v_25_meanw2v_26_meanw2v_27_meanw2v_28_meanw2v_29_mean
0024143720.89855140.07114040.27624640.07774340.07164223...5.1843894.208591-2.088988-2.854722-1.6864671.7008871.500208-1.8975801.8762061.622946
1123851870.48571440.07010440.21489240.10051140.071709138...1.4131612.259752-0.436357-1.192455-0.6828111.1281571.210701-0.2233891.6561321.078244
2102397600.15113440.20675640.59473440.28338640.20884597...4.0502472.544489-0.881991-0.593533-0.0807042.0974060.287642-0.0388321.7865821.170315
31002411670.16301740.07336540.52543540.31274640.336845337...0.8414770.671789-0.587554-0.842925-0.3494600.1708430.4200940.1278120.5140260.275592
410000377360.09549141.11212341.89264941.74293841.851099257...0.2919740.578577-0.997851-0.379300-0.0047100.5180540.3177970.0600170.4633250.733860

5 rows × 440 columns

use_train = all_df[all_df['label'] != -1]
use_test = all_df[all_df['label'] == -1]#label为-1时是测试集
use_feats = [c for c in use_train.columns if c not in ['ID', 'label']]
X_train,X_verify,y_train,y_verify= train_test_split(use_train[use_feats],use_train['label'],test_size=0.2,random_state=0)
#1.根据特征的重要性进行特征选择
##############特征选择参数###################
selectFeatures = 200 # 控制特征数
earlyStopping = 100 # 控制早停
select_num_boost_round = 1000 # 特征选择训练轮次
#首先设置基础参数
selfParam = {
    'learning_rate':0.01, # 学习率
    'boosting':'dart', # 算法类型, gbdt,dart
    'objective':'multiclass', # 多分类
    'metric':'None',
    'num_leaves':32, # 
    'feature_fraction':0.8, # 训练特征比例
    'bagging_fraction':0.8, # 训练样本比例 
    'min_data_in_leaf':30, # 叶子最小样本
    'num_class': 3,
    'max_depth':6, # 树的最大深度
    
    'num_threads':8,#LightGBM 的线程数
    'min_data_in_bin':30, # 单箱数据量
    'max_bin':256, # 最大分箱数 
    'is_unbalance':True, # 非平衡样本
    'train_metric':True,
    'verbose':-1,
}
# 特征选择 ---------------------------------------------------------------------------------
def f1_score_eval(preds, valid_df):
    labels = valid_df.get_label()
    preds = np.argmax(preds.reshape(3, -1), axis=0)
    scores = f1_score(y_true=labels, y_pred=preds, average='macro')
    return 'f1_score', scores, True

train_data = lgb.Dataset(data=X_train,label=y_train,feature_name=use_feats)
valid_data = lgb.Dataset(data=X_verify,label=y_verify,reference=train_data,feature_name=use_feats)

sm = lgb.train(params=selfParam,train_set=train_data,num_boost_round=select_num_boost_round,
                      valid_sets=[valid_data],valid_names=['valid'],
                      feature_name=use_feats,
                      early_stopping_rounds=earlyStopping,verbose_eval=False,keep_training_booster=True,feval=f1_score_eval)
features_importance = {k:v for k,v in zip(sm.feature_name(),sm.feature_importance(iteration=sm.best_iteration))}
sort_feature_importance = sorted(features_importance.items(),key=lambda x:x[1],reverse=True)
print('total feature best score:', sm.best_score)
print('total feature importance:',sort_feature_importance)
print('select forward {} features:{}'.format(selectFeatures,sort_feature_importance[:selectFeatures]))
#model_feature是选择的超参数
model_feature = [k[0] for k in sort_feature_importance[:selectFeatures]]
##############超参数优化的超参域###################
spaceParam = {
    'boosting': hp.choice('boosting',['gbdt','dart']),
    'learning_rate':hp.loguniform('learning_rate', np.log(0.01), np.log(0.05)),
    'num_leaves': hp.quniform('num_leaves', 3, 66, 3), 
    'feature_fraction': hp.uniform('feature_fraction', 0.7,1),
    'min_data_in_leaf': hp.quniform('min_data_in_leaf', 10, 50,5), 
    'num_boost_round':hp.quniform('num_boost_round',500,2000,100), 
    'bagging_fraction':hp.uniform('bagging_fraction',0.6,1)  
}
# 超参数优化 ---------------------------------------------------------------------------------
def getParam(param):
    for k in ['num_leaves', 'min_data_in_leaf','num_boost_round']:
        param[k] = int(float(param[k]))
    for k in ['learning_rate', 'feature_fraction','bagging_fraction']:
        param[k] = float(param[k])
    if param['boosting'] == 0:
        param['boosting'] = 'gbdt'
    elif param['boosting'] == 1:
        param['boosting'] = 'dart'
    # 添加固定参数
    param['objective'] = 'multiclass'
    param['max_depth'] = 7
    param['num_threads'] = 8
    param['is_unbalance'] = True
    param['metric'] = 'None'
    param['train_metric'] = True
    param['verbose'] = -1
    param['bagging_freq']=5
    param['num_class']=3 
    param['feature_pre_filter']=False
    return param
def f1_score_eval(preds, valid_df):
    labels = valid_df.get_label()
    preds = np.argmax(preds.reshape(3, -1), axis=0)
    scores = f1_score(y_true=labels, y_pred=preds, average='macro')
    return 'f1_score', scores, True
def lossFun(param):
    param = getParam(param)
    m = lgb.train(params=param,train_set=train_data,num_boost_round=param['num_boost_round'],
                          valid_sets=[train_data,valid_data],valid_names=['train','valid'],
                          feature_name=features,feval=f1_score_eval,
                          early_stopping_rounds=earlyStopping,verbose_eval=False,keep_training_booster=True)
    train_f1_score = m.best_score['train']['f1_score']
    valid_f1_score = m.best_score['valid']['f1_score']
    loss_f1_score = 1 - valid_f1_score
    print('训练集f1_score:{},测试集f1_score:{},loss_f1_score:{}'.format(train_f1_score, valid_f1_score, loss_f1_score))
    return {'loss': loss_f1_score, 'params': param, 'status': STATUS_OK}

features = model_feature
train_data = lgb.Dataset(data=X_train[model_feature],label=y_train,feature_name=features)
valid_data = lgb.Dataset(data=X_verify[features],label=y_verify,reference=train_data,feature_name=features)

best_param = fmin(fn=lossFun, space=spaceParam, algo=tpe.suggest, max_evals=100, trials=Trials())
best_param = getParam(best_param)
print('Search best param:',best_param)

经过特征选择和超参数优化后,最终的模型使用为将参数设置为贝叶斯优化之后的超参数,然后进行10折交叉,对测试集进行叠加求平均。

def f1_score_eval(preds, valid_df):
    labels = valid_df.get_label()
    preds = np.argmax(preds.reshape(3, -1), axis=0)
    scores = f1_score(y_true=labels, y_pred=preds, average='macro')
    return 'f1_score', scores, True

def sub_on_line_lgb(train_, test_, pred, label, cate_cols, split,
                    is_shuffle=True,
                    use_cart=False,
                    get_prob=False):
    n_class = 3
    train_pred = np.zeros((train_.shape[0], n_class))
    test_pred = np.zeros((test_.shape[0], n_class))
    n_splits = 10

    assert split in ['kf', 'skf'
                    ], '{} Not Support this type of split way'.format(split)

    if split == 'kf':
        folds = KFold(n_splits=n_splits, shuffle=is_shuffle, random_state=1024)
        kf_way = folds.split(train_[pred])
    else:
        #与KFold最大的差异在于,他是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同。
        folds = StratifiedKFold(n_splits=n_splits,
                                shuffle=is_shuffle,
                                random_state=1024)
        kf_way = folds.split(train_[pred], train_[label])

    print('Use {} features ...'.format(len(pred)))
    #将以下参数改为贝叶斯优化之后的参数
    params = {
        'learning_rate': 0.05,
        'boosting_type': 'gbdt',
        'objective': 'multiclass',
        'metric': 'None',
        'num_leaves': 60,
        'feature_fraction':0.86,
        'bagging_fraction': 0.73,
        'bagging_freq': 5,
        'seed': 1,
        'bagging_seed': 1,
        'feature_fraction_seed': 7,
        'min_data_in_leaf': 15,
        'num_class': n_class,
        'nthread': 8,
        'verbose': -1,
        'num_boost_round': 1100,
        'max_depth': 7,
    }
    for n_fold, (train_idx, valid_idx) in enumerate(kf_way, start=1):
        print('the {} training start ...'.format(n_fold))
        train_x, train_y = train_[pred].iloc[train_idx
                                            ], train_[label].iloc[train_idx]
        valid_x, valid_y = train_[pred].iloc[valid_idx
                                            ], train_[label].iloc[valid_idx]

        if use_cart:
            dtrain = lgb.Dataset(train_x,
                                 label=train_y,
                                 categorical_feature=cate_cols)
            dvalid = lgb.Dataset(valid_x,
                                 label=valid_y,
                                 categorical_feature=cate_cols)
        else:
            dtrain = lgb.Dataset(train_x, label=train_y)
            dvalid = lgb.Dataset(valid_x, label=valid_y)

        clf = lgb.train(params=params,
                        train_set=dtrain,
	                    num_boost_round=3000,
                        valid_sets=[dvalid],
                        early_stopping_rounds=100,
                        verbose_eval=100,
                        feval=f1_score_eval)
        train_pred[valid_idx] = clf.predict(valid_x,
                                            num_iteration=clf.best_iteration)
        test_pred += clf.predict(test_[pred],
                                 num_iteration=clf.best_iteration) / folds.n_splits
    print(classification_report(train_[label], np.argmax(train_pred,
                                                         axis=1),
                                digits=4))
    if get_prob:
        sub_probs = ['qyxs_prob_{}'.format(q) for q in ['围网', '刺网', '拖网']]
        prob_df = pd.DataFrame(test_pred, columns=sub_probs)
        prob_df['ID'] = test_['ID'].values
        return prob_df
    else:
        test_['label'] = np.argmax(test_pred, axis=1)
        return test_[['ID', 'label']]
use_train = all_df[all_df['label'] != -1]
use_test = all_df[all_df['label'] == -1]
 use_feats = [c for c in use_train.columns if c not in ['ID', 'label']]
use_feats=model_feature
sub = sub_on_line_lgb(use_train, use_test, use_feats, 'label', [], 'kf',is_shuffle=True,use_cart=False,get_prob=False)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值