DW-心电信号分类预测

1.数据集介绍

赛题以心电图数据为背景,根据心电图感应数据预测心跳信号,其中心跳信号对应正常病例以及受不同心律不齐和心肌梗塞影响的病例,这是一个多分类的问题。但是由于信号数据具有时间顺序,所以是一个时序建模分析问题。只是给定的数据是转化为数值的.csv数值文件
。它的总数据量超过20万,主要为1列心跳信号序列数据,其中每个样本的信号序列采样频次一致,长度相等,从中抽取10万条作为训练集,2万条作为测试集A,2万条作为测试集B,同时会对心跳信号类别(label)信息进行脱敏。

比赛地址:https://tianchi.aliyun.com/competition/entrance/531883/introduction

1.1数据概况

train.csv

  • id 为心跳信号分配的唯一标识
  • heartbeat_signals 心跳信号序列(数据之间采用“,”进行分隔)
  • label 心跳信号类别(0、1、2、3)

testA.csv

  • id 心跳信号分配的唯一标识
  • heartbeat_signals 心跳信号序列(数据之间采用“,”进行分隔)

1.2预测指标

本比赛的评价指标是 预测的概率与真实值差值的绝对值。

具体计算公式如下:

总共有n个病例,针对某一个信号,若真实值为[y1,y2,y3,y4],模型预测概率值为[a1,a2,a3,a4],那么该模型的评价指标abs-sum为
a b s − s u m = ∑ j = 1 n ∑ i = 1 4 ∣ y i − a i ∣ {abs-sum={\mathop{ \sum }\limits_{{j=1}}^{{n}}{{\mathop{ \sum }\limits_{{i=1}}^{{4}}{{ \left| {y\mathop{{}}\nolimits_{{i}}-a\mathop{{}}\nolimits_{{i}}} \right| }}}}}} abssum=j=1ni=14yiai
例如,某心跳信号类别为1,通过编码转成[0,1,0,0],预测不同心跳信号概率为[0.1,0.7,0.1,0.1],那么这个信号预测结果的abs-sum为
a b s − s u m = ∣ 0.1 − 0 ∣ + ∣ 0.7 − 1 ∣ + ∣ 0.1 − 0 ∣ + ∣ 0.1 − 0 ∣ = 0.6 {abs-sum={ \left| {0.1-0} \right| }+{ \left| {0.7-1} \right| }+{ \left| {0.1-0} \right| }+{ \left| {0.1-0} \right| }=0.6} abssum=0.10+0.71+0.10+0.10=0.6

多分类算法常见的评估指标如下:

其实多分类的评价指标的计算方式与二分类完全一样,只不过我们计算的是针对于每一类来说的召回率、精确度、准确率和 F1分数。

1、混淆矩阵(Confuse Matrix)

  • (1)若一个实例是正类,并且被预测为正类,即为真正类TP(True Positive )
  • (2)若一个实例是正类,但是被预测为负类,即为假负类FN(False Negative )
  • (3)若一个实例是负类,但是被预测为正类,即为假正类FP(False Positive )
  • (4)若一个实例是负类,并且被预测为负类,即为真负类TN(True Negative )

第一个字母T/F,表示预测的正确与否;第二个字母P/N,表示预测的结果为正例或者负例。如TP就表示预测对了,预测的结果是正例,那它的意思就是把正例预测为了正例。

2.准确率(Accuracy)
不适合样本不均衡的情况,医疗数据大部分都是样本不均衡数据。
A c c u r a c y = C o r r e c t T o t a l A c c u r a c y = T P + T N T P + T N + F P + F N Accuracy=\frac{Correct}{Total}\\ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} Accuracy=TotalCorrectAccuracy=TP+TN+FP+FNTP+TN
3、精确率(Precision)也叫查准率简写为P
精确率(Precision)是针对预测结果而言的,其含义是在被所有预测为正的样本中实际为正样本的概率
精确率代表对正样本结果中的预测准确程度,准确率则代表整体的预测准确程度,包括正样本和负样本。
P r e c i s i o n = T P T P + F P Precision = \frac{TP}{TP + FP} Precision=TP+FPTP
4.召回率(Recall) 也叫查全率 简写为R
召回率(Recall)是针对原样本而言的,其含义是在实际为正的样本中被预测为正样本的概率
R e c a l l = T P T P + F N Recall = \frac{TP}{TP + FN} Recall=TP+FNTP

5.宏查准率(macro-P)

计算每个样本的精确率然后求平均值

6.宏查全率(macro-R)

计算每个样本的召回率然后求平均值
7.宏F1(macro-F1)
m a c r o F 1 = 2 × m a c r o P × m a c r o R m a c r o P + m a c r o R {macroF1=\frac{{2 \times macroP \times macroR}}{{macroP+macroR}}} macroF1=macroP+macroR2×macroP×macroR
与上面的宏不同,微查准查全,先将多个混淆矩阵的TP,FP,TN,FN对应位置求平均,然后按照P和R的公式求得micro-P和micro-R,最后根据micro-P和micro-R求得micro-F1

8.微查准率(micro-P)
m i c r o P = T P ‾ T P ‾ × F P ‾ {microP=\frac{{\overline{TP}}}{{\overline{TP} \times \overline{FP}}}} microP=TP×FPTP
9.微查全率(micro-R)
m i c r o R = T P ‾ T P ‾ × F N ‾ {microR=\frac{{\overline{TP}}}{{\overline{TP} \times \overline{FN}}}} microR=TP×FNTP
10.微F1(micro-F1)
m i c r o F 1 = 2 × m i c r o P × m i c r o R m i c r o P + m i c r o R {microF1=\frac{{2 \times microP\times microR }}{{microP+microR}}} microF1=microP+microR2×microP×microR

1.3赛题分析

  • 本题为传统的数据挖掘问题,通过数据科学以及机器学习深度学习的办法来进行建模得到结果。
  • 本题为典型的多分类问题,心跳信号一共有4个不同的类别
  • 主要应用xgb、lgb、catboost,以及pandas、numpy、matplotlib、seabon、sklearn、keras等等数据挖掘常用库或者框架来进行数据挖掘任务。

2.数据加载

2.1数据读取

import os
import gc
import math

import pandas as pd
import numpy as np

import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler


from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('train.csv')
test=pd.read_csv('testA.csv')
train.head()
idheartbeat_signalslabel
00.9912297987616655,0.9435330436439665,0.764677…0.0
10.9912297987616655,0.9435330436439665,0.764677…0.0
21.0,0.9591487564065292,0.7013782792997189,0.23…2.0
30.9757952826275774,0.9340884687738161,0.659636…0.0
40.0,0.055816398940721094,0.26129357194994196,0…2.0
test.head()
idhearbeat_signals
1000000.9915713654170097,1.0,0.6318163407681274,0.13…
1000010.6075533139615096,0.5417083883163654,0.340694…
1000020.9752726292239277,0.6710965234906665,0.686758…
1000030.9956348033996116,0.9170249621481004,0.521096…
1000041.0,0.8879490481178918,0.745564725322326,0.531…

2.2分类指标计算

def abs_sum(y_pre,y_tru):
    #y_pre为预测概率矩阵
    #y_tru为真实类别矩阵
    y_pre=np.array(y_pre)
    y_tru=np.array(y_tru)
    loss=sum(sum(abs(y_pre-y_tru)))
    return loss

2.3EDA和数据预处理

data.describe()——获取数据的相关统计量

data.info()——获取数据类型

data.isnull().sum()——判断数据缺失和异常,查看每列的存在nan情况

"""存疑?"""
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
# 简单预处理:将第二列中的205个信号点转化为205列
train_list = []

for items in train.values:
    train_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])

train = pd.DataFrame(np.array(train_list))
train.columns = ['id'] + ['s_'+str(i) for i in range(len(train_list[0])-2)] + ['label']
train = reduce_mem_usage(train)

#test同理
Memory usage of dataframe is 157.93 MB
Memory usage after optimization is: 39.67 MB
Decreased by 74.9%
Memory usage of dataframe is 31.43 MB
Memory usage after optimization is: 7.90 MB
Decreased by 74.9%

稍微复杂些的时序特征转换方法

# 对心电特征进行行转列处理,同时为每个心电信号加入时间步特征time
train_heartbeat_df = data_train["heartbeat_signals"].str.split(",", expand=True).stack()
train_heartbeat_df = train_heartbeat_df.reset_index()
train_heartbeat_df = train_heartbeat_df.set_index("level_0")
train_heartbeat_df.index.name = None
train_heartbeat_df.rename(columns={"level_1":"time", 0:"heartbeat_signals"}, inplace=True)
train_heartbeat_df["heartbeat_signals"] = train_heartbeat_df["heartbeat_signals"].astype(float)

train_heartbeat_df
			time		heartbeat_signals
0			0				0.991230
0			1				0.943533
0			2				0.764677
0			3				0.618571
0			4				0.379632
...		...			...
99999	200			0.000000
99999	201			0.000000
99999	202			0.000000
99999	203			0.000000
99999	204			0.000000

20500000 rows × 2 columns
# 将处理后的心电特征加入到训练数据中,同时将训练数据label列单独存储
data_train_label = data_train["label"]
data_train = data_train.drop("label", axis=1)
data_train = data_train.drop("heartbeat_signals", axis=1)
data_train = data_train.join(train_heartbeat_df)

data_train
			id		time	heartbeat_signals
0			0			0			0.991230
0			0			1			0.943533
0			0			2			0.764677
0			0			3			0.618571
0			0			4			0.379632
...		...		...		...
99999	99999	200		0.0
99999	99999	201		0.0
99999	99999	202		0.0
99999	99999	203		0.0
99999	99999	204		0.0

20500000 rows × 4 columns
data_train[data_train["id"]==1]
			id		time	heartbeat_signals
1			1			0			0.971482
1			1			1			0.928969
1			1			2			0.572933
1			1			3			0.178457
1			1			4			0.122962
...		...		...		...
1			1			200		0.0
1			1			201		0.0
1			1			202		0.0
1			1			203		0.0
1			1			204		0.0

205 rows × 4 columns

2.4特征选择

**Tsfresh(TimeSeries Fresh)**可以自动计算大量的时间序列数据的特征。此外,该包还包含了特征重要性评估、特征选择的方法

  • 特征提取extract_features
from tsfresh import extract_features

# 特征提取
train_features = extract_features(data_train, column_id='id', column_sort='time')
train_features
id		sum_values		abs_energy		mean_abs_change		mean_change 	...
0			38.927945			18.216197			0.019894					-0.004859			...
1			19.445634			7.705092			0.019952					-0.004762			...
2			21.192974			9.140423			0.009863					-0.004902			...
...		...						...						...								...						...
99997	40.897057			16.412857			0.019470					-0.004538			...
99998	42.333303			14.281281			0.017032					-0.004902			...
99999	53.290117			21.637471			0.021870					-0.004539			...

100000 rows × 779 columns
  • 去除NaN值 from tsfresh.utilities.dataframe_functions import impute

train_features中包含了heartbeat_signals的779种常见的时间序列特征(所有这些特征的解释可以去看官方文档:Introduction ),这其中有的特征可能为NaN值(产生原因为当前数据不支持此类特征的计算),使用以下方式去除NaN值:

from tsfresh.utilities.dataframe_functions import impute

# 去除抽取特征中的NaN值
impute(train_features)
id		sum_values		abs_energy		mean_abs_change		mean_change 	...
0			38.927945			18.216197			0.019894					-0.004859			...
1			19.445634			7.705092			0.019952					-0.004762			...
2			21.192974			9.140423			0.009863					-0.004902			...
...		...						...						...								...						...
99997	40.897057			16.412857			0.019470					-0.004538			...
99998	42.333303			14.281281			0.017032					-0.004902			...
99999	53.290117			21.637471			0.021870					-0.004539			...

100000 rows × 779 columns

按照特征和响应变量之间的相关性进行特征选择,这一过程包含两步:首先单独计算每个特征和响应变量之间的相关性,然后进行特征选择,决定哪些特征可以被保留。

from tsfresh import select_features

# 按照特征和数据label之间的相关性进行特征选择
train_features_filtered = select_features(train_features, data_train_label)

train_features_filtered
id		sum_values		fft_coefficient__attr_"abs"__coeff_35		fft_coefficient__attr_"abs"__coeff_34		...
0			38.927945			1.168685																0.982133																...
1			19.445634			1.460752																1.924501																...
2			21.192974			1.787166																2.1469872																...
...		...						...																			...																			...
99997	40.897057			1.190514																0.674603																...
99998	42.333303			1.237608																1.325212																...
99999	53.290117			0.154759																2.921164																...

100000 rows × 700 columns

经过特征选择,留下了700个特征。

2.5训练集划分

对于数据集的划分有三种方法:留出法,交叉验证法和自助法

①留出法

留出法是直接将数据集D划分为两个互斥的集合,其中一个集合作为训练集S,另一个作为测试集T。需要注意的是在划分的时候要尽可能保证数据分布的一致性,即避免因数据划分过程引入额外的偏差而对最终结果产生影响。为了保证数据分布的一致性,通常我们采用分层采样的方式来对数据进行采样。

Tips: 通常,会将数据集D中大约2/3~4/5的样本作为训练集,其余的作为测试集。

②交叉验证法

k折交叉验证通常将数据集D分为k份,其中k-1份作为训练集,剩余的一份作为测试集,这样就可以获得k组训练/测试集,可以进行k次训练与测试,最终返回的是k个测试结果的均值。交叉验证中数据集的划分依然是依据分层采样的方式来进行。

对于交叉验证法,其k值的选取往往决定了评估结果的稳定性和保真性,通常k值选取10。当k=1的时候,我们称之为留一法

③自助法

我们每次从数据集D中取一个样本作为训练集中的元素,然后把该样本放回,重复该行为m次,这样我们就可以得到大小为m的训练集,在这里面有的样本重复出现,有的样本则没有出现过,我们把那些没有出现过的样本作为测试集。

进行这样采样的原因是因为在D中约有36.8%的数据没有在训练集中出现过。留出法与交叉验证法都是使用分层采样的方式进行数据采样与划分,而自助法则是使用有放回重复采样的方式进行数据采样

数据集划分总结
对于数据量充足的时候,通常采用留出法或者k折交叉验证法来进行训练/测试集的划分;
对于数据集小且难以有效划分训练/测试集时使用自助法;
对于数据集小且可有效划分的时候最好使用留一法来进行划分,因为这种方法最为准确

x_train = train.drop(['id','label'], axis=1)
y_train = train['label']
x_test=test.drop(['id'], axis=1)

3.模型训练

from sklearn.model_selection import KFold
# 分离数据集,方便进行交叉验证
X_train = data.drop(['id','label'], axis=1)
y_train = data['label']

# 5折交叉验证
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

因为树模型中没有f1-score评价指标,所以需要自定义评价指标,在模型迭代中返回验证集f1-score变化情况。

def f1_score_vali(preds, data_vali):
    labels = data_vali.get_label()
    preds = np.argmax(preds.reshape(4, -1), axis=0)
    score_vali = f1_score(y_true=labels, y_pred=preds, average='macro')
    return 'f1_score', score_vali, True

使用Lightgbm进行建模

"""对训练集数据进行划分,分成训练集和验证集,并进行相应的操作"""
from sklearn.model_selection import train_test_split
import lightgbm as lgb
# 数据集划分
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train, y_train, test_size=0.2)
train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
valid_matrix = lgb.Dataset(X_val, label=y_val)

params = {
    "learning_rate": 0.1,
    "boosting": 'gbdt',  
    "lambda_l2": 0.1,
    "max_depth": -1,
    "num_leaves": 128,
    "bagging_fraction": 0.8,
    "feature_fraction": 0.8,
    "metric": None,
    "objective": "multiclass",
    "num_class": 4,
    "nthread": 10,
    "verbose": -1,
}

"""使用训练集数据进行模型训练"""
model = lgb.train(params, 
                  train_set=train_matrix, 
                  valid_sets=valid_matrix, 
                  num_boost_round=2000, 
                  verbose_eval=50, 
                  early_stopping_rounds=200,
                  feval=f1_score_vali)
Training until validation scores don't improve for 200 rounds
[50]	valid_0's multi_logloss: 0.0535465	valid_0's f1_score: 0.953675
[100]	valid_0's multi_logloss: 0.0484882	valid_0's f1_score: 0.961373
[150]	valid_0's multi_logloss: 0.0507799	valid_0's f1_score: 0.962653
[200]	valid_0's multi_logloss: 0.0531035	valid_0's f1_score: 0.963224
[250]	valid_0's multi_logloss: 0.0547945	valid_0's f1_score: 0.963721
Early stopping, best iteration is:
[88]	valid_0's multi_logloss: 0.0482441	valid_0's f1_score: 0.959676

对验证集进行预测

val_pre_lgb = model.predict(X_val, num_iteration=model.best_iteration)
preds = np.argmax(val_pre_lgb, axis=1)
score = f1_score(y_true=y_val, y_pred=preds, average='macro')
print('未调参前lightgbm单模型在验证集上的f1:{}'.format(score))
未调参前lightgbm单模型在验证集上的f1:0.9596756568138634

更进一步的,使用5折交叉验证进行模型性能评估

"""使用lightgbm 5折交叉验证进行建模预测"""
cv_scores = []
for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
    print('************************************ {} ************************************'.format(str(i+1)))
    X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
    train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
    valid_matrix = lgb.Dataset(X_val, label=y_val)

    params = {
                "learning_rate": 0.1,
                "boosting": 'gbdt',  
                "lambda_l2": 0.1,
                "max_depth": -1,
                "num_leaves": 128,
                "bagging_fraction": 0.8,
                "feature_fraction": 0.8,
                "metric": None,
                "objective": "multiclass",
                "num_class": 4,
                "nthread": 10,
                "verbose": -1,
            }
    
    model = lgb.train(params, 
                      train_set=train_matrix, 
                      valid_sets=valid_matrix, 
                      num_boost_round=2000, 
                      verbose_eval=100, 
                      early_stopping_rounds=200,
                      feval=f1_score_vali)
    
    val_pred = model.predict(X_val, num_iteration=model.best_iteration)
    
    val_pred = np.argmax(val_pred, axis=1)
    cv_scores.append(f1_score(y_true=y_val, y_pred=val_pred, average='macro'))
    print(cv_scores)

print("lgb_scotrainre_list:{}".format(cv_scores))
print("lgb_score_mean:{}".format(np.mean(cv_scores)))
print("lgb_score_std:{}".format(np.std(cv_scores)))
...
lgb_scotrainre_list:[0.9674515729721614, 0.9656700872844327, 0.9700043639844769, 0.9655663272378014, 0.9631137190307674]
lgb_score_mean:0.9663612141019279
lgb_score_std:0.0022854824074775683

测试结果

temp=pd.DataFrame(lgb_test)
result=pd.read_csv('sample_submit.csv')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('submit.csv',index=False)

4.建模调参

  • 1. 贪心调参

    先使用当前对模型影响最大的参数进行调优,达到当前参数下的模型最优化,再使用对模型影响次之的参数进行调优,如此下去,直到所有的参数调整完毕。

    这个方法的缺点就是可能会调到局部最优而不是全局最优,但是只需要一步一步的进行参数最优化调试即可,容易理解。

    需要注意的是在树模型中参数调整的顺序,也就是各个参数对模型的影响程度,这里列举一下日常调参过程中常用的参数和调参顺序:

    • ①:max_depth、num_leaves
    • ②:min_data_in_leaf、min_child_weight
    • ③:bagging_fraction、 feature_fraction、bagging_freq
    • ④:reg_lambda、reg_alpha
    • ⑤:min_split_gain
    from sklearn.model_selection import cross_val_score
    # 调objective
    best_obj = dict()
    for obj in objective:
        model = LGBMRegressor(objective=obj)
        """预测并计算roc的相关指标"""
        score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()
        best_obj[obj] = score
    
    # num_leaves
    best_leaves = dict()
    for leaves in num_leaves:
        model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0], num_leaves=leaves)
        """预测并计算roc的相关指标"""
        score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()
        best_leaves[leaves] = score
    
    # max_depth
    best_depth = dict()
    for depth in max_depth:
        model = LGBMRegressor(objective=min(best_obj.items(), key=lambda x:x[1])[0],
                              num_leaves=min(best_leaves.items(), key=lambda x:x[1])[0],
                              max_depth=depth)
        """预测并计算roc的相关指标"""
        score = cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()
        best_depth[depth] = score
    
    """
    可依次将模型的参数通过上面的方式进行调整优化,并且通过可视化观察在每一个最优参数下模型的得分情况
    """
    

    可依次将模型的参数通过上面的方式进行调整优化,并且通过可视化观察在每一个最优参数下模型的得分情况

  • 2. 网格搜索

    sklearn 提供GridSearchCV用于进行网格搜索,只需要把模型的参数输进去,就能给出最优化的结果和参数。相比起贪心调参,网格搜索的结果会更优,但是网格搜索只适合于小数据集,一旦数据的量级上去了,很难得出结果。

    同样以Lightgbm算法为例,进行网格搜索调参:

    """通过网格搜索确定最优参数"""
    from sklearn.model_selection import GridSearchCV
    
    def get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=31, max_depth=-1, bagging_fraction=1.0, 
                           feature_fraction=1.0, bagging_freq=0, min_data_in_leaf=20, min_child_weight=0.001, 
                           min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=None):
        # 设置5折交叉验证
        cv_fold = KFold(n_splits=5, shuffle=True, random_state=2021)
    
        model_lgb = lgb.LGBMClassifier(learning_rate=learning_rate,
                                       n_estimators=n_estimators,
                                       num_leaves=num_leaves,
                                       max_depth=max_depth,
                                       bagging_fraction=bagging_fraction,
                                       feature_fraction=feature_fraction,
                                       bagging_freq=bagging_freq,
                                       min_data_in_leaf=min_data_in_leaf,
                                       min_child_weight=min_child_weight,
                                       min_split_gain=min_split_gain,
                                       reg_lambda=reg_lambda,
                                       reg_alpha=reg_alpha,
                                       n_jobs= 8
                                      )
    
        f1 = make_scorer(f1_score, average='micro')
        grid_search = GridSearchCV(estimator=model_lgb, 
                                   cv=cv_fold,
                                   param_grid=param_grid,
                                   scoring=f1
    
                                  )
        grid_search.fit(X_train, y_train)
    
        print('模型当前最优参数为:{}'.format(grid_search.best_params_))
        print('模型当前最优得分为:{}'.format(grid_search.best_score_))
    
    """以下代码未运行,耗时较长,请谨慎运行,且每一步的最优参数需要在下一步进行手动更新,请注意"""
    
    """
    需要注意一下的是,除了获取上面的获取num_boost_round时候用的是原生的lightgbm(因为要用自带的cv)
    下面配合GridSearchCV时必须使用sklearn接口的lightgbm。
    """
    """设置n_estimators 为581,调整num_leaves和max_depth,这里选择先粗调再细调"""
    lgb_params = {'num_leaves': range(10, 80, 5), 'max_depth': range(3,10,2)}
    get_best_cv_params(learning_rate=0.1, n_estimators=581, num_leaves=None, max_depth=None, min_data_in_leaf=20, 
                       min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
    
    """num_leaves为30,max_depth为7,进一步细调num_leaves和max_depth"""
    lgb_params = {'num_leaves': range(25, 35, 1), 'max_depth': range(5,9,1)}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=None, max_depth=None, min_data_in_leaf=20, 
                       min_child_weight=0.001,bagging_fraction=1.0, feature_fraction=1.0, bagging_freq=0, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
    
    """
    确定min_data_in_leaf为45,min_child_weight为0.001 ,下面进行bagging_fraction、feature_fraction和bagging_freq的调参
    """
    lgb_params = {'bagging_fraction': [i/10 for i in range(5,10,1)], 
                  'feature_fraction': [i/10 for i in range(5,10,1)],
                  'bagging_freq': range(0,81,10)
                 }
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                       min_child_weight=0.001,bagging_fraction=None, feature_fraction=None, bagging_freq=None, 
                       min_split_gain=0, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
    
    """
    确定bagging_fraction为0.4、feature_fraction为0.6、bagging_freq为 ,下面进行reg_lambda、reg_alpha的调参
    """
    lgb_params = {'reg_lambda': [0,0.001,0.01,0.03,0.08,0.3,0.5], 'reg_alpha': [0,0.001,0.01,0.03,0.08,0.3,0.5]}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                       min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, 
                       min_split_gain=0, reg_lambda=None, reg_alpha=None, param_grid=lgb_params)
    
    """
    确定reg_lambda、reg_alpha都为0,下面进行min_split_gain的调参
    """
    lgb_params = {'min_split_gain': [i/10 for i in range(0,11,1)]}
    get_best_cv_params(learning_rate=0.1, n_estimators=85, num_leaves=29, max_depth=7, min_data_in_leaf=45, 
                       min_child_weight=0.001,bagging_fraction=0.9, feature_fraction=0.9, bagging_freq=40, 
                       min_split_gain=None, reg_lambda=0, reg_alpha=0, param_grid=lgb_params)
    
    """
    参数确定好了以后,我们设置一个比较小的learning_rate 0.005,来确定最终的num_boost_round
    """
    # 设置5折交叉验证
    # cv_fold = StratifiedKFold(n_splits=5, random_state=0, shuffle=True, )
    final_params = {
                    'boosting_type': 'gbdt',
                    'learning_rate': 0.01,
                    'num_leaves': 29,
                    'max_depth': 7,
                    'objective': 'multiclass',
                    'num_class': 4,
                    'min_data_in_leaf':45,
                    'min_child_weight':0.001,
                    'bagging_fraction': 0.9,
                    'feature_fraction': 0.9,
                    'bagging_freq': 40,
                    'min_split_gain': 0,
                    'reg_lambda':0,
                    'reg_alpha':0,
                    'nthread': 6
                   }
    
    cv_result = lgb.cv(train_set=lgb_train,
                       early_stopping_rounds=20,
                       num_boost_round=5000,
                       nfold=5,
                       stratified=True,
                       shuffle=True,
                       params=final_params,
                       feval=f1_score_vali,
                       seed=0,
                      )
    

    在实际调整过程中,可先设置一个较大的学习率(上面的例子中0.1),通过Lgb原生的cv函数进行树个数的确定,之后再通过上面的实例代码进行参数的调整优化。

    最后针对最优的参数设置一个较小的学习率(例如0.05),同样通过cv函数确定树的个数,确定最终的参数。

    需要注意的是,针对大数据集,上面每一层参数的调整都需要耗费较长时间,

  • 贝叶斯调参

    在使用之前需要先安装包bayesian-optimization,运行如下命令即可:

    pip install bayesian-optimization
    

    贝叶斯调参的主要思想是:给定优化的目标函数(广义的函数,只需指定输入和输出即可,无需知道内部结构以及数学性质),通过不断地添加样本点来更新目标函数的后验分布(高斯过程,直到后验分布基本贴合于真实分布)。简单的说,就是考虑了上一次参数的信息,从而更好的调整当前的参数。

    贝叶斯调参的步骤如下:

    • 定义优化函数(rf_cv)
    • 建立模型
    • 定义待优化的参数
    • 得到优化结果,并返回要优化的分数指标
    from sklearn.model_selection import cross_val_score
    
    """定义优化函数"""
    def rf_cv_lgb(num_leaves, max_depth, bagging_fraction, feature_fraction, bagging_freq, min_data_in_leaf, 
                  min_child_weight, min_split_gain, reg_lambda, reg_alpha):
        # 建立模型
        model_lgb = lgb.LGBMClassifier(boosting_type='gbdt', objective='multiclass', num_class=4,
                                       learning_rate=0.1, n_estimators=5000,
                                       num_leaves=int(num_leaves), max_depth=int(max_depth), 
                                       bagging_fraction=round(bagging_fraction, 2), feature_fraction=round(feature_fraction, 2),
                                       bagging_freq=int(bagging_freq), min_data_in_leaf=int(min_data_in_leaf),
                                       min_child_weight=min_child_weight, min_split_gain=min_split_gain,
                                       reg_lambda=reg_lambda, reg_alpha=reg_alpha,
                                       n_jobs= 8
                                      )
        f1 = make_scorer(f1_score, average='micro')
        val = cross_val_score(model_lgb, X_train_split, y_train_split, cv=5, scoring=f1).mean()
    
        return val
    
    from bayes_opt import BayesianOptimization
    """定义优化参数"""
    bayes_lgb = BayesianOptimization(
        rf_cv_lgb, 
        {
            'num_leaves':(10, 200),
            'max_depth':(3, 20),
            'bagging_fraction':(0.5, 1.0),
            'feature_fraction':(0.5, 1.0),
            'bagging_freq':(0, 100),
            'min_data_in_leaf':(10,100),
            'min_child_weight':(0, 10),
            'min_split_gain':(0.0, 1.0),
            'reg_alpha':(0.0, 10),
            'reg_lambda':(0.0, 10),
        }
    )
    
    """开始优化"""
    bayes_lgb.maximize(n_iter=10)
    
    |   iter    |  target   | baggin... | baggin... | featur... | max_depth | min_ch... | min_da... | min_sp... | num_le... | reg_alpha | reg_la... |
    |  1        |  0.9785   |  0.5174   |  10.78    |  0.8746   |  10.15    |  4.288    |  48.97    |  0.2337   |  42.83    |  6.551    |  9.015    |
    |  2        |  0.9778   |  0.6777   |  41.77    |  0.5291   |  12.15    |  4.16     |  26.39    |  0.2461   |  55.78    |  6.528    |  0.6003   |
    |  3        |  0.9745   |  0.5825   |  68.77    |  0.5932   |  8.36     |  9.296    |  77.74    |  0.7946   |  79.12    |  3.045    |  5.593    |
    |  4        |  0.9802   |  0.9669   |  78.34    |  0.77     |  19.68    |  9.886    |  66.34    |  0.255    |  161.1    |  4.727    |  8.18     |
    |  5        |  0.9836   |  0.9897   |  51.9     |  0.9737   |  16.82    |  2.001    |  42.1     |  0.03563  |  134.2    |  3.437    |  1.368    |
    |  6        |  0.9749   |  0.5575   |  46.2     |  0.6518   |  15.9     |  7.817    |  34.12    |  0.341    |  153.2    |  7.144    |  7.899    |
    |  7        |  0.9793   |  0.9644   |  55.08    |  0.9795   |  18.5     |  2.085    |  41.22    |  0.7031   |  129.9    |  3.369    |  2.717    |
    |  8        |  0.9819   |  0.5926   |  58.23    |  0.6149   |  16.81    |  2.911    |  39.91    |  0.1699   |  137.3    |  2.685    |  2.891    |
    |  9        |  0.983    |  0.7796   |  50.38    |  0.7261   |  17.87    |  3.499    |  37.59    |  0.1404   |  136.1    |  2.442    |  6.621    |
    |  10       |  0.9843   |  0.638    |  49.32    |  0.9282   |  11.33    |  6.504    |  43.21    |  0.288    |  137.7    |  0.2083   |  6.966    |
    |  11       |  0.9798   |  0.8196   |  47.05    |  0.5845   |  9.075    |  2.965    |  46.16    |  0.3984   |  131.6    |  3.634    |  2.601    |
    |  12       |  0.9726   |  0.7688   |  37.57    |  0.9811   |  10.26    |  1.239    |  17.54    |  0.9651   |  46.5     |  8.834    |  6.276    |
    |  13       |  0.9836   |  0.5214   |  48.3     |  0.8203   |  19.13    |  3.129    |  35.47    |  0.08455  |  138.2    |  2.345    |  9.691    |
    |  14       |  0.9738   |  0.5617   |  45.75    |  0.8648   |  18.88    |  4.383    |  46.88    |  0.9315   |  141.8    |  4.968    |  5.563    |
    |  15       |  0.9807   |  0.8046   |  47.05    |  0.6449   |  12.38    |  0.3744   |  41.13    |  0.6808   |  138.7    |  0.8521   |  9.461    |
    =================================================================================================================================================
    
    """显示优化结果"""
    bayes_lgb.max
    
    {'target': 0.9842625,
     'params': {'bagging_fraction': 0.6379596054685973,
      'bagging_freq': 49.319589248277715,
      'feature_fraction': 0.9282486828608231,
      'max_depth': 11.32826513626976,
      'min_child_weight': 6.5044214037514845,
      'min_data_in_leaf': 43.211716584925405,
      'min_split_gain': 0.28802399981965143,
      'num_leaves': 137.7332804262704,
      'reg_alpha': 0.2082701560002398,
      'reg_lambda': 6.966270735649479}}
    

    参数优化完成后,我们可以根据优化后的参数建立新的模型,降低学习率并寻找最优模型迭代次数

    """调整一个较小的学习率,并通过cv函数确定当前最优的迭代次数"""
    base_params_lgb = {
                        'boosting_type': 'gbdt',
                        'objective': 'multiclass',
                        'num_class': 4,
                        'learning_rate': 0.01,
                        'num_leaves': 138,
                        'max_depth': 11,
                        'min_data_in_leaf': 43,
                        'min_child_weight':6.5,
                        'bagging_fraction': 0.64,
                        'feature_fraction': 0.93,
                        'bagging_freq': 49,
                        'reg_lambda': 7,
                        'reg_alpha': 0.21,
                        'min_split_gain': 0.288,
                        'nthread': 10,
                        'verbose': -1,
    }
    
    cv_result_lgb = lgb.cv(
        train_set=train_matrix,
        early_stopping_rounds=1000, 
        num_boost_round=20000,
        nfold=5,
        stratified=True,
        shuffle=True,
        params=base_params_lgb,
        feval=f1_score_vali,
        seed=0
    )
    print('迭代次数{}'.format(len(cv_result_lgb['f1_score-mean'])))
    print('最终模型的f1为{}'.format(max(cv_result_lgb['f1_score-mean'])))
    
    迭代次数4833
    最终模型的f1为0.961641452120875
    

    模型参数已经确定,建立最终模型并对验证集进行验证

    import lightgbm as lgb
    """使用lightgbm 5折交叉验证进行建模预测"""
    cv_scores = []
    for i, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
        print('************************************ {} ************************************'.format(str(i+1)))
        X_train_split, y_train_split, X_val, y_val = X_train.iloc[train_index], y_train[train_index], X_train.iloc[valid_index], y_train[valid_index]
    
        train_matrix = lgb.Dataset(X_train_split, label=y_train_split)
        valid_matrix = lgb.Dataset(X_val, label=y_val)
    
        params = {
                    'boosting_type': 'gbdt',
                    'objective': 'multiclass',
                    'num_class': 4,
                    'learning_rate': 0.01,
                    'num_leaves': 138,
                    'max_depth': 11,
                    'min_data_in_leaf': 43,
                    'min_child_weight':6.5,
                    'bagging_fraction': 0.64,
                    'feature_fraction': 0.93,
                    'bagging_freq': 49,
                    'reg_lambda': 7,
                    'reg_alpha': 0.21,
                    'min_split_gain': 0.288,
                    'nthread': 10,
                    'verbose': -1,
        }
    
        model = lgb.train(params, train_set=train_matrix, num_boost_round=4833, valid_sets=valid_matrix, 
                          verbose_eval=1000, early_stopping_rounds=200, feval=f1_score_vali)
        val_pred = model.predict(X_val, num_iteration=model.best_iteration)
        val_pred = np.argmax(val_pred, axis=1)
        cv_scores.append(f1_score(y_true=y_val, y_pred=val_pred, average='macro'))
        print(cv_scores)
    
    print("lgb_scotrainre_list:{}".format(cv_scores))
    print("lgb_score_mean:{}".format(np.mean(cv_scores)))
    print("lgb_score_std:{}".format(np.std(cv_scores)))
    
    ...
    lgb_scotrainre_list:[0.9615056903324599, 0.9597829114711733, 0.9644760387635415, 0.9622009947666585, 0.9607941521618003]
    lgb_score_mean:0.9617519574991267
    lgb_score_std:0.0015797109890455313
    
    
  • 模型调参小总结

    • 集成模型内置的cv函数可以较快的进行单一参数的调节,一般可以用来优先确定树模型的迭代次数

    • 数据量较大的时候(例如本次项目的数据),网格搜索调参会特别特别慢,不建议尝试

    • 集成模型中原生库和sklearn下的库部分参数不一致,需要注意,具体可以参考官方API

5.模型融合

  • 简单加权融合:
    • 回归(分类概率):算术平均融合(Arithmetic mean),几何平均融合(Geometric mean);
    • 分类:投票(Voting)
    • 综合:排序融合(Rank averaging),log融合
import numpy as np
import pandas as pd
from sklearn import metrics

##生成一些简单的样本数据,test_prei 代表第i个模型的预测值
test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

#y_test_true 代表第模型的真实值
y_test_true = [1, 3, 2, 6] 

##定义结果的加权平均函数
def Weighted_method(test_pre1,test_pre2,test_pre3,w=[1/3,1/3,1/3]):
   Weighted_result = w[0]*pd.Series(test_pre1)+w[1]*pd.Series(test_pre2)+w[2]*pd.Series(test_pre3)
   return Weighted_result

#各模型的预测结果计算MAE
print('Pred1 MAE:',metrics.mean_absolute_error(y_test_true, test_pre1))
print('Pred2 MAE:',metrics.mean_absolute_error(y_test_true, test_pre2))
print('Pred3 MAE:',metrics.mean_absolute_error(y_test_true, test_pre3))

##根据加权计算MAE
w = [0.3,0.4,0.3] # 定义比重权值
Weighted_pre = Weighted_method(test_pre1,test_pre2,test_pre3,w)
print('Weighted_pre MAE:',metrics.mean_absolute_error(y_test_true, Weighted_pre))
Pred1 MAE: 0.1750000000000001
Pred2 MAE: 0.07499999999999993
Pred3 MAE: 0.10000000000000009
Weighted_pre MAE: 0.05750000000000027

可以发现加权结果相对于之前的结果是有提升的,这种我们称其为简单的加权平均。
还有一些特殊的形式,比如mean平均,median平均

## 定义结果的加权平均函数
def Mean_method(test_pre1,test_pre2,test_pre3):
    Mean_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).mean(axis=1)
    return Mean_result

Mean_pre = Mean_method(test_pre1,test_pre2,test_pre3)
print('Mean_pre MAE:',metrics.mean_absolute_error(y_test_true, Mean_pre))

## 定义结果的加权平均函数
def Median_method(test_pre1,test_pre2,test_pre3):
    Median_result = pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).median(axis=1)
    return Median_result

Median_pre = Median_method(test_pre1,test_pre2,test_pre3)
print('Median_pre MAE:',metrics.mean_absolute_error(y_test_true, Median_pre))
Mean_pre MAE: 0.06666666666666693
Median_pre MAE: 0.07500000000000007
  • stacking/blending:
    • 构建多层模型,并利用预测结果再拟合预测。
from sklearn import linear_model

def Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,test_pre1,test_pre2,test_pre3,model_L2= linear_model.LinearRegression()):
  model_L2.fit(pd.concat([pd.Series(train_reg1),pd.Series(train_reg2),pd.Series(train_reg3)],axis=1).values,y_train_true)
  Stacking_result = model_L2.predict(pd.concat([pd.Series(test_pre1),pd.Series(test_pre2),pd.Series(test_pre3)],axis=1).values)
  return Stacking_result

##生成一些简单的样本数据,test_prei 代表第i个模型的预测值
train_reg1 = [3.2, 8.2, 9.1, 5.2]
train_reg2 = [2.9, 8.1, 9.0, 4.9]
train_reg3 = [3.1, 7.9, 9.2, 5.0]
#y_test_true 代表第模型的真实值
y_train_true = [3, 8, 9, 5] 

test_pre1 = [1.2, 3.2, 2.1, 6.2]
test_pre2 = [0.9, 3.1, 2.0, 5.9]
test_pre3 = [1.1, 2.9, 2.2, 6.0]

#y_test_true 代表第模型的真实值
y_test_true = [1, 3, 2, 6] 

model_L2= linear_model.LinearRegression()
Stacking_pre = Stacking_method(train_reg1,train_reg2,train_reg3,y_train_true,
                             test_pre1,test_pre2,test_pre3,model_L2)
print('Stacking_pre MAE:',metrics.mean_absolute_error(y_test_true, Stacking_pre))
Stacking_pre MAE: 0.04213483146067404

可以发现模型结果相对于之前有进一步的提升,这是我们需要注意的一点是,对于第二层Stacking的模型不宜选取的过于复杂,这样会导致模型在训练集上过拟合,从而使得在测试集上并不能达到很好的效果。

  • boosting/bagging(在xgboost,Adaboost,GBDT中已经用到):

    • 多树的提升方法

    分类模型融合

import numpy as np
import lightgbm as lgb
from sklearn.datasets import make_blobs
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
(1) Voting投票机制

Voting即投票机制,分为软投票和硬投票两种,其原理采用少数服从多数的思想。

'''
硬投票:对多个模型直接进行投票,不区分模型结果的相对重要度,最终投票数最多的类为最终被预测的类。
'''
iris = datasets.load_iris()

x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = lgb.LGBMClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.7,
                     colsample_bytree=0.6, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=200, max_depth=10, min_samples_split=10,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1)

# 硬投票
eclf = VotingClassifier(estimators=[('lgb', clf1), ('rf', clf2), ('svc', clf3)], voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf], ['LGB', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.95 (+/- 0.05) [LGB]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [SVM]
Accuracy: 0.95 (+/- 0.05) [Ensemble]
(2) 分类的Stacking\Blending融合:

stacking是一种分层模型集成框架。

以两层为例,第一层由多个基学习器组成,其输入为原始训练集,第二层的模型则是以第一层基学习器的输出作为训练集进行再训练,从而得到完整的stacking模型, stacking两层模型都使用了全部的训练数据。

'''
5-Fold Stacking
'''
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier,GradientBoostingClassifier
import pandas as pd
#创建训练的数据集
data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]

#模型融合中使用到的各个单模型
clfs = [LogisticRegression(solver='lbfgs'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]
 
#切分一部分数据作为测试集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)

dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_predict.shape[0], len(clfs)))

#5折stacking
n_splits = 5
skf = StratifiedKFold(n_splits)
skf = skf.split(X, y)

for j, clf in enumerate(clfs):
    #依次训练各个单模型
    dataset_blend_test_j = np.zeros((X_predict.shape[0], 5))
    for i, (train, test) in enumerate(skf):
        #5-Fold交叉训练,使用第i个部分作为预测,剩余的部分来训练模型,获得其预测的输出作为第i部分的新特征。
        X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]
        clf.fit(X_train, y_train)
        y_submission = clf.predict_proba(X_test)[:, 1]
        dataset_blend_train[test, j] = y_submission
        dataset_blend_test_j[:, i] = clf.predict_proba(X_predict)[:, 1]
    #对于测试集,直接用这k个模型的预测值均值作为新的特征。
    dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_blend_test[:, j]))

clf = LogisticRegression(solver='lbfgs')
clf.fit(dataset_blend_train, y)
y_submission = clf.predict_proba(dataset_blend_test)[:, 1]

print("Val auc Score of Stacking: %f" % (roc_auc_score(y_predict, y_submission)))
val auc Score: 1.000000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
val auc Score: 0.500000
Val auc Score of Stacking: 1.000000

Blending,其实和Stacking是一种类似的多层模型融合的形式

  • 其主要思路是把原始的训练集先分成两部分,比如70%的数据作为新的训练集,剩下30%的数据作为测试集。
  • 在第一层,我们在这70%的数据上训练多个模型,然后去预测那30%数据的label,同时也预测test集的label。
  • 在第二层,我们就直接用这30%数据在第一层预测的结果做为新特征继续训练,然后用test集第一层预测的label做特征,用第二层训练的模型做进一步预测

其优点在于

  • 比stacking简单(因为不用进行k次的交叉验证来获得stacker feature)
  • 避开了一个信息泄露问题:generlizers和stacker使用了不一样的数据集

缺点在于:

  • 使用了很少的数据(第二阶段的blender只使用training set10%的量)
  • blender可能会过拟合
  • stacking使用多次的交叉验证会比较稳健
'''
Blending
'''
 
#创建训练的数据集
#创建训练的数据集
data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]
 
#模型融合中使用到的各个单模型
clfs = [LogisticRegression(solver='lbfgs'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        #ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]

#切分一部分数据作为测试集
X, X_predict, y, y_predict = train_test_split(data, target, test_size=0.3, random_state=2020)

#切分训练数据集为d1,d2两部分
X_d1, X_d2, y_d1, y_d2 = train_test_split(X, y, test_size=0.5, random_state=2020)
dataset_d1 = np.zeros((X_d2.shape[0], len(clfs)))
dataset_d2 = np.zeros((X_predict.shape[0], len(clfs)))
 
for j, clf in enumerate(clfs):
    #依次训练各个单模型
    clf.fit(X_d1, y_d1)
    y_submission = clf.predict_proba(X_d2)[:, 1]
    dataset_d1[:, j] = y_submission
    #对于测试集,直接用这k个模型的预测值作为新的特征。
    dataset_d2[:, j] = clf.predict_proba(X_predict)[:, 1]
    print("val auc Score: %f" % roc_auc_score(y_predict, dataset_d2[:, j]))

#融合使用的模型
clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(dataset_d1, y_d2)
y_submission = clf.predict_proba(dataset_d2)[:, 1]
print("Val auc Score of Blending: %f" % (roc_auc_score(y_predict, y_submission)))
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
val auc Score: 1.000000
Val auc Score of Blending: 1.000000

一些其它方法

将特征放进模型中预测,并将预测结果变换并作为新的特征加入原有特征中再经过模型预测结果 (Stacking变化)
(可以反复预测多次将结果加入最后的特征中)

def Ensemble_add_feature(train,test,target,clfs):
    
    # n_flods = 5
    # skf = list(StratifiedKFold(y, n_folds=n_flods))

    train_ = np.zeros((train.shape[0],len(clfs*2)))
    test_ = np.zeros((test.shape[0],len(clfs*2)))

    for j,clf in enumerate(clfs):
        '''依次训练各个单模型'''
        # print(j, clf)
        '''使用第1个部分作为预测,第2部分来训练模型,获得其预测的输出作为第2部分的新特征。'''
        # X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]

        clf.fit(train,target)
        y_train = clf.predict(train)
        y_test = clf.predict(test)

        ## 新特征生成
        train_[:,j*2] = y_train**2
        test_[:,j*2] = y_test**2
        train_[:, j+1] = np.exp(y_train)
        test_[:, j+1] = np.exp(y_test)
        # print("val auc Score: %f" % r2_score(y_predict, dataset_d2[:, j]))
        print('Method ',j)

    train_ = pd.DataFrame(train_)
    test_ = pd.DataFrame(test_)
    return train_,test_

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

data_0 = iris.data
data = data_0[:100,:]

target_0 = iris.target
target = target_0[:100]

x_train,x_test,y_train,y_test=train_test_split(data,target,test_size=0.3)
x_train = pd.DataFrame(x_train) ; x_test = pd.DataFrame(x_test)

#模型融合中使用到的各个单模型
clfs = [LogisticRegression(),
        RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),
        ExtraTreesClassifier(n_estimators=5, n_jobs=-1, criterion='entropy'),
        GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=5)]

New_train,New_test = Ensemble_add_feature(x_train,x_test,y_train,clfs)

clf = LogisticRegression()
# clf = GradientBoostingClassifier(learning_rate=0.02, subsample=0.5, max_depth=6, n_estimators=30)
clf.fit(New_train, y_train)
y_emb = clf.predict_proba(New_test)[:, 1]

print("Val auc Score of stacking: %f" % (roc_auc_score(y_test, y_emb)))
Method  0
Method  1
Method  2
Method  3
Method  4
Val auc Score of stacking: 1.000000

6.总结

reduce_mem_usage减少内存使用这块
lgb模型参数和模型原理
对模型进行调参,贪心调参、网格搜索调参、贝叶斯调参共三种调参手段,使用贝叶斯调参对本次项目进行简单优化
模型融合:(之后补)

  1. 结果层面的融合,这种是最常见的融合方法,其可行的融合方法也有很多,比如根据结果的得分进行加权融合,还可以做Log,exp处理等。在做结果融合的时候。有一个很重要的条件是模型结果的得分要比较近似但结果的差异要比较大,这样的结果融合往往有比较好的效果提升。如果不满足这个条件带来的效果很低,甚至是负效果。

  2. 特征层面的融合,这个层面叫融合融合并不准确,主要是队伍合并后大家可以相互学习特征工程。如果我们用同种模型训练,可以把特征进行切分给不同的模型,然后在后面进行模型或者结果融合有时也能产生比较好的效果。

  3. 模型层面的融合,模型层面的融合可能就涉及模型的堆叠和设计,比如加stacking,部分模型的结果作为特征输入等,这些就需要多实验和思考了,基于模型层面的融合最好不同模型类型要有一定的差异,用同种模型不同的参数的收益一般是比较小的。
    如果给定的是信号图的形式又该如何先提取成数值形式的.csv文件呢

  • 2
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值