招商银行2022 Fintech精英训练营数据赛道-第4名方案分享

0. 比赛简介

竞赛时间:4月29日9:00-5月12日17:00

竞赛流程:4月29日9:00-5月9日24:00,赛题开放A榜数据(test_A榜),预测结果数据每天限提交3次;5月10日00:00-5月12日17:00,赛题开放B榜数据(test_B榜),预测结果数据每天限提交3次。重复提交或提交格式错误均扣除有效提交次数,请谨慎提交答案,结果提交后请务必点击“运行”按钮,方可查看当前个人排名。

排行榜依据“最终得分”计算排名,“最终得分”计算公式为:A榜最高分 * 0.3 + B榜最高分 * 0.7。“最终得分”越大,成绩排名越前。

该竞赛就是一个类似于kaggle的数据科学赛题,基于银行的脱敏金融数据,对客户是否存款进行预测(二分类预测),评价指标为AUC(Area Under Curve)。本文主要对数据分析和一些常用的机器学习trick进行分享。

1. 数据预处理

训练数据主要是由49个特征和LABEL(0和1)组成,其中特征有数值型(int or float)、字符型(str),还有 "?" 和大量的2,由于机器学习模型只能识别数值型数据,需要数据进行预处理。

1.1 缺失值转换为np.nan

import pandas as pd, numpy as np
train = pd.read_excel('fintech训练营/train.xlsx')
test = pd.read_excel('fintech训练营/test_A榜.xlsx')
datasets = [train,test]
for dataset in datasets:
    for i in dataset.columns:
        # 将"?"转换为空值np.nan
        dataset[i] = dataset[i].apply(lambda x : np.nan if x=='?' else x)

1.2 类别型特征编码

from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
for dataset in datasets:
    for i in dataset.columns:
        if dataset[i].dtype == 'object':
            if i != 'CUST_UID':
                dataset[i] = label.fit_transform(dataset[i]) # 比如"A", "B"转换成1, 2

2. 数据探索

2.1 判断是否数据不平衡

train['LABEL'].value_counts() 

LABEL

0   30000

1   10000

Name: LABEL, dtype: int64

由图可见0和1的比例为3:1,一般大于10:1时为数据不平衡,故不需要进行数据增强。

2.2 异常值

import matplotlib.pyplot as plt
import seaborn as sns
cols = 4
rows = 13
plt.figure(figsize = (4*cols, 3*rows))
i=1
for col in train.columns[2:]:
    if train[col].dtype != 'object':
        ax = plt.subplot(rows, cols, i)
        ax = sns.boxplot(train[col], orient = 'v', width = 0.5)
        ax.set_xlabel(col)
        ax.set_ylabel('frequency')
        ax = ax.legend(['train'])
        i += 1
plt.tight_layout()
plt.show()

 从箱型图中可以看到每个特征的异常值分布,有助于我们进一步分析。

2.3 核密度分布图

cols = 4
rows = 13
plt.figure(figsize = (4*cols, 3*rows))
i = 1
for col in train.columns[2:]:
    ax = plt.subplot(rows,cols,i)
    ax = sns.kdeplot(train[col].dropna(), color = 'red', shade = True)
    ax = sns.kdeplot(test[col].dropna(), color = 'blue', shade = True)
    ax.set_xlabel(col)
    ax.set_ylabel('frequency')
    ax = ax.legend(['train','test'])
    i += 1
plt.tight_layout()
plt.show()

 从核密度分布图可以看出每个特征的分布(正态、双尖峰等)情况,有助于我们进一步分析。

2.4 对抗验证

如果训练集和测试集数据分布不同,模型训练可能会因为过拟合导致泛化性较差。我们通过给训练集构造label=1,测试集label=0,选用机器学习模型,用所有特征来预测label,通过交叉验证得到AUC(Area Under Curve),判断如下:  

(1)如果AUC过高,则存在训练集和测试集分布不同的特征,(该特征能准确判断训练集和测试集的差异,即数据漂移dataset shift,影响模型预测效果) , 则删除特征重要性排名较前的特征,然后再次预测  

(2)重复步骤1直至AUC达到0.5~0.6左右

from sklearn.model_selection import train_test_split
train_new = train.copy()
test_new = test.copy()
train_new = train_new.drop(['CUST_UID','LABEL'], axis = 1) # 只保留特征
test_new = test_new.drop(['CUST_UID'], axis = 1) # 只保留特征
train_new['label'] = 1 # 训练集标签1
test_new['label'] = 0 # 测试集标签0
data = pd.concat([train_new,test_new], axis = 0)
test_size_pct = 0.2 # 划分训练集和测试集
X_train, X_valid, y_train, y_valid = train_test_split(data.drop(['label'], axis = 1), data['label'], test_size = test_size_pct, random_state = 42)

接下来选用LightGBM(机器学习模型,以下简称LGBM,下载命令:pip install lightgbm)进行训练:

from lightgbm import LGBMClassifier
from lightgbm import log_evaluation, early_stopping
lgb = LGBMClassifier(verbosity = -1) 
lgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'], 
       callbacks = [log_evaluation(period = 50), early_stopping(stopping_rounds = 128)])

Training until validation scores don't improve for 128 rounds

[50]   valid_0's auc: 0.500989  valid_0's binary_logloss: 0.534569

[100] valid_0's auc: 0.504464  valid_0's binary_logloss: 0.536658

Did not meet early stopping. Best iteration is:

[6]     valid_0's auc: 0.510417  valid_0's binary_logloss: 0.532726

 接下来用LGBM进行预测分析: 

from sklearn.metrics import roc_auc_score
pred_lgb = lgb.predict_proba(X_valid)[:,1]
roc_auc_score(y_valid, pred_lgb)

 0.510416950256567

AUC=0.51说明训练集和测试集特征分布近似,不需要删除特征。

3. 模型预测

3.1 模型初步训练

我们使用lgb对训练集的预测效果进行初步了解:

from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
ignore = ['CUST_UID','LABEL']
features = [feat for feat in train.columns if feat not in ignore]
target_feature = 'LABEL'
test_size_pct = 0.10
X_train, X_valid, y_train, y_valid = train_test_split(train[features], train[target_feature], test_size = test_size_pct, random_state = 42)
lgb = LGBMClassifier(verbosity = -1) 
lgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'], 
       callbacks = [log_evaluation(period = 50), early_stopping(stopping_rounds = 128)])
pred_lgb = lgb.predict_proba(X_valid)[:,1]
roc_auc_score(y_valid, pred_lgb)

Training until validation scores don't improve for 128 rounds

[50]   valid_0's auc: 0.949571 valid_0's binary_logloss: 0.240469

[100] valid_0's auc: 0.949393 valid_0's binary_logloss: 0.240118

Did not meet early stopping. Best iteration is:

[63]   valid_0's auc: 0.949871 valid_0's binary_logloss: 0.239181

0.9498706534679444

 3.2 K折交叉验证

k折交叉验证(K-Fold Cross-Validation)是一种用于估计机器学习模型性能的统计方法。它是一种评估统计分析结果如何推广到独立数据集的方法,可以较大程度避免过拟合,增强模型的泛化性。详细了解可参考:https://blog.csdn.net/Rocky6688/article/details/107296546

from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
lgb = LGBMClassifier(learning_rate = 0.05, max_depth = 20, num_leaves = 100, random_state = 1000, verbosity = -1)
strtfdKFold = StratifiedKFold(n_splits = 5, random_state = 100, shuffle = True)
# 把特征和标签传递给StratifiedKFold实例
X_train = train[features]
y_train = train[target_feature]
kfold = strtfdKFold.split(X_train, y_train)
scores = []
for k, (train1, test1) in enumerate(kfold):
    lgb.fit(X_train.iloc[train1,:], y_train.iloc[train1])
    pred_lgb = lgb.predict_proba(X_train.iloc[test1, :])[:,1]
    score = roc_auc_score(y_train.iloc[test1], pred_lgb)
    scores.append(score)
    print('Fold: %2d, Training/Test Split Distribution: %s, AUC: %s' % (k+1, np.bincount(y_train.iloc[train1]), score))
print('Cross-Validation AUC: %s +/- %s' %(np.mean(scores), np.std(scores)))

Fold: 1, Training/Test Split Distribution: [24000 8000], AUC: 0.9490365000000001

Fold: 2, Training/Test Split Distribution: [24000 8000], AUC: 0.9481471250000001

Fold: 3, Training/Test Split Distribution: [24000 8000], AUC: 0.9523520416666665

Fold: 4, Training/Test Split Distribution: [24000 8000], AUC: 0.9509735416666666

Fold: 5, Training/Test Split Distribution: [24000 8000], AUC: 0.9490646666666667

Cross-Validation AUC: 0.9499147749999999 +/- 0.0015283908750980848 

上述模型参数只是随机设定,可以看出模型的Cross Validation(以下简称CV)分数为0.9499,可以看出模型的初始效果已经挺不错了。

3.3 多模型预测

3.3.1 单模型效果检验

我们这里对LightGBM、XGBoost、CatBoost、HistGradientBoostingClassifier四种效果较好的树模型进行预测分析:

from sklearn import model_selection, ensemble
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from tqdm import tqdm
vote_est = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('hgbc', ensemble.HistGradientBoostingClassifier(random_state = 42)),
    #lightbgm
    ('lgb', LGBMClassifier(verbosity = -1, random_state = 42)),
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    ('xgb', XGBClassifier(verbosity = 0, random_state = 42)),
    ('cbc', CatBoostClassifier(verbose = 0, random_state = 42))
]

MLA_columns = ['MLA Name','MLA Train AUC Mean', 'MLA Test AUC Mean', 'MLA Test AUC 3*STD' ,'MLA Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
row_index = 0
cv_split = model_selection.ShuffleSplit(n_splits = 5, test_size = 0.2, train_size = 0.8, random_state = 0)
for i in tqdm(vote_est):
    model = i[1]
    MLA_compare.loc[row_index, 'MLA Name'] = i[0]
    cv_results = model_selection.cross_validate(model, train[features], train[target_feature], cv = cv_split, scoring = 'roc_auc', return_train_score = True)
    MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()
    MLA_compare.loc[row_index, 'MLA Train AUC Mean'] = cv_results['train_score'].mean()
    MLA_compare.loc[row_index, 'MLA Test AUC Mean'] = cv_results['test_score'].mean()   
    #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets
    MLA_compare.loc[row_index, 'MLA Test AUC 3*STD'] = cv_results['test_score'].std()*3   # let's know the worst that can happen!
    row_index += 1
    del model
MLA_compare.sort_values(by = ['MLA Test AUC Mean'], ascending = False, inplace = True)
MLA_compare
MLA NameMLA Train AUC MeanMLA Test AUC MeanMLA Test AUC 3*STDMLA Time
lgb0.9768290.9506420.0077590.365629
cbc0.9793380.9506190.00746514.070166
hgbc0.9706880.9500870.0074390.70215
xgb0.9949840.9471380.0076950.768368

 我们可以看出LGBM的效果最好,而CatBoost训练+预测时间最长。

3.3.2 投票法

投票法(Voting)是集成学习里面针对分类问题的一种结合策略。是一种遵循少数服从多数原则的集成学习模型,通过多个模型的集成降低噪声和方差,从而提高模型的鲁棒性。一般情况下,投票法的预测效果应当优于单一模型的预测效果。投票法主要有两种形式:

  1. 硬投票:预测结果是所有投票结果最多出现的类。
  2. 软投票:将所有模型预测样本为某一类别的概率的平均值作为标准,概率最高的对应的类型为最终的预测结果 / 预测结果是所有投票结果中概率加和最大的类。

详细了解可参考:https://blog.csdn.net/deephub/article/details/122976720

sklearn已经在ensemble库中实现了投票法,直接用,由于评价指标为AUC,所以这里只尝试软投票的方法。

grid_soft = ensemble.VotingClassifier(estimators = vote_est, voting = 'soft')
grid_soft_cv = model_selection.cross_validate(grid_soft, train[features], train[target_feature],
                                              scoring='roc_auc', cv = cv_split, return_train_score = True)
print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv['train_score'].mean()*100)) 
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv['test_score'].std()*100*3))

Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 98.52

Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 95.20

Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 0.71

可以看出软投票的测试集分数为0.952,高于单一模型效果最好的LGBM分数0.9506。如果删除单模效果最差的XGBoost(0.947138),投票结果是不是更好呢,效果如下:

vote_est2 = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('hgbc', ensemble.HistGradientBoostingClassifier(random_state = 42)),
    #lightbgm
    ('lgb', LGBMClassifier(verbosity = -1, random_state = 42)),
    ('cbc', CatBoostClassifier(verbose = 0, random_state = 42))
]
grid_soft2 = ensemble.VotingClassifier(estimators = vote_est2 , voting = 'soft')
grid_soft_cv2 = model_selection.cross_validate(grid_soft2, train[features], train[target_feature],
                                               scoring='roc_auc', cv = cv_split, return_train_score = True)
print("Soft Voting w/Tuned Hyperparameters Training w/bin score mean: {:.2f}". format(grid_soft_cv2['train_score'].mean()*100)) 
print("Soft Voting w/Tuned Hyperparameters Test w/bin score mean: {:.2f}". format(grid_soft_cv2['test_score'].mean()*100))
print("Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- {:.2f}". format(grid_soft_cv2['test_score'].std()*100*3))

Soft Voting w/Tuned Hyperparameters Training w/bin score mean: 97.77

Soft Voting w/Tuned Hyperparameters Test w/bin score mean: 95.18

Soft Voting w/Tuned Hyperparameters Test w/bin score 3*std: +/- 0.70

可以看到,删掉XGBoost后的分数反而还下降了,可见投票法在单模效果不是很差的时候,还是尽可能增加更多的模型以增强泛化性。

3.3.3 Stacking

Stacking(堆叠泛化)是指训练一个模型用于组合 (combine)其他各个模型。即首先我们先训练多个不同的模型,然后再以之前训练的各个模型的输出为输入来训练一个模型,以得到一个最终的输出。详细了解可参考:https://blog.csdn.net/ueke1/article/details/137190677

from mlxtend.classifier import StackingCVClassifier
train_new = train.copy()
test_new = test.copy()
dataset_net = [train_new,test_new]
for dataset in dataset_net:
    for i in dataset.columns:
        if dataset[i].dtype != 'object':
            dataset[i] = dataset[i].fillna(dataset[i].mean()) # 逻辑回归需要填充空值
hgbc = ensemble.HistGradientBoostingClassifier(random_state = 42)
lgb = LGBMClassifier(verbosity = -1, random_state = 42)
xgb = XGBClassifier(verbosity = 0, random_state = 42)
cbc = CatBoostClassifier(verbose = 0, random_state = 42)
lr = linear_model.LogisticRegressionCV()
sclf = StackingCVClassifier(classifiers = [hgbc, lgb, cbc],  # 第一层分类器
                            meta_classifier = lr,   # 第二层分类器,并非表示第二次stacking,而是通过logistic regression对新的训练特征数据进行训练,得到predicted label
                            cv = 5)
strtfdKFold = StratifiedKFold(n_splits = 5)
#把特征和标签传递给StratifiedKFold实例
X_train = train_new[features]
y_train = train_new[target_feature]
kfold = strtfdKFold.split(X_train, y_train)
scores = []
for k, (train1, test1) in enumerate(kfold):
    sclf.fit(X_train.iloc[train1,:], y_train.iloc[train1])
    pred_lgb = sclf.predict_proba(X_train.iloc[test1, :])[:,1]
    score = roc_auc_score(y_train.iloc[test1], pred_lgb)
    scores.append(score)
    print('Fold: %2d, Training/Test Split Distribution: %s, Accuracy: %.3f' % (k+1, np.bincount(y_train.iloc[train1]), score))
print('\n\nCross-Validation accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))

Fold: 1, Training/Test Split Distribution: [24000 8000], Accuracy: 0.893

Fold: 2, Training/Test Split Distribution: [24000 8000], Accuracy: 0.882

Fold: 3, Training/Test Split Distribution: [24000 8000], Accuracy: 0.883

Fold: 4, Training/Test Split Distribution: [24000 8000], Accuracy: 0.880

Fold: 5, Training/Test Split Distribution: [24000 8000], Accuracy: 0.882

Cross-Validation accuracy: 0.884 +/- 0.005

 简单试了几个stacking,发现效果远远不如单模,故选用软投票。

4. 模型优化技巧

4.1 伪标签

伪标签是一种半监督学习的技术,它是指利用已标注数据训练的模型对未标注数据进行预测,然后根据预测结果为未标注数据分配类别标签,目的是帮助模型学习到无标注数据中隐藏的信息。伪标签的选取一般是基于模型预测的最大概率的类,可以用于微调模型,提高模型的泛化能力。详细了解可参考:https://zhuanlan.zhihu.com/p/157325083

以下是伪标签的类:

from sklearn.utils import shuffle
from sklearn.base import BaseEstimator, RegressorMixin

class PseudoLabeler(BaseEstimator, RegressorMixin):
    
    def __init__(self, model, test, features, target, sample_rate=0.2, seed=42):
        self.sample_rate = sample_rate
        self.seed = seed
        self.model = model
        self.model.seed = seed
        
        self.test = test
        self.features = features
        self.target = target
        
    def get_params(self, deep=True):
        return {
            "sample_rate": self.sample_rate,
            "seed": self.seed,
            "model": self.model,
            "test": self.test,
            "features": self.features,
            "target": self.target
        }

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

        
    def fit(self, X, y):
        if self.sample_rate > 0.0:
            augemented_train = self.__create_augmented_train(X, y)
            self.model.fit(
                augemented_train[self.features],
                augemented_train[self.target]
            )
        else:
            self.model.fit(X, y)
        
        return self


    def __create_augmented_train(self, X, y):
        num_of_samples = int(len(self.test) * self.sample_rate)
        
        # Train the model and creat the pseudo-labels
        self.model.fit(X, y)
        pseudo_labels = self.model.predict(self.test[self.features])
        
        # Add the pseudo-labels to the test set
        augmented_test = self.test.copy(deep=True)
        augmented_test[self.target] = pseudo_labels
        
        # Take a subset of the test set with pseudo-labels and append in onto
        # the training set
        sampled_test = augmented_test.sample(n=num_of_samples)
        temp_train = pd.concat([X, y], axis=1)
        augemented_train = pd.concat([sampled_test, temp_train])

        return shuffle(augemented_train)
        
    def predict(self, X):
        return self.model.predict(X)
    
    def predict_proba(self, X):
        return self.model.predict_proba(X)
    
    def get_model_name(self):
        return self.model.__class__.__name__

伪标签的调用方法如下:

strtfdKFold = StratifiedKFold(n_splits = 5)
grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
X_train = train[features]
y_train = train[target_feature]
X_test = test[features]
kfold = strtfdKFold.split(X_train, y_train)
pred = pd.DataFrame()
for k, (train1, test1) in enumerate(kfold):
    pseudo = PseudoLabeler(grid_soft, test, features, target_feature, sample_rate = 1)
    pseudo.fit(X_train.iloc[train1,:], y_train.iloc[train1])
    pred_lgb = pseudo.predict_proba(X_test)[:,1]
    pred[str(k)] = pred_lgb
pred['result'] = (pred['0'] + pred['1'] + pred['2'] + pred['3'] + pred['4']) / 5

伪标签能使模型鲁棒性大大增强,从而增强的未知数据的预测能力,由于训练时间的问题,作者并未对伪标签做线下CV测试,但是线上分数提高了,从而证明了伪标签的有效性。

4.2 神经网络

树模型表现良好,那么nn表现如何呢,作者写了一个简单的多层感知机神经网络模型如下:

(1)数据读取以及数据预处理:

import os, gc, math, time, random, numpy as np, pandas as pd, warnings, torch
warnings.filterwarnings('ignore')
from torch import nn
from torch.utils.data import DataLoader, Dataset
from torch.nn import functional as F
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from transformers import get_constant_schedule_with_warmup, get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
from sklearn.preprocessing import LabelEncoder
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# ====================================================
# CFG
# ====================================================
class CFG:
    seed = 42
    num_hidden1 = 768
    num_hidden2 = 512
    num_hidden3 = 768
    num_hidden4 = 768
    num_output = 2
    print_freq = 100
    scheduler = 'cosine'
    batch_size = 32
    num_workers = 3
    lr = 1e-5
    weight_decay = 0
    epochs = 5
    num_warmup_steps = 0
    num_cycles = 0.5
    n_accumulate = 1
    train = True
    n_fold = 5

def seed_everything(seed=CFG.seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

seed_everything(seed = 42)
train_net = pd.read_excel('fintech训练营/train.xlsx')
for i in train_net.columns:
    train_net[i] = train_net[i].apply(lambda x : np.nan if x=='?' else x)
label = LabelEncoder()
for i in train_net.columns:
    if train_net[i].dtype == 'object':
        if i != 'CUST_UID':
            train_net[i] = label.fit_transform(train_net[i])
    else:
        if i != "LABEL":
            train_net[i] = train_net[i].fillna(train_net[i].mode().values[0]) # 用众数填充缺失值
            train_net[i] = (train_net[i] - train_net[i].min()) / (train_net[i].max() - train_net[i].min()) # 归一化

ignore = ['CUST_UID','LABEL']
CFG.feas = [feat for feat in train_net.columns if feat not in ignore]
CFG.target_fea = 'LABEL'

skf = StratifiedKFold(n_splits = 5, random_state = CFG.seed, shuffle = True)
train_net['fold'] = -1
for i, (_, val_) in enumerate(skf.split(train_net[CFG.feas], train_net[CFG.target_fea])):
    train_net.loc[val_, 'fold'] = int(i)

神经网络框架:

(1)criterion是损失函数,这里我们使用nn.CrossEntropyLoss

(2)get_score是验证集评价指标函数,评价指标为AUC

(3)FeedBackDataset是神经网络的数据类,用于数据的预处理和传入,__getitem__函数即为每个item传入数据的格式

(4)custom_collate_fn为神经网络收集整理函数,由于FeedBackDataset的数据是一个一个传入的,而一个batch可能有多个数据,所以需要对多个单独传入的数据进行整理并转化成tensor

(5)BPNetModel是神经网络的结构类,forward函数定义了数据前向传播,loss.backward()反向传播求梯度,作者在这里封装了一个loss_accumulate函数,主要是为了应对显存不足而又需要大batchsize的情况,可以在设定比如batchsize=1、accumulate=4的情况下达到和batchsize=4等同效果的结果

(6)asMinutes和timeSince为时间辅助函数,主要是为了计算神经网络运行和等待时间

(7)train_one_epoch为单个epoch的训练函数,在该函数定义训练日志、参数更新(optimizer.step)、优化器更新(scheduler.step)、梯度清零(optimizer.zero_grad)等操作

(8)valid_one_epoch为预测函数,我们都知道模型不一定是在每个epoch训练结束的时候效果最好, 该函数可以帮助我们在模型训练过程中预测验证集,通过验证集的分数来挑选最优模型

(9)train_loop为模型初始化函数,主要用于定义DataLoader、model和optimizer(Adam、SGD)等组件

def criterion(outputs, labels):
    return nn.CrossEntropyLoss(reduction = 'sum')(outputs, labels)

def get_score(outputs, labels):
    outputs = F.softmax(torch.tensor(outputs)).numpy()[:,1]
    return roc_auc_score(labels, outputs)

class FeedBackDataset(Dataset):
    def __init__(self, data):
        self.data = data[CFG.feas].values
        self.targets = data[CFG.target_fea].values

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, index):
        return {
            'feature': self.data[index], 
            'target': self.targets[index]
        }

def custom_collate_fn(batch):
    datas, targets = [], []
    for batchid, data in enumerate(batch):
        datas.append(data['feature'])
        targets.append(data['target'])
    datas = torch.tensor(datas) 
    targets = torch.tensor(targets) 
    return datas, targets

class BPNetModel(nn.Module):
    def __init__(self):
        super(BPNetModel, self).__init__()
        self.hiddden1 = torch.nn.Linear(len(CFG.feas), CFG.num_hidden1) # 定义隐层网络
        self.hiddden2 = torch.nn.Linear(CFG.num_hidden1, CFG.num_hidden2)
        self.hiddden3 = torch.nn.Linear(CFG.num_hidden2, CFG.num_hidden3) # 定义隐层网络
        self.hiddden4 = torch.nn.Linear(CFG.num_hidden3, CFG.num_hidden4)
        self.out = torch.nn.Linear(CFG.num_hidden4, CFG.num_output) # 定义输出层网络
        self.relu = torch.nn.ReLU()
    def forward(self, x):
        x = x.to(torch.float32)
        x = self.hiddden1(x) # 隐层激活函数采用relu()函数
        x = self.relu(x)
        x = self.hiddden2(x)
        x = self.relu(x)
        x = self.hiddden3(x)
        x = self.relu(x)
        x = self.hiddden4(x)
        x = self.relu(x)
        x = self.out(x)
        return x

    def get_loss(self, inputs):
        inputs, targets = inputs[0].to(device), inputs[1]
        outs = self.forward(inputs)
        loss = criterion(outs, targets.to(device))
        return loss, outs

    def loss_accumulate(self, data_list):
        running_loss = 0
        result = []
        all_bs = sum(data_bts['batchsize'] for data_bts in data_list)
        for data_bts in data_list:
            data = data_bts['data']
            loss, outs = self.get_loss(data)
            loss = loss / all_bs
            loss.backward()
            running_loss += loss.item()
            result.append(outs.detach().to('cpu').numpy())
        return running_loss, all_bs, result

def asMinutes(s):
    m = math.floor(s/60)
    s -= m * 60
    return "%dm %ds" % (m, s)

def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return "%s (remain %s)" % (asMinutes(s), asMinutes(rs))

def get_scheduler(cfg, optimizer, num_train_steps):
    if cfg.scheduler == 'linear':
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps = cfg.num_warmup_steps, num_training_steps = num_train_steps
        )
    elif cfg.scheduler == 'cosine':
        scheduler = get_cosine_schedule_with_warmup(
            optimizer, num_warmup_steps = cfg.num_warmup_steps, num_training_steps = num_train_steps, num_cycles = cfg.num_cycles
        )
    return scheduler


def train_one_epoch(model, optimizer, scheduler, dataloader, epoch, valid_data):
    model.train()
    dataset_size = 0
    running_loss_awp = 0
    epoch_loss_awp = 0
    # valid_labels = valid_data.label.values
    validDataset = FeedBackDataset(valid_data)
    valid_loader = DataLoader(validDataset,
                            batch_size = CFG.batch_size,
                            shuffle = False,
                            collate_fn = custom_collate_fn,
                            num_workers = CFG.num_workers,
                            pin_memory = True)

    start = end = time.time()
    data_list = []
    target_list = []
    pred_list = []
    for step, data in enumerate(dataloader):
        if os.path.exists('break.txt'):
            raise ValueError('break error')

        batch_size = data[-1].shape[0]
        data_list.append({'data': data, 'batchsize': batch_size})
        target_list.append(data[1].numpy())
        if (step +1) % CFG.n_accumulate == 0:
            accum_loss, datalist_size, result = model.loss_accumulate(data_list)
            optimizer.step()
            optimizer.zero_grad()
            if scheduler is not None:
                scheduler.step()
            pred_list += result
            data_list = [] # refresh the accumulate data_list
            running_loss_awp += (accum_loss * datalist_size)
            dataset_size += datalist_size
            # average loss
            epoch_loss_awp = running_loss_awp / dataset_size

        end = time.time()

        if step % CFG.print_freq == 0 or step == (len(dataloader)-1):
            score_train = get_score(np.concatenate(pred_list), np.concatenate(target_list))
            print('Train: [{}] '
                  'Loss: {:.4f}  ' 
                  'Train AUC: {:.4f}  ' 
                  'Step: [{}/{}] '
                  'Elapsed {remain:s} '
                  .format(epoch, epoch_loss_awp, score_train, step+1, len(dataloader), 
                          remain = timeSince(start, float(step+1)/len(dataloader))))
    pred = valid_one_epoch(model, valid_loader, epoch)

    gc.collect()
    return epoch_loss_awp, pred

@torch.no_grad()
def valid_one_epoch(model, dataloader, epoch):
    dataset_size = 0
    running_loss = 0

    start = end = time.time()
    pred = []

    for step, data in enumerate(dataloader):
        if os.path.exists('break.txt'):
            raise ValueError('break error')
        loss, outputs = model.get_loss(data)
        pred.append(outputs.to('cpu').numpy())
        batch_size = data[-1].shape[0]
        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size
        epoch_loss = running_loss / dataset_size

    print('EVAL: [{}] ' 
            'Loss: {:.4f}  '
            'Step: [{}/{}] '
            'Elapsed {remain:s} '
            .format(epoch, epoch_loss, step + 1 , len(dataloader),
                    remain = timeSince(start, float(step+1)/len(dataloader))))
    pred = np.concatenate(pred)
    model.train()
    return pred


def train_loop(fold):
    train_data = train_net[train_net.fold != fold].reset_index(drop=True)
    valid_data = train_net[train_net.fold == fold].reset_index(drop=True)
    trainDataset = FeedBackDataset(train_data)
    train_loader = DataLoader(trainDataset,
                              batch_size = CFG.batch_size,
                              shuffle = True,
                              collate_fn = custom_collate_fn,
                              num_workers = CFG.num_workers,
                              pin_memory = True)

    model = BPNetModel().to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr = CFG.lr)
    # loop
    best_score = 0
    # General training
    num_train_steps = int(len(train_data) / CFG.batch_size * CFG.epochs) 
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)
    for epoch in range(CFG.epochs):
        print(f'-------------epoch:{epoch} training-------------')
        start_time = time.time()
        train_epoch_loss, pred = train_one_epoch(model, optimizer, scheduler, train_loader, epoch, valid_data)
        score = get_score(pred, valid_data[CFG.target_fea].values)
        elapsed = time.time() - start_time
        print(f'Fold {fold} Epoch {epoch} - avg_train_loss: {train_epoch_loss:.4f} time: {elapsed:.0f}s')
        if score > best_score:
            best_score = score
        print(f'AUC score:{score}  best_score:{best_score}')
    torch.cuda.empty_cache()
    gc.collect()
    
    return best_score

 模型结果展示:

best_scores = []
if CFG.train:
    for fold in range(CFG.n_fold):
        print(f'-------------fold:{fold} training-------------')
        best_scores.append(train_loop(fold))
    print(f"Cross Validation: {np.mean(best_scores)}")

 部分训练日志如下:

-------------epoch:3 training-------------
Train: [3] Loss: 0.5390  Test AUC: 0.6302  Step: [1/1000] Elapsed 0m 0s (remain 1m 7s) 
Train: [3] Loss: 0.5257  Test AUC: 0.6642  Step: [101/1000] Elapsed 0m 0s (remain 0m 3s) 
Train: [3] Loss: 0.5302  Test AUC: 0.6614  Step: [201/1000] Elapsed 0m 0s (remain 0m 2s) 
Train: [3] Loss: 0.5326  Test AUC: 0.6601  Step: [301/1000] Elapsed 0m 0s (remain 0m 1s) 
Train: [3] Loss: 0.5342  Test AUC: 0.6591  Step: [401/1000] Elapsed 0m 0s (remain 0m 1s) 
Train: [3] Loss: 0.5349  Test AUC: 0.6580  Step: [501/1000] Elapsed 0m 1s (remain 0m 1s) 
Train: [3] Loss: 0.5320  Test AUC: 0.6597  Step: [601/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [3] Loss: 0.5323  Test AUC: 0.6583  Step: [701/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [3] Loss: 0.5330  Test AUC: 0.6592  Step: [801/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [3] Loss: 0.5336  Test AUC: 0.6593  Step: [901/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [3] Loss: 0.5328  Test AUC: 0.6590  Step: [1000/1000] Elapsed 0m 1s (remain 0m 0s) 
EVAL: [3] Loss: 17.1517  Step: [250/250] Elapsed 0m 0s (remain 0m 0s) 
Fold 4 Epoch 3 - avg_train_loss: 0.5328 time: 3s
AUC score:0.6507739583333333  best_score:0.6507739583333333
-------------epoch:4 training-------------
Train: [4] Loss: 0.5875  Test AUC: 0.6727  Step: [1/1000] Elapsed 0m 0s (remain 1m 12s) 
Train: [4] Loss: 0.5156  Test AUC: 0.6763  Step: [101/1000] Elapsed 0m 0s (remain 0m 2s) 
Train: [4] Loss: 0.5246  Test AUC: 0.6699  Step: [201/1000] Elapsed 0m 0s (remain 0m 2s) 
Train: [4] Loss: 0.5281  Test AUC: 0.6632  Step: [301/1000] Elapsed 0m 0s (remain 0m 1s) 
Train: [4] Loss: 0.5335  Test AUC: 0.6615  Step: [401/1000] Elapsed 0m 0s (remain 0m 1s) 
Train: [4] Loss: 0.5364  Test AUC: 0.6585  Step: [501/1000] Elapsed 0m 1s (remain 0m 1s) 
Train: [4] Loss: 0.5355  Test AUC: 0.6569  Step: [601/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [4] Loss: 0.5337  Test AUC: 0.6572  Step: [701/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [4] Loss: 0.5336  Test AUC: 0.6570  Step: [801/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [4] Loss: 0.5325  Test AUC: 0.6593  Step: [901/1000] Elapsed 0m 1s (remain 0m 0s) 
Train: [4] Loss: 0.5325  Test AUC: 0.6598  Step: [1000/1000] Elapsed 0m 2s (remain 0m 0s) 
EVAL: [4] Loss: 17.1514  Step: [250/250] Elapsed 0m 0s (remain 0m 0s) 
Fold 4 Epoch 4 - avg_train_loss: 0.5325 time: 2s
AUC score:0.6505471666666667  best_score:0.6507739583333333
Cross Validation: 0.6568245416666667

该神经网络通过对数据归一化处理、众数填充缺失值,然后用四层线性层(768、512、768、768)构建了一个简单的多层感知机模型,可以看出该baseline的CV只有0.6568,比树差了很多(可能需要对数据和模型做进一步处理),但由于时间原因以及神经网络在二分类任务中通常弱于树模型,故放弃使用nn,继续优化树模型。

4.3 Bad Case Analysis

 Bad Case,顾名思义,就是模型预测不了的数据。Bad Case分析,简单来说就是把模型预测不了的数据找出来,分析这部分数据有什么特点(特征共性、分布不同等),进而改进模型的效果,此处不再赘述,感兴趣的同学可以参考:https://zhuanlan.zhihu.com/p/104961266

4.4 模型参数调优

4.4.1 网格搜索

网格搜索,也称穷举搜索。简单来说,就是模型一个超参数有多个值(比如learning可以是1e-5、1e-4、1e-3等等),那么就遍历这所有可能的值,找到最优模型。但是随着超参数组合的增加,比如特征A有5个值、B有10个值、C有6个值,那么就有300种要遍历的方案,遍历时间会非常的长。详细可参考:https://blog.csdn.net/qq_39521554/article/details/86227582

cv_split = model_selection.ShuffleSplit(n_splits = 5, test_size = .2, train_size = .8, random_state = 0 ) 
vote_est = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('hgbc',ensemble.HistGradientBoostingClassifier()),
    #lightbgm
    ('lgb', LGBMClassifier()),
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    ('xgb', XGBClassifier(verbosity=0)),
    ('cbc',CatBoostClassifier(verbose=0))

]

grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [42]


grid_param = [
            [{
            # hgbc
            'learning_rate': grid_learn, 
            'max_depth': [1, 3, 5, 7, 9], 
            'max_iter':[50, 100, 200, 500],
            'random_state': grid_seed 
             }],

            [{
            # lgb
            'learning_rate': grid_learn, 
            'max_depth': [1, 3, 5, 7, 9], 
            'n_estimators': grid_n_estimator,
            'colsample_bytree':[0.6, 0.7, 0.8, 0.9, 1],
            'reg_alpha': [0, 0.05, 1],
            'reg_lambda': [0, 0.1, 0.5, 1],
            'seed': grid_seed 
             }], 
    
            [{
            # xgb
            'learning_rate': grid_learn, 
            'max_depth': [1, 3, 5, 7, 9], 
            'n_estimators': grid_n_estimator,
            'gamma':[0, 0.2, 0.5],
            'subsample':[0.6, 0.7, 0.8, 0.9, 1],
            'seed': grid_seed ,
            'verbosity':[0]
             }],

             [{
            # cbc
            'learning_rate': grid_learn, 
            'n_estimators': grid_n_estimator,
            'depth': [4, 5, 6, 7, 8, 9], 
            'l2_leaf_reg': [1,3,5,7],
            'seed': grid_seed ,
            'verbose':[0]
             }]  
             
        ]

start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote_est, grid_param): #https://docs.python.org/3/library/functions.html#zip
    start = time.perf_counter()        
    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
    best_search.fit(train[features], train[target_feature])
    run = time.perf_counter() - start

    best_param = best_search.best_params_
    print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))

run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))

部分模型训练结果如下:

### The best parameter for HistGradientBoostingClassifier is  

{'learning_rate': 0.03, 'max_depth': 7, 'max_iter': 500, 'random_state': 0}  

with a runtime of 6016.12 seconds.  

### The best parameter for LGBMClassifier is  

{'colsample_bytree': 0.7, 'learning_rate': 0.03, 'max_depth': 7, 'n_estimators': 300, 'reg_alpha': 1, 'reg_lambda': 0.1, 'seed': 0}  

with a runtime of 33976.22 seconds.

用网格搜索的参数重新训练模型,发现模型并没有什么提高,由此可见手动输入搜索的超参数很难提高模型的效果。

4.4.2 Optuna

Optuna是一个基于贝叶斯优化的超参数优化框架。它的目标是通过智能的搜索策略,尽可能少的实验次数找到最佳超参数组合Optuna是一个基于贝叶斯优化的超参数优化框架。它的目标是通过智能的搜索策略,尽可能少的实验次数找到最佳超参数组合。详细可参考Optuna官方文档:https://zh-cn.optuna.org/index.html

import numpy as np 
import pandas as pd 
import os 
os.chdir(os.path.abspath(os.curdir))
from tqdm import tqdm
from lightgbm import LGBMClassifier

#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score


import warnings
warnings.filterwarnings('ignore')

from optuna.integration import LightGBMPruningCallback
import optuna

from lightgbm import early_stopping

import warnings
warnings.filterwarnings('ignore')

train = pd.read_excel('fintech训练营/train.xlsx')
test = pd.read_excel('fintech训练营/test_A榜.xlsx')
datasets = [train,test]
for dataset in datasets:
    for i in dataset.columns:
        dataset[i] = dataset[i].apply(lambda x : np.nan if x=='?' else x)

label = LabelEncoder()
for dataset in datasets:
    for i in dataset.columns:
        if dataset[i].dtype == 'object':
            if i != 'CUST_UID':
                dataset[i] = label.fit_transform(dataset[i])

ignore = ['CUST_UID','LABEL']
features = [feat for feat in train.columns if feat not in ignore]
target_feature = 'LABEL'


def objective(trial, X, y):
    # 参数网格
    param_grid = {
        "n_estimators": trial.suggest_categorical("n_estimators", [300,500,1000]),
        "learning_rate": trial.suggest_categorical("learning_rate", [0.01, 0.03, 0.05, 0.1, 0.25]),
        "num_leaves": trial.suggest_int("num_leaves", 20, 3000, step=20),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "subsample": trial.suggest_float("subsample", 0.001, 0.999),
        "subsample_freq": trial.suggest_categorical("subsample_freq", [1]),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.001, 0.999),
        "random_state": 42,
        "verbosity": -1
    }
    # 5折交叉验证
    cv = StratifiedKFold(n_splits=5,random_state=100,shuffle=True)

    cv_scores = np.empty(5)
    for idx, (train_idx, test_idx) in enumerate(cv.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        # LGBM建模
        model = LGBMClassifier(**param_grid)
        model.fit(
            X_train,
            y_train,
            eval_set = [(X_test, y_test)],
            eval_metric = ['auc'],
            callbacks = [early_stopping(stopping_rounds = 100, verbose = 0)]
        )
        # 模型预测
        preds = model.predict_proba(X_test)[:,1]
        # 优化指标auc最大
        cv_scores[idx] = roc_auc_score(y_test, preds)
    
    return np.mean(cv_scores)

study = optuna.create_study(direction="maximize", study_name="LGBM Classifier")
func = lambda trial: objective(trial, train[features], train[target_feature])
study.optimize(func, n_trials=2000)
print('best params:',study.best_params)
print('best score:',study.best_value)

 部分训练日志如下:

[I 2024-08-01 10:45:25,902] A new study created in memory with name: LGBM Classifier [I 2024-08-01 10:45:38,510] Trial 0 finished with value: 0.9422476416666667 and parameters: {'n_estimators': 500, 'learning_rate': 0.05, 'num_leaves': 2660, 'max_depth': 11, 'subsample': 0.0905498028535803, 'subsample_freq': 1, 'colsample_bytree': 0.2742667597572731}. Best is trial 0 with value: 0.9422476416666667.

[I 2024-08-01 10:45:43,009] Trial 1 finished with value: 0.947348975 and parameters: {'n_estimators': 300, 'learning_rate': 0.25, 'num_leaves': 2520, 'max_depth': 4, 'subsample': 0.58422461412904, 'subsample_freq': 1, 'colsample_bytree': 0.41895727668202326}. Best is trial 1 with value: 0.947348975.

[I 2024-08-01 10:45:53,076] Trial 2 finished with value: 0.94371525 and parameters: {'n_estimators': 500, 'learning_rate': 0.03, 'num_leaves': 2220, 'max_depth': 12, 'subsample': 0.027482276053391742, 'subsample_freq': 1, 'colsample_bytree': 0.5719792346676457}. Best is trial 1 with value: 0.947348975.

[I 2024-08-01 10:46:11,290] Trial 3 finished with value: 0.9489362166666666 and parameters: {'n_estimators': 1000, 'learning_rate': 0.05, 'num_leaves': 2420, 'max_depth': 3, 'subsample': 0.4658330549093411, 'subsample_freq': 1, 'colsample_bytree': 0.9729852421378948}. Best is trial 3 with value: 0.9489362166666666.

 Optuna相对于网格搜索速度大大提升,而且会自动搜索超参数,我们只需要提前输入超参数的值或范围(0~1等),它就会自动搜索最优参数,实在是调参的不二之选。

4.5 特征工程

4.5.1 分箱

对于非树模型,可以对连续值或者取值较多的离散变量进行分箱操作,有助于减少异常值的影响、处理缺失值(缺失值单独为一个箱)等,详细可参考:https://blog.csdn.net/CarryLvan/article/details/108775507

grid_soft = ensemble.VotingClassifier(estimators = vote_est , voting = 'soft')
for name in train.columns[2:]:
    train_cut=train.copy()
    if train[name].dtype!='object':
        train_cut[name] = pd.qcut(train_cut[name].rank(method='first'),5)
        train_cut[name] = label.fit_transform(train_cut[name])
        print(name)
        strtfdKFold = StratifiedKFold(n_splits=5,random_state=100,shuffle=True)
        #把特征和标签传递给StratifiedKFold实例
        X_train = train_cut[features]
        y_train = train_cut[target_feature]
        kfold = strtfdKFold.split(X_train, y_train)
        scores = []
        for k, (train1, test1) in enumerate(kfold):
            grid_soft.fit(X_train.iloc[train1,:], y_train.iloc[train1])
            pred_lgb = grid_soft.predict_proba(X_train.iloc[test1, :])[:,1]
            score=roc_auc_score(y_train.iloc[test1], pred_lgb)
            scores.append(score)
            print('Fold: %2d, Training/Test Split Distribution: %s, AUC: %s' % (k+1, np.bincount(y_train.iloc[train1]), score))
        print('Cross-Validation AUC: %s +/- %s\n' %(np.mean(scores), np.std(scores)))
        

作者实现了等频分箱,主要利用pandas库的pd.cut函数,但由于使用的是树模型,分箱特征帮助不大(树模型分类本身就是分箱),最终没有使用该方法。

4.5.2 手工特征

其实特征工程算是机器学习最重要的一个模块,好的特征能够提取出更多的信息,从而提高模型的预测效果,其它的技巧只是提升上限罢了。那么该文章为什么把特征工程放在最后呢?一个很重要的原因是该数据集的特点为匿名特征、脱敏数据,你很难构造出有意义的手工强特。作者猜测特征的含义(由于涉及脱敏就不放在这里了),构造了两个强特如下:

train_ft['transfer_amount_avg'] = train_ft['MON_12_EXT_SAM_TRSF_OUT_AMT'] / train_ft['MON_12_EXT_SAM_NM_TRSF_OUT_CNT']
train_ft['dps_cur_month_peak_ratio'] = (train_ft['LAST_12_MON_COR_DPS_TM_PNT_BAL_PEAK_VAL'] - train_ft['CUR_MON_COR_DPS_MON_DAY_AVG_BAL']) / train_ft['CUR_MON_COR_DPS_MON_DAY_AVG_BAL']

这两个特征在A榜作用不大(估计是训练集和测试集分布近似原因),但在B榜提升不少。

4.5.3 Featuretools

匿名手工特征很难构造,作者另辟思路,选取featuretools自动化特征工具库,它可以运用groupby、mean、max、min 等算子,快速构建丰富的数据特征,从而提高模型的效果。详细可参考:https://blog.csdn.net/ShowMeAI/article/details/123650547

import featuretools as ft
train_ft = pd.read_excel('fintech训练营/train.xlsx')
test_ft = pd.read_excel('fintech训练营/test_A榜.xlsx')
datasets = [train_ft,test_ft]
for dataset in datasets:
    for i in dataset.columns:
        dataset[i]=dataset[i].apply(lambda x : np.nan if x=='?' else x)

ignore = ['CUST_UID','LABEL']
categorical = ['MON_12_CUST_CNT_PTY_ID',
            'AI_STAR_SCO',
            'WTHR_OPN_ONL_ICO',
            'SHH_BCK',
            'LGP_HLD_CARD_LVL',
            'NB_CTC_HLD_IDV_AIO_CARD_SITU']
features = [feat for feat in train_ft.columns if feat not in (ignore + categorical)]
target_feature = 'LABEL'

es = ft.EntitySet(id='fintech')  # 用id标识实体集
es=es.add_dataframe(
    dataframe_name = "fintech_train",
    dataframe = train_ft[features],
    index = '1',
    make_index = True
)
es=es.add_dataframe(
    dataframe_name = "fintech_test",
    dataframe = test_ft[features],
    index = '2',
    make_index = True
)

feature_train, feature_defs_train = ft.dfs(entityset=es, 
                                    target_dataframe_name='fintech_train',
                                    agg_primitives=["mean", "sum", "mode"],
                                    trans_primitives=['add_numeric', 'subtract_numeric', 'multiply_numeric', 'divide_numeric'], # 2列相加减乘除来生成
                                    max_depth = 1)
feature_test, feature_defs_test = ft.dfs(entityset=es, 
                                    target_dataframe_name='fintech_test',
                                    agg_primitives=["mean", "sum", "mode"],
                                    trans_primitives=['add_numeric', 'subtract_numeric', 'multiply_numeric', 'divide_numeric'], # 2列相加减乘除来生成
                                    max_depth = 1)

feature_train_last = pd.concat([feature_train,train_ft[categorical+[target_feature]]],axis=1)
feature_test_last = pd.concat([feature_test,test_ft[categorical]],axis=1)

label = LabelEncoder()
datasets=[feature_train_last,feature_test_last]
for dataset in datasets:
    for i in categorical:            
        dataset[i] = label.fit_transform(dataset[i])

for dataset in datasets:
    for i in dataset.columns:
        dataset[i] = dataset[i].apply(lambda x : np.nan if x==np.inf else x) # 除会产生inf, xgboost预测需要转换成nan

作者运用mean、sum、mode、add_numeric、subtract_numeric、multiply_numeric、divide_numeric(2列相加减乘除)等算子,足足构造了上千个特征(可在trans_primitives参数里进一步添加cosine、sine、modulo_numeric、percentile、natural_logarithm等算子)。同时,由于除法会产生inf,这会导致XGBoost预测时报错,需要转换成nan。上千个特征导致数据占用内存特别大,不管是读取、保存、训练都会耗费很长时间,这里介绍一种减少内存的方法:

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float32)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

Memory usage of dataframe is 1393.43 MB  

Memory usage after optimization is: 703.28 MB  

Decreased by 49.5%  

Memory usage of dataframe is 417.94 MB  

Memory usage after optimization is: 209.01 MB  

Decreased by 50.0%  

该函数通过对数值型变量的转换,可以大大缩减特征的内存占用大小。由上部分运行记录可见,可以把一个dataframe的内存占用缩减一半,同时不会对数据的精度造成太大影响。

4.5.4 特征筛选

对featuretools构造的特征进行训练预测,结果如下(代码同上):

Fold: 1, Training/Test Split Distribution: [24000 8000], AUC: 0.9479184166666665

Fold: 2, Training/Test Split Distribution: [24000 8000], AUC: 0.9481912083333333

Fold: 3, Training/Test Split Distribution: [24000 8000], AUC: 0.9511504583333334

Fold: 4, Training/Test Split Distribution: [24000 8000], AUC: 0.9495554166666667

Fold: 5, Training/Test Split Distribution: [24000 8000], AUC: 0.9486693333333335

Cross-Validation AUC: 0.9490969666666667 +/- 0.0011678401394241225

大家可以看到,效果反而比不构造特征的时候更差了,这是因为这是直接特征之间交互(加减乘除)构造的特征,包含很多噪声,导致模型效果较差。这里介绍一种表现优异的特征筛选方法--逐步特征筛选法(以n个特征为例):

(1)用n个特征进行模型训练预测,用模型(比如LGBM)的特征重要性进行排名

(2)剔除重要性排名最后的n/2(每次剔除的数量可灵活调整)个特征

(3)用n/2个特征重新训练预测,重复步骤(1)、(2)

为什么用这个方法呢?这其实是贪心算法的一个变种,因为特征是耦合的,只保留一次训练中重要性排名靠前的特征可能不是最优的,所以每次只删一部分。最好的情况是每次只删除一个特征,但是特征数量一多会导致训练时间大大增加,所以用微小数据精度提升的可能性来换取时间。

import numpy as np 
import pandas as pd 
import os 
os.chdir(os.path.abspath(os.curdir))
from tqdm import tqdm
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

#Common Model Helpers
from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import time

def get_importance(importance, names, model_type):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    return fi_df

def score(X_train,y_train):
    start = time.time()
    print(len(X_train.columns))
    params_hgbc = {'max_iter': 300, 
                'learning_rate': 0.03, 
                'max_leaf_nodes': 1980, 
                'max_depth': 6, 
                'random_state': 22934
    }

    params_lgb = {'n_estimators': 300, 
                'learning_rate': 0.03, 
                'num_leaves': 1240, 
                'max_depth': 8, 
                'subsample': 0.614339420520959, 
                'subsample_freq': 1, 
                'colsample_bytree': 0.9711563047222685,
                'random_state' : 2022
    }

    params_xgb = {'n_estimators': 300, 
                'learning_rate': 0.03, 
                'num_leaves': 1880, 
                'max_depth': 8, 
                'subsample': 0.7574143599011826, 
                'subsample_freq': 1, 
                'colsample_bytree': 0.682578966844618,
                'verbosity':0,
                'random_state': 2022
    }
    vote_est_new = [
    #Ensemble Methods: http://scikit-learn.org/stable/modules/ensemble.html
    ('hgbc',ensemble.HistGradientBoostingClassifier(**params_hgbc)),
    #lightbgm
    ('lgb', LGBMClassifier(**params_lgb)),
    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    ('xgb', XGBClassifier(**params_xgb)),
    ]  

    grid_soft2 = ensemble.VotingClassifier(estimators = vote_est_new , voting = 'soft' ,weights = [0.2,0.4,0.4])
    strtfdKFold = StratifiedKFold(n_splits=5,random_state=100,shuffle=True)
    #把特征和标签传递给StratifiedKFold实例
    kfold = strtfdKFold.split(X_train, y_train)
    scores = []
    for k, (train1, test1) in enumerate(kfold):
        grid_soft2.fit(X_train.iloc[train1,:], y_train.iloc[train1])
        pred_lgb = grid_soft2.predict_proba(X_train.iloc[test1, :])[:,1]
        score=roc_auc_score(y_train.iloc[test1], pred_lgb)
        scores.append(score)
        print('Fold: %2d, Training/Test Split Distribution: %s, AUC: %s' % (k+1, np.bincount(y_train.iloc[train1]), score))
    print('\nCross-Validation AUC: %s +/- %s\n' %(np.mean(scores), np.std(scores)))
    print('time:',time.time()-start)
    return np.mean(scores)

train = pd.read_parquet('data/train_feature_last.parquet')
test = pd.read_parquet('data/test_feature_last.parquet')
test_original = pd.read_excel('fintech训练营/test_A榜.xlsx')
ignore = ['CUST_UID','LABEL']
original_feature = list(test_original.columns)
target_feature = 'LABEL'
important=pd.read_csv('important/step_10/important_790.csv')
result_1 = pd.DataFrame(columns=['num','score'])
row = 0

for num in range(780,100,-10):
    features = [feat for feat in important['feature_names'][:num] if feat not in ignore]
    score_new = score(train[features],train[target_feature])
    lgb=LGBMClassifier()
    lgb.fit(train[features], train[target_feature])
    important = get_importance(lgb.feature_importances_,features,'LGBM ')
    important.to_csv('important/no_add_original_800/important'+str(num)+'.csv',index=False)
    result_1.loc[row,'num'] = num 
    result_1.loc[row,'score'] = score_new
    result_1.to_csv('important/no_add_original_800/result.csv',index=False) 
    row += 1

通过上述代码,筛选出分数最高的特征组合,再用optuna调参,模型的表现如下:

MLA Name

MLA Train AUC Mean

MLA Test AUC Mea

MLA Test AUC 3*STD

MLA Time

xgb0.9971480.9529810.006871121.238519
lgb0.996984

0.952962

0.007715

23.109799

lghc0.979313

0.952644

0.0074

26.32936

cbc0.986432

0.951637

0.008834

47.699232

可见每个模型的CV都有较大提高,同时线上分数也同样提升,可见策略的有效性。

5. 赛题总结

5.1  A榜

本次竞赛A榜的训练集和测试集分布一致,导致大家评分非常接近(用LGBM跑baseline都有0.93,作者CV: 0.9539、LB: 0.9551第53名,0.956估计第1),剩下的就是卷过拟合罢了。

5.2  B榜

B榜的训练集和测试集分布不一致,数据漂移非常严重,有“毒特征”(CV很好,LeaderBoard很差)。

(1)作者通过对抗验证筛掉了漂移严重的特征(对抗验证AUC  0.99->0.71)

(2)构造手工强特,用这些基础特征通过featuretools构造特征(5000多个特征),并用逐步特征筛选法筛选(180个特征)

(3)训练集加入了A榜(用A榜模型打标签)、B测试集的伪标签

(4)选用LGBM、XGBoost、CatBoost、HistGradientBoostingClassifier四个集成树模型,并用optuna进行树模型超参数的调优,最后软投票融合,融合权重为LGBM:0.35,XGBoost:0.35,CatBoost:0.15,HistGradientBoostingClassifier:0.15)

运用这些技巧,作者最终B榜的分数为CV: 0.863,LB: 0.872,与A榜分数按比例计分后排名第4。

文章涉及的数据集和源代码可从Github下载:https://github.com/CNLCNL/2022-fintech,希望这些方法和技巧能对大家有所帮助,谢谢!

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值