机器学习模型实战

以蚂蚁金服支付风险异常识别为例,梳理建模过程中的步骤及思路,其中缺失部分(持续优化)会在效果验证后补充。

ATEC学习赛:数据挖掘之支付风险异常识别:
https://dc.cloud.alipay.com/index#/topic/intro?id=9

针对类别型编码(categorical-encoding)的处理:
https://github.com/scikit-learn-contrib/categorical-encoding

之所以选择此赛题,是因为此赛题长期开放,可以持续测试新的想法。废话不多说,直接开始正题。

import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import missingno

%matplotlib inline

读取训练、测试集

train = pd.read_csv('atec_anti_fraud_train.csv')
test_b = pd.read_csv('atec_anti_fraud_test_b.csv')
print(train.shape,test_b.shape)

在这里插入图片描述
大概150万样本。

rate = pd.DataFrame(train.label.value_counts())
rate['rate'] = rate['label'].apply(lambda x: x/train.shape[0])
rate
train.head()

在这里插入图片描述
浏览数据,共300维,除297维变量外,包含id,label,date,因为互联网数据往往涉及到大量用户行为,样本和变量随时间变化大。因此可以好好利用date这个变量。翻阅其他同学分享的经验,其中就有根据date去设计cv时的k-折数据集。

因个人电脑性能原因,可以对数据集进行抽样后进行分析。

sample_size = 100000
sample = train.sample(sample_size)
sample.shape

按列分析,剔除缺失率较大的变量

fea_cols = sample.drop(['id','label','date'],axis=1).columns
def get_col_null(df,threshold=0.8):
    col_null = df.isnull().sum(axis=0)
    col_null = col_null[col_null>threshold*df.shape[0]]
    return list(col_null.index)

def get_row_null(df,threshold=0.8):
    row_null = df.isnull().sum(axis=1)
    row_null = row_null[row_null>threshold*df.shape[1]]
    return list(row_null.index)

features = sample[fea_cols]
col_null = get_col_null(features,threshold=0.8)
#删除缺失率超过阈值的列
sample = sample.drop(col_null,axis=1)
sample.shape
def select_fea_corr(df,threshold=0.5):
    corr = df.corr()
    corr_del = []
    for col in corr.columns:
        if col!='label':
    #         import pdb
    #         pdb.set_trace()
            tmp = corr[col]
            tmp = list(tmp[tmp>threshold].index)
            tmp.remove(col)
            if 'label' in tmp:
                tmp.remove('label')
            for i in tmp:
                #该特征与label的相关性大于col与label的相关性,因此remove
                if corr.loc['label',i]>corr.loc['label',col]:
                   tmp.remove(i)
            corr_del.extend(tmp)
        else: continue
    return list(set(corr_del))
    
df_corr = sample.drop(['id','date'],axis=1)
df_corr = select_fea_corr(df_corr,threshold=0.8)
#删除相关性超过阈值的列
sample = sample.drop(df_corr,axis=1)
sample.shape

sample.columns
col_need = list(sample.columns)
剔除相关性之后的剩余变量
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

train_need = train[col_need]
def get_mean_median_mode(df):
    mean_ = df.mean()
    median_ = df.median()
    mode_ = df.mode().T[0]
    return mean_,median_,mode_

mean_,median_,mode_ = get_mean_median_mode(train_need[feature_col])
train_need = train_need.fillna(median_)
缺失值暂时以均值替代,后续还可以用其他方式来填充,可以多试,实践出真知,当然还得靠玄学。
train_need.label.value_counts()

在这里插入图片描述

train_need1 = train_need.copy()
train_need1['label'] = train_need1['label'].apply(lambda x: 1 if x==-1 else x)
train_need1['label'].value_counts()
相信蚂蚁金服的风控,所以可以把-1的样本全都当成1.
X = train_need1.drop(['id','label','date'],axis=1)
y = train_need1['label']
trainedforest = RandomForestClassifier(n_estimators=500,random_state=2019).fit(X,y)
通过RandomForest进行特征选择

from sklearn.externals import joblib
joblib.dump(trainedforest,'randomforest.pkl')
randomforest = joblib.load('randomforest.pkl')
训练比较耗时,为避免多次建模浪费时间,可以先保存模型和变量文件。
feat_importances = pd.Series(randomforest.feature_importances_, index= X.columns)
feat_importances = pd.DataFrame(feat_importances).reset_index()
feat_importances.columns = ['feature','important']
feat_importances = feat_importances.sort_values(['important'],ascending=False)
feat_importances.to_csv('feat_importances.csv')
feat_importances
假设此代码用于其他数据集,且变量较多,通过RandomForest进行特征选择后,按重要性排序后挑选前K个变量。
top_k = 200
features_select = feat_importances.head(top_k)['feature'].values

接下来我们要利用eli5库的PermutationImportance来分析训练、验证集中变量的稳定性。

import eli5
from eli5.sklearn import PermutationImportance
抽样,并按date排序,并划分训练、验证集。
sam = 200000
train_need2 = train_need.sample(sam)
train_need2 = train_need2.sort_values(['date'])
train_need2.head()

valid_need2 = train_need2.iloc[int(0.8*sam):,:]
train_need2 = train_need2.iloc[:int(0.8*sam),:]

# X_train,X_valid,y_train,y_valid = train_test_split(X_perm,y_perm,test_size=0.2)
X_train = train_need2[features_select]
y_train = train_need2['label']
X_valid = valid_need2[features_select]
y_valid = valid_need2['label']
print(X_train.shape,X_valid.shape)

应适当调整RandomForestClassifier的参数,尽可以让模型学到较优的入模变量
model = RandomForestClassifier(n_estimators=500,random_state=2019).fit(X_train,y_train)
perm_train = PermutationImportance(model,random_state=1).fit(X_train,y_train)
perm_valid = PermutationImportance(model,random_state=1).fit(X_valid,y_valid)
eli5.show_weights(perm_train,top=100,feature_names=X_train.columns.tolist())
eli5.show_weights(perm_valid,top=100,feature_names=X_valid.columns.tolist())

在这里插入图片描述

perm_feature_importance_train = pd.concat([pd.Series(X_train.columns),pd.Series(perm_train.feature_importances_)],axis=1).sort_values(by=1,ascending=False)
perm_feature_importance_train.columns = ['feature','imp']
perm_feature_importance_train = perm_feature_importance_train.reset_index(drop=True)
perm_feature_importance_train

perm_feature_importance_valid = pd.concat([pd.Series(X_valid.columns),pd.Series(perm_valid.feature_importances_)],axis=1).sort_values(by=1,ascending=False)
perm_feature_importance_valid.columns = ['feature','imp']
perm_feature_importance_valid = perm_feature_importance_valid.reset_index(drop=True)
perm_feature_importance_valid

perm_feature_importance_train = perm_feature_importance_train.to_csv('perm_feature_importance_train.csv',index=False)
perm_feature_importance_valid = perm_feature_importance_valid.to_csv('perm_feature_importance_valid.csv',index=False)

perm_feature_importance_train = pd.read_csv('./perm_feature_importance_train.csv')
perm_feature_importance_valid = pd.read_csv('./perm_feature_importance_valid.csv')
perm_feature_importance_train = perm_feature_importance_train[perm_feature_importance_train['imp']>0]
perm_feature_importance_valid = perm_feature_importance_valid[perm_feature_importance_valid['imp']>0]
挑选出train、valid中变量重要性>0的变量

# how_param = 'outer'
how_param = 'inner'
perm_feature_select = pd.merge(perm_feature_importance_train,perm_feature_importance_valid,how=how_param,on='feature')
合并变量集,可以尝试不同的合并方式,放入模型对比效果;
perm_feature_select

在这里插入图片描述

col_need1 = perm_feature_select.feature.values.tolist()
col_need2 = perm_feature_select.feature.values.tolist()
col_need1.extend(['label','date'])
col_need2.extend(['date'])

total_res=pd.DataFrame()
total_res['id']=test_b.id
train_need = train[col_need1]
test_b_need = test_b[col_need2]
#kde
import gc
def plot_kde(train, test, col, values=True):
    fig,ax =plt.subplots(1,4,figsize=(15,5))
    sns.kdeplot(train[col][train['label']==0],color='g',ax=ax[0])
    sns.kdeplot(train[col][train['label']==1],color='r',ax=ax[0])
    sns.kdeplot(train[col][train['label']==-1],color='y',ax=ax[0])
    
    sns.kdeplot(train[col],color='y',ax=ax[1])
    
    sns.kdeplot(test[col],color='b',ax=ax[2])
    
    sns.kdeplot(train[col],color='y',ax=ax[3])
    sns.kdeplot(test[col],color='b',ax=ax[3])  
    plt.show()
    del train, col,test
    gc.collect()
    
cols = train_need.columns
for col in cols:
    if col !='id' and col !='label' and col in col_need:
        plot_kde(train_need,test_b_need,col)
row_null = get_row_null(train_need,0.75)
len(row_null)
train_need = train_need.drop(row_null,axis=0)
train_need.shape
train_need = train_need.sort_values('date')
train_need.head()
train_need.label.value_counts()
train_unlabel = train_need[train_need['label']==-1]
train_unlabel.shape
train_label = train_need[train_need['label']!=-1]
train_label.shape
train_unlabel.label = 1
train_unlabel.head()

train1 = pd.concat([train_label,train_unlabel])
train_sample = train1
train_sample
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV
from sklearn.metrics import roc_auc_score,classification_report,roc_curve,auc,accuracy_score
from bayes_opt import BayesianOptimization
x = train_sample.drop(['label','date'],axis=1)
y = train_sample['label']
col_need = x.columns
x = x[col_need]
test_b_need = test_b_need[col_need]
print(x.shape,test_b_need.shape)
#减少内存占用
del train
del test_b
x = x.fillna(median_)
test_b_need = test_b_need.fillna(median_)
def model_cv(learning_rate, max_bin, num_leaves, n_estimators, max_depth
             ,min_split_gain,colsample_bytree
             ,subsample,reg_alpha,reg_lambda,min_child_weight,min_child_samples
            ):
    val = cross_val_score(
        LGBMClassifier(
            learning_rate = learning_rate,
            max_bin = int(max_bin),
            num_leaves = int(num_leaves),
            n_estimators = int(n_estimators),
            max_depth=int(max_depth),
            min_split_gain=int(min_split_gain),
            colsample_bytree=colsample_bytree,
            subsample=subsample,
            reg_alpha=reg_alpha,
            reg_lambda=int(reg_lambda),
            min_child_weight=min_child_weight,
            min_child_samples=int(min_child_samples),

            random_state=2,
            is_unbalance=True
        ),
        x, y, scoring='roc_auc', cv=10
    ).mean()
    return val
model_bo = BayesianOptimization(
        model_cv,
        {
            'learning_rate':(0.05, 0.1),
            'max_bin':(10,255),
            'num_leaves':(10,35),
            'n_estimators': (100, 300),
            'max_depth': (2, 13),
            'min_split_gain':(1,5),
            'colsample_bytree':(0.9,1.0),
            'subsample':(0.5,1.0),
            'reg_alpha':(0.1,5.0),
             'reg_lambda':(100,800),
            'min_child_weight':(0.01,0.1),
            'min_child_samples':(10,100)
        }
        )
model_bo.maximize()
model_bo.max

params = model_bo.max['params']
params['max_bin'] = int(params['max_bin'])
params['num_leaves'] = int(params['num_leaves'])
params['n_estimators'] = int(params['n_estimators'])
params['max_depth'] = int(params['max_depth'])
# params['min_split_gain'] = int(params['min_split_gain'])
params['reg_lambda'] = int(params['reg_lambda'])
# params['min_child_samples'] = int(params['min_child_samples'])


model_base = LGBMClassifier(**params)
np.mean(cross_val_score(model_base, x, y, cv=5, scoring='roc_auc'))
model_base.fit(x,y)
print('auc_score:',roc_auc_score(y,model_base.predict_proba(x)[:,1]))
x = x.reset_index(drop=True)
y = y.reset_index(drop=True)

评估:

def feval_spec(y, preds):
    from sklearn.metrics import roc_curve
    fpr, tpr, threshold = roc_curve(y, preds)
    tpr0001 = tpr[fpr <= 0.001].max()
    tpr001 = tpr[fpr <= 0.005].max()
    tpr005 = tpr[fpr <= 0.01].max()
    #tpr01 = tpr[fpr.values <= 0.01].max()
    tprcal = 0.4 * tpr0001 + 0.3 * tpr001 + 0.3 * tpr005
    return tprcal
from sklearn.model_selection import StratifiedKFold
skf=StratifiedKFold(n_splits=10,random_state=2019)
skf.get_n_splits(x,y)
print(skf)
train_predict_proba=pd.DataFrame()
test_predict_proba=pd.DataFrame()
i = 0
for train_index,test_index in skf.split(x,y):
    i +=1
#     print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=x.loc[train_index],x.loc[test_index]
    y_train,y_test=y[train_index],y[test_index]
    model_base.fit(X_train,y_train)
#     print('auc_score:',roc_auc_score(y,model_base.predict_proba(x)[:,1]))
    train_metric=feval_spec(y_train,model_base.predict_proba(X_train)[:,1])
    valid_metri=feval_spec(y_test,model_base.predict_proba(X_test)[:,1])
    print("The ",i," times res: train set spe_val:",train_metric,", validation set sep_val: ",valid_metri)
    train_predict_proba = pd.concat([train_predict_proba,pd.DataFrame(model_base.predict_proba(x)[:,1])],axis=1)
    test_predict_proba = pd.concat([test_predict_proba,pd.DataFrame(model_base.predict_proba(test_b_need)[:,1])],axis=1)
train_predict_proba_mean = train_predict_proba.mean(axis=1)
print(feval_spec(y,train_predict_proba_mean))

total_res["score"] = test_predict_proba.mean(axis=1)
# total_res.to_csv("result.csv",index=False)
from sklearn.model_selection import StratifiedKFold
skf=StratifiedKFold(n_splits=10,random_state=2019)
skf.get_n_splits(x,y)
print(skf)
train_predict_proba=pd.DataFrame()
test_predict_proba=pd.DataFrame()
i = 0
for train_index,test_index in skf.split(x,y):
    i +=1
#     print("Train Index:",train_index,",Test Index:",test_index)
    X_train,X_test=x.loc[train_index],x.loc[test_index]
    y_train,y_test=y[train_index],y[test_index]
    model_base.fit(X_train,y_train)
#     print('auc_score:',roc_auc_score(y,model_base.predict_proba(x)[:,1]))
    train_metric=feval_spec(y_train,model_base.predict_proba(X_train)[:,1])
    valid_metri=feval_spec(y_test,model_base.predict_proba(X_test)[:,1])
    print("The ",i," times res: train set spe_val:",train_metric,", validation set sep_val: ",valid_metri)
    train_predict_proba = pd.concat([train_predict_proba,pd.DataFrame(model_base.predict_proba(x)[:,1])],axis=1)
    test_predict_proba = pd.concat([test_predict_proba,pd.DataFrame(model_base.predict_proba(test_b_need)[:,1])],axis=1)
train_predict_proba_mean = train_predict_proba.mean(axis=1)
print(feval_spec(y,train_predict_proba_mean))

total_res["score"] = test_predict_proba.mean(axis=1)
# total_res.to_csv("result.csv",index=False)
train_predict_proba.columns = ['score'+str(i) for i in range(10)]
train_predict_proba
train_predict_proba['score'] = 0
for col in train_predict_proba.columns:
    train_predict_proba['score'] = train_predict_proba['score'] + train_predict_proba[col].rank()
max_v = train_predict_proba['score'].max()

min_v = train_predict_proba['score'].min()

train_predict_proba['score'] = train_predict_proba['score'].apply(lambda x: (x - min_v) / (max_
print(feval_spec(y,train_predict_proba['score']))
y_pred_proba = model_base.predict_proba(x)
fpr,tpr,thresholds = roc_curve(y, y_pred_proba[:,1])
tpr0001 = tpr[fpr <= 0.001].max()
tpr001 = tpr[fpr <= 0.005].max()
tpr005 = tpr[fpr <= 0.01].max()

tprcal = 0.4 * tpr0001 + 0.3 * tpr001 + 0.3 * tpr005
tprcal
plt.plot(fpr,tpr)
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值