"深度之眼"AI自媒体联合"科赛"平台银行客户二分类算法比赛参赛经验分享

 

比赛简介

近段时间参加了"深度之眼"联合"科赛"推出的银行客户二分类算法比赛,在“深度之眼”指导李老师的视频教学指导下,有幸复现出baseline。这里首先感谢平台和李老师。比赛链接:「二分类算法」提供银行精准营销解决方案

赛题描述

数据集:选自UCI机器学习库中的「银行营销数据集(Bank Marketing Data Set)」

这些数据与葡萄牙银行机构的营销活动相关。这些营销活动以电话为基础,一般,银行的客服人员需要联系客户至少一次,以此确认客户是否将认购该银行的产品(定期存款)。因此,与该数据集对应的任务是「分类任务」,「分类目标」是预测客户是(' 1 ')或者否(' 0 ')购买该银行的产品,可以看出来是典型的二分类问题。

数据与评测算法

本次评测算法为:AUC(Area Under the Curve) 。关于这个评价指标的介绍网上有很多博客,这里不是本文探讨的重点部分。

训练集简单描述

官方给出train_set.csv和test_set.csv,其中train_set.csv供选手用于训练,test_set.csv供选手用于预测。train_set.csv中包含的每列特征信息如下所示。

test_set.scv测试集中除了不含有最后需要预测的 'y' 分类这一列,其他所含列信息与train_set.csv类似。训练集一共18个字段,数据的品质很高,没有Nan或脏数据。其中数值型特征有8个,分类型特征有9个,标签为 'y'。

 

baseline代码

相关模块引入

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import lightgbm as lgb
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

数据读入

#读入数据
dataSet = pd.read_csv("D:\\AI\\game\\2019Kesci二分类算法比赛\\dataSet\\train_set.csv")
testSet = pd.read_csv("D:\\AI\\game\\2019Kesci二分类算法比赛\\dataSet\\test_set.csv")
dataSet.head()
 IDagejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
0143managementmarriedtertiaryno291yesnounknown9may1502-10unknown0
1242techniciandivorcedprimaryno5076yesnocellular7apr9912512other0
2347admin.marriedsecondaryno104yesyescellular14jul772-10unknown0
3428managementsinglesecondaryno-994yesyescellular18jul1742-10unknown0
4542techniciandivorcedsecondaryno2974yesnounknown21may1875-10unknown0
testSet.head()

 

 IDagejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcome
02531851housemaidmarriedunknownno174nonotelephone29jul3083-10unknown
12531932managementmarriedtertiaryno6059yesnocellular20nov1102-10unknown
22532060retiredmarriedprimaryno0nonotelephone30jul1303-10unknown
32532132studentsingletertiaryno64nonocellular30jun59841055failure
42532241housemaidmarriedsecondaryno0yesyescellular15jul3684-10unknown

简单查看下数据分布

dataSet.describe()
 IDagebalancedaydurationcampaignpdayspreviousy
count25317.00000025317.00000025317.00000025317.00000025317.00000025317.00000025317.00000025317.00000025317.000000
mean12659.00000040.9353791357.55508215.835289257.7323932.77205040.2487660.5917370.116957
std7308.53271910.6342892999.8228118.319480256.9751513.136097100.2135412.5683130.321375
min1.00000018.000000-8019.0000001.0000000.0000001.000000-1.0000000.0000000.000000
25%6330.00000033.00000073.0000008.000000103.0000001.000000-1.0000000.0000000.000000
50%12659.00000039.000000448.00000016.000000181.0000002.000000-1.0000000.0000000.000000
75%18988.00000048.0000001435.00000021.000000317.0000003.000000-1.0000000.0000000.000000
max25317.00000095.000000102127.00000031.0000003881.00000055.000000854.000000275.0000001.000000

看下String型每列特征值具体有哪些

print(dataSet['job'].unique())

['management' 'technician' 'admin.' 'services' 'retired' 'student'
 'blue-collar' 'unknown' 'entrepreneur' 'housemaid' 'self-employed'
 'unemployed']

print(dataSet['marital'].unique())

['married' 'divorced' 'single']

print(dataSet['education'].unique())

['tertiary' 'primary' 'secondary' 'unknown']

print(dataSet['default'].unique())

['no' 'yes']

print(dataSet['housing'].unique())

['yes' 'no']

print(dataSet['loan'].unique())

['yes' 'no']

print(dataSet['loan'].unique())

['no' 'yes']

print(dataSet['contact'].unique())

['unknown' 'cellular' 'telephone']

print(dataSet['month'].unique())

['may' 'apr' 'jul' 'jun' 'nov' 'aug' 'jan' 'feb' 'dec' 'oct' 'sep' 'mar']

print(dataSet['poutcome'].unique())

['unknown' 'other' 'failure' 'success']

print(dataSet['y'].unique())

[0 1]

String类型数据转化

#暂时不构建特征,首先将string类型数据转化成Category类型
for col in dataSet.columns[dataSet.dtypes == 'object']:
    le = preprocessing.LabelEncoder()
    le.fit(dataSet[col])
    dataSet[col] = le.transform(dataSet[col])
    testSet[col] = le.transform(testSet[col])

dataSet.head()

 

 IDagejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
01434120291102981502-1030
12429000507610070991251210
23470110104110145772-1030
34284210-9941101851742-1030
4542901029741022181875-1030

可以看出来,所有的String类型特征值已经被转化成相应的数字类别特征值。

数据normalization

scaler = preprocessing.StandardScaler()
scaler.fit(dataSet[['age','balance','duration','campaign','pdays','previous']])
dataSet[['age','balance','duration','campaign','pdays','previous']] = scaler.transform(dataSet[['age','balance','duration','campaign','pdays','previous']])
testSet[['age','balance','duration','campaign','pdays','previous']] = scaler.transform(testSet[['age','balance','duration','campaign','pdays','previous']]

dataSet.head()

 

 IDagejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
010.1941514120-0.35554610298-0.419241-0.246187-0.411617-0.23040430
120.10011490001.23957910070-0.617708-0.5650612.1030630.54833310
230.5703010110-0.417885110145-0.703321-0.246187-0.411617-0.23040430
34-1.2164084210-0.783913110185-0.325845-0.246187-0.411617-0.23040430
450.10011490100.538857102218-0.2752550.710435-0.411617-0.23040430

 可以看出来相应的特征已经被normalization。

构建模型之前预处理

baseline版本暂时没有做深入的特征工程,简单做了下数据预处理之后,使用lightgbm融合xgboost进行建模,具体如下:

dataSet_new = list(set(dataSet.columns) - set(['ID','y']))

seed = 42
X_train, X_val, y_train, y_val = train_test_split(dataSet[dataSet_new], dataSet['y'], test_size = 0.2, random_state = seed)

train_data = lgb.Dataset(X_train, label = y_train)
val_data = lgb.Dataset(X_val, label = y_val, reference = train_data)

建模和参数调节

params = {
            'task': 'train',
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': {'auc'},
            'verbose': 0,
            'num_leaves': 30,
            'learning_rate': 0.01,
            'is_unbalance': True
         }

model = lgb.train(params,
                  train_data,
                  num_boost_round = 1000,
                  valid_sets = val_data,
                  early_stopping_rounds = 10,
                  categorical_feature = ['job','marital','education','default','housing','loan','contact','poutcome']
                 )

训练结果如下,可以看出来,689轮训练之后达到了早停,线上验证集测试auc为:0.934334。

lightgbm模型预测

pred1 = model.predict(testSet[dataSet_new])

引入xgboost模型调参

xg_reg = xgb.XGBRegressor(objective = 'reg:linear', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 8,
                alpha = 8, n_estimators = 500, reg_lambda = 1)
xg_reg.fit(X_train,y_train)

xgboost模型预测

pred2 = xg_reg.predict(testSet[dataSet_new])

生成提交文件

result = pd.DataFrame()
result['ID'] = testSet['ID']
result['pred'] = (pred1 + pred2) / 2
result.to_csv('D:\\AI\\game\\2019Kesci二分类算法比赛\\提交结果\\蜗壳星空_ver1.csv',index=False)

查看线上成绩和排名

可以看出来,排名167名,与top1的1.00的成绩还有相当大的差距。本文仅仅是提供一个baseline,并祝愿各位大佬在后面的阶段比赛顺利,取得满意的成绩!!!

 

  • 6
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

鸡啄米的时光机

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值