2019年厦门国际银行“数创金融杯”数据建模大赛总结

比赛介绍

比赛链接:此次大赛由厦门国际银行与厦门大学数据挖掘研究中心联合举办,厦门国际银行-厦门大学数据挖掘研究中心“数创金融”联合实验室承办。

数据下载地址:https://download.csdn.net/download/weixin_35770067/13718841

数据总体概述

本次数据共分为两个数据集,train_x.csv、train_target.csv和test_x.csv,其中train_x.csv为训练集的特征,train_target.csv为训练集的目标变量,其中,为了增强模型的泛化能力,训练集由两个阶段的样本组成,由字段isNew标记。test_x.csv为测试集的特征,特征变量与训练集一致。建模的目标即根据训练集对模型进行训练,并对测试集进行预测。

数据字段说明

a)为用户基本属性信息
id, target, certId, gender, age, dist, edu, job, ethnic, highestEdu, certValidBegin, certValidStop
在这里插入图片描述
b)借贷相关信息
loanProduct, lmt, basicLevel, bankCard, residentAddr, linkRela,setupHour, weekday
在这里插入图片描述
c)用户征信相关信息
x_0至x_78以及ncloseCreditCard, unpayIndvLoan, unpayOtherLoan, unpayNormalLoan, 5yearBadloan
该部分数据涉及较为第三方敏感数据,未做进一步说明。

评分标准

排名根据测试集的AUC确定

RF-basemodel(0.75+)

我们一开始先使用RandomFroest分类器来泡一下所有的数据,看一下选用默认参数下的后果。

# 选用所有特征
# ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'x_79', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'target']
x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
rf = RandomForestClassifier()

AUC Score (Train): 0.545862
我们发现达到的准确率只有0.545862,基本上和盲猜没啥区别。

我们看一下调参之后的结果:

rf = RandomForestClassifier(n_estimators=100,random_state=10)

AUC Score (Train): 0.632956

rf = RandomForestClassifier(n_estimators=90,random_state=10)

AUC Score (Train): 0.638696

rf = RandomForestClassifier(n_estimators=80,random_state=10)

AUC Score (Train): 0.633332

rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.687838

rf = RandomForestClassifier(n_estimators=90, max_depth=6,random_state=10)

AUC Score (Train): 0.685170

rf = RandomForestClassifier(n_estimators=90 max_depth=8,random_state=10)

AUC Score (Train): 0.653320

rf = RandomForestClassifier(n_estimators=90 max_depth=10,random_state=10)

AUC Score (Train): 0.636410

通过上面简单的调参结果可以看出,不同的参数对AUC的影响还是很大的,最低为0.545862,最高可以达到0.687838,相差接近20%各百分点,当然这个结果不一定是最优的,还有很多可调控的空间。

上面主要是对初期的调参,下面我们再来看看特征。

我们上面的特征选择的是全部特征,我们下面使用随机森林自带的feature_importances_接口来选取一些更有效的特征,代码如下:

importances = rf.feature_importances_ 
indices = np.argsort(importances)[::-1]
feat_labels = X_train.columns
std = np.std([tree.feature_importances_ for tree in rf.estimators_],
                axis=0) #  inter-trees variability. 
print("Feature ranking:") 
#    l1,l2,l3,l4 = [],[],[],[]
# 打印每个特征的重要程度
for f in range(X_train.shape[1]):
    print("%d. feature no:%d feature name:%s (%f)" % (f + 1, indices[f], feat_labels[indices[f]], importances[indices[f]]))
Feature ranking:
1. feature no:7 feature name:lmt (0.119897)
2. feature no:90 feature name:certBalidStop (0.070063)
3. feature no:91 feature name:bankCard (0.065635)
4. feature no:89 feature name:certValidBegin (0.061998)
5. feature no:93 feature name:residentAddr (0.055272)
6. feature no:0 feature name:certId (0.054448)
7. feature no:4 feature name:dist (0.048813)
8. feature no:8 feature name:basicLevel (0.042018)
9. feature no:97 feature name:weekday (0.040811)
10. feature no:96 feature name:setupHour (0.040214)
11. feature no:54 feature name:x_45 (0.038700)
12. feature no:3 feature name:age (0.031389)
13. feature no:1 feature name:loanProduct (0.028978)
14. feature no:95 feature name:linkRela (0.027006)
15. feature no:100 feature name:unpayOtherLoan (0.026191)
16. feature no:6 feature name:job (0.018915)
17. feature no:29 feature name:x_20 (0.018539)
18. feature no:55 feature name:x_46 (0.016263)
19. feature no:82 feature name:x_73 (0.015427)
20. feature no:42 feature name:x_33 (0.014756)
21. feature no:44 feature name:x_35 (0.009275)
22. feature no:92 feature name:ethnic (0.008969)
23. feature no:34 feature name:x_25 (0.008467)
24. feature no:71 feature name:x_62 (0.008017)
25. feature no:37 feature name:x_28 (0.007177)
26. feature no:2 feature name:gender (0.007070)
27. feature no:76 feature name:x_67 (0.006776)
28. feature no:85 feature name:x_76 (0.006183)
29. feature no:101 feature name:unpayNormalLoan (0.005641)
30. feature no:72 feature name:x_63 (0.005626)
31. feature no:98 feature name:ncloseCreditCard (0.005433)
32. feature no:81 feature name:x_72 (0.005120)
33. feature no:77 feature name:x_68 (0.004969)
34. feature no:43 feature name:x_34 (0.004652)
35. feature no:70 feature name:x_61 (0.004451)
36. feature no:35 feature name:x_26 (0.003792)
37. feature no:63 feature name:x_54 (0.003617)
38. feature no:60 feature name:x_51 (0.003151)
39. feature no:56 feature name:x_47 (0.003083)
40. feature no:25 feature name:x_16 (0.002995)
41. feature no:23 feature name:x_14 (0.002979)
42. feature no:36 feature name:x_27 (0.002700)
43. feature no:32 feature name:x_23 (0.002591)
44. feature no:99 feature name:unpayIndvLoan (0.002557)
45. feature no:80 feature name:x_71 (0.002379)
46. feature no:83 feature name:x_74 (0.002353)
47. feature no:68 feature name:x_59 (0.002294)
48. feature no:84 feature name:x_75 (0.002284)
49. feature no:61 feature name:x_52 (0.001965)
50. feature no:26 feature name:x_17 (0.001933)
51. feature no:10 feature name:x_1 (0.001912)
52. feature no:9 feature name:x_0 (0.001882)
53. feature no:31 feature name:x_22 (0.001662)
54. feature no:52 feature name:x_43 (0.001651)
55. feature no:74 feature name:x_65 (0.001631)
56. feature no:62 feature name:x_53 (0.001578)
57. feature no:13 feature name:x_4 (0.001530)
58. feature no:57 feature name:x_48 (0.001484)
59. feature no:59 feature name:x_50 (0.001357)
60. feature no:11 feature name:x_2 (0.001116)
61. feature no:16 feature name:x_7 (0.000877)
62. feature no:48 feature name:x_39 (0.000832)
63. feature no:102 feature name:5yearBadloan (0.000797)
64. feature no:64 feature name:x_55 (0.000787)
65. feature no:30 feature name:x_21 (0.000786)
66. feature no:47 feature name:x_38 (0.000759)
67. feature no:19 feature name:x_10 (0.000694)
68. feature no:66 feature name:x_57 (0.000653)
69. feature no:50 feature name:x_41 (0.000548)
70. feature no:20 feature name:x_11 (0.000508)
71. feature no:65 feature name:x_56 (0.000500)
72. feature no:17 feature name:x_8 (0.000400)
73. feature no:15 feature name:x_6 (0.000390)
74. feature no:79 feature name:x_70 (0.000378)
75. feature no:94 feature name:highestEdu (0.000355)
76. feature no:75 feature name:x_66 (0.000229)
77. feature no:53 feature name:x_44 (0.000226)
78. feature no:21 feature name:x_12 (0.000183)
79. feature no:58 feature name:x_49 (0.000129)
80. feature no:38 feature name:x_29 (0.000120)
81. feature no:51 feature name:x_42 (0.000112)
82. feature no:73 feature name:x_64 (0.000096)
83. feature no:39 feature name:x_30 (0.000005)
84. feature no:24 feature name:x_15 (0.000000)
85. feature no:40 feature name:x_31 (0.000000)
86. feature no:88 feature name:x_79 (0.000000)
87. feature no:87 feature name:x_78 (0.000000)
88. feature no:86 feature name:x_77 (0.000000)
89. feature no:5 feature name:edu (0.000000)
90. feature no:41 feature name:x_32 (0.000000)
91. feature no:78 feature name:x_69 (0.000000)
92. feature no:45 feature name:x_36 (0.000000)
93. feature no:22 feature name:x_13 (0.000000)
94. feature no:67 feature name:x_58 (0.000000)
95. feature no:12 feature name:x_3 (0.000000)
96. feature no:46 feature name:x_37 (0.000000)
97. feature no:14 feature name:x_5 (0.000000)
98. feature no:33 feature name:x_24 (0.000000)
99. feature no:49 feature name:x_40 (0.000000)
100. feature no:28 feature name:x_19 (0.000000)
101. feature no:18 feature name:x_9 (0.000000)
102. feature no:27 feature name:x_18 (0.000000)
103. feature no:69 feature name:x_60 (0.000000)

根据特征的重要性评估,我们抛弃掉重要性为0的特征,进行测试后:

x_columns = ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_4', 'x_6', 'x_7', 'x_8', 'x_10', 'x_11', 'x_12', 'x_14', 'x_16', 'x_17', 'x_20', 'x_21', 'x_22', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_33', 'x_34', 'x_35', 'x_38', 'x_39', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_59','x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']
rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.681259
我们剔除了随机森林认为不重要的特征,效果反而变差了。

依次类推,我们再次打印特征重要性,继续抛弃重要性为0的特征。

x_columns = ['certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_1', 'x_2', 'x_4', 'x_6', 'x_8', 'x_12', 'x_14', 'x_16', 'x_17', 'x_20', 'x_21', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_33', 'x_34', 'x_35', 'x_39', 'x_41', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_57','x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']

rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.677848
我们再次剔除了rf认为不重要的特征,效果再次变差了。

x_columns = ['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_6', 'x_8', 'x_12', 'x_14', 'x_16', 'x_20', 'x_22', 'x_23', 'x_25', 'x_26', 'x_27', 'x_28', 'x_33', 'x_34', 'x_35', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certBalidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan']

rf = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10)

AUC Score (Train): 0.690318
前两轮效果逐渐变差,现在终于AUC又提升了有些,其中还是挺微妙的。

最后,我们暂且将0.690318 作为最好的结果(其实并不是)。性能提升远不止于此,初次模型的调参和特征选择到此结束了。如果再使用特征工程、规则、交叉验证等的一些方法,效果肯定会更好,初次这里就只是进行了简单的调参和特征选取。

XGBoost-basemodel(76+)

上一次使用的是基于随机森林的basemodel,最终线上可达75+,今天尝试了一下xgboost,线上可达76+。

首先使用XGBoost的分类器,使用默认参数看一下效果。

# 选用所有特征
# ['id','certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target']
x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
xgboost = xgb.XGBClassifier()

AUC Score (Train): 0.703644
之前随机森林的默认参数只有0.545862,两者差距有些大呀。

首先看一下调参数的效果:

xgboost = xgb.XGBClassifier(max_depth=6, n_estimators=100)

AUC Score (Train): 0.702864

xgboost = xgb.XGBClassifier(max_depth=6, n_estimators=200)

AUC Score (Train): 0.688059

从上面简单的调参结果可以看出,不同参数下对AUC的影响还是很大的,始终没有默认参数效果好,放弃继续手动调参数。(后续要换成网格搜索了,训练时间会长一些)

下面开始看一下特征的选取对结果的影响。
上述的测试都选用的全部特征,下面使用XGBoost的feature_importances_接口来选取一些更有效的特征,代码如下:

importances = xgboost_model.feature_importances_ 
indices = np.argsort(importances)[::-1]
feat_labels = X_train.columns
print("Feature ranking:") 
#    l1,l2,l3,l4 = [],[],[],[]
for f in range(X_train.shape[1]):
    print("%d. feature no:%d feature name:%s (%f)" % (f + 1, indices[f], feat_labels[indices[f]], importances[indices[f]]))
print (">>>>>", importances)

经过和随机森林一样基于特征重要度对特征进行剔除后,最终发现AUC没有变化,所以直接提交了结果,线上可达76+。

XGBoost-KFold(77+)

在上一次中我们基于XGBoost的basemodel,线上可达76+,今天尝试了一下XGBoost下不同折的交叉验证,线上可达77+。

下面分别给出5折、7折、8折交叉验证的代码 和 各自最好结果。

# 选用所有特征
# ['id','certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target']
x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
......
n_splits = 7
kf = KFold(n_splits=n_splits, shuffle=True, random_state=1234)
for train_index, test_index in kf.split(X_train):
    xgboost = xgb.XGBClassifier()

5折交叉验证下AUC Score (Train): 0.7245306571511836
7折交叉验证下AUC Score (Train): 0.7306788309565827
8折交叉验证下AUC Score (Train): 0.7511906354858096
最终线上成绩都在77+,从结果来看随着折数增加,线下AUC提升合理。

XGBoost-KFlod-特征工程
# 训练与测试数据进行拼接
train_test_data = pd.concat([X_train,X_predict],axis=0,ignore_index = True)

# 数据转换
train_test_data['certBeginDt'] = pd.to_datetime(train_test_data["certValidBegin"] * 1000000000) - pd.offsets.DateOffset(years=70)
print ("time >>>", train_test_data['certBeginDt'])
train_test_data = train_test_data.drop(['certValidBegin'], axis=1)
train_test_data['certStopDt'] = pd.to_datetime(train_test_data["certValidStop"] * 1000000000) - pd.offsets.DateOffset(years=70)
train_test_data = train_test_data.drop(['certValidStop'], axis=1)

# 特征组合 
train_test_data["certStopDt"+"certBeginDt"] = train_test_data["certStopDt"] - train_test_data["certBeginDt"]
print ("train_test_data>>>>>>", train_test_data["certStopDt"+"certBeginDt"])

print ("进行分箱")
train_test_data["age_bin"] = pd.cut(train_test_data["age"],20,labels=False)
train_test_data = train_test_data.drop(['age'], axis=1)
train_test_data["dist_bin"] = pd.qcut(train_test_data["dist"],60,labels=False)
train_test_data = train_test_data.drop(['dist'], axis=1)
train_test_data["lmt_bin"] = pd.qcut(train_test_data["lmt"],50,labels=False)
train_test_data = train_test_data.drop(['lmt'], axis=1)
train_test_data["setupHour_bin"] = pd.qcut(train_test_data["setupHour"],10,labels=False)
train_test_data = train_test_data.drop(['setupHour'], axis=1)
train_test_data["certStopDtcertBeginDt_bin"] = pd.cut(train_test_data["certStopDtcertBeginDt"],30,labels=False)
train_test_data = train_test_data.drop(['certStopDtcertBeginDt'], axis=1)
# 'certValidBegin', 'certValidStop'
train_test_data["certBeginDt_bin"] = pd.cut(train_test_data["certBeginDt"],30,labels=False)
train_test_data = train_test_data.drop(['certBeginDt'], axis=1)
train_test_data["certStopDt_bin"] = pd.cut(train_test_data["certStopDt"],30,labels=False)
train_test_data = train_test_data.drop(['certStopDt'], axis=1)
X_train = train_test_data.iloc[:X_train.shape[0],:]
X_predict = train_test_data.iloc[X_train.shape[0]:,:]

# 选用所有特征
print ("进行onehot")
train_data = X_train
test_data = X_predict
# 选择要做onehot的列['id', 'certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew', 'target']
# ["gender", "edu", "job", 'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78']
# edu
dummy_fea = ["gender","job", "loanProduct", "basicLevel","ethnic"] #'x_0', 'x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'x_6', 'x_7', 'x_8', 'x_9', 'x_10', 'x_11', 'x_12', 'x_13', 'x_14', 'x_15', 'x_16', 'x_17', 'x_18', 'x_19', 'x_20', 'x_21', 'x_22', 'x_23', 'x_24', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_30', 'x_31', 'x_32', 'x_33', 'x_34', 'x_35', 'x_36', 'x_37', 'x_38', 'x_39', 'x_40', 'x_41', 'x_42', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_49', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_57', 'x_58', 'x_59', 'x_60', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_70', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'x_77', 'x_78']
train_test_data = pd.concat([train_data,test_data],axis=0,ignore_index = True) 
dummy_df = pd.get_dummies(train_test_data.loc[:,dummy_fea])
dunmy_fea_rename_dict = {}
for per_i in dummy_df.columns.values:
    dunmy_fea_rename_dict[per_i] = per_i + '_onehot'
print (">>>>>",  dunmy_fea_rename_dict)
dummy_df = dummy_df.rename( columns=dunmy_fea_rename_dict )
train_test_data = pd.concat([train_test_data,dummy_df],axis=1)
column_headers = list( train_test_data.columns.values )
print(column_headers)
train_test_data = train_test_data.drop(dummy_fea,axis=1)
column_headers = list( train_test_data.columns.values )
print(column_headers)
train_train = train_test_data.iloc[:train_data.shape[0],:]
test_test = train_test_data.iloc[train_data.shape[0]:,:]
X_train = train_train
X_predict = test_test

# 交叉验证,可参考之前的
..........
# 网格搜索
n_splits = 5
cv_params = {'max_depth': [4, 6, 8, 10], 'min_child_weight': [3, 4, 5, 6], 'scale_pos_weight':[5,8,10]}
other_params = {'learning_rate': 0.1, 'n_estimators': 4, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
                    'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 1, 'reg_alpha': 1, 'reg_lambda': 1}
xgboost = xgb.XGBClassifier()
optimized_GBM = GridSearchCV(estimator=xgboost, param_grid=cv_params, scoring='roc_auc', cv=n_splits, verbose=1, n_jobs=4)
xgboost_model = optimized_GBM.fit(X_train, y_train) 
y_pp = xgboost_model.predict_proba(X_predict)[:, 1]

发现提升效果并不大。在这里劝诫各位打比赛的小伙伴,在不分析数据的基础上随意堆叠特征工程,效果可能不升反降,需要我们对数据进行针对性的分析。

stacking-KFold(78+)

下面给出模型集成的代码stacking:经过调参和不断优化,最终线上成绩达到78+

# -*- coding: utf-8 -*-
from heamy.dataset import Dataset
from heamy.estimator import Regressor, Classifier
# ModelsPipeline:https://blog.csdn.net/qiqzhang/article/details/85477242 ; https://cloud.tencent.com/developer/article/1463294
from heamy.pipeline import ModelsPipeline
import pandas as pd
import xgboost as xgb
import datetime
from sklearn.metrics import roc_auc_score
# lightgbm安装:https://blog.csdn.net/weixin_41843918/article/details/85047492 
# lgb样例:https://www.jianshu.com/p/c208cac3496f
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from pandas.core.frame import DataFrame



def xgb_feature(X_train, y_train, X_test, y_test=None):
    other_params = {'learning_rate': 0.125, 'max_depth': 3}
    model = xgb.XGBClassifier(**other_params).fit(X_train, y_train)  
    predict = model.predict_proba(X_test)[:,1]
    #minmin = min(predict)
    #maxmax = max(predict)
    #vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))
    #return vfunc(predict)
    return predict

def xgb_feature2(X_train, y_train, X_test, y_test=None):
    # , 'num_boost_round':12
    other_params = {'learning_rate': 0.1, 'max_depth': 3}
    model = xgb.XGBClassifier(**other_params).fit(X_train, y_train)  
    predict = model.predict_proba(X_test)[:,1]
    #minmin = min(predict)
    #maxmax = max(predict)
    #vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))
    #return vfunc(predict)
    return predict

def xgb_feature3(X_train, y_train, X_test, y_test=None):
    # , 'num_boost_round':20
    other_params = {'learning_rate': 0.13, 'max_depth': 3}
    model = xgb.XGBClassifier(**other_params).fit(X_train, y_train)  
    predict = model.predict_proba(X_test)[:,1]
    #minmin = min(predict)
    #maxmax = max(predict)
    #vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))
    #return vfunc(predict)
    return predict

def rf_model(X_train, y_train, X_test, y_test=None):
    # n_estimators = 100
    model = RandomForestClassifier(n_estimators=90, max_depth=4,random_state=10).fit(X_train,y_train)
    predict = model.predict_proba(X_test)[:,1]
    #minmin = min(predict)
    #maxmax = max(predict)
    #vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))
    #return vfunc(predict)
    return predict


def et_model(X_train, y_train, X_test, y_test=None):
    model = ExtraTreesClassifier(max_features = 'log2', n_estimators = 1000 , n_jobs = -1).fit(X_train,y_train)
    return model.predict_proba(X_test)[:,1]

def gbdt_model(X_train, y_train, X_test, y_test=None):
    # n_estimators = 700
    model = GradientBoostingClassifier(learning_rate = 0.02, max_features = 0.7, n_estimators = 100 , max_depth = 5).fit(X_train,y_train)
    predict = model.predict_proba(X_test)[:,1]
    #minmin = min(predict)
    #maxmax = max(predict)
    #vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))
    #return vfunc(predict)
    return predict


def logistic_model(X_train, y_train, X_test, y_test=None):
    model = LogisticRegression(penalty = 'l2').fit(X_train,y_train)
    return model.predict_proba(X_test)[:,1]

def lgb_feature(X_train, y_train, X_test, y_test=None):
    model = lgb.LGBMClassifier(boosting_type='gbdt',  min_data_in_leaf=5, max_bin=200, num_leaves=25, learning_rate=0.01).fit(X_train, y_train) 
    predict = model.predict_proba(X_test)[:,1]
    #minmin = min(predict)
    #maxmax = max(predict)
    #vfunc = np.vectorize(lambda x:(x-minmin)/(maxmax-minmin))
    #return vfunc(predict)
    return predict


VAILD = False
if __name__ == '__main__':
    if VAILD == False:
        ##############################
        train_data = pd.read_csv('data/train_data_target.csv',engine = 'python')
         # 
        # x_columns = [x for x in train_data.columns if x not in ["target", "id"]]
        x_columns = ['certId', 'loanProduct', 'gender', 'age', 'dist', 'edu', 'job', 'lmt', 'basicLevel', 'x_12', 'x_14', 'x_16', 'x_20', 'x_25', 'x_26', 'x_27', 'x_28', 'x_29', 'x_33', 'x_34', 'x_41', 'x_43', 'x_44', 'x_45', 'x_46', 'x_47', 'x_48', 'x_50', 'x_51', 'x_52', 'x_53', 'x_54', 'x_55', 'x_56', 'x_61', 'x_62', 'x_63', 'x_64', 'x_65', 'x_66', 'x_67', 'x_68', 'x_69', 'x_71', 'x_72', 'x_73', 'x_74', 'x_75', 'x_76', 'certValidBegin', 'certValidStop', 'bankCard', 'ethnic', 'residentAddr', 'highestEdu', 'linkRela', 'setupHour', 'weekday', 'ncloseCreditCard', 'unpayIndvLoan', 'unpayOtherLoan', 'unpayNormalLoan', '5yearBadloan', 'isNew']
        train_data.fillna(0,inplace = True)
        test_data = pd.read_csv('data/test.csv',engine = 'python')
        test_data.fillna(0,inplace = True)
        train_test_data = pd.concat([train_data,test_data],axis=0,ignore_index = True)
        train_test_data = train_test_data.fillna(-888, inplace = True)
        # dummy_fea = ["gender", "edu", "job"]
        dummy_fea = []
        #dummy_df = pd.get_dummies(train_test_data.loc[:,dummy_fea])
        #dunmy_fea_rename_dict = {}
        #for per_i in dummy_df.columns.values:
        #    dunmy_fea_rename_dict[per_i] = per_i + '_onehot'
        #print (">>>>>",  dunmy_fea_rename_dict)
        #dummy_df.rename( columns=dunmy_fea_rename_dict )
        #train_test_data = pd.concat([train_test_data,dummy_df],axis=1)
        #train_test_data = train_test_data.drop(dummy_fea,axis=1)
        train_train = train_test_data.iloc[:train_data.shape[0],:]
        test_test = train_test_data.iloc[train_data.shape[0]:,:]
        train_train_x = train_train
        test_test_x = test_test
        xgb_dataset = Dataset(X_train=train_train_x,y_train=train_data['target'],X_test=test_test_x,y_test=None,use_cache=False)
        #heamy
        print ("---------------------------------------------------------------------------------------)")
        print ("开始构建pipeline:ModelsPipeline(model_xgb,model_xgb2,model_xgb3,model_lgb,model_gbdt)")
        model_xgb = Regressor(dataset=xgb_dataset, estimator=xgb_feature,name='xgb',use_cache=False)
        model_xgb2 = Regressor(dataset=xgb_dataset, estimator=xgb_feature2,name='xgb2',use_cache=False)
        model_xgb3 = Regressor(dataset=xgb_dataset, estimator=xgb_feature3,name='xgb3',use_cache=False)
        model_gbdt = Regressor(dataset=xgb_dataset, estimator=gbdt_model,name='gbdt',use_cache=False)
        model_lgb = Regressor(dataset=xgb_dataset, estimator=lgb_feature,name='lgb',use_cache=False)
        model_rf = Regressor(dataset=xgb_dataset, estimator=rf_model,name='rf',use_cache=False)

        # pipeline = ModelsPipeline(model_xgb,model_xgb2,model_xgb3,model_lgb,model_gbdt, model_rf)
        pipeline = ModelsPipeline(model_xgb, model_xgb2, model_xgb3, model_lgb, model_rf)
        print ("---------------------------------------------------------------------------------------)")
        print ("开始训练pipeline:pipeline.stack(k=7, seed=111, add_diff=False, full_test=True)")
        stack_ds = pipeline.stack(k=7, seed=111, add_diff=False, full_test=True)
        # k = 7    model_xgb, model_xgb2, model_xgb3, model_lgb, model_rf :   AUC: 0.780043 
        print ("stack_ds: ", stack_ds)
        print ("---------------------------------------------------------------------------------------)")
        print ("开始训练Regressor:Regressor(dataset=stack_ds, estimator=LinearRegression,parameters={'fit_intercept': False})")
        stacker = Regressor(dataset=stack_ds, estimator=LinearRegression,parameters={'fit_intercept': False})
        print ("---------------------------------------------------------------------------------------)")
        print ("开始预测:")
        predict_result = stacker.predict()

        id_list = test_data["id"].tolist()
        d ={ "id" : id_list, "target" : predict_result  }
        res = DataFrame(d)#将字典转换成为数据框
        print (">>>>", res)
        csv_file = 'stacking_res/res_stacking.csv'
        res.to_csv( csv_file ) 

后续又做了很多特征工程和模型融合,可能是对金融风控这一块不了解和自己太菜的原因,成绩止步于此。
在这里插入图片描述

总结

从之前的尝试可以看出,在不做任何特征、不调参的情况下,提升效果的方法可以有:

  • 换好的模型
  • 使用交叉验证
  • 采用模型集成的方案

当然,后期想提升的话,方案就比较多了,还可以有 数据增强(数据的扩增、不均衡的处理)、数据清洗(异常值、分布等等),特征工程(特征选择、统计特征、归一化、编码、分箱等等)、模型选择、 损失函数、模型集成。 (以上每种都去尝试真的很难,这个靠平常的积累,譬如什么模型要对特征做什么样的处理、什么样的参数适合多大的数据量、特征选择方法(卡方、方差、模型、分布等等)
总而言之,都需要平时的多尝试和多积累。

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

安替-AnTi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值