针对银行客户流失预测,主要流程分为:特征预处理、特征选择,分类模型选择与训练。主要工作如下:
1:特征预处理与选择
对性别进行哑变量处理;
对是否有****信息将布尔值转换01表示;
画出年龄直方图可以看出大致呈正态分布,对年龄分段处理后缺失值采用插补方式;
资产当前总额=存储类资产当前总额=本币存储当前总金额 月日均余额=存储类资产月日均余额=本币存储月日均余额 分别删除其中两项;
针对*NUM,*DUR,*AMT,*BAL字段分别进行特征提取(SelectKBest)达到降维效果;
最后整合数据,特征标准化处理最终为44个特征(StandardScaler)。
2:分类模型选择与训练
数据集划分:采用K折交叉验证,train_test_split自主切分数据集
模型选择:采用了决策树,提升树(GBDT/XGBoost),SVM(libsvm)神经网络(多层感知器算法)分别训练模型
3:对应python主要代码:
-
decisiontree.py
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score X_train,X_test,y_train,y_test=train_test_split(StS,y,test_size=0.4,random_state=0) clf = tree.DecisionTreeClassifier() clf = clf.fit(X_train, y_train) pre_labels = clf.predict(X_test) print('accuracy score:',accuracy_score(y_test,pre_labels,normalize=True)) print('recall score:',recall_score(y_test,pre_labels)) print('precision score:',precision_score(y_test,pre_labels)) print('f1 score:',f1_score(y_test,pre_labels))
- XGBoost.py
import xgboost as xgb from sklearn.preprocessing import StandardScaler #记录程序运行时间 import time start_time = time.time() from xgboost.sklearn import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report,roc_auc_score bankChurn = pd.read_csv('D:/work/lost data and dictionary/t