使用机器学习配合管道机制对信用卡违约建模

最新推荐文章于 2024-04-09 13:20:33 发布

H~A~H

最新推荐文章于 2024-04-09 13:20:33 发布

阅读量530

点赞数

文章标签：机器学习 python 数据分析

本文链接：https://blog.csdn.net/weixin_46278697/article/details/104496659

版权

数据集解读

这是某个银行的违约数据集
数据集意义

数据探索

首先我们导入建模需要用到的库

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

读取数据查看数据

warnings.filterwarnings('ignore')

# 设置字体等，让图片正常显示
plt.rcParams['font.sans-serif'] = ['SimHei'] 
plt.rcParams['axes.unicode_minus'] = False

# 读取数据
data = pd.read_csv('D:/credit_default-master/credit_default-master/UCI_Credit_Card.csv', encoding='utf-8')
print(data.info()

读取的数据如上图，一共30000条数据
接下来我们来看看违约的情况

next_month = data['default.payment.next.month'].value_counts()  # 统计违约情况 
df = pd.DataFrame({'default.payment.next.month': next_month.index, 'values': next_month.values})  # 建立新数据集
plt.figure()
sns.barplot(x='default.payment.next.month', y='values', data=df)  # 使用柱状图查看违约情况
plt.title('违约客户情况\n（违约：1， 守约：0）')
locs, labels = plt.xticks()
plt.show()

违约情况如图所示
查看共线性

# 查看共线性
corr = data.corr()
plt.figure()
sns.heatmap(corr, annot=True)
plt.show()

在这里插入图片描述
共线性如上图，我们发现BILL_AMT1，BILL_AMT2 ，BILL_AMT3，BILL_AMT4 ，BILL_AMT5 ， BILL_AMT6 变量之间的相关系数均等于大于0.8，为高度线性相关，所以我们只需选择其中一个即可，同时PAY_4，PAY_5，PAY_6变量也是高度线性相关，也只需选一个变量即可

数据处理

data.drop(['ID', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'PAY_5', 'PAY_6'], axis=1, inplace=True)  # 删除不需要的特征变量
y = data['default.payment.next.month'].values
columns = data.columns.tolist()  # 取出特征变量
columns.remove('default.payment.next.month')
x = data[columns].values
x_train, x_test,  y_train, y_test = train_test_split(x, y, test_size=0.3, stratify=y, random_state=1)  # 切割数据集

数据建模

# 构造分类器
classifiers = [
    SVC(random_state=1, kernel='rbf'),
    DecisionTreeClassifier(random_state=1, criterion='gini'),
    RandomForestClassifier(random_state=1, criterion='gini'),
    KNeighborsClassifier(metric='minkowski'),
]

# 构造分类器名称
classifiers_names = [
    'svc',
    'decisiontreeclassifier',
    'randomforestclassifier',
    'kneighborsclassifier',
]

# 构造分类器的参数
classifiers_param_grid = [
    {'svc__C': [1], 'svc__gamma': [0.01]},
    {'decisiontreeclassifier__max_depth': [6, 9, 11]},
    {'randomforestclassifier__n_estimators': [3, 5, 6]},
    {'kneighborsclassifier__n_neighbors': [4, 6, 8]},
]


def GridSearchCV_work(pipeline, x_train, x_test, y_train, y_test, param_grid, score='accuracy'):
    response = {}
    gridsearch = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring=score)
    search = gridsearch.fit(x_train, y_train)
    print('最优参数:', search.best_params_)
    print('最优分数:', search.best_score_)
    y_predict = gridsearch.predict(x_test)
    print('准确率:', accuracy_score(y_test, y_predict))
    response['预测值'] = y_predict
    response['准确率'] = accuracy_score(y_test, y_predict)
    return response


for model, model_name, model_param_grid in zip(classifiers, classifiers_names, classifiers_param_grid):
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        (model_name, model)
    ])
    result = GridSearchCV_work(pipeline, x_train, x_test, y_train, y_test, model_param_grid, score='accuracy')

运行后显示的结果如下

最优参数: {'svc__C': 1, 'svc__gamma': 0.01}
最优分数: 0.8193809523809523
准确率: 0.818
最优参数: {'decisiontreeclassifier__max_depth': 6}
最优分数: 0.8197619047619048
准确率: 0.8147777777777778
最优参数: {'randomforestclassifier__n_estimators': 6}
最优分数: 0.8001904761904761
准确率: 0.8007777777777778
最优参数: {'kneighborsclassifier__n_neighbors': 8}
最优分数: 0.805952380952381
准确率: 0.8088888888888889

可以看出我们选择支持向量机算法在惩罚系数为1，核函数系数为0.01时建模是最佳的，当然这只是基于目前我给的参数中最佳的，大家可以更加详细跟增大范围的参数来进行调参使模型更加好。同时大家在查看完变量共性，筛选出特征变量后，大家可以使用RFE,RFECV等筛选特征变量算法筛选出更加好的变量。谢谢大家的观看！

H~A~H

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用机器学习配合管道机制对信用卡违约建模

数据集解读这是某个银行的违约数据集数据探索首先我们导入建模需要用到的库import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltimport warningsfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline...
复制链接

扫一扫