【数据分析实践】 Task1.1 模型构建

导入本次实践过程中所需的包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

%matplotlib inline

模型构建

数据集下载

实践数据的下载地址 https://pan.baidu.com/s/1dtHJiV6zMbf_fWPi-dZ95g

说明:这份数据集是金融数据(非原始数据,已经处理过了),我们要做的是预测贷款用户是否会逾期。表格中 “status” 是结果标签:0表示未逾期,1表示逾期。

导入数据

从data_all.csv文件中导入原始数据,并查看数据相关信息:

data_origin = pd.read_csv('data_all.csv')
data_origin.head()
low_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobility...consfin_product_countconsfin_max_limitconsfin_avg_limitlatest_query_dayloans_latest_dayreg_preference_for_tradlatest_query_time_monthlatest_query_time_weekdayloans_latest_time_monthloans_latest_time_weekday
00.010.9900.900.550.31317.027.026.03.0...2.01200.01200.012.018.004.02.04.03.0
10.020.9420001.281.000.45819.030.014.04.0...6.022800.09360.04.02.005.03.05.05.0
20.040.9601.001.000.11413.068.022.01.0...1.04200.04200.02.06.005.05.05.01.0
30.000.9620000.130.570.77722.014.06.03.0...5.030000.012180.02.04.015.05.05.03.0
40.010.9900.461.000.17513.066.042.01.0...2.08400.08250.022.0120.004.06.01.06.0

5 rows × 85 columns

查看数据各列的统计信息:

data_origin.describe()
low_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobility...consfin_product_countconsfin_max_limitconsfin_avg_limitlatest_query_dayloans_latest_dayreg_preference_for_tradlatest_query_time_monthlatest_query_time_weekdayloans_latest_time_monthloans_latest_time_weekday
count4754.0000004754.0000004754.0000004754.0000004754.0000004754.0000004754.0000004754.0000004754.0000004754.000000...4754.0000004754.0000004754.0000004754.0000004754.0000004754.0000004754.0000004754.000004754.0000004754.000000
mean0.0218010.9013321940.19772814.1523180.8044930.36535617.50315529.00462821.7484222.678797...5.08834716418.9734967507.42637824.04164951.9840130.3729494.2738753.421964.5427013.025873
std0.0415190.1448373923.971494693.9614410.1969200.1701944.47468622.71165916.4720310.890198...3.34479413885.1073575830.67462336.50034453.2493640.6873821.3337781.932132.9877311.895870
min0.0000000.0000000.0000000.0000000.1200000.0330002.0000000.0000004.0000001.000000...0.0000000.0000000.000000-2.000000-2.0000000.0000001.0000000.000001.0000000.000000
25%0.0100000.8800000.0000000.6200000.6700000.23300015.00000016.00000012.0000002.000000...3.0000007800.0000004200.0000006.0000007.0000000.0000004.0000002.000003.0000002.000000
50%0.0100000.960000500.0000000.9700000.8600000.35000017.00000023.00000017.0000003.000000...4.00000014400.0000006750.00000016.00000029.0000000.0000004.0000004.000004.0000003.000000
75%0.0200000.9900002000.0000001.6000001.0000000.47950020.00000032.00000026.7500003.000000...7.00000020400.0000009696.25000023.00000086.0000001.0000005.0000005.000005.0000005.000000
max1.0000001.00000068000.00000047596.7400001.0000000.94100042.000000285.000000234.0000005.000000...20.000000266400.00000082800.000000360.000000323.0000004.00000012.0000006.0000012.0000006.000000

8 rows × 85 columns

查看数据是否存在缺失值:

data_origin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 85 columns):
low_volume_percent                            4754 non-null float64
middle_volume_percent                         4754 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4754 non-null float64
trans_activity_month                          4754 non-null float64
trans_activity_day                            4754 non-null float64
transd_mcc                                    4754 non-null float64
trans_days_interval_filter                    4754 non-null float64
trans_days_interval                           4754 non-null float64
regional_mobility                             4754 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4754 non-null float64
first_transaction_time                        4754 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4754 non-null float64
rank_trad_1_month                             4754 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4754 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4754 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4754 non-null float64
trans_top_time_last_1_month                   4754 non-null float64
trans_top_time_last_6_month                   4754 non-null float64
consume_top_time_last_1_month                 4754 non-null float64
consume_top_time_last_6_month                 4754 non-null float64
cross_consume_count_last_1_month              4754 non-null float64
trans_fail_top_count_enum_last_1_month        4754 non-null float64
trans_fail_top_count_enum_last_6_month        4754 non-null float64
trans_fail_top_count_enum_last_12_month       4754 non-null float64
consume_mini_time_last_1_month                4754 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4754 non-null float64
railway_consume_count_last_12_month           4754 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4754 non-null float64
status                                        4754 non-null int64
first_transaction_day                         4754 non-null float64
trans_day_last_12_month                       4754 non-null float64
apply_score                                   4754 non-null float64
apply_credibility                             4754 non-null float64
query_org_count                               4754 non-null float64
query_finance_count                           4754 non-null float64
query_cash_count                              4754 non-null float64
query_sum_count                               4754 non-null float64
latest_one_month_apply                        4754 non-null float64
latest_three_month_apply                      4754 non-null float64
latest_six_month_apply                        4754 non-null float64
loans_score                                   4754 non-null float64
loans_credibility_behavior                    4754 non-null float64
loans_count                                   4754 non-null float64
loans_settle_count                            4754 non-null float64
loans_overdue_count                           4754 non-null float64
loans_org_count_behavior                      4754 non-null float64
consfin_org_count_behavior                    4754 non-null float64
loans_cash_count                              4754 non-null float64
latest_one_month_loan                         4754 non-null float64
latest_three_month_loan                       4754 non-null float64
latest_six_month_loan                         4754 non-null float64
history_suc_fee                               4754 non-null float64
history_fail_fee                              4754 non-null float64
latest_one_month_suc                          4754 non-null float64
latest_one_month_fail                         4754 non-null float64
loans_long_time                               4754 non-null float64
loans_credit_limit                            4754 non-null float64
loans_credibility_limit                       4754 non-null float64
loans_org_count_current                       4754 non-null float64
loans_product_count                           4754 non-null float64
loans_max_limit                               4754 non-null float64
loans_avg_limit                               4754 non-null float64
consfin_credit_limit                          4754 non-null float64
consfin_credibility                           4754 non-null float64
consfin_org_count_current                     4754 non-null float64
consfin_product_count                         4754 non-null float64
consfin_max_limit                             4754 non-null float64
consfin_avg_limit                             4754 non-null float64
latest_query_day                              4754 non-null float64
loans_latest_day                              4754 non-null float64
reg_preference_for_trad                       4754 non-null int64
latest_query_time_month                       4754 non-null float64
latest_query_time_weekday                     4754 non-null float64
loans_latest_time_month                       4754 non-null float64
loans_latest_time_weekday                     4754 non-null float64
dtypes: float64(73), int64(12)
memory usage: 3.1 MB

从以上信息可以看出,这份数据共有85个特征(包括标签列status),4754个样本,数据不存在缺失值。

划分数据集

首先将status列作为数据标签y,其余列作为数据集X:

y = data_origin.status
X = data_origin.drop(['status'], axis=1)

再调用sklearn包将此金融数据集按比例7:3划分为训练集和数据集,随机种子2018:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2018)

查看划分的数据集和训练集大小:

[X_train.shape, y_train.shape, X_test.shape, y_test.shape]
[(3327, 84), (3327,), (1427, 84), (1427,)]

构建模型

此部分共构建三种模型:逻辑回归,SVM,以及决策树模型

逻辑回归
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
SVM
svc = SVC()
svc.fit(X_train, y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
决策树
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

模型评分

首先对测试集进行预测:

log_predict = log_reg.predict(X_test)
svc_predict = svc.predict(X_test)
dt_predict = dt_clf.predict(X_test)
准确度Accuracy
print('逻辑回归模型的准确度为%.2f%%' % (accuracy_score(log_predict, y_test) * 100))
print('决策树模型的准确度为%.2f%%' % (accuracy_score(dt_predict, y_test) * 100))
print('SVM模型的准确度为%.2f%%' % (accuracy_score(svc_predict, y_test) * 100))
逻辑回归模型的准确度为74.84%
决策树模型的准确度为67.83%
SVM模型的准确度为74.84%
混淆矩阵

混淆矩阵中的每一行表示一个实际的类, 而每一列表示一个预测的类。一个完美的分类器将只有真反例和真正例,所以混淆矩阵的左上到右下的对角线值越小越好。

log_conf = confusion_matrix(log_predict, y_test)
dt_conf = confusion_matrix(dt_predict, y_test)
svm_conf = confusion_matrix(svc_predict, y_test)

print('逻辑回归模型混淆矩阵为\n%s' % (log_conf))
print('决策树模型混淆矩阵为\n%s' % (dt_conf))
print('SVM模型的混淆矩阵为\n%s' % (svm_conf))
逻辑回归模型混淆矩阵为
[[1068  359]
 [   0    0]]
决策树模型混淆矩阵为
[[832 223]
 [236 136]]
SVM模型的混淆矩阵为
[[1068  359]
 [   0    0]]

混淆矩阵的绘制:

fig = plt.figure(figsize=(8, 6))
fig1 = plt.subplot(131)
fig1.matshow(log_conf, cmap=plt.cm.gray)
fig2 = plt.subplot(132)
fig2.matshow(dt_conf, cmap=plt.cm.gray)
fig3 = plt.subplot(133)
fig3.matshow(svm_conf, cmap=plt.cm.gray)

在这里插入图片描述

精确率Precision

精确率的定义如下:
P r e c i s i o n = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP
其中TP表示被正确识别的正例(True Positive),FP表示被误判为负例的正例(False Positive)。

print('逻辑回归模型精确率为%.2f%%' % (precision_score(log_predict, y_test) * 100))
print('决策树模型精确率为%.2f%%' % (precision_score(dt_predict, y_test) * 100))
print('SVM模型的精确率为%.2f%%' % (precision_score(svc_predict, y_test) * 100))
逻辑回归模型精确率为0.00%
决策树模型精确率为37.88%
SVM模型的精确率为0.00%
召回率Recall

召回率定义如下:
R e c a l l = T P T P + F N Recall=\frac{TP}{TP+FN} Recall=TP+FNTP
其中TP表示被正确识别的正例(True Positive),FN表示被误判为正例的负例(False Negative)。

print('逻辑回归模型召回率为%.2f%%' % (recall_score(log_predict, y_test) * 100))
print('决策树模型召回率为%.2f%%' % (recall_score(dt_predict, y_test) * 100))
print('SVM模型的召回率为%.2f%%' % (recall_score(svc_predict, y_test) * 100))
逻辑回归模型召回率为0.00%
决策树模型召回率为36.56%
SVM模型的召回率为0.00%

由于逻辑回归模型和SVM模型预测中没有正例(即status=1),所以根据召回率和精确率的定义,两个值均为0。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值