广工大数协阿里云天池金融风控训练营-Task1-CSDN博客

本文链接：https://blog.csdn.net/weixin_52311669/article/details/115931965

Task1 了解赛题

一.学习大纲

●赛题概况
●数据概况
●预测指标
●分析赛题

二.学习内容

1.赛题概况

比赛要求参赛选手根据给定的数据集，建立模型，预测金融风险。
(1)数据来自某信贷平台的贷款记录，总数据量超过120w，包含47列变量信息，其中1列为匿名变量。
(2)从数据集中抽取80万条作为训练集，20万条作为测试集A，20万条作为测试集B，同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。

2. 数据概况

数据列所属的性质的特征。

3.预测指标

竞赛采用AUC（Area Under Curve）作为评价指标。

（一）分类算法常见的评估指标：

(1).混淆矩阵（Confuse Matrix）

真正类TP(True Positive)：实例为正类，被预测为正类。
假负类FN(False Ngative)：实例为正类，被预测为负类。
假正类FP(False Positive)：实例为负类，被预测为正类。
真负类TM(True Negative)：实例为负类，被预测为负类。

(2).准确率（Accuracy）【注：不适合样本不均衡的情况】
(3).精确率（Precision）
(4).召回率（Recall）
(5).F1 Score【兼顾精确率、召回率】
(6)P-R曲线
(7)ROC(ROC空间将假正例率和真正利率定义为X，Y轴)
(8).AUC（取值范围：0.5～1.0；越接近1.0，检测方法真实性越强；等于0.5时，真实性最低，无应用价值）

（二）对于金融风控预测类常见的评估值指标：

KS（常用于评估模型区分度）

在风控中，KS常用于评估模型区分度。区分度越大，说明模型的风险排序能越强。
K-S曲线与ROC曲线的不同：
ROC曲线将真正例率和假正例率作为横纵轴；
K-S曲线将真正例率和假正例率都作为纵轴，横轴则由选定的阈值来充当。
一般情况KS值越大，模型的区分能力越强，但是也不是越大模型效果就越好，如果KS过大，模型可能存在异常，所以当KS值过高可能需要检查模型是否过拟合。以下为KS值对应的模型情况，但此对应不是唯一的，只代表大致趋势。
KS<0.2 一般认为模型没有区分能力
0.2 <= KS<=0.3 认为模型有一定区分能力
0.3 <= KS<=0.5 认为模型具有较强区分能力
KS>0.75 模型有异常

三.代码示例

1.获取数据

import pandas as pd
train = pd.read_csv('http://tianchi-media.oss-cn-beijing.aliyuncs.com/dragonball/FRC/data_set/train.csv')
testA = pd.read_csv('http://tianchi-media.oss-cn-beijing.aliyuncs.com/dragonball/FRC/data_set/testA.csv')
print('Train data shape:',train.shape)
print('TestA data shape:',testA.shape)
train.head()

2.分类算法的评估指标

1）混淆矩阵

import numpy as meng
from sklearn.metrics import confusion_matrix
y_pred = [0, 1, 1, 1]
y_true = [0, 1, 1, 0]
print('混淆矩阵:\n',confusion_matrix(y_true, y_pred))

混淆矩阵:
[[1 1]
[0 2]]

2）准确率（Accuracy）

 from sklearn.metrics import accuracy_score
  y_pred = [0, 1, 1, 1]
  y_true = [0, 1, 1, 0]
  print('ACC:',accuracy_score(y_true, y_pred))

ACC: 0.75

3）精确率（Precision）

from sklearn import metrics
y_pred = [0, 1, 1, 1]
y_true = [0, 1, 1, 0]
print('Precision',metrics.precision_score(y_true, y_pred))

Precision 0.6666666666666666

4）召回率（Recall）

from sklearn import metrics
y_pred = [0, 1, 1, 1]
y_true = [0, 1, 1, 0]
print('Recall',metrics.recall_score(y_true, y_pred))

Recall 1.0

5）F1 Score

from sklearn.metrics import accuracy_score
y_pred = [0, 1,1, 1]
y_true = [0, 1, 1,0]
print('F1-score:',metrics.f1_score(y_true, y_pred))

F1-score: 0.5

6）P-R曲线

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
y_pred = [0, 0, 1, 0, 0, 0, 1, 1,1,1]
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
precision, recall, thresholds = precision_recall_curve(y_true, y_pred)
plt.plot(precision, recall)

[<matplotlib.lines.Line2D at 0x7f6ff870b908>]

在这里插入图片描述

7)ROC

from sklearn.metrics import roc_curve
y_pred = [0, 0, 1, 0, 0, 0, 1, 1,1,1]
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
FPR,TPR,thresholds=roc_curve(y_true, y_pred)
plt.title('ROC')
plt.plot(FPR, TPR,'b')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('TPR')
plt.xlabel('FPR')

Text(0.5, 0, ‘FPR’)

8).AUC

import numpy as meng
from sklearn.metrics import roc_auc_score
y_true = meng.array([0, 1, 1, 0])
y_scores = meng.array([0.2, 0.3, 0.4, 0.7])
print('AUC socre:',roc_auc_score(y_true, y_scores))