【风控实践】信用卡欺诈检测(上)

source:https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets

Correcting Previous Mistakes from Imbalanced Datasets:

  • Never test on the oversampled or undersampled dataset.
  • If we want to implement cross validation, remember to oversample or undersample your training data during cross-validation, not before!
  • Don't use accuracy score as a metric with imbalanced datasets (will be usually high and misleading), instead use f1-score, precision/recall score or confusion matrix

开始!?

数据介绍:

数据有 284807 条数据。

数据集包含欧洲持卡人于2013年9月通过信用卡进行的交易。该数据集提供两天内发生的交易,其中在284,807笔交易中有492起欺诈行为。数据集非常不平衡,负面类别(欺诈)占所有交易的0.172%。

Keep in mind that in order to implement a PCA transformation features need to be previously scaled. (In this case, all the V features have been scaled or at least that is what we are assuming the people that develop the dataset did.)

它只包含数值输入变量,这是PCA变换的结果。不幸的是,由于保密问题,我们无法提供有关数据的原始特征和更多背景信息。特征V1,V2,... V28是用PCA获得的主要组件,唯一没有用PCA转换的特征是'Time'和'Amount'

  • “时间”包含每个事务与数据集中第一个事务之间经过的秒数。
  • '金额'是交易金额,该特征可以用于依赖于例子的成本敏感性学习。
  • “Class”是响应变量,在欺诈的情况下其值为1,否则为0。
# 查看数据里是否有缺失值
# Good No Null Values!
print(df.isnull().sum().max()) # 0

print(df.columns) # 查看特征
“”“
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')
”“”

# The classes are heavily skewed we need to solve this issue later.
print('No Frauds', round(df['Class'].value_counts()[0]/len(df) * 100,2), '% of the dataset')
print('Frauds', round(df['Class'].value_counts()[1]/len(df) * 100,2), '% of the dataset')
# 看看欺诈与非欺诈的比例如何
# No Frauds 99.83 % of the dataset
# Frauds 0.17 % of the dataset

 0代表正常,1代表欺诈,二者数量严重失衡,极度不平衡,根本不在一个数量级上;(划重点!不平衡数据的处理!)


严重失衡!所以要构建子样本,其中欺诈和非欺诈交易的比率为50/50。这意味着我们的子样本将具有相同数量的欺诈和非欺诈交易。使用原始数据帧将导致以下问题:

过度拟合:我们的分类模型将假设在大多数情况下没有欺诈!我们希望我们的模型在欺诈发生时能够确定。

错误的相关性:虽然我们不知道“v”功能代表什么,但了解每个功能如何通过具有不平衡数据框影响结果(欺诈或无欺诈)是很有用的,我们无法看到类和功能之间的真正相关性。

Scaled amoun and scaled time  are the columns with scaled values. (对未缩放的特征进行缩放,采用RobustScaler)

我们的数据集中存在欺诈的492个案例,因此我们可以随机获得492个非欺诈案例以创建新的子数据框。分布为1:1。

# Since most of our data has already been scaled we should scale the columns that are left to scale (Amount and Time)
from sklearn.preprocessing import StandardScaler, RobustScaler

# RobustScaler is less prone to outliers.

std_scaler = StandardScaler()
rob_scaler = RobustScaler()

df['scaled_amount'] = rob_scaler.fit_transform(df['Amount'].values.reshape(-1,1))
df['scaled_time'] = rob_scaler.fit_transform(df['Time'].values.reshape(-1,1))

df.drop(['Time','Amount'], axis=1, inplace=True)

接着对原始数据划分,Why? for testing purposes, remember although we are splitting the data when implementing Random UnderSampling or OverSampling techniques, we want to test our models on the original testing set not on the testing set created by either of these techniques.(用构建的1:1数据挑选模型,测试还是用原数据)

开始了解特征!!!

相关矩阵是理解我们数据的本质。 We want to know if there are features that influence heavily in whether a specific transaction is a fraud。但是,我们必须使用正确的子样本,以便我们了解哪些特征与欺诈交易具有高度正相关或负相关。

负相关: V17,V14,V12和V10呈负相关。请注意,这些值越低,最终结果就越有可能成为欺诈交易。
正相关: V2,V4,V11和V19正相关。注意这些值越高,最终结果越有可能成为欺诈交易。 
BoxPlots:将使用箱形图更好地了解这些功能在欺诈性和非欺诈性交易中的分布情况。

必须确保在相关矩阵中使用子样本,否则我们的相关矩阵将受到类不平衡的影响。

 # Feature V10 绘制特征10的箱形图
sns.boxplot(x="Class", y="V10", data=new_df, ax=ax3, palette=colors)
ax3.set_title("V10 Feature \n Reduction of outliers", fontsize=14)
ax3.annotate('Fewer extreme \n outliers', xy=(0.95, -16.5), xytext=(0, -12),
            arrowprops=dict(facecolor='black'),
            fontsize=14)

从与类具有高度相关性的特征中去除“极端异常值”,对模型的准确性产生积极影响。(去除异常值!对模型效果有很大提高!!)

以下异常检测益处离群点,提高性能

 After implementing outlier reduction our accuracy has been improved by over 3%! Some outliers can distort the accuracy of our models but remember, we have to avoid an extreme amount of information loss or else our model runs the risk of underfitting.

异常检测的方法~

Interquartile Range Method: 四分位距离

  • Interquartile Range (IQR): We calculate this by the difference between the 75th percentile and 25th percentile. Our aim is to create a threshold beyond the 75th and 25th percentile that in case some instance pass this threshold the instance will be deleted.
  • Boxplots: Besides easily seeing the 25th and 75th percentiles (both end of the squares) it is also easy to see extreme outliers (points beyond the lower and higher extreme).
# -----> V12 removing outliers from fraud transactions
v12_fraud = new_df['V12'].loc[new_df['Class'] == 1].values
q25, q75 = np.percentile(v12_fraud, 25), np.percentile(v12_fraud, 75)
v12_iqr = q75 - q25

v12_cut_off = v12_iqr * 1.5
v12_lower, v12_upper = q25 - v12_cut_off, q75 + v12_cut_off
print('V12 Lower: {}'.format(v12_lower))
print('V12 Upper: {}'.format(v12_upper))
outliers = [x for x in v12_fraud if x < v12_lower or x > v12_upper]
print('V12 outliers: {}'.format(outliers))
print('Feature V12 Outliers for Fraud Cases: {}'.format(len(outliers)))
new_df = new_df.drop(new_df[(new_df['V12'] > v12_upper) | (new_df['V12'] < v12_lower)].index)
print('Number of Instances after outliers removal: {}'.format(len(new_df)))
print('----' * 44)

Outlier Removal Tradeoff:

We have to be careful as to how far do we want the threshold for removing outliers. We determine the threshold by multiplying a number (ex: 1.5) by the (Interquartile Range). The higher this threshold is, the less outliers will detect (multiplying by a higher number ex: 3), and the lower this threshold is the more outliers it will detect. (注意去除异常值的阈值设置。我们通过将一个数字(例如:1.5)乘以(四分位数范围)来确定阈值。这个阈值越高,检测到的异常值就越少(乘以一个更高的数,例如:3),而这个阈值越低,检测到的异常值就越多。)

The Tradeoff: The lower the threshold the more outliers it will remove however, we want to focus more on "extreme outliers" rather than just outliers. Why? because we might run the risk of information loss which will cause our models to have a lower accuracy. You can play with this threshold and see how it affects the accuracy of our classification models.(阈值越低,它将删除的异常值就越多,但是,我们希望更多地关注“极端异常值”,而不仅仅是异常值。为什么?因为我们可能会有信息丢失的风险,这将导致我们的模型有较低的准确性。您可以使用这个阈值,看看它如何影响我们的分类模型的准确性。)

Summary:

  • Visualize Distributions: We first start by visualizing the distribution of the feature we are going to use to eliminate some of the outliers. V14 is the only feature that has a Gaussian distribution compared to features V12 and V10.
  • Determining the threshold: After we decide which number we will use to multiply with the iqr (the lower more outliers removed), we will proceed in determining the upper and lower thresholds by substrating q25 - threshold (lower extreme threshold) and adding q75 + threshold (upper extreme threshold).
  • Conditional Dropping: Lastly, we create a conditional dropping stating that if the "threshold" is exceeded in both extremes, the instances will be removed.
  • Boxplot Representation: Visualize through the boxplot that the number of "extreme outliers" have been reduced to a considerable amount.
  • 可视化分布:首先,我们要可视化将要用来消除一些异常值的特性的分布。与V12和V10相比,V14是唯一一个高斯分布的特征。
    确定阈值:在确定使用哪个数字乘以iqr(去除的异常值越低)之后,我们将通过减去q25 -阈值(较低的极限阈值)并添加q75 +阈值(较高的极限阈值)来确定上下阈值。
    条件下降:最后,我们创建一个条件下降,声明如果在两个极端都超过“阈值”,实例将被删除。
    箱线图表示:通过箱线图可以直观地看到“极端异常值”的数量已经减少到相当多。

实验结果:logistic效果最好~

分类器(UnderSampling):下采样是将多的样本减少至和少的样本一样
在大多数情况下,Logistic回归分类器比其他三个分类器更准确。 (我们将进一步分析Logistic回归)
GridSearchCV用于确定为分类器提供最佳预测分数的参数。
Logistic回归具有最佳ROC,这意味着LogisticRegression可以非常准确地区分欺诈和非欺诈交易。
学习曲线:
训练分数与交叉验证分数之间的差距越大,模型越有可能过度拟合(高变异)。
如果训练和交叉验证集中的得分都很低,则表明我们的模型不合适(高偏差)
Logistic回归分类器在训练和交叉验证集中显示最佳分数。

# 划分数据集
# Undersampling before cross validating (prone to overfit)
X = new_df.drop('Class', axis=1)
y = new_df['Class']

# Our data is already scaled we should split our training and test sets
from sklearn.model_selection import train_test_split

# This is explicitly used for undersampling.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Turn the values into an array for feeding the classification algorithms.
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

逻辑回归

# Use GridSearchCV to find the best parameters.
from sklearn.model_selection import GridSearchCV

# Logistic Regression
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

grid_log_reg = GridSearchCV(LogisticRegression(), log_reg_params)
grid_log_reg.fit(X_train, y_train)
# We automatically get the logistic regression with the best parameters.
log_reg = grid_log_reg.best_estimator_

log_reg_score = cross_val_score(log_reg, X_train, y_train, cv=5) # 5折交叉验证
print('Logistic Regression Cross Validation Score: ', round(log_reg_score.mean() * 100, 2).astype(str) + '%')

# Logistic Regression Cross Validation Score:  93.66%

SVM

# Support Vector Classifier
svc_params = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)

# SVC best estimator
svc = grid_svc.best_estimator_

svc_score = cross_val_score(svc, X_train, y_train, cv=5)
print('Support Vector Classifier Cross Validation Score', round(svc_score.mean() * 100, 2).astype(str) + '%')

#Support Vector Classifier Cross Validation Score 93.92%

KNN

# Support Vector Classifier
svc_params = {'C': [0.5, 0.7, 0.9, 1], 'kernel': ['rbf', 'poly', 'sigmoid', 'linear']}
grid_svc = GridSearchCV(SVC(), svc_params)
grid_svc.fit(X_train, y_train)

# SVC best estimator
svc = grid_svc.best_estimator_

knears_score = cross_val_score(knears_neighbors, X_train, y_train, cv=5)
print('Knears Neighbors Cross Validation Score', round(knears_score.mean() * 100, 2).astype(str) + '%')

#Knears Neighbors Cross Validation Score 93.66%

决策树 

# DecisionTree Classifier
tree_params = {"criterion": ["gini", "entropy"], "max_depth": list(range(2,4,1)),
              "min_samples_leaf": list(range(5,7,1))}
grid_tree = GridSearchCV(DecisionTreeClassifier(), tree_params)
grid_tree.fit(X_train, y_train)

# tree best estimator
tree_clf = grid_tree.best_estimator_

tree_score = cross_val_score(tree_clf, X_train, y_train, cv=5)
print('DecisionTree Classifier Cross Validation Score', round(tree_score.mean() * 100, 2).astype(str) + '%')

%DecisionTree Classifier Cross Validation Score 91.81%

# 预测并计算auc和绘制roc曲线
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict
# Create a DataFrame with all the scores and the classifiers names.

log_reg_pred = cross_val_predict(log_reg, X_train, y_train, cv=5,
                             method="decision_function")

knears_pred = cross_val_predict(knears_neighbors, X_train, y_train, cv=5)

svc_pred = cross_val_predict(svc, X_train, y_train, cv=5,
                             method="decision_function")

tree_pred = cross_val_predict(tree_clf, X_train, y_train, cv=5)

from sklearn.metrics import roc_auc_score

print('Logistic Regression: ', roc_auc_score(y_train, log_reg_pred))
print('KNears Neighbors: ', roc_auc_score(y_train, knears_pred))
print('Support Vector Classifier: ', roc_auc_score(y_train, svc_pred))
print('Decision Tree Classifier: ', roc_auc_score(y_train, tree_pred))

log_fpr, log_tpr, log_thresold = roc_curve(y_train, log_reg_pred)
knear_fpr, knear_tpr, knear_threshold = roc_curve(y_train, knears_pred)
svc_fpr, svc_tpr, svc_threshold = roc_curve(y_train, svc_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train, tree_pred)

使用原始数据会产生过拟合的问题,所以要用子样本来训练(undersample)。

# We will undersample during cross validating.

# We will undersample during cross validating
undersample_X = df.drop('Class', axis=1)
undersample_y = df['Class']

for train_index, test_index in sss.split(undersample_X, undersample_y):
    print("Train:", train_index, "Test:", test_index)
    undersample_Xtrain, undersample_Xtest = undersample_X.iloc[train_index], undersample_X.iloc[test_index]
    undersample_ytrain, undersample_ytest = undersample_y.iloc[train_index], undersample_y.iloc[test_index]
    
undersample_Xtrain = undersample_Xtrain.values
undersample_Xtest = undersample_Xtest.values
undersample_ytrain = undersample_ytrain.values
undersample_ytest = undersample_ytest.values 

undersample_accuracy = []
undersample_precision = []
undersample_recall = []
undersample_f1 = []
undersample_auc = []

# Implementing NearMiss Technique 
# Distribution of NearMiss (Just to see how it distributes the labels we won't use these variables)
X_nearmiss, y_nearmiss = NearMiss().fit_sample(undersample_X.values, undersample_y.values)
print('NearMiss Label Distribution: {}'.format(Counter(y_nearmiss)))
# NearMiss Label Distribution: Counter({0: 492, 1: 492})
# Cross Validating the right way

for train, test in sss.split(undersample_Xtrain, undersample_ytrain):
    undersample_pipeline = imbalanced_make_pipeline(NearMiss(sampling_strategy='majority'), log_reg) # SMOTE happens during Cross Validation not before..
    undersample_model = undersample_pipeline.fit(undersample_Xtrain[train], undersample_ytrain[train])
    undersample_prediction = undersample_model.predict(undersample_Xtrain[test])
    
    undersample_accuracy.append(undersample_pipeline.score(original_Xtrain[test], original_ytrain[test]))
    undersample_precision.append(precision_score(original_ytrain[test], undersample_prediction))
    undersample_recall.append(recall_score(original_ytrain[test], undersample_prediction))
    undersample_f1.append(f1_score(original_ytrain[test], undersample_prediction))
    undersample_auc.append(roc_auc_score(original_ytrain[test], undersample_prediction))


# 结果
"""
Overfitting: (使用原数据,过拟合)

Recall Score: 0.91
Precision Score: 0.74
F1 Score: 0.82
Accuracy Score: 0.80
----------------------------------------------------------------------------------
----------------------------------------------------------------------------------
How it should be: (子样本,正负比例1:1的结果)

Accuracy Score: 0.58
Precision Score: 0.00
Recall Score: 0.38
F1 Score: 0.00
"""

用下采样处理得到的测试数据来求recall和混淆矩阵的,因为下采样得到的数据相比于原始数据是很少的,所以这个测试结果没什么说服力,所以我们要用原始数据(没有经过下采样的数据)来进行测试。

以上是对不平衡数据采用下采样的方法进行处理,

与下采样采用减少数据的做法不同,过采样采用的另一种思路:

过采样:对样本中数量较少的那一类进行生成补齐,使之达到与较多的那一类相匹配的程度。

下篇介绍SMOTE算法上采样进行分析。


总结:

大体思想是:拿到数据该如何寻找规律、选那种模型来构建反欺诈模型。描述label和每个特征的关系,找到影响label最严重,即最相关的特征。

还有一个收获:

hyperopt参数优化

grid-search 是全空间扫描,所以比较慢。hyperopt是一种通过贝叶斯优化(贝叶斯优化简介)来调整参数的工具,对于像XGBoost这种参数比较多的算法,可以用它来获取比较好的参数值。

hyperopt需要对每个参数指定搜索空间,而不是如grid-search中那样指定值,比如参数x在0-1区间内均匀取值,参数y在0-1之间对数取值。然后,可以指定参数优化的搜索算法,如随机搜索(对应是hyperopt.rand.suggest)和模拟退火(对应是hyperopt.anneal.suggest),TPE算法。


参考:

python分析信用卡反欺诈:两种采样方法解决数据不平衡及效果分析、模型调参示例

一些有趣的未看的:

图模型在反欺诈中的应用

蚂蚁金服ATEC人工智能大赛

  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值