Creditcard_prediction_练手小项目

所需环境库以环境

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import sklearn

print('pandas:',pd.__version__)
print('matplotlib:',matplotlib.__version__)
print('numpy:',np.__version__)
print('sklearn:',sklearn.__version__)
pandas: 0.23.4
matplotlib: 2.2.3
numpy: 1.16.4
sklearn: 0.22.2.post1

读取数据并显示数据各列信息

data = pd.read_csv('creditcard.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB

有上述信息可以看出,各列信息不存在缺失值,而且数值类型皆为数值,不需要进行离散化。

后续在检查一下各特征数据分布:

data.describe()
TimeV1V2V3V4V5V6V7V8V9...V21V22V23V24V25V26V27V28AmountClass
count284807.0000002.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+05...2.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+052.848070e+05284807.000000284807.000000
mean94813.8595753.919560e-155.688174e-16-8.769071e-152.782312e-15-1.552563e-152.010663e-15-1.694249e-15-1.927028e-16-3.137024e-15...1.537294e-167.959909e-165.367590e-164.458112e-151.453003e-151.699104e-15-3.660161e-16-1.206049e-1688.3496190.001727
std47488.1459551.958696e+001.651309e+001.516255e+001.415869e+001.380247e+001.332271e+001.237094e+001.194353e+001.098632e+00...7.345240e-017.257016e-016.244603e-016.056471e-015.212781e-014.822270e-014.036325e-013.300833e-01250.1201090.041527
min0.000000-5.640751e+01-7.271573e+01-4.832559e+01-5.683171e+00-1.137433e+02-2.616051e+01-4.355724e+01-7.321672e+01-1.343407e+01...-3.483038e+01-1.093314e+01-4.480774e+01-2.836627e+00-1.029540e+01-2.604551e+00-2.256568e+01-1.543008e+010.0000000.000000
25%54201.500000-9.203734e-01-5.985499e-01-8.903648e-01-8.486401e-01-6.915971e-01-7.682956e-01-5.540759e-01-2.086297e-01-6.430976e-01...-2.283949e-01-5.423504e-01-1.618463e-01-3.545861e-01-3.171451e-01-3.269839e-01-7.083953e-02-5.295979e-025.6000000.000000
50%84692.0000001.810880e-026.548556e-021.798463e-01-1.984653e-02-5.433583e-02-2.741871e-014.010308e-022.235804e-02-5.142873e-02...-2.945017e-026.781943e-03-1.119293e-024.097606e-021.659350e-02-5.213911e-021.342146e-031.124383e-0222.0000000.000000
75%139320.5000001.315642e+008.037239e-011.027196e+007.433413e-016.119264e-013.985649e-015.704361e-013.273459e-015.971390e-01...1.863772e-015.285536e-011.476421e-014.395266e-013.507156e-012.409522e-019.104512e-027.827995e-0277.1650000.000000
max172792.0000002.454930e+002.205773e+019.382558e+001.687534e+013.480167e+017.330163e+011.205895e+022.000721e+011.559499e+01...2.720284e+011.050309e+012.252841e+014.584549e+007.519589e+003.517346e+003.161220e+013.384781e+0125691.1600001.000000

8 rows × 31 columns

从上表可以看出特征V1至V28的特征可以纲量比较统一;而Time特征属于连续递增数据,不适合作为训练特征,舍去该特征;而对于Amount特征是否需要进行标准化,通过后续训练以及测试准确率来判断。

print('0:{:d}, 1:{:d}'.format(sum(data.Class==0),sum(data.Class==1)))
0:284315, 1:492

而对于Class类别,只有0(正常),1(异常),可以明显看出标签是非常不均衡的。

设置训练集和测试集

由于原数据标签十分不均衡,为了测试集的准确性,需将训练集设为类别数量1:1,因此正样本50个,负样本50个。

data_fixed = data.drop(['Time'], axis=1)

data_pos = data_fixed[data_fixed['Class'].values == 0].sample(frac = 1).reset_index(drop=True)
data_neg = data_fixed[data_fixed['Class'].values == 1].sample(frac = 1).reset_index(drop=True)
data_train = pd.concat([data_neg.iloc[50:,:], data_pos.iloc[50:,:]] ).sample(frac = 1).reset_index(drop=True)
data_test = pd.concat([data_neg.iloc[:50,:], data_pos.iloc[:50,:]] ).sample(frac = 1).reset_index(drop=True)
data_train.to_csv('creditcard_train.csv')
data_test.to_csv('creditcard_test.csv')
X_train, y_train = data_train.iloc[:,:-1], data_train.iloc[:,-1]
X_test, y_test = data_test.iloc[:,:-1], data_test.iloc[:,-1]
print('0:{:d}, 1:{:d}'.format(sum(y_test==0),sum(y_test==1)))
0:50, 1:50

训练阶段

首先定义一个训练并可以计算测试集准确率的函数

from sklearn.metrics import confusion_matrix
def model_train(model):
    model = model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    matrix = confusion_matrix(y_test, y_predict)
    return (sum(y_predict == y_test)/len(y_test)),matrix
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

NBM = [KNeighborsClassifier(n_neighbors=6, n_jobs=8), 
       GaussianNB(), 
       DecisionTreeClassifier(max_depth=5, min_samples_split=5), 
       RandomForestClassifier(n_estimators= 100, max_depth=10, n_jobs=8),
       RandomForestClassifier(n_estimators= 100, max_depth=10, n_jobs=8, class_weight='balanced'),
       xgb.XGBClassifier(tree_method = "hist", n_estimators=100, n_jobs = 8)]
NAME= ["KNN", "GNB", "DCT", "RF", "RF_Balanced", "XGBT"]

for itr, itrname in zip(NBM, NAME):
    acc, con_matrix = model_train(itr)
    print(itrname+' '+str(acc*100)+'%\n',con_matrix)
KNN 83.0%
 [[50  0]
 [17 33]]
GNB 91.0%
 [[49  1]
 [ 8 42]]
DCT 91.0%
 [[50  0]
 [ 9 41]]
RF 91.0%
 [[50  0]
 [ 9 41]]
RF_Balanced 91.0%
 [[50  0]
 [ 9 41]]
XGBT 92.0%
 [[50  0]
 [ 8 42]]

由于数据量较大,所以选取了训练速度很快,以及可以多线程进行的模型。通过结果可以看出整体表现比较平均,所有模型都出现了同一个问题,正样本训练的结果非常好,预测的错误都出现在了负样本。这也是由于样本标签不均衡的结果,由于是交易欺诈的预测,实际情况上负样本确实占少部分,由于该项目的数据集非常标准,根据生活经验也可以发现欺诈交易中与普通交易存在比较大的差距(也就是说在样本空间上,正样本和负样本距离会比较大),所以得到的预测效果比较好。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值