信用卡欺诈检测python

最新推荐文章于 2023-01-24 17:48:06 发布

小锐->技术成就梦想,梦想成就辉煌。

最新推荐文章于 2023-01-24 17:48:06 发布

阅读量847

点赞数

分类专栏： python 文章标签： python 机器学习 sklearn

本文链接：https://blog.csdn.net/weixin_56636204/article/details/122418541

版权

python 专栏收录该内容

64 篇文章 55 订阅

订阅专栏

# 建立逻辑回归模型，对两类数据进行分类
# 6.1.1 数据读取与分析
# 导入库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 读入数据
data = pd.read_csv('跟着迪哥学.001/creditcard.csv')
# 读五行
print(data.head())
# print(data.shape)
# 制造图表说明异常与正常数据
count_classes = pd.value_counts(data['Class'], sort=True).sort_index()
count_classes.plot(kind='bar')
plt.title('Fraud class histogram')
plt.xlabel('class')
plt.ylabel('Frequency')
plt.show()
# 6.1.2 样本不均衡解决方案
# 下采样， 异常数据比较少，让正常样本与异常样本一样少。
# 过采样， 假造异常数据， 数据生成是现阶段的常见的一种套路
# 6.1.3 特征标准化
# z = (x-xmean)/std(x)
# 其中z为标准化后的数据， x为原始数据, xmean为原始数据的均值， std(x)为原始数据的标准差
# 使用sklearn工具包来完成特征标准化操作
# 先导入模块
# preprocessing 预处理
from sklearn.preprocessing import StandardScaler

data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time', 'Amount'], axis=1)
print(data.head())
# StandardScaler方法对数据进行标准化处理， 先导入该模块， 然后用fit_transform操作， reshape(-1, 1)的含义是将传入的数
# 据转成一列的形式，可以按照函数的要求去做，drop操作是去除无用特征。输出结果normAmount列就是标准化处理后的结果，可见数值都在较小范围内浮动.
# 6.2下采样方案
# 不包含标签的就是特征
X = data.loc[:, data.columns != 'Class']
# 标签
y = data.loc[:, data.columns == 'Class']
number_records_fraud = len(data[data.Class == 1])
# 得到所有异常样本得索引
fraud_indices = np.array(data[data.Class == 1].index)
# 得到所有正常样本得索引
normal_indices = np.array(data[data.Class == 0].index)
# 在正常样本当中， 随机采样得到指定个数的样本， 并取其索引
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
random_normal_indices = np.array(random_normal_indices)
# print(random_normal_indices)
# 有了正常和异常的样本后把他们的索引都拿到手
under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
# 根据索引得到下采样的所有样本点
under_sample_data = data.iloc[under_sample_indices, :]
X_undersample = under_sample_data.iloc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.iloc[:, under_sample_data.columns == 'Class']
# 打印下采样测略后正负样本比例
print('正常样本所占整体比例:', len(under_sample_data[under_sample_data.Class == 0]) / len(under_sample_data))
print('负样本所占整体比例:', len(under_sample_data[under_sample_data.Class == 1]) / len(under_sample_data))
print('下采样测略总体样本数量:', len(under_sample_data))
# 交叉验证
# 将数据划分成训练集以及测试集,在训练集重划分验证集，进行自我训练，将不同的组得到的结果形成均值，构成模型，最后才去做测试集
# 导入数据集切分模块
from sklearn.model_selection import train_test_split

# 对整个数据集进行划分， X为特征数据， Y为标签， test_size为测试集比列， random_state 为随机种子， 目的是使得每次随机的结果都一样
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
print('原始训练集包含的样本数量：', len(X_train))
print('原始测试集包含的样本数量：', len(X_test))
print('原始样本总数：', len(X_train) + len(X_test))

# 下采样数据集进行划分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample,
                                                                                                    y_undersample,
                                                                                                    test_size=0.3,
                                                                                                    random_state=0)

print('下采样训练集包含的样本数量：', len(X_train_undersample))
print('下采样测试集包含的样本数量：', len(X_test_undersample))
print('下采样本总数：', len(X_train_undersample) + len(X_test_undersample))

# 模型评估方法
# 准确率是分类中最常用的一个参数,用于说明在整体中做对了多少.
# 召回率， 观察给定的目标， 针对这个目标统计你取得了多少成绩，而不是针对整体而言.

# 逻辑回归模型
# 6.3.1 参数对结果的影响
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, recall_score, classification_report
import warnings
warnings.filterwarnings("ignore")

def printing_Kfold_scores(x_train_data, y_train_data):
    fold = KFold(5, shuffle=False)
    # 进行五折交叉验证
    # 定义不同的正则化惩罚力度
    c_param_range = [0.01, 0.1, 1, 10, 100]
    # 展示结果用的表格
    results_table = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
    results_table['C_parameter'] = c_param_range
    # k-fold 表示K折的交叉验证， 这里会得到两个索引集合: 训练集 = indices[0], 验证集 = indices[1]
    j = 0
    # 循环遍历不同的参数
    for c_param in c_param_range:
        print('-------------')
        print('正则化惩罚力度:', c_param)
        print('-------------')
        print('')
        recall_accs = []
        # 一步步分解来执行交叉验证

        for iteration, indices in enumerate(fold.split(y_train_data), start=1):
            # 指定算法模型， 并且给定参数
            Ir = LogisticRegression(C=c_param, penalty='l2')
            # 训练模型， 注意不要给错索引， 训练的时候传入的一定是训练集， 所以X和Y的索引都是0
            Ir.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())
            # 建立好模型后， 预测模型结果， 这里用的是验证集， 索引为1
            y_pred_undersample = Ir.predict(x_train_data.iloc[indices[1], :].values)
            # 预测结果明确后， 就可以进行评估， 这里recall_score需要传入预测值和真实值
            recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)
            # 将得到的值平均，所以要将其保存起来
            recall_accs.append(recall_acc)
            print('Iteration', iteration, ':召回率=', recall_acc)
        # 计算完所有的交叉验证后， 计算平均结果
        results_table.loc[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('')
        print('平均召回率', np.mean(recall_accs))
        print('')
        # 找到最好的参数， 哪一个Recall高, 自然就是最好的
        best_c = results_table.loc[results_table['Mean recall score'].astype('float32').idxmax()]['C_parameter']
        # 打引最好的结果
        print('**********************************')
        print('效果最好的模型所选的参数 = ', best_c)
        print('**********************************')
        return best_c


best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)

小锐->技术成就梦想,梦想成就辉煌。

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
信用卡欺诈检测python

# 建立逻辑回归模型，对两类数据进行分类# 6.1.1 数据读取与分析# 导入库import numpy as npimport pandas as pdimport matplotlib.pyplot as plt# 读入数据data = pd.read_csv('跟着迪哥学.001/creditcard.csv')# 读五行print(data.head())# print(data.shape)# 制造图表说明异常与正常数据count_classes = pd.value_.
复制链接

扫一扫

专栏目录