逻辑回归-信用卡欺诈检测代码调试问题（二分类）

最新推荐文章于 2023-12-20 12:18:48 发布

kind_begin

最新推荐文章于 2023-12-20 12:18:48 发布

阅读量752

点赞数 1

分类专栏： Python机器学习入门

本文链接：https://blog.csdn.net/yuyue_chn/article/details/103173369

版权

Python机器学习入门专栏收录该内容

2 篇文章 0 订阅

订阅专栏

# conda list
# conda install nump
# anaconda search -t conda sklearn
# anaconda show
# pip install sklearn # pip install imblearn 不平衡模块

以上是安装一些库的搜索或安装命令

%matplotlib作用

是在使用jupyter notebook 或者 jupyter qtconsole的时候，才会经常用到%matplotlib，也就是说那一份代码可能就是别人使用jupyter notebook 或者 jupyter qtconsole进行编辑的。关于jupyter notebook是什么，可以参考这个链接：[Jupyter Notebook介绍、安装及使用教程][1]
而%matplotlib具体作用是当你调用matplotlib.pyplot的绘图函数plot()进行绘图的时候，或者生成一个figure画布的时候，可以直接在你的python console里面生成图像。

而我们在spyder或者pycharm实际运行代码的时候，可以直接注释掉这一句，也是可以运行成功的。链接：https://www.jianshu.com/p/2dda5bb8ce7d 作者：hplllrhp

1. count_classes = pd.value_counts(data['Class'], sort = True).sort_index()

pd.value_counts（）计数并排序函数：对不同值0/1/2分别计数统计， sort = True默认为True,计数按values进行自动排序, 优先6-3-2

.sort_index()对计数结果按照索引（值类型）进行排序，优先0-1-2

2.对两类样本不均衡问题的处理

1）下采样——在多方中随机采样等量少方，再重组两类样本集

# 消除某维度的数量级差异，数据预处理，此处采用标准缩放器
from sklearn.preprocessing import StandardScaler
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
# TypeError: list indices must be integers or slices, not str ; 必须 .values 转化为矩阵，-1为自动计算行数，1为制定列数

# Number of data points in the minority class
number_records_fraud = len(data[data.Class == 1]) 少方样本数量
fraud_indices = np.array(data[data.Class == 1].index) 少方样本对应的索引

# Picking the indices of the normal classes
normal_indices = data[data.Class == 0].index 待选的多方1样本的索引范围

# Out of the indices we picked, randomly select "x" number (number_records_fraud)
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace = False) #不覆盖
# 随机选择函数，1）在哪里选，2）选多少个；

2）过采样策略—向多方样本数量看齐，SMOTE样本生成

from imblearn.over_sampling import SMOTE

oversampler=SMOTE(random_state=0) #每次生成数据不变
os_features,os_labels=oversampler.fit_sample(features_train,labels_train) #仅对训练集随机生成，1/0标签样本相等

3.交叉验证

1）训练数据集和测试数据集切分

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
#切分数据，0.3作为测试集；随机切分，洗牌； random_state = 0 每次随机切分一致
#切分后 X，y仍对应同一个样本

2）定义交叉验证函数

from sklearn.model_selection import KFold, cross_val_score

fold = KFold(5,shuffle=False) #分为5组交叉验证新库无须输入n

# the k-fold will give 2 lists: train_indices = indices[0], test_indices = indices[1]；每一组验证，用其中n-1组作训练集，另一组作测试，进行k 次验证

目的在于通过对训练样本集的学习，确定最佳的模型学习参数，可通过Accuracy/Precision/Recall等指标评价

4.混淆矩阵显示函数

1）混淆矩阵

目的：评估分类器准确性
函数：sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None)
输入：

y_true:实际的目标结果
y_pred:预测的结果
labels: 标签，对结果中的string进行排序，顺序对应0、1、2 ；两个轴的标签和行列号顺序一致
sample_weight:样本的权重？

输出：

一个矩阵，shape=[y中的类型数，y中的类型数]
矩阵中每个值表征分类的准确性
第0行第0列的数表示y_true中值为0，y_pred中值也为0的个数
第0行第1列的数表示y_true中值为0，y_pred中值为1的个数

2）混淆矩阵显示函数

#混淆矩阵显示函数
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
"""
plt.imshow(cm,aspect='equal', interpolation='nearest', cmap=cmap)
''' {'equal', 'auto'} or float, optional
Controls the aspect ratio of the axes. The aspect is of particular
relevance for images since it may distort the image, i.e. pixel
will not be square.
'''
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes)) #产生array([0,1])
plt.xticks(tick_marks, classes, rotation=0) #轴刻度
plt.yticks([0], classes) 刻度为[0,1]时则显示不全

thresh = cm.max() / 2. cm为混淆矩阵
#array([[56324, 537],
# [ 9, 92]], dtype=int64)

for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):

# range(2)==range(0,2) 左闭右开计数，[0,1],i,j 范围
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black") #i，j一定要对应混淆矩阵元素位置

plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')

3）调用结果显示

5.自定分类判断阈值

y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
#predict_proba返回的是一个 n 行 k 列的数组，获得所有结果的概率，对每一类判断概率
#第 i 行第 j 列上的数值是模型预测第 i 个预测样本为某个标签的概率，并且每一行的概率和为1。

kind_begin

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
逻辑回归-信用卡欺诈检测代码调试问题（二分类）

# conda list# conda install nump# anaconda search -t conda sklearn# anaconda show# pip install sklearn # pip install imblearn 不平衡模块以上是安装一些库的搜索或安装命令%matplotlib作用是在使用jupyter notebook 或...
复制链接

扫一扫