《机器学习》逻辑回归大批量数据的过采样＜9＞

菜就多练_0828

于 2024-08-21 14:28:33 发布

阅读量673

点赞数 19

分类专栏：《机器学习》人工智能篇文章标签：机器学习逻辑回归人工智能过采样

本文链接：https://blog.csdn.net/qq_64603703/article/details/141391561

版权

《机器学习》人工智能篇专栏收录该内容

8 篇文章 0 订阅

订阅专栏

一、案例文件

同样使用上节课的银行贷款案例，其文件内容大致如下：（共28万多条，31列）

现在要继续接着上节课的内容对模型进行优化

二、过采样流程

1、流程图示

2、具体流程介绍

1）数据切分

将原始数据切分为70%的训练集和30%的测试集。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

2）数据增强

对训练集中的class=1的数据进行人工拟合，扩充为与class=0的数据量相等。

# 假设augmented_data是在class=1的数据上进行增强得到的新数据
# 假设class_0_data是class=0的数据

# 将class=1的数据量扩充为class=0的数据量
augmented_data = generate_augmented_data(class_1_data, len(class_0_data))

# 人工拟合后，得到与class=0数据量相等的augmented_data

3）创建大数据集

将数据增强后的训练集与原测试集合并，得到一个新的大数据集。

# 将增强后的训练集与原测试集合并
X_train_augmented = np.concatenate((X_train, augmented_data))
y_train_augmented = np.concatenate((y_train, np.ones(len(augmented_data))))

4）数据集切分

将新的大数据集切分为训练集和测试集。

X_train_big, X_test_big, y_train_big, y_test_big = train_test_split(X_train_augmented, y_train_augmented, test_size=0.3, random_state=0)

5）k折交叉验证

对训练集进行k折交叉验证，用于模型训练的选择和调优。

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=0)
for train_index, val_index in kf.split(X_train_big):
    X_train_fold, X_val_fold = X_train_big[train_index], X_train_big[val_index]
    y_train_fold, y_val_fold = y_train_big[train_index], y_train_big[val_index]
    
    # 在每个fold上进行模型训练和选择
    model = LogisticRegression()
    model.fit(X_train_fold, y_train_fold)
    # 在验证集上进行模型选择和调优

6）模型训练

建立模型并对大数据集的训练集进行训练，得到训练完成的模型。

model = LogisticRegression()
model.fit(X_train_big, y_train_big)

7）模型测试

使用之前切分出来的测试集和大数据集本身作为测试集，导入模型进行测试。

8）最优概率阈值

根据测试结果，调整阈值以得到最优的预测概率。

from sklearn.metrics import precision_score, recall_score
# 假设y_pred_prob是模型预测的概率值
threshold = 0.5 # 初始阈值
best_threshold = threshold
best_f1_score = 0

for threshold in np.arange(0.1, 1.0, 0.1):
    y_pred_test_threshold = (y_pred_prob > threshold).astype(int)
    f1_score_test = f1_score(y_test_big, y_pred_test_threshold)
    
    if f1_score_test > best_f1_score:
        best_f1_score = f1_score_test
        best_threshold = threshold

print("Best threshold:", best_threshold)

三、完整代码实现

import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pylab import mpl

# 可视化混淆矩阵，网上都是包装好的，可以直接复制使用
def cm_plot(y, yp):
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt

    cm = confusion_matrix(y, yp)
    plt.matshow(cm, cmap=plt.cm.Blues)
    plt.colorbar()
    for x in range(len(cm)):
        for y in range(len(cm)):
            plt.annotate(cm[x, y], xy=(y, x), horizontalalignment='center',verticalalignment='center')
            plt.ylabel('True label')
            plt.xlabel('Predicted label')
    return plt

data = pd.read_csv(r'./creditcard.csv')
data.head()  # 默认输出前5行，这里用来提示防止忘记代码了

# 设置字体，用来显示中文
mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus'] = False

# 同级class列中每个类型的数据个数，（这里的样本极度不均衡）
labels_count = pd.value_counts(data['Class'])
# 可视化上述分类数据的个数
# plt.title("正负例样本数")
# plt.xlabel('类别')
# plt.ylabel('频数')
# labels_count.plot(kind='bar')
# plt.show()

# 将Amount列的数据进行Z标准化，因为其余列的值有负值，所以不能使用0-1归一化
# 导入库的用法
from sklearn.preprocessing import StandardScaler
# z标准化
scaler = StandardScaler()  # 这是一个类，专门用来处理z标准化，适合处理大量数据的类，还有一种叫scale，适合对小部分数据进行z标准化
data['Amount'] = scaler.fit_transform(data[['Amount']])  # 对Amount整列数据进行z标准化后将其传入原data内，覆盖原来的列
data.head()

# 删除无用的列，Time列，axis=1表示列，axis=0表示行，然后再传入data文件
data = data.drop(['Time'],axis=1)


from sklearn.model_selection import train_test_split

x_whole = data.drop('Class',axis=1)  # 取出原始数据集特征值和标签
y_whole = data.Class
x_train,x_test,y_train,y_test = train_test_split(x_whole,y_whole,test_size=0.2,random_state=0)  # 对随机取出百分之20当做测试集

from imblearn.over_sampling import SMOTE  # 导入过采样方法

oversampler = SMOTE(random_state=0)  # 保证数据闭合效果，随机种子
os_x_train,os_y_train = oversampler.fit_resample(x_train,y_train)  # 对上述取出的百分之80原始数据的训练集进行过采样

labels_count = pd.value_counts(os_y_train)  # 计算过采样后的数据中各类别的数目
plt.title("正负例样本数")
plt.xlabel('类别')
plt.ylabel('频数')
labels_count.plot(kind='bar')
plt.show()

os_x_train_w,os_x_test_w,os_y_train_w,os_y_test_w = (   # 对过采样的数随机取出20%当做测试集，80%当做训练集
    train_test_split(os_x_train,os_y_train,test_size=0.2,random_state=0))



from sklearn.linear_model import LogisticRegression  # 导入逻辑回归模型
from sklearn.model_selection import cross_val_score  # k折交叉验证

scores = []  # 定义一个空列表，用来存放后面更改正则化强度后的每个C值计算结果对应的score值
c_param_range = [0.01,0.1,1,10,100] # 循环更改的C值
z = 1
for i in c_param_range:
    start_time = time.time()  # 开始时间
    lr = LogisticRegression(C=i,penalty='l2',solver='lbfgs',max_iter=1000)  # 循环遍历C值
    score = cross_val_score(lr,os_x_train_w,os_y_train_w,cv=5,scoring='recall')  # k折运算
    score_mean = sum(score)/len(score)
    scores.append(score_mean)
    end_time = time.time()   # 结束时间
    print("第{}次。。。".format(z))
    print("time spend:{:.2f}".format(end_time-start_time))
    print("recall:{}".format(score_mean))
    z+=1
    # print(score_mean)

best_c = c_param_range[np.argmax(scores)]  # 最优C值

lr = LogisticRegression(C=best_c,penalty='l2',max_iter=1000)
lr.fit(os_x_train_w,os_y_train_w)   # 对过采样随机取出的训练集数据进行训练

from sklearn import metrics   # 打印分类报告

train_predict_big = lr.predict(os_x_train_w)  # 将训练集自己导入模型进行测试
print(metrics.classification_report(os_y_train_w,train_predict_big))

test_predict_big = lr.predict(os_x_test_w)   # 对测试集进行测试
print(metrics.classification_report(os_y_test_w,test_predict_big))

train_predict = lr.predict(x_train)  # 对原始数据训练集进行测试
print(metrics.classification_report(y_train,train_predict))
cm_plot(y_train,train_predict).show()

test_predict = lr.predict(x_test)  # 对原始数据测试集进行测试
print(metrics.classification_report(y_test,test_predict))
cm_plot(y_test,test_predict).show()



thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]   # 此处同样更改阈值
recalls = []
for i in thresholds:
    y_predict_proba = lr.predict_proba(x_test)

    y_predict_proba = pd.DataFrame(y_predict_proba)
    y_predict_proba = y_predict_proba.drop([0],axis=1)
    y_predict_proba[y_predict_proba[[1]] > i] = 1
    y_predict_proba[y_predict_proba[[1]] <= i] =0

    recall = metrics.recall_score(y_test,y_predict_proba[1])
    recalls.append(recall)
    print("{}Recall metric in the testing dataset:{:.3f}".format(i,recall))

其打印结果如下：

1）过采样后的数据结构

2）混淆矩阵可视化

3）结果的打印

四、总结

大批量数据的过采样是一种数据预处理技术，用于解决数据不平衡问题。在数据集中，某些类别的样本数量明显少于其他类别，这会导致模型在训练时对少数类别的预测能力较差。过采样的目标是通过增加少数类别的样本数量，使得各个类别之间的样本数量更加均衡。

大批量数据的过采样可以通过以下步骤实现：

选择过采样方法：常见的过采样方法包括随机复制、SMOTE（合成少数类过采样技术）、ADASYN（自适应合成过采样技术）等。每种方法有不同的原理和适用场景，选择适合问题的方法是关键。
执行过采样：根据选择的过采样方法，对少数类别的样本进行复制或生成新样本。复制样本可以直接复制原始样本，生成样本则是根据少数类别样本的特征合成新样本。
调整过采样比例：过采样可能会导致样本数量过多，导致模型训练时间增加或者过度拟合。可以通过控制过采样比例来平衡样本数量，确保各个类别之间的样本数量接近而不过于失衡。

菜就多练_0828

关注

19
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
《机器学习》逻辑回归大批量数据的过采样＜9＞

大批量数据的过采样可以提高模型对少数类别的预测能力，但也存在一些问题。过采样可能会引入噪声样本，导致模型过于关注少数类别，忽视大多数类别。此外，过采样还可能导致模型在测试集上的性能下降，因为测试集的样本分布可能与过采样后的训练集不一致。
复制链接

扫一扫