机器学习实战（一）：逻辑回归预测_逻辑回归模型预测实例-CSDN博客

本文链接：https://blog.csdn.net/qq_24140919/article/details/89677791

最近学习预测，先从最简单的入手，本文写最近利用机器学习中的逻辑回归算法实现的两个实际案例：

1. 根据以往的申请表数据预测一个学生是否被大学录取

2. 信用卡欺诈预测

后边代码整理至我的github中，待续！！

一、根据以往的申请表数据预测一个学生是否被大学录取

数据如下：

1. 数据分析

通过数据分析得到数据均衡：总样本：100；正样本：60；负样本：40；

由于是两个属性值，所以可视化看看。

代码如下：

def show_data():
    positive = pdData[pdData['Classes'] == 1]
    negative = pdData[pdData['Classes'] == 0]
    print(len(positive))
    print(len(negative))
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(positive['attribute1'], positive['attribute2'], s=30, c='b', marker='o', label='Classes')
    ax.scatter(negative['attribute1'], negative['attribute2'], s=30, c='r', marker='x', label='Not Classes')
    ax.legend()
    ax.set_xlabel('attribute1 Score')
    ax.set_ylabel('attribute2 Score')
    plt.show()
    plt.close()

数据分布效果图如下：

2. 数据准备

（1）数据预处理

相等于（1， x1，x2）*（，，）T

所以我们需要对数据预处理下：

pdData.insert(0, 'Ones', 1)
orig_data = pdData.values

（2）得到训练需要的 X, y, theta

维度分别是：(100, 3)，(100, 1)，(1, 3)

3. 训练模型

（1）sigmoid：隐射到概率的函数

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

（2）model：返回预测结果值

def model(X, theta):
    return sigmoid(np.dot(X, theta.T))

（3）cost：根据参数计算损失

def cost(X, y, theta):
    model(X, theta)
    left = np.multiply(-y, np.log(model(X, theta)))
    right = np.multiply(1-y, np.log(1-model(X, theta)))
    return np.sum(left-right) / len(X)

（4）gradrent：计算每个参数的梯度方向, 目标函数的导数

def gradrent(X, y, theta):
    grad = np.zeros(theta.shape)
    error = (model(X, theta) - y).ravel()
    for j in range(len(theta.ravel())):
        term = np.multiply(error, X[:, j])
        grad[0, j] = np.sum(term) / len(X)
    return grad

（5）decent：进行参数更新【根据梯度下降算法】,参考三种停止策略

梯度下降有三种方法：小批量梯度下降，随机梯度下降，批量梯度下降

停止策略也有三种：根据迭代次数、grad值限定、损失值限定

#  定义三种梯度下降策略
STOP_ITER = 0
STOP_COST = 1
STOP_GRAD = 2
def stopCriterion(type, value, threshold):
    if type == STOP_ITER:
        return value > threshold
    elif type == STOP_COST:
        return abs(value[-1]-value[-2]) < threshold
    elif type == STOP_GRAD:
        return np.linalg.norm(value) < threshold
def decent(data, batchsize, stopType, thresh, alpha):
    init_time = time.time()
    i = 0       # 迭代次数
    k = 0       # batch
    X, y, theta = shuffleData(data)
    grad = np.zeros(theta.shape)    # 梯度
    costs = [cost(X, y, theta)]     # 损失
    while True:
        # batchsize=任意一个数：小批量梯度下降
        # batchsize=1：随机梯度下降
        # batchsize=所有样本数：批量梯度下降
        grad = gradrent(X[k:k+batchsize], y[k:k+batchsize], theta)
        k += batchsize
        if k >= len(X):
            k = 0
            X, y, theta = shuffleData(data)     # 重新洗牌
        theta = theta - alpha*grad              # 更新参数
        costs.append(cost(X, y, theta))
        i += 1

        if stopType == STOP_ITER:
            value = i
        elif stopType == STOP_COST:
            value = costs
        elif stopType == STOP_GRAD:
            value = grad
        if stopCriterion(stopType, value, thresh):
            break
    total_time = time.time() - init_time
    return theta, i - 1, costs, grad, total_time

（6）结果可视化

def runExpe(data, batchSize, stopType, thresh, alpha):
    theta, iter, costs, grad, dur = decent(data, batchSize, stopType, thresh, alpha)
    name = "Original" if (data[:, 1] > 2).sum() > 1 else "Scaled"
    name += " data - learning rate: {} - ".format(alpha)
    if batchSize == orig_data.shape[0]:
        strDescType = "Gradient"
    elif batchSize == 1:
        strDescType = "Stochastic"
    else:
        strDescType = "Mini-batch ({})".format(batchSize)
    name += strDescType + " descent - Stop: "
    if stopType == STOP_ITER:
        strStop = "{} iterations".format(thresh)
    elif stopType == STOP_COST:
        strStop = "costs change < {}".format(thresh)
    else: strStop = "gradient norm < {}".format(thresh)
    name += strStop
    print("***{}\nTheta: {} - Iter: {} - Last cost: {:03.2f} - Duration: {:03.2f}s".format(
        name, theta, iter, costs[-1], dur))
    fig, ax = plt.subplots(figsize=(12,4))
    ax.plot(np.arange(len(costs)), costs, 'r')
    ax.set_xlabel('Iterations')
    ax.set_ylabel('Cost')
    ax.set_title(name.upper() + ' - Error vs. Iteration')
    plt.show()
    plt.close()
    return theta

小Tips：利用sklearn的scale方法对加载的数据进行预处理

至于训练的参数如何调，我到现在也不是很明白？？

代码如下：

def process_data(data):
    scaled_data = data.copy()
    scaled_data[:, 1:3] = pp.scale(data[:, 1:3])
    return scaled_data

处理后的数据：

效果对比：

未预处理数据的效果图：

预处理后数据的效果图：

二、信用卡欺诈预测

相较于前一个案例，这个案例对数据的分析就很重要了。

数据如下：

1. 数据分析

（1）分析特征值数据

通过查看每列的数据信息，发现Amout列的数据范围波动特别大，为了避免导致机器误判为值越大越重要，需要对Amout进行归一化操作。同时，Time列是无用数据。

代码如下：

def normalization_data(data):
    '''
        数据预处理一：对数据的某一列进行归一化操作，并删除无用的列
    '''
    data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
    data = data.drop(['Time', 'Amount'], axis=1)
    return data

（2）查看样本的class类别分布是否均衡

代码：

def histogram_class(data):
    '''
        数据分析：柱状图分析class的分布情况
    '''
    # 类别1：number1；类别2：number2
    count_classes = pd.value_counts(data['Class'], sort=True).sort_index()      # type: Series
    # print(count_classes)
    count_classes.plot(kind='bar')
    plt.title("Fraud class histogram")
    plt.xlabel("Class")
    plt.ylabel("Frequency")
    plt.show()
    plt.close()
histogram_class(credit_cards)

效果图如下：

结论：数据非常不均衡，总样本数：284807；其中类别为1的样本数只有：492，而类别为0的样本数有：284315

解决办法：

过采样【将类别少的样本采取样本生成策略得到更多的样本，再与类别多的样本组合成一个新的样本】

下采样【在类别多的样本中抽取类别少的样本相同的数目，组合成一个新的样本】

过采样代码如下：

def oversample_data(data, split_size):
    X_train, X_test, y_train, y_test = slice_data(data, split_size)
    oversampler = SMOTE(random_state=0)
    x_train, y_train = oversampler.fit_sample(X_train, y_train)
    return x_train, X_test, y_train, y_test

下采样代码如下：

def upsample_data(data, split_size):
    '''
        数据预处理二：使用下采样实现样本均衡 + 分出训练集和测试集
    '''
    number_records_fraud = len(data[data.Class == 1])
    fraud_indices = np.array(data[data.Class == 1].index)

    normal_indices = data[data.Class == 0].index
    random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
    random_normal_indices = np.array(random_normal_indices)

    under_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
    under_sample_data = data.iloc[under_sample_indices, :]

    X_train, X_test, y_train, y_test = slice_data(under_sample_data, split_size)

    return X_train, X_test, y_train, y_test

这其中采用了sklearn.model_selection 中的分割数据方法 train_test_split

通过对数据的分析，我们已经得到了： X_train, X_test, y_train, y_test。下面就开始训练模型

2. 利用十字交叉验证进行逻辑回归模型训练

这里我们用到交叉验证的方法来对模型训练，代码如下：

def printing_Kfold_scores(x_train_data,y_train_data, kf_size):
    fold = KFold(kf_size, shuffle=False)
    # 正则惩罚项
    for iteration, indices in enumerate(fold.split(y_train_data)):
        # print(iteration)    # train的几个样本的组合编号
        # print(indices)          # 返回值有两个，即两组值得下标，第一个为训练集，一般占5分之4；第二个为验证集
       
         # 模型训练

在模型训练过程中，我们需要注意到以下几点：

（1）正则化惩罚项

sklearn中封装了两种正则化方式：L1 和 L2，但我们可以调节惩罚力度，找到最准确率最高的惩罚力度参数。

c_param_range = [0.01, 0.1, 1, 10, 100]
for c_param in c_param_range:
    for iteration, indices in enumerate(fold.split(y_train_data)):
        lr = LogisticRegression(C=c_param, penalty='l1')

（2）模型评估方法

一般模型评估方法有两种，一种是精度，也就是预测对的样本数 ÷ 总样本数，但这个在预测一些实际例子中并不合适，比如100个病人样本中有90个正常人，10个癌症患者，通过模型预测100个都是正常的，那精度就是90%，虽然高，但我们知道这个模型一点用都没有。第二种是召回率（查全率）：recall，也就是在所有预测某个类别的结果中，正确的数 ÷ 该类别总数。主要弄清楚TP、FP、FN、TN，如下图：

recall公式如下：

代码如下：

lr = LogisticRegression(C=c_param, penalty='l1')
lr.fit(x_train_data.iloc[indices[0], :], y_train_data.iloc[indices[0], :].values.ravel())
y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)
recall_acc = recall_score(y_train_data.iloc[indices[1], :].values, y_pred_undersample)

（3）阈值

通过了解逻辑回归的原理，我们就知道其实逻辑回归最后一步是用了sigmoid函数来实现分类的，所以这里我们可以指定分类的阈值。不同的阈值对结果的准确率还是不一样的。

代码如下：

lr = LogisticRegression(C=c, penalty='l1')
lr.fit(X_upsample_train, y_upsample_train.values.ravel())
# 第i行j列的数值是模型预测第i个样本为某个标签的概率，并且每一行的概率之和为1
y_pred_undersample_prob = lr.predict_proba(X_upsample_test.values)      # type: array

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

for i in thresholds:
    y_test_predictions_high_recall = y_pred_undersample_prob[:, 1] > i

3. 结论

使用下采样的结果：recall值大，但对原始数据处理时误差也是很大的，实验结果如下图：

对y_upsample_train 的recall如下：

对y_test的Recall: 0.9115646258503401【为了使得recall高，也就是将类1的分为类1，将很多类0的也划分至类1了】

使用过采样的结果：recall值较下采样低，但对原始数据处理时误差小，实验结果如下图：

对y_upsample_train 的recall如下：

对y_test的Recall: 0.9183673469387755

结论中涉及的相关技术：

（1）混淆矩阵

根据预测值和实际值就可以画出混淆矩阵，代码如下：

def plot_confusion_matrix(pre_classes, classes, title='Confusion matrix', cmap=plt.cm.Blues):
    """
        利用混淆矩阵画出模型评估中的TP、FP、FN、TN
    """
    plt.imshow(pre_classes, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = pre_classes.max() / 2.
    for i, j in itertools.product(range(pre_classes.shape[0]), range(pre_classes.shape[1])):
        plt.text(j, i, pre_classes[i, j],
                 horizontalalignment="center",
                 color="white" if pre_classes[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

（2）一张画布上显示多个图片

主要应用：

plt.subplot(3, 3, j)

例如实现不同阈值的混淆矩阵代码如下：

def show_matrix(c):
    lr = LogisticRegression(C=c, penalty='l1')
    lr.fit(X_upsample_train, y_upsample_train.values.ravel())
    # 第i行j列的数值是模型预测第i个样本为某个标签的概率，并且每一行的概率之和为1
    y_pred_undersample_prob = lr.predict_proba(X_upsample_test.values)      # type: array

    thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

    plt.figure(figsize=(10, 10))
    j = 1
    for i in thresholds:
        plt.subplot(3, 3, j)
        j += 1
        y_test_predictions_high_recall = y_pred_undersample_prob[:, 1] > i
        cnf_matrix = confusion_matrix(y_upsample_test, y_test_predictions_high_recall)
        np.set_printoptions(precision=2)

        print("Recall: ", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

        class_names = [0, 1]
        plot_confusion_matrix(cnf_matrix
                              , classes=class_names
                              , title='Threshold >= %s' % i)
    plt.show()
    plt.close()

实验效果如下：