机器学习之四：Logistic回归

最新推荐文章于 2022-01-02 01:53:38 发布

Jayden Huang

最新推荐文章于 2022-01-02 01:53:38 发布

阅读量169

点赞数

分类专栏： Machine Learning Python 文章标签： Machine Learning Logistic

本文链接：https://blog.csdn.net/u011585024/article/details/82951860

版权

Python 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

Machine Learning

7 篇文章 0 订阅

订阅专栏

本文为作者学习Logistic回归算法后的整理笔记，仅供学习使用！

一、概述

假设现有一些数据点，现有一条直线对这些点进行拟合（该线称为最佳拟合直线），这个拟合过程称为回归。

二、核心思想

根据现有数据分类边界线建立回归公式，以此进行分类。

三、优缺点

（1）优点

计算代价不高，易于理解和实现

（2）缺点

容易欠拟合，分类精度可能不高

综上，该算法适用数值型和标称型数据类型的数据

四、sigmod函数公式

其中，

五、基于最优化方法的最佳回归系数确定

（1）梯度上升法

要找到某函数的最大值，最好的方法就是沿着该函数的梯度方向探寻。函数f(x, y)的梯度如下：

x轴方向：

y轴方向：

梯度的迭代公式：

（2）梯度下降法

与梯度上升算法一直，只是公式中的加法需变成减法

六、方法步骤

# ###########################################################
# 方法步骤：
# 1、收集数据：给定数据文件
# 2、准备数据：用Python解析文件并填充缺失值
# 3、分析数据：可视化并观察数据
# 4、训练算法：适用优化算法，找到最佳的系数
# 5、测试算法：为了量化回归的效果，需要观察错误率。根据错误率决定是否回退到训练阶段，通过改变迭代次数和步长等参数来得到更好的回归参数
# 6、适用算法：实现一个简单的命令行程序来收集数据并输出预测结果
# ###########################################################

七、代码示例

（1）收集数据：给定数据文件

（2）准备数据：用Python解析文件并填充缺失值

# 1、收集数据：给定数据文件
# 2、准备数据：用Python解析文件并填充缺失值
def read_file(file_name):
    fr = open(file_name)
    list_of_set = []
    list_of_labels = []
    for line in fr.readlines():
        curr_line = line.strip().split('\t')
        arange_num_of_line = len(curr_line) - 1
        line_arr = []
        for index in np.arange(arange_num_of_line):
            line_arr.append(float(curr_line[index]))
        list_of_set.append(line_arr)
        list_of_labels.append(float(curr_line[arange_num_of_line]))
    return list_of_set, list_of_labels


def load_dataset():
    training_file_name = "horseColicTraining.txt"
    test_file_name = "horseColicTest.txt"

    list_of_training_set, list_of_training_labels = read_file(training_file_name)
    list_of_test_set, list_of_test_labels = read_file(test_file_name)
    return list_of_training_set, list_of_training_labels, list_of_test_set, list_of_test_labels

（3）分析数据：可视化并观察数据

# 分析数据，画出决策边界
# import numpy as np
def plot_best_fit(wei):
    import matplotlib.pyplot as plt
    weights = np.array(wei)
    data_matrix, label_matrix = load_dataset()
    data_arr = np.array(data_matrix)
    n = (data_arr.shape)[0]
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(label_matrix[i] == 1):
            xcord1.append(data_arr[i, 1])
            ycord1.append(data_arr[i, 2])
        else:
            xcord2.append(data_arr[i, 1])
            ycord2.append(data_arr[i, 2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s = 30, c = 'red', marker = 's')
    ax.scatter(xcord2, ycord2, s = 30, c = 'green')
    x = np.arange(-3.0, 3.0, 0.1)
    y = (-weights[0] - weights[1] * x) / weights[2]
    ax.plot(x, y)
    plt.xlabel('X1')
    plt.ylabel('Y1')
    plt.show()

（4）训练算法：适用优化算法，找到最佳的系数

# h(x) = 1 / (1 + e ** (-x))
def sigmod(inX):
    return 1 / (1 + np.exp(-inX))

def model(inX, theta):
    return sigmod(np.dot(inX, theta))

# 随机梯度上升算法
def stoc_grad_ascent_0(data_matrix, class_labels, num_iter = 150):
    m,n = data_matrix.shape
    step = 0.01
    weights = np.ones(n)
    for j in np.arange(num_iter):
        data_index = np.arange(m)
        for i in np.arange(m):
            # 在降低alpha的函数中，alpha每次减少1 / (j + i), 其中j 是迭代次数， i是样本点的下标，这样当j << max(i)时， alpha就不是严格下降的
            alpha = 4 / (1.0 + j + i) + step
            # 这里用过随机选取样本来更新回归参数，这种方法可以减少周期性的波动
            rand_index = int(np.random.uniform(0, len(data_index)))
            h = model(data_matrix[rand_index], weights)
            error = class_labels[rand_index] - h
            weights = weights + alpha * error * data_matrix[rand_index]
            np.delete(data_index, rand_index)
    return weights

（5）测试算法：为了量化回归的效果，需要观察错误率。根据错误率决定是否回退到训练阶段，通过改变迭代次数和步长等参数来得到更好的回归参数

def classify_vector(inX, weights):
    prob = lg.model(inX, weights)
    if prob > 0.5:
        return 1
    else:
        return 0


def colic_test():
    list_of_training_set, list_of_training_labels, list_of_test_set, list_of_test_labels = load_dataset()
    weights_of_training = lg.stoc_grad_ascent_0(np.array(list_of_training_set), list_of_training_labels, 500)

    error_count = 0
    num_of_test_vec = 0

    for index, line in enumerate(list_of_test_set):
        num_of_test_vec += 1
        if classify_vector(np.array(line), weights_of_training) != list_of_test_labels[index]:
            error_count += 1
    error_rate = float(error_count) / num_of_test_vec
    print("the error rate of this test is : {0}".format(error_rate))
    return error_rate


def multi_test():
    num_of_test = 10
    error_sum = 0

    for k in np.arange(num_of_test):
        error_sum += colic_test()
    print("after {0} iterations the average error rate is {1} ".format(num_of_test, error_sum / float(num_of_test)))