机器学习基石作业二中的DECISION_STUMP实现

最新推荐文章于 2020-12-02 13:40:20 发布

萝卜地里的兔子

最新推荐文章于 2020-12-02 13:40:20 发布

阅读量2.2k

点赞数 1

分类专栏：机器学习文章标签：机器学习基石机器学习

本文链接：https://blog.csdn.net/u012613903/article/details/85065438

版权

机器学习专栏收录该内容

14 篇文章 0 订阅

订阅专栏

概要：在林老的题目描述中，DECISION_STUMP（其实就是“决策桩”，也就是只有一层的决策树）。题目中提到了 $\theta$ 的选去是把属性（一维的）按照从小到大的顺序排列以后取两个挨着的值的平均值，网上有人的实现会在开头和结尾的值手动去加一个小于最小的值，一个大于最大的值；添加的两个值的大小是多大合适，这是个问题。带来的另外一个问题就是，解释性变差了；就像《西瓜书》上说的，我们按西瓜的甜度区分西瓜的好坏，你收集到了甜度值是 0.1,0.2,.0.5,0.6,0.9(忽略了好瓜、坏瓜的标志），但最后你用了0.35(假设算法取在了0.2和0.5之间）作为了区别好瓜和坏瓜的标准，这个值没有在训练数据中出现过，给人的感觉就是：唉，为什么是这个值，怎么得来的？所以《西瓜书》提到了可以直接选择出现的这些值作为 $\theta$ ，有更好的解释性。当然去均值的方式也是正确的。但是本人更倾向于直接用出现的值来作为 $\theta$ ，所以算法中没有对属性进行排序，加一个大于最大以及一个小于最小值（在《西瓜书》中，取均值的时候也没有做这个操作，而是直接排序，然后就取两个相邻值的平均值作为 $\theta$ 了）、取平均值的操作。

举例：有1，3, 2三个值：

（1）按照从小到大的顺序排列以后取两个挨着的值的平均值:1 $\theta$ 2 $\theta$ 3；会得到2个 $\theta$ 。

（2）按照从小到大的顺序排列以后取两个挨着的值的平均值，在开头和结尾的值手动去加一个小于最小的值，一个大于最大的值：0 $\theta$ 1 $\theta$ 2 $\theta$ 3 $\theta$ 4；会得到4个 $\theta$ .

（3）直接取值，得到3个 $\theta$ ；

这三种方式有所区别，但对结果其实没什么影响。下面的代码使用的是（3）

util.py（公共方法，加载数据用的）

# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as np


def load_data(file_name):
    x = []
    y = []
    with open(file_name, 'r+') as f:
        for line in f:
            line = line.rstrip("\n")
            temp = line.split(" ")
            temp.insert(0, '1')
            x_temp = [float(val) for val in temp[:-1]]
            y_tem = [int(val) for val in temp[-1:]][0]
            x.append(x_temp)
            y.append(y_tem)

    nx = np.array(x)
    ny = np.array(y)
    return nx, ny

decision_stump_one_dimension.py（对应一维的问题）

# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as np


def sign_zero_as_neg(x):
    """
    这里修改了np自带的sign函数，当传入的值为0的时候，不再返回0，而是-1；
    也就是说在边界上的点按反例处理
    :param x:
    :return:
    """
    result = np.sign(x)
    result[result == 0] = -1
    return result


def data_generator(size):
    """
    生成[-1, 1)之间的随机数， 然后加入20%的噪声，即20%的概率观测值取了相反数
    :param size:
    :return:
    """
    x_arr = np.random.uniform(-1, 1, size)
    y_arr = sign_zero_as_neg(x_arr)
    y_arr = np.where(np.random.uniform(0, 1, size) < 0.2, -y_arr, y_arr)
    print(x_arr)
    print(y_arr)
    return x_arr, y_arr


def err_in_counter(x_arr, y_arr, s, theta):
    """
    计算E_in
    :param x_arr:
    [[x1, x2, x3, ... ,xn]
     [x1, x2, x3, ... ,xn]
     [x1, x2, x3, ... ,xn]
            ...
     [x1, x2, x3, ... ,xn]]
    :param y_arr:
    [[y1, y2, y3, ... ,yn]
     [y1, y2, y3, ... ,yn]
     [y1, y2, y3, ... ,yn]
            ...
     [y1, y2, y3, ... ,yn]]
    :param s:{-1,1}
    :param theta:
    [[theta1, theta1, theta1, ... ,theta1]
     [theta2, theta2, theta2, ..., theta2]
     [theta3, theta3, theta3, ..., theta3]
                ...
     [thetak, thetak, thetak, ..., thetak]]
    :return:[err_theta1, err_theta2, ..., err_thetak] 中最小的以及下标
    """
    result = s * sign_zero_as_neg(x_arr - theta)
    err_tile = np.where(result == y_arr, 0, 1).sum(1)
    return err_tile.min(), err_tile.argmin()


def err_out_calculator(s, theta):
    return 0.5 + 0.3 * s * (abs(theta) - 1)


def decision_stump_1d(x_arr, y_arr):
    theta = x_arr
    theta_tile = np.tile(theta, (len(x_arr), 1)).T
    x_tile = np.tile(x_arr, (len(theta), 1))
    y_tile = np.tile(y_arr, (len(theta), 1))
    err_pos, index_pos = err_in_counter(x_tile, y_tile, 1, theta_tile)
    err_neg, index_neg = err_in_counter(x_tile, y_tile, -1, theta_tile)
    if err_pos < err_neg:
        return err_pos / len(y_arr), err_out_calculator(1, theta[index_pos])
    else:
        return err_neg / len(y_arr), err_out_calculator(-1, theta[index_neg])


if __name__ == '__main__':
    avg_err_in = 0
    avg_err_out = 0
    for i in range(5000):
        x, y = data_generator(20)
        e_in, e_out = decision_stump_1d(x, y)
        avg_err_in = avg_err_in + (1.0 / (i + 1)) * (e_in - avg_err_in)
        avg_err_out = avg_err_out + (1.0 / (i + 1)) * (e_out - avg_err_out)
    print("e_in:", avg_err_in)
    print("e_out:", avg_err_out)

decision_stump_multi_dimension.py（对应多维的问题）

# -*- coding:utf-8 -*-
# Author: Evan Mi
import numpy as np
from decison_stump import util


def sign_zero_as_neg(x):
    """
    这里修改了np自带的sign函数，当传入的值为0的时候，不再返回0，而是-1；
    也就是说在边界上的点按反例处理
    :param x:
    :return:
    """
    result = np.sign(x)
    result[result == 0] = -1
    return result


def err_in_counter(x_arr, y_arr, s, theta):
    """
    计算E_in
    :param x_arr:
    [[x1, x2, x3, ... ,xn]
     [x1, x2, x3, ... ,xn]
     [x1, x2, x3, ... ,xn]
            ...
     [x1, x2, x3, ... ,xn]]
    :param y_arr:
    [[y1, y2, y3, ... ,yn]
     [y1, y2, y3, ... ,yn]
     [y1, y2, y3, ... ,yn]
            ...
     [y1, y2, y3, ... ,yn]]
    :param s:{-1,1}
    :param theta:
    [[theta1, theta1, theta1, ... ,theta1]
     [theta2, theta2, theta2, ..., theta2]
     [theta3, theta3, theta3, ..., theta3]
                ...
     [thetak, thetak, thetak, ..., thetak]]
    :return:[err_theta1, err_theta2, ..., err_thetak] 中最小的以及下标
    """
    result = s * sign_zero_as_neg(x_arr - theta)
    err_tile = np.where(result == y_arr, 0, 1).sum(1)
    return err_tile.min(), err_tile.argmin()


def err_out_counter(x_arr, y_arr, s, theta, dimension):
    temp = s * sign_zero_as_neg(x_arr.T[dimension] - theta)
    e_out = np.where(temp == y_arr, 0, 1).sum() / np.size(x_arr, 0)
    return e_out


def decision_stump_1d(x_arr, y_arr):
    theta = x_arr
    theta_tile = np.tile(theta, (len(x_arr), 1)).T
    x_tile = np.tile(x_arr, (len(theta), 1))
    y_tile = np.tile(y_arr, (len(theta), 1))
    err_pos, index_pos = err_in_counter(x_tile, y_tile, 1, theta_tile)
    err_neg, index_neg = err_in_counter(x_tile, y_tile, -1, theta_tile)
    if err_pos < err_neg:
        return err_pos / len(y_arr), index_pos, 1
    else:
        return err_neg / len(y_arr), index_neg, -1


def decision_stump_multi_d(x, y):
    x = x.T
    dimension, e_in, theta, s = 0, float('inf'), 0, 0
    for i in range(np.size(x, 0)):
        e_in_temp, index, s_temp = decision_stump_1d(x[i], y)
        if e_in_temp < e_in:
            dimension, e_in, theta, s = i, e_in_temp, x[i][index], s_temp
        # 错误率相等的时候随机选择
        if e_in_temp == e_in:
            pick_rate = np.random.uniform(0, 1)
            if pick_rate > 0.5:
                dimension, e_in, theta, s = i, e_in_temp, x[i][index], s_temp
    return dimension, e_in, theta, s


if __name__ == '__main__':
    x_train, y_train = util.load_data('data/train.txt')
    x_test, y_test = util.load_data('data/test.txt')
    determined_dimension, e_in_result, theta_result, s_result = decision_stump_multi_d(x_train, y_train)
    print("E_IN:", e_in_result)
    print("E_OUT:", err_out_counter(x_test, y_test, s_result, theta_result, determined_dimension))

详细项目代码及代码使用的数据见：DECISION_STUMP