手撕机器学习：Part1 回归与分类（含样例代码）

最新推荐文章于 2024-05-26 16:25:30 发布

理想主义小白

最新推荐文章于 2024-05-26 16:25:30 发布

阅读量577

点赞数 3

分类专栏： # 机器学习手撕算法文章标签：机器学习 python

本文链接：https://blog.csdn.net/weixin_44063570/article/details/107863983

版权

手撕算法同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

机器学习

2 篇文章 0 订阅

订阅专栏

岭回归/套索回归/弹性网络回归、梯度下降、Sigmoid/Softmax、数据升维……

0 前言
1 回归：线性回归
2 分类：逻辑回归（Logistic Regression)
拓：广义线性模型与数据升维
3 代码部分（sklearn实现）

0 前言

近日回顾了邹博老师的机器学习教程，有感于其对机器学习模型和数学知识的透彻理解，特别是一些公式推导和形象解释，这里梳理下知识点，以加深记忆，如有错误和不当之处，还请各位大佬多多指教。

第一篇主要围绕线性回归和逻辑回归两大基础模型，分别代表机器学习最基础的两大任务：回归和分类，同时拓展了一些相关知识点，详见目录。

1 回归：线性回归

从定义上看，回归任务即学习一个由x到y的映射f，其中y向量为连续值。常用于数据分析与拟合、因素与相关性分析、天气预报、股价预测、产品控制等场合。

1.1（多元）线性回归的损失函数

1.1.1 最大似然估计与最小二乘

多元线性回归的损失函数很好理解，既模型预测值与真实值之差平方和：

但其实更可以从最大似然估计的角度来理解：

1.1.2 代价函数的解析解，λ扰动

在介绍梯度下降方法之前，先介绍线性回归代价函数的解析解，即吴恩达老师视频中的正规方程法，该方法可以直接得到最优解θ，但只适用于线性回归模型。

1.2 复杂度惩罚因子（正则化）

岭回归、套索回归、弹性网络回归（Ridge，LASSO，Elastic Net）

这里介绍三种回归模型，其中的λ即为惩罚因子，是超参数，可以根据验证集的训练效果来调整。
以Ridge回归为例，λ后的求和项为平方和损失，在梯度下降的过程中，可以“惩罚”（削减）较大的θ参数，起到防止过拟合的效果。
岭回归（Ridge）

套索回归（LASSO，Least absolute shrinkage and selection operator）

与Ridge回归不同，LASSO回归增加项为L1范数或L2范数。
为了解决绝对值的优化方法，对应提出了LARS算法(Barsley Efron,2004)

弹性网络回归（Elastic Net）

Elastic Net方法则使用了L1范数和L2范数的结合。

1.3 梯度下降法

1.4 线性回归的进一步分析

可以对样本是非线性的，只要对参数θ线性

2 分类：逻辑回归（Logistic Regression)

2.1 二分类：Sigmoid函数

2.1.1 Sigmoid函数及其导数

Sigmoid函数可以理解为S型的函数，亦是常见的激活函数之一（详见后文2.3）

2.1.2 Logistic回归参数估计与(负)对数最大似然函数

在这里插入图片描述

2.1.3 与线性回归的对比

经过推到后可以发现，逻辑回归与线性回归在梯度下降公式上有相同的形式！区别仅在于预测函数hθ(x)不一样

2.2 多分类：Softmax函数

在这里插入图片描述
（手撕不动了，先放PPT截图，择日再斯）
softmax函数图像如下图（二维）

2.3 Sigmoid/Softmax与神经网络的关系

在这里插入图片描述
与此同时也可以看出，当softmax用于二分类时，可以退化成Sigmoid

拓：广义线性模型与数据升维

当维度为1时，逻辑回归是典型的对数线性模型，反应在图像上的效果就是分界线为一条直线（或超平面）
在这里插入图片描述

但任何一个一维线性模型都可以进行维度提升，实际应用中可以根据需要”选取特征“，即原来的特征为1，x1，x2，x3…xn，升维后特征为1，x1，x2，x3…，x1x2，x1x3…x1^2…
从图像上反映即分界面变成”曲线“
在这里插入图片描述

3 代码部分（sklearn实现）

3.1 回归案例：波士顿房价预测

数据集下载链接：https://archive.ics.uci.edu/ml/datasets/Housing

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNetCV
import sklearn.datasets
from pprint import pprint
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
import warnings


def not_empty(s):
    return s != ''


if __name__ == "__main__":
    warnings.filterwarnings(action='ignore')             
    np.set_printoptions(suppress=True)
    file_data = pd.read_csv('.\housing.data', header=None)                # 读取数据
    # a = np.array([float(s) for s in str if s != ''])
    data = np.empty((len(file_data), 14))      
    for i, d in enumerate(file_data.values):
        d = list(map(float, list(filter(not_empty, d[0].split(' ')))))
        data[i] = d
    x, y = np.split(data, (13, ), axis=1)                                 # 前13列为x，最后一列为y
    # data = sklearn.datasets.load_boston()
    # x = np.array(data.data)
    # y = np.array(data.target)
    print('样本个数：%d, 特征个数：%d' % x.shape)
    print(y.shape)
    y = y.ravel()

    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=0)
    model = Pipeline([
        ('ss', StandardScaler()),
        ('poly', PolynomialFeatures(degree=3, include_bias=True)),
        ('linear', ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.99, 1], alphas=np.logspace(-3, 2, 5),
                                fit_intercept=False, max_iter=1e3, cv=3))            # 使用弹性网络回归，3折交叉验证
    ])   # Pipline参数设置格式model.set_params(别名__参数 = value)
    print('开始建模...')
    model.fit(x_train, y_train)
    # linear = model.get_params('linear')['linear']
    # print u'超参数：', linear.alpha_
    # print u'L1 ratio：', linear.l1_ratio_
    # print u'系数：', linear.coef_.ravel()

    order = y_test.argsort(axis=0)
    y_test = y_test[order]
    x_test = x_test[order, :]
    y_pred = model.predict(x_test)
    r2 = model.score(x_test, y_test)
    mse = mean_squared_error(y_test, y_pred)
    print('R2:', r2)
    print('均方误差：', mse)

    t = np.arange(len(y_pred))
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(t, y_test, 'r-', lw=2, label='真实值')
    plt.plot(t, y_pred, 'g-', lw=2, label='估计值')
    plt.legend(loc='best')
    plt.title('波士顿房价预测', fontsize=18)
    plt.xlabel('样本编号', fontsize=15)
    plt.ylabel('房屋价格', fontsize=15)
    plt.grid()
    plt.show()

3.2 分类案例：鸢尾花

数据集下载链接：http://archive.ics.uci.edu/ml/datasets/Iris

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
# import matplotlib.pyplot as plt
# import matplotlib as mpl
# import matplotlib.patches as mpatches


if __name__ == "__main__":
    path = '.\iris.data'  # 数据文件路径

 {b'Iris-setosa': 0,
              b'Iris-versicolor': 1,
              b'Iris-virginica': 2}
        return it[s]

    # 路径，浮点型数据，逗号分隔，第4列使用函数iris_type单独处理
    data = np.loadtxt(path, dtype=float, delimiter=',', converters={4: iris_type})
    print(data)

    data = pd.read_csv(path, header=None)
    data[4] = pd.Categorical(data[4]).codes
    # iris_types = data[4].unique()
    # print iris_types
    # for i, type in enumerate(iris_types):
    #     data.set_value(data[4] == type, 4, i)
    x, y = np.split(data.values, (4,), axis=1)
    # print 'x = \n', x
    # print 'y = \n', y
    # 仅使用前两列特征
    x = x[:, :2]
    lr = Pipeline([('sc', StandardScaler()),
                   ('poly', PolynomialFeatures(degree=2)),
                   ('clf', LogisticRegression()) ])                  # 使用逻辑回归模型
    lr.fit(x, y.ravel())
    y_hat = lr.predict(x)
    y_hat_prob = lr.predict_proba(x)
    np.set_printoptions(suppress=True)
    print('y_hat = \n', y_hat)
    print('y_hat_prob = \n', y_hat_prob)
    print('准确度：%.2f%%' % (100*np.mean(y_hat == y.ravel())))

# 画图
# N, M = 500, 500     # 横纵各采样多少个值
# x1_min, x1_max = x[:, 0].min(), x[:, 0].max()   # 第0列的范围
# x2_min, x2_max = x[:, 1].min(), x[:, 1].max()   # 第1列的范围
# t1 = np.linspace(x1_min, x1_max, N)
# t2 = np.linspace(x2_min, x2_max, M)
# x1, x2 = np.meshgrid(t1, t2)                    # 生成网格采样点
# x_test = np.stack((x1.flat, x2.flat), axis=1)   # 测试点
# 
# # # 无意义，只是为了凑另外两个维度
# # x3 = np.ones(x1.size) * np.average(x[:, 2])
# # x4 = np.ones(x1.size) * np.average(x[:, 3])
# # x_test = np.stack((x1.flat, x2.flat, x3, x4), axis=1)  # 测试点
# 
# mpl.rcParams['font.sans-serif'] = ['simHei']
# mpl.rcParams['axes.unicode_minus'] = False
# cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])
# cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
# y_hat = lr.predict(x_test)                  # 预测值
# y_hat = y_hat.reshape(x1.shape)                 # 使之与输入的形状相同
# plt.figure(facecolor='w')
# plt.pcolormesh(x1, x2, y_hat, cmap=cm_light)     # 预测值的显示
# plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k', s=50, cmap=cm_dark)    # 样本的显示
# plt.xlabel(u'花萼长度', fontsize=14)
# plt.ylabel(u'花萼宽度', fontsize=14)
# plt.xlim(x1_min, x1_max)
# plt.ylim(x2_min, x2_max)
# plt.grid()
# patchs = [mpatches.Patch(color='#77E0A0', label='Iris-setosa'),
#           mpatches.Patch(color='#FF8080', label='Iris-versicolor'),
#           mpatches.Patch(color='#A0A0FF', label='Iris-virginica')]
# plt.legend(handles=patchs, fancybox=True, framealpha=0.8)
# plt.title(u'鸢尾花Logistic回归分类效果 - 标准化', fontsize=17)
# plt.show()

作图部分代码调试出现一些问题T^T，报错为无效的RGBA值和plt.scatter函数更新带来的错误，等有机会调试完成后重新上传……

模型保存

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import matplotlib as mpl
import os
from sklearn.externals import joblib


if __name__ == "__main__":
    data = pd.read_csv('..\\iris.data', header=None)
    x = data[[0, 1]]
    y = pd.Categorical(data[4]).codes
    if os.path.exists('iris.model'):                              # 查看是否已有模型
        print('Load Model...')
        lr = joblib.load('iris.model')
    else:
        print('Train Model...')
        lr = Pipeline([('sc', StandardScaler()),
                       ('poly', PolynomialFeatures(degree=3)),
                       ('clf', LogisticRegression()) ])
        lr.fit(x, y.ravel())
    y_hat = lr.predict(x)
    joblib.dump(lr, 'iris.model')
    print('y_hat = \n', y_hat)
    print('accuracy = %.3f%%' % (100*accuracy_score(y, y_hat)))

理想主义小白

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
手撕机器学习：Part1 回归与分类（含样例代码）

岭回归/套索回归/弹性网络回归、梯度下降、Sigmoid/Softmax、数据升维……0 前言1 回归：线性回归1.1（多元）线性回归的损失函数1.1.1 最大似然估计与最小二乘1.1.2 代价函数的解析解，λ扰动1.2 复杂度惩罚因子（正则化）岭回归、套索回归、弹性网络回归（Ridge，LASSO，Elastic Net）1.3 梯度下降法1.4 线性回归的进一步分析可以对样本是非线性的，只要对参数θ线性2 分类：逻辑回归（Logistic Regression)2.1 二分类：Sigmoid函数2.1.
复制链接

扫一扫

专栏目录