Task01逻辑回归&线性回归

最新推荐文章于 2024-08-12 19:10:16 发布

小曹小曹喜欢吃草

最新推荐文章于 2024-08-12 19:10:16 发布

阅读量961

点赞数

分类专栏： ml 文章标签：逻辑回归线性回归机器学习

本文链接：https://blog.csdn.net/weixin_46180512/article/details/121965562

版权

ml 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1.Logistic Regression

给定数据 $X={x_1,x_2,...,}$ , $Y={y_1,y_2,...,}$ 考虑二分类任务，即 $y_i\in{{0,1}},i=1,2,...$ ,

假设函数(Hypothesis function)

假设函数就是其基本模型，如下： $h_{\theta}(x)=g(\theta^{T}x)$ 其中 $\theta^{T}x=w^Tx+b$ , 而 $g(z)=\frac{1}{1+e^{-z}}$ 为 $s i g m o i d$ 函数，也称激活函数。

损失函数

损失函数又叫代价函数，用于衡量模型的好坏，这里可以用极大似然估计法来定义损失函数。

似然与概率的区别以及什么是极大似然估计，一文搞懂极大似然估计

代价函数可定义为极大似然估计，即 $L(\theta)=\prod_{i=1}p(y_i=1|x_i)=h_\theta(x_1)(1-h_\theta(x_2))...$ , 其中 $x_1$ 对应的标签 $y_1=1$ ， $x_2$ 对应的标签 $y_2=0$ ，即设定正例的概率为 $h_\theta(x_i)$ : $p(y_i=1|x_i)=h_\theta(x_i)$ $p(y_i=0|x_i)=1-h_\theta(x_i)$

根据极大似然估计原理，我们的目标是 $\theta^* = \arg \max _{\theta} L(\theta)$

为了简化运算，两边加对数，得到 $\theta^* = \arg \max _{\theta} L(\theta) \Rightarrow \theta^* = \arg \min _{\theta} -\ln(L(\theta))$

化简可得(这里只为了写代码，具体推导参考西瓜书)： $-\ln(L(\theta))=\ell(\boldsymbol{\theta})=\sum_{i=1}(-y_i\theta^Tx_i+\ln(1+e^{\theta^Tx_i}))$

求解：梯度下降

根据凸优化理论，该函数可以由梯度下降法，牛顿法得出最优解。

对于梯度下降来讲, 其中 $\eta$ 为学习率： $\theta^{t+1}=\theta^{t}-\eta \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}$ 其中 $\frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_{i=1}(-y_ix_i+\frac{e^{\theta^Tx_i}x_i}{1+e^{\theta^Tx_i}})=\sum_{i=1}x_i(-y_i+h_\theta(x_i))=\sum_{i=1}x_i(-error)$

这里梯度上升更方便点： $\theta^{t+1}=\theta^{t}+\eta \frac{-\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}$ 其中 $\frac{-\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_{i=1}(y_ix_i-\frac{e^{\theta^Tx_i}x_i}{1+e^{\theta^Tx_i}})=\sum_{i=1}x_i(y_i-h_\theta(x_i))=\sum_{i=1}x_i*error$

伪代码

训练算法如下：

输入：训练数据 $X={x_1,x_2,...,x_n}$ ,训练标签 $Y={y_1,y_2,...,}$ ，注意均为矩阵形式

输出: 训练好的模型参数 $\theta$ ，或者 $h_{\theta}(x)$

初始化模型参数 $\theta$ ，迭代次数 $n_iters$ ，学习率 $\eta$

$\mathbf{FOR} \ i_iter \ \mathrm{in \ range}(n_iters)$
$\mathbf{FOR} \ i \ \mathrm{in \ range}(n)$   $\rightarrow n=len(X)$
$error=y_i-h_{\theta}(x_i)$
$grad=error*x_i$
$\theta \leftarrow \theta + \eta*grad$   $\rightarrow$梯度上升
$\mathbf{END \ FOR}$
$\mathbf{END \ FOR}$

回归

根据因变量的不同，分成几种回归：

连续：多重线性回归(注意与多元线性回归有区别，比如多元自变量是连续的，多重则可以是多种数据类型等)
二项分布：logistic回归
poisson分布：poisson回归
负二项分布：负二项回归
逻辑回归，同线性回归一样，需要求出n个参数：
逻辑回归通过Sigmoid函数引入了非线性因素，可以轻松处理二分类问题:
与线性回归不同，逻辑回归使用的是交叉熵损失函数:其梯度为:

形式和线性回归一样，但其实假设函数(Hypothesis function)不一样，逻辑回归是:

import sys
from pathlib import Path
curr_path = str(Path().absolute())
parent_path = str(Path().absolute().parent)
sys.path.append(parent_path) # add current terminal path to sys.path

import numpy as np
from Mnist.load_data import load_local_mnist

(x_train, y_train), (x_test, y_test) = load_local_mnist(one_hot=False)

# print(np.shape(x_train),np.shape(y_train))

ones_col=[[1] for i in range(len(x_train))] # 生成全为1的二维嵌套列表，即[[1],[1],...,[1]]
x_train_modified=np.append(x_train,ones_col,axis=1)
ones_col=[[1] for i in range(len(x_test))]
x_test_modified=np.append(x_test,ones_col,axis=1)

# print(np.shape(x_train_modified))

# Mnsit有0-9十个标记，由于是二分类任务，所以可以将标记0的作为1，其余为0用于识别是否为0的任务
y_train_modified=np.array([1 if y_train[i]==1 else 0 for i in range(len(y_train))])
y_test_modified=np.array([1 if y_test[i]==1 else 0 for i in range(len(y_test))])
n_iters=10 

x_train_modified_mat = np.mat(x_train_modified)
theta = np.mat(np.zeros(len(x_train_modified[0])))
lr = 0.01 # 学习率

def sigmoid(x):
    '''sigmoid函数
    '''
    return 1.0/(1+np.exp(-x))

小批量梯度下降法(Mini-batch Gradient Descent)

for i_iter in range(n_iters):
    for n in range(len(x_train_modified)):
        hypothesis = sigmoid(np.dot(x_train_modified[n], theta.T))
        error = y_train_modified[n]- hypothesis
        grad = error*x_train_modified_mat[n]
        theta += lr*grad
    print('LogisticRegression Model(learning_rate={},i_iter={})'.format(
    lr, i_iter+1))

1. 线性回归入门

1.1 数据生成

线性回归是机器学习算法的一个敲门砖，为了能够更方便直观地带大家入门，这里使用人工生成的简单的数据。生成数据的思路是设定一个二维的函数（维度高了没办法在平面上画出来），根据这个函数生成一些离散的数据点，对每个数据点我们可以适当的加一点波动，也就是噪声，最后看看我们算法的拟合或者说回归效果。

import numpy as np
import matplotlib.pyplot as plt 

def true_fun(X): # 这是我们设定的真实函数，即ground truth的模型
    return 1.5*X + 0.2

np.random.seed(0) # 设置随机种子
n_samples = 30 # 设置采样数据点的个数

'''生成随机数据作为训练集，并且加一些噪声'''
X_train = np.sort(np.random.rand(n_samples)) 
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1)

1.2 定义模型

生成数据之后，我们可以定义我们的算法模型，直接从sklearn库中导入类LinearRegression即可，由于线性回归比较简单，所以这个类的输入参数也比较少，不需要多加设置。定义好模型之后直接训练，就能得到我们拟合的一些参数。

from sklearn.linear_model import LinearRegression # 导入线性回归模型
model = LinearRegression() # 定义模型
model.fit(X_train[:,np.newaxis], y_train) # 训练模型
print("输出参数w：",model.coef_) # 输出模型参数w
print("输出参数b：",model.intercept_) # 输出参数b
输出参数w： [[1.4474774]]
输出参数b： [0.22557542]

1.3 模型测试与比较

可以看到线性回归拟合的参数是1.44和0.22，很接近实际的1.5和0.2，说明我们的算法性能还不错。下面我们直接选取一批数据测试，然后通过画图看看算法模型与实际模型的差距。

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, model.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X_train,y_train) # 画出训练集的点
plt.legend(loc="best")
plt.show()

在这里插入图片描述

由于我们的数据比较简单，所以从图中也可以看出，我们的算法拟合曲线与实际的很接近。对于更复杂以及高维的情况，线性回归不能满足我们回归的需求，这时候我们需要用到更为高级一些的多项式回归了。

2. 多项式回归

多项式回归的思路一般是将次多项式方程转化为线性回归方程，即将转换为（令即可），然后使用线性回归的方法求出相应的参数。一般实际的算法也是如此，我们将多项式特征分析器和线性回归串联，算出线性回归的参数之后倒推过去就行。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures # 导入能够计算多项式特征的类
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

def true_fun(X): # 这是我们设定的真实函数，即ground truth的模型
    return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30 # 设置随机种子

X = np.sort(np.random.rand(n_samples)) 
y = true_fun(X) + np.random.randn(n_samples) * 0.1

degrees = [1, 4, 15] # 多项式最高次
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())
    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)]) # 使用pipline串联模型
    pipeline.fit(X[:, np.newaxis], y)
    
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,scoring="neg_mean_squared_error", cv=10) # 使用交叉验证
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
plt.show()

在这里插入图片描述

2.1 交叉验证

在这个算法训练过程中，我们使用了一个技巧，就是交叉验证，类似的方法还有holdout检验以及自助法（交叉验证的升级版，即每次又放回去的选取数据）。交叉验证法的作用就是尝试利用不同的训练集/测试集划分来对模型做多组不同的训练/测试，来应对测试结果过于片面以及训练数据不足的问题。过程如下图：
在这里插入图片描述