sklearn-第一节

最新推荐文章于 2024-08-06 22:06:11 发布

~一段浮华

最新推荐文章于 2024-08-06 22:06:11 发布

阅读量1.6k

点赞数

文章标签：线性回归梯度下降逻辑回归多项式回归模型拟合

本文链接：https://blog.csdn.net/qq_42258383/article/details/121964754

版权

一、线性回归

1一元线性回归

“线性回归” (linear regression) 试图学得一个线性模型以尽可能准确地预测实值输出标记.举一个二维函数的例子y=1.5x+0.2,根据这个函数生成一些离散的数据点，对每个数据点加一点波动，也就是噪声，最后看看我们算法的拟合或者说回归效果。

1.1.数据生成

import numpy as np
import matplotlib.pyplot as plt

def true_fun(X):
    return 1.5*X + 0.2

np.random.seed(0) # 随机种子,0代表一系列的种子堆，生成的随机数可以复现
n_samples = 30
'''生成随机数据作为训练集'''
X_train = np.sort(np.random.rand(n_samples)) #生成30个随机数，并对此进行排序
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1) #将上边生成的随机数x代入到函数中，同时加入reshape之后的噪声

1.2模型生成

实际模型会存在一个偏置量b，以一元为例， $y=w_1x_1+b=w^Tb=w_1x_1+w_0x_0$ , 实际使用梯度下降法时可以添加一维并令 $x_0=1$ , 则求出的 $w_0=b$

1.下面介绍一下梯度下降法

在这里插入图片描述

2.如何求取梯度

假设给定模型 $h(\theta)=\sum_{j=0}^{n} \theta_{j} x_{j}$ 以及目标函数(损失函数): $J(\theta)=\frac{1}{m} \sum_{i=0}^{m}\left(h_{\theta}\left(x^{i}\right)-y^{i}\right)^{2}$ , 其中 $m$ 表示数据的量，我们目标是为了 $J(\theta)$ 尽可能小，所以这里加上 $\frac{1}{2}$ 为了后面的简化，即 $J(\theta)=\frac{1}{2m} \sum_{i=0}^{m}\left(y^{i}-h_{\theta}\left(x^{i}\right)\right)^{2}$ 。
那么梯度则为：
$\begin{aligned} \frac{\partial J(\theta)}{\partial \theta_{j}} &=\frac{1}{m} \sum_{i=0}^{m}\left(y^{i}-h_{\theta}\left(x^{i}\right)\right) \frac{\partial}{\partial \theta_{j}}\left(y^{i}-h_{\theta}\left(x^{i}\right)\right) \\ &=-\frac{1}{m} \sum_{i=0}^{m}\left(y^{i}-h_{\theta}\left(x^{i}\right)\right) \frac{\partial}{\partial \theta_{j}}\left(\sum_{j=0}^{n} \theta_{j} x_{j}^{i}-y^{i}\right) \\ &=-\frac{1}{m} \sum_{i=0}^{m}\left(y^{i}-h_{\theta}\left(x^{i}\right)\right) x_{j}^{i}\\ &=\frac{1}{m} \sum_{i=0}^{m}\left(h_{\theta}(x^{i})-y^{i})\right) x_{j}^{i} \end{aligned}$

设 $x$ 是(m,n)维的矩阵， $y$ 是(m,1)维度的矩阵， $h_{\theta}$ 是预测的值，维度与 $y$ 相同，那么梯度用矩阵表示如下:
$\frac{\partial J(\theta)}{\partial \theta_{j}} = \frac{1}{m}x^{T}(h_{\theta}-y)$

生成数据之后，我们可以手动我们的算法模型，也可以直接从sklearn库中导入类LinearRegression

m,p = np.shape(data_X) # m, 数据量 p: 特征数
max_iter = 1000 # 迭代数
weights = np.ones((p,1))  #可以创建任意维度和元素个数的数组，其元素值均为1,生成n个分量w，每个分量均为1，因为W的分量的个数要对应于X中分量的个数
alpha = 0.1 # 学习率
for i in range(0,max_iter):
    error = np.dot(data_X,weights)- y_train   #np.dot可以做矩阵乘法和点积, 此值为预测值h(Xi)  ,  dot(矩阵1,矩阵2)是矩阵（向量）乘法运算函数 ，误差向量=预测值h(Xi)-真实值yi （此向量的分量相加就是总误差）
    gradient = data_X.transpose().dot(error)/m #求梯度，X的转置与误差向量乘积乘以1/m
    weights = weights - alpha * gradient   #作为评判的指标
print("输出参数w:",weights[1:][0]) # 输出模型参数w
print("输出参数:b",weights[0]) # 输出参数b

或者

from sklearn.linear_model import LinearRegression # 导入线性回归模型
model = LinearRegression() # 定义模型
model.fit(X_train[:,np.newaxis], y_train) # 训练模型
print("输出参数w：",model.coef_) # 输出模型参数w
print("输出参数b：",model.intercept_) # 输出参数b

输出参数w： [[1.4474774]]
输出参数b： [0.22557542]

1.3可视化

线性回归拟合的参数是1.44和0.22，很接近实际的1.5和0.2,通过画图看看算法模型与实际模型的差距。

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, X_test*weights[1][0]+weights[0][0], label="Model") 
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X_train,y_train) # 画出训练集的点
plt.legend(loc="best")
plt.show()

在这里插入图片描述

由于数据比较简单，我们的算法拟合曲线与实际的很接近。对于更复杂以及高维的情况，线性回归不能满足我们回归的需求，这时候我们需要用到更为高级一些的多项式回归了。

2 多元线性回归

在实际中更多的是遇见数据集D由多个属性描述，以三元为例， $y=w_1x_1+w_2x_2+w_3x_3+b=w^Tb$

from sklearn.linear_model import LinearRegression

X_train = [[1,1,1],[1,1,2],[1,2,1]]
y_train = [[6],[9],[8]]
 
model = LinearRegression()   #可以直接使用 LinearRegression()
model.fit(X_train, y_train)
print("输出参数w:",model.coef_) # 输出参数w1,w2,w3
print("输出参数b:",model.intercept_) # 输出参数b
test_X = [[1,3,5]]
pred_y = model.predict(test_X)
print("预测结果:",pred_y)

3. 多项式回归

多项式回归的思路一般是将m次多项式方程转化为m线性回归方程，即将 $y=b_0+b_1x+...+b_mx^m转换为y=b_0*+b_1x_1+...+b_mx_m（令x_m=x^m$ 即可），然后使用线性回归的方法求出相应的参数。一般实际的算法也是如此，我们将多项式特征分析器和线性回归串联，算出线性回归的参数之后倒推过去就行。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures # 导入能够计算多项式特征的类
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

def true_fun(X): # 这是我们设定的真实函数，即ground truth的模型
    return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30 # 设置随机种子

X = np.sort(np.random.rand(n_samples)) 
y = true_fun(X) + np.random.randn(n_samples) * 0.1

degrees = [1, 4, 15] # 多项式最高次
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())
    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)]) # 使用pipline串联模型
    pipeline.fit(X[:, np.newaxis], y)
    
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,scoring="neg_mean_squared_error", cv=10) # 使用交叉验证
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
plt.show()

在这里插入图片描述

3.1交叉验证法

在多项式回归中，使用了交叉验证方法；

“交叉验证法” (cross alidation) 将数据分为K个大小相似的互斥子集，每个子集尽可保持数据分布的一致性，即从通过分层采样得到后每次用k-1个子集的并集作为训练集。剩余的那个子集作测试集;这样就可获得k组训练/测试集，从而可进行 k次训练和测试。最终返回的是测试结果的均值， K最常用的取 10 ，此时称为 10折交叉验证。

在这里插入图片描述

常用的还有留出法和自助法

留出法： D=S U T，S ∩ T= ∅，S与T互斥，训练/测试集的划分要尽可能保持数据分布的一致性，在分类任务中至少要保持样本的类别比例相似，保留类别比例的采样方式通常称为**“分层采样”** (stratified sampling)。

注：在使用留出法时，一般要采用若干次随机划分、重复进行实验评估后取平均值作为留出法的评估结果.

自助法：

对数据集D挑选样本拷贝至D’,即在D‘中会有重复的数据，而D中会有一部分数据不会在D’中出现,取极限大约是在36.8%。当D与D’的数据量相同时，停止采样，将D’作训练集，将D中未在**D’**中出现的数据用作测试集。

注：自助法在数据集较小、难以有效划分训练/测试集时很有用;但自助法产生的数据集改变了初始数据集的分布，这会引入估计偏差.因此，在初始数据量足够时，留出法和交叉验证法更常用一些.

二、逻辑回归

2.1损失函数

在广义线性模型的中，只需将单调可微函数将分类任务的真实标记y与线性回归模型的预测值联系起来，考虑二分类任务， $y_i\in{\{0,1\}},i=1,2,...$ ,

假设函数就是其基本模型，如下：
$h_{\theta}(x)=g(\theta^{T}x)$
其中 $\theta^{T}x=w^Tx+b$ , 而 $g(z)=\frac{1}{1+e^{-z}}$ 为 $s i g m o i d$ 函数，也称激活函数。

损失函数又叫代价函数，用于衡量模型的好坏，这里可以用极大似然估计法来定义损失函数。

代价函数可定义为极大似然估计，即 $L(\theta)=\prod_{i=1}p(y_i=1|x_i)=h_\theta(x_1)(1-h_\theta(x_2))...$ ,
其中 $x_1$ 对应的标签 $y_1=1$ ， $x_2$ 对应的标签 $y_2=0$
$-\ln(L(\theta))=\ell(\boldsymbol{\theta})=\sum_{i=1}(-y_i\theta^Tx_i+\ln(1+e^{\theta^Tx_i}))$

具体可见损失函数的推导https://blog.csdn.net/qq_42258383/article/details/121483322?spm=1001.2014.3001.5502

2.2梯度下降法

根据凸优化理论，该函数可以由梯度下降法，牛顿法得出最优解。

对于梯度下降来讲, 其中 $\eta$ 为学习率：
$\theta^{t+1}=\theta^{t}-\eta \frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}$
其中
$\frac{\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_{i=1}(-y_ix_i+\frac{e^{\theta^Tx_i}x_i}{1+e^{\theta^Tx_i}})=\sum_{i=1}x_i(-y_i+h_\theta(x_i))=\sum_{i=1}x_i(-error)$

这里梯度上升更方便点：
$\theta^{t+1}=\theta^{t}+\eta \frac{-\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}$
其中
$\frac{-\partial \ell(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\sum_{i=1}(y_ix_i-\frac{e^{\theta^Tx_i}x_i}{1+e^{\theta^Tx_i}})=\sum_{i=1}x_i(y_i-h_\theta(x_i))=\sum_{i=1}x_i*error$

2.3代码逻辑

训练算法如下：

输入：训练数据 $X=\{x_1,x_2,...,x_n\}$ ,训练标签 $Y=\{y_1,y_2,...,\}$ ，注意均为矩阵形式
输出: 训练好的模型参数 $\theta$ ，或者 $h_{\theta}(x)$
初始化模型参数 $\theta$ ，迭代次数 $n\_iters$ ，学习率 $\eta$
$\mathbf{FOR} \ i\_iter \ \mathrm{in \ range}(n\_iters)$
- $\mathbf{FOR} \ i \ \mathrm{in \ range}(n)$ $\rightarrow n=len(X)$
  - $error=y_i-h_{\theta}(x_i)$
  - $grad=error*x_i$
  - $\theta \leftarrow \theta + \eta*grad$ $\rightarrow$ 梯度上升
  - $\mathbf{END \ FOR}$
$\mathbf{END \ FOR}$

2.4代码实现

import sys
from pathlib import Path
curr_path = str(Path().absolute())
parent_path = str(Path().absolute().parent)
sys.path.append(parent_path) # add current terminal path to sys.path

import numpy as np
from Mnist.load_data import load_local_mnist

(x_train, y_train), (x_test, y_test) = load_local_mnist(one_hot=False)

# print(np.shape(x_train),np.shape(y_train))

ones_col=[[1] for i in range(len(x_train))] # 生成全为1的二维嵌套列表，即[[1],[1],...,[1]]
x_train_modified=np.append(x_train,ones_col,axis=1)
ones_col=[[1] for i in range(len(x_test))]
x_test_modified=np.append(x_test,ones_col,axis=1)

# print(np.shape(x_train_modified))

# Mnsit有0-9十个标记，由于是二分类任务，所以可以将标记0的作为1，其余为0用于识别是否为0的任务
y_train_modified=np.array([1 if y_train[i]==1 else 0 for i in range(len(y_train))])
y_test_modified=np.array([1 if y_test[i]==1 else 0 for i in range(len(y_test))])
n_iters=10 

x_train_modified_mat = np.mat(x_train_modified)
theta = np.mat(np.zeros(len(x_train_modified[0])))
lr = 0.01 # 学习率

def sigmoid(x):
    '''sigmoid函数
    '''
    return 1.0/(1+np.exp(-x))

小批量梯度下降法

for i_iter in range(n_iters):
    for n in range(len(x_train_modified)):
        hypothesis = sigmoid(np.dot(x_train_modified[n], theta.T))
        error = y_train_modified[n]- hypothesis
        grad = error*x_train_modified_mat[n]
        theta += lr*grad
    print('LogisticRegression Model(learning_rate={},i_iter={})'.format(
    lr, i_iter+1))

总结

作为sklearn的初学者，本文对一元线性回归、多元线性回归、多项式线性回归以及逻辑回归进行了算法分析和代码实现，对于代码部分只是囫囵吞枣，欢迎大家指正。

~一段浮华

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
sklearn-第一节

一、线性回归1一元线性回归“线性回归” (linear regression) 试图学得一个线性模型以尽可能准确地预测实值输出标记.举一个二维函数的例子y=1.5x+0.2,根据这个函数生成一些离散的数据点，对每个数据点加一点波动，也就是噪声，最后看看我们算法的拟合或者说回归效果。1.1.数据生成import numpy as npimport matplotlib.pyplot as pltdef true_fun(X): return 1.5*X + 0.2np.random.
复制链接

扫一扫