sklearn机器学习（一）

最新推荐文章于 2024-08-10 22:14:28 发布

黑小板

最新推荐文章于 2024-08-10 22:14:28 发布

阅读量2k

点赞数 2

分类专栏：机器学习文章标签： sklearn 机器学习 python

本文链接：https://blog.csdn.net/weixin_45397053/article/details/121910610

版权

机器学习专栏收录该内容

13 篇文章 2 订阅

订阅专栏

本文介绍了线性回归和逻辑回归的基本概念、实现方式以及在sklearn库中的应用。通过实例展示了如何使用线性回归进行数据拟合，并探讨了多项式回归、交叉验证以及过拟合和欠拟合的概念。同时，详细解释了逻辑回归的模型原理，包括sigmoid函数、最大似然估计和梯度下降。最后，通过鸢尾花数据集演示了逻辑回归在分类任务中的应用。

摘要由CSDN通过智能技术生成

Task01
本次学习参照Datawhale开源学习：https://github.com/datawhalechina/machine-learning-toy-code/tree/main/ml-with-sklearn
内容安排如下，主要是一些代码实现和部分原理介绍。
在这里插入图片描述

1. 线性回归和Logistic回归

回归就是研究自变量X对于因变量Y的影响。回归有多种分类方式，根据因变量的不同，分成几种回归：

连续：多重线性回归
二项分布：logistic回归
poisson分布：poisson回归
负二项分布：负二项回归

1.1. 线性回归

模型：学得一个线性模型尽可能准确地预测实值输出标记。
在这里插入图片描述
策略：最小化均方误差目标函数，使得所有样本到直线上的欧氏距离最小。（最小二乘法）

下面使用sklearn来实现线性回归。

1.1.1. 数据生成

'''生成随机数据作为训练集，并且加一些噪声'''
import numpy as np

def true_fun(X): 
    return 1.5*X + 0.2

np.random.seed(0) # 设置随机种子
n_samples = 30 # 设置采样数据点的个数

X_train = np.sort(np.random.rand(n_samples)) 
y_train = (true_fun(X_train) + np.random.randn(n_samples) * 0.05).reshape(n_samples,1)

1.1.2. 定义模型

from sklearn.linear_model import LinearRegression # 导入线性回归模型
model = LinearRegression() # 定义模型
model.fit(X_train[:,np.newaxis], y_train) # 训练模型
print("输出参数w：",model.coef_) # 输出模型参数w
print("输出参数b：",model.intercept_) # 输出参数b

输出参数w： [[1.4474774]]
输出参数b： [0.22557542]

1.1.3. 模型测试与比较

import matplotlib.pyplot as plt

X_test = np.linspace(0, 1, 100)
plt.plot(X_test, model.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X_train,y_train) # 画出训练集的点
plt.legend(loc="best")
plt.show()

在这里插入图片描述

1.1.4. 多项式回归

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures # 导入能够计算多项式特征的类
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

def true_fun(X): # 这是我们设定的真实函数，即ground truth的模型
    return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30 # 设置随机种子

X = np.sort(np.random.rand(n_samples)) 
y = true_fun(X) + np.random.randn(n_samples) * 0.1

degrees = [1, 4, 15] # 多项式最高次
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())
    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)]) # 使用pipline串联模型
    pipeline.fit(X[:, np.newaxis], y)
    
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,scoring="neg_mean_squared_error", cv=10) # 使用交叉验证
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
plt.show()

在这里插入图片描述

1.1.5. 交叉验证

上述算法训练过程中，我们使用了交叉验证cross_val_score。将原始数据分为训练集（绿色）、验证集（黄色）、测试集（红色）。每次训练选择的训练集和验证集都不同，记录每次训练验证后的性能指标，最后根据各次训练结果选择合适模型。根据原始数据划分数量不同分为不同K-折交叉验证（K-fold Cross Validation，记为K-CV）。下图中将原始数据分为五份为5-折交叉验证。
请添加图片描述

1.1.6. 过拟合与欠拟合

1.1.4图中Degree1就是欠拟合，Degree3就是过拟合。

1.1.7. 参数说明

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize='deprecated', copy_X=True, n_jobs=None, positive=False)

fit_intercept：是否计算截距，bool型，默认True。
normalize：是否对数据进行标准化处理，bool型，默认False。
copy_X：是否对X复制，bool型，默认True。如为false，则即经过中心化，标准化后，把新数据覆盖到原数据上
n_jobs：计算时设置的任务个数，int or None，默认None。如果选择-1则代表使用所有的CPU。
positivebool：bool型，默认为False。当设置为True时，强制系数为正。此选项仅支持密集数组。

1.2. 逻辑回归

Logistic Regression 虽然被称为回归，但其实际上是分类模型，并常用于二分类。它的本质是假设数据服从这个分布，然后使用极大似然估计做参数的估计。
模型：比如要对如下×分类，若使用线性分类进行处理：
在这里插入图片描述
我们可以使用 $h_{\theta}(x)=0.5$ 的阈值大小来进行分类判断：

但如果再加一个点，再用 $h_{\theta}(x)=0.5$ 的阈值来判断就不合适了。

因此我们尝试用一个函数（sigmoid函数 $g (x)$ ）将 $h_{\theta}(x)$ 映射到0~1之间，这样我们就可以定义对于任意的 $h_{\theta}(x)$ ，大于等于0.5时取y=1，小于0.5时y=0：
在这里插入图片描述
这样我们就得到逻辑回归的模型（概率模型）： $h_{\theta}(x)=g\left(\theta^{T} x\right), g(z)=\frac{1}{1+e^{-z}}$
其中：
$z=\theta_0+\theta_1x+\theta_2x+...+\theta_nx=\theta^Tx$
策略：由上述可以得到分类任务y值为1、0的概率分别为： $p(y=1|x;\theta)=h_{\theta}(x)$ $p(y=0|x;\theta)=1-h_{\theta}(x)$
也可以写成： $p(y|x;\theta)=(h_{\theta}(x))^{y}(1-h_{\theta}(x))^{1-y}$
假设我们有m个相互独立的训练样本，那么可以用训练样本估计参数的似然函数：
在这里插入图片描述
为方便计算取对数、除以m、取负，估计参数就是求最大似然就是求下列交叉熵损失函数最小值：
$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^{m}\left(y^{(i)} \log h_{\theta}\left(x^{(i)}\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]\right.$
算法：和线性回归一样求梯度
$\begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) &=\frac{\partial}{\partial \theta_{j}}\left[-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \log \left(h_{\theta}\left(x^{(i)}\right)\right)+\left(1-y^{(i)}\right) \log \left(1-h_{\theta}\left(x^{(i)}\right)\right)\right]\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{1}{\left.h_{\theta}\left(x^{(i)}\right)\right)} \frac{\partial}{\partial \theta_{j}} h_{\theta}\left(x^{(i)}\right)-\left(1-y^{(i)}\right) \frac{1}{1-h_{\theta}\left(x^{(i)}\right)} \frac{\partial}{\partial \theta_{j}} h_{\theta}\left(x^{(i)}\right)\right] \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{1}{\left.h_{\theta}\left(x^{(i)}\right)\right)}-\left(1-y^{(i)}\right) \frac{1}{1-h_{\theta}\left(x^{(i)}\right)}\right] \frac{\partial}{\partial \theta_{j}} h_{\theta}\left(x^{(i)}\right) \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)} \frac{1}{\left.h_{\theta}\left(x^{(i)}\right)\right)}-\left(1-y^{(i)}\right) \frac{1}{1-h_{\theta}\left(x^{(i)}\right)}\right] \frac{\partial}{\partial \theta_{j}} g\left(\theta^{T} x^{(i)}\right) \end{aligned}$
因为:
$\begin{aligned} \frac{\partial}{\partial \theta_{j}} g\left(\theta^{T} x^{(i)}\right) &=\frac{\partial}{\partial \theta_{j}} \frac{1}{1+e^{-\theta^{T} x^{(i)}}} \\ &=\frac{e^{-\theta^{T} x^{(i)}}}{\left(1+^{-\theta} T^{T_{x}(i)}\right)^{2}} \frac{\partial}{\partial \theta_{j}} \theta^{T} x^{(i)} \\ &=g\left(\theta^{T} x^{(i)}\right)\left(1-g\left(\theta^{T} x^{(i)}\right)\right) x_{j}^{(i)} \end{aligned}$
所以:
$\begin{aligned} \frac{\partial}{\partial \theta_{j}} J(\theta) &=-\frac{1}{m} \sum_{i=1}^{m}\left[y^{(i)}\left(1-g\left(\theta^{T} x^{(i)}\right)\right)-\left(1-y^{(i)}\right) g\left(\theta^{T} x^{(i)}\right)\right] x_{j}^{(i)} \\ &=-\frac{1}{m} \sum_{i=1}^{m}\left(y^{(i)}-g\left(\theta^{T} x^{(i)}\right)\right) x_{j}^{(i)} \\ &=\frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{j}^{(i)} \end{aligned}$
令导数为零即可求出极值点。

1.2.1.数据生成

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
'''使用鸢尾花数据集'''
iris = datasets.load_iris()
X = iris.data
y = iris.target

print(X[0])
print(y[0])

[5.1 3.5 1.4 0.2]
0

1.2.2.分类任务

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y)

lg = LogisticRegression() #实例化logistic模型
lg.fit(X_train,y_train)   #训练模型
lg.predict_proba(X_test)  #预测得到测试数据的各个类概率预测结果

1.2.3.参数说明

class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

penalty：惩罚项，str类型。可选参数为l1和l2，默认为l2。
dual：对偶或原始方法，bool类型，默认为False。
tol：停止求解的标准，float类型，默认为1e-4。
c：正则化系数λ的倒数，float类型，默认为1.0。
fit_intercept：是否存在截距或偏差，bool类型，默认为True。
intercept_scaling：仅在正则化项为”liblinear”，且fit_intercept设置为True时有用。float类型，默认为1。
class_weight：用于标示分类模型中各种类型的权重，字典或字符串，默认为None。
random_state：随机数种子，int类型，默认为None。
solver：优化算法选择参数，五个可选newton-cg,lbfgs,liblinear,sag,saga。默认为liblinear。
max_iter：算法收敛最大迭代次数，int类型，默认为10。
multi_class：分类方式选择参数，str类型，可选ovr和multinomial，默认为ovr。
verbose：日志冗长度，int类型，默认为0。
warm_start：热启动参数，bool类型，默认为False。
n_jobs：并行数，int类型，默认为1。
l1_ratio：弹性网混合参数，int类型，默认为None。0 <= l1_ratio <= 1。只在penalty='elasticnet’时使用。设置l1_ratio=0等价于使用penalty=‘l2’，而设置l1_ratio=1等价于使用penalty=‘l1’。

黑小板

关注

2
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
sklearn机器学习（一）

Task01本次学习参照Datawhale开源学习：https://github.com/datawhalechina/machine-learning-toy-code/tree/main/ml-with-sklearn内容安排如下，主要是一些代码实现和部分原理介绍。个人总结：一、1. 线性回归和Logistic回归回归就是研究自变量X对于因变量Y的影响。回归有多种分类方式，根据因变量的不同，分成几种回归：连续：多重线性回归二项分布：logistic回归poisson分布：poisso
复制链接

扫一扫

专栏目录