使用Sklearn中的逻辑回归（LogisticRegression）对手写数字（load_digits）数据集进行识别分类训练

最新推荐文章于 2024-09-13 15:18:18 发布

TBest_

最新推荐文章于 2024-09-13 15:18:18 发布

阅读量2.7k

点赞数 28

分类专栏：机器学习文章标签： sklearn 逻辑回归分类机器学习 python 人工智能

本文链接：https://blog.csdn.net/m0_59611146/article/details/136797721

版权

机器学习专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文介绍了如何使用Sklearn中的手写数字数据集进行数据分析，包括数据预处理、划分训练集和测试集，以及构建和评估逻辑回归模型。通过十折交叉验证计算模型性能并计算错误率，展示了机器学习基本流程的应用。

摘要由CSDN通过智能技术生成

一、数据集分析

该手写数据为Sklearn内置数据集，导入数据集：

from sklearn.datasets import load_digits

1.1 数据集规格

1797个样本，每个样本包括8*8像素的图像和一个[0, 9]整数的标签
数据集data中，每一个样本均有64个数据位float64型。
关于手写数字识别问题：通过训练一个8x8 的手写数字图片中每个像素点不同的灰度值，来判定数字,是一个分类问题.

内置文件来自作者的解说：

    """Load and return the digits dataset (classification).

    Each datapoint is a 8x8 image of a digit.

    =================   ==============
    Classes                         10
    Samples per class             ~180
    Samples total                 1797
    Dimensionality                  64
    Features             integers 0-16
    =================   ==============

    This is a copy of the test set of the UCI ML hand-written digits datasets
    https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

    Read more in the :ref:`User Guide <digits_dataset>`.

    Parameters
    ----------
    n_class : int, default=10
        The number of classes to return. Between 0 and 10.

    return_X_y : bool, default=False
        If True, returns ``(data, target)`` instead of a Bunch object.
        See below for more information about the `data` and `target` object.

        .. versionadded:: 0.18

    as_frame : bool, default=False
        If True, the data is a pandas DataFrame including columns with
        appropriate dtypes (numeric). The target is
        a pandas DataFrame or Series depending on the number of target columns.
        If `return_X_y` is True, then (`data`, `target`) will be pandas
        DataFrames or Series as described below.

        .. versionadded:: 0.23

    Returns
    -------
    data : :class:`~sklearn.utils.Bunch`
        Dictionary-like object, with the following attributes.

        data : {ndarray, dataframe} of shape (1797, 64)
            The flattened data matrix. If `as_frame=True`, `data` will be
            a pandas DataFrame.
        target: {ndarray, Series} of shape (1797,)
            The classification target. If `as_frame=True`, `target` will be
            a pandas Series.
        feature_names: list
            The names of the dataset columns.
        target_names: list
            The names of target classes.

            .. versionadded:: 0.20

        frame: DataFrame of shape (1797, 65)
            Only present when `as_frame=True`. DataFrame with `data` and
            `target`.

            .. versionadded:: 0.23
        images: {ndarray} of shape (1797, 8, 8)
            The raw image data.
        DESCR: str
            The full description of the dataset.

    (data, target) : tuple if ``return_X_y`` is True
        A tuple of two ndarrays by default. The first contains a 2D ndarray of
        shape (1797, 64) with each row representing one sample and each column
        representing the features. The second ndarray of shape (1797) contains
        the target samples.  If `as_frame=True`, both arrays are pandas objects,
        i.e. `X` a dataframe and `y` a series.

        .. versionadded:: 0.18

    Examples
    --------
    To load the data and visualize the images::

        >>> from sklearn.datasets import load_digits
        >>> digits = load_digits()
        >>> print(digits.data.shape)
        (1797, 64)
        >>> import matplotlib.pyplot as plt
        >>> plt.gray()
        >>> plt.matshow(digits.images[0])
        <...>
        >>> plt.show()
    """

翻译（翻译的一言难尽，将就一下吧）：

“”“加载并返回数字数据集（分类）。每个数据点都是一个数字的 8x8 图像。 ==============类每类 10 个样本 ~180 个样本共 1797 维 64 特征整数 0-16 ============== 这是 UCI ML 手写数字数据集测试集的副本 https:archive.ics.uci.edumldatasetsOptical+Recognition+of+Handwritten+Digits

在：ref：'用户指南<digits_dataset>中阅读更多内容'.参数 ----------

        n_class ： int， default=10 要返回的类数。介于 0 和 10 之间。

        return_X_y ： bool， default=False 如果为 True，则返回 ''（data， target）'' 而不是 Bunch 对象。有关“data”和“target”对象的详细信息，请参阅下文。

        as_frame ： bool， default=False ，如果为 True，则数据是 pandas DataFrame，其中包含具有适当 dtypes （numeric）的列。目标是 pandas DataFrame 或 Series，具体取决于目标列的数量。如果 'return_X_y' 为 True，则（'data'， 'target'）将是 pandas DataFrames 或 Series，如下所述。

返回-------数据：：class：'~sklearn.utils.Bunch' 类似字典的对象，具有以下属性。

        data ： {ndarray， dataframe} of shape （1797， 64）扁平化的数据矩阵。如果 'as_frame=True'，则 'data' 将是一个 pandas DataFrame。

        target： {ndarray， Series} of shape （1797，）分类目标。如果 'as_frame=True'，则 'target' 将是pandas Series。

        feature_names：list 数据集列的名称。

        target_names：列出目标类的名称。

        frame： shape（1797， 65）的DataFrame，仅当'as_frame=True'时才出现。带有“data”和“target”的 DataFrame。

        images： {ndarray} of shape （1797， 8， 8）原始图像数据。

        DESCR： str 数据集的完整描述。

        （data， target）： tuple if ''return_X_y'' is True 默认情况下，两个 ndarrays 的元组。第一个包含形状（1797， 64）的 2D ndarray，每行代表一个样本，每列代表特征。形状（1797）的第二个 ndarray 包含目标样本。如果 'as_frame=True'，则两个数组都是 pandas 对象，即 'X' 是数据帧，“y”是序列

1.2 加载数据

# 获取数据集数据和标签
datas = load_digits()
X_data = datas.data
y_data = datas.target

1.3 展示数据集中前十个数据

代码：

from matplotlib import pyplot as plt

#  展示前十个数据的图像
fig, ax = plt.subplots(
    nrows=2,
    ncols=5,
    sharex=True,
    sharey=True, )
ax = ax.flatten()
for i in range(10):
    ax[i].imshow(datas.data[i].reshape((8, 8)), cmap='Greys', interpolation='nearest')
plt.show()

图像：

二、数据处理

2.1 划分数据集

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3)

三、建立模型

3.1 逻辑回归

3.1.1 LogisticRegression()主要参数

        penalty：指定正则化的参数可选为 "l1", “l2” 默认为 “l2”. 注意： l1 正则化会将部分参

数压缩到 0 ，而 l2 正则化不会让参数取到 0 只会无线接近

        C:大于 0 的浮点数。 C 越小对损失函数的惩罚越重

        multi_class:告知模型要处理的分类问题是二分类还是多分类。默认为 “ovr” （二分类） “multinational”: 表示处理多分类问题，在solver="liblinear" 时不可用 “auto” ：表示让模型自动判断分类类型

        solver：指定求解方式

3.2 建立逻辑回归模型

# 建立逻辑回归模型
model = LogisticRegression(max_iter=10000, random_state=42, multi_class='multinomial')

# 训练模型
model.fit(X_train, y_train)

四、模型评估

4.1 十折交叉验证

十折交叉验证是将训练集分割成10个子样本，一个单独的子样本被保留作为验证模型的数据，其他9个样本用来训练。交叉验证重复10次，每个子样本验证一次，平均10次的结果或者使用其它结合方式，最终得到一个单一估测。这个方法的优势在于，同时重复运用随机产生的子样本进行训练和验证，每次的结果验证一次，10次交叉验证是最常用的。

scores = cross_val_score(model, X_train, y_train, cv=10)  # 十折交叉验证
k = 0
for i in scores:
    k += i
print("十折交叉验证平均值：", k / 10)
print(f"十折交叉验证:{scores}\n")

结果：

4.2 错误率

y_pred = model.predict(X_test)
error_rate = model.score(X_test, y_test)

print(f"错误率:{error_rate}\n")
print(f"测试集预测值:{y_pred}\n")

结果：

五、源码

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score, train_test_split
from matplotlib import pyplot as plt

# 获取数据集数据和标签
datas = load_digits()
X_data = datas.data
y_data = datas.target

#  展示前十个数据的图像
fig, ax = plt.subplots(
    nrows=2,
    ncols=5,
    sharex=True,
    sharey=True, )
ax = ax.flatten()
for i in range(10):
    ax[i].imshow(datas.data[i].reshape((8, 8)), cmap='Greys', interpolation='nearest')
plt.show()

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3)
# 建立逻辑回归模型
model = LogisticRegression(max_iter=10000, random_state=42, multi_class='multinomial')
scores = cross_val_score(model, X_train, y_train, cv=10)  # 十折交叉验证
k = 0
for i in scores:
    k += i
print("十折交叉验证平均值：", k / 10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
error_rate = model.score(X_test, y_test)

print(f"十折交叉验证:{scores}\n")
print(f"错误率:{error_rate}\n")
print(f"测试集预测值:{y_pred}\n")