数据科学基础上机作业（4）——南京工业大学2023年春季

Melody_0v0

已于 2024-07-14 16:08:34 修改

阅读量502

点赞数 15

文章标签：机器学习人工智能

于 2024-07-14 16:03:54 首次发布

本文链接：https://blog.csdn.net/qq_41723563/article/details/140418206

版权

数据科学4月27日上机作业

——made by njtech_计2104 Melody

任务5 分类分析

5.1 kNN与SVM

以kNN和SVM算法为例，理解分类分析算法的基本原理及流程，理解kNN和SVM算法的区别。

kNN（k-最近邻算法）

kNN算法是一种基于实例的学习方法，通常用于分类问题。

基本原理：一个样本可以通过其邻居的类别来确定自己的类别。即为了判断未知样本的类别，以所有已知类别的样本作为参照，计算未知样本与所有已知样本的距离，从中选取与未知样本距离最近的K个已知样本，根据少数服从多数的投票法则（majority-voting），将未知样本与K个最邻近样本中所属类别占比较多的归为一类。

流程：

计算待分类样本与已知类别样本之间的距离。
按距离升序排序，选取距离最近的k个邻居。
统计k个邻居中各类别的出现次数。
将出现次数最多的类别作为待分类样本的类别。

实现细节：

K的选取：K这个字母的含义就是要选取的最邻近样本实例的个数，在 scikit-learn 中 KNN算法的 K 值是通过 n_neighbors 参数来调节的，默认值是 5。在本文选取3
```
knn = KNeighborsClassifier(n_neighbors=3)
```

KNN源码分析

先重述KNN算法流程：

计算待分类样本与已知数据集中每个样本的距离。
按距离从小到大排序。
选取距离最小的前k个样本。
统计k个样本中各个类别的数量。
将出现次数最多的类别作为待分类样本的类别。

下面是 KNeighborsClassifier 的源码分析。

导入相关库和函数：

import numpy as np
from scipy import stats
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.neighbors import NearestNeighbors, KNeighborsMixin
from sklearn.utils import check_array, check_X_y
from sklearn.utils.validation import check_is_fitted
from sklearn.utils.multiclass import check_classification_targets

定义 KNeighborsClassifier 类： KNeighborsClassifier 继承了 KNeighborsMixin 、 ClassifierMixin 和 NeighborsBase 类。

class KNeighborsClassifier(KNeighborsMixin, ClassifierMixin, NeighborsBase):

KNeighborsMixin ：这个类提供了与k近邻相关的方法。主要包括kneighbors和kneighbors_graph方法。kneighbors方法用于找到查询点的k个最近邻居，返回距离和对应的索引。kneighbors_graph方法返回一个表示点与其k个最近邻居之间连接关系的图。
ClassifierMixin ：这个类提供了分类器需要的通用方法。主要包括score方法，用于计算分类器的准确率。这使得所有继承了ClassifierMixin的分类器类都具有一致的评估方法。
NeighborsBase ：这是neighbors模块的基类，提供了基本的参数检查和距离计算功能。它处理了一些与距离度量和参数检查相关的基本逻辑。所有的近邻算法（如KNeighborsClassifier和RadiusNeighborsClassifier）都继承自这个基类。

初始化函数：

__init__ 函数用于初始化对象。

    def __init__(
        self,					
        n_neighbors=5,			
        *,
        weights="uniform",
        algorithm="auto",
        leaf_size=30,
        p=2,
        metric="minkowski",
        metric_params=None,
        n_jobs=None,
    ):
        super().__init__(
            n_neighbors=n_neighbors,
            algorithm=algorithm,
            leaf_size=leaf_size,
            metric=metric,
            p=p,
            metric_params=metric_params,
            n_jobs=n_jobs,
        )
        self.weights = weights

参数定义：__init__方法接受以下参数：
- n_neighbors：默认值为5，表示查找的邻居数。
- weights：默认值为"uniform"，用于指定权重函数。可选值有"uniform"（所有邻居的权重相同）和"distance"（权重与距离成反比）或自定义的callable函数。
- algorithm：默认值为"auto"，用于计算最近邻居的算法。可选值有"ball_tree"、“kd_tree”、“brute"和"auto”。
- leaf_size：默认值为30，用于BallTree或KDTree的叶子大小。这会影响构建和查询速度，以及存储树所需的内存。
- p：默认值为2，用于指定Minkowski距离的幂值参数。当p=1时，为曼哈顿距离；当p=2时，为欧氏距离。
- metric：默认值为"minkowski"，用于计算距离的度量方式。
- metric_params：默认值为None，用于指定度量函数的额外关键字参数。
- n_jobs：默认值为None，用于指定并行计算的线程数。如果为-1，则使用所有可用的处理器。
调用父类构造方法：使用super().__init__()调用父类NeighborsBase的__init__方法，将参数传递给父类，以完成NeighborsBase的初始化工作。
设置权重参数：self.weights = weights，将传入的weights参数值赋给实例的weights属性。这样在后续的计算过程中，可以使用self.weights来引用权重函数。

训练函数 .fit()

fit函数用于训练分类器。它首先检查输入数据的有效性，然后计算最近邻居，并保存训练数据和标签。

    def fit(self, X, y):
        """将k-最近邻分类器根据训练数据集进行拟合。
    参数
    ----------
    X : {类数组, 稀疏矩阵} 形状为 (n_samples, n_features) 或 \
            (n_samples, n_samples) 如果 metric='precomputed'
        训练数据。

    y : {类数组, 稀疏矩阵} 形状为 (n_samples,) 或 \
            (n_samples, n_outputs)
        目标值。

    返回
    -------
    self : KNeighborsClassifier
        已拟合的k-最近邻分类器。
        """
        self._validate_params() #验证x，y的有效性

        return self._fit(X, y)

def _fit(self, X, y=None):
        if self._get_tags()["requires_y"]:
            if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
                X, y = self._validate_data(
                    X, y, accept_sparse="csr", multi_output=True, order="C"
                )

            if is_classifier(self):
                # Classification targets require a specific format
                if y.ndim == 1 or y.ndim == 2 and y.shape[1] == 1:
                    if y.ndim != 1:
                        warnings.warn(
                            "A column-vector y was passed when a "
                            "1d array was expected. Please change "
                            "the shape of y to (n_samples,), for "
                            "example using ravel().",
                            DataConversionWarning,
                            stacklevel=2,
                        )

                    self.outputs_2d_ = False
                    y = y.reshape((-1, 1))
                else:
                    self.outputs_2d_ = True

                check_classification_targets(y)
                self.classes_ = []
                self._y = np.empty(y.shape, dtype=int)
                for k in range(self._y.shape[1]):
                    classes, self._y[:, k] = np.unique(y[:, k], return_inverse=True)
                    self.classes_.append(classes)

                if not self.outputs_2d_:
                    self.classes_ = self.classes_[0]
                    self._y = self._y.ravel()
            else:
                self._y = y

        else:
            if not isinstance(X, (KDTree, BallTree, NeighborsBase)):
                X = self._validate_data(X, accept_sparse="csr", order="C")

        self._check_algorithm_metric()
        if self.metric_params is None:
            self.effective_metric_params_ = {}
        else:
            self.effective_metric_params_ = self.metric_params.copy()

        effective_p = self.effective_metric_params_.get("p", self.p)
        if self.metric in ["wminkowski", "minkowski"]:
            self.effective_metric_params_["p"] = effective_p

        self.effective_metric_ = self.metric
        # For minkowski distance, use more efficient methods where available
        if self.metric == "minkowski":
            p = self.effective_metric_params_.pop("p", 2)
            w = self.effective_metric_params_.pop("w", None)

            if p == 1 and w is None:
                self.effective_metric_ = "manhattan"
            elif p == 2 and w is None:
                self.effective_metric_ = "euclidean"
            elif p == np.inf and w is None:
                self.effective_metric_ = "chebyshev"
            else:
                # Use the generic minkowski metric, possibly weighted.
                self.effective_metric_params_["p"] = p
                self.effective_metric_params_["w"] = w

        if isinstance(X, NeighborsBase):
            self._fit_X = X._fit_X
            self._tree = X._tree
            self._fit_method = X._fit_method
            self.n_samples_fit_ = X.n_samples_fit_
            return self

        elif isinstance(X, BallTree):
            self._fit_X = X.data
            self._tree = X
            self._fit_method = "ball_tree"
            self.n_samples_fit_ = X.data.shape[0]
            return self

        elif isinstance(X, KDTree):
            self._fit_X = X.data
            self._tree = X
            self._fit_method = "kd_tree"
            self.n_samples_fit_ = X.data.shape[0]
            return self

        if self.metric == "precomputed":
            X = _check_precomputed(X)
            # Precomputed matrix X must be squared
            if X.shape[0] != X.shape[1]:
                raise ValueError(
                    "Precomputed matrix must be square."
                    " Input is a {}x{} matrix.".format(X.shape[0], X.shape[1])
                )
            self.n_features_in_ = X.shape[1]

        n_samples = X.shape[0]
        if n_samples == 0:
            raise ValueError("n_samples must be greater than 0")

        if issparse(X):
            if self.algorithm not in ("auto", "brute"):
                warnings.warn("cannot use tree with sparse input: using brute force")

            if self.effective_metric_ not in VALID_METRICS_SPARSE[
                "brute"
            ] and not callable(self.effective_metric_):
                raise ValueError(
                    "Metric '%s' not valid for sparse input. "
                    "Use sorted(sklearn.neighbors."
                    "VALID_METRICS_SPARSE['brute']) "
                    "to get valid options. "
                    "Metric can also be a callable function." % (self.effective_metric_)
                )
            self._fit_X = X.copy()
            self._tree = None
            self._fit_method = "brute"
            self.n_samples_fit_ = X.shape[0]
            return self

        self._fit_method = self.algorithm
        self._fit_X = X
        self.n_samples_fit_ = X.shape[0]

        if self._fit_method == "auto":
            # A tree approach is better for small number of neighbors or small
            # number of features, with KDTree generally faster when available
            if (
                self.metric == "precomputed"
                or self._fit_X.shape[1] > 15
                or (
                    self.n_neighbors is not None
                    and self.n_neighbors >= self._fit_X.shape[0] // 2
                )
            ):
                self._fit_method = "brute"
            else:
                if (
                    # TODO(1.3): remove "wminkowski"
                    self.effective_metric_ in ("wminkowski", "minkowski")
                    and self.effective_metric_params_["p"] < 1
                ):
                    self._fit_method = "brute"
                elif (
                    self.effective_metric_ == "minkowski"
                    and self.effective_metric_params_.get("w") is not None
                ):
                    # Be consistent with scipy 1.8 conventions: in scipy 1.8,
                    # 'wminkowski' was removed in favor of passing a
                    # weight vector directly to 'minkowski'.
                    #
                    # 'wminkowski' is not part of valid metrics for KDTree but
                    # the 'minkowski' without weights is.
                    #
                    # Hence, we detect this case and choose BallTree
                    # which supports 'wminkowski'.
                    self._fit_method = "ball_tree"
                elif self.effective_metric_ in VALID_METRICS["kd_tree"]:
                    self._fit_method = "kd_tree"
                elif (
                    callable(self.effective_metric_)
                    or self.effective_metric_ in VALID_METRICS["ball_tree"]
                ):
                    self._fit_method = "ball_tree"
                else:
                    self._fit_method = "brute"

        if (
            # TODO(1.3): remove "wminkowski"
            self.effective_metric_ in ("wminkowski", "minkowski")
            and self.effective_metric_params_["p"] < 1
        ):
            # For 0 < p < 1 Minkowski distances aren't valid distance
            # metric as they do not satisfy triangular inequality:
            # they are semi-metrics.
            # algorithm="kd_tree" and algorithm="ball_tree" can't be used because
            # KDTree and BallTree require a proper distance metric to work properly.
            # However, the brute-force algorithm supports semi-metrics.
            if self._fit_method == "brute":
                warnings.warn(
                    "Mind that for 0 < p < 1, Minkowski metrics are not distance"
                    " metrics. Continuing the execution with `algorithm='brute'`."
                )
            else:  # self._fit_method in ("kd_tree", "ball_tree")
                raise ValueError(
                    f'algorithm="{self._fit_method}" does not support 0 < p < 1 for '
                    "the Minkowski metric. To resolve this problem either "
                    'set p >= 1 or algorithm="brute".'
                )

        if self._fit_method == "ball_tree":
            self._tree = BallTree(
                X,
                self.leaf_size,
                metric=self.effective_metric_,
                **self.effective_metric_params_,
            )
        elif self._fit_method == "kd_tree":
            if (
                self.effective_metric_ == "minkowski"
                and self.effective_metric_params_.get("w") is not None
            ):
                raise ValueError(
                    "algorithm='kd_tree' is not valid for "
                    "metric='minkowski' with a weight parameter 'w': "
                    "try algorithm='ball_tree' "
                    "or algorithm='brute' instead."
                )
            self._tree = KDTree(
                X,
                self.leaf_size,
                metric=self.effective_metric_,
                **self.effective_metric_params_,
            )
        elif self._fit_method == "brute":
            self._tree = None

        return self

首先，_fit方法接受两个参数：X和y。X是输入数据，y是标签。如果_get_tags()["requires_y"]为True，则对X和y进行验证和处理。对于分类问题，会将y转换成合适的格式，并存储类别信息。
接下来，代码检查所选的距离度量和算法是否兼容。如果不兼容，则引发错误或警告。
然后，代码根据输入数据类型和参数选择合适的最近邻居算法。可能的选择有：
- 如果输入数据X已经是一个NeighborsBase、BallTree或KDTree实例，直接使用这些实例进行拟合。
- 如果输入数据是稀疏矩阵，使用暴力搜索方法进行拟合。
- 如果输入数据是密集矩阵，根据数据的特征数、所选距离度量等条件，选择使用kd-tree、ball-tree或暴力搜索方法进行拟合。
最后，根据所选的最近邻算法实例化相应的树结构（BallTree或KDTree）或者将_tree设置为None（暴力搜索情况下）。

在完成拟合后，KNeighborsClassifier实例将具有以下属性：

_fit_X：存储输入数据。
_y：存储标签数据。
_tree：存储拟合后的树结构（如果使用kd-tree或ball-tree算法）。
_fit_method：存储实际使用的最近邻算法（如"brute"、“kd_tree"或"ball_tree”）。
n_samples_fit_：存储拟合数据的样本数量。

预测函数 .predict()

predict 函数负责预测数据的类别。首先，它检查模型是否已训练，然后计算待预测数据与训练数据的距离，找出最近的k个邻居，最后根据邻居的标签进行投票，得到预测类别。

    def predict(self, X):
        """对提供的数据进行类别标签的预测。

        Parameters
        ----------
        X : {类数组，稀疏矩阵} 形状为 (n_queries, n_features) 或 \
            (n_queries, n_indexed) 如果 metric == 'precomputed'
        测试样本。

        返回
        -------
        y : 形状为 (n_queries,) 或 (n_queries, n_outputs) 的 ndarray
            每个数据样本的类别标签。
        """
        if self.weights == "uniform":
            # In that case, we do not need the distances to perform
            # the weighting so we do not compute them.
            neigh_ind = self.kneighbors(X, return_distance=False)
            neigh_dist = None
        else:
            neigh_dist, neigh_ind = self.kneighbors(X)

        classes_ = self.classes_
        _y = self._y
        if not self.outputs_2d_:
            _y = self._y.reshape((-1, 1))
            classes_ = [self.classes_]

        n_outputs = len(classes_)
        n_queries = _num_samples(X)
        weights = _get_weights(neigh_dist, self.weights)

        y_pred = np.empty((n_queries, n_outputs), dtype=classes_[0].dtype)
        for k, classes_k in enumerate(classes_):
            if weights is None:
                mode, _ = _mode(_y[neigh_ind, k], axis=1)
            else:
                mode, _ = weighted_mode(_y[neigh_ind, k], weights, axis=1)

            mode = np.asarray(mode.ravel(), dtype=np.intp)
            y_pred[:, k] = classes_k.take(mode)

        if not self.outputs_2d_:
            y_pred = y_pred.ravel()

        return y_pred

首先，根据权重类型（ self.weights ）来决定是否需要计算邻居之间的距离。如果权重是"uniform"，则所有邻居具有相同的权重，不需要计算距离；否则，需要计算距离。
使用 kneighbors 方法找到给定数据X的最近邻居，并根据需要返回距离（ neigh_dist ）和邻居索引（ neigh_ind ）。
处理类别信息。classes_和_y用于存储类别信息。根据分类器的输出维度（ outputs_2d_ ），可能需要对这些信息进行调整。
计算预测结果。根据kneighbors返回的邻居索引和距离以及 _y 中的类别信息，对每个查询点进行预测。这里的预测结果是基于最近邻居的类别标签的加权投票（如果使用权重）或者简单投票（如果权重是"uniform"）。
最后，将预测结果调整为适当的形状（ n_queries x n_outputs ），然后返回。

KNN算法的本质

如下图所示，如何判断绿色圆应该属于哪一类，是属于红色三角形还是属于蓝色四方形？如果K=3，由于红色三角形所占比例为2/3，绿色圆将被判定为属于红色三角形那个类，如果K=5，由于蓝色四方形比例为3/5，因此绿色圆将被判定为属于蓝色四方形类。

在这里插入图片描述

经过上述分析，可以知道在KNN算法中，影响准确度最大的参数通常是 K 值，k值的选择对算法的准确性和稳定性有很大影响。较小的k值容易受到噪声数据的影响，导致过拟合；较大的k值可能会使模型过于平滑，导致欠拟合。通常，k值的选择通过交叉验证来确定。

so，KNN训练的本质是什么？其实他并不是类似于神经网络那样反复迭代，他仅仅是把每个点的情况给记录下来，可以说算法的训练模型本质上是存储了训练数据集。KNN是一种基于实例的学习（instance-based learning）或者懒惰学习（lazy learning）方法。它没有显式地学习一个模型，而是在预测时根据输入的数据点直接查找训练数据集中的最近邻居。KNN算法在训练阶段实际上并不进行复杂的计算，而是将训练数据存储起来，用于后续的查询和预测。

KNN算法的训练模型实际上就是训练数据集本身！

附录：KNN与神经网络的区别

KNN（K-Nearest Neighbors）和神经网络在训练模型、学习过程和预测方式上有很大的不同。下面是它们之间的一些主要差异：

学习方法：KNN是一种基于实例的学习方法，也称为懒惰学习。它没有显式地学习一个模型，而是直接使用训练数据进行预测。而神经网络是基于模型的学习方法，它通过训练过程学习数据集中的底层结构和特征表示，并形成一个显式的模型。
训练过程：KNN的训练过程非常简单，只需要存储训练数据集。而神经网络的训练过程通常包括多次迭代和权重更新，以最小化损失函数。训练神经网络通常需要大量计算资源和时间。
训练模型：KNN的训练模型实际上就是训练数据集本身，没有显式的参数。而神经网络的训练模型是由多个层和连接权重组成的复杂网络结构。这些权重在训练过程中得到优化。
预测过程：KNN在预测时，需要计算待预测点与训练数据集中所有点之间的距离，选择最近的k个邻居，并进行投票确定预测类别。而神经网络的预测过程相对简单，只需将输入数据通过训练好的网络进行前向传播，输出层给出预测结果。
可解释性：KNN相对容易解释，因为它的预测是基于最近邻居的投票。神经网络的可解释性较差，尤其是深度神经网络，因为它们包含大量参数和复杂的层次结构。
计算资源需求：在预测阶段，KNN需要大量计算资源来计算距离和查找最近邻居，尤其是在大规模数据集上。而神经网络在预测阶段通常需要较少的计算资源，因为训练好的模型可以快速进行前向传播。然而，在训练阶段，神经网络需要大量计算资源和时间来优化参数。

SVM（支持向量机）

SVM是一种基于结构风险最小化原理的二分类算法，可以拓展到多分类问题。其基本思想是在特征空间中寻找一个最优的超平面，将不同类别的样本分隔开。

基本原理：在特征空间中寻找一个最优超平面，使得该超平面距离两类样本的间隔最大，以达到最优分类效果。对于非线性可分问题，SVM通过核函数将低维特征空间映射到高维特征空间，实现线性可分。

流程：

首先确定数据集的线性可分性，如果线性不可分，选择合适的核函数将数据映射到高维空间。
构建优化目标函数，求解最优超平面的参数。
使用最优参数找到最优超平面，将样本进行分类。

源码分析

导入相关库和函数

from sklearn.svm import SVC

class SVC(BaseSVC):

类的定义和继承关系： SVC 继承自 BaseSVC 类， BaseSVC 又继承自BaseLibSVM类，这意味着 SVC 继承了 BaseSVC 和 BaseLibSVM 中的所有方法和属性。BaseLibSVM 是 scikit-learn 中 libsvm库的基础封装。

初始化函数

    def __init__(
        self,
        *,
        C=1.0,
        kernel="rbf",
        degree=3,
        gamma="scale",
        coef0=0.0,
        shrinking=True,
        probability=False,
        tol=1e-3,
        cache_size=200,
        class_weight=None,
        verbose=False,
        max_iter=-1,
        decision_function_shape="ovr",
        break_ties=False,
        random_state=None,
    ):

        super().__init__(
            kernel=kernel,
            degree=degree,
            gamma=gamma,
            coef0=coef0,
            tol=tol,
            C=C,
            nu=0.0,
            shrinking=shrinking,
            probability=probability,
            cache_size=cache_size,
            class_weight=class_weight,
            verbose=verbose,
            max_iter=max_iter,
            decision_function_shape=decision_function_shape,
            break_ties=break_ties,
            random_state=random_state,
        )

C: float，默认值为1.0。误差项的惩罚系数。
kernel: str，默认值为"rbf"。核函数类型。
degree: int，默认值为3。多项式核函数的阶数。
gamma: {‘scale’, ‘auto’} 或 float，默认值为"scale"。核函数的系数。
coef0: float，默认值为0.0。核函数中的独立项。
shrinking: bool，默认值为True。是否使用启发式收缩方法。
probability: bool，默认值为False。是否启用概率估计。
tol: float，默认值为1e-3。停止标准的公差。
cache_size: float，默认值为200。核函数值缓存大小（MB）。
class_weight: {dict, ‘balanced’}，默认值为None。类权重。
verbose: bool，默认值为False。是否启用详细输出。
max_iter: int，默认值为-1。最大迭代次数。
decision_function_shape: {‘ovo’, ‘ovr’}，默认值为"ovr"。决策函数形状。
break_ties: bool，默认值为False。是否打破预测值的平局。
random_state: int，RandomState实例或None，默认值为None。随机数生成器的种子。

核函数

SVC类支持多种核函数，这些核函数在_pairwise_kernels中实现。sklearn利用pairwise_kernels方法来计算两个输入数据点之间的核函数值。下面简要介绍这些核函数的计算公式：

线性核（linear）： $K(x, y) = x^T y$
多项式核（poly）： $(\gamma x^T y + r)^d$
径向基核（rbf）： $\exp(-\gamma ||x - y||^2)$
Sigmoid核（sigmoid）： $\tanh(\gamma x^T y + r)$

训练函数 .fit()

    def fit(self, X, y, sample_weight=None):
        """Fit the SVM model according to the given training data.

        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features) \
                or (n_samples, n_samples)
            Training vectors, where `n_samples` is the number of samples
            and `n_features` is the number of features.
            For kernel="precomputed", the expected shape of X is
            (n_samples, n_samples).

        y : array-like of shape (n_samples,)
            Target values (class labels in classification, real numbers in
            regression).

        sample_weight : array-like of shape (n_samples,), default=None
            Per-sample weights. Rescale C per sample. Higher weights
            force the classifier to put more emphasis on these points.

        Returns
        -------
        self : object
            Fitted estimator.

        Notes
        -----
        If X and y are not C-ordered and contiguous arrays of np.float64 and
        X is not a scipy.sparse.csr_matrix, X and/or y may be copied.

        If X is a dense array, then the other methods will not support sparse
        matrices as input.
        """
        self._validate_params()

        rnd = check_random_state(self.random_state)

        sparse = sp.isspmatrix(X)
        if sparse and self.kernel == "precomputed":
            raise TypeError("Sparse precomputed kernels are not supported.")
        self._sparse = sparse and not callable(self.kernel)

        if callable(self.kernel):
            check_consistent_length(X, y)
        else:
            X, y = self._validate_data(
                X,
                y,
                dtype=np.float64,
                order="C",
                accept_sparse="csr",
                accept_large_sparse=False,
            )

        y = self._validate_targets(y)

        sample_weight = np.asarray(
            [] if sample_weight is None else sample_weight, dtype=np.float64
        )
        solver_type = LIBSVM_IMPL.index(self._impl)

        # input validation
        n_samples = _num_samples(X)
        if solver_type != 2 and n_samples != y.shape[0]:
            raise ValueError(
                "X and y have incompatible shapes.\n"
                + "X has %s samples, but y has %s." % (n_samples, y.shape[0])
            )

        if self.kernel == "precomputed" and n_samples != X.shape[1]:
            raise ValueError(
                "Precomputed matrix must be a square matrix."
                " Input is a {}x{} matrix.".format(X.shape[0], X.shape[1])
            )

        if sample_weight.shape[0] > 0 and sample_weight.shape[0] != n_samples:
            raise ValueError(
                "sample_weight and X have incompatible shapes: "
                "%r vs %r\n"
                "Note: Sparse matrices cannot be indexed w/"
                "boolean masks (use `indices=True` in CV)."
                % (sample_weight.shape, X.shape)
            )

        kernel = "precomputed" if callable(self.kernel) else self.kernel

        if kernel == "precomputed":
            # unused but needs to be a float for cython code that ignores
            # it anyway
            self._gamma = 0.0
        elif isinstance(self.gamma, str):
            if self.gamma == "scale":
                # var = E[X^2] - E[X]^2 if sparse
                X_var = (X.multiply(X)).mean() - (X.mean()) ** 2 if sparse else X.var()
                self._gamma = 1.0 / (X.shape[1] * X_var) if X_var != 0 else 1.0
            elif self.gamma == "auto":
                self._gamma = 1.0 / X.shape[1]
        elif isinstance(self.gamma, Real):
            self._gamma = self.gamma

        fit = self._sparse_fit if self._sparse else self._dense_fit
        if self.verbose:
            print("[LibSVM]", end="")

        seed = rnd.randint(np.iinfo("i").max)
        fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
        # see comment on the other call to np.iinfo in this file

        self.shape_fit_ = X.shape if hasattr(X, "shape") else (n_samples,)

        # In binary case, we need to flip the sign of coef, intercept and
        # decision function. Use self._intercept_ and self._dual_coef_
        # internally.
        self._intercept_ = self.intercept_.copy()
        self._dual_coef_ = self.dual_coef_
        if self._impl in ["c_svc", "nu_svc"] and len(self.classes_) == 2:
            self.intercept_ *= -1
            self.dual_coef_ = -self.dual_coef_

        dual_coef = self._dual_coef_.data if self._sparse else self._dual_coef_
        intercept_finiteness = np.isfinite(self._intercept_).all()
        dual_coef_finiteness = np.isfinite(dual_coef).all()
        if not (intercept_finiteness and dual_coef_finiteness):
            raise ValueError(
                "The dual coefficients or intercepts are not finite. "
                "The input data may contain large values and need to be"
                "preprocessed."
            )

        # Since, in the case of SVC and NuSVC, the number of models optimized by
        # libSVM could be greater than one (depending on the input), `n_iter_`
        # stores an ndarray.
        # For the other sub-classes (SVR, NuSVR, and OneClassSVM), the number of
        # models optimized by libSVM is always one, so `n_iter_` stores an
        # integer.
        if self._impl in ["c_svc", "nu_svc"]:
            self.n_iter_ = self._num_iter
        else:
            self.n_iter_ = self._num_iter.item()

        return self

接收以下参数：

X：形状为(n_samples, n_features)或(n_samples, n_samples)的数组或稀疏矩阵，表示训练向量。其中，n_samples是样本数量，n_features是特征数量。对于kernel=“precomputed”，X的预期形状为(n_samples, n_samples)。
y：形状为(n_samples,)的数组，表示目标值（分类中的类标签，回归中的实数）。
sample_weight：形状为(n_samples,)的数组，表示每个样本的权重。较高的权重会使分类器更加重视这些点。

方法的返回值是拟合后的估计器对象。

方法首先验证参数，然后检查输入数据的稀疏性。接下来，方法对输入数据进行验证，确保数据具有正确的数据类型和形状。随后，它会对目标值进行验证，并根据需要对样本权重进行处理。

接着，方法确定要使用的解算器类型，并对输入数据进行进一步的验证。此外，还会根据输入数据调整核函数的参数（如gamma值）。

最后，方法将调用 _sparse_fit 或 _dense_fit 方法（取决于输入数据是否稀疏），并传入适当的参数以拟合SVM模型。方法还会调整SVM模型的相关属性，如拟合形状、截距、对偶系数等。

总之，这个 fit 方法负责根据给定的训练数据拟合SVM模型。在方法内部，它会处理数据验证、参数调整、模型拟合等过程。

算法的本质

假设我们有一个二维空间中的数据集，包括两个类别：A 和 B。我们的目标是根据这些已标记的数据点构建一个分类器，以便在未来对新的未标记数据点进行分类。

这里，SVM可以帮助我们找到一个能够尽可能正确地将A类和B类数据点分开的决策边界（在这种情况下是一条直线）。

在这里插入图片描述

以下是SVM如何实现这一目标的：

最大化间隔：SVM试图找到一个最佳的直线，使得它与两个类别中距离最近的数据点（即支持向量）之间的距离（间隔）最大。这有助于提高分类器的泛化能力，因为最大化间隔能够降低过拟合的风险。
最小化分类误差：在某些情况下，数据集可能是线性不可分的，这意味着没有一条直线能完全正确地将A类和B类数据点分开。在这种情况下，SVM允许一些数据点被错误分类，以便找到一个在整体上表现较好的决策边界。为了实现这一目标，SVM引入了松弛变量和惩罚系数C，它们控制允许错误分类的程度。
处理线性不可分问题：如果数据集是线性不可分的，SVM可以使用核技巧将数据点映射到更高维的特征空间，在该空间中找到一个线性可分的超平面。核函数（如径向基函数、多项式核等）可以直接计算数据点在高维空间中的内积，从而避免了显式地将数据点映射到高维空间的计算复杂性。

训练SVM后，我们可以得到一个分类器，它可以根据找到的决策边界对新的未标记数据点进行分类。这对于解决现实世界中的分类问题，如垃圾邮件过滤、图像识别、文本分类等，都是非常有用的。

区别

算法类型：kNN是基于实例的学习方法，而SVM是基于边界的学习方法。
复杂度：kNN算法计算复杂度较高，因为需要计算待分类样本与所有已知类别样本之间的距离；SVM在训练完成后，分类仅涉及支持向量，计算复杂度较低。
鲁棒性：SVM对异常值和噪声具有较好的鲁棒性，而kNN对异常值和噪声较为敏感。 4. 核函数：SVM可以通过使用不同的核函数将非线性问题转化为线性问题，而kNN没有这个特性。
在线学习：kNN支持在线学习，即在新数据进入时可以直接加入训练集进行分类，而SVM需要重新训练模型。
参数选择：kNN的主要参数是k值，需要根据具体问题进行调整；SVM的参数包括正则化参数C和核函数参数，参数选择对模型性能有较大影响。
解释性：kNN算法的解释性较好，可以直观地理解分类结果；而SVM模型涉及到核函数和最优超平面，解释性相对较弱。

5.2-4 KNN算法处理

利用kNN算法，以参数k=3编程对鸢尾花进行分类，建议使用sklearn中内置的已经预处理好的数据集。

输出训练好的模型在训练集与测试集上的分类准确度。

对测试集上数据预测的结果进行可视化输出，与真值进行对比。

# 导入所需库
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

# 加载鸢尾花数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 使用kNN算法进行分类，设置k=3
k = 3
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# 计算训练集和测试集上的分类准确度
train_accuracy = knn.score(X_train, y_train)
test_accuracy = knn.score(X_test, y_test)

# 输出训练集和测试集上的分类准确度
print("训练集精准度: {:.2f}".format(train_accuracy))
print("测试集精准度: {:.2f}".format(test_accuracy))

训练集精准度: 0.95
测试集精准度: 1.00

# 预测测试集
y_pred = knn.predict(X_test)

# 可视化
fig, ax = plt.subplots(figsize=(10, 6))

# 绘制ground truth
ax.scatter(range(len(y_test)), y_test, c='b', marker='o', label='Ground Truth')

# 绘制predictions
ax.scatter(range(len(y_pred)), y_pred, c='r', marker='x', label='Predictions')

# 设置坐标轴标签
ax.set_xlabel('Index')
ax.set_ylabel('Iris Type')

# 添加图例
ax.legend()

# 显示图形
plt.show()

# 预测测试集
y_pred = knn.predict(X_test)

# 创建子图
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
axes = axes.ravel()

# 对每个特征进行可视化
for i in range(4):
    for target in np.unique(y_test):
        # 绘制真值
        axes[i].scatter(X_test[(y_test == target), i], X_test[(y_test == target), (i + 1) % 4], label='True Class {}'.format(target))
        # 绘制预测值
        axes[i].scatter(X_test[(y_pred == target), i], X_test[(y_pred == target), (i + 1) % 4], marker='x', label='Predicted Class {}'.format(target))
    axes[i].set_xlabel('Feature {}'.format(i + 1))
    axes[i].set_ylabel('Feature {}'.format((i + 1) % 4 + 1))
    axes[i].legend()


plt.tight_layout()
plt.show()

5.5-6 SVM算法处理

使用SVM算法训练模型，计算训练好的模型在训练集与测试集上的分类准确度。

选择数据的其中两维，可视化输出SVM模型的分类边界。

# 加载鸢尾花数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names  # 提取特征名称

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 使用SVM算法进行分类
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

# 计算训练集和测试集上的分类准确度
train_accuracy = svm.score(X_train, y_train)
test_accuracy = svm.score(X_test, y_test)

# 输出训练集和测试集上的分类准确度
print("训练集准确度: {:.2f}".format(train_accuracy))
print("测试集准确度: {:.2f}".format(test_accuracy))

训练集准确度: 0.97
测试集准确度: 1.00

# 绘制分类边界
fig, axes = plt.subplots(4, 4, figsize=(16, 16))

for i in range(4):
    for j in range(4):
        if i != j:
            # 训练SVM模型
            svm.fit(X_train[:, [i, j]], y_train)
            # 绘制决策边界
            x_min, x_max = X[:, i].min() - 1, X[:, i].max() + 1
            y_min, y_max = X[:, j].min() - 1, X[:, j].max() + 1
            xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
            Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            axes[i, j].contourf(xx, yy, Z, alpha=0.4)
            axes[i, j].scatter(X[:, i], X[:, j], c=y, edgecolors='k')
            axes[i, j].set_xlabel(feature_names[i])  # 使用具体的特征名称
            axes[i, j].set_ylabel(feature_names[j])  # 使用具体的特征名称
        else:
            axes[i, j].text(0.5, 0.5, feature_names[i], ha='center', va='center', fontsize=12)

plt.tight_layout()
plt.show()