【sklearn】利用scikit-learn训练经典分类模型（算法原理与实现）_机器学习 sklearn神经网络分类模型-CSDN博客

本文链接：https://blog.csdn.net/TiffanyRabbit/article/details/76574009

本文对Scikit-learn中的常用分类器进行了全面总结，包括原理介绍及应用实例，覆盖广义线性模型、SVM、KNN、朴素贝叶斯、决策树、集成方法和神经网络等，并提供了代码示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本文意图将机器学习中常用的分类器进行总结，从原理到sklearn实现进行统一梳理，宝宝们把本文作为入门读物也好，复习提纲也好，各取所需就好。
5月初就想总结一下，到现在好几个月了，改用了tensorflow和mxnet。虽然不改对sklearn的喜爱，但也懒得每个分类算法都整理写一遍了，毕竟sklearn api辣么清楚明白。我在此做个简单整理/搬运，也算给我的scikit-learn生涯画个句号。

Scikit-learn提供的分类算法

惊喜的发现sklearn出了中文文档，分享给大家：sklearn监督学习

里面按照模型框架的不同给分类学习划分了几个大类，个人觉着非常合理。
包括：
- 广义线性模型（包括最小二乘、LR模型、Lasso、岭回归、贝叶斯回归等常用基分类器）
- SVM
- 最近邻
- 朴素贝叶斯
- 决策树
- 集成方法（包括主要的Bagging, Boosting，随机森林、GBDT等好用的模型）
- 神经网络（反向传播实现的MLP，同时也支持设定激活函数和隐藏层的设定，缺点是不支持大数据训练和GPU，但可以作为小样本下的基线或测试使用）

经典分类算法原理与sklearn实现

广义线性模型（以逻辑回归LR为例）
- 算法原理：
  - sklearn官方LR模型文档
  - LR_intro，搞懂LR算法的必读
- sklearn应用：
  - 利用sklearn 逻辑回归模型在经典IRIS数据集上做3分类
SVM支持向量机
- 算法原理：
  - 支持向量机（一），把数学原理讲的巨清楚的一个系列博客，不收藏都浪费
  - sklearn官方SVC文档
- sklearn应用：
  - 利用sklearn SVM模型在经典IRIS数据集上做3分类，并比较不同核函数
  - 利用sklearn 实现Weighted SVM，用于不平衡数据集
KNN
- 算法原理：
  - KNN算法原理
- sklearn实现：
  - IRIS数据集上进行KNN分类训练
NB朴素贝叶斯
- 算法原理：
  - 算法杂货铺详解朴素贝叶斯分类
- sklearn实现：
  - 哈哈sklearn都懒得单独写个文档了，直接粘代码了

from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points : %d"
      % (iris.data.shape[0],(iris.target != y_pred).sum()))

Decision Tree
- 算法原理：
- sklearn实现
  - IRIS Again：sklearn决策树在IRIS数据集的分类训练
  - 导出并绘制训练好的决策树：

>>> import graphviz # doctest: +SKIP
dot_data = tree.export_graphviz(clf, out_file=None) # doctest: +SKIP
graph = graphviz.Source(dot_data) # doctest: +SKIP
graph.render("iris") # doctest: +SKIP

>>> dot_data = tree.export_graphviz(clf, out_file=None,
                            feature_names=iris.feature_names,  
                            class_names=iris.target_names, 
                            filled=True, rounded=True,
                            special_characters=True)
>>> graph = graphviz.Source(dot_data)
>>> graph

随机森林
- 算法原理：
  - 随机森林算法
  - sklearn RF官方文档 - 推荐
- sklearn应用：
  - IRIS Again and again: 随机森林算法分类训练
  - 利用随机森林进行人脸识别，并打印不同pixel的重要性
GBDT
- 算法原理：
  - GBDT算法原理
- sklearn实践：
  - sklearn+GBDT应用案例
MLP
- 算法原理：
  - 神经网络浅讲：从神经元到深度学习
  - MLP公式推导
- sklearn实现：
  - sklearn多层感知器进行MNIST手写数字识别

======================================
想知道sklearn除了方便简单的文档外还有什么优势？

在此举两个栗子让刚入门的童鞋感受一下下。

1、同一数据集上，常用分类器对比，并绘制决策面，包含以上几乎全部分类器。（搬运自Classifier Comparison）
结果见图：

print(__doc__)


# Code source: Gaël Varoquaux
#              Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

h = .02  # step size in the mesh

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis()]

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
linearly_separable = (X, y)

datasets = [make_moons(noise=0.3, random_state=0),
            make_circles(noise=0.2, factor=0.5, random_state=1),
            linearly_separable
            ]

figure = plt.figure(figsize=(27, 9))
i = 1
# iterate over datasets
for ds_cnt, ds in enumerate(datasets):
    # preprocess dataset, split into training and test part
    X, y = ds
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=.4, random_state=42)

    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # just plot the dataset first
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
    if ds_cnt == 0:
        ax.set_title("Input data")
    # Plot the training points
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
               edgecolors='k')
    # and testing points
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
               edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    i += 1

    # iterate over classifiers
    for name, clf in zip(names, classifiers):
        ax = plt.subplot(len(datasets), len(classifiers) + 1, i)
        clf.fit(X_train, y_train)
        score = clf.score(X_test, y_test)

        # Plot the decision boundary. For that, we will assign a color to each
        # point in the mesh [x_min, x_max]x[y_min, y_max].
        if hasattr(clf, "decision_function"):
            Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
        else:
            Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]

        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)

        # Plot also the training points
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
                   edgecolors='k')
        # and testing points
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,
                   edgecolors='k', alpha=0.6)

        ax.set_xlim(xx.min(), xx.max())
        ax.set_ylim(yy.min(), yy.max())
        ax.set_xticks(())
        ax.set_yticks(())
        if ds_cnt == 0:
            ax.set_title(name)
        ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
                size=15, horizontalalignment='right')
        i += 1

plt.tight_layout()
plt.show()

2、几乎自动化的交叉验证调参(搬运自Parameter estimation using grid search)

from __future__ import print_function

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

print(__doc__)

# Loading the Digits dataset
digits = datasets.load_digits()

# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()

# Note the problem is too easy: the hyperparameter plateau is too flat and the
# output model is the same for precision and recall with ties in quality.