使用scikit-learn的svm进行分类（代码分析）

最新推荐文章于 2024-08-08 05:37:47 发布

玥晓珖

最新推荐文章于 2024-08-08 05:37:47 发布

阅读量7.6k

点赞数 2

分类专栏：深度学习基础文章标签： sklearn scikit

本文链接：https://blog.csdn.net/u010327061/article/details/84262536

版权

深度学习基础专栏收录该内容

35 篇文章 1 订阅

订阅专栏

基于SciPy的众多分支版本中，最有名，也是专门面向机器学习的就是Scikit-learn。Scikit-learn项目最早由数据科学家 David Cournapeau 在 2007 年发起，需要NumPy和SciPy等其他包的支持，是Python语言中专门针对机器学习应用而发展起来的一款开源框架。Scikit-learn从来不做除机器学习领域之外的其他扩展，也从来不采用未经广泛验证的算法。

Scikit-learn的基本功能主要被分为六大部分：分类，回归，聚类，数据降维，模型选择和数据预处理。

我们今天在这里只说部分内容

分类是指识别给定对象的所属类别，属于监督学习的范畴，最常见的应用场景包括垃圾邮件检测和图像识别等。目前Scikit-learn已经实现的算法包括：支持向量机（SVM），最近邻，逻辑回归，随机森林，决策树以及多层感知器（MLP）神经网络等等。

需要指出的是，由于Scikit-learn本身不支持深度学习，也不支持GPU加速，因此这里对于MLP的实现并不适合于处理大规模问题。有相关需求的读者可以查看同样对Python有良好支持的Keras和Theano等框架

数据降维是指使用主成分分析（PCA）、非负矩阵分解（NMF）或特征选择等降维技术来减少要考虑的随机变量的个数，其主要应用场景包括可视化处理和效率提升。

模型选择是指对于给定参数和模型的比较、验证和选择，其主要目的是通过参数调整来提升精度。目前Scikit-learn实现的模块包括：格点搜索，交叉验证和各种针对预测误差评估的度量函数。

数据预处理是指数据的特征提取和归一化，是机器学习过程中的第一个也是最重要的一个环节。这里归一化是指将输入数据转换为具有零均值和单位权方差的新变量，但因为大多数时候都做不到精确等于零，因此会设置一个可接受的范围，一般都要求落在0-1之间。而特征提取是指将文本或图像数据转换为可用于机器学习的数字变量。

1. 安装：

之前已经搭建了基于anaconda虚拟环境的TensorFlow平台，安装了python 3.6，NumPy，SciPy。

在虚拟环境下运行pip install -U scikit-learn

2.跑样例代码：

https://scikit-learn.org/stable/auto_examples/index.html#general-examples 都在这个链接里

比如Recognizing hand-written digits:

打印结果时显示上面的注释：

"""
================================
Recognizing hand-written digits
================================

An example showing how the scikit-learn can be used to recognize images of
hand-written digits.

This example is commented in the
:ref:`tutorial section of the user manual <introduction>`.

"""
print(__doc__)

导入所需模块：

import matplotlib.pyplot as plt

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics

读入数据

digits = datasets.load_digits()

数据是长这样的，总共有1797张图像，每张图像8*8，还有对应的标签：

把标签和数据编程一个list，并显示前四个：

然后整理数据

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images) #1797
data = digits.images.reshape((n_samples, -1))#（1797，64）

创建分类器，使用前一部分数据进行训练分类：

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)

# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples // 2], digits.target[:n_samples // 2])

关于svm.SVC参数详解：https://blog.csdn.net/github_39261590/article/details/75009069 。 gamma: float参数默认为auto，核函数系数，只对‘rbf’,‘poly’,‘sigmod’有效。如果gamma为auto，代表其值为样本特征数的倒数，即1/n_features.

下面对后半部分数据进行预测：

# Now predict the value of the digit on the second half:
expected = digits.target[n_samples // 2:]
predicted = classifier.predict(data[n_samples // 2:])

打印和显示：

print("Classification report for classifier %s:\n%s\n"
      % (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

images_and_predictions = list(zip(digits.images[n_samples // 2:], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
    plt.subplot(2, 4, index + 5)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title('Prediction: %i' % prediction)

显示结果：