【Python】机器学习笔记05-支持向量机（Support Vector Machine）

最新推荐文章于 2024-08-03 18:21:56 发布

RM -RF /星

最新推荐文章于 2024-08-03 18:21:56 发布

阅读量520

点赞数 1

分类专栏：数据科学与人工智能文章标签：机器学习 python 数据挖掘

本文链接：https://blog.csdn.net/weixin_41429999/article/details/108047718

版权

数据科学与人工智能专栏收录该内容

11 篇文章 5 订阅

订阅专栏

本文的参考资料：《Python数据科学手册》；
本文的源代上传到了Gitee上；

本文用到的包：

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs, make_circles, fetch_lfw_people
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

支持向量机（Support Vector Machine）

支持向量机是一种有监督的机器学习算法，既可以用于分类任务，也可以用于回归任务；
在sklearn中，支持向量机分类由SVC类实现，支持向量机回归由SVR类实现；

SVM的边界最大化

在这里插入图片描述

以上图为例，如果我们要找一条直线将两类数据分隔开来，可以有无数种选择，而支持向量机的做法就是在在这无数种解中间选择“边界最大”的那一条，就如上图所示；

SVM的支持向量

我们用只支持向量机做一个简单的分类的例子，并画出模型的决策边界，结果如下图所示，我们会发现训练集中有一些数据点刚好落在决策边界上，这些点的位置储存在模型的support_vectors_属性中，这些点被称为支持向量，是模型进行拟合的关键；同时，这也说明了，支持向量机对远离边界的数据并不敏感；

x, y = make_blobs(
    n_samples=100,
    n_features=2,
    centers=2,
    random_state=2333,
    cluster_std=0.8,
)

plt.figure(figsize=(10, 10))
plt.scatter(x=x[:, 0], y=x[:, 1], c=y, cmap=plt.cm.get_cmap('autumn', lut=2), edgecolors='black')

model = SVC(kernel='linear', C=1e10)
model.fit(x, y)

x_fit = np.linspace(-8, 3, 200)
y_fit = np.linspace(6, 11, 200)
x_fit, y_fit = np.meshgrid(x_fit, y_fit)
xy = np.vstack((x_fit.ravel(), y_fit.ravel())).T
res = model.decision_function(xy).reshape(x_fit.shape)
plt.contour(x_fit, y_fit, res, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

plt.scatter(
    model.support_vectors_[:, 0],
    model.support_vectors_[:, 1],
    s=200, linewidths=2, edgecolors='k', facecolors='none', label='支持向量',
)
plt.legend()
plt.title('使用SVC进行分类，并画出决策边界与支持向量')

结果：

在这里插入图片描述

SVM的核函数

上面是一个线性数据的例子，但是如果我们要拟合非线性的数据的话，可以在初始化SVC类的时候，执行不同的核（kernel参数），以下是一个在非线性数据上使用SVC的例子：

x, y = make_circles(
    n_samples=100,
    noise=0.1,
    factor=0.2,
    random_state=233,
)

plt.figure(figsize=(10, 10))
plt.scatter(x=x[:, 0], y=x[:, 1], c=y, edgecolors='k', cmap=plt.cm.get_cmap('autumn', lut=2))

model = SVC(kernel='rbf', C=1e10)
model.fit(x, y)

x_fit = np.linspace(-1.2, 1.2, 200)
y_fit = np.linspace(-1.2, 1.2, 100)
x_fit, y_fit = np.meshgrid(x_fit, y_fit)
xy = np.vstack((x_fit.ravel(), y_fit.ravel())).T
res = model.decision_function(xy).reshape(x_fit.shape)

plt.contour(x_fit, y_fit, res, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])
plt.scatter(
    x=model.support_vectors_[:, 0],
    y=model.support_vectors_[:, 1],
    s=200, linewidths=2, edgecolors='k', facecolors='none', label='支持向量',
)
plt.legend()
plt.title('使用SVC拟合非线性数据')

结果：

在这里插入图片描述

软化边界

有时候，我们遇到的数据并不会有清晰的界限，数据之间可能会有重合，这时候，我们需要允许支持向量去穿越决策边界，对于这种需要，我们使用SVC的C参数来解决（之前一直没讲这个参数的含义），C越大，越不会允许支持向量穿越边界，C越小，越有可能允许支持向量穿越边界；
样例如下：

x, y = make_blobs(
    n_samples=100,
    n_features=2,
    centers=2,
    random_state=233,
    cluster_std=2,
)

fig, axs = plt.subplots(1, 2, figsize=(20, 10))
for i in range(2): axs[i].scatter(
    x=x[:, 0],
    y=x[:, 1],
    c=y,
    edgecolors='k',
    cmap=plt.cm.get_cmap('autumn', lut=2),
)

model = [SVC(kernel='linear', C=1e6), SVC(kernel='linear', C=1)]
for i in range(2):
    m = model[i]  # type: SVC
    ax = axs[i]  # type: plt.Axes
    m.fit(x, y)

    x_fit = np.linspace(-13, 8, 200)
    y_fit = np.linspace(1, 14, 100)
    x_fit, y_fit = np.meshgrid(x_fit, y_fit)
    xy = np.vstack((x_fit.ravel(), y_fit.ravel())).T
    res = m.decision_function(xy).reshape(x_fit.shape)
    ax.contour(x_fit, y_fit, res, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

    ax.scatter(
        x=m.support_vectors_[:, 0],
        y=m.support_vectors_[:, 1],
        s=200,
        linewidths=2,
        edgecolors='k',
        facecolors='none',
        label='支持向量',
    )
    ax.set_title(f'参数C={m.C}')
    ax.legend()

结果：

在这里插入图片描述

使用支持向量机进行人脸识别

这里使用sklearn提供的人脸数据集；
每一张人脸图像是 $62 \times 47$ 的，不适合将每一个像素作为一个特征，所以这里使用主成分分析提取150个特征（主成分分析后面介绍）；
为了寻找模型的最佳参数，我们使用网格搜索；
最后使用分类报告和混淆矩阵查看模型实际效果；

加载数据使用函数fetch_lfw_people函数，生成分类报告使用classification_report函数，计算混淆矩阵使用confusion_matrix；


faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.images[0].shape)
print(faces.target_names)

fig, axs = plt.subplots(3, 5, figsize=(9, 10))
for i, ax in enumerate(axs.flatten()):  # 展示一些数据
    ax.imshow(faces.images[i], cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(faces.target_names[faces.target[i]])

pca = PCA(n_components=150, whiten=True, random_state=42, svd_solver='randomized')  # 随机化的PCA
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
grid = GridSearchCV(model, dict(
    svc__C=[1, 5, 20, 50],
    svc__gamma=[0.0001, 0.0005, 0.001, 0.005],
))

x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)
grid.fit(x_train, y_train)


best_model = grid.best_estimator_
res = best_model.predict(x_test)
print(f'模型最佳得分：{grid.best_score_:.3f}')
print(f'模型最佳参数：')
for i in grid.best_params_.keys():
    print(f'{i:<10} : {grid.best_params_[i]:0.3f}')

report = classification_report(y_test, res, target_names=faces.target_names)
print(f'模型分类效果报告：\n{report}')

mat = confusion_matrix(y_test, res)
plt.figure(figsize=(10, 10))
sns.heatmap(
    data=mat.T,
    square=True,
    annot=True,
    fmt='d',
    cbar=False,
    xticklabels=faces.target_names,
    yticklabels=faces.target_names,
    cmap='Greens',
)

结果：
在这里插入图片描述

在这里插入图片描述

支持向量机总结

模型依赖的支持向量很少，意味着内存消耗少；
训练完成后，模型的预测速度很快
由于模型只受边界点附近的点的影响，可以适应高维的数据；
模型训练的最大复杂度会达到 $\Theta(N^3)$ ；
训练效果对软化参数C的依赖很大；

完整代码（Jupyter Notebook）

#%%

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs, make_circles, fetch_lfw_people
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

sns.set()
plt.rc('font', family='SimHei')
plt.rc('axes', unicode_minus=False)

#%% md

# 支持向量机

支持向量机（support vector machine）是一种有监督的机器学习算法，可以用于分类任务和回归任务。

## 边界最大化

如下图所示，如果要找一条直线将两种点分隔开来，可以有很多选择，支持向量机会在所有可行结果中寻找一条“边界”最大的直线；
这里的边界指的是直线到两种点的最近距离之和，如图所示；

#%%

x, y = make_blobs(
    n_samples=50,
    n_features=2,
    centers=2,
    random_state=0,
    cluster_std=0.6,
)

plt.figure(figsize=(10, 10))
plt.scatter(x=x[:, 0], y=x[:, 1], c=y, cmap=plt.cm.get_cmap('autumn', lut=2), edgecolors='black')

x_fit = np.linspace(-1, 3.5, 100)
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
    y_fit = m * x_fit + b
    plt.plot(x_fit, y_fit, '-k')
    plt.fill_between(x_fit, y_fit - d, y_fit + d, edgecolor=None, color='gray', alpha=0.5)
plt.title('什么是边界最大化')

#%% md

### 支持向量

下面是用SVC拟合一组数据，并画出模型的决策边界，可以发现有一些数据点恰好在模型的决策边界上，这些点被称为**支持向量**，是模型进行拟合的关键；
换言之，SVC模型对远离边界的数据并不敏感；

#%%

x, y = make_blobs(
    n_samples=100,
    n_features=2,
    centers=2,
    random_state=2333,
    cluster_std=0.8,
)

plt.figure(figsize=(10, 10))
plt.scatter(x=x[:, 0], y=x[:, 1], c=y, cmap=plt.cm.get_cmap('autumn', lut=2), edgecolors='black')

model = SVC(kernel='linear', C=1e10)
model.fit(x, y)

x_fit = np.linspace(-8, 3, 200)
y_fit = np.linspace(6, 11, 200)
x_fit, y_fit = np.meshgrid(x_fit, y_fit)
xy = np.vstack((x_fit.ravel(), y_fit.ravel())).T
res = model.decision_function(xy).reshape(x_fit.shape)
plt.contour(x_fit, y_fit, res, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

plt.scatter(
    model.support_vectors_[:, 0],
    model.support_vectors_[:, 1],
    s=200, linewidths=2, edgecolors='k', facecolors='none', label='支持向量',
)
plt.legend()
plt.title('使用SVC进行分类，并画出决策边界与支持向量')

#%% md

### SVM与核函数

上面的例子是一个对线性关系的拟合，如果我们需要拟合非线性的关系，就需要使用不同的核函数对数据进行变换，在sklearn中，这个操作可以使用SVC类的
kernel参数来实现；

#%%

x, y = make_circles(
    n_samples=100,
    noise=0.1,
    factor=0.2,
    random_state=233,
)

plt.figure(figsize=(10, 10))
plt.scatter(x=x[:, 0], y=x[:, 1], c=y, edgecolors='k', cmap=plt.cm.get_cmap('autumn', lut=2))

model = SVC(kernel='rbf', C=1e10)
model.fit(x, y)

x_fit = np.linspace(-1.2, 1.2, 200)
y_fit = np.linspace(-1.2, 1.2, 100)
x_fit, y_fit = np.meshgrid(x_fit, y_fit)
xy = np.vstack((x_fit.ravel(), y_fit.ravel())).T
res = model.decision_function(xy).reshape(x_fit.shape)

plt.contour(x_fit, y_fit, res, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])
plt.scatter(
    x=model.support_vectors_[:, 0],
    y=model.support_vectors_[:, 1],
    s=200, linewidths=2, edgecolors='k', facecolors='none', label='支持向量',
)
plt.legend()
plt.title('使用SVC拟合非线性数据')

#%% md

### 软化边界

有些时候，我们遇到的数据的边界不会那么清晰，数据之间可能会有重合，这时候，我们需要允许支持向量穿过决策边界，对于这种需要，我们使用SVC的C参数来解决；
C越大，越不允许支持向量穿越边界，C越小，越可能允许支持向量穿越边界，这种操作被称为软化边界。

#%%

x, y = make_blobs(
    n_samples=100,
    n_features=2,
    centers=2,
    random_state=233,
    cluster_std=2,
)

fig, axs = plt.subplots(1, 2, figsize=(20, 10))
for i in range(2): axs[i].scatter(
    x=x[:, 0],
    y=x[:, 1],
    c=y,
    edgecolors='k',
    cmap=plt.cm.get_cmap('autumn', lut=2),
)

model = [SVC(kernel='linear', C=1e6), SVC(kernel='linear', C=1)]
for i in range(2):
    m = model[i]  # type: SVC
    ax = axs[i]  # type: plt.Axes
    m.fit(x, y)

    x_fit = np.linspace(-13, 8, 200)
    y_fit = np.linspace(1, 14, 100)
    x_fit, y_fit = np.meshgrid(x_fit, y_fit)
    xy = np.vstack((x_fit.ravel(), y_fit.ravel())).T
    res = m.decision_function(xy).reshape(x_fit.shape)
    ax.contour(x_fit, y_fit, res, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

    ax.scatter(
        x=m.support_vectors_[:, 0],
        y=m.support_vectors_[:, 1],
        s=200,
        linewidths=2,
        edgecolors='k',
        facecolors='none',
        label='支持向量',
    )
    ax.set_title(f'参数C={m.C}')
    ax.legend()

#%% md

## 使用支持向量机进行人脸识别

使用sklearn提供的人脸数据集；
每一张人脸图像是$62 \times 47$的，不适合直接使用每一个像素作为特征，所以这里使用主成分分析提取150个特征；
为了寻找模型的最佳参数，我们使用网格搜索；
最后使用分类报告和混淆矩阵查看模型实际效果；

#%%

faces = fetch_lfw_people(min_faces_per_person=60)
print(faces.images[0].shape)
print(faces.target_names)

fig, axs = plt.subplots(3, 5, figsize=(9, 10))
for i, ax in enumerate(axs.flatten()):
    ax.imshow(faces.images[i], cmap='bone')
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel(faces.target_names[faces.target[i]])

pca = PCA(n_components=150, whiten=True, random_state=42, svd_solver='randomized')
svc = SVC(kernel='rbf', class_weight='balanced')
model = make_pipeline(pca, svc)
grid = GridSearchCV(model, dict(
    svc__C=[1, 5, 20, 50],
    svc__gamma=[0.0001, 0.0005, 0.001, 0.005],
))

x_train, x_test, y_train, y_test = train_test_split(faces.data, faces.target, random_state=42)
grid.fit(x_train, y_train)

#%%

best_model = grid.best_estimator_
res = best_model.predict(x_test)
print(f'模型最佳得分：{grid.best_score_:.3f}')
print(f'模型最佳参数：')
for i in grid.best_params_.keys():
    print(f'{i:<10} : {grid.best_params_[i]:0.3f}')

report = classification_report(y_test, res, target_names=faces.target_names)
print(f'模型分类效果报告：\n{report}')

mat = confusion_matrix(y_test, res)
plt.figure(figsize=(10, 10))
sns.heatmap(
    data=mat.T,
    square=True,
    annot=True,
    fmt='d',
    cbar=False,
    xticklabels=faces.target_names,
    yticklabels=faces.target_names,
    cmap='Greens',
)

#%% md

## 支持向量机总结

-   模型依赖的支持向量很少，意味着内存消耗少；
-   训练完成后，模型的预测速度很快
-   由于模型只受边界点附近的点的影响，可以适应高维的数据；
-   模型训练的最大复杂度会达到$\Theta(N^3)$；
-   训练效果对软化参数C的依赖很大；

RM -RF /星

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【Python】机器学习笔记05-支持向量机（Support Vector Machine）

本文的参考资料：《Python数据科学手册》；本文的源代上传到了Gitee上；本文用到的包：%matplotlib inlineimport numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.pipeline import make_pipelinefrom sklearn.metrics import classification_report,
复制链接

扫一扫

专栏目录