SVM支持向量机

最新推荐文章于 2023-11-18 09:00:00 发布

GSmate

最新推荐文章于 2023-11-18 09:00:00 发布

阅读量187

点赞数

本文链接：https://blog.csdn.net/qq_44318499/article/details/104844435

版权

Support Vector Machine

学习模型
有监督学习：需要事先对数据打上分类标签，这样机器就知道数据属于哪一类。

无监督学习：数据没有打上分类标签，有可能因为不具备先验知识，或打标签的成本很高，需要机器代替我们部分完成改工作，比如将数据进行聚类，方便后人工对每个类进行分析。

SVM 是有监督的学习模型：可以进行模式识别、分类以及回归分析。

SVM工作原理
示例：桌面上有两种颜色混乱的小球，我们将这两种小球来区分开，我们猛拍桌子小球会腾起，在腾空的那一刹那，会出现一个水平切面，将两种颜色的球分开来。

原因：二维平面无法找出一条直线来区分小球颜色，但是在三位空间。我们可以找到一个平面来区分小球颜色，该平面我们叫做超平面。

SVM计算过程：就是帮我们找到一个超平面的过程，该超平面就是 SVM分类器。

分类间隔
我们在示例中，会找到一个决策面来将小球颜色分离，在保证决策面C不变，且分类不产生错误的情况下，我们可以移动决策面，来产生两个极限位置：决策面A和决策面B，分界线C就是最优决策面，极限位置到最优决策面的距离就是分类间隔。

我们可以转动最优决策面，会发现存在多个最优决策面，它们都能把数据集正确分开，这些最优决策面的分类间隔可能是不同的，拥有最大间隔的决策面就是 SVM 需要的最优解。

点到超平面距离公式
决策面在一维空间就是一个点、二维空间就是一条直线、三维空间就是一个平面，当空间维度更多，这个线性函数名称叫做 “超平面”：g(x)=wTx+b g(x) = w^Tx + bg(x)=w
T
x+b，其中w,x∈Rn w,x\in R^nw,x∈R
n

w、x是n维空间里的向量，其中x是函数变量，w是法向量。法向量指的是垂直于平面直线所表示的向量，决定超平面的方向。

SVM就是帮我们找到一个超平面，这个超平面能将不同样本划分，使得样本集中的点到这个分类超平面的最小距离(分类间隔)最大化，在这个过程中支持向量就是离分类超平面最近的样本点，如果确定了支持向量也就确定了这个超平面，所以支持向量决定了分类间隔是多少，在最大间隔以外的样本点，对分类都没有意义。

硬间隔、软间隔和非线性SVM
数据线性可分：模型称为硬间隔支持向量机
硬间隔：完全分类准确，不存在分类错误的情况
软间隔：允许一定量的样本分类错误
非线性数据：模型称为非线性支持向量机
非线性数据集合：两种颜色的小球，呈现大小圆环的形状。

非线性数据集合，不论多高级的分类器，只要映射函数是线性的，就没办法处理，SVM也处理不了。

核函数：可以将样本从原始空间映射到一个更高的特质空间中，使得样本在新空间中线性可分，我们就以可以使用原来的推到进行计算，所有推到都是在新空间，而不是在原来的空间中进行。

在非线性SVM中，核函数的选择就是影响SVM最大的变量，下面是常用的核函数

线性核
多项式核
高斯核
拉普拉斯核
sigmoid核
这些函数的区别在于映射方式的不同，通过核函数就能把样本空间投射到新的高维空间中。

结果：软间隔和核函数的提出，都是为了方便对超平面公式中的w和b进行求解，得到最大分类间隔的超平面
SVM

SVM_introduction:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 'sepal length', 'sepal width', 'petal length', 'petal width'
iris_feature = u'花萼长度', u'花萼宽度', u'花瓣长度', u'花瓣宽度'

if __name__ == "__main__":
    path = 'iris.data'  # 数据文件路径
    data = pd.read_csv(path, header=None)
    x, y = data[range(4)], data[4]
    y = pd.Categorical(y).codes
    x = x[[0, 1]]
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6)

    # 分类器
    clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr')
    # clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')
    clf.fit(x_train, y_train.ravel())

    # 准确率
    print(clf.score(x_train, y_train))  # 精度
    print('训练集准确率：', accuracy_score(y_train, clf.predict(x_train)))
    print(clf.score(x_test, y_test))
    print('测试集准确率：', accuracy_score(y_test, clf.predict(x_test)))

    # decision_function
    print('decision_function:\n', clf.decision_function(x_train))
    print('\npredict:\n', clf.predict(x_train))

    # 画图
    x1_min, x2_min = x.min()
    x1_max, x2_max = x.max()
    x1, x2 = np.mgrid[x1_min:x1_max:500j, x2_min:x2_max:500j]  # 生成网格采样点
    grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点
    # print 'grid_test = \n', grid_test
    # Z = clf.decision_function(grid_test)    # 样本到决策面的距离
    # print Z
    grid_hat = clf.predict(grid_test)  # 预测分类值
    grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同
    mpl.rcParams['font.sans-serif'] = [u'SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    cm_light = mpl.colors.ListedColormap(['#A0FFA0', '#FFA0A0', '#A0A0FF'])
    cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
    plt.figure(facecolor='w')
    plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)
    plt.scatter(x[0], x[1], c=y, edgecolors='k', s=50, cmap=cm_dark)  # 样本
    plt.scatter(x_test[0], x_test[1], s=120, facecolors='none', zorder=10)  # 圈中测试集样本
    plt.xlabel(iris_feature[0], fontsize=13)
    plt.ylabel(iris_feature[1], fontsize=13)
    plt.xlim(x1_min, x1_max)
    plt.ylim(x2_min, x2_max)
    plt.title(u'鸢尾花SVM二特征分类', fontsize=16)
    plt.grid(b=True, ls=':')
    plt.tight_layout(pad=1.5)
    plt.show()

SVR:

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt

if __name__ == "__main__":
    N = 50
    np.random.seed(0)
    x = np.sort(np.random.uniform(0, 6, N), axis=0)
    y = 2 * np.sin(x) + 0.1 * np.random.randn(N)
    x = x.reshape(-1, 1)
    print('x =\n', x)
    print('y =\n', y)

    print('SVR - RBF')
    svr_rbf = svm.SVR(kernel='rbf', gamma=0.2, C=100)
    svr_rbf.fit(x, y)
    print('SVR - Linear')
    svr_linear = svm.SVR(kernel='linear', C=100)
    svr_linear.fit(x, y)
    print('SVR - Polynomial')
    svr_poly = svm.SVR(kernel='poly', degree=3, C=100)
    svr_poly.fit(x, y)
    print('Fit OK.')

    # 思考：系数1.1改成1.5
    x_test = np.linspace(x.min(), 1.1 * x.max(), 100).reshape(-1, 1)
    y_rbf = svr_rbf.predict(x_test)
    y_linear = svr_linear.predict(x_test)
    y_poly = svr_poly.predict(x_test)

    plt.figure(figsize=(9, 8), facecolor='w')
    plt.plot(x_test, y_rbf, 'r-', linewidth=2, label='RBF Kernel')
    plt.plot(x_test, y_linear, 'g-', linewidth=2, label='Linear Kernel')
    plt.plot(x_test, y_poly, 'b-', linewidth=2, label='Polynomial Kernel')
    plt.plot(x, y, 'mo', markersize=6)
    plt.scatter(x[svr_rbf.support_], y[svr_rbf.support_], s=200, c='r', marker='*', label='RBF Support Vectors',
                zorder=10)
    plt.legend(loc='lower left')
    plt.title('SVR', fontsize=16)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.grid(True)
    plt.tight_layout(2)
    plt.show()

sklearn.metrics评价指标
SVM评价指标

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, fbeta_score
from sklearn.metrics import precision_recall_fscore_support, classification_report

if __name__ == "__main__":
    y_true = np.array([1, 1, 1, 1, 0, 0])
    y_hat = np.array([1, 0, 1, 1, 1, 1])
    print('Accuracy：\t', accuracy_score(y_true, y_hat))

    # The precision is the ratio 'tp / (tp + fp)' where 'tp' is the number of
    # true positives and 'fp' the number of false positives. The precision is
    # intuitively the ability of the classifier not to label as positive a sample
    # that is negative.
    # The best value is 1 and the worst value is 0.
    precision = precision_score(y_true, y_hat)
    print('Precision:\t', precision)

    # The recall is the ratio 'tp / (tp + fn)' where 'tp' is the number of
    # true positives and 'fn' the number of false negatives. The recall is
    # intuitively the ability of the classifier to find all the positive samples.
    # The best value is 1 and the worst value is 0.
    recall = recall_score(y_true, y_hat)
    print('Recall:  \t', recall)

    # F1 score, also known as balanced F-score or F-measure
    # The F1 score can be interpreted as a weighted average of the precision and
    # recall, where an F1 score reaches its best value at 1 and worst score at 0.
    # The relative contribution of precision and recall to the F1 score are
    # equal. The formula for the F1 score is:
    #     F1 = 2 * (precision * recall) / (precision + recall)
    print('f1 score: \t', f1_score(y_true, y_hat))
    print(2 * (precision * recall) / (precision + recall))

    # The F-beta score is the weighted harmonic mean of precision and recall,
    # reaching its optimal value at 1 and its worst value at 0.
    # The 'beta' parameter determines the weight of precision in the combined
    # score. 'beta < 1' lends more weight to precision, while 'beta > 1'
    # favors recall ('beta -> 0' considers only precision, 'beta -> inf' only recall).
    print('F-beta：')
    for beta in np.logspace(-3, 3, num=7, base=10):
        fbeta = fbeta_score(y_true, y_hat, beta=beta)
        print('\tbeta=%9.3f\tF-beta=%.5f' % (beta, fbeta))
        # print (1+beta**2)*precision*recall / (beta**2 * precision + recall)

    print(precision_recall_fscore_support(y_true, y_hat, beta=1))
    print(classification_report(y_true, y_hat))