支持向量机专题03

最新推荐文章于 2024-04-30 17:07:10 发布

Winfred_Bo

最新推荐文章于 2024-04-30 17:07:10 发布

阅读量456

点赞数

分类专栏：机器学习文章标签：支持向量机 python 机器学习

本文链接：https://blog.csdn.net/Winfred_Bo/article/details/107925009

版权

机器学习专栏收录该内容

11 篇文章 0 订阅

订阅专栏

支持向量机原理进阶讲解

1.二分类SVC中样本不均衡问题
- 1.1不均衡问题：重要参数class_weight
- 1.2混淆矩阵（Confusion Matrix）
2.ROC曲线及相关问题

1.二分类SVC中样本不均衡问题

1.1不均衡问题：重要参数class_weight

分类模型天生会倾向于多数的类，让多数类更容易被判断正确，少数类会被牺牲.即便模型什么都不做，把全部的样本都当成不会犯罪的人，即使模型的评估指标accuracy变得毫无意义，但却无法达到我们要求‘识别犯罪的人’的建模目的。

A.采样方法（与逻辑回归类似）
但在向量机模型中并不能被采用，支持向量机模型而言，样本对计算速度影响巨大，单纯增加样本数目不仅会增加计算时间，而且还会增加无数对决策边界无关的样本点。
B.标签赋予权重
class_weight、sample_weight ，通常来说，这两种参数我们只能选取一个来设置

1、导入所需的库
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs
2、创造样本不均衡的数据集
class_1 = 500
class_2 = 50
centers = [[0.0 , 0.0] ,[2.0 ,2.0]]
cluster_std = [1.5 ,0.5]
X,y = make_blobs(n_samples=[class_1 , class_2]
                ,centers= centers
                ,cluster_std= cluster_std
                ,random_state= 0 
                ,shuffle= False
                )
plt.scatter(X[:,0],X[:,1],c=y , cmap = 'rainbow',s= 10)
-------------------------------------------------------------------------
3、建立两种模型（有无设置标签权重）
#不设定class_weight
clf = svm.SVC(kernel='linear')
clf.fit(X,y)
#设定class_weight
wclf = svm.SVC(kernel='linear',class_weight={1:10})
wclf.fit(X,y)
#第一种情况下的模型打分
clf.score(X,y) 0.9418181818181818
#第二种情况下的模型打分
wclf.score(X,y) 0.9127272727272727
-------------------------------------------------------------------------
4、绘制两个模型下数据的决策边界
#首先要有数据分布
plt.figure(figsize=(6,5))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap="rainbow",s=10)
ax = plt.gca() #获取当前的子图，如果不存在，则创建新的子图

#绘制决策边界的第一步：要有网格
xlim = ax.get_xlim()
ylim = ax.get_ylim()

xx = np.linspace(xlim[0], xlim[1], 30)
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T

#第二步：找出我们的样本点到决策边界的距离
Z_clf = clf.decision_function(xy).reshape(XX.shape)
a = ax.contour(XX, YY, Z_clf, colors='black', levels=[0], alpha=0.5, linestyles=['-'])

Z_wclf = wclf.decision_function(xy).reshape(XX.shape)
b = ax.contour(XX, YY, Z_wclf, colors='red', levels=[0], alpha=0.5, linestyles=['-'])

#第三步：画图例
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"],
           loc="upper right")
plt.show()

在这里插入图片描述

1.2混淆矩阵（Confusion Matrix）

在这里插入图片描述

1、准确度
$\frac{11+10}{11+10+01+00}$
准确度就是预测正确的所有样本除以总样本，通常越接近1越好。
准确度也是模型整体效果的评估单位

2、精准度
$\frac{11}{11+01}$
精准度就是将多数类判错后所需付出成本的衡量
(y[y == clf.predict(X)] == 1).sum() / (clf.predict(X)==1).sum()
当每一次将多数类判断错误的成本十分高的时候，我们会追求高精确度。

3、召回率
$\frac{11}{11+10}$
召回率表示所有真实为1的样本中，被我们预测正确的样本所占的比例。
(y[y == clf.predict(X)] == 1).sum() / (y == 1).sum()
如果我们希望不计一切代价，找出少数类，我们就追求高召回率

4、F1 measure
$=\frac{2}{{1\over Precision} + {1\over Recall} } = \frac{2*Precision}{Precision+Recall}$
F1 measure能够保证精确度和召回率都很高

5、特异度
$\frac{00}{00+01}$
特异度表示所有真实为0的样本中，被正确预测为0的样本所占的比例。
（y[y == clf.predict(X)]）
特异值衡量了一个模型将多数类判断正确的能力。

6、假正率
$\frac{01}{00+01}$
假正率即1 - specificity 就是一个模型将多数类判断错误的能力。

2.ROC曲线及相关问题

2.1逻辑回归实现概率预测

1、自建数据集
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_blobs

class_1_ = 7
class_2_ = 4
centers_ = [[0,0],[1,1]]
clusters_std = [0.5 , 1]
X_ , y_ = make_blobs(n_samples=[class_1_ , class_2_]
                    ,centers=centers_
                    ,cluster_std= clusters_std
                    ,random_state= 0
                    ,shuffle= False)
plt.scatter(X_[:,0] , X_[:,1]
           ,c = y_
           ,cmap = 'rainbow'
           ,s = 30)

在这里插入图片描述

2、建模，调用概率
from sklearn.linear_model import LogisticRegression as LogiR
clf_lo = LogiR().fit(X_ , y_)
prob = clf_lo.predict_proba(X_)
import pandas as pd 
prob = pd.DataFrame(prob)
prob.columns = ['0', '1']
prob
		0			1
0	0.694619	0.305381
1	0.510931	0.489069
2	0.820038	0.179962
3	0.785647	0.214353
4	0.777387	0.222613
5	0.656634	0.343366
6	0.768586	0.231414
7	0.349171	0.650829
8	0.366184	0.633816
9	0.663272	0.336728
10	0.607529	0.392471
--------------------------------------------------------------------------------
3、使用阈值0.5,大于0.5的样本直接预测为1，小于0的样本预测为0
for i in range(prob.shape[0]):
    if prob.loc[i,'1'] > 0.5:
        prob.loc[i,'pred'] = 1
    else:
        prob.loc[i,'pred'] = 0
prob['y_true'] = y_
prob = prob.sort_values(by = '1', ascending= False)
prob
	0	1	pred	y_true
7	0.349171	0.650829	1.0	0
8	0.366184	0.633816	1.0	0
1	0.510931	0.489069	0.0	0
10	0.607529	0.392471	0.0	0
5	0.656634	0.343366	0.0	0
9	0.663272	0.336728	0.0	0
0	0.694619	0.305381	0.0	0
6	0.768586	0.231414	0.0	1
4	0.777387	0.222613	0.0	1
3	0.785647	0.214353	0.0	1
2	0.820038	0.179962	0.0	1
--------------------------------------------------------------------------------
4、使用混淆矩阵查看结果
from sklearn.metrics import confusion_matrix as CM , precision_score as P ,recall_score as R
CM(prob.loc[:,'y_true'] , prob.loc[:,'pred'],labels=[1,0])
P(prob.loc[:,'y_true'] , prob.loc[:,'pred'],labels=[1,0])
R(prob.loc[:,'y_true'] , prob.loc[:,'pred'],labels=[1,0])

总结：在不同的阈值下其模型的评估指标会发生改变，但并不是说对阈值进行升高或降低就会发生定向改变，一切要根据数据的实际分布来进行判断。

2.2SVM实现概率预测

1、自建数据集
class_1 = 500
class_2 = 50
centers = [[0,0],[2,2]]
clusters_std = [1.5 , 0.5]
X ,y = make_blobs(n_samples=[class_1 , class_2]
                 ,centers =centers
                 ,cluster_std= clusters_std
                 ,random_state= 0
                 ,shuffle= False)
plt.scatter(X[:,0] , X[:,1] ,c= y
           ,cmap= 'rainbow' ,s= 10)

在这里插入图片描述

clf_proba = svm.SVC(kernel='linear' , C = 1.0 ,probability= True ).fit(X ,y)
clf_proba.predict_proba(X)
array([[0.68035758, 0.31964242],
       [0.24033895, 0.75966105],
       [0.9662359 , 0.0337641 ],
       ...,
       [0.13634925, 0.86365075],
       [0.33415679, 0.66584321],
       [0.29297621, 0.70702379]])
 clf_proba.predict_proba(X).shape
 (550, 2)
 clf_proba.decision_function(X).shape
 (550,)

decision_function 只会生成一列距离，样本的类别由距离的符号来进行判断；predict_proba会生成两个类别分别对应的概率。

2.3绘制SVM的ROC曲线

ROC是一条以不同阈值下的假正率FPR为横坐标，不同阈值下的召回率Recall为纵坐标的曲线。
recall = []
FPR = []
probrange = np.linspace(clf_proba.predict_proba(X)[:,1].min() , clf_proba.predict_proba(X)[:,1].max() , num = 50 , endpoint= False)

from sklearn.metrics import confusion_matrix as CM , recall_score as R
import matplotlib.pyplot as plot

for i in probrange:
    y_predict = []
    for j in range(X.shape[0]):
        if clf_proba.predict_proba(X)[j ,1] > i:
            y_predict.append(1)
        else:
            y_predict.append(0)
    cm = CM(y , y_predict , labels=[1,0])
    recall.append(cm[0,0]/cm[0,:].sum())
    FPR.append(cm[1,0]/cm[1,:].sum())

recall.sort()

FPR.sort()

plt.plot(FPR , recall , C = 'red')
plt.plot(probrange + 0.05 , probrange + 0.05 , c = 'black' ,linestyle = '--')
plt.show()

在这里插入图片描述

对于一条凸形ROC曲线来说，越靠近左上角说明效果拟合越好，越往下越糟糕。

2.4关于ROC的理解

sklearn.metrics.roc_curve（y_true ,y_score , pos_label = None , sample_weight = None , drop_intermediate = True）
可以帮助我们计算ROC曲线的横坐标假正率FPR ，纵坐标Recall和对应的阈值

from sklearn.metrics import roc_curve
FPR , recall , thresholds = roc_curve(y ,clf_proba.decision_function(X) , pos_label= 1)
返回ROC曲线下的FPR,Recall及对应的阈值的数值
from sklearn.metrics import roc_auc_score as AUC
area = AUC(y,clf_proba.decision_function(X))
返回我们计算AUC面积
plt.figure()
plt.plot(FPR ,recall , color = 'red'
         ,label = 'ROC curve (area = %0.2f)'%area)
plt.plot([0,1],[0,1],color = 'black',linestyle = '--')
plt.xlim([-0.05 , 1.05])
plt.xlabel('False Positive Rate')
plt.ylim([-0.05 , 1.05])
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc = 'lower right')
plt.show()

在这里插入图片描述
约登指数：我们的希望是模型在捕获少数类的能力变强的时候，尽量不误伤多数类，也就是说，随着recall的变大，FPR的大小越小越好。而Recall和FPR差距最大的点，这个点叫做约登指数。

maxindex = (recall - FPR).tolist().index(max(recall - FPR))
thresholds[maxindex]
-1.0860191749391461
plt.figure()
plt.plot(FPR ,recall , color = 'red'
         ,label = 'ROC curve (area = %0.2f)'%area)
plt.plot([0,1],[0,1],color = 'black',linestyle = '--')
plt.scatter(FPR[maxindex] ,recall[maxindex] ,c ='k' , s = 30)
plt.xlim([-0.05 , 1.05])
plt.xlabel('False Positive Rate')
plt.ylim([-0.05 , 1.05])
plt.ylabel('Recall')
plt.title('Receiver operating characteristic example')
plt.legend(loc = 'lower right')
plt.show()