ML学习笔记（一）：SVM向量机实验验证笔记

最新推荐文章于 2024-08-18 22:01:39 发布

啥都不会的蒋浩同学

最新推荐文章于 2024-08-18 22:01:39 发布

阅读量42

点赞数

文章标签：支持向量机学习笔记

本文链接：https://blog.csdn.net/m0_59114786/article/details/134567730

版权

2023.11.23 SVM向量机学习笔记和SMO 高效优化算法

SVM 应用的一般框架

(1) 收集数据：可以使用任意方法。

(2) 准备数据：需要数值型数据。

(3) 分析数据：有助于可视化分隔超平面。

(4) 训练算法：SVM的大部分时间都源自训练，该过程主要实现两个参数的调优。

(5) 测试算法：十分简单的计算过程就可以实现。

(6) 使用算法：几乎所有分类问题都可以使用SVM，值得一提的是，SVM本身是一个二类
分类器，对多类问题应用SVM需要对代码做一些修改。

SMO 高效优化算法

svm的使用过程中，最主要是需要得到alpha的最优值，一旦得到即可得到分类决策函数。

SMO表示序列最小优化（Sequential Minimal Optimization）

SMO算法的目标是求出一系列alpha和b，一旦求出了这些alpha，就很容易计算出权重向量w并得到分隔超平面。

SMO算法的工作原理是：每次循环中选择两个alpha进行优化处理。一旦找到一对合适的alpha，那么就增大其中一个同时减小另一个。这里所谓的“合适”就是指两个alpha必须要符合一定的条件，条件之一就是这两个alpha必须要在间隔边界之外，而其第二个条件则是这两个alpha还没有进行过区间化处理或者不在边界上。

编译环境：飞浆ai studio的notebook

# View dataset directory. 
# This directory will be recovered automatically after resetting environment. 
# 在终端查看当前挂载的数据集目录
!ls /home/aistudio/data

data112477

# 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢.
# View personal work directory. 
# All changes under this directory will be kept even after reset. 
# Please clean unnecessary files in time to speed up environment loading. 
!ls /home/aistudio/work

# 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例:
# If a persistence installation is required, 
# you need to use the persistence path as the following: 
!mkdir /home/aistudio/external-libraries
!pip install beautifulsoup4 -t /home/aistudio/external-libraries

mkdir: cannot create directory ‘/home/aistudio/external-libraries’: File exists
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting beautifulsoup4
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/9c/d8/909c4089dbe4ade9f9705f143c9f13f065049a9d5e7d34c828aefdd0a97c/beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
Collecting soupsieve>1.2
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/16/e3/4ad79882b92617e3a4a0df1960d6bce08edfb637737ac5c3f3ba29022e25/soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.11.1 soupsieve-2.3.2.post1
[33mWARNING: Target directory /home/aistudio/external-libraries/beautifulsoup4-4.11.1.dist-info already exists. Specify --upgrade to force replacement.[0m[33m
[0m[33mWARNING: Target directory /home/aistudio/external-libraries/soupsieve already exists. Specify --upgrade to force replacement.[0m[33m
[0m[33mWARNING: Target directory /home/aistudio/external-libraries/bs4 already exists. Specify --upgrade to force replacement.[0m[33m
[0m[33mWARNING: Target directory /home/aistudio/external-libraries/soupsieve-2.3.2.post1.dist-info already exists. Specify --upgrade to force replacement.[0m[33m
[0m[33mWARNING: You are using pip version 22.0.4; however, version 22.1 is available.
You should consider upgrading via the '/opt/conda/envs/python35-paddle120-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

# 同时添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可: 
# Also add the following code, 
# so that every time the environment (kernel) starts, 
# just run the following code: 
import sys 
sys.path.append('/home/aistudio/external-libraries')


```python
#程序清单6-1 SMO算法中的辅助函数
from numpy import *
from time import sleep

def loadDataSet(fileName): #该函数打开文件并对其进行逐行解析，从而得到每行的类标签和整个数据矩阵。
    dataMat = []; labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr = line.strip().split('\t')
        dataMat.append([float(lineArr[0]), float(lineArr[1])])
        labelMat.append(float(lineArr[2]))
    return dataMat,labelMat

def selectJrand(i,m): #i是第一个alpha的下标，m是所有alpha的数目。只要函数值不等于输入值i，函数就会进行随机选择。
    j=i #we want to select any J not equal to i
    while (j==i):
        j = int(random.uniform(0,m))
    return j

def clipAlpha(aj,H,L): #用于调整大于H或小于L的alpha值。
    if aj > H: 
        aj = H
    if L > aj:
        aj = L
    return aj

#用于测试装载testSet.txt数据
dataMat,labelMat = loadDataSet('/home/aistudio/testSet.txt')
#print(dataMat,labelMat)

#程序清单6-2 简化版SMO算法
def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
    #输入参数：数据集、类别标签、常数C、容错率和退出前最大的循环次数
    dataMatrix = mat(dataMatIn); labelMat = mat(classLabels).transpose()
    b = 0; m,n = shape(dataMatrix)
    alphas = mat(zeros((m,1)))
    iter = 0
    while (iter < maxIter):
        alphaPairsChanged = 0
        for i in range(m):
            fXi = float(multiply(alphas,labelMat).T*(dataMatrix*dataMatrix[i,:].T)) + b
            Ei = fXi - float(labelMat[i])#if checks if an example violates KKT conditions
            if ((labelMat[i]*Ei < -toler) and (alphas[i] < C)) or ((labelMat[i]*Ei > toler) and (alphas[i] > 0)):
                j = selectJrand(i,m)
                fXj = float(multiply(alphas,labelMat).T*(dataMatrix*dataMatrix[j,:].T)) + b
                Ej = fXj - float(labelMat[j])
                alphaIold = alphas[i].copy(); alphaJold = alphas[j].copy();
                if (labelMat[i] != labelMat[j]):
                    L = max(0, alphas[j] - alphas[i])
                    H = min(C, C + alphas[j] - alphas[i])
                else:
                    L = max(0, alphas[j] + alphas[i] - C)
                    H = min(C, alphas[j] + alphas[i])
                #if L==H: print("L==H"); continue
                eta = 2.0 * dataMatrix[i,:]*dataMatrix[j,:].T - dataMatrix[i,:]*dataMatrix[i,:].T - dataMatrix[j,:]*dataMatrix[j,:].T
                #if eta >= 0: print("eta>=0"); continue
                alphas[j] -= labelMat[j]*(Ei - Ej)/eta
                alphas[j] = clipAlpha(alphas[j],H,L)
                if (abs(alphas[j] - alphaJold) < 0.00001): 
                    #print("j not moving enough"); 
                    continue
                alphas[i] += labelMat[j]*labelMat[i]*(alphaJold - alphas[j])#update i by the same amount as j
                                                                        #the update is in the oppostie direction
                b1 = b - Ei- labelMat[i]*(alphas[i]-alphaIold)*dataMatrix[i,:]*dataMatrix[i,:].T - labelMat[j]*(alphas[j]-alphaJold)*dataMatrix[i,:]*dataMatrix[j,:].T
                b2 = b - Ej- labelMat[i]*(alphas[i]-alphaIold)*dataMatrix[i,:]*dataMatrix[j,:].T - labelMat[j]*(alphas[j]-alphaJold)*dataMatrix[j,:]*dataMatrix[j,:].T
                if (0 < alphas[i]) and (C > alphas[i]): b = b1
                elif (0 < alphas[j]) and (C > alphas[j]): b = b2
                else: b = (b1 + b2)/2.0
                alphaPairsChanged += 1
                #print("iter: %d i:%d, pairs changed %d" % (iter,i,alphaPairsChanged))
        if (alphaPairsChanged == 0): iter += 1
        else: iter = 0
        #print("iteration number: %d" % iter)
    return b,alphas

#测试smoSimple
C = 0.6
toler = 0.001
maxIter = 40
b,alphas = smoSimple(dataMat,labelMat, C, toler, maxIter)
print('b:',b)
print('alphas:',alphas[alphas>0])

b: [[-3.75701938]]
alphas: [[0.12842297 0.23588326 0.36430804]]

#利用完整 Platt SMO 算法加速优化
class optStruct:
    def __init__(self,dataMatIn, classLabels, C, toler, kTup):  # Initialize the structure with the parameters 
        self.X = dataMatIn
        self.labelMat = classLabels
        self.C = C
        self.tol = toler
        self.m = shape(dataMatIn)[0]
        self.alphas = mat(zeros((self.m,1)))
        self.b = 0
        self.eCache = mat(zeros((self.m,2))) #first column is valid flag
        self.K = mat(zeros((self.m,self.m)))
        for i in range(self.m):
            self.K[:,i] = kernelTrans(self.X, self.X[i,:], kTup)

#程序清单6-4 完整Platt SMO算法中的优化例程
def innerL(i, oS):
    Ei = calcEk(oS, i)
    if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or ((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
        j,Ej = selectJ(i, oS, Ei) #this has been changed from selectJrand
        alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy();
        if (oS.labelMat[i] != oS.labelMat[j]):
            L = max(0, oS.alphas[j] - oS.alphas[i])
            H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
        else:
            L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
            H = min(oS.C, oS.alphas[j] + oS.alphas[i])
        if L==H: 
            #print("L==H"); 
            return 0
        eta = 2.0 * oS.K[i,j] - oS.K[i,i] - oS.K[j,j]
        #changed for kernel
        if eta >= 0: 
            #print("eta>=0"); 
            return 0
        oS.alphas[j] -= oS.labelMat[j]*(Ei - Ej)/eta
        oS.alphas[j] = clipAlpha(oS.alphas[j],H,L)
        updateEk(oS, j) #added this for the Ecache
        if (abs(oS.alphas[j] - alphaJold) < 0.00001): 
            #print("j not moving enough"); 
            return 0
        oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])#update i by the same amount as j
        updateEk(oS, i) #added this for the Ecache                    #the update is in the oppostie direction
        b1 = oS.b - Ei- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,i] - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[i,j]
        b2 = oS.b - Ej- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,j]- oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[j,j]
        if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]): oS.b = b1
        elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]): oS.b = b2
        else: oS.b = (b1 + b2)/2.0
        return 1
    else: return 0

#程序清单6-5 完整版Platt SMO的外循环代码
def smoP(dataMatIn, classLabels, C, toler, maxIter,kTup=('lin', 0)):    #full Platt SMO
    oS = optStruct(mat(dataMatIn),mat(classLabels).transpose(),C,toler, kTup)
    iter = 0
    entireSet = True; alphaPairsChanged = 0
    while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
        alphaPairsChanged = 0
        if entireSet:   #go over all
            for i in range(oS.m):        
                alphaPairsChanged += innerL(i,oS)
               # print("fullSet, iter: %d i:%d, pairs changed %d" % (iter,i,alphaPairsChanged))
            iter += 1
        else:#go over non-bound (railed) alphas
            nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]
            for i in nonBoundIs:
                alphaPairsChanged += innerL(i,oS)
                #print("non-bound, iter: %d i:%d, pairs changed %d" % (iter,i,alphaPairsChanged))
            iter += 1
        if entireSet: entireSet = False #toggle entire set loop
        elif (alphaPairsChanged == 0): entireSet = True  
        #print("iteration number: %d" % iter)
    return oS.b,oS.alphas

def kernelTrans(X, A, kTup): #calc the kernel or transform data to a higher dimensional space
    m,n = shape(X)
    K = mat(zeros((m,1)))
    if kTup[0]=='lin': K = X * A.T   #linear kernel
    elif kTup[0]=='rbf':
        for j in range(m):
            deltaRow = X[j,:] - A
            K[j] = deltaRow*deltaRow.T
        K = exp(K/(-1*kTup[1]**2)) #divide in NumPy is element-wise not matrix like Matlab
    else: raise NameError('Houston We Have a Problem -- \
    That Kernel is not recognized')
    return K
def calcEk(oS, k):
    fXk = float(multiply(oS.alphas,oS.labelMat).T*oS.K[:,k] + oS.b)
    Ek = fXk - float(oS.labelMat[k])
    return Ek
        
def selectJ(i, oS, Ei):         #this is the second choice -heurstic, and calcs Ej
    maxK = -1; maxDeltaE = 0; Ej = 0
    oS.eCache[i] = [1,Ei]  #set valid #choose the alpha that gives the maximum delta E
    validEcacheList = nonzero(oS.eCache[:,0].A)[0]
    if (len(validEcacheList)) > 1:
        for k in validEcacheList:   #loop through valid Ecache values and find the one that maximizes delta E
            if k == i: continue #don't calc for i, waste of time
            Ek = calcEk(oS, k)
            deltaE = abs(Ei - Ek)
            if (deltaE > maxDeltaE):
                maxK = k; maxDeltaE = deltaE; Ej = Ek
        return maxK, Ej
    else:   #in this case (first time around) we don't have any valid eCache values
        j = selectJrand(i, oS.m)
        Ej = calcEk(oS, j)
    return j, Ej

def updateEk(oS, k):#after any alpha has changed update the new value in the cache
    Ek = calcEk(oS, k)
    oS.eCache[k] = [1,Ek]

#测试smoP
C = 0.6
toler = 0.001
maxIter = 40
b,alphas = smoP(dataMat,labelMat, C, toler, maxIter,kTup=('lin', 0))
print('b:',b)
print('alphas:',alphas[alphas>0]) #而非零alpha所对应的也就是支持向量。

b: [[-2.89901748]]
alphas: [[0.06961952 0.0169055  0.0169055  0.0272699  0.04522972 0.0272699
  0.0243898  0.06140181 0.06140181]]

#根据alphas计算w,b
def calcWs(alphas,dataArr,classLabels):
    X = mat(dataArr); labelMat = mat(classLabels).transpose()
    m,n = shape(X)
    w = zeros((n,1))
    for i in range(m):
        w += multiply(alphas[i]*labelMat[i],X[i,:].T)
    return w

 #绘制分类器
from numpy import *
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.patches import Circle

def showClassifer(dataMat,labelMat,alphas,w,b):
    data_plus = []
    data_minus = []
    dataSize = len(dataMat)
    for i in range(dataSize):
        if labelMat[i] > 0:
            data_plus.append(dataMat[i])
        else:
            data_minus.append(dataMat[i])
    #转成numpy矩阵
    new_data_plus = np.array(data_plus)
    new_data_minus = np.array(data_minus)
    plt.scatter(np.transpose(new_data_plus)[0],np.transpose(new_data_plus)[1],s=30, c='r', marker='s')
    plt.scatter(np.transpose(new_data_minus)[0],np.transpose(new_data_minus)[1],s=30, c='g')
    #绘制直线
    x1 = max(dataMat)[0]
    x2 = min(dataMat)[0]
    a1,a2 = w
    b = float(b)
    a1 = float(a1[0])
    a2 = float(a2[0])
    y1,y2 = (-b - a1 *x1) /a2,(-b - a1 * x2) / a2
    #画出直线
    plt.plot([x1, x2], [y1+b, y2+b])
    plt.plot([x1, x2], [y1-b, y2-b])
    plt.plot([x1,x2],[y1,y2])
    #找出支持向量的点
    for i,alpha in enumerate(alphas):
        #支持向量机的点
        if(abs(alpha) > 0):
            x , y = dataMat[i]
            plt.scatter([x],[y],s = 150,c = 'none',alpha = 0.7,linewidths=2,edgecolors='black')
    plt.xlim((-2, 10))
    plt.ylim((-5, 5))
    plt.show()

#所有的测试
dataMat,labelMat = loadDataSet('/home/aistudio/testSet.txt')
C = 2
toler = 0.0001
maxIter = 50
b,alphas = smoP(dataMat,labelMat, C, toler, maxIter,kTup=('lin', 0))
w = calcWs(alphas,dataMat,labelMat)
showClassifer(dataMat,labelMat,alphas,w,b)

在这里插入图片描述

#下面进行对手写数字识别分类

#(1)对数据进行解压

import os
import zipfile
os.chdir('/home/aistudio/data/data112477')
extracting = zipfile.ZipFile('digits.zip')
extracting.extractall()

#(2)装载数据集
def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect
def loadImages(dirName):
    from os import listdir
    hwLabels = []
    trainingFileList = listdir(dirName)           #load the training set
    m = len(trainingFileList)
    trainingMat = zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]     #take off .txt
        classNumStr = int(fileStr.split('_')[0])
        if classNumStr == 9: hwLabels.append(-1)
        else: hwLabels.append(1)
        trainingMat[i,:] = img2vector('%s/%s' % (dirName, fileNameStr))
    return trainingMat, hwLabels   
#trainingMat, hwLabels  = loadImages('/home/aistudio/data/data112477/trainingDigits')
#(3)测试数据集
def testDigits(kTup=('rbf', 10)):
    dataArr,labelArr = loadImages('/home/aistudio/data/data112477/trainingDigits')
    b,alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, kTup)
    datMat=mat(dataArr); labelMat = mat(labelArr).transpose()
    svInd=nonzero(alphas.A>0)[0]
    sVs=datMat[svInd] 
    labelSV = labelMat[svInd];
    print("there are %d Support Vectors" % shape(sVs)[0])
    m,n = shape(datMat)
    errorCount = 0
    for i in range(m):
        kernelEval = kernelTrans(sVs,datMat[i,:],kTup)
        predict=kernelEval.T * multiply(labelSV,alphas[svInd]) + b
        if sign(predict)!=sign(labelArr[i]): errorCount += 1
    print("the training error rate is: %f" % (float(errorCount)/m))
    dataArr,labelArr = loadImages('/home/aistudio/data/data112477/trainingDigits')
    errorCount = 0
    datMat=mat(dataArr); labelMat = mat(labelArr).transpose()
    m,n = shape(datMat)
    for i in range(m):
        kernelEval = kernelTrans(sVs,datMat[i,:],kTup)
        predict=kernelEval.T * multiply(labelSV,alphas[svInd]) + b
        if sign(predict)!=sign(labelArr[i]): errorCount += 1
    print("the test error rate is: %f" % (float(errorCount)/m)) 
testDigits(kTup=('rbf', 5))

there are 402 Support Vectors
the training error rate is: 0.000000
the test error rate is: 0.000000

#从3-25
testDigits(kTup=('rbf', 3))
testDigits(kTup=('rbf', 10))
testDigits(kTup=('rbf', 15))
testDigits(kTup=('rbf', 20))
testDigits(kTup=('rbf', 25))

there are 402 Support Vectors
the training error rate is: 0.000000
the test error rate is: 0.000000
there are 112 Support Vectors
the training error rate is: 0.000000
the test error rate is: 0.000000
there are 68 Support Vectors
the training error rate is: 0.004975
the test error rate is: 0.004975
there are 54 Support Vectors
the training error rate is: 0.002488
the test error rate is: 0.002488
there are 43 Support Vectors
the training error rate is: 0.034826
the test error rate is: 0.034826

#从30-50
testDigits(kTup=('rbf', 30))
testDigits(kTup=('rbf', 35))
testDigits(kTup=('rbf', 40))
testDigits(kTup=('rbf', 45))
testDigits(kTup=('rbf', 50))

there are 43 Support Vectors
the training error rate is: 0.000000
the test error rate is: 0.000000
there are 38 Support Vectors
the training error rate is: 0.034826
the test error rate is: 0.034826
there are 36 Support Vectors
the training error rate is: 0.002488
the test error rate is: 0.002488
there are 44 Support Vectors
the training error rate is: 0.022388
the test error rate is: 0.022388
there are 46 Support Vectors
the training error rate is: 0.000000
the test error rate is: 0.000000

#从55-75
testDigits(kTup=('rbf', 55))
testDigits(kTup=('rbf', 60))
testDigits(kTup=('rbf', 65))
testDigits(kTup=('rbf', 70))
testDigits(kTup=('rbf', 75))

there are 43 Support Vectors
the training error rate is: 0.007463
the test error rate is: 0.007463
there are 36 Support Vectors
the training error rate is: 0.019900
the test error rate is: 0.019900
there are 47 Support Vectors
the training error rate is: 0.019900
the test error rate is: 0.019900
there are 43 Support Vectors
the training error rate is: 0.022388
the test error rate is: 0.022388
there are 38 Support Vectors
the training error rate is: 0.012438
the test error rate is: 0.012438

#从80-100
testDigits(kTup=('rbf', 80))
testDigits(kTup=('rbf', 85))
testDigits(kTup=('rbf', 90))
testDigits(kTup=('rbf', 95))
 training error rate is: 0.022388
    the test error rate is: 0.022388
    there are 38 Support Vectors
    the training error rate is: 0.012438
    the test error rate is: 0.012438



```python
#从80-100
testDigits(kTup=('rbf', 80))
testDigits(kTup=('rbf', 85))
testDigits(kTup=('rbf', 90))
testDigits(kTup=('rbf', 95))
testDigits(kTup=('rbf', 100))

there are 37 Support Vectors
the training error rate is: 0.017413
the test error rate is: 0.017413
there are 47 Support Vectors
the training error rate is: 0.007463
the test error rate is: 0.007463
there are 43 Support Vectors
the training error rate is: 0.019900
the test error rate is: 0.019900
there are 32 Support Vectors
the training error rate is: 0.027363
the test error rate is: 0.027363
there are 45 Support Vectors
the training error rate is: 0.014925
the test error rate is: 0.014925