机器学习之多分类问题的支持向量机—基于python

      大家好,我是带我去滑雪!

     背景知识:由于支持向量机使用超平面分离不同类别的数据,故并没有推广到多分类问题的自然方法。对于多分类问题,通常采用的方法是1对1分类,也被称为全配对法。具体来说,假设数据有K类,其中K>2。从这K类中,取出两类(不考虑排序),则可能的取法组合数目为gif.latex?%5Cbinom%7BK%7D%7B2%7D%3DC_%7BK%7D%5E%7B2%7D%3D%5Cfrac%7BK%28K-1%29%7D%7B2%7D

      对这gif.latex?C_%7BK%7D%5E%7B2%7D个二分类问题均使用支持向量机,可得到gif.latex?C_%7BK%7D%5E%7B2%7D个SVM模型。对于一个新的测试观测值,使用这gif.latex?C_%7BK%7D%5E%7B2%7D个SVM模型进行预测,然后以最常见的预测类别作为最终的预测结果。事实上,这是一种两两对决(PK),最终以多数票规则加总的方法。

      本期使用 UCI Machine Learning Repository 的液体超声波流量计(liquid ultrasonic flowmeter)数据Meter_D.csv,进行多分类问题的SVM估计。其中V44为响应变量,表示流量计的四种不同状态(1为Healthy,2为Gas injection,3为Installation effects,4为Waxing)。V1-V43为流量计的一系列度量指标,均为数值型变量。研究目的是根据这些指标判断流量计的质量状况。

     本文期望完成以下一些问题

  (1)载入数据,并考察其形状与前5个观测值;

  (2)使用random_state=0,通过分层抽样,随机选取100个观测值作为训练集;

  (3)根据训练集数据,将所有特征变量标准化;

  (4)以参数“random_state=123”,使用线性核进行SVM估计(使用SVC的其他默认参数),计算测试集的预测准确率;

  (5)在测试集进行预测,并展示测试集的混淆矩阵;

  (6)使用二次核进行SVM估计(参数同上),计算测试集的预测准确率;

  (7)使用三次核进行SVM估计(参数同上),并计算测试集的预测准确率;

  (8)使用径向核进行SVM估计(参数同上),计算测试集的预测准确率;

  (9)使用S型核进行SVM估计(参数同上),计算测试集的预测准确率;

  (10)采用10折交叉验证,选中最优的调节参数组合,

  (11)使用最优的调节参数组合进行线性核SVM估计,计算测试集的预测准确率。

  (1)载入数据,并考察其形状与前5个观测值

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
get_ipython().run_line_magic('matplotlib', 'inline')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import warnings
data = pd.read_csv(r'E:\工作\硕士\博客\博客20-多分类问题的SVM估计\Meter_D.csv')#读取数据
data.shape#查看数据形状
data.head()#查看数据集前5行

输出结果:

(180, 44)
V1V2V3V4V5V6V7V8V9V10...V35V36V37V38V39V40V41V42V43V44
01.1047061.0046790.9947832.3458332.6044442.5800002.3472221485.8055561485.9305561485.941667...-0.7124.150000123.911944151.615833151.275000151.559444151.221111124.253611124.0177781
11.0894010.9977780.9980123.3994443.7111113.7119443.4144441486.1638891486.3027781486.302778...-0.7124.171944123.831667151.652222151.164444151.595278151.113333124.274167123.9377781
21.0796711.0060560.9990273.4380563.7147223.6897223.4200001486.3222221486.4527781486.455556...-0.7124.160000123.818056151.636111151.150833151.580000151.096944124.260833123.9236111
31.0908341.0131940.9949113.3994443.7250003.6605563.3711111486.4555561486.5750001486.583333...-0.7124.144722123.808611151.622500151.139167151.564167151.084167124.250833123.9130561
41.0938161.0097161.0036223.3988893.7027783.6813893.3519441486.6000001486.7250001486.730556...-0.7124.134444123.797500151.607500151.122778151.550278151.068889124.236111123.9030561

5 rows × 44 columns

(2)使用random_state=0,通过分层抽样,随机选取100个观测值作为训练集;

x=data.iloc[:,0:43]#使用pandas库中切片函数iloc,取出数据矩阵V1-V43,并赋值给x
y=data.iloc[:,43]#取出响应变量V44,并赋值给y
x
y

输出结果x:

V1V2V3V4V5V6V7V8V9V10...V34V35V36V37V38V39V40V41V42V43
01.1047061.0046790.9947832.3458332.6044442.5800002.3472221485.8055561485.9305561485.941667...-0.700000-0.700000124.150000123.911944151.615833151.275000151.559444151.221111124.253611124.017778
11.0894010.9977780.9980123.3994443.7111113.7119443.4144441486.1638891486.3027781486.302778...-0.700000-0.700000124.171944123.831667151.652222151.164444151.595278151.113333124.274167123.937778
21.0796711.0060560.9990273.4380563.7147223.6897223.4200001486.3222221486.4527781486.455556...-0.700000-0.700000124.160000123.818056151.636111151.150833151.580000151.096944124.260833123.923611
31.0908341.0131940.9949113.3994443.7250003.6605563.3711111486.4555561486.5750001486.583333...-0.700000-0.700000124.144722123.808611151.622500151.139167151.564167151.084167124.250833123.913056
41.0938161.0097161.0036223.3988893.7027783.6813893.3519441486.6000001486.7250001486.730556...-0.700000-0.700000124.134444123.797500151.607500151.122778151.550278151.068889124.236111123.903056
..................................................................
1754.5932205.1538460.123245-0.138611-1.703889-0.102778-0.2547221513.5055561507.0805561494.355556...44.60000044.600000121.266111122.244167145.081111155.030000147.680000153.622500117.866389117.880278
1760.1737680.2142500.098034-0.200833-0.401389-0.103889-2.7069441521.8972221474.4000001500.705556...44.60000044.600000121.082222121.059167151.401944154.034722147.871389151.595278118.459444118.631111
1770.0490270.1155930.134579-0.281111-0.050833-0.098889-2.7727781518.0722221439.8027781512.719444...44.16944444.169444121.298889121.370278158.391111154.248611149.020000148.266389118.569722118.807500
178-0.0530560.0994410.143832-0.3866670.1297220.029444-2.6133331487.0833331422.2111111484.836111...44.10000044.100000125.552500122.318333161.174167155.306389149.217222153.626111118.638333118.872778
1790.0089310.9143471.184204-2.6172220.144167-0.190000-2.5147221476.8555561413.8777781506.325000...44.10000044.100000127.293611122.416111161.170278157.195000148.598611150.280556118.752222118.986389

180 rows × 43 columns

输出结果y:

0      1
1      1
2      1
3      1
4      1
      ..
175    4
176    4
177    4
178    4
179    4
Name: V44, Length: 180, dtype: int64

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=80#测试集梳理为总数180-100
                                               ,stratify=y,random_state=0)#参数stratify: 依据标签y,按原数据y中各类比例,分配给train和test,使得train和test中各类数据的比例与原数据集一样。
x_train

输出结果:

V1V2V3V4V5V6V7V8V9V10...V34V35V36V37V38V39V40V41V42V43
1051.0855431.0068810.9908846.2466676.8413896.7369446.2616671484.1805561484.2833331484.294444...-0.7-0.7124.478056123.857778152.061944151.167778151.999722151.118333124.585278123.961111
921.0856771.0167981.0005672.3716672.5716672.5322222.3294441486.6722221486.7972221486.797222...-0.7-0.7124.076667123.842222151.525000151.191389151.468889151.136944124.179444123.947778
1742.11797910.4414630.451199-0.709167-1.669167-0.101111-0.1266671472.9861111481.6250001469.966667...44.644.6122.591111127.706944147.530556156.243611152.669444153.077222118.620556118.630833
451.0871551.0027350.9834356.2408336.8983336.7708336.3325001484.3027781484.4027781484.427778...-1.0-1.0124.466389123.843889152.052500151.151667151.985556151.105833124.580000123.942778
1041.0775051.0050150.9917536.2702786.8111116.7244446.2916671483.8611111483.9500001483.958333...-0.7-0.7124.508611123.881944152.095000151.201944152.038611151.148056124.611667123.988056
..................................................................
1281.1143581.0018700.9793367.1008338.0794447.9069447.2450001485.6500001485.7611111485.816667...-0.9-0.9124.397778123.687778151.991944150.933611151.922778150.878889124.493889123.782500
701.1012801.0102801.0002181.2216671.3444441.3316671.2083331485.3083331485.4166671485.411111...-0.9-0.9124.135278124.012500151.586389151.409722151.530833151.358056124.240278124.120278
541.1106810.9752610.9894862.6755563.0077783.0494442.7780561484.9055561484.6555561484.588889...-0.6-0.6124.241389123.973056151.774167151.379444151.728611151.326944124.387500124.111667
1352.0476811.8526231.8254573.4436113.7286113.6913890.1800001483.7694441483.8805561483.891667...44.444.4124.371389124.031111151.900278151.411111151.842500151.356944116.763889116.749722
1241.1078861.0113291.0027776.2225006.8708336.8155566.1311111484.3861111484.4972221484.583333...-0.9-0.9124.455000123.845278152.042500151.140556151.977778151.083611124.552778123.940833

100 rows × 43 columns

  (3)根据训练集数据,将所有特征变量标准化

scaler = StandardScaler()
scaler.fit(x_train)
x_train_s=scaler.fit_transform(x_train)#对训练集中的特征变量进行标准化
x_test_s=scaler.fit_transform(x_test)#对测试集的特征变量进行标准化

  (4)以参数“random_state=123”,使用线性核进行SVM估计(使用SVC的其他默认参数),计算测试集的预测准确率

model=SVC(kernel="linear",random_state=123)#使用线性核(rbf)
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率

输出结果:

0.825

         结果显示,线性核SVM的预测准确率为82.5%。

  (5)在测试集进行预测,并展示测试集的混淆矩阵

model_rmse = np.sqrt(mean_squared_error(y_test,linear_pred))    #RMSE,计算训练误差
model_mae = mean_absolute_error(y_test,linear_pred)   #MAE,计算平均绝对误差
model_r2 = r2_score(y_test, linear_pred)  # R2,准确率
print("The RMSE of RBF_SVR: ", model_rmse)
print("The MAE of RBF_SVR: ",model_mae)
print("R^2 of RBF_SVR: ",model_r2)

输出结果:

The RMSE of RBF_SVR:  0.6614378277661477
The MAE of RBF_SVR:  0.2625
R^2 of RBF_SVR:  0.6857816182246661

pred=model.predict(x_test_s)
pd.crosstab(y_test,pred,rownames=['Actual'],colnames=['Predicted'])#计算测试集混淆矩阵

输出结果:

Predicted1234
Actual
115260
20550
310230
400023

  (6)使用二次核进行SVM估计(参数同上),计算测试集的预测准确率

model=SVC(kernel='poly',degree=2,random_state=123)#使用二次项核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率

输出结果:

0.6

  (7)使用三次核进行SVM估计(参数同上),并计算测试集的预测准确率

model=SVC(kernel='poly',degree=3,random_state=123)#使用三次项核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率

输出结果:

0.5875

  (8)使用径向核进行SVM估计(参数同上),计算测试集的预测准确率

 model=SVC(kernel='rbf',random_state=123)#使用径向核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率

输出结果:

0.6875

   (9)使用S型核进行SVM估计(参数同上),计算测试集的预测准确率

model=SVC(kernel='sigmoid',random_state=123)#使用S型核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率

输出结果:

0.6125

  (10)采用10折交叉验证,选中最优的调节参数组合

param_grid={'C':[0.01,0.1,1,10,20,40,60,80,100,120,140,160,180,200,220,240,360,500],'gamma':[0.001,0.01,1,10]}#定义参数网格
kfold=StratifiedKFold(n_splits=10,shuffle=True,random_state=1)#定义10折随机分组
model=GridSearchCV(SVC(kernel="linear",random_state=123),param_grid,cv=kfold)
model.fit(x_train_s,y_train)
model.best_params_

输出结果:

{'C': 180, 'gamma': 0.001}

  (11)使用最优的调节参数组合进行线性核SVM估计,计算测试集的预测准确率

model1=model.best_estimator_#结合最优超参数,重新定义最优model
model1.fit(x_train_s,y_train)#模型估计
score=model1.score(x_test_s,y_test)
score

输出结果:

0.8625 

更多优质内容持续发布中,请移步主页查看。

 若有问题可邮箱联系:1736732074@qq.com 

博主的WeChat:TCB1736732074

   点赞+关注,下次不迷路!

  • 10
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 22
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 22
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

(备考中,暂停更新)4.14 于武汉

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值