大家好,我是带我去滑雪!
背景知识:由于支持向量机使用超平面分离不同类别的数据,故并没有推广到多分类问题的自然方法。对于多分类问题,通常采用的方法是1对1分类,也被称为全配对法。具体来说,假设数据有K类,其中K>2。从这K类中,取出两类(不考虑排序),则可能的取法组合数目为
对这个二分类问题均使用支持向量机,可得到个SVM模型。对于一个新的测试观测值,使用这个SVM模型进行预测,然后以最常见的预测类别作为最终的预测结果。事实上,这是一种两两对决(PK),最终以多数票规则加总的方法。
本期使用 UCI Machine Learning Repository 的液体超声波流量计(liquid ultrasonic flowmeter)数据Meter_D.csv,进行多分类问题的SVM估计。其中V44为响应变量,表示流量计的四种不同状态(1为Healthy,2为Gas injection,3为Installation effects,4为Waxing)。V1-V43为流量计的一系列度量指标,均为数值型变量。研究目的是根据这些指标判断流量计的质量状况。
本文期望完成以下一些问题:
(1)载入数据,并考察其形状与前5个观测值;
(2)使用random_state=0,通过分层抽样,随机选取100个观测值作为训练集;
(3)根据训练集数据,将所有特征变量标准化;
(4)以参数“random_state=123”,使用线性核进行SVM估计(使用SVC的其他默认参数),计算测试集的预测准确率;
(5)在测试集进行预测,并展示测试集的混淆矩阵;
(6)使用二次核进行SVM估计(参数同上),计算测试集的预测准确率;
(7)使用三次核进行SVM估计(参数同上),并计算测试集的预测准确率;
(8)使用径向核进行SVM估计(参数同上),计算测试集的预测准确率;
(9)使用S型核进行SVM估计(参数同上),计算测试集的预测准确率;
(10)采用10折交叉验证,选中最优的调节参数组合,
(11)使用最优的调节参数组合进行线性核SVM估计,计算测试集的预测准确率。
(1)载入数据,并考察其形状与前5个观测值
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold,StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
get_ipython().run_line_magic('matplotlib', 'inline')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
import warnings
data = pd.read_csv(r'E:\工作\硕士\博客\博客20-多分类问题的SVM估计\Meter_D.csv')#读取数据
data.shape#查看数据形状
data.head()#查看数据集前5行输出结果:
(180, 44)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 0 1.104706 1.004679 0.994783 2.345833 2.604444 2.580000 2.347222 1485.805556 1485.930556 1485.941667 ... -0.7 124.150000 123.911944 151.615833 151.275000 151.559444 151.221111 124.253611 124.017778 1 1 1.089401 0.997778 0.998012 3.399444 3.711111 3.711944 3.414444 1486.163889 1486.302778 1486.302778 ... -0.7 124.171944 123.831667 151.652222 151.164444 151.595278 151.113333 124.274167 123.937778 1 2 1.079671 1.006056 0.999027 3.438056 3.714722 3.689722 3.420000 1486.322222 1486.452778 1486.455556 ... -0.7 124.160000 123.818056 151.636111 151.150833 151.580000 151.096944 124.260833 123.923611 1 3 1.090834 1.013194 0.994911 3.399444 3.725000 3.660556 3.371111 1486.455556 1486.575000 1486.583333 ... -0.7 124.144722 123.808611 151.622500 151.139167 151.564167 151.084167 124.250833 123.913056 1 4 1.093816 1.009716 1.003622 3.398889 3.702778 3.681389 3.351944 1486.600000 1486.725000 1486.730556 ... -0.7 124.134444 123.797500 151.607500 151.122778 151.550278 151.068889 124.236111 123.903056 1 5 rows × 44 columns
(2)使用random_state=0,通过分层抽样,随机选取100个观测值作为训练集;
x=data.iloc[:,0:43]#使用pandas库中切片函数iloc,取出数据矩阵V1-V43,并赋值给x
y=data.iloc[:,43]#取出响应变量V44,并赋值给y
x
y输出结果x:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 0 1.104706 1.004679 0.994783 2.345833 2.604444 2.580000 2.347222 1485.805556 1485.930556 1485.941667 ... -0.700000 -0.700000 124.150000 123.911944 151.615833 151.275000 151.559444 151.221111 124.253611 124.017778 1 1.089401 0.997778 0.998012 3.399444 3.711111 3.711944 3.414444 1486.163889 1486.302778 1486.302778 ... -0.700000 -0.700000 124.171944 123.831667 151.652222 151.164444 151.595278 151.113333 124.274167 123.937778 2 1.079671 1.006056 0.999027 3.438056 3.714722 3.689722 3.420000 1486.322222 1486.452778 1486.455556 ... -0.700000 -0.700000 124.160000 123.818056 151.636111 151.150833 151.580000 151.096944 124.260833 123.923611 3 1.090834 1.013194 0.994911 3.399444 3.725000 3.660556 3.371111 1486.455556 1486.575000 1486.583333 ... -0.700000 -0.700000 124.144722 123.808611 151.622500 151.139167 151.564167 151.084167 124.250833 123.913056 4 1.093816 1.009716 1.003622 3.398889 3.702778 3.681389 3.351944 1486.600000 1486.725000 1486.730556 ... -0.700000 -0.700000 124.134444 123.797500 151.607500 151.122778 151.550278 151.068889 124.236111 123.903056 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 175 4.593220 5.153846 0.123245 -0.138611 -1.703889 -0.102778 -0.254722 1513.505556 1507.080556 1494.355556 ... 44.600000 44.600000 121.266111 122.244167 145.081111 155.030000 147.680000 153.622500 117.866389 117.880278 176 0.173768 0.214250 0.098034 -0.200833 -0.401389 -0.103889 -2.706944 1521.897222 1474.400000 1500.705556 ... 44.600000 44.600000 121.082222 121.059167 151.401944 154.034722 147.871389 151.595278 118.459444 118.631111 177 0.049027 0.115593 0.134579 -0.281111 -0.050833 -0.098889 -2.772778 1518.072222 1439.802778 1512.719444 ... 44.169444 44.169444 121.298889 121.370278 158.391111 154.248611 149.020000 148.266389 118.569722 118.807500 178 -0.053056 0.099441 0.143832 -0.386667 0.129722 0.029444 -2.613333 1487.083333 1422.211111 1484.836111 ... 44.100000 44.100000 125.552500 122.318333 161.174167 155.306389 149.217222 153.626111 118.638333 118.872778 179 0.008931 0.914347 1.184204 -2.617222 0.144167 -0.190000 -2.514722 1476.855556 1413.877778 1506.325000 ... 44.100000 44.100000 127.293611 122.416111 161.170278 157.195000 148.598611 150.280556 118.752222 118.986389 180 rows × 43 columns
输出结果y:
0 1 1 1 2 1 3 1 4 1 .. 175 4 176 4 177 4 178 4 179 4 Name: V44, Length: 180, dtype: int64x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=80#测试集梳理为总数180-100
,stratify=y,random_state=0)#参数stratify: 依据标签y,按原数据y中各类比例,分配给train和test,使得train和test中各类数据的比例与原数据集一样。
x_train输出结果:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 105 1.085543 1.006881 0.990884 6.246667 6.841389 6.736944 6.261667 1484.180556 1484.283333 1484.294444 ... -0.7 -0.7 124.478056 123.857778 152.061944 151.167778 151.999722 151.118333 124.585278 123.961111 92 1.085677 1.016798 1.000567 2.371667 2.571667 2.532222 2.329444 1486.672222 1486.797222 1486.797222 ... -0.7 -0.7 124.076667 123.842222 151.525000 151.191389 151.468889 151.136944 124.179444 123.947778 174 2.117979 10.441463 0.451199 -0.709167 -1.669167 -0.101111 -0.126667 1472.986111 1481.625000 1469.966667 ... 44.6 44.6 122.591111 127.706944 147.530556 156.243611 152.669444 153.077222 118.620556 118.630833 45 1.087155 1.002735 0.983435 6.240833 6.898333 6.770833 6.332500 1484.302778 1484.402778 1484.427778 ... -1.0 -1.0 124.466389 123.843889 152.052500 151.151667 151.985556 151.105833 124.580000 123.942778 104 1.077505 1.005015 0.991753 6.270278 6.811111 6.724444 6.291667 1483.861111 1483.950000 1483.958333 ... -0.7 -0.7 124.508611 123.881944 152.095000 151.201944 152.038611 151.148056 124.611667 123.988056 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 128 1.114358 1.001870 0.979336 7.100833 8.079444 7.906944 7.245000 1485.650000 1485.761111 1485.816667 ... -0.9 -0.9 124.397778 123.687778 151.991944 150.933611 151.922778 150.878889 124.493889 123.782500 70 1.101280 1.010280 1.000218 1.221667 1.344444 1.331667 1.208333 1485.308333 1485.416667 1485.411111 ... -0.9 -0.9 124.135278 124.012500 151.586389 151.409722 151.530833 151.358056 124.240278 124.120278 54 1.110681 0.975261 0.989486 2.675556 3.007778 3.049444 2.778056 1484.905556 1484.655556 1484.588889 ... -0.6 -0.6 124.241389 123.973056 151.774167 151.379444 151.728611 151.326944 124.387500 124.111667 135 2.047681 1.852623 1.825457 3.443611 3.728611 3.691389 0.180000 1483.769444 1483.880556 1483.891667 ... 44.4 44.4 124.371389 124.031111 151.900278 151.411111 151.842500 151.356944 116.763889 116.749722 124 1.107886 1.011329 1.002777 6.222500 6.870833 6.815556 6.131111 1484.386111 1484.497222 1484.583333 ... -0.9 -0.9 124.455000 123.845278 152.042500 151.140556 151.977778 151.083611 124.552778 123.940833 100 rows × 43 columns
(3)根据训练集数据,将所有特征变量标准化
scaler = StandardScaler()
scaler.fit(x_train)
x_train_s=scaler.fit_transform(x_train)#对训练集中的特征变量进行标准化
x_test_s=scaler.fit_transform(x_test)#对测试集的特征变量进行标准化
(4)以参数“random_state=123”,使用线性核进行SVM估计(使用SVC的其他默认参数),计算测试集的预测准确率
model=SVC(kernel="linear",random_state=123)#使用线性核(rbf)
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率输出结果:
0.825
结果显示,线性核SVM的预测准确率为82.5%。
(5)在测试集进行预测,并展示测试集的混淆矩阵
model_rmse = np.sqrt(mean_squared_error(y_test,linear_pred)) #RMSE,计算训练误差
model_mae = mean_absolute_error(y_test,linear_pred) #MAE,计算平均绝对误差
model_r2 = r2_score(y_test, linear_pred) # R2,准确率
print("The RMSE of RBF_SVR: ", model_rmse)
print("The MAE of RBF_SVR: ",model_mae)
print("R^2 of RBF_SVR: ",model_r2)输出结果:
The RMSE of RBF_SVR: 0.6614378277661477 The MAE of RBF_SVR: 0.2625 R^2 of RBF_SVR: 0.6857816182246661pred=model.predict(x_test_s)
pd.crosstab(y_test,pred,rownames=['Actual'],colnames=['Predicted'])#计算测试集混淆矩阵输出结果:
Predicted 1 2 3 4 Actual 1 15 2 6 0 2 0 5 5 0 3 1 0 23 0 4 0 0 0 23
(6)使用二次核进行SVM估计(参数同上),计算测试集的预测准确率
model=SVC(kernel='poly',degree=2,random_state=123)#使用二次项核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率输出结果:
0.6
(7)使用三次核进行SVM估计(参数同上),并计算测试集的预测准确率
model=SVC(kernel='poly',degree=3,random_state=123)#使用三次项核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率输出结果:
0.5875
(8)使用径向核进行SVM估计(参数同上),计算测试集的预测准确率
model=SVC(kernel='rbf',random_state=123)#使用径向核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率输出结果:
0.6875
(9)使用S型核进行SVM估计(参数同上),计算测试集的预测准确率
model=SVC(kernel='sigmoid',random_state=123)#使用S型核
model.fit(x_train_s,y_train)#模型估计
model.score(x_test_s,y_test)#计算预测准确率输出结果:
0.6125
(10)采用10折交叉验证,选中最优的调节参数组合
param_grid={'C':[0.01,0.1,1,10,20,40,60,80,100,120,140,160,180,200,220,240,360,500],'gamma':[0.001,0.01,1,10]}#定义参数网格
kfold=StratifiedKFold(n_splits=10,shuffle=True,random_state=1)#定义10折随机分组
model=GridSearchCV(SVC(kernel="linear",random_state=123),param_grid,cv=kfold)
model.fit(x_train_s,y_train)
model.best_params_输出结果:
{'C': 180, 'gamma': 0.001}
(11)使用最优的调节参数组合进行线性核SVM估计,计算测试集的预测准确率
model1=model.best_estimator_#结合最优超参数,重新定义最优model
model1.fit(x_train_s,y_train)#模型估计
score=model1.score(x_test_s,y_test)
score输出结果:
0.8625
更多优质内容持续发布中,请移步主页查看。
若有问题可邮箱联系:1736732074@qq.com
博主的WeChat:TCB1736732074
点赞+关注,下次不迷路!