机器学习-svm

该代码示例展示了如何使用Python的scikit-learn库对乳腺癌数据集进行预处理,包括数据探索、特征选择、标准化以及模型训练。在模型训练部分,对比了线性核、多项式核和RBF核的SVM模型在训练集和测试集上的表现。RBF核在训练集上达到100%准确率,但在测试集上表现不佳,可能存在过拟合问题。
摘要由CSDN通过智能技术生成
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
col_names = list(cancer.feature_names)
col_names.append('target')

df = pd.DataFrame(np.c_[cancer.data, cancer.target], columns=col_names)
df.head()
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension...worst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimensiontarget
017.9910.38122.801001.00.118400.277600.30010.147100.24190.07871...17.33184.602019.00.16220.66560.71190.26540.46010.118900.0
120.5717.77132.901326.00.084740.078640.08690.070170.18120.05667...23.41158.801956.00.12380.18660.24160.18600.27500.089020.0
219.6921.25130.001203.00.109600.159900.19740.127900.20690.05999...25.53152.501709.00.14440.42450.45040.24300.36130.087580.0
311.4220.3877.58386.10.142500.283900.24140.105200.25970.09744...26.5098.87567.70.20980.86630.68690.25750.66380.173000.0
420.2914.34135.101297.00.100300.132800.19800.104300.18090.05883...16.67152.201575.00.13740.20500.40000.16250.23640.076780.0

5 rows × 31 columns

df.describe()
mean radiusmean texturemean perimetermean areamean smoothnessmean compactnessmean concavitymean concave pointsmean symmetrymean fractal dimension...worst textureworst perimeterworst areaworst smoothnessworst compactnessworst concavityworst concave pointsworst symmetryworst fractal dimensiontarget
count569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000...569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000569.000000
mean14.12729219.28964991.969033654.8891040.0963600.1043410.0887990.0489190.1811620.062798...25.677223107.261213880.5831280.1323690.2542650.2721880.1146060.2900760.0839460.627417
std3.5240494.30103624.298981351.9141290.0140640.0528130.0797200.0388030.0274140.007060...6.14625833.602542569.3569930.0228320.1573360.2086240.0657320.0618670.0180610.483918
min6.9810009.71000043.790000143.5000000.0526300.0193800.0000000.0000000.1060000.049960...12.02000050.410000185.2000000.0711700.0272900.0000000.0000000.1565000.0550400.000000
25%11.70000016.17000075.170000420.3000000.0863700.0649200.0295600.0203100.1619000.057700...21.08000084.110000515.3000000.1166000.1472000.1145000.0649300.2504000.0714600.000000
50%13.37000018.84000086.240000551.1000000.0958700.0926300.0615400.0335000.1792000.061540...25.41000097.660000686.5000000.1313000.2119000.2267000.0999300.2822000.0800401.000000
75%15.78000021.800000104.100000782.7000000.1053000.1304000.1307000.0740000.1957000.066120...29.720000125.4000001084.0000000.1460000.3391000.3829000.1614000.3179000.0920801.000000
max28.11000039.280000188.5000002501.0000000.1634000.3454000.4268000.2012000.3040000.097440...49.540000251.2000004254.0000000.2226001.0580001.2520000.2910000.6638000.2075001.000000

8 rows × 31 columns

特征选择

df.columns
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')
sns.countplot(x = 'target', label = "Count",data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8baae71f60>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UAFlq8RK-1678272580557)(svm_files/svm_5_1.png)]

plt.figure(figsize=(10, 8))
sns.scatterplot(x = 'mean area', y = 'mean smoothness', hue = 'target', data = df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b882cbe10>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pGFIfh4O-1678272580558)(svm_files/svm_6_1.png)]

# 皮尔森系数
plt.figure(figsize=(20,10)) 
sns.heatmap(df.corr(), annot=True) 
<matplotlib.axes._subplots.AxesSubplot at 0x7f8b882b2128>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hfY2Ize2-1678272580558)(svm_files/svm_7_1.png)]

2. 2. 模型训练

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = df.drop('target', axis=1)
y = df.target

print(f"'X' shape: {X.shape}")
print(f"'y' shape: {y.shape}")

pipeline = Pipeline([
    ('min_max_scaler', MinMaxScaler()),
    ('std_scaler', StandardScaler())
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'X' shape: (569, 30)
'y' shape: (569,)
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train: # 训练集
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False: # 测试集
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

多项式核

C:C-SVC的惩罚参数C?默认值是1.0

C越大,相当于惩罚松弛变量,希望松弛变量接近0,即对误分类的惩罚增大,趋向于对训练集全分对的情况,这样对训练集测试时准确率很高,但泛化能力弱。C值小,对误分类的惩罚减小,允许容错,将他们当成噪声点,泛化能力较强。

kernel:核函数,默认是rbf,可以是‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’

0 – 线性:u’v
 1 – 多项式:(gamma*u’v + coef0)^degree
2 – RBF函数:exp(-gamma|u-v|^2)
 3 –sigmoid:tanh(gamma
u’*v + coef0)

degree :多项式poly函数的维度,默认是3,选择其他核函数时会被忽略。

gamma : ‘rbf’,‘poly’ 和‘sigmoid’的核函数参数。默认是’auto’,则会选择1/n_features

coef0 :核函数的常数项。对于‘poly’和 ‘sigmoid’有用。

from sklearn.svm import SVC

linear_model = SVC(kernel='linear')
linear_model.fit(X_train, y_train)

print_score(linear_model, X_train, y_train, X_test, y_test, train=True)
print_score(linear_model, X_train, y_train, X_test, y_test, train=False)
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,


Train Result:
================================================
Accuracy Score: 96.92%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    0.975460    0.965753  0.969231    0.970607      0.969359
recall       0.940828    0.986014  0.969231    0.963421      0.969231
f1-score     0.957831    0.975779  0.969231    0.966805      0.969112
support    169.000000  286.000000  0.969231  455.000000    455.000000
_______________________________________________
Confusion Matrix: 
 [[159  10]
 [  4 282]]

Test Result:
================================================
Accuracy Score: 95.61%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0        1.0  accuracy   macro avg  weighted avg
precision   0.975000   0.945946   0.95614    0.960473      0.956905
recall      0.906977   0.985915   0.95614    0.946446      0.956140
f1-score    0.939759   0.965517   0.95614    0.952638      0.955801
support    43.000000  71.000000   0.95614  114.000000    114.000000
_______________________________________________
Confusion Matrix: 
 [[39  4]
 [ 1 70]]



/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,
from sklearn.svm import SVC

poly_model = SVC(kernel='poly', degree=2, gamma='auto', coef0=1, C=5)
poly_model.fit(X_train, y_train)

print_score(poly_model, X_train, y_train, X_test, y_test, train=True)
print_score(poly_model, X_train, y_train, X_test, y_test, train=False)
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,


Train Result:
================================================
Accuracy Score: 97.14%
_______________________________________________
CLASSIFICATION REPORT:
                  0.0         1.0  accuracy   macro avg  weighted avg
precision    0.987500    0.962712  0.971429    0.975106      0.971919
recall       0.934911    0.993007  0.971429    0.963959      0.971429
f1-score     0.960486    0.977625  0.971429    0.969056      0.971259
support    169.000000  286.000000  0.971429  455.000000    455.000000
_______________________________________________
Confusion Matrix: 
 [[158  11]
 [  2 284]]

Test Result:
================================================
Accuracy Score: 94.74%
_______________________________________________
CLASSIFICATION REPORT:
                 0.0        1.0  accuracy   macro avg  weighted avg
precision   0.974359   0.933333  0.947368    0.953846      0.948808
recall      0.883721   0.985915  0.947368    0.934818      0.947368
f1-score    0.926829   0.958904  0.947368    0.942867      0.946806
support    43.000000  71.000000  0.947368  114.000000    114.000000
_______________________________________________
Confusion Matrix: 
 [[38  5]
 [ 1 70]]



/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,

高斯核函数

rbf_model = SVC(kernel='rbf', gamma=0.1, C= 1)
rbf_model.fit(X_train, y_train)

print_score(rbf_model, X_train, y_train, X_test, y_test, train=True)
print_score(rbf_model, X_train, y_train, X_test, y_test, train=False)
Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
             0.0    1.0  accuracy  macro avg  weighted avg
precision    1.0    1.0       1.0        1.0           1.0
recall       1.0    1.0       1.0        1.0           1.0
f1-score     1.0    1.0       1.0        1.0           1.0
support    169.0  286.0       1.0      455.0         455.0
_______________________________________________
Confusion Matrix: 
 [[169   0]
 [  0 286]]

Test Result:
================================================
Accuracy Score: 62.28%
_______________________________________________
CLASSIFICATION REPORT:
            0.0        1.0  accuracy   macro avg  weighted avg
precision   0.0   0.622807  0.622807    0.311404      0.387889
recall      0.0   1.000000  0.622807    0.500000      0.622807
f1-score    0.0   0.767568  0.622807    0.383784      0.478046
support    43.0  71.000000  0.622807  114.000000    114.000000
_______________________________________________
Confusion Matrix: 
 [[ 0 43]
 [ 0 71]]



/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py:1692: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['str_']. An error will be raised in 1.2.
  FutureWarning,
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/Users/gaoguli/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))



  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

the uzi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值